[PATCH rcu 0/4] CSD-lock diagnostics updates for v6.12

Linux RCU subsystem development
 help / color / mirror / Atom feed

* [PATCH rcu 0/4] CSD-lock diagnostics updates for v6.12
@ 2024-08-16  4:36 Neeraj Upadhyay
  2024-08-16  4:39 ` [PATCH rcu 1/4] locking/csd_lock: Print large numbers as negatives neeraj.upadhyay
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Neeraj Upadhyay @ 2024-08-16  4:36 UTC (permalink / raw)
  To: rcu
  Cc: paulmck, joel, frederic, boqun.feng, urezki, linux-kernel,
	kernel-team, rostedt, mingo, peterz, leobras, imran.f.khan, riel,
	tglx

Hello!

This series adds following changes to the CSD-lock diagnostics (rebased on v6.11-rc1):

1.      Print large CSD-lock-hold time numbers as negatives, courtesy of Paul E. McKenney.

2.      Add api to provide an indication of ongoing CSD-lock stall, courtesy of Paul E. McKenney.

3.      Add a backoff mechanism for repeated CSD-lock diagnostic reports of same incident,
        courtesy of Paul E. McKenney.

4.      Print only local CPU info when sched_clock goes backward, courtesy of Rik van Riel.

As communicated earlier at [1], I will be submitting these changes as part of RCU tree,
due to dependency of RCU commit "7c72dedb0079 rcu: Summarize RCU CPU stall warnings during
CSD-lock stalls" [2] on these changes.

Git tree is at [2]

[1] https://lore.kernel.org/lkml/20240722133559.GA667117@neeraj.linux/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/neeraj.upadhyay/linux-rcu.git/commit/?h=rcustall.15.08.24a&id=7c72dedb0079e62c1c75dbab038332017f34a6b8
[3] https://git.kernel.org/pub/scm/linux/kernel/git/neeraj.upadhyay/linux-rcu.git/log/?h=csd.lock.15.08.24a

- Neeraj

Paul E. McKenney (3):
  locking/csd_lock: Print large numbers as negatives
  locking/csd_lock: Provide an indication of ongoing CSD-lock stall
  locking/csd-lock: Use backoff for repeated reports of same incident

Rik van Riel (1):
  smp: print only local CPU info when sched_clock goes backward

 include/linux/smp.h |  6 ++++++
 kernel/smp.c        | 38 +++++++++++++++++++++++++++++++++-----
 lib/Kconfig.debug   |  1 +
 3 files changed, 40 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH rcu 1/4] locking/csd_lock: Print large numbers as negatives
  2024-08-16  4:36 [PATCH rcu 0/4] CSD-lock diagnostics updates for v6.12 Neeraj Upadhyay
@ 2024-08-16  4:39 ` neeraj.upadhyay
  2024-08-16  4:39 ` [PATCH rcu 2/4] locking/csd_lock: Provide an indication of ongoing CSD-lock stall neeraj.upadhyay
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: neeraj.upadhyay @ 2024-08-16  4:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, rostedt, paulmck, neeraj.upadhyay,
	neeraj.upadhyay, boqun.feng, joel, urezki, frederic, mingo,
	peterz, leobras, imran.f.khan, riel, tglx

From: "Paul E. McKenney" <paulmck@kernel.org>

The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are
printed in nanoseconds as unsigned long longs, which is a bit obtuse for
human readers when timing bugs result in negative CSD-lock hold times.
Yes, there are some people to whom it is immediately obvious that
18446744073709551615 is really -1, but for the rest of us...

Therefore, print these numbers as signed long longs, making the negative
hold times immediately apparent.

Reported-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
---
 kernel/smp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index aaffecdad319..e87953729230 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -249,8 +249,8 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
 	/* How long since this CSD lock was stuck. */
 	ts_delta = ts2 - ts0;
-	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
-		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
+	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n",
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta,
 		 cpu, csd->func, csd->info);
 	/*
 	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH rcu 2/4] locking/csd_lock: Provide an indication of ongoing CSD-lock stall
  2024-08-16  4:36 [PATCH rcu 0/4] CSD-lock diagnostics updates for v6.12 Neeraj Upadhyay
  2024-08-16  4:39 ` [PATCH rcu 1/4] locking/csd_lock: Print large numbers as negatives neeraj.upadhyay
@ 2024-08-16  4:39 ` neeraj.upadhyay
  2024-08-16  4:39 ` [PATCH rcu 3/4] locking/csd-lock: Use backoff for repeated reports of same incident neeraj.upadhyay
  2024-08-16  4:39 ` [PATCH rcu 4/4] smp: print only local CPU info when sched_clock goes backward neeraj.upadhyay
  3 siblings, 0 replies; 5+ messages in thread
From: neeraj.upadhyay @ 2024-08-16  4:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, rostedt, paulmck, neeraj.upadhyay,
	neeraj.upadhyay, boqun.feng, joel, urezki, frederic, mingo,
	peterz, leobras, imran.f.khan, riel, tglx

From: "Paul E. McKenney" <paulmck@kernel.org>

If a CSD-lock stall goes on long enough, it will cause an RCU CPU
stall warning.  This additional warning provides much additional
console-log traffic and little additional information.  Therefore,
provide a new csd_lock_is_stuck() function that returns true if there
is an ongoing CSD-lock stall.  This function will be used by the RCU
CPU stall warnings to provide a one-line indication of the stall when
this function returns true.

[ neeraj.upadhyay: Apply Rik van Riel feedback. ]
[ neeraj.upadhyay: Apply kernel test robot feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
---
 include/linux/smp.h |  6 ++++++
 kernel/smp.c        | 16 ++++++++++++++++
 lib/Kconfig.debug   |  1 +
 3 files changed, 23 insertions(+)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index fcd61dfe2af3..3871bd32018f 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -294,4 +294,10 @@ int smpcfd_prepare_cpu(unsigned int cpu);
 int smpcfd_dead_cpu(unsigned int cpu);
 int smpcfd_dying_cpu(unsigned int cpu);
 
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+bool csd_lock_is_stuck(void);
+#else
+static inline bool csd_lock_is_stuck(void) { return false; }
+#endif
+
 #endif /* __LINUX_SMP_H */
diff --git a/kernel/smp.c b/kernel/smp.c
index e87953729230..202cda4d2a55 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -208,6 +208,19 @@ static int csd_lock_wait_getcpu(call_single_data_t *csd)
 	return -1;
 }
 
+static atomic_t n_csd_lock_stuck;
+
+/**
+ * csd_lock_is_stuck - Has a CSD-lock acquisition been stuck too long?
+ *
+ * Returns @true if a CSD-lock acquisition is stuck and has been stuck
+ * long enough for a "non-responsive CSD lock" message to be printed.
+ */
+bool csd_lock_is_stuck(void)
+{
+	return !!atomic_read(&n_csd_lock_stuck);
+}
+
 /*
  * Complain if too much time spent waiting.  Note that only
  * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU,
@@ -229,6 +242,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 		cpu = csd_lock_wait_getcpu(csd);
 		pr_alert("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock.\n",
 			 *bug_id, raw_smp_processor_id(), cpu);
+		atomic_dec(&n_csd_lock_stuck);
 		return true;
 	}
 
@@ -252,6 +266,8 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n",
 		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta,
 		 cpu, csd->func, csd->info);
+	if (firsttime)
+		atomic_inc(&n_csd_lock_stuck);
 	/*
 	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
 	 * to become unstuck. Use a signed comparison to avoid triggering
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a30c03a66172..4e5f61cba8e4 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1614,6 +1614,7 @@ config SCF_TORTURE_TEST
 config CSD_LOCK_WAIT_DEBUG
 	bool "Debugging for csd_lock_wait(), called from smp_call_function*()"
 	depends on DEBUG_KERNEL
+	depends on SMP
 	depends on 64BIT
 	default n
 	help
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH rcu 3/4] locking/csd-lock: Use backoff for repeated reports of same incident
  2024-08-16  4:36 [PATCH rcu 0/4] CSD-lock diagnostics updates for v6.12 Neeraj Upadhyay
  2024-08-16  4:39 ` [PATCH rcu 1/4] locking/csd_lock: Print large numbers as negatives neeraj.upadhyay
  2024-08-16  4:39 ` [PATCH rcu 2/4] locking/csd_lock: Provide an indication of ongoing CSD-lock stall neeraj.upadhyay
@ 2024-08-16  4:39 ` neeraj.upadhyay
  2024-08-16  4:39 ` [PATCH rcu 4/4] smp: print only local CPU info when sched_clock goes backward neeraj.upadhyay
  3 siblings, 0 replies; 5+ messages in thread
From: neeraj.upadhyay @ 2024-08-16  4:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, rostedt, paulmck, neeraj.upadhyay,
	neeraj.upadhyay, boqun.feng, joel, urezki, frederic, mingo,
	peterz, leobras, imran.f.khan, riel, tglx

From: "Paul E. McKenney" <paulmck@kernel.org>

Currently, the CSD-lock diagnostics in CONFIG_CSD_LOCK_WAIT_DEBUG=y
kernels are emitted at five-second intervals.  Although this has proven
to be a good time interval for the first diagnostic, if the target CPU
keeps interrupts disabled for way longer than five seconds, the ratio
of useful new information to pointless repetition increases considerably.

Therefore, back off the time period for repeated reports of the same
incident, increasing linearly with the number of reports and logarithmicly
with the number of online CPUs.

[ paulmck: Apply Dan Carpenter feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
---
 kernel/smp.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 202cda4d2a55..b484ee6dcaf6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -226,7 +226,7 @@ bool csd_lock_is_stuck(void)
  * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU,
  * so waiting on other types gets much less information.
  */
-static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id)
+static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id, unsigned long *nmessages)
 {
 	int cpu = -1;
 	int cpux;
@@ -249,7 +249,9 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 	ts2 = sched_clock();
 	/* How long since we last checked for a stuck CSD lock.*/
 	ts_delta = ts2 - *ts1;
-	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
+	if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
+			       (!*nmessages ? 1 : (ilog2(num_online_cpus()) / 2 + 1)) ||
+		   csd_lock_timeout_ns == 0))
 		return false;
 
 	firsttime = !*bug_id;
@@ -266,6 +268,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n",
 		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta,
 		 cpu, csd->func, csd->info);
+	(*nmessages)++;
 	if (firsttime)
 		atomic_inc(&n_csd_lock_stuck);
 	/*
@@ -306,12 +309,13 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
  */
 static void __csd_lock_wait(call_single_data_t *csd)
 {
+	unsigned long nmessages = 0;
 	int bug_id = 0;
 	u64 ts0, ts1;
 
 	ts1 = ts0 = sched_clock();
 	for (;;) {
-		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id))
+		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
 			break;
 		cpu_relax();
 	}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH rcu 4/4] smp: print only local CPU info when sched_clock goes backward
  2024-08-16  4:36 [PATCH rcu 0/4] CSD-lock diagnostics updates for v6.12 Neeraj Upadhyay
                   ` (2 preceding siblings ...)
  2024-08-16  4:39 ` [PATCH rcu 3/4] locking/csd-lock: Use backoff for repeated reports of same incident neeraj.upadhyay
@ 2024-08-16  4:39 ` neeraj.upadhyay
  3 siblings, 0 replies; 5+ messages in thread
From: neeraj.upadhyay @ 2024-08-16  4:39 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, rostedt, paulmck, neeraj.upadhyay,
	neeraj.upadhyay, boqun.feng, joel, urezki, frederic, mingo,
	peterz, leobras, imran.f.khan, riel, tglx

From: Rik van Riel <riel@surriel.com>

About 40% of all csd_lock warnings observed in our fleet appear to
be due to sched_clock() going backward in time (usually only a little
bit), resulting in ts0 being larger than ts2.

When the local CPU is at fault, we should print out a message reflecting
that, rather than trying to get the remote CPU's stack trace.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
---
 kernel/smp.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/smp.c b/kernel/smp.c
index b484ee6dcaf6..f25e20617b7e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -254,6 +254,14 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 		   csd_lock_timeout_ns == 0))
 		return false;
 
+	if (ts0 > ts2) {
+		/* Our own sched_clock went backward; don't blame another CPU. */
+		ts_delta = ts0 - ts2;
+		pr_alert("sched_clock on CPU %d went backward by %llu ns\n", raw_smp_processor_id(), ts_delta);
+		*ts1 = ts2;
+		return false;
+	}
+
 	firsttime = !*bug_id;
 	if (firsttime)
 		*bug_id = atomic_inc_return(&csd_bug_count);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-08-16  4:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16  4:36 [PATCH rcu 0/4] CSD-lock diagnostics updates for v6.12 Neeraj Upadhyay
2024-08-16  4:39 ` [PATCH rcu 1/4] locking/csd_lock: Print large numbers as negatives neeraj.upadhyay
2024-08-16  4:39 ` [PATCH rcu 2/4] locking/csd_lock: Provide an indication of ongoing CSD-lock stall neeraj.upadhyay
2024-08-16  4:39 ` [PATCH rcu 3/4] locking/csd-lock: Use backoff for repeated reports of same incident neeraj.upadhyay
2024-08-16  4:39 ` [PATCH rcu 4/4] smp: print only local CPU info when sched_clock goes backward neeraj.upadhyay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox