From: Leonardo Bras <leobras@redhat.com>
To: neeraj.upadhyay@kernel.org
Cc: Leonardo Bras <leobras@redhat.com>,
linux-kernel@vger.kernel.org, rcu@vger.kernel.org,
kernel-team@meta.com, rostedt@goodmis.org, mingo@kernel.org,
peterz@infradead.org, paulmck@kernel.org,
imran.f.khan@oracle.com, riel@surriel.com, tglx@linutronix.de
Subject: Re: [PATCH v2 3/3] locking/csd-lock: Use backoff for repeated reports of same incident
Date: Wed, 31 Jul 2024 19:01:23 -0300 [thread overview]
Message-ID: <Zqq0M92zcR1kcuKz@LeoBras> (raw)
In-Reply-To: <20240722133735.667161-3-neeraj.upadhyay@kernel.org>
On Mon, Jul 22, 2024 at 07:07:35PM +0530, neeraj.upadhyay@kernel.org wrote:
> From: "Paul E. McKenney" <paulmck@kernel.org>
>
> Currently, the CSD-lock diagnostics in CONFIG_CSD_LOCK_WAIT_DEBUG=y
> kernels are emitted at five-second intervals. Although this has proven
> to be a good time interval for the first diagnostic, if the target CPU
> keeps interrupts disabled for way longer than five seconds, the ratio
> of useful new information to pointless repetition increases considerably.
>
> Therefore, back off the time period for repeated reports of the same
> incident, increasing linearly with the number of reports and logarithmicly
> with the number of online CPUs.
>
> [ paulmck: Apply Dan Carpenter feedback. ]
>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Imran Khan <imran.f.khan@oracle.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Leonardo Bras <leobras@redhat.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Reviewed-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
> ---
> kernel/smp.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 9385cc05de53..dfcde438ef63 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -225,7 +225,7 @@ bool csd_lock_is_stuck(void)
> * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU,
> * so waiting on other types gets much less information.
> */
> -static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id)
> +static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id, unsigned long *nmessages)
> {
> int cpu = -1;
> int cpux;
> @@ -248,7 +248,9 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
> ts2 = sched_clock();
> /* How long since we last checked for a stuck CSD lock.*/
> ts_delta = ts2 - *ts1;
> - if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
> + if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
> + (!*nmessages ? 1 : (ilog2(num_online_cpus()) / 2 + 1)) ||
> + csd_lock_timeout_ns == 0))
I think this is a nice change.
OTOH above condition is quite hard to read IMHO.
IIUC you want, for csd_lock_timeout_ns 5s, and num_online_cpus = 64
1st message: after 5s
2nd message: after 5 * 2 * (6 / 2 + 1) = 10 * 4 = 40s
3rd message: after 5 * 3 * 4 = 60s
...
Is that correct?
I think this could be achieved with:
/* How long since we last checked for a stuck CSD lock.*/
ts_delta = ts2 - *ts1;
+ if (*nmessages)
+ csd_lock_timeout_ns *= (*nmessages + 1) * (ilog2(num_online_cpus()) / 2 + 1)
if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
return false;
Does that look better?
Thanks!
Leo
>
> firsttime = !*bug_id;
> @@ -265,6 +267,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
> pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n",
> firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta,
> cpu, csd->func, csd->info);
> + (*nmessages)++;
> if (firsttime)
> atomic_inc(&n_csd_lock_stuck);
> /*
> @@ -305,12 +308,13 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
> */
> static void __csd_lock_wait(call_single_data_t *csd)
> {
> + unsigned long nmessages = 0;
> int bug_id = 0;
> u64 ts0, ts1;
>
> ts1 = ts0 = sched_clock();
> for (;;) {
> - if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id))
> + if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
> break;
> cpu_relax();
> }
> --
> 2.40.1
>
next prev parent reply other threads:[~2024-07-31 22:01 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-22 13:35 [PATCH v2 0/3] CSD-lock diagnostics enhancements Neeraj Upadhyay
2024-07-22 13:37 ` [PATCH v2 1/3] locking/csd_lock: Print large numbers as negatives neeraj.upadhyay
2024-07-22 13:37 ` [PATCH v2 2/3] locking/csd_lock: Provide an indication of ongoing CSD-lock stall neeraj.upadhyay
2024-07-31 21:35 ` Leonardo Bras
2024-07-31 22:08 ` Paul E. McKenney
2024-08-05 21:42 ` Leonardo Bras
2024-07-22 13:37 ` [PATCH v2 3/3] locking/csd-lock: Use backoff for repeated reports of same incident neeraj.upadhyay
2024-07-31 22:01 ` Leonardo Bras [this message]
2024-07-24 17:06 ` [PATCH v2 4/3] smp: print only local CPU info when sched_clock goes backward neeraj.upadhyay
2024-08-06 6:09 ` Leonardo Bras
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zqq0M92zcR1kcuKz@LeoBras \
--to=leobras@redhat.com \
--cc=imran.f.khan@oracle.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=neeraj.upadhyay@kernel.org \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=rcu@vger.kernel.org \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox