From: Leonardo Bras <leobras@redhat.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Imran Khan <imran.f.khan@oracle.com>,
Peter Zijlstra <peterz@infradead.org>,
Valentin Schneider <vschneid@redhat.com>,
Juergen Gross <jgross@suse.com>,
Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long
Date: Mon, 16 Oct 2023 17:58:23 -0300 [thread overview]
Message-ID: <ZS2j73BWypYk-S1u@redhat.com> (raw)
In-Reply-To: <ZS2iqck0tWDWEVMZ@redhat.com>
On Mon, Oct 16, 2023 at 05:52:57PM -0300, Leonardo Bras wrote:
> On Mon, Oct 16, 2023 at 11:27:51AM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 13, 2023 at 12:26:22PM -0300, Leonardo Bras wrote:
> > > On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote:
> > > > > Hello Paul,
> > > > >
> > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote:
> > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck
> > > > > > temporarily, it usually gets released in a few seconds, and sometimes
> > > > > > up to one or two minutes.
> > > > > >
> > > > > > If the CSD lock stays stuck for more than several minutes, it never
> > > > > > seems to get unstuck, and gradually more and more things in the system
> > > > > > end up also getting stuck.
> > > > > >
> > > > > > In the latter case, we should just give up, so the system can dump out
> > > > > > a little more information about what went wrong, and, with panic_on_oops
> > > > > > and a kdump kernel loaded, dump a whole bunch more information about
> > > > > > what might have gone wrong.
> > > > > >
> > > > > > Question: should this have its own panic_on_ipistall switch in
> > > > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
> > > > > > way than via BUG_ON?
> > > > > >
> > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems
> > > > > where such delay is acceptable and system can eventually get back to sane state,
> > > > > this option (set to 0 after boot) would prevent crashing the system for
> > > > > apparently benign CSD hangs of long duration.
> > > >
> > > > Good point! How about like the following?
> > > >
> > > > Thanx, Paul
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3
> > > > Author: Rik van Riel <riel@surriel.com>
> > > > Date: Mon Aug 21 16:04:09 2023 -0400
> > > >
> > > > smp,csd: Throw an error if a CSD lock is stuck for too long
> > > >
> > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck
> > > > temporarily, it usually gets released in a few seconds, and sometimes
> > > > up to one or two minutes.
> > > >
> > > > If the CSD lock stays stuck for more than several minutes, it never
> > > > seems to get unstuck, and gradually more and more things in the system
> > > > end up also getting stuck.
> > > >
> > > > In the latter case, we should just give up, so the system can dump out
> > > > a little more information about what went wrong, and, with panic_on_oops
> > > > and a kdump kernel loaded, dump a whole bunch more information about what
> > > > might have gone wrong. In addition, there is an smp.panic_on_ipistall
> > > > kernel boot parameter that by default retains the old behavior, but when
> > > > set enables the panic after the CSD lock has been stuck for more than
> > > > five minutes.
> > > >
> > > > [ paulmck: Apply Imran Khan feedback. ]
> > > >
> > > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/
> > > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Valentin Schneider <vschneid@redhat.com>
> > > > Cc: Juergen Gross <jgross@suse.com>
> > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > Cc: Randy Dunlap <rdunlap@infradead.org>
> > > >
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index 0a1731a0f0ef..592935267ce2 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -5858,6 +5858,11 @@
> > > > This feature may be more efficiently disabled
> > > > using the csdlock_debug- kernel parameter.
> > > >
> > > > + smp.panic_on_ipistall= [KNL]
> > > > + If a csd_lock_timeout extends for more than
> > > > + five minutes, panic the system. By default, let
> > > > + CSD-lock acquisition take as long as they take.
> > > > +
> > >
> > > It could be interesting to have it as an s64 parameter (in {mili,}seconds)
> > > instead of bool, this way the user could pick the time to wait before the
> > > panic happens. 0 or -1 could mean disabled.
> > >
> > > What do you think?
> > >
> > > Other than that,
> > > Reviewed-by: Leonardo Bras <leobras@redhat.com>
> >
> > Thank you for looking this over!
> >
> > How about with the diff shown below, to be folded into the original?
> > I went with int instead of s64 because I am having some difficulty
> > imagining anyone specifying more than a 24-day timeout. ;-)
>
> I suggested s64 just because it was the type being used in
> BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL);
>
> But anyway, int should be fine.
>
>
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index ccb7621eff79..ea5ae9deb753 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5931,8 +5931,10 @@
> >
> > smp.panic_on_ipistall= [KNL]
> > If a csd_lock_timeout extends for more than
> > - five minutes, panic the system. By default, let
> > - CSD-lock acquisition take as long as they take.
> > + the specified number of milliseconds, panic the
> > + system. By default, let CSD-lock acquisition
> > + take as long as they take. Specifying 300,000
> > + for this value provides a 10-minute timeout.
>
> 300,000 ms is 300s, which is 5 minutes, right?
>
> >
> > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
> > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index b6a0773a7015..d3ca47f32f38 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info);
> >
> > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */
> > module_param(csd_lock_timeout, ulong, 0444);
> > -static bool panic_on_ipistall;
> > -module_param(panic_on_ipistall, bool, 0444);
> > +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for ten minutes. */
>
> s/ten/five
>
> > +module_param(panic_on_ipistall, int, 0444);
> >
> > static atomic_t csd_bug_count = ATOMIC_INIT(0);
> >
> > @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> > * to become unstuck. Use a signed comparison to avoid triggering
> > * on underflows when the TSC is out of sync between sockets.
> > */
> > - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL);
> > + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC));
>
> s64 here would avoid casting (s64)panic_on_ipistall, but I think it does
> not really impact readability.
>
> IIUC ts_delta is an u64 being casted as s64, which could be an issue but no
> computer system will actually take over 2^31 ns (292 years) to run 1
> iteration, so it's safe.
>
> I think it's a nice feaure :)
s/feaure/feature
Also, FWIW:
Reviewed-by: Leonardo Bras <leobras@redhat.com>
>
> Thanks!
> Leo
>
>
> > if (cpu_cur_csd && csd != cpu_cur_csd) {
> > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
> > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
> >
next prev parent reply other threads:[~2023-10-16 20:59 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-05 16:48 [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long Paul E. McKenney
2023-10-05 23:32 ` Imran Khan
2023-10-09 16:39 ` Paul E. McKenney
2023-10-10 4:58 ` Imran Khan
2023-10-10 13:47 ` Paul E. McKenney
2023-10-13 15:26 ` Leonardo Bras
2023-10-16 18:27 ` Paul E. McKenney
2023-10-16 20:52 ` Leonardo Bras
2023-10-16 20:58 ` Leonardo Bras [this message]
2023-10-16 21:37 ` Paul E. McKenney
2023-10-16 22:15 ` Leonardo Bras
2023-10-06 18:48 ` Jonas Oberhauser
2023-10-06 20:26 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZS2j73BWypYk-S1u@redhat.com \
--to=leobras@redhat.com \
--cc=imran.f.khan@oracle.com \
--cc=jgross@suse.com \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.