From: Guillaume Morin <guillaume@morinfr.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: linux-kernel@vger.kernel.org
Subject: Re: call_rcu data race patch
Date: Fri, 17 Sep 2021 23:34:06 +0200 [thread overview]
Message-ID: <20210917213404.GA14271@bender.morinfr.org> (raw)
In-Reply-To: <20210917211148.GU4156@paulmck-ThinkPad-P17-Gen-1>
On 17 Sep 14:11, Paul E. McKenney wrote:
> On Fri, Sep 17, 2021 at 09:15:57PM +0200, Guillaume Morin wrote:
> > Hello Paul,
> >
> > I've been researching some RCU warnings we see that lead to full lockups
> > with longterm 5.x kernels.
> >
> > Basically the rcu_advance_cbs() == true warning in
> > rcu_advance_cbs_nowake() is firing then everything eventually gets
> > stuck on RCU synchronization because the GP thread stays asleep while
> > rcu_state.gp_flags & 1 == 1 (this is a bunch of nohz_full cpus)
> >
> > During that search I found your patch from July 12th
> > https://www.spinics.net/lists/rcu/msg05731.html that seems related (all
> > warnings we've seen happened in the __fput call path). Is there a reason
> > this patch was not pushed? Is there an issue with this patch or did it
> > fall just through the cracks?
>
> It is still in -rcu:
>
> 2431774f04d1 ("rcu: Mark accesses to rcu_state.n_force_qs")
>
> It is slated for the v5.16 merge window. But does it really fix the
> problem that you are seeing?
I am going to try it soon. Since I could not see it in Linus' tree, I
wanted to make sure there was nothing wrong with the patch, hence my
email :-)
To my dismay, I can't reproduce this issue so this has made debugging
and testing very complicated.
I have a few kdumps from 5.4 and 5.10 kernels (that's how I was able to
observe that the gp thread was sleeping for a long time) and that
rcu_state.gp_flags & 1 == 1.
But this warning has happened a couple of dozen times on multiple
machines in the __fput path (different kind of HW as well). Removing
nohz_full from the command line makes the problem disappear.
Most machines have had fairly long uptime (30+ days) before showing the
warning, though it has happened on a couple occasions only after a few
hours.
That's pretty much all I have been able to gather so far, unfortunately.
> > PS: FYI during my research, I've found another similar report in
> > bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=208685
>
> Huh. First I have heard of it. It looks like they hit this after about
> nine days of uptime. I have run way more than nine days of testing of
> nohz_full RCU operation with rcutorture, and have never seen it myself.
>
> Can you reproduce this? If so, can you reproduce it on mainline kernels
> (as opposed to -stable kernels as in that bugzilla)?
I have at least one prod machine where the problem happens usually
within a couple of days. All my attempts to reproduce on any testing
environment have failed.
>
> The theory behind that WARN_ON_ONCE() is as follows:
>
> o The check of rcu_seq_state(rcu_seq_current(&rnp->gp_seq))
> says that there is a grace period either in effect or just
> now ending.
>
> o In the latter case, the grace-period cleanup has not yet
> reached the current rcu_node structure, which means that
> it has not yet checked to see if another grace period
> is needed.
>
> o Either way, the RCU_GP_FLAG_INIT will cause the next grace
> period to start. (This flag is protected by the root
> rcu_node structure's ->lock.)
>
> Again, can you reproduce this, especially in mainline?
I have not tried because running a mainline kernel in our prod
enviroment is quite difficult and requires lot of work for validation.
Though I could probably make it happen but it would take some time.
Patches that I can apply on a stable kernel are much easier for me to
try, as you probably have guessed.
I appreciate your answer,
Guillaume.
--
Guillaume Morin <guillaume@morinfr.org>
next prev parent reply other threads:[~2021-09-17 21:34 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20210917191555.GA2198@bender.morinfr.org>
2021-09-17 21:11 ` call_rcu data race patch Paul E. McKenney
2021-09-17 21:34 ` Guillaume Morin [this message]
2021-09-17 22:07 ` Paul E. McKenney
2021-09-18 0:39 ` Guillaume Morin
2021-09-18 4:00 ` Paul E. McKenney
2021-09-18 7:08 ` Guillaume Morin
2021-09-19 16:35 ` Paul E. McKenney
2021-09-20 16:05 ` Guillaume Morin
2021-09-22 19:14 ` Guillaume Morin
2021-09-22 19:24 ` Paul E. McKenney
2021-09-27 15:38 ` Guillaume Morin
2021-09-27 16:10 ` Paul E. McKenney
2021-09-27 16:49 ` Guillaume Morin
2021-09-27 21:46 ` Paul E. McKenney
2021-09-30 13:50 ` Guillaume Morin
2021-11-18 18:41 ` Daniel Vacek
2021-11-18 22:59 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210917213404.GA14271@bender.morinfr.org \
--to=guillaume@morinfr.org \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox