Re: Normal RCU grace period can be stalled for long because need-resched flags not set?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	rcu <rcu@vger.kernel.org>
Subject: Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
Date: Sun, 7 Jul 2019 04:19:06 -0700	[thread overview]
Message-ID: <20190707111906.GB26519@linux.ibm.com> (raw)
In-Reply-To: <20190706230331.GA158271@google.com>

On Sat, Jul 06, 2019 at 07:03:31PM -0400, Joel Fernandes wrote:
> On Sat, Jul 06, 2019 at 11:21:17AM -0700, Paul E. McKenney wrote:
> > On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote:
> > > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> > > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> > > [snip] 
> > > > > I did have a request, could you help me understand why is the grace period
> > > > > duration double that of my busy wait time? You mentioned this has something
> > > > > to do with the thread not waking up before another GP is started. But I did
> > > > > not follow this. Thanks a lot!!
> > > > 
> > > > Let's look at a normal grace period, and assume the ->gp_seq grace-period
> > > > sequence counter is initially 0x1, so that a grace period is already
> > > > in progress (keep in mind that the low-order two bits of ->gp_seq are
> > > > the phase within the grace period and the rest of the bits are the
> > > > grace-period number, so we are in phase 1 of the grace period following
> > > > grace period #0).  This grace period was started via synchronize_rcu()
> > > > by Task A.
> > > > 
> > > > Then we have the following sequence of events:
> > > > 
> > > > o	Task B does call_rcu(), but is too late to use the already-started
> > > > 	grace period, so it needs another one.	The ->gp_seq_needed
> > > > 	counter thus becomes 0x8.
> > > > 
> > > > o	The current grace period completes, so that the ->gp_seq
> > > > 	counter becomes 0x4.  Callback invocation commences.
> > > > 
> > > > o	The grace-period kthread notices that ->gp_seq_needed is greater
> > > > 	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
> > > > 	counter therefore becomes 0x5.
> > > > 
> > > > o	The callback for Task A's synchronize_rcu() is invoked, awakening
> > > > 	Task A.  This happens almost immediately after the new grace
> > > > 	period starts, but does definitely happen -after- that grace
> > > > 	period starts.
> > > 
> > > Yes, but at this point, we are still at the 1GP mark. But the distribution's
> > > median below shows a strong correlation with 2 preempt-disable durations and
> > > grace-period completion. For example, if the duration of the preempt
> > > disable/enable loop is 50ms, it strongly shows the writer's median time
> > > difference before and after the synchronize_rcu as 100ms, not really showing
> > > it is 60 ms or 75 ms or anything.
> > > 
> > > > 
> > > > o	Task A, being part of rcuperf, almost immediately does another
> > > > 	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.
> > > 
> > > Yes, but before that another synchronize_rcu, it also gets the timestamp and
> > > does a diff between old/new timestamp and has captured the data, so at that
> > > point the data captured should only be for around 1GPs worth give or take.
> > 
> > But in this case, the "give or take" is almost a full grace period, for
> > a total of almost two grace periods.
> > 
> > > > If you play out the rest of this sequence, you should see how Task A
> > > > waits for almost two full grace periods.
> > > 
> > > I tried to play this out, still didn't get it :-|. Not that it is an issue
> > > per-se, but still would like to understand it better.
> > 
> > Let's try it a different way.  Let's skip the state portion of ->gp_seq
> > and just focus on the number of elapsed grace periods.
> > 
> > o	Let's start with RCU idle, having completed grace-period number
> > 	zero and not yet having started grace-period number 1.
> > 	Let's also abbreviate: GP0 and GP1.
> > 
> > o	Task A wants a grace period.  Because RCU is idle, Task A's
> > 	GP will complete at completion of GP1.
> > 
> > o	RCU starts GP1.
> > 
> > o	Almost immediately thereafter, Task B wants a grace period.
> > 	But it cannot use GP1, because GP1 has already started, even
> > 	if just barely, because some CPU or tasks might already have
> > 	reported a quiescent state for GP1.
> > 
> > 	So Task B's GP will not complete until the end of GP2.	And that
> > 	is almost two GPs worth of time from now.
> > 
> > And on a system where RCU is busy, Task B's experience is the common
> > case if it is in a tight loop doing synchronize_rcu().  Because RCU is
> > busy, by the end of each grace period, there will always be a request
> > for another grace period.  And therefore RCU's grace-period kthread
> > will immediately start a new grace period upon completion of each
> > grace period.  So the sequence of events from Task B's viewpoint will
> > be as follows:
> > 
> > o	Task B executes synchronize_rcu(), which requests a new grace
> > 	period.
> > 
> > o	This requested grace period ends, and another immediately starts.
> > 
> > o	Task B's RCU callback is invoked, which does a wakeup.
> > 
> > o	Task B executes another synchronize_rcu(), but just after a new
> > 	grace period has started.  Task B thus has to wait almost two
> > 	full grace periods.
> > 
> > This could be perceived as a problem, but my question back to you would
> > be "Why on earth are you invoking synchronize_rcu() in a tight loop???"
> > "Oh, because you are running rcuperf?  Well, that is why rcuperf uses
> > ftrace data!"
> > 
> > Which might be the point on which we are talking past each other.  You
> > might have been thinking of the latency of RCU's grace-period computation.
> > I was thinking of the length of time between the call to synchronize_rcu()
> > and the corresponding return.
> 
> This registered perfectly with me now, I was having a hard time visualizing
> but it makes sense to me now based on your description. Thanks a lot!
> 
> Ascii art:
> (Basically the synch rcu is done just after a slight offset)
> 
>                synchronize_rcu Wait     synchronize_rcu Wait 
>               |<-------------------->|  |--------------------|
>               v                      v  v                    v
> <----------> <----------> <----------> <---------> <--------->
>      GP            GP          GP         GP          GP
> 
> GP = long preempt disable duration

You got it!

							Thanx, Paul

next prev parent reply	other threads:[~2019-07-07 11:19 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes
2019-07-03 15:30 ` Steven Rostedt
2019-07-03 16:41   ` Joel Fernandes
2019-07-03 16:43     ` Joel Fernandes
2019-07-03 17:39     ` Paul E. McKenney
2019-07-03 21:24       ` Joel Fernandes
2019-07-03 21:57         ` Paul E. McKenney
2019-07-03 22:24           ` Joel Fernandes
2019-07-03 23:01             ` Paul E. McKenney
2019-07-04  0:21               ` Joel Fernandes
2019-07-04  0:32                 ` Joel Fernandes
2019-07-04  0:50                   ` Paul E. McKenney
2019-07-04  3:24                     ` Joel Fernandes
2019-07-04 17:13                       ` Paul E. McKenney
2019-07-04 18:50                         ` Joel Fernandes
2019-07-04 22:17                           ` Paul E. McKenney
2019-07-05  0:08                             ` Joel Fernandes
2019-07-05  1:30                               ` Joel Fernandes
2019-07-05  1:57                                 ` Paul E. McKenney
2019-07-06 12:18                                   ` [attn: Steve] " Joel Fernandes
2019-07-06 18:05                                     ` Paul E. McKenney
2019-07-06 23:25                                       ` Steven Rostedt
2019-07-06 12:02                             ` Joel Fernandes
2019-07-06 18:21                               ` Paul E. McKenney
2019-07-06 23:03                                 ` Joel Fernandes
2019-07-07 11:19                                   ` Paul E. McKenney [this message]
2019-07-04  0:47                 ` Paul E. McKenney
2019-07-04 16:49                   ` Joel Fernandes
2019-07-04 17:08                     ` Paul E. McKenney
2019-07-03 16:10 ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190707111906.GB26519@linux.ibm.com \
    --to=paulmck@linux.ibm.com \
    --cc=joel@joelfernandes.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.