Re: Normal RCU grace period can be stalled for long because need-resched flags not set?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	rcu <rcu@vger.kernel.org>
Subject: Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
Date: Wed, 3 Jul 2019 14:57:14 -0700	[thread overview]
Message-ID: <20190703215714.GW26519@linux.ibm.com> (raw)
In-Reply-To: <20190703212426.GC146386@google.com>

On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > 
> > > > 
> > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > 
> > > > > I'll keep poking ;-)
> > > > 
> > > > Hi Joel,
> > > > 
> > > > Can you also share the tests you are performing as well as any
> > > > module/code changes you made so that we can duplicate the results?
> > > 
> > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > believe Linus's tree should have same results.
> > > 
> > > After applying the diff below, I run it like this:
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > 
> > > Some new options I added:
> > > pd_test=1 runs the preempt disable loop test
> > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > 
> > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > 
> > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > there's nothing to worry about (at least this is what I remember tracing).
> > > 
> > > With pd_resched = 0, the output of the command above:
> > > Average grace-period duration: 195629 microseconds
> > > Minimum grace-period duration: 30111.7
> > > 50th percentile grace-period duration: 211000
> > > 90th percentile grace-period duration: 218000
> > > 99th percentile grace-period duration: 222999
> > > Maximum grace-period duration: 236351
> > > 
> > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > I wonder why its twice, but that's still Ok. It is as follows:
> > > Average grace-period duration: 12302.2 microseconds
> > > Minimum grace-period duration: 5998.35
> > > 50th percentile grace-period duration: 12000.4
> > > 90th percentile grace-period duration: 15996.4
> > > 99th percentile grace-period duration: 18000.6
> > > Maximum grace-period duration: 20998.6
> > 
> > Both of these results are within the design range for normal
> > RCU grace-period durations on busy systems.  See the code in
> > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > durations" at which RCU starts taking more aggressive actions to end
> > the current grace period.  See especially:
> > 
> > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > 
> > This usually gets you about 100 milliseconds, and if you are starting
> > grace periods in quick succession from a single thread while other threads
> > are doing likewise, each grace-period wait gets to wait about two grace
> > periods worth due to the end of the previous grace period having started
> > a new grace period before the thread is awakened.
> > 
> > Of course, if this is causing trouble for some use case, it would not
> > be hard to create a tunable to override this panic duration.  But that
> > would of course require a real use case in real use, given that RCU isn't
> > exactly short on tunables at the moment.  Significantly shortening this
> > panic duration caused 0day to complain about slowness last I tried it,
> > just so you know.
> 
> Thanks a lot for the explanation.
> Indeed this code in the tick is doing a good job and I just had to drop
> jiffies_till_first_fqs to bring down the latencies. With a
> jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> drop by 4 fold.

You lost me on this one.  The normal value of jiffies_till_first_fqs
is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
have to ask...  What did you do to get jiffies_till_first_fqs=100?
The normal default automatic settings would need something like 8,000
CPUs to get it up to that level.

Or did you instead mean replacing the "HZ / 10" in the code snippet
above with "HZ / 20" or similar?

Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
-That- does default to 100, and you could set it using the
rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
must admit that I would naively expect halving jiffies_till_first_fqs to
halve the latencies.  But I have not looked at it closely, and there are
lots of moving parts in RCU's grace-period encouragement code, so maybe
that is the effect.

> In the tick:
>         if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
>                 /* Idle and userspace execution already are quiescent states. */
>                 if (!rcu_is_cpu_rrupt_from_idle() && !user) {
>                         set_preempt_need_resched();     <--------\
>                         set_tsk_need_resched(current);  <------- the preempt
> 								count test loop
> 								stands no chance!
>                 }
>                 __this_cpu_write(rcu_data.rcu_urgent_qs, false);
>         }

You got it!  That is indeed where it sets resched.

But if I understand what your preempt-count test loop consists of, RCU
would have a tough time with it in a CONFIG_PREEMPT=n kernel.

							Thanx, Paul

> Appreciate it! thanks,
> 
>  - Joel
>

next prev parent reply	other threads:[~2019-07-03 21:57 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes
2019-07-03 15:30 ` Steven Rostedt
2019-07-03 16:41   ` Joel Fernandes
2019-07-03 16:43     ` Joel Fernandes
2019-07-03 17:39     ` Paul E. McKenney
2019-07-03 21:24       ` Joel Fernandes
2019-07-03 21:57         ` Paul E. McKenney [this message]
2019-07-03 22:24           ` Joel Fernandes
2019-07-03 23:01             ` Paul E. McKenney
2019-07-04  0:21               ` Joel Fernandes
2019-07-04  0:32                 ` Joel Fernandes
2019-07-04  0:50                   ` Paul E. McKenney
2019-07-04  3:24                     ` Joel Fernandes
2019-07-04 17:13                       ` Paul E. McKenney
2019-07-04 18:50                         ` Joel Fernandes
2019-07-04 22:17                           ` Paul E. McKenney
2019-07-05  0:08                             ` Joel Fernandes
2019-07-05  1:30                               ` Joel Fernandes
2019-07-05  1:57                                 ` Paul E. McKenney
2019-07-06 12:18                                   ` [attn: Steve] " Joel Fernandes
2019-07-06 18:05                                     ` Paul E. McKenney
2019-07-06 23:25                                       ` Steven Rostedt
2019-07-06 12:02                             ` Joel Fernandes
2019-07-06 18:21                               ` Paul E. McKenney
2019-07-06 23:03                                 ` Joel Fernandes
2019-07-07 11:19                                   ` Paul E. McKenney
2019-07-04  0:47                 ` Paul E. McKenney
2019-07-04 16:49                   ` Joel Fernandes
2019-07-04 17:08                     ` Paul E. McKenney
2019-07-03 16:10 ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190703215714.GW26519@linux.ibm.com \
    --to=paulmck@linux.ibm.com \
    --cc=joel@joelfernandes.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.