From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com,
josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com,
tglx@linutronix.de, bunk@kernel.org, ego@in.ibm.com,
oleg@tv-sign.ru
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU
Date: Fri, 21 Sep 2007 15:44:59 -0700 [thread overview]
Message-ID: <20070921224459.GD9059@linux.vnet.ibm.com> (raw)
In-Reply-To: <20070921223112.GG15697@goodmis.org>
On Fri, Sep 21, 2007 at 06:31:12PM -0400, Steven Rostedt wrote:
> On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
> > On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <rostedt@goodmis.org>
> > wrote:
> >
> > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> >
> >
> > > Can you have a pointer somewhere that explains these states. And not a
> > > "it's in this paper or directory". Either have a short discription here,
> > > or specify where exactly to find the information (perhaps a
> > > Documentation/RCU/preemptible_states.txt?).
> > >
> > > Trying to understand these states has caused me the most agony in
> > > reviewing these patches.
> > >
> > > > + */
> > > > +
> > > > +enum rcu_try_flip_states {
> > > > + rcu_try_flip_idle_state, /* "I" */
> > > > + rcu_try_flip_waitack_state, /* "A" */
> > > > + rcu_try_flip_waitzero_state, /* "Z" */
> > > > + rcu_try_flip_waitmb_state /* "M" */
> > > > +};
> >
> > I thought the 4 flip states corresponded to the 4 GP stages, but now
> > you confused me. It seems to indeed progress one stage for every 4 flip
> > states.
>
> I'm still confused ;-)
If you do a synchronize_rcu() it might well have to wait through the
following sequence of states:
Stage 0: (might have to wait through part of this to get out of "next" queue)
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 1:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 2:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 3:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 4:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
So yes, grace periods do indeed have some latency.
> > Hmm, now I have to puzzle how these 4 stages are required by the lock
> > and unlock magic.
> >
> > > > +/*
> > > > + * Return the number of RCU batches processed thus far. Useful for debug
> > > > + * and statistics. The _bh variant is identical to straight RCU.
> > > > + */
> > >
> > > If they are identical, then why the separation?
> >
> > I guess a smaller RCU domain makes for quicker grace periods.
>
> No, I mean that both the rcu_batches_completed and
> rcu_batches_completed_bh are identical. Perhaps we can just put in a
>
> #define rcu_batches_completed_bh rcu_batches_completed
>
> in rcupreempt.h. In rcuclassic, they are different. But no need to have
> two identical functions in the preempt version. A macro should do.
Ah!!! Good point, #define does make sense here.
> > > > +void __rcu_read_lock(void)
> > > > +{
> > > > + int idx;
> > > > + struct task_struct *me = current;
> > >
> > > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > > variable to assign current to. It's just that "me" thows me off a
> > > little while reviewing this. But this is just a nitpick, so do as you
> > > will.
> >
> > struct task_struct *curr = current;
> >
> > is also not uncommon.
>
> True, but the "me" confused me. Since that task struct is not me ;-)
Well, who is it, then? ;-)
> > > > + int nesting;
> > > > +
> > > > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > > > + if (nesting != 0) {
> > > > +
> > > > + /* An earlier rcu_read_lock() covers us, just count it. */
> > > > +
> > > > + me->rcu_read_lock_nesting = nesting + 1;
> > > > +
> > > > + } else {
> > > > + unsigned long oldirq;
> > >
> > > > +
> > > > + /*
> > > > + * Disable local interrupts to prevent the grace-period
> > > > + * detection state machine from seeing us half-done.
> > > > + * NMIs can still occur, of course, and might themselves
> > > > + * contain rcu_read_lock().
> > > > + */
> > > > +
> > > > + local_irq_save(oldirq);
> > >
> > > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > > local_bh_disable be sufficient here? You already cover NMIs, which would
> > > also handle normal interrupts.
> >
> > This is also my understanding, but I think this disable is an
> > 'optimization' in that it avoids the regular IRQs from jumping through
> > these hoops outlined below.
>
> But isn't disabling irqs slower than doing a local_bh_disable? So the
> majority of times (where irqs will not happen) we have this overhead.
The current code absolutely must exclude the scheduling-clock hardirq
handler.
> > > > + /*
> > > > + * Outermost nesting of rcu_read_lock(), so increment
> > > > + * the current counter for the current CPU. Use volatile
> > > > + * casts to prevent the compiler from reordering.
> > > > + */
> > > > +
> > > > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > > > + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> > > > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > > > +
> > > > + /*
> > > > + * Now that the per-CPU counter has been incremented, we
> > > > + * are protected from races with rcu_read_lock() invoked
> > > > + * from NMI handlers on this CPU. We can therefore safely
> > > > + * increment the nesting counter, relieving further NMIs
> > > > + * of the need to increment the per-CPU counter.
> > > > + */
> > > > +
> > > > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > > > +
> > > > + /*
> > > > + * Now that we have preventing any NMIs from storing
> > > > + * to the ->rcu_flipctr_idx, we can safely use it to
> > > > + * remember which counter to decrement in the matching
> > > > + * rcu_read_unlock().
> > > > + */
> > > > +
> > > > + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > > > + local_irq_restore(oldirq);
> > > > + }
> > > > +}
> >
> > > > +/*
> > > > + * Attempt a single flip of the counters. Remember, a single flip does
> > > > + * -not- constitute a grace period. Instead, the interval between
> > > > + * at least three consecutive flips is a grace period.
> > > > + *
> > > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> > >
> > > Oh, come now! It's not "nuts" to use this ;-)
> > >
> > > > + * on a large SMP, they might want to use a hierarchical organization of
> > > > + * the per-CPU-counter pairs.
> > > > + */
> >
> > Its the large SMP case that's nuts, and on that I have to agree with
> > Paul, its not really large SMP friendly.
>
> Hmm, that could be true. But on large SMP systems, you usually have a
> large amounts of memory, so hopefully a really long synchronize_rcu
> would not be a problem.
Somewhere in the range from 64 to a few hundred CPUs, the global lock
protecting the try_flip state machine would start sucking air pretty
badly. But the real problem is synchronize_sched(), which loops through
all the CPUs -- this would likely cause problems at a few tens of
CPUs, perhaps as early as 10-20.
Thanx, Paul
next prev parent reply other threads:[~2007-09-21 22:57 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-10 18:30 [PATCH RFC 0/9] RCU: Preemptible RCU Paul E. McKenney
2007-09-10 18:32 ` [PATCH RFC 1/9] RCU: Split API to permit multiple RCU implementations Paul E. McKenney
2007-09-21 4:14 ` Steven Rostedt
2007-09-10 18:33 ` [PATCH RFC 2/9] RCU: Fix barriers Paul E. McKenney
2007-09-10 18:34 ` [PATCH RFC 3/9] RCU: Preemptible RCU Paul E. McKenney
2007-09-21 4:17 ` Steven Rostedt
2007-09-21 5:50 ` Paul E. McKenney
2007-09-21 5:56 ` Dipankar Sarma
2007-09-21 14:40 ` Steven Rostedt
2007-09-21 15:46 ` Peter Zijlstra
2007-09-21 22:06 ` Paul E. McKenney
2007-09-21 22:31 ` Steven Rostedt
2007-09-21 22:44 ` Paul E. McKenney [this message]
2007-09-21 23:23 ` Steven Rostedt
2007-09-21 23:44 ` Paul E. McKenney
2007-09-22 0:26 ` Paul E. McKenney
2007-09-22 1:15 ` Steven Rostedt
2007-09-22 1:53 ` Paul E. McKenney
2007-09-22 3:15 ` Steven Rostedt
2007-09-22 4:07 ` Paul E. McKenney
2007-09-21 15:20 ` Steven Rostedt
2007-09-21 23:03 ` Paul E. McKenney
2007-09-22 0:32 ` Paul E. McKenney
2007-09-22 1:19 ` Steven Rostedt
2007-09-22 1:43 ` Paul E. McKenney
2007-09-22 2:56 ` Steven Rostedt
2007-09-22 4:10 ` Paul E. McKenney
2007-09-23 17:38 ` Oleg Nesterov
2007-09-24 0:15 ` Paul E. McKenney
2007-09-26 15:13 ` Oleg Nesterov
2007-09-27 15:46 ` Paul E. McKenney
2007-09-28 14:47 ` Oleg Nesterov
2007-09-28 18:57 ` Paul E. McKenney
2007-09-30 16:31 ` Oleg Nesterov
2007-09-30 23:02 ` Davide Libenzi
2007-10-01 1:37 ` Paul E. McKenney
2007-10-01 18:44 ` Davide Libenzi
2007-10-01 19:21 ` Paul E. McKenney
2007-10-01 22:09 ` Davide Libenzi
2007-10-01 22:24 ` Paul E. McKenney
2007-10-02 18:02 ` Oleg Nesterov
2007-10-01 1:20 ` Paul E. McKenney
2007-09-10 18:35 ` [PATCH RFC 4/9] RCU: synchronize_sched() workaround for CPU hotplug Paul E. McKenney
2007-09-10 18:36 ` [PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU Paul E. McKenney
2007-09-30 16:38 ` Oleg Nesterov
2007-10-01 1:41 ` Paul E. McKenney
2007-09-10 18:39 ` [PATCH RFC 6/9] RCU priority boosting " Paul E. McKenney
2007-09-28 22:56 ` Gautham R Shenoy
2007-09-28 23:05 ` Steven Rostedt
2007-09-30 3:11 ` Paul E. McKenney
2007-10-05 11:46 ` Gautham R Shenoy
2007-10-05 12:24 ` Steven Rostedt
2007-10-05 13:21 ` Gautham R Shenoy
2007-10-05 14:07 ` Paul E. McKenney
2007-09-10 18:39 ` [PATCH RFC 7/9] RCU: rcutorture testing for RCU priority boosting Paul E. McKenney
2007-09-10 18:41 ` [PATCH RFC 8/9] RCU: Make RCU priority boosting consume less power Paul E. McKenney
2007-09-10 18:42 ` [PATCH RFC 9/9] RCU: preemptible documentation and comment cleanups Paul E. McKenney
2007-09-10 18:44 ` [PATCH RFC 0/9] RCU: Preemptible RCU Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070921224459.GD9059@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=bunk@kernel.org \
--cc=dipankar@in.ibm.com \
--cc=dvhltc@us.ibm.com \
--cc=ego@in.ibm.com \
--cc=josht@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=oleg@tv-sign.ru \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=tytso@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox