From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Oleg Nesterov <oleg@tv-sign.ru>
Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com,
josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com,
tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org,
ego@in.ibm.com, srostedt@redhat.com
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU
Date: Fri, 28 Sep 2007 11:57:59 -0700 [thread overview]
Message-ID: <20070928185759.GC9153@linux.vnet.ibm.com> (raw)
In-Reply-To: <20070928144714.GA397@tv-sign.ru>
On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
> On 09/27, Paul E. McKenney wrote:
> >
> > On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> > >
> > > Yes, yes, I see now. We really need this barriers, except I think
> > > rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
> > >
> > > // rcu_try_flip_waitzero()
> > > if (A == 0) {
> > > mb();
> > > B == 0;
> > > }
> > >
> > > Do we really need the mb() in this case? How it is possible that STORE
> > > goes before LOAD? "Obviously", the LOAD should be completed first, no?
> >
> > Suppose that A was most recently stored by a CPU that shares a store
> > buffer with this CPU. Then it is possible that some other CPU sees
> > the store to B as happening before the store that "A==0" above is
> > loading from. This other CPU would naturally conclude that the store
> > to B must have happened before the load from A.
>
> Ah, I was confused by the comment,
>
> smp_mb(); /* Don't call for memory barriers before we see zero. */
> ^^^^^^^^^^^^^^^^^^
> So, in fact, we need this barrier to make sure that _other_ CPUs see these
> changes in order, thanks. Of course, _we_ already saw zero.
Fair point!
Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
rcu_flipctrs sum to zero" or some such?
> But in that particular case this doesn't matter, rcu_try_flip_waitzero()
> is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
> it doesn't really need the barrier? (besides, it is always called under
> fliplock).
The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
so we cannot necessarily rely on locking to trivialize all of this.
> > In more detail, suppose that CPU 0 and 1 share a store buffer, and
> > that CPU 2 and 3 share a second store buffer. This happens naturally
> > if CPUs 0 and 1 are really just different hardware threads within a
> > single core.
> >
> > So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
> > and that the cacheline for B is initially owned by CPUs 0 and 1.
> > Then consider the following sequence of events:
> >
> > o CPU 0 stores zero to A. This is a cache miss, so the new value
> > for A is placed in CPU 0's and 1's store buffer.
> >
> > o CPU 1 executes the above code, first loading A. It sees
> > the value of A==0 in the store buffer, and therefore
> > stores zero to B, which hits in the cache. (I am assuming
> > that we left out the mb() above).
> >
> > o CPU 2 loads from B, which misses the cache, and gets the
> > value that CPU 1 stored. Suppose it checks the value,
> > and based on this check, loads A. The old value of A might
> > still be in cache, which would lead CPU 2 to conclude that
> > the store to B by CPU 1 must have happened before the store
> > to A by CPU 0.
> >
> > Memory barriers would prevent this confusion.
>
> Thanks a lot!!! This fills another gap in my understanding.
Glad it helped! ;-)
> OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
> store buffer, the situation is simple, we can replace "CPU 0 stores" with
> "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
>
> Suppose that CPU 1 does
>
> wmb();
> B = 0
>
> Can we assume that CPU 2 doing
>
> if (B == 0) {
> rmb();
>
> must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?
Yes. CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
reads following its rmb() must therefore see all of CPU 1's stores
preceding the wmb().
> > > > > If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
> > > > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> > > >
> > > > I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
> > > > transition has to be synchronized to the moving of the callbacks --
> > > > either that or we need more GP_STAGES.
> > >
> > > Hmm. Still can't understand.
> >
> > Callbacks would be able to be injected into a grace period after it
> > started.
>
> Yes, but this is _exactly_ what the current code does in the scenario below,
>
> > Or are you arguing that as long as interrupts remain disabled between
> > the two events, no harm done?
>
> no,
>
> > > Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
> > > rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
> > > (say, another CPU does rcu_try_flip_idle() in between).
> > >
> > > We ack the flip, call_rcu() enables irqs, the timer interrupt calls
> > > __rcu_advance_callbacks() again and moves the callbacks.
> > >
> > > So, it is still possible that "move callbacks" and "ack the flip" happen
> > > out of order. But why this is bad?
>
> Look, what happens is
>
> // call_rcu()
> rcu_flip_flag = rcu_flipped
> insert the new callback
> // timer irq
> move the callbacks (the new one goes to wait[0])
>
> But I still can't understand why this is bad,
>
> > > This can't "speedup" the moving of our callbacks from next to done lists.
> > > Yes, RCU state machine can switch to the next state/stage, but this looks
> > > safe to me.
>
> Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed
> further events, we can't miss rcu_read_lock() section, no?
>
> > > Help!
>
> Please :)
Quite possibly my paranoia -- need to think about this some more.
Guess I need to expand the code-level portion of the document to cover
the grace-period side. That would force me either to get my explanation
into shape or admit that you are correct, as the case might be. ;-)
> > > if (rcu_ctrlblk.completed == rdp->completed)
> > > rcu_try_flip();
> > >
> > > Could you clarify the check above? Afaics this is just optimization,
> > > technically it is correct to rcu_try_flip() at any time, even if ->completed
> > > are not synchronized. Most probably in that case rcu_try_flip_waitack() will
> > > fail, but nothing bad can happen, yes?
> >
> > >From a conceptual viewpoint, if this CPU hasn't caught up with the
> > last grace-period stage, it has no business trying to push forward to
> > the next stage. So this might (or might not) happen to work with this
> > particular implementation, it needs to stay as is. We need this code
> > to be robust enough to optimize the grace-period latencies, right?
>
> Yes, yes. I just wanted to be sure I didn't miss some other subtle reason.
>
> > > void synchronize_sched(void)
> > > {
> > > struct migration_req req;
> > >
> > > req->task = NULL;
> > > init_completion(&req.done);
> > >
> > > for_each_online_cpu(cpu) {
> > > struct rq *rq = cpu_rq(cpu);
> > > int online;
> > >
> > > spin_lock_irq(&rq->lock);
> > > online = cpu_online(cpu); // HOTPLUG_CPU
> > > if (online) {
> > > list_add(&req->list, &rq->migration_queue);
> > > req.done.done = 0;
> > > }
> > > spin_unlock_irq(&rq->lock);
> > >
> > > if (online) {
> > > wake_up_process(rq->migration_thread);
> > > wait_for_completion(&req.done);
> > > }
> > > }
> > > }
> > >
> >
> > I need to think about your approach above. It looks like you are
> > leveraging the migration tasks, but I am concerned about concurrent
> > hotplug events.
>
> I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue,
> if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not
> interesting for us, we can miss it.
>
> Hmm... but wake_up_process() should be moved under spin_lock().
The other approach would be to simply have a separate thread for this
purpose. Batching would amortize the overhead (a single trip around the
CPUs could satisfy an arbitrarily large number of synchronize_sched()
requests).
Thanx, Paul
next prev parent reply other threads:[~2007-09-28 18:58 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-10 18:30 [PATCH RFC 0/9] RCU: Preemptible RCU Paul E. McKenney
2007-09-10 18:32 ` [PATCH RFC 1/9] RCU: Split API to permit multiple RCU implementations Paul E. McKenney
2007-09-21 4:14 ` Steven Rostedt
2007-09-10 18:33 ` [PATCH RFC 2/9] RCU: Fix barriers Paul E. McKenney
2007-09-10 18:34 ` [PATCH RFC 3/9] RCU: Preemptible RCU Paul E. McKenney
2007-09-21 4:17 ` Steven Rostedt
2007-09-21 5:50 ` Paul E. McKenney
2007-09-21 5:56 ` Dipankar Sarma
2007-09-21 14:40 ` Steven Rostedt
2007-09-21 15:46 ` Peter Zijlstra
2007-09-21 22:06 ` Paul E. McKenney
2007-09-21 22:31 ` Steven Rostedt
2007-09-21 22:44 ` Paul E. McKenney
2007-09-21 23:23 ` Steven Rostedt
2007-09-21 23:44 ` Paul E. McKenney
2007-09-22 0:26 ` Paul E. McKenney
2007-09-22 1:15 ` Steven Rostedt
2007-09-22 1:53 ` Paul E. McKenney
2007-09-22 3:15 ` Steven Rostedt
2007-09-22 4:07 ` Paul E. McKenney
2007-09-21 15:20 ` Steven Rostedt
2007-09-21 23:03 ` Paul E. McKenney
2007-09-22 0:32 ` Paul E. McKenney
2007-09-22 1:19 ` Steven Rostedt
2007-09-22 1:43 ` Paul E. McKenney
2007-09-22 2:56 ` Steven Rostedt
2007-09-22 4:10 ` Paul E. McKenney
2007-09-23 17:38 ` Oleg Nesterov
2007-09-24 0:15 ` Paul E. McKenney
2007-09-26 15:13 ` Oleg Nesterov
2007-09-27 15:46 ` Paul E. McKenney
2007-09-28 14:47 ` Oleg Nesterov
2007-09-28 18:57 ` Paul E. McKenney [this message]
2007-09-30 16:31 ` Oleg Nesterov
2007-09-30 23:02 ` Davide Libenzi
2007-10-01 1:37 ` Paul E. McKenney
2007-10-01 18:44 ` Davide Libenzi
2007-10-01 19:21 ` Paul E. McKenney
2007-10-01 22:09 ` Davide Libenzi
2007-10-01 22:24 ` Paul E. McKenney
2007-10-02 18:02 ` Oleg Nesterov
2007-10-01 1:20 ` Paul E. McKenney
2007-09-10 18:35 ` [PATCH RFC 4/9] RCU: synchronize_sched() workaround for CPU hotplug Paul E. McKenney
2007-09-10 18:36 ` [PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU Paul E. McKenney
2007-09-30 16:38 ` Oleg Nesterov
2007-10-01 1:41 ` Paul E. McKenney
2007-09-10 18:39 ` [PATCH RFC 6/9] RCU priority boosting " Paul E. McKenney
2007-09-28 22:56 ` Gautham R Shenoy
2007-09-28 23:05 ` Steven Rostedt
2007-09-30 3:11 ` Paul E. McKenney
2007-10-05 11:46 ` Gautham R Shenoy
2007-10-05 12:24 ` Steven Rostedt
2007-10-05 13:21 ` Gautham R Shenoy
2007-10-05 14:07 ` Paul E. McKenney
2007-09-10 18:39 ` [PATCH RFC 7/9] RCU: rcutorture testing for RCU priority boosting Paul E. McKenney
2007-09-10 18:41 ` [PATCH RFC 8/9] RCU: Make RCU priority boosting consume less power Paul E. McKenney
2007-09-10 18:42 ` [PATCH RFC 9/9] RCU: preemptible documentation and comment cleanups Paul E. McKenney
2007-09-10 18:44 ` [PATCH RFC 0/9] RCU: Preemptible RCU Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070928185759.GC9153@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=bunk@kernel.org \
--cc=dipankar@in.ibm.com \
--cc=dvhltc@us.ibm.com \
--cc=ego@in.ibm.com \
--cc=josht@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=oleg@tv-sign.ru \
--cc=srostedt@redhat.com \
--cc=tglx@linutronix.de \
--cc=tytso@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.