linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	laijs@cn.fujitsu.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, josh@joshtriplett.org, niv@us.ibm.com,
	tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org,
	Valdis.Kletnieks@vt.edu, dhowells@redhat.com,
	eric.dumazet@gmail.com, darren@dvhart.com, fweisbec@gmail.com,
	patches@linaro.org
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
Date: Thu, 16 Feb 2012 06:00:30 -0500	[thread overview]
Message-ID: <20120216110030.GA1425@Krystal> (raw)
In-Reply-To: <20120216063805.GF2976@linux.vnet.ibm.com>

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 15, 2012 at 09:51:44AM -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > [...]
> > > > +/*
> > > > + * To be called from the update side after an index flip.  Returns true
> > > > + * if the modulo sum of the counters is stably zero, false if there is
> > > > + * some possibility of non-zero.
> > > > + */
> > > > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > > >  {
> > > >  	int cpu;
> > > > -	int sum;
> > > >  
> > > > -	sum = 0;
> > > > +	/*
> > > > +	 * Note that srcu_readers_active_idx() can incorrectly return
> > > > +	 * zero even though there is a pre-existing reader throughout.
> > > > +	 * To see this, suppose that task A is in a very long SRCU
> > > > +	 * read-side critical section that started on CPU 0, and that
> > > > +	 * no other reader exists, so that the modulo sum of the counters
> > > > +	 * is equal to one.  Then suppose that task B starts executing
> > > > +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > > > +	 * task C starts reading on CPU 0, so that its increment is not
> > > > +	 * summed, but finishes reading on CPU 2, so that its decrement
> > > > +	 * -is- summed.  Then when task B completes its sum, it will
> > > > +	 * incorrectly get zero, despite the fact that task A has been
> > > > +	 * in its SRCU read-side critical section the whole time.
> > > > +	 *
> > > > +	 * We therefore do a validation step should srcu_readers_active_idx()
> > > > +	 * return zero.
> > > > +	 */
> > > > +	if (srcu_readers_active_idx(sp, idx) != 0)
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * Since the caller recently flipped ->completed, we can see at
> > > > +	 * most one increment of each CPU's counter from this point
> > > > +	 * forward.  The reason for this is that the reader CPU must have
> > > > +	 * fetched the index before srcu_readers_active_idx checked
> > > > +	 * that CPU's counter, but not yet incremented its counter.
> > > > +	 * Its eventual counter increment will follow the read in
> > > > +	 * srcu_readers_active_idx(), and that increment is immediately
> > > > +	 * followed by smp_mb() B.  Because smp_mb() D is between
> > > > +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> > > > +	 * that CPU's subsequent load of ->completed must see the new
> > > > +	 * value, and therefore increment the counter in the other rank.
> > > > +	 */
> > > > +	smp_mb(); /* A */
> > > > +
> > > > +	/*
> > > > +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > > > +	 * filled in from the per-CPU counter values.  Since both
> > > > +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > > > +	 * upper bits of the per-CPU counter, an increment/decrement
> > > > +	 * pair will change the value of the counter.  Since there is
> > > > +	 * only one possible increment, the only way to wrap the counter
> > > > +	 * is to have a huge number of counter decrements, which requires
> > > > +	 * a huge number of tasks and huge SRCU read-side critical-section
> > > > +	 * nesting levels, even on 32-bit systems.
> > > > +	 *
> > > > +	 * All of the ways of confusing the readings require that the scan
> > > > +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> > > > +	 * but not its increment.  However, between that decrement and
> > > > +	 * increment are smb_mb() B and C.  Either or both of these pair
> > > > +	 * with smp_mb() A above to ensure that the scan below will see
> > > > +	 * the read-side tasks's increment, thus noting a difference in
> > > > +	 * the counter values between the two passes.
> > > 
> > > Hi Paul,
> > > 
> > > I think the implementation is correct, but the explanation above might
> > > be improved. Let's consider the following a scenario, where a reader is
> > > migrated between increment of the counter and issuing the memory barrier
> > > in the read lock:
> > > 
> > > A,B,C are readers
> > > D is synchronize_rcu (one flip'n'wait)
> > > 
> > > CPU A               CPU B           CPU C         CPU D
> > >                                     c[1]++
> > >                                     smp_mb(1)
> > >                                                   read c[0] -> 0
> > > c[0]++
> > > (implicit smp_mb (2))
> > >       -> migrated ->
> > >                     (implicit smp_mb (3))
> > >                     smp_mb (4)
> > >                     smp_mb (5)
> > >                     c[1]--
> > >                                                   read c[1] -> -1
> > >                                                   read c[2] -> 1
> > >                                                   (false 0 sum)
> > >                                                   smp_mb (6)
> > >                                                   re-check each.
> > >                                     c[1]--
> > > 
> > > re-check: because we observed c[1] == -1, thanks to the implicit memory
> > > barriers within thread migration (2 and 3), we can assume that we _will_
> > > observe the updated value of c[0] after smp_mb (6).
> > > 
> > > The current explanation states that memory barriers 4 and 5, along with  
> > > 6, are responsible for ensuring that the increment will be observed by 
> > > the re-check. However, I doubt they have anything to do with it: it's 
> > > rather the implicit memory barriers in thread migration, along with
> > > program order guarantees on writes to the same address, that seems to be
> > > the reason why we can do this ordering assumption.
> > 
> > Please disregard the part about program order: CPU A writes to c[0], and
> > CPU B writes to c[1], which are two different memory locations. The rest
> > of my discussion stands though.
> > 
> > Simply reasoning about write to c[0], memory barriers 2-3, write to
> > c[1], along with c[1] read, memory barrier 6, and then c[0] read is
> > enough to explain the ordering guarantees you need, without invoking
> > program order.
> 
> I am assuming that if the scheduler migrates a process, it applies enough
> memory ordering to allow the proof to operate as if it had stayed on a
> single CPU throughout.

When applied to per-cpu variables, the reasoning cannot invoke program
order though, since we're touching two different memory locations.

> The reasoning for this would consider the
> scheduler access and memory barriers

indeed,

>-- but there would be an arbitrarily
> large number of migration patterns, so I am not convinced that it would
> help...

This brings the following question then: which memory barriers, in the
scheduler activity, act as full memory barriers to migrated threads ? I
see that the rq_lock is taken, but this lock is permeable in one
direction (operations can spill into the critical section). I'm probably
missing something else, but this something else probably needs to be
documented somewhere, since we are doing tons of assumptions based on
it.

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > Does it make sense, or shall I get another coffee to wake myself up ?
> > > ;)
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > > +	 *
> > > > +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> > > > +	 * none of the counters changed, we know that the zero was the
> > > > +	 * correct sum.
> > > > +	 *
> > > > +	 * Of course, it is possible that a task might be delayed
> > > > +	 * for a very long time in __srcu_read_lock() after fetching
> > > > +	 * the index but before incrementing its counter.  This
> > > > +	 * possibility will be dealt with in __synchronize_srcu().
> > > > +	 */
> > > >  	for_each_possible_cpu(cpu)
> > > > -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > > > -	return sum;
> > > > +		if (sp->snap[cpu] !=
> > > > +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > > > +			return false;  /* False zero reading! */
> > > > +	return true;
> > > >  }
> > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

  reply	other threads:[~2012-02-16 11:00 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-13  2:09 [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation Paul E. McKenney
2012-02-15 12:59 ` Peter Zijlstra
2012-02-16  6:35   ` Paul E. McKenney
2012-02-16 10:50     ` Mathieu Desnoyers
2012-02-16 10:52       ` Peter Zijlstra
2012-02-16 11:14         ` Mathieu Desnoyers
2012-02-15 14:31 ` Mathieu Desnoyers
2012-02-15 14:51   ` Mathieu Desnoyers
2012-02-16  6:38     ` Paul E. McKenney
2012-02-16 11:00       ` Mathieu Desnoyers [this message]
2012-02-16 11:51         ` Peter Zijlstra
2012-02-16 12:18           ` Mathieu Desnoyers
2012-02-16 12:44             ` Peter Zijlstra
2012-02-16 14:52               ` Mathieu Desnoyers
2012-02-16 14:58                 ` Peter Zijlstra
2012-02-16 15:13               ` Paul E. McKenney
2012-02-20  7:15 ` Lai Jiangshan
2012-02-20 17:44   ` Paul E. McKenney
2012-02-21  1:11     ` Lai Jiangshan
2012-02-21  1:50       ` Paul E. McKenney
2012-02-21  8:44         ` Lai Jiangshan
2012-02-21 17:24           ` Paul E. McKenney
2012-02-22  9:29             ` [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path Lai Jiangshan
2012-02-22  9:29             ` [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock() Lai Jiangshan
2012-02-22  9:50               ` Peter Zijlstra
2012-02-22 21:20               ` Paul E. McKenney
2012-02-22 21:26                 ` Paul E. McKenney
2012-02-22 21:39                   ` Steven Rostedt
2012-02-23  1:01                     ` Paul E. McKenney
2012-02-22  9:29             ` [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period Lai Jiangshan
2012-02-23  1:01               ` Paul E. McKenney
2012-02-24  8:06               ` Lai Jiangshan
2012-02-24 20:01                 ` Paul E. McKenney
2012-02-27  8:01                   ` [PATCH 1/2 RFC] srcu: change the comments of the wait algorithm Lai Jiangshan
2012-02-27  8:01                   ` [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm Lai Jiangshan
2012-02-27 18:30                     ` Paul E. McKenney
2012-02-28  1:51                       ` Lai Jiangshan
2012-02-28 13:47                         ` Paul E. McKenney
2012-02-29 10:07                           ` Lai Jiangshan
2012-02-29 13:55                             ` Paul E. McKenney
2012-03-01  2:31                               ` Lai Jiangshan
2012-03-01 13:20                                 ` Paul E. McKenney
2012-03-10  3:41                                   ` Lai Jiangshan
2012-03-06  8:42             ` [RFC PATCH 0/6 paul/rcu/srcu] srcu: implement call_srcu() Lai Jiangshan
2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
2012-03-06  9:57                 ` [PATCH 2/6] Don't touch the snap in srcu_readers_active() Lai Jiangshan
2012-03-08 19:14                   ` Paul E. McKenney
2012-03-06  9:57                 ` [PATCH 3/6] use "int trycount" instead of "bool expedited" Lai Jiangshan
2012-03-08 19:25                   ` Paul E. McKenney
2012-03-06  9:57                 ` [PATCH 4/6] remove flip_idx_and_wait() Lai Jiangshan
2012-03-06 10:41                   ` Peter Zijlstra
2012-03-07  3:54                   ` [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu() Lai Jiangshan
2012-03-08 13:04                     ` Peter Zijlstra
2012-03-08 14:17                       ` Lai Jiangshan
2012-03-08 13:08                     ` Peter Zijlstra
2012-03-08 20:35                     ` Paul E. McKenney
2012-03-10  3:16                       ` Lai Jiangshan
2012-03-12 18:03                         ` Paul E. McKenney
2012-03-14  7:47                           ` Lai Jiangshan
2012-04-10 20:15                             ` Paul E. McKenney
2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
2012-03-06 10:47                   ` Peter Zijlstra
2012-03-08 19:44                     ` Paul E. McKenney
2012-03-06 10:58                   ` Peter Zijlstra
2012-03-06 15:17                     ` Lai Jiangshan
2012-03-06 15:38                       ` Peter Zijlstra
2012-03-08 19:49                         ` Paul E. McKenney
2012-03-10 10:12                           ` Peter Zijlstra
2012-03-12 17:52                             ` Paul E. McKenney
2012-03-06 11:16                   ` Peter Zijlstra
2012-03-06 15:12                     ` Lai Jiangshan
2012-03-06 15:34                       ` Peter Zijlstra
2012-03-08 19:58                         ` Paul E. McKenney
2012-03-10  3:32                           ` Lai Jiangshan
2012-03-10 10:09                           ` Peter Zijlstra
2012-03-12 17:54                             ` Paul E. McKenney
2012-03-12 17:58                               ` Peter Zijlstra
2012-03-12 18:32                                 ` Paul E. McKenney
2012-03-12 20:25                                   ` Peter Zijlstra
2012-03-12 23:15                                     ` Paul E. McKenney
2012-03-12 23:18                                       ` Peter Zijlstra
2012-03-12 23:38                                         ` Paul E. McKenney
2012-03-06 15:26                     ` Lai Jiangshan
2012-03-06 15:37                       ` Peter Zijlstra
2012-03-06 11:17                   ` Peter Zijlstra
2012-03-06 11:22                   ` Peter Zijlstra
2012-03-06 11:35                   ` Peter Zijlstra
2012-03-06 11:36                   ` Peter Zijlstra
2012-03-06 11:39                   ` Peter Zijlstra
2012-03-06 14:50                     ` Lai Jiangshan
2012-03-06 11:52                   ` Peter Zijlstra
2012-03-06 14:44                     ` Lai Jiangshan
2012-03-06 15:31                       ` Peter Zijlstra
2012-03-06 15:32                       ` Peter Zijlstra
2012-03-07  6:44                         ` Lai Jiangshan
2012-03-07  8:10                       ` Gilad Ben-Yossef
2012-03-07  9:21                         ` Lai Jiangshan
2012-03-06 14:47                     ` Lai Jiangshan
2012-03-06  9:57                 ` [PATCH 6/6] add srcu torture test Lai Jiangshan
2012-03-08 19:03                 ` [PATCH 1/6] remove unused srcu_barrier() Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120216110030.GA1425@Krystal \
    --to=mathieu.desnoyers@efficios.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=akpm@linux-foundation.org \
    --cc=compudj@krystal.dyndns.org \
    --cc=darren@dvhart.com \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fweisbec@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=niv@us.ibm.com \
    --cc=patches@linaro.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).