public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
@ 2008-07-06  9:23 Lai Jiangshan
  2008-07-18 14:09 ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Lai Jiangshan @ 2008-07-06  9:23 UTC (permalink / raw)
  To: Paul E. McKenney, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Ingo Molnar, Peter Zijlstra
  Cc: Linux Kernel Mailing List


This is v2, it's a little deference from v1 that I
had send to lkml.
use ACCESS_ONCE
use rcu_batch_after/rcu_batch_before for batch # comparison.

rcutorture test result:
(hotplugs: do cpu-online/offline once per second)

No CONFIG_NO_HZ:           OK, 12hours
No CONFIG_NO_HZ, hotplugs: OK, 12hours
CONFIG_NO_HZ=y:            OK, 24hours
CONFIG_NO_HZ=y, hotplugs:  Failed.
(Failed also without my patch applied, exactly the same bug occurred,
http://lkml.org/lkml/2008/7/3/24)

v1's email thread:
http://lkml.org/lkml/2008/6/2/539


v1's description:

The code/algorithm of the implement of current callbacks-processing
is very efficient and technical. But when I studied it and I found
a disadvantage:

In multi-CPU systems, when a new RCU callback is being
queued(call_rcu[_bh]), this callback will be invoked after the grace
period for the batch with batch number = rcp->cur+2 has completed
very very likely in current implement. Actually, this callback can be
invoked after the grace period for the batch with
batch number = rcp->cur+1 has completed. The delay of invocation means
that latency of synchronize_rcu() is extended. But more important thing
is that the callbacks usually free memory, and these works are delayed
too! it's necessary for reclaimer to free memory as soon as
possible when left memory is few.

A very simple way can solve this problem:
a field(struct rcu_head::batch) is added to record the batch number for
the RCU callback. And when a new RCU callback is being queued, we
determine the batch number for this callback(head->batch = rcp->cur+1)
and we move this callback to rdp->donelist if we find
that head->batch <= rcp->completed when we process callbacks.
This simple way reduces the wait time for invocation a lot. (about
2.5Grace Period -> 1.5Grace Period in average in multi-CPU systems)

This is my algorithm. But I do not add any field for struct rcu_head
in my implement. We just need to memorize the last 2 batches and
their batch number, because these 2 batches include all entries that
for whom the grace period hasn't completed. So we use a special
linked-list rather than add a field.
Please see the comment of struct rcu_data.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
index 5efd734..8ed9de6 100644
--- a/include/linux/rcuclassic.h
+++ b/include/linux/rcuclassic.h
@@ -66,11 +66,7 @@ static inline int rcu_batch_after(long a, long b)
 	return (a - b) > 0;
 }
 
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
+/* Per-CPU data for Read-Copy UPdate. */
 struct rcu_data {
 	/* 1) quiescent state handling : */
 	long		quiescbatch;     /* Batch # for grace period */
@@ -78,12 +74,24 @@ struct rcu_data {
 	int		qs_pending;	 /* core waits for quiesc state */
 
 	/* 2) batch handling */
-	long  	       	batch;           /* Batch # for current RCU batch */
+	/*
+	 * if nxtlist is not NULL, then:
+	 * batch:
+	 *	The batch # for the last entry of nxtlist
+	 * [*nxttail[1], NULL = *nxttail[2]):
+	 *	Entries that batch # <= batch
+	 * [*nxttail[0], *nxttail[1]):
+	 *	Entries that batch # <= batch - 1
+	 * [nxtlist, *nxttail[0]):
+	 *	Entries that batch # <= batch - 2
+	 *	The grace period for these entries has completed, and
+	 *	the other grace-period-completed entries may be moved
+	 *	here temporarily in rcu_process_callbacks().
+	 */
+	long  	       	batch;
 	struct rcu_head *nxtlist;
-	struct rcu_head **nxttail;
+	struct rcu_head **nxttail[3];
 	long            qlen; 	 	 /* # of queued callbacks */
-	struct rcu_head *curlist;
-	struct rcu_head **curtail;
 	struct rcu_head *donelist;
 	struct rcu_head **donetail;
 	long		blimit;		 /* Upper limit on a processed batch */
diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index 9ac3b94..d9c1caf 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -120,6 +120,43 @@ static inline void force_quiescent_state(struct rcu_data *rdp,
 }
 #endif
 
+static void __call_rcu(struct rcu_head *head, struct rcu_ctrlblk *rcp,
+		struct rcu_data *rdp)
+{
+	long batch;
+	smp_mb(); /* reads the most recently updated value of rcu->cur. */
+
+	/*
+	 * Determine the batch number of this callback.
+	 *
+	 * Using ACCESS_ONCE to avoid the following error when gcc eliminates
+	 * local variable "batch" and emits codes like this:
+	 *	1) rdp->batch = rcp->cur + 1 # gets old value
+	 *	......
+	 *	2)rcu_batch_after(rcp->cur + 1, rdp->batch) # gets new value
+	 * then [*nxttail[0], *nxttail[1]) may contain callbacks
+	 * that batch# = rdp->batch, see the comment of struct rcu_data.
+	 */
+	batch = ACCESS_ONCE(rcp->cur) + 1;
+
+	if (rdp->nxtlist && rcu_batch_after(batch, rdp->batch)) {
+		/* process callbacks */
+		rdp->nxttail[0] = rdp->nxttail[1];
+		rdp->nxttail[1] = rdp->nxttail[2];
+		if (rcu_batch_after(batch - 1, rdp->batch))
+			rdp->nxttail[0] = rdp->nxttail[2];
+	}
+
+	rdp->batch = batch;
+	*rdp->nxttail[2] = head;
+	rdp->nxttail[2] = &head->next;
+
+	if (unlikely(++rdp->qlen > qhimark)) {
+		rdp->blimit = INT_MAX;
+		force_quiescent_state(rdp, &rcu_ctrlblk);
+	}
+}
+
 /**
  * call_rcu - Queue an RCU callback for invocation after a grace period.
  * @head: structure to be used for queueing the RCU updates.
@@ -135,18 +172,11 @@ void call_rcu(struct rcu_head *head,
 				void (*func)(struct rcu_head *rcu))
 {
 	unsigned long flags;
-	struct rcu_data *rdp;
 
 	head->func = func;
 	head->next = NULL;
 	local_irq_save(flags);
-	rdp = &__get_cpu_var(rcu_data);
-	*rdp->nxttail = head;
-	rdp->nxttail = &head->next;
-	if (unlikely(++rdp->qlen > qhimark)) {
-		rdp->blimit = INT_MAX;
-		force_quiescent_state(rdp, &rcu_ctrlblk);
-	}
+	__call_rcu(head, &rcu_ctrlblk, &__get_cpu_var(rcu_data));
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(call_rcu);
@@ -171,20 +201,11 @@ void call_rcu_bh(struct rcu_head *head,
 				void (*func)(struct rcu_head *rcu))
 {
 	unsigned long flags;
-	struct rcu_data *rdp;
 
 	head->func = func;
 	head->next = NULL;
 	local_irq_save(flags);
-	rdp = &__get_cpu_var(rcu_bh_data);
-	*rdp->nxttail = head;
-	rdp->nxttail = &head->next;
-
-	if (unlikely(++rdp->qlen > qhimark)) {
-		rdp->blimit = INT_MAX;
-		force_quiescent_state(rdp, &rcu_bh_ctrlblk);
-	}
-
+	__call_rcu(head, &rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(call_rcu_bh);
@@ -213,12 +234,6 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
 static inline void raise_rcu_softirq(void)
 {
 	raise_softirq(RCU_SOFTIRQ);
-	/*
-	 * The smp_mb() here is required to ensure that this cpu's
-	 * __rcu_process_callbacks() reads the most recently updated
-	 * value of rcu->cur.
-	 */
-	smp_mb();
 }
 
 /*
@@ -360,13 +375,15 @@ static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp,
  * which is dead and hence not processing interrupts.
  */
 static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
-				struct rcu_head **tail)
+				struct rcu_head **tail, long batch)
 {
-	local_irq_disable();
-	*this_rdp->nxttail = list;
-	if (list)
-		this_rdp->nxttail = tail;
-	local_irq_enable();
+	if (list) {
+		local_irq_disable();
+		this_rdp->batch = batch;
+		*this_rdp->nxttail[2] = list;
+		this_rdp->nxttail[2] = tail;
+		local_irq_enable();
+	}
 }
 
 static void __rcu_offline_cpu(struct rcu_data *this_rdp,
@@ -380,9 +397,9 @@ static void __rcu_offline_cpu(struct rcu_data *this_rdp,
 	if (rcp->cur != rcp->completed)
 		cpu_quiet(rdp->cpu, rcp);
 	spin_unlock_bh(&rcp->lock);
-	rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
-	rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
-	rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
+	/* spin_lock implies smp_mb() */
+	rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail, rcp->cur + 1);
+	rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail[2], rcp->cur + 1);
 }
 
 static void rcu_offline_cpu(int cpu)
@@ -412,27 +429,37 @@ static void rcu_offline_cpu(int cpu)
 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
 					struct rcu_data *rdp)
 {
-	if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
-		*rdp->donetail = rdp->curlist;
-		rdp->donetail = rdp->curtail;
-		rdp->curlist = NULL;
-		rdp->curtail = &rdp->curlist;
-	}
-
-	if (rdp->nxtlist && !rdp->curlist) {
+	if (rdp->nxtlist) {
 		local_irq_disable();
-		rdp->curlist = rdp->nxtlist;
-		rdp->curtail = rdp->nxttail;
-		rdp->nxtlist = NULL;
-		rdp->nxttail = &rdp->nxtlist;
-		local_irq_enable();
 
 		/*
-		 * start the next batch of callbacks
+		 * move the other grace-period-completed entries to
+		 * [rdp->nxtlist, *rdp->nxttail[0]) temporarily
+		 */
+		if (!rcu_batch_before(rcp->completed, rdp->batch))
+			rdp->nxttail[0] = rdp->nxttail[1] = rdp->nxttail[2];
+		else if (!rcu_batch_before(rcp->completed, rdp->batch - 1))
+			rdp->nxttail[0] = rdp->nxttail[1];
+
+		/*
+		 * the grace period for entries in
+		 * [rdp->nxtlist, *rdp->nxttail[0]) has completed and
+		 * move these entries to donelist
 		 */
+		if (rdp->nxttail[0] != &rdp->nxtlist) {
+			*rdp->donetail = rdp->nxtlist;
+			rdp->donetail = rdp->nxttail[0];
+			rdp->nxtlist = *rdp->nxttail[0];
+			*rdp->donetail = NULL;
+
+			if (rdp->nxttail[1] == rdp->nxttail[0])
+				rdp->nxttail[1] = &rdp->nxtlist;
+			if (rdp->nxttail[2] == rdp->nxttail[0])
+				rdp->nxttail[2] = &rdp->nxtlist;
+			rdp->nxttail[0] = &rdp->nxtlist;
+		}
 
-		/* determine batch number */
-		rdp->batch = rcp->cur + 1;
+		local_irq_enable();
 
 		if (rcu_batch_after(rdp->batch, rcp->pending)) {
 			/* and start it/schedule start if it's a new batch */
@@ -458,15 +485,26 @@ static void rcu_process_callbacks(struct softirq_action *unused)
 
 static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
 {
-	/* This cpu has pending rcu entries and the grace period
-	 * for them has completed.
-	 */
-	if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
-		return 1;
+	if (rdp->nxtlist) {
+		/*
+		 * This cpu has pending rcu entries and the grace period
+		 * for them has completed.
+		 */
+		if (!rcu_batch_before(rcp->completed, rdp->batch))
+			return 1;
+		if (!rcu_batch_before(rcp->completed, rdp->batch - 1) &&
+				rdp->nxttail[0] != rdp->nxttail[1])
+			return 1;
+		if (rdp->nxttail[0] != &rdp->nxtlist)
+			return 1;
 
-	/* This cpu has no pending entries, but there are new entries */
-	if (!rdp->curlist && rdp->nxtlist)
-		return 1;
+		/*
+		 * This cpu has pending rcu entries and the new batch
+		 * for then hasn't been started nor scheduled start
+		 */
+		if (rcu_batch_after(rdp->batch, rcp->pending))
+			return 1;
+	}
 
 	/* This cpu has finished callbacks to invoke */
 	if (rdp->donelist)
@@ -502,7 +540,7 @@ int rcu_needs_cpu(int cpu)
 	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 	struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu);
 
-	return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu));
+	return !!rdp->nxtlist || !!rdp_bh->nxtlist || rcu_pending(cpu);
 }
 
 void rcu_check_callbacks(int cpu, int user)
@@ -521,8 +559,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
 						struct rcu_data *rdp)
 {
 	memset(rdp, 0, sizeof(*rdp));
-	rdp->curtail = &rdp->curlist;
-	rdp->nxttail = &rdp->nxtlist;
+	rdp->nxttail[0] = rdp->nxttail[1] = rdp->nxttail[2] = &rdp->nxtlist;
 	rdp->donetail = &rdp->donelist;
 	rdp->quiescbatch = rcp->completed;
 	rdp->qs_pending = 0;








^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
  2008-07-06  9:23 [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2) Lai Jiangshan
@ 2008-07-18 14:09 ` Ingo Molnar
  2008-08-01 21:10   ` Paul E. McKenney
       [not found]   ` <20080721100433.GC8384@linux.vnet.ibm.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Ingo Molnar @ 2008-07-18 14:09 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List


* Lai Jiangshan <laijs@cn.fujitsu.com> wrote:

> This is v2, it's a little deference from v1 that I
> had send to lkml.
> use ACCESS_ONCE
> use rcu_batch_after/rcu_batch_before for batch # comparison.
> 
> rcutorture test result:
> (hotplugs: do cpu-online/offline once per second)
> 
> No CONFIG_NO_HZ:           OK, 12hours
> No CONFIG_NO_HZ, hotplugs: OK, 12hours
> CONFIG_NO_HZ=y:            OK, 24hours
> CONFIG_NO_HZ=y, hotplugs:  Failed.
> (Failed also without my patch applied, exactly the same bug occurred,
> http://lkml.org/lkml/2008/7/3/24)

thanks, i've applied this to tip/core/rcu - but it would be nice have an 
ack/nak from Paul as well, before we can proceed further.

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
  2008-07-18 14:09 ` Ingo Molnar
@ 2008-08-01 21:10   ` Paul E. McKenney
       [not found]   ` <20080721100433.GC8384@linux.vnet.ibm.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2008-08-01 21:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lai Jiangshan, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List

On Fri, Jul 18, 2008 at 04:09:30PM +0200, Ingo Molnar wrote:
> 
> * Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> 
> > This is v2, it's a little deference from v1 that I
> > had send to lkml.
> > use ACCESS_ONCE
> > use rcu_batch_after/rcu_batch_before for batch # comparison.
> > 
> > rcutorture test result:
> > (hotplugs: do cpu-online/offline once per second)
> > 
> > No CONFIG_NO_HZ:           OK, 12hours
> > No CONFIG_NO_HZ, hotplugs: OK, 12hours
> > CONFIG_NO_HZ=y:            OK, 24hours
> > CONFIG_NO_HZ=y, hotplugs:  Failed.
> > (Failed also without my patch applied, exactly the same bug occurred,
> > http://lkml.org/lkml/2008/7/3/24)
> 
> thanks, i've applied this to tip/core/rcu - but it would be nice have an 
> ack/nak from Paul as well, before we can proceed further.

I like the general approach -- simplification is a good thing.

So, for the moment:

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

I need to do a more detailed review (e.g., the memory-barrier changes
and the incorporation of some grace-period processing into call_rcu()),
after which I will either ask for changes or give a Reviewed-by.

I need to do this more-detailed review before we ship to mainline.
(Gak, already two weeks since Jiangshan sent this!!!  My apologies!!!)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
       [not found]   ` <20080721100433.GC8384@linux.vnet.ibm.com>
@ 2008-08-01 21:10     ` Paul E. McKenney
  2008-08-03  8:01       ` Lai Jiangshan
       [not found]     ` <20080725165454.GA7147@linux.vnet.ibm.com>
  1 sibling, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2008-08-01 21:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lai Jiangshan, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List

On Mon, Jul 21, 2008 at 03:04:33AM -0700, Paul E. McKenney wrote:
> On Fri, Jul 18, 2008 at 04:09:30PM +0200, Ingo Molnar wrote:
> > 
> > * Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> > 
> > > This is v2, it's a little deference from v1 that I
> > > had send to lkml.
> > > use ACCESS_ONCE
> > > use rcu_batch_after/rcu_batch_before for batch # comparison.
> > > 
> > > rcutorture test result:
> > > (hotplugs: do cpu-online/offline once per second)
> > > 
> > > No CONFIG_NO_HZ:           OK, 12hours
> > > No CONFIG_NO_HZ, hotplugs: OK, 12hours
> > > CONFIG_NO_HZ=y:            OK, 24hours
> > > CONFIG_NO_HZ=y, hotplugs:  Failed.
> > > (Failed also without my patch applied, exactly the same bug occurred,
> > > http://lkml.org/lkml/2008/7/3/24)
> > 
> > thanks, i've applied this to tip/core/rcu - but it would be nice have an 
> > ack/nak from Paul as well, before we can proceed further.
> 
> I like the general approach -- simplification is a good thing.
> 
> So, for the moment:
> 
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> I need to do a more detailed review (e.g., the memory-barrier changes
> and the incorporation of some grace-period processing into call_rcu()),
> after which I will either ask for changes or give a Reviewed-by.
> 
> I need to do this more-detailed review before we ship to mainline.
> (Gak, already two weeks since Jiangshan sent this!!!  My apologies!!!)


> diff -urpNa -X dontdiff linux-2.6.26/include/linux/rcuclassic.h linux-2.6.26-ljsimp/include/linux/rcuclassic.h
> --- linux-2.6.26/include/linux/rcuclassic.h	2008-07-13 14:51:29.000000000 -0700
> +++ linux-2.6.26-ljsimp/include/linux/rcuclassic.h	2008-07-24 13:44:59.000000000 -0700
> @@ -45,7 +45,7 @@
>  struct rcu_ctrlblk {
>  	long	cur;		/* Current batch number.                      */
>  	long	completed;	/* Number of the last completed batch         */
> -	int	next_pending;	/* Is the next batch already waiting?         */
> +	long	pending;	/* Number of the last pending batch           */

I vaguely remember some issue with cache locality motivating the change
to a simple flag.  However, if this change causes problems on large SMP
machines, a better fix would be to go to a hierarchical grace-period
detection algorithm.  So I am OK with this change.

>  	int	signaled;
>  
> @@ -66,11 +66,7 @@ static inline int rcu_batch_after(long a
>  	return (a - b) > 0;
>  }
>  
> -/*
> - * Per-CPU data for Read-Copy UPdate.
> - * nxtlist - new callbacks are added here
> - * curlist - current batch for which quiescent cycle started if any
> - */
> +/* Per-CPU data for Read-Copy UPdate. */
>  struct rcu_data {
>  	/* 1) quiescent state handling : */
>  	long		quiescbatch;     /* Batch # for grace period */
> @@ -78,12 +74,24 @@ struct rcu_data {
>  	int		qs_pending;	 /* core waits for quiesc state */
>  
>  	/* 2) batch handling */
> -	long  	       	batch;           /* Batch # for current RCU batch */
> +	/*
> +	 * if nxtlist is not NULL, then:
> +	 * batch:
> +	 *	The batch # for the last entry of nxtlist
> +	 * [*nxttail[1], NULL = *nxttail[2]):
> +	 *	Entries that batch # <= batch
> +	 * [*nxttail[0], *nxttail[1]):
> +	 *	Entries that batch # <= batch - 1
> +	 * [nxtlist, *nxttail[0]):
> +	 *	Entries that batch # <= batch - 2
> +	 *	The grace period for these entries has completed, and
> +	 *	the other grace-period-completed entries may be moved
> +	 *	here temporarily in rcu_process_callbacks().
> +	 */
> +	long  	       	batch;
>  	struct rcu_head *nxtlist;
> -	struct rcu_head **nxttail;
> +	struct rcu_head **nxttail[3];
>  	long            qlen; 	 	 /* # of queued callbacks */
> -	struct rcu_head *curlist;
> -	struct rcu_head **curtail;
>  	struct rcu_head *donelist;
>  	struct rcu_head **donetail;
>  	long		blimit;		 /* Upper limit on a processed batch */
> diff -urpNa -X dontdiff linux-2.6.26/kernel/rcuclassic.c linux-2.6.26-ljsimp/kernel/rcuclassic.c
> --- linux-2.6.26/kernel/rcuclassic.c	2008-07-13 14:51:29.000000000 -0700
> +++ linux-2.6.26-ljsimp/kernel/rcuclassic.c	2008-07-24 13:46:58.000000000 -0700
> @@ -60,12 +60,14 @@ EXPORT_SYMBOL_GPL(rcu_lock_map);
>  static struct rcu_ctrlblk rcu_ctrlblk = {
>  	.cur = -300,
>  	.completed = -300,
> +	.pending = -300,
>  	.lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
>  	.cpumask = CPU_MASK_NONE,
>  };
>  static struct rcu_ctrlblk rcu_bh_ctrlblk = {
>  	.cur = -300,
>  	.completed = -300,
> +	.pending = -300,
>  	.lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock),
>  	.cpumask = CPU_MASK_NONE,
>  };
> @@ -118,6 +120,43 @@ static inline void force_quiescent_state
>  }
>  #endif
>  
> +static void __call_rcu(struct rcu_head *head, struct rcu_ctrlblk *rcp,
> +		struct rcu_data *rdp)
> +{
> +	long batch;
> +	smp_mb(); /* reads the most recently updated value of rcu->cur. */

A better comment would be something like:

	smb_mb(); /* Read of rcu->cur must happen after any change by caller. */

> +
> +	/*
> +	 * Determine the batch number of this callback.
> +	 *
> +	 * Using ACCESS_ONCE to avoid the following error when gcc eliminates
> +	 * local variable "batch" and emits codes like this:
> +	 *	1) rdp->batch = rcp->cur + 1 # gets old value
> +	 *	......
> +	 *	2)rcu_batch_after(rcp->cur + 1, rdp->batch) # gets new value
> +	 * then [*nxttail[0], *nxttail[1]) may contain callbacks
> +	 * that batch# = rdp->batch, see the comment of struct rcu_data.
> +	 */
> +	batch = ACCESS_ONCE(rcp->cur) + 1;
> +
> +	if (rdp->nxtlist && rcu_batch_after(batch, rdp->batch)) {
> +		/* process callbacks */
> +		rdp->nxttail[0] = rdp->nxttail[1];
> +		rdp->nxttail[1] = rdp->nxttail[2];
> +		if (rcu_batch_after(batch - 1, rdp->batch))
> +			rdp->nxttail[0] = rdp->nxttail[2];
> +	}
> +
> +	rdp->batch = batch;
> +	*rdp->nxttail[2] = head;
> +	rdp->nxttail[2] = &head->next;
> +
> +	if (unlikely(++rdp->qlen > qhimark)) {
> +		rdp->blimit = INT_MAX;
> +		force_quiescent_state(rdp, &rcu_ctrlblk);
> +	}
> +}
> +
>  /**
>   * call_rcu - Queue an RCU callback for invocation after a grace period.
>   * @head: structure to be used for queueing the RCU updates.
> @@ -133,18 +172,11 @@ void call_rcu(struct rcu_head *head,
>  				void (*func)(struct rcu_head *rcu))
>  {
>  	unsigned long flags;
> -	struct rcu_data *rdp;
>  
>  	head->func = func;
>  	head->next = NULL;
>  	local_irq_save(flags);

I very much like the gathering of common code from call_rcu() and
call_rcu_bh() into __call_rcu().  But why not also move the
local_irq_save() and local_irq_restore() to __call_rcu(), perhaps
along with the initialization of head->next?

(I understand the motivation for keeping the initialization of the
fields of "head" at this level -- otherwise, you must add another
argument to __call_rcu().  But might be worth considering...)

> -	rdp = &__get_cpu_var(rcu_data);
> -	*rdp->nxttail = head;
> -	rdp->nxttail = &head->next;
> -	if (unlikely(++rdp->qlen > qhimark)) {
> -		rdp->blimit = INT_MAX;
> -		force_quiescent_state(rdp, &rcu_ctrlblk);
> -	}
> +	__call_rcu(head, &rcu_ctrlblk, &__get_cpu_var(rcu_data));
>  	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(call_rcu);
> @@ -169,20 +201,11 @@ void call_rcu_bh(struct rcu_head *head,
>  				void (*func)(struct rcu_head *rcu))
>  {
>  	unsigned long flags;
> -	struct rcu_data *rdp;
>  
>  	head->func = func;
>  	head->next = NULL;
>  	local_irq_save(flags);
> -	rdp = &__get_cpu_var(rcu_bh_data);
> -	*rdp->nxttail = head;
> -	rdp->nxttail = &head->next;
> -
> -	if (unlikely(++rdp->qlen > qhimark)) {
> -		rdp->blimit = INT_MAX;
> -		force_quiescent_state(rdp, &rcu_bh_ctrlblk);
> -	}
> -
> +	__call_rcu(head, &rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
>  	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(call_rcu_bh);
> @@ -211,12 +234,6 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed_
>  static inline void raise_rcu_softirq(void)
>  {
>  	raise_softirq(RCU_SOFTIRQ);
> -	/*
> -	 * The smp_mb() here is required to ensure that this cpu's
> -	 * __rcu_process_callbacks() reads the most recently updated
> -	 * value of rcu->cur.
> -	 */
> -	smp_mb();

I have not yet convinced myself that it is OK to get rid of this memory
barrier.  This memory barrier was intended order to handle the following
sequence of events:

	rcu_read_lock_bh();  /* no memory barrier. */
	p = rcu_dereference(some_global_pointer);
	do_something_with(p);
	rcu_read_unlock_bh();  /* no memory barrier. */

	---- scheduling-clock interrupt occurs, eventually invoking
	---- rcu_check_callbacks()

	---- And the access to "p" above could potentially be reordered
	---- into the grace-period computation

Such reordering is of course quite unlikely to be harmful, due to the
long duration of RCU grace periods.  But I am paranoid.

If this memory barrier turns out to be necessary, one approach would
to be to place it at the beginning of rcu_check_callbacks(), which is
a better place for it in any case.

Thoughts?

>  }
>  
>  /*
> @@ -276,14 +293,8 @@ static void rcu_do_batch(struct rcu_data
>   */
>  static void rcu_start_batch(struct rcu_ctrlblk *rcp)
>  {
> -	if (rcp->next_pending &&
> +	if (rcp->cur != rcp->pending &&
>  			rcp->completed == rcp->cur) {
> -		rcp->next_pending = 0;
> -		/*
> -		 * next_pending == 0 must be visible in
> -		 * __rcu_process_callbacks() before it can see new value of cur.
> -		 */
> -		smp_wmb();

These memory barriers were added to allow for the fact that we run much
of the grace-period machinery without holding locks.

>  		rcp->cur++;
>  
>  		/*
> @@ -364,13 +375,15 @@ static void rcu_check_quiescent_state(st
>   * which is dead and hence not processing interrupts.
>   */
>  static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
> -				struct rcu_head **tail)
> +				struct rcu_head **tail, long batch)
>  {
> -	local_irq_disable();
> -	*this_rdp->nxttail = list;
> -	if (list)
> -		this_rdp->nxttail = tail;
> -	local_irq_enable();
> +	if (list) {
> +		local_irq_disable();
> +		this_rdp->batch = batch;
> +		*this_rdp->nxttail[2] = list;
> +		this_rdp->nxttail[2] = tail;
> +		local_irq_enable();
> +	}
>  }
>  
>  static void __rcu_offline_cpu(struct rcu_data *this_rdp,
> @@ -384,9 +397,9 @@ static void __rcu_offline_cpu(struct rcu
>  	if (rcp->cur != rcp->completed)
>  		cpu_quiet(rdp->cpu, rcp);
>  	spin_unlock_bh(&rcp->lock);
> -	rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
> -	rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
> -	rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
> +	/* spin_lock implies smp_mb() */

Almost.  Please note that spin_lock() and spin_unlock() are permitted
to imply semi-permeable memory barriers.  See the ia64 implementation for
an example of this.  The upshot is that spin_lock() is permitted to
allow prior memory references to "bleed" into the critical section, and
spin_unlock() is likewise permitted to allow subsequent memory
references to "bleed" into the critical section.  So, if you have code
as follows:

	a = 1;
	spin_lock(&my_lock);
	b = 1;
	c = 1;
	spin_unlock(&my_lock);
	d = 1;

Then the following is a perfectly legal execution ordering:

	spin_lock(&my_lock);
	d = 1;
	c = 1;
	b = 1;
	a = 1;
	spin_unlock(&my_lock);

Not sure if this causes the code any problems, but at the very least,
the comment needs to change.

more later...

							Thanx, Paul

> +	rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail, rcp->cur + 1);
> +	rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail[2], rcp->cur + 1);
>  }
>  
>  static void rcu_offline_cpu(int cpu)
> @@ -416,37 +429,45 @@ static void rcu_offline_cpu(int cpu)
>  static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
>  					struct rcu_data *rdp)
>  {
> -	if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
> -		*rdp->donetail = rdp->curlist;
> -		rdp->donetail = rdp->curtail;
> -		rdp->curlist = NULL;
> -		rdp->curtail = &rdp->curlist;
> -	}
> -
> -	if (rdp->nxtlist && !rdp->curlist) {
> +	if (rdp->nxtlist) {
>  		local_irq_disable();
> -		rdp->curlist = rdp->nxtlist;
> -		rdp->curtail = rdp->nxttail;
> -		rdp->nxtlist = NULL;
> -		rdp->nxttail = &rdp->nxtlist;
> -		local_irq_enable();
>  
>  		/*
> -		 * start the next batch of callbacks
> +		 * move the other grace-period-completed entries to
> +		 * [rdp->nxtlist, *rdp->nxttail[0]) temporarily
>  		 */
> +		if (!rcu_batch_before(rcp->completed, rdp->batch))
> +			rdp->nxttail[0] = rdp->nxttail[1] = rdp->nxttail[2];
> +		else if (!rcu_batch_before(rcp->completed, rdp->batch - 1))
> +			rdp->nxttail[0] = rdp->nxttail[1];
>  
> -		/* determine batch number */
> -		rdp->batch = rcp->cur + 1;
> -		/* see the comment and corresponding wmb() in
> -		 * the rcu_start_batch()
> +		/*
> +		 * the grace period for entries in
> +		 * [rdp->nxtlist, *rdp->nxttail[0]) has completed and
> +		 * move these entries to donelist
>  		 */
> -		smp_rmb();
> +		if (rdp->nxttail[0] != &rdp->nxtlist) {
> +			*rdp->donetail = rdp->nxtlist;
> +			rdp->donetail = rdp->nxttail[0];
> +			rdp->nxtlist = *rdp->nxttail[0];
> +			*rdp->donetail = NULL;
> +
> +			if (rdp->nxttail[1] == rdp->nxttail[0])
> +				rdp->nxttail[1] = &rdp->nxtlist;
> +			if (rdp->nxttail[2] == rdp->nxttail[0])
> +				rdp->nxttail[2] = &rdp->nxtlist;
> +			rdp->nxttail[0] = &rdp->nxtlist;
> +		}
> +
> +		local_irq_enable();
>  
> -		if (!rcp->next_pending) {
> +		if (rcu_batch_after(rdp->batch, rcp->pending)) {
>  			/* and start it/schedule start if it's a new batch */
>  			spin_lock(&rcp->lock);
> -			rcp->next_pending = 1;
> -			rcu_start_batch(rcp);
> +			if (rcu_batch_after(rdp->batch, rcp->pending)) {
> +				rcp->pending = rdp->batch;
> +				rcu_start_batch(rcp);
> +			}
>  			spin_unlock(&rcp->lock);
>  		}
>  	}
> @@ -464,15 +485,26 @@ static void rcu_process_callbacks(struct
>  
>  static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
>  {
> -	/* This cpu has pending rcu entries and the grace period
> -	 * for them has completed.
> -	 */
> -	if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
> -		return 1;
> +	if (rdp->nxtlist) {
> +		/*
> +		 * This cpu has pending rcu entries and the grace period
> +		 * for them has completed.
> +		 */
> +		if (!rcu_batch_before(rcp->completed, rdp->batch))
> +			return 1;
> +		if (!rcu_batch_before(rcp->completed, rdp->batch - 1) &&
> +				rdp->nxttail[0] != rdp->nxttail[1])
> +			return 1;
> +		if (rdp->nxttail[0] != &rdp->nxtlist)
> +			return 1;
>  
> -	/* This cpu has no pending entries, but there are new entries */
> -	if (!rdp->curlist && rdp->nxtlist)
> -		return 1;
> +		/*
> +		 * This cpu has pending rcu entries and the new batch
> +		 * for then hasn't been started nor scheduled start
> +		 */
> +		if (rcu_batch_after(rdp->batch, rcp->pending))
> +			return 1;
> +	}
>  
>  	/* This cpu has finished callbacks to invoke */
>  	if (rdp->donelist)
> @@ -508,7 +540,7 @@ int rcu_needs_cpu(int cpu)
>  	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
>  	struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu);
>  
> -	return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu));
> +	return !!rdp->nxtlist || !!rdp_bh->nxtlist || rcu_pending(cpu);
>  }
>  
>  void rcu_check_callbacks(int cpu, int user)
> @@ -527,8 +559,7 @@ static void rcu_init_percpu_data(int cpu
>  						struct rcu_data *rdp)
>  {
>  	memset(rdp, 0, sizeof(*rdp));
> -	rdp->curtail = &rdp->curlist;
> -	rdp->nxttail = &rdp->nxtlist;
> +	rdp->nxttail[0] = rdp->nxttail[1] = rdp->nxttail[2] = &rdp->nxtlist;
>  	rdp->donetail = &rdp->donelist;
>  	rdp->quiescbatch = rcp->completed;
>  	rdp->qs_pending = 0;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
       [not found]     ` <20080725165454.GA7147@linux.vnet.ibm.com>
@ 2008-08-01 21:11       ` Paul E. McKenney
  0 siblings, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2008-08-01 21:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lai Jiangshan, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List

On Fri, Jul 25, 2008 at 09:54:54AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 21, 2008 at 03:04:33AM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 18, 2008 at 04:09:30PM +0200, Ingo Molnar wrote:
> > > 
> > > * Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> > > 
> > > > This is v2, it's a little deference from v1 that I
> > > > had send to lkml.
> > > > use ACCESS_ONCE
> > > > use rcu_batch_after/rcu_batch_before for batch # comparison.
> > > > 
> > > > rcutorture test result:
> > > > (hotplugs: do cpu-online/offline once per second)
> > > > 
> > > > No CONFIG_NO_HZ:           OK, 12hours
> > > > No CONFIG_NO_HZ, hotplugs: OK, 12hours
> > > > CONFIG_NO_HZ=y:            OK, 24hours
> > > > CONFIG_NO_HZ=y, hotplugs:  Failed.
> > > > (Failed also without my patch applied, exactly the same bug occurred,
> > > > http://lkml.org/lkml/2008/7/3/24)
> > > 
> > > thanks, i've applied this to tip/core/rcu - but it would be nice have an 
> > > ack/nak from Paul as well, before we can proceed further.
> > 
> > I like the general approach -- simplification is a good thing.
> > 
> > So, for the moment:
> > 
> > Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > I need to do a more detailed review (e.g., the memory-barrier changes
> > and the incorporation of some grace-period processing into call_rcu()),
> > after which I will either ask for changes or give a Reviewed-by.
> > 
> > I need to do this more-detailed review before we ship to mainline.
> > (Gak, already two weeks since Jiangshan sent this!!!  My apologies!!!)

And here is a patch to apply on top of Jiangshan's patch that
rationalizes the locking and memory barriers.  Much of this is general
cleanup, not necessarily directly related to Jiangshan's patch.
Given this as a base, it should be fairly easy to make a hierarchical
implementation for machines that see lock contention in the RCU
infrastructure -- split and make a hierarchy on rcu_ctrlblk.

Thoughts?

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 rcuclassic.c |   62 +++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 44 insertions(+), 18 deletions(-)

diff -urpNa -X dontdiff linux-2.6.26-ljsimp/kernel/rcuclassic.c linux-2.6.26-ljsimpfix/kernel/rcuclassic.c
--- linux-2.6.26-ljsimp/kernel/rcuclassic.c	2008-07-24 13:46:58.000000000 -0700
+++ linux-2.6.26-ljsimpfix/kernel/rcuclassic.c	2008-07-28 12:47:29.000000000 -0700
@@ -86,6 +86,7 @@ static void force_quiescent_state(struct
 	int cpu;
 	cpumask_t cpumask;
 	set_need_resched();
+	spin_lock(&rcp->lock);
 	if (unlikely(!rcp->signaled)) {
 		rcp->signaled = 1;
 		/*
@@ -111,6 +112,7 @@ static void force_quiescent_state(struct
 		for_each_cpu_mask(cpu, cpumask)
 			smp_send_reschedule(cpu);
 	}
+	spin_unlock(&rcp->lock);
 }
 #else
 static inline void force_quiescent_state(struct rcu_data *rdp,
@@ -124,7 +126,11 @@ static void __call_rcu(struct rcu_head *
 		struct rcu_data *rdp)
 {
 	long batch;
-	smp_mb(); /* reads the most recently updated value of rcu->cur. */
+	unsigned long flags;
+
+	head->next = NULL;
+	local_irq_save(flags);
+	smp_mb(); /* Read of rcu->cur must happen after any change by caller. */
 
 	/*
 	 * Determine the batch number of this callback.
@@ -155,6 +161,7 @@ static void __call_rcu(struct rcu_head *
 		rdp->blimit = INT_MAX;
 		force_quiescent_state(rdp, &rcu_ctrlblk);
 	}
+	local_irq_restore(flags);
 }
 
 /**
@@ -171,13 +178,8 @@ static void __call_rcu(struct rcu_head *
 void call_rcu(struct rcu_head *head,
 				void (*func)(struct rcu_head *rcu))
 {
-	unsigned long flags;
-
 	head->func = func;
-	head->next = NULL;
-	local_irq_save(flags);
 	__call_rcu(head, &rcu_ctrlblk, &__get_cpu_var(rcu_data));
-	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(call_rcu);
 
@@ -200,13 +202,8 @@ EXPORT_SYMBOL_GPL(call_rcu);
 void call_rcu_bh(struct rcu_head *head,
 				void (*func)(struct rcu_head *rcu))
 {
-	unsigned long flags;
-
 	head->func = func;
-	head->next = NULL;
-	local_irq_save(flags);
 	__call_rcu(head, &rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
-	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(call_rcu_bh);
 
@@ -389,17 +386,17 @@ static void rcu_move_batch(struct rcu_da
 static void __rcu_offline_cpu(struct rcu_data *this_rdp,
 				struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
 {
-	/* if the cpu going offline owns the grace period
+	/*
+	 * if the cpu going offline owns the grace period
 	 * we can block indefinitely waiting for it, so flush
 	 * it here
 	 */
 	spin_lock_bh(&rcp->lock);
 	if (rcp->cur != rcp->completed)
 		cpu_quiet(rdp->cpu, rcp);
-	spin_unlock_bh(&rcp->lock);
-	/* spin_lock implies smp_mb() */
 	rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail, rcp->cur + 1);
 	rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail[2], rcp->cur + 1);
+	spin_unlock_bh(&rcp->lock);
 }
 
 static void rcu_offline_cpu(int cpu)
@@ -429,16 +426,19 @@ static void rcu_offline_cpu(int cpu)
 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
 					struct rcu_data *rdp)
 {
+	long completed_snap;
+
 	if (rdp->nxtlist) {
 		local_irq_disable();
+		completed_snap = ACCESS_ONCE(rcp->completed);
 
 		/*
 		 * move the other grace-period-completed entries to
 		 * [rdp->nxtlist, *rdp->nxttail[0]) temporarily
 		 */
-		if (!rcu_batch_before(rcp->completed, rdp->batch))
+		if (!rcu_batch_before(completed_snap, rdp->batch))
 			rdp->nxttail[0] = rdp->nxttail[1] = rdp->nxttail[2];
-		else if (!rcu_batch_before(rcp->completed, rdp->batch - 1))
+		else if (!rcu_batch_before(completed_snap, rdp->batch - 1))
 			rdp->nxttail[0] = rdp->nxttail[1];
 
 		/*
@@ -479,20 +479,38 @@ static void __rcu_process_callbacks(stru
 
 static void rcu_process_callbacks(struct softirq_action *unused)
 {
+	/*
+	 * Memory references from any prior RCU read-side critical sections
+	 * executed by the interrupted code must be see before any RCU
+	 * grace-period manupulations below.
+	 */
+
+	smp_mb(); /* See above block comment. */
+
 	__rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data));
 	__rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
+
+	/*
+	 * Memory references from any later RCU read-side critical sections
+	 * executed by the interrupted code must be see after any RCU
+	 * grace-period manupulations above.
+	 */
+
+	smp_mb(); /* See above block comment. */
 }
 
 static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
 {
 	if (rdp->nxtlist) {
+		long completed_snap = ACCESS_ONCE(rcp->completed);
+
 		/*
 		 * This cpu has pending rcu entries and the grace period
 		 * for them has completed.
 		 */
-		if (!rcu_batch_before(rcp->completed, rdp->batch))
+		if (!rcu_batch_before(completed_snap, rdp->batch))
 			return 1;
-		if (!rcu_batch_before(rcp->completed, rdp->batch - 1) &&
+		if (!rcu_batch_before(completed_snap, rdp->batch - 1) &&
 				rdp->nxttail[0] != rdp->nxttail[1])
 			return 1;
 		if (rdp->nxttail[0] != &rdp->nxtlist)
@@ -543,6 +561,12 @@ int rcu_needs_cpu(int cpu)
 	return !!rdp->nxtlist || !!rdp_bh->nxtlist || rcu_pending(cpu);
 }
 
+/*
+ * Top-level function driving RCU grace-period detection, normally
+ * invoked from the scheduler-clock interrupt.  This function simply
+ * increments counters that are read only from softirq by this same
+ * CPU, so there are no memory barriers required.
+ */
 void rcu_check_callbacks(int cpu, int user)
 {
 	if (user ||
@@ -558,6 +582,7 @@ void rcu_check_callbacks(int cpu, int us
 static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
 						struct rcu_data *rdp)
 {
+	spin_lock(&rcp->lock);
 	memset(rdp, 0, sizeof(*rdp));
 	rdp->nxttail[0] = rdp->nxttail[1] = rdp->nxttail[2] = &rdp->nxtlist;
 	rdp->donetail = &rdp->donelist;
@@ -565,6 +590,7 @@ static void rcu_init_percpu_data(int cpu
 	rdp->qs_pending = 0;
 	rdp->cpu = cpu;
 	rdp->blimit = blimit;
+	spin_unlock(&rcp->lock);
 }
 
 static void __cpuinit rcu_online_cpu(int cpu)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
  2008-08-01 21:10     ` Paul E. McKenney
@ 2008-08-03  8:01       ` Lai Jiangshan
  2008-08-04 22:54         ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Lai Jiangshan @ 2008-08-03  8:01 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List

Paul E. McKenney wrote:

[...]

>>  /**
>>   * call_rcu - Queue an RCU callback for invocation after a grace period.
>>   * @head: structure to be used for queueing the RCU updates.
>> @@ -133,18 +172,11 @@ void call_rcu(struct rcu_head *head,
>>  				void (*func)(struct rcu_head *rcu))
>>  {
>>  	unsigned long flags;
>> -	struct rcu_data *rdp;
>>  
>>  	head->func = func;
>>  	head->next = NULL;
>>  	local_irq_save(flags);
> 
> I very much like the gathering of common code from call_rcu() and
> call_rcu_bh() into __call_rcu().  But why not also move the
> local_irq_save() and local_irq_restore() to __call_rcu(), perhaps
> along with the initialization of head->next?

We should put __get_cpu_var into preempt_disable critical section.
So I didn't move the local_irq_save() and local_irq_restore()
to __call_rcu().

I greed your changes except the changes here.
percpu_ptr() may help for us.

> 
> (I understand the motivation for keeping the initialization of the
> fields of "head" at this level -- otherwise, you must add another
> argument to __call_rcu().  But might be worth considering...)
> 
>> -	rdp = &__get_cpu_var(rcu_data);
>> -	*rdp->nxttail = head;
>> -	rdp->nxttail = &head->next;
>> -	if (unlikely(++rdp->qlen > qhimark)) {
>> -		rdp->blimit = INT_MAX;
>> -		force_quiescent_state(rdp, &rcu_ctrlblk);
>> -	}
>> +	__call_rcu(head, &rcu_ctrlblk, &__get_cpu_var(rcu_data));
>>  	local_irq_restore(flags);
>>  }
>>  EXPORT_SYMBOL_GPL(call_rcu);
>> @@ -169,20 +201,11 @@ void call_rcu_bh(struct rcu_head *head,
>>  				void (*func)(struct rcu_head *rcu))
>>  {
>>  	unsigned long flags;
>> -	struct rcu_data *rdp;
>>  
>>  	head->func = func;
>>  	head->next = NULL;
>>  	local_irq_save(flags);
>> -	rdp = &__get_cpu_var(rcu_bh_data);
>> -	*rdp->nxttail = head;
>> -	rdp->nxttail = &head->next;
>> -
>> -	if (unlikely(++rdp->qlen > qhimark)) {
>> -		rdp->blimit = INT_MAX;
>> -		force_quiescent_state(rdp, &rcu_bh_ctrlblk);
>> -	}
>> -
>> +	__call_rcu(head, &rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
>>  	local_irq_restore(flags);
>>  }
>>  EXPORT_SYMBOL_GPL(call_rcu_bh);
>> @@ -211,12 +234,6 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed_
>>  static inline void raise_rcu_softirq(void)
>>  {
>>  	raise_softirq(RCU_SOFTIRQ);
>> -	/*
>> -	 * The smp_mb() here is required to ensure that this cpu's
>> -	 * __rcu_process_callbacks() reads the most recently updated
>> -	 * value of rcu->cur.
>> -	 */
>> -	smp_mb();
> 
> I have not yet convinced myself that it is OK to get rid of this memory
> barrier.  This memory barrier was intended order to handle the following
> sequence of events:
> 
> 	rcu_read_lock_bh();  /* no memory barrier. */
> 	p = rcu_dereference(some_global_pointer);
> 	do_something_with(p);
> 	rcu_read_unlock_bh();  /* no memory barrier. */
> 
> 	---- scheduling-clock interrupt occurs, eventually invoking
> 	---- rcu_check_callbacks()
> 
> 	---- And the access to "p" above could potentially be reordered
> 	---- into the grace-period computation
> 
> Such reordering is of course quite unlikely to be harmful, due to the
> long duration of RCU grace periods.  But I am paranoid.
> 
> If this memory barrier turns out to be necessary, one approach would
> to be to place it at the beginning of rcu_check_callbacks(), which is
> a better place for it in any case.
> 
> Thoughts?

I hasn't thought it before. I thought that smp_mb is used for
rcu->cur as the original comment had told.

I prefer to add memory barrier to rcu_process_callbacks as your patch.

But I has a question here:

In this case, p->rcu_head is not in donelist. So __rcu_process_callbacks
is only access to p->rcu_head(p->rcu_head.next). And other cpus don't
access to p->rcu_head which has been queued on this cpu' rcu_data.

Is this reordering harmful(How this reordering make other
cpus' access wrong)?

[...]



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
  2008-08-03  8:01       ` Lai Jiangshan
@ 2008-08-04 22:54         ` Paul E. McKenney
  2008-08-06  7:08           ` Lai Jiangshan
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2008-08-04 22:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Ingo Molnar, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List

On Sun, Aug 03, 2008 at 04:01:00PM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> 
> [...]
> 
> >>  /**
> >>   * call_rcu - Queue an RCU callback for invocation after a grace period.
> >>   * @head: structure to be used for queueing the RCU updates.
> >> @@ -133,18 +172,11 @@ void call_rcu(struct rcu_head *head,
> >>  				void (*func)(struct rcu_head *rcu))
> >>  {
> >>  	unsigned long flags;
> >> -	struct rcu_data *rdp;
> >>  
> >>  	head->func = func;
> >>  	head->next = NULL;
> >>  	local_irq_save(flags);
> > 
> > I very much like the gathering of common code from call_rcu() and
> > call_rcu_bh() into __call_rcu().  But why not also move the
> > local_irq_save() and local_irq_restore() to __call_rcu(), perhaps
> > along with the initialization of head->next?
> 
> We should put __get_cpu_var into preempt_disable critical section.
> So I didn't move the local_irq_save() and local_irq_restore()
> to __call_rcu().

Good point -- a preemption just at the call to __call_rcu() does need
to be handled correctly.  I will update the patch.

> I greed your changes except the changes here.
> percpu_ptr() may help for us.

Tell me more about percpu_ptr().

> > (I understand the motivation for keeping the initialization of the
> > fields of "head" at this level -- otherwise, you must add another
> > argument to __call_rcu().  But might be worth considering...)
> > 
> >> -	rdp = &__get_cpu_var(rcu_data);
> >> -	*rdp->nxttail = head;
> >> -	rdp->nxttail = &head->next;
> >> -	if (unlikely(++rdp->qlen > qhimark)) {
> >> -		rdp->blimit = INT_MAX;
> >> -		force_quiescent_state(rdp, &rcu_ctrlblk);
> >> -	}
> >> +	__call_rcu(head, &rcu_ctrlblk, &__get_cpu_var(rcu_data));
> >>  	local_irq_restore(flags);
> >>  }
> >>  EXPORT_SYMBOL_GPL(call_rcu);
> >> @@ -169,20 +201,11 @@ void call_rcu_bh(struct rcu_head *head,
> >>  				void (*func)(struct rcu_head *rcu))
> >>  {
> >>  	unsigned long flags;
> >> -	struct rcu_data *rdp;
> >>  
> >>  	head->func = func;
> >>  	head->next = NULL;
> >>  	local_irq_save(flags);
> >> -	rdp = &__get_cpu_var(rcu_bh_data);
> >> -	*rdp->nxttail = head;
> >> -	rdp->nxttail = &head->next;
> >> -
> >> -	if (unlikely(++rdp->qlen > qhimark)) {
> >> -		rdp->blimit = INT_MAX;
> >> -		force_quiescent_state(rdp, &rcu_bh_ctrlblk);
> >> -	}
> >> -
> >> +	__call_rcu(head, &rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
> >>  	local_irq_restore(flags);
> >>  }
> >>  EXPORT_SYMBOL_GPL(call_rcu_bh);
> >> @@ -211,12 +234,6 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed_
> >>  static inline void raise_rcu_softirq(void)
> >>  {
> >>  	raise_softirq(RCU_SOFTIRQ);
> >> -	/*
> >> -	 * The smp_mb() here is required to ensure that this cpu's
> >> -	 * __rcu_process_callbacks() reads the most recently updated
> >> -	 * value of rcu->cur.
> >> -	 */
> >> -	smp_mb();
> > 
> > I have not yet convinced myself that it is OK to get rid of this memory
> > barrier.  This memory barrier was intended order to handle the following
> > sequence of events:
> > 
> > 	rcu_read_lock_bh();  /* no memory barrier. */
> > 	p = rcu_dereference(some_global_pointer);
> > 	do_something_with(p);
> > 	rcu_read_unlock_bh();  /* no memory barrier. */
> > 
> > 	---- scheduling-clock interrupt occurs, eventually invoking
> > 	---- rcu_check_callbacks()
> > 
> > 	---- And the access to "p" above could potentially be reordered
> > 	---- into the grace-period computation
> > 
> > Such reordering is of course quite unlikely to be harmful, due to the
> > long duration of RCU grace periods.  But I am paranoid.
> > 
> > If this memory barrier turns out to be necessary, one approach would
> > to be to place it at the beginning of rcu_check_callbacks(), which is
> > a better place for it in any case.
> > 
> > Thoughts?
> 
> I hasn't thought it before. I thought that smp_mb is used for
> rcu->cur as the original comment had told.
> 
> I prefer to add memory barrier to rcu_process_callbacks as your patch.

Yeah, that became clear as I wrote the code.  ;-)

> But I has a question here:
> 
> In this case, p->rcu_head is not in donelist. So __rcu_process_callbacks
> is only access to p->rcu_head(p->rcu_head.next). And other cpus don't
> access to p->rcu_head which has been queued on this cpu' rcu_data.
> 
> Is this reordering harmful(How this reordering make other
> cpus' access wrong)?

I have a somewhat different goal here.  I want to simplify the memory
ordering design without giving up too much performance -- the current
state in mainline is much too fragile, in my opinion, especially given
that the grace-period code paths are not fastpaths.

Next step -- hierarchical grace-period detection to handle the 4096-CPU
machines that I was being buttonholed about at OLS...

Would you be interested in applying your multi-tailed list change to
preemptable RCU?

						Thanx, Paul

> [...]
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
  2008-08-04 22:54         ` Paul E. McKenney
@ 2008-08-06  7:08           ` Lai Jiangshan
  2008-08-07  3:19             ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Lai Jiangshan @ 2008-08-06  7:08 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List

Paul E. McKenney wrote:
[...]
> 
> Tell me more about percpu_ptr().

Sorry about this. percpu_ptr is used for dynamic allocation percpu pointer.

It seems that we cannot get a pointer from a static declare percpu data
which can be used as a dynamic allocation percpu data's pointer. 

> 
[...]
> 
> I have a somewhat different goal here.  I want to simplify the memory
> ordering design without giving up too much performance -- the current
> state in mainline is much too fragile, in my opinion, especially given
> that the grace-period code paths are not fastpaths.
> 
> Next step -- hierarchical grace-period detection to handle the 4096-CPU
> machines that I was being buttonholed about at OLS...
> 
> Would you be interested in applying your multi-tailed list change to
> preemptable RCU?
> 
It's not necessary. Actually I like one tail per list which is good for
readability. 

But in my patch, the most work is combining lists, not
moving a list to next list, so i use multi-tailed simplify this works
and others(etc: "if (rdp->nxtlist)" will be changed to be a more
complex and less readability statement if i use one-tail-per-list)

These not means multi-tailed is good thing.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2)
  2008-08-06  7:08           ` Lai Jiangshan
@ 2008-08-07  3:19             ` Paul E. McKenney
  0 siblings, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2008-08-07  3:19 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Ingo Molnar, Dipankar Sarma, Gautham Shenoy, Dhaval Giani,
	Peter Zijlstra, Linux Kernel Mailing List

On Wed, Aug 06, 2008 at 03:08:10PM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> [...]
> > 
> > Tell me more about percpu_ptr().
> 
> Sorry about this. percpu_ptr is used for dynamic allocation percpu pointer.

Yep, that I knew.

> It seems that we cannot get a pointer from a static declare percpu data
> which can be used as a dynamic allocation percpu data's pointer. 

Sad but true...  Ran into this with SRCU a couple of years back.  :-/

> > 
> [...]
> > 
> > I have a somewhat different goal here.  I want to simplify the memory
> > ordering design without giving up too much performance -- the current
> > state in mainline is much too fragile, in my opinion, especially given
> > that the grace-period code paths are not fastpaths.
> > 
> > Next step -- hierarchical grace-period detection to handle the 4096-CPU
> > machines that I was being buttonholed about at OLS...
> > 
> > Would you be interested in applying your multi-tailed list change to
> > preemptable RCU?
> > 
> It's not necessary. Actually I like one tail per list which is good for
> readability. 
> 
> But in my patch, the most work is combining lists, not
> moving a list to next list, so i use multi-tailed simplify this works
> and others(etc: "if (rdp->nxtlist)" will be changed to be a more
> complex and less readability statement if i use one-tail-per-list)
> 
> These not means multi-tailed is good thing.

It does indeed depend on the details of the implementation.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-08-07  3:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-06  9:23 [RFC][PATCH 2/2] rcu classic: new algorithm for callbacks-processing(v2) Lai Jiangshan
2008-07-18 14:09 ` Ingo Molnar
2008-08-01 21:10   ` Paul E. McKenney
     [not found]   ` <20080721100433.GC8384@linux.vnet.ibm.com>
2008-08-01 21:10     ` Paul E. McKenney
2008-08-03  8:01       ` Lai Jiangshan
2008-08-04 22:54         ` Paul E. McKenney
2008-08-06  7:08           ` Lai Jiangshan
2008-08-07  3:19             ` Paul E. McKenney
     [not found]     ` <20080725165454.GA7147@linux.vnet.ibm.com>
2008-08-01 21:11       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox