[RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
@ 2004-05-25  5:35 Manfred Spraul
  2004-05-27 23:22 ` [Lse-tech] " Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Manfred Spraul @ 2004-05-25  5:35 UTC (permalink / raw)
  To: linux-kernel, lse-tech

Hi,

Step two for reducing cacheline trashing within rcupdate.c:

rcu_process_callbacks always acquires rcu_ctrlblk.state.mutex and
calls rcu_start_batch, even if the batch is already running or already
scheduled to run.
This can be avoided with a sequence lock: A sequence lock allows
to read the current batch number and next_pending atomically. If
next_pending is already set, then there is no need to acquire the
global mutex.

This means that for each grace period, there will be

- one write access to the rcu_ctrlblk.batch cacheline
- lots of read accesses to rcu_ctrlblk.batch (3-10*cpus_online(),
  perhaps even more).
  
  Behavior similar to the jiffies cacheline, shouldn't be a problem.
  
- cpus_online()+1 write accesses to rcu_ctrlblk.state, all of them
  starting with spin_lock(&rcu_ctrlblk.state.mutex).

  For large enough cpus_online() this will be a problem, but all
  except two of the spin_lock calls only protect the rcu_cpu_mask
  bitmap, thus a hierarchical bitmap would allow to split the write
  accesses to multiple cachelines.

Tested on an 8-way with reaim. Unfortunately it probably won't help
with Jack Steiner's 'ls' test since in this test only one cpu
generates rcu entries.

What do you think?

--
	Manfred

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 6
//  SUBLEVEL = 6
//  EXTRAVERSION = -mm4
--- 2.6/kernel/rcupdate.c	2004-05-23 11:53:46.000000000 +0200
+++ build-2.6/kernel/rcupdate.c	2004-05-23 11:53:52.000000000 +0200
@@ -47,7 +47,7 @@
 
 /* Definition for rcupdate control block. */
 struct rcu_ctrlblk rcu_ctrlblk = 
-	{ .batch = { .cur = -300, .completed = -300 },
+	{ .batch = { .cur = -300, .completed = -300 , .lock = SEQCNT_ZERO },
 	  .state = {.mutex = SPIN_LOCK_UNLOCKED, .rcu_cpu_mask = CPU_MASK_NONE } };
 DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
 
@@ -124,16 +124,18 @@
 	cpumask_t active;
 
 	if (next_pending)
-		rcu_ctrlblk.state.next_pending = 1;
+		rcu_ctrlblk.batch.next_pending = 1;
 
-	if (rcu_ctrlblk.state.next_pending &&
+	if (rcu_ctrlblk.batch.next_pending &&
 			rcu_ctrlblk.batch.completed == rcu_ctrlblk.batch.cur) {
-		rcu_ctrlblk.state.next_pending = 0;
 		/* Can't change, since spin lock held. */
 		active = nohz_cpu_mask;
 		cpus_complement(active);
 		cpus_and(rcu_ctrlblk.state.rcu_cpu_mask, cpu_online_map, active);
+		write_seqcount_begin(&rcu_ctrlblk.batch.lock);
+		rcu_ctrlblk.batch.next_pending = 0;
 		rcu_ctrlblk.batch.cur++;
+		write_seqcount_end(&rcu_ctrlblk.batch.lock);
 	}
 }
 
@@ -261,6 +263,8 @@
 
 	local_irq_disable();
 	if (!list_empty(&RCU_nxtlist(cpu)) && list_empty(&RCU_curlist(cpu))) {
+		int next_pending, seq;
+
 		__list_splice(&RCU_nxtlist(cpu), &RCU_curlist(cpu));
 		INIT_LIST_HEAD(&RCU_nxtlist(cpu));
 		local_irq_enable();
@@ -268,10 +272,19 @@
 		/*
 		 * start the next batch of callbacks
 		 */
-		spin_lock(&rcu_ctrlblk.state.mutex);
-		RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
-		rcu_start_batch(1);
-		spin_unlock(&rcu_ctrlblk.state.mutex);
+		do {
+			seq = read_seqcount_begin(&rcu_ctrlblk.batch.lock);
+			/* determine batch number */
+			RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
+			next_pending = rcu_ctrlblk.batch.next_pending;
+		} while (read_seqcount_retry(&rcu_ctrlblk.batch.lock, seq));
+
+		if (!next_pending) {
+			/* and start it/schedule start if it's a new batch */
+			spin_lock(&rcu_ctrlblk.state.mutex);
+			rcu_start_batch(1);
+			spin_unlock(&rcu_ctrlblk.state.mutex);
+		}
 	} else {
 		local_irq_enable();
 	}
--- 2.6/include/linux/rcupdate.h	2004-05-23 11:53:46.000000000 +0200
+++ build-2.6/include/linux/rcupdate.h	2004-05-23 11:53:52.000000000 +0200
@@ -41,6 +41,7 @@
 #include <linux/threads.h>
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
+#include <linux/seqlock.h>
 
 /**
  * struct rcu_head - callback structure for use with RCU
@@ -69,11 +70,14 @@
 	struct {
 		long	cur;		/* Current batch number.	      */
 		long	completed;	/* Number of the last completed batch */
+		int	next_pending;	/* Is the next batch already waiting? */
+		seqcount_t lock;	/* for atomically reading cur and     */
+		                        /* next_pending. Spinlock not used,   */
+		                        /* protected by state.mutex           */
 	} batch ____cacheline_maxaligned_in_smp;
 	/* remaining members: bookkeeping of the progress of the grace period */
 	struct {
 		spinlock_t	mutex;	/* Guard this struct                  */
-		int	next_pending;	/* Is the next batch already waiting? */
 		cpumask_t	rcu_cpu_mask; 	/* CPUs that need to switch   */
 				/* in order for current batch to proceed.     */
 	} state ____cacheline_maxaligned_in_smp;

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Lse-tech] [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
  2004-05-25  5:35 [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches Manfred Spraul
@ 2004-05-27 23:22 ` Paul E. McKenney
  2004-05-28 13:57   ` Manfred Spraul
  0 siblings, 1 reply; 4+ messages in thread
From: Paul E. McKenney @ 2004-05-27 23:22 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, lse-tech

Hello, Manfred,

I am still digging through these, and things look quite good in general,
but I have a question on your second patch.

Given the following sequence of events:

1.	CPU 0 executes the

		rcu_ctrlblk.batch.next_pending = 1;

	at the beginning of rcu_start_batch().

2.	CPU 1 executes the read_seqcount code sequence in
	rcu_process_callbacks(), setting RCU_batch(cpu) to
	the next batch number, and setting next_pending to 1.

3.	CPU 0 executes the remainder of rcu_start_batch(),
	setting rcu_ctrlblk.batch.next_pending to 0 and
	incrementing rcu_ctrlblk.batch.cur.

4.	CPU 1's state is now as if the grace period had already
	completed for the callbacks that were just moved to
	RCU_curlist(), which would be very bad.

First, can this sequence of events really happen?  I cannot see
anything that prevents it, but on the other hand, am writing this
while under the influence of heavy-duty decongestants.

Second, is CPU 1's state really "grace period completed"?
See previous disclaimer.

If this really is a problem, it seems like the fix would be
to extend the range of the write_seqcount to cover the entire
body of rcu_start_batch().  Which would increase the number
of write_seqcount calls, degrading performance.  Might not be
enough to worry about, but 512 CPUs is 512 CPUs...

Another approach would be to do the write_seqcount_begin() in
both "if" statements, and to do the corresponding write_seqcount_end()
if the begin had happened.  If this is a concern, something like the
following might be appropriate:

static void rcu_start_batch(int next_pending)
{
	cpumask_t active;
	int writeseq = 0;

	if (next_pending) {
		write_seqcount_begin(&rcu_ctrlblk.batch.lock);
		writeseq = 1;
		rcu_ctrlblk.batch.next_pending = 1;
	}

	if (rcu_ctrlblk.batch.next_pending &&
			rcu_ctrlblk.batch.completed == rcu_ctrlblk.batch.cur) {
		/* Can't change, since spin lock held. */
		active = nohz_cpu_mask;
		cpus_complement(active);
		cpus_and(rcu_ctrlblk.state.rcu_cpu_mask, cpu_online_map, active);
		if (!writeseq) {
			write_seqcount_begin(&rcu_ctrlblk.batch.lock);
			writeseq = 1;
		}
		rcu_ctrlblk.batch.next_pending = 0;
		rcu_ctrlblk.batch.cur++;
	}
	if (writeseq)
		write_seqcount_end(&rcu_ctrlblk.batch.lock);
}

Thoughts?

Will continue looking through the patches, more later.

						Thanx, Paul

On Tue, May 25, 2004 at 07:35:21AM +0200, Manfred Spraul wrote:
> Hi,
> 
> Step two for reducing cacheline trashing within rcupdate.c:
> 
> rcu_process_callbacks always acquires rcu_ctrlblk.state.mutex and
> calls rcu_start_batch, even if the batch is already running or already
> scheduled to run.
> This can be avoided with a sequence lock: A sequence lock allows
> to read the current batch number and next_pending atomically. If
> next_pending is already set, then there is no need to acquire the
> global mutex.
> 
> This means that for each grace period, there will be
> 
> - one write access to the rcu_ctrlblk.batch cacheline
> - lots of read accesses to rcu_ctrlblk.batch (3-10*cpus_online(),
>   perhaps even more).
>   
>   Behavior similar to the jiffies cacheline, shouldn't be a problem.
>   
> - cpus_online()+1 write accesses to rcu_ctrlblk.state, all of them
>   starting with spin_lock(&rcu_ctrlblk.state.mutex).
> 
>   For large enough cpus_online() this will be a problem, but all
>   except two of the spin_lock calls only protect the rcu_cpu_mask
>   bitmap, thus a hierarchical bitmap would allow to split the write
>   accesses to multiple cachelines.
> 
> Tested on an 8-way with reaim. Unfortunately it probably won't help
> with Jack Steiner's 'ls' test since in this test only one cpu
> generates rcu entries.
> 
> What do you think?
> 
> --
> 	Manfred
> 
> // $Header$
> // Kernel Version:
> //  VERSION = 2
> //  PATCHLEVEL = 6
> //  SUBLEVEL = 6
> //  EXTRAVERSION = -mm4
> --- 2.6/kernel/rcupdate.c	2004-05-23 11:53:46.000000000 +0200
> +++ build-2.6/kernel/rcupdate.c	2004-05-23 11:53:52.000000000 +0200
> @@ -47,7 +47,7 @@
>  
>  /* Definition for rcupdate control block. */
>  struct rcu_ctrlblk rcu_ctrlblk = 
> -	{ .batch = { .cur = -300, .completed = -300 },
> +	{ .batch = { .cur = -300, .completed = -300 , .lock = SEQCNT_ZERO },
>  	  .state = {.mutex = SPIN_LOCK_UNLOCKED, .rcu_cpu_mask = CPU_MASK_NONE } };
>  DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
>  
> @@ -124,16 +124,18 @@
>  	cpumask_t active;
>  
>  	if (next_pending)
> -		rcu_ctrlblk.state.next_pending = 1;
> +		rcu_ctrlblk.batch.next_pending = 1;
>  
> -	if (rcu_ctrlblk.state.next_pending &&
> +	if (rcu_ctrlblk.batch.next_pending &&
>  			rcu_ctrlblk.batch.completed == rcu_ctrlblk.batch.cur) {
> -		rcu_ctrlblk.state.next_pending = 0;
>  		/* Can't change, since spin lock held. */
>  		active = nohz_cpu_mask;
>  		cpus_complement(active);
>  		cpus_and(rcu_ctrlblk.state.rcu_cpu_mask, cpu_online_map, active);
> +		write_seqcount_begin(&rcu_ctrlblk.batch.lock);
> +		rcu_ctrlblk.batch.next_pending = 0;
>  		rcu_ctrlblk.batch.cur++;
> +		write_seqcount_end(&rcu_ctrlblk.batch.lock);
>  	}
>  }
>  
> @@ -261,6 +263,8 @@
>  
>  	local_irq_disable();
>  	if (!list_empty(&RCU_nxtlist(cpu)) && list_empty(&RCU_curlist(cpu))) {
> +		int next_pending, seq;
> +
>  		__list_splice(&RCU_nxtlist(cpu), &RCU_curlist(cpu));
>  		INIT_LIST_HEAD(&RCU_nxtlist(cpu));
>  		local_irq_enable();
> @@ -268,10 +272,19 @@
>  		/*
>  		 * start the next batch of callbacks
>  		 */
> -		spin_lock(&rcu_ctrlblk.state.mutex);
> -		RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
> -		rcu_start_batch(1);
> -		spin_unlock(&rcu_ctrlblk.state.mutex);
> +		do {
> +			seq = read_seqcount_begin(&rcu_ctrlblk.batch.lock);
> +			/* determine batch number */
> +			RCU_batch(cpu) = rcu_ctrlblk.batch.cur + 1;
> +			next_pending = rcu_ctrlblk.batch.next_pending;
> +		} while (read_seqcount_retry(&rcu_ctrlblk.batch.lock, seq));
> +
> +		if (!next_pending) {
> +			/* and start it/schedule start if it's a new batch */
> +			spin_lock(&rcu_ctrlblk.state.mutex);
> +			rcu_start_batch(1);
> +			spin_unlock(&rcu_ctrlblk.state.mutex);
> +		}
>  	} else {
>  		local_irq_enable();
>  	}
> --- 2.6/include/linux/rcupdate.h	2004-05-23 11:53:46.000000000 +0200
> +++ build-2.6/include/linux/rcupdate.h	2004-05-23 11:53:52.000000000 +0200
> @@ -41,6 +41,7 @@
>  #include <linux/threads.h>
>  #include <linux/percpu.h>
>  #include <linux/cpumask.h>
> +#include <linux/seqlock.h>
>  
>  /**
>   * struct rcu_head - callback structure for use with RCU
> @@ -69,11 +70,14 @@
>  	struct {
>  		long	cur;		/* Current batch number.	      */
>  		long	completed;	/* Number of the last completed batch */
> +		int	next_pending;	/* Is the next batch already waiting? */
> +		seqcount_t lock;	/* for atomically reading cur and     */
> +		                        /* next_pending. Spinlock not used,   */
> +		                        /* protected by state.mutex           */
>  	} batch ____cacheline_maxaligned_in_smp;
>  	/* remaining members: bookkeeping of the progress of the grace period */
>  	struct {
>  		spinlock_t	mutex;	/* Guard this struct                  */
> -		int	next_pending;	/* Is the next batch already waiting? */
>  		cpumask_t	rcu_cpu_mask; 	/* CPUs that need to switch   */
>  				/* in order for current batch to proceed.     */
>  	} state ____cacheline_maxaligned_in_smp;
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g. 
> Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Lse-tech] [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
  2004-05-27 23:22 ` [Lse-tech] " Paul E. McKenney
@ 2004-05-28 13:57   ` Manfred Spraul
  2004-05-28 16:28     ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Manfred Spraul @ 2004-05-28 13:57 UTC (permalink / raw)
  To: paulmck; +Cc: Manfred Spraul, linux-kernel, lse-tech

Paul E. McKenney wrote:

>Hello, Manfred,
>
>I am still digging through these, and things look quite good in general,
>but I have a question on your second patch.
>  
>
Let's assume that

batch.completed = 5;
batch.cur = 5;
batch.next_pending = 0;

>Given the following sequence of events:
>
>1.	CPU 0 executes the
>
>		rcu_ctrlblk.batch.next_pending = 1;
>  
>
batch.next_pending = 1.

>	at the beginning of rcu_start_batch().
>
>2.	CPU 1 executes the read_seqcount code sequence in
>	rcu_process_callbacks(), setting RCU_batch(cpu) to
>	the next batch number, and setting next_pending to 1.
>  
>
RCU_batch(1) is now 6.
next_pending is 1, rcu_process_callbacks continues without calling 
rcu_start_batch().

>3.	CPU 0 executes the remainder of rcu_start_batch(),
>	setting rcu_ctrlblk.batch.next_pending to 0 and
>	incrementing rcu_ctrlblk.batch.cur.
>  
>
batch.cur = 6.

>4.	CPU 1's state is now as if the grace period had already
>	completed for the callbacks that were just moved to
>	RCU_curlist(), which would be very bad.
>  
>
AFAICS: No. RCU_batch(1) is 6 and rcu_ctrlblk.batch.completed is still 
5. The test for grace period completed is

>  if (!list_empty(&RCU_curlist(cpu)) &&
>             
> !rcu_batch_before(rcu_ctrlblk.batch.completed,RCU_batch(cpu))) {
>          __list_splice(&RCU_curlist(cpu), &list);
>          INIT_LIST_HEAD(&RCU_curlist(cpu));
>  }

5 is before 6, thus the callbacks won't be processed.

The only write operation to rcu_ctrlblk.batch.completed is in cpu_quiet, 
after checking that the cpu bitmap is empty and under 
spin_lock(rcu_ctrlblk.state.mutex).

Thanks for looking at my patches,

--
    Manfred


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Lse-tech] [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches
  2004-05-28 13:57   ` Manfred Spraul
@ 2004-05-28 16:28     ` Paul E. McKenney
  0 siblings, 0 replies; 4+ messages in thread
From: Paul E. McKenney @ 2004-05-28 16:28 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Manfred Spraul, linux-kernel, lse-tech

On Fri, May 28, 2004 at 03:57:07PM +0200, Manfred Spraul wrote:
> Paul E. McKenney wrote:
> 
> >Hello, Manfred,
> >
> >I am still digging through these, and things look quite good in general,
> >but I have a question on your second patch.

Sorry for the bother, will keep looking, hopefully with more accurate
comments in the future.  :-/

						Thanx, Paul

> Let's assume that
> 
> batch.completed = 5;
> batch.cur = 5;
> batch.next_pending = 0;
> 
> >Given the following sequence of events:
> >
> >1.	CPU 0 executes the
> >
> >		rcu_ctrlblk.batch.next_pending = 1;
> > 
> >
> batch.next_pending = 1.
> 
> >	at the beginning of rcu_start_batch().
> >
> >2.	CPU 1 executes the read_seqcount code sequence in
> >	rcu_process_callbacks(), setting RCU_batch(cpu) to
> >	the next batch number, and setting next_pending to 1.
> > 
> >
> RCU_batch(1) is now 6.
> next_pending is 1, rcu_process_callbacks continues without calling 
> rcu_start_batch().
> 
> >3.	CPU 0 executes the remainder of rcu_start_batch(),
> >	setting rcu_ctrlblk.batch.next_pending to 0 and
> >	incrementing rcu_ctrlblk.batch.cur.
> > 
> >
> batch.cur = 6.
> 
> >4.	CPU 1's state is now as if the grace period had already
> >	completed for the callbacks that were just moved to
> >	RCU_curlist(), which would be very bad.
> > 
> >
> AFAICS: No. RCU_batch(1) is 6 and rcu_ctrlblk.batch.completed is still 
> 5. The test for grace period completed is
> 
> > if (!list_empty(&RCU_curlist(cpu)) &&
> >            
> >!rcu_batch_before(rcu_ctrlblk.batch.completed,RCU_batch(cpu))) {
> >         __list_splice(&RCU_curlist(cpu), &list);
> >         INIT_LIST_HEAD(&RCU_curlist(cpu));
> > }
> 
> 5 is before 6, thus the callbacks won't be processed.
> 
> The only write operation to rcu_ctrlblk.batch.completed is in cpu_quiet, 
> after checking that the cpu bitmap is empty and under 
> spin_lock(rcu_ctrlblk.state.mutex).
> 
> Thanks for looking at my patches,
> 
> --
>    Manfred
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-05-28 16:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-25  5:35 [RFC, PATCH] 2/5 rcu lock update: Use a sequence lock for starting batches Manfred Spraul
2004-05-27 23:22 ` [Lse-tech] " Paul E. McKenney
2004-05-28 13:57   ` Manfred Spraul
2004-05-28 16:28     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox