RCU scaling on large systems

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RCU scaling on large systems
@ 2004-05-01 12:08 Jack Steiner
  2004-05-01 21:17 ` William Lee Irwin III
  2004-05-02 18:28 ` Paul E. McKenney
  0 siblings, 2 replies; 22+ messages in thread
From: Jack Steiner @ 2004-05-01 12:08 UTC (permalink / raw)
  To: linux-kernel

Has anyone investigated scaling of the RCU algorithms on large
cpu-count systems?

We are running 2.6.5 on a 512p system. The RCU update code is causing 
a near-livelock when booting. Several times, I dropped into kdb &
found many of the cpus in the RCU code. The cpus are in
rcu_check_quiescent_state() spinning on the rcu_ctrlblk.mutex.

I also found an interesting anomaly that was traced to RCU. I have
a program that measures "cpu efficiency". Basically, the program 
creates a cpu bound thread for each cpu & measures the percentage 
of time that each cpu ACTUALLY spends executing user code.
On an idle each system, each cpu *should* spend >99% in user mode.

On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
RCU code. The time is spent contending for shared cache lines.

Even more bizarre: if I repeatedly type "ls" in a *single* window 
(probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
of the time in the RCU code.

The RCU algorithms don't scale - at least on our systems!!!!!

Attached is an experimental hack that fixes the problem. I
don't believe that this is the correct fix but it does prove
that slowing down the frequency of updates fixes the problem.

With this hack, "ls" no longer measurable disturbs other cpus. Each
cpu spends ~99.8% of its time in user code regardless of the frequency
of typing "ls".

By default, the RCU code attempts to process callbacks on each cpu
every tick. The hack adds a mask so that only a few cpus process
callbacks each tick. 

I ran a simple experiment to vary the mask. Col 1 shows the mask
value, col 2 show system time when system is idle, col 3 shows
system time when typing "ls" on a single cpu. The percentages are
"eyeballed averages" for all cpus.

	VAL	idle%	"ls" %
	0	6.00	55.00
	1	2.00	50.00
	3	 .20	20.00
	7	 .20	  .22
	15	 .20	  .24
	31	 .19	  .25
	63	 .19	  .26
	127	 .19	  .25
	511	 .19	  .19

Looks like any value >7 or 15 should be ok on our hardware. I picked 
a larger value because I don't understand how heavy loads affect the 
optimal value.  I suspect that the optimal value is architecture
dependent & probably platform dependent.

Is anyone already working on this issue?

Comments and additional insight appreciated..........

Index: linux/include/linux/rcupdate.h
===================================================================
--- linux.orig/include/linux/rcupdate.h	2004-04-30 09:59:00.000000000 -0500
+++ linux/include/linux/rcupdate.h	2004-04-30 10:01:37.000000000 -0500
@@ -41,6 +41,7 @@
 #include <linux/threads.h>
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
+#include <linux/jiffies.h>

 /**
  * struct rcu_head - callback structure for use with RCU
@@ -84,6 +85,14 @@
         return (a - b) > 0;
 }

+#define RCU_JIFFIES_MASK 15
+
+/* Is it time to process a batch on this cpu */
+static inline int rcu_time(int cpu)
+{
+	return (((jiffies - cpu) & RCU_JIFFIES_MASK) == 0);
+}
+
 /*
  * Per-CPU data for Read-Copy Update.
  * nxtlist - new callbacks are added here
@@ -111,11 +120,11 @@

 static inline int rcu_pending(int cpu) 
 {
-	if ((!list_empty(&RCU_curlist(cpu)) &&
+	if (rcu_time(cpu) && ((!list_empty(&RCU_curlist(cpu)) &&
 	     rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
 	    (list_empty(&RCU_curlist(cpu)) &&
 			 !list_empty(&RCU_nxtlist(cpu))) ||
-	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)))
 		return 1;
 	else
 		return 0;
-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-01 12:08 RCU scaling on large systems Jack Steiner
@ 2004-05-01 21:17 ` William Lee Irwin III
  2004-05-01 22:35   ` William Lee Irwin III
                     ` (3 more replies)
  2004-05-02 18:28 ` Paul E. McKenney
  1 sibling, 4 replies; 22+ messages in thread
From: William Lee Irwin III @ 2004-05-01 21:17 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-kernel

On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
> On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> RCU code. The time is spent contending for shared cache lines.

Would something like this help cacheline contention? This uses the
per_cpu data areas to hold per-cpu booleans for needing switches.
Untested/uncompiled.

The global lock is unfortunately still there.


-- wli

Index: wli-2.6.6-rc3-mm1/include/linux/rcupdate.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/rcupdate.h	2004-04-03 19:36:52.000000000 -0800
+++ wli-2.6.6-rc3-mm1/include/linux/rcupdate.h	2004-05-01 14:15:09.000000000 -0700
@@ -41,6 +41,9 @@
 #include <linux/threads.h>
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
+#include <asm/atomic.h>
+
+#define RCU_CPU_SCATTER (NR_CPUS > 128)
 
 /**
  * struct rcu_head - callback structure for use with RCU
@@ -68,8 +71,10 @@
 	spinlock_t	mutex;		/* Guard this struct                  */
 	long		curbatch;	/* Current batch number.	      */
 	long		maxbatch;	/* Max requested batch number.        */
+#if !RCU_CPU_SCATTER
 	cpumask_t	rcu_cpu_mask; 	/* CPUs that need to switch in order  */
 					/* for current batch to proceed.      */
+#endif
 };
 
 /* Is batch a before batch b ? */
@@ -96,6 +101,9 @@
         long  	       	batch;           /* Batch # for current RCU batch */
         struct list_head  nxtlist;
         struct list_head  curlist;
+#if RCU_CPU_SCATTER
+	atomic_t	need_switch;
+#endif
 };
 
 DECLARE_PER_CPU(struct rcu_data, rcu_data);
@@ -109,13 +117,39 @@
 
 #define RCU_QSCTR_INVALID	0
 
+#if RCU_CPU_SCATTER
+#define rcu_need_switch(cpu)		(!!atomic_read(&per_cpu(rcu_data, cpu).need_switch))
+#define rcu_clear_need_switch(cpu)	atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0)
+static inline int rcu_any_cpu_need_switch(void)
+{
+	int cpu;
+	for_each_online_cpu(cpu) {
+		if (rcu_need_switch(cpu))
+			return 1;
+	}
+	return 0;
+}
+
+static inline void rcu_set_need_switch_cpumask(cpumask_t cpumask)
+{
+	int cpu;
+	for_each_cpu_mask(cpu, cpumask)
+		atomic_set(&per_cpu(rcu_data, cpu).need_switch, 1);
+}
+#else
+#define rcu_need_switch(cpu)		cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)
+#define rcu_clear_need_switch(cpu)	cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask)
+#define rcu_any_cpu_need_switch()	(!cpus_empty(rcu_ctrlblk.rcu_cpu_mask))
+#define rcu_set_need_switch_cpumask(x)	cpus_copy(rcu_ctrlblk.rcu_cpu_mask, x)
+#endif
+
 static inline int rcu_pending(int cpu) 
 {
 	if ((!list_empty(&RCU_curlist(cpu)) &&
 	     rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
 	    (list_empty(&RCU_curlist(cpu)) &&
 			 !list_empty(&RCU_nxtlist(cpu))) ||
-	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	    	rcu_need_switch(cpu))
 		return 1;
 	else
 		return 0;
Index: wli-2.6.6-rc3-mm1/kernel/rcupdate.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/kernel/rcupdate.c	2004-04-30 15:05:53.000000000 -0700
+++ wli-2.6.6-rc3-mm1/kernel/rcupdate.c	2004-05-01 13:47:05.000000000 -0700
@@ -46,10 +46,19 @@
 #include <linux/cpu.h>
 
 /* Definition for rcupdate control block. */
-struct rcu_ctrlblk rcu_ctrlblk = 
-	{ .mutex = SPIN_LOCK_UNLOCKED, .curbatch = 1, 
-	  .maxbatch = 1, .rcu_cpu_mask = CPU_MASK_NONE };
+struct rcu_ctrlblk rcu_ctrlblk = {
+	.mutex = SPIN_LOCK_UNLOCKED,
+	.curbatch = 1, 
+	.maxbatch = 1,
+#if !RCU_CPU_SCATTER
+	.rcu_cpu_mask = CPU_MASK_NONE
+#endif
+};
+#if RCU_CPU_SCATTER
+DEFINE_PER_CPU(struct rcu_data, rcu_data) = { .need_switch = ATOMIC_INIT(0), };
+#else
 DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
+#endif
 
 /* Fake initialization required by compiler */
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
@@ -109,13 +118,14 @@
 		rcu_ctrlblk.maxbatch = newbatch;
 	}
 	if (rcu_batch_before(rcu_ctrlblk.maxbatch, rcu_ctrlblk.curbatch) ||
-	    !cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) {
+	    !rcu_any_cpu_need_switch()) {
 		return;
 	}
 	/* Can't change, since spin lock held. */
 	active = idle_cpu_mask;
 	cpus_complement(active);
-	cpus_and(rcu_ctrlblk.rcu_cpu_mask, cpu_online_map, active);
+	cpus_and(active, cpu_online_map, active);
+	rcu_set_need_switch_cpumask(active);
 }
 
 /*
@@ -127,7 +137,7 @@
 {
 	int cpu = smp_processor_id();
 
-	if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	if (!rcu_need_switch(cpu))
 		return;
 
 	/* 
@@ -143,12 +153,12 @@
 		return;
 
 	spin_lock(&rcu_ctrlblk.mutex);
-	if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	if (!rcu_need_switch(cpu))
 		goto out_unlock;
 
-	cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask);
+	rcu_clear_need_switch(cpu);
 	RCU_last_qsctr(cpu) = RCU_QSCTR_INVALID;
-	if (!cpus_empty(rcu_ctrlblk.rcu_cpu_mask))
+	if (!rcu_any_cpu_need_switch())
 		goto out_unlock;
 
 	rcu_ctrlblk.curbatch++;
@@ -186,11 +196,11 @@
 	 * it here
 	 */
 	spin_lock_irq(&rcu_ctrlblk.mutex);
-	if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask))
+	if (!rcu_any_cpu_need_switch())
 		goto unlock;
 
-	cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask);
-	if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) {
+	rcu_clear_need_switch(cpu);
+	if (!rcu_any_cpu_need_switch()) {
 		rcu_ctrlblk.curbatch++;
 		/* We may avoid calling start batch if
 		 * we are starting the batch only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-01 21:17 ` William Lee Irwin III
@ 2004-05-01 22:35   ` William Lee Irwin III
  2004-05-02  1:38   ` Jack Steiner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: William Lee Irwin III @ 2004-05-01 22:35 UTC (permalink / raw)
  To: Jack Steiner, linux-kernel

On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
> +static inline void rcu_set_need_switch_cpumask(cpumask_t cpumask)
> +{
> +	int cpu;
> +	for_each_cpu_mask(cpu, cpumask)
> +		atomic_set(&per_cpu(rcu_data, cpu).need_switch, 1);
> +}

needs to be:
	for_each_online_cpu(cpu)
		atomic_set(&per_cpu(rcu_data, cpu).need_switch, !!cpu_isset(cpu, cpumask));


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-01 21:17 ` William Lee Irwin III
  2004-05-01 22:35   ` William Lee Irwin III
@ 2004-05-02  1:38   ` Jack Steiner
  2004-05-07 17:53   ` Andrea Arcangeli
  2004-05-07 20:49   ` Jack Steiner
  3 siblings, 0 replies; 22+ messages in thread
From: Jack Steiner @ 2004-05-02  1:38 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel

On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
> 
> Would something like this help cacheline contention? This uses the
> per_cpu data areas to hold per-cpu booleans for needing switches.
> Untested/uncompiled.
> 
> The global lock is unfortunately still there.

It will be Monday before I can get time on the large system & do
additional experiments. I have 2 logs & both show the
contention is on the line "spin_lock(&rcu_ctrlblk.mutex);".  These traces
were from a heavy workload with processes on most cpus.

I want to repeat the "ls" experiment & see if it has the same hot spot.
The "ls" experiment has only 1 active cpu. The rest of the cpus are idle.

I'll post additional info on Monday (if possible).

I'll also try your patch. (Thanks for quick reply)



> 
> 
> -- wli
> 
> Index: wli-2.6.6-rc3-mm1/include/linux/rcupdate.h
> ===================================================================
> --- wli-2.6.6-rc3-mm1.orig/include/linux/rcupdate.h	2004-04-03 19:36:52.000000000 -0800
> +++ wli-2.6.6-rc3-mm1/include/linux/rcupdate.h	2004-05-01 14:15:09.000000000 -0700
> @@ -41,6 +41,9 @@
>  #include <linux/threads.h>
>  #include <linux/percpu.h>
>  #include <linux/cpumask.h>
> +#include <asm/atomic.h>
> +
> +#define RCU_CPU_SCATTER (NR_CPUS > 128)
>  
>  /**
>   * struct rcu_head - callback structure for use with RCU
> @@ -68,8 +71,10 @@
>  	spinlock_t	mutex;		/* Guard this struct                  */
>  	long		curbatch;	/* Current batch number.	      */
>  	long		maxbatch;	/* Max requested batch number.        */
> +#if !RCU_CPU_SCATTER
>  	cpumask_t	rcu_cpu_mask; 	/* CPUs that need to switch in order  */
>  					/* for current batch to proceed.      */
> +#endif
>  };
>  
>  /* Is batch a before batch b ? */
> @@ -96,6 +101,9 @@
>          long  	       	batch;           /* Batch # for current RCU batch */
>          struct list_head  nxtlist;
>          struct list_head  curlist;
> +#if RCU_CPU_SCATTER
> +	atomic_t	need_switch;
> +#endif
>  };
>  
>  DECLARE_PER_CPU(struct rcu_data, rcu_data);
> @@ -109,13 +117,39 @@
>  
>  #define RCU_QSCTR_INVALID	0
>  
> +#if RCU_CPU_SCATTER
> +#define rcu_need_switch(cpu)		(!!atomic_read(&per_cpu(rcu_data, cpu).need_switch))
> +#define rcu_clear_need_switch(cpu)	atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0)
> +static inline int rcu_any_cpu_need_switch(void)
> +{
> +	int cpu;
> +	for_each_online_cpu(cpu) {
> +		if (rcu_need_switch(cpu))
> +			return 1;
> +	}
> +	return 0;
> +}
> +
> +static inline void rcu_set_need_switch_cpumask(cpumask_t cpumask)
> +{
> +	int cpu;
> +	for_each_cpu_mask(cpu, cpumask)
> +		atomic_set(&per_cpu(rcu_data, cpu).need_switch, 1);
> +}
> +#else
> +#define rcu_need_switch(cpu)		cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)
> +#define rcu_clear_need_switch(cpu)	cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask)
> +#define rcu_any_cpu_need_switch()	(!cpus_empty(rcu_ctrlblk.rcu_cpu_mask))
> +#define rcu_set_need_switch_cpumask(x)	cpus_copy(rcu_ctrlblk.rcu_cpu_mask, x)
> +#endif
> +
>  static inline int rcu_pending(int cpu) 
>  {
>  	if ((!list_empty(&RCU_curlist(cpu)) &&
>  	     rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
>  	    (list_empty(&RCU_curlist(cpu)) &&
>  			 !list_empty(&RCU_nxtlist(cpu))) ||
> -	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
> +	    	rcu_need_switch(cpu))
>  		return 1;
>  	else
>  		return 0;
> Index: wli-2.6.6-rc3-mm1/kernel/rcupdate.c
> ===================================================================
> --- wli-2.6.6-rc3-mm1.orig/kernel/rcupdate.c	2004-04-30 15:05:53.000000000 -0700
> +++ wli-2.6.6-rc3-mm1/kernel/rcupdate.c	2004-05-01 13:47:05.000000000 -0700
> @@ -46,10 +46,19 @@
>  #include <linux/cpu.h>
>  
>  /* Definition for rcupdate control block. */
> -struct rcu_ctrlblk rcu_ctrlblk = 
> -	{ .mutex = SPIN_LOCK_UNLOCKED, .curbatch = 1, 
> -	  .maxbatch = 1, .rcu_cpu_mask = CPU_MASK_NONE };
> +struct rcu_ctrlblk rcu_ctrlblk = {
> +	.mutex = SPIN_LOCK_UNLOCKED,
> +	.curbatch = 1, 
> +	.maxbatch = 1,
> +#if !RCU_CPU_SCATTER
> +	.rcu_cpu_mask = CPU_MASK_NONE
> +#endif
> +};
> +#if RCU_CPU_SCATTER
> +DEFINE_PER_CPU(struct rcu_data, rcu_data) = { .need_switch = ATOMIC_INIT(0), };
> +#else
>  DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
> +#endif
>  
>  /* Fake initialization required by compiler */
>  static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
> @@ -109,13 +118,14 @@
>  		rcu_ctrlblk.maxbatch = newbatch;
>  	}
>  	if (rcu_batch_before(rcu_ctrlblk.maxbatch, rcu_ctrlblk.curbatch) ||
> -	    !cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) {
> +	    !rcu_any_cpu_need_switch()) {
>  		return;
>  	}
>  	/* Can't change, since spin lock held. */
>  	active = idle_cpu_mask;
>  	cpus_complement(active);
> -	cpus_and(rcu_ctrlblk.rcu_cpu_mask, cpu_online_map, active);
> +	cpus_and(active, cpu_online_map, active);
> +	rcu_set_need_switch_cpumask(active);
>  }
>  
>  /*
> @@ -127,7 +137,7 @@
>  {
>  	int cpu = smp_processor_id();
>  
> -	if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
> +	if (!rcu_need_switch(cpu))
>  		return;
>  
>  	/* 
> @@ -143,12 +153,12 @@
>  		return;
>  
>  	spin_lock(&rcu_ctrlblk.mutex);
> -	if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
> +	if (!rcu_need_switch(cpu))
>  		goto out_unlock;
>  
> -	cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask);
> +	rcu_clear_need_switch(cpu);
>  	RCU_last_qsctr(cpu) = RCU_QSCTR_INVALID;
> -	if (!cpus_empty(rcu_ctrlblk.rcu_cpu_mask))
> +	if (!rcu_any_cpu_need_switch())
>  		goto out_unlock;
>  
>  	rcu_ctrlblk.curbatch++;
> @@ -186,11 +196,11 @@
>  	 * it here
>  	 */
>  	spin_lock_irq(&rcu_ctrlblk.mutex);
> -	if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask))
> +	if (!rcu_any_cpu_need_switch())
>  		goto unlock;
>  
> -	cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask);
> -	if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) {
> +	rcu_clear_need_switch(cpu);
> +	if (!rcu_any_cpu_need_switch()) {
>  		rcu_ctrlblk.curbatch++;
>  		/* We may avoid calling start batch if
>  		 * we are starting the batch only

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-01 21:17 ` William Lee Irwin III
  2004-05-01 22:35   ` William Lee Irwin III
  2004-05-02  1:38   ` Jack Steiner
@ 2004-05-07 17:53   ` Andrea Arcangeli
  2004-05-07 18:17     ` William Lee Irwin III
  2004-05-07 20:49   ` Jack Steiner
  3 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2004-05-07 17:53 UTC (permalink / raw)
  To: William Lee Irwin III, Jack Steiner, linux-kernel

On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
> 
> Would something like this help cacheline contention? This uses the
> per_cpu data areas to hold per-cpu booleans for needing switches.
> Untested/uncompiled.
> 
> The global lock is unfortunately still there.

I'm afraid this cannot help, the rcu_cpu_mask and the mutex are in the same
cacheline, so it's not just about the global lock being still there,
it's about the cpumask being in the same cacheline with the global lock.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-07 17:53   ` Andrea Arcangeli
@ 2004-05-07 18:17     ` William Lee Irwin III
  2004-05-07 19:59       ` Andrea Arcangeli
  0 siblings, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-05-07 18:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Jack Steiner, linux-kernel

On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
>> Would something like this help cacheline contention? This uses the
>> per_cpu data areas to hold per-cpu booleans for needing switches.
>> Untested/uncompiled.
>> The global lock is unfortunately still there.

On Fri, May 07, 2004 at 07:53:58PM +0200, Andrea Arcangeli wrote:
> I'm afraid this cannot help, the rcu_cpu_mask and the mutex are in the same
> cacheline, so it's not just about the global lock being still there,
> it's about the cpumask being in the same cacheline with the global lock.

Hmm. I can't quite make out what you're trying to say. If it were about
the cpumask sharing the cacheline with the global lock, then the patch
would help, but you say it should not. I don't care much about the
conclusion, since I wrote the patch to express the notion that the
concentration of accesses to the cpumask's shared cacheline(s) could be
dispersed by using integers in per_cpu data to represent the individual
bits of the cpumask if that were the problem, and by trying something
similar to the posted patch, it could be determined if that were so,
but later heard back that it'd been determined by other means that it
was the lock itself...

-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-07 18:17     ` William Lee Irwin III
@ 2004-05-07 19:59       ` Andrea Arcangeli
  0 siblings, 0 replies; 22+ messages in thread
From: Andrea Arcangeli @ 2004-05-07 19:59 UTC (permalink / raw)
  To: William Lee Irwin III, Jack Steiner, linux-kernel

On Fri, May 07, 2004 at 11:17:06AM -0700, William Lee Irwin III wrote:
> On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
> >> Would something like this help cacheline contention? This uses the
> >> per_cpu data areas to hold per-cpu booleans for needing switches.
> >> Untested/uncompiled.
> >> The global lock is unfortunately still there.
> 
> On Fri, May 07, 2004 at 07:53:58PM +0200, Andrea Arcangeli wrote:
> > I'm afraid this cannot help, the rcu_cpu_mask and the mutex are in the same
> > cacheline, so it's not just about the global lock being still there,
> > it's about the cpumask being in the same cacheline with the global lock.
> 
> Hmm. I can't quite make out what you're trying to say. If it were about
> the cpumask sharing the cacheline with the global lock, then the patch
> would help, but you say it should not. I don't care much about the

Since cpumask and global lock are on the same cacheline, and the global
lock still force that cacheline to bounce, it shouldn't make any
difference to remove the cpumask from that global cacheline, by the time
you take the lock you have the cacheline local and in the next
nanosecond you can use the cpumask zerocost. I understood the real cost
is the bouncing of _such_ cacheline across all 512 cpus and the global
lock will still generates it as far as I can tell. Though I'm mostly
guessing, I certainly wouldn't be against trying your patch, I was just
afraid it cannot help.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-01 21:17 ` William Lee Irwin III
                     ` (2 preceding siblings ...)
  2004-05-07 17:53   ` Andrea Arcangeli
@ 2004-05-07 20:49   ` Jack Steiner
  3 siblings, 0 replies; 22+ messages in thread
From: Jack Steiner @ 2004-05-07 20:49 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel

On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
> 
> Would something like this help cacheline contention? This uses the
> per_cpu data areas to hold per-cpu booleans for needing switches.
> Untested/uncompiled.
> 
> The global lock is unfortunately still there.
> 
> 
> -- wli
> 
...
>  
> +#if RCU_CPU_SCATTER
> +#define rcu_need_switch(cpu)		(!!atomic_read(&per_cpu(rcu_data, cpu).need_switch))
> +#define rcu_clear_need_switch(cpu)	atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0)
> +static inline int rcu_any_cpu_need_switch(void)
> +{
> +	int cpu;
> +	for_each_online_cpu(cpu) {
> +		if (rcu_need_switch(cpu))
> +			return 1;
> +	}
> +	return 0;
> +}
..

This fixes only part of the problem.

Referencing percpu data for every cpu is not particularily efficient - at
least on our platform.  Percpu data is allocated so that it is on the 
local node of each cpu.

We use 64MB granules. The percpu data structures on the individual nodes 
are separated by addresses that differ by many GB. A scan of all percpu 
data structures requires a TLB entry for each node.  This is costly & 
trashes the TLB.  (Our max system size is currently 256 nodes).

For example, a 4 node, 2 cpus/node system shows:

	mdb> m d __per_cpu_offset
        <0xa000000100b357c8> 0xe000003001010000
        <0xa000000100b357d0> 0xe000003001020000
        <0xa000000100b357d8> 0xe00000b001020000
        <0xa000000100b357e0> 0xe00000b001030000
        <0xa000000100b357e8> 0xe000013001030000
        <0xa000000100b357f0> 0xe000013001040000
        <0xa000000100b357f8> 0xe00001b001040000
        <0xa000000100b35800> 0xe00001b001050000

The node number is encoded in bits [48:39] of the virtual/physical
address.

Unfortunately, our hardware does not provide a way to allocate node
local memory for every node and have all the memory covered by a single
TLB entry.

Moving "need_switch" to a single array with cacheline aligned
entries would work. I can give that a try.....

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-01 12:08 RCU scaling on large systems Jack Steiner
  2004-05-01 21:17 ` William Lee Irwin III
@ 2004-05-02 18:28 ` Paul E. McKenney
  2004-05-03 16:39   ` Jesse Barnes
  2004-05-03 18:40   ` Jack Steiner
  1 sibling, 2 replies; 22+ messages in thread
From: Paul E. McKenney @ 2004-05-02 18:28 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-kernel

Hello, Jack!

On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
> 
> Has anyone investigated scaling of the RCU algorithms on large
> cpu-count systems?
> 
> We are running 2.6.5 on a 512p system. The RCU update code is causing 
> a near-livelock when booting. Several times, I dropped into kdb &
> found many of the cpus in the RCU code. The cpus are in
> rcu_check_quiescent_state() spinning on the rcu_ctrlblk.mutex.

The current RCU grace-period-detection infrastructure was designed
for a few tens of CPUs, so it is absolutely no surprise that it 
does not serve you well on a 512-CPU system.

The current RCU infrastructure algorithm can be thought of as a 
lazy barrier controlled by a bitmask.  For 512 CPUs, I believe that
more of a combining-tree algorithm will be needed.  If you have
a NUMA-like architecture, then the structure of the combining tree
will of course want to reflect the underlying hardware architecture.
Then any time period during which each of the individual NUMA nodes
passed through a local grace period is a global grace period.
This approach allows you to have per-NUMA-node RCU locks and a global
RCU lock, reducing the number of acquisitions of the global lock
by the ratio of the number of CPUs to the number of nodes.

>From your numbers below, I would guess that if you have at least
8 CPUs per NUMA node, a two-level tree would suffice.  If you have
only 4 CPUs per NUMA node, you might well need a per-node level,
a per-4-nodes level, and a global level to get the global lock
contention reduced sufficiently.

> I also found an interesting anomaly that was traced to RCU. I have
> a program that measures "cpu efficiency". Basically, the program 
> creates a cpu bound thread for each cpu & measures the percentage 
> of time that each cpu ACTUALLY spends executing user code.
> On an idle each system, each cpu *should* spend >99% in user mode.
> 
> On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> RCU code. The time is spent contending for shared cache lines.

Again, no surprise, Linux's RCU was not designed for a 512-CPU
system.  ;-)

The hierarchical grace-period-detection scheme described above
also increases cache locality, greatly reducing the cache-thrashing
you are seeing.

> Even more bizarre: if I repeatedly type "ls" in a *single* window 
> (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
> of the time in the RCU code.

Hmmm...  How large was the directory you were running "ls" on?
At first blush, it sounds like the "ls" was somehow provoking
a dcache update, which would then exercise RCU.

> The RCU algorithms don't scale - at least on our systems!!!!!

As noted earlier, the current implementation is not designed for
512 CPUs.  And, as noted earlier, there are ways to make it
scale.  But for some reason, we felt it advisable to start with
a smaller, simpler, and hence less scalable implementation.  ;-)

> Attached is an experimental hack that fixes the problem. I
> don't believe that this is the correct fix but it does prove
> that slowing down the frequency of updates fixes the problem.
> 
> 
> With this hack, "ls" no longer measurable disturbs other cpus. Each
> cpu spends ~99.8% of its time in user code regardless of the frequency
> of typing "ls".
> 
> 
> 
> By default, the RCU code attempts to process callbacks on each cpu
> every tick. The hack adds a mask so that only a few cpus process
> callbacks each tick. 

Cute!  However, it is not clear to me that this approach is
compatible with real-time use of RCU, since it results in CPUs
processing their callbacks less frequently, and thus getting
more of them to process at a time.

But it is not clear to me that anyone is looking for realtime
response from a 512-CPU machine (yow!!!), so perhaps this
is not a problem...

> I ran a simple experiment to vary the mask. Col 1 shows the mask
> value, col 2 show system time when system is idle, col 3 shows
> system time when typing "ls" on a single cpu. The percentages are
> "eyeballed averages" for all cpus.
> 
> 	VAL	idle%	"ls" %
> 	0	6.00	55.00
> 	1	2.00	50.00
> 	3	 .20	20.00
> 	7	 .20	  .22
> 	15	 .20	  .24
> 	31	 .19	  .25
> 	63	 .19	  .26
> 	127	 .19	  .25
> 	511	 .19	  .19
> 
> Looks like any value >7 or 15 should be ok on our hardware. I picked 
> a larger value because I don't understand how heavy loads affect the 
> optimal value.  I suspect that the optimal value is architecture
> dependent & probably platform dependent.
> 
> 
> Is anyone already working on this issue?
> 
> Comments and additional insight appreciated..........

This patch certainly seems simple enough, and I would guess that
"jiffies" is referenced often enough that it is warm in the cache
despite being frequently updated.

Other thoughts?

						Thanx, Paul

> Index: linux/include/linux/rcupdate.h
> ===================================================================
> --- linux.orig/include/linux/rcupdate.h	2004-04-30 09:59:00.000000000 -0500
> +++ linux/include/linux/rcupdate.h	2004-04-30 10:01:37.000000000 -0500
> @@ -41,6 +41,7 @@
>  #include <linux/threads.h>
>  #include <linux/percpu.h>
>  #include <linux/cpumask.h>
> +#include <linux/jiffies.h>
>  
>  /**
>   * struct rcu_head - callback structure for use with RCU
> @@ -84,6 +85,14 @@
>          return (a - b) > 0;
>  }
>  
> +#define RCU_JIFFIES_MASK 15
> +
> +/* Is it time to process a batch on this cpu */
> +static inline int rcu_time(int cpu)
> +{
> +	return (((jiffies - cpu) & RCU_JIFFIES_MASK) == 0);
> +}
> +
>  /*
>   * Per-CPU data for Read-Copy Update.
>   * nxtlist - new callbacks are added here
> @@ -111,11 +120,11 @@
>  
>  static inline int rcu_pending(int cpu) 
>  {
> -	if ((!list_empty(&RCU_curlist(cpu)) &&
> +	if (rcu_time(cpu) && ((!list_empty(&RCU_curlist(cpu)) &&
>  	     rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
>  	    (list_empty(&RCU_curlist(cpu)) &&
>  			 !list_empty(&RCU_nxtlist(cpu))) ||
> -	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
> +	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)))
>  		return 1;
>  	else
>  		return 0;
> -- 
> Thanks
> 
> Jack Steiner (steiner@sgi.com)          651-683-5302
> Principal Engineer                      SGI - Silicon Graphics, Inc.
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-02 18:28 ` Paul E. McKenney
@ 2004-05-03 16:39   ` Jesse Barnes
  2004-05-03 20:04     ` Paul E. McKenney
  2004-05-03 18:40   ` Jack Steiner
  1 sibling, 1 reply; 22+ messages in thread
From: Jesse Barnes @ 2004-05-03 16:39 UTC (permalink / raw)
  To: linux-kernel, paulmck; +Cc: Jack Steiner

On Sunday, May 2, 2004 11:28 am, Paul E. McKenney wrote:
> From your numbers below, I would guess that if you have at least
> 8 CPUs per NUMA node, a two-level tree would suffice.  If you have
> only 4 CPUs per NUMA node, you might well need a per-node level,
> a per-4-nodes level, and a global level to get the global lock
> contention reduced sufficiently.

Actually, only 2, but it sounds like your approach would work.

> Cute!  However, it is not clear to me that this approach is
> compatible with real-time use of RCU, since it results in CPUs
> processing their callbacks less frequently, and thus getting
> more of them to process at a time.

I think it was just a proof-of-concept--the current RCU design obviously 
wasn't designed with this machine in mind :).

> But it is not clear to me that anyone is looking for realtime
> response from a 512-CPU machine (yow!!!), so perhaps this
> is not a problem...

There are folks that would like realtime (or close to realtime) response on 
such systems, so it would be best not to do anything that would explicitly 
prevent it.

> This patch certainly seems simple enough, and I would guess that
> "jiffies" is referenced often enough that it is warm in the cache
> despite being frequently updated.
>
> Other thoughts?

On a big system like this though, won't reading jiffies frequently be another 
source of contention?

Jesse

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-03 16:39   ` Jesse Barnes
@ 2004-05-03 20:04     ` Paul E. McKenney
  0 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2004-05-03 20:04 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-kernel, Jack Steiner

On Mon, May 03, 2004 at 09:39:11AM -0700, Jesse Barnes wrote:
> On Sunday, May 2, 2004 11:28 am, Paul E. McKenney wrote:
> > From your numbers below, I would guess that if you have at least
> > 8 CPUs per NUMA node, a two-level tree would suffice.  If you have
> > only 4 CPUs per NUMA node, you might well need a per-node level,
> > a per-4-nodes level, and a global level to get the global lock
> > contention reduced sufficiently.
> 
> Actually, only 2, but it sounds like your approach would work.

OK, make that a per-node level, a per-8-nodes level, and a global
level.  ;-)  The per-node level might or might not be helpful,
depending on your memory latencies.

> > Cute!  However, it is not clear to me that this approach is
> > compatible with real-time use of RCU, since it results in CPUs
> > processing their callbacks less frequently, and thus getting
> > more of them to process at a time.
> 
> I think it was just a proof-of-concept--the current RCU design obviously 
> wasn't designed with this machine in mind :).

Agreed!  ;-)

> > But it is not clear to me that anyone is looking for realtime
> > response from a 512-CPU machine (yow!!!), so perhaps this
> > is not a problem...
> 
> There are folks that would like realtime (or close to realtime) response on 
> such systems, so it would be best not to do anything that would explicitly 
> prevent it.

The potential problem with less-frequent processing of callbacks is
that it would result in larger "bursts" of callbacks to be processed,
degrading realtime scheduling latency.  There are some patches that
help avoid this problem, but they probably need more testing and
tuning.

> > This patch certainly seems simple enough, and I would guess that
> > "jiffies" is referenced often enough that it is warm in the cache
> > despite being frequently updated.
> >
> > Other thoughts?
> 
> On a big system like this though, won't reading jiffies frequently be another 
> source of contention?

My thought was that each CPU was already reading jiffies several times
each tick anyway, so that it would already be cached when RCU wanted
to look at it.  But I must defer to your experience with this particular
machine.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-02 18:28 ` Paul E. McKenney
  2004-05-03 16:39   ` Jesse Barnes
@ 2004-05-03 18:40   ` Jack Steiner
  2004-05-07 20:50     ` Paul E. McKenney
  1 sibling, 1 reply; 22+ messages in thread
From: Jack Steiner @ 2004-05-03 18:40 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

Paul - 

Thanks for the reply.


Additional data from experiments today.

As expected, there are multiple hot spots related to the rcu_ctrlblk.

	- scheduler_tick() in the rcu_pending macro. Specifically, on the
	  load of the rcu_cpu_mask.

	- rcu_check_quiescent_state() on spin_lock(&rcu_ctrlblk.mutex);

These two spots are are ~equally hot. 

Some of the cache line contention could be alleviated by separating 
these fields into multiple cache lines.  Wli posted a patch over the 
weekend that does that. I have not had a chance to review the patch in 
detail but it looks a reasonable idea.



-----------------
Response to your mail:



> >From your numbers below, I would guess that if you have at least
> 8 CPUs per NUMA node, a two-level tree would suffice.  If you have
> only 4 CPUs per NUMA node, you might well need a per-node level,
> a per-4-nodes level, and a global level to get the global lock
> contention reduced sufficiently.

The system consists of 256 nodes. Each node has 2 cpus located on
a shared FSB. The nodes are packaged as 128 modules - 2 nodes per
module. The 2 nodes in a module are slightly "closer" latency-wise 
than nodes in different modules. 


> 
> > I also found an interesting anomaly that was traced to RCU. I have
> > a program that measures "cpu efficiency". Basically, the program 
> > creates a cpu bound thread for each cpu & measures the percentage 
> > of time that each cpu ACTUALLY spends executing user code.
> > On an idle each system, each cpu *should* spend >99% in user mode.
> > 
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
> 
> Again, no surprise, Linux's RCU was not designed for a 512-CPU
> system.  ;-)
> 
> The hierarchical grace-period-detection scheme described above
> also increases cache locality, greatly reducing the cache-thrashing
> you are seeing.
> 
> > Even more bizarre: if I repeatedly type "ls" in a *single* window 
> > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
> > of the time in the RCU code.
> 
> Hmmm...  How large was the directory you were running "ls" on?
> At first blush, it sounds like the "ls" was somehow provoking
> a dcache update, which would then exercise RCU.

The directory size does not seem to be too significant. I tried one test 
on a NFS directory with 250 files. Another test on /tmp with 25 files.
In both cases, the results were similar. 


> 
> > The RCU algorithms don't scale - at least on our systems!!!!!
> 
> As noted earlier, the current implementation is not designed for
> 512 CPUs.  And, as noted earlier, there are ways to make it
> scale.  But for some reason, we felt it advisable to start with
> a smaller, simpler, and hence less scalable implementation.  ;-)

Makes sense. I would not have expected otherwise. Overall, linux scales
to 512p much better than I would have predicted. 

Is anyone working on improving RCU scaling to higher cpu counts. I
dont want to duplicate any work that is already in progress.
Otherwise, I'll start investigating what can be done to improve
scaling. 




> 
> > Attached is an experimental hack that fixes the problem. I
> > don't believe that this is the correct fix but it does prove
> > that slowing down the frequency of updates fixes the problem.
> > 
> > 
> > With this hack, "ls" no longer measurable disturbs other cpus. Each
> > cpu spends ~99.8% of its time in user code regardless of the frequency
> > of typing "ls".
> > 
> > 
> > 
> > By default, the RCU code attempts to process callbacks on each cpu
> > every tick. The hack adds a mask so that only a few cpus process
> > callbacks each tick. 
> 
> Cute!  However, it is not clear to me that this approach is
> compatible with real-time use of RCU, since it results in CPUs
> processing their callbacks less frequently, and thus getting
> more of them to process at a time.
> 
> But it is not clear to me that anyone is looking for realtime
> response from a 512-CPU machine (yow!!!), so perhaps this
> is not a problem...

Agree on both counts.


-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-03 18:40   ` Jack Steiner
@ 2004-05-07 20:50     ` Paul E. McKenney
  2004-05-07 22:06       ` Jack Steiner
  0 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2004-05-07 20:50 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-kernel

On Mon, May 03, 2004 at 01:40:06PM -0500, Jack Steiner wrote:
> Paul - 
> 
> Thanks for the reply.
> 
> 
> Additional data from experiments today.
> 
> As expected, there are multiple hot spots related to the rcu_ctrlblk.
> 
> 	- scheduler_tick() in the rcu_pending macro. Specifically, on the
> 	  load of the rcu_cpu_mask.

Going to a hierarchical approach would break this up, especially if
the hierarchy reflected the NUMA structure of the machine.

> 	- rcu_check_quiescent_state() on spin_lock(&rcu_ctrlblk.mutex);

Again, in a hierarchical approach, this would be the lock at the
lowest level of the hierarchy.

> These two spots are are ~equally hot. 
> 
> Some of the cache line contention could be alleviated by separating 
> these fields into multiple cache lines.  Wli posted a patch over the 
> weekend that does that. I have not had a chance to review the patch in 
> detail but it looks a reasonable idea.

Could help or hurt -- multiple cachelines can be accessed in parallel,
which should speed things up, but the need to load multiple cachelines
could slow other things down.

> -----------------
> Response to your mail:
> 
> 
> 
> > >From your numbers below, I would guess that if you have at least
> > 8 CPUs per NUMA node, a two-level tree would suffice.  If you have
> > only 4 CPUs per NUMA node, you might well need a per-node level,
> > a per-4-nodes level, and a global level to get the global lock
> > contention reduced sufficiently.
> 
> The system consists of 256 nodes. Each node has 2 cpus located on
> a shared FSB. The nodes are packaged as 128 modules - 2 nodes per
> module. The 2 nodes in a module are slightly "closer" latency-wise 
> than nodes in different modules. 

If the latency difference is sufficiently slight, a two-level scheme
that split the system up into 16 blocks of 8 nodes (16 CPUs) each
would seem most likely to help.

> > > I also found an interesting anomaly that was traced to RCU. I have
> > > a program that measures "cpu efficiency". Basically, the program 
> > > creates a cpu bound thread for each cpu & measures the percentage 
> > > of time that each cpu ACTUALLY spends executing user code.
> > > On an idle each system, each cpu *should* spend >99% in user mode.
> > > 
> > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > > RCU code. The time is spent contending for shared cache lines.
> > 
> > Again, no surprise, Linux's RCU was not designed for a 512-CPU
> > system.  ;-)
> > 
> > The hierarchical grace-period-detection scheme described above
> > also increases cache locality, greatly reducing the cache-thrashing
> > you are seeing.
> > 
> > > Even more bizarre: if I repeatedly type "ls" in a *single* window 
> > > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
> > > of the time in the RCU code.
> > 
> > Hmmm...  How large was the directory you were running "ls" on?
> > At first blush, it sounds like the "ls" was somehow provoking
> > a dcache update, which would then exercise RCU.
> 
> The directory size does not seem to be too significant. I tried one test 
> on a NFS directory with 250 files. Another test on /tmp with 25 files.
> In both cases, the results were similar. 

Hmmm...  I took a quick glance, but don't see why an "ls" would
cause RCU to be invoked.  This should only happen if a dentry is
ejected from dcache.

Any enlightenment appreciated!

> > > The RCU algorithms don't scale - at least on our systems!!!!!
> > 
> > As noted earlier, the current implementation is not designed for
> > 512 CPUs.  And, as noted earlier, there are ways to make it
> > scale.  But for some reason, we felt it advisable to start with
> > a smaller, simpler, and hence less scalable implementation.  ;-)
> 
> Makes sense. I would not have expected otherwise. Overall, linux scales
> to 512p much better than I would have predicted. 

Indeed!  I am impressed!

> Is anyone working on improving RCU scaling to higher cpu counts. I
> dont want to duplicate any work that is already in progress.
> Otherwise, I'll start investigating what can be done to improve
> scaling. 

We are focusing first on some of the update-side code, which on
our systems is a much tighter bottleneck than is the RCU infrastructure.
I expect that we will need to increase RCU scaling eventually, but
if you have a short-term need, you should go ahead.  I will of course
be more than happy to review and comment.  One challenge will be
coming up with something that works efficiently both for small
and large machines.  Maybe some arch-specific code is required, but
it would be good to avoid this, if possible.

> > > Attached is an experimental hack that fixes the problem. I
> > > don't believe that this is the correct fix but it does prove
> > > that slowing down the frequency of updates fixes the problem.
> > > 
> > > 
> > > With this hack, "ls" no longer measurable disturbs other cpus. Each
> > > cpu spends ~99.8% of its time in user code regardless of the frequency
> > > of typing "ls".
> > > 
> > > 
> > > 
> > > By default, the RCU code attempts to process callbacks on each cpu
> > > every tick. The hack adds a mask so that only a few cpus process
> > > callbacks each tick. 
> > 
> > Cute!  However, it is not clear to me that this approach is
> > compatible with real-time use of RCU, since it results in CPUs
> > processing their callbacks less frequently, and thus getting
> > more of them to process at a time.
> > 
> > But it is not clear to me that anyone is looking for realtime
> > response from a 512-CPU machine (yow!!!), so perhaps this
> > is not a problem...
> 
> Agree on both counts.

I will let you and Jesse hash out the need for realtime response
on a 512-way machine.  ;-)

						Thanx, Paul

> -- 
> Thanks
> 
> Jack Steiner (steiner@sgi.com)          651-683-5302
> Principal Engineer                      SGI - Silicon Graphics, Inc.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-07 20:50     ` Paul E. McKenney
@ 2004-05-07 22:06       ` Jack Steiner
  2004-05-07 23:32         ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Jack Steiner @ 2004-05-07 22:06 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

On Fri, May 07, 2004 at 01:50:48PM -0700, Paul E. McKenney wrote:
> On Mon, May 03, 2004 at 01:40:06PM -0500, Jack Steiner wrote:
> > Paul - 
> > 
> 
> Hmmm...  I took a quick glance, but don't see why an "ls" would
> cause RCU to be invoked.  This should only happen if a dentry is
> ejected from dcache.
> 
> Any enlightenment appreciated!
> 


The calls to RCU are coming from here:

	[11]kdb> bt
	Stack traceback for pid 3553
	0xe00002b007230000     3553     3139  1   11   R  0xe00002b0072304f0 *ls
	0xa0000001000feee0 call_rcu
	0xa0000001001a3b20 d_free+0x80
	0xa0000001001a3ec0 dput+0x340
	0xa00000010016bcd0 __fput+0x210
	0xa00000010016baa0 fput+0x40
	0xa000000100168760 filp_close+0xc0
	0xa000000100168960 sys_close+0x180
	0xa000000100011be0 ia64_ret_from_syscall

I see this same backtrace from numerous processes.


-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-07 22:06       ` Jack Steiner
@ 2004-05-07 23:32         ` Andrew Morton
  2004-05-08  4:55           ` Jack Steiner
  2004-05-17 21:18           ` Andrea Arcangeli
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2004-05-07 23:32 UTC (permalink / raw)
  To: Jack Steiner; +Cc: paulmck, linux-kernel

Jack Steiner <steiner@sgi.com> wrote:
>
> The calls to RCU are coming from here:
> 
> 	[11]kdb> bt
> 	Stack traceback for pid 3553
> 	0xe00002b007230000     3553     3139  1   11   R  0xe00002b0072304f0 *ls
> 	0xa0000001000feee0 call_rcu
> 	0xa0000001001a3b20 d_free+0x80
> 	0xa0000001001a3ec0 dput+0x340
> 	0xa00000010016bcd0 __fput+0x210
> 	0xa00000010016baa0 fput+0x40
> 	0xa000000100168760 filp_close+0xc0
> 	0xa000000100168960 sys_close+0x180
> 	0xa000000100011be0 ia64_ret_from_syscall
> 
> I see this same backtrace from numerous processes.

eh?  Why is dput freeing the dentry?  It should just be leaving it in cache.

What filesystem is being used?  procfs?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-07 23:32         ` Andrew Morton
@ 2004-05-08  4:55           ` Jack Steiner
  2004-05-17 21:18           ` Andrea Arcangeli
  1 sibling, 0 replies; 22+ messages in thread
From: Jack Steiner @ 2004-05-08  4:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: paulmck, linux-kernel

On Fri, May 07, 2004 at 07:04:36PM -0500, Jack Steiner wrote:
> On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote:
> > Jack Steiner <steiner@sgi.com> wrote:
> > >
> > > The calls to RCU are coming from here:
> > > 
> > > 	[11]kdb> bt
> > > 	Stack traceback for pid 3553
> > > 	0xe00002b007230000     3553     3139  1   11   R  0xe00002b0072304f0 *ls
> > > 	0xa0000001000feee0 call_rcu
> > > 	0xa0000001001a3b20 d_free+0x80
> > > 	0xa0000001001a3ec0 dput+0x340
> > > 	0xa00000010016bcd0 __fput+0x210
> > > 	0xa00000010016baa0 fput+0x40
> > > 	0xa000000100168760 filp_close+0xc0
> > > 	0xa000000100168960 sys_close+0x180
> > > 	0xa000000100011be0 ia64_ret_from_syscall
> > > 
> > > I see this same backtrace from numerous processes.
> > 
> > eh?  Why is dput freeing the dentry?  It should just be leaving it in cache.
> > 
> > What filesystem is being used?  procfs?
> 
> Good possibility. I verified that /proc DOES cause a call to call_rcu.
> 
> I also did an strace on "ls". I see the following. Does
> this make sense??? Note open of /proc/meminfo...
> 
> 	[root@piton tmp]# strace -o zzz ls
> 	espdbd.sock  jd_sockV4  ProPack-installer  s.eventmond  zz  zzz
> 	
> 	[root@piton tmp]# egrep 'open|close|fork|exec' zzz
> 	execve("/bin/ls", ["ls"], [/* 26 vars */]) = 0
> 	open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or
> 	directory)
> 	open("/etc/ld.so.cache", O_RDONLY)      = 3
> 	close(3)                                = 0
> 	open("/lib/libacl.so.1", O_RDONLY)      = 3
> 	close(3)                                = 0
> 	open("/lib/libtermcap.so.2", O_RDONLY)  = 3
> 	close(3)                                = 0
> 	open("/lib/tls/libc.so.6.1", O_RDONLY)  = 3
> 	close(3)                                = 0
> 	open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/tls/libattr.so.1",
> 	O_RDONLY) = -1 ENOENT (No such file or directory)
> 	open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/libattr.so.1",
> 	O_RDONLY) = -1 ENOENT (No such file or directory)
> 	open("/lib/libattr.so.1", O_RDONLY)     = 3
> 	close(3)                                = 0
> 	open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
> 	close(3)                                = 0
> 	open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
> 	close(3)                                = 0
> 	open("/etc/mtab", O_RDONLY)             = 3
> 	close(3)                                = 0
> >>>>	open("/proc/meminfo", O_RDONLY)         = 3
> 	close(3)                                = 0
> 
> Full output of strace:
> 	[root@piton tmp]# cat zzz
> 	execve("/bin/ls", ["ls"], [/* 26 vars */]) = 0
> 	uname({sys="Linux", node="piton.americas.sgi.com", ...}) = 0
> 	brk(0)                                  = 0x6000000000004000
> 	open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such
> 	file or directory)
> 	open("/etc/ld.so.cache", O_RDONLY)      = 3
> 	fstat(3, {st_mode=S_IFREG|0644, st_size=82621, ...}) = 0
> 	mmap(NULL, 82621, PROT_READ, MAP_PRIVATE, 3, 0) =
> 	0x2000000000040000
> 	close(3)                                = 0
> 	open("/lib/libacl.so.1", O_RDONLY)      = 3
> 	read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0
> 	\'\0\0"..., 640) = 640
> 	fstat(3, {st_mode=S_IFREG|0644, st_size=187607, ...}) = 0
> 	mmap(NULL, 160576, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) =
> 	0x2000000000058000
> 	mprotect(0x2000000000070000, 62272, PROT_NONE) = 0
> 	mmap(0x2000000000078000, 32768, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_FIXED, 3, 0x10000) = 0x2000000000078000
> 	close(3)                                = 0
> 	open("/lib/libtermcap.so.2", O_RDONLY)  = 3
> 	read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0
> 	\31\0"..., 640) = 640
> 	fstat(3, {st_mode=S_IFREG|0755, st_size=26088, ...}) = 0
> 	mmap(NULL, 16384, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000080000
> 	mmap(NULL, 89656, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) =
> 	0x2000000000084000
> 	mprotect(0x200000000008c000, 56888, PROT_NONE) = 0
> 	mmap(0x2000000000094000, 32768, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x2000000000094000
> 	close(3)                                = 0
> 	open("/lib/tls/libc.so.6.1", O_RDONLY)  = 3
> 	read(3,
> 	"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0\320\252"...,
> 	640) = 640
> 	fstat(3, {st_mode=S_IFREG|0755, st_size=9329267, ...}) = 0
> 	mmap(NULL, 2519264, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) =
> 	0x200000000009c000
> 	mmap(0x20000000002ec000, 81920, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_FIXED, 3, 0x240000) = 0x20000000002ec000
> 	mmap(0x2000000000300000, 12512, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2000000000300000
> 	close(3)                                = 0
> 	open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/tls/libattr.so.1",
> 	O_RDONLY) = -1 ENOENT (No such file or directory)
> 	stat("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/tls",
> 	0x60000fffffffad00) = -1 ENOENT (No such file or directory)
> 	open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/libattr.so.1",
> 	O_RDONLY) = -1 ENOENT (No such file or directory)
> 	stat("/usr/src/redhat/xfs-cmds/attr/libattr/.libs",
> 	0x60000fffffffad00) = -1 ENOENT (No such file or directory)
> 	open("/lib/libattr.so.1", O_RDONLY)     = 3
> 	read(3,
> 	"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0\240\31"...,
> 	640) = 640
> 	fstat(3, {st_mode=S_IFREG|0644, st_size=55577, ...}) = 0
> 	mmap(NULL, 102360, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) =
> 	0x2000000000304000
> 	mprotect(0x2000000000310000, 53208, PROT_NONE) = 0
> 	mmap(0x2000000000314000, 49152, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x2000000000314000
> 	close(3)                                = 0
> 	mmap(NULL, 16384, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000320000
> 	mmap(NULL, 32768, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000324000
> 	mmap(NULL, 16384, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x200000000032c000
> 	mmap(NULL, 16384, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000330000
> 	munmap(0x2000000000040000, 82621)       = 0
> 	brk(0)                                  = 0x6000000000004000
> 	brk(0x6000000000028000)                 = 0x6000000000028000
> 	brk(0)                                  = 0x6000000000028000
> 	open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
> 	fstat(3, {st_mode=S_IFREG|0644, st_size=34848240, ...}) = 0
> 	mmap(NULL, 34848240, PROT_READ, MAP_PRIVATE, 3, 0) =
> 	0x2000000000334000
> 	close(3)                                = 0
> 	rt_sigaction(SIGTERM, {0x400000000001e9a0, [], SA_RESTART},
> 	{SIG_DFL}, 8) = 0
> 	rt_sigaction(SIGKILL, {0x400000000001e9a0, [], SA_RESTART},
> 	{SIG_DFL}, 8) = -1 EINVAL (Invalid argument)
> 	rt_sigaction(SIGSTOP, {0x400000000001e9a0, [], SA_RESTART},
> 	{SIG_DFL}, 8) = -1 EINVAL (Invalid argument)
> 	ioctl(1, TCGETS or SNDCTL_TMR_TIMEBASE, {B9600 opost isig icanon
> 	echo ...}) = 0
> 	ioctl(1, TIOCGWINSZ, {ws_row=38, ws_col=137, ws_xpixel=0,
> 	ws_ypixel=0}) = 0
> 	open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
> 	fstat(3, {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0
> 	fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> 	getdents64(0x3, 0x60000000000095d8, 0x4000) = 432
> 	getdents64(0x3, 0x60000000000095d8, 0x4000) = 0
> 	close(3)                                = 0
> 	open("/etc/mtab", O_RDONLY)             = 3
> 	fstat(3, {st_mode=S_IFREG|0644, st_size=1533, ...}) = 0
> 	mmap(NULL, 65536, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000002470000
> 	read(3, "/dev/xscsi/pci01.03.0-1/target1/"..., 16384) = 1533
> 	close(3)                                = 0
> 	munmap(0x2000000002470000, 65536)       = 0
> >>>>	open("/proc/meminfo", O_RDONLY)         = 3
> 	fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> 	mmap(NULL, 65536, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000002470000
> 	read(3, "MemTotal:     22874016 kB\nMemFre"..., 1024) = 653
> 	close(3)                                = 0
> 	munmap(0x2000000002470000, 65536)       = 0
> 	fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) =
> 	0
> 	mmap(NULL, 65536, PROT_READ|PROT_WRITE,
> 	MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000002470000
> 	write(1, "espdbd.sock  jd_sockV4\tProPack-i"..., 62) = 62
> 	munmap(0x2000000002470000, 65536)       = 0
> 	exit_group(0)                           = ?
> 
> -- 
> Thanks
> 
> Jack Steiner (steiner@sgi.com)          651-683-5302
> Principal Engineer                      SGI - Silicon Graphics, Inc.
> 
> 

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-07 23:32         ` Andrew Morton
  2004-05-08  4:55           ` Jack Steiner
@ 2004-05-17 21:18           ` Andrea Arcangeli
  2004-05-17 21:42             ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2004-05-17 21:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jack Steiner, paulmck, linux-kernel

On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote:
> Jack Steiner <steiner@sgi.com> wrote:
> >
> > The calls to RCU are coming from here:
> > 
> > 	[11]kdb> bt
> > 	Stack traceback for pid 3553
> > 	0xe00002b007230000     3553     3139  1   11   R  0xe00002b0072304f0 *ls
> > 	0xa0000001000feee0 call_rcu
> > 	0xa0000001001a3b20 d_free+0x80
> > 	0xa0000001001a3ec0 dput+0x340
> > 	0xa00000010016bcd0 __fput+0x210
> > 	0xa00000010016baa0 fput+0x40
> > 	0xa000000100168760 filp_close+0xc0
> > 	0xa000000100168960 sys_close+0x180
> > 	0xa000000100011be0 ia64_ret_from_syscall
> > 
> > I see this same backtrace from numerous processes.
> 
> eh?  Why is dput freeing the dentry?  It should just be leaving it in cache.
> 
> What filesystem is being used?  procfs?

deleting entries from dcache can be a frequent operation, even rename()
triggers d_free.

note that I changed my tree to free all negative entries that are
currently generated by unlink. I find useless to leave negative dentries
after "unlink". I leave them of course after a failed lookup (that's the
fundamental usage of the negative dentries for the PATHs userspace
lookups), but not after unlink. I think it's wasted memory that would
better be used for other purposes. I think this is also the reason when
dcache-RCU was once benchmarked on top of my 2.4 tree it resulted in a
loss of performance.

in my 2.6-aa I didn't forward port all my 2.4-aa stuff but I really
would like to avoid wasting memory there again for the files that are
being deleted, plus during memory pressure (that could be generated even
from pagecache) dcache must be freed up (amittedly there we could free
more than 1 entry for every quiescent point).

I suggested a few years ago that I agreed completely with RCU _only_ for
usages where the reader is extremely _frequent_ and the writer is
_extremely_ unlikely to happen, my most obvious example was the
replacement of the big reader lock.

RCU basically trades mugh higher performance for reader, with much lower
performance for the writer. RCU is like a rwlock with the read-lock being
extremely fast (actually it's literally a noop), but with a very slow
writer (but the writer is still better than the big writer lock,
especially the writer has no starvation and can coalesce things together
to maximize icache etc..). The more cpus, the higher performance you get
by RCU by having a noop read-lock operation. The more cpus the lower
performance you get in the writer. Very few things comes for free ;).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-17 21:18           ` Andrea Arcangeli
@ 2004-05-17 21:42             ` Andrew Morton
  2004-05-17 23:50               ` Andrea Arcangeli
                                 ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Andrew Morton @ 2004-05-17 21:42 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: steiner, paulmck, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote:
> > Jack Steiner <steiner@sgi.com> wrote:
> > >
> > > The calls to RCU are coming from here:
> > > 
> > > 	[11]kdb> bt
> > > 	Stack traceback for pid 3553
> > > 	0xe00002b007230000     3553     3139  1   11   R  0xe00002b0072304f0 *ls
> > > 	0xa0000001000feee0 call_rcu
> > > 	0xa0000001001a3b20 d_free+0x80
> > > 	0xa0000001001a3ec0 dput+0x340
> > > 	0xa00000010016bcd0 __fput+0x210
> > > 	0xa00000010016baa0 fput+0x40
> > > 	0xa000000100168760 filp_close+0xc0
> > > 	0xa000000100168960 sys_close+0x180
> > > 	0xa000000100011be0 ia64_ret_from_syscall
> > > 
> > > I see this same backtrace from numerous processes.
> > 
> > eh?  Why is dput freeing the dentry?  It should just be leaving it in cache.
> > 
> > What filesystem is being used?  procfs?
> 
> deleting entries from dcache can be a frequent operation, even rename()
> triggers d_free.

This issue has gone all quiet.  Is anyone doing aything?

> note that I changed my tree to free all negative entries that are
> currently generated by unlink. I find useless to leave negative dentries
> after "unlink". I leave them of course after a failed lookup (that's the
> fundamental usage of the negative dentries for the PATHs userspace
> lookups), but not after unlink.

Sounds sensible.  Could you please send out the patch?

> RCU basically trades mugh higher performance for reader, with much lower
> performance for the writer.

If the writer wants synchronous-removal semantics, yes.

The problem here and, I believe, in the route cache is in finding a balance
between the amount of storage and the frequency of RCU callback runs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-17 21:42             ` Andrew Morton
@ 2004-05-17 23:50               ` Andrea Arcangeli
  2004-05-18 13:33               ` Jack Steiner
  2004-05-18 23:13               ` Matt Mackall
  2 siblings, 0 replies; 22+ messages in thread
From: Andrea Arcangeli @ 2004-05-17 23:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: steiner, paulmck, linux-kernel

On Mon, May 17, 2004 at 02:42:28PM -0700, Andrew Morton wrote:
> This issue has gone all quiet.  Is anyone doing aything?

dunno

> Sounds sensible.  Could you please send out the patch?

this is the 2.4 version but problem is this logic will now work fine on
UP or small SMP, but it will be very detrimental on the 256-way, so we
may want to address the dcache-rcu problem first? (my wild guess is that
renames are less frequent than unlinks).

diff -urNp 2.4.19pre9/fs/dcache.c neg-dentry-ref/fs/dcache.c
--- 2.4.19pre9/fs/dcache.c	Wed May 29 02:12:36 2002
+++ neg-dentry-ref/fs/dcache.c	Wed May 29 04:19:13 2002
@@ -806,6 +806,7 @@ out:

 void d_delete(struct dentry * dentry)
 {
+#ifdef DENTRY_WASTE_RAM
 	/*
 	 * Are we the only user?
 	 */
@@ -815,6 +816,7 @@ void d_delete(struct dentry * dentry)
 		return;
 	}
 	spin_unlock(&dcache_lock);
+#endif

 	/*
 	 * If not, just drop the dentry and let dput
diff -urNp 2.4.19pre9/fs/namei.c neg-dentry-ref/fs/namei.c
--- 2.4.19pre9/fs/namei.c	Wed May 29 02:12:36 2002
+++ neg-dentry-ref/fs/namei.c	Wed May 29 04:20:45 2002
@@ -1042,6 +1042,10 @@ do_last:
 		error = vfs_create(dir->d_inode, dentry,
 				   mode & ~current->fs->umask);
 		up(&dir->d_inode->i_sem);
+#ifndef DENTRY_WASTE_RAM
+		if (error)
+			d_drop(dentry);
+#endif
 		dput(nd->dentry);
 		nd->dentry = dentry;
 		if (error)

> If the writer wants synchronous-removal semantics, yes.
> 
> The problem here and, I believe, in the route cache is in finding a balance
> between the amount of storage and the frequency of RCU callback runs.

RCU is by design an asynchronous-removal. The problem is the "latency"
of this asynchronous-removal, so the time between the request-of-freeing
and the effective release of the memory. Jack's patch increases the
latency of reaching a quiescent point a lot (this is fine as far as he
has enough memory) and reduces the overhead. He partically removed
completely the contention on the spinlock, so the cpu will not waste any
time spinning and trashing the cacheline everywhere.

The problem is that there is a "max latency" we can allow under any
certain workload, before throttling on rcu (normally inside the VM due
memory exaustion, or like in the routing cache case throttling by losing
incoming packets). So again, for not-frequent writer, rcu is fine and
Jack's approch is fine too and it solves the contention problem
completely. For frequent writer like the dcache, rcu is much more
problematic, and here we're even ignoring the fact that on all machines
calling rcu for the writer is more expensive than taking a write_lock,
so if there's no reader to optimize, there's nearly no point to use RCU.
This even ignoring the fact current rcu implementation will spin on the
spinlock (but we could change that like Jack did).

An improvement over Jack's patch that will likely not increase the
latency much, is to minimize as much as possible the delay between the
processing of cpu0 then cpu1 then cpu2 etc.. The basic logic of Jack's
patch is to _serialize_ the rcu_check_callbacks. By serializing it cpu
after cpu, he removes the contention on the lock, but there's absolutely
no reason to leave millisecond delays in between cpus, so using IPI to
queue the next callback in the next cpu would solve the problem as well
as Jack's patch but without increasing latency as much as he did.
Problem is that sending IPIs from softirq context is not allowed, like
queuing tasklets in other cpus is not allowed. So there's no easy fix
other than Jack's huge-latency-increase one, but that makes rcu not
suitable for frequent-writer case (probably not an issue for him with
some tera of ram and without zone-normal ;) but an issue for others).
And again, rcu is probably slower anyways for the frequent-writer case
(even on a 2-way), so it really should be used for frequent-reader, if
there's any legitimate frequent-writer rcu is troublesome, it's not that
rcu isn't designed for huge machines, rcu is _perfect_ for huge
machines, but only as far as you don't mind running the writer
frequently during production (like it can happen with dcache). RCU is
like the big reader lock, but it's much better than the big reader lock.

Routing cache is less of an issue since the collection should happen
extremely infrequently in normal usages (it's not nearly similar to
calling unlink or rename in normal usages), but it's troublesome too if
you're under routing attack and you've to collect entries all the time.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-17 21:42             ` Andrew Morton
  2004-05-17 23:50               ` Andrea Arcangeli
@ 2004-05-18 13:33               ` Jack Steiner
  2004-05-18 23:13               ` Matt Mackall
  2 siblings, 0 replies; 22+ messages in thread
From: Jack Steiner @ 2004-05-18 13:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, paulmck, linux-kernel

On Mon, May 17, 2004 at 02:42:28PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote:
> > > Jack Steiner <steiner@sgi.com> wrote:
> > > >
> > > > The calls to RCU are coming from here:
> > > > 
> > > > 	[11]kdb> bt
> > > > 	Stack traceback for pid 3553
> > > > 	0xe00002b007230000     3553     3139  1   11   R  0xe00002b0072304f0 *ls
> > > > 	0xa0000001000feee0 call_rcu
> > > > 	0xa0000001001a3b20 d_free+0x80
> > > > 	0xa0000001001a3ec0 dput+0x340
> > > > 	0xa00000010016bcd0 __fput+0x210
> > > > 	0xa00000010016baa0 fput+0x40
> > > > 	0xa000000100168760 filp_close+0xc0
> > > > 	0xa000000100168960 sys_close+0x180
> > > > 	0xa000000100011be0 ia64_ret_from_syscall
> > > > 
> > > > I see this same backtrace from numerous processes.
> > > 
> > > eh?  Why is dput freeing the dentry?  It should just be leaving it in cache.
> > > 
> > > What filesystem is being used?  procfs?
> > 
> > deleting entries from dcache can be a frequent operation, even rename()
> > triggers d_free.
> 
> This issue has gone all quiet.  Is anyone doing aything?

I plan to look into the RCU scaling issues (unless someone beats me to
it) but it will be a couple of weeks before I can start.



> 
> > note that I changed my tree to free all negative entries that are
> > currently generated by unlink. I find useless to leave negative dentries
> > after "unlink". I leave them of course after a failed lookup (that's the
> > fundamental usage of the negative dentries for the PATHs userspace
> > lookups), but not after unlink.
> 
> Sounds sensible.  Could you please send out the patch?
> 
> > RCU basically trades mugh higher performance for reader, with much lower
> > performance for the writer.
> 
> If the writer wants synchronous-removal semantics, yes.
> 
> The problem here and, I believe, in the route cache is in finding a balance
> between the amount of storage and the frequency of RCU callback runs.

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
  2004-05-17 21:42             ` Andrew Morton
  2004-05-17 23:50               ` Andrea Arcangeli
  2004-05-18 13:33               ` Jack Steiner
@ 2004-05-18 23:13               ` Matt Mackall
  2 siblings, 0 replies; 22+ messages in thread
From: Matt Mackall @ 2004-05-18 23:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, steiner, paulmck, linux-kernel

On Mon, May 17, 2004 at 02:42:28PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> > note that I changed my tree to free all negative entries that are
> > currently generated by unlink. I find useless to leave negative dentries
> > after "unlink". I leave them of course after a failed lookup (that's the
> > fundamental usage of the negative dentries for the PATHs userspace
> > lookups), but not after unlink.
> 
> Sounds sensible.  Could you please send out the patch?

Seems like it'll be the wrong thing for the fairly common lockfile
case and also stuff like make clean all.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RCU scaling on large systems
@ 2004-05-20 11:36 Manfred Spraul
  0 siblings, 0 replies; 22+ messages in thread
From: Manfred Spraul @ 2004-05-20 11:36 UTC (permalink / raw)
  To: Jack Steiner; +Cc: William Lee Irwin III, linux-kernel, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

Jack Steiner wrote:

>On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
>> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
>> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
>> > RCU code. The time is spent contending for shared cache lines.
>> 
>> Would something like this help cacheline contention? This uses the
>> per_cpu data areas to hold per-cpu booleans for needing switches.
>> Untested/uncompiled.
>  
>
> [snip]
>
>We use 64MB granules. The percpu data structures on the individual nodes 
>are separated by addresses that differ by many GB. A scan of all percpu 
>data structures requires a TLB entry for each node.  This is costly & 
>trashes the TLB.  (Our max system size is currently 256 nodes).
>  
>
rcu_cpu_mask serves two purposes:
- if a bit for a cpu is set, then the cpu must report quiescent states. Thus all cpus continuously poll their own bits.
- If a cpu notices that it completed a quiescent state, it clears it's own bit, checks all bits and completes the grace period. This happens under a global spinlock.

What about splitting that into two variables?
A global counter could be used to inform the cpus that they should look for quiescent states. The cacheline would be mostly read-only, it shouldn't cause trashing.

Attached is a patch - it boots uniprocessor i386. What do you think? It should remove the hotspot from scheduler_tick entirely.

--
	Manfred


[-- Attachment #2: patch-rcu-simple --]
[-- Type: text/plain, Size: 4162 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 6
//  SUBLEVEL = 6
//  EXTRAVERSION = -mm4
--- 2.6/kernel/rcupdate.c	2004-05-19 19:51:17.000000000 +0200
+++ build-2.6/kernel/rcupdate.c	2004-05-20 13:22:59.000000000 +0200
@@ -55,6 +55,7 @@
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
 #define RCU_tasklet(cpu) (per_cpu(rcu_tasklet, cpu))
 
+long rcu_global_grace_num ____cacheline_aligned_in_smp;
 /**
  * call_rcu - Queue an RCU update request.
  * @head: structure to be used for queueing the RCU updates.
@@ -97,6 +98,23 @@
 }
 
 /*
+ * Grace period handling:
+ * The grace period handling consists out of two steps:
+ * - A new grace period is started.
+ *   This is done by rcu_start_batch. The start is not broadcasted to
+ *   all cpus, they must pick this up from rcu_global_grace_num. All cpus are
+ *   recorded  in the rcu_ctrlblk.rcu_cpu_mask bitmap.
+ * - All cpus must go through a quiescent state.
+ *   Since the start of the grace period is not broadcasted, at least two
+ *   calls to rcu_check_quiescent_state are required:
+ *   The first call just notices that a new grace period is running. The
+ *   following calls check if there was a quiescent state since the beginning
+ *   of the grace period. If so, it updates rcu_ctrlblk.rcu_cpu_mask. If
+ *   the bitmap is empty, then the grace period is completed. 
+ *   rcu_check_quiescent_state calls rcu_start_batch to start the next grace
+ *   period.
+ */
+/*
  * Register a new batch of callbacks, and start it up if there is currently no
  * active batch and the batch to be registered has not already occurred.
  * Caller must hold the rcu_ctrlblk lock.
@@ -116,6 +134,8 @@
 	active = nohz_cpu_mask;
 	cpus_complement(active);
 	cpus_and(rcu_ctrlblk.rcu_cpu_mask, cpu_online_map, active);
+	smp_wmb();
+	rcu_global_grace_num++;
 }
 
 /*
@@ -127,18 +147,28 @@
 {
 	int cpu = smp_processor_id();
 
-	if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	if (RCU_grace_num(cpu) != rcu_global_grace_num) {
+		/* new grace period: record qsctr value. */
+		BUG_ON(RCU_last_qsctr(cpu) != RCU_QSCTR_INVALID);
+		RCU_last_qsctr(cpu) = RCU_qsctr(cpu);
+		RCU_grace_num(cpu) = rcu_global_grace_num;
+		return;
+	}
+
+	/* grace period already completed for this cpu?
+	 * last_qsctr is checked instead of the actual bitmap to avoid
+	 * cacheline trashing
+	 */
+	if (RCU_last_qsctr(cpu) == RCU_QSCTR_INVALID) {
+		/* yes. Just leave. */
 		return;
+	}
 
 	/* 
 	 * Races with local timer interrupt - in the worst case
 	 * we may miss one quiescent state of that CPU. That is
 	 * tolerable. So no need to disable interrupts.
 	 */
-	if (RCU_last_qsctr(cpu) == RCU_QSCTR_INVALID) {
-		RCU_last_qsctr(cpu) = RCU_qsctr(cpu);
-		return;
-	}
 	if (RCU_qsctr(cpu) == RCU_last_qsctr(cpu))
 		return;
 
--- 2.6/include/linux/rcupdate.h	2004-05-19 19:42:16.000000000 +0200
+++ build-2.6/include/linux/rcupdate.h	2004-05-20 12:46:02.000000000 +0200
@@ -94,16 +94,19 @@
         long            last_qsctr;	 /* value of qsctr at beginning */
                                          /* of rcu grace period */
         long  	       	batch;           /* Batch # for current RCU batch */
+        long		grace_num;       /* grace period number */
         struct list_head  nxtlist;
         struct list_head  curlist;
 };
 
 DECLARE_PER_CPU(struct rcu_data, rcu_data);
 extern struct rcu_ctrlblk rcu_ctrlblk;
+extern long rcu_global_grace_num;
 
 #define RCU_qsctr(cpu) 		(per_cpu(rcu_data, (cpu)).qsctr)
 #define RCU_last_qsctr(cpu) 	(per_cpu(rcu_data, (cpu)).last_qsctr)
 #define RCU_batch(cpu) 		(per_cpu(rcu_data, (cpu)).batch)
+#define RCU_grace_num(cpu)	(per_cpu(rcu_data, (cpu)).grace_num)
 #define RCU_nxtlist(cpu) 	(per_cpu(rcu_data, (cpu)).nxtlist)
 #define RCU_curlist(cpu) 	(per_cpu(rcu_data, (cpu)).curlist)
 
@@ -115,7 +118,8 @@
 	     rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
 	    (list_empty(&RCU_curlist(cpu)) &&
 			 !list_empty(&RCU_nxtlist(cpu))) ||
-	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	    RCU_grace_num(cpu) != rcu_global_grace_num ||
+	    RCU_last_qsctr(cpu) != RCU_QSCTR_INVALID)
 		return 1;
 	else
 		return 0;

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-05-20 11:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-01 12:08 RCU scaling on large systems Jack Steiner
2004-05-01 21:17 ` William Lee Irwin III
2004-05-01 22:35   ` William Lee Irwin III
2004-05-02  1:38   ` Jack Steiner
2004-05-07 17:53   ` Andrea Arcangeli
2004-05-07 18:17     ` William Lee Irwin III
2004-05-07 19:59       ` Andrea Arcangeli
2004-05-07 20:49   ` Jack Steiner
2004-05-02 18:28 ` Paul E. McKenney
2004-05-03 16:39   ` Jesse Barnes
2004-05-03 20:04     ` Paul E. McKenney
2004-05-03 18:40   ` Jack Steiner
2004-05-07 20:50     ` Paul E. McKenney
2004-05-07 22:06       ` Jack Steiner
2004-05-07 23:32         ` Andrew Morton
2004-05-08  4:55           ` Jack Steiner
2004-05-17 21:18           ` Andrea Arcangeli
2004-05-17 21:42             ` Andrew Morton
2004-05-17 23:50               ` Andrea Arcangeli
2004-05-18 13:33               ` Jack Steiner
2004-05-18 23:13               ` Matt Mackall
  -- strict thread matches above, loose matches on Subject: below --
2004-05-20 11:36 Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox