public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: RCU scaling on large systems
@ 2004-05-20 11:36 Manfred Spraul
  0 siblings, 0 replies; 22+ messages in thread
From: Manfred Spraul @ 2004-05-20 11:36 UTC (permalink / raw)
  To: Jack Steiner; +Cc: William Lee Irwin III, linux-kernel, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

Jack Steiner wrote:

>On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
>> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
>> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
>> > RCU code. The time is spent contending for shared cache lines.
>> 
>> Would something like this help cacheline contention? This uses the
>> per_cpu data areas to hold per-cpu booleans for needing switches.
>> Untested/uncompiled.
>  
>
> [snip]
>
>We use 64MB granules. The percpu data structures on the individual nodes 
>are separated by addresses that differ by many GB. A scan of all percpu 
>data structures requires a TLB entry for each node.  This is costly & 
>trashes the TLB.  (Our max system size is currently 256 nodes).
>  
>
rcu_cpu_mask serves two purposes:
- if a bit for a cpu is set, then the cpu must report quiescent states. Thus all cpus continuously poll their own bits.
- If a cpu notices that it completed a quiescent state, it clears it's own bit, checks all bits and completes the grace period. This happens under a global spinlock.

What about splitting that into two variables?
A global counter could be used to inform the cpus that they should look for quiescent states. The cacheline would be mostly read-only, it shouldn't cause trashing.

Attached is a patch - it boots uniprocessor i386. What do you think? It should remove the hotspot from scheduler_tick entirely.

--
	Manfred


[-- Attachment #2: patch-rcu-simple --]
[-- Type: text/plain, Size: 4162 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 6
//  SUBLEVEL = 6
//  EXTRAVERSION = -mm4
--- 2.6/kernel/rcupdate.c	2004-05-19 19:51:17.000000000 +0200
+++ build-2.6/kernel/rcupdate.c	2004-05-20 13:22:59.000000000 +0200
@@ -55,6 +55,7 @@
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
 #define RCU_tasklet(cpu) (per_cpu(rcu_tasklet, cpu))
 
+long rcu_global_grace_num ____cacheline_aligned_in_smp;
 /**
  * call_rcu - Queue an RCU update request.
  * @head: structure to be used for queueing the RCU updates.
@@ -97,6 +98,23 @@
 }
 
 /*
+ * Grace period handling:
+ * The grace period handling consists out of two steps:
+ * - A new grace period is started.
+ *   This is done by rcu_start_batch. The start is not broadcasted to
+ *   all cpus, they must pick this up from rcu_global_grace_num. All cpus are
+ *   recorded  in the rcu_ctrlblk.rcu_cpu_mask bitmap.
+ * - All cpus must go through a quiescent state.
+ *   Since the start of the grace period is not broadcasted, at least two
+ *   calls to rcu_check_quiescent_state are required:
+ *   The first call just notices that a new grace period is running. The
+ *   following calls check if there was a quiescent state since the beginning
+ *   of the grace period. If so, it updates rcu_ctrlblk.rcu_cpu_mask. If
+ *   the bitmap is empty, then the grace period is completed. 
+ *   rcu_check_quiescent_state calls rcu_start_batch to start the next grace
+ *   period.
+ */
+/*
  * Register a new batch of callbacks, and start it up if there is currently no
  * active batch and the batch to be registered has not already occurred.
  * Caller must hold the rcu_ctrlblk lock.
@@ -116,6 +134,8 @@
 	active = nohz_cpu_mask;
 	cpus_complement(active);
 	cpus_and(rcu_ctrlblk.rcu_cpu_mask, cpu_online_map, active);
+	smp_wmb();
+	rcu_global_grace_num++;
 }
 
 /*
@@ -127,18 +147,28 @@
 {
 	int cpu = smp_processor_id();
 
-	if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	if (RCU_grace_num(cpu) != rcu_global_grace_num) {
+		/* new grace period: record qsctr value. */
+		BUG_ON(RCU_last_qsctr(cpu) != RCU_QSCTR_INVALID);
+		RCU_last_qsctr(cpu) = RCU_qsctr(cpu);
+		RCU_grace_num(cpu) = rcu_global_grace_num;
+		return;
+	}
+
+	/* grace period already completed for this cpu?
+	 * last_qsctr is checked instead of the actual bitmap to avoid
+	 * cacheline trashing
+	 */
+	if (RCU_last_qsctr(cpu) == RCU_QSCTR_INVALID) {
+		/* yes. Just leave. */
 		return;
+	}
 
 	/* 
 	 * Races with local timer interrupt - in the worst case
 	 * we may miss one quiescent state of that CPU. That is
 	 * tolerable. So no need to disable interrupts.
 	 */
-	if (RCU_last_qsctr(cpu) == RCU_QSCTR_INVALID) {
-		RCU_last_qsctr(cpu) = RCU_qsctr(cpu);
-		return;
-	}
 	if (RCU_qsctr(cpu) == RCU_last_qsctr(cpu))
 		return;
 
--- 2.6/include/linux/rcupdate.h	2004-05-19 19:42:16.000000000 +0200
+++ build-2.6/include/linux/rcupdate.h	2004-05-20 12:46:02.000000000 +0200
@@ -94,16 +94,19 @@
         long            last_qsctr;	 /* value of qsctr at beginning */
                                          /* of rcu grace period */
         long  	       	batch;           /* Batch # for current RCU batch */
+        long		grace_num;       /* grace period number */
         struct list_head  nxtlist;
         struct list_head  curlist;
 };
 
 DECLARE_PER_CPU(struct rcu_data, rcu_data);
 extern struct rcu_ctrlblk rcu_ctrlblk;
+extern long rcu_global_grace_num;
 
 #define RCU_qsctr(cpu) 		(per_cpu(rcu_data, (cpu)).qsctr)
 #define RCU_last_qsctr(cpu) 	(per_cpu(rcu_data, (cpu)).last_qsctr)
 #define RCU_batch(cpu) 		(per_cpu(rcu_data, (cpu)).batch)
+#define RCU_grace_num(cpu)	(per_cpu(rcu_data, (cpu)).grace_num)
 #define RCU_nxtlist(cpu) 	(per_cpu(rcu_data, (cpu)).nxtlist)
 #define RCU_curlist(cpu) 	(per_cpu(rcu_data, (cpu)).curlist)
 
@@ -115,7 +118,8 @@
 	     rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
 	    (list_empty(&RCU_curlist(cpu)) &&
 			 !list_empty(&RCU_nxtlist(cpu))) ||
-	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	    RCU_grace_num(cpu) != rcu_global_grace_num ||
+	    RCU_last_qsctr(cpu) != RCU_QSCTR_INVALID)
 		return 1;
 	else
 		return 0;

^ permalink raw reply	[flat|nested] 22+ messages in thread
* RCU scaling on large systems
@ 2004-05-01 12:08 Jack Steiner
  2004-05-01 21:17 ` William Lee Irwin III
  2004-05-02 18:28 ` Paul E. McKenney
  0 siblings, 2 replies; 22+ messages in thread
From: Jack Steiner @ 2004-05-01 12:08 UTC (permalink / raw)
  To: linux-kernel


Has anyone investigated scaling of the RCU algorithms on large
cpu-count systems?

We are running 2.6.5 on a 512p system. The RCU update code is causing 
a near-livelock when booting. Several times, I dropped into kdb &
found many of the cpus in the RCU code. The cpus are in
rcu_check_quiescent_state() spinning on the rcu_ctrlblk.mutex.


I also found an interesting anomaly that was traced to RCU. I have
a program that measures "cpu efficiency". Basically, the program 
creates a cpu bound thread for each cpu & measures the percentage 
of time that each cpu ACTUALLY spends executing user code.
On an idle each system, each cpu *should* spend >99% in user mode.

On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
RCU code. The time is spent contending for shared cache lines.

Even more bizarre: if I repeatedly type "ls" in a *single* window 
(probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
of the time in the RCU code.

The RCU algorithms don't scale - at least on our systems!!!!!


Attached is an experimental hack that fixes the problem. I
don't believe that this is the correct fix but it does prove
that slowing down the frequency of updates fixes the problem.


With this hack, "ls" no longer measurable disturbs other cpus. Each
cpu spends ~99.8% of its time in user code regardless of the frequency
of typing "ls".



By default, the RCU code attempts to process callbacks on each cpu
every tick. The hack adds a mask so that only a few cpus process
callbacks each tick. 

I ran a simple experiment to vary the mask. Col 1 shows the mask
value, col 2 show system time when system is idle, col 3 shows
system time when typing "ls" on a single cpu. The percentages are
"eyeballed averages" for all cpus.

	VAL	idle%	"ls" %
	0	6.00	55.00
	1	2.00	50.00
	3	 .20	20.00
	7	 .20	  .22
	15	 .20	  .24
	31	 .19	  .25
	63	 .19	  .26
	127	 .19	  .25
	511	 .19	  .19

Looks like any value >7 or 15 should be ok on our hardware. I picked 
a larger value because I don't understand how heavy loads affect the 
optimal value.  I suspect that the optimal value is architecture
dependent & probably platform dependent.


Is anyone already working on this issue?

Comments and additional insight appreciated..........




Index: linux/include/linux/rcupdate.h
===================================================================
--- linux.orig/include/linux/rcupdate.h	2004-04-30 09:59:00.000000000 -0500
+++ linux/include/linux/rcupdate.h	2004-04-30 10:01:37.000000000 -0500
@@ -41,6 +41,7 @@
 #include <linux/threads.h>
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
+#include <linux/jiffies.h>
 
 /**
  * struct rcu_head - callback structure for use with RCU
@@ -84,6 +85,14 @@
         return (a - b) > 0;
 }
 
+#define RCU_JIFFIES_MASK 15
+
+/* Is it time to process a batch on this cpu */
+static inline int rcu_time(int cpu)
+{
+	return (((jiffies - cpu) & RCU_JIFFIES_MASK) == 0);
+}
+
 /*
  * Per-CPU data for Read-Copy Update.
  * nxtlist - new callbacks are added here
@@ -111,11 +120,11 @@
 
 static inline int rcu_pending(int cpu) 
 {
-	if ((!list_empty(&RCU_curlist(cpu)) &&
+	if (rcu_time(cpu) && ((!list_empty(&RCU_curlist(cpu)) &&
 	     rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
 	    (list_empty(&RCU_curlist(cpu)) &&
 			 !list_empty(&RCU_nxtlist(cpu))) ||
-	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+	    cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)))
 		return 1;
 	else
 		return 0;
-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-05-20 11:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-20 11:36 RCU scaling on large systems Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2004-05-01 12:08 Jack Steiner
2004-05-01 21:17 ` William Lee Irwin III
2004-05-01 22:35   ` William Lee Irwin III
2004-05-02  1:38   ` Jack Steiner
2004-05-07 17:53   ` Andrea Arcangeli
2004-05-07 18:17     ` William Lee Irwin III
2004-05-07 19:59       ` Andrea Arcangeli
2004-05-07 20:49   ` Jack Steiner
2004-05-02 18:28 ` Paul E. McKenney
2004-05-03 16:39   ` Jesse Barnes
2004-05-03 20:04     ` Paul E. McKenney
2004-05-03 18:40   ` Jack Steiner
2004-05-07 20:50     ` Paul E. McKenney
2004-05-07 22:06       ` Jack Steiner
2004-05-07 23:32         ` Andrew Morton
2004-05-08  4:55           ` Jack Steiner
2004-05-17 21:18           ` Andrea Arcangeli
2004-05-17 21:42             ` Andrew Morton
2004-05-17 23:50               ` Andrea Arcangeli
2004-05-18 13:33               ` Jack Steiner
2004-05-18 23:13               ` Matt Mackall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox