* RCU scaling on large systems
@ 2004-05-01 12:08 Jack Steiner
2004-05-01 21:17 ` William Lee Irwin III
2004-05-02 18:28 ` Paul E. McKenney
0 siblings, 2 replies; 22+ messages in thread
From: Jack Steiner @ 2004-05-01 12:08 UTC (permalink / raw)
To: linux-kernel
Has anyone investigated scaling of the RCU algorithms on large
cpu-count systems?
We are running 2.6.5 on a 512p system. The RCU update code is causing
a near-livelock when booting. Several times, I dropped into kdb &
found many of the cpus in the RCU code. The cpus are in
rcu_check_quiescent_state() spinning on the rcu_ctrlblk.mutex.
I also found an interesting anomaly that was traced to RCU. I have
a program that measures "cpu efficiency". Basically, the program
creates a cpu bound thread for each cpu & measures the percentage
of time that each cpu ACTUALLY spends executing user code.
On an idle each system, each cpu *should* spend >99% in user mode.
On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
RCU code. The time is spent contending for shared cache lines.
Even more bizarre: if I repeatedly type "ls" in a *single* window
(probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
of the time in the RCU code.
The RCU algorithms don't scale - at least on our systems!!!!!
Attached is an experimental hack that fixes the problem. I
don't believe that this is the correct fix but it does prove
that slowing down the frequency of updates fixes the problem.
With this hack, "ls" no longer measurable disturbs other cpus. Each
cpu spends ~99.8% of its time in user code regardless of the frequency
of typing "ls".
By default, the RCU code attempts to process callbacks on each cpu
every tick. The hack adds a mask so that only a few cpus process
callbacks each tick.
I ran a simple experiment to vary the mask. Col 1 shows the mask
value, col 2 show system time when system is idle, col 3 shows
system time when typing "ls" on a single cpu. The percentages are
"eyeballed averages" for all cpus.
VAL idle% "ls" %
0 6.00 55.00
1 2.00 50.00
3 .20 20.00
7 .20 .22
15 .20 .24
31 .19 .25
63 .19 .26
127 .19 .25
511 .19 .19
Looks like any value >7 or 15 should be ok on our hardware. I picked
a larger value because I don't understand how heavy loads affect the
optimal value. I suspect that the optimal value is architecture
dependent & probably platform dependent.
Is anyone already working on this issue?
Comments and additional insight appreciated..........
Index: linux/include/linux/rcupdate.h
===================================================================
--- linux.orig/include/linux/rcupdate.h 2004-04-30 09:59:00.000000000 -0500
+++ linux/include/linux/rcupdate.h 2004-04-30 10:01:37.000000000 -0500
@@ -41,6 +41,7 @@
#include <linux/threads.h>
#include <linux/percpu.h>
#include <linux/cpumask.h>
+#include <linux/jiffies.h>
/**
* struct rcu_head - callback structure for use with RCU
@@ -84,6 +85,14 @@
return (a - b) > 0;
}
+#define RCU_JIFFIES_MASK 15
+
+/* Is it time to process a batch on this cpu */
+static inline int rcu_time(int cpu)
+{
+ return (((jiffies - cpu) & RCU_JIFFIES_MASK) == 0);
+}
+
/*
* Per-CPU data for Read-Copy Update.
* nxtlist - new callbacks are added here
@@ -111,11 +120,11 @@
static inline int rcu_pending(int cpu)
{
- if ((!list_empty(&RCU_curlist(cpu)) &&
+ if (rcu_time(cpu) && ((!list_empty(&RCU_curlist(cpu)) &&
rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
(list_empty(&RCU_curlist(cpu)) &&
!list_empty(&RCU_nxtlist(cpu))) ||
- cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+ cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)))
return 1;
else
return 0;
--
Thanks
Jack Steiner (steiner@sgi.com) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: RCU scaling on large systems 2004-05-01 12:08 RCU scaling on large systems Jack Steiner @ 2004-05-01 21:17 ` William Lee Irwin III 2004-05-01 22:35 ` William Lee Irwin III ` (3 more replies) 2004-05-02 18:28 ` Paul E. McKenney 1 sibling, 4 replies; 22+ messages in thread From: William Lee Irwin III @ 2004-05-01 21:17 UTC (permalink / raw) To: Jack Steiner; +Cc: linux-kernel On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote: > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > RCU code. The time is spent contending for shared cache lines. Would something like this help cacheline contention? This uses the per_cpu data areas to hold per-cpu booleans for needing switches. Untested/uncompiled. The global lock is unfortunately still there. -- wli Index: wli-2.6.6-rc3-mm1/include/linux/rcupdate.h =================================================================== --- wli-2.6.6-rc3-mm1.orig/include/linux/rcupdate.h 2004-04-03 19:36:52.000000000 -0800 +++ wli-2.6.6-rc3-mm1/include/linux/rcupdate.h 2004-05-01 14:15:09.000000000 -0700 @@ -41,6 +41,9 @@ #include <linux/threads.h> #include <linux/percpu.h> #include <linux/cpumask.h> +#include <asm/atomic.h> + +#define RCU_CPU_SCATTER (NR_CPUS > 128) /** * struct rcu_head - callback structure for use with RCU @@ -68,8 +71,10 @@ spinlock_t mutex; /* Guard this struct */ long curbatch; /* Current batch number. */ long maxbatch; /* Max requested batch number. */ +#if !RCU_CPU_SCATTER cpumask_t rcu_cpu_mask; /* CPUs that need to switch in order */ /* for current batch to proceed. */ +#endif }; /* Is batch a before batch b ? */ @@ -96,6 +101,9 @@ long batch; /* Batch # for current RCU batch */ struct list_head nxtlist; struct list_head curlist; +#if RCU_CPU_SCATTER + atomic_t need_switch; +#endif }; DECLARE_PER_CPU(struct rcu_data, rcu_data); @@ -109,13 +117,39 @@ #define RCU_QSCTR_INVALID 0 +#if RCU_CPU_SCATTER +#define rcu_need_switch(cpu) (!!atomic_read(&per_cpu(rcu_data, cpu).need_switch)) +#define rcu_clear_need_switch(cpu) atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0) +static inline int rcu_any_cpu_need_switch(void) +{ + int cpu; + for_each_online_cpu(cpu) { + if (rcu_need_switch(cpu)) + return 1; + } + return 0; +} + +static inline void rcu_set_need_switch_cpumask(cpumask_t cpumask) +{ + int cpu; + for_each_cpu_mask(cpu, cpumask) + atomic_set(&per_cpu(rcu_data, cpu).need_switch, 1); +} +#else +#define rcu_need_switch(cpu) cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask) +#define rcu_clear_need_switch(cpu) cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask) +#define rcu_any_cpu_need_switch() (!cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) +#define rcu_set_need_switch_cpumask(x) cpus_copy(rcu_ctrlblk.rcu_cpu_mask, x) +#endif + static inline int rcu_pending(int cpu) { if ((!list_empty(&RCU_curlist(cpu)) && rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) || (list_empty(&RCU_curlist(cpu)) && !list_empty(&RCU_nxtlist(cpu))) || - cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)) + rcu_need_switch(cpu)) return 1; else return 0; Index: wli-2.6.6-rc3-mm1/kernel/rcupdate.c =================================================================== --- wli-2.6.6-rc3-mm1.orig/kernel/rcupdate.c 2004-04-30 15:05:53.000000000 -0700 +++ wli-2.6.6-rc3-mm1/kernel/rcupdate.c 2004-05-01 13:47:05.000000000 -0700 @@ -46,10 +46,19 @@ #include <linux/cpu.h> /* Definition for rcupdate control block. */ -struct rcu_ctrlblk rcu_ctrlblk = - { .mutex = SPIN_LOCK_UNLOCKED, .curbatch = 1, - .maxbatch = 1, .rcu_cpu_mask = CPU_MASK_NONE }; +struct rcu_ctrlblk rcu_ctrlblk = { + .mutex = SPIN_LOCK_UNLOCKED, + .curbatch = 1, + .maxbatch = 1, +#if !RCU_CPU_SCATTER + .rcu_cpu_mask = CPU_MASK_NONE +#endif +}; +#if RCU_CPU_SCATTER +DEFINE_PER_CPU(struct rcu_data, rcu_data) = { .need_switch = ATOMIC_INIT(0), }; +#else DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L }; +#endif /* Fake initialization required by compiler */ static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; @@ -109,13 +118,14 @@ rcu_ctrlblk.maxbatch = newbatch; } if (rcu_batch_before(rcu_ctrlblk.maxbatch, rcu_ctrlblk.curbatch) || - !cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) { + !rcu_any_cpu_need_switch()) { return; } /* Can't change, since spin lock held. */ active = idle_cpu_mask; cpus_complement(active); - cpus_and(rcu_ctrlblk.rcu_cpu_mask, cpu_online_map, active); + cpus_and(active, cpu_online_map, active); + rcu_set_need_switch_cpumask(active); } /* @@ -127,7 +137,7 @@ { int cpu = smp_processor_id(); - if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)) + if (!rcu_need_switch(cpu)) return; /* @@ -143,12 +153,12 @@ return; spin_lock(&rcu_ctrlblk.mutex); - if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)) + if (!rcu_need_switch(cpu)) goto out_unlock; - cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask); + rcu_clear_need_switch(cpu); RCU_last_qsctr(cpu) = RCU_QSCTR_INVALID; - if (!cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) + if (!rcu_any_cpu_need_switch()) goto out_unlock; rcu_ctrlblk.curbatch++; @@ -186,11 +196,11 @@ * it here */ spin_lock_irq(&rcu_ctrlblk.mutex); - if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) + if (!rcu_any_cpu_need_switch()) goto unlock; - cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask); - if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) { + rcu_clear_need_switch(cpu); + if (!rcu_any_cpu_need_switch()) { rcu_ctrlblk.curbatch++; /* We may avoid calling start batch if * we are starting the batch only ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-01 21:17 ` William Lee Irwin III @ 2004-05-01 22:35 ` William Lee Irwin III 2004-05-02 1:38 ` Jack Steiner ` (2 subsequent siblings) 3 siblings, 0 replies; 22+ messages in thread From: William Lee Irwin III @ 2004-05-01 22:35 UTC (permalink / raw) To: Jack Steiner, linux-kernel On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote: > +static inline void rcu_set_need_switch_cpumask(cpumask_t cpumask) > +{ > + int cpu; > + for_each_cpu_mask(cpu, cpumask) > + atomic_set(&per_cpu(rcu_data, cpu).need_switch, 1); > +} needs to be: for_each_online_cpu(cpu) atomic_set(&per_cpu(rcu_data, cpu).need_switch, !!cpu_isset(cpu, cpumask)); -- wli ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-01 21:17 ` William Lee Irwin III 2004-05-01 22:35 ` William Lee Irwin III @ 2004-05-02 1:38 ` Jack Steiner 2004-05-07 17:53 ` Andrea Arcangeli 2004-05-07 20:49 ` Jack Steiner 3 siblings, 0 replies; 22+ messages in thread From: Jack Steiner @ 2004-05-02 1:38 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote: > On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote: > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > > RCU code. The time is spent contending for shared cache lines. > > Would something like this help cacheline contention? This uses the > per_cpu data areas to hold per-cpu booleans for needing switches. > Untested/uncompiled. > > The global lock is unfortunately still there. It will be Monday before I can get time on the large system & do additional experiments. I have 2 logs & both show the contention is on the line "spin_lock(&rcu_ctrlblk.mutex);". These traces were from a heavy workload with processes on most cpus. I want to repeat the "ls" experiment & see if it has the same hot spot. The "ls" experiment has only 1 active cpu. The rest of the cpus are idle. I'll post additional info on Monday (if possible). I'll also try your patch. (Thanks for quick reply) > > > -- wli > > Index: wli-2.6.6-rc3-mm1/include/linux/rcupdate.h > =================================================================== > --- wli-2.6.6-rc3-mm1.orig/include/linux/rcupdate.h 2004-04-03 19:36:52.000000000 -0800 > +++ wli-2.6.6-rc3-mm1/include/linux/rcupdate.h 2004-05-01 14:15:09.000000000 -0700 > @@ -41,6 +41,9 @@ > #include <linux/threads.h> > #include <linux/percpu.h> > #include <linux/cpumask.h> > +#include <asm/atomic.h> > + > +#define RCU_CPU_SCATTER (NR_CPUS > 128) > > /** > * struct rcu_head - callback structure for use with RCU > @@ -68,8 +71,10 @@ > spinlock_t mutex; /* Guard this struct */ > long curbatch; /* Current batch number. */ > long maxbatch; /* Max requested batch number. */ > +#if !RCU_CPU_SCATTER > cpumask_t rcu_cpu_mask; /* CPUs that need to switch in order */ > /* for current batch to proceed. */ > +#endif > }; > > /* Is batch a before batch b ? */ > @@ -96,6 +101,9 @@ > long batch; /* Batch # for current RCU batch */ > struct list_head nxtlist; > struct list_head curlist; > +#if RCU_CPU_SCATTER > + atomic_t need_switch; > +#endif > }; > > DECLARE_PER_CPU(struct rcu_data, rcu_data); > @@ -109,13 +117,39 @@ > > #define RCU_QSCTR_INVALID 0 > > +#if RCU_CPU_SCATTER > +#define rcu_need_switch(cpu) (!!atomic_read(&per_cpu(rcu_data, cpu).need_switch)) > +#define rcu_clear_need_switch(cpu) atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0) > +static inline int rcu_any_cpu_need_switch(void) > +{ > + int cpu; > + for_each_online_cpu(cpu) { > + if (rcu_need_switch(cpu)) > + return 1; > + } > + return 0; > +} > + > +static inline void rcu_set_need_switch_cpumask(cpumask_t cpumask) > +{ > + int cpu; > + for_each_cpu_mask(cpu, cpumask) > + atomic_set(&per_cpu(rcu_data, cpu).need_switch, 1); > +} > +#else > +#define rcu_need_switch(cpu) cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask) > +#define rcu_clear_need_switch(cpu) cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask) > +#define rcu_any_cpu_need_switch() (!cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) > +#define rcu_set_need_switch_cpumask(x) cpus_copy(rcu_ctrlblk.rcu_cpu_mask, x) > +#endif > + > static inline int rcu_pending(int cpu) > { > if ((!list_empty(&RCU_curlist(cpu)) && > rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) || > (list_empty(&RCU_curlist(cpu)) && > !list_empty(&RCU_nxtlist(cpu))) || > - cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)) > + rcu_need_switch(cpu)) > return 1; > else > return 0; > Index: wli-2.6.6-rc3-mm1/kernel/rcupdate.c > =================================================================== > --- wli-2.6.6-rc3-mm1.orig/kernel/rcupdate.c 2004-04-30 15:05:53.000000000 -0700 > +++ wli-2.6.6-rc3-mm1/kernel/rcupdate.c 2004-05-01 13:47:05.000000000 -0700 > @@ -46,10 +46,19 @@ > #include <linux/cpu.h> > > /* Definition for rcupdate control block. */ > -struct rcu_ctrlblk rcu_ctrlblk = > - { .mutex = SPIN_LOCK_UNLOCKED, .curbatch = 1, > - .maxbatch = 1, .rcu_cpu_mask = CPU_MASK_NONE }; > +struct rcu_ctrlblk rcu_ctrlblk = { > + .mutex = SPIN_LOCK_UNLOCKED, > + .curbatch = 1, > + .maxbatch = 1, > +#if !RCU_CPU_SCATTER > + .rcu_cpu_mask = CPU_MASK_NONE > +#endif > +}; > +#if RCU_CPU_SCATTER > +DEFINE_PER_CPU(struct rcu_data, rcu_data) = { .need_switch = ATOMIC_INIT(0), }; > +#else > DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L }; > +#endif > > /* Fake initialization required by compiler */ > static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; > @@ -109,13 +118,14 @@ > rcu_ctrlblk.maxbatch = newbatch; > } > if (rcu_batch_before(rcu_ctrlblk.maxbatch, rcu_ctrlblk.curbatch) || > - !cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) { > + !rcu_any_cpu_need_switch()) { > return; > } > /* Can't change, since spin lock held. */ > active = idle_cpu_mask; > cpus_complement(active); > - cpus_and(rcu_ctrlblk.rcu_cpu_mask, cpu_online_map, active); > + cpus_and(active, cpu_online_map, active); > + rcu_set_need_switch_cpumask(active); > } > > /* > @@ -127,7 +137,7 @@ > { > int cpu = smp_processor_id(); > > - if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)) > + if (!rcu_need_switch(cpu)) > return; > > /* > @@ -143,12 +153,12 @@ > return; > > spin_lock(&rcu_ctrlblk.mutex); > - if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)) > + if (!rcu_need_switch(cpu)) > goto out_unlock; > > - cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask); > + rcu_clear_need_switch(cpu); > RCU_last_qsctr(cpu) = RCU_QSCTR_INVALID; > - if (!cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) > + if (!rcu_any_cpu_need_switch()) > goto out_unlock; > > rcu_ctrlblk.curbatch++; > @@ -186,11 +196,11 @@ > * it here > */ > spin_lock_irq(&rcu_ctrlblk.mutex); > - if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) > + if (!rcu_any_cpu_need_switch()) > goto unlock; > > - cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask); > - if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) { > + rcu_clear_need_switch(cpu); > + if (!rcu_any_cpu_need_switch()) { > rcu_ctrlblk.curbatch++; > /* We may avoid calling start batch if > * we are starting the batch only -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-01 21:17 ` William Lee Irwin III 2004-05-01 22:35 ` William Lee Irwin III 2004-05-02 1:38 ` Jack Steiner @ 2004-05-07 17:53 ` Andrea Arcangeli 2004-05-07 18:17 ` William Lee Irwin III 2004-05-07 20:49 ` Jack Steiner 3 siblings, 1 reply; 22+ messages in thread From: Andrea Arcangeli @ 2004-05-07 17:53 UTC (permalink / raw) To: William Lee Irwin III, Jack Steiner, linux-kernel On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote: > On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote: > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > > RCU code. The time is spent contending for shared cache lines. > > Would something like this help cacheline contention? This uses the > per_cpu data areas to hold per-cpu booleans for needing switches. > Untested/uncompiled. > > The global lock is unfortunately still there. I'm afraid this cannot help, the rcu_cpu_mask and the mutex are in the same cacheline, so it's not just about the global lock being still there, it's about the cpumask being in the same cacheline with the global lock. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-07 17:53 ` Andrea Arcangeli @ 2004-05-07 18:17 ` William Lee Irwin III 2004-05-07 19:59 ` Andrea Arcangeli 0 siblings, 1 reply; 22+ messages in thread From: William Lee Irwin III @ 2004-05-07 18:17 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Jack Steiner, linux-kernel On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote: >> Would something like this help cacheline contention? This uses the >> per_cpu data areas to hold per-cpu booleans for needing switches. >> Untested/uncompiled. >> The global lock is unfortunately still there. On Fri, May 07, 2004 at 07:53:58PM +0200, Andrea Arcangeli wrote: > I'm afraid this cannot help, the rcu_cpu_mask and the mutex are in the same > cacheline, so it's not just about the global lock being still there, > it's about the cpumask being in the same cacheline with the global lock. Hmm. I can't quite make out what you're trying to say. If it were about the cpumask sharing the cacheline with the global lock, then the patch would help, but you say it should not. I don't care much about the conclusion, since I wrote the patch to express the notion that the concentration of accesses to the cpumask's shared cacheline(s) could be dispersed by using integers in per_cpu data to represent the individual bits of the cpumask if that were the problem, and by trying something similar to the posted patch, it could be determined if that were so, but later heard back that it'd been determined by other means that it was the lock itself... -- wli ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-07 18:17 ` William Lee Irwin III @ 2004-05-07 19:59 ` Andrea Arcangeli 0 siblings, 0 replies; 22+ messages in thread From: Andrea Arcangeli @ 2004-05-07 19:59 UTC (permalink / raw) To: William Lee Irwin III, Jack Steiner, linux-kernel On Fri, May 07, 2004 at 11:17:06AM -0700, William Lee Irwin III wrote: > On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote: > >> Would something like this help cacheline contention? This uses the > >> per_cpu data areas to hold per-cpu booleans for needing switches. > >> Untested/uncompiled. > >> The global lock is unfortunately still there. > > On Fri, May 07, 2004 at 07:53:58PM +0200, Andrea Arcangeli wrote: > > I'm afraid this cannot help, the rcu_cpu_mask and the mutex are in the same > > cacheline, so it's not just about the global lock being still there, > > it's about the cpumask being in the same cacheline with the global lock. > > Hmm. I can't quite make out what you're trying to say. If it were about > the cpumask sharing the cacheline with the global lock, then the patch > would help, but you say it should not. I don't care much about the Since cpumask and global lock are on the same cacheline, and the global lock still force that cacheline to bounce, it shouldn't make any difference to remove the cpumask from that global cacheline, by the time you take the lock you have the cacheline local and in the next nanosecond you can use the cpumask zerocost. I understood the real cost is the bouncing of _such_ cacheline across all 512 cpus and the global lock will still generates it as far as I can tell. Though I'm mostly guessing, I certainly wouldn't be against trying your patch, I was just afraid it cannot help. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-01 21:17 ` William Lee Irwin III ` (2 preceding siblings ...) 2004-05-07 17:53 ` Andrea Arcangeli @ 2004-05-07 20:49 ` Jack Steiner 3 siblings, 0 replies; 22+ messages in thread From: Jack Steiner @ 2004-05-07 20:49 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote: > On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote: > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > > RCU code. The time is spent contending for shared cache lines. > > Would something like this help cacheline contention? This uses the > per_cpu data areas to hold per-cpu booleans for needing switches. > Untested/uncompiled. > > The global lock is unfortunately still there. > > > -- wli > ... > > +#if RCU_CPU_SCATTER > +#define rcu_need_switch(cpu) (!!atomic_read(&per_cpu(rcu_data, cpu).need_switch)) > +#define rcu_clear_need_switch(cpu) atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0) > +static inline int rcu_any_cpu_need_switch(void) > +{ > + int cpu; > + for_each_online_cpu(cpu) { > + if (rcu_need_switch(cpu)) > + return 1; > + } > + return 0; > +} .. This fixes only part of the problem. Referencing percpu data for every cpu is not particularily efficient - at least on our platform. Percpu data is allocated so that it is on the local node of each cpu. We use 64MB granules. The percpu data structures on the individual nodes are separated by addresses that differ by many GB. A scan of all percpu data structures requires a TLB entry for each node. This is costly & trashes the TLB. (Our max system size is currently 256 nodes). For example, a 4 node, 2 cpus/node system shows: mdb> m d __per_cpu_offset <0xa000000100b357c8> 0xe000003001010000 <0xa000000100b357d0> 0xe000003001020000 <0xa000000100b357d8> 0xe00000b001020000 <0xa000000100b357e0> 0xe00000b001030000 <0xa000000100b357e8> 0xe000013001030000 <0xa000000100b357f0> 0xe000013001040000 <0xa000000100b357f8> 0xe00001b001040000 <0xa000000100b35800> 0xe00001b001050000 The node number is encoded in bits [48:39] of the virtual/physical address. Unfortunately, our hardware does not provide a way to allocate node local memory for every node and have all the memory covered by a single TLB entry. Moving "need_switch" to a single array with cacheline aligned entries would work. I can give that a try..... -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-01 12:08 RCU scaling on large systems Jack Steiner 2004-05-01 21:17 ` William Lee Irwin III @ 2004-05-02 18:28 ` Paul E. McKenney 2004-05-03 16:39 ` Jesse Barnes 2004-05-03 18:40 ` Jack Steiner 1 sibling, 2 replies; 22+ messages in thread From: Paul E. McKenney @ 2004-05-02 18:28 UTC (permalink / raw) To: Jack Steiner; +Cc: linux-kernel Hello, Jack! On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote: > > Has anyone investigated scaling of the RCU algorithms on large > cpu-count systems? > > We are running 2.6.5 on a 512p system. The RCU update code is causing > a near-livelock when booting. Several times, I dropped into kdb & > found many of the cpus in the RCU code. The cpus are in > rcu_check_quiescent_state() spinning on the rcu_ctrlblk.mutex. The current RCU grace-period-detection infrastructure was designed for a few tens of CPUs, so it is absolutely no surprise that it does not serve you well on a 512-CPU system. The current RCU infrastructure algorithm can be thought of as a lazy barrier controlled by a bitmask. For 512 CPUs, I believe that more of a combining-tree algorithm will be needed. If you have a NUMA-like architecture, then the structure of the combining tree will of course want to reflect the underlying hardware architecture. Then any time period during which each of the individual NUMA nodes passed through a local grace period is a global grace period. This approach allows you to have per-NUMA-node RCU locks and a global RCU lock, reducing the number of acquisitions of the global lock by the ratio of the number of CPUs to the number of nodes. >From your numbers below, I would guess that if you have at least 8 CPUs per NUMA node, a two-level tree would suffice. If you have only 4 CPUs per NUMA node, you might well need a per-node level, a per-4-nodes level, and a global level to get the global lock contention reduced sufficiently. > I also found an interesting anomaly that was traced to RCU. I have > a program that measures "cpu efficiency". Basically, the program > creates a cpu bound thread for each cpu & measures the percentage > of time that each cpu ACTUALLY spends executing user code. > On an idle each system, each cpu *should* spend >99% in user mode. > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > RCU code. The time is spent contending for shared cache lines. Again, no surprise, Linux's RCU was not designed for a 512-CPU system. ;-) The hierarchical grace-period-detection scheme described above also increases cache locality, greatly reducing the cache-thrashing you are seeing. > Even more bizarre: if I repeatedly type "ls" in a *single* window > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50% > of the time in the RCU code. Hmmm... How large was the directory you were running "ls" on? At first blush, it sounds like the "ls" was somehow provoking a dcache update, which would then exercise RCU. > The RCU algorithms don't scale - at least on our systems!!!!! As noted earlier, the current implementation is not designed for 512 CPUs. And, as noted earlier, there are ways to make it scale. But for some reason, we felt it advisable to start with a smaller, simpler, and hence less scalable implementation. ;-) > Attached is an experimental hack that fixes the problem. I > don't believe that this is the correct fix but it does prove > that slowing down the frequency of updates fixes the problem. > > > With this hack, "ls" no longer measurable disturbs other cpus. Each > cpu spends ~99.8% of its time in user code regardless of the frequency > of typing "ls". > > > > By default, the RCU code attempts to process callbacks on each cpu > every tick. The hack adds a mask so that only a few cpus process > callbacks each tick. Cute! However, it is not clear to me that this approach is compatible with real-time use of RCU, since it results in CPUs processing their callbacks less frequently, and thus getting more of them to process at a time. But it is not clear to me that anyone is looking for realtime response from a 512-CPU machine (yow!!!), so perhaps this is not a problem... > I ran a simple experiment to vary the mask. Col 1 shows the mask > value, col 2 show system time when system is idle, col 3 shows > system time when typing "ls" on a single cpu. The percentages are > "eyeballed averages" for all cpus. > > VAL idle% "ls" % > 0 6.00 55.00 > 1 2.00 50.00 > 3 .20 20.00 > 7 .20 .22 > 15 .20 .24 > 31 .19 .25 > 63 .19 .26 > 127 .19 .25 > 511 .19 .19 > > Looks like any value >7 or 15 should be ok on our hardware. I picked > a larger value because I don't understand how heavy loads affect the > optimal value. I suspect that the optimal value is architecture > dependent & probably platform dependent. > > > Is anyone already working on this issue? > > Comments and additional insight appreciated.......... This patch certainly seems simple enough, and I would guess that "jiffies" is referenced often enough that it is warm in the cache despite being frequently updated. Other thoughts? Thanx, Paul > Index: linux/include/linux/rcupdate.h > =================================================================== > --- linux.orig/include/linux/rcupdate.h 2004-04-30 09:59:00.000000000 -0500 > +++ linux/include/linux/rcupdate.h 2004-04-30 10:01:37.000000000 -0500 > @@ -41,6 +41,7 @@ > #include <linux/threads.h> > #include <linux/percpu.h> > #include <linux/cpumask.h> > +#include <linux/jiffies.h> > > /** > * struct rcu_head - callback structure for use with RCU > @@ -84,6 +85,14 @@ > return (a - b) > 0; > } > > +#define RCU_JIFFIES_MASK 15 > + > +/* Is it time to process a batch on this cpu */ > +static inline int rcu_time(int cpu) > +{ > + return (((jiffies - cpu) & RCU_JIFFIES_MASK) == 0); > +} > + > /* > * Per-CPU data for Read-Copy Update. > * nxtlist - new callbacks are added here > @@ -111,11 +120,11 @@ > > static inline int rcu_pending(int cpu) > { > - if ((!list_empty(&RCU_curlist(cpu)) && > + if (rcu_time(cpu) && ((!list_empty(&RCU_curlist(cpu)) && > rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) || > (list_empty(&RCU_curlist(cpu)) && > !list_empty(&RCU_nxtlist(cpu))) || > - cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask)) > + cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))) > return 1; > else > return 0; > -- > Thanks > > Jack Steiner (steiner@sgi.com) 651-683-5302 > Principal Engineer SGI - Silicon Graphics, Inc. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-02 18:28 ` Paul E. McKenney @ 2004-05-03 16:39 ` Jesse Barnes 2004-05-03 20:04 ` Paul E. McKenney 2004-05-03 18:40 ` Jack Steiner 1 sibling, 1 reply; 22+ messages in thread From: Jesse Barnes @ 2004-05-03 16:39 UTC (permalink / raw) To: linux-kernel, paulmck; +Cc: Jack Steiner On Sunday, May 2, 2004 11:28 am, Paul E. McKenney wrote: > From your numbers below, I would guess that if you have at least > 8 CPUs per NUMA node, a two-level tree would suffice. If you have > only 4 CPUs per NUMA node, you might well need a per-node level, > a per-4-nodes level, and a global level to get the global lock > contention reduced sufficiently. Actually, only 2, but it sounds like your approach would work. > Cute! However, it is not clear to me that this approach is > compatible with real-time use of RCU, since it results in CPUs > processing their callbacks less frequently, and thus getting > more of them to process at a time. I think it was just a proof-of-concept--the current RCU design obviously wasn't designed with this machine in mind :). > But it is not clear to me that anyone is looking for realtime > response from a 512-CPU machine (yow!!!), so perhaps this > is not a problem... There are folks that would like realtime (or close to realtime) response on such systems, so it would be best not to do anything that would explicitly prevent it. > This patch certainly seems simple enough, and I would guess that > "jiffies" is referenced often enough that it is warm in the cache > despite being frequently updated. > > Other thoughts? On a big system like this though, won't reading jiffies frequently be another source of contention? Jesse ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-03 16:39 ` Jesse Barnes @ 2004-05-03 20:04 ` Paul E. McKenney 0 siblings, 0 replies; 22+ messages in thread From: Paul E. McKenney @ 2004-05-03 20:04 UTC (permalink / raw) To: Jesse Barnes; +Cc: linux-kernel, Jack Steiner On Mon, May 03, 2004 at 09:39:11AM -0700, Jesse Barnes wrote: > On Sunday, May 2, 2004 11:28 am, Paul E. McKenney wrote: > > From your numbers below, I would guess that if you have at least > > 8 CPUs per NUMA node, a two-level tree would suffice. If you have > > only 4 CPUs per NUMA node, you might well need a per-node level, > > a per-4-nodes level, and a global level to get the global lock > > contention reduced sufficiently. > > Actually, only 2, but it sounds like your approach would work. OK, make that a per-node level, a per-8-nodes level, and a global level. ;-) The per-node level might or might not be helpful, depending on your memory latencies. > > Cute! However, it is not clear to me that this approach is > > compatible with real-time use of RCU, since it results in CPUs > > processing their callbacks less frequently, and thus getting > > more of them to process at a time. > > I think it was just a proof-of-concept--the current RCU design obviously > wasn't designed with this machine in mind :). Agreed! ;-) > > But it is not clear to me that anyone is looking for realtime > > response from a 512-CPU machine (yow!!!), so perhaps this > > is not a problem... > > There are folks that would like realtime (or close to realtime) response on > such systems, so it would be best not to do anything that would explicitly > prevent it. The potential problem with less-frequent processing of callbacks is that it would result in larger "bursts" of callbacks to be processed, degrading realtime scheduling latency. There are some patches that help avoid this problem, but they probably need more testing and tuning. > > This patch certainly seems simple enough, and I would guess that > > "jiffies" is referenced often enough that it is warm in the cache > > despite being frequently updated. > > > > Other thoughts? > > On a big system like this though, won't reading jiffies frequently be another > source of contention? My thought was that each CPU was already reading jiffies several times each tick anyway, so that it would already be cached when RCU wanted to look at it. But I must defer to your experience with this particular machine. Thanx, Paul ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-02 18:28 ` Paul E. McKenney 2004-05-03 16:39 ` Jesse Barnes @ 2004-05-03 18:40 ` Jack Steiner 2004-05-07 20:50 ` Paul E. McKenney 1 sibling, 1 reply; 22+ messages in thread From: Jack Steiner @ 2004-05-03 18:40 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel Paul - Thanks for the reply. Additional data from experiments today. As expected, there are multiple hot spots related to the rcu_ctrlblk. - scheduler_tick() in the rcu_pending macro. Specifically, on the load of the rcu_cpu_mask. - rcu_check_quiescent_state() on spin_lock(&rcu_ctrlblk.mutex); These two spots are are ~equally hot. Some of the cache line contention could be alleviated by separating these fields into multiple cache lines. Wli posted a patch over the weekend that does that. I have not had a chance to review the patch in detail but it looks a reasonable idea. ----------------- Response to your mail: > >From your numbers below, I would guess that if you have at least > 8 CPUs per NUMA node, a two-level tree would suffice. If you have > only 4 CPUs per NUMA node, you might well need a per-node level, > a per-4-nodes level, and a global level to get the global lock > contention reduced sufficiently. The system consists of 256 nodes. Each node has 2 cpus located on a shared FSB. The nodes are packaged as 128 modules - 2 nodes per module. The 2 nodes in a module are slightly "closer" latency-wise than nodes in different modules. > > > I also found an interesting anomaly that was traced to RCU. I have > > a program that measures "cpu efficiency". Basically, the program > > creates a cpu bound thread for each cpu & measures the percentage > > of time that each cpu ACTUALLY spends executing user code. > > On an idle each system, each cpu *should* spend >99% in user mode. > > > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > > RCU code. The time is spent contending for shared cache lines. > > Again, no surprise, Linux's RCU was not designed for a 512-CPU > system. ;-) > > The hierarchical grace-period-detection scheme described above > also increases cache locality, greatly reducing the cache-thrashing > you are seeing. > > > Even more bizarre: if I repeatedly type "ls" in a *single* window > > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50% > > of the time in the RCU code. > > Hmmm... How large was the directory you were running "ls" on? > At first blush, it sounds like the "ls" was somehow provoking > a dcache update, which would then exercise RCU. The directory size does not seem to be too significant. I tried one test on a NFS directory with 250 files. Another test on /tmp with 25 files. In both cases, the results were similar. > > > The RCU algorithms don't scale - at least on our systems!!!!! > > As noted earlier, the current implementation is not designed for > 512 CPUs. And, as noted earlier, there are ways to make it > scale. But for some reason, we felt it advisable to start with > a smaller, simpler, and hence less scalable implementation. ;-) Makes sense. I would not have expected otherwise. Overall, linux scales to 512p much better than I would have predicted. Is anyone working on improving RCU scaling to higher cpu counts. I dont want to duplicate any work that is already in progress. Otherwise, I'll start investigating what can be done to improve scaling. > > > Attached is an experimental hack that fixes the problem. I > > don't believe that this is the correct fix but it does prove > > that slowing down the frequency of updates fixes the problem. > > > > > > With this hack, "ls" no longer measurable disturbs other cpus. Each > > cpu spends ~99.8% of its time in user code regardless of the frequency > > of typing "ls". > > > > > > > > By default, the RCU code attempts to process callbacks on each cpu > > every tick. The hack adds a mask so that only a few cpus process > > callbacks each tick. > > Cute! However, it is not clear to me that this approach is > compatible with real-time use of RCU, since it results in CPUs > processing their callbacks less frequently, and thus getting > more of them to process at a time. > > But it is not clear to me that anyone is looking for realtime > response from a 512-CPU machine (yow!!!), so perhaps this > is not a problem... Agree on both counts. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-03 18:40 ` Jack Steiner @ 2004-05-07 20:50 ` Paul E. McKenney 2004-05-07 22:06 ` Jack Steiner 0 siblings, 1 reply; 22+ messages in thread From: Paul E. McKenney @ 2004-05-07 20:50 UTC (permalink / raw) To: Jack Steiner; +Cc: linux-kernel On Mon, May 03, 2004 at 01:40:06PM -0500, Jack Steiner wrote: > Paul - > > Thanks for the reply. > > > Additional data from experiments today. > > As expected, there are multiple hot spots related to the rcu_ctrlblk. > > - scheduler_tick() in the rcu_pending macro. Specifically, on the > load of the rcu_cpu_mask. Going to a hierarchical approach would break this up, especially if the hierarchy reflected the NUMA structure of the machine. > - rcu_check_quiescent_state() on spin_lock(&rcu_ctrlblk.mutex); Again, in a hierarchical approach, this would be the lock at the lowest level of the hierarchy. > These two spots are are ~equally hot. > > Some of the cache line contention could be alleviated by separating > these fields into multiple cache lines. Wli posted a patch over the > weekend that does that. I have not had a chance to review the patch in > detail but it looks a reasonable idea. Could help or hurt -- multiple cachelines can be accessed in parallel, which should speed things up, but the need to load multiple cachelines could slow other things down. > ----------------- > Response to your mail: > > > > > >From your numbers below, I would guess that if you have at least > > 8 CPUs per NUMA node, a two-level tree would suffice. If you have > > only 4 CPUs per NUMA node, you might well need a per-node level, > > a per-4-nodes level, and a global level to get the global lock > > contention reduced sufficiently. > > The system consists of 256 nodes. Each node has 2 cpus located on > a shared FSB. The nodes are packaged as 128 modules - 2 nodes per > module. The 2 nodes in a module are slightly "closer" latency-wise > than nodes in different modules. If the latency difference is sufficiently slight, a two-level scheme that split the system up into 16 blocks of 8 nodes (16 CPUs) each would seem most likely to help. > > > I also found an interesting anomaly that was traced to RCU. I have > > > a program that measures "cpu efficiency". Basically, the program > > > creates a cpu bound thread for each cpu & measures the percentage > > > of time that each cpu ACTUALLY spends executing user code. > > > On an idle each system, each cpu *should* spend >99% in user mode. > > > > > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > > > RCU code. The time is spent contending for shared cache lines. > > > > Again, no surprise, Linux's RCU was not designed for a 512-CPU > > system. ;-) > > > > The hierarchical grace-period-detection scheme described above > > also increases cache locality, greatly reducing the cache-thrashing > > you are seeing. > > > > > Even more bizarre: if I repeatedly type "ls" in a *single* window > > > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50% > > > of the time in the RCU code. > > > > Hmmm... How large was the directory you were running "ls" on? > > At first blush, it sounds like the "ls" was somehow provoking > > a dcache update, which would then exercise RCU. > > The directory size does not seem to be too significant. I tried one test > on a NFS directory with 250 files. Another test on /tmp with 25 files. > In both cases, the results were similar. Hmmm... I took a quick glance, but don't see why an "ls" would cause RCU to be invoked. This should only happen if a dentry is ejected from dcache. Any enlightenment appreciated! > > > The RCU algorithms don't scale - at least on our systems!!!!! > > > > As noted earlier, the current implementation is not designed for > > 512 CPUs. And, as noted earlier, there are ways to make it > > scale. But for some reason, we felt it advisable to start with > > a smaller, simpler, and hence less scalable implementation. ;-) > > Makes sense. I would not have expected otherwise. Overall, linux scales > to 512p much better than I would have predicted. Indeed! I am impressed! > Is anyone working on improving RCU scaling to higher cpu counts. I > dont want to duplicate any work that is already in progress. > Otherwise, I'll start investigating what can be done to improve > scaling. We are focusing first on some of the update-side code, which on our systems is a much tighter bottleneck than is the RCU infrastructure. I expect that we will need to increase RCU scaling eventually, but if you have a short-term need, you should go ahead. I will of course be more than happy to review and comment. One challenge will be coming up with something that works efficiently both for small and large machines. Maybe some arch-specific code is required, but it would be good to avoid this, if possible. > > > Attached is an experimental hack that fixes the problem. I > > > don't believe that this is the correct fix but it does prove > > > that slowing down the frequency of updates fixes the problem. > > > > > > > > > With this hack, "ls" no longer measurable disturbs other cpus. Each > > > cpu spends ~99.8% of its time in user code regardless of the frequency > > > of typing "ls". > > > > > > > > > > > > By default, the RCU code attempts to process callbacks on each cpu > > > every tick. The hack adds a mask so that only a few cpus process > > > callbacks each tick. > > > > Cute! However, it is not clear to me that this approach is > > compatible with real-time use of RCU, since it results in CPUs > > processing their callbacks less frequently, and thus getting > > more of them to process at a time. > > > > But it is not clear to me that anyone is looking for realtime > > response from a 512-CPU machine (yow!!!), so perhaps this > > is not a problem... > > Agree on both counts. I will let you and Jesse hash out the need for realtime response on a 512-way machine. ;-) Thanx, Paul > -- > Thanks > > Jack Steiner (steiner@sgi.com) 651-683-5302 > Principal Engineer SGI - Silicon Graphics, Inc. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-07 20:50 ` Paul E. McKenney @ 2004-05-07 22:06 ` Jack Steiner 2004-05-07 23:32 ` Andrew Morton 0 siblings, 1 reply; 22+ messages in thread From: Jack Steiner @ 2004-05-07 22:06 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel On Fri, May 07, 2004 at 01:50:48PM -0700, Paul E. McKenney wrote: > On Mon, May 03, 2004 at 01:40:06PM -0500, Jack Steiner wrote: > > Paul - > > > > Hmmm... I took a quick glance, but don't see why an "ls" would > cause RCU to be invoked. This should only happen if a dentry is > ejected from dcache. > > Any enlightenment appreciated! > The calls to RCU are coming from here: [11]kdb> bt Stack traceback for pid 3553 0xe00002b007230000 3553 3139 1 11 R 0xe00002b0072304f0 *ls 0xa0000001000feee0 call_rcu 0xa0000001001a3b20 d_free+0x80 0xa0000001001a3ec0 dput+0x340 0xa00000010016bcd0 __fput+0x210 0xa00000010016baa0 fput+0x40 0xa000000100168760 filp_close+0xc0 0xa000000100168960 sys_close+0x180 0xa000000100011be0 ia64_ret_from_syscall I see this same backtrace from numerous processes. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-07 22:06 ` Jack Steiner @ 2004-05-07 23:32 ` Andrew Morton 2004-05-08 4:55 ` Jack Steiner 2004-05-17 21:18 ` Andrea Arcangeli 0 siblings, 2 replies; 22+ messages in thread From: Andrew Morton @ 2004-05-07 23:32 UTC (permalink / raw) To: Jack Steiner; +Cc: paulmck, linux-kernel Jack Steiner <steiner@sgi.com> wrote: > > The calls to RCU are coming from here: > > [11]kdb> bt > Stack traceback for pid 3553 > 0xe00002b007230000 3553 3139 1 11 R 0xe00002b0072304f0 *ls > 0xa0000001000feee0 call_rcu > 0xa0000001001a3b20 d_free+0x80 > 0xa0000001001a3ec0 dput+0x340 > 0xa00000010016bcd0 __fput+0x210 > 0xa00000010016baa0 fput+0x40 > 0xa000000100168760 filp_close+0xc0 > 0xa000000100168960 sys_close+0x180 > 0xa000000100011be0 ia64_ret_from_syscall > > I see this same backtrace from numerous processes. eh? Why is dput freeing the dentry? It should just be leaving it in cache. What filesystem is being used? procfs? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-07 23:32 ` Andrew Morton @ 2004-05-08 4:55 ` Jack Steiner 2004-05-17 21:18 ` Andrea Arcangeli 1 sibling, 0 replies; 22+ messages in thread From: Jack Steiner @ 2004-05-08 4:55 UTC (permalink / raw) To: Andrew Morton; +Cc: paulmck, linux-kernel On Fri, May 07, 2004 at 07:04:36PM -0500, Jack Steiner wrote: > On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote: > > Jack Steiner <steiner@sgi.com> wrote: > > > > > > The calls to RCU are coming from here: > > > > > > [11]kdb> bt > > > Stack traceback for pid 3553 > > > 0xe00002b007230000 3553 3139 1 11 R 0xe00002b0072304f0 *ls > > > 0xa0000001000feee0 call_rcu > > > 0xa0000001001a3b20 d_free+0x80 > > > 0xa0000001001a3ec0 dput+0x340 > > > 0xa00000010016bcd0 __fput+0x210 > > > 0xa00000010016baa0 fput+0x40 > > > 0xa000000100168760 filp_close+0xc0 > > > 0xa000000100168960 sys_close+0x180 > > > 0xa000000100011be0 ia64_ret_from_syscall > > > > > > I see this same backtrace from numerous processes. > > > > eh? Why is dput freeing the dentry? It should just be leaving it in cache. > > > > What filesystem is being used? procfs? > > Good possibility. I verified that /proc DOES cause a call to call_rcu. > > I also did an strace on "ls". I see the following. Does > this make sense??? Note open of /proc/meminfo... > > [root@piton tmp]# strace -o zzz ls > espdbd.sock jd_sockV4 ProPack-installer s.eventmond zz zzz > > [root@piton tmp]# egrep 'open|close|fork|exec' zzz > execve("/bin/ls", ["ls"], [/* 26 vars */]) = 0 > open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or > directory) > open("/etc/ld.so.cache", O_RDONLY) = 3 > close(3) = 0 > open("/lib/libacl.so.1", O_RDONLY) = 3 > close(3) = 0 > open("/lib/libtermcap.so.2", O_RDONLY) = 3 > close(3) = 0 > open("/lib/tls/libc.so.6.1", O_RDONLY) = 3 > close(3) = 0 > open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/tls/libattr.so.1", > O_RDONLY) = -1 ENOENT (No such file or directory) > open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/libattr.so.1", > O_RDONLY) = -1 ENOENT (No such file or directory) > open("/lib/libattr.so.1", O_RDONLY) = 3 > close(3) = 0 > open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 > close(3) = 0 > open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 > close(3) = 0 > open("/etc/mtab", O_RDONLY) = 3 > close(3) = 0 > >>>> open("/proc/meminfo", O_RDONLY) = 3 > close(3) = 0 > > Full output of strace: > [root@piton tmp]# cat zzz > execve("/bin/ls", ["ls"], [/* 26 vars */]) = 0 > uname({sys="Linux", node="piton.americas.sgi.com", ...}) = 0 > brk(0) = 0x6000000000004000 > open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such > file or directory) > open("/etc/ld.so.cache", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=82621, ...}) = 0 > mmap(NULL, 82621, PROT_READ, MAP_PRIVATE, 3, 0) = > 0x2000000000040000 > close(3) = 0 > open("/lib/libacl.so.1", O_RDONLY) = 3 > read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0 > \'\0\0"..., 640) = 640 > fstat(3, {st_mode=S_IFREG|0644, st_size=187607, ...}) = 0 > mmap(NULL, 160576, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = > 0x2000000000058000 > mprotect(0x2000000000070000, 62272, PROT_NONE) = 0 > mmap(0x2000000000078000, 32768, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED, 3, 0x10000) = 0x2000000000078000 > close(3) = 0 > open("/lib/libtermcap.so.2", O_RDONLY) = 3 > read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0 > \31\0"..., 640) = 640 > fstat(3, {st_mode=S_IFREG|0755, st_size=26088, ...}) = 0 > mmap(NULL, 16384, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000080000 > mmap(NULL, 89656, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = > 0x2000000000084000 > mprotect(0x200000000008c000, 56888, PROT_NONE) = 0 > mmap(0x2000000000094000, 32768, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x2000000000094000 > close(3) = 0 > open("/lib/tls/libc.so.6.1", O_RDONLY) = 3 > read(3, > "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0\320\252"..., > 640) = 640 > fstat(3, {st_mode=S_IFREG|0755, st_size=9329267, ...}) = 0 > mmap(NULL, 2519264, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = > 0x200000000009c000 > mmap(0x20000000002ec000, 81920, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED, 3, 0x240000) = 0x20000000002ec000 > mmap(0x2000000000300000, 12512, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2000000000300000 > close(3) = 0 > open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/tls/libattr.so.1", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/tls", > 0x60000fffffffad00) = -1 ENOENT (No such file or directory) > open("/usr/src/redhat/xfs-cmds/attr/libattr/.libs/libattr.so.1", > O_RDONLY) = -1 ENOENT (No such file or directory) > stat("/usr/src/redhat/xfs-cmds/attr/libattr/.libs", > 0x60000fffffffad00) = -1 ENOENT (No such file or directory) > open("/lib/libattr.so.1", O_RDONLY) = 3 > read(3, > "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0002\0\1\0\0\0\240\31"..., > 640) = 640 > fstat(3, {st_mode=S_IFREG|0644, st_size=55577, ...}) = 0 > mmap(NULL, 102360, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = > 0x2000000000304000 > mprotect(0x2000000000310000, 53208, PROT_NONE) = 0 > mmap(0x2000000000314000, 49152, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x2000000000314000 > close(3) = 0 > mmap(NULL, 16384, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000320000 > mmap(NULL, 32768, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000324000 > mmap(NULL, 16384, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x200000000032c000 > mmap(NULL, 16384, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000330000 > munmap(0x2000000000040000, 82621) = 0 > brk(0) = 0x6000000000004000 > brk(0x6000000000028000) = 0x6000000000028000 > brk(0) = 0x6000000000028000 > open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=34848240, ...}) = 0 > mmap(NULL, 34848240, PROT_READ, MAP_PRIVATE, 3, 0) = > 0x2000000000334000 > close(3) = 0 > rt_sigaction(SIGTERM, {0x400000000001e9a0, [], SA_RESTART}, > {SIG_DFL}, 8) = 0 > rt_sigaction(SIGKILL, {0x400000000001e9a0, [], SA_RESTART}, > {SIG_DFL}, 8) = -1 EINVAL (Invalid argument) > rt_sigaction(SIGSTOP, {0x400000000001e9a0, [], SA_RESTART}, > {SIG_DFL}, 8) = -1 EINVAL (Invalid argument) > ioctl(1, TCGETS or SNDCTL_TMR_TIMEBASE, {B9600 opost isig icanon > echo ...}) = 0 > ioctl(1, TIOCGWINSZ, {ws_row=38, ws_col=137, ws_xpixel=0, > ws_ypixel=0}) = 0 > open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 > fstat(3, {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}) = 0 > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > getdents64(0x3, 0x60000000000095d8, 0x4000) = 432 > getdents64(0x3, 0x60000000000095d8, 0x4000) = 0 > close(3) = 0 > open("/etc/mtab", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=1533, ...}) = 0 > mmap(NULL, 65536, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000002470000 > read(3, "/dev/xscsi/pci01.03.0-1/target1/"..., 16384) = 1533 > close(3) = 0 > munmap(0x2000000002470000, 65536) = 0 > >>>> open("/proc/meminfo", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > mmap(NULL, 65536, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000002470000 > read(3, "MemTotal: 22874016 kB\nMemFre"..., 1024) = 653 > close(3) = 0 > munmap(0x2000000002470000, 65536) = 0 > fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = > 0 > mmap(NULL, 65536, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000002470000 > write(1, "espdbd.sock jd_sockV4\tProPack-i"..., 62) = 62 > munmap(0x2000000002470000, 65536) = 0 > exit_group(0) = ? > > -- > Thanks > > Jack Steiner (steiner@sgi.com) 651-683-5302 > Principal Engineer SGI - Silicon Graphics, Inc. > > -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-07 23:32 ` Andrew Morton 2004-05-08 4:55 ` Jack Steiner @ 2004-05-17 21:18 ` Andrea Arcangeli 2004-05-17 21:42 ` Andrew Morton 1 sibling, 1 reply; 22+ messages in thread From: Andrea Arcangeli @ 2004-05-17 21:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Jack Steiner, paulmck, linux-kernel On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote: > Jack Steiner <steiner@sgi.com> wrote: > > > > The calls to RCU are coming from here: > > > > [11]kdb> bt > > Stack traceback for pid 3553 > > 0xe00002b007230000 3553 3139 1 11 R 0xe00002b0072304f0 *ls > > 0xa0000001000feee0 call_rcu > > 0xa0000001001a3b20 d_free+0x80 > > 0xa0000001001a3ec0 dput+0x340 > > 0xa00000010016bcd0 __fput+0x210 > > 0xa00000010016baa0 fput+0x40 > > 0xa000000100168760 filp_close+0xc0 > > 0xa000000100168960 sys_close+0x180 > > 0xa000000100011be0 ia64_ret_from_syscall > > > > I see this same backtrace from numerous processes. > > eh? Why is dput freeing the dentry? It should just be leaving it in cache. > > What filesystem is being used? procfs? deleting entries from dcache can be a frequent operation, even rename() triggers d_free. note that I changed my tree to free all negative entries that are currently generated by unlink. I find useless to leave negative dentries after "unlink". I leave them of course after a failed lookup (that's the fundamental usage of the negative dentries for the PATHs userspace lookups), but not after unlink. I think it's wasted memory that would better be used for other purposes. I think this is also the reason when dcache-RCU was once benchmarked on top of my 2.4 tree it resulted in a loss of performance. in my 2.6-aa I didn't forward port all my 2.4-aa stuff but I really would like to avoid wasting memory there again for the files that are being deleted, plus during memory pressure (that could be generated even from pagecache) dcache must be freed up (amittedly there we could free more than 1 entry for every quiescent point). I suggested a few years ago that I agreed completely with RCU _only_ for usages where the reader is extremely _frequent_ and the writer is _extremely_ unlikely to happen, my most obvious example was the replacement of the big reader lock. RCU basically trades mugh higher performance for reader, with much lower performance for the writer. RCU is like a rwlock with the read-lock being extremely fast (actually it's literally a noop), but with a very slow writer (but the writer is still better than the big writer lock, especially the writer has no starvation and can coalesce things together to maximize icache etc..). The more cpus, the higher performance you get by RCU by having a noop read-lock operation. The more cpus the lower performance you get in the writer. Very few things comes for free ;). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-17 21:18 ` Andrea Arcangeli @ 2004-05-17 21:42 ` Andrew Morton 2004-05-17 23:50 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Andrew Morton @ 2004-05-17 21:42 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: steiner, paulmck, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote: > > Jack Steiner <steiner@sgi.com> wrote: > > > > > > The calls to RCU are coming from here: > > > > > > [11]kdb> bt > > > Stack traceback for pid 3553 > > > 0xe00002b007230000 3553 3139 1 11 R 0xe00002b0072304f0 *ls > > > 0xa0000001000feee0 call_rcu > > > 0xa0000001001a3b20 d_free+0x80 > > > 0xa0000001001a3ec0 dput+0x340 > > > 0xa00000010016bcd0 __fput+0x210 > > > 0xa00000010016baa0 fput+0x40 > > > 0xa000000100168760 filp_close+0xc0 > > > 0xa000000100168960 sys_close+0x180 > > > 0xa000000100011be0 ia64_ret_from_syscall > > > > > > I see this same backtrace from numerous processes. > > > > eh? Why is dput freeing the dentry? It should just be leaving it in cache. > > > > What filesystem is being used? procfs? > > deleting entries from dcache can be a frequent operation, even rename() > triggers d_free. This issue has gone all quiet. Is anyone doing aything? > note that I changed my tree to free all negative entries that are > currently generated by unlink. I find useless to leave negative dentries > after "unlink". I leave them of course after a failed lookup (that's the > fundamental usage of the negative dentries for the PATHs userspace > lookups), but not after unlink. Sounds sensible. Could you please send out the patch? > RCU basically trades mugh higher performance for reader, with much lower > performance for the writer. If the writer wants synchronous-removal semantics, yes. The problem here and, I believe, in the route cache is in finding a balance between the amount of storage and the frequency of RCU callback runs. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-17 21:42 ` Andrew Morton @ 2004-05-17 23:50 ` Andrea Arcangeli 2004-05-18 13:33 ` Jack Steiner 2004-05-18 23:13 ` Matt Mackall 2 siblings, 0 replies; 22+ messages in thread From: Andrea Arcangeli @ 2004-05-17 23:50 UTC (permalink / raw) To: Andrew Morton; +Cc: steiner, paulmck, linux-kernel On Mon, May 17, 2004 at 02:42:28PM -0700, Andrew Morton wrote: > This issue has gone all quiet. Is anyone doing aything? dunno > Sounds sensible. Could you please send out the patch? this is the 2.4 version but problem is this logic will now work fine on UP or small SMP, but it will be very detrimental on the 256-way, so we may want to address the dcache-rcu problem first? (my wild guess is that renames are less frequent than unlinks). diff -urNp 2.4.19pre9/fs/dcache.c neg-dentry-ref/fs/dcache.c --- 2.4.19pre9/fs/dcache.c Wed May 29 02:12:36 2002 +++ neg-dentry-ref/fs/dcache.c Wed May 29 04:19:13 2002 @@ -806,6 +806,7 @@ out: void d_delete(struct dentry * dentry) { +#ifdef DENTRY_WASTE_RAM /* * Are we the only user? */ @@ -815,6 +816,7 @@ void d_delete(struct dentry * dentry) return; } spin_unlock(&dcache_lock); +#endif /* * If not, just drop the dentry and let dput diff -urNp 2.4.19pre9/fs/namei.c neg-dentry-ref/fs/namei.c --- 2.4.19pre9/fs/namei.c Wed May 29 02:12:36 2002 +++ neg-dentry-ref/fs/namei.c Wed May 29 04:20:45 2002 @@ -1042,6 +1042,10 @@ do_last: error = vfs_create(dir->d_inode, dentry, mode & ~current->fs->umask); up(&dir->d_inode->i_sem); +#ifndef DENTRY_WASTE_RAM + if (error) + d_drop(dentry); +#endif dput(nd->dentry); nd->dentry = dentry; if (error) > If the writer wants synchronous-removal semantics, yes. > > The problem here and, I believe, in the route cache is in finding a balance > between the amount of storage and the frequency of RCU callback runs. RCU is by design an asynchronous-removal. The problem is the "latency" of this asynchronous-removal, so the time between the request-of-freeing and the effective release of the memory. Jack's patch increases the latency of reaching a quiescent point a lot (this is fine as far as he has enough memory) and reduces the overhead. He partically removed completely the contention on the spinlock, so the cpu will not waste any time spinning and trashing the cacheline everywhere. The problem is that there is a "max latency" we can allow under any certain workload, before throttling on rcu (normally inside the VM due memory exaustion, or like in the routing cache case throttling by losing incoming packets). So again, for not-frequent writer, rcu is fine and Jack's approch is fine too and it solves the contention problem completely. For frequent writer like the dcache, rcu is much more problematic, and here we're even ignoring the fact that on all machines calling rcu for the writer is more expensive than taking a write_lock, so if there's no reader to optimize, there's nearly no point to use RCU. This even ignoring the fact current rcu implementation will spin on the spinlock (but we could change that like Jack did). An improvement over Jack's patch that will likely not increase the latency much, is to minimize as much as possible the delay between the processing of cpu0 then cpu1 then cpu2 etc.. The basic logic of Jack's patch is to _serialize_ the rcu_check_callbacks. By serializing it cpu after cpu, he removes the contention on the lock, but there's absolutely no reason to leave millisecond delays in between cpus, so using IPI to queue the next callback in the next cpu would solve the problem as well as Jack's patch but without increasing latency as much as he did. Problem is that sending IPIs from softirq context is not allowed, like queuing tasklets in other cpus is not allowed. So there's no easy fix other than Jack's huge-latency-increase one, but that makes rcu not suitable for frequent-writer case (probably not an issue for him with some tera of ram and without zone-normal ;) but an issue for others). And again, rcu is probably slower anyways for the frequent-writer case (even on a 2-way), so it really should be used for frequent-reader, if there's any legitimate frequent-writer rcu is troublesome, it's not that rcu isn't designed for huge machines, rcu is _perfect_ for huge machines, but only as far as you don't mind running the writer frequently during production (like it can happen with dcache). RCU is like the big reader lock, but it's much better than the big reader lock. Routing cache is less of an issue since the collection should happen extremely infrequently in normal usages (it's not nearly similar to calling unlink or rename in normal usages), but it's troublesome too if you're under routing attack and you've to collect entries all the time. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-17 21:42 ` Andrew Morton 2004-05-17 23:50 ` Andrea Arcangeli @ 2004-05-18 13:33 ` Jack Steiner 2004-05-18 23:13 ` Matt Mackall 2 siblings, 0 replies; 22+ messages in thread From: Jack Steiner @ 2004-05-18 13:33 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, paulmck, linux-kernel On Mon, May 17, 2004 at 02:42:28PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > On Fri, May 07, 2004 at 04:32:35PM -0700, Andrew Morton wrote: > > > Jack Steiner <steiner@sgi.com> wrote: > > > > > > > > The calls to RCU are coming from here: > > > > > > > > [11]kdb> bt > > > > Stack traceback for pid 3553 > > > > 0xe00002b007230000 3553 3139 1 11 R 0xe00002b0072304f0 *ls > > > > 0xa0000001000feee0 call_rcu > > > > 0xa0000001001a3b20 d_free+0x80 > > > > 0xa0000001001a3ec0 dput+0x340 > > > > 0xa00000010016bcd0 __fput+0x210 > > > > 0xa00000010016baa0 fput+0x40 > > > > 0xa000000100168760 filp_close+0xc0 > > > > 0xa000000100168960 sys_close+0x180 > > > > 0xa000000100011be0 ia64_ret_from_syscall > > > > > > > > I see this same backtrace from numerous processes. > > > > > > eh? Why is dput freeing the dentry? It should just be leaving it in cache. > > > > > > What filesystem is being used? procfs? > > > > deleting entries from dcache can be a frequent operation, even rename() > > triggers d_free. > > This issue has gone all quiet. Is anyone doing aything? I plan to look into the RCU scaling issues (unless someone beats me to it) but it will be a couple of weeks before I can start. > > > note that I changed my tree to free all negative entries that are > > currently generated by unlink. I find useless to leave negative dentries > > after "unlink". I leave them of course after a failed lookup (that's the > > fundamental usage of the negative dentries for the PATHs userspace > > lookups), but not after unlink. > > Sounds sensible. Could you please send out the patch? > > > RCU basically trades mugh higher performance for reader, with much lower > > performance for the writer. > > If the writer wants synchronous-removal semantics, yes. > > The problem here and, I believe, in the route cache is in finding a balance > between the amount of storage and the frequency of RCU callback runs. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems 2004-05-17 21:42 ` Andrew Morton 2004-05-17 23:50 ` Andrea Arcangeli 2004-05-18 13:33 ` Jack Steiner @ 2004-05-18 23:13 ` Matt Mackall 2 siblings, 0 replies; 22+ messages in thread From: Matt Mackall @ 2004-05-18 23:13 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, steiner, paulmck, linux-kernel On Mon, May 17, 2004 at 02:42:28PM -0700, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > note that I changed my tree to free all negative entries that are > > currently generated by unlink. I find useless to leave negative dentries > > after "unlink". I leave them of course after a failed lookup (that's the > > fundamental usage of the negative dentries for the PATHs userspace > > lookups), but not after unlink. > > Sounds sensible. Could you please send out the patch? Seems like it'll be the wrong thing for the fairly common lockfile case and also stuff like make clean all. -- Matt Mackall : http://www.selenic.com : Linux development and consulting ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RCU scaling on large systems
@ 2004-05-20 11:36 Manfred Spraul
0 siblings, 0 replies; 22+ messages in thread
From: Manfred Spraul @ 2004-05-20 11:36 UTC (permalink / raw)
To: Jack Steiner; +Cc: William Lee Irwin III, linux-kernel, Paul E. McKenney
[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]
Jack Steiner wrote:
>On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
>> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
>> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
>> > RCU code. The time is spent contending for shared cache lines.
>>
>> Would something like this help cacheline contention? This uses the
>> per_cpu data areas to hold per-cpu booleans for needing switches.
>> Untested/uncompiled.
>
>
> [snip]
>
>We use 64MB granules. The percpu data structures on the individual nodes
>are separated by addresses that differ by many GB. A scan of all percpu
>data structures requires a TLB entry for each node. This is costly &
>trashes the TLB. (Our max system size is currently 256 nodes).
>
>
rcu_cpu_mask serves two purposes:
- if a bit for a cpu is set, then the cpu must report quiescent states. Thus all cpus continuously poll their own bits.
- If a cpu notices that it completed a quiescent state, it clears it's own bit, checks all bits and completes the grace period. This happens under a global spinlock.
What about splitting that into two variables?
A global counter could be used to inform the cpus that they should look for quiescent states. The cacheline would be mostly read-only, it shouldn't cause trashing.
Attached is a patch - it boots uniprocessor i386. What do you think? It should remove the hotspot from scheduler_tick entirely.
--
Manfred
[-- Attachment #2: patch-rcu-simple --]
[-- Type: text/plain, Size: 4162 bytes --]
// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 6
// SUBLEVEL = 6
// EXTRAVERSION = -mm4
--- 2.6/kernel/rcupdate.c 2004-05-19 19:51:17.000000000 +0200
+++ build-2.6/kernel/rcupdate.c 2004-05-20 13:22:59.000000000 +0200
@@ -55,6 +55,7 @@
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
#define RCU_tasklet(cpu) (per_cpu(rcu_tasklet, cpu))
+long rcu_global_grace_num ____cacheline_aligned_in_smp;
/**
* call_rcu - Queue an RCU update request.
* @head: structure to be used for queueing the RCU updates.
@@ -97,6 +98,23 @@
}
/*
+ * Grace period handling:
+ * The grace period handling consists out of two steps:
+ * - A new grace period is started.
+ * This is done by rcu_start_batch. The start is not broadcasted to
+ * all cpus, they must pick this up from rcu_global_grace_num. All cpus are
+ * recorded in the rcu_ctrlblk.rcu_cpu_mask bitmap.
+ * - All cpus must go through a quiescent state.
+ * Since the start of the grace period is not broadcasted, at least two
+ * calls to rcu_check_quiescent_state are required:
+ * The first call just notices that a new grace period is running. The
+ * following calls check if there was a quiescent state since the beginning
+ * of the grace period. If so, it updates rcu_ctrlblk.rcu_cpu_mask. If
+ * the bitmap is empty, then the grace period is completed.
+ * rcu_check_quiescent_state calls rcu_start_batch to start the next grace
+ * period.
+ */
+/*
* Register a new batch of callbacks, and start it up if there is currently no
* active batch and the batch to be registered has not already occurred.
* Caller must hold the rcu_ctrlblk lock.
@@ -116,6 +134,8 @@
active = nohz_cpu_mask;
cpus_complement(active);
cpus_and(rcu_ctrlblk.rcu_cpu_mask, cpu_online_map, active);
+ smp_wmb();
+ rcu_global_grace_num++;
}
/*
@@ -127,18 +147,28 @@
{
int cpu = smp_processor_id();
- if (!cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+ if (RCU_grace_num(cpu) != rcu_global_grace_num) {
+ /* new grace period: record qsctr value. */
+ BUG_ON(RCU_last_qsctr(cpu) != RCU_QSCTR_INVALID);
+ RCU_last_qsctr(cpu) = RCU_qsctr(cpu);
+ RCU_grace_num(cpu) = rcu_global_grace_num;
+ return;
+ }
+
+ /* grace period already completed for this cpu?
+ * last_qsctr is checked instead of the actual bitmap to avoid
+ * cacheline trashing
+ */
+ if (RCU_last_qsctr(cpu) == RCU_QSCTR_INVALID) {
+ /* yes. Just leave. */
return;
+ }
/*
* Races with local timer interrupt - in the worst case
* we may miss one quiescent state of that CPU. That is
* tolerable. So no need to disable interrupts.
*/
- if (RCU_last_qsctr(cpu) == RCU_QSCTR_INVALID) {
- RCU_last_qsctr(cpu) = RCU_qsctr(cpu);
- return;
- }
if (RCU_qsctr(cpu) == RCU_last_qsctr(cpu))
return;
--- 2.6/include/linux/rcupdate.h 2004-05-19 19:42:16.000000000 +0200
+++ build-2.6/include/linux/rcupdate.h 2004-05-20 12:46:02.000000000 +0200
@@ -94,16 +94,19 @@
long last_qsctr; /* value of qsctr at beginning */
/* of rcu grace period */
long batch; /* Batch # for current RCU batch */
+ long grace_num; /* grace period number */
struct list_head nxtlist;
struct list_head curlist;
};
DECLARE_PER_CPU(struct rcu_data, rcu_data);
extern struct rcu_ctrlblk rcu_ctrlblk;
+extern long rcu_global_grace_num;
#define RCU_qsctr(cpu) (per_cpu(rcu_data, (cpu)).qsctr)
#define RCU_last_qsctr(cpu) (per_cpu(rcu_data, (cpu)).last_qsctr)
#define RCU_batch(cpu) (per_cpu(rcu_data, (cpu)).batch)
+#define RCU_grace_num(cpu) (per_cpu(rcu_data, (cpu)).grace_num)
#define RCU_nxtlist(cpu) (per_cpu(rcu_data, (cpu)).nxtlist)
#define RCU_curlist(cpu) (per_cpu(rcu_data, (cpu)).curlist)
@@ -115,7 +118,8 @@
rcu_batch_before(RCU_batch(cpu), rcu_ctrlblk.curbatch)) ||
(list_empty(&RCU_curlist(cpu)) &&
!list_empty(&RCU_nxtlist(cpu))) ||
- cpu_isset(cpu, rcu_ctrlblk.rcu_cpu_mask))
+ RCU_grace_num(cpu) != rcu_global_grace_num ||
+ RCU_last_qsctr(cpu) != RCU_QSCTR_INVALID)
return 1;
else
return 0;
^ permalink raw reply [flat|nested] 22+ messages in threadend of thread, other threads:[~2004-05-20 11:44 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-05-01 12:08 RCU scaling on large systems Jack Steiner 2004-05-01 21:17 ` William Lee Irwin III 2004-05-01 22:35 ` William Lee Irwin III 2004-05-02 1:38 ` Jack Steiner 2004-05-07 17:53 ` Andrea Arcangeli 2004-05-07 18:17 ` William Lee Irwin III 2004-05-07 19:59 ` Andrea Arcangeli 2004-05-07 20:49 ` Jack Steiner 2004-05-02 18:28 ` Paul E. McKenney 2004-05-03 16:39 ` Jesse Barnes 2004-05-03 20:04 ` Paul E. McKenney 2004-05-03 18:40 ` Jack Steiner 2004-05-07 20:50 ` Paul E. McKenney 2004-05-07 22:06 ` Jack Steiner 2004-05-07 23:32 ` Andrew Morton 2004-05-08 4:55 ` Jack Steiner 2004-05-17 21:18 ` Andrea Arcangeli 2004-05-17 21:42 ` Andrew Morton 2004-05-17 23:50 ` Andrea Arcangeli 2004-05-18 13:33 ` Jack Steiner 2004-05-18 23:13 ` Matt Mackall -- strict thread matches above, loose matches on Subject: below -- 2004-05-20 11:36 Manfred Spraul
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox