* [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU
@ 2010-01-25 3:48 Paul E. McKenney
2010-01-25 12:28 ` Lai Jiangshan
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Paul E. McKenney @ 2010-01-25 3:48 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc,
niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
[Experimental RFC, not for inclusion.]
I recently received a complaint that RCU was refusing to let a system
go into low-power state immediately, instead waiting a few ticks after
the system had gone idle before letting go of the last CPU. Of course,
the reason for this was that there were a couple of RCU callbacks on
the last CPU.
Currently, rcu_needs_cpu() simply checks whether the current CPU has
an outstanding RCU callback, which means that the last CPU to go into
dyntick-idle mode might wait a few ticks for the relevant grace periods
to complete. However, if all the other CPUs are in dyntick-idle mode,
and if this CPU is in a quiescent state (which it is for RCU-bh and
RCU-sched any time that we are considering going into dyntick-idle mode),
then the grace period is instantly complete.
This patch therefore repeatedly invokes the RCU grace-period machinery
in order to force any needed grace periods to complete quickly. It does
so a limited number of times in order to prevent starvation by an RCU
callback function that might pass itself to call_rcu().
Thoughts?
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
diff --git a/init/Kconfig b/init/Kconfig
index d95ca7c..42bf914 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -396,6 +396,22 @@ config RCU_FANOUT_EXACT
Say N if unsure.
+config RCU_FAST_NO_HZ
+ bool "Accelerate last non-dyntick-idle CPU's grace periods"
+ depends on TREE_RCU && NO_HZ && SMP
+ default n
+ help
+ This option causes RCU to attempt to accelerate grace periods
+ in order to allow the final CPU to enter dynticks-idle state
+ more quickly. On the other hand, this option increases the
+ overhead of the dynticks-idle checking, particularly on systems
+ with large numbers of CPUs.
+
+ Say Y if energy efficiency is critically important, particularly
+ if you have relatively few CPUs.
+
+ Say N if you are unsure.
+
config TREE_RCU_TRACE
def_bool RCU_TRACE && ( TREE_RCU || TREE_PREEMPT_RCU )
select DEBUG_FS
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 099a255..29d88c0 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1550,10 +1550,9 @@ static int rcu_pending(int cpu)
/*
* Check to see if any future RCU-related work will need to be done
* by the current CPU, even if none need be done immediately, returning
- * 1 if so. This function is part of the RCU implementation; it is -not-
- * an exported member of the RCU API.
+ * 1 if so.
*/
-int rcu_needs_cpu(int cpu)
+static int rcu_needs_cpu_quick_check(int cpu)
{
/* RCU callbacks either ready or pending? */
return per_cpu(rcu_sched_data, cpu).nxtlist ||
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index e77cdf3..d6170a9 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -906,3 +906,72 @@ static void __init __rcu_init_preempt(void)
}
#endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
+
+#if defined(CONFIG_TREE_PREEMPT_RCU) || !defined(CONFIG_RCU_FAST_NO_HZ)
+
+/*
+ * Check to see if any future RCU-related work will need to be done
+ * by the current CPU, even if none need be done immediately, returning
+ * 1 if so. This function is part of the RCU implementation; it is -not-
+ * an exported member of the RCU API.
+ *
+ * Because we have preemptible RCU, just check whether this CPU needs
+ * any flavor of RCU. Do not chew up lots of CPU cycles with preemption
+ * disabled in a most-likely vain attempt to cause RCU not to need this CPU.
+ */
+int rcu_needs_cpu(int cpu)
+{
+ return rcu_needs_cpu_quick_check(cpu);
+}
+
+#else
+
+#define RCU_NEEDS_CPU_FLUSHES 5
+
+/*
+ * Check to see if any future RCU-related work will need to be done
+ * by the current CPU, even if none need be done immediately, returning
+ * 1 if so. This function is part of the RCU implementation; it is -not-
+ * an exported member of the RCU API.
+ *
+ * Because we are not supporting preemptible RCU, attempt to accelerate
+ * any current grace periods so that RCU no longer needs this CPU, but
+ * only if all other CPUs are already in dynticks-idle mode. This will
+ * allow the CPU cores to be powered down immediately, as opposed to after
+ * waiting many milliseconds for grace periods to elapse.
+ */
+int rcu_needs_cpu(int cpu)
+{
+ int c = 1;
+ int i;
+ int thatcpu;
+
+ /* Don't bother unless we are the last non-dyntick-idle CPU. */
+ for_each_cpu(thatcpu, nohz_cpu_mask)
+ if (thatcpu != cpu)
+ return rcu_needs_cpu_quick_check(cpu);
+
+ /* Try to push remaining RCU-sched and RCU-bh callbacks through. */
+ for (i = 0; i < RCU_NEEDS_CPU_FLUSHES && c; i++) {
+ c = 0;
+ if (per_cpu(rcu_sched_data, cpu).nxtlist) {
+ c = 1;
+ rcu_sched_qs(cpu);
+ force_quiescent_state(&rcu_sched_state, 0);
+ __rcu_process_callbacks(&rcu_sched_state,
+ &per_cpu(rcu_sched_data, cpu));
+ }
+ if (per_cpu(rcu_bh_data, cpu).nxtlist) {
+ c = 1;
+ rcu_bh_qs(cpu);
+ force_quiescent_state(&rcu_bh_state, 0);
+ __rcu_process_callbacks(&rcu_bh_state,
+ &per_cpu(rcu_bh_data, cpu));
+ }
+ }
+
+ /* If RCU callbacks are still pending, RCU still needs this CPU. */
+ return c;
+}
+
+#endif
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-25 3:48 [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU Paul E. McKenney @ 2010-01-25 12:28 ` Lai Jiangshan 2010-01-25 12:35 ` Peter Zijlstra ` (2 more replies) 2010-01-25 15:12 ` Steven Rostedt 2010-01-26 21:30 ` Andi Kleen 2 siblings, 3 replies; 23+ messages in thread From: Lai Jiangshan @ 2010-01-25 12:28 UTC (permalink / raw) To: paulmck Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells Paul E. McKenney wrote: > [Experimental RFC, not for inclusion.] > > I recently received a complaint that RCU was refusing to let a system > go into low-power state immediately, instead waiting a few ticks after > the system had gone idle before letting go of the last CPU. Of course, > the reason for this was that there were a couple of RCU callbacks on > the last CPU. > > Currently, rcu_needs_cpu() simply checks whether the current CPU has > an outstanding RCU callback, which means that the last CPU to go into > dyntick-idle mode might wait a few ticks for the relevant grace periods > to complete. However, if all the other CPUs are in dyntick-idle mode, > and if this CPU is in a quiescent state (which it is for RCU-bh and > RCU-sched any time that we are considering going into dyntick-idle mode), > then the grace period is instantly complete. > > This patch therefore repeatedly invokes the RCU grace-period machinery > in order to force any needed grace periods to complete quickly. It does > so a limited number of times in order to prevent starvation by an RCU > callback function that might pass itself to call_rcu(). > > Thoughts? > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > diff --git a/init/Kconfig b/init/Kconfig > index d95ca7c..42bf914 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -396,6 +396,22 @@ config RCU_FANOUT_EXACT > > Say N if unsure. > > +config RCU_FAST_NO_HZ > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > + depends on TREE_RCU && NO_HZ && SMP > + default n > + help > + This option causes RCU to attempt to accelerate grace periods > + in order to allow the final CPU to enter dynticks-idle state > + more quickly. On the other hand, this option increases the > + overhead of the dynticks-idle checking, particularly on systems > + with large numbers of CPUs. > + > + Say Y if energy efficiency is critically important, particularly > + if you have relatively few CPUs. > + > + Say N if you are unsure. > + > config TREE_RCU_TRACE > def_bool RCU_TRACE && ( TREE_RCU || TREE_PREEMPT_RCU ) > select DEBUG_FS > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > index 099a255..29d88c0 100644 > --- a/kernel/rcutree.c > +++ b/kernel/rcutree.c > @@ -1550,10 +1550,9 @@ static int rcu_pending(int cpu) > /* > * Check to see if any future RCU-related work will need to be done > * by the current CPU, even if none need be done immediately, returning > - * 1 if so. This function is part of the RCU implementation; it is -not- > - * an exported member of the RCU API. > + * 1 if so. > */ > -int rcu_needs_cpu(int cpu) > +static int rcu_needs_cpu_quick_check(int cpu) > { > /* RCU callbacks either ready or pending? */ > return per_cpu(rcu_sched_data, cpu).nxtlist || > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h > index e77cdf3..d6170a9 100644 > --- a/kernel/rcutree_plugin.h > +++ b/kernel/rcutree_plugin.h > @@ -906,3 +906,72 @@ static void __init __rcu_init_preempt(void) > } > > #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */ > + > +#if defined(CONFIG_TREE_PREEMPT_RCU) || !defined(CONFIG_RCU_FAST_NO_HZ) > + > +/* > + * Check to see if any future RCU-related work will need to be done > + * by the current CPU, even if none need be done immediately, returning > + * 1 if so. This function is part of the RCU implementation; it is -not- > + * an exported member of the RCU API. > + * > + * Because we have preemptible RCU, just check whether this CPU needs > + * any flavor of RCU. Do not chew up lots of CPU cycles with preemption > + * disabled in a most-likely vain attempt to cause RCU not to need this CPU. > + */ > +int rcu_needs_cpu(int cpu) > +{ > + return rcu_needs_cpu_quick_check(cpu); > +} > + > +#else > + > +#define RCU_NEEDS_CPU_FLUSHES 5 > + > +/* > + * Check to see if any future RCU-related work will need to be done > + * by the current CPU, even if none need be done immediately, returning > + * 1 if so. This function is part of the RCU implementation; it is -not- > + * an exported member of the RCU API. > + * > + * Because we are not supporting preemptible RCU, attempt to accelerate > + * any current grace periods so that RCU no longer needs this CPU, but > + * only if all other CPUs are already in dynticks-idle mode. This will > + * allow the CPU cores to be powered down immediately, as opposed to after > + * waiting many milliseconds for grace periods to elapse. > + */ > +int rcu_needs_cpu(int cpu) > +{ > + int c = 1; > + int i; > + int thatcpu; > + > + /* Don't bother unless we are the last non-dyntick-idle CPU. */ > + for_each_cpu(thatcpu, nohz_cpu_mask) > + if (thatcpu != cpu) > + return rcu_needs_cpu_quick_check(cpu); The comment and the code are not the same, I think. ----------- I found this thing, Although I think it is a ugly thing. Is it help? See select_nohz_load_balancer(). /* * This routine will try to nominate the ilb (idle load balancing) * owner among the cpus whose ticks are stopped. ilb owner will do the idle * load balancing on behalf of all those cpus. If all the cpus in the system * go into this tickless mode, then there will be no ilb owner (as there is * no need for one) and all the cpus will sleep till the next wakeup event * arrives... * * For the ilb owner, tick is not stopped. And this tick will be used * for idle load balancing. ilb owner will still be part of * nohz.cpu_mask.. * * While stopping the tick, this cpu will become the ilb owner if there * is no other owner. And will be the owner till that cpu becomes busy * or if all cpus in the system stop their ticks at which point * there is no need for ilb owner. * * When the ilb owner becomes busy, it nominates another owner, during the * next busy scheduler_tick() */ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-25 12:28 ` Lai Jiangshan @ 2010-01-25 12:35 ` Peter Zijlstra 2010-01-25 15:08 ` Steven Rostedt 2010-01-27 5:17 ` Paul E. McKenney 2 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2010-01-25 12:35 UTC (permalink / raw) To: Lai Jiangshan Cc: paulmck, linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, rostedt, Valdis.Kletnieks, dhowells, Pallipadi,Venkatesh On Mon, 2010-01-25 at 20:28 +0800, Lai Jiangshan wrote: > Paul E. McKenney wrote: > > [Experimental RFC, not for inclusion.] > > > > I recently received a complaint that RCU was refusing to let a system > > go into low-power state immediately, instead waiting a few ticks after > > the system had gone idle before letting go of the last CPU. Of course, > > the reason for this was that there were a couple of RCU callbacks on > > the last CPU. > > > > Currently, rcu_needs_cpu() simply checks whether the current CPU has > > an outstanding RCU callback, which means that the last CPU to go into > > dyntick-idle mode might wait a few ticks for the relevant grace periods > > to complete. However, if all the other CPUs are in dyntick-idle mode, > > and if this CPU is in a quiescent state (which it is for RCU-bh and > > RCU-sched any time that we are considering going into dyntick-idle mode), > > then the grace period is instantly complete. > > > > This patch therefore repeatedly invokes the RCU grace-period machinery > > in order to force any needed grace periods to complete quickly. It does > > so a limited number of times in order to prevent starvation by an RCU > > callback function that might pass itself to call_rcu(). > > > > Thoughts? > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > > > diff --git a/init/Kconfig b/init/Kconfig > > index d95ca7c..42bf914 100644 > > --- a/init/Kconfig > > +++ b/init/Kconfig > > @@ -396,6 +396,22 @@ config RCU_FANOUT_EXACT > > > > Say N if unsure. > > > > +config RCU_FAST_NO_HZ > > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > > + depends on TREE_RCU && NO_HZ && SMP > > + default n > > + help > > + This option causes RCU to attempt to accelerate grace periods > > + in order to allow the final CPU to enter dynticks-idle state > > + more quickly. On the other hand, this option increases the > > + overhead of the dynticks-idle checking, particularly on systems > > + with large numbers of CPUs. > > + > > + Say Y if energy efficiency is critically important, particularly > > + if you have relatively few CPUs. > > + > > + Say N if you are unsure. > > + > > config TREE_RCU_TRACE > > def_bool RCU_TRACE && ( TREE_RCU || TREE_PREEMPT_RCU ) > > select DEBUG_FS > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > index 099a255..29d88c0 100644 > > --- a/kernel/rcutree.c > > +++ b/kernel/rcutree.c > > @@ -1550,10 +1550,9 @@ static int rcu_pending(int cpu) > > /* > > * Check to see if any future RCU-related work will need to be done > > * by the current CPU, even if none need be done immediately, returning > > - * 1 if so. This function is part of the RCU implementation; it is -not- > > - * an exported member of the RCU API. > > + * 1 if so. > > */ > > -int rcu_needs_cpu(int cpu) > > +static int rcu_needs_cpu_quick_check(int cpu) > > { > > /* RCU callbacks either ready or pending? */ > > return per_cpu(rcu_sched_data, cpu).nxtlist || > > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h > > index e77cdf3..d6170a9 100644 > > --- a/kernel/rcutree_plugin.h > > +++ b/kernel/rcutree_plugin.h > > @@ -906,3 +906,72 @@ static void __init __rcu_init_preempt(void) > > } > > > > #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */ > > + > > +#if defined(CONFIG_TREE_PREEMPT_RCU) || !defined(CONFIG_RCU_FAST_NO_HZ) > > + > > +/* > > + * Check to see if any future RCU-related work will need to be done > > + * by the current CPU, even if none need be done immediately, returning > > + * 1 if so. This function is part of the RCU implementation; it is -not- > > + * an exported member of the RCU API. > > + * > > + * Because we have preemptible RCU, just check whether this CPU needs > > + * any flavor of RCU. Do not chew up lots of CPU cycles with preemption > > + * disabled in a most-likely vain attempt to cause RCU not to need this CPU. > > + */ > > +int rcu_needs_cpu(int cpu) > > +{ > > + return rcu_needs_cpu_quick_check(cpu); > > +} > > + > > +#else > > + > > +#define RCU_NEEDS_CPU_FLUSHES 5 > > + > > +/* > > + * Check to see if any future RCU-related work will need to be done > > + * by the current CPU, even if none need be done immediately, returning > > + * 1 if so. This function is part of the RCU implementation; it is -not- > > + * an exported member of the RCU API. > > + * > > + * Because we are not supporting preemptible RCU, attempt to accelerate > > + * any current grace periods so that RCU no longer needs this CPU, but > > + * only if all other CPUs are already in dynticks-idle mode. This will > > + * allow the CPU cores to be powered down immediately, as opposed to after > > + * waiting many milliseconds for grace periods to elapse. > > + */ > > +int rcu_needs_cpu(int cpu) > > +{ > > + int c = 1; > > + int i; > > + int thatcpu; > > + > > + /* Don't bother unless we are the last non-dyntick-idle CPU. */ > > + for_each_cpu(thatcpu, nohz_cpu_mask) > > + if (thatcpu != cpu) > > + return rcu_needs_cpu_quick_check(cpu); > > The comment and the code are not the same, I think. > > ----------- > I found this thing, Although I think it is a ugly thing. > Is it help? > > See select_nohz_load_balancer(). > > /* > * This routine will try to nominate the ilb (idle load balancing) > * owner among the cpus whose ticks are stopped. ilb owner will do the idle > * load balancing on behalf of all those cpus. If all the cpus in the system > * go into this tickless mode, then there will be no ilb owner (as there is > * no need for one) and all the cpus will sleep till the next wakeup event > * arrives... > * > * For the ilb owner, tick is not stopped. And this tick will be used > * for idle load balancing. ilb owner will still be part of > * nohz.cpu_mask.. > * > * While stopping the tick, this cpu will become the ilb owner if there > * is no other owner. And will be the owner till that cpu becomes busy > * or if all cpus in the system stop their ticks at which point > * there is no need for ilb owner. > * > * When the ilb owner becomes busy, it nominates another owner, during the > * next busy scheduler_tick() > */ Not quite sure what your point is, but Venki was poking at the ILB so he might want to be aware of things... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-25 12:28 ` Lai Jiangshan 2010-01-25 12:35 ` Peter Zijlstra @ 2010-01-25 15:08 ` Steven Rostedt 2010-01-27 5:17 ` Paul E. McKenney 2 siblings, 0 replies; 23+ messages in thread From: Steven Rostedt @ 2010-01-25 15:08 UTC (permalink / raw) To: Lai Jiangshan Cc: paulmck, linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, 2010-01-25 at 20:28 +0800, Lai Jiangshan wrote: > Paul E. McKenney wrote: > > + /* Don't bother unless we are the last non-dyntick-idle CPU. */ > > + for_each_cpu(thatcpu, nohz_cpu_mask) > > + if (thatcpu != cpu) > > + return rcu_needs_cpu_quick_check(cpu); > > The comment and the code are not the same, I think. > I once heard this quote, but I don't know who said it: "If the comment and the code do not match, they probably are both wrong" Anyway, you are correct, the comment does not match, but I think the code is wrong. The code returns if any cpu is in non-dyntick-idle state. Reading the change log, that looks wrong. Perhaps this is what was needed: for_each_online_cpu(thatcpu) { if (thatcpu != cpu && !cpumask_test_cpu(thatcpu, nohz_cpu_mask) return rcu_needs_cpu_quick_check(cpu); } -- Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-25 12:28 ` Lai Jiangshan 2010-01-25 12:35 ` Peter Zijlstra 2010-01-25 15:08 ` Steven Rostedt @ 2010-01-27 5:17 ` Paul E. McKenney 2 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 5:17 UTC (permalink / raw) To: Lai Jiangshan Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells On Mon, Jan 25, 2010 at 08:28:34PM +0800, Lai Jiangshan wrote: > Paul E. McKenney wrote: > > [Experimental RFC, not for inclusion.] > > > > I recently received a complaint that RCU was refusing to let a system > > go into low-power state immediately, instead waiting a few ticks after > > the system had gone idle before letting go of the last CPU. Of course, > > the reason for this was that there were a couple of RCU callbacks on > > the last CPU. > > > > Currently, rcu_needs_cpu() simply checks whether the current CPU has > > an outstanding RCU callback, which means that the last CPU to go into > > dyntick-idle mode might wait a few ticks for the relevant grace periods > > to complete. However, if all the other CPUs are in dyntick-idle mode, > > and if this CPU is in a quiescent state (which it is for RCU-bh and > > RCU-sched any time that we are considering going into dyntick-idle mode), > > then the grace period is instantly complete. > > > > This patch therefore repeatedly invokes the RCU grace-period machinery > > in order to force any needed grace periods to complete quickly. It does > > so a limited number of times in order to prevent starvation by an RCU > > callback function that might pass itself to call_rcu(). > > > > Thoughts? > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > > > diff --git a/init/Kconfig b/init/Kconfig > > index d95ca7c..42bf914 100644 > > --- a/init/Kconfig > > +++ b/init/Kconfig > > @@ -396,6 +396,22 @@ config RCU_FANOUT_EXACT > > > > Say N if unsure. > > > > +config RCU_FAST_NO_HZ > > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > > + depends on TREE_RCU && NO_HZ && SMP > > + default n > > + help > > + This option causes RCU to attempt to accelerate grace periods > > + in order to allow the final CPU to enter dynticks-idle state > > + more quickly. On the other hand, this option increases the > > + overhead of the dynticks-idle checking, particularly on systems > > + with large numbers of CPUs. > > + > > + Say Y if energy efficiency is critically important, particularly > > + if you have relatively few CPUs. > > + > > + Say N if you are unsure. > > + > > config TREE_RCU_TRACE > > def_bool RCU_TRACE && ( TREE_RCU || TREE_PREEMPT_RCU ) > > select DEBUG_FS > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > index 099a255..29d88c0 100644 > > --- a/kernel/rcutree.c > > +++ b/kernel/rcutree.c > > @@ -1550,10 +1550,9 @@ static int rcu_pending(int cpu) > > /* > > * Check to see if any future RCU-related work will need to be done > > * by the current CPU, even if none need be done immediately, returning > > - * 1 if so. This function is part of the RCU implementation; it is -not- > > - * an exported member of the RCU API. > > + * 1 if so. > > */ > > -int rcu_needs_cpu(int cpu) > > +static int rcu_needs_cpu_quick_check(int cpu) > > { > > /* RCU callbacks either ready or pending? */ > > return per_cpu(rcu_sched_data, cpu).nxtlist || > > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h > > index e77cdf3..d6170a9 100644 > > --- a/kernel/rcutree_plugin.h > > +++ b/kernel/rcutree_plugin.h > > @@ -906,3 +906,72 @@ static void __init __rcu_init_preempt(void) > > } > > > > #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */ > > + > > +#if defined(CONFIG_TREE_PREEMPT_RCU) || !defined(CONFIG_RCU_FAST_NO_HZ) > > + > > +/* > > + * Check to see if any future RCU-related work will need to be done > > + * by the current CPU, even if none need be done immediately, returning > > + * 1 if so. This function is part of the RCU implementation; it is -not- > > + * an exported member of the RCU API. > > + * > > + * Because we have preemptible RCU, just check whether this CPU needs > > + * any flavor of RCU. Do not chew up lots of CPU cycles with preemption > > + * disabled in a most-likely vain attempt to cause RCU not to need this CPU. > > + */ > > +int rcu_needs_cpu(int cpu) > > +{ > > + return rcu_needs_cpu_quick_check(cpu); > > +} > > + > > +#else > > + > > +#define RCU_NEEDS_CPU_FLUSHES 5 > > + > > +/* > > + * Check to see if any future RCU-related work will need to be done > > + * by the current CPU, even if none need be done immediately, returning > > + * 1 if so. This function is part of the RCU implementation; it is -not- > > + * an exported member of the RCU API. > > + * > > + * Because we are not supporting preemptible RCU, attempt to accelerate > > + * any current grace periods so that RCU no longer needs this CPU, but > > + * only if all other CPUs are already in dynticks-idle mode. This will > > + * allow the CPU cores to be powered down immediately, as opposed to after > > + * waiting many milliseconds for grace periods to elapse. > > + */ > > +int rcu_needs_cpu(int cpu) > > +{ > > + int c = 1; > > + int i; > > + int thatcpu; > > + > > + /* Don't bother unless we are the last non-dyntick-idle CPU. */ > > + for_each_cpu(thatcpu, nohz_cpu_mask) > > + if (thatcpu != cpu) > > + return rcu_needs_cpu_quick_check(cpu); > > The comment and the code are not the same, I think. Indeed, for this code to be correct, I would need to sequence through the non-dyntick-idle CPUs, not the dyntick-idle ones. Good catch! I will likely come back with something similar to Steve Rostedt's suggestion. Probably better to sequence through all the CPUs rather than to allocate a cpumask and invert it. Or a 'for_each_cpu_not()' or some such. ;-) There does appear to be a cpumask_next_zero() that I should be able to use. Thanx, Paul > ----------- > I found this thing, Although I think it is a ugly thing. > Is it help? > > See select_nohz_load_balancer(). > > /* > * This routine will try to nominate the ilb (idle load balancing) > * owner among the cpus whose ticks are stopped. ilb owner will do the idle > * load balancing on behalf of all those cpus. If all the cpus in the system > * go into this tickless mode, then there will be no ilb owner (as there is > * no need for one) and all the cpus will sleep till the next wakeup event > * arrives... > * > * For the ilb owner, tick is not stopped. And this tick will be used > * for idle load balancing. ilb owner will still be part of > * nohz.cpu_mask.. > * > * While stopping the tick, this cpu will become the ilb owner if there > * is no other owner. And will be the owner till that cpu becomes busy > * or if all cpus in the system stop their ticks at which point > * there is no need for ilb owner. > * > * When the ilb owner becomes busy, it nominates another owner, during the > * next busy scheduler_tick() > */ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-25 3:48 [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU Paul E. McKenney 2010-01-25 12:28 ` Lai Jiangshan @ 2010-01-25 15:12 ` Steven Rostedt 2010-01-27 14:11 ` Paul E. McKenney 2010-01-26 21:30 ` Andi Kleen 2 siblings, 1 reply; 23+ messages in thread From: Steven Rostedt @ 2010-01-25 15:12 UTC (permalink / raw) To: paulmck Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Sun, 2010-01-24 at 19:48 -0800, Paul E. McKenney wrote: > +/* > + * Check to see if any future RCU-related work will need to be done > + * by the current CPU, even if none need be done immediately, returning > + * 1 if so. This function is part of the RCU implementation; it is -not- > + * an exported member of the RCU API. > + * > + * Because we are not supporting preemptible RCU, attempt to accelerate > + * any current grace periods so that RCU no longer needs this CPU, but > + * only if all other CPUs are already in dynticks-idle mode. This will > + * allow the CPU cores to be powered down immediately, as opposed to after > + * waiting many milliseconds for grace periods to elapse. > + */ > +int rcu_needs_cpu(int cpu) > +{ > + int c = 1; > + int i; > + int thatcpu; > + > + /* Don't bother unless we are the last non-dyntick-idle CPU. */ > + for_each_cpu(thatcpu, nohz_cpu_mask) > + if (thatcpu != cpu) > + return rcu_needs_cpu_quick_check(cpu); > + > + /* Try to push remaining RCU-sched and RCU-bh callbacks through. */ > + for (i = 0; i < RCU_NEEDS_CPU_FLUSHES && c; i++) { > + c = 0; > + if (per_cpu(rcu_sched_data, cpu).nxtlist) { > + c = 1; > + rcu_sched_qs(cpu); > + force_quiescent_state(&rcu_sched_state, 0); > + __rcu_process_callbacks(&rcu_sched_state, > + &per_cpu(rcu_sched_data, cpu)); > + } > + if (per_cpu(rcu_bh_data, cpu).nxtlist) { > + c = 1; > + rcu_bh_qs(cpu); > + force_quiescent_state(&rcu_bh_state, 0); > + __rcu_process_callbacks(&rcu_bh_state, > + &per_cpu(rcu_bh_data, cpu)); > + } > + } > + > + /* If RCU callbacks are still pending, RCU still needs this CPU. */ > + return c; What happens if the last loop pushes out all callbacks? Then we would be returning 1 when we could really be returning 0. Wouldn't a better answer be: return per_cpu(rcu_sched_data, cpu).nxtlist || per_cpu(rcu_bh_data, cpu).nxtlist; -- Steve > +} > + > +#endif ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-25 15:12 ` Steven Rostedt @ 2010-01-27 14:11 ` Paul E. McKenney 2010-01-27 14:52 ` Steven Rostedt 0 siblings, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 14:11 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Mon, Jan 25, 2010 at 10:12:03AM -0500, Steven Rostedt wrote: > On Sun, 2010-01-24 at 19:48 -0800, Paul E. McKenney wrote: > > > +/* > > + * Check to see if any future RCU-related work will need to be done > > + * by the current CPU, even if none need be done immediately, returning > > + * 1 if so. This function is part of the RCU implementation; it is -not- > > + * an exported member of the RCU API. > > + * > > + * Because we are not supporting preemptible RCU, attempt to accelerate > > + * any current grace periods so that RCU no longer needs this CPU, but > > + * only if all other CPUs are already in dynticks-idle mode. This will > > + * allow the CPU cores to be powered down immediately, as opposed to after > > + * waiting many milliseconds for grace periods to elapse. > > + */ > > +int rcu_needs_cpu(int cpu) > > +{ > > + int c = 1; > > + int i; > > + int thatcpu; > > + > > + /* Don't bother unless we are the last non-dyntick-idle CPU. */ > > + for_each_cpu(thatcpu, nohz_cpu_mask) > > + if (thatcpu != cpu) > > + return rcu_needs_cpu_quick_check(cpu); > > + > > + /* Try to push remaining RCU-sched and RCU-bh callbacks through. */ > > + for (i = 0; i < RCU_NEEDS_CPU_FLUSHES && c; i++) { > > + c = 0; > > + if (per_cpu(rcu_sched_data, cpu).nxtlist) { > > + c = 1; > > + rcu_sched_qs(cpu); > > + force_quiescent_state(&rcu_sched_state, 0); > > + __rcu_process_callbacks(&rcu_sched_state, > > + &per_cpu(rcu_sched_data, cpu)); > > > + } > > + if (per_cpu(rcu_bh_data, cpu).nxtlist) { > > + c = 1; > > + rcu_bh_qs(cpu); > > + force_quiescent_state(&rcu_bh_state, 0); > > + __rcu_process_callbacks(&rcu_bh_state, > > + &per_cpu(rcu_bh_data, cpu)); > > + } > > + } > > + > > + /* If RCU callbacks are still pending, RCU still needs this CPU. */ > > + return c; > > What happens if the last loop pushes out all callbacks? Then we would be > returning 1 when we could really be returning 0. Wouldn't a better > answer be: > > return per_cpu(rcu_sched_data, cpu).nxtlist || > per_cpu(rcu_bh_data, cpu).nxtlist; Good point!!! Or I can move the assignment to "c" to the end of each branch of the "if" statement, and do something like the following: c = !!per_cpu(rcu_sched_data, cpu).nxtlist; But either way, you are right, it does not make sense to go to all the trouble of forcing a grace period and then failing to take advantage of it. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 14:11 ` Paul E. McKenney @ 2010-01-27 14:52 ` Steven Rostedt 0 siblings, 0 replies; 23+ messages in thread From: Steven Rostedt @ 2010-01-27 14:52 UTC (permalink / raw) To: paulmck Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, Valdis.Kletnieks, dhowells On Wed, 2010-01-27 at 06:11 -0800, Paul E. McKenney wrote: > On Mon, Jan 25, 2010 at 10:12:03AM -0500, Steven Rostedt wrote: > > On Sun, 2010-01-24 at 19:48 -0800, Paul E. McKenney wrote: > > > > > +/* > > > + * Check to see if any future RCU-related work will need to be done > > > + * by the current CPU, even if none need be done immediately, returning > > > + * 1 if so. This function is part of the RCU implementation; it is -not- > > > + * an exported member of the RCU API. > > > + * > > > + * Because we are not supporting preemptible RCU, attempt to accelerate > > > + * any current grace periods so that RCU no longer needs this CPU, but > > > + * only if all other CPUs are already in dynticks-idle mode. This will > > > + * allow the CPU cores to be powered down immediately, as opposed to after > > > + * waiting many milliseconds for grace periods to elapse. > > > + */ > > > +int rcu_needs_cpu(int cpu) > > > +{ > > > + int c = 1; > > > + int i; > > > + int thatcpu; > > > + > > > + /* Don't bother unless we are the last non-dyntick-idle CPU. */ > > > + for_each_cpu(thatcpu, nohz_cpu_mask) > > > + if (thatcpu != cpu) > > > + return rcu_needs_cpu_quick_check(cpu); > > > + > > > + /* Try to push remaining RCU-sched and RCU-bh callbacks through. */ > > > + for (i = 0; i < RCU_NEEDS_CPU_FLUSHES && c; i++) { > > > + c = 0; > > > + if (per_cpu(rcu_sched_data, cpu).nxtlist) { > > > + c = 1; > > > + rcu_sched_qs(cpu); > > > + force_quiescent_state(&rcu_sched_state, 0); > > > + __rcu_process_callbacks(&rcu_sched_state, > > > + &per_cpu(rcu_sched_data, cpu)); > > > > > + } > > > + if (per_cpu(rcu_bh_data, cpu).nxtlist) { > > > + c = 1; > > > + rcu_bh_qs(cpu); > > > + force_quiescent_state(&rcu_bh_state, 0); > > > + __rcu_process_callbacks(&rcu_bh_state, > > > + &per_cpu(rcu_bh_data, cpu)); > > > + } > > > + } > > > + > > > + /* If RCU callbacks are still pending, RCU still needs this CPU. */ > > > + return c; > > > > What happens if the last loop pushes out all callbacks? Then we would be > > returning 1 when we could really be returning 0. Wouldn't a better > > answer be: > > > > return per_cpu(rcu_sched_data, cpu).nxtlist || > > per_cpu(rcu_bh_data, cpu).nxtlist; > > Good point!!! > > Or I can move the assignment to "c" to the end of each branch of the > "if" statement, and do something like the following: > > c = !!per_cpu(rcu_sched_data, cpu).nxtlist; Hmm, that may just add obfuscation to those looking at the code. > > But either way, you are right, it does not make sense to go to all the > trouble of forcing a grace period and then failing to take advantage > of it. Yeah, whatever implementation is fine, as long as it works and takes advantage of all forced grace periods. -- Steve ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-25 3:48 [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU Paul E. McKenney 2010-01-25 12:28 ` Lai Jiangshan 2010-01-25 15:12 ` Steven Rostedt @ 2010-01-26 21:30 ` Andi Kleen 2010-01-26 23:55 ` Mathieu Desnoyers 2010-01-27 5:20 ` Paul E. McKenney 2 siblings, 2 replies; 23+ messages in thread From: Andi Kleen @ 2010-01-26 21:30 UTC (permalink / raw) To: paulmck Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: Kind of offtopic to the original patch, but I couldn't resist... > +config RCU_FAST_NO_HZ > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > + depends on TREE_RCU && NO_HZ && SMP Having such a thing as a config option doesn't really make any sense to me. Who would want to recompile their kernel to enable/disable this? If anything it should be runtime, or better just unconditionally on. In general RCU could probably reduce its number of "weird" Kconfig options. While I think I have a better understanding of RCU than a lot of normal users I often have no clue what to set there when building a kernel. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-26 21:30 ` Andi Kleen @ 2010-01-26 23:55 ` Mathieu Desnoyers 2010-01-27 5:21 ` Paul E. McKenney 2010-01-27 5:20 ` Paul E. McKenney 1 sibling, 1 reply; 23+ messages in thread From: Mathieu Desnoyers @ 2010-01-26 23:55 UTC (permalink / raw) To: Andi Kleen Cc: paulmck, linux-kernel, mingo, laijs, dipankar, akpm, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells * Andi Kleen (andi@firstfloor.org) wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > > Kind of offtopic to the original patch, but I couldn't > resist... > > > +config RCU_FAST_NO_HZ > > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > > + depends on TREE_RCU && NO_HZ && SMP > > Having such a thing as a config option doesn't really make > any sense to me. Who would want to recompile their kernel > to enable/disable this? If anything it should be runtime, or better > just unconditionally on. > > In general RCU could probably reduce its number of "weird" > Kconfig options. > > While I think I have a better understanding of RCU than a lot > of normal users I often have no clue what to set there when > building a kernel. Maybe we could keep them under a CONFIG_DEBUG_RCU umbrella. Compiling out parts of the rcu options can be useful for debugging purposes, but I agree with you that end users should not be bothered with that much options when some of them are "obviously" wanted. OTOH, I understand that Paul seems to want to introduce new RCU features gradually, without hitting all kernel users with bugs in newer features. That's a sane approach to keep things generally stable, but maybe it is time to set some of the stabilized RCU options to default Y and move their config to the debug menu. Let's see what Paul has to say about this... Thanks, Mathieu > > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-26 23:55 ` Mathieu Desnoyers @ 2010-01-27 5:21 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 5:21 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells On Tue, Jan 26, 2010 at 06:55:16PM -0500, Mathieu Desnoyers wrote: > * Andi Kleen (andi@firstfloor.org) wrote: > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > > > > Kind of offtopic to the original patch, but I couldn't > > resist... > > > > > +config RCU_FAST_NO_HZ > > > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > > > + depends on TREE_RCU && NO_HZ && SMP > > > > Having such a thing as a config option doesn't really make > > any sense to me. Who would want to recompile their kernel > > to enable/disable this? If anything it should be runtime, or better > > just unconditionally on. > > > > In general RCU could probably reduce its number of "weird" > > Kconfig options. > > > > While I think I have a better understanding of RCU than a lot > > of normal users I often have no clue what to set there when > > building a kernel. > > Maybe we could keep them under a CONFIG_DEBUG_RCU umbrella. Compiling > out parts of the rcu options can be useful for debugging purposes, but > I agree with you that end users should not be bothered with that much > options when some of them are "obviously" wanted. > > OTOH, I understand that Paul seems to want to introduce new RCU > features gradually, without hitting all kernel users with bugs in newer > features. That's a sane approach to keep things generally stable, but > maybe it is time to set some of the stabilized RCU options to default Y > and move their config to the debug menu. That is indeed a big part of the motivation. ;-) Thanx, Paul > Let's see what Paul has to say about this... > > Thanks, > > Mathieu > > > > > -Andi > > > > -- > > ak@linux.intel.com -- Speaking for myself only. > > -- > Mathieu Desnoyers > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-26 21:30 ` Andi Kleen 2010-01-26 23:55 ` Mathieu Desnoyers @ 2010-01-27 5:20 ` Paul E. McKenney 2010-01-27 9:43 ` Andi Kleen 1 sibling, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 5:20 UTC (permalink / raw) To: Andi Kleen Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells On Tue, Jan 26, 2010 at 10:30:57PM +0100, Andi Kleen wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > > Kind of offtopic to the original patch, but I couldn't > resist... > > > +config RCU_FAST_NO_HZ > > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > > + depends on TREE_RCU && NO_HZ && SMP > > Having such a thing as a config option doesn't really make > any sense to me. Who would want to recompile their kernel > to enable/disable this? If anything it should be runtime, or better > just unconditionally on. It adds significant overhead on entry to dyntick-idle mode for systems with large numbers of CPUs. :-( > In general RCU could probably reduce its number of "weird" > Kconfig options. > > While I think I have a better understanding of RCU than a lot > of normal users I often have no clue what to set there when > building a kernel. One approach would be to make it be default for small numbers of CPUs (as in systems likely to be battery powered) but not for large numbers of CPUs. The reason I didn't do this initially is that a server-class four-CPU system would have no need for this, but a four-core cellphone most definitely would. So I just created another config variable. In any case, I do agree with your point about reducing the number of config variables. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 5:20 ` Paul E. McKenney @ 2010-01-27 9:43 ` Andi Kleen 2010-01-27 9:50 ` Peter Zijlstra 2010-01-27 10:01 ` Paul E. McKenney 0 siblings, 2 replies; 23+ messages in thread From: Andi Kleen @ 2010-01-27 9:43 UTC (permalink / raw) To: Paul E. McKenney Cc: Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells On Tue, Jan 26, 2010 at 09:20:50PM -0800, Paul E. McKenney wrote: > On Tue, Jan 26, 2010 at 10:30:57PM +0100, Andi Kleen wrote: > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > > > > Kind of offtopic to the original patch, but I couldn't > > resist... > > > > > +config RCU_FAST_NO_HZ > > > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > > > + depends on TREE_RCU && NO_HZ && SMP > > > > Having such a thing as a config option doesn't really make > > any sense to me. Who would want to recompile their kernel > > to enable/disable this? If anything it should be runtime, or better > > just unconditionally on. > > It adds significant overhead on entry to dyntick-idle mode for systems > with large numbers of CPUs. :-( Can't you simply check that at runtime then? if (num_possible_cpus() > 20) ... BTW the new small is large. This years high end desktop PC will come with upto 12 CPU threads. It would likely be challenging to find a good number for 20 that holds up with the future. Or better perhaps have some threshold that you don't do it that often, or only do it when you expect to be idle for a long enough time that the CPU can enter deeper idle states (I higher idle states some more wakeups typically don't matter that much) The cpufreq/cstate governour have a reasonable good idea now how "idle" the system is and will be. Maybe you can reuse that information somehow. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 9:43 ` Andi Kleen @ 2010-01-27 9:50 ` Peter Zijlstra 2010-01-27 10:00 ` Andi Kleen 2010-01-27 10:04 ` Paul E. McKenney 2010-01-27 10:01 ` Paul E. McKenney 1 sibling, 2 replies; 23+ messages in thread From: Peter Zijlstra @ 2010-01-27 9:50 UTC (permalink / raw) To: Andi Kleen Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, rostedt, Valdis.Kletnieks, dhowells On Wed, 2010-01-27 at 10:43 +0100, Andi Kleen wrote: > > Can't you simply check that at runtime then? > > if (num_possible_cpus() > 20) > ... > > BTW the new small is large. This years high end desktop PC will come with > upto 12 CPU threads. It would likely be challenging to find a good > number for 20 that holds up with the future. If only scalability were that easy :/ These massive core/thread count things are causing more problems as well, the cpus/node ratios are constantly growing, giving grief in the page allocator as well as other places that used to scale per node. As to the current problem, the call_rcu() interface doesn't make a hard promise that the callback will be done on the same cpu, right? So why not simply move the callback list over to a more active cpu? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 9:50 ` Peter Zijlstra @ 2010-01-27 10:00 ` Andi Kleen 2010-01-27 10:04 ` Paul E. McKenney 1 sibling, 0 replies; 23+ messages in thread From: Andi Kleen @ 2010-01-27 10:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Andi Kleen, Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, rostedt, Valdis.Kletnieks, dhowells On Wed, Jan 27, 2010 at 10:50:50AM +0100, Peter Zijlstra wrote: > On Wed, 2010-01-27 at 10:43 +0100, Andi Kleen wrote: > > > > Can't you simply check that at runtime then? > > > > if (num_possible_cpus() > 20) > > ... > > > > BTW the new small is large. This years high end desktop PC will come with > > upto 12 CPU threads. It would likely be challenging to find a good > > number for 20 that holds up with the future. > > If only scalability were that easy :/ See the rest of my email... I see you're already suffering from "Linus disease" ... only reading the first paragraph/sentence in the email. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 9:50 ` Peter Zijlstra 2010-01-27 10:00 ` Andi Kleen @ 2010-01-27 10:04 ` Paul E. McKenney 2010-01-27 11:39 ` Nick Piggin 1 sibling, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 10:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, rostedt, Valdis.Kletnieks, dhowells On Wed, Jan 27, 2010 at 10:50:50AM +0100, Peter Zijlstra wrote: > On Wed, 2010-01-27 at 10:43 +0100, Andi Kleen wrote: > > > > Can't you simply check that at runtime then? > > > > if (num_possible_cpus() > 20) > > ... > > > > BTW the new small is large. This years high end desktop PC will come with > > upto 12 CPU threads. It would likely be challenging to find a good > > number for 20 that holds up with the future. > > If only scalability were that easy :/ > > These massive core/thread count things are causing more problems as > well, the cpus/node ratios are constantly growing, giving grief in the > page allocator as well as other places that used to scale per node. > > As to the current problem, the call_rcu() interface doesn't make a hard > promise that the callback will be done on the same cpu, right? So why > not simply move the callback list over to a more active cpu? I could indeed do that. However, there is nothing stopping the more-active CPU from going into dynticks-idle mode between the time that I decide to push the callback to it and the time I actually do the pushing. :-( I considered pushing the callbacks to the orphanage, but that is a global lock that I would rather not acquire on each dyntick-idle transition. This conversation is having the effect of making me much more comfortable adding a kernel configuration parameter. Might not have been the intent, but there you have it! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 10:04 ` Paul E. McKenney @ 2010-01-27 11:39 ` Nick Piggin 2010-01-27 11:59 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Nick Piggin @ 2010-01-27 11:39 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, rostedt, Valdis.Kletnieks, dhowells On Wed, Jan 27, 2010 at 02:04:34AM -0800, Paul E. McKenney wrote: > I could indeed do that. However, there is nothing stopping the > more-active CPU from going into dynticks-idle mode between the time > that I decide to push the callback to it and the time I actually do > the pushing. :-( > > I considered pushing the callbacks to the orphanage, but that is a > global lock that I would rather not acquire on each dyntick-idle > transition. Well we already have to do atomic operations on the nohz mask, so maybe it would be acceptable to actually have a spinlock there to serialise operations on the nohz mask and also allow some subsystem specific things (synchronisation here should allow either one of those above approaches). It's not going to be zero cost, but seeing as there is already the contended cacheline there, it's not going to introduce a fundamentally new bottleneck. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 11:39 ` Nick Piggin @ 2010-01-27 11:59 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 11:59 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, rostedt, Valdis.Kletnieks, dhowells On Wed, Jan 27, 2010 at 10:39:22PM +1100, Nick Piggin wrote: > On Wed, Jan 27, 2010 at 02:04:34AM -0800, Paul E. McKenney wrote: > > I could indeed do that. However, there is nothing stopping the > > more-active CPU from going into dynticks-idle mode between the time > > that I decide to push the callback to it and the time I actually do > > the pushing. :-( > > > > I considered pushing the callbacks to the orphanage, but that is a > > global lock that I would rather not acquire on each dyntick-idle > > transition. > > Well we already have to do atomic operations on the nohz mask, so > maybe it would be acceptable to actually have a spinlock there to > serialise operations on the nohz mask and also allow some subsystem > specific things (synchronisation here should allow either one of > those above approaches). > > It's not going to be zero cost, but seeing as there is already the > contended cacheline there, it's not going to introduce a > fundamentally new bottleneck. Good point, although a contended global lock is nastier than a contended cache line. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 9:43 ` Andi Kleen 2010-01-27 9:50 ` Peter Zijlstra @ 2010-01-27 10:01 ` Paul E. McKenney 2010-01-27 10:13 ` Andi Kleen 1 sibling, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 10:01 UTC (permalink / raw) To: Andi Kleen Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells On Wed, Jan 27, 2010 at 10:43:36AM +0100, Andi Kleen wrote: > On Tue, Jan 26, 2010 at 09:20:50PM -0800, Paul E. McKenney wrote: > > On Tue, Jan 26, 2010 at 10:30:57PM +0100, Andi Kleen wrote: > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > > > > > > Kind of offtopic to the original patch, but I couldn't > > > resist... > > > > > > > +config RCU_FAST_NO_HZ > > > > + bool "Accelerate last non-dyntick-idle CPU's grace periods" > > > > + depends on TREE_RCU && NO_HZ && SMP > > > > > > Having such a thing as a config option doesn't really make > > > any sense to me. Who would want to recompile their kernel > > > to enable/disable this? If anything it should be runtime, or better > > > just unconditionally on. > > > > It adds significant overhead on entry to dyntick-idle mode for systems > > with large numbers of CPUs. :-( > > Can't you simply check that at runtime then? > > if (num_possible_cpus() > 20) > ... > > BTW the new small is large. This years high end desktop PC will come with > upto 12 CPU threads. It would likely be challenging to find a good > number for 20 that holds up with the future. And this was another line of reasoning that lead me to the extra kernel config parameter. > Or better perhaps have some threshold that you don't do it > that often, or only do it when you expect to be idle for a long > enough time that the CPU can enter deeper idle states > > (I higher idle states some more wakeups typically don't matter > that much) > > The cpufreq/cstate governour have a reasonable good idea > now how "idle" the system is and will be. Maybe you can reuse > that information somehow. My first thought was to find an existing "I am a small device running on battery power" or "low power consumption is critical to me" config parameter. I didn't find anything that looked like that. If there was one, I would make RCU_FAST_NO_HZ depend on it. Or did I miss some kernel parameter or API? Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 10:01 ` Paul E. McKenney @ 2010-01-27 10:13 ` Andi Kleen 2010-01-27 11:44 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Andi Kleen @ 2010-01-27 10:13 UTC (permalink / raw) To: Paul E. McKenney Cc: Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells, arjan > > Can't you simply check that at runtime then? > > > > if (num_possible_cpus() > 20) > > ... > > > > BTW the new small is large. This years high end desktop PC will come with > > upto 12 CPU threads. It would likely be challenging to find a good > > number for 20 that holds up with the future. > > And this was another line of reasoning that lead me to the extra kernel > config parameter. Which doesn't solve the problem at all. > > Or better perhaps have some threshold that you don't do it > > that often, or only do it when you expect to be idle for a long > > enough time that the CPU can enter deeper idle states > > > > (I higher idle states some more wakeups typically don't matter > > that much) > > > > The cpufreq/cstate governour have a reasonable good idea > > now how "idle" the system is and will be. Maybe you can reuse > > that information somehow. > > My first thought was to find an existing "I am a small device running on > battery power" or "low power consumption is critical to me" config > parameter. I didn't find anything that looked like that. If there was > one, I would make RCU_FAST_NO_HZ depend on it. > > Or did I miss some kernel parameter or API? There are a few for scalability (e.g. numa_distance()), but they're obscure. The really good ones are just known somewhere. But I think in this case scalability is not the key thing to check for, but expected idle latency. Even on a large system if near all CPUs are idle spending some time to keep them idle even longer is a good thing. But only if the CPUs actually benefit from long idle. There's the "pm_qos_latency" frame work that could be used for this in theory, but it's not 100% the right match because it's not dynamic. Unfortunately last time I looked the interfaces were rather clumpsy (e.g. don't allow interrupt level notifiers) Better would be some insight into the expected future latency: look at exporting this information from the various frequency/idle governours. Perhaps pm_qos_latency could be extended to support that? CC Arjan, maybe he has some ideas on that. After all of that there would be still of course the question what the right latency threshold would be, but at least that's a much easier question that number of CPUs. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 10:13 ` Andi Kleen @ 2010-01-27 11:44 ` Paul E. McKenney 2010-01-27 12:11 ` Andi Kleen 0 siblings, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 11:44 UTC (permalink / raw) To: Andi Kleen Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells, arjan On Wed, Jan 27, 2010 at 11:13:42AM +0100, Andi Kleen wrote: > > > Can't you simply check that at runtime then? > > > > > > if (num_possible_cpus() > 20) > > > ... > > > > > > BTW the new small is large. This years high end desktop PC will come with > > > upto 12 CPU threads. It would likely be challenging to find a good > > > number for 20 that holds up with the future. > > > > And this was another line of reasoning that lead me to the extra kernel > > config parameter. > > Which doesn't solve the problem at all. Depending on what you consider the problem to be, of course. >From what I can see, most people would want RCU_FAST_NO_HZ=n. Only people with extreme power-consumption concerns would likely care enough to select this. > > > Or better perhaps have some threshold that you don't do it > > > that often, or only do it when you expect to be idle for a long > > > enough time that the CPU can enter deeper idle states > > > > > > (I higher idle states some more wakeups typically don't matter > > > that much) > > > > > > The cpufreq/cstate governour have a reasonable good idea > > > now how "idle" the system is and will be. Maybe you can reuse > > > that information somehow. > > > > My first thought was to find an existing "I am a small device running on > > battery power" or "low power consumption is critical to me" config > > parameter. I didn't find anything that looked like that. If there was > > one, I would make RCU_FAST_NO_HZ depend on it. > > > > Or did I miss some kernel parameter or API? > > There are a few for scalability (e.g. numa_distance()), but they're > obscure. The really good ones are just known somewhere. > > But I think in this case scalability is not the key thing to check > for, but expected idle latency. Even on a large system if near all > CPUs are idle spending some time to keep them idle even longer is a good > thing. But only if the CPUs actually benefit from long idle. The larger the number of CPUs, the lower the probability of all of them going idle, so the less difference this patch makes. Perhaps some larger system will care about this on a per-socket basis, but I have yet to hear any requests. > There's the "pm_qos_latency" frame work that could be used for this > in theory, but it's not 100% the right match because it's not > dynamic. > > Unfortunately last time I looked the interfaces were rather clumpsy > (e.g. don't allow interrupt level notifiers) I do need to query from interrupt context, but could potentially have a notifier set up state for me. Still, the real question is "how important is a small reduction in power consumption?" > Better would be some insight into the expected future latency: > look at exporting this information from the various frequency/idle > governours. > > Perhaps pm_qos_latency could be extended to support that? > CC Arjan, maybe he has some ideas on that. > > After all of that there would be still of course the question > what the right latency threshold would be, but at least that's > a much easier question that number of CPUs. Hmmm... I am still believing that very few people want RCU_FAST_NO_HZ, and that those who want it can select it for their devices. Trying to apply this to server-class machines gets into questions like "where are the core/socket boundaries", "can this hardware turn entire cores/sockets off", "given the current workload, does it really make sense to try to turn off entire cores/sockets", and "is a few ticks important when migrating processes, irqs, timers, and whatever else is required to actually turn off a given core or socket for a significant time period". I took a quick look at te pm_qos_latency, and, as you note, it doesn't really seem to be designed to handle this situation. And we really should not be gold-plating this thing. I have one requester (off list) who needs it badly, and who is willing to deal with a kernel configuration parameter. I have no other requesters, and therefore cannot reasonably anticipate their needs. As a result, we cannot justify building any kind of infrastructure beyond what is reasonable for the single requester. Maybe the situation will be different next year. But if so, we would then have some information on what people really need. So, if it turns out that more will be needed in 2011, I will be happy to do something about it once I have some hard information on what will really be needed. Fair enough? Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 11:44 ` Paul E. McKenney @ 2010-01-27 12:11 ` Andi Kleen 2010-01-27 13:23 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Andi Kleen @ 2010-01-27 12:11 UTC (permalink / raw) To: Paul E. McKenney Cc: Andi Kleen, linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells, arjan > From what I can see, most people would want RCU_FAST_NO_HZ=n. Only Most people do not recompile their kernel. And even those that do most likely will not have enough information to make an informed choice at build time. > people with extreme power-consumption concerns would likely care enough > to select this. What would a distributor shipping binary kernels use? > > But I think in this case scalability is not the key thing to check > > for, but expected idle latency. Even on a large system if near all > > CPUs are idle spending some time to keep them idle even longer is a good > > thing. But only if the CPUs actually benefit from long idle. > > The larger the number of CPUs, the lower the probability of all of them > going idle, so the less difference this patch makes. Perhaps some My shiny new 8 CPU threads desktop is not less likely to go idle when I do nothing on it than an older dual core 2 CPU thread desktop. Especially not given all the recent optimizations (no idle tick) in this area etc. And core/thread counts are growing. In terms of CPU numbers today's large machine is tomorrow's small machine. > I do need to query from interrupt context, but could potentially have a > notifier set up state for me. Still, the real question is "how important > is a small reduction in power consumption?" I think any (measurable) power saving is important. Also on modern Intel CPUs power saving often directly translates into performance: if more cores are idle the others can clock faster. > I took a quick look at te pm_qos_latency, and, as you note, it doesn't > really seem to be designed to handle this situation. It could be extended for it. It's just software after all, we can change it. > > And we really should not be gold-plating this thing. I have one requester > (off list) who needs it badly, and who is willing to deal with a kernel > configuration parameter. I have no other requesters, and therefore > cannot reasonably anticipate their needs. As a result, we cannot justify > building any kind of infrastructure beyond what is reasonable for the > single requester. If this has a measurable power advantage I think it's better to do the extra steps to make it usable everywhere, with automatic heuristics and no Kconfig hacks. If it's not then it's probably not worth merging. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU 2010-01-27 12:11 ` Andi Kleen @ 2010-01-27 13:23 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2010-01-27 13:23 UTC (permalink / raw) To: Andi Kleen Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells, arjan On Wed, Jan 27, 2010 at 01:11:50PM +0100, Andi Kleen wrote: > > From what I can see, most people would want RCU_FAST_NO_HZ=n. Only > > Most people do not recompile their kernel. And even those > that do most likely will not have enough information to make > an informed choice at build time. I believe that only a few embedded people will be using RCU_FAST_NO_HZ=y. > > people with extreme power-consumption concerns would likely care enough > > to select this. > > What would a distributor shipping binary kernels use? RCU_FAST_NO_HZ=n. > > > But I think in this case scalability is not the key thing to check > > > for, but expected idle latency. Even on a large system if near all > > > CPUs are idle spending some time to keep them idle even longer is a good > > > thing. But only if the CPUs actually benefit from long idle. > > > > The larger the number of CPUs, the lower the probability of all of them > > going idle, so the less difference this patch makes. Perhaps some > > My shiny new 8 CPU threads desktop is not less likely to go idle when I do > nothing on it than an older dual core 2 CPU thread desktop. > > Especially not given all the recent optimizations (no idle tick) > in this area etc. > > And core/thread counts are growing. In terms of CPU numbers today's > large machine is tomorrow's small machine. But your shiny new 8-CPU threads desktop runs off of AC power, right? If so, I don't think you will care about a 4-5-tick delay for the last CPU going into dyntick-idle mode. And I bet you won't be able to measure the difference on your battery-powered laptop. > > I do need to query from interrupt context, but could potentially have a > > notifier set up state for me. Still, the real question is "how important > > is a small reduction in power consumption?" > > I think any (measurable) power saving is important. Also on modern Intel > CPUs power saving often directly translates into performance: > if more cores are idle the others can clock faster. OK, I am testing a corrected patch with the kernel configuration parameter. If you can show a measureable difference on typical desktop/server systems, then we can look into doing something more generally useful. > > I took a quick look at te pm_qos_latency, and, as you note, it doesn't > > really seem to be designed to handle this situation. > > It could be extended for it. It's just software after all, > we can change it. Of course we can change it. But should we? > > And we really should not be gold-plating this thing. I have one requester > > (off list) who needs it badly, and who is willing to deal with a kernel > > configuration parameter. I have no other requesters, and therefore > > cannot reasonably anticipate their needs. As a result, we cannot justify > > building any kind of infrastructure beyond what is reasonable for the > > single requester. > > If this has a measurable power advantage I think it's better to > do the extra steps to make it usable everywhere, with automatic heuristics > and no Kconfig hacks. I would agree with the following: If this has a measurable power advantage -on- -a- -large- -fraction- -of- -systems-, then it -might- be better to do extra steps to make it usable everywhere, which -might- involve heuristics instead of a kernel configuration parameter. > If it's not then it's probably not worth merging. This is not necessarily the case. It can make a lot of sense to try something for a special case, and then use the experience gained in that special case to produce a good solution. On the other hand, it does not necessarily make sense to do a lot of possibly useless work based on vague guesses as to what is needed. If we merge the special case, then others have the opportunity to try it out, thus getting us the experience required to see (1) if soemthing more general-purpose is needed in the first place and (2) if so, what that more general-purpose thing might look like. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2010-01-27 14:52 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-01-25 3:48 [PATCH RFC tip/core/rcu] accelerate grace period if last non-dynticked CPU Paul E. McKenney 2010-01-25 12:28 ` Lai Jiangshan 2010-01-25 12:35 ` Peter Zijlstra 2010-01-25 15:08 ` Steven Rostedt 2010-01-27 5:17 ` Paul E. McKenney 2010-01-25 15:12 ` Steven Rostedt 2010-01-27 14:11 ` Paul E. McKenney 2010-01-27 14:52 ` Steven Rostedt 2010-01-26 21:30 ` Andi Kleen 2010-01-26 23:55 ` Mathieu Desnoyers 2010-01-27 5:21 ` Paul E. McKenney 2010-01-27 5:20 ` Paul E. McKenney 2010-01-27 9:43 ` Andi Kleen 2010-01-27 9:50 ` Peter Zijlstra 2010-01-27 10:00 ` Andi Kleen 2010-01-27 10:04 ` Paul E. McKenney 2010-01-27 11:39 ` Nick Piggin 2010-01-27 11:59 ` Paul E. McKenney 2010-01-27 10:01 ` Paul E. McKenney 2010-01-27 10:13 ` Andi Kleen 2010-01-27 11:44 ` Paul E. McKenney 2010-01-27 12:11 ` Andi Kleen 2010-01-27 13:23 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox