[PATCH RFC] rcu: Permit limiting of force_quiescent

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency
@ 2012-03-16 16:03 Paul E. McKenney
  2012-03-17 16:57 ` Mike Galbraith
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2012-03-16 16:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet, darren,
	fweisbec, efault, sivanich

Hello, Dimitri and Mike,

This one is a first step in limiting RCU-induced latencies for systems
that have not just NR_CPUS=4096, but also a lot of CPUs.  I have tested
this using my big-system-simulation setup, but of course testing on
systems that really are big would be quite valuable.

Remaining issues include the latency of setting up grace periods and
the latency of finalizing them in the case where another grace period
is not immediately rquired.

							Thanx, Paul

------------------------------------------------------------------------

Systems whose cache-miss penalties are large compared to the permitted
per-CPU force_quiescent_state() latency can limit these latencies by
defining CONFIG_ARCH_RCU_FQS_LIMIT to be the maximum number of CPUs
that force_quiescent_state() is permitted to visit per invocation.
If this limit is exceeded, the next CPU to scan is stored into a new
->fqs_next_cpu field in the rcu_state structure, and the existing
->jiffies_force_qs field is set so that the next CPU will re-invoke
force_quiescent_state() rather than waiting a few jiffies.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 7247fa8..e99b405 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1685,47 +1685,66 @@ void rcu_check_callbacks(int cpu, int user)
  * have not yet encountered a quiescent state, using the function specified.
  * Also initiate boosting for any threads blocked on the root rcu_node.
  *
+ * Returns 0 if the scan could not be completed, 1 otherwise.
  * The caller must have suppressed start of new grace periods.
  */
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
+static int force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 {
 	unsigned long bit;
 	int cpu;
+	int cpusvisited = 0;
 	unsigned long flags;
 	unsigned long mask;
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rsp, rnp) {
+		if (rnp->grphi < rsp->fqs_next_cpu)
+			continue;
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		if (!rcu_gp_in_progress(rsp)) {
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
-			return;
+			return 1;
 		}
 		if (rnp->qsmask == 0) {
 			rcu_initiate_boost(rnp, flags); /* releases rnp->lock */
 			continue;
 		}
 		cpu = rnp->grplo;
+		if (cpu < rsp->fqs_next_cpu)
+			cpu = rsp->fqs_next_cpu;
 		bit = 1;
 		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
-			if ((rnp->qsmask & bit) != 0 &&
-			    f(per_cpu_ptr(rsp->rda, cpu)))
+			if ((rnp->qsmask & bit) == 0)
+				continue;
+			if (cpu > rcu_get_max_cpu())
+				goto done;
+			if (f(per_cpu_ptr(rsp->rda, cpu)))
 				mask |= bit;
+			if (++cpusvisited >= FQS_MAX_CPU_VISIT) {
+				rsp->fqs_next_cpu = cpu + 1;
+				break;
+			}
 		}
 		if (mask != 0) {
 
 			/* rcu_report_qs_rnp() releases rnp->lock. */
 			rcu_report_qs_rnp(mask, rsp, rnp, flags);
-			continue;
+			if (cpusvisited < FQS_MAX_CPU_VISIT)
+				continue;
 		}
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		if (cpusvisited >= FQS_MAX_CPU_VISIT)
+			break;
 	}
 	rnp = rcu_get_root(rsp);
 	if (rnp->qsmask == 0) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		rcu_initiate_boost(rnp, flags); /* releases rnp->lock. */
 	}
+done:
+	rsp->fqs_next_cpu = 0;
+	return 1;
 }
 
 /*
@@ -1735,6 +1754,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 static void force_quiescent_state(struct rcu_state *rsp, int relaxed)
 {
 	unsigned long flags;
+	int retval;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	trace_rcu_utilization("Start fqs");
@@ -1771,21 +1791,25 @@ static void force_quiescent_state(struct rcu_state *rsp, int relaxed)
 		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
 
 		/* Record dyntick-idle state. */
-		force_qs_rnp(rsp, dyntick_save_progress_counter);
+		retval = force_qs_rnp(rsp, dyntick_save_progress_counter);
 		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-		if (rcu_gp_in_progress(rsp))
+		if (rcu_gp_in_progress(rsp) && retval) {
 			rsp->fqs_state = RCU_FORCE_QS;
+			rsp->fqs_next_cpu = 0;
+		} else if (rcu_gp_in_progress(rsp))
+			rsp->jiffies_force_qs = jiffies - 1;
 		break;
 
 	case RCU_FORCE_QS:
 
 		/* Check dyntick-idle state, send IPI to laggarts. */
 		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+		retval = force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
 
 		/* Leave state in case more forcing is required. */
-
 		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
+		if (rcu_gp_in_progress(rsp) && !retval)
+			rsp->jiffies_force_qs = jiffies - 1;
 		break;
 	}
 	rsp->fqs_active = 0;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 772df1c..444a39b 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -348,6 +348,21 @@ do {									\
 } while (0)
 
 /*
+ * Large latency-sensitive configurations can limit force_quiescent_state()
+ * latencies by defining an CONFIG_ARCH_RCU_FQS_LIMIT.  This should be
+ * sized based on that architecture's cache-miss latency and the maximum
+ * desired force_quiescent_state latency.  For example, if the cache-miss
+ * latency was 100 nanoseconds, and the maximum force_quiescent_state()
+ * latency contribution was 5 microseconds, then that architecture should
+ * define CONFIG_ARCH_RCU_FQS_LIMIT to be 50.
+ */
+#ifdef CONFIG_ARCH_RCU_FQS_LIMIT
+#define FQS_MAX_CPU_VISIT CONFIG_ARCH_RCU_FQS_LIMIT
+#else /* #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
+#define FQS_MAX_CPU_VISIT NR_CPUS
+#endif /* #else #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
+
+/*
  * RCU global state, including node hierarchy.  This hierarchy is
  * represented in "heap" form in a dense array.  The root (first level)
  * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
@@ -396,6 +411,7 @@ struct rcu_state {
 						/*  or NULL if no barrier. */
 	raw_spinlock_t fqslock;			/* Only one task forcing */
 						/*  quiescent states. */
+	int fqs_next_cpu;			/* Next CPU for fqs to scan. */
 	unsigned long jiffies_force_qs;		/* Time at which to invoke */
 						/*  force_quiescent_state(). */
 	unsigned long n_force_qs;		/* Number of calls to */


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency
  2012-03-16 16:03 [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency Paul E. McKenney
@ 2012-03-17 16:57 ` Mike Galbraith
  2012-03-18  7:19   ` Mike Galbraith
  2012-03-18 15:00   ` Mike Galbraith
  0 siblings, 2 replies; 6+ messages in thread
From: Mike Galbraith @ 2012-03-17 16:57 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sivanich

On Fri, 2012-03-16 at 09:03 -0700, Paul E. McKenney wrote: 
> Hello, Dimitri and Mike,
> 
> This one is a first step in limiting RCU-induced latencies for systems
> that have not just NR_CPUS=4096, but also a lot of CPUs.  I have tested
> this using my big-system-simulation setup, but of course testing on
> systems that really are big would be quite valuable.

This didn't work for me.

On a 48 core DL785, with your GP initialization patch applied, and
tracing with irqsoff tracer with box idling along, tracer reports >
200us latencies (tracer overhead alert) from force_qs_rnp() banging on
254 locks.. iow it didn't stop traversing, did a good imitation of a
latency spike.

So I bent it up a little, see below.  I haven't instrumented it yet, and
might have busted it all to pieces, so RFC back.


> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 7247fa8..e99b405 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1685,47 +1685,66 @@ void rcu_check_callbacks(int cpu, int user)
>   * have not yet encountered a quiescent state, using the function specified.
>   * Also initiate boosting for any threads blocked on the root rcu_node.
>   *
> + * Returns 0 if the scan could not be completed, 1 otherwise.
>   * The caller must have suppressed start of new grace periods.
>   */

But it fibs, always returns 1.

> -static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
> +static int force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
>  {
>  	unsigned long bit;
>  	int cpu;
> +	int cpusvisited = 0;
>  	unsigned long flags;
>  	unsigned long mask;
>  	struct rcu_node *rnp;
>  
>  	rcu_for_each_leaf_node(rsp, rnp) {
..... 
> 		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> -			if ((rnp->qsmask & bit) != 0 &&
> -			    f(per_cpu_ptr(rsp->rda, cpu)))
> +			if ((rnp->qsmask & bit) == 0)
> +				continue;
> +			if (cpu > rcu_get_max_cpu())
> +				goto done;

No unlock.

.... 
> +done:
> +	rsp->fqs_next_cpu = 0;
> +	return 1;
>  }

rsp->fqs_next_cpu = 0 means we start over, no?

WRT FQS_MAX_CPU_VISIT, maybe a read_mostly tweakable instead?  (Like we
don't have enough tweakables lying about)


Anyway, this is what I did.  Setting ARCH_RCU_FQS_LIMIT to 64 on a 48
core box didn't test sip vs gulp logic particularly well, but that
doesn't matter ATM.  The only test done was to boot it and see if it
stopped banging on oodles of locks.

Ok, box has been running rcutorture for ten whole minutes now without
deadlocking or exploding.. ship it, and go have part of a weekend.

<poof>


rcu: Permit limiting of force_quiescent_state() latency

Systems whose cache-miss penalties are large compared to the permitted
per-CPU force_quiescent_state() latency can limit these latencies by
defining CONFIG_ARCH_RCU_FQS_LIMIT to be the maximum number of CPUs
that force_quiescent_state() is permitted to visit per invocation.
If this limit is exceeded, the next CPU to scan is stored into a new
->fqs_next_cpu field in the rcu_state structure, and the existing
->jiffies_force_qs field is set so that the next CPU will re-invoke
force_quiescent_state() rather than waiting a few jiffies.

<snip SOB>
---
 arch/x86/Kconfig |    4 ++++
 kernel/rcutree.c |   48 +++++++++++++++++++++++++++++++++++++++---------
 kernel/rcutree.h |   16 ++++++++++++++++
 3 files changed, 59 insertions(+), 9 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -254,6 +254,10 @@ config ARCH_CPU_PROBE_RELEASE
 	def_bool y
 	depends on HOTPLUG_CPU && !XEN
 
+config ARCH_RCU_FQS_LIMIT
+	int
+	default 64
+
 source "init/Kconfig"
 source "kernel/Kconfig.freezer"
 
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1316,47 +1316,72 @@ void rcu_check_callbacks(int cpu, int us
  * have not yet encountered a quiescent state, using the function specified.
  * Also initiate boosting for any threads blocked on the root rcu_node.
  *
+ * Returns 0 if the scan could not be completed, 1 otherwise.
  * The caller must have suppressed start of new grace periods.
  */
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
+static int force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 {
 	unsigned long bit;
 	int cpu;
+	int cpusvisited = 0, done = 1;
 	unsigned long flags;
 	unsigned long mask;
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rsp, rnp) {
+		if (rnp->grphi < rsp->fqs_next_cpu)
+			continue;
+		if (rnp->grplo > rcu_get_max_cpu())
+			continue;
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		if (!rcu_gp_in_progress(rsp)) {
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
-			return;
+			return done;
 		}
 		if (rnp->qsmask == 0) {
 			rcu_initiate_boost(rnp, flags); /* releases rnp->lock */
 			continue;
 		}
 		cpu = rnp->grplo;
+		if (cpu < rsp->fqs_next_cpu)
+			cpu = rsp->fqs_next_cpu;
 		bit = 1;
 		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
-			if ((rnp->qsmask & bit) != 0 &&
-			    f(per_cpu_ptr(rsp->rda, cpu)))
+			if ((rnp->qsmask & bit) == 0)
+				continue;
+			if (f(per_cpu_ptr(rsp->rda, cpu)))
 				mask |= bit;
+			rsp->fqs_next_cpu = cpu + 1;
+			if (cpu >= rcu_get_max_cpu())
+				break;
+			if (++cpusvisited >= FQS_MAX_CPU_VISIT) {
+				if (FQS_MAX_CPU_VISIT != NR_CPUS)
+					done = 0;
+				break;
+			}
 		}
 		if (mask != 0) {
 
 			/* rcu_report_qs_rnp() releases rnp->lock. */
 			rcu_report_qs_rnp(mask, rsp, rnp, flags);
-			continue;
+			if (cpusvisited < FQS_MAX_CPU_VISIT)
+				continue;
 		}
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		if (cpusvisited >= FQS_MAX_CPU_VISIT)
+			break;
 	}
 	rnp = rcu_get_root(rsp);
 	if (rnp->qsmask == 0) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		rcu_initiate_boost(rnp, flags); /* releases rnp->lock. */
 	}
+
+	if (done)
+		rsp->fqs_next_cpu = 0;
+
+	return done;
 }
 
 /*
@@ -1366,6 +1391,7 @@ static void force_qs_rnp(struct rcu_stat
 static void force_quiescent_state(struct rcu_state *rsp, int relaxed)
 {
 	unsigned long flags;
+	int retval;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	if (!rcu_gp_in_progress(rsp))
@@ -1398,21 +1424,25 @@ static void force_quiescent_state(struct
 		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
 
 		/* Record dyntick-idle state. */
-		force_qs_rnp(rsp, dyntick_save_progress_counter);
+		retval = force_qs_rnp(rsp, dyntick_save_progress_counter);
 		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-		if (rcu_gp_in_progress(rsp))
+		if (rcu_gp_in_progress(rsp) && retval) {
 			rsp->signaled = RCU_FORCE_QS;
+			rsp->fqs_next_cpu = 0;
+		} else if (rcu_gp_in_progress(rsp))
+			rsp->jiffies_force_qs = jiffies - 1;
 		break;
 
 	case RCU_FORCE_QS:
 
 		/* Check dyntick-idle state, send IPI to laggarts. */
 		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+		retval = force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
 
 		/* Leave state in case more forcing is required. */
-
 		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
+		if (rcu_gp_in_progress(rsp) && !retval)
+			rsp->jiffies_force_qs = jiffies - 1;
 		break;
 	}
 	rsp->fqs_active = 0;
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -354,6 +354,21 @@ do {									\
 } while (0)
 
 /*
+ * Large latency-sensitive configurations can limit force_quiescent_state()
+ * latencies by defining an CONFIG_ARCH_RCU_FQS_LIMIT.  This should be
+ * sized based on that architecture's cache-miss latency and the maximum
+ * desired force_quiescent_state latency.  For example, if the cache-miss
+ * latency was 100 nanoseconds, and the maximum force_quiescent_state()
+ * latency contribution was 5 microseconds, then that architecture should
+ * define CONFIG_ARCH_RCU_FQS_LIMIT to be 50.
+ */
+#ifdef CONFIG_ARCH_RCU_FQS_LIMIT
+#define FQS_MAX_CPU_VISIT CONFIG_ARCH_RCU_FQS_LIMIT
+#else /* #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
+#define FQS_MAX_CPU_VISIT NR_CPUS
+#endif /* #else #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
+
+/*
  * RCU global state, including node hierarchy.  This hierarchy is
  * represented in "heap" form in a dense array.  The root (first level)
  * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
@@ -391,6 +406,7 @@ struct rcu_state {
 						/*  starting new GP. */
 	raw_spinlock_t fqslock;			/* Only one task forcing */
 						/*  quiescent states. */
+	int fqs_next_cpu;			/* Next CPU for fqs to scan. */
 	unsigned long jiffies_force_qs;		/* Time at which to invoke */
 						/*  force_quiescent_state(). */
 	unsigned long n_force_qs;		/* Number of calls to */



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency
  2012-03-17 16:57 ` Mike Galbraith
@ 2012-03-18  7:19   ` Mike Galbraith
  2012-03-18 15:00   ` Mike Galbraith
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Galbraith @ 2012-03-18  7:19 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sivanich

On Sat, 2012-03-17 at 17:57 +0100, Mike Galbraith wrote:

> So I bent it up a little, see below.

s/bent/busted.  FQS_MAX_CPU_VISIT < CPUs == deadlock.  Poo.

-Mike


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency
  2012-03-17 16:57 ` Mike Galbraith
  2012-03-18  7:19   ` Mike Galbraith
@ 2012-03-18 15:00   ` Mike Galbraith
  2012-03-19  0:10     ` Paul E. McKenney
  1 sibling, 1 reply; 6+ messages in thread
From: Mike Galbraith @ 2012-03-18 15:00 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sivanich

On Sat, 2012-03-17 at 17:57 +0100, Mike Galbraith wrote: 

> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1316,47 +1316,72 @@ void rcu_check_callbacks(int cpu, int us
>   * have not yet encountered a quiescent state, using the function specified.
>   * Also initiate boosting for any threads blocked on the root rcu_node.
>   *
> + * Returns 0 if the scan could not be completed, 1 otherwise.
>   * The caller must have suppressed start of new grace periods.
>   */
> -static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
> +static int force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
>  {
>  	unsigned long bit;
>  	int cpu;
> +	int cpusvisited = 0, done = 1;
>  	unsigned long flags;
>  	unsigned long mask;
>  	struct rcu_node *rnp;
>  
>  	rcu_for_each_leaf_node(rsp, rnp) {
> +		if (rnp->grphi < rsp->fqs_next_cpu)
> +			continue;
> +		if (rnp->grplo > rcu_get_max_cpu())
> +			continue;

...

> 		if (mask != 0) {
>  
>  			/* rcu_report_qs_rnp() releases rnp->lock. */
>  			rcu_report_qs_rnp(mask, rsp, rnp, flags);
> -			continue;
> +			if (cpusvisited < FQS_MAX_CPU_VISIT)    deadlock: if no continue..
> +				continue;
>  		}
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);  _thoroughly_ unlock it.


Boots/runs.  Altered to not bail mid-group.

I tried a synchronized jitter test on 45 cores, patches made diddly spit
difference.  There's too much other noise.  Also applied both patches to
RT, and tested on 60 cores, which also made little if any difference,
but then jitter is single digit (usecs) running this test with the RT
kernel already.  With NOPREEMPT kernel, jitter is low triple digit worst
case with or without patches.

---
 arch/x86/Kconfig |    4 ++++
 kernel/rcutree.c |   49 ++++++++++++++++++++++++++++++++++++-------------
 kernel/rcutree.h |   16 ++++++++++++++++
 3 files changed, 56 insertions(+), 13 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -254,6 +254,10 @@ config ARCH_CPU_PROBE_RELEASE
 	def_bool y
 	depends on HOTPLUG_CPU && !XEN
 
+config ARCH_RCU_FQS_LIMIT
+	int
+	default 16
+
 source "init/Kconfig"
 source "kernel/Kconfig.freezer"
 
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1316,47 +1316,65 @@ void rcu_check_callbacks(int cpu, int us
  * have not yet encountered a quiescent state, using the function specified.
  * Also initiate boosting for any threads blocked on the root rcu_node.
  *
+ * Returns 0 if the scan could not be completed, 1 otherwise.
  * The caller must have suppressed start of new grace periods.
  */
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
+static int force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 {
 	unsigned long bit;
 	int cpu;
+	int cpusvisited = 0, done = 1;
 	unsigned long flags;
 	unsigned long mask;
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rsp, rnp) {
+		if (rnp->grphi < rsp->fqs_next_cpu)
+			continue;
+		if (rnp->grplo > rcu_get_max_cpu())
+			continue;
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		if (!rcu_gp_in_progress(rsp)) {
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
-			return;
+			return done;
 		}
 		if (rnp->qsmask == 0) {
 			rcu_initiate_boost(rnp, flags); /* releases rnp->lock */
 			continue;
 		}
-		cpu = rnp->grplo;
 		bit = 1;
-		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
-			if ((rnp->qsmask & bit) != 0 &&
-			    f(per_cpu_ptr(rsp->rda, cpu)))
+		rsp->fqs_next_cpu = rnp->grphi + 1;
+		cpusvisited += rnp->grphi - rnp->grplo;
+		for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, bit <<= 1) {
+			if ((rnp->qsmask & bit) == 0)
+				continue;
+			if (f(per_cpu_ptr(rsp->rda, cpu)))
 				mask |= bit;
 		}
-		if (mask != 0) {
 
+		if (mask != 0) {
 			/* rcu_report_qs_rnp() releases rnp->lock. */
 			rcu_report_qs_rnp(mask, rsp, rnp, flags);
-			continue;
+		} else
+			raw_spin_unlock_irqrestore(&rnp->lock, flags);
+
+		if (cpusvisited >= FQS_MAX_CPU_VISIT) {
+			if (FQS_MAX_CPU_VISIT != NR_CPUS)
+				done = 0;
+			break;
 		}
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 	rnp = rcu_get_root(rsp);
 	if (rnp->qsmask == 0) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		rcu_initiate_boost(rnp, flags); /* releases rnp->lock. */
 	}
+
+	if (done)
+		rsp->fqs_next_cpu = 0;
+
+	return done;
 }
 
 /*
@@ -1366,6 +1384,7 @@ static void force_qs_rnp(struct rcu_stat
 static void force_quiescent_state(struct rcu_state *rsp, int relaxed)
 {
 	unsigned long flags;
+	int retval;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	if (!rcu_gp_in_progress(rsp))
@@ -1398,21 +1417,25 @@ static void force_quiescent_state(struct
 		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
 
 		/* Record dyntick-idle state. */
-		force_qs_rnp(rsp, dyntick_save_progress_counter);
+		retval = force_qs_rnp(rsp, dyntick_save_progress_counter);
 		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-		if (rcu_gp_in_progress(rsp))
+		if (rcu_gp_in_progress(rsp) && retval) {
 			rsp->signaled = RCU_FORCE_QS;
+			rsp->fqs_next_cpu = 0;
+		} else if (rcu_gp_in_progress(rsp))
+			rsp->jiffies_force_qs = jiffies - 1;
 		break;
 
 	case RCU_FORCE_QS:
 
 		/* Check dyntick-idle state, send IPI to laggarts. */
 		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+		retval = force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
 
 		/* Leave state in case more forcing is required. */
-
 		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
+		if (rcu_gp_in_progress(rsp) && !retval)
+			rsp->jiffies_force_qs = jiffies - 1;
 		break;
 	}
 	rsp->fqs_active = 0;
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -354,6 +354,21 @@ do {									\
 } while (0)
 
 /*
+ * Large latency-sensitive configurations can limit force_quiescent_state()
+ * latencies by defining an CONFIG_ARCH_RCU_FQS_LIMIT.  This should be
+ * sized based on that architecture's cache-miss latency and the maximum
+ * desired force_quiescent_state latency.  For example, if the cache-miss
+ * latency was 100 nanoseconds, and the maximum force_quiescent_state()
+ * latency contribution was 5 microseconds, then that architecture should
+ * define CONFIG_ARCH_RCU_FQS_LIMIT to be 50.
+ */
+#ifdef CONFIG_ARCH_RCU_FQS_LIMIT
+#define FQS_MAX_CPU_VISIT CONFIG_ARCH_RCU_FQS_LIMIT
+#else /* #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
+#define FQS_MAX_CPU_VISIT NR_CPUS
+#endif /* #else #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
+
+/*
  * RCU global state, including node hierarchy.  This hierarchy is
  * represented in "heap" form in a dense array.  The root (first level)
  * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
@@ -391,6 +406,7 @@ struct rcu_state {
 						/*  starting new GP. */
 	raw_spinlock_t fqslock;			/* Only one task forcing */
 						/*  quiescent states. */
+	int fqs_next_cpu;			/* Next CPU for fqs to scan. */
 	unsigned long jiffies_force_qs;		/* Time at which to invoke */
 						/*  force_quiescent_state(). */
 	unsigned long n_force_qs;		/* Number of calls to */



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency
  2012-03-18 15:00   ` Mike Galbraith
@ 2012-03-19  0:10     ` Paul E. McKenney
  2012-03-19  3:47       ` Mike Galbraith
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2012-03-19  0:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sivanich

On Sun, Mar 18, 2012 at 04:00:21PM +0100, Mike Galbraith wrote:
> On Sat, 2012-03-17 at 17:57 +0100, Mike Galbraith wrote: 
> 
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1316,47 +1316,72 @@ void rcu_check_callbacks(int cpu, int us
> >   * have not yet encountered a quiescent state, using the function specified.
> >   * Also initiate boosting for any threads blocked on the root rcu_node.
> >   *
> > + * Returns 0 if the scan could not be completed, 1 otherwise.
> >   * The caller must have suppressed start of new grace periods.
> >   */
> > -static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
> > +static int force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
> >  {
> >  	unsigned long bit;
> >  	int cpu;
> > +	int cpusvisited = 0, done = 1;
> >  	unsigned long flags;
> >  	unsigned long mask;
> >  	struct rcu_node *rnp;
> >  
> >  	rcu_for_each_leaf_node(rsp, rnp) {
> > +		if (rnp->grphi < rsp->fqs_next_cpu)
> > +			continue;
> > +		if (rnp->grplo > rcu_get_max_cpu())
> > +			continue;
> 
> ...
> 
> > 		if (mask != 0) {
> >  
> >  			/* rcu_report_qs_rnp() releases rnp->lock. */
> >  			rcu_report_qs_rnp(mask, rsp, rnp, flags);
> > -			continue;
> > +			if (cpusvisited < FQS_MAX_CPU_VISIT)    deadlock: if no continue..
> > +				continue;
> >  		}
> >  		raw_spin_unlock_irqrestore(&rnp->lock, flags);  _thoroughly_ unlock it.
> 
> 
> Boots/runs.  Altered to not bail mid-group.
> 
> I tried a synchronized jitter test on 45 cores, patches made diddly spit
> difference.  There's too much other noise.  Also applied both patches to
> RT, and tested on 60 cores, which also made little if any difference,
> but then jitter is single digit (usecs) running this test with the RT
> kernel already.  With NOPREEMPT kernel, jitter is low triple digit worst
> case with or without patches.

Thank you very much for your efforts on this!!!

Given that you were seeing about 200-microsecond latency spikes on
grace-period initialization, I would expect that you would need about
200 dyntick-idle CPUs for force_quiescent_state() to give you a
ten-microsecond spike, so I am not surprised that you could not see
the difference on 60 CPUs, which probably have given you something
like 3 microseconds..

I will take a closer look at your patch (thank you for posting it!) --
but I still have not given up hope on the CPU-by-CPU approach.

							Thanx, Paul

> ---
>  arch/x86/Kconfig |    4 ++++
>  kernel/rcutree.c |   49 ++++++++++++++++++++++++++++++++++++-------------
>  kernel/rcutree.h |   16 ++++++++++++++++
>  3 files changed, 56 insertions(+), 13 deletions(-)
> 
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -254,6 +254,10 @@ config ARCH_CPU_PROBE_RELEASE
>  	def_bool y
>  	depends on HOTPLUG_CPU && !XEN
> 
> +config ARCH_RCU_FQS_LIMIT
> +	int
> +	default 16
> +
>  source "init/Kconfig"
>  source "kernel/Kconfig.freezer"
> 
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1316,47 +1316,65 @@ void rcu_check_callbacks(int cpu, int us
>   * have not yet encountered a quiescent state, using the function specified.
>   * Also initiate boosting for any threads blocked on the root rcu_node.
>   *
> + * Returns 0 if the scan could not be completed, 1 otherwise.
>   * The caller must have suppressed start of new grace periods.
>   */
> -static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
> +static int force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
>  {
>  	unsigned long bit;
>  	int cpu;
> +	int cpusvisited = 0, done = 1;
>  	unsigned long flags;
>  	unsigned long mask;
>  	struct rcu_node *rnp;
> 
>  	rcu_for_each_leaf_node(rsp, rnp) {
> +		if (rnp->grphi < rsp->fqs_next_cpu)
> +			continue;
> +		if (rnp->grplo > rcu_get_max_cpu())
> +			continue;
>  		mask = 0;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
>  		if (!rcu_gp_in_progress(rsp)) {
>  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -			return;
> +			return done;
>  		}
>  		if (rnp->qsmask == 0) {
>  			rcu_initiate_boost(rnp, flags); /* releases rnp->lock */
>  			continue;
>  		}
> -		cpu = rnp->grplo;
>  		bit = 1;
> -		for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> -			if ((rnp->qsmask & bit) != 0 &&
> -			    f(per_cpu_ptr(rsp->rda, cpu)))
> +		rsp->fqs_next_cpu = rnp->grphi + 1;
> +		cpusvisited += rnp->grphi - rnp->grplo;
> +		for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> +			if ((rnp->qsmask & bit) == 0)
> +				continue;
> +			if (f(per_cpu_ptr(rsp->rda, cpu)))
>  				mask |= bit;
>  		}
> -		if (mask != 0) {
> 
> +		if (mask != 0) {
>  			/* rcu_report_qs_rnp() releases rnp->lock. */
>  			rcu_report_qs_rnp(mask, rsp, rnp, flags);
> -			continue;
> +		} else
> +			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +
> +		if (cpusvisited >= FQS_MAX_CPU_VISIT) {
> +			if (FQS_MAX_CPU_VISIT != NR_CPUS)
> +				done = 0;
> +			break;
>  		}
> -		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	}
>  	rnp = rcu_get_root(rsp);
>  	if (rnp->qsmask == 0) {
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
>  		rcu_initiate_boost(rnp, flags); /* releases rnp->lock. */
>  	}
> +
> +	if (done)
> +		rsp->fqs_next_cpu = 0;
> +
> +	return done;
>  }
> 
>  /*
> @@ -1366,6 +1384,7 @@ static void force_qs_rnp(struct rcu_stat
>  static void force_quiescent_state(struct rcu_state *rsp, int relaxed)
>  {
>  	unsigned long flags;
> +	int retval;
>  	struct rcu_node *rnp = rcu_get_root(rsp);
> 
>  	if (!rcu_gp_in_progress(rsp))
> @@ -1398,21 +1417,25 @@ static void force_quiescent_state(struct
>  		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
> 
>  		/* Record dyntick-idle state. */
> -		force_qs_rnp(rsp, dyntick_save_progress_counter);
> +		retval = force_qs_rnp(rsp, dyntick_save_progress_counter);
>  		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
> -		if (rcu_gp_in_progress(rsp))
> +		if (rcu_gp_in_progress(rsp) && retval) {
>  			rsp->signaled = RCU_FORCE_QS;
> +			rsp->fqs_next_cpu = 0;
> +		} else if (rcu_gp_in_progress(rsp))
> +			rsp->jiffies_force_qs = jiffies - 1;
>  		break;
> 
>  	case RCU_FORCE_QS:
> 
>  		/* Check dyntick-idle state, send IPI to laggarts. */
>  		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
> -		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
> +		retval = force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
> 
>  		/* Leave state in case more forcing is required. */
> -
>  		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
> +		if (rcu_gp_in_progress(rsp) && !retval)
> +			rsp->jiffies_force_qs = jiffies - 1;
>  		break;
>  	}
>  	rsp->fqs_active = 0;
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -354,6 +354,21 @@ do {									\
>  } while (0)
> 
>  /*
> + * Large latency-sensitive configurations can limit force_quiescent_state()
> + * latencies by defining an CONFIG_ARCH_RCU_FQS_LIMIT.  This should be
> + * sized based on that architecture's cache-miss latency and the maximum
> + * desired force_quiescent_state latency.  For example, if the cache-miss
> + * latency was 100 nanoseconds, and the maximum force_quiescent_state()
> + * latency contribution was 5 microseconds, then that architecture should
> + * define CONFIG_ARCH_RCU_FQS_LIMIT to be 50.
> + */
> +#ifdef CONFIG_ARCH_RCU_FQS_LIMIT
> +#define FQS_MAX_CPU_VISIT CONFIG_ARCH_RCU_FQS_LIMIT
> +#else /* #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
> +#define FQS_MAX_CPU_VISIT NR_CPUS
> +#endif /* #else #ifdef CONFIG_ARCH_RCU_FQS_LIMIT */
> +
> +/*
>   * RCU global state, including node hierarchy.  This hierarchy is
>   * represented in "heap" form in a dense array.  The root (first level)
>   * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
> @@ -391,6 +406,7 @@ struct rcu_state {
>  						/*  starting new GP. */
>  	raw_spinlock_t fqslock;			/* Only one task forcing */
>  						/*  quiescent states. */
> +	int fqs_next_cpu;			/* Next CPU for fqs to scan. */
>  	unsigned long jiffies_force_qs;		/* Time at which to invoke */
>  						/*  force_quiescent_state(). */
>  	unsigned long n_force_qs;		/* Number of calls to */
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency
  2012-03-19  0:10     ` Paul E. McKenney
@ 2012-03-19  3:47       ` Mike Galbraith
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Galbraith @ 2012-03-19  3:47 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sivanich

On Sun, 2012-03-18 at 17:10 -0700, Paul E. McKenney wrote:

> Thank you very much for your efforts on this!!!

No problem, irq holdoff troubles are worth effort.

> Given that you were seeing about 200-microsecond latency spikes on
> grace-period initialization, I would expect that you would need about
> 200 dyntick-idle CPUs for force_quiescent_state() to give you a
> ten-microsecond spike, so I am not surprised that you could not see
> the difference on 60 CPUs, which probably have given you something
> like 3 microseconds..

Crawling over fewer locks should still save cycles, so I'll measure.
Big box rt needs every little usec we can scrape together.

-Mike


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-03-19  3:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-16 16:03 [PATCH RFC] rcu: Permit limiting of force_quiescent_state() latency Paul E. McKenney
2012-03-17 16:57 ` Mike Galbraith
2012-03-18  7:19   ` Mike Galbraith
2012-03-18 15:00   ` Mike Galbraith
2012-03-19  0:10     ` Paul E. McKenney
2012-03-19  3:47       ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox