[PATCH 0/2] rcu: Fix series of spurious RCU softirqs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] rcu: Fix series of spurious RCU softirqs
@ 2010-11-24  0:31 Frederic Weisbecker
  2010-11-24  0:31 ` [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods Frederic Weisbecker
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24  0:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Paul E . McKenney, Lai Jiangshan,
	Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Ingo Molnar

Hi,

I've observed some not so unfrequent series of spurious rcu
softirqs, sometimes happening at each ticks for a random
while.

These patches aims at fixing them.

Thanks.

Frederic Weisbecker (2):
  rcu: Don't chase unnecessary quiescent states after extended grace periods
  rcu: Stop checking quiescent states after grace period completion from remote

 kernel/rcutree.c |   22 ++++++++++++++++++++--
 1 files changed, 20 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24  0:31 [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Frederic Weisbecker
@ 2010-11-24  0:31 ` Frederic Weisbecker
  2010-11-24  0:58   ` Paul E. McKenney
  2010-11-24  0:31 ` [PATCH 2/2] rcu: Stop checking quiescent states after grace period completion from remote Frederic Weisbecker
  2010-11-25  3:42 ` [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Lai Jiangshan
  2 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24  0:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Paul E. McKenney, Lai Jiangshan,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Steven Rostedt

When a cpu is in an extended quiescent state, which includes idle
nohz or CPU offline, others CPUs will take care of the grace periods
on its behalf.

When this CPU exits its extended quiescent state, it will catch up
with the last started grace period and start chasing its own
quiescent states to end the current grace period.

However in this case we always start to track quiescent states if the
grace period number has changed since we started our extended
quiescent state. And we do this because we always assume that the last
grace period is not finished and needs us to complete it, which is
sometimes wrong.

This patch verifies if the last grace period has been completed and
if so, start hunting local quiescent states like we always did.
Otherwise don't do anything, this economizes us some work and
an unnecessary softirq.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ccdc04c..5f038a1 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -620,8 +620,17 @@ static void __init check_cpu_stall_init(void)
 static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
 {
 	if (rdp->gpnum != rnp->gpnum) {
-		rdp->qs_pending = 1;
-		rdp->passed_quiesc = 0;
+		/*
+		 * Another CPU might have taken take of this new grace period
+		 * while we were idle and handled us as in an extended quiescent
+		 * state. In that case, we don't need to chase a local quiescent
+		 * state, otherwise:
+		 */
+		if (rdp->completed != rnp->gpnum) {
+			rdp->qs_pending = 1;
+			rdp->passed_quiesc = 0;
+		}
+
 		rdp->gpnum = rnp->gpnum;
 	}
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] rcu: Stop checking quiescent states after grace period completion from remote
  2010-11-24  0:31 [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Frederic Weisbecker
  2010-11-24  0:31 ` [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods Frederic Weisbecker
@ 2010-11-24  0:31 ` Frederic Weisbecker
  2010-11-24  1:03   ` Paul E. McKenney
  2010-11-25  3:42 ` [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Lai Jiangshan
  2 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24  0:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Frederic Weisbecker, Paul E. McKenney, Lai Jiangshan,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Steven Rostedt

After a CPU starts to chase its quiescent states by setting
rdp->qs_pending to 1, it can still enter into an extended
quiescent state and then another CPU will take care of this
and complete the grace period if necessary.

rcu_report_qs_rdp() currently doesn't handle well this case
and considers it must try later to notify its quiescent state.

However if the last grace period has been completed there is
nothing left to do for the current CPU.

It means that until a next grace period starts, the CPU that
runs into that case will keep chasing its own quiescent states
by raising a softirq on every tick for no good reason.

This can take a while before a new grace period starts and
this time slice is covered by spurious softirqs and other
kinds of rcu checks.

Fix this by resetting rdp->qs_pending if the last grace
period has been completed by a remote CPU while we were
in an extended quiescent state.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rcutree.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5f038a1..f287eaa 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -937,6 +937,15 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long las
 		 * race occurred.
 		 */
 		rdp->passed_quiesc = 0;	/* try again later! */
+
+		/*
+		 * Another CPU may have taken care of us if we were in an
+		 * extended quiescent state, in which case we don't need
+		 * to continue to track anything.
+		 */
+		if (rnp->gpnum == rnp->completed)
+			rdp->qs_pending = 0;
+
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 		return;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24  0:31 ` [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods Frederic Weisbecker
@ 2010-11-24  0:58   ` Paul E. McKenney
  2010-11-24  2:29     ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24  0:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
> When a cpu is in an extended quiescent state, which includes idle
> nohz or CPU offline, others CPUs will take care of the grace periods
> on its behalf.
> 
> When this CPU exits its extended quiescent state, it will catch up
> with the last started grace period and start chasing its own
> quiescent states to end the current grace period.
> 
> However in this case we always start to track quiescent states if the
> grace period number has changed since we started our extended
> quiescent state. And we do this because we always assume that the last
> grace period is not finished and needs us to complete it, which is
> sometimes wrong.
> 
> This patch verifies if the last grace period has been completed and
> if so, start hunting local quiescent states like we always did.
> Otherwise don't do anything, this economizes us some work and
> an unnecessary softirq.

Interesting approach!  I can see how this helps in the case where the
CPU just came online, but I don't see it in the nohz case, because the
nohz case does not update the rdp->completed variable.  In contrast,
the online path calls rcu_init_percpu_data() which sets up this variable.

So, what am I missing here?

							Thanx, Paul

PS.  It might well be worthwhile for the online case alone, but
     the commit message does need to be accurate.

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/rcutree.c |   13 +++++++++++--
>  1 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index ccdc04c..5f038a1 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -620,8 +620,17 @@ static void __init check_cpu_stall_init(void)
>  static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
>  {
>  	if (rdp->gpnum != rnp->gpnum) {
> -		rdp->qs_pending = 1;
> -		rdp->passed_quiesc = 0;
> +		/*
> +		 * Another CPU might have taken take of this new grace period
> +		 * while we were idle and handled us as in an extended quiescent
> +		 * state. In that case, we don't need to chase a local quiescent
> +		 * state, otherwise:
> +		 */
> +		if (rdp->completed != rnp->gpnum) {
> +			rdp->qs_pending = 1;
> +			rdp->passed_quiesc = 0;
> +		}
> +
>  		rdp->gpnum = rnp->gpnum;
>  	}
>  }
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] rcu: Stop checking quiescent states after grace period completion from remote
  2010-11-24  0:31 ` [PATCH 2/2] rcu: Stop checking quiescent states after grace period completion from remote Frederic Weisbecker
@ 2010-11-24  1:03   ` Paul E. McKenney
  0 siblings, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24  1:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 01:31:13AM +0100, Frederic Weisbecker wrote:
> After a CPU starts to chase its quiescent states by setting
> rdp->qs_pending to 1, it can still enter into an extended
> quiescent state and then another CPU will take care of this
> and complete the grace period if necessary.
> 
> rcu_report_qs_rdp() currently doesn't handle well this case
> and considers it must try later to notify its quiescent state.
> 
> However if the last grace period has been completed there is
> nothing left to do for the current CPU.
> 
> It means that until a next grace period starts, the CPU that
> runs into that case will keep chasing its own quiescent states
> by raising a softirq on every tick for no good reason.
> 
> This can take a while before a new grace period starts and
> this time slice is covered by spurious softirqs and other
> kinds of rcu checks.
> 
> Fix this by resetting rdp->qs_pending if the last grace
> period has been completed by a remote CPU while we were
> in an extended quiescent state.

This one looks very good, at least at first glance!!!  Queued.

							Thanx, Paul

> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/rcutree.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 5f038a1..f287eaa 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -937,6 +937,15 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long las
>  		 * race occurred.
>  		 */
>  		rdp->passed_quiesc = 0;	/* try again later! */
> +
> +		/*
> +		 * Another CPU may have taken care of us if we were in an
> +		 * extended quiescent state, in which case we don't need
> +		 * to continue to track anything.
> +		 */
> +		if (rnp->gpnum == rnp->completed)
> +			rdp->qs_pending = 0;
> +
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  		return;
>  	}
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24  0:58   ` Paul E. McKenney
@ 2010-11-24  2:29     ` Frederic Weisbecker
  2010-11-24  2:33       ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24  2:29 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Tue, Nov 23, 2010 at 04:58:20PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
> > When a cpu is in an extended quiescent state, which includes idle
> > nohz or CPU offline, others CPUs will take care of the grace periods
> > on its behalf.
> >
> > When this CPU exits its extended quiescent state, it will catch up
> > with the last started grace period and start chasing its own
> > quiescent states to end the current grace period.
> >
> > However in this case we always start to track quiescent states if the
> > grace period number has changed since we started our extended
> > quiescent state. And we do this because we always assume that the last
> > grace period is not finished and needs us to complete it, which is
> > sometimes wrong.
> >
> > This patch verifies if the last grace period has been completed and
> > if so, start hunting local quiescent states like we always did.
> > Otherwise don't do anything, this economizes us some work and
> > an unnecessary softirq.
>
> Interesting approach!  I can see how this helps in the case where the
> CPU just came online, but I don't see it in the nohz case, because the
> nohz case does not update the rdp->completed variable.  In contrast,
> the online path calls rcu_init_percpu_data() which sets up this variable.
>
> So, what am I missing here?
>
> 							Thanx, Paul
>
> PS.  It might well be worthwhile for the online case alone, but
>      the commit message does need to be accurate.


So, let's take this scenario (inspired from a freshly dumped trace to
clarify my ideas):

CPU 1 was idle, it has missed several grace periods, but CPU 0 took care
of that.

Hence, CPU 0's rdp->gpnum = rdp->completed = 4294967000

But the last grace period was 4294967002 and it's completed
(rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).

Now CPU 0 gets a tick for a random reason, it calls rcu_check_callbacks()
and then rcu_pending() which raises the softirq because of this:

	/* Has another RCU grace period completed?  */
	if (ACCESS_ONCE(rnp->completed) != rdp->completed) { /* outside lock */
		rdp->n_rp_gp_completed++;
		return 1;
	}

The softirq fires, we call rcu_process_gp_end() which will
update rdp->completed into the global state:
(rsp->completed = rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).

But rsp->pgnum is still 2 offsets backwards.

Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
-> note_new_gpnum() and then we end up a requested quiescent state while
every grace periods are completed.

So, now that I describe all that, I wonder if actually the solution would
be better with changing the above condition to not fire the softirq to begin
with, because:

	rnp->completed != rdp->completed

doesn't seem to mean we need the current cpu. It just mean that the node
was smart enough to make its way without us when we were in an extended
quiescent state :)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24  2:29     ` Frederic Weisbecker
@ 2010-11-24  2:33       ` Frederic Weisbecker
  2010-11-24  6:22         ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24  2:33 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

2010/11/24 Frederic Weisbecker <fweisbec@gmail.com>:
> On Tue, Nov 23, 2010 at 04:58:20PM -0800, Paul E. McKenney wrote:
>> On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
>> > When a cpu is in an extended quiescent state, which includes idle
>> > nohz or CPU offline, others CPUs will take care of the grace periods
>> > on its behalf.
>> >
>> > When this CPU exits its extended quiescent state, it will catch up
>> > with the last started grace period and start chasing its own
>> > quiescent states to end the current grace period.
>> >
>> > However in this case we always start to track quiescent states if the
>> > grace period number has changed since we started our extended
>> > quiescent state. And we do this because we always assume that the last
>> > grace period is not finished and needs us to complete it, which is
>> > sometimes wrong.
>> >
>> > This patch verifies if the last grace period has been completed and
>> > if so, start hunting local quiescent states like we always did.
>> > Otherwise don't do anything, this economizes us some work and
>> > an unnecessary softirq.
>>
>> Interesting approach!  I can see how this helps in the case where the
>> CPU just came online, but I don't see it in the nohz case, because the
>> nohz case does not update the rdp->completed variable.  In contrast,
>> the online path calls rcu_init_percpu_data() which sets up this variable.
>>
>> So, what am I missing here?
>>
>>                                                       Thanx, Paul
>>
>> PS.  It might well be worthwhile for the online case alone, but
>>      the commit message does need to be accurate.
>
>
> So, let's take this scenario (inspired from a freshly dumped trace to
> clarify my ideas):
>
> CPU 1 was idle, it has missed several grace periods, but CPU 0 took care
> of that.
>
> Hence, CPU 0's rdp->gpnum = rdp->completed = 4294967000
>
> But the last grace period was 4294967002 and it's completed
> (rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
>
> Now CPU 0 gets a tick for a random reason, it calls rcu_check_callbacks()
> and then rcu_pending() which raises the softirq because of this:
>
>        /* Has another RCU grace period completed?  */
>        if (ACCESS_ONCE(rnp->completed) != rdp->completed) { /* outside lock */
>                rdp->n_rp_gp_completed++;
>                return 1;
>        }
>
> The softirq fires, we call rcu_process_gp_end() which will
> update rdp->completed into the global state:
> (rsp->completed = rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
>
> But rsp->pgnum is still 2 offsets backwards.
>
> Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
> -> note_new_gpnum() and then we end up a requested quiescent state while
> every grace periods are completed.


Sorry I should have described that in the changelogs but my ideas
weren't as clear as they
are now (at least I feel they are, doesn't mean they actually are ;)
Chasing these RCU bugs for too much hours has toasted my brain..

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24  2:33       ` Frederic Weisbecker
@ 2010-11-24  6:22         ` Paul E. McKenney
  2010-11-24 13:48           ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24  6:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 03:33:21AM +0100, Frederic Weisbecker wrote:
> 2010/11/24 Frederic Weisbecker <fweisbec@gmail.com>:
> > On Tue, Nov 23, 2010 at 04:58:20PM -0800, Paul E. McKenney wrote:
> >> On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
> >> > When a cpu is in an extended quiescent state, which includes idle
> >> > nohz or CPU offline, others CPUs will take care of the grace periods
> >> > on its behalf.
> >> >
> >> > When this CPU exits its extended quiescent state, it will catch up
> >> > with the last started grace period and start chasing its own
> >> > quiescent states to end the current grace period.
> >> >
> >> > However in this case we always start to track quiescent states if the
> >> > grace period number has changed since we started our extended
> >> > quiescent state. And we do this because we always assume that the last
> >> > grace period is not finished and needs us to complete it, which is
> >> > sometimes wrong.
> >> >
> >> > This patch verifies if the last grace period has been completed and
> >> > if so, start hunting local quiescent states like we always did.
> >> > Otherwise don't do anything, this economizes us some work and
> >> > an unnecessary softirq.
> >>
> >> Interesting approach!  I can see how this helps in the case where the
> >> CPU just came online, but I don't see it in the nohz case, because the
> >> nohz case does not update the rdp->completed variable.  In contrast,
> >> the online path calls rcu_init_percpu_data() which sets up this variable.
> >>
> >> So, what am I missing here?
> >>
> >>                                                       Thanx, Paul
> >>
> >> PS.  It might well be worthwhile for the online case alone, but
> >>      the commit message does need to be accurate.
> >
> >
> > So, let's take this scenario (inspired from a freshly dumped trace to
> > clarify my ideas):
> >
> > CPU 1 was idle, it has missed several grace periods, but CPU 0 took care
> > of that.
> >
> > Hence, CPU 0's rdp->gpnum = rdp->completed = 4294967000

But CPU 1's rdp state is outdated, due to locking design.

> > But the last grace period was 4294967002 and it's completed
> > (rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
> >
> > Now CPU 0 gets a tick for a random reason, it calls rcu_check_callbacks()
> > and then rcu_pending() which raises the softirq because of this:
> >
> >        /* Has another RCU grace period completed?  */
> >        if (ACCESS_ONCE(rnp->completed) != rdp->completed) { /* outside lock */
> >                rdp->n_rp_gp_completed++;
> >                return 1;
> >        }

Yes, because CPU 0 has not yet updated its rdp state.

> > The softirq fires, we call rcu_process_gp_end() which will
> > update rdp->completed into the global state:
> > (rsp->completed = rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
> >
> > But rsp->pgnum is still 2 offsets backwards.

This one is hard for me to believe -- rsp->gpnum drives the rest of
the ->gpnum fields, right?

> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
> > -> note_new_gpnum() and then we end up a requested quiescent state while
> > every grace periods are completed.
> 
> Sorry I should have described that in the changelogs but my ideas
> weren't as clear as they
> are now (at least I feel they are, doesn't mean they actually are ;)
> Chasing these RCU bugs for too much hours has toasted my brain..

Welcome to my world!!!  But keep in mind that an extra timer tick
or two is much preferable to a potential hang!  And you only get
the extra timer tick if there was some other reason that the
CPU came out of nohz mode, correct?

Which is why checking the rnp fields makes more sense to me, actually.
Acquiring rnp->lock is much less painful than pinning down the rsp state.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24  6:22         ` Paul E. McKenney
@ 2010-11-24 13:48           ` Frederic Weisbecker
  2010-11-24 14:42             ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24 13:48 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> On Wed, Nov 24, 2010 at 03:33:21AM +0100, Frederic Weisbecker wrote:
>> 2010/11/24 Frederic Weisbecker <fweisbec@gmail.com>:
>> > On Tue, Nov 23, 2010 at 04:58:20PM -0800, Paul E. McKenney wrote:
>> >> On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
>> >> > When a cpu is in an extended quiescent state, which includes idle
>> >> > nohz or CPU offline, others CPUs will take care of the grace periods
>> >> > on its behalf.
>> >> >
>> >> > When this CPU exits its extended quiescent state, it will catch up
>> >> > with the last started grace period and start chasing its own
>> >> > quiescent states to end the current grace period.
>> >> >
>> >> > However in this case we always start to track quiescent states if the
>> >> > grace period number has changed since we started our extended
>> >> > quiescent state. And we do this because we always assume that the last
>> >> > grace period is not finished and needs us to complete it, which is
>> >> > sometimes wrong.
>> >> >
>> >> > This patch verifies if the last grace period has been completed and
>> >> > if so, start hunting local quiescent states like we always did.
>> >> > Otherwise don't do anything, this economizes us some work and
>> >> > an unnecessary softirq.
>> >>
>> >> Interesting approach!  I can see how this helps in the case where the
>> >> CPU just came online, but I don't see it in the nohz case, because the
>> >> nohz case does not update the rdp->completed variable.  In contrast,
>> >> the online path calls rcu_init_percpu_data() which sets up this variable.
>> >>
>> >> So, what am I missing here?
>> >>
>> >>                                                       Thanx, Paul
>> >>
>> >> PS.  It might well be worthwhile for the online case alone, but
>> >>      the commit message does need to be accurate.
>> >
>> >
>> > So, let's take this scenario (inspired from a freshly dumped trace to
>> > clarify my ideas):
>> >
>> > CPU 1 was idle, it has missed several grace periods, but CPU 0 took care
>> > of that.
>> >
>> > Hence, CPU 0's rdp->gpnum = rdp->completed = 4294967000


(Actually I was talking about CPU 1 here. CPU was idle and
has rdp->gpnum and rdp->completed at 4294967000.

While the global state and even the node are on 4294967002


> But CPU 1's rdp state is outdated, due to locking design.

Yeah.


>> > But the last grace period was 4294967002 and it's completed
>> > (rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
>> >
>> > Now CPU 0 gets a tick for a random reason, it calls rcu_check_callbacks()
>> > and then rcu_pending() which raises the softirq because of this:
>> >
>> >        /* Has another RCU grace period completed?  */
>> >        if (ACCESS_ONCE(rnp->completed) != rdp->completed) { /* outside lock */
>> >                rdp->n_rp_gp_completed++;
>> >                return 1;
>> >        }
>
> Yes, because CPU 0 has not yet updated its rdp state.


So again I made a mistake. Here it's CPU 1 that takes this path, the outdated
CPU that was idle.

So yeah, here it has not yet updated its rdp state, it's still 2
offsets backwards.


>> > The softirq fires, we call rcu_process_gp_end() which will
>> > update rdp->completed into the global state:
>> > (rsp->completed = rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
>> >
>> > But rsp->pgnum is still 2 offsets backwards.
>
> This one is hard for me to believe -- rsp->gpnum drives the rest of
> the ->gpnum fields, right?


Oops, I meant rdp->pgnum is still 2 offsets backward, for CPU 1.

Sorry.


>> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
>> > -> note_new_gpnum() and then we end up a requested quiescent state while
>> > every grace periods are completed.
>>
>> Sorry I should have described that in the changelogs but my ideas
>> weren't as clear as they
>> are now (at least I feel they are, doesn't mean they actually are ;)
>> Chasing these RCU bugs for too much hours has toasted my brain..
>
> Welcome to my world!!!  But keep in mind that an extra timer tick
> or two is much preferable to a potential hang!  And you only get
> the extra timer tick if there was some other reason that the
> CPU came out of nohz mode, correct?


Yeah, either because of a timer, hrtimer, or a reschedule.
But you still generate a spurious softirq in this scheme.

Two in fact: one because of the rnp->completed != rsp->completed condition
in rcu_pending(), another one because when we update the pgnum, we
always start chasing QS, regardless of the last GP beeing completed or not.


> Which is why checking the rnp fields makes more sense to me, actually.
> Acquiring rnp->lock is much less painful than pinning down the rsp state.

Right.

Another thing, we  already have the (rnp->gpnum) != rdp->gpnu check in
rcu_pending(),
why also checking (rnp->completed) != rdp->completed) ?

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 13:48           ` Frederic Weisbecker
@ 2010-11-24 14:42             ` Paul E. McKenney
  2010-11-24 15:45               ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24 14:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 02:48:46PM +0100, Frederic Weisbecker wrote:
> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > On Wed, Nov 24, 2010 at 03:33:21AM +0100, Frederic Weisbecker wrote:
> >> 2010/11/24 Frederic Weisbecker <fweisbec@gmail.com>:
> >> > On Tue, Nov 23, 2010 at 04:58:20PM -0800, Paul E. McKenney wrote:
> >> >> On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
> >> >> > When a cpu is in an extended quiescent state, which includes idle
> >> >> > nohz or CPU offline, others CPUs will take care of the grace periods
> >> >> > on its behalf.
> >> >> >
> >> >> > When this CPU exits its extended quiescent state, it will catch up
> >> >> > with the last started grace period and start chasing its own
> >> >> > quiescent states to end the current grace period.
> >> >> >
> >> >> > However in this case we always start to track quiescent states if the
> >> >> > grace period number has changed since we started our extended
> >> >> > quiescent state. And we do this because we always assume that the last
> >> >> > grace period is not finished and needs us to complete it, which is
> >> >> > sometimes wrong.
> >> >> >
> >> >> > This patch verifies if the last grace period has been completed and
> >> >> > if so, start hunting local quiescent states like we always did.
> >> >> > Otherwise don't do anything, this economizes us some work and
> >> >> > an unnecessary softirq.
> >> >>
> >> >> Interesting approach!  I can see how this helps in the case where the
> >> >> CPU just came online, but I don't see it in the nohz case, because the
> >> >> nohz case does not update the rdp->completed variable.  In contrast,
> >> >> the online path calls rcu_init_percpu_data() which sets up this variable.
> >> >>
> >> >> So, what am I missing here?
> >> >>
> >> >>                                                       Thanx, Paul
> >> >>
> >> >> PS.  It might well be worthwhile for the online case alone, but
> >> >>      the commit message does need to be accurate.
> >> >
> >> >
> >> > So, let's take this scenario (inspired from a freshly dumped trace to
> >> > clarify my ideas):
> >> >
> >> > CPU 1 was idle, it has missed several grace periods, but CPU 0 took care
> >> > of that.
> >> >
> >> > Hence, CPU 0's rdp->gpnum = rdp->completed = 4294967000
> 
> 
> (Actually I was talking about CPU 1 here. CPU was idle and
> has rdp->gpnum and rdp->completed at 4294967000.
> 
> While the global state and even the node are on 4294967002
> 
> 
> > But CPU 1's rdp state is outdated, due to locking design.
> 
> Yeah.
> 
> 
> >> > But the last grace period was 4294967002 and it's completed
> >> > (rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
> >> >
> >> > Now CPU 0 gets a tick for a random reason, it calls rcu_check_callbacks()
> >> > and then rcu_pending() which raises the softirq because of this:
> >> >
> >> >        /* Has another RCU grace period completed?  */
> >> >        if (ACCESS_ONCE(rnp->completed) != rdp->completed) { /* outside lock */
> >> >                rdp->n_rp_gp_completed++;
> >> >                return 1;
> >> >        }
> >
> > Yes, because CPU 0 has not yet updated its rdp state.
> 
> So again I made a mistake. Here it's CPU 1 that takes this path, the outdated
> CPU that was idle.
> 
> So yeah, here it has not yet updated its rdp state, it's still 2
> offsets backwards.

OK, we might now be on the same page.  ;-)

> >> > The softirq fires, we call rcu_process_gp_end() which will
> >> > update rdp->completed into the global state:
> >> > (rsp->completed = rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
> >> >
> >> > But rsp->pgnum is still 2 offsets backwards.
> >
> > This one is hard for me to believe -- rsp->gpnum drives the rest of
> > the ->gpnum fields, right?
> 
> Oops, I meant rdp->pgnum is still 2 offsets backward, for CPU 1.
> 
> Sorry.

CPU 1?  Or CPU 0?

> >> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
> >> > -> note_new_gpnum() and then we end up a requested quiescent state while
> >> > every grace periods are completed.
> >>
> >> Sorry I should have described that in the changelogs but my ideas
> >> weren't as clear as they
> >> are now (at least I feel they are, doesn't mean they actually are ;)
> >> Chasing these RCU bugs for too much hours has toasted my brain..
> >
> > Welcome to my world!!!  But keep in mind that an extra timer tick
> > or two is much preferable to a potential hang!  And you only get
> > the extra timer tick if there was some other reason that the
> > CPU came out of nohz mode, correct?
> 
> Yeah, either because of a timer, hrtimer, or a reschedule.
> But you still generate a spurious softirq in this scheme.

Eliminating spurious softirqs is a good thing, but let's review the
overall priorities (probably missing a few, but this should give you
an overall view):

1.	No too-short grace periods!!!

2.	No grace-period hangs!!

3.	Minimize read-side overhead!

4.	Maintain decent update-side performance and scalability!

5.	Avoid inflicting real-time latency problems!

6.	Avoid waking up sleeping CPUs!

7.	Let CPUs that can do so go to sleep immediately (as opposed to
	waiting a few milliseconds).

8.	Avoid spurious softirqs.

9.	Super-optimization of update and grace-period paths.  (I normally
	just say "no" -- it has to be an impressive optimization for me
	to be willing to risk messing up #1 through #7.)

Nevertheless, simple changes that avoid spurious softirqs are good things.

> Two in fact: one because of the rnp->completed != rsp->completed condition
> in rcu_pending(), another one because when we update the pgnum, we
> always start chasing QS, regardless of the last GP beeing completed or not.

OK -- is this an example of #8 above, or is it really #7?  I am absolutely
not worried about a pair of back-to-back softirqs, and I don't believe
that you should be, either.  ;-)

> > Which is why checking the rnp fields makes more sense to me, actually.
> > Acquiring rnp->lock is much less painful than pinning down the rsp state.
> 
> Right.
> 
> Another thing, we  already have the (rnp->gpnum) != rdp->gpnu check in
> rcu_pending(),
> why also checking (rnp->completed) != rdp->completed) ?

Because if (rnp->completed != rdp->completed), we might need to process
some callbacks, either advancing them or invoking them.

By the way, have you introduced a config variable for your HPC dynticks
changes?  Longer term, __rcu_pending() for your HPC dynticks case
could check for the current CPU having any callbacks before the call to
check_cpu_stalls(), as in rcu_needs_cpu_quick_check().  That way, the
CPUs with callbacks would drive the RCU core state machine.  We don't
necessarily want this in the common case because it can increase
grace-period latency, but it could be very useful in the HPC-dyntick
case -- eliminate any number of sources of spurious ticks with low
risk to the high-priority RCU properties.  There are some corresponding
changes required on the force_quiescent_state() path, but these are
reasonably straightforward.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 14:42             ` Paul E. McKenney
@ 2010-11-24 15:45               ` Frederic Weisbecker
  2010-11-24 16:15                 ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24 15:45 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> On Wed, Nov 24, 2010 at 02:48:46PM +0100, Frederic Weisbecker wrote:
>> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
>> > On Wed, Nov 24, 2010 at 03:33:21AM +0100, Frederic Weisbecker wrote:
>> >> 2010/11/24 Frederic Weisbecker <fweisbec@gmail.com>:
>> >> > On Tue, Nov 23, 2010 at 04:58:20PM -0800, Paul E. McKenney wrote:
>> >> >> On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
>> >> > The softirq fires, we call rcu_process_gp_end() which will
>> >> > update rdp->completed into the global state:
>> >> > (rsp->completed = rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
>> >> >
>> >> > But rsp->pgnum is still 2 offsets backwards.
>> >
>> > This one is hard for me to believe -- rsp->gpnum drives the rest of
>> > the ->gpnum fields, right?
>>
>> Oops, I meant rdp->pgnum is still 2 offsets backward, for CPU 1.
>>
>> Sorry.
>
> CPU 1?  Or CPU 0?


CPU 1, the one that was idle :-D

So CPU 1 rdp did catch up with node and state for its completed field.
But not its pgnum yet.


>> >> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
>> >> > -> note_new_gpnum() and then we end up a requested quiescent state while
>> >> > every grace periods are completed.
>> >>
>> >> Sorry I should have described that in the changelogs but my ideas
>> >> weren't as clear as they
>> >> are now (at least I feel they are, doesn't mean they actually are ;)
>> >> Chasing these RCU bugs for too much hours has toasted my brain..
>> >
>> > Welcome to my world!!!  But keep in mind that an extra timer tick
>> > or two is much preferable to a potential hang!  And you only get
>> > the extra timer tick if there was some other reason that the
>> > CPU came out of nohz mode, correct?
>>
>> Yeah, either because of a timer, hrtimer, or a reschedule.
>> But you still generate a spurious softirq in this scheme.
>
> Eliminating spurious softirqs is a good thing, but let's review the
> overall priorities (probably missing a few, but this should give you
> an overall view):
>
> 1.      No too-short grace periods!!!


Oh, I did not know that. Is it to avoid too much
short series of callbacks execution or?


> 2.      No grace-period hangs!!
>
> 3.      Minimize read-side overhead!
>
> 4.      Maintain decent update-side performance and scalability!
>
> 5.      Avoid inflicting real-time latency problems!
>
> 6.      Avoid waking up sleeping CPUs!
>
> 7.      Let CPUs that can do so go to sleep immediately (as opposed to
>        waiting a few milliseconds).
>
> 8.      Avoid spurious softirqs.


Avoiding spurious softirqs bring us back to 7, even though it's not a matter
of ms but rather us.


>
> 9.      Super-optimization of update and grace-period paths.  (I normally
>        just say "no" -- it has to be an impressive optimization for me
>        to be willing to risk messing up #1 through #7.)
>
> Nevertheless, simple changes that avoid spurious softirqs are good things.


Ok, then the original patch of this discussion seems to fit then.
Although I need to respin the changelog.


>> Two in fact: one because of the rnp->completed != rsp->completed condition
>> in rcu_pending(), another one because when we update the pgnum, we
>> always start chasing QS, regardless of the last GP beeing completed or not.
>
> OK -- is this an example of #8 above, or is it really #7?  I am absolutely
> not worried about a pair of back-to-back softirqs, and I don't believe
> that you should be, either.  ;-)


This seems to be 8 and 7.

A pair of back-to-back softirqs is no huge deal, but still time
consuming, spurious, etc...
if we can easily spott they are useless, why not get rid of them. At
least we can easily
and reliably avoid the second one from note_new_gpnum().



>> > Which is why checking the rnp fields makes more sense to me, actually.
>> > Acquiring rnp->lock is much less painful than pinning down the rsp state.
>>
>> Right.
>>
>> Another thing, we  already have the (rnp->gpnum) != rdp->gpnu check in
>> rcu_pending(),
>> why also checking (rnp->completed) != rdp->completed) ?
>
> Because if (rnp->completed != rdp->completed), we might need to process
> some callbacks, either advancing them or invoking them.


The rdp->nxttail in rcu is still an obscur part for me, so I just believe you :)


> By the way, have you introduced a config variable for your HPC dynticks
> changes?

I will, because I'll have to touch some fast path and I would prefer to do
that compile-conditional.

But for now I don't care and experiment drafts :)

I'm quite close to something that seems to work well BTW, those series of
softirq were my latest problem as it made rcu_pending() returning 1 for
a good while and I coudn't stop the tick. After the 2nd patch it should be
fine now, I hope.


>  Longer term, __rcu_pending() for your HPC dynticks case
> could check for the current CPU having any callbacks before the call to
> check_cpu_stalls(), as in rcu_needs_cpu_quick_check().  That way, the
> CPUs with callbacks would drive the RCU core state machine.


Ah, because currently every CPUs calling rcu_pending() can be pulled into
handling the state machine because of this check, right?

        /* Has an RCU GP gone long enough to send resched IPIs &c? */
	if (rcu_gp_in_progress(rsp) &&
	    ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
		rdp->n_rp_need_fqs++;
		return 1;
	
So the goal would be to let this job to CPUs that have callbacks, right?


> We don't
> necessarily want this in the common case because it can increase
> grace-period latency


Because it makes less CPUs to handle the grace period state machine?


> but it could be very useful in the HPC-dyntick
> case -- eliminate any number of sources of spurious ticks with low
> risk to the high-priority RCU properties.  There are some corresponding
> changes required on the force_quiescent_state() path, but these are
> reasonably straightforward.

Ok, I may have a try at it (if I completely understand what you suggest :-)
But first I'll try to get that dyntick mode with the current state, which seems
to be enough for the basic desired functionality.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 15:45               ` Frederic Weisbecker
@ 2010-11-24 16:15                 ` Paul E. McKenney
  2010-11-24 17:38                   ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24 16:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 04:45:11PM +0100, Frederic Weisbecker wrote:
> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > On Wed, Nov 24, 2010 at 02:48:46PM +0100, Frederic Weisbecker wrote:
> >> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> >> > On Wed, Nov 24, 2010 at 03:33:21AM +0100, Frederic Weisbecker wrote:
> >> >> 2010/11/24 Frederic Weisbecker <fweisbec@gmail.com>:
> >> >> > On Tue, Nov 23, 2010 at 04:58:20PM -0800, Paul E. McKenney wrote:
> >> >> >> On Wed, Nov 24, 2010 at 01:31:12AM +0100, Frederic Weisbecker wrote:
> >> >> > The softirq fires, we call rcu_process_gp_end() which will
> >> >> > update rdp->completed into the global state:
> >> >> > (rsp->completed = rnp->pgnum = rnp->completed = rsp->pgnum = 4294967002).
> >> >> >
> >> >> > But rsp->pgnum is still 2 offsets backwards.
> >> >
> >> > This one is hard for me to believe -- rsp->gpnum drives the rest of
> >> > the ->gpnum fields, right?
> >>
> >> Oops, I meant rdp->pgnum is still 2 offsets backward, for CPU 1.
> >>
> >> Sorry.
> >
> > CPU 1?  Or CPU 0?
> 
> CPU 1, the one that was idle :-D
> 
> So CPU 1 rdp did catch up with node and state for its completed field.
> But not its pgnum yet.

OK, I will need to take a closer look at the rdp->gpnum setting.

> >> >> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
> >> >> > -> note_new_gpnum() and then we end up a requested quiescent state while
> >> >> > every grace periods are completed.
> >> >>
> >> >> Sorry I should have described that in the changelogs but my ideas
> >> >> weren't as clear as they
> >> >> are now (at least I feel they are, doesn't mean they actually are ;)
> >> >> Chasing these RCU bugs for too much hours has toasted my brain..
> >> >
> >> > Welcome to my world!!!  But keep in mind that an extra timer tick
> >> > or two is much preferable to a potential hang!  And you only get
> >> > the extra timer tick if there was some other reason that the
> >> > CPU came out of nohz mode, correct?
> >>
> >> Yeah, either because of a timer, hrtimer, or a reschedule.
> >> But you still generate a spurious softirq in this scheme.
> >
> > Eliminating spurious softirqs is a good thing, but let's review the
> > overall priorities (probably missing a few, but this should give you
> > an overall view):
> >
> > 1.      No too-short grace periods!!!
> 
> Oh, I did not know that. Is it to avoid too much
> short series of callbacks execution or?

To avoid breaking RCU's fundamental guarantee.  ;-)

It is easy to create bugs where a CPU thinks that it needs to respond
to the grace period based on a prior quiescent state, but unknown to
it that grace period has finished and a new one has started.  This
CPU might then wrongly report a quiescent state against the new grace
period.  Needless to say, this is very very very very very very bad.

> > 2.      No grace-period hangs!!
> >
> > 3.      Minimize read-side overhead!
> >
> > 4.      Maintain decent update-side performance and scalability!
> >
> > 5.      Avoid inflicting real-time latency problems!
> >
> > 6.      Avoid waking up sleeping CPUs!
> >
> > 7.      Let CPUs that can do so go to sleep immediately (as opposed to
> >        waiting a few milliseconds).
> >
> > 8.      Avoid spurious softirqs.
> 
> Avoiding spurious softirqs bring us back to 7, even though it's not a matter
> of ms but rather us.

Microseconds should not be a problem as long as they are rare.  Milliseconds
are a problem for the battery-powered folks as well as for people who care
about OS jitter.

> > 9.      Super-optimization of update and grace-period paths.  (I normally
> >        just say "no" -- it has to be an impressive optimization for me
> >        to be willing to risk messing up #1 through #7.)
> >
> > Nevertheless, simple changes that avoid spurious softirqs are good things.
> 
> Ok, then the original patch of this discussion seems to fit then.
> Although I need to respin the changelog.

Sounds good -- I will then review it carefully.

> >> Two in fact: one because of the rnp->completed != rsp->completed condition
> >> in rcu_pending(), another one because when we update the pgnum, we
> >> always start chasing QS, regardless of the last GP beeing completed or not.
> >
> > OK -- is this an example of #8 above, or is it really #7?  I am absolutely
> > not worried about a pair of back-to-back softirqs, and I don't believe
> > that you should be, either.  ;-)
> 
> This seems to be 8 and 7.
> 
> A pair of back-to-back softirqs is no huge deal, but still time
> consuming, spurious, etc...
> if we can easily spott they are useless, why not get rid of them. At
> least we can easily
> and reliably avoid the second one from note_new_gpnum().

As long as I can prove to myself that the patch that gets rid of them
does not cause other problems.  ;-)

> >> > Which is why checking the rnp fields makes more sense to me, actually.
> >> > Acquiring rnp->lock is much less painful than pinning down the rsp state.
> >>
> >> Right.
> >>
> >> Another thing, we  already have the (rnp->gpnum) != rdp->gpnu check in
> >> rcu_pending(),
> >> why also checking (rnp->completed) != rdp->completed) ?
> >
> > Because if (rnp->completed != rdp->completed), we might need to process
> > some callbacks, either advancing them or invoking them.
> 
> The rdp->nxttail in rcu is still an obscur part for me, so I just believe you :)

It is just the place that the pending rcu_head callbacks are stored.

> > By the way, have you introduced a config variable for your HPC dynticks
> > changes?
> 
> I will, because I'll have to touch some fast path and I would prefer to do
> that compile-conditional.
> 
> But for now I don't care and experiment drafts :)

Well, if any of the softirq/tick streamlining in RCU is risky, it needs
to go under that same #ifdef!  ;-)

> I'm quite close to something that seems to work well BTW, those series of
> softirq were my latest problem as it made rcu_pending() returning 1 for
> a good while and I coudn't stop the tick. After the 2nd patch it should be
> fine now, I hope.

By "couldn't stop the tick" you mean "the tick went on a few times more
than you like" or do you really mean "couldn't ever stop the tick"?
My guess is the former.  If there were problems like the latter, Arjan
would probably have beat me up about it long ago!

> >  Longer term, __rcu_pending() for your HPC dynticks case
> > could check for the current CPU having any callbacks before the call to
> > check_cpu_stalls(), as in rcu_needs_cpu_quick_check().  That way, the
> > CPUs with callbacks would drive the RCU core state machine.
> 
> Ah, because currently every CPUs calling rcu_pending() can be pulled into
> handling the state machine because of this check, right?
> 
>         /* Has an RCU GP gone long enough to send resched IPIs &c? */
> 	if (rcu_gp_in_progress(rsp) &&
> 	    ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
> 		rdp->n_rp_need_fqs++;
> 		return 1;
> 	
> So the goal would be to let this job to CPUs that have callbacks, right?

In the dynticks-HPC case, yes.

> > We don't
> > necessarily want this in the common case because it can increase
> > grace-period latency
> 
> Because it makes less CPUs to handle the grace period state machine?

Not quite.  Here I am not worried about the dyntick-HPC case, but rather
the usual more-runnable-tasks-than-CPUs case.  Here, CPUs are context
switching frequently, so if they report their own quiescent states (even
when they don't happen to have any callbacks queued) the grace period will
complete more quickly.  After all, force_quiescent_state() doesn't even
get started until the third tick following the start of the grace period.

In contrast, if we let the context-switching callback-free CPUs do their
own work, the grace period will likely end in a tick or two.

> > but it could be very useful in the HPC-dyntick
> > case -- eliminate any number of sources of spurious ticks with low
> > risk to the high-priority RCU properties.  There are some corresponding
> > changes required on the force_quiescent_state() path, but these are
> > reasonably straightforward.
> 
> Ok, I may have a try at it (if I completely understand what you suggest :-)
> But first I'll try to get that dyntick mode with the current state, which seems
> to be enough for the basic desired functionality.

If the two changes you have posted thus far are all you need, this does
sound like a good starting point.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 16:15                 ` Paul E. McKenney
@ 2010-11-24 17:38                   ` Frederic Weisbecker
  2010-11-24 18:20                     ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24 17:38 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> On Wed, Nov 24, 2010 at 04:45:11PM +0100, Frederic Weisbecker wrote:
>> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
>> > On Wed, Nov 24, 2010 at 02:48:46PM +0100, Frederic Weisbecker wrote:
>> CPU 1, the one that was idle :-D
>>
>> So CPU 1 rdp did catch up with node and state for its completed field.
>> But not its pgnum yet.
>
> OK, I will need to take a closer look at the rdp->gpnum setting.

Ok, do you want me to resend the patch with the changelog changed accordingly
to our discussion or?


>> >> >> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
>> >> >> > -> note_new_gpnum() and then we end up a requested quiescent state while
>> >> >> > every grace periods are completed.
>> >> >>
>> >> >> Sorry I should have described that in the changelogs but my ideas
>> >> >> weren't as clear as they
>> >> >> are now (at least I feel they are, doesn't mean they actually are ;)
>> >> >> Chasing these RCU bugs for too much hours has toasted my brain..
>> >> >
>> >> > Welcome to my world!!!  But keep in mind that an extra timer tick
>> >> > or two is much preferable to a potential hang!  And you only get
>> >> > the extra timer tick if there was some other reason that the
>> >> > CPU came out of nohz mode, correct?
>> >>
>> >> Yeah, either because of a timer, hrtimer, or a reschedule.
>> >> But you still generate a spurious softirq in this scheme.
>> >
>> > Eliminating spurious softirqs is a good thing, but let's review the
>> > overall priorities (probably missing a few, but this should give you
>> > an overall view):
>> >
>> > 1.      No too-short grace periods!!!
>>
>> Oh, I did not know that. Is it to avoid too much
>> short series of callbacks execution or?
>
> To avoid breaking RCU's fundamental guarantee.  ;-)
>
> It is easy to create bugs where a CPU thinks that it needs to respond
> to the grace period based on a prior quiescent state, but unknown to
> it that grace period has finished and a new one has started.  This
> CPU might then wrongly report a quiescent state against the new grace
> period.  Needless to say, this is very very very very very very bad.

Aaah ok. I thought you were talking about the time of a correct grace period,
not a bug that would make it seem shorter than it actually is.

Ok. Of course, priority number 1.


>> > 7.      Let CPUs that can do so go to sleep immediately (as opposed to
>> >        waiting a few milliseconds).
>> >
>> > 8.      Avoid spurious softirqs.
>>
>> Avoiding spurious softirqs bring us back to 7, even though it's not a matter
>> of ms but rather us.
>
> Microseconds should not be a problem as long as they are rare.  Milliseconds
> are a problem for the battery-powered folks as well as for people who care
> about OS jitter.

Right.


>> >> Two in fact: one because of the rnp->completed != rsp->completed condition
>> >> in rcu_pending(), another one because when we update the pgnum, we
>> >> always start chasing QS, regardless of the last GP beeing completed or not.
>> >
>> > OK -- is this an example of #8 above, or is it really #7?  I am absolutely
>> > not worried about a pair of back-to-back softirqs, and I don't believe
>> > that you should be, either.  ;-)
>>
>> This seems to be 8 and 7.
>>
>> A pair of back-to-back softirqs is no huge deal, but still time
>> consuming, spurious, etc...
>> if we can easily spott they are useless, why not get rid of them. At
>> least we can easily
>> and reliably avoid the second one from note_new_gpnum().
>
> As long as I can prove to myself that the patch that gets rid of them
> does not cause other problems.  ;-)

:)

>> >> > Which is why checking the rnp fields makes more sense to me, actually.
>> >> > Acquiring rnp->lock is much less painful than pinning down the rsp state.
>> >>
>> >> Right.
>> >>
>> >> Another thing, we  already have the (rnp->gpnum) != rdp->gpnu check in
>> >> rcu_pending(),
>> >> why also checking (rnp->completed) != rdp->completed) ?
>> >
>> > Because if (rnp->completed != rdp->completed), we might need to process
>> > some callbacks, either advancing them or invoking them.
>>
>> The rdp->nxttail in rcu is still an obscur part for me, so I just believe you :)
>
> It is just the place that the pending rcu_head callbacks are stored.

Yeah. I mean, I need to read how the code manages the different queues.
But __rcu_process_gp_end() seems to sum it up quite well.

>> > By the way, have you introduced a config variable for your HPC dynticks
>> > changes?
>>
>> I will, because I'll have to touch some fast path and I would prefer to do
>> that compile-conditional.
>>
>> But for now I don't care and experiment drafts :)
>
> Well, if any of the softirq/tick streamlining in RCU is risky, it needs
> to go under that same #ifdef!  ;-)

Sure, if something needs a reorg or a change even in the dyntick-hpc
off-case code,
I'll propose that to you, like I did for these two patches.

But I'll try to keep the invasive changes that are only for
dyntick-hpc only under that config.

>> I'm quite close to something that seems to work well BTW, those series of
>> softirq were my latest problem as it made rcu_pending() returning 1 for
>> a good while and I coudn't stop the tick. After the 2nd patch it should be
>> fine now, I hope.
>
> By "couldn't stop the tick" you mean "the tick went on a few times more
> than you like" or do you really mean "couldn't ever stop the tick"?
> My guess is the former.  If there were problems like the latter, Arjan
> would probably have beat me up about it long ago!

It's more like couldn't ever stop the tick. But that doesn't concern mainline.
This is because I have a hook that prevents the tick from beeing stopped
until rcu_pending() == 0.

In mainline it doesn't prevent the CPU from going nohz idle though, because
the softirq is armed from the tick. Once the softirq is processed, the CPU
can go to sleep. On the next timer tick it would again raise the softirq and
then could again go to sleep, etc..

I still have a trace of that, with my rcu_pending() hook, in
dyntick-hpc, that kept
returning 1 during at least 100 seconds and on each tick.
I did not go really further into this from my code as I immediately
switched to tip:master
to check if the problem came from my code or not.
And then I discovered that rcu_pending() indeed kept returning 1 for some while
in mainline (don't remember how much could be "some while" though), I
saw all these
spurious rcu softirq at each ticks caused by rcu_pending() and for
random time slices:
probably between a wake up from idle and the next grace period, if my
theory is right, and I
think that happened likely with bh flavour probably because it's
subject to less grace periods.

And this is what the second patch fixes in mainline and that also
seems to fix my issue in
dyntick-hpc.

Probably it happened more easily on dynctick-hpc as I was calling
rcu_pending() after
calling rcu_enter_nohz() (some buggy part of mine).

>> >  Longer term, __rcu_pending() for your HPC dynticks case
>> > could check for the current CPU having any callbacks before the call to
>> > check_cpu_stalls(), as in rcu_needs_cpu_quick_check().  That way, the
>> > CPUs with callbacks would drive the RCU core state machine.
>>
>> Ah, because currently every CPUs calling rcu_pending() can be pulled into
>> handling the state machine because of this check, right?
>>
>>         /* Has an RCU GP gone long enough to send resched IPIs &c? */
>>       if (rcu_gp_in_progress(rsp) &&
>>           ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
>>               rdp->n_rp_need_fqs++;
>>               return 1;
>>
>> So the goal would be to let this job to CPUs that have callbacks, right?
>
> In the dynticks-HPC case, yes.
>
>> > We don't
>> > necessarily want this in the common case because it can increase
>> > grace-period latency
>>
>> Because it makes less CPUs to handle the grace period state machine?
>
> Not quite.  Here I am not worried about the dyntick-HPC case, but rather
> the usual more-runnable-tasks-than-CPUs case.  Here, CPUs are context
> switching frequently, so if they report their own quiescent states (even
> when they don't happen to have any callbacks queued) the grace period will
> complete more quickly.  After all, force_quiescent_state() doesn't even
> get started until the third tick following the start of the grace period.


Ah, I see what you mean. So you would suggest to even ignore those
explicit QS report when in dynticj-hpc mode for CPUs that don't have callbacks?

Why not keeping them?


>> > but it could be very useful in the HPC-dyntick
>> > case -- eliminate any number of sources of spurious ticks with low
>> > risk to the high-priority RCU properties.  There are some corresponding
>> > changes required on the force_quiescent_state() path, but these are
>> > reasonably straightforward.
>>
>> Ok, I may have a try at it (if I completely understand what you suggest :-)
>> But first I'll try to get that dyntick mode with the current state, which seems
>> to be enough for the basic desired functionality.
>
> If the two changes you have posted thus far are all you need, this does
> sound like a good starting point.

For now yeah, it seems to work well. I may run into further surpises though :)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 17:38                   ` Frederic Weisbecker
@ 2010-11-24 18:20                     ` Paul E. McKenney
  2010-11-24 20:22                       ` Paul E. McKenney
  2010-11-24 22:42                       ` Frederic Weisbecker
  0 siblings, 2 replies; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24 18:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 06:38:45PM +0100, Frederic Weisbecker wrote:
> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > On Wed, Nov 24, 2010 at 04:45:11PM +0100, Frederic Weisbecker wrote:
> >> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> >> > On Wed, Nov 24, 2010 at 02:48:46PM +0100, Frederic Weisbecker wrote:
> >> CPU 1, the one that was idle :-D
> >>
> >> So CPU 1 rdp did catch up with node and state for its completed field.
> >> But not its pgnum yet.
> >
> > OK, I will need to take a closer look at the rdp->gpnum setting.
> 
> Ok, do you want me to resend the patch with the changelog changed accordingly
> to our discussion or?

Please!

> >> >> >> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
> >> >> >> > -> note_new_gpnum() and then we end up a requested quiescent state while
> >> >> >> > every grace periods are completed.
> >> >> >>
> >> >> >> Sorry I should have described that in the changelogs but my ideas
> >> >> >> weren't as clear as they
> >> >> >> are now (at least I feel they are, doesn't mean they actually are ;)
> >> >> >> Chasing these RCU bugs for too much hours has toasted my brain..
> >> >> >
> >> >> > Welcome to my world!!!  But keep in mind that an extra timer tick
> >> >> > or two is much preferable to a potential hang!  And you only get
> >> >> > the extra timer tick if there was some other reason that the
> >> >> > CPU came out of nohz mode, correct?
> >> >>
> >> >> Yeah, either because of a timer, hrtimer, or a reschedule.
> >> >> But you still generate a spurious softirq in this scheme.
> >> >
> >> > Eliminating spurious softirqs is a good thing, but let's review the
> >> > overall priorities (probably missing a few, but this should give you
> >> > an overall view):
> >> >
> >> > 1.      No too-short grace periods!!!
> >>
> >> Oh, I did not know that. Is it to avoid too much
> >> short series of callbacks execution or?
> >
> > To avoid breaking RCU's fundamental guarantee.  ;-)
> >
> > It is easy to create bugs where a CPU thinks that it needs to respond
> > to the grace period based on a prior quiescent state, but unknown to
> > it that grace period has finished and a new one has started.  This
> > CPU might then wrongly report a quiescent state against the new grace
> > period.  Needless to say, this is very very very very very very bad.
> 
> Aaah ok. I thought you were talking about the time of a correct grace period,
> not a bug that would make it seem shorter than it actually is.
> 
> Ok. Of course, priority number 1.

;-)

> >> > 7.      Let CPUs that can do so go to sleep immediately (as opposed to
> >> >        waiting a few milliseconds).
> >> >
> >> > 8.      Avoid spurious softirqs.
> >>
> >> Avoiding spurious softirqs bring us back to 7, even though it's not a matter
> >> of ms but rather us.
> >
> > Microseconds should not be a problem as long as they are rare.  Milliseconds
> > are a problem for the battery-powered folks as well as for people who care
> > about OS jitter.
> 
> Right.
> 
> 
> >> >> Two in fact: one because of the rnp->completed != rsp->completed condition
> >> >> in rcu_pending(), another one because when we update the pgnum, we
> >> >> always start chasing QS, regardless of the last GP beeing completed or not.
> >> >
> >> > OK -- is this an example of #8 above, or is it really #7?  I am absolutely
> >> > not worried about a pair of back-to-back softirqs, and I don't believe
> >> > that you should be, either.  ;-)
> >>
> >> This seems to be 8 and 7.
> >>
> >> A pair of back-to-back softirqs is no huge deal, but still time
> >> consuming, spurious, etc...
> >> if we can easily spott they are useless, why not get rid of them. At
> >> least we can easily
> >> and reliably avoid the second one from note_new_gpnum().
> >
> > As long as I can prove to myself that the patch that gets rid of them
> > does not cause other problems.  ;-)
> 
> :)
> 
> >> >> > Which is why checking the rnp fields makes more sense to me, actually.
> >> >> > Acquiring rnp->lock is much less painful than pinning down the rsp state.
> >> >>
> >> >> Right.
> >> >>
> >> >> Another thing, we  already have the (rnp->gpnum) != rdp->gpnu check in
> >> >> rcu_pending(),
> >> >> why also checking (rnp->completed) != rdp->completed) ?
> >> >
> >> > Because if (rnp->completed != rdp->completed), we might need to process
> >> > some callbacks, either advancing them or invoking them.
> >>
> >> The rdp->nxttail in rcu is still an obscur part for me, so I just believe you :)
> >
> > It is just the place that the pending rcu_head callbacks are stored.
> 
> Yeah. I mean, I need to read how the code manages the different queues.
> But __rcu_process_gp_end() seems to sum it up quite well.

For advancing callbacks, that is the one!  For invocation of callbacks,
see rcu_do_batch().

> >> > By the way, have you introduced a config variable for your HPC dynticks
> >> > changes?
> >>
> >> I will, because I'll have to touch some fast path and I would prefer to do
> >> that compile-conditional.
> >>
> >> But for now I don't care and experiment drafts :)
> >
> > Well, if any of the softirq/tick streamlining in RCU is risky, it needs
> > to go under that same #ifdef!  ;-)
> 
> Sure, if something needs a reorg or a change even in the dyntick-hpc
> off-case code,
> I'll propose that to you, like I did for these two patches.
> 
> But I'll try to keep the invasive changes that are only for
> dyntick-hpc only under that config.

Sounds good!

> >> I'm quite close to something that seems to work well BTW, those series of
> >> softirq were my latest problem as it made rcu_pending() returning 1 for
> >> a good while and I coudn't stop the tick. After the 2nd patch it should be
> >> fine now, I hope.
> >
> > By "couldn't stop the tick" you mean "the tick went on a few times more
> > than you like" or do you really mean "couldn't ever stop the tick"?
> > My guess is the former.  If there were problems like the latter, Arjan
> > would probably have beat me up about it long ago!
> 
> It's more like couldn't ever stop the tick. But that doesn't concern mainline.
> This is because I have a hook that prevents the tick from beeing stopped
> until rcu_pending() == 0.

That would certainly change behavior!!!  Why did you need to do that?

Ah, because force_quiescent_state() has not yet been taught about
dyntick-HPC, got it...

> In mainline it doesn't prevent the CPU from going nohz idle though, because
> the softirq is armed from the tick. Once the softirq is processed, the CPU
> can go to sleep. On the next timer tick it would again raise the softirq and
> then could again go to sleep, etc..

You lost me on this one.  If the CPU goes to sleep (AKA enters dyntick-idle
mode, right?), then there wouldn't be a next timer tick, right?

> I still have a trace of that, with my rcu_pending() hook, in
> dyntick-hpc, that kept
> returning 1 during at least 100 seconds and on each tick.
> I did not go really further into this from my code as I immediately
> switched to tip:master
> to check if the problem came from my code or not.
> And then I discovered that rcu_pending() indeed kept returning 1 for some while
> in mainline (don't remember how much could be "some while" though), I
> saw all these
> spurious rcu softirq at each ticks caused by rcu_pending() and for
> random time slices:
> probably between a wake up from idle and the next grace period, if my
> theory is right, and I
> think that happened likely with bh flavour probably because it's
> subject to less grace periods.
> 
> And this is what the second patch fixes in mainline and that also
> seems to fix my issue in
> dyntick-hpc.
> 
> Probably it happened more easily on dynctick-hpc as I was calling
> rcu_pending() after
> calling rcu_enter_nohz() (some buggy part of mine).

OK, but that is why dyntick-idle is governed by rcu_needs_cpu() rather
than rcu_pending().  But yes, need to upgrade force_quiescent_state().

One hacky way to do that would be to replace smp_send_reschedule() with
an smp_call_function_single() that invoked something like the following
on the target CPU:

	static void rcu_poke_cpu(void *info)
	{
		raise_softirq(RCU_SOFTIRQ);
	}

So rcu_implicit_offline_qs() does something like the following in place
of the smp_send_reschedule():

	smp_call_function_single(rdp->cpu, rcu_poke_cpu, NULL, 0);

The call to set_need_resched() can remain as is.

Of course, a mainline version would need to be a bit more discerning,
but this should do work just fine for your experimental use.

This should allow you to revert back to rcu_needs_cpu().

Or am I missing something here?

> >> >  Longer term, __rcu_pending() for your HPC dynticks case
> >> > could check for the current CPU having any callbacks before the call to
> >> > check_cpu_stalls(), as in rcu_needs_cpu_quick_check().  That way, the
> >> > CPUs with callbacks would drive the RCU core state machine.
> >>
> >> Ah, because currently every CPUs calling rcu_pending() can be pulled into
> >> handling the state machine because of this check, right?
> >>
> >>         /* Has an RCU GP gone long enough to send resched IPIs &c? */
> >>       if (rcu_gp_in_progress(rsp) &&
> >>           ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
> >>               rdp->n_rp_need_fqs++;
> >>               return 1;
> >>
> >> So the goal would be to let this job to CPUs that have callbacks, right?
> >
> > In the dynticks-HPC case, yes.
> >
> >> > We don't
> >> > necessarily want this in the common case because it can increase
> >> > grace-period latency
> >>
> >> Because it makes less CPUs to handle the grace period state machine?
> >
> > Not quite.  Here I am not worried about the dyntick-HPC case, but rather
> > the usual more-runnable-tasks-than-CPUs case.  Here, CPUs are context
> > switching frequently, so if they report their own quiescent states (even
> > when they don't happen to have any callbacks queued) the grace period will
> > complete more quickly.  After all, force_quiescent_state() doesn't even
> > get started until the third tick following the start of the grace period.
> 
> Ah, I see what you mean. So you would suggest to even ignore those
> explicit QS report when in dynticj-hpc mode for CPUs that don't have callbacks?
> 
> Why not keeping them?

My belief is that people needing dyntick-HPC are OK with RCU grace periods
taking a few jiffies longer than they might otherwise.  Besides, when you
are running dyntick-HPC, you aren't context switching much, so keeping
the tick doesn't buy you as much reduction in grace-period latency.

> >> > but it could be very useful in the HPC-dyntick
> >> > case -- eliminate any number of sources of spurious ticks with low
> >> > risk to the high-priority RCU properties.  There are some corresponding
> >> > changes required on the force_quiescent_state() path, but these are
> >> > reasonably straightforward.
> >>
> >> Ok, I may have a try at it (if I completely understand what you suggest :-)
> >> But first I'll try to get that dyntick mode with the current state, which seems
> >> to be enough for the basic desired functionality.
> >
> > If the two changes you have posted thus far are all you need, this does
> > sound like a good starting point.
> 
> For now yeah, it seems to work well. I may run into further surpises though :)

Perish the thought!!!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 18:20                     ` Paul E. McKenney
@ 2010-11-24 20:22                       ` Paul E. McKenney
  2010-11-24 20:45                         ` Frederic Weisbecker
  2010-11-24 22:42                       ` Frederic Weisbecker
  1 sibling, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24 20:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 10:20:51AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 24, 2010 at 06:38:45PM +0100, Frederic Weisbecker wrote:
> > 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > > On Wed, Nov 24, 2010 at 04:45:11PM +0100, Frederic Weisbecker wrote:
> > >> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > >> > On Wed, Nov 24, 2010 at 02:48:46PM +0100, Frederic Weisbecker wrote:
> > >> CPU 1, the one that was idle :-D
> > >>
> > >> So CPU 1 rdp did catch up with node and state for its completed field.
> > >> But not its pgnum yet.
> > >
> > > OK, I will need to take a closer look at the rdp->gpnum setting.
> > 
> > Ok, do you want me to resend the patch with the changelog changed accordingly
> > to our discussion or?
> 
> Please!

I take it back.  I queued the following -- your code, but updated
comment and commit message.  Please let me know if I missed anything.

							Thanx, Paul

------------------------------------------------------------------------

commit 1d9d947bb882371a0877ba05207a0b996dcb38ee
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date:   Wed Nov 24 01:31:12 2010 +0100

    rcu: Don't chase unnecessary quiescent states after extended grace periods
    
    When a CPU is in an extended quiescent state, including offline and
    dyntick-idle mode, other CPUs will detect the extended quiescent state
    and respond to the the current grace period on that CPU's behalf.
    However, the locking design prevents those other CPUs from updating
    the first CPU's rcu_data state.
    
    Therefore, when this CPU exits its extended quiescent state, it must
    update its rcu_data state.  Because such a CPU will usually check for
    the completion of a prior grace period before checking for the start of a
    new grace period, the rcu_data ->completed field will be updated before
    the rcu_data ->gpnum field.  This means that if RCU is currently idle,
    the CPU will usually enter __note_new_gpnum() with ->completed set to
    the current grace-period number, but with ->gpnum set to some long-ago
    grace period number.  Unfortunately, __note_new_gpnum() will then insist
    that the current CPU needlessly check for a new quiescent state.  This
    checking can result in this CPU needlessly taking several scheduling-clock
    interrupts.
    
    This bug is harmless in most cases, but is a problem for users concerned
    with OS jitter for HPC applications or concerned with battery lifetime
    for portable SMP embedded devices.  This commit therefore makes the
    test in __note_new_gpnum() check for this situation and avoid the needless
    quiescent-state checks.
    
    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5df948f..76cd5d2 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -616,8 +616,20 @@ static void __init check_cpu_stall_init(void)
 static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
 {
 	if (rdp->gpnum != rnp->gpnum) {
-		rdp->qs_pending = 1;
-		rdp->passed_quiesc = 0;
+		/*
+		 * Because RCU checks for the prior grace period ending
+		 * before checking for a new grace period starting, it
+		 * is possible for rdp->gpnum to be set to the old grace
+		 * period and rdp->completed to be set to the new grace
+		 * period.  So don't bother checking for a quiescent state
+		 * for the rnp->gpnum grace period unless it really is
+		 * waiting for this CPU.
+		 */
+		if (rdp->completed != rnp->gpnum) {
+			rdp->qs_pending = 1;
+			rdp->passed_quiesc = 0;
+		}
+
 		rdp->gpnum = rnp->gpnum;
 	}
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 20:22                       ` Paul E. McKenney
@ 2010-11-24 20:45                         ` Frederic Weisbecker
  2010-11-24 21:19                           ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24 20:45 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> I take it back.  I queued the following -- your code, but updated
> comment and commit message.  Please let me know if I missed anything.
>
>                                                        Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 1d9d947bb882371a0877ba05207a0b996dcb38ee
> Author: Frederic Weisbecker <fweisbec@gmail.com>
> Date:   Wed Nov 24 01:31:12 2010 +0100
>
>    rcu: Don't chase unnecessary quiescent states after extended grace periods
>
>    When a CPU is in an extended quiescent state, including offline and
>    dyntick-idle mode, other CPUs will detect the extended quiescent state
>    and respond to the the current grace period on that CPU's behalf.
>    However, the locking design prevents those other CPUs from updating
>    the first CPU's rcu_data state.
>
>    Therefore, when this CPU exits its extended quiescent state, it must
>    update its rcu_data state.  Because such a CPU will usually check for
>    the completion of a prior grace period before checking for the start of a
>    new grace period, the rcu_data ->completed field will be updated before
>    the rcu_data ->gpnum field.  This means that if RCU is currently idle,
>    the CPU will usually enter __note_new_gpnum() with ->completed set to
>    the current grace-period number, but with ->gpnum set to some long-ago
>    grace period number.  Unfortunately, __note_new_gpnum() will then insist
>    that the current CPU needlessly check for a new quiescent state.  This
>    checking can result in this CPU needlessly taking several scheduling-clock
>    interrupts.


So I'm all ok for the commit and the comments updated. But just a doubt about
the about sentence.

The effect seems more that there will be one extra softirq. But not an
extra tick
because before sleeping, the CPU will check rcu_needs_cpu() which
doesn't check for
the need of noting a quiescent state, IIRC...

And I think the softirq will be only raised on the next tick.

Hm?

>
>    This bug is harmless in most cases, but is a problem for users concerned
>    with OS jitter for HPC applications or concerned with battery lifetime
>    for portable SMP embedded devices.  This commit therefore makes the
>    test in __note_new_gpnum() check for this situation and avoid the needless
>    quiescent-state checks.
>
>    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
>    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>    Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
>    Cc: Ingo Molnar <mingo@elte.hu>
>    Cc: Thomas Gleixner <tglx@linutronix.de>
>    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>    Cc: Steven Rostedt <rostedt@goodmis.org>
>    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 5df948f..76cd5d2 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -616,8 +616,20 @@ static void __init check_cpu_stall_init(void)
>  static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
>  {
>        if (rdp->gpnum != rnp->gpnum) {
> -               rdp->qs_pending = 1;
> -               rdp->passed_quiesc = 0;
> +               /*
> +                * Because RCU checks for the prior grace period ending
> +                * before checking for a new grace period starting, it
> +                * is possible for rdp->gpnum to be set to the old grace
> +                * period and rdp->completed to be set to the new grace
> +                * period.  So don't bother checking for a quiescent state
> +                * for the rnp->gpnum grace period unless it really is
> +                * waiting for this CPU.
> +                */
> +               if (rdp->completed != rnp->gpnum) {
> +                       rdp->qs_pending = 1;
> +                       rdp->passed_quiesc = 0;
> +               }
> +
>                rdp->gpnum = rnp->gpnum;
>        }
>  }
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 20:45                         ` Frederic Weisbecker
@ 2010-11-24 21:19                           ` Paul E. McKenney
  2010-11-24 21:50                             ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-24 21:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 09:45:08PM +0100, Frederic Weisbecker wrote:
> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > I take it back.  I queued the following -- your code, but updated
> > comment and commit message.  Please let me know if I missed anything.
> >
> >                                                        Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 1d9d947bb882371a0877ba05207a0b996dcb38ee
> > Author: Frederic Weisbecker <fweisbec@gmail.com>
> > Date:   Wed Nov 24 01:31:12 2010 +0100
> >
> >    rcu: Don't chase unnecessary quiescent states after extended grace periods
> >
> >    When a CPU is in an extended quiescent state, including offline and
> >    dyntick-idle mode, other CPUs will detect the extended quiescent state
> >    and respond to the the current grace period on that CPU's behalf.
> >    However, the locking design prevents those other CPUs from updating
> >    the first CPU's rcu_data state.
> >
> >    Therefore, when this CPU exits its extended quiescent state, it must
> >    update its rcu_data state.  Because such a CPU will usually check for
> >    the completion of a prior grace period before checking for the start of a
> >    new grace period, the rcu_data ->completed field will be updated before
> >    the rcu_data ->gpnum field.  This means that if RCU is currently idle,
> >    the CPU will usually enter __note_new_gpnum() with ->completed set to
> >    the current grace-period number, but with ->gpnum set to some long-ago
> >    grace period number.  Unfortunately, __note_new_gpnum() will then insist
> >    that the current CPU needlessly check for a new quiescent state.  This
> >    checking can result in this CPU needlessly taking several scheduling-clock
> >    interrupts.
> 
> 
> So I'm all ok for the commit and the comments updated. But just a doubt about
> the about sentence.
> 
> The effect seems more that there will be one extra softirq. But not an
> extra tick
> because before sleeping, the CPU will check rcu_needs_cpu() which
> doesn't check for
> the need of noting a quiescent state, IIRC...
> 
> And I think the softirq will be only raised on the next tick.
> 
> Hm?

Good point!  This paragraph now reads:

	Therefore, when this CPU exits its extended quiescent state,
	it must update its rcu_data state.  Because such a CPU will
	usually check for the completion of a prior grace period
	before checking for the start of a new grace period, the
	rcu_data ->completed field will be updated before the rcu_data
	->gpnum field.	This means that if RCU is currently idle, the
	CPU will usually enter __note_new_gpnum() with ->completed set
	to the current grace-period number, but with ->gpnum set to some
	long-ago grace period number.  Unfortunately, __note_new_gpnum()
	will then insist that the current CPU needlessly check for a new
	quiescent state.  This checking can result in this CPU needlessly
	taking an additional softirq for unnecessary RCU processing.

Fair enough?

							Thanx, Paul

> >    This bug is harmless in most cases, but is a problem for users concerned
> >    with OS jitter for HPC applications or concerned with battery lifetime
> >    for portable SMP embedded devices.  This commit therefore makes the
> >    test in __note_new_gpnum() check for this situation and avoid the needless
> >    quiescent-state checks.
> >
> >    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> >    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >    Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> >    Cc: Ingo Molnar <mingo@elte.hu>
> >    Cc: Thomas Gleixner <tglx@linutronix.de>
> >    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >    Cc: Steven Rostedt <rostedt@goodmis.org>
> >    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 5df948f..76cd5d2 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -616,8 +616,20 @@ static void __init check_cpu_stall_init(void)
> >  static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
> >  {
> >        if (rdp->gpnum != rnp->gpnum) {
> > -               rdp->qs_pending = 1;
> > -               rdp->passed_quiesc = 0;
> > +               /*
> > +                * Because RCU checks for the prior grace period ending
> > +                * before checking for a new grace period starting, it
> > +                * is possible for rdp->gpnum to be set to the old grace
> > +                * period and rdp->completed to be set to the new grace
> > +                * period.  So don't bother checking for a quiescent state
> > +                * for the rnp->gpnum grace period unless it really is
> > +                * waiting for this CPU.
> > +                */
> > +               if (rdp->completed != rnp->gpnum) {
> > +                       rdp->qs_pending = 1;
> > +                       rdp->passed_quiesc = 0;
> > +               }
> > +
> >                rdp->gpnum = rnp->gpnum;
> >        }
> >  }
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 21:19                           ` Paul E. McKenney
@ 2010-11-24 21:50                             ` Frederic Weisbecker
  0 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24 21:50 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> On Wed, Nov 24, 2010 at 09:45:08PM +0100, Frederic Weisbecker wrote:
>> 2010/11/24 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
>> > I take it back.  I queued the following -- your code, but updated
>> > comment and commit message.  Please let me know if I missed anything.
>> >
>> >                                                        Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > commit 1d9d947bb882371a0877ba05207a0b996dcb38ee
>> > Author: Frederic Weisbecker <fweisbec@gmail.com>
>> > Date:   Wed Nov 24 01:31:12 2010 +0100
>> >
>> >    rcu: Don't chase unnecessary quiescent states after extended grace periods
>> >
>> >    When a CPU is in an extended quiescent state, including offline and
>> >    dyntick-idle mode, other CPUs will detect the extended quiescent state
>> >    and respond to the the current grace period on that CPU's behalf.
>> >    However, the locking design prevents those other CPUs from updating
>> >    the first CPU's rcu_data state.
>> >
>> >    Therefore, when this CPU exits its extended quiescent state, it must
>> >    update its rcu_data state.  Because such a CPU will usually check for
>> >    the completion of a prior grace period before checking for the start of a
>> >    new grace period, the rcu_data ->completed field will be updated before
>> >    the rcu_data ->gpnum field.  This means that if RCU is currently idle,
>> >    the CPU will usually enter __note_new_gpnum() with ->completed set to
>> >    the current grace-period number, but with ->gpnum set to some long-ago
>> >    grace period number.  Unfortunately, __note_new_gpnum() will then insist
>> >    that the current CPU needlessly check for a new quiescent state.  This
>> >    checking can result in this CPU needlessly taking several scheduling-clock
>> >    interrupts.
>>
>>
>> So I'm all ok for the commit and the comments updated. But just a doubt about
>> the about sentence.
>>
>> The effect seems more that there will be one extra softirq. But not an
>> extra tick
>> because before sleeping, the CPU will check rcu_needs_cpu() which
>> doesn't check for
>> the need of noting a quiescent state, IIRC...
>>
>> And I think the softirq will be only raised on the next tick.
>>
>> Hm?
>
> Good point!  This paragraph now reads:
>
>        Therefore, when this CPU exits its extended quiescent state,
>        it must update its rcu_data state.  Because such a CPU will
>        usually check for the completion of a prior grace period
>        before checking for the start of a new grace period, the
>        rcu_data ->completed field will be updated before the rcu_data
>        ->gpnum field.  This means that if RCU is currently idle, the
>        CPU will usually enter __note_new_gpnum() with ->completed set
>        to the current grace-period number, but with ->gpnum set to some
>        long-ago grace period number.  Unfortunately, __note_new_gpnum()
>        will then insist that the current CPU needlessly check for a new
>        quiescent state.  This checking can result in this CPU needlessly
>        taking an additional softirq for unnecessary RCU processing.
>
> Fair enough?


Perfect! :)

Thanks a lot!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 18:20                     ` Paul E. McKenney
  2010-11-24 20:22                       ` Paul E. McKenney
@ 2010-11-24 22:42                       ` Frederic Weisbecker
  2010-11-25 14:56                         ` Paul E. McKenney
  1 sibling, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-24 22:42 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

Le Wed, 24 Nov 2010 10:20:51 -0800,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> a écrit :

> On Wed, Nov 24, 2010 at 06:38:45PM +0100, Frederic Weisbecker wrote:
> > Yeah. I mean, I need to read how the code manages the different
> > queues. But __rcu_process_gp_end() seems to sum it up quite well.
> 
> For advancing callbacks, that is the one!  For invocation of
> callbacks, see rcu_do_batch().

Ok.


> > It's more like couldn't ever stop the tick. But that doesn't
> > concern mainline. This is because I have a hook that prevents the
> > tick from beeing stopped until rcu_pending() == 0.
> 
> That would certainly change behavior!!!  Why did you need to do that?
> 
> Ah, because force_quiescent_state() has not yet been taught about
> dyntick-HPC, got it...

Oh actually I have taught it about that. For such isolated CPU that
doesn't respond, it sends a specific IPI that will restart the tick if
we are not in nohz.

The point in restarting the tick is to find some quiescent states and
also to keep the tick for a little while for potential further grace
periods to complete while we are in the kernel.

This is why I use rcu_pending() from the tick: to check if we still
need the tick for rcu.
 
> > In mainline it doesn't prevent the CPU from going nohz idle though,
> > because the softirq is armed from the tick. Once the softirq is
> > processed, the CPU can go to sleep. On the next timer tick it would
> > again raise the softirq and then could again go to sleep, etc..
> 
> You lost me on this one.  If the CPU goes to sleep (AKA enters
> dyntick-idle mode, right?), then there wouldn't be a next timer tick,
> right?

If there is a timer queued (timer_list or hrtimer), then the next timer
tick is programmed to fire for the next timer. Until then the CPU can
go to sleep and it will be woken up on that next timer interrupt.


> > I still have a trace of that, with my rcu_pending() hook, in
> > dyntick-hpc, that kept
> > returning 1 during at least 100 seconds and on each tick.
> > I did not go really further into this from my code as I immediately
> > switched to tip:master
> > to check if the problem came from my code or not.
> > And then I discovered that rcu_pending() indeed kept returning 1
> > for some while in mainline (don't remember how much could be "some
> > while" though), I saw all these
> > spurious rcu softirq at each ticks caused by rcu_pending() and for
> > random time slices:
> > probably between a wake up from idle and the next grace period, if
> > my theory is right, and I
> > think that happened likely with bh flavour probably because it's
> > subject to less grace periods.
> > 
> > And this is what the second patch fixes in mainline and that also
> > seems to fix my issue in
> > dyntick-hpc.
> > 
> > Probably it happened more easily on dynctick-hpc as I was calling
> > rcu_pending() after
> > calling rcu_enter_nohz() (some buggy part of mine).
> 
> OK, but that is why dyntick-idle is governed by rcu_needs_cpu() rather
> than rcu_pending().  But yes, need to upgrade force_quiescent_state().
> 
> One hacky way to do that would be to replace smp_send_reschedule()
> with an smp_call_function_single() that invoked something like the
> following on the target CPU:
> 
> 	static void rcu_poke_cpu(void *info)
> 	{
> 		raise_softirq(RCU_SOFTIRQ);
> 	}
> 
> So rcu_implicit_offline_qs() does something like the following in
> place of the smp_send_reschedule():
> 
> 	smp_call_function_single(rdp->cpu, rcu_poke_cpu, NULL, 0);
> 
> The call to set_need_resched() can remain as is.
> 
> Of course, a mainline version would need to be a bit more discerning,
> but this should do work just fine for your experimental use.
> 
> This should allow you to revert back to rcu_needs_cpu().
> 
> Or am I missing something here?

So, as I explained above I'm currently using such an alternate IPI. But
raising the softirq would only take care of:

* checking if there is a new grace period (rearm rdp->qs_pending and so)
* take care of callbacks

But it's not enough to track quiescent states. And we have no more timer
interrupts to track them. So we need to restart the tick at least until
we find a quiescent state for the grace period waiting for us.

But I may be missing something either :)


> > Ah, I see what you mean. So you would suggest to even ignore those
> > explicit QS report when in dynticj-hpc mode for CPUs that don't
> > have callbacks?
> > 
> > Why not keeping them?
> 
> My belief is that people needing dyntick-HPC are OK with RCU grace
> periods taking a few jiffies longer than they might otherwise.
> Besides, when you are running dyntick-HPC, you aren't context
> switching much, so keeping the tick doesn't buy you as much reduction
> in grace-period latency.

But don't we still need the tick on such cases, (if we aren't in
userspace) when a grace period starts, to note our grace periods?
The rcu IPI itself doesn't seem to be sufficient for that.

I'm not sure I undrstand what you mean.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] rcu: Fix series of spurious RCU softirqs
  2010-11-24  0:31 [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Frederic Weisbecker
  2010-11-24  0:31 ` [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods Frederic Weisbecker
  2010-11-24  0:31 ` [PATCH 2/2] rcu: Stop checking quiescent states after grace period completion from remote Frederic Weisbecker
@ 2010-11-25  3:42 ` Lai Jiangshan
  2010-11-25  7:38   ` Frederic Weisbecker
  2 siblings, 1 reply; 27+ messages in thread
From: Lai Jiangshan @ 2010-11-25  3:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt, Ingo Molnar

On 11/24/2010 08:31 AM, Frederic Weisbecker wrote:
> Hi,
> 
> I've observed some not so unfrequent series of spurious rcu
> softirqs, sometimes happening at each ticks for a random
> while.
> 
> These patches aims at fixing them.
> 
> Thanks.
> 
> Frederic Weisbecker (2):
>   rcu: Don't chase unnecessary quiescent states after extended grace periods
>   rcu: Stop checking quiescent states after grace period completion from remote
> 

If we ensure rdp->gpnum >= rdp->completed is always true, the problems as
you described will not be existed. Or maybe I misunderstand you.

rdp->gpnum >= rdp->completed is a very important guarantee I think.
(In my RCURING, it is guaranteed.) I'm afraid there are some other
problems still hidden if it is not guaranteed.

so I recommend: (code is better than words)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d5bc439..af4e87a 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -648,6 +648,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 
 		/* Remember that we saw this grace-period completion. */
 		rdp->completed = rnp->completed;
+
+		/* Ensure ->gpnum >= ->completed after NO_HZ */
+		if (unlikely(rnp->completed - rdp->gpnum > 0
+				|| rdp->gpnum - rnp->gpnum > 0)) {
+			rdp->gpnum = rnp->completed;
+			rdp->qs_pending = 0;
+		}
 	}
 }
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] rcu: Fix series of spurious RCU softirqs
  2010-11-25  3:42 ` [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Lai Jiangshan
@ 2010-11-25  7:38   ` Frederic Weisbecker
  2010-11-25  8:35     ` Lai Jiangshan
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-25  7:38 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, LKML, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt, Ingo Molnar

On Thu, Nov 25, 2010 at 11:42:34AM +0800, Lai Jiangshan wrote:
> On 11/24/2010 08:31 AM, Frederic Weisbecker wrote:
> > Hi,
> > 
> > I've observed some not so unfrequent series of spurious rcu
> > softirqs, sometimes happening at each ticks for a random
> > while.
> > 
> > These patches aims at fixing them.
> > 
> > Thanks.
> > 
> > Frederic Weisbecker (2):
> >   rcu: Don't chase unnecessary quiescent states after extended grace periods
> >   rcu: Stop checking quiescent states after grace period completion from remote
> > 
> 
> If we ensure rdp->gpnum >= rdp->completed is always true, the problems as
> you described will not be existed. Or maybe I misunderstand you.
> 
> rdp->gpnum >= rdp->completed is a very important guarantee I think.
> (In my RCURING, it is guaranteed.) I'm afraid there are some other
> problems still hidden if it is not guaranteed.
> 
> so I recommend: (code is better than words)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index d5bc439..af4e87a 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -648,6 +648,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>  
>  		/* Remember that we saw this grace-period completion. */
>  		rdp->completed = rnp->completed;
> +
> +		/* Ensure ->gpnum >= ->completed after NO_HZ */
> +		if (unlikely(rnp->completed - rdp->gpnum > 0
> +				|| rdp->gpnum - rnp->gpnum > 0)) {
> +			rdp->gpnum = rnp->completed;
> +			rdp->qs_pending = 0;


That's an alternative to my first patch yeah. And if rdp->gpnum >= rdp->completed
must be a guarantee outside the rnp lock, then it's certainly better because
the lock is relaxed between rcu_process_gp_end() and note_new_gpnum(), and
both values are async in this lockless frame.

But perhaps this shouldn't touch rdp->qs_pending:

"if (rnp->completed > rdp->gpnum || rdp->gpnum > rnp->gpnum)" is not
a guarantee that we don't need to find quiescent states.

but rnp->completed == rnp->gpnum would provide that guarantee.
That said, note_new_gp_new() would fix the value of rdp->qs_pending.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] rcu: Fix series of spurious RCU softirqs
  2010-11-25  7:38   ` Frederic Weisbecker
@ 2010-11-25  8:35     ` Lai Jiangshan
  2010-11-25  9:27       ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Lai Jiangshan @ 2010-11-25  8:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt, Ingo Molnar

On 11/25/2010 03:38 PM, Frederic Weisbecker wrote:
> On Thu, Nov 25, 2010 at 11:42:34AM +0800, Lai Jiangshan wrote:
>> On 11/24/2010 08:31 AM, Frederic Weisbecker wrote:
>>> Hi,
>>>
>>> I've observed some not so unfrequent series of spurious rcu
>>> softirqs, sometimes happening at each ticks for a random
>>> while.
>>>
>>> These patches aims at fixing them.
>>>
>>> Thanks.
>>>
>>> Frederic Weisbecker (2):
>>>   rcu: Don't chase unnecessary quiescent states after extended grace periods
>>>   rcu: Stop checking quiescent states after grace period completion from remote
>>>
>>
>> If we ensure rdp->gpnum >= rdp->completed is always true, the problems as
>> you described will not be existed. Or maybe I misunderstand you.
>>
>> rdp->gpnum >= rdp->completed is a very important guarantee I think.
>> (In my RCURING, it is guaranteed.) I'm afraid there are some other
>> problems still hidden if it is not guaranteed.
>>
>> so I recommend: (code is better than words)
>>
>> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
>> index d5bc439..af4e87a 100644
>> --- a/kernel/rcutree.c
>> +++ b/kernel/rcutree.c
>> @@ -648,6 +648,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>>  
>>  		/* Remember that we saw this grace-period completion. */
>>  		rdp->completed = rnp->completed;
>> +
>> +		/* Ensure ->gpnum >= ->completed after NO_HZ */
>> +		if (unlikely(rnp->completed - rdp->gpnum > 0
>> +				|| rdp->gpnum - rnp->gpnum > 0)) {
>> +			rdp->gpnum = rnp->completed;
>> +			rdp->qs_pending = 0;
> 
> 
> That's an alternative to my first patch yeah.

Since rdp->gpnum >= rdp->completed is guaranteed.
your second patch is not needed, the problem is also fixed.

if rnp->gpnum == rnp->completed, rcu_report_qs_rdp() will not be called.
it is because rdp->qs_pending == 0 when rnp->gpnum == rnp->completed.


And if rdp->gpnum >= rdp->completed
> must be a guarantee outside the rnp lock, then it's certainly better because
> the lock is relaxed between rcu_process_gp_end() and note_new_gpnum(), and
> both values are async in this lockless frame.
> 
> But perhaps this shouldn't touch rdp->qs_pending:

if rdp->gpnum == rnp->completed, it means we don't need a qs for rdp->gpnum,
it is completed. so we must set rdp->qs_pending = 0;

when we really need a qs, rdp->qs_pending will be fixed in note_new_gp_new().

> 
> "if (rnp->completed > rdp->gpnum || rdp->gpnum > rnp->gpnum)" is not
> a guarantee that we don't need to find quiescent states.
> 
> but rnp->completed == rnp->gpnum would provide that guarantee.
> That said, note_new_gp_new() would fix the value of rdp->qs_pending.
> 
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] rcu: Fix series of spurious RCU softirqs
  2010-11-25  8:35     ` Lai Jiangshan
@ 2010-11-25  9:27       ` Frederic Weisbecker
  2010-11-25 14:58         ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-25  9:27 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, LKML, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt, Ingo Molnar

On Thu, Nov 25, 2010 at 04:35:30PM +0800, Lai Jiangshan wrote:
> On 11/25/2010 03:38 PM, Frederic Weisbecker wrote:
> > On Thu, Nov 25, 2010 at 11:42:34AM +0800, Lai Jiangshan wrote:
> >> On 11/24/2010 08:31 AM, Frederic Weisbecker wrote:
> >>> Hi,
> >>>
> >>> I've observed some not so unfrequent series of spurious rcu
> >>> softirqs, sometimes happening at each ticks for a random
> >>> while.
> >>>
> >>> These patches aims at fixing them.
> >>>
> >>> Thanks.
> >>>
> >>> Frederic Weisbecker (2):
> >>>   rcu: Don't chase unnecessary quiescent states after extended grace periods
> >>>   rcu: Stop checking quiescent states after grace period completion from remote
> >>>
> >>
> >> If we ensure rdp->gpnum >= rdp->completed is always true, the problems as
> >> you described will not be existed. Or maybe I misunderstand you.
> >>
> >> rdp->gpnum >= rdp->completed is a very important guarantee I think.
> >> (In my RCURING, it is guaranteed.) I'm afraid there are some other
> >> problems still hidden if it is not guaranteed.
> >>
> >> so I recommend: (code is better than words)
> >>
> >> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> >> index d5bc439..af4e87a 100644
> >> --- a/kernel/rcutree.c
> >> +++ b/kernel/rcutree.c
> >> @@ -648,6 +648,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
> >>  
> >>  		/* Remember that we saw this grace-period completion. */
> >>  		rdp->completed = rnp->completed;
> >> +
> >> +		/* Ensure ->gpnum >= ->completed after NO_HZ */
> >> +		if (unlikely(rnp->completed - rdp->gpnum > 0
> >> +				|| rdp->gpnum - rnp->gpnum > 0)) {
> >> +			rdp->gpnum = rnp->completed;
> >> +			rdp->qs_pending = 0;
> > 
> > 
> > That's an alternative to my first patch yeah.
> 
> Since rdp->gpnum >= rdp->completed is guaranteed.
> your second patch is not needed, the problem is also fixed.
> 
> if rnp->gpnum == rnp->completed, rcu_report_qs_rdp() will not be called.
> it is because rdp->qs_pending == 0 when rnp->gpnum == rnp->completed.


Aaah...


> 
> And if rdp->gpnum >= rdp->completed
> > must be a guarantee outside the rnp lock, then it's certainly better because
> > the lock is relaxed between rcu_process_gp_end() and note_new_gpnum(), and
> > both values are async in this lockless frame.
> > 
> > But perhaps this shouldn't touch rdp->qs_pending:
> 
> if rdp->gpnum == rnp->completed, it means we don't need a qs for rdp->gpnum,
> it is completed. so we must set rdp->qs_pending = 0;
> 
> when we really need a qs, rdp->qs_pending will be fixed in note_new_gp_new().


Ok that makes all sense now!

I'm just not sure about your check above.

(rdp->gpnum - rnp->gpnum > 0) can never happen, right?

Also perhaps we should set rdp->qs_pending = 0 only if
rnp->completed == rdp->completed?

Which in the end would be:

	/* Remember that we saw this grace-period completion. */
	rdp->completed = rnp->completed;

+	if (rdp->gpnum < rdp->completed)
+		rdp->gpnum = rdp->completed;
+
+	if (rdp->gpnum == rdp->completed)
+		rdp->qs_pending = 0;


And then if there is a new grace period to handle, this will
be done through note_new_pgnum().

Hm?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-24 22:42                       ` Frederic Weisbecker
@ 2010-11-25 14:56                         ` Paul E. McKenney
  2010-11-26 14:06                           ` Frederic Weisbecker
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-25 14:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Wed, Nov 24, 2010 at 11:42:57PM +0100, Frederic Weisbecker wrote:
> Le Wed, 24 Nov 2010 10:20:51 -0800,
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> a écrit :
> 
> > On Wed, Nov 24, 2010 at 06:38:45PM +0100, Frederic Weisbecker wrote:
> > > Yeah. I mean, I need to read how the code manages the different
> > > queues. But __rcu_process_gp_end() seems to sum it up quite well.
> > 
> > For advancing callbacks, that is the one!  For invocation of
> > callbacks, see rcu_do_batch().
> 
> Ok.
> 
> > > It's more like couldn't ever stop the tick. But that doesn't
> > > concern mainline. This is because I have a hook that prevents the
> > > tick from beeing stopped until rcu_pending() == 0.
> > 
> > That would certainly change behavior!!!  Why did you need to do that?
> > 
> > Ah, because force_quiescent_state() has not yet been taught about
> > dyntick-HPC, got it...
> 
> Oh actually I have taught it about that. For such isolated CPU that
> doesn't respond, it sends a specific IPI that will restart the tick if
> we are not in nohz.
> 
> The point in restarting the tick is to find some quiescent states and
> also to keep the tick for a little while for potential further grace
> periods to complete while we are in the kernel.
> 
> This is why I use rcu_pending() from the tick: to check if we still
> need the tick for rcu.

So if you have received the IPI, then you turn on the tick.  As soon as
rcu_pending() returns false, you can turn off the tick.  Once the tick
is off, you can go back to using rcu_needs_cpu(), right?  (At least
until that CPU receives another RCU IPI.)

Either that or you need to check for both rcu_needs_cpu() -and-
rcu_pending().  Actually, you do need to check both during the time that
you have the tick on after receiving an RCU IPI.

> > > In mainline it doesn't prevent the CPU from going nohz idle though,
> > > because the softirq is armed from the tick. Once the softirq is
> > > processed, the CPU can go to sleep. On the next timer tick it would
> > > again raise the softirq and then could again go to sleep, etc..
> > 
> > You lost me on this one.  If the CPU goes to sleep (AKA enters
> > dyntick-idle mode, right?), then there wouldn't be a next timer tick,
> > right?
> 
> If there is a timer queued (timer_list or hrtimer), then the next timer
> tick is programmed to fire for the next timer. Until then the CPU can
> go to sleep and it will be woken up on that next timer interrupt.

OK.

> > > I still have a trace of that, with my rcu_pending() hook, in
> > > dyntick-hpc, that kept
> > > returning 1 during at least 100 seconds and on each tick.
> > > I did not go really further into this from my code as I immediately
> > > switched to tip:master
> > > to check if the problem came from my code or not.
> > > And then I discovered that rcu_pending() indeed kept returning 1
> > > for some while in mainline (don't remember how much could be "some
> > > while" though), I saw all these
> > > spurious rcu softirq at each ticks caused by rcu_pending() and for
> > > random time slices:
> > > probably between a wake up from idle and the next grace period, if
> > > my theory is right, and I
> > > think that happened likely with bh flavour probably because it's
> > > subject to less grace periods.
> > > 
> > > And this is what the second patch fixes in mainline and that also
> > > seems to fix my issue in
> > > dyntick-hpc.
> > > 
> > > Probably it happened more easily on dynctick-hpc as I was calling
> > > rcu_pending() after
> > > calling rcu_enter_nohz() (some buggy part of mine).
> > 
> > OK, but that is why dyntick-idle is governed by rcu_needs_cpu() rather
> > than rcu_pending().  But yes, need to upgrade force_quiescent_state().
> > 
> > One hacky way to do that would be to replace smp_send_reschedule()
> > with an smp_call_function_single() that invoked something like the
> > following on the target CPU:
> > 
> > 	static void rcu_poke_cpu(void *info)
> > 	{
> > 		raise_softirq(RCU_SOFTIRQ);
> > 	}
> > 
> > So rcu_implicit_offline_qs() does something like the following in
> > place of the smp_send_reschedule():
> > 
> > 	smp_call_function_single(rdp->cpu, rcu_poke_cpu, NULL, 0);
> > 
> > The call to set_need_resched() can remain as is.
> > 
> > Of course, a mainline version would need to be a bit more discerning,
> > but this should do work just fine for your experimental use.
> > 
> > This should allow you to revert back to rcu_needs_cpu().
> > 
> > Or am I missing something here?
> 
> So, as I explained above I'm currently using such an alternate IPI. But
> raising the softirq would only take care of:
> 
> * checking if there is a new grace period (rearm rdp->qs_pending and so)
> * take care of callbacks
> 
> But it's not enough to track quiescent states. And we have no more timer
> interrupts to track them. So we need to restart the tick at least until
> we find a quiescent state for the grace period waiting for us.
> 
> But I may be missing something either :)

Well, I don't know whether or not you are missing it, but I do need to
get my act together and get RCU priority boosting ported from tiny
RCU to tree RCU.  Otherwise, force_quiescent_state() cannot do much
for preemptible RCU.

> > > Ah, I see what you mean. So you would suggest to even ignore those
> > > explicit QS report when in dynticj-hpc mode for CPUs that don't
> > > have callbacks?
> > > 
> > > Why not keeping them?
> > 
> > My belief is that people needing dyntick-HPC are OK with RCU grace
> > periods taking a few jiffies longer than they might otherwise.
> > Besides, when you are running dyntick-HPC, you aren't context
> > switching much, so keeping the tick doesn't buy you as much reduction
> > in grace-period latency.
> 
> But don't we still need the tick on such cases, (if we aren't in
> userspace) when a grace period starts, to note our grace periods?
> The rcu IPI itself doesn't seem to be sufficient for that.
> 
> I'm not sure I undrstand what you mean.

If there is a grace period, there must be an RCU callback.  The CPU that
has the RCU callback queued will keep its own tick going, because
rcu_needs_cpu() will return true.

The reason for ignoring the explicit QS reports in at least some of the
cases is that paying attention to them requires that the tick be enabled.
Which is fine for most workloads, but not so good for workloads that care
about OS jitter.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] rcu: Fix series of spurious RCU softirqs
  2010-11-25  9:27       ` Frederic Weisbecker
@ 2010-11-25 14:58         ` Paul E. McKenney
  0 siblings, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-25 14:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Lai Jiangshan, LKML, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt, Ingo Molnar

On Thu, Nov 25, 2010 at 10:27:08AM +0100, Frederic Weisbecker wrote:
> On Thu, Nov 25, 2010 at 04:35:30PM +0800, Lai Jiangshan wrote:
> > On 11/25/2010 03:38 PM, Frederic Weisbecker wrote:
> > > On Thu, Nov 25, 2010 at 11:42:34AM +0800, Lai Jiangshan wrote:
> > >> On 11/24/2010 08:31 AM, Frederic Weisbecker wrote:
> > >>> Hi,
> > >>>
> > >>> I've observed some not so unfrequent series of spurious rcu
> > >>> softirqs, sometimes happening at each ticks for a random
> > >>> while.
> > >>>
> > >>> These patches aims at fixing them.
> > >>>
> > >>> Thanks.
> > >>>
> > >>> Frederic Weisbecker (2):
> > >>>   rcu: Don't chase unnecessary quiescent states after extended grace periods
> > >>>   rcu: Stop checking quiescent states after grace period completion from remote
> > >>>
> > >>
> > >> If we ensure rdp->gpnum >= rdp->completed is always true, the problems as
> > >> you described will not be existed. Or maybe I misunderstand you.
> > >>
> > >> rdp->gpnum >= rdp->completed is a very important guarantee I think.
> > >> (In my RCURING, it is guaranteed.) I'm afraid there are some other
> > >> problems still hidden if it is not guaranteed.
> > >>
> > >> so I recommend: (code is better than words)
> > >>
> > >> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > >> index d5bc439..af4e87a 100644
> > >> --- a/kernel/rcutree.c
> > >> +++ b/kernel/rcutree.c
> > >> @@ -648,6 +648,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
> > >>  
> > >>  		/* Remember that we saw this grace-period completion. */
> > >>  		rdp->completed = rnp->completed;
> > >> +
> > >> +		/* Ensure ->gpnum >= ->completed after NO_HZ */
> > >> +		if (unlikely(rnp->completed - rdp->gpnum > 0
> > >> +				|| rdp->gpnum - rnp->gpnum > 0)) {
> > >> +			rdp->gpnum = rnp->completed;
> > >> +			rdp->qs_pending = 0;
> > > 
> > > 
> > > That's an alternative to my first patch yeah.
> > 
> > Since rdp->gpnum >= rdp->completed is guaranteed.
> > your second patch is not needed, the problem is also fixed.
> > 
> > if rnp->gpnum == rnp->completed, rcu_report_qs_rdp() will not be called.
> > it is because rdp->qs_pending == 0 when rnp->gpnum == rnp->completed.
> 
> 
> Aaah...
> 
> 
> > 
> > And if rdp->gpnum >= rdp->completed
> > > must be a guarantee outside the rnp lock, then it's certainly better because
> > > the lock is relaxed between rcu_process_gp_end() and note_new_gpnum(), and
> > > both values are async in this lockless frame.
> > > 
> > > But perhaps this shouldn't touch rdp->qs_pending:
> > 
> > if rdp->gpnum == rnp->completed, it means we don't need a qs for rdp->gpnum,
> > it is completed. so we must set rdp->qs_pending = 0;
> > 
> > when we really need a qs, rdp->qs_pending will be fixed in note_new_gp_new().
> 
> 
> Ok that makes all sense now!
> 
> I'm just not sure about your check above.
> 
> (rdp->gpnum - rnp->gpnum > 0) can never happen, right?
> 
> Also perhaps we should set rdp->qs_pending = 0 only if
> rnp->completed == rdp->completed?
> 
> Which in the end would be:
> 
> 	/* Remember that we saw this grace-period completion. */
> 	rdp->completed = rnp->completed;
> 
> +	if (rdp->gpnum < rdp->completed)
> +		rdp->gpnum = rdp->completed;
> +
> +	if (rdp->gpnum == rdp->completed)
> +		rdp->qs_pending = 0;
> 
> 
> And then if there is a new grace period to handle, this will
> be done through note_new_pgnum().
> 
> Hm?

Given that it is Thanksgiving holiday here in USA, I am going to give you
guys a several days to come to agreement on this.  Then I will inspect
the resulting patch.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-25 14:56                         ` Paul E. McKenney
@ 2010-11-26 14:06                           ` Frederic Weisbecker
  2010-11-29 23:06                             ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2010-11-26 14:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Thu, Nov 25, 2010 at 06:56:32AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 24, 2010 at 11:42:57PM +0100, Frederic Weisbecker wrote:
> > Le Wed, 24 Nov 2010 10:20:51 -0800,
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> a écrit :
> > 
> > > On Wed, Nov 24, 2010 at 06:38:45PM +0100, Frederic Weisbecker wrote:
> > > > Yeah. I mean, I need to read how the code manages the different
> > > > queues. But __rcu_process_gp_end() seems to sum it up quite well.
> > > 
> > > For advancing callbacks, that is the one!  For invocation of
> > > callbacks, see rcu_do_batch().
> > 
> > Ok.
> > 
> > > > It's more like couldn't ever stop the tick. But that doesn't
> > > > concern mainline. This is because I have a hook that prevents the
> > > > tick from beeing stopped until rcu_pending() == 0.
> > > 
> > > That would certainly change behavior!!!  Why did you need to do that?
> > > 
> > > Ah, because force_quiescent_state() has not yet been taught about
> > > dyntick-HPC, got it...
> > 
> > Oh actually I have taught it about that. For such isolated CPU that
> > doesn't respond, it sends a specific IPI that will restart the tick if
> > we are not in nohz.
> > 
> > The point in restarting the tick is to find some quiescent states and
> > also to keep the tick for a little while for potential further grace
> > periods to complete while we are in the kernel.
> > 
> > This is why I use rcu_pending() from the tick: to check if we still
> > need the tick for rcu.
> 
> So if you have received the IPI, then you turn on the tick.  As soon as
> rcu_pending() returns false, you can turn off the tick.  Once the tick
> is off, you can go back to using rcu_needs_cpu(), right?  (At least
> until that CPU receives another RCU IPI.)

Yep. Although there is already a test for rcu_needs_cpu() in
tick_nohz_stop_sched_tick() that cancels the switch to nohz mode.

 
> Either that or you need to check for both rcu_needs_cpu() -and-
> rcu_pending().  Actually, you do need to check both during the time that
> you have the tick on after receiving an RCU IPI.

Yeah.
 
> > But it's not enough to track quiescent states. And we have no more timer
> > interrupts to track them. So we need to restart the tick at least until
> > we find a quiescent state for the grace period waiting for us.
> > 
> > But I may be missing something either :)
> 
> Well, I don't know whether or not you are missing it, but I do need to
> get my act together and get RCU priority boosting ported from tiny
> RCU to tree RCU.  Otherwise, force_quiescent_state() cannot do much
> for preemptible RCU.

And if I understood it correctly, RCU priority boosting involves a new
kthread that handles the callbacks instead of a softirq? So that you
can give dynamic priority to this thread and so?

How will that help in the preemptibe RCU case to force the quiescent
state? Scheduling the kthread doesn't mean that every tasks that were
preempted have been rescheduled and exited their rcu_read_unlock(), so
I guess you plan another trickery :)


> > > > Ah, I see what you mean. So you would suggest to even ignore those
> > > > explicit QS report when in dynticj-hpc mode for CPUs that don't
> > > > have callbacks?
> > > > 
> > > > Why not keeping them?
> > > 
> > > My belief is that people needing dyntick-HPC are OK with RCU grace
> > > periods taking a few jiffies longer than they might otherwise.
> > > Besides, when you are running dyntick-HPC, you aren't context
> > > switching much, so keeping the tick doesn't buy you as much reduction
> > > in grace-period latency.
> > 
> > But don't we still need the tick on such cases, (if we aren't in
> > userspace) when a grace period starts, to note our grace periods?
> > The rcu IPI itself doesn't seem to be sufficient for that.
> > 
> > I'm not sure I undrstand what you mean.
> 
> If there is a grace period, there must be an RCU callback.  The CPU that
> has the RCU callback queued will keep its own tick going, because
> rcu_needs_cpu() will return true.
> 
> The reason for ignoring the explicit QS reports in at least some of the
> cases is that paying attention to them requires that the tick be enabled.
> Which is fine for most workloads, but not so good for workloads that care
> about OS jitter.

Is there another solution than restarting the tick a local CPU when we need
a quiescent state from it? The simple IPI is not enough to ensure we find a
decent grace period, in that case we need to restart the tick anyway if we
are in the kernel, no?

Enjoy your turkey(s) ;-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods
  2010-11-26 14:06                           ` Frederic Weisbecker
@ 2010-11-29 23:06                             ` Paul E. McKenney
  0 siblings, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2010-11-29 23:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Lai Jiangshan, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt

On Fri, Nov 26, 2010 at 03:06:43PM +0100, Frederic Weisbecker wrote:
> On Thu, Nov 25, 2010 at 06:56:32AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 24, 2010 at 11:42:57PM +0100, Frederic Weisbecker wrote:
> > > Le Wed, 24 Nov 2010 10:20:51 -0800,
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> a écrit :
> > > 
> > > > On Wed, Nov 24, 2010 at 06:38:45PM +0100, Frederic Weisbecker wrote:
> > > > > Yeah. I mean, I need to read how the code manages the different
> > > > > queues. But __rcu_process_gp_end() seems to sum it up quite well.
> > > > 
> > > > For advancing callbacks, that is the one!  For invocation of
> > > > callbacks, see rcu_do_batch().
> > > 
> > > Ok.
> > > 
> > > > > It's more like couldn't ever stop the tick. But that doesn't
> > > > > concern mainline. This is because I have a hook that prevents the
> > > > > tick from beeing stopped until rcu_pending() == 0.
> > > > 
> > > > That would certainly change behavior!!!  Why did you need to do that?
> > > > 
> > > > Ah, because force_quiescent_state() has not yet been taught about
> > > > dyntick-HPC, got it...
> > > 
> > > Oh actually I have taught it about that. For such isolated CPU that
> > > doesn't respond, it sends a specific IPI that will restart the tick if
> > > we are not in nohz.
> > > 
> > > The point in restarting the tick is to find some quiescent states and
> > > also to keep the tick for a little while for potential further grace
> > > periods to complete while we are in the kernel.
> > > 
> > > This is why I use rcu_pending() from the tick: to check if we still
> > > need the tick for rcu.
> > 
> > So if you have received the IPI, then you turn on the tick.  As soon as
> > rcu_pending() returns false, you can turn off the tick.  Once the tick
> > is off, you can go back to using rcu_needs_cpu(), right?  (At least
> > until that CPU receives another RCU IPI.)
> 
> Yep. Although there is already a test for rcu_needs_cpu() in
> tick_nohz_stop_sched_tick() that cancels the switch to nohz mode.

Sounds plausible.  ;-)

> > Either that or you need to check for both rcu_needs_cpu() -and-
> > rcu_pending().  Actually, you do need to check both during the time that
> > you have the tick on after receiving an RCU IPI.
> 
> Yeah.

Before I forget (again!) -- my thinking over the past few months has
been in terms of making rcu_pending() simpler by permitting more false
positives.  This makes sense if the only penalty is a needless softirq.
I am guessing that this sort of change would be a bad idea from your
viewpoint, but thought I should check.

> > > But it's not enough to track quiescent states. And we have no more timer
> > > interrupts to track them. So we need to restart the tick at least until
> > > we find a quiescent state for the grace period waiting for us.
> > > 
> > > But I may be missing something either :)
> > 
> > Well, I don't know whether or not you are missing it, but I do need to
> > get my act together and get RCU priority boosting ported from tiny
> > RCU to tree RCU.  Otherwise, force_quiescent_state() cannot do much
> > for preemptible RCU.
> 
> And if I understood it correctly, RCU priority boosting involves a new
> kthread that handles the callbacks instead of a softirq? So that you
> can give dynamic priority to this thread and so?

Yes, there will be a new kthread, and yes, because this allows its
priority to be adjusted independently of non-RCU stuff.  My current
design (and code, in the case of Tiny RCU) runs this at a fixed real-time
priority.  This priority can be adjusted via the chrt command, if desired,
though you can of course shoot yourself in the foot quite impressively
if you do this carelessly.

> How will that help in the preemptibe RCU case to force the quiescent
> state? Scheduling the kthread doesn't mean that every tasks that were
> preempted have been rescheduled and exited their rcu_read_unlock(), so
> I guess you plan another trickery :)

There is one kthread per CPU, so that there can be lock-free interaction
similar to that between mainline and the RCU softirq.  ;-)

But in the meantime, your approach is good for experimental purposes.
And might well be the way it needs to happen longer term, for all I know.

> > > > > Ah, I see what you mean. So you would suggest to even ignore those
> > > > > explicit QS report when in dynticj-hpc mode for CPUs that don't
> > > > > have callbacks?
> > > > > 
> > > > > Why not keeping them?
> > > > 
> > > > My belief is that people needing dyntick-HPC are OK with RCU grace
> > > > periods taking a few jiffies longer than they might otherwise.
> > > > Besides, when you are running dyntick-HPC, you aren't context
> > > > switching much, so keeping the tick doesn't buy you as much reduction
> > > > in grace-period latency.
> > > 
> > > But don't we still need the tick on such cases, (if we aren't in
> > > userspace) when a grace period starts, to note our grace periods?
> > > The rcu IPI itself doesn't seem to be sufficient for that.
> > > 
> > > I'm not sure I undrstand what you mean.
> > 
> > If there is a grace period, there must be an RCU callback.  The CPU that
> > has the RCU callback queued will keep its own tick going, because
> > rcu_needs_cpu() will return true.
> > 
> > The reason for ignoring the explicit QS reports in at least some of the
> > cases is that paying attention to them requires that the tick be enabled.
> > Which is fine for most workloads, but not so good for workloads that care
> > about OS jitter.
> 
> Is there another solution than restarting the tick a local CPU when we need
> a quiescent state from it? The simple IPI is not enough to ensure we find a
> decent grace period, in that case we need to restart the tick anyway if we
> are in the kernel, no?

It depends on the flavor of RCU.  RCU-sched and RCU-bh are taken
care of if the IPI handler does a set_need_resched() followed by a
raise_softirq().  Of course, perhaps set_need_resched() ends up restarting
the tick, depending on the order of checks in your schedule() path.

But it doesn't need to.  RCU's force_quiescent_state() would resend the
IPI periodically, which would force the CPU through its state machine.
The set_need_resched() would force a quiescent state, and the IPIs
following that would push RCU through the steps needed to note that the
corresponding CPU was no longer blocking the grace period.

Preemptible RCU works similarly, unless there is a low-priority task
that is being indefinitely preempted in an RCU read-side critical
section.  In that case, the low-priority task is in need of priority
boosting.

Or am I still missing something?

> Enjoy your turkey(s) ;-)

I did, thank you!  Only one turkey, though.  You must be confusing me
with my 30-years-ago self.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2010-11-29 23:07 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-24  0:31 [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Frederic Weisbecker
2010-11-24  0:31 ` [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods Frederic Weisbecker
2010-11-24  0:58   ` Paul E. McKenney
2010-11-24  2:29     ` Frederic Weisbecker
2010-11-24  2:33       ` Frederic Weisbecker
2010-11-24  6:22         ` Paul E. McKenney
2010-11-24 13:48           ` Frederic Weisbecker
2010-11-24 14:42             ` Paul E. McKenney
2010-11-24 15:45               ` Frederic Weisbecker
2010-11-24 16:15                 ` Paul E. McKenney
2010-11-24 17:38                   ` Frederic Weisbecker
2010-11-24 18:20                     ` Paul E. McKenney
2010-11-24 20:22                       ` Paul E. McKenney
2010-11-24 20:45                         ` Frederic Weisbecker
2010-11-24 21:19                           ` Paul E. McKenney
2010-11-24 21:50                             ` Frederic Weisbecker
2010-11-24 22:42                       ` Frederic Weisbecker
2010-11-25 14:56                         ` Paul E. McKenney
2010-11-26 14:06                           ` Frederic Weisbecker
2010-11-29 23:06                             ` Paul E. McKenney
2010-11-24  0:31 ` [PATCH 2/2] rcu: Stop checking quiescent states after grace period completion from remote Frederic Weisbecker
2010-11-24  1:03   ` Paul E. McKenney
2010-11-25  3:42 ` [PATCH 0/2] rcu: Fix series of spurious RCU softirqs Lai Jiangshan
2010-11-25  7:38   ` Frederic Weisbecker
2010-11-25  8:35     ` Lai Jiangshan
2010-11-25  9:27       ` Frederic Weisbecker
2010-11-25 14:58         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox