public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] RCU and CONFIG_PREEMPT_RT progress
@ 2005-05-10  1:24 Paul E. McKenney
  2005-05-10 10:55 ` Esben Nielsen
  2005-05-10 20:08 ` Peter Zijlstra
  0 siblings, 2 replies; 12+ messages in thread
From: Paul E. McKenney @ 2005-05-10  1:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, dipankar

Hello!

Making some progress on CONFIG_PREEMPT_RT-compatible RCU.  Am exploring
two approaches, the lock-based approach discussed earlier and a
counter-flip approach, with the latter very likely being the method
of choice.  The reason for working the former is to get myself up to
speed on details of CONFIG_PREEMPT_RT with something relatively simple.

I am basing my work off of:

	http://people.redhat.com/mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc1-V0.7.41-14

since it works well on a four-CPU x86 system.  I will port forward when
I have something worth considering for inclusion.  To reiterate, the
patches referenced below are playtoys, for experimental and educational
use only.


Lock-Based Approach

1.	Trivial working patch:

	http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-2a.patch

	This one uses a single rwlock and a single global callback
	list.  This of course means that only one task may be in
	an RCU read-side critical section at a time.  Even so,
	I had to split synchronize_kernel() into synchronize_rcu()
	and synchronize_sched() -- I get infrequent hangs otherwise.
	The implementation of synchronize_sched() is probably not what
	it eventually needs to be, since it simply forces each CPU to
	context switch, whether voluntary or preemption.  Will be looking
	into this later on.

2.	Slightly less trivial working patch:

	http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-3a.patch

	This one uses a per-CPU rwlock, but keeps the single global
	callback list.  It is otherwise identical to #1.

Next step is to go to per-CPU callback lists.  If I was taking this
approach seriously, I would also experiment with multiple RCU read-side
locks per CPU, but I don't believe I would learn anything from that
exercise.

The reason that I am not taking this approach seriously is that it
can impose high latencies on RCU read-side critical sections, as
discussed earlier on LKML.  It also has high rcu_read_lock() and
rcu_read_unlock() overhead.


Counter-Based Approach

The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
counter-based approach, which seems to work, but which can result in
indefinite-duration grace periods.  The following are very hazy thoughts
on how to get the benefits of this approach, but with short grace periods.

1.	The basic trick is to maintain a pair of counters per CPU.
	There would also be a global boolean variable that would select
	one or the other of each pair.  The rcu_read_lock() primitive
	would then increment the counter indicated by the boolean
	corresponding to the CPU that it is currently running on.
	It would also keep a pointer to that particular counter in
	the task structure.  The rcu_read_unlock() primitive would
	decrement this counter.  (And, yes, you would also have a
	counter in the task structure so that only the outermost of
	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
	actually increment/decrement the per-CPU counter pairs.)

	To force a grace period, one would invert the value of the
	global boolean variable.  Once all the counters indicated
	by the old value of the global boolean variable hit zero,
	the corresponding set of RCU callbacks can be safely invoked.

	The big problem with this approach is that a pair of inversions
	of the global boolean variable could be spaced arbitrarily 
	closely, especially when you consider that the read side code
	can be preempted.  This could cause RCU callbacks to be invoked
	prematurely, which could greatly reduce the life expectancy
	of your kernel.

2.	#1 above, but use a seqlock to guard the counter selection in
	rcu_read_lock().  One potential disadvantage of this approach
	is that an extremely unlucky instance of rcu_read_lock() might
	be indefinitely delayed by a series of counter flips.  I am
	concerned that this might actually happen under low-memory
	conditions.  Also requires memory barriers on the read side,
	which we might be stuck with, but still hope to be able to
	get rid of.  And the per-CPU counter manipulation must use
	atomic instructions.

3.	#1 above, but use per-CPU locks to guard the counter selection.
	I don't like this any better than #2, worse, in fact, since it
	requires expensive atomic instructions as well.

4.	The Australian National Zoo alternative:  keep the counter pairs
	in the task structure rather than keeping them per-CPU.  This
	eliminates the need for atomic operations in rcu_read_lock() and
	rcu_read_unlock(), but makes the update side do horribly expensive
	task-list trawls.  [So named because I thought of it while trying
	to jog to the Australian National Zoo.  I took a wrong turn, and
	ended up running up a valley on the other side of Black Mountain,
	so never did make it to the zoo.  On the other hand, I did encounter
	a herd of wild kangaroo and also thought up this approach, so I
	think I came out OK on the deal.]

5.	The National Train notion:  #4 above, but keep a separate list
	containing only preempted tasks that had non-zero RCU counters
	at the time of preemption.  In the (presumably) common case of
	no preemption in RCU read-side critical sections, both the
	read-side and the update-side overhead is low.  But...  There
	is a problem with detecting tasks that are in long-running
	RCU read-side critical sections that don't get preempted.
	[So named because I thought of it on the UK National Train
	somewhere between London and Winchester.]

6.	Oak Hills option:  keep per-CPU counters, which require atomic
	increment/decrement in the general case, but use a fastpath
	that (with preemption disabled) checks to see if the value of
	the counter is zero (for rcu_read_lock()) or one (for
	rcu_read_unlock()), and, if so, does the counter manipulation
	non-atomically.  Use atomics on the (presumably infrequent)
	slow path, which is taken if someone gets preempted in the middle
	of an RCU read-side critical section.

	Handle races between rcu_read_lock() and counter flips by
	having rcu_read_lock() increment the counter, then checking
	to see if it incremented the correct counter of the pair.
	If it did not (i.e., the flip just happened), increment
	the other counter of the pair as well, recording the fact that
	both were incremented in the task struct.  The rcu_read_unlock()
	primitive then decrements any/all counters that rcu_read_lock()
	incremented.

	Memory barriers are still needed in the non-atomic increment
	and decrement cases.  However, it may be possible to leverage
	naturally occuring memory barriers (see for example Joe Seigh's
	recent LKML posting on RCU+SMR: http://lkml.org/lkml/2005/5/9/129).
	If the naturally occuring memory barriers aren't happening fast
	enough (e.g., low memory situation), a round of IPIs should
	suffice, for example, smp_call_function() to a function that
	advances the callbacks on each CPU.

	If this one pans out, the common-case overhead of rcu_read_lock()
	and rcu_read_unlock() would not be much more expensive than the
	current CONFIG_PREEMPT implementations.

There are probably better approaches, but that is what I have thus far.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10  1:24 [RFC] RCU and CONFIG_PREEMPT_RT progress Paul E. McKenney
@ 2005-05-10 10:55 ` Esben Nielsen
  2005-05-10 14:45   ` Paul E. McKenney
  2005-05-10 20:08 ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Esben Nielsen @ 2005-05-10 10:55 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, mingo, dipankar

On Mon, 9 May 2005, Paul E. McKenney wrote:

> Hello!
> 
> Making some progress on CONFIG_PREEMPT_RT-compatible RCU.  Am exploring
> two approaches, the lock-based approach discussed earlier and a
> counter-flip approach, with the latter very likely being the method
> of choice.  The reason for working the former is to get myself up to
> speed on details of CONFIG_PREEMPT_RT with something relatively simple.
> 
> I am basing my work off of:
> 
> 	http://people.redhat.com/mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc1-V0.7.41-14
> 
> since it works well on a four-CPU x86 system.  I will port forward when
> I have something worth considering for inclusion.  To reiterate, the
> patches referenced below are playtoys, for experimental and educational
> use only.
> 
> 
> Lock-Based Approach
> 
> 1.	Trivial working patch:
> 
> 	http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-2a.patch
> 
> 	This one uses a single rwlock and a single global callback
> 	list.  This of course means that only one task may be in
> 	an RCU read-side critical section at a time.  Even so,
> 	I had to split synchronize_kernel() into synchronize_rcu()
> 	and synchronize_sched() -- I get infrequent hangs otherwise.
> 	The implementation of synchronize_sched() is probably not what
> 	it eventually needs to be, since it simply forces each CPU to
> 	context switch, whether voluntary or preemption.  Will be looking
> 	into this later on.
> 
*shiver* A single rwlock is bad in all respects, both regarding
scaling and real-time behaviour.


> 2.	Slightly less trivial working patch:
> 
> 	http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-3a.patch
> 
> 	This one uses a per-CPU rwlock, but keeps the single global
> 	callback list.  It is otherwise identical to #1.
> 
Again, a single rwlock is very bad for RT behaviour.

> Next step is to go to per-CPU callback lists.  If I was taking this
> approach seriously, I would also experiment with multiple RCU read-side
> locks per CPU, but I don't believe I would learn anything from that
> exercise.
> 
> The reason that I am not taking this approach seriously is that it
> can impose high latencies on RCU read-side critical sections, as
> discussed earlier on LKML.  It also has high rcu_read_lock() and
> rcu_read_unlock() overhead.
> 
> 
> Counter-Based Approach
> 
> The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> counter-based approach, which seems to work, but which can result in
> indefinite-duration grace periods.
Is that really a problem in practise? For RCU updates should happen
rarely.

>  The following are very hazy thoughts
> on how to get the benefits of this approach, but with short grace periods.
> 
> 1.	The basic trick is to maintain a pair of counters per CPU.
> 	There would also be a global boolean variable that would select
> 	one or the other of each pair.  The rcu_read_lock() primitive
> 	would then increment the counter indicated by the boolean
> 	corresponding to the CPU that it is currently running on.
> 	It would also keep a pointer to that particular counter in
> 	the task structure.  The rcu_read_unlock() primitive would
> 	decrement this counter.  

(And, yes, you would also have a
> 	counter in the task structure so that only the outermost of
> 	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> 	actually increment/decrement the per-CPU counter pairs.)
Why bother? What is the overhead of checking relative to just doing the
increment/decrement?
I had another idea: Do it in the scheduler:
  per_cpu_rcu_count += prev->rcu_read_count;
  if(per_cpu_rcu_count==0) {
     /* grace period encountered */
  }
  (do the task switch)
  
  per_cpu_rcu_count -= count->rcu_read_count;

Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
current. During schedule there is no current and therefore it is the total
count.

> 
> 	To force a grace period, one would invert the value of the
> 	global boolean variable.  Once all the counters indicated
> 	by the old value of the global boolean variable hit zero,
> 	the corresponding set of RCU callbacks can be safely invoked.
> 
> 	The big problem with this approach is that a pair of inversions
> 	of the global boolean variable could be spaced arbitrarily 
> 	closely, especially when you consider that the read side code
> 	can be preempted.  This could cause RCU callbacks to be invoked
> 	prematurely, which could greatly reduce the life expectancy
> 	of your kernel.
> 
> 2.	#1 above, but use a seqlock to guard the counter selection in
> 	rcu_read_lock().  One potential disadvantage of this approach
> 	is that an extremely unlucky instance of rcu_read_lock() might
> 	be indefinitely delayed by a series of counter flips.  I am
> 	concerned that this might actually happen under low-memory
> 	conditions.  Also requires memory barriers on the read side,
> 	which we might be stuck with, but still hope to be able to
> 	get rid of.  And the per-CPU counter manipulation must use
> 	atomic instructions.
Why not just use 
 preempt_disable(); 
 manipulate counters; 
 preempt_enable();
??
> 
> 3.	#1 above, but use per-CPU locks to guard the counter selection.
> 	I don't like this any better than #2, worse, in fact, since it
> 	requires expensive atomic instructions as well.
local_irqs_disable() andor preempt_disable() should work as they counters
are per-cpu, right?

Or see the alternative above: The counter is per task but latched to a
per-cpu on schedule().

> 
> 4.	The Australian National Zoo alternative:  keep the counter pairs
> 	in the task structure rather than keeping them per-CPU.  This
> 	eliminates the need for atomic operations in rcu_read_lock() and
> 	rcu_read_unlock(), but makes the update side do horribly expensive
> 	task-list trawls.  [So named because I thought of it while trying
> 	to jog to the Australian National Zoo.  I took a wrong turn, and
> 	ended up running up a valley on the other side of Black Mountain,
> 	so never did make it to the zoo.  On the other hand, I did encounter
> 	a herd of wild kangaroo and also thought up this approach, so I
> 	think I came out OK on the deal.]
> 
> 5.	The National Train notion:  #4 above, but keep a separate list
> 	containing only preempted tasks that had non-zero RCU counters
> 	at the time of preemption.  In the (presumably) common case of
> 	no preemption in RCU read-side critical sections, both the
> 	read-side and the update-side overhead is low.  But...  There
> 	is a problem with detecting tasks that are in long-running
> 	RCU read-side critical sections that don't get preempted.
> 	[So named because I thought of it on the UK National Train
> 	somewhere between London and Winchester.]
> 
> 6.	Oak Hills option:  keep per-CPU counters, which require atomic
> 	increment/decrement in the general case, but use a fastpath
> 	that (with preemption disabled) checks to see if the value of
> 	the counter is zero (for rcu_read_lock()) or one (for
> 	rcu_read_unlock()), and, if so, does the counter manipulation
> 	non-atomically.  Use atomics on the (presumably infrequent)
> 	slow path, which is taken if someone gets preempted in the middle
> 	of an RCU read-side critical section.
> 
> 	Handle races between rcu_read_lock() and counter flips by
> 	having rcu_read_lock() increment the counter, then checking
> 	to see if it incremented the correct counter of the pair.
> 	If it did not (i.e., the flip just happened), increment
> 	the other counter of the pair as well, recording the fact that
> 	both were incremented in the task struct.  The rcu_read_unlock()
> 	primitive then decrements any/all counters that rcu_read_lock()
> 	incremented.
> 
> 	Memory barriers are still needed in the non-atomic increment
> 	and decrement cases.  However, it may be possible to leverage
> 	naturally occuring memory barriers (see for example Joe Seigh's
> 	recent LKML posting on RCU+SMR: http://lkml.org/lkml/2005/5/9/129).
> 	If the naturally occuring memory barriers aren't happening fast
> 	enough (e.g., low memory situation), a round of IPIs should
> 	suffice, for example, smp_call_function() to a function that
> 	advances the callbacks on each CPU.
> 
> 	If this one pans out, the common-case overhead of rcu_read_lock()
> 	and rcu_read_unlock() would not be much more expensive than the
> 	current CONFIG_PREEMPT implementations.
> 
> There are probably better approaches, but that is what I have thus far.
>

I was playing around with it a little while back. I didn't sned a patch
though as I didn't have time. It works on a UP machine but I don't have a
SMP to test on :-(

The ingreedients are:
1) A per-cpu read-side counter, protected locally by preempt_disable().
2) A task read-side counter (doesn't need protectection at all).
3) Tasks having non-zero read-side counter can't be migrated to other
CPUs. That is to make the per-cpu counter truely per-cpu.
4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
boost the priority to the maximum non-RT priority such it will not be 
preempted by other SCHED_OTHER tasks. This makes the  delay in
syncronize_kernel() somewhat deterministic (it depends on how
"deterministic" the RT tasks in the system are.)

I'll try to send you what I got when I get my computer at home back up.
I have only testet id on my UP labtop, where it seems to run fine.

> Thoughts?
> 
> 							Thanx, Paul

Esben


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10 10:55 ` Esben Nielsen
@ 2005-05-10 14:45   ` Paul E. McKenney
  2005-05-11 18:14     ` Esben Nielsen
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2005-05-10 14:45 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: linux-kernel, mingo, dipankar

On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> On Mon, 9 May 2005, Paul E. McKenney wrote:
> 
> > Hello!
> > 
> > Making some progress on CONFIG_PREEMPT_RT-compatible RCU.  Am exploring
> > two approaches, the lock-based approach discussed earlier and a
> > counter-flip approach, with the latter very likely being the method
> > of choice.  The reason for working the former is to get myself up to
> > speed on details of CONFIG_PREEMPT_RT with something relatively simple.
> > 
> > I am basing my work off of:
> > 
> > 	http://people.redhat.com/mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc1-V0.7.41-14
> > 
> > since it works well on a four-CPU x86 system.  I will port forward when
> > I have something worth considering for inclusion.  To reiterate, the
> > patches referenced below are playtoys, for experimental and educational
> > use only.
> > 
> > 
> > Lock-Based Approach
> > 
> > 1.	Trivial working patch:
> > 
> > 	http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-2a.patch
> > 
> > 	This one uses a single rwlock and a single global callback
> > 	list.  This of course means that only one task may be in
> > 	an RCU read-side critical section at a time.  Even so,
> > 	I had to split synchronize_kernel() into synchronize_rcu()
> > 	and synchronize_sched() -- I get infrequent hangs otherwise.
> > 	The implementation of synchronize_sched() is probably not what
> > 	it eventually needs to be, since it simply forces each CPU to
> > 	context switch, whether voluntary or preemption.  Will be looking
> > 	into this later on.
> > 
> *shiver* A single rwlock is bad in all respects, both regarding
> scaling and real-time behaviour.

Agreed, see above: "for experimental and educational purposes only".

> > 2.	Slightly less trivial working patch:
> > 
> > 	http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-3a.patch
> > 
> > 	This one uses a per-CPU rwlock, but keeps the single global
> > 	callback list.  It is otherwise identical to #1.
> > 
> Again, a single rwlock is very bad for RT behaviour.

Again, please see above.  BTW, this is per-CPU rwlock, not a single
global one.  But of course your point still stands, since bad latencies
can still be transmitted via the locks.

The purpose of this patch is to educate me on CONFIG_PREEMPT_RT, -not-
to be submitted for inclusion!  ;-)

> > Next step is to go to per-CPU callback lists.  If I was taking this
> > approach seriously, I would also experiment with multiple RCU read-side
> > locks per CPU, but I don't believe I would learn anything from that
> > exercise.
> > 
> > The reason that I am not taking this approach seriously is that it
> > can impose high latencies on RCU read-side critical sections, as
> > discussed earlier on LKML.  It also has high rcu_read_lock() and
> > rcu_read_unlock() overhead.
> > 
> > 
> > Counter-Based Approach
> > 
> > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > counter-based approach, which seems to work, but which can result in
> > indefinite-duration grace periods.
>
> Is that really a problem in practise? For RCU updates should happen
> rarely.

Yes.  Several people have run into serious problems on small-memory
machines under denial-of-service workloads.  Not pretty.

> >  The following are very hazy thoughts
> > on how to get the benefits of this approach, but with short grace periods.
> > 
> > 1.	The basic trick is to maintain a pair of counters per CPU.
> > 	There would also be a global boolean variable that would select
> > 	one or the other of each pair.  The rcu_read_lock() primitive
> > 	would then increment the counter indicated by the boolean
> > 	corresponding to the CPU that it is currently running on.
> > 	It would also keep a pointer to that particular counter in
> > 	the task structure.  The rcu_read_unlock() primitive would
> > 	decrement this counter.  (And, yes, you would also have a
> > 	counter in the task structure so that only the outermost of
> > 	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > 	actually increment/decrement the per-CPU counter pairs.)
>
> Why bother? What is the overhead of checking relative to just doing the
> increment/decrement?

The increment of the per-CPU counter pair is likely to incur a cache miss,
and thus be orders of magnitude more expensive than the increment of
the variable in the task structure.  I did not propose this optimization
lightly.  ;-)

> I had another idea: Do it in the scheduler:
>   per_cpu_rcu_count += prev->rcu_read_count;
>   if(per_cpu_rcu_count==0) {
>      /* grace period encountered */
>   }
>   (do the task switch)
>   
>   per_cpu_rcu_count -= count->rcu_read_count;
> 
> Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> current. During schedule there is no current and therefore it is the total
> count.

One downside of this approach is that it would maximally expand the
RCU read-side critical sections -- one of the things we need is to
avoid holding onto memory longer than necessary, as noted above.

Also, what prevents a grace period from ending while a task is
preempted in an RCU read-side critical section?  Or are you intending
that synchronize_rcu() scan the task list as well as the per-CPU
counters?

> > 	To force a grace period, one would invert the value of the
> > 	global boolean variable.  Once all the counters indicated
> > 	by the old value of the global boolean variable hit zero,
> > 	the corresponding set of RCU callbacks can be safely invoked.
> > 
> > 	The big problem with this approach is that a pair of inversions
> > 	of the global boolean variable could be spaced arbitrarily 
> > 	closely, especially when you consider that the read side code
> > 	can be preempted.  This could cause RCU callbacks to be invoked
> > 	prematurely, which could greatly reduce the life expectancy
> > 	of your kernel.
> > 
> > 2.	#1 above, but use a seqlock to guard the counter selection in
> > 	rcu_read_lock().  One potential disadvantage of this approach
> > 	is that an extremely unlucky instance of rcu_read_lock() might
> > 	be indefinitely delayed by a series of counter flips.  I am
> > 	concerned that this might actually happen under low-memory
> > 	conditions.  Also requires memory barriers on the read side,
> > 	which we might be stuck with, but still hope to be able to
> > 	get rid of.  And the per-CPU counter manipulation must use
> > 	atomic instructions.
>
> Why not just use 
>  preempt_disable(); 
>  manipulate counters; 
>  preempt_enable();
> ??

Because we can race with the counter switch, which can happen on some
other CPU.  Therefore, preempt_disable() does not help with this race.

> > 3.	#1 above, but use per-CPU locks to guard the counter selection.
> > 	I don't like this any better than #2, worse, in fact, since it
> > 	requires expensive atomic instructions as well.
>
> local_irqs_disable() andor preempt_disable() should work as they counters
> are per-cpu, right?

Again, neither of these help with the race against the counter-switch,
which would likely be happening on some other CPU.

> Or see the alternative above: The counter is per task but latched to a
> per-cpu on schedule().

This could work (assuming that synchronize_rcu() scanned the task list
as well as the per-CPU counters), but again would extend grace periods
too long for small-memory machines subject to denial-of-service attacks.

> > 4.	The Australian National Zoo alternative:  keep the counter pairs
> > 	in the task structure rather than keeping them per-CPU.  This
> > 	eliminates the need for atomic operations in rcu_read_lock() and
> > 	rcu_read_unlock(), but makes the update side do horribly expensive
> > 	task-list trawls.  [So named because I thought of it while trying
> > 	to jog to the Australian National Zoo.  I took a wrong turn, and
> > 	ended up running up a valley on the other side of Black Mountain,
> > 	so never did make it to the zoo.  On the other hand, I did encounter
> > 	a herd of wild kangaroo and also thought up this approach, so I
> > 	think I came out OK on the deal.]
> > 
> > 5.	The National Train notion:  #4 above, but keep a separate list
> > 	containing only preempted tasks that had non-zero RCU counters
> > 	at the time of preemption.  In the (presumably) common case of
> > 	no preemption in RCU read-side critical sections, both the
> > 	read-side and the update-side overhead is low.  But...  There
> > 	is a problem with detecting tasks that are in long-running
> > 	RCU read-side critical sections that don't get preempted.
> > 	[So named because I thought of it on the UK National Train
> > 	somewhere between London and Winchester.]
> > 
> > 6.	Oak Hills option:  keep per-CPU counters, which require atomic
> > 	increment/decrement in the general case, but use a fastpath
> > 	that (with preemption disabled) checks to see if the value of
> > 	the counter is zero (for rcu_read_lock()) or one (for
> > 	rcu_read_unlock()), and, if so, does the counter manipulation
> > 	non-atomically.  Use atomics on the (presumably infrequent)
> > 	slow path, which is taken if someone gets preempted in the middle
> > 	of an RCU read-side critical section.
> > 
> > 	Handle races between rcu_read_lock() and counter flips by
> > 	having rcu_read_lock() increment the counter, then checking
> > 	to see if it incremented the correct counter of the pair.
> > 	If it did not (i.e., the flip just happened), increment
> > 	the other counter of the pair as well, recording the fact that
> > 	both were incremented in the task struct.  The rcu_read_unlock()
> > 	primitive then decrements any/all counters that rcu_read_lock()
> > 	incremented.
> > 
> > 	Memory barriers are still needed in the non-atomic increment
> > 	and decrement cases.  However, it may be possible to leverage
> > 	naturally occuring memory barriers (see for example Joe Seigh's
> > 	recent LKML posting on RCU+SMR: http://lkml.org/lkml/2005/5/9/129).
> > 	If the naturally occuring memory barriers aren't happening fast
> > 	enough (e.g., low memory situation), a round of IPIs should
> > 	suffice, for example, smp_call_function() to a function that
> > 	advances the callbacks on each CPU.
> > 
> > 	If this one pans out, the common-case overhead of rcu_read_lock()
> > 	and rcu_read_unlock() would not be much more expensive than the
> > 	current CONFIG_PREEMPT implementations.
> > 
> > There are probably better approaches, but that is what I have thus far.
> 
> I was playing around with it a little while back. I didn't sned a patch
> though as I didn't have time. It works on a UP machine but I don't have a
> SMP to test on :-(

There are some available from OSDL, right?  Or am I out of date here?

> The ingreedients are:
> 1) A per-cpu read-side counter, protected locally by preempt_disable().

In order to avoid too-long grace periods, you need a pair of counters.
If you only have a single counter, you can end up with a series of
RCU read-side critical-section preemptions resulting in the counter
never reaching zero, and thus the grace period never terminating.
Not good, especially in small-memory machines subject to denial-of-service
attacks.

> 2) A task read-side counter (doesn't need protectection at all).

Yep.  Also a field in the task struct indicating which per-CPU counter
applies (unless the scheduler is managing this).

> 3) Tasks having non-zero read-side counter can't be migrated to other
> CPUs. That is to make the per-cpu counter truely per-cpu.

If CPU hotplug can be disallowed, then this can work.  Do we really want
to disallow CPU hotplug?  Especially given that some implementations
might want to use CPU hotplug to reduce power consumption?  There also
might be issues with tickless operation.

> 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> boost the priority to the maximum non-RT priority such it will not be 
> preempted by other SCHED_OTHER tasks. This makes the  delay in
> syncronize_kernel() somewhat deterministic (it depends on how
> "deterministic" the RT tasks in the system are.)

But one would only want to do this boost if there was a synchronize_rcu()
waiting -and- if there is some reason to accelerate grace periods (e.g.,
low on memory), right?  If there is no reason to accelerate grace periods,
then extending them increases efficiency by amortizing grace-period
overhead over many updates.

> I'll try to send you what I got when I get my computer at home back up.
> I have only testet id on my UP labtop, where it seems to run fine.

I look forward to seeing it!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10  1:24 [RFC] RCU and CONFIG_PREEMPT_RT progress Paul E. McKenney
  2005-05-10 10:55 ` Esben Nielsen
@ 2005-05-10 20:08 ` Peter Zijlstra
  2005-05-10 20:18   ` Valdis.Kletnieks
  2005-05-10 20:29   ` Paul E. McKenney
  1 sibling, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2005-05-10 20:08 UTC (permalink / raw)
  To: paulmck; +Cc: dipankar, Ingo Molnar, LKML

On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:

> 
> Counter-Based Approach
> 
> The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> counter-based approach, which seems to work, but which can result in
> indefinite-duration grace periods.  The following are very hazy thoughts
> on how to get the benefits of this approach, but with short grace periods.
> 
> 1.	The basic trick is to maintain a pair of counters per CPU.
> 	There would also be a global boolean variable that would select
> 	one or the other of each pair.  The rcu_read_lock() primitive
> 	would then increment the counter indicated by the boolean
> 	corresponding to the CPU that it is currently running on.
> 	It would also keep a pointer to that particular counter in
> 	the task structure.  The rcu_read_unlock() primitive would
> 	decrement this counter.  (And, yes, you would also have a
> 	counter in the task structure so that only the outermost of
> 	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> 	actually increment/decrement the per-CPU counter pairs.)
> 
> 	To force a grace period, one would invert the value of the
> 	global boolean variable.  Once all the counters indicated
> 	by the old value of the global boolean variable hit zero,
> 	the corresponding set of RCU callbacks can be safely invoked.
> 
> 	The big problem with this approach is that a pair of inversions
> 	of the global boolean variable could be spaced arbitrarily 
> 	closely, especially when you consider that the read side code
> 	can be preempted.  This could cause RCU callbacks to be invoked
> 	prematurely, which could greatly reduce the life expectancy
> 	of your kernel.

> Thoughts?
> 

How about having another boolean indicating the ability to flip the
selector boolean. This boolean would be set false on an actual flip and
cleared during a grace period. That way the flips cannot ever interfere
with one another such that the callbacks would be cleared prematurely.

-- 
Peter Zijlstra <a.p.zijlstra@chello.nl>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10 20:08 ` Peter Zijlstra
@ 2005-05-10 20:18   ` Valdis.Kletnieks
  2005-05-10 20:29   ` Paul E. McKenney
  1 sibling, 0 replies; 12+ messages in thread
From: Valdis.Kletnieks @ 2005-05-10 20:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paulmck, dipankar, Ingo Molnar, LKML

[-- Attachment #1: Type: text/plain, Size: 420 bytes --]

On Tue, 10 May 2005 22:08:11 +0200, Peter Zijlstra said:

> How about having another boolean indicating the ability to flip the
> selector boolean. This boolean would be set false on an actual flip and
> cleared during a grace period. That way the flips cannot ever interfere
> with one another such that the callbacks would be cleared prematurely.

As all the dining philosophers grab a fork and a spoon and dig in. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10 20:08 ` Peter Zijlstra
  2005-05-10 20:18   ` Valdis.Kletnieks
@ 2005-05-10 20:29   ` Paul E. McKenney
  2005-05-10 20:56     ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2005-05-10 20:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: dipankar, Ingo Molnar, LKML

On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
> 
> > 
> > Counter-Based Approach
> > 
> > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > counter-based approach, which seems to work, but which can result in
> > indefinite-duration grace periods.  The following are very hazy thoughts
> > on how to get the benefits of this approach, but with short grace periods.
> > 
> > 1.	The basic trick is to maintain a pair of counters per CPU.
> > 	There would also be a global boolean variable that would select
> > 	one or the other of each pair.  The rcu_read_lock() primitive
> > 	would then increment the counter indicated by the boolean
> > 	corresponding to the CPU that it is currently running on.
> > 	It would also keep a pointer to that particular counter in
> > 	the task structure.  The rcu_read_unlock() primitive would
> > 	decrement this counter.  (And, yes, you would also have a
> > 	counter in the task structure so that only the outermost of
> > 	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > 	actually increment/decrement the per-CPU counter pairs.)
> > 
> > 	To force a grace period, one would invert the value of the
> > 	global boolean variable.  Once all the counters indicated
> > 	by the old value of the global boolean variable hit zero,
> > 	the corresponding set of RCU callbacks can be safely invoked.
> > 
> > 	The big problem with this approach is that a pair of inversions
> > 	of the global boolean variable could be spaced arbitrarily 
> > 	closely, especially when you consider that the read side code
> > 	can be preempted.  This could cause RCU callbacks to be invoked
> > 	prematurely, which could greatly reduce the life expectancy
> > 	of your kernel.
> 
> > Thoughts?
> 
> How about having another boolean indicating the ability to flip the
> selector boolean. This boolean would be set false on an actual flip and
> cleared during a grace period. That way the flips cannot ever interfere
> with one another such that the callbacks would be cleared prematurely.

But the flip is an integral part of detecting a grace period.  So, if I
understand your proposal correctly, I would have to flip to figure out
when it was safe to flip.

What am I missing here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10 20:29   ` Paul E. McKenney
@ 2005-05-10 20:56     ` Peter Zijlstra
  2005-05-10 22:36       ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2005-05-10 20:56 UTC (permalink / raw)
  To: paulmck; +Cc: dipankar, Ingo Molnar, LKML

On Tue, 2005-05-10 at 13:29 -0700, Paul E. McKenney wrote:
> On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> > On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
> > 
> > > 
> > > Counter-Based Approach
> > > 
> > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > counter-based approach, which seems to work, but which can result in
> > > indefinite-duration grace periods.  The following are very hazy thoughts
> > > on how to get the benefits of this approach, but with short grace periods.
> > > 
> > > 1.	The basic trick is to maintain a pair of counters per CPU.
> > > 	There would also be a global boolean variable that would select
> > > 	one or the other of each pair.  The rcu_read_lock() primitive
> > > 	would then increment the counter indicated by the boolean
> > > 	corresponding to the CPU that it is currently running on.
> > > 	It would also keep a pointer to that particular counter in
> > > 	the task structure.  The rcu_read_unlock() primitive would
> > > 	decrement this counter.  (And, yes, you would also have a
> > > 	counter in the task structure so that only the outermost of
> > > 	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > > 	actually increment/decrement the per-CPU counter pairs.)
> > > 
> > > 	To force a grace period, one would invert the value of the
> > > 	global boolean variable.  Once all the counters indicated
> > > 	by the old value of the global boolean variable hit zero,
> > > 	the corresponding set of RCU callbacks can be safely invoked.
> > > 
> > > 	The big problem with this approach is that a pair of inversions
> > > 	of the global boolean variable could be spaced arbitrarily 
> > > 	closely, especially when you consider that the read side code
> > > 	can be preempted.  This could cause RCU callbacks to be invoked
> > > 	prematurely, which could greatly reduce the life expectancy
> > > 	of your kernel.
> > 
> > > Thoughts?
> > 
> > How about having another boolean indicating the ability to flip the
> > selector boolean. This boolean would be set false on an actual flip and
> > cleared during a grace period. That way the flips cannot ever interfere
> > with one another such that the callbacks would be cleared prematurely.
> 
> But the flip is an integral part of detecting a grace period.  So, if I
> understand your proposal correctly, I would have to flip to figure out
> when it was safe to flip.
> 
> What am I missing here?


int can_flip = 1;
int selector = 0;

int counter[2] = {0, 0};

void up()
{
  ++counter[current->selection = selector];
}

void down()
{
  if (!--counter[current->selection])
    do_grace();
}

void do_grace()
{
  // free stuff
  can_flip = 1;
}

void force_grace()
{
  if (can_flip)
  {
    can_flip = 0;
    selector ^= 1;
  }
}


The way I understood your proposal was that in order to force a grace
period you flip the selector and once the old one reaches zero again it
does a cleanup.

Now your problem was that when you force another flip before the old
counter reached zero the shit will hit the proverbial fan. So what I
proposed (as hopefully illustrated by the code) was another boolean; my
can_flip; that controls the flippability :-)

One can for example call force_grace every few seconds or when a
watermark on the rcu-callback queue has been reached. If can_flip blocks
the actual flip nothing is lost since a cleanup is allready pending.

I hope to have been clearer in explaining my idea; or if I'm missing the
issue tell me to read the thread in the morning ;)

-- 
Peter Zijlstra <a.p.zijlstra@chello.nl>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10 20:56     ` Peter Zijlstra
@ 2005-05-10 22:36       ` Paul E. McKenney
  2005-05-12  8:24         ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2005-05-10 22:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: dipankar, Ingo Molnar, LKML

On Tue, May 10, 2005 at 10:56:24PM +0200, Peter Zijlstra wrote:
> On Tue, 2005-05-10 at 13:29 -0700, Paul E. McKenney wrote:
> > On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
> > > 
> > > > 
> > > > Counter-Based Approach
> > > > 
> > > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > > counter-based approach, which seems to work, but which can result in
> > > > indefinite-duration grace periods.  The following are very hazy thoughts
> > > > on how to get the benefits of this approach, but with short grace periods.
> > > > 
> > > > 1.	The basic trick is to maintain a pair of counters per CPU.
> > > > 	There would also be a global boolean variable that would select
> > > > 	one or the other of each pair.  The rcu_read_lock() primitive
> > > > 	would then increment the counter indicated by the boolean
> > > > 	corresponding to the CPU that it is currently running on.
> > > > 	It would also keep a pointer to that particular counter in
> > > > 	the task structure.  The rcu_read_unlock() primitive would
> > > > 	decrement this counter.  (And, yes, you would also have a
> > > > 	counter in the task structure so that only the outermost of
> > > > 	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > > > 	actually increment/decrement the per-CPU counter pairs.)
> > > > 
> > > > 	To force a grace period, one would invert the value of the
> > > > 	global boolean variable.  Once all the counters indicated
> > > > 	by the old value of the global boolean variable hit zero,
> > > > 	the corresponding set of RCU callbacks can be safely invoked.
> > > > 
> > > > 	The big problem with this approach is that a pair of inversions
> > > > 	of the global boolean variable could be spaced arbitrarily 
> > > > 	closely, especially when you consider that the read side code
> > > > 	can be preempted.  This could cause RCU callbacks to be invoked
> > > > 	prematurely, which could greatly reduce the life expectancy
> > > > 	of your kernel.
> > > 
> > > > Thoughts?
> > > 
> > > How about having another boolean indicating the ability to flip the
> > > selector boolean. This boolean would be set false on an actual flip and
> > > cleared during a grace period. That way the flips cannot ever interfere
> > > with one another such that the callbacks would be cleared prematurely.
> > 
> > But the flip is an integral part of detecting a grace period.  So, if I
> > understand your proposal correctly, I would have to flip to figure out
> > when it was safe to flip.
> > 
> > What am I missing here?
> 
> 
> int can_flip = 1;
> int selector = 0;
> 
> int counter[2] = {0, 0};
> 
> void up()
> {
>   ++counter[current->selection = selector];

Suppose task 0 has just fetched the value of "selector".  How does
force_grace() know that it is now inappropriate to invert the value
of "selector"?

One might suppress preemption, but there can still be interrupts,
ECC error correction, and just plain bad luck.  So up() needs to
be able to deal with "selector" getting inverted out from under it.

Unless I am missing something still...

						Thanx, Paul

> }
> 
> void down()
> {
>   if (!--counter[current->selection])
>     do_grace();
> }
> 
> void do_grace()
> {
>   // free stuff
>   can_flip = 1;
> }
> 
> void force_grace()
> {
>   if (can_flip)
>   {
>     can_flip = 0;
>     selector ^= 1;
>   }
> }
> 
> 
> The way I understood your proposal was that in order to force a grace
> period you flip the selector and once the old one reaches zero again it
> does a cleanup.
> 
> Now your problem was that when you force another flip before the old
> counter reached zero the shit will hit the proverbial fan. So what I
> proposed (as hopefully illustrated by the code) was another boolean; my
> can_flip; that controls the flippability :-)
> 
> One can for example call force_grace every few seconds or when a
> watermark on the rcu-callback queue has been reached. If can_flip blocks
> the actual flip nothing is lost since a cleanup is allready pending.
> 
> I hope to have been clearer in explaining my idea; or if I'm missing the
> issue tell me to read the thread in the morning ;)
> 
> -- 
> Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10 14:45   ` Paul E. McKenney
@ 2005-05-11 18:14     ` Esben Nielsen
  2005-05-12  1:43       ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Esben Nielsen @ 2005-05-11 18:14 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, mingo, dipankar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7943 bytes --]

On Tue, 10 May 2005, Paul E. McKenney wrote:

> On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> [...]
> > > 
> > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > counter-based approach, which seems to work, but which can result in
> > > indefinite-duration grace periods.
> >
> > Is that really a problem in practise? For RCU updates should happen
> > rarely.
> 
> Yes.  Several people have run into serious problems on small-memory
> machines under denial-of-service workloads.  Not pretty.
>
That is what I tell people at work: Dynamic memory allocation is _not_ a
good thing in real-time embedded systems. They call be old fashioned when
I say I want anything important pre-allocated in seperate pools. But it
exactly avoids these kind of surprises, makes the system more
deterministic and makes it easier to find leaks.
 
> [...]
> > 
> > Why bother? What is the overhead of checking relative to just doing the
> > increment/decrement?
> 
> The increment of the per-CPU counter pair is likely to incur a cache miss,
> and thus be orders of magnitude more expensive than the increment of
> the variable in the task structure.  I did not propose this optimization
> lightly.  ;-)
> 
> > I had another idea: Do it in the scheduler:
> >   per_cpu_rcu_count += prev->rcu_read_count;
> >   if(per_cpu_rcu_count==0) {
> >      /* grace period encountered */
> >   }
> >   (do the task switch)
> >   
> >   per_cpu_rcu_count -= count->rcu_read_count;
> > 
> > Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> > current. During schedule there is no current and therefore it is the total
> > count.
> 
> One downside of this approach is that it would maximally expand the
> RCU read-side critical sections -- one of the things we need is to
> avoid holding onto memory longer than necessary, as noted above.
> 
> Also, what prevents a grace period from ending while a task is
> preempted in an RCU read-side critical section?  Or are you intending
> that synchronize_rcu() scan the task list as well as the per-CPU
> counters?
Eh, if a RCU task is preempted in a read-side CS there IS no grace period!
The trick is to avoid having it haning there for too long (see my boosting
mechanismen below).
> 
> > > [...] 
> > > 2.	#1 above, but use a seqlock to guard the counter selection in
> > > 	rcu_read_lock().  One potential disadvantage of this approach
> > > 	is that an extremely unlucky instance of rcu_read_lock() might
> > > 	be indefinitely delayed by a series of counter flips.  I am
> > > 	concerned that this might actually happen under low-memory
> > > 	conditions.  Also requires memory barriers on the read side,
> > > 	which we might be stuck with, but still hope to be able to
> > > 	get rid of.  And the per-CPU counter manipulation must use
> > > 	atomic instructions.
> >
> > Why not just use 
> >  preempt_disable(); 
> >  manipulate counters; 
> >  preempt_enable();
> > ??
> 
> Because we can race with the counter switch, which can happen on some
> other CPU.  Therefore, preempt_disable() does not help with this race.

Well, the per-cpu counters should be truely per-cpu. I.e. all tasks must
_only_ touch the counter on their current cpu. preempt_disable() prevents
the migration of the current task, right?
 
> 
> > > 3.	#1 above, but use per-CPU locks to guard the counter selection.
> > > 	I don't like this any better than #2, worse, in fact, since it
> > > 	requires expensive atomic instructions as well.
> >
> > local_irqs_disable() andor preempt_disable() should work as they counters
> > are per-cpu, right?
> 
> Again, neither of these help with the race against the counter-switch,
> which would likely be happening on some other CPU.
As above :-)

> 
> > Or see the alternative above: The counter is per task but latched to a
> > per-cpu on schedule().
> 
> This could work (assuming that synchronize_rcu() scanned the task list
> as well as the per-CPU counters), but again would extend grace periods
> too long for small-memory machines subject to denial-of-service attacks.
> 
No, need to traverse the list. Again the syncronize_rcu() waits for all
CPU to check their number _locally_. The task doing that can simply check
it's the latched number as it knows it is not in a read-side CS.

> > [...] 
> > I was playing around with it a little while back. I didn't sned a patch
> > though as I didn't have time. It works on a UP machine but I don't have a
> > SMP to test on :-(
> 
> There are some available from OSDL, right?  Or am I out of date here?
Honestly, I can't remember if I mailed it to anyone. I only have the code
on my home PC, which broke down. I am right now running off a backup. I
have attached the patch to this mail. I don't have time to clean it up :-(
I see it also includes code for a testing device trying where I try 
to meassure the grace period.

> 
> > The ingreedients are:
> > 1) A per-cpu read-side counter, protected locally by preempt_disable().
> 
> In order to avoid too-long grace periods, you need a pair of counters.
> If you only have a single counter, you can end up with a series of
> RCU read-side critical-section preemptions resulting in the counter
> never reaching zero, and thus the grace period never terminating.
> Not good, especially in small-memory machines subject to denial-of-service
> attacks.
See the boosting trick below. That will do the trick.

> 
> > 2) A task read-side counter (doesn't need protectection at all).
> 
> Yep.  Also a field in the task struct indicating which per-CPU counter
> applies (unless the scheduler is managing this).
No, it is _always_ the local CPU which alplies as I have tried to make 3)
below (which isn't tested at all).
> 
> > 3) Tasks having non-zero read-side counter can't be migrated to other
> > CPUs. That is to make the per-cpu counter truely per-cpu.
> 
> If CPU hotplug can be disallowed, then this can work.  Do we really want
> to disallow CPU hotplug?  Especially given that some implementations
> might want to use CPU hotplug to reduce power consumption?  There also
> might be issues with tickless operation.
> 
Yeah, I noticed the hotplog code. What are the RT requirements for
hotplugging? With 4) below it shouldn't be a problem to wait until all
tasks ar out of the read-lock session.

A better alternative is to make migration work like this:
1) Deactive the task.
2) Make the target cpu increment it's rcu-counter.
3) Migrate the task.
4) Make the source cpu increment it's rcu-counter.
Then you can safely migrate tasks preempted in read-side CS as well.
Harder to code than my small hack, though.

> > 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> > boost the priority to the maximum non-RT priority such it will not be 
> > preempted by other SCHED_OTHER tasks. This makes the  delay in
> > syncronize_kernel() somewhat deterministic (it depends on how
> > "deterministic" the RT tasks in the system are.)
> 
> But one would only want to do this boost if there was a synchronize_rcu()
> waiting -and- if there is some reason to accelerate grace periods (e.g.,
> low on memory), right?  If there is no reason to accelerate grace periods,
> then extending them increases efficiency by amortizing grace-period
> overhead over many updates.
Well, I only do the boost if the task is actually preempted in the
read-side CS - it should be relatively rare. I could make a check and
only do the boost if their is an request for a grace-period set. BUT the
problem is that when somebody requests a grace-period later on it would
have to traverse the runqueue and boost the tasks in read-side CS. That
sounds like more expensive, but one can only know that by testing....

> 
> > I'll try to send you what I got when I get my computer at home back up.
> > I have only testet id on my UP labtop, where it seems to run fine.
> 
> I look forward to seeing it!
> 
Attached here :-)
> 							Thanx, Paul
> 

Esben

[-- Attachment #2: Type: TEXT/PLAIN, Size: 11227 bytes --]

diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/Makefile linux-2.6.12-rc1-RCU/Makefile
--- Orig/linux-2.6.12-rc1/Makefile	Sun Mar 20 20:30:55 2005
+++ linux-2.6.12-rc1-RCU/Makefile	Sun Mar 27 21:05:58 2005
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 12
-EXTRAVERSION =-rc1
+EXTRAVERSION =-rc1-RCU-SMP
 NAME=Woozy Numbat
 
 # *DOCUMENTATION*
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Kconfig linux-2.6.12-rc1-RCU/drivers/char/Kconfig
--- Orig/linux-2.6.12-rc1/drivers/char/Kconfig	Sun Mar 20 20:31:03 2005
+++ linux-2.6.12-rc1-RCU/drivers/char/Kconfig	Fri Mar 25 00:19:07 2005
@@ -978,6 +978,13 @@
 	  The mmtimer device allows direct userspace access to the
 	  Altix system timer.
 
+
+config RCU_TESTER
+	tristate "Test device for testing RCU code."
+	default n
+	help
+	  To help debugging the RCU code. Don't include this.
+
 source "drivers/char/tpm/Kconfig"
 
 endmenu
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Makefile linux-2.6.12-rc1-RCU/drivers/char/Makefile
--- Orig/linux-2.6.12-rc1/drivers/char/Makefile	Sun Mar 20 20:31:03 2005
+++ linux-2.6.12-rc1-RCU/drivers/char/Makefile	Fri Mar 25 00:19:11 2005
@@ -48,6 +48,8 @@
 obj-$(CONFIG_VIOTAPE)		+= viotape.o
 obj-$(CONFIG_HVCS)		+= hvcs.o
 
+obj-$(CONFIG_RCU_TESTER)	+= rcu_tester.o
+
 obj-$(CONFIG_PRINTER) += lp.o
 obj-$(CONFIG_TIPAR) += tipar.o
 
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c
--- Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c	Thu Jan  1 01:00:00 1970
+++ linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c	Wed Mar 30 00:15:52 2005
@@ -0,0 +1,120 @@
+/*
+ * test device for testing RCU code
+ */
+
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+
+#define RCU_TESTER_MINOR	 222
+
+#define RCU_TESTER_WRITE	4247
+#define RCU_TESTER_READ 	4248
+
+#define ERCU_FAILED               35
+
+#ifndef notrace
+#define notrace
+#endif
+
+u64 notrace get_cpu_tick(void)
+{
+	u64 tsc;
+#ifdef ARCHARM
+	tsc = *oscr;
+#else
+	__asm__ __volatile__("rdtsc" : "=A" (tsc));
+#endif
+	return tsc;
+}
+
+void notrace loop(int loops)
+{
+	int i;
+
+	for (i = 0; i < loops; i++)
+		get_cpu_tick();
+}
+
+static spinlock_t write_lock;
+static int *protected_data = NULL;
+
+static int rcu_tester_open(struct inode *in, struct file *file)
+{
+	return 0;
+}
+
+static long rcu_tester_ioctl(struct file *file,
+			  unsigned int cmd, unsigned long args)
+{
+	switch(cmd) {
+	case RCU_TESTER_READ: {
+		int *dataptr, data1,data2;
+		rcu_read_lock();
+		dataptr = protected_data;
+		data1 = *dataptr;
+		loop(args);
+		data2 = *dataptr;
+		rcu_read_unlock();
+		if(data1!=data2)
+			return -ERCU_FAILED;
+		else
+			return 0; /* ok */
+	}
+	case RCU_TESTER_WRITE: {
+		int *newdata, *olddata;
+		newdata = kmalloc(sizeof(int), GFP_KERNEL);
+		if(!newdata)
+			return -ENOMEM;
+
+		spin_lock(&write_lock);
+		olddata = rcu_dereference(protected_data);
+		*newdata = *olddata + 1;
+		rcu_assign_pointer(protected_data, newdata);
+		spin_unlock(&write_lock);
+		synchronize_kernel();
+		kfree(olddata);
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
+}
+
+static struct file_operations rcu_tester_fops = {
+	.owner		= THIS_MODULE,
+	.llseek		= no_llseek,
+	.unlocked_ioctl = rcu_tester_ioctl,
+	.open		= rcu_tester_open,
+};
+
+static struct miscdevice rcu_tester_dev =
+{
+	RCU_TESTER_MINOR,
+	"rcu_tester",
+	&rcu_tester_fops
+};
+
+static int __init rcu_tester_init(void)
+{
+	if (misc_register(&rcu_tester_dev))
+		return -ENODEV;
+
+	protected_data = kmalloc(sizeof(int), GFP_KERNEL);
+	*protected_data = 0;
+	
+	spin_lock_init(&write_lock);
+
+	return 0;
+}
+
+void __exit rcu_tester_exit(void)
+{
+	printk(KERN_INFO "rcu-tester device uninstalled\n");
+	misc_deregister(&rcu_tester_dev);
+}
+
+module_init(rcu_tester_init);
+module_exit(rcu_tester_exit);
+
+MODULE_LICENSE("GPL");
+
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/init_task.h linux-2.6.12-rc1-RCU/include/linux/init_task.h
--- Orig/linux-2.6.12-rc1/include/linux/init_task.h	Sun Mar 20 20:31:28 2005
+++ linux-2.6.12-rc1-RCU/include/linux/init_task.h	Fri Mar 25 00:22:00 2005
@@ -74,6 +74,7 @@
 	.usage		= ATOMIC_INIT(2),				\
 	.flags		= 0,						\
 	.lock_depth	= -1,						\
+	.rcu_read_lock_nesting = 0,				       	\
 	.prio		= MAX_PRIO-20,					\
 	.static_prio	= MAX_PRIO-20,					\
 	.policy		= SCHED_NORMAL,					\
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/rcupdate.h linux-2.6.12-rc1-RCU/include/linux/rcupdate.h
--- Orig/linux-2.6.12-rc1/include/linux/rcupdate.h	Wed Mar  2 08:37:50 2005
+++ linux-2.6.12-rc1-RCU/include/linux/rcupdate.h	Sun Mar 27 18:12:45 2005
@@ -41,6 +41,8 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
+#include <linux/sched.h>
+#include <linux/interrupt.h>
 
 /**
  * struct rcu_head - callback structure for use with RCU
@@ -85,6 +87,7 @@
  * curlist - current batch for which quiescent cycle started if any
  */
 struct rcu_data {
+	int             active_readers;
 	/* 1) quiescent state handling : */
 	long		quiescbatch;     /* Batch # for grace period */
 	int		passed_quiesc;	 /* User-mode/idle loop etc. */
@@ -115,12 +118,14 @@
 static inline void rcu_qsctr_inc(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
-	rdp->passed_quiesc = 1;
+	if(rdp->active_readers==0)
+		rdp->passed_quiesc = 1;
 }
 static inline void rcu_bh_qsctr_inc(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
-	rdp->passed_quiesc = 1;
+	if(rdp->active_readers==0)
+		rdp->passed_quiesc = 1;
 }
 
 static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
@@ -183,14 +188,34 @@
  *
  * It is illegal to block while in an RCU read-side critical section.
  */
-#define rcu_read_lock()		preempt_disable()
+static inline void rcu_read_lock(void)
+{	
+	unsigned long flags;
+
+	local_irq_save(flags);
+ 	__get_cpu_var(rcu_data).active_readers++;
+	current->rcu_read_lock_nesting++;
+	local_irq_restore(flags);
+}
 
 /**
  * rcu_read_unlock - marks the end of an RCU read-side critical section.
  *
  * See rcu_read_lock() for more information.
  */
-#define rcu_read_unlock()	preempt_enable()
+static inline void rcu_read_unlock(void)
+{
+	unsigned long flags;
+	int cpu;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	
+ 	per_cpu(rcu_data,cpu).active_readers--;
+	current->rcu_read_lock_nesting--;
+	rcu_qsctr_inc(cpu);
+	local_irq_restore(flags);
+}
 
 /*
  * So where is rcu_write_lock()?  It does not exist, as there is no
@@ -213,14 +238,34 @@
  * can use just rcu_read_lock().
  *
  */
-#define rcu_read_lock_bh()	local_bh_disable()
+static inline void rcu_read_lock_bh(void)
+{ 
+	unsigned long flags;
+
+	local_irq_save(flags);
+ 	__get_cpu_var(rcu_bh_data).active_readers++;
+	current->rcu_read_lock_nesting++;
+	local_irq_restore(flags);
+}
 
 /*
  * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
  *
  * See rcu_read_lock_bh() for more information.
  */
-#define rcu_read_unlock_bh()	local_bh_enable()
+static inline void rcu_read_unlock_bh(void)	
+{ 
+	unsigned long flags;
+	int cpu;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	
+ 	per_cpu(rcu_bh_data,cpu).active_readers--;
+	current->rcu_read_lock_nesting--;
+	rcu_bh_qsctr_inc(cpu);
+	local_irq_restore(flags);
+}
 
 /**
  * rcu_dereference - fetch an RCU-protected pointer in an
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/sched.h linux-2.6.12-rc1-RCU/include/linux/sched.h
--- Orig/linux-2.6.12-rc1/include/linux/sched.h	Sun Mar 20 20:31:29 2005
+++ linux-2.6.12-rc1-RCU/include/linux/sched.h	Fri Mar 25 00:22:00 2005
@@ -569,6 +569,9 @@
 	unsigned long ptrace;
 
 	int lock_depth;		/* Lock depth */
+	
+	int rcu_read_lock_nesting; /* How many RCU read reagions the thread is 
+				      in */
 
 	int prio, static_prio;
 	struct list_head run_list;
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/init/main.c linux-2.6.12-rc1-RCU/init/main.c
--- Orig/linux-2.6.12-rc1/init/main.c	Sun Mar 20 20:31:30 2005
+++ linux-2.6.12-rc1-RCU/init/main.c	Fri Mar 25 13:36:26 2005
@@ -688,8 +688,11 @@
 	system_state = SYSTEM_RUNNING;
 	numa_default_policy();
 
-	if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
-		printk(KERN_WARNING "Warning: unable to open an initial console.\n");
+	{
+	  int res = sys_open((const char __user *) "/dev/console", O_RDWR, 0);
+	  if(res<0)
+	    printk(KERN_WARNING "Warning: unable to open an initial console.: retval=%d\n",res);
+	}
 
 	(void) sys_dup(0);
 	(void) sys_dup(0);
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/rcupdate.c linux-2.6.12-rc1-RCU/kernel/rcupdate.c
--- Orig/linux-2.6.12-rc1/kernel/rcupdate.c	Wed Mar  2 08:37:30 2005
+++ linux-2.6.12-rc1-RCU/kernel/rcupdate.c	Fri Mar 25 13:08:38 2005
@@ -66,7 +66,10 @@
 	  {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
 
 DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
+EXPORT_PER_CPU_SYMBOL(rcu_data);
 DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
+EXPORT_PER_CPU_SYMBOL(rcu_bh_data);
+
 
 /* Fake initialization required by compiler */
 static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/sched.c linux-2.6.12-rc1-RCU/kernel/sched.c
--- Orig/linux-2.6.12-rc1/kernel/sched.c	Sun Mar 20 20:31:30 2005
+++ linux-2.6.12-rc1-RCU/kernel/sched.c	Wed Mar 30 00:06:39 2005
@@ -594,6 +594,9 @@
 	if (rt_task(p))
 		return p->prio;
 
+	if (p->rcu_read_lock_nesting)
+		return MAX_RT_PRIO;
+
 	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
 
 	prio = p->static_prio - bonus;
@@ -986,6 +989,9 @@
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
+	if (unlikely(p->rcu_read_lock_nesting))
+		goto out_activate;
+
 #ifdef CONFIG_SCHEDSTATS
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {
@@ -1644,6 +1650,8 @@
 		return 0;
 	if (!cpu_isset(this_cpu, p->cpus_allowed))
 		return 0;
+	if (p->rcu_read_lock_nesting)
+		return 0;
 
 	/*
 	 * Aggressive migration if:
@@ -2633,6 +2641,7 @@
 need_resched:
 	preempt_disable();
 	prev = current;
+	  
 	release_kernel_lock(prev);
 need_resched_nonpreemptible:
 	rq = this_rq();
@@ -2675,6 +2684,7 @@
 		else {
 			if (prev->state == TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
+
 			deactivate_task(prev, rq);
 		}
 	}
@@ -2710,6 +2720,17 @@
 	}
 
 	array = rq->active;
+
+	if( unlikely(prev->rcu_read_lock_nesting) && 
+	    prev->prio > MAX_RT_PRIO ) {
+		prio_array_t *prev_array = prev->array;
+		if(prev_array)
+			dequeue_task(prev, prev_array);
+		prev->prio = MAX_RT_PRIO;
+		if(prev_array)
+			enqueue_task_head(prev, prev_array);
+	}
+
 	if (unlikely(!array->nr_active)) {
 		/*
 		 * Switch the active and expired arrays.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-11 18:14     ` Esben Nielsen
@ 2005-05-12  1:43       ` Paul E. McKenney
  0 siblings, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2005-05-12  1:43 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: linux-kernel, mingo, dipankar

On Wed, May 11, 2005 at 08:14:49PM +0200, Esben Nielsen wrote:
> On Tue, 10 May 2005, Paul E. McKenney wrote:
> 
> > On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> > [...]
> > > > 
> > > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > > counter-based approach, which seems to work, but which can result in
> > > > indefinite-duration grace periods.
> > >
> > > Is that really a problem in practise? For RCU updates should happen
> > > rarely.
> > 
> > Yes.  Several people have run into serious problems on small-memory
> > machines under denial-of-service workloads.  Not pretty.
> >
> That is what I tell people at work: Dynamic memory allocation is _not_ a
> good thing in real-time embedded systems. They call be old fashioned when
> I say I want anything important pre-allocated in seperate pools. But it
> exactly avoids these kind of surprises, makes the system more
> deterministic and makes it easier to find leaks.

As always, depends on the situation.  Configuring all the separate
pools can be a bit on the painful side, right?  ;-)

> > [...]
> > > 
> > > Why bother? What is the overhead of checking relative to just doing the
> > > increment/decrement?
> > 
> > The increment of the per-CPU counter pair is likely to incur a cache miss,
> > and thus be orders of magnitude more expensive than the increment of
> > the variable in the task structure.  I did not propose this optimization
> > lightly.  ;-)
> > 
> > > I had another idea: Do it in the scheduler:
> > >   per_cpu_rcu_count += prev->rcu_read_count;
> > >   if(per_cpu_rcu_count==0) {
> > >      /* grace period encountered */
> > >   }
> > >   (do the task switch)
> > >   
> > >   per_cpu_rcu_count -= count->rcu_read_count;
> > > 
> > > Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> > > current. During schedule there is no current and therefore it is the total
> > > count.
> > 
> > One downside of this approach is that it would maximally expand the
> > RCU read-side critical sections -- one of the things we need is to
> > avoid holding onto memory longer than necessary, as noted above.
> > 
> > Also, what prevents a grace period from ending while a task is
> > preempted in an RCU read-side critical section?  Or are you intending
> > that synchronize_rcu() scan the task list as well as the per-CPU
> > counters?
>
> Eh, if a RCU task is preempted in a read-side CS there IS no grace period!
> The trick is to avoid having it haning there for too long (see my boosting
> mechanismen below).

There is not -supposed- to be a grace period in that case.

> > > > [...] 
> > > > 2.	#1 above, but use a seqlock to guard the counter selection in
> > > > 	rcu_read_lock().  One potential disadvantage of this approach
> > > > 	is that an extremely unlucky instance of rcu_read_lock() might
> > > > 	be indefinitely delayed by a series of counter flips.  I am
> > > > 	concerned that this might actually happen under low-memory
> > > > 	conditions.  Also requires memory barriers on the read side,
> > > > 	which we might be stuck with, but still hope to be able to
> > > > 	get rid of.  And the per-CPU counter manipulation must use
> > > > 	atomic instructions.
> > >
> > > Why not just use 
> > >  preempt_disable(); 
> > >  manipulate counters; 
> > >  preempt_enable();
> > > ??
> > 
> > Because we can race with the counter switch, which can happen on some
> > other CPU.  Therefore, preempt_disable() does not help with this race.
> 
> Well, the per-cpu counters should be truely per-cpu. I.e. all tasks must
> _only_ touch the counter on their current cpu. preempt_disable() prevents
> the migration of the current task, right?
>  
> > 
> > > > 3.	#1 above, but use per-CPU locks to guard the counter selection.
> > > > 	I don't like this any better than #2, worse, in fact, since it
> > > > 	requires expensive atomic instructions as well.
> > >
> > > local_irqs_disable() andor preempt_disable() should work as they counters
> > > are per-cpu, right?
> > 
> > Again, neither of these help with the race against the counter-switch,
> > which would likely be happening on some other CPU.
> As above :-)
> 
> > 
> > > Or see the alternative above: The counter is per task but latched to a
> > > per-cpu on schedule().
> > 
> > This could work (assuming that synchronize_rcu() scanned the task list
> > as well as the per-CPU counters), but again would extend grace periods
> > too long for small-memory machines subject to denial-of-service attacks.
> > 
> No, need to traverse the list. Again the syncronize_rcu() waits for all
> CPU to check their number _locally_. The task doing that can simply check
> it's the latched number as it knows it is not in a read-side CS.
> 
> > > [...] 
> > > I was playing around with it a little while back. I didn't sned a patch
> > > though as I didn't have time. It works on a UP machine but I don't have a
> > > SMP to test on :-(
> > 
> > There are some available from OSDL, right?  Or am I out of date here?
> Honestly, I can't remember if I mailed it to anyone. I only have the code
> on my home PC, which broke down. I am right now running off a backup. I
> have attached the patch to this mail. I don't have time to clean it up :-(
> I see it also includes code for a testing device trying where I try 
> to meassure the grace period.
> 
> > 
> > > The ingreedients are:
> > > 1) A per-cpu read-side counter, protected locally by preempt_disable().
> > 
> > In order to avoid too-long grace periods, you need a pair of counters.
> > If you only have a single counter, you can end up with a series of
> > RCU read-side critical-section preemptions resulting in the counter
> > never reaching zero, and thus the grace period never terminating.
> > Not good, especially in small-memory machines subject to denial-of-service
> > attacks.
> See the boosting trick below. That will do the trick.
> 
> > 
> > > 2) A task read-side counter (doesn't need protectection at all).
> > 
> > Yep.  Also a field in the task struct indicating which per-CPU counter
> > applies (unless the scheduler is managing this).
> No, it is _always_ the local CPU which alplies as I have tried to make 3)
> below (which isn't tested at all).
> > 
> > > 3) Tasks having non-zero read-side counter can't be migrated to other
> > > CPUs. That is to make the per-cpu counter truely per-cpu.
> > 
> > If CPU hotplug can be disallowed, then this can work.  Do we really want
> > to disallow CPU hotplug?  Especially given that some implementations
> > might want to use CPU hotplug to reduce power consumption?  There also
> > might be issues with tickless operation.
> > 
> Yeah, I noticed the hotplog code. What are the RT requirements for
> hotplugging? With 4) below it shouldn't be a problem to wait until all
> tasks ar out of the read-lock session.
> 
> A better alternative is to make migration work like this:
> 1) Deactive the task.
> 2) Make the target cpu increment it's rcu-counter.
> 3) Migrate the task.
> 4) Make the source cpu increment it's rcu-counter.
> Then you can safely migrate tasks preempted in read-side CS as well.
> Harder to code than my small hack, though.
> 
> > > 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> > > boost the priority to the maximum non-RT priority such it will not be 
> > > preempted by other SCHED_OTHER tasks. This makes the  delay in
> > > syncronize_kernel() somewhat deterministic (it depends on how
> > > "deterministic" the RT tasks in the system are.)
> > 
> > But one would only want to do this boost if there was a synchronize_rcu()
> > waiting -and- if there is some reason to accelerate grace periods (e.g.,
> > low on memory), right?  If there is no reason to accelerate grace periods,
> > then extending them increases efficiency by amortizing grace-period
> > overhead over many updates.
> Well, I only do the boost if the task is actually preempted in the
> read-side CS - it should be relatively rare. I could make a check and
> only do the boost if their is an request for a grace-period set. BUT the
> problem is that when somebody requests a grace-period later on it would
> have to traverse the runqueue and boost the tasks in read-side CS. That
> sounds like more expensive, but one can only know that by testing....
> 
> > 
> > > I'll try to send you what I got when I get my computer at home back up.
> > > I have only testet id on my UP labtop, where it seems to run fine.
> > 
> > I look forward to seeing it!
> > 
> Attached here :-)

Thank you, see comments interspersed.

> > 							Thanx, Paul
> > 
> 
> Esben

> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/Makefile linux-2.6.12-rc1-RCU/Makefile
> --- Orig/linux-2.6.12-rc1/Makefile	Sun Mar 20 20:30:55 2005
> +++ linux-2.6.12-rc1-RCU/Makefile	Sun Mar 27 21:05:58 2005
> @@ -1,7 +1,7 @@
>  VERSION = 2
>  PATCHLEVEL = 6
>  SUBLEVEL = 12
> -EXTRAVERSION =-rc1
> +EXTRAVERSION =-rc1-RCU-SMP
>  NAME=Woozy Numbat
>  
>  # *DOCUMENTATION*
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Kconfig linux-2.6.12-rc1-RCU/drivers/char/Kconfig
> --- Orig/linux-2.6.12-rc1/drivers/char/Kconfig	Sun Mar 20 20:31:03 2005
> +++ linux-2.6.12-rc1-RCU/drivers/char/Kconfig	Fri Mar 25 00:19:07 2005
> @@ -978,6 +978,13 @@
>  	  The mmtimer device allows direct userspace access to the
>  	  Altix system timer.
>  
> +
> +config RCU_TESTER
> +	tristate "Test device for testing RCU code."
> +	default n
> +	help
> +	  To help debugging the RCU code. Don't include this.
> +
>  source "drivers/char/tpm/Kconfig"
>  
>  endmenu
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Makefile linux-2.6.12-rc1-RCU/drivers/char/Makefile
> --- Orig/linux-2.6.12-rc1/drivers/char/Makefile	Sun Mar 20 20:31:03 2005
> +++ linux-2.6.12-rc1-RCU/drivers/char/Makefile	Fri Mar 25 00:19:11 2005
> @@ -48,6 +48,8 @@
>  obj-$(CONFIG_VIOTAPE)		+= viotape.o
>  obj-$(CONFIG_HVCS)		+= hvcs.o
>  
> +obj-$(CONFIG_RCU_TESTER)	+= rcu_tester.o
> +
>  obj-$(CONFIG_PRINTER) += lp.o
>  obj-$(CONFIG_TIPAR) += tipar.o
>  
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c
> --- Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c	Thu Jan  1 01:00:00 1970
> +++ linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c	Wed Mar 30 00:15:52 2005
> @@ -0,0 +1,120 @@
> +/*
> + * test device for testing RCU code
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/miscdevice.h>
> +
> +#define RCU_TESTER_MINOR	 222
> +
> +#define RCU_TESTER_WRITE	4247
> +#define RCU_TESTER_READ 	4248
> +
> +#define ERCU_FAILED               35
> +
> +#ifndef notrace
> +#define notrace
> +#endif
> +
> +u64 notrace get_cpu_tick(void)
> +{
> +	u64 tsc;
> +#ifdef ARCHARM
> +	tsc = *oscr;
> +#else
> +	__asm__ __volatile__("rdtsc" : "=A" (tsc));
> +#endif
> +	return tsc;
> +}
> +
> +void notrace loop(int loops)
> +{
> +	int i;
> +
> +	for (i = 0; i < loops; i++)
> +		get_cpu_tick();
> +}
> +
> +static spinlock_t write_lock;
> +static int *protected_data = NULL;
> +
> +static int rcu_tester_open(struct inode *in, struct file *file)
> +{
> +	return 0;
> +}
> +
> +static long rcu_tester_ioctl(struct file *file,
> +			  unsigned int cmd, unsigned long args)
> +{
> +	switch(cmd) {
> +	case RCU_TESTER_READ: {
> +		int *dataptr, data1,data2;
> +		rcu_read_lock();
> +		dataptr = protected_data;
> +		data1 = *dataptr;
> +		loop(args);
> +		data2 = *dataptr;
> +		rcu_read_unlock();
> +		if(data1!=data2)
> +			return -ERCU_FAILED;
> +		else
> +			return 0; /* ok */
> +	}
> +	case RCU_TESTER_WRITE: {
> +		int *newdata, *olddata;
> +		newdata = kmalloc(sizeof(int), GFP_KERNEL);
> +		if(!newdata)
> +			return -ENOMEM;
> +
> +		spin_lock(&write_lock);
> +		olddata = rcu_dereference(protected_data);
> +		*newdata = *olddata + 1;
> +		rcu_assign_pointer(protected_data, newdata);
> +		spin_unlock(&write_lock);
> +		synchronize_kernel();
> +		kfree(olddata);
> +		return 0;
> +	}
> +	default:
> +		return -EINVAL;
> +	}
> +}

Cute RCU test code!

You could get better coverage by switching between two different structs,
with each containing a pair of integers.

Initialize both, point to one of them.  Repeatedly do the following
in RCU_TESTER_WRITE:

1.	Point to the other struct.

2.	Wait for a grace period to elapse.

3.	Increment one of the integers in the not-pointed-to structure.

4.	Delay for some time.

5.	Increment the other integer in the not-pointed-to structure.

Then RCU_TESTER_READ can repeatedly dereference the pointer under RCU
protection, and compare the two integers.  If they differ, RCU's grace
period was not long enough.

> +
> +static struct file_operations rcu_tester_fops = {
> +	.owner		= THIS_MODULE,
> +	.llseek		= no_llseek,
> +	.unlocked_ioctl = rcu_tester_ioctl,
> +	.open		= rcu_tester_open,
> +};
> +
> +static struct miscdevice rcu_tester_dev =
> +{
> +	RCU_TESTER_MINOR,
> +	"rcu_tester",
> +	&rcu_tester_fops
> +};
> +
> +static int __init rcu_tester_init(void)
> +{
> +	if (misc_register(&rcu_tester_dev))
> +		return -ENODEV;
> +
> +	protected_data = kmalloc(sizeof(int), GFP_KERNEL);
> +	*protected_data = 0;
> +	
> +	spin_lock_init(&write_lock);
> +
> +	return 0;
> +}
> +
> +void __exit rcu_tester_exit(void)
> +{
> +	printk(KERN_INFO "rcu-tester device uninstalled\n");
> +	misc_deregister(&rcu_tester_dev);
> +}
> +
> +module_init(rcu_tester_init);
> +module_exit(rcu_tester_exit);
> +
> +MODULE_LICENSE("GPL");
> +
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/init_task.h linux-2.6.12-rc1-RCU/include/linux/init_task.h
> --- Orig/linux-2.6.12-rc1/include/linux/init_task.h	Sun Mar 20 20:31:28 2005
> +++ linux-2.6.12-rc1-RCU/include/linux/init_task.h	Fri Mar 25 00:22:00 2005
> @@ -74,6 +74,7 @@
>  	.usage		= ATOMIC_INIT(2),				\
>  	.flags		= 0,						\
>  	.lock_depth	= -1,						\
> +	.rcu_read_lock_nesting = 0,				       	\
>  	.prio		= MAX_PRIO-20,					\
>  	.static_prio	= MAX_PRIO-20,					\
>  	.policy		= SCHED_NORMAL,					\
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/rcupdate.h linux-2.6.12-rc1-RCU/include/linux/rcupdate.h
> --- Orig/linux-2.6.12-rc1/include/linux/rcupdate.h	Wed Mar  2 08:37:50 2005
> +++ linux-2.6.12-rc1-RCU/include/linux/rcupdate.h	Sun Mar 27 18:12:45 2005
> @@ -41,6 +41,8 @@
>  #include <linux/percpu.h>
>  #include <linux/cpumask.h>
>  #include <linux/seqlock.h>
> +#include <linux/sched.h>
> +#include <linux/interrupt.h>
>  
>  /**
>   * struct rcu_head - callback structure for use with RCU
> @@ -85,6 +87,7 @@
>   * curlist - current batch for which quiescent cycle started if any
>   */
>  struct rcu_data {
> +	int             active_readers;
>  	/* 1) quiescent state handling : */
>  	long		quiescbatch;     /* Batch # for grace period */
>  	int		passed_quiesc;	 /* User-mode/idle loop etc. */
> @@ -115,12 +118,14 @@
>  static inline void rcu_qsctr_inc(int cpu)
>  {
>  	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> -	rdp->passed_quiesc = 1;
> +	if(rdp->active_readers==0)
> +		rdp->passed_quiesc = 1;
>  }
>  static inline void rcu_bh_qsctr_inc(int cpu)
>  {
>  	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
> -	rdp->passed_quiesc = 1;
> +	if(rdp->active_readers==0)
> +		rdp->passed_quiesc = 1;
>  }

This seems to suffer from extended grace periods.  Or am I missing something?

>  static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
> @@ -183,14 +188,34 @@
>   *
>   * It is illegal to block while in an RCU read-side critical section.
>   */
> -#define rcu_read_lock()		preempt_disable()
> +static inline void rcu_read_lock(void)
> +{	
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> + 	__get_cpu_var(rcu_data).active_readers++;
> +	current->rcu_read_lock_nesting++;
> +	local_irq_restore(flags);
> +}
>  
>  /**
>   * rcu_read_unlock - marks the end of an RCU read-side critical section.
>   *
>   * See rcu_read_lock() for more information.
>   */
> -#define rcu_read_unlock()	preempt_enable()
> +static inline void rcu_read_unlock(void)
> +{
> +	unsigned long flags;
> +	int cpu;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	
> + 	per_cpu(rcu_data,cpu).active_readers--;
> +	current->rcu_read_lock_nesting--;
> +	rcu_qsctr_inc(cpu);
> +	local_irq_restore(flags);
> +}
>  
>  /*
>   * So where is rcu_write_lock()?  It does not exist, as there is no
> @@ -213,14 +238,34 @@
>   * can use just rcu_read_lock().
>   *
>   */
> -#define rcu_read_lock_bh()	local_bh_disable()
> +static inline void rcu_read_lock_bh(void)
> +{ 
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> + 	__get_cpu_var(rcu_bh_data).active_readers++;
> +	current->rcu_read_lock_nesting++;
> +	local_irq_restore(flags);
> +}
>  
>  /*
>   * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
>   *
>   * See rcu_read_lock_bh() for more information.
>   */
> -#define rcu_read_unlock_bh()	local_bh_enable()
> +static inline void rcu_read_unlock_bh(void)	
> +{ 
> +	unsigned long flags;
> +	int cpu;
> +
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	
> + 	per_cpu(rcu_bh_data,cpu).active_readers--;
> +	current->rcu_read_lock_nesting--;
> +	rcu_bh_qsctr_inc(cpu);
> +	local_irq_restore(flags);
> +}
>  
>  /**
>   * rcu_dereference - fetch an RCU-protected pointer in an
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/sched.h linux-2.6.12-rc1-RCU/include/linux/sched.h
> --- Orig/linux-2.6.12-rc1/include/linux/sched.h	Sun Mar 20 20:31:29 2005
> +++ linux-2.6.12-rc1-RCU/include/linux/sched.h	Fri Mar 25 00:22:00 2005
> @@ -569,6 +569,9 @@
>  	unsigned long ptrace;
>  
>  	int lock_depth;		/* Lock depth */
> +	
> +	int rcu_read_lock_nesting; /* How many RCU read reagions the thread is 
> +				      in */
>  
>  	int prio, static_prio;
>  	struct list_head run_list;
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/init/main.c linux-2.6.12-rc1-RCU/init/main.c
> --- Orig/linux-2.6.12-rc1/init/main.c	Sun Mar 20 20:31:30 2005
> +++ linux-2.6.12-rc1-RCU/init/main.c	Fri Mar 25 13:36:26 2005
> @@ -688,8 +688,11 @@
>  	system_state = SYSTEM_RUNNING;
>  	numa_default_policy();
>  
> -	if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
> -		printk(KERN_WARNING "Warning: unable to open an initial console.\n");
> +	{
> +	  int res = sys_open((const char __user *) "/dev/console", O_RDWR, 0);
> +	  if(res<0)
> +	    printk(KERN_WARNING "Warning: unable to open an initial console.: retval=%d\n",res);
> +	}
>  
>  	(void) sys_dup(0);
>  	(void) sys_dup(0);
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/rcupdate.c linux-2.6.12-rc1-RCU/kernel/rcupdate.c
> --- Orig/linux-2.6.12-rc1/kernel/rcupdate.c	Wed Mar  2 08:37:30 2005
> +++ linux-2.6.12-rc1-RCU/kernel/rcupdate.c	Fri Mar 25 13:08:38 2005
> @@ -66,7 +66,10 @@
>  	  {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
>  
>  DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
> +EXPORT_PER_CPU_SYMBOL(rcu_data);
>  DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
> +EXPORT_PER_CPU_SYMBOL(rcu_bh_data);
> +
>  
>  /* Fake initialization required by compiler */
>  static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/sched.c linux-2.6.12-rc1-RCU/kernel/sched.c
> --- Orig/linux-2.6.12-rc1/kernel/sched.c	Sun Mar 20 20:31:30 2005
> +++ linux-2.6.12-rc1-RCU/kernel/sched.c	Wed Mar 30 00:06:39 2005
> @@ -594,6 +594,9 @@
>  	if (rt_task(p))
>  		return p->prio;
>  
> +	if (p->rcu_read_lock_nesting)
> +		return MAX_RT_PRIO;
> +

Do you really want to do this priority boost unconditionally?
Seems like you would only want to if there are callbacks waiting
for a grace period when the system is low on memory.

Otherwise, how does boosting the priority help?

>  	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>  
>  	prio = p->static_prio - bonus;
> @@ -986,6 +989,9 @@
>  	if (unlikely(task_running(rq, p)))
>  		goto out_activate;
>  
> +	if (unlikely(p->rcu_read_lock_nesting))
> +		goto out_activate;
> +
>  #ifdef CONFIG_SCHEDSTATS
>  	schedstat_inc(rq, ttwu_cnt);
>  	if (cpu == this_cpu) {
> @@ -1644,6 +1650,8 @@
>  		return 0;
>  	if (!cpu_isset(this_cpu, p->cpus_allowed))
>  		return 0;
> +	if (p->rcu_read_lock_nesting)
> +		return 0;

Could you please add "p" to your "diff" command so that the function
is listed?

>  	/*
>  	 * Aggressive migration if:
> @@ -2633,6 +2641,7 @@
>  need_resched:
>  	preempt_disable();
>  	prev = current;
> +	  
>  	release_kernel_lock(prev);
>  need_resched_nonpreemptible:
>  	rq = this_rq();
> @@ -2675,6 +2684,7 @@
>  		else {
>  			if (prev->state == TASK_UNINTERRUPTIBLE)
>  				rq->nr_uninterruptible++;
> +
>  			deactivate_task(prev, rq);
>  		}
>  	}
> @@ -2710,6 +2720,17 @@
>  	}
>  
>  	array = rq->active;
> +
> +	if( unlikely(prev->rcu_read_lock_nesting) && 
> +	    prev->prio > MAX_RT_PRIO ) {
> +		prio_array_t *prev_array = prev->array;
> +		if(prev_array)
> +			dequeue_task(prev, prev_array);
> +		prev->prio = MAX_RT_PRIO;
> +		if(prev_array)
> +			enqueue_task_head(prev, prev_array);
> +	}
> +

Again, do you really want to boost priority unconditionally when in
an RCU read-side critical section?

>  	if (unlikely(!array->nr_active)) {
>  		/*
>  		 * Switch the active and expired arrays.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-10 22:36       ` Paul E. McKenney
@ 2005-05-12  8:24         ` Peter Zijlstra
  2005-05-12 14:30           ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2005-05-12  8:24 UTC (permalink / raw)
  To: paulmck; +Cc: dipankar, Ingo Molnar, Linux-kernel

On Tue, 2005-05-10 at 15:36 -0700, Paul E. McKenney wrote:
> On Tue, May 10, 2005 at 10:56:24PM +0200, Peter Zijlstra wrote:
> > On Tue, 2005-05-10 at 13:29 -0700, Paul E. McKenney wrote:
> > > On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> > > > On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
> > > > 
> > > > > 
> > > > > Counter-Based Approach
> > > > > 
> > > > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > > > counter-based approach, which seems to work, but which can result in
> > > > > indefinite-duration grace periods.  The following are very hazy thoughts
> > > > > on how to get the benefits of this approach, but with short grace periods.
> > > > > 
> > > > > 1.	The basic trick is to maintain a pair of counters per CPU.
> > > > > 	There would also be a global boolean variable that would select
> > > > > 	one or the other of each pair.  The rcu_read_lock() primitive
> > > > > 	would then increment the counter indicated by the boolean
> > > > > 	corresponding to the CPU that it is currently running on.
> > > > > 	It would also keep a pointer to that particular counter in
> > > > > 	the task structure.  The rcu_read_unlock() primitive would
> > > > > 	decrement this counter.  (And, yes, you would also have a
> > > > > 	counter in the task structure so that only the outermost of
> > > > > 	a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > > > > 	actually increment/decrement the per-CPU counter pairs.)
> > > > > 
> > > > > 	To force a grace period, one would invert the value of the
> > > > > 	global boolean variable.  Once all the counters indicated
> > > > > 	by the old value of the global boolean variable hit zero,
> > > > > 	the corresponding set of RCU callbacks can be safely invoked.
> > > > > 
> > > > > 	The big problem with this approach is that a pair of inversions
> > > > > 	of the global boolean variable could be spaced arbitrarily 
> > > > > 	closely, especially when you consider that the read side code
> > > > > 	can be preempted.  This could cause RCU callbacks to be invoked
> > > > > 	prematurely, which could greatly reduce the life expectancy
> > > > > 	of your kernel.
> > > > 
> > > > > Thoughts?
> > > > 
> > > > How about having another boolean indicating the ability to flip the
> > > > selector boolean. This boolean would be set false on an actual flip and
> > > > cleared during a grace period. That way the flips cannot ever interfere
> > > > with one another such that the callbacks would be cleared prematurely.
> > > 
> > > But the flip is an integral part of detecting a grace period.  So, if I
> > > understand your proposal correctly, I would have to flip to figure out
> > > when it was safe to flip.
> > > 
> > > What am I missing here?
> > 
> > 
> > int can_flip = 1;
> > int selector = 0;
> > 
> > int counter[2] = {0, 0};
> > 
> > void up()
> > {
> >   ++counter[current->selection = selector];
> 
> Suppose task 0 has just fetched the value of "selector".  How does
> force_grace() know that it is now inappropriate to invert the value
> of "selector"?
> 
> One might suppress preemption, but there can still be interrupts,
> ECC error correction, and just plain bad luck.  So up() needs to
> be able to deal with "selector" getting inverted out from under it.
> 
> Unless I am missing something still...

True, I see you point; there is a race between the = and ++ operators.

current->selection = selector;
--> gap
++counter[current->selection];

if you flip and the then old current->selection reached 0 before this
task gets executed again do_grace gets called and cleans the callbacks;
which should not matter since this task has not yet started using any
data. however it will be problematic because when it does get scheduled
again it works on the wrong counter and thus does not prevent a grace
period on the data will be using.

however I assumed you had these problems solved with your counter-based
approach. My code was meant to illustrate how I thought your double
inversion problem could be avoided.

Do you by any chance have a RCU impl. based on the counter-based
approach so I can try to understand and maybe try to intergrate my
ideas?


> > }
> > 
> > void down()
> > {
> >   if (!--counter[current->selection])
> >     do_grace();
> > }
> > 
> > void do_grace()
> > {
> >   // free stuff
> >   can_flip = 1;
> > }
> > 
> > void force_grace()
> > {
> >   if (can_flip)
> >   {
> >     can_flip = 0;
> >     selector ^= 1;
> >   }
> > }
> > 
> > 
> > The way I understood your proposal was that in order to force a grace
> > period you flip the selector and once the old one reaches zero again it
> > does a cleanup.
> > 
> > Now your problem was that when you force another flip before the old
> > counter reached zero the shit will hit the proverbial fan. So what I
> > proposed (as hopefully illustrated by the code) was another boolean; my
> > can_flip; that controls the flippability :-)
> > 
> > One can for example call force_grace every few seconds or when a
> > watermark on the rcu-callback queue has been reached. If can_flip blocks
> > the actual flip nothing is lost since a cleanup is allready pending.
> > 
> > I hope to have been clearer in explaining my idea; or if I'm missing the
> > issue tell me to read the thread in the morning ;)
> > 
> > -- 
> > Peter Zijlstra <a.p.zijlstra@chello.nl>
> > 
> > 
-- 
Peter Zijlstra <a.p.zijlstra@chello.nl>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] RCU and CONFIG_PREEMPT_RT progress
  2005-05-12  8:24         ` Peter Zijlstra
@ 2005-05-12 14:30           ` Paul E. McKenney
  0 siblings, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2005-05-12 14:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: dipankar, Ingo Molnar, Linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2707 bytes --]

On Thu, May 12, 2005 at 10:24:37AM +0200, Peter Zijlstra wrote:
> On Tue, 2005-05-10 at 15:36 -0700, Paul E. McKenney wrote:

[ . . . ]

> > > > But the flip is an integral part of detecting a grace period.  So, if I
> > > > understand your proposal correctly, I would have to flip to figure out
> > > > when it was safe to flip.
> > > > 
> > > > What am I missing here?
> > > 
> > > 
> > > int can_flip = 1;
> > > int selector = 0;
> > > 
> > > int counter[2] = {0, 0};
> > > 
> > > void up()
> > > {
> > >   ++counter[current->selection = selector];
> > 
> > Suppose task 0 has just fetched the value of "selector".  How does
> > force_grace() know that it is now inappropriate to invert the value
> > of "selector"?
> > 
> > One might suppress preemption, but there can still be interrupts,
> > ECC error correction, and just plain bad luck.  So up() needs to
> > be able to deal with "selector" getting inverted out from under it.
> > 
> > Unless I am missing something still...
> 
> True, I see you point; there is a race between the = and ++ operators.
> 
> current->selection = selector;
> --> gap
> ++counter[current->selection];
> 
> if you flip and the then old current->selection reached 0 before this
> task gets executed again do_grace gets called and cleans the callbacks;
> which should not matter since this task has not yet started using any
> data. however it will be problematic because when it does get scheduled
> again it works on the wrong counter and thus does not prevent a grace
> period on the data will be using.

Yep, that is indeed my concern.

> however I assumed you had these problems solved with your counter-based
> approach. My code was meant to illustrate how I thought your double
> inversion problem could be avoided.
> 
> Do you by any chance have a RCU impl. based on the counter-based
> approach so I can try to understand and maybe try to intergrate my
> ideas?

There are a couple of counter-based implementations out there:

o	The K42/Tornado implementation of RCU.

o	Dipankar Sarma's rcu_preempt-2.5.8-3.patch and
	rcu_poll_preempt-2.5.14-2.patch that were proposed for
	Linux some years back.  These are attached.

Both of these can potentially suffer from huge grace periods, though
the K42/Tornado guys may have come up with some tricks to avoid this
by now.  These long grace periods were why the above patches were
passed over for Linux's RCU.

Both of these also rely on the fact that the time interval between
a pair of counter switches must have context switches on all CPUs
(or threads, depending on the exact implementation).  We need to
avoid this requirement in CONFIG_PREEMPT_RT in order to deal with
small-memory environments.

						Thanx, Paul

[-- Attachment #2: rcu_preempt-2.5.8-3.patch --]
[-- Type: text/plain, Size: 19832 bytes --]

diff -urN linux-2.5.8-base/include/linux/init_task.h linux-2.5.8-rcu/include/linux/init_task.h
--- linux-2.5.8-base/include/linux/init_task.h	Mon Apr 15 00:48:45 2002
+++ linux-2.5.8-rcu/include/linux/init_task.h	Wed Apr 17 14:47:02 2002
@@ -79,6 +79,7 @@
     blocked:		{{0}},						\
     alloc_lock:		SPIN_LOCK_UNLOCKED,				\
     journal_info:	NULL,						\
+    cpu_preempt_cntr:	NULL,					\
 }
 
 
diff -urN linux-2.5.8-base/include/linux/rcupdate.h linux-2.5.8-rcu/include/linux/rcupdate.h
--- linux-2.5.8-base/include/linux/rcupdate.h	Thu Jan  1 05:30:00 1970
+++ linux-2.5.8-rcu/include/linux/rcupdate.h	Wed Apr 17 14:38:14 2002
@@ -0,0 +1,73 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (c) International Business Machines Corp., 2001
+ *
+ * Author: Dipankar Sarma <dipankar@in.ibm.com>
+ * 
+ * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com>
+ * and inputs from Andrea Arcangeli, Rusty Russell, Andi Kleen etc.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * 		http://lse.sourceforge.net/locking/rcupdate.html
+ *
+ */
+
+#ifndef __LINUX_RCUPDATE_H
+#define __LINUX_RCUPDATE_H
+
+#include <linux/list.h>
+#include <linux/smp.h>
+
+/*
+ * Callback structure for use with call_rcu(). 
+ */
+struct rcu_head {
+	struct list_head list;
+	void (*func)(void *obj);
+	void *arg;
+};
+
+#define RCU_HEAD_INIT(head) { LIST_HEAD_INIT(head.list), NULL, NULL }
+#define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT(head)
+#define INIT_RCU_HEAD(ptr) do { \
+	INIT_LIST_HEAD(&(ptr)->list); (ptr)->func = NULL; (ptr)->arg = NULL; \
+} while (0)
+
+/* Batch counter type. */
+typedef long rcu_batch_t;
+
+/*
+ * RCU_BATCH_XX(rcu_batch_t a, rcu_batch_t b)
+ * 
+ * Returns true if batch number ``a'' compares as specified to ``b''.
+ * This comparison allows for the fact that the batch numbers wrap.
+ */
+#define RCU_BATCH_EQ(a, b)		((a) - (b) == 0)
+#define RCU_BATCH_GE(a, b)		((a) - (b) >= 0)
+#define RCU_BATCH_GT(a, b)		((a) - (b) > 0)
+#define RCU_BATCH_LE(a, b)		((a) - (b) <= 0)
+#define RCU_BATCH_LT(a, b)		((a) - (b) < 0)
+#define RCU_BATCH_NE(a, b)		((a) - (b) != 0)
+
+extern void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg);
+extern void synchronize_kernel(void);
+
+#endif /* __LINUX_RCUPDATE_H */
diff -urN linux-2.5.8-base/include/linux/sched.h linux-2.5.8-rcu/include/linux/sched.h
--- linux-2.5.8-base/include/linux/sched.h	Mon Apr 15 00:48:45 2002
+++ linux-2.5.8-rcu/include/linux/sched.h	Fri Apr 19 22:10:43 2002
@@ -28,6 +28,7 @@
 #include <linux/securebits.h>
 #include <linux/fs_struct.h>
 #include <linux/compiler.h>
+#include <linux/percpu.h>
 
 struct exec_domain;
 
@@ -346,6 +347,7 @@
 
 /* journalling filesystem info */
 	void *journal_info;
+	atomic_t *cpu_preempt_cntr;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
@@ -418,6 +420,7 @@
 
 extern struct   mm_struct init_mm;
 extern struct task_struct *init_tasks[NR_CPUS];
+extern long cpu_quiescent __per_cpu_data;
 
 /* PID hashing. (shouldnt this be dynamic?) */
 #define PIDHASH_SZ (4096 >> 2)
@@ -883,6 +886,53 @@
 		clear_thread_flag(TIF_SIGPENDING);
 }
 
+#ifdef CONFIG_PREEMPT
+
+extern atomic_t rcu_preempt_cntr[2] __per_cpu_data;
+extern atomic_t *curr_preempt_cntr __per_cpu_data;
+extern atomic_t *next_preempt_cntr __per_cpu_data;
+
+static inline void rcu_switch_preempt_cntr(int cpu)
+{
+	atomic_t *tmp;
+	tmp = per_cpu(curr_preempt_cntr, cpu);
+	per_cpu(curr_preempt_cntr, cpu) = per_cpu(next_preempt_cntr, cpu);
+	per_cpu(next_preempt_cntr, cpu) = tmp;
+
+}
+
+static inline void rcu_preempt_put(void)
+{
+	if (unlikely(current->cpu_preempt_cntr != NULL)) {
+		atomic_dec(current->cpu_preempt_cntr);
+		current->cpu_preempt_cntr = NULL;
+	}
+}
+
+/* Must not be preempted */
+static inline void rcu_preempt_get(void)
+{
+	if (likely(current->cpu_preempt_cntr == NULL)) {
+		current->cpu_preempt_cntr = 
+				this_cpu(next_preempt_cntr);
+		atomic_inc(current->cpu_preempt_cntr);
+	}
+}
+
+static inline int rcu_cpu_preempted(int cpu)
+{
+	return (atomic_read(per_cpu(curr_preempt_cntr, cpu)) != 0);
+}
+#else
+
+#define rcu_init_preempt_cntr(cpu) do { } while(0)
+#define rcu_switch_preempt_cntr(cpu) do { } while(0)
+#define rcu_preempt_put() do { } while(0)
+#define rcu_preempt_get() do { } while(0)
+#define rcu_cpu_preempted(cpu) (0)
+
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif
diff -urN linux-2.5.8-base/kernel/Makefile linux-2.5.8-rcu/kernel/Makefile
--- linux-2.5.8-base/kernel/Makefile	Mon Apr 15 00:48:47 2002
+++ linux-2.5.8-rcu/kernel/Makefile	Tue Apr 16 12:32:09 2002
@@ -10,12 +10,12 @@
 O_TARGET := kernel.o
 
 export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o \
-		printk.o platform.o 
+		printk.o platform.o rcupdate.o
 
 obj-y     = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o info.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o futex.o platform.o
+	    signal.o sys.o kmod.o context.o futex.o platform.o rcupdate.o
 
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += ksyms.o
diff -urN linux-2.5.8-base/kernel/exit.c linux-2.5.8-rcu/kernel/exit.c
--- linux-2.5.8-base/kernel/exit.c	Mon Apr 15 00:48:49 2002
+++ linux-2.5.8-rcu/kernel/exit.c	Wed Apr 17 11:25:01 2002
@@ -518,6 +518,7 @@
 
 	tsk->exit_code = code;
 	exit_notify();
+	rcu_preempt_put();
 	schedule();
 	BUG();
 /*
diff -urN linux-2.5.8-base/kernel/rcupdate.c linux-2.5.8-rcu/kernel/rcupdate.c
--- linux-2.5.8-base/kernel/rcupdate.c	Thu Jan  1 05:30:00 1970
+++ linux-2.5.8-rcu/kernel/rcupdate.c	Fri Apr 19 16:59:03 2002
@@ -0,0 +1,393 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (c) International Business Machines Corp., 2001
+ *
+ * Author: Dipankar Sarma <dipankar@in.ibm.com>
+ * 
+ * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com>
+ * and inputs from Andrea Arcangeli, Rusty Russell, Andi Kleen etc.
+ *
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * 		http://lse.sourceforge.net/locking/rcupdate.html
+ *
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/config.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
+#include <asm/bitops.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <linux/smp.h>
+#include <linux/init.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+
+/*
+ * Per-CPU data for Read-Copy UPdate.
+ * We maintain callbacks in 3 per-cpu lists. Each list represents
+ * a batch and every batch is given a number that is global across
+ * all CPUs. The global current batch number (curbatch) represents
+ * the current batch of callbacks for which the quiescent cycle
+ * has started.
+ * rcu_nextlist - new callbacks are added here
+ * rcu_currlist - current batch for which quiescent cycle started if any
+ * rcu_donelist - one or more batches that have finished waiting to be invoked
+ */
+
+static long rcu_last_qsctr __per_cpu_data;
+static rcu_batch_t rcu_batch __per_cpu_data;
+static struct list_head  rcu_nextlist __per_cpu_data;
+struct list_head  rcu_currlist __per_cpu_data;
+struct list_head  rcu_donelist __per_cpu_data;
+struct tasklet_struct  rcu_tasklet __per_cpu_data;
+struct task_struct *rcu_task __per_cpu_data;
+struct semaphore rcu_sema __per_cpu_data;
+
+#define RCU_QSCTR_INVALID	0
+
+static spinlock_t	rcu_lock = SPIN_LOCK_UNLOCKED;
+static rcu_batch_t	rcu_curbatch = 1; /* Current batch number */
+static rcu_batch_t	rcu_maxbatch = 1; /* Max requested batch number*/
+
+
+/* CPUs that need to switch in order for current RCU batch to proceed */
+static unsigned long	rcu_cpumask = 0; 	
+/* CPUs that have active RCUs */
+static unsigned long	rcu_active_cpumask = 0;
+/* Global timer to nudge CPUs */
+static struct timer_list rcu_timer;	
+
+asmlinkage long sys_sched_get_priority_max(int policy);
+
+/*
+ * Register a new rcu callback. This will be invoked as soon
+ * as all CPUs have performed a context switch or been seen in the
+ * idle loop or in a user process.
+ */
+void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg)
+{
+	unsigned long flags;
+
+	head->func = func;
+	head->arg = arg;
+	local_irq_save(flags);
+	list_add_tail(&head->list, &this_cpu(rcu_nextlist));
+	local_irq_restore(flags);
+	tasklet_schedule(&this_cpu(rcu_tasklet));
+}
+
+/*
+ * Invoke the completed RCU callbacks. They are expected to be in
+ * a per-cpu list.
+ */
+static inline void rcu_invoke_callbacks(struct list_head *list)
+{
+	struct list_head *entry;
+	struct rcu_head *head;
+
+	while (!list_empty(list)) {
+		entry = list->next;
+		list_del(entry);
+		head = list_entry(entry, struct rcu_head, list);
+		head->func(head->arg);
+	}
+}
+
+/*
+ * Register a new batch of callbacks with the global state maintainers.
+ * Caller must hold the rcu global lock.
+ */
+static inline void rcu_reg_batch(rcu_batch_t newbatch)
+{
+	int cpu = cpu_number_map(smp_processor_id());
+	
+	if (RCU_BATCH_LT(rcu_maxbatch, newbatch)) {
+		rcu_maxbatch = newbatch;
+	}
+	if (RCU_BATCH_LT(rcu_maxbatch, rcu_curbatch) || (rcu_cpumask != 0)) {
+		return;
+	}
+	rcu_cpumask = cpu_online_map;
+	rcu_switch_preempt_cntr(cpu);
+}
+
+/*
+ * Check if the cpu has gone through a quiescent state.
+ * If so and if it already hasn't done so in this RCU
+ * quiescent cycle, then indicate that it has done so.
+ */
+static void rcu_check_quiescent_state(void)
+{
+	int cpu = cpu_number_map(smp_processor_id());
+
+	if (!test_bit(cpu, &rcu_cpumask)) {
+		return;
+	}
+
+	/* 
+	 * May race with rcu per-cpu tick - in the worst case
+	 * we may miss one quiescent state of that CPU. That is
+	 * tolerable. So no need to disable interrupts.
+	 */
+	if (this_cpu(rcu_last_qsctr) == RCU_QSCTR_INVALID) {
+		this_cpu(rcu_last_qsctr) = this_cpu(cpu_quiescent);
+		return;
+	}
+	if (this_cpu(cpu_quiescent) == this_cpu(rcu_last_qsctr) ||
+		rcu_cpu_preempted(cpu)) {
+		return;
+	}
+
+	spin_lock(&rcu_lock);
+	if (!test_bit(cpu, &rcu_cpumask)) {
+		spin_unlock(&rcu_lock);
+		return;
+	}
+	clear_bit(cpu, &rcu_cpumask);
+	this_cpu(rcu_last_qsctr) = RCU_QSCTR_INVALID;
+	if (rcu_cpumask != 0) {
+		/* All CPUs haven't gone through a quiescent state */
+		spin_unlock(&rcu_lock);
+		return;
+	}
+	rcu_curbatch++;
+	rcu_reg_batch(rcu_maxbatch);
+	spin_unlock(&rcu_lock);
+}
+
+static void rcu_move_next_batch(void)
+{
+	int rcu_timer_active;
+	int cpu = cpu_number_map(smp_processor_id());
+
+	local_irq_disable();
+	if (!list_empty(&this_cpu(rcu_nextlist)) && 
+			list_empty(&this_cpu(rcu_currlist))) {
+		list_splice(&this_cpu(rcu_nextlist), &this_cpu(rcu_currlist));
+		INIT_LIST_HEAD(&this_cpu(rcu_nextlist));
+		local_irq_enable();
+
+		/*
+		 * start the next batch of callbacks
+		 */
+		spin_lock(&rcu_lock);
+		rcu_timer_active = (rcu_active_cpumask != 0);
+		if (!test_bit(cpu, &rcu_active_cpumask))
+			set_bit(cpu, &rcu_active_cpumask);
+		if (!rcu_timer_active && 
+		    !timer_pending(&rcu_timer)) {
+			rcu_timer.expires = jiffies + 5;
+			add_timer(&rcu_timer);
+		}
+		this_cpu(rcu_batch) = rcu_curbatch + 1;
+		rcu_reg_batch(this_cpu(rcu_batch));
+		spin_unlock(&rcu_lock);
+	} else {
+		local_irq_enable();
+	}
+}
+
+/*
+ * Look into the per-cpu callback information to see if there is
+ * any processing necessary - if so do it.
+ */
+static void rcu_process_callbacks(unsigned long data)
+{
+	int cpu = cpu_number_map(smp_processor_id());
+
+	if (!list_empty(&this_cpu(rcu_currlist)) &&
+	    RCU_BATCH_GT(rcu_curbatch, this_cpu(rcu_batch))) {
+		list_splice(&this_cpu(rcu_currlist), &this_cpu(rcu_donelist));
+		INIT_LIST_HEAD(&this_cpu(rcu_currlist));
+	}
+
+	rcu_move_next_batch();
+
+	/* 
+	 * No race between nxtlist here & call_rcu() from irq -
+         * call_rcu() will anyway reschedule the tasklet so if the
+	 * bit in cpu_active_mask gets cleared, it will get set again.
+	 */
+	if (list_empty(&this_cpu(rcu_nextlist)) && 
+			list_empty(&this_cpu(rcu_currlist))) {
+		spin_lock(&rcu_lock);
+		clear_bit(cpu, &rcu_active_cpumask);
+		spin_unlock(&rcu_lock);
+	}
+	rcu_check_quiescent_state();
+
+	if (!list_empty(&this_cpu(rcu_donelist))) {
+		rcu_invoke_callbacks(&this_cpu(rcu_donelist));
+	}
+}
+
+/*
+ * Per-cpu tick that processes per-cpu queues
+ */
+static void rcu_percpu_tick_common(void)
+{
+	/* Take this opportunity to check for idle loop */
+	if (idle_cpu(smp_processor_id()))
+		this_cpu(cpu_quiescent)++;
+	rcu_process_callbacks(0);
+}
+
+
+#ifdef CONFIG_SMP
+static void rcu_percpu_tick(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
+		up(&per_cpu(rcu_sema, cpu));
+	}
+}
+#else
+static void rcu_percpu_tick(void)
+{
+	rcu_percpu_tick_common();
+}
+#endif
+
+/*
+ * RCU timer handler - called one in every 50ms only if there is
+ * any RCU pending in the system. It nudges every CPU.
+ */
+static void rcu_timer_handler(unsigned long data)
+{
+	rcu_percpu_tick();
+	spin_lock(&rcu_lock);
+	if (rcu_active_cpumask) {
+		rcu_timer.expires = jiffies + 5;
+		add_timer(&rcu_timer);
+	}
+	spin_unlock(&rcu_lock);
+
+}
+
+#ifdef CONFIG_SMP
+/*
+ * Per-CPU RCU dameon. It runs at an absurdly high priority so
+ * that it is not starved out by the scheduler thereby holding
+ * up RC updates.
+ */
+static int krcud(void * __bind_cpu)
+{
+	int bind_cpu = *(int *) __bind_cpu;
+	int cpu = cpu_logical_map(bind_cpu);
+
+	daemonize();
+        current->policy = SCHED_FIFO;
+        current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);
+
+	sigfillset(&current->blocked);
+
+	/* Migrate to the right CPU */
+	set_cpus_allowed(current, 1UL << cpu);
+
+	sprintf(current->comm, "krcud_CPU%d", bind_cpu);
+	sema_init(&this_cpu(rcu_sema), 0);
+
+	this_cpu(rcu_task) = current;
+
+	for (;;) {
+		while (down_interruptible(&this_cpu(rcu_sema)));
+		local_bh_disable();
+		rcu_percpu_tick_common();
+		local_bh_enable();
+	}
+}
+
+static void spawn_krcud(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
+		if (kernel_thread(krcud, (void *) &cpu,
+				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
+			printk(KERN_ERR "spawn_krcud() failed for cpu %d\n", 
+									cpu);
+		else {
+			while (!per_cpu(rcu_task, cpu)) {
+				schedule();
+			}
+		}
+	}
+}
+
+#endif
+
+/*
+ * Initializes rcu mechanism.  Assumed to be called early.
+ * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
+ * Note that cpu_quiescent and friends are implicitly
+ * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
+ */
+static __init int rcu_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		tasklet_init(&per_cpu(rcu_tasklet, i), 
+				rcu_process_callbacks, 0UL);
+		INIT_LIST_HEAD(&per_cpu(rcu_nextlist, i));
+		INIT_LIST_HEAD(&per_cpu(rcu_currlist, i));
+		INIT_LIST_HEAD(&per_cpu(rcu_donelist, i));
+	}
+	init_timer(&rcu_timer);
+	rcu_timer.function = rcu_timer_handler;
+#ifdef CONFIG_SMP
+	spawn_krcud();
+#endif
+	return 0;
+}
+
+/* Because of FASTCALL declaration of complete, we use this wrapper */
+static void wakeme_after_rcu(void *completion)
+{
+	complete(completion);
+}
+
+/*
+ * Wait until all the CPUs have gone through a "quiescent" state.
+ */
+void synchronize_kernel(void)
+{
+	struct rcu_head rcu;
+	DECLARE_COMPLETION(completion);
+
+	/* Will wake me after RCU finished */
+	call_rcu(&rcu, wakeme_after_rcu, &completion);
+
+	/* Wait for it */
+	wait_for_completion(&completion);
+}
+
+__initcall(rcu_init);
+
+EXPORT_SYMBOL(call_rcu);
+EXPORT_SYMBOL(synchronize_kernel);
diff -urN linux-2.5.8-base/kernel/sched.c linux-2.5.8-rcu/kernel/sched.c
--- linux-2.5.8-base/kernel/sched.c	Mon Apr 15 00:48:47 2002
+++ linux-2.5.8-rcu/kernel/sched.c	Fri Apr 19 18:26:09 2002
@@ -17,11 +17,13 @@
 #include <linux/init.h>
 #include <asm/uaccess.h>
 #include <linux/highmem.h>
+#include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <asm/mmu_context.h>
 #include <linux/interrupt.h>
 #include <linux/completion.h>
 #include <linux/kernel_stat.h>
+#include <linux/percpu.h>
 
 /*
  * Priority of a process goes from 0 to 139. The 0-99
@@ -150,6 +152,7 @@
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
+long cpu_quiescent __per_cpu_data;
 
 #define cpu_rq(cpu)		(runqueues + (cpu))
 #define this_rq()		cpu_rq(smp_processor_id())
@@ -157,6 +160,24 @@
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define rt_task(p)		((p)->prio < MAX_RT_PRIO)
 
+#ifdef CONFIG_PREEMPT
+atomic_t rcu_preempt_cntr[2] __per_cpu_data;
+atomic_t *curr_preempt_cntr __per_cpu_data;
+atomic_t *next_preempt_cntr __per_cpu_data;
+#endif
+
+#ifdef CONFIG_PREEMPT
+static inline void rcu_init_preempt_cntr(int cpu)
+{
+        atomic_set(&per_cpu(rcu_preempt_cntr[0], cpu), 0);
+        atomic_set(&per_cpu(rcu_preempt_cntr[1], cpu), 0);
+	per_cpu(curr_preempt_cntr, cpu) = 
+			&per_cpu(rcu_preempt_cntr[1], cpu);
+	per_cpu(next_preempt_cntr, cpu) = 
+			&per_cpu(rcu_preempt_cntr[0], cpu);
+}
+#endif
+
 static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
 {
 	struct runqueue *rq;
@@ -768,8 +789,11 @@
 	 * if entering from preempt_schedule, off a kernel preemption,
 	 * go straight to picking the next task.
 	 */
-	if (unlikely(preempt_get_count() & PREEMPT_ACTIVE))
+	if (unlikely(preempt_get_count() & PREEMPT_ACTIVE)) {
 		goto pick_next_task;
+	} else {
+		rcu_preempt_put();
+	}
 	
 	switch (prev->state) {
 	case TASK_INTERRUPTIBLE:
@@ -812,6 +836,7 @@
 switch_tasks:
 	prefetch(next);
 	clear_tsk_need_resched(prev);
+	per_cpu(cpu_quiescent, prev->thread_info->cpu)++;
 
 	if (likely(prev != next)) {
 		rq->nr_switches++;
@@ -852,6 +877,7 @@
 		return;
 
 	ti->preempt_count = PREEMPT_ACTIVE;
+	rcu_preempt_get();
 	schedule();
 	ti->preempt_count = 0;
 	barrier();
@@ -1575,6 +1601,7 @@
 		spin_lock_init(&rq->lock);
 		spin_lock_init(&rq->frozen);
 		INIT_LIST_HEAD(&rq->migration_queue);
+		rcu_init_preempt_cntr(i);
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

[-- Attachment #3: rcu_poll_preempt-2.5.14-2.patch --]
[-- Type: text/plain, Size: 17841 bytes --]

diff -urN linux-2.5.14-base/include/linux/init_task.h linux-2.5.14-rcu_poll_preempt/include/linux/init_task.h
--- linux-2.5.14-base/include/linux/init_task.h	Mon May  6 09:07:54 2002
+++ linux-2.5.14-rcu_poll_preempt/include/linux/init_task.h	Tue May  7 17:23:51 2002
@@ -79,6 +79,7 @@
     blocked:		{{0}},						\
     alloc_lock:		SPIN_LOCK_UNLOCKED,				\
     journal_info:	NULL,						\
+    cpu_preempt_cntr:	NULL,					\
 }
 
 
diff -urN linux-2.5.14-base/include/linux/rcupdate.h linux-2.5.14-rcu_poll_preempt/include/linux/rcupdate.h
--- linux-2.5.14-base/include/linux/rcupdate.h	Thu Jan  1 05:30:00 1970
+++ linux-2.5.14-rcu_poll_preempt/include/linux/rcupdate.h	Tue May  7 17:23:51 2002
@@ -0,0 +1,71 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (c) International Business Machines Corp., 2001
+ *
+ * Author: Dipankar Sarma <dipankar@in.ibm.com>
+ * 
+ * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com>
+ * and inputs from Andrea Arcangeli, Rusty Russell, Andi Kleen etc.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * 		http://lse.sourceforge.net/locking/rcupdate.html
+ *
+ */
+
+#ifndef __LINUX_RCUPDATE_H
+#define __LINUX_RCUPDATE_H
+
+#include <linux/list.h>
+
+/*
+ * Callback structure for use with call_rcu(). 
+ */
+struct rcu_head {
+	struct list_head list;
+	void (*func)(void *obj);
+	void *arg;
+};
+
+#define RCU_HEAD_INIT(head) { LIST_HEAD_INIT(head.list), NULL, NULL }
+#define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT(head)
+#define INIT_RCU_HEAD(ptr) do { \
+	INIT_LIST_HEAD(&(ptr)->list); (ptr)->func = NULL; (ptr)->arg = NULL; \
+} while (0)
+
+
+extern void FASTCALL(call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg));
+
+#ifdef CONFIG_PREEMPT
+extern void FASTCALL(call_rcu_preempt(struct rcu_head *head, void (*func)(void *arg), void *arg));
+#else
+static inline void call_rcu_preempt(struct rcu_head *head, 
+                               void (*func)(void *arg), void *arg)
+{
+	call_rcu(head, func, arg);
+}
+#endif
+
+extern void synchronize_kernel(void);
+extern void synchronize_kernel(void);
+
+extern void rcu_init(void);
+
+#endif /* __LINUX_RCUPDATE_H */
diff -urN linux-2.5.14-base/include/linux/sched.h linux-2.5.14-rcu_poll_preempt/include/linux/sched.h
--- linux-2.5.14-base/include/linux/sched.h	Mon May  6 09:07:54 2002
+++ linux-2.5.14-rcu_poll_preempt/include/linux/sched.h	Tue May  7 17:23:51 2002
@@ -28,6 +28,7 @@
 #include <linux/securebits.h>
 #include <linux/fs_struct.h>
 #include <linux/compiler.h>
+#include <linux/percpu.h>
 
 struct exec_domain;
 
@@ -162,6 +163,7 @@
 extern void flush_scheduled_tasks(void);
 extern int start_context_thread(void);
 extern int current_is_keventd(void);
+extern void force_cpu_reschedule(int cpu);
 
 struct namespace;
 
@@ -347,6 +349,7 @@
 /* journalling filesystem info */
 	void *journal_info;
 	struct dentry *proc_dentry;
+	atomic_t *cpu_preempt_cntr;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
@@ -419,6 +422,7 @@
 
 extern struct   mm_struct init_mm;
 extern struct task_struct *init_tasks[NR_CPUS];
+extern long cpu_quiescent __per_cpu_data;
 
 /* PID hashing. (shouldnt this be dynamic?) */
 #define PIDHASH_SZ (4096 >> 2)
@@ -876,6 +880,53 @@
 		clear_thread_flag(TIF_SIGPENDING);
 }
 
+#ifdef CONFIG_PREEMPT
+
+extern atomic_t rcu_preempt_cntr[2] __per_cpu_data;
+extern atomic_t *curr_preempt_cntr __per_cpu_data;
+extern atomic_t *next_preempt_cntr __per_cpu_data;
+
+static inline void rcu_switch_preempt_cntr(int cpu)
+{
+	atomic_t *tmp;
+	tmp = per_cpu(curr_preempt_cntr, cpu);
+	per_cpu(curr_preempt_cntr, cpu) = per_cpu(next_preempt_cntr, cpu);
+	per_cpu(next_preempt_cntr, cpu) = tmp;
+
+}
+
+static inline void rcu_preempt_put(void)
+{
+	if (unlikely(current->cpu_preempt_cntr != NULL)) {
+		atomic_dec(current->cpu_preempt_cntr);
+		current->cpu_preempt_cntr = NULL;
+	}
+}
+
+/* Must not be preempted */
+static inline void rcu_preempt_get(void)
+{
+	if (likely(current->cpu_preempt_cntr == NULL)) {
+		current->cpu_preempt_cntr = 
+				this_cpu(next_preempt_cntr);
+		atomic_inc(current->cpu_preempt_cntr);
+	}
+}
+
+static inline int rcu_cpu_preempted(int cpu)
+{
+	return (atomic_read(per_cpu(curr_preempt_cntr, cpu)) != 0);
+}
+#else
+
+#define rcu_init_preempt_cntr(cpu) do { } while(0)
+#define rcu_switch_preempt_cntr(cpu) do { } while(0)
+#define rcu_preempt_put() do { } while(0)
+#define rcu_preempt_get() do { } while(0)
+#define rcu_cpu_preempted(cpu) (0)
+
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif
diff -urN linux-2.5.14-base/init/main.c linux-2.5.14-rcu_poll_preempt/init/main.c
--- linux-2.5.14-base/init/main.c	Mon May  6 09:07:56 2002
+++ linux-2.5.14-rcu_poll_preempt/init/main.c	Tue May  7 17:23:51 2002
@@ -28,6 +28,7 @@
 #include <linux/bootmem.h>
 #include <linux/tty.h>
 #include <linux/percpu.h>
+#include <linux/rcupdate.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -346,6 +347,7 @@
 	printk("Kernel command line: %s\n", saved_command_line);
 	parse_options(command_line);
 	trap_init();
+	rcu_init();
 	init_IRQ();
 	sched_init();
 	softirq_init();
diff -urN linux-2.5.14-base/kernel/Makefile linux-2.5.14-rcu_poll_preempt/kernel/Makefile
--- linux-2.5.14-base/kernel/Makefile	Mon May  6 09:07:56 2002
+++ linux-2.5.14-rcu_poll_preempt/kernel/Makefile	Tue May  7 17:23:51 2002
@@ -10,12 +10,12 @@
 O_TARGET := kernel.o
 
 export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o \
-		printk.o platform.o 
+		printk.o platform.o rcupdate.o
 
 obj-y     = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o info.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o futex.o platform.o
+	    signal.o sys.o kmod.o context.o futex.o platform.o rcupdate.o
 
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += ksyms.o
diff -urN linux-2.5.14-base/kernel/exit.c linux-2.5.14-rcu_poll_preempt/kernel/exit.c
--- linux-2.5.14-base/kernel/exit.c	Mon May  6 09:07:59 2002
+++ linux-2.5.14-rcu_poll_preempt/kernel/exit.c	Tue May  7 17:23:51 2002
@@ -550,6 +550,7 @@
 
 	tsk->exit_code = code;
 	exit_notify();
+	rcu_preempt_put();
 	schedule();
 	BUG();
 /*
diff -urN linux-2.5.14-base/kernel/fork.c linux-2.5.14-rcu_poll_preempt/kernel/fork.c
--- linux-2.5.14-base/kernel/fork.c	Mon May  6 09:07:54 2002
+++ linux-2.5.14-rcu_poll_preempt/kernel/fork.c	Tue May  7 17:23:51 2002
@@ -117,6 +117,7 @@
 	tsk->thread_info = ti;
 	ti->task = tsk;
 	atomic_set(&tsk->usage,1);
+	tsk->cpu_preempt_cntr = NULL;
 
 	return tsk;
 }
diff -urN linux-2.5.14-base/kernel/rcupdate.c linux-2.5.14-rcu_poll_preempt/kernel/rcupdate.c
--- linux-2.5.14-base/kernel/rcupdate.c	Thu Jan  1 05:30:00 1970
+++ linux-2.5.14-rcu_poll_preempt/kernel/rcupdate.c	Tue May  7 17:23:51 2002
@@ -0,0 +1,285 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (c) International Business Machines Corp., 2001
+ * Copyright (C) Andrea Arcangeli <andrea@suse.de> SuSE, 2001
+ *
+ * Author: Dipankar Sarma <dipankar@in.ibm.com>,
+ *	   Andrea Arcangeli <andrea@suse.de>
+ * 
+ * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com>
+ * and inputs from Andrea Arcangeli, Rusty Russell, Andi Kleen etc.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * 		http://lse.sourceforge.net/locking/rcupdate.html
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+
+/* Definition for rcupdate internal data. */
+struct rcu_data {
+	spinlock_t lock;
+	struct list_head nxtlist;
+	struct list_head curlist;
+	struct tasklet_struct tasklet;
+	unsigned long qsmask;
+	int polling_in_progress;
+	long quiescent_checkpoint[NR_CPUS];
+};
+
+struct rcu_data rcu_data; 
+
+#ifdef CONFIG_PREEMPT
+struct rcu_data rcu_data_preempt;
+#endif
+
+#define RCU_quiescent(cpu) per_cpu(cpu_quiescent, cpu)
+
+/*
+ * Register a new rcu callback. This will be invoked as soon
+ * as all CPUs have performed a context switch or been seen in the
+ * idle loop or in a user process. It can be called only from
+ * process or BH context, however can be made to work from irq
+ * context too with minor code changes if necessary.
+ */
+void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg)
+{
+	head->func = func;
+	head->arg = arg;
+
+	spin_lock_bh(&rcu_data.lock);
+	list_add(&head->list, &rcu_data.nxtlist);
+	spin_unlock_bh(&rcu_data.lock);
+
+	tasklet_hi_schedule(&rcu_data.tasklet);
+}
+
+#ifdef CONFIG_PREEMPT
+/*
+ * Same as call_rcu except that you don't need to disable preemption
+ * during the read-side of the RCU critical section. Preemption is
+ * handled transparently here.
+ */
+void call_rcu_preempt(struct rcu_head *head, void (*func)(void *arg), void *arg)
+{
+	head->func = func;
+	head->arg = arg;
+
+	spin_lock_bh(&rcu_data_preempt.lock);
+	list_add(&head->list, &rcu_data_preempt.nxtlist);
+	spin_unlock_bh(&rcu_data_preempt.lock);
+
+	tasklet_hi_schedule(&rcu_data_preempt.tasklet);
+}
+static inline void rcu_setup_grace_period(struct rcu_data *rdata, int cpu)
+{
+	rdata->qsmask |= 1UL << cpu;
+	rdata->quiescent_checkpoint[cpu] = RCU_quiescent(cpu);
+	if (rdata == &rcu_data_preempt)
+		rcu_switch_preempt_cntr(cpu);
+	force_cpu_reschedule(cpu);
+}
+static inline int rcu_grace_period_complete(struct rcu_data *rdata, int cpu)
+{
+	if (rdata == &rcu_data)  {
+		return ((rdata->qsmask & (1UL << cpu)) &&
+			(rdata->quiescent_checkpoint[cpu] != 
+						RCU_quiescent(cpu)));
+	} else {
+		return ((rdata->qsmask & (1UL << cpu)) &&
+			(rdata->quiescent_checkpoint[cpu] != 
+						RCU_quiescent(cpu)) &&
+              		!rcu_cpu_preempted(cpu));
+	}
+}
+#else
+static inline void rcu_setup_grace_period(struct rcu_data *rdata, int cpu)
+{
+	rdata->qsmask |= 1UL << cpu;
+	rdata->quiescent_checkpoint[cpu] = RCU_quiescent(cpu);
+	force_cpu_reschedule(cpu);
+}
+static inline int rcu_grace_period_complete(struct rcu_data *rdata, int cpu)
+{
+	return ((rdata->qsmask & (1UL << cpu)) &&
+		(rdata->quiescent_checkpoint[cpu] != RCU_quiescent(cpu)));
+}
+#endif
+
+
+static int rcu_prepare_polling(struct rcu_data *rdata)
+{
+	int stop;
+	int i;
+
+#ifdef DEBUG
+	if (!list_empty(&rdata->curlist))
+		BUG();
+#endif
+
+	stop = 1;
+	if (!list_empty(&rdata->nxtlist)) {
+		list_splice(&rdata->nxtlist, &rdata->curlist);
+		INIT_LIST_HEAD(&rdata->nxtlist);
+
+		rdata->polling_in_progress = 1;
+
+		for (i = 0; i < smp_num_cpus; i++) {
+			int cpu = cpu_logical_map(i);
+			rcu_setup_grace_period(rdata, cpu);
+		}
+		stop = 0;
+	}
+
+	return stop;
+}
+
+/*
+ * Invoke the completed RCU callbacks.
+ */
+static void rcu_invoke_callbacks(struct rcu_data *rdata)
+{
+	struct list_head *entry;
+	struct rcu_head *head;
+
+#ifdef DEBUG
+	if (list_empty(&rdata->curlist))
+		BUG();
+#endif
+
+	entry = rdata->curlist.prev;
+	do {
+		head = list_entry(entry, struct rcu_head, list);
+		entry = entry->prev;
+
+		head->func(head->arg);
+	} while (entry != &rdata->curlist);
+
+	INIT_LIST_HEAD(&rdata->curlist);
+}
+
+static int rcu_completion(struct rcu_data *rdata)
+{
+	int stop;
+
+	rdata->polling_in_progress = 0;
+	rcu_invoke_callbacks(rdata);
+
+	stop = rcu_prepare_polling(rdata);
+
+	return stop;
+}
+
+static int rcu_polling(struct rcu_data *rdata)
+{
+	int i;
+	int stop;
+
+	for (i = 0; i < smp_num_cpus; i++) {
+		int cpu = cpu_logical_map(i);
+
+		if (rcu_grace_period_complete(rdata, cpu))
+			rdata->qsmask &= ~(1UL << cpu);
+	}
+
+	stop = 0;
+	if (!rdata->qsmask)
+		stop = rcu_completion(rdata);
+
+	return stop;
+}
+
+/*
+ * Look into the per-cpu callback information to see if there is
+ * any processing necessary - if so do it.
+ */
+static void rcu_process_callbacks(unsigned long data)
+{
+	int stop;
+	struct rcu_data *rdata = (struct rcu_data *)data;
+
+	spin_lock(&rdata->lock);
+	if (!rdata->polling_in_progress)
+		stop = rcu_prepare_polling(rdata);
+	else
+		stop = rcu_polling(rdata);
+	spin_unlock(&rdata->lock);
+
+	if (!stop)
+		tasklet_hi_schedule(&rdata->tasklet);
+}
+
+/* Because of FASTCALL declaration of complete, we use this wrapper */
+static void wakeme_after_rcu(void *completion)
+{
+	complete(completion);
+}
+
+static void rcu_init_data(struct rcu_data *rdata)
+{
+	tasklet_init(&rdata->tasklet, rcu_process_callbacks, 
+			(unsigned long)rdata);
+	INIT_LIST_HEAD(&rdata->nxtlist);
+	INIT_LIST_HEAD(&rdata->curlist);
+	spin_lock_init(&rdata->lock);
+}
+
+/*
+ * Initializes rcu mechanism.  Assumed to be called early.
+ * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
+ */
+void __init rcu_init(void)
+{
+	rcu_init_data(&rcu_data);
+#ifdef CONFIG_PREEMPT
+	rcu_init_data(&rcu_data_preempt);
+#endif
+}
+
+/*
+ * Wait until all the CPUs have gone through a "quiescent" state.
+ */
+void synchronize_kernel(void)
+{
+	struct rcu_head rcu;
+	DECLARE_COMPLETION(completion);
+
+	/* Will wake me after RCU finished */
+	call_rcu_preempt(&rcu, wakeme_after_rcu, &completion);
+
+	/* Wait for it */
+	wait_for_completion(&completion);
+}
+
+EXPORT_SYMBOL(call_rcu);
+#ifdef CONFIG_PREEMPT
+EXPORT_SYMBOL(call_rcu_preempt);
+#endif
+EXPORT_SYMBOL(synchronize_kernel);
diff -urN linux-2.5.14-base/kernel/sched.c linux-2.5.14-rcu_poll_preempt/kernel/sched.c
--- linux-2.5.14-base/kernel/sched.c	Mon May  6 09:07:57 2002
+++ linux-2.5.14-rcu_poll_preempt/kernel/sched.c	Tue May  7 17:23:51 2002
@@ -22,6 +22,7 @@
 #include <linux/interrupt.h>
 #include <linux/completion.h>
 #include <linux/kernel_stat.h>
+#include <linux/percpu.h>
 
 /*
  * Priority of a process goes from 0 to 139. The 0-99
@@ -156,12 +157,32 @@
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
 
+long cpu_quiescent __per_cpu_data;
+
 #define cpu_rq(cpu)		(runqueues + (cpu))
 #define this_rq()		cpu_rq(smp_processor_id())
 #define task_rq(p)		cpu_rq((p)->thread_info->cpu)
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define rt_task(p)		((p)->prio < MAX_RT_PRIO)
 
+#ifdef CONFIG_PREEMPT
+atomic_t rcu_preempt_cntr[2] __per_cpu_data;
+atomic_t *curr_preempt_cntr __per_cpu_data;
+atomic_t *next_preempt_cntr __per_cpu_data;
+#endif
+
+#ifdef CONFIG_PREEMPT
+static inline void rcu_init_preempt_cntr(int cpu)
+{
+        atomic_set(&per_cpu(rcu_preempt_cntr[0], cpu), 0);
+        atomic_set(&per_cpu(rcu_preempt_cntr[1], cpu), 0);
+	per_cpu(curr_preempt_cntr, cpu) = 
+			&per_cpu(rcu_preempt_cntr[1], cpu);
+	per_cpu(next_preempt_cntr, cpu) = 
+			&per_cpu(rcu_preempt_cntr[0], cpu);
+}
+#endif
+
 static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
 {
 	struct runqueue *rq;
@@ -773,8 +794,11 @@
 	 * if entering from preempt_schedule, off a kernel preemption,
 	 * go straight to picking the next task.
 	 */
-	if (unlikely(preempt_get_count() & PREEMPT_ACTIVE))
+	if (unlikely(preempt_get_count() & PREEMPT_ACTIVE)) {
 		goto pick_next_task;
+	} else {
+		rcu_preempt_put();
+	}
 	
 	switch (prev->state) {
 	case TASK_INTERRUPTIBLE:
@@ -817,6 +841,7 @@
 switch_tasks:
 	prefetch(next);
 	clear_tsk_need_resched(prev);
+	per_cpu(cpu_quiescent, prev->thread_info->cpu)++;
 
 	if (likely(prev != next)) {
 		rq->nr_switches++;
@@ -857,6 +882,7 @@
 		return;
 
 	ti->preempt_count = PREEMPT_ACTIVE;
+	rcu_preempt_get();
 	schedule();
 	ti->preempt_count = 0;
 	barrier();
@@ -1031,6 +1057,21 @@
 	task_rq_unlock(rq, &flags);
 }
 
+void force_cpu_reschedule(int cpu)
+{
+	unsigned long flags;
+	struct runqueue *rq, *newrq;
+	struct task_struct *p;
+
+	rq = cpu_rq(cpu);
+	p = rq->curr;
+	newrq = task_rq_lock(p, &flags);
+	if (newrq == rq)
+		resched_task(p);
+	task_rq_unlock(newrq, &flags);
+}
+
+
 #ifndef __alpha__
 
 /*
@@ -1592,6 +1633,7 @@
 		spin_lock_init(&rq->lock);
 		spin_lock_init(&rq->frozen);
 		INIT_LIST_HEAD(&rq->migration_queue);
+		rcu_init_preempt_cntr(i);
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-05-12 14:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-10  1:24 [RFC] RCU and CONFIG_PREEMPT_RT progress Paul E. McKenney
2005-05-10 10:55 ` Esben Nielsen
2005-05-10 14:45   ` Paul E. McKenney
2005-05-11 18:14     ` Esben Nielsen
2005-05-12  1:43       ` Paul E. McKenney
2005-05-10 20:08 ` Peter Zijlstra
2005-05-10 20:18   ` Valdis.Kletnieks
2005-05-10 20:29   ` Paul E. McKenney
2005-05-10 20:56     ` Peter Zijlstra
2005-05-10 22:36       ` Paul E. McKenney
2005-05-12  8:24         ` Peter Zijlstra
2005-05-12 14:30           ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox