Re: [PATCH] Fix RCU race in access of nohz_cpu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
@ 2005-12-08 19:31 Oleg Nesterov
  2005-12-09  2:46 ` Srivatsa Vaddagiri
  2005-12-09  2:56 ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 33+ messages in thread
From: Oleg Nesterov @ 2005-12-08 19:31 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: linux-kernel, Dipankar Sarma, Paul E. McKenney, Andrew Morton

Srivatsa Vaddagiri wrote:
>
> Accessing  nohz_cpu_mask before incrementing rcp->cur is racy. It can
> cause tickless idle CPUs to be included in rsp->cpumask, which will
> extend graceperiods unnecessarily.
> ...
> @@ -244,15 +244,15 @@ static void rcu_start_batch(struct rcu_c
>
>         if (rcp->next_pending &&
>                         rcp->completed == rcp->cur) {
> -               /* Can't change, since spin lock held. */
> -               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
> -
>                 rcp->next_pending = 0;
>                 /* next_pending == 0 must be visible in __rcu_process_callbacks()
>                  * before it can see new value of cur.
>                  */
>                 smp_wmb();
>                 rcp->cur++;
> +               smp_mb();
> +               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
> +

Could you explain why this patch makes any difference?

grep shows the only user of nohz_cpu_mask in arch/s390/kernel/time.c,
start_hz_timer() and stop_hz_timer().

I can't see how this change can prevent idle cpus to be included in
->cpumask, cpu can add itself to nohz_cpu_mask right after some other
cpu started new grace period.

I think cpu should call cpu_quiet() after adding itself to nohz mask
to eliminate this race.

No?

Oleg.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-08 19:31 [PATCH] Fix RCU race in access of nohz_cpu_mask Oleg Nesterov
@ 2005-12-09  2:46 ` Srivatsa Vaddagiri
  2005-12-09 19:17   ` Oleg Nesterov
  2005-12-09  2:56 ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 33+ messages in thread
From: Srivatsa Vaddagiri @ 2005-12-09  2:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Dipankar Sarma, Paul E. McKenney, Andrew Morton

On Thu, Dec 08, 2005 at 10:31:06PM +0300, Oleg Nesterov wrote:
> I can't see how this change can prevent idle cpus to be included in
> ->cpumask, cpu can add itself to nohz_cpu_mask right after some other
> cpu started new grace period.

Yes that can happen, but if they check for rcu_pending right after that
it will prevent them from going tickless atleast (which will prevent grace
periods from being unnecessarily extended). Something like below:


	CPU0					CPU1



	rcp->cur++;	/* New GP */		

	smp_mb();

	rsp->cpumask = 0x3

						cpu_set(1, nohz_cpu_mask);

						rcu_pending()? 
						 - Yes,
						   cpu_clear(1, nohz_cpu_mask);
						   Abort going tickless


Ideally we would have needed a smp_mb() in CPU1 also between setting CPU1
in nohz_cpu_mask and checking for rcu_pending(), but I guess it is not needed
in s390 because of its strong ordering?

-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-09  2:46 ` Srivatsa Vaddagiri
@ 2005-12-09 19:17   ` Oleg Nesterov
  2005-12-10 15:19     ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 33+ messages in thread
From: Oleg Nesterov @ 2005-12-09 19:17 UTC (permalink / raw)
  To: vatsa; +Cc: linux-kernel, Dipankar Sarma, Paul E. McKenney, Andrew Morton

Srivatsa Vaddagiri wrote:
> 
> On Thu, Dec 08, 2005 at 10:31:06PM +0300, Oleg Nesterov wrote:
> > I can't see how this change can prevent idle cpus to be included in
> > ->cpumask, cpu can add itself to nohz_cpu_mask right after some other
> > cpu started new grace period.
> 
> Yes that can happen, but if they check for rcu_pending right after that
> it will prevent them from going tickless atleast (which will prevent grace
> periods from being unnecessarily extended). Something like below:
> 
>         CPU0                                    CPU1
> 
>         rcp->cur++;     /* New GP */
> 
>         smp_mb();

I think I need some education on memory barriers.

Does this mb() garantees that the new value of ->cur will be visible
on other cpus immediately after smp_mb() (so that rcu_pending() will
notice it) ?

My understanding is that it only garantees that all stores before it
must be visible before any store after mb. (yes, mb implies rmb, but
I think it does not matter if CPU1 adds itself to nonhz mask after
CPU0 reads nohz_cpu_mask). This means that CPU1 can read the stale
value of ->cur. I guess I am wrong, but I can't prove it to myself.

Could you please clarify this?

Even simpler question:

CPU0
	var = 1;
	wmb();

after that CPU1 does rmb().

Does it garantees that CPU1 will see the new value of var?

> Ideally we would have needed a smp_mb() in CPU1 also between setting CPU1
> in nohz_cpu_mask and checking for rcu_pending(), but I guess it is not needed
> in s390 because of its strong ordering?

I don't know, but please note that s390's definition of smp_mb__after_atomic_inc()
is not a 'nop'.

Oleg.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-09 19:17   ` Oleg Nesterov
@ 2005-12-10 15:19     ` Srivatsa Vaddagiri
  2005-12-10 18:55       ` Oleg Nesterov
  0 siblings, 1 reply; 33+ messages in thread
From: Srivatsa Vaddagiri @ 2005-12-10 15:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Dipankar Sarma, Paul E. McKenney, Andrew Morton

On Fri, Dec 09, 2005 at 10:17:38PM +0300, Oleg Nesterov wrote:
> Srivatsa Vaddagiri wrote:
> > 
> > On Thu, Dec 08, 2005 at 10:31:06PM +0300, Oleg Nesterov wrote:
> > > I can't see how this change can prevent idle cpus to be included in
> > > ->cpumask, cpu can add itself to nohz_cpu_mask right after some other
> > > cpu started new grace period.
> > 
> > Yes that can happen, but if they check for rcu_pending right after that
> > it will prevent them from going tickless atleast (which will prevent grace
> > periods from being unnecessarily extended). Something like below:
> > 
> >         CPU0                                    CPU1
> > 
> >         rcp->cur++;     /* New GP */
> > 
> >         smp_mb();
> 
> I think I need some education on memory barriers.
> 
> Does this mb() garantees that the new value of ->cur will be visible
> on other cpus immediately after smp_mb() (so that rcu_pending() will
> notice it) ?

AFAIK the new value of ->cur should be visible to other CPUs when smp_mb() 
returns. Here's a definition of smp_mb() from Paul's article:

smp_mb(): "memory barrier" that orders both loads and stores. This means loads 
and stores preceding the memory barrier are committed to memory before any 
loads and stores following the memory barrier.

[ http://www.linuxjournal.com/article/8211 ]

> My understanding is that it only garantees that all stores before it
> must be visible before any store after mb. (yes, mb implies rmb, but
> I think it does not matter if CPU1 adds itself to nonhz mask after
> CPU0 reads nohz_cpu_mask). This means that CPU1 can read the stale
> value of ->cur. I guess I am wrong, but I can't prove it to myself.
> 
> Could you please clarify this?
> 
> Even simpler question:
> 
> CPU0
> 	var = 1;
> 	wmb();
> 
> after that CPU1 does rmb().
> 
> Does it garantees that CPU1 will see the new value of var?

Again, going by the above article, I would expect CPU1 to see the new value of 
var.


-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-10 15:19     ` Srivatsa Vaddagiri
@ 2005-12-10 18:55       ` Oleg Nesterov
  2005-12-11 17:41         ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Srivatsa Vaddagiri
  2005-12-12  3:10         ` [PATCH] Fix RCU race in access of nohz_cpu_mask Paul E. McKenney
  0 siblings, 2 replies; 33+ messages in thread
From: Oleg Nesterov @ 2005-12-10 18:55 UTC (permalink / raw)
  To: vatsa; +Cc: linux-kernel, Dipankar Sarma, Paul E. McKenney, Andrew Morton

Srivatsa Vaddagiri wrote:
> 
> On Fri, Dec 09, 2005 at 10:17:38PM +0300, Oleg Nesterov wrote:
> > >         rcp->cur++;     /* New GP */
> > >
> > >         smp_mb();
> >
> > I think I need some education on memory barriers.
> >
> > Does this mb() garantees that the new value of ->cur will be visible
> > on other cpus immediately after smp_mb() (so that rcu_pending() will
> > notice it) ?
> 
> AFAIK the new value of ->cur should be visible to other CPUs when smp_mb()
> returns. Here's a definition of smp_mb() from Paul's article:
> 
> smp_mb(): "memory barrier" that orders both loads and stores. This means loads
> and stores preceding the memory barrier are committed to memory before any
> loads and stores following the memory barrier.

Thanks, but this definition talks about ordering, it does not say
anything about the time when stores are really commited to memory
(and does it mean also that cache-lines are invalidated on other
cpus ?).

> [ http://www.linuxjournal.com/article/8211 ]

And thanks for this link, now I have some understanding about
read_barrier_depends() ...

Oleg.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-10 18:55       ` Oleg Nesterov
@ 2005-12-11 17:41         ` Srivatsa Vaddagiri
  2005-12-11 21:21           ` Andrew James Wade
  2005-12-12  3:10         ` [PATCH] Fix RCU race in access of nohz_cpu_mask Paul E. McKenney
  1 sibling, 1 reply; 33+ messages in thread
From: Srivatsa Vaddagiri @ 2005-12-11 17:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Dipankar Sarma, Paul E. McKenney, Andrew Morton,
	Ingo Molnar, Rusty Russell

[Changed the subject line to be more generic in the interest of wider audience]

We seem to be having some confusion over the exact semantics of smp_mb().

Specifically, are all stores preceding smp_mb() guaranteed to have finished
(committed to memory/corresponding cache-lines on other CPUs invalidated) 
*before* successive loads are issued?


>From IA-32 manual (download.intel.com/design/Pentium4/manuals/25366817.pdf
Page 271):

"Memory mapped devices and other I/O devices on the bus are often sensitive to 
the order of writes to their I/O buffers. I/O instructions can be used to (the 
IN and OUT instructions) impose strong write ordering on such accesses as 
follows. Prior to executing an I/O instruction, the processor waits for all 
						___________________________
previous instructions in the program to complete and for all buffered
_____________________________________________________________________
writes to drain to memory. Only instruction fetch and page tables walks can 
_________________________
pass I/O instructions. Execution of subsequent instructions do not begin until 
the processor determines that the I/O instruction has been completed."

Synchronization mechanisms in multiple-processor systems may depend upon a 
strong memory-ordering model. Here, a program can use a locking instruction 
such as the XCHG instruction or the LOCK prefix to insure that a 
read-modify-write operation on memory is carried out atomically. Locking 
operations typically operate like I/O operations in that they wait for all 
							 _________________
previous instructions to complete and for all buffered writes to drain to 
_________________________________________________________________________
memory (see Section 7.1.2, “Bus Locking”).
______

Program synchronization can also be carried out with serializing instructions 
(see Section 7.4). These instructions are typically used at critical procedure 
or task boundaries to force completion of all previous instructions before a 
jump to a new section of code or a context switch occurs. Like the I/O and 
locking instructions, the processor waits until all previous instructions have
		      ________________________________________________________
been completed and all buffered writes have been drained to memory before 
_________________________________________________________________________
executing the serializing instruction."
_____________________________________


>From this, looks like we can interpret that IA-32 processor will wait for all 
writes to drain to memory (implies cache invalidation on other CPUs?) *before* 
executing the synchronizing instruction?

Question is can this be generalized for other CPUs too?


On Sat, Dec 10, 2005 at 09:55:35PM +0300, Oleg Nesterov wrote:
> Srivatsa Vaddagiri wrote:
> > 
> > On Fri, Dec 09, 2005 at 10:17:38PM +0300, Oleg Nesterov wrote:
> > > >         rcp->cur++;     /* New GP */
> > > >
> > > >         smp_mb();
> > >
> > > I think I need some education on memory barriers.
> > >
> > > Does this mb() garantees that the new value of ->cur will be visible
> > > on other cpus immediately after smp_mb() (so that rcu_pending() will
> > > notice it) ?
> > 
> > AFAIK the new value of ->cur should be visible to other CPUs when smp_mb()
> > returns. Here's a definition of smp_mb() from Paul's article:
> > 
> > smp_mb(): "memory barrier" that orders both loads and stores. This means loads
> > and stores preceding the memory barrier are committed to memory before any
> > loads and stores following the memory barrier.
> 
> Thanks, but this definition talks about ordering, it does not say
> anything about the time when stores are really commited to memory
> (and does it mean also that cache-lines are invalidated on other
> cpus ?).
> 
> > [ http://www.linuxjournal.com/article/8211 ]
> 
> And thanks for this link, now I have some understanding about
> read_barrier_depends() ...
> 
> Oleg.

-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-11 17:41         ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Srivatsa Vaddagiri
@ 2005-12-11 21:21           ` Andrew James Wade
  2005-12-11 23:45             ` Rusty Russell
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew James Wade @ 2005-12-11 21:21 UTC (permalink / raw)
  To: vatsa
  Cc: Oleg Nesterov, linux-kernel, Dipankar Sarma, Paul E. McKenney,
	Andrew Morton, Ingo Molnar, Rusty Russell

On Sunday 11 December 2005 12:41, Srivatsa Vaddagiri wrote:
> [Changed the subject line to be more generic in the interest of wider audience]
> 
> We seem to be having some confusion over the exact semantics of smp_mb().
> 
> Specifically, are all stores preceding smp_mb() guaranteed to have finished
> (committed to memory/corresponding cache-lines on other CPUs invalidated) 
> *before* successive loads are issued?

I doubt it. That's definitely not true of smp_wmb(), which boils down to
__asm__ __volatile__ ("": : :"memory") on SMP i386 (which the constrains
how the compiler orders write instructions, but is otherwise a nop. i386
has in-order writes.).

And it makes sense that wmb() wouldn't wait for writes: RCU needs
constraints on the order in which writes become visible, but has very week
constraints on when they do. Waiting for writes to flush would hurt
performance.

Andrew Wade

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-11 21:21           ` Andrew James Wade
@ 2005-12-11 23:45             ` Rusty Russell
  2005-12-12  0:49               ` Keith Owens
  2005-12-13  5:07               ` Andrew James Wade
  0 siblings, 2 replies; 33+ messages in thread
From: Rusty Russell @ 2005-12-11 23:45 UTC (permalink / raw)
  To: ajwade
  Cc: vatsa, Oleg Nesterov, linux-kernel, Dipankar Sarma,
	Paul E. McKenney, Andrew Morton, Ingo Molnar

On Sun, 2005-12-11 at 16:21 -0500, Andrew James Wade wrote:
> On Sunday 11 December 2005 12:41, Srivatsa Vaddagiri wrote:
> > [Changed the subject line to be more generic in the interest of wider audience]
> > 
> > We seem to be having some confusion over the exact semantics of smp_mb().
> > 
> > Specifically, are all stores preceding smp_mb() guaranteed to have finished
> > (committed to memory/corresponding cache-lines on other CPUs invalidated) 
> > *before* successive loads are issued?
> 
> I doubt it. That's definitely not true of smp_wmb(), which boils down to
> __asm__ __volatile__ ("": : :"memory") on SMP i386 (which the constrains
> how the compiler orders write instructions, but is otherwise a nop. i386
> has in-order writes.).
> 
> And it makes sense that wmb() wouldn't wait for writes: RCU needs
> constraints on the order in which writes become visible, but has very week
> constraints on when they do. Waiting for writes to flush would hurt
> performance.

On the contrary.  I did some digging and asking and thinking about this
for the Unreliable Guide to Kernel Locking, years ago:

wmb() means all writes preceeding will complete before any writes
following are started.
rmb() means all reads preceeding will complete before any reads
following are started.
mb() means all reads and writes preceeding will complete before any
reads and writes following are started.

This does not map to all the possibilities, nor does it take advantage
of all architectures, but it's generally sufficient without getting
insanely complex.

Hope that clarifies,
Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-11 23:45             ` Rusty Russell
@ 2005-12-12  0:49               ` Keith Owens
  2005-12-12  8:41                 ` Srivatsa Vaddagiri
  2005-12-13  5:07               ` Andrew James Wade
  1 sibling, 1 reply; 33+ messages in thread
From: Keith Owens @ 2005-12-12  0:49 UTC (permalink / raw)
  To: Rusty Russell
  Cc: ajwade, vatsa, Oleg Nesterov, linux-kernel, Dipankar Sarma,
	Paul E. McKenney, Andrew Morton, Ingo Molnar

On Mon, 12 Dec 2005 10:45:16 +1100, 
Rusty Russell <rusty@rustcorp.com.au> wrote:
>On Sun, 2005-12-11 at 16:21 -0500, Andrew James Wade wrote:
>> On Sunday 11 December 2005 12:41, Srivatsa Vaddagiri wrote:
>> > We seem to be having some confusion over the exact semantics of smp_mb().
>> > 
>> > Specifically, are all stores preceding smp_mb() guaranteed to have finished
>> > (committed to memory/corresponding cache-lines on other CPUs invalidated) 
>> > *before* successive loads are issued?
>> 
>> I doubt it. That's definitely not true of smp_wmb(), which boils down to
>> __asm__ __volatile__ ("": : :"memory") on SMP i386 (which the constrains
>> how the compiler orders write instructions, but is otherwise a nop. i386
>> has in-order writes.).
>> 
>> And it makes sense that wmb() wouldn't wait for writes: RCU needs
>> constraints on the order in which writes become visible, but has very week
>> constraints on when they do. Waiting for writes to flush would hurt
>> performance.
>
>On the contrary.  I did some digging and asking and thinking about this
>for the Unreliable Guide to Kernel Locking, years ago:
>
>wmb() means all writes preceeding will complete before any writes
>following are started.
>rmb() means all reads preceeding will complete before any reads
>following are started.
>mb() means all reads and writes preceeding will complete before any
>reads and writes following are started.

FWIW, wmb() on IA64 does not require that preceding stores are flushed
to main memory.  It only requires that they be "made visible to other
processors in the coherence domain".  "visible" means that the updated
value must reach (at least) an externally snooped cache.  There is no
requirement that the preceding stores be flushed all the way to main
memory, the updates only have to get as far as a cache level that other
cpus can see.  The cache snooping takes care of flushing to main memory
when necessary.

IA64 does have a memory fence that stalls the cpu until the data is
"accepted by the external platform".  That format is expensive and is
only used for memory mapped I/O, where the data really does have to
read the memory before the cpu can perform its next operation.  For
example, in the mmiowb() case.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-12  0:49               ` Keith Owens
@ 2005-12-12  8:41                 ` Srivatsa Vaddagiri
  2005-12-12 19:33                   ` Oleg Nesterov
  0 siblings, 1 reply; 33+ messages in thread
From: Srivatsa Vaddagiri @ 2005-12-12  8:41 UTC (permalink / raw)
  To: Keith Owens
  Cc: Rusty Russell, ajwade, Oleg Nesterov, linux-kernel,
	Dipankar Sarma, Paul E. McKenney, Andrew Morton, Ingo Molnar

On Mon, Dec 12, 2005 at 11:49:07AM +1100, Keith Owens wrote:
> >On the contrary.  I did some digging and asking and thinking about this
> >for the Unreliable Guide to Kernel Locking, years ago:
> >
> >wmb() means all writes preceeding will complete before any writes
> >following are started.
> >rmb() means all reads preceeding will complete before any reads
> >following are started.
> >mb() means all reads and writes preceeding will complete before any
> >reads and writes following are started.
> 
> FWIW, wmb() on IA64 does not require that preceding stores are flushed
> to main memory.  It only requires that they be "made visible to other
> processors in the coherence domain".  "visible" means that the updated
> value must reach (at least) an externally snooped cache.  There is no
> requirement that the preceding stores be flushed all the way to main
> memory, the updates only have to get as far as a cache level that other
> cpus can see.  The cache snooping takes care of flushing to main memory
> when necessary.

For the context of the problem that we are dealing with, I think this fact
that writes are made "visible" to other CPUs (before smp_mb() finishes and 
before other reads are started) is good enough.

Oleg, with all these inputs, I consider the patch I had sent to be correct.
Let me know if you still have some lingering doubts!

P.S :- Thanks to everybody who reponded clarifying this subject.


-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of  nohz_cpu_mask ]
  2005-12-12  8:41                 ` Srivatsa Vaddagiri
@ 2005-12-12 19:33                   ` Oleg Nesterov
  2005-12-13  5:20                     ` Paul E. McKenney
  0 siblings, 1 reply; 33+ messages in thread
From: Oleg Nesterov @ 2005-12-12 19:33 UTC (permalink / raw)
  To: vatsa
  Cc: Keith Owens, Rusty Russell, ajwade, linux-kernel, Dipankar Sarma,
	Paul E. McKenney, Andrew Morton, Ingo Molnar

Srivatsa Vaddagiri wrote:
> 
> Oleg, with all these inputs, I consider the patch I had sent to be correct.

Yes, I think so. My suspects were due to my misunderstanding.
I was wrong when I said that we can ignore rmb() part of mb()
in case when start_hz_timer() runs after rcu_start_batch().

	rcp->cur++;
	// ->cur will be visible to other cpus
	// _before_ we will *READ* nohz_cpu_mask.
	// we don't have any 'timing' problems.
	// In other words: if another cpu does not
	// see the new value - we did not read this
	// mask yet.
	smp_mb();

Thanks for your patience.

> P.S :- Thanks to everybody who reponded clarifying this subject.

Yes! it was really helpful, thanks to all. I think it would be great
to have Paul's very clear (and short!) explanation somewhere in
Documentation/.

Oleg.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-12 19:33                   ` Oleg Nesterov
@ 2005-12-13  5:20                     ` Paul E. McKenney
  0 siblings, 0 replies; 33+ messages in thread
From: Paul E. McKenney @ 2005-12-13  5:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: vatsa, Keith Owens, Rusty Russell, ajwade, linux-kernel,
	Dipankar Sarma, Andrew Morton, Ingo Molnar

On Mon, Dec 12, 2005 at 10:33:22PM +0300, Oleg Nesterov wrote:
> Srivatsa Vaddagiri wrote:
> 
> > P.S :- Thanks to everybody who reponded clarifying this subject.
> 
> Yes! it was really helpful, thanks to all. I think it would be great
> to have Paul's very clear (and short!) explanation somewhere in
> Documentation/.

Glad you liked it!

I need to do an RCU documentation update soon anyway, so will include
memory barriers in that update.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-11 23:45             ` Rusty Russell
  2005-12-12  0:49               ` Keith Owens
@ 2005-12-13  5:07               ` Andrew James Wade
  2005-12-13  5:43                 ` Paul E. McKenney
  2005-12-13 11:20                 ` Andi Kleen
  1 sibling, 2 replies; 33+ messages in thread
From: Andrew James Wade @ 2005-12-13  5:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: vatsa, Oleg Nesterov, linux-kernel, Dipankar Sarma,
	Paul E. McKenney, Andrew Morton, Ingo Molnar

On Sunday 11 December 2005 18:45, Rusty Russell wrote:
> On Sun, 2005-12-11 at 16:21 -0500, Andrew James Wade wrote:
> > On Sunday 11 December 2005 12:41, Srivatsa Vaddagiri wrote:
> > > [Changed the subject line to be more generic in the interest of wider audience]
> > > 
> > > We seem to be having some confusion over the exact semantics of smp_mb().
> > > 
> > > Specifically, are all stores preceding smp_mb() guaranteed to have finished
> > > (committed to memory/corresponding cache-lines on other CPUs invalidated) 
> > > *before* successive loads are issued?
> > 
> > I doubt it. That's definitely not true of smp_wmb(), which boils down to
> > __asm__ __volatile__ ("": : :"memory") on SMP i386 (which the constrains
> > how the compiler orders write instructions, but is otherwise a nop. i386
> > has in-order writes.).

Hrrm, after doing a bit of reading on cache-coherency, verifying that the
corresponding cache-lines on other CPUs are invalidated can (sometimes)
be done quite quickly, so waiting for that before issuing reads might not
destroy performance. I still doubt that i386es do thing this way, but I
don't think it matters (see below).

> > 
> > And it makes sense that wmb() wouldn't wait for writes: RCU needs
> > constraints on the order in which writes become visible, but has very week
> > constraints on when they do. Waiting for writes to flush would hurt
> > performance.
> 
> On the contrary.  I did some digging and asking and thinking about this
> for the Unreliable Guide to Kernel Locking, years ago:
> 
> wmb() means all writes preceeding will complete before any writes
> following are started.

What does it mean for a write to start? For that matter, what does it mean
for a write to complete?

I think my focusing on the hardware details (of which I am woefully
ignorant) was a mistake: the hardware can really do whatever it wants so
long as it implements the expected abstract machine model, and it is
ordering that matters in that model. So it makes sense that ordering, not
time, is what the definitions of the memory barriers focus on.

I think that Oleg's question:
> Does this mb() garantees that the new value of ->cur will be visible
> on other cpus immediately after smp_mb() (so that rcu_pending() will
> notice it) ?
is really just about the timeliness with which writes before a smp_mb()
become visible to other CPUs. Does Linux run on architectures in which
writes are not propagated in a timely manner (that is: other CPUs can read
stale values for a "considerable" time)? I rather doubt it.

But with a proviso to my answer: the compiler will happily hoist reads out
of loops. So writes won't necessarily get read in a timely manner.

Andrew


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-13  5:07               ` Andrew James Wade
@ 2005-12-13  5:43                 ` Paul E. McKenney
  2005-12-13 11:20                 ` Andi Kleen
  1 sibling, 0 replies; 33+ messages in thread
From: Paul E. McKenney @ 2005-12-13  5:43 UTC (permalink / raw)
  To: ajwade
  Cc: Rusty Russell, vatsa, Oleg Nesterov, linux-kernel, Dipankar Sarma,
	Andrew Morton, Ingo Molnar

On Tue, Dec 13, 2005 at 12:07:47AM -0500, Andrew James Wade wrote:
> On Sunday 11 December 2005 18:45, Rusty Russell wrote:
> > On Sun, 2005-12-11 at 16:21 -0500, Andrew James Wade wrote:
> > > On Sunday 11 December 2005 12:41, Srivatsa Vaddagiri wrote:
> > > > [Changed the subject line to be more generic in the interest of wider audience]
> > > > 
> > > > We seem to be having some confusion over the exact semantics of smp_mb().
> > > > 
> > > > Specifically, are all stores preceding smp_mb() guaranteed to have finished
> > > > (committed to memory/corresponding cache-lines on other CPUs invalidated) 
> > > > *before* successive loads are issued?
> > > 
> > > I doubt it. That's definitely not true of smp_wmb(), which boils down to
> > > __asm__ __volatile__ ("": : :"memory") on SMP i386 (which the constrains
> > > how the compiler orders write instructions, but is otherwise a nop. i386
> > > has in-order writes.).
> 
> Hrrm, after doing a bit of reading on cache-coherency, verifying that the
> corresponding cache-lines on other CPUs are invalidated can (sometimes)
> be done quite quickly, so waiting for that before issuing reads might not
> destroy performance. I still doubt that i386es do thing this way, but I
> don't think it matters (see below).

Hardware designers need only be concerned with how things -appear-
to the software.  They can and do use optimizations that get the
effect of waiting without actually waiting.  For example, an x86
CPU need not -really- keep writes ordered as long as the software
cannot tell that they have become misordered.  This may sound
strange, but one important example would be a CPU doing writes that
are to variables that it knows do not appear in any other CPU's 
cache.  In this sort of case, the CPU could reorder the writes until
such time as it detected a request from another CPU requsting a
cache line containing one of the variables.

> > > And it makes sense that wmb() wouldn't wait for writes: RCU needs
> > > constraints on the order in which writes become visible, but has very week
> > > constraints on when they do. Waiting for writes to flush would hurt
> > > performance.
> > 
> > On the contrary.  I did some digging and asking and thinking about this
> > for the Unreliable Guide to Kernel Locking, years ago:
> > 
> > wmb() means all writes preceeding will complete before any writes
> > following are started.
> 
> What does it mean for a write to start? For that matter, what does it mean
> for a write to complete?

Potentially many things in both cases:

o	Fetching the store instruction that will do the write.

o	Determining that all instructions whose execution logically
	precedes the write are free of fatal errors (e.g., page
	faults, exceptions, traps, ...).

o	Determining the exact value that is to be written (which might
	have been computed by prior instructions).

o	Determining the exact address that is to be written to (which
	also might have been computed by prior instructions).

o	Determining that the store instruction itself is free of
	fatal errors.  Note that it might be necessary to compute
	the address to be sure.

o	Determining that no hardware interrupt will precede the
	completion of the store instruction.

o	Storing the value to one of the CPU's write buffers (which
	is not yet in the coherence region represented by the CPU's
	cache).

o	Starting the process of fetching the cache line into which
	the value is to be stored.

o	Starting the process of invalidating the corresponding cache
	line from all the other CPUs' caches.

o	Receiving the cache line into which the value is to be stored.

o	Storing the value to the CPU's cache, which involves combining
	the value to be written with the rest of the cache line.

o	Responding to the first request for the corresponding cache
	line from some other CPU.

o	Completing the process of invalidating the corresponding cache
	line from all the other CPUs' caches.  Note that Alpha's
	weak memory-barrier semantics allow this step to complete
	long after the value is stored into the writing CPU's cache.

o	Storing the value to physical DRAM.

Most of these steps can execute in parallel or out of order.  And a
CPU designer would gleefully point out -lots- of steps I have omitted,
some for the sake of brevity, others out of ignorance.

You asked!!!

> I think my focusing on the hardware details (of which I am woefully
> ignorant) was a mistake: the hardware can really do whatever it wants so
> long as it implements the expected abstract machine model, and it is
> ordering that matters in that model. So it makes sense that ordering, not
> time, is what the definitions of the memory barriers focus on.

Yes, maintaining one's sanity requires taking a very simplified view
of memory ordering, especially since no two CPUs order memory references
in exactly the same way.

> I think that Oleg's question:
> > Does this mb() garantees that the new value of ->cur will be visible
> > on other cpus immediately after smp_mb() (so that rcu_pending() will
> > notice it) ?
> is really just about the timeliness with which writes before a smp_mb()
> become visible to other CPUs. Does Linux run on architectures in which
> writes are not propagated in a timely manner (that is: other CPUs can read
> stale values for a "considerable" time)? I rather doubt it.

Remember that smp_mb() is officially about controlling the order in
which values are communicated between the CPU executing the smp_mb()
and the interconnect, -not- between a pair of CPUs.  In the general
case (which includes Alpha), controlling the order in which values
are communicated from one CPU to another requires that -both- CPUs
execute memory barriers.

> But with a proviso to my answer: the compiler will happily hoist reads out
> of loops. So writes won't necessarily get read in a timely manner.

Very good!

And this is in fact why smp_wmb() and friends also contain must directives
instructing the compiler not to mess with ordering.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-13  5:07               ` Andrew James Wade
  2005-12-13  5:43                 ` Paul E. McKenney
@ 2005-12-13 11:20                 ` Andi Kleen
  2005-12-13 16:20                   ` Paul E. McKenney
  1 sibling, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2005-12-13 11:20 UTC (permalink / raw)
  To: ajwade
  Cc: vatsa, Oleg Nesterov, linux-kernel, Dipankar Sarma,
	Paul E. McKenney, Andrew Morton, Ingo Molnar

Andrew James Wade <andrew.j.wade@gmail.com> writes:
> 
> What does it mean for a write to start? For that matter, what does it mean
> for a write to complete?

wmb() or smp_smb() really makes no guarantees when a write
completes (e.g. when another CPU or a device sees the new value)
It just guarantees ordering. This means when you do e.g.

        *p = 1;
        wmb();
        *p = 2;
        wmb();

then another CPU will never see 2 before 1 (but possibly it might
miss the "1 state completely"). When it actually sees the values
(or if it keeps using whatever value was in *p before the first
assignment) is undefined.

The definition quoted above is wrong I think. Rusty can you
perhaps fix it up?

> I think my focusing on the hardware details (of which I am woefully
> ignorant) was a mistake: the hardware can really do whatever it wants so
> long as it implements the expected abstract machine model, and it is
> ordering that matters in that model. So it makes sense that ordering, not
> time, is what the definitions of the memory barriers focus on.

Exactly.

> > Does this mb() garantees that the new value of ->cur will be visible
> > on other cpus immediately after smp_sees 1 or 2 is undefined.  mb() (so that rcu_pending() will
> > notice it) ?
> is really just about the timeliness with which writes before a smp_mb()
> become visible to other CPUs. Does Linux run on architectures in which
> writes are not propagated in a timely manner (that is: other CPUs can read
> stale values for a "considerable" time)? I rather doubt it.
> 
> But with a proviso to my answer: the compiler will happily hoist reads out
> of loops. So writes won't necessarily get read in a timely manner.

Or just consider interrupts. Any CPU can stop for an very long time
at any time as seen by the other CPUs.  Or it might take an machine check
and come back. You can never make any assumptions about when another
CPU sees a write unless you add explicit synchronization like a spinlock.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-13 11:20                 ` Andi Kleen
@ 2005-12-13 16:20                   ` Paul E. McKenney
  2005-12-13 22:27                     ` Keith Owens
  0 siblings, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2005-12-13 16:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: ajwade, vatsa, Oleg Nesterov, linux-kernel, Dipankar Sarma,
	Andrew Morton, Ingo Molnar

On Tue, Dec 13, 2005 at 04:20:13AM -0700, Andi Kleen wrote:
> Andrew James Wade <andrew.j.wade@gmail.com> writes:
> > 
> > What does it mean for a write to start? For that matter, what does it mean
> > for a write to complete?
> 
> wmb() or smp_smb() really makes no guarantees when a write
> completes (e.g. when another CPU or a device sees the new value)
> It just guarantees ordering. This means when you do e.g.
> 
>         *p = 1;
>         wmb();
>         *p = 2;
>         wmb();
> 
> then another CPU will never see 2 before 1 (but possibly it might
> miss the "1 state completely"). When it actually sees the values
> (or if it keeps using whatever value was in *p before the first
> assignment) is undefined.

It is OK on the write side, but the corresponding barriers are required
on the read side (even on non-Alpha CPUs, since the value of the pointer
remains constant).

	Writer			Reader

	p = 1;			p1 = p;
	smp_wmb();		smp_rmb();
	p = 2;			p2 = p;

With these memory barriers in place, if p2 is set to 1, then p1 must
also be set to 1.  Similarly, if p1 is set to 2, then p2 must also
be set to 2.  If the smp_rmb() is not present on the read side, then
the reading CPU (and compiler) are free to interchange the order
of the two assignment statements.  As Andi says, memory barriers do
not necessarily make reads or writes happen faster, instead they only
enforce limited ordering constraints.

If the variable p references MMIO rather than normal memory, then
wmb() and rmb() are needed instead of smp_wmb() and smp_rmb().
This is because the I/O device needs to see the accesses ordered
even in a UP system.

						Thanx, Paul

> The definition quoted above is wrong I think. Rusty can you
> perhaps fix it up?
>
> > I think my focusing on the hardware details (of which I am woefully
> > ignorant) was a mistake: the hardware can really do whatever it wants so
> > long as it implements the expected abstract machine model, and it is
> > ordering that matters in that model. So it makes sense that ordering, not
> > time, is what the definitions of the memory barriers focus on.
> 
> Exactly.
> 
> > > Does this mb() garantees that the new value of ->cur will be visible
> > > on other cpus immediately after smp_sees 1 or 2 is undefined.  mb() (so that rcu_pending() will
> > > notice it) ?
> > is really just about the timeliness with which writes before a smp_mb()
> > become visible to other CPUs. Does Linux run on architectures in which
> > writes are not propagated in a timely manner (that is: other CPUs can read
> > stale values for a "considerable" time)? I rather doubt it.
> > 
> > But with a proviso to my answer: the compiler will happily hoist reads out
> > of loops. So writes won't necessarily get read in a timely manner.
> 
> Or just consider interrupts. Any CPU can stop for an very long time
> at any time as seen by the other CPUs.  Or it might take an machine check
> and come back. You can never make any assumptions about when another
> CPU sees a write unless you add explicit synchronization like a spinlock.
> 
> -Andi
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-13 16:20                   ` Paul E. McKenney
@ 2005-12-13 22:27                     ` Keith Owens
  2005-12-13 22:50                       ` Paul E. McKenney
                                         ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Keith Owens @ 2005-12-13 22:27 UTC (permalink / raw)
  To: paulmck
  Cc: Andi Kleen, ajwade, vatsa, Oleg Nesterov, linux-kernel,
	Dipankar Sarma, Andrew Morton, Ingo Molnar

On Tue, 13 Dec 2005 08:20:27 -0800, 
"Paul E. McKenney" <paulmck@us.ibm.com> wrote:
>If the variable p references MMIO rather than normal memory, then
>wmb() and rmb() are needed instead of smp_wmb() and smp_rmb().

mmiowb(), not wmb().  IA64 has a different form of memory fence for I/O
space compared to normal memory.  MIPS also has a non-empty form of
mmiowb().

>This is because the I/O device needs to see the accesses ordered
>even in a UP system.

Why does mmiowb() map to empty on most systems, including Alpha?
Should it not map to wmb() for everything except IA64 and MIPS?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-13 22:27                     ` Keith Owens
@ 2005-12-13 22:50                       ` Paul E. McKenney
  2005-12-14  1:12                         ` Andi Kleen
  2005-12-15 21:15                       ` Semantics of smp_mb() Roland Dreier
  2005-12-16  7:46                       ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Jeremy Higdon
  2 siblings, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2005-12-13 22:50 UTC (permalink / raw)
  To: Keith Owens
  Cc: Andi Kleen, ajwade, vatsa, Oleg Nesterov, linux-kernel,
	Dipankar Sarma, Andrew Morton, Ingo Molnar

On Wed, Dec 14, 2005 at 09:27:41AM +1100, Keith Owens wrote:
> On Tue, 13 Dec 2005 08:20:27 -0800, 
> "Paul E. McKenney" <paulmck@us.ibm.com> wrote:
> >If the variable p references MMIO rather than normal memory, then
> >wmb() and rmb() are needed instead of smp_wmb() and smp_rmb().
> 
> mmiowb(), not wmb().  IA64 has a different form of memory fence for I/O
> space compared to normal memory.  MIPS also has a non-empty form of
> mmiowb().

New one on me!

> >This is because the I/O device needs to see the accesses ordered
> >even in a UP system.
> 
> Why does mmiowb() map to empty on most systems, including Alpha?
> Should it not map to wmb() for everything except IA64 and MIPS?

Sounds like I am not the only one that it is new to...

There are over four hundred instances of wmb() still in the drivers tree
in 2.6.14.  I suspect that most of them are for MMIO fencing -- the ones
I looked at quickly certainly seem to be.

But, given mmiowb(), what is the point of having wmb()?  Why are
write memory barriers needed in UP kernels if not for MMIO and other
hardware-specific accesses?  Is your thought that use of wmb() should
be phased out in favor of mmiowb()?  (Might be a good idea, but doesn't
look like we are there yet.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-13 22:50                       ` Paul E. McKenney
@ 2005-12-14  1:12                         ` Andi Kleen
  2005-12-14  1:46                           ` Paul E. McKenney
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2005-12-14  1:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Keith Owens, Andi Kleen, ajwade, vatsa, Oleg Nesterov,
	linux-kernel, Dipankar Sarma, Andrew Morton, Ingo Molnar

On Tue, Dec 13, 2005 at 02:50:59PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 14, 2005 at 09:27:41AM +1100, Keith Owens wrote:
> > On Tue, 13 Dec 2005 08:20:27 -0800, 
> > "Paul E. McKenney" <paulmck@us.ibm.com> wrote:
> > >If the variable p references MMIO rather than normal memory, then
> > >wmb() and rmb() are needed instead of smp_wmb() and smp_rmb().
> > 
> > mmiowb(), not wmb().  IA64 has a different form of memory fence for I/O
> > space compared to normal memory.  MIPS also has a non-empty form of
> > mmiowb().
> 
> New one on me!

Didn't it make only a difference on the Altix or something like that? 
I suppose they added it only on the drivers for devices supported by SGI.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-14  1:12                         ` Andi Kleen
@ 2005-12-14  1:46                           ` Paul E. McKenney
  0 siblings, 0 replies; 33+ messages in thread
From: Paul E. McKenney @ 2005-12-14  1:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Keith Owens, ajwade, vatsa, Oleg Nesterov, linux-kernel,
	Dipankar Sarma, Andrew Morton, Ingo Molnar

On Wed, Dec 14, 2005 at 02:12:53AM +0100, Andi Kleen wrote:
> On Tue, Dec 13, 2005 at 02:50:59PM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 14, 2005 at 09:27:41AM +1100, Keith Owens wrote:
> > > On Tue, 13 Dec 2005 08:20:27 -0800, 
> > > "Paul E. McKenney" <paulmck@us.ibm.com> wrote:
> > > >If the variable p references MMIO rather than normal memory, then
> > > >wmb() and rmb() are needed instead of smp_wmb() and smp_rmb().
> > > 
> > > mmiowb(), not wmb().  IA64 has a different form of memory fence for I/O
> > > space compared to normal memory.  MIPS also has a non-empty form of
> > > mmiowb().
> > 
> > New one on me!
> 
> Didn't it make only a difference on the Altix or something like that? 
> I suppose they added it only on the drivers for devices supported by SGI.

It could potentially help on a few other CPUs, but quite a few driver
changes would be needed to really bring out the full benefits.  I am
concerned that the current state leaves a number of CPU families broken --
the empty definitions cannot be good for other weakly ordered CPUs!

						Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb()
  2005-12-13 22:27                     ` Keith Owens
  2005-12-13 22:50                       ` Paul E. McKenney
@ 2005-12-15 21:15                       ` Roland Dreier
  2005-12-16  7:46                       ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Jeremy Higdon
  2 siblings, 0 replies; 33+ messages in thread
From: Roland Dreier @ 2005-12-15 21:15 UTC (permalink / raw)
  To: Keith Owens
  Cc: paulmck, Andi Kleen, ajwade, vatsa, Oleg Nesterov, linux-kernel,
	Dipankar Sarma, Andrew Morton, Ingo Molnar

    Keith> Why does mmiowb() map to empty on most systems, including
    Keith> Alpha?  Should it not map to wmb() for everything except
    Keith> IA64 and MIPS?

I think the intention (as spelled out in Documentation/DocBook/deviceiobook.tmpl)
is that mmiowb() must be used in conjunction with spinlocks or some
other SMP synchronization mechanism.  The locks themselves are
sufficient ordering on the archs where mmiowb() is a NOP.

One way of thinking about this is that the usually barrier operations
like wmb() affect the order of a single CPU's operations -- that is,
wmb() is saying that all writes from the current thread of execution
before the wmb() become visible before any of the writes from the
current after the wmb().  But wmb() doesn't say anything about how one
CPU's writes are ordered against anything a different CPU does.

mmiowb() is something else -- it controls the visibility of writes
from different CPUs.  It says that all writes before the mmiowb()
become visible before any writes from any CPU after the mmiowb().
However, this isn't sensible unless we can order the writes between
CPUs, and that only makes sense if there's a lock that lets us say
that one CPU is executing something after the mmiowb().

 - R.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-13 22:27                     ` Keith Owens
  2005-12-13 22:50                       ` Paul E. McKenney
  2005-12-15 21:15                       ` Semantics of smp_mb() Roland Dreier
@ 2005-12-16  7:46                       ` Jeremy Higdon
  2006-03-13 18:39                         ` Paul E. McKenney
  2 siblings, 1 reply; 33+ messages in thread
From: Jeremy Higdon @ 2005-12-16  7:46 UTC (permalink / raw)
  To: Keith Owens
  Cc: paulmck, Andi Kleen, ajwade, vatsa, Oleg Nesterov, linux-kernel,
	Dipankar Sarma, Andrew Morton, Ingo Molnar

Roland Dreier got this right.  The purpose of the mmiowb is
to ensure that writes to I/O devices while holding a spinlock
are ordered with respect to writes issued after the original
processor releases and a second processor acquires said
spinlock.

A MMIO read would be sufficient, but is much heavier weight.

On the SGI MIPS-based systems, the "sync" instruction was used.
On the Altix systems, a register on the hub chip is read.

>From comments by jejb, we're looking at modifying the mmiowb
API by adding an argument which would be a register to read
from if the architecture in question needs ordering in this
way but does not have a lighter weight mechanism like the Altix
mmiowb.  Since there will now need to be a width indication,
mmiowb will be replaced with mmiowb[bwlq].

jeremy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2005-12-16  7:46                       ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Jeremy Higdon
@ 2006-03-13 18:39                         ` Paul E. McKenney
  2006-03-31  4:56                           ` Jeremy Higdon
  0 siblings, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2006-03-13 18:39 UTC (permalink / raw)
  To: Jeremy Higdon
  Cc: Keith Owens, Andi Kleen, ajwade, vatsa, Oleg Nesterov,
	linux-kernel, Dipankar Sarma, Andrew Morton, Ingo Molnar

On Thu, Dec 15, 2005 at 11:46:26PM -0800, Jeremy Higdon wrote:
> Roland Dreier got this right.  The purpose of the mmiowb is
> to ensure that writes to I/O devices while holding a spinlock
> are ordered with respect to writes issued after the original
> processor releases and a second processor acquires said
> spinlock.
> 
> A MMIO read would be sufficient, but is much heavier weight.
> 
> On the SGI MIPS-based systems, the "sync" instruction was used.
> On the Altix systems, a register on the hub chip is read.
> 
> >From comments by jejb, we're looking at modifying the mmiowb
> API by adding an argument which would be a register to read
> from if the architecture in question needs ordering in this
> way but does not have a lighter weight mechanism like the Altix
> mmiowb.  Since there will now need to be a width indication,
> mmiowb will be replaced with mmiowb[bwlq].

Any progress on this front?  I figured that I would wait to update
the ordering document until after this change happened, but if it
is going to be awhile, I should proceed with the current API.

Thoughts?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2006-03-13 18:39                         ` Paul E. McKenney
@ 2006-03-31  4:56                           ` Jeremy Higdon
  2006-03-31  6:18                             ` Paul E. McKenney
  2006-03-31 23:38                             ` Jesse Barnes
  0 siblings, 2 replies; 33+ messages in thread
From: Jeremy Higdon @ 2006-03-31  4:56 UTC (permalink / raw)
  To: Paul E. McKenney, bcasavan
  Cc: Keith Owens, Andi Kleen, ajwade, vatsa, Oleg Nesterov,
	linux-kernel, Dipankar Sarma, Andrew Morton, Ingo Molnar

On Mon, Mar 13, 2006 at 10:39:32AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 15, 2005 at 11:46:26PM -0800, Jeremy Higdon wrote:
> > Roland Dreier got this right.  The purpose of the mmiowb is
> > to ensure that writes to I/O devices while holding a spinlock
> > are ordered with respect to writes issued after the original
> > processor releases and a second processor acquires said
> > spinlock.
> > 
> > A MMIO read would be sufficient, but is much heavier weight.
> > 
> > On the SGI MIPS-based systems, the "sync" instruction was used.
> > On the Altix systems, a register on the hub chip is read.
> > 
> > >From comments by jejb, we're looking at modifying the mmiowb
> > API by adding an argument which would be a register to read
> > from if the architecture in question needs ordering in this
> > way but does not have a lighter weight mechanism like the Altix
> > mmiowb.  Since there will now need to be a width indication,
> > mmiowb will be replaced with mmiowb[bwlq].
> 
> Any progress on this front?  I figured that I would wait to update
> the ordering document until after this change happened, but if it
> is going to be awhile, I should proceed with the current API.
> 
> Thoughts?
> 
> 						Thanx, Paul

Brent Casavant was going to be working on this.  I'll CC him so that
he can indicate the status.

jeremy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2006-03-31  4:56                           ` Jeremy Higdon
@ 2006-03-31  6:18                             ` Paul E. McKenney
  2006-03-31 23:38                             ` Jesse Barnes
  1 sibling, 0 replies; 33+ messages in thread
From: Paul E. McKenney @ 2006-03-31  6:18 UTC (permalink / raw)
  To: Jeremy Higdon
  Cc: bcasavan, Keith Owens, Andi Kleen, ajwade, vatsa, Oleg Nesterov,
	linux-kernel, Dipankar Sarma, Andrew Morton, Ingo Molnar

On Thu, Mar 30, 2006 at 08:56:27PM -0800, Jeremy Higdon wrote:
> On Mon, Mar 13, 2006 at 10:39:32AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 15, 2005 at 11:46:26PM -0800, Jeremy Higdon wrote:
> > > Roland Dreier got this right.  The purpose of the mmiowb is
> > > to ensure that writes to I/O devices while holding a spinlock
> > > are ordered with respect to writes issued after the original
> > > processor releases and a second processor acquires said
> > > spinlock.
> > > 
> > > A MMIO read would be sufficient, but is much heavier weight.
> > > 
> > > On the SGI MIPS-based systems, the "sync" instruction was used.
> > > On the Altix systems, a register on the hub chip is read.
> > > 
> > > >From comments by jejb, we're looking at modifying the mmiowb
> > > API by adding an argument which would be a register to read
> > > from if the architecture in question needs ordering in this
> > > way but does not have a lighter weight mechanism like the Altix
> > > mmiowb.  Since there will now need to be a width indication,
> > > mmiowb will be replaced with mmiowb[bwlq].
> > 
> > Any progress on this front?  I figured that I would wait to update
> > the ordering document until after this change happened, but if it
> > is going to be awhile, I should proceed with the current API.
> > 
> > Thoughts?
> 
> Brent Casavant was going to be working on this.  I'll CC him so that
> he can indicate the status.

David Howells is documenting memory barriers in the Documentation
directory as well.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
  2006-03-31  4:56                           ` Jeremy Higdon
  2006-03-31  6:18                             ` Paul E. McKenney
@ 2006-03-31 23:38                             ` Jesse Barnes
  1 sibling, 0 replies; 33+ messages in thread
From: Jesse Barnes @ 2006-03-31 23:38 UTC (permalink / raw)
  To: Jeremy Higdon
  Cc: Paul E. McKenney, bcasavan, Keith Owens, Andi Kleen, ajwade,
	vatsa, Oleg Nesterov, linux-kernel, Dipankar Sarma, Andrew Morton,
	Ingo Molnar

[Unknown author]
> > > From comments by jejb, we're looking at modifying the mmiowb
> > > API by adding an argument which would be a register to read
> > > from if the architecture in question needs ordering in this
> > > way but does not have a lighter weight mechanism like the Altix
> > > mmiowb.  Since there will now need to be a width indication,
> > > mmiowb will be replaced with mmiowb[bwlq].
> >
> > Any progress on this front?  I figured that I would wait to update
> > the ordering document until after this change happened, but if it
> > is going to be awhile, I should proceed with the current API.

I avoided doing this initially on the premise that I shouldn't do things 
'just because we might need it in the future' since that way seems to 
lead to madness.  Is there actual hardware that needs an argument to 
mmiowb() (i.e. that can't just do a read from the system's single bridge 
or something)?

Jesse

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-10 18:55       ` Oleg Nesterov
  2005-12-11 17:41         ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Srivatsa Vaddagiri
@ 2005-12-12  3:10         ` Paul E. McKenney
  2005-12-12  4:32           ` Andrew Morton
  1 sibling, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2005-12-12  3:10 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: vatsa, linux-kernel, Dipankar Sarma, Andrew Morton

On Sat, Dec 10, 2005 at 09:55:35PM +0300, Oleg Nesterov wrote:
> Srivatsa Vaddagiri wrote:
> > 
> > On Fri, Dec 09, 2005 at 10:17:38PM +0300, Oleg Nesterov wrote:
> > > >         rcp->cur++;     /* New GP */
> > > >
> > > >         smp_mb();
> > >
> > > I think I need some education on memory barriers.
> > >
> > > Does this mb() garantees that the new value of ->cur will be visible
> > > on other cpus immediately after smp_mb() (so that rcu_pending() will
> > > notice it) ?
> > 
> > AFAIK the new value of ->cur should be visible to other CPUs when smp_mb()
> > returns. Here's a definition of smp_mb() from Paul's article:
> > 
> > smp_mb(): "memory barrier" that orders both loads and stores. This means loads
> > and stores preceding the memory barrier are committed to memory before any
> > loads and stores following the memory barrier.
> 
> Thanks, but this definition talks about ordering, it does not say
> anything about the time when stores are really commited to memory
> (and does it mean also that cache-lines are invalidated on other
> cpus ?).

Think of a pair of CPUs (creatively named "CPU 0" and "CPU 1") with
a cache-coherent interconnect between them.  Then:

1.	wmb() guarantees that any writes preceding the wmb() will
	be seen by the interconnect before any writes following the
	wmb().  But this applies -only- to the writes executed by
	the CPU doing the wmb().

2.	rmb() guarantees that any changes seen by the interconnect
	preceding the rmb() will be seen by any reads following the
	rmb().  Again, this applies only to reads executed by the
	CPU doing the wmb().  However, the changes might be due to
	any CPU.

3.	mb() combines the guarantees made by rmb() and wmb().

All CPUs but Alpha make stronger guarantees.  On those CPUs, you can
replace "interconnect" in #1 above with "all CPUs", and you can replace
"seen by the interconnect" in #2 above with "caused by any CPU".
Rumor has it that more recent Alpha CPUs also have stronger memory
barriers, but I have thus far been unable to confirm this.

On with the rest of the definitions:

4.	smp_wmb(), smp_rmb(), and smb_mb() do nothing in UP kernels,
	but do the corresponding memory barrier in SMP kernels.

5.	read_barrier_depends() is rmb() on Alpha, and nop on other
	CPUs.  Non-Alpha CPUs guarantee that if CPU 0 does the
	following, where p->a is initially zero and global_pointer->a
	is initially 1:

		p->a = 2;
		smp_wmb();
		global_pointer = p;

	and if CPU 1 does:

		x = global_pointer->a;

	then the value of x will be either 1 or 2, never zero.  On Alpha,
	strange though it may seem (and it did seem strange to me when I
	first encountered it!), the value of x could well be zero.
	To get the same guarantee on Alpha as on the other CPUs, the
	assignment to x must instead be as follows:

		tmp = global_pointer;
		read_memory_depends();
		x = tmp->a;

	or, equivalently:

		x = rcu_dereference(global_pointer)->a;

	There is an smp_read_barrier_depends() that takes effect only
	in an SMP kernel, similar to smp_wmb() and friends.

This example may seem quite strange, but the need for the barrier
follows directly from the definition #1, which makes its ordering
guarantee only to the -interconnect-, -not- to the other CPUs.

> > [ http://www.linuxjournal.com/article/8211 ]

Hmm...  That one does look familiar.  ;-)

> And thanks for this link, now I have some understanding about
> read_barrier_depends() ...

Let's just say it required much patience and perseverence on the part
of the Alpha architects to explain it to me the first time around.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-12  3:10         ` [PATCH] Fix RCU race in access of nohz_cpu_mask Paul E. McKenney
@ 2005-12-12  4:32           ` Andrew Morton
  2005-12-12  4:38             ` David S. Miller
                               ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Andrew Morton @ 2005-12-12  4:32 UTC (permalink / raw)
  To: paulmck; +Cc: oleg, vatsa, linux-kernel, dipankar

"Paul E. McKenney" <paulmck@us.ibm.com> wrote:
>
> 1.	wmb() guarantees that any writes preceding the wmb() will
>  	be seen by the interconnect before any writes following the
>  	wmb().  But this applies -only- to the writes executed by
>  	the CPU doing the wmb().
> 
>  2.	rmb() guarantees that any changes seen by the interconnect
>  	preceding the rmb() will be seen by any reads following the
>  	rmb().  Again, this applies only to reads executed by the
>  	CPU doing the wmb().  However, the changes might be due to
>  	any CPU.
> 
>  3.	mb() combines the guarantees made by rmb() and wmb().

So foo_mb() in preemptible code is potentially buggy.

I guess we assume that a context switch accidentally did enough of the
right types of barriers for things to work OK.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-12  4:32           ` Andrew Morton
@ 2005-12-12  4:38             ` David S. Miller
  2005-12-12  4:47               ` Nick Piggin
  2005-12-12  4:49             ` Paul Mackerras
  2005-12-12  6:27             ` Keith Owens
  2 siblings, 1 reply; 33+ messages in thread
From: David S. Miller @ 2005-12-12  4:38 UTC (permalink / raw)
  To: akpm; +Cc: paulmck, oleg, vatsa, linux-kernel, dipankar

From: Andrew Morton <akpm@osdl.org>
Date: Sun, 11 Dec 2005 20:32:26 -0800

> So foo_mb() in preemptible code is potentially buggy.
> 
> I guess we assume that a context switch accidentally did enough of the
> right types of barriers for things to work OK.

A trap ought to flush all memory operations.

There are some incredibly difficult memory error handling cases if the
cpu does not synchronize all pending memory operations when a trap
occurs.

Failing that, yes, to be absolutely safe we'd need to have some
explicit memory barrier in the context switch.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-12  4:38             ` David S. Miller
@ 2005-12-12  4:47               ` Nick Piggin
  0 siblings, 0 replies; 33+ messages in thread
From: Nick Piggin @ 2005-12-12  4:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, paulmck, oleg, vatsa, linux-kernel, dipankar

David S. Miller wrote:
> From: Andrew Morton <akpm@osdl.org>
> Date: Sun, 11 Dec 2005 20:32:26 -0800
> 
> 
>>So foo_mb() in preemptible code is potentially buggy.
>>
>>I guess we assume that a context switch accidentally did enough of the
>>right types of barriers for things to work OK.
> 
> 
> A trap ought to flush all memory operations.
> 
> There are some incredibly difficult memory error handling cases if the
> cpu does not synchronize all pending memory operations when a trap
> occurs.
> 
> Failing that, yes, to be absolutely safe we'd need to have some
> explicit memory barrier in the context switch.

But it isn't that mbs in preemptible code are buggy. If they are
scheduled off then back on the same CPU, the barrier will still
be executed in the expected instruction sequence for that process.

I think the minimum needed is for cpu *migrations* to be memory
barriers. Again, this isn't any particular problem of mb()
instructions - if cpu migrations weren't memory barriers, then
preemptible code isn't even guaranteed ordering with its own memory
operations, which would be quite interesting :)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-12  4:32           ` Andrew Morton
  2005-12-12  4:38             ` David S. Miller
@ 2005-12-12  4:49             ` Paul Mackerras
  2005-12-12  6:27             ` Keith Owens
  2 siblings, 0 replies; 33+ messages in thread
From: Paul Mackerras @ 2005-12-12  4:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: paulmck, oleg, vatsa, linux-kernel, dipankar

Andrew Morton writes:

> So foo_mb() in preemptible code is potentially buggy.
> 
> I guess we assume that a context switch accidentally did enough of the
> right types of barriers for things to work OK.

The context switch code on powerpc includes an mb() for exactly this
reason.

Paul.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-12  4:32           ` Andrew Morton
  2005-12-12  4:38             ` David S. Miller
  2005-12-12  4:49             ` Paul Mackerras
@ 2005-12-12  6:27             ` Keith Owens
  2 siblings, 0 replies; 33+ messages in thread
From: Keith Owens @ 2005-12-12  6:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: paulmck, oleg, vatsa, linux-kernel, dipankar

On Sun, 11 Dec 2005 20:32:26 -0800, 
Andrew Morton <akpm@osdl.org> wrote:
>"Paul E. McKenney" <paulmck@us.ibm.com> wrote:
>>
>> 1.	wmb() guarantees that any writes preceding the wmb() will
>>  	be seen by the interconnect before any writes following the
>>  	wmb().  But this applies -only- to the writes executed by
>>  	the CPU doing the wmb().
>> 
>>  2.	rmb() guarantees that any changes seen by the interconnect
>>  	preceding the rmb() will be seen by any reads following the
>>  	rmb().  Again, this applies only to reads executed by the
>>  	CPU doing the wmb().  However, the changes might be due to
>>  	any CPU.
>> 
>>  3.	mb() combines the guarantees made by rmb() and wmb().
>
>So foo_mb() in preemptible code is potentially buggy.
>
>I guess we assume that a context switch accidentally did enough of the
>right types of barriers for things to work OK.

Not by accident.  Any context switch must flush the memory state from
the old cpu's internal buffers, and that flush must get at least as far
as the globally snoopable cache.  Otherwise the old cpu could still own
partial memory updates from the process, even though the process was
now running on a new cpu.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Fix RCU race in access of nohz_cpu_mask
  2005-12-08 19:31 [PATCH] Fix RCU race in access of nohz_cpu_mask Oleg Nesterov
  2005-12-09  2:46 ` Srivatsa Vaddagiri
@ 2005-12-09  2:56 ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 33+ messages in thread
From: Srivatsa Vaddagiri @ 2005-12-09  2:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Dipankar Sarma, Paul E. McKenney, Andrew Morton

On Thu, Dec 08, 2005 at 10:31:06PM +0300, Oleg Nesterov wrote:
> I think cpu should call cpu_quiet() after adding itself to nohz mask
> to eliminate this race.

That would require rsp->lock to be taken on every idle CPU that wishes to go
tickless. IMO that may not be a good idea.

-- 

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2006-03-31 23:38 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-08 19:31 [PATCH] Fix RCU race in access of nohz_cpu_mask Oleg Nesterov
2005-12-09  2:46 ` Srivatsa Vaddagiri
2005-12-09 19:17   ` Oleg Nesterov
2005-12-10 15:19     ` Srivatsa Vaddagiri
2005-12-10 18:55       ` Oleg Nesterov
2005-12-11 17:41         ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Srivatsa Vaddagiri
2005-12-11 21:21           ` Andrew James Wade
2005-12-11 23:45             ` Rusty Russell
2005-12-12  0:49               ` Keith Owens
2005-12-12  8:41                 ` Srivatsa Vaddagiri
2005-12-12 19:33                   ` Oleg Nesterov
2005-12-13  5:20                     ` Paul E. McKenney
2005-12-13  5:07               ` Andrew James Wade
2005-12-13  5:43                 ` Paul E. McKenney
2005-12-13 11:20                 ` Andi Kleen
2005-12-13 16:20                   ` Paul E. McKenney
2005-12-13 22:27                     ` Keith Owens
2005-12-13 22:50                       ` Paul E. McKenney
2005-12-14  1:12                         ` Andi Kleen
2005-12-14  1:46                           ` Paul E. McKenney
2005-12-15 21:15                       ` Semantics of smp_mb() Roland Dreier
2005-12-16  7:46                       ` Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ] Jeremy Higdon
2006-03-13 18:39                         ` Paul E. McKenney
2006-03-31  4:56                           ` Jeremy Higdon
2006-03-31  6:18                             ` Paul E. McKenney
2006-03-31 23:38                             ` Jesse Barnes
2005-12-12  3:10         ` [PATCH] Fix RCU race in access of nohz_cpu_mask Paul E. McKenney
2005-12-12  4:32           ` Andrew Morton
2005-12-12  4:38             ` David S. Miller
2005-12-12  4:47               ` Nick Piggin
2005-12-12  4:49             ` Paul Mackerras
2005-12-12  6:27             ` Keith Owens
2005-12-09  2:56 ` Srivatsa Vaddagiri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox