on memory barriers and cachelines

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* on memory barriers and cachelines
@ 2012-02-01  9:33 Peter Zijlstra
  2012-02-01 14:22 ` Paul E. McKenney
  2012-02-01 17:17 ` Linus Torvalds
  0 siblings, 2 replies; 8+ messages in thread
From: Peter Zijlstra @ 2012-02-01  9:33 UTC (permalink / raw)
  To: Linus Torvalds, Paul E. McKenney
  Cc: benh, davem, H. Peter Anvin, Linux-Arch, Ingo Molnar, dhowells

Hi all,

So I was talking to Paul yesterday and he mentioned how the SRCU sync
primitive has to use extra synchronize_sched() calls in order to avoid
smp_rmb() calls in the srcu_read_{un,}lock() calls.

Now memory barriers are usually explained as observable order between
two (or more) unrelated variables, as Documentation/memory-barriers.txt
does in great detail.

What I couldn't find in there though, is what happens when both
variables are on the same cacheline. The "The effects of the CPU cache"
and "Cache coherency" sections are closest but leave me wanting on this
point.

Can we get some implicit behaviour from being on the same cacheline? Or
can this memory access queue still totally wreck the game?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on memory barriers and cachelines
  2012-02-01  9:33 on memory barriers and cachelines Peter Zijlstra
@ 2012-02-01 14:22 ` Paul E. McKenney
  2012-02-10  2:51   ` Jamie Lokier
  2012-02-01 17:17 ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2012-02-01 14:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, benh, davem, H. Peter Anvin, Linux-Arch,
	Ingo Molnar, dhowells

On Wed, Feb 01, 2012 at 10:33:58AM +0100, Peter Zijlstra wrote:
> Hi all,
> 
> So I was talking to Paul yesterday and he mentioned how the SRCU sync
> primitive has to use extra synchronize_sched() calls in order to avoid
> smp_rmb() calls in the srcu_read_{un,}lock() calls.
> 
> Now memory barriers are usually explained as observable order between
> two (or more) unrelated variables, as Documentation/memory-barriers.txt
> does in great detail.
> 
> What I couldn't find in there though, is what happens when both
> variables are on the same cacheline. The "The effects of the CPU cache"
> and "Cache coherency" sections are closest but leave me wanting on this
> point.
> 
> Can we get some implicit behaviour from being on the same cacheline? Or
> can this memory access queue still totally wreck the game?

I don't know of any guarantees in this area, but am checking with
hardware architects for a couple of architectures.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on memory barriers and cachelines
  2012-02-01 14:22 ` Paul E. McKenney
@ 2012-02-10  2:51   ` Jamie Lokier
  2012-02-10 16:32     ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Jamie Lokier @ 2012-02-10  2:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, benh, davem, H. Peter Anvin,
	Linux-Arch, Ingo Molnar, dhowells

Paul E. McKenney wrote:
> On Wed, Feb 01, 2012 at 10:33:58AM +0100, Peter Zijlstra wrote:
> > Hi all,
> > 
> > So I was talking to Paul yesterday and he mentioned how the SRCU sync
> > primitive has to use extra synchronize_sched() calls in order to avoid
> > smp_rmb() calls in the srcu_read_{un,}lock() calls.
> > 
> > Now memory barriers are usually explained as observable order between
> > two (or more) unrelated variables, as Documentation/memory-barriers.txt
> > does in great detail.
> > 
> > What I couldn't find in there though, is what happens when both
> > variables are on the same cacheline. The "The effects of the CPU cache"
> > and "Cache coherency" sections are closest but leave me wanting on this
> > point.
> > 
> > Can we get some implicit behaviour from being on the same cacheline? Or
> > can this memory access queue still totally wreck the game?
> 
> I don't know of any guarantees in this area, but am checking with
> hardware architects for a couple of architectures.

On a related note:

   - What's to stop the compiler optimising away a data dependency,
     converting it to a speculative control dependency?  Here's a
     contrived example:

         ORIGINAL:

             int func(int *p)
             {
                 int index = p[0], first = p[1];
                 read_barrier_depends(); /* do..while(0) on most archs */
		 return max(first, p[index]);
             }

         OPTIMISED:

             int func(int *p)
             {
                 int index = p[0], val = p[1];
                 if (index != 1)
                     val = max(val, p[index]);
                 return val;
             }

     A quick search of the GCC manual for "speculation" and
     "speculative" comes up with quite a few hits.  I've no idea if
     they are relevant.

   - If I understood correctly, IA64 has explicit special registers to
     assist data-memory speculation by the compiler.  These would be
     the ALAT registers.  I don't know if they are used in a way that
     affects RCU, but they do appear in the GCC machine description,
     and in the manual some kinds of "data speculative scheduling" are
     enabled by default.  But read_barrier_depends() is a do {} while
     on IA64.

   - The GCC manual mentions data speculation in conjunction with
     Blackfin as well.  I have no idea if it's relevant, but Blackfin
     does at least define read_barrier_depends() in an interesting way,
     sometimes.

   - I read that ARM can do speculative memory loads these days.  It
     complicates DMA.  But are they implemented by speculative
     preloading into the cache, or by speculatively executing load
     instructions whose results are predicated on a control path
     taken?  If the latter, is an empty read_barrier_depends() still
     ok on ARM?

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on memory barriers and cachelines
  2012-02-10  2:51   ` Jamie Lokier
@ 2012-02-10 16:32     ` Paul E. McKenney
  2012-02-10 18:13       ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2012-02-10 16:32 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Peter Zijlstra, Linus Torvalds, benh, davem, H. Peter Anvin,
	Linux-Arch, Ingo Molnar, dhowells

On Fri, Feb 10, 2012 at 02:51:29AM +0000, Jamie Lokier wrote:
> Paul E. McKenney wrote:
> > On Wed, Feb 01, 2012 at 10:33:58AM +0100, Peter Zijlstra wrote:
> > > Hi all,
> > > 
> > > So I was talking to Paul yesterday and he mentioned how the SRCU sync
> > > primitive has to use extra synchronize_sched() calls in order to avoid
> > > smp_rmb() calls in the srcu_read_{un,}lock() calls.
> > > 
> > > Now memory barriers are usually explained as observable order between
> > > two (or more) unrelated variables, as Documentation/memory-barriers.txt
> > > does in great detail.
> > > 
> > > What I couldn't find in there though, is what happens when both
> > > variables are on the same cacheline. The "The effects of the CPU cache"
> > > and "Cache coherency" sections are closest but leave me wanting on this
> > > point.
> > > 
> > > Can we get some implicit behaviour from being on the same cacheline? Or
> > > can this memory access queue still totally wreck the game?
> > 
> > I don't know of any guarantees in this area, but am checking with
> > hardware architects for a couple of architectures.
> 
> On a related note:
> 
>    - What's to stop the compiler optimising away a data dependency,
>      converting it to a speculative control dependency?  Here's a
>      contrived example:
> 
>          ORIGINAL:
> 
>              int func(int *p)
>              {
>                  int index = p[0], first = p[1];
>                  read_barrier_depends(); /* do..while(0) on most archs */
> 		 return max(first, p[index]);
>              }
> 
>          OPTIMISED:
> 
>              int func(int *p)
>              {
>                  int index = p[0], val = p[1];
>                  if (index != 1)
>                      val = max(val, p[index]);
>                  return val;
>              }
> 
>      A quick search of the GCC manual for "speculation" and
>      "speculative" comes up with quite a few hits.  I've no idea if
>      they are relevant.

Well, that would be one reason why I did all that work to get
memory_order_consume into C++11.  ;-)

More seriously, you can defeat some of the speculative optimizations
by using ACCESS_ONCE():

                 int index = ACCESS_ONCE(p[0]), first = ACCESS_ONCE(p[1]);

This forces a volatile access which should make the compiler at least
a bit more reluctant to apply speculation optimizations.  And using
rcu_dereference_index_check() in the kernel packages the ACCESS_ONCE()
and the smp_read_barrier_depends().

>    - If I understood correctly, IA64 has explicit special registers to
>      assist data-memory speculation by the compiler.  These would be
>      the ALAT registers.  I don't know if they are used in a way that
>      affects RCU, but they do appear in the GCC machine description,
>      and in the manual some kinds of "data speculative scheduling" are
>      enabled by default.  But read_barrier_depends() is a do {} while
>      on IA64.

As I understand it, the ALAT registers do respect dependency ordering.
But you would need to talk to an IA64 hardware architect and an IA64
compiler expert to get the whole story.

>    - The GCC manual mentions data speculation in conjunction with
>      Blackfin as well.  I have no idea if it's relevant, but Blackfin
>      does at least define read_barrier_depends() in an interesting way,
>      sometimes.

Are there SMP blackfin systems now?  There were not last I checked,
and these issues matter only on SMP.

>    - I read that ARM can do speculative memory loads these days.  It
>      complicates DMA.  But are they implemented by speculative
>      preloading into the cache, or by speculatively executing load
>      instructions whose results are predicated on a control path
>      taken?  If the latter, is an empty read_barrier_depends() still
>      ok on ARM?

But ARM does guarantee dependency ordering, so whatever it does to
speculate, it must validate -- the results must be as if the hardware
had done no speculation.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on memory barriers and cachelines
  2012-02-10 16:32     ` Paul E. McKenney
@ 2012-02-10 18:13       ` Peter Zijlstra
  2012-02-10 18:47         ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2012-02-10 18:13 UTC (permalink / raw)
  To: paulmck
  Cc: Jamie Lokier, Linus Torvalds, benh, davem, H. Peter Anvin,
	Linux-Arch, Ingo Molnar, dhowells

On Fri, 2012-02-10 at 08:32 -0800, Paul E. McKenney wrote:
> 
> Are there SMP blackfin systems now?  There were not last I checked,
> and these issues matter only on SMP. 

Those are not cache coherent and a total trainwreck.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on memory barriers and cachelines
  2012-02-10 18:13       ` Peter Zijlstra
@ 2012-02-10 18:47         ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2012-02-10 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jamie Lokier, Linus Torvalds, benh, davem, H. Peter Anvin,
	Linux-Arch, Ingo Molnar, dhowells

On Fri, Feb 10, 2012 at 07:13:09PM +0100, Peter Zijlstra wrote:
> On Fri, 2012-02-10 at 08:32 -0800, Paul E. McKenney wrote:
> > 
> > Are there SMP blackfin systems now?  There were not last I checked,
> > and these issues matter only on SMP. 
> 
> Those are not cache coherent and a total trainwreck.

Indeed, I was ignoring the cache-incoherent "SMP" blackfin systems.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on memory barriers and cachelines
  2012-02-01  9:33 on memory barriers and cachelines Peter Zijlstra
  2012-02-01 14:22 ` Paul E. McKenney
@ 2012-02-01 17:17 ` Linus Torvalds
  2012-02-01 17:29   ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2012-02-01 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, benh, davem, H. Peter Anvin, Linux-Arch,
	Ingo Molnar, dhowells

On Wed, Feb 1, 2012 at 1:33 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I was talking to Paul yesterday and he mentioned how the SRCU sync
> primitive has to use extra synchronize_sched() calls in order to avoid
> smp_rmb() calls in the srcu_read_{un,}lock() calls.

So that's probably a bad optimization these days, simply because
smp_rmb() is totally free on x86.

And on other architectures, it is *usually* a fairly cheap pipeline
sync. But they mostly don't really matter, outside of ARM.

> Now memory barriers are usually explained as observable order between
> two (or more) unrelated variables, as Documentation/memory-barriers.txt
> does in great detail.
>
> What I couldn't find in there though, is what happens when both
> variables are on the same cacheline. The "The effects of the CPU cache"
> and "Cache coherency" sections are closest but leave me wanting on this
> point.
>
> Can we get some implicit behaviour from being on the same cacheline? Or
> can this memory access queue still totally wreck the game?

At least on alpha, the cacheline itself is subpartitioned into
sectors, and accesses to different parts of the same cacheline can go
to different sectors, and literally have ordering issues because a
write from another CPU will update the sectors individually. This is
where the insane "smp_read_barrier_depends()" comes from, iirc.

So no, you cannot assume that a single cacheline is somehow "atomic"
and inherently ordered.

Also, even if you were to find an atomic sub-chunk, if you need a
"smp_rmb()", what else would guarantee that the CPU core wouldn't
re-order things to do the second read first, then lose the cacheline,
re-read it, and then do the first read?

So the reason smp_rmb() is free on x86 is that won't do that kind of
re-ordering.  Either because the uarch won't re-order the cache
accesses of reads wrt each other in the first place or because the
uarch makes sure that cachelines stay around until instructions have
been retired in order. But other architectures that do need smp_rmb()
can well re-order loads wildly even if they share a cacheline.

But smp_rmb() and smp_wmb() are usually supposed *much* cheaper than a
full barrier. Of course, various architectures can get it totally
wrong, so..

                      Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on memory barriers and cachelines
  2012-02-01 17:17 ` Linus Torvalds
@ 2012-02-01 17:29   ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2012-02-01 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, benh, davem, H. Peter Anvin, Linux-Arch,
	Ingo Molnar, dhowells

On Wed, 2012-02-01 at 09:17 -0800, Linus Torvalds wrote:
> 
> Also, even if you were to find an atomic sub-chunk, if you need a
> "smp_rmb()", what else would guarantee that the CPU core wouldn't
> re-order things to do the second read first, then lose the cacheline,
> re-read it, and then do the first read? 

Fair enough.. Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-02-10 18:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-01  9:33 on memory barriers and cachelines Peter Zijlstra
2012-02-01 14:22 ` Paul E. McKenney
2012-02-10  2:51   ` Jamie Lokier
2012-02-10 16:32     ` Paul E. McKenney
2012-02-10 18:13       ` Peter Zijlstra
2012-02-10 18:47         ` Paul E. McKenney
2012-02-01 17:17 ` Linus Torvalds
2012-02-01 17:29   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).