public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [patch] mm: fix anon_vma races
       [not found] <20081016041033.GB10371@wotan.suse.de>
@ 2008-10-17 22:14 ` Hugh Dickins
  2008-10-17 23:05   ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-10-17 22:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, linux-kernel, linux-mm

Added Linus and LKML to the Cc now I see that he's suggesting
a similar but not identical patch under that other thread.

On Thu, 16 Oct 2008, Nick Piggin wrote:

> Hi,
> 
> Still would like independent confirmation of these problems and the fix, but
> it still looks buggy to me...
> 
> ---
> 
> There are some races in the anon_vma code.
> 
> The race comes about because adding our vma to the tail of anon_vma->head comes
> after the assignment of vma->anon_vma to a new anon_vma pointer. In the case
> where this is a new anon_vma, this is done without holding any locks.  So a
> parallel anon_vma_prepare might see the vma already has an anon_vma, so it
> won't serialise on the page_table_lock. It may proceed and the anon_vma to a
> page in the page fault path. Another thread may then pick up the page from the
> LRU list, find its mapcount incremented, and attempt to iterate over the
> anon_vma's list concurrently with the first thread (because the first one is
> not holding the anon_vma lock). This is a fairly subtle race, and only likely
> to be hit in kernels where the spinlock is preemptible and the first thread is
> preempted at the right time... but OTOH it is _possible_ to hit here; on bigger
> SMP systems cacheline transfer latencies could be very large, or we could take
> an NMI inside the lock or something. Fix this by initialising the list before
> adding the anon_vma to vma.
> 
> After that, there is a similar data-race with memory ordering where the store
> to make the anon_vma visible passes previous stores to initialize the anon_vma.
> This race also includes stores to initialize the anon_vma spinlock by the
> slab constructor. Add and comment appropriate barriers to solve this.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Ow, you do break my head with these things.

I'm pretty sure you're on to something here, it certainly looks
suspicious.  But I've not yet convinced myself that what you (or
Linus) propose is the right answer - I was going to think through
some more, but spotting the other from Linus thought I ought at
least to introduce you to each other ;)

My problem is really with the smp_read_barrier_depends() you each
have in anon_vma_prepare().  But the only thing which its CPU
does with the anon_vma is put its address into a struct page
(or am I forgetting more?).  Wouldn't the smp_read_barrier_depends()
need to be, not there in anon_vma_prepare(), but over on the third
CPU, perhaps in page_lock_anon_vma()?

I've also been toying with the idea of taking the newly alloc'ed
anon_vma's lock here in the !vma->anon_vma case, instead of having
that "locked = NULL" business.  That might be a good idea, but I'm
rather thinking it doesn't quite address the issue at hand.

Taking a break...

Hugh

> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -81,8 +81,15 @@ int anon_vma_prepare(struct vm_area_stru
>  		/* page_table_lock to protect against threads */
>  		spin_lock(&mm->page_table_lock);
>  		if (likely(!vma->anon_vma)) {
> -			vma->anon_vma = anon_vma;
>  			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> +			/*
> +			 * This smp_wmb() is required to order all previous
> +			 * stores to initialize the anon_vma (by the slab
> +			 * ctor) and add this vma, with the store to make it
> +			 * visible to other CPUs via vma->anon_vma.
> +			 */
> +			smp_wmb();
> +			vma->anon_vma = anon_vma;
>  			allocated = NULL;
>  		}
>  		spin_unlock(&mm->page_table_lock);
> @@ -91,6 +98,15 @@ int anon_vma_prepare(struct vm_area_stru
>  			spin_unlock(&locked->lock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
> +	} else {
> +		/*
> +		 * This smp_read_barrier_depends is required to order the data
> +		 * dependent loads of fields in anon_vma, with the load of the
> +		 * anon_vma pointer vma->anon_vma. This complements the above
> +		 * smp_wmb, and prevents a CPU from loading uninitialized
> +		 * contents of anon_vma.
> +		 */
> +		smp_read_barrier_depends();
>  	}
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-17 22:14 ` [patch] mm: fix anon_vma races Hugh Dickins
@ 2008-10-17 23:05   ` Linus Torvalds
  2008-10-18  0:13     ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2008-10-17 23:05 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm



On Fri, 17 Oct 2008, Hugh Dickins wrote:
> 
> My problem is really with the smp_read_barrier_depends() you each
> have in anon_vma_prepare().  But the only thing which its CPU
> does with the anon_vma is put its address into a struct page
> (or am I forgetting more?).  Wouldn't the smp_read_barrier_depends()
> need to be, not there in anon_vma_prepare(), but over on the third
> CPU, perhaps in page_lock_anon_vma()?

I thought about it, but it's a disaster from a maintenance standpoint to 
put it there, rather than make it all clear in the _one_ function that 
actually does things optimistically.

I agree that it's a bit subtle the way I did it (haven't seen Nick's 
patch, I assume he was upset at me for shouting at him), but that's part 
of why I put that comment in there and said things are subtle.

Anyway, technically you're right: the smp_read_barrier_depends() really 
would be more obvious in the place where we actually fetch that "anon_vma" 
pointer again and actually derefernce it.

HOWEVER:

 - there are potentially multiple places that do that, and putting it in 
   the anon_vma_prepare() thing not only matches things with the 
   smp_wmb(), making that whole pairing much more obvious, but it also 
   means that we're guaranteed that any anon_vma user will have done the 
   smp_read_barrier_depends(), since they all have to do that prepare 
   thing anyway.

   So putting it there is simpler and gives better guarantees, and pairs 
   up the barriers better.

 - Now, "simpler" (etc) is no help if it doesn't work, so now I have to 
   convince you that it's _sufficient_ to do that "read_barrier_depends()" 
   early, even if we then end up re-doing the first read and thus the 
   "depends" part doesn't work any more. So "simpler" is all good, but not 
   if it's incorrect.

   And I admit it, here my argument is one of implementation. The fact is, 
   the only architecture where "read_barrier_depends()" exists at all as 
   anything but a no-op is alpha, and there it's a full read barrier. On 
   all other architectures, causality implies a read barrier anyway, so 
   for them, placement (or non-placement) of the smp_read_barrier_depends 
   is a total non-issue.

   And so, since on the only architecture where it could possibly matter, 
   that _depends thing turns into a full read barrier, and since 
   "anon_vma" is actually stable since written, and since the only 
   ordering constrain is that initial ordering of seeing the "anon_vma" 
   turn non-NULL, you may as well think of that "read_barrier_depends()" 
   as a full read barrier between the _original_ read of the anon_vma 
   pointer and then the read of the lock data we want to protect.

   Which it is, on alpha. And that is sufficient. IOW, think of it as a 
   real read_barrier(), with no dependency thing, but that only happens 
   when an architecture doesn't already guarantee the causality barrier.

   And once you think of it as a "smp_rmb() for alpha", you realize that 
   it's perfectly ok for it to be where it is.

Anyway, lockless is bad. It would certainly be a *lot* simpler to just 
take the page_table_lock around the whole thing, except I think we really 
*really* don't want to do that. That thing is solidly in a couple of 
*very* timing-critical routines. Doing another lock there is just not an 
option.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-17 23:05   ` Linus Torvalds
@ 2008-10-18  0:13     ` Hugh Dickins
  2008-10-18  0:25       ` Linus Torvalds
  2008-10-18  1:53       ` Nick Piggin
  0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-10-18  0:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, linux-kernel, linux-mm

On Fri, 17 Oct 2008, Linus Torvalds wrote:
> On Fri, 17 Oct 2008, Hugh Dickins wrote:
> > 
> > My problem is really with the smp_read_barrier_depends() you each
> > have in anon_vma_prepare().  But the only thing which its CPU
> > does with the anon_vma is put its address into a struct page
> > (or am I forgetting more?).  Wouldn't the smp_read_barrier_depends()
> > need to be, not there in anon_vma_prepare(), but over on the third
> > CPU, perhaps in page_lock_anon_vma()?
> 
> I thought about it, but it's a disaster from a maintenance standpoint to 
> put it there, rather than make it all clear in the _one_ function that 
> actually does things optimistically.
> 
> I agree that it's a bit subtle the way I did it (haven't seen Nick's 
> patch, I assume he was upset at me for shouting at him), but that's part 
> of why I put that comment in there and said things are subtle.

Nick's patch was below in the mail you're replying to - I expect
it looked so much like yours, that at a glance you thought it was
yours - though there are little differences.

(Yes, I think he felt he'd make more progress by backing away from
your harangues and taking cover with a patch to linux-mm - though
surely you were right that his original ctor angle was mistaken.)

> 
> Anyway, technically you're right: the smp_read_barrier_depends() really 
> would be more obvious in the place where we actually fetch that "anon_vma" 
> pointer again and actually derefernce it.
> 
> HOWEVER:
> 
>  - there are potentially multiple places that do that, and putting it in 
>    the anon_vma_prepare() thing not only matches things with the 
>    smp_wmb(), making that whole pairing much more obvious, but it also 
>    means that we're guaranteed that any anon_vma user will have done the 
>    smp_read_barrier_depends(), since they all have to do that prepare 
>    thing anyway.

No, it's not so that any anon_vma user would have done the
smp_read_barrier_depends() placed in anon_vma_prepare().

Anyone faulting in a page would have done it (swapoff? that
assumes it's been done, let's not worry about it right now).

But they're doing it to make the page's ptes accessible to
memory reclaim, and the CPU doing memory reclaim will not
(unless by coincidence) have done that anon_vma_prepare() -
it's just reading the links which the faulters are providing.

But I've given up thought for the night, will leave digesting
all you've written until morning, just wanted to point that
out before sleeping.

Hugh

> 
>    So putting it there is simpler and gives better guarantees, and pairs 
>    up the barriers better.
> 
>  - Now, "simpler" (etc) is no help if it doesn't work, so now I have to 
>    convince you that it's _sufficient_ to do that "read_barrier_depends()" 
>    early, even if we then end up re-doing the first read and thus the 
>    "depends" part doesn't work any more. So "simpler" is all good, but not 
>    if it's incorrect.
> 
>    And I admit it, here my argument is one of implementation. The fact is, 
>    the only architecture where "read_barrier_depends()" exists at all as 
>    anything but a no-op is alpha, and there it's a full read barrier. On 
>    all other architectures, causality implies a read barrier anyway, so 
>    for them, placement (or non-placement) of the smp_read_barrier_depends 
>    is a total non-issue.
> 
>    And so, since on the only architecture where it could possibly matter, 
>    that _depends thing turns into a full read barrier, and since 
>    "anon_vma" is actually stable since written, and since the only 
>    ordering constrain is that initial ordering of seeing the "anon_vma" 
>    turn non-NULL, you may as well think of that "read_barrier_depends()" 
>    as a full read barrier between the _original_ read of the anon_vma 
>    pointer and then the read of the lock data we want to protect.
> 
>    Which it is, on alpha. And that is sufficient. IOW, think of it as a 
>    real read_barrier(), with no dependency thing, but that only happens 
>    when an architecture doesn't already guarantee the causality barrier.
> 
>    And once you think of it as a "smp_rmb() for alpha", you realize that 
>    it's perfectly ok for it to be where it is.
> 
> Anyway, lockless is bad. It would certainly be a *lot* simpler to just 
> take the page_table_lock around the whole thing, except I think we really 
> *really* don't want to do that. That thing is solidly in a couple of 
> *very* timing-critical routines. Doing another lock there is just not an 
> option.
> 
> 		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  0:13     ` Hugh Dickins
@ 2008-10-18  0:25       ` Linus Torvalds
  2008-10-18  1:53       ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2008-10-18  0:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm



On Sat, 18 Oct 2008, Hugh Dickins wrote:
> 
> But they're doing it to make the page's ptes accessible to
> memory reclaim, and the CPU doing memory reclaim will not
> (unless by coincidence) have done that anon_vma_prepare() -
> it's just reading the links which the faulters are providing.

Ahh. Good point. Then yes, those ones would really need the 
smp_read_barrier_depends() too.

Very annoying.

Of course, we _could_ just admit that the situation is really *really* 
unlikely, and there are probably something like five people running 
Linux/alpha, and that we don't care enough. With just the smp_wmb(), we 
cover all non-alpha people, since this is all through a pointer, so all 
the readers will inevitably be of the smp_read_barrier_depends kind.

If we want to guarantee the proper smp_read_barrier_depends() behaviour, 
then we'd need to find them all. Possibly by renaming the whole field 
(prepend an underscore like we usually do) and forcing all readers to use 
some "get_anon_vma(vma)" helper function, aka

  static inline struct anon_vma *get_anon_vma(struct vm_area_struct *vma)
  {
	struct anon_vma *anon_vma = vma->_anon_vma;
	smp_read_barrier_depends();
	return anon_vma;
  }

which would find them all.

Ugh. I really would have preferred not having something like that. 
Especially considering how limited that issue really is. Hmm. 

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  0:13     ` Hugh Dickins
  2008-10-18  0:25       ` Linus Torvalds
@ 2008-10-18  1:53       ` Nick Piggin
  2008-10-18  2:50         ` Paul Mackerras
  1 sibling, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2008-10-18  1:53 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, linux-kernel, linux-mm, linux-arch

On Sat, Oct 18, 2008 at 01:13:16AM +0100, Hugh Dickins wrote:
> On Fri, 17 Oct 2008, Linus Torvalds wrote:
> > would be more obvious in the place where we actually fetch that "anon_vma" 
> > pointer again and actually derefernce it.
> > 
> > HOWEVER:
> > 
> >  - there are potentially multiple places that do that, and putting it in 
> >    the anon_vma_prepare() thing not only matches things with the 
> >    smp_wmb(), making that whole pairing much more obvious, but it also 
> >    means that we're guaranteed that any anon_vma user will have done the 
> >    smp_read_barrier_depends(), since they all have to do that prepare 
> >    thing anyway.
> 
> No, it's not so that any anon_vma user would have done the
> smp_read_barrier_depends() placed in anon_vma_prepare().
> 
> Anyone faulting in a page would have done it (swapoff? that
> assumes it's been done, let's not worry about it right now).
> 
> But they're doing it to make the page's ptes accessible to
> memory reclaim, and the CPU doing memory reclaim will not
> (unless by coincidence) have done that anon_vma_prepare() -
> it's just reading the links which the faulters are providing.

Yes, that's a very important flaw you point out with the fix. Good
spotting.

Actually another thing I was staying awake thinking about was the
pairwise consistency problem. "Apparently" Linux is supposed to
support arbitrary pairwise consistency.

This means.
CPU0
anon_vma.initialized = 1;
smp_wmb()
vma->anon_vma = anon_vma;

CPU1
if (vma->anon_vma)
  page->anon_vma = vma->anon_vma;

CPU2
if (page->anon_vma) {
  smp_read_barrier_depends();
  assert(page->anon_vma.initialized);
}

The assertion may trigger because the store from CPU0 may not have
propograted to CPU2 before the stores from CPU1.

But after thinking about this a bit more, I think Linux would be
broken all over the map under such ordering schemes. I think we'd
have to mandate causal consistency. Are there any architectures we
run on where this is not guaranteed? (I think recent clarifications
to x86 ordering give us CC on that architecture).

powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  1:53       ` Nick Piggin
@ 2008-10-18  2:50         ` Paul Mackerras
  2008-10-18  2:57           ` Linus Torvalds
  2008-10-18  5:49           ` Nick Piggin
  0 siblings, 2 replies; 13+ messages in thread
From: Paul Mackerras @ 2008-10-18  2:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch

Nick Piggin writes:

> But after thinking about this a bit more, I think Linux would be
> broken all over the map under such ordering schemes. I think we'd
> have to mandate causal consistency. Are there any architectures we
> run on where this is not guaranteed? (I think recent clarifications
> to x86 ordering give us CC on that architecture).
> 
> powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch)

Not sure what you mean by causal consistency, but I assume it's the
same as saying that barriers give cumulative ordering, as described on
page 413 of the Power Architecture V2.05 document at:

http://www.power.org/resources/reading/PowerISA_V2.05.pdf

The ordering provided by sync, lwsync and eieio is cumulative (see
pages 446 and 448), so we should be OK on powerpc AFAICS.  (The
cumulative property of eieio only applies to accesses to normal system
memory, but that should be OK since we use sync when we want barriers
that affect non-cacheable accesses as well as cacheable.)

Paul.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:50         ` Paul Mackerras
@ 2008-10-18  2:57           ` Linus Torvalds
  2008-10-18  5:49           ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2008-10-18  2:57 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Nick Piggin, Hugh Dickins, linux-kernel, linux-mm, linux-arch



On Sat, 18 Oct 2008, Paul Mackerras wrote:
> 
> Not sure what you mean by causal consistency, but I assume it's the
> same as saying that barriers give cumulative ordering, as described on
> page 413 of the Power Architecture V2.05 document at:

I'm pretty sure that everybody but alpha is ok.

And alpha needs the smp_read_barrier_depends() not because it doesn't 
really support causality, but because each CPU internally doesn't 
guarantee that they handle the cache invalidates in-order without a 
barrier. 

So without the smp_read_barrier_depends(), alpha will actually have the 
proper causal relationships (cachelines will move to exclusive state on 
CPU0 in the right order and others will see the causality), but because 
CPU2 may see the stale data from not even having invalidated the 
"anon_vma.initialized" because the cache invalidation queue hadn't been 
flushed in order.

Alpha is insane. And the odd man out.

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  2:50         ` Paul Mackerras
  2008-10-18  2:57           ` Linus Torvalds
@ 2008-10-18  5:49           ` Nick Piggin
  2008-10-18 10:49             ` Paul Mackerras
  2008-10-18 17:00             ` Linus Torvalds
  1 sibling, 2 replies; 13+ messages in thread
From: Nick Piggin @ 2008-10-18  5:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch

On Sat, Oct 18, 2008 at 01:50:57PM +1100, Paul Mackerras wrote:
> Nick Piggin writes:
> 
> > But after thinking about this a bit more, I think Linux would be
> > broken all over the map under such ordering schemes. I think we'd
> > have to mandate causal consistency. Are there any architectures we
> > run on where this is not guaranteed? (I think recent clarifications
> > to x86 ordering give us CC on that architecture).
> > 
> > powerpc, ia64, alpha, sparc, arm, mips? (cced linux-arch)
> 
> Not sure what you mean by causal consistency, but I assume it's the

I think it can be called transitive. Basically (assumememory starts off zeroed)
CPU0
x := 1

CPU1
if (x == 1) {
  fence
  y := 1
}

CPU2
if (y == 1) {
  fence
  assert(x == 1)
}


As opposed to pairwise, which only provides an ordering of visibility between
any given two CPUs (so the store to y might be propogated to CPU2 after the
store to x, regardless of the fences).

Apparently pairwise ordering is more interesting than just a theoretical
thing, and not just restricted to Alpha's funny caches. It can allow for
arbitrary network propogating stores / cache coherency between CPUs. x86's
publically documented memory model supposedly could allow for such ordering
up until a year or so ago (when they clarified and strengthened it).


> same as saying that barriers give cumulative ordering, as described on
> page 413 of the Power Architecture V2.05 document at:
> 
> http://www.power.org/resources/reading/PowerISA_V2.05.pdf
> 
> The ordering provided by sync, lwsync and eieio is cumulative (see
> pages 446 and 448), so we should be OK on powerpc AFAICS.  (The
> cumulative property of eieio only applies to accesses to normal system
> memory, but that should be OK since we use sync when we want barriers
> that affect non-cacheable accesses as well as cacheable.)

The section on cumulative ordering sounds like it might do the trick. But
I haven't really worked through exactly what it is saying ;)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  5:49           ` Nick Piggin
@ 2008-10-18 10:49             ` Paul Mackerras
  2008-10-18 17:00             ` Linus Torvalds
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Mackerras @ 2008-10-18 10:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linus Torvalds, linux-kernel, linux-mm, linux-arch

Nick Piggin writes:

> > Not sure what you mean by causal consistency, but I assume it's the
> 
> I think it can be called transitive. Basically (assumememory starts off zeroed)
> CPU0
> x := 1
> 
> CPU1
> if (x == 1) {
>   fence
>   y := 1
> }
> 
> CPU2
> if (y == 1) {
>   fence
>   assert(x == 1)
> }

That's essentially the same as example 1 on page 415, so yes we are
talking about the same thing.

Paul.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18  5:49           ` Nick Piggin
  2008-10-18 10:49             ` Paul Mackerras
@ 2008-10-18 17:00             ` Linus Torvalds
  2008-10-18 18:44               ` Matthew Wilcox
  2008-10-19  2:53               ` Nick Piggin
  1 sibling, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2008-10-18 17:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch



On Sat, 18 Oct 2008, Nick Piggin wrote:
> 
> I think it can be called transitive. Basically (assumememory starts off zeroed)

Alpha is transitive. It has a notion of "processor issue order" and 
"location access order", and the ordering those two creates is a 
transitive "before" and "after" ordering.

The issue with alpha is not that it wouldn't be transitive - the issue is 
that *local* read dependencies do not cause a "processor issue order"!

So the real issue with alpha is not about any big pair-wise ordering vs 
transitive thing, the big issue is that alpha's totally _local_ and 
per-core orderings are so totally screwed up, and are defined to be very 
loose - because back when alpha was designed, loose memory ordering was 
thought to be a good thing for performance.

They were wrong, but that was mainly because the alpha designers lived in 
a time when threading wasn't really even an issue. They were optimizing 
purely for the case where memory ordering doesn't matter, and considered 
locking etc to be one of those non-RISCy rare operations that can be 
basically ignored.

> CPU0
> x := 1

So this creates a "location access" event on 'x' on alpha, call it "event 
A".

> CPU1
> if (x == 1) {
>   fence
>   y := 1
> }

This has two events: let's call the read of 'x' "B", and "C" is the write 
to 'y'.

And according to the alpha rules, we now have:

 - A << B

   Because we saw a '1' in B, we now have a "location access ordering" 
   on the _same_ variable between A and B.

 - B < C

   Because we have the fence in between the read and the write, we now 
   have a "processor issue order" between B and C (despite the fact that 
   they are different variables).

And now, the alpha definition of "before" means that we can declare that A 
is before C.

But on alpha, we really do need that fence, even if the address of 'y' was 
somehow directly data-dependent on 'x'. THAT is what makes alpha special, 
not any odd ordering rules.

> CPU2
> if (y == 1) {
>   fence
>   assert(x == 1)
> }

So again, we now have two events: the access of 'y' is "D", and the access 
of x is "E". And again, according to the alpha rules, we have two 
orderings:

 - C << D

   Because we saw a '1' in D, we have another "location access ordering" 
   on the variably 'y' between C and D.

 - D < E

   Because of the fence, we have a "processor issue ordering" between D 
   and E.

And for the same reason as above, we now get that C is "before" E 
according to the alpha ordering rules. And because the definition of 
"before" is transitive, then A is before E.

And that, in turn, means that that assert() can never trigger, because if 
it triggered, then by the access ordering rules that would imply that E << 
A, which would mean that E is "before" A, which in turn would violate the 
whole chain we just got to.

So while the alpha architecture manual doesn't have the exact sequence 
mentioned above (it only has nine so-called "Litmus tests"), it's fairly 
close to Litmus test 3, and the ordering on alpha is very clear: it's all 
transitive and causal (ie "before" can never be "after").

> Apparently pairwise ordering is more interesting than just a theoretical
> thing, and not just restricted to Alpha's funny caches.

Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
does some form of transitive ordering.

The real (and only really odd) issue with alpha is that for everybody 
else, if you have

	read x -> data dependency -> read y

(ie you read a pointer and dereference it, or you read an index and 
dereference an array through it), then on all other architectures that 
will imply a local processor ordering, which in turn will be part of the 
whole transitive order of operations. 

On alpha, it doesn't. You can think of it as alpha doing value speculation 
(ie allowing speculative reads even across data dependencies), so on 
alpha, you could imagine a CPU doing address speculation, and turning the 
two reads into a sequence of

 (a) read off speculative pointer '*p'
 (b) read x
 (c) verify that that x == p

and THAT is what "smp_read_barrier_depends()" will basically stop on 
alpha. Nothing else. Other CPU's will always basically do

 (a) read x
 (b) read *x

so they have an implied read barrier between those two events thanks to 
simply the causal relationship.

Some more notes:

 - The reason that alpha has this odd thing is not actually that any alpha 
   implementation does value speculation, but the way the caches are 
   invalidated, the invalidates can be delayed and re-ordered almost 
   arbitrarily on the local CPU, and in the absense of a memory barrier 
   the second read (that does happen "after" the first read in some local 
   internal CPU sense and wasn't really speculative in that way) can get 
   stale data because one cacheline has been updated before another one 
   has.

   So while you can think of it a value speculation, the underlying cause 
   is actually not some fancy speculation infrastructure, just an internal 
   implementation issue.

 - The _data_ dependency is important, because other architectures _will_ 
   still speculatively move memory operations around across other "causal" 
   relationships, notably across control dependencies. IOW, if you have

	if (read(x))
		read(y)

   then there is NOT necessarily any real orderign between the reads, 
   because the conditional ends up being speculated, and you may well see 
   "y" being read before "x", and you really need a smp_rmb() on other 
   architectures than alpha too. So in this sense, alpha is very 
   "consistent" - for alpha, _no_ amount of local causality matters, and 
   only accesses to the *same* variable are implicitly locally ordered.

 - On x86, the new memory ordering semantics means that _all_ local causal 
   relationships are honored, so x86, like alpha, is very consistent. It 
   will consider both the data-dependency and the control dependency to be 
   100% the same. It just does it differently than alpha: for alpha, 
   neither matters for ordering, for x86, both matter.

Of course, even on x86, the local causal relationships still do allow 
loads to pass stores, so x86 isn't _totally_ ordered. x86 obviously still 
does need the smp_mb().

So alpha is "more consistent" in the respect of really having very clear 
rules. The fact that those "clear rules" are totally insane and very 
inconvenient for threading (and weren't the big performance advantage that 
people used to think they would be) is a separate matter.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 17:00             ` Linus Torvalds
@ 2008-10-18 18:44               ` Matthew Wilcox
  2008-10-19  2:54                 ` Nick Piggin
  2008-10-19  2:53               ` Nick Piggin
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2008-10-18 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm,
	linux-arch

On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote:
> > Apparently pairwise ordering is more interesting than just a theoretical
> > thing, and not just restricted to Alpha's funny caches.
> 
> Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
> does some form of transitive ordering.

I assume you're talking about CPUs in particular here, and I don't know
of any counterexamples.

If you're talking about PCI devices, the model is not transitive.
Here's the exact text from Appendix E of PCI 3.0:

  A system may have multiple Producer-Consumer pairs operating
  simultaneously, with different data - flag-status sets located all
  around the system. But since only one Producer can write to a single
  data-flag set, there are no ordering requirements between different
  masters. Writes from one master on one bus may occur in one order on
  one bus, with respect to another master's writes, and occur in another
  order on another bus. In this case, the rules allow for some writes
  to be rearranged; for example, an agent on Bus 1 may see Transaction
  A from a master on Bus 1 complete first, followed by Transaction B
  from another master on Bus 0. An agent on Bus 0 may see Transaction
  B complete first followed by Transaction A. Even though the actual
  transactions complete in a different order, this causes no problem
  since the different masters must be addressing different data-flag sets.

I seem to remember earlier versions of the spec having more clear
language about A happening before B and B happening before C didn't
mean that A happened before C from the perspective of a third device,
but I can't find it right now.

Anyway, as the spec says, you're not supposed to use PCI like that,
so it doesn't matter.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 17:00             ` Linus Torvalds
  2008-10-18 18:44               ` Matthew Wilcox
@ 2008-10-19  2:53               ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2008-10-19  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Hugh Dickins, linux-kernel, linux-mm, linux-arch

On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 18 Oct 2008, Nick Piggin wrote:
> > 
> > I think it can be called transitive. Basically (assumememory starts off zeroed)
> 
> Alpha is transitive. It has a notion of "processor issue order" and 
> "location access order", and the ordering those two creates is a 
> transitive "before" and "after" ordering.
> 
> The issue with alpha is not that it wouldn't be transitive - the issue is 
> that *local* read dependencies do not cause a "processor issue order"!

That's fine. That's not so different to most other weakly ordered processor
having control dependencies not appearing in-order. So long as stores
propogate according to causality.

 
> So this creates a "location access" event on 'x' on alpha, call it "event 
> A".
> 
> > CPU1
> > if (x == 1) {
> >   fence
> >   y := 1
> > }
> 
> This has two events: let's call the read of 'x' "B", and "C" is the write 
> to 'y'.
> 
> And according to the alpha rules, we now have:
> 
>  - A << B
> 
>    Because we saw a '1' in B, we now have a "location access ordering" 
>    on the _same_ variable between A and B.
> 
>  - B < C
> 
>    Because we have the fence in between the read and the write, we now 
>    have a "processor issue order" between B and C (despite the fact that 
>    they are different variables).
> 
> And now, the alpha definition of "before" means that we can declare that A 
> is before C.
> 
> But on alpha, we really do need that fence, even if the address of 'y' was 
> somehow directly data-dependent on 'x'. THAT is what makes alpha special, 
> not any odd ordering rules.
> 
> > CPU2
> > if (y == 1) {
> >   fence
> >   assert(x == 1)
> > }
> 
> So again, we now have two events: the access of 'y' is "D", and the access 
> of x is "E". And again, according to the alpha rules, we have two 
> orderings:
> 
>  - C << D
> 
>    Because we saw a '1' in D, we have another "location access ordering" 
>    on the variably 'y' between C and D.
> 
>  - D < E
> 
>    Because of the fence, we have a "processor issue ordering" between D 
>    and E.
> 
> And for the same reason as above, we now get that C is "before" E 
> according to the alpha ordering rules. And because the definition of 
> "before" is transitive, then A is before E.
> 
> And that, in turn, means that that assert() can never trigger, because if 
> it triggered, then by the access ordering rules that would imply that E << 
> A, which would mean that E is "before" A, which in turn would violate the 
> whole chain we just got to.
> 
> So while the alpha architecture manual doesn't have the exact sequence 
> mentioned above (it only has nine so-called "Litmus tests"), it's fairly 
> close to Litmus test 3, and the ordering on alpha is very clear: it's all 
> transitive and causal (ie "before" can never be "after").

OK, good.

 
> > Apparently pairwise ordering is more interesting than just a theoretical
> > thing, and not just restricted to Alpha's funny caches.
> 
> Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
> does some form of transitive ordering.
 
We were chatting with Andy Glew a while back, and he said it actually can
be quite beneficial for HW designers (but I imagine that is the same as a
lot of "insane" things) ;)

I remember though that you said Linux should be pairwise-safe. I think
that's wrong (for more reasons than this anon-vma race), which is why
I got concerned and started off this subthread.

I think Linux probably has a lot of problems in a pairwise consistency
model, so I'd just like to check if we acutally attempt to supportany
architecture where that is the case.

x86, powerpc, alpha are good ;) That gives me hope.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] mm: fix anon_vma races
  2008-10-18 18:44               ` Matthew Wilcox
@ 2008-10-19  2:54                 ` Nick Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2008-10-19  2:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Paul Mackerras, Hugh Dickins, linux-kernel,
	linux-mm, linux-arch

On Sat, Oct 18, 2008 at 12:44:05PM -0600, Matthew Wilcox wrote:
> On Sat, Oct 18, 2008 at 10:00:30AM -0700, Linus Torvalds wrote:
> > > Apparently pairwise ordering is more interesting than just a theoretical
> > > thing, and not just restricted to Alpha's funny caches.
> > 
> > Nobody does just pairwise ordering, afaik. It's an insane model. Everybody 
> > does some form of transitive ordering.
> 
> I assume you're talking about CPUs in particular here, and I don't know
> of any counterexamples.

Yes, just CPUs.

 
> If you're talking about PCI devices, the model is not transitive.
> Here's the exact text from Appendix E of PCI 3.0:
> 
>   A system may have multiple Producer-Consumer pairs operating
>   simultaneously, with different data - flag-status sets located all
>   around the system. But since only one Producer can write to a single
>   data-flag set, there are no ordering requirements between different
>   masters. Writes from one master on one bus may occur in one order on
>   one bus, with respect to another master's writes, and occur in another
>   order on another bus. In this case, the rules allow for some writes
>   to be rearranged; for example, an agent on Bus 1 may see Transaction
>   A from a master on Bus 1 complete first, followed by Transaction B
>   from another master on Bus 0. An agent on Bus 0 may see Transaction
>   B complete first followed by Transaction A. Even though the actual
>   transactions complete in a different order, this causes no problem
>   since the different masters must be addressing different data-flag sets.
> 
> I seem to remember earlier versions of the spec having more clear
> language about A happening before B and B happening before C didn't
> mean that A happened before C from the perspective of a third device,
> but I can't find it right now.
> 
> Anyway, as the spec says, you're not supposed to use PCI like that,
> so it doesn't matter.

Interesting. Hopefully as you say it won't matter.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-10-19  2:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20081016041033.GB10371@wotan.suse.de>
2008-10-17 22:14 ` [patch] mm: fix anon_vma races Hugh Dickins
2008-10-17 23:05   ` Linus Torvalds
2008-10-18  0:13     ` Hugh Dickins
2008-10-18  0:25       ` Linus Torvalds
2008-10-18  1:53       ` Nick Piggin
2008-10-18  2:50         ` Paul Mackerras
2008-10-18  2:57           ` Linus Torvalds
2008-10-18  5:49           ` Nick Piggin
2008-10-18 10:49             ` Paul Mackerras
2008-10-18 17:00             ` Linus Torvalds
2008-10-18 18:44               ` Matthew Wilcox
2008-10-19  2:54                 ` Nick Piggin
2008-10-19  2:53               ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox