Possible memory ordering bug in page reclaim?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Possible memory ordering bug in page reclaim?
@ 2005-10-15  3:28 Nick Piggin
  2005-10-15  6:17 ` Hugh Dickins
  0 siblings, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2005-10-15  3:28 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Hugh Dickins, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 598 bytes --]

Is there anything that prevents PageDirty from theoretically being
speculatively loaded before page_count here? (see patch)

It would result in pagecache corruption in the following situation:

1                                2
find_get_page();
write to page                    write_lock(tree_lock);
SetPageDirty();                  if (page_count != 2
put_page();                          || PageDirty())

Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
1, and page_count *after* put_page in 1.

Or am I seeing things that aren't there?

Thanks,

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-reclaim-memorder-fix.patch --]
[-- Type: text/plain, Size: 604 bytes --]

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -511,7 +511,12 @@ static int shrink_list(struct list_head 
 		 * PageDirty _after_ making sure that the page is freeable and
 		 * not in use by anybody. 	(pagecache + us == 2)
 		 */
-		if (page_count(page) != 2 || PageDirty(page)) {
+		if (page_count(page) != 2) {
+			write_unlock_irq(&mapping->tree_lock);
+			goto keep_locked;
+		}
+		smp_rmb();
+		if (PageDirty(page)) {
 			write_unlock_irq(&mapping->tree_lock);
 			goto keep_locked;
 		}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15  3:28 Possible memory ordering bug in page reclaim? Nick Piggin
@ 2005-10-15  6:17 ` Hugh Dickins
  2005-10-15  7:43   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 21+ messages in thread
From: Hugh Dickins @ 2005-10-15  6:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, Ben Herrenschmidt,
	Andrea Arcangeli, linux-kernel

On Sat, 15 Oct 2005, Nick Piggin wrote:
> 
> Is there anything that prevents PageDirty from theoretically being
> speculatively loaded before page_count here? (see patch)
> 
> It would result in pagecache corruption in the following situation:
> 
> 1                                2
> find_get_page();
> write to page                    write_lock(tree_lock);
> SetPageDirty();                  if (page_count != 2
> put_page();                          || PageDirty())
> 
> Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
                                  page->flags
> 1, and page_count *after* put_page in 1.

I think you're right.  But I'm the last person to ask
barrier/ordering questions of.  CC'ed Ben and Andrea.

Hugh

> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -511,7 +511,12 @@ static int shrink_list(struct list_head 
>  		 * PageDirty _after_ making sure that the page is freeable and
>  		 * not in use by anybody. 	(pagecache + us == 2)
>  		 */
> -		if (page_count(page) != 2 || PageDirty(page)) {
> +		if (page_count(page) != 2) {
> +			write_unlock_irq(&mapping->tree_lock);
> +			goto keep_locked;
> +		}
> +		smp_rmb();
> +		if (PageDirty(page)) {
>  			write_unlock_irq(&mapping->tree_lock);
>  			goto keep_locked;
>  		}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15  6:17 ` Hugh Dickins
@ 2005-10-15  7:43   ` Benjamin Herrenschmidt
  2005-10-15  8:00     ` Herbert Xu
  2005-10-15  8:59     ` Nick Piggin
  0 siblings, 2 replies; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2005-10-15  7:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Paul Mackerras, Anton Blanchard, Nick Piggin, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, linux-kernel

On Sat, 2005-10-15 at 07:17 +0100, Hugh Dickins wrote:
> On Sat, 15 Oct 2005, Nick Piggin wrote:
> > 
> > Is there anything that prevents PageDirty from theoretically being
> > speculatively loaded before page_count here? (see patch)
> > 
> > It would result in pagecache corruption in the following situation:
> > 
> > 1                                2
> > find_get_page();
> > write to page                    write_lock(tree_lock);
> > SetPageDirty();                  if (page_count != 2
> > put_page();                          || PageDirty())
> > 
> > Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
>                                   page->flags
> > 1, and page_count *after* put_page in 1.
> 
> I think you're right.  But I'm the last person to ask
> barrier/ordering questions of.  CC'ed Ben and Andrea.

yup, now the question is wether PG_Dirty will be visible to CPU 2 before
the page count is decremented right ? That depends on put_page, I
suppose. If it's doing a simple atomic, there is an issue. But atomics
with return has been so often abused as locks that they may have been
implemented with a barrier... (On ppc64, it will do an eieio, thus I
think it should be ok).

There is also a problem the other way around. Write to page, then set
page dirty... those writes may be visible to CPU 2 (that is the page
content be dirty) before find_get_page even increased the page count,
unless there is a barrier in there too.

Paul, Anton ?

Ben.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15  7:43   ` Benjamin Herrenschmidt
@ 2005-10-15  8:00     ` Herbert Xu
  2005-10-15 16:57       ` Linus Torvalds
  2005-10-15  8:59     ` Nick Piggin
  1 sibling, 1 reply; 21+ messages in thread
From: Herbert Xu @ 2005-10-15  8:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: hugh, paulus, anton, nickpiggin, torvalds, akpm, andrea,
	linux-kernel

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
>> > 1                                2
>> > find_get_page();
>> > write to page                    write_lock(tree_lock);
>> > SetPageDirty();                  if (page_count != 2
>> > put_page();                          || PageDirty())
>> > 
>> > Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
>>                                   page->flags
>> > 1, and page_count *after* put_page in 1.
> 
> yup, now the question is wether PG_Dirty will be visible to CPU 2 before
> the page count is decremented right ? That depends on put_page, I
> suppose. If it's doing a simple atomic, there is an issue. But atomics
> with return has been so often abused as locks that they may have been
> implemented with a barrier... (On ppc64, it will do an eieio, thus I
> think it should be ok).

Yes atomic_add_negative should always be a barrier.

> There is also a problem the other way around. Write to page, then set
> page dirty... those writes may be visible to CPU 2 (that is the page
> content be dirty) before find_get_page even increased the page count,
> unless there is a barrier in there too.

find_get_page does a read_unlock_irq after the increment which also
serves as a barrier.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15  8:00     ` Herbert Xu
@ 2005-10-15 16:57       ` Linus Torvalds
  2005-10-15 19:29         ` David S. Miller
  2005-10-16  0:04         ` Nick Piggin
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2005-10-15 16:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Benjamin Herrenschmidt, hugh, paulus, anton, nickpiggin, akpm,
	andrea, linux-kernel



On Sat, 15 Oct 2005, Herbert Xu wrote:
>
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> > yup, now the question is wether PG_Dirty will be visible to CPU 2 before
> > the page count is decremented right ? That depends on put_page, I
> > suppose. If it's doing a simple atomic, there is an issue. But atomics
> > with return has been so often abused as locks that they may have been
> > implemented with a barrier... (On ppc64, it will do an eieio, thus I
> > think it should be ok).
> 
> Yes atomic_add_negative should always be a barrier.

I disagree. That would be very expensive on anything but x86, where it 
just happens to be true for other reasons. Atomics do _not_ implement 
barriers.

Normally, we really don't care about any other decrement than the one that 
goes to zero, and that one tends to come with its own serialization logic.

In this case, we don't even care.

I agree, however, that it looks like PG_dirty is racy. Probably not in 
practice, but still.

So I'd suggest adding a smp_wmb() into set_page_dirty, and the rmb where 
Nick suggested.

So I'd suggest a patch something more like this.. Marking the dirty/count 
cases unlikely too in mm/page-writeback.c, since we should have tested for 
these conditions optimistically outside the lock.

Comments? Nick, did you have some test-case that you think might actually 
have been impacted by this?

		Linus

---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0166ea1..722779c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -668,8 +668,14 @@ int fastcall set_page_dirty(struct page 
 			return (*spd)(page);
 		return __set_page_dirty_buffers(page);
 	}
-	if (!PageDirty(page))
+	if (!PageDirty(page)) {
 		SetPageDirty(page);
+		/*
+		 * Make sure the dirty bit percolates out
+		 * to other CPU's before we release the page
+		 */
+		smp_wmb();
+	}
 	return 0;
 }
 EXPORT_SYMBOL(set_page_dirty);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0ea71e8..64f9570 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -511,10 +511,11 @@ static int shrink_list(struct list_head 
 		 * PageDirty _after_ making sure that the page is freeable and
 		 * not in use by anybody. 	(pagecache + us == 2)
 		 */
-		if (page_count(page) != 2 || PageDirty(page)) {
-			write_unlock_irq(&mapping->tree_lock);
-			goto keep_locked;
-		}
+		if (unlikely(page_count(page) != 2))
+			goto cannot_free;
+		smp_rmb();
+		if (unlikely(PageDirty(page)))
+			goto cannot_free;
 
 #ifdef CONFIG_SWAP
 		if (PageSwapCache(page)) {
@@ -538,6 +539,10 @@ free_it:
 			__pagevec_release_nonlru(&freed_pvec);
 		continue;
 
+cannot_free:
+		write_unlock_irq(&mapping->tree_lock);
+		goto keep_locked;
+
 activate_locked:
 		SetPageActive(page);
 		pgactivate++;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 16:57       ` Linus Torvalds
@ 2005-10-15 19:29         ` David S. Miller
  2005-10-15 22:17           ` Benjamin Herrenschmidt
  2005-10-16  0:04         ` Nick Piggin
  1 sibling, 1 reply; 21+ messages in thread
From: David S. Miller @ 2005-10-15 19:29 UTC (permalink / raw)
  To: torvalds
  Cc: herbert, benh, hugh, paulus, anton, nickpiggin, akpm, andrea,
	linux-kernel

From: Linus Torvalds <torvalds@osdl.org>
Date: Sat, 15 Oct 2005 09:57:47 -0700 (PDT)

> On Sat, 15 Oct 2005, Herbert Xu wrote:
> >
> > Yes atomic_add_negative should always be a barrier.
> 
> I disagree. That would be very expensive on anything but x86, where it 
> just happens to be true for other reasons. Atomics do _not_ implement 
> barriers.

When they return values, they are defined to be barriers.
It's even on the documentation :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 19:29         ` David S. Miller
@ 2005-10-15 22:17           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2005-10-15 22:17 UTC (permalink / raw)
  To: David S. Miller
  Cc: torvalds, herbert, hugh, paulus, anton, nickpiggin, akpm, andrea,
	linux-kernel

On Sat, 2005-10-15 at 12:29 -0700, David S. Miller wrote:
> From: Linus Torvalds <torvalds@osdl.org>
> Date: Sat, 15 Oct 2005 09:57:47 -0700 (PDT)
> 
> > On Sat, 15 Oct 2005, Herbert Xu wrote:
> > >
> > > Yes atomic_add_negative should always be a barrier.
> > 
> > I disagree. That would be very expensive on anything but x86, where it 
> > just happens to be true for other reasons. Atomics do _not_ implement 
> > barriers.
> 
> When they return values, they are defined to be barriers.
> It's even on the documentation :-)

Ahhh, good to know :)

/me should read the documentation sometimes....

Ben.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 16:57       ` Linus Torvalds
  2005-10-15 19:29         ` David S. Miller
@ 2005-10-16  0:04         ` Nick Piggin
  1 sibling, 0 replies; 21+ messages in thread
From: Nick Piggin @ 2005-10-16  0:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Herbert Xu, Benjamin Herrenschmidt, hugh, paulus, anton, akpm,
	andrea, linux-kernel, David S. Miller

Linus Torvalds wrote:

> I agree, however, that it looks like PG_dirty is racy. Probably not in 
> practice, but still.
> 
> So I'd suggest adding a smp_wmb() into set_page_dirty, and the rmb where 
> Nick suggested.
> 
> So I'd suggest a patch something more like this.. Marking the dirty/count 
> cases unlikely too in mm/page-writeback.c, since we should have tested for 
> these conditions optimistically outside the lock.
> 

As Dave suggested, I think there is too much other code that depends on
these atomics to be barriers for us to change it (at least not in this
patch! :)).


> Comments? Nick, did you have some test-case that you think might actually 
> have been impacted by this?
> 

I guess your vmscan.c hunks are slightly nicer, though I might put
'cannot_free' right at the end, because it will be a very uncommon case.

And no, I don't have a test case. In fact, I wouldn't be surprised if
nobody anywhere has ever hit it :) I was just browsing code...

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15  7:43   ` Benjamin Herrenschmidt
  2005-10-15  8:00     ` Herbert Xu
@ 2005-10-15  8:59     ` Nick Piggin
  2005-10-15 12:08       ` Herbert Xu
  1 sibling, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2005-10-15  8:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Hugh Dickins, Paul Mackerras, Anton Blanchard, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, linux-kernel

Benjamin Herrenschmidt wrote:
> On Sat, 2005-10-15 at 07:17 +0100, Hugh Dickins wrote:
> 
>>On Sat, 15 Oct 2005, Nick Piggin wrote:
>>

>>>1                                2
>>>find_get_page();
>>>write to page                    write_lock(tree_lock);
>>>SetPageDirty();                  if (page_count != 2
>>>put_page();                          || PageDirty())
>>>
>>>Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
>>
>>                                  page->flags
>>
>>>1, and page_count *after* put_page in 1.
>>
>>I think you're right.  But I'm the last person to ask
>>barrier/ordering questions of.  CC'ed Ben and Andrea.
> 
> 
> yup, now the question is wether PG_Dirty will be visible to CPU 2 before
> the page count is decremented right ? That depends on put_page, I
> suppose. If it's doing a simple atomic, there is an issue. But atomics
> with return has been so often abused as locks that they may have been
> implemented with a barrier... (On ppc64, it will do an eieio, thus I
> think it should be ok).
> 

Well yes, that's on the store side (1, above). However can't a CPU
still speculatively (eg. guess the branch) load the page->flags
cacheline which might be satisfied from memory before the page->count
cacheline loads? Ie. you can still have the correct write ordering
but have incorrect read ordering?

Because neither PageDirty nor page_count is a barrier, and there is
no read barrier between them.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15  8:59     ` Nick Piggin
@ 2005-10-15 12:08       ` Herbert Xu
  2005-10-15 13:35         ` Nick Piggin
  2005-10-15 18:00         ` Andrea Arcangeli
  0 siblings, 2 replies; 21+ messages in thread
From: Herbert Xu @ 2005-10-15 12:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: benh, hugh, paulus, anton, torvalds, akpm, andrea, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> Well yes, that's on the store side (1, above). However can't a CPU
> still speculatively (eg. guess the branch) load the page->flags
> cacheline which might be satisfied from memory before the page->count
> cacheline loads? Ie. you can still have the correct write ordering
> but have incorrect read ordering?
> 
> Because neither PageDirty nor page_count is a barrier, and there is
> no read barrier between them.

Yes you're right.  A read barrier is required here.

I think Ben was actually agreeing with you.  He's just questioning
whether the corresponding write barrier existed on CPU 1 (the answer
to which is affirmative).
 
Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 12:08       ` Herbert Xu
@ 2005-10-15 13:35         ` Nick Piggin
  2005-10-15 18:00         ` Andrea Arcangeli
  1 sibling, 0 replies; 21+ messages in thread
From: Nick Piggin @ 2005-10-15 13:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: benh, hugh, paulus, anton, torvalds, akpm, andrea, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 891 bytes --]

Herbert Xu wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Well yes, that's on the store side (1, above). However can't a CPU
>>still speculatively (eg. guess the branch) load the page->flags
>>cacheline which might be satisfied from memory before the page->count
>>cacheline loads? Ie. you can still have the correct write ordering
>>but have incorrect read ordering?
>>
>>Because neither PageDirty nor page_count is a barrier, and there is
>>no read barrier between them.
> 
> 
> Yes you're right.  A read barrier is required here.
> 
> I think Ben was actually agreeing with you.  He's just questioning
> whether the corresponding write barrier existed on CPU 1 (the answer
> to which is affirmative).
>  

Ah, that clears up my misunderstanding.

Yes I agree the write side is OK.

Thanks Ben and Herbert. I guess I should do a proper patch then.

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-reclaim-memorder-fix.patch --]
[-- Type: text/plain, Size: 1560 bytes --]

In mm/vmscan.c, the page reclaim may have the following sequence 2
running concurrently with sequence 1 on another CPU:

1                                2
find_get_page();
write to page                    write_lock(tree_lock);
SetPageDirty();                  if (page_count != 2
put_page();                              || PageDirty())
                                     /* page dirty or busy */
				 else
				     /* free it */

The comment indicates that PageDirty must be checked *after* page_count
indicates there are no users of this page, which prevents the dirty bit
from being lost in the case that that sequence 2 might see the state of
PageDirty() *before* SetPageDirty() in 1, but page_count *after* put_page
in 1.

However, there is no read memory barrier there, and so nothing to stop a
CPU from loading page_count before PageDirty (ie. ->flags). Theoretically,
data corruption is possible.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -511,7 +511,12 @@ static int shrink_list(struct list_head 
 		 * PageDirty _after_ making sure that the page is freeable and
 		 * not in use by anybody. 	(pagecache + us == 2)
 		 */
-		if (page_count(page) != 2 || PageDirty(page)) {
+		if (page_count(page) != 2) {
+			write_unlock_irq(&mapping->tree_lock);
+			goto keep_locked;
+		}
+		smp_rmb();
+		if (PageDirty(page)) {
 			write_unlock_irq(&mapping->tree_lock);
 			goto keep_locked;
 		}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 12:08       ` Herbert Xu
  2005-10-15 13:35         ` Nick Piggin
@ 2005-10-15 18:00         ` Andrea Arcangeli
  2005-10-15 19:48           ` Herbert Xu
  2005-10-15 22:16           ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2005-10-15 18:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Nick Piggin, benh, hugh, paulus, anton, torvalds, akpm,
	linux-kernel

On Sat, Oct 15, 2005 at 10:08:08PM +1000, Herbert Xu wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> > Well yes, that's on the store side (1, above). However can't a CPU
> > still speculatively (eg. guess the branch) load the page->flags
> > cacheline which might be satisfied from memory before the page->count
> > cacheline loads? Ie. you can still have the correct write ordering
> > but have incorrect read ordering?
> > 
> > Because neither PageDirty nor page_count is a barrier, and there is
> > no read barrier between them.
> 
> Yes you're right.  A read barrier is required here.

Even a write barrier is required on the left side, the read barrier on
the right side is useless if there is no write barrier on the left side.

Note that the barrier in atomic_add_negative is useless here because it
happens way too late, _after_ the count is decremented (not _before_)
so the decreased count could be already visible to the other cpu.

Not all archs are like x86 where a barrier happens implicitly both
before and after the instruction, and the way atomic_add_negative is
implemented the barrier from a common code point of view is only added
_after_ the instruction. 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 18:00         ` Andrea Arcangeli
@ 2005-10-15 19:48           ` Herbert Xu
  2005-10-15 20:07             ` Andrea Arcangeli
  2005-10-15 22:16           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 21+ messages in thread
From: Herbert Xu @ 2005-10-15 19:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, benh, hugh, paulus, anton, torvalds, akpm,
	linux-kernel

On Sat, Oct 15, 2005 at 06:00:18PM +0000, Andrea Arcangeli wrote:
> 
> Note that the barrier in atomic_add_negative is useless here because it
> happens way too late, _after_ the count is decremented (not _before_)
> so the decreased count could be already visible to the other cpu.

Could you please point me to an architecture that does this?

This assumption is in fact made in a number of places in the kernel
where constructs such as atomic_add_negative or atomic_dec_and_test
are used and assumed to imply a memory barrier.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 19:48           ` Herbert Xu
@ 2005-10-15 20:07             ` Andrea Arcangeli
  2005-10-15 23:07               ` David S. Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2005-10-15 20:07 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Nick Piggin, benh, hugh, paulus, anton, torvalds, akpm,
	linux-kernel

On Sun, Oct 16, 2005 at 05:48:55AM +1000, Herbert Xu wrote:
> On Sat, Oct 15, 2005 at 06:00:18PM +0000, Andrea Arcangeli wrote:
> > 
> > Note that the barrier in atomic_add_negative is useless here because it
> > happens way too late, _after_ the count is decremented (not _before_)
> > so the decreased count could be already visible to the other cpu.
> 
> Could you please point me to an architecture that does this?

sure see alpha:

	__asm__ __volatile__(
	"1:	ldq_l %0,%1\n"
	"	addq %0,%3,%2\n"
	"	addq %0,%3,%0\n"
	"	stq_c %0,%1\n"
	"	beq %0,2f\n"
	"	mb\n"

the memory barrier is applied way after the write is visible to other
cpus, you can even get an irq before the mb and block there for some
usec.

>From a common code point of view, the barrier you mentioned in
atomic_add_negative is absolutely useless for this specific case
(setpagedirty+put_page)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 20:07             ` Andrea Arcangeli
@ 2005-10-15 23:07               ` David S. Miller
  2005-10-16 19:36                 ` Ivan Kokshaysky
  0 siblings, 1 reply; 21+ messages in thread
From: David S. Miller @ 2005-10-15 23:07 UTC (permalink / raw)
  To: andrea
  Cc: herbert, nickpiggin, benh, hugh, paulus, anton, torvalds, akpm,
	linux-kernel

From: Andrea Arcangeli <andrea@suse.de>
Date: Sat, 15 Oct 2005 22:07:01 +0200

> sure see alpha:
> 
> 	__asm__ __volatile__(
> 	"1:	ldq_l %0,%1\n"
> 	"	addq %0,%3,%2\n"
> 	"	addq %0,%3,%0\n"
> 	"	stq_c %0,%1\n"
> 	"	beq %0,2f\n"
> 	"	mb\n"
> 
> the memory barrier is applied way after the write is visible to other
> cpus, you can even get an irq before the mb and block there for some
> usec.

For atomic operations returning values, there must be a memory
barrier both before and after the atomic operation.  This is
defined in Documentation/atomic_ops.txt, so Alpha needs to be
fixed to add a memory barrier at the beginning of these
assembler sequences.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 23:07               ` David S. Miller
@ 2005-10-16 19:36                 ` Ivan Kokshaysky
  2005-10-17  4:29                   ` David S. Miller
  2005-10-17 11:28                   ` Andrea Arcangeli
  0 siblings, 2 replies; 21+ messages in thread
From: Ivan Kokshaysky @ 2005-10-16 19:36 UTC (permalink / raw)
  To: David S. Miller
  Cc: andrea, herbert, nickpiggin, benh, hugh, paulus, anton, torvalds,
	akpm, linux-kernel

On Sat, Oct 15, 2005 at 04:07:02PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <andrea@suse.de>
> Date: Sat, 15 Oct 2005 22:07:01 +0200
> 
> > sure see alpha:
> > 
> > 	__asm__ __volatile__(
> > 	"1:	ldq_l %0,%1\n"
> > 	"	addq %0,%3,%2\n"
> > 	"	addq %0,%3,%0\n"
> > 	"	stq_c %0,%1\n"
> > 	"	beq %0,2f\n"
> > 	"	mb\n"
> > 
> > the memory barrier is applied way after the write is visible to other
> > cpus, you can even get an irq before the mb and block there for some
> > usec.
> 
> For atomic operations returning values, there must be a memory
> barrier both before and after the atomic operation.  This is
> defined in Documentation/atomic_ops.txt, so Alpha needs to be
> fixed to add a memory barrier at the beginning of these
> assembler sequences.

My opinion is that we don't need a barrier even _after_ ll/sc sequences
on Alpha - it's redundant; perhaps it's just a historical baggage.
I strongly believe that ll/sc _does_ imply an SMP memory barrier, as can
be seen from instructions timing: a standalone mb takes tens of cycles,
the mb before/after ll/sc takes 0 to 1 cycle on ev6 (a bit more on ev56)
depending on instruction slotting.
Note that superfluous mb's around atomic stuff still can hurt -
Alpha mb instruction also flushes IO write buffers, so it can
be _extremely_ expensive under heavy IO...

Ivan.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-16 19:36                 ` Ivan Kokshaysky
@ 2005-10-17  4:29                   ` David S. Miller
  2005-10-17  7:23                     ` Ivan Kokshaysky
  2005-10-17 11:28                   ` Andrea Arcangeli
  1 sibling, 1 reply; 21+ messages in thread
From: David S. Miller @ 2005-10-17  4:29 UTC (permalink / raw)
  To: ink
  Cc: andrea, herbert, nickpiggin, benh, hugh, paulus, anton, torvalds,
	akpm, linux-kernel

From: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Date: Sun, 16 Oct 2005 23:36:00 +0400

> My opinion is that we don't need a barrier even _after_ ll/sc sequences
> on Alpha - it's redundant; perhaps it's just a historical baggage.
> I strongly believe that ll/sc _does_ imply an SMP memory barrier, as can
> be seen from instructions timing: a standalone mb takes tens of cycles,
> the mb before/after ll/sc takes 0 to 1 cycle on ev6 (a bit more on ev56)
> depending on instruction slotting.
> Note that superfluous mb's around atomic stuff still can hurt -
> Alpha mb instruction also flushes IO write buffers, so it can
> be _extremely_ expensive under heavy IO...

Even a quick google tells me that on Alpha, LL/SC have implicit
barriers only with respect to the locations being acted upon
by the LL/SC, but don't have more general barrier properties.

This is in line with how PPC is defined as well.

I truly believe that you would be removing those Alpha memory barriers
only at your own peril. :-)  And since the PPC and Alpha defined
semantics aparently match, I think it would be wise to add the missing
memory barriers from the front of the LL/SC sequences which need them,
just as PPC does.

I just spent 6 months tracking down a bug that turned out to be a
couple of missing memory barriers on sparc64.  So, this has taught me
to be extra careful in this area :-)  I really think this is an area
in which it's better to be safe than sorry.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-17  4:29                   ` David S. Miller
@ 2005-10-17  7:23                     ` Ivan Kokshaysky
  0 siblings, 0 replies; 21+ messages in thread
From: Ivan Kokshaysky @ 2005-10-17  7:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: andrea, herbert, nickpiggin, benh, hugh, paulus, anton, torvalds,
	akpm, linux-kernel

On Sun, Oct 16, 2005 at 09:29:17PM -0700, David S. Miller wrote:
> Even a quick google tells me that on Alpha, LL/SC have implicit
> barriers only with respect to the locations being acted upon
> by the LL/SC, but don't have more general barrier properties.

Argh, you are right - I should have read the manual before posting...

Ivan.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-16 19:36                 ` Ivan Kokshaysky
  2005-10-17  4:29                   ` David S. Miller
@ 2005-10-17 11:28                   ` Andrea Arcangeli
  1 sibling, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2005-10-17 11:28 UTC (permalink / raw)
  To: Ivan Kokshaysky
  Cc: David S. Miller, herbert, nickpiggin, benh, hugh, paulus, anton,
	torvalds, akpm, linux-kernel

On Sun, Oct 16, 2005 at 11:36:00PM +0400, Ivan Kokshaysky wrote:
> Note that superfluous mb's around atomic stuff still can hurt -
> Alpha mb instruction also flushes IO write buffers, so it can
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's what needed to implement the wmb on alpha, this is why mb is
needed there and we need to add it at the top as well to comply with
docs (and especially for atomic_dec_and_test kind of usage like Dave
said).

Ivan I assume you'll take care of fixing it, thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 18:00         ` Andrea Arcangeli
  2005-10-15 19:48           ` Herbert Xu
@ 2005-10-15 22:16           ` Benjamin Herrenschmidt
  2005-10-15 23:13             ` David S. Miller
  1 sibling, 1 reply; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2005-10-15 22:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Herbert Xu, Nick Piggin, hugh, paulus, anton, torvalds, akpm,
	linux-kernel


> Even a write barrier is required on the left side, the read barrier on
> the right side is useless if there is no write barrier on the left side.
> 
> Note that the barrier in atomic_add_negative is useless here because it
> happens way too late, _after_ the count is decremented (not _before_)
> so the decreased count could be already visible to the other cpu.

Not on ppc64. Our atomic*return functions have a write barrier before
the atomic operation. I think there is just too much code out there that
either expects implicit barriers done by atomics (abuse them as locks)
or simply written by people who don't understand those ordering issues
(to be fair, they can be fairly brain damaging).

I agree this is not a good generic solution though.

> Not all archs are like x86 where a barrier happens implicitly both
> before and after the instruction, and the way atomic_add_negative is
> implemented the barrier from a common code point of view is only added
> _after_ the instruction. 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory ordering bug in page reclaim?
  2005-10-15 22:16           ` Benjamin Herrenschmidt
@ 2005-10-15 23:13             ` David S. Miller
  0 siblings, 0 replies; 21+ messages in thread
From: David S. Miller @ 2005-10-15 23:13 UTC (permalink / raw)
  To: benh
  Cc: andrea, herbert, nickpiggin, hugh, paulus, anton, torvalds, akpm,
	linux-kernel

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Sun, 16 Oct 2005 08:16:30 +1000

> > Even a write barrier is required on the left side, the read barrier on
> > the right side is useless if there is no write barrier on the left side.
> > 
> > Note that the barrier in atomic_add_negative is useless here because it
> > happens way too late, _after_ the count is decremented (not _before_)
> > so the decreased count could be already visible to the other cpu.
> 
> Not on ppc64.

Or sparc64, or ppc, or ... any platform that adheres to the
rules specified in Documentation/atomic_ops.txt :-)

One really needs the barriers both before and after the atomic
operation when you return a value, because this is the only way to get
sane semantics for dropping the final reference to some object.  When
one does:

	if (atomic_dec_and_test(&obj->count))
		__free_stuff(obj);

All other modifications to object state before the atomic
decrement must reach global visibility _first_, and then
the atomic decrement itself must be strongly ordered wrt.
future memory operations.

That's why it is documented that the memory barriers both before and
after the atomic operation are required when the operation returns a
result value of some kind.

There is a lot of code in the kernel that depends upon this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2005-10-17  9:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-15  3:28 Possible memory ordering bug in page reclaim? Nick Piggin
2005-10-15  6:17 ` Hugh Dickins
2005-10-15  7:43   ` Benjamin Herrenschmidt
2005-10-15  8:00     ` Herbert Xu
2005-10-15 16:57       ` Linus Torvalds
2005-10-15 19:29         ` David S. Miller
2005-10-15 22:17           ` Benjamin Herrenschmidt
2005-10-16  0:04         ` Nick Piggin
2005-10-15  8:59     ` Nick Piggin
2005-10-15 12:08       ` Herbert Xu
2005-10-15 13:35         ` Nick Piggin
2005-10-15 18:00         ` Andrea Arcangeli
2005-10-15 19:48           ` Herbert Xu
2005-10-15 20:07             ` Andrea Arcangeli
2005-10-15 23:07               ` David S. Miller
2005-10-16 19:36                 ` Ivan Kokshaysky
2005-10-17  4:29                   ` David S. Miller
2005-10-17  7:23                     ` Ivan Kokshaysky
2005-10-17 11:28                   ` Andrea Arcangeli
2005-10-15 22:16           ` Benjamin Herrenschmidt
2005-10-15 23:13             ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox