* Possible memory ordering bug in page reclaim?
@ 2005-10-15 3:28 Nick Piggin
2005-10-15 6:17 ` Hugh Dickins
0 siblings, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2005-10-15 3:28 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton, Hugh Dickins, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 598 bytes --]
Is there anything that prevents PageDirty from theoretically being
speculatively loaded before page_count here? (see patch)
It would result in pagecache corruption in the following situation:
1 2
find_get_page();
write to page write_lock(tree_lock);
SetPageDirty(); if (page_count != 2
put_page(); || PageDirty())
Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
1, and page_count *after* put_page in 1.
Or am I seeing things that aren't there?
Thanks,
--
SUSE Labs, Novell Inc.
[-- Attachment #2: mm-reclaim-memorder-fix.patch --]
[-- Type: text/plain, Size: 604 bytes --]
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -511,7 +511,12 @@ static int shrink_list(struct list_head
* PageDirty _after_ making sure that the page is freeable and
* not in use by anybody. (pagecache + us == 2)
*/
- if (page_count(page) != 2 || PageDirty(page)) {
+ if (page_count(page) != 2) {
+ write_unlock_irq(&mapping->tree_lock);
+ goto keep_locked;
+ }
+ smp_rmb();
+ if (PageDirty(page)) {
write_unlock_irq(&mapping->tree_lock);
goto keep_locked;
}
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Possible memory ordering bug in page reclaim? 2005-10-15 3:28 Possible memory ordering bug in page reclaim? Nick Piggin @ 2005-10-15 6:17 ` Hugh Dickins 2005-10-15 7:43 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 21+ messages in thread From: Hugh Dickins @ 2005-10-15 6:17 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, Ben Herrenschmidt, Andrea Arcangeli, linux-kernel On Sat, 15 Oct 2005, Nick Piggin wrote: > > Is there anything that prevents PageDirty from theoretically being > speculatively loaded before page_count here? (see patch) > > It would result in pagecache corruption in the following situation: > > 1 2 > find_get_page(); > write to page write_lock(tree_lock); > SetPageDirty(); if (page_count != 2 > put_page(); || PageDirty()) > > Now I'm worried that 2 might see PageDirty *before* SetPageDirty in page->flags > 1, and page_count *after* put_page in 1. I think you're right. But I'm the last person to ask barrier/ordering questions of. CC'ed Ben and Andrea. Hugh > --- linux-2.6.orig/mm/vmscan.c > +++ linux-2.6/mm/vmscan.c > @@ -511,7 +511,12 @@ static int shrink_list(struct list_head > * PageDirty _after_ making sure that the page is freeable and > * not in use by anybody. (pagecache + us == 2) > */ > - if (page_count(page) != 2 || PageDirty(page)) { > + if (page_count(page) != 2) { > + write_unlock_irq(&mapping->tree_lock); > + goto keep_locked; > + } > + smp_rmb(); > + if (PageDirty(page)) { > write_unlock_irq(&mapping->tree_lock); > goto keep_locked; > } ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 6:17 ` Hugh Dickins @ 2005-10-15 7:43 ` Benjamin Herrenschmidt 2005-10-15 8:00 ` Herbert Xu 2005-10-15 8:59 ` Nick Piggin 0 siblings, 2 replies; 21+ messages in thread From: Benjamin Herrenschmidt @ 2005-10-15 7:43 UTC (permalink / raw) To: Hugh Dickins Cc: Paul Mackerras, Anton Blanchard, Nick Piggin, Linus Torvalds, Andrew Morton, Andrea Arcangeli, linux-kernel On Sat, 2005-10-15 at 07:17 +0100, Hugh Dickins wrote: > On Sat, 15 Oct 2005, Nick Piggin wrote: > > > > Is there anything that prevents PageDirty from theoretically being > > speculatively loaded before page_count here? (see patch) > > > > It would result in pagecache corruption in the following situation: > > > > 1 2 > > find_get_page(); > > write to page write_lock(tree_lock); > > SetPageDirty(); if (page_count != 2 > > put_page(); || PageDirty()) > > > > Now I'm worried that 2 might see PageDirty *before* SetPageDirty in > page->flags > > 1, and page_count *after* put_page in 1. > > I think you're right. But I'm the last person to ask > barrier/ordering questions of. CC'ed Ben and Andrea. yup, now the question is wether PG_Dirty will be visible to CPU 2 before the page count is decremented right ? That depends on put_page, I suppose. If it's doing a simple atomic, there is an issue. But atomics with return has been so often abused as locks that they may have been implemented with a barrier... (On ppc64, it will do an eieio, thus I think it should be ok). There is also a problem the other way around. Write to page, then set page dirty... those writes may be visible to CPU 2 (that is the page content be dirty) before find_get_page even increased the page count, unless there is a barrier in there too. Paul, Anton ? Ben. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 7:43 ` Benjamin Herrenschmidt @ 2005-10-15 8:00 ` Herbert Xu 2005-10-15 16:57 ` Linus Torvalds 2005-10-15 8:59 ` Nick Piggin 1 sibling, 1 reply; 21+ messages in thread From: Herbert Xu @ 2005-10-15 8:00 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: hugh, paulus, anton, nickpiggin, torvalds, akpm, andrea, linux-kernel Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > >> > 1 2 >> > find_get_page(); >> > write to page write_lock(tree_lock); >> > SetPageDirty(); if (page_count != 2 >> > put_page(); || PageDirty()) >> > >> > Now I'm worried that 2 might see PageDirty *before* SetPageDirty in >> page->flags >> > 1, and page_count *after* put_page in 1. > > yup, now the question is wether PG_Dirty will be visible to CPU 2 before > the page count is decremented right ? That depends on put_page, I > suppose. If it's doing a simple atomic, there is an issue. But atomics > with return has been so often abused as locks that they may have been > implemented with a barrier... (On ppc64, it will do an eieio, thus I > think it should be ok). Yes atomic_add_negative should always be a barrier. > There is also a problem the other way around. Write to page, then set > page dirty... those writes may be visible to CPU 2 (that is the page > content be dirty) before find_get_page even increased the page count, > unless there is a barrier in there too. find_get_page does a read_unlock_irq after the increment which also serves as a barrier. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 8:00 ` Herbert Xu @ 2005-10-15 16:57 ` Linus Torvalds 2005-10-15 19:29 ` David S. Miller 2005-10-16 0:04 ` Nick Piggin 0 siblings, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2005-10-15 16:57 UTC (permalink / raw) To: Herbert Xu Cc: Benjamin Herrenschmidt, hugh, paulus, anton, nickpiggin, akpm, andrea, linux-kernel On Sat, 15 Oct 2005, Herbert Xu wrote: > > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > yup, now the question is wether PG_Dirty will be visible to CPU 2 before > > the page count is decremented right ? That depends on put_page, I > > suppose. If it's doing a simple atomic, there is an issue. But atomics > > with return has been so often abused as locks that they may have been > > implemented with a barrier... (On ppc64, it will do an eieio, thus I > > think it should be ok). > > Yes atomic_add_negative should always be a barrier. I disagree. That would be very expensive on anything but x86, where it just happens to be true for other reasons. Atomics do _not_ implement barriers. Normally, we really don't care about any other decrement than the one that goes to zero, and that one tends to come with its own serialization logic. In this case, we don't even care. I agree, however, that it looks like PG_dirty is racy. Probably not in practice, but still. So I'd suggest adding a smp_wmb() into set_page_dirty, and the rmb where Nick suggested. So I'd suggest a patch something more like this.. Marking the dirty/count cases unlikely too in mm/page-writeback.c, since we should have tested for these conditions optimistically outside the lock. Comments? Nick, did you have some test-case that you think might actually have been impacted by this? Linus --- diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0166ea1..722779c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -668,8 +668,14 @@ int fastcall set_page_dirty(struct page return (*spd)(page); return __set_page_dirty_buffers(page); } - if (!PageDirty(page)) + if (!PageDirty(page)) { SetPageDirty(page); + /* + * Make sure the dirty bit percolates out + * to other CPU's before we release the page + */ + smp_wmb(); + } return 0; } EXPORT_SYMBOL(set_page_dirty); diff --git a/mm/vmscan.c b/mm/vmscan.c index 0ea71e8..64f9570 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -511,10 +511,11 @@ static int shrink_list(struct list_head * PageDirty _after_ making sure that the page is freeable and * not in use by anybody. (pagecache + us == 2) */ - if (page_count(page) != 2 || PageDirty(page)) { - write_unlock_irq(&mapping->tree_lock); - goto keep_locked; - } + if (unlikely(page_count(page) != 2)) + goto cannot_free; + smp_rmb(); + if (unlikely(PageDirty(page))) + goto cannot_free; #ifdef CONFIG_SWAP if (PageSwapCache(page)) { @@ -538,6 +539,10 @@ free_it: __pagevec_release_nonlru(&freed_pvec); continue; +cannot_free: + write_unlock_irq(&mapping->tree_lock); + goto keep_locked; + activate_locked: SetPageActive(page); pgactivate++; ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 16:57 ` Linus Torvalds @ 2005-10-15 19:29 ` David S. Miller 2005-10-15 22:17 ` Benjamin Herrenschmidt 2005-10-16 0:04 ` Nick Piggin 1 sibling, 1 reply; 21+ messages in thread From: David S. Miller @ 2005-10-15 19:29 UTC (permalink / raw) To: torvalds Cc: herbert, benh, hugh, paulus, anton, nickpiggin, akpm, andrea, linux-kernel From: Linus Torvalds <torvalds@osdl.org> Date: Sat, 15 Oct 2005 09:57:47 -0700 (PDT) > On Sat, 15 Oct 2005, Herbert Xu wrote: > > > > Yes atomic_add_negative should always be a barrier. > > I disagree. That would be very expensive on anything but x86, where it > just happens to be true for other reasons. Atomics do _not_ implement > barriers. When they return values, they are defined to be barriers. It's even on the documentation :-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 19:29 ` David S. Miller @ 2005-10-15 22:17 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 21+ messages in thread From: Benjamin Herrenschmidt @ 2005-10-15 22:17 UTC (permalink / raw) To: David S. Miller Cc: torvalds, herbert, hugh, paulus, anton, nickpiggin, akpm, andrea, linux-kernel On Sat, 2005-10-15 at 12:29 -0700, David S. Miller wrote: > From: Linus Torvalds <torvalds@osdl.org> > Date: Sat, 15 Oct 2005 09:57:47 -0700 (PDT) > > > On Sat, 15 Oct 2005, Herbert Xu wrote: > > > > > > Yes atomic_add_negative should always be a barrier. > > > > I disagree. That would be very expensive on anything but x86, where it > > just happens to be true for other reasons. Atomics do _not_ implement > > barriers. > > When they return values, they are defined to be barriers. > It's even on the documentation :-) Ahhh, good to know :) /me should read the documentation sometimes.... Ben. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 16:57 ` Linus Torvalds 2005-10-15 19:29 ` David S. Miller @ 2005-10-16 0:04 ` Nick Piggin 1 sibling, 0 replies; 21+ messages in thread From: Nick Piggin @ 2005-10-16 0:04 UTC (permalink / raw) To: Linus Torvalds Cc: Herbert Xu, Benjamin Herrenschmidt, hugh, paulus, anton, akpm, andrea, linux-kernel, David S. Miller Linus Torvalds wrote: > I agree, however, that it looks like PG_dirty is racy. Probably not in > practice, but still. > > So I'd suggest adding a smp_wmb() into set_page_dirty, and the rmb where > Nick suggested. > > So I'd suggest a patch something more like this.. Marking the dirty/count > cases unlikely too in mm/page-writeback.c, since we should have tested for > these conditions optimistically outside the lock. > As Dave suggested, I think there is too much other code that depends on these atomics to be barriers for us to change it (at least not in this patch! :)). > Comments? Nick, did you have some test-case that you think might actually > have been impacted by this? > I guess your vmscan.c hunks are slightly nicer, though I might put 'cannot_free' right at the end, because it will be a very uncommon case. And no, I don't have a test case. In fact, I wouldn't be surprised if nobody anywhere has ever hit it :) I was just browsing code... Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 7:43 ` Benjamin Herrenschmidt 2005-10-15 8:00 ` Herbert Xu @ 2005-10-15 8:59 ` Nick Piggin 2005-10-15 12:08 ` Herbert Xu 1 sibling, 1 reply; 21+ messages in thread From: Nick Piggin @ 2005-10-15 8:59 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Hugh Dickins, Paul Mackerras, Anton Blanchard, Linus Torvalds, Andrew Morton, Andrea Arcangeli, linux-kernel Benjamin Herrenschmidt wrote: > On Sat, 2005-10-15 at 07:17 +0100, Hugh Dickins wrote: > >>On Sat, 15 Oct 2005, Nick Piggin wrote: >> >>>1 2 >>>find_get_page(); >>>write to page write_lock(tree_lock); >>>SetPageDirty(); if (page_count != 2 >>>put_page(); || PageDirty()) >>> >>>Now I'm worried that 2 might see PageDirty *before* SetPageDirty in >> >> page->flags >> >>>1, and page_count *after* put_page in 1. >> >>I think you're right. But I'm the last person to ask >>barrier/ordering questions of. CC'ed Ben and Andrea. > > > yup, now the question is wether PG_Dirty will be visible to CPU 2 before > the page count is decremented right ? That depends on put_page, I > suppose. If it's doing a simple atomic, there is an issue. But atomics > with return has been so often abused as locks that they may have been > implemented with a barrier... (On ppc64, it will do an eieio, thus I > think it should be ok). > Well yes, that's on the store side (1, above). However can't a CPU still speculatively (eg. guess the branch) load the page->flags cacheline which might be satisfied from memory before the page->count cacheline loads? Ie. you can still have the correct write ordering but have incorrect read ordering? Because neither PageDirty nor page_count is a barrier, and there is no read barrier between them. Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 8:59 ` Nick Piggin @ 2005-10-15 12:08 ` Herbert Xu 2005-10-15 13:35 ` Nick Piggin 2005-10-15 18:00 ` Andrea Arcangeli 0 siblings, 2 replies; 21+ messages in thread From: Herbert Xu @ 2005-10-15 12:08 UTC (permalink / raw) To: Nick Piggin Cc: benh, hugh, paulus, anton, torvalds, akpm, andrea, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Well yes, that's on the store side (1, above). However can't a CPU > still speculatively (eg. guess the branch) load the page->flags > cacheline which might be satisfied from memory before the page->count > cacheline loads? Ie. you can still have the correct write ordering > but have incorrect read ordering? > > Because neither PageDirty nor page_count is a barrier, and there is > no read barrier between them. Yes you're right. A read barrier is required here. I think Ben was actually agreeing with you. He's just questioning whether the corresponding write barrier existed on CPU 1 (the answer to which is affirmative). Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 12:08 ` Herbert Xu @ 2005-10-15 13:35 ` Nick Piggin 2005-10-15 18:00 ` Andrea Arcangeli 1 sibling, 0 replies; 21+ messages in thread From: Nick Piggin @ 2005-10-15 13:35 UTC (permalink / raw) To: Herbert Xu Cc: benh, hugh, paulus, anton, torvalds, akpm, andrea, linux-kernel [-- Attachment #1: Type: text/plain, Size: 891 bytes --] Herbert Xu wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>Well yes, that's on the store side (1, above). However can't a CPU >>still speculatively (eg. guess the branch) load the page->flags >>cacheline which might be satisfied from memory before the page->count >>cacheline loads? Ie. you can still have the correct write ordering >>but have incorrect read ordering? >> >>Because neither PageDirty nor page_count is a barrier, and there is >>no read barrier between them. > > > Yes you're right. A read barrier is required here. > > I think Ben was actually agreeing with you. He's just questioning > whether the corresponding write barrier existed on CPU 1 (the answer > to which is affirmative). > Ah, that clears up my misunderstanding. Yes I agree the write side is OK. Thanks Ben and Herbert. I guess I should do a proper patch then. -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-reclaim-memorder-fix.patch --] [-- Type: text/plain, Size: 1560 bytes --] In mm/vmscan.c, the page reclaim may have the following sequence 2 running concurrently with sequence 1 on another CPU: 1 2 find_get_page(); write to page write_lock(tree_lock); SetPageDirty(); if (page_count != 2 put_page(); || PageDirty()) /* page dirty or busy */ else /* free it */ The comment indicates that PageDirty must be checked *after* page_count indicates there are no users of this page, which prevents the dirty bit from being lost in the case that that sequence 2 might see the state of PageDirty() *before* SetPageDirty() in 1, but page_count *after* put_page in 1. However, there is no read memory barrier there, and so nothing to stop a CPU from loading page_count before PageDirty (ie. ->flags). Theoretically, data corruption is possible. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -511,7 +511,12 @@ static int shrink_list(struct list_head * PageDirty _after_ making sure that the page is freeable and * not in use by anybody. (pagecache + us == 2) */ - if (page_count(page) != 2 || PageDirty(page)) { + if (page_count(page) != 2) { + write_unlock_irq(&mapping->tree_lock); + goto keep_locked; + } + smp_rmb(); + if (PageDirty(page)) { write_unlock_irq(&mapping->tree_lock); goto keep_locked; } ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 12:08 ` Herbert Xu 2005-10-15 13:35 ` Nick Piggin @ 2005-10-15 18:00 ` Andrea Arcangeli 2005-10-15 19:48 ` Herbert Xu 2005-10-15 22:16 ` Benjamin Herrenschmidt 1 sibling, 2 replies; 21+ messages in thread From: Andrea Arcangeli @ 2005-10-15 18:00 UTC (permalink / raw) To: Herbert Xu Cc: Nick Piggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel On Sat, Oct 15, 2005 at 10:08:08PM +1000, Herbert Xu wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > Well yes, that's on the store side (1, above). However can't a CPU > > still speculatively (eg. guess the branch) load the page->flags > > cacheline which might be satisfied from memory before the page->count > > cacheline loads? Ie. you can still have the correct write ordering > > but have incorrect read ordering? > > > > Because neither PageDirty nor page_count is a barrier, and there is > > no read barrier between them. > > Yes you're right. A read barrier is required here. Even a write barrier is required on the left side, the read barrier on the right side is useless if there is no write barrier on the left side. Note that the barrier in atomic_add_negative is useless here because it happens way too late, _after_ the count is decremented (not _before_) so the decreased count could be already visible to the other cpu. Not all archs are like x86 where a barrier happens implicitly both before and after the instruction, and the way atomic_add_negative is implemented the barrier from a common code point of view is only added _after_ the instruction. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 18:00 ` Andrea Arcangeli @ 2005-10-15 19:48 ` Herbert Xu 2005-10-15 20:07 ` Andrea Arcangeli 2005-10-15 22:16 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 21+ messages in thread From: Herbert Xu @ 2005-10-15 19:48 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel On Sat, Oct 15, 2005 at 06:00:18PM +0000, Andrea Arcangeli wrote: > > Note that the barrier in atomic_add_negative is useless here because it > happens way too late, _after_ the count is decremented (not _before_) > so the decreased count could be already visible to the other cpu. Could you please point me to an architecture that does this? This assumption is in fact made in a number of places in the kernel where constructs such as atomic_add_negative or atomic_dec_and_test are used and assumed to imply a memory barrier. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 19:48 ` Herbert Xu @ 2005-10-15 20:07 ` Andrea Arcangeli 2005-10-15 23:07 ` David S. Miller 0 siblings, 1 reply; 21+ messages in thread From: Andrea Arcangeli @ 2005-10-15 20:07 UTC (permalink / raw) To: Herbert Xu Cc: Nick Piggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel On Sun, Oct 16, 2005 at 05:48:55AM +1000, Herbert Xu wrote: > On Sat, Oct 15, 2005 at 06:00:18PM +0000, Andrea Arcangeli wrote: > > > > Note that the barrier in atomic_add_negative is useless here because it > > happens way too late, _after_ the count is decremented (not _before_) > > so the decreased count could be already visible to the other cpu. > > Could you please point me to an architecture that does this? sure see alpha: __asm__ __volatile__( "1: ldq_l %0,%1\n" " addq %0,%3,%2\n" " addq %0,%3,%0\n" " stq_c %0,%1\n" " beq %0,2f\n" " mb\n" the memory barrier is applied way after the write is visible to other cpus, you can even get an irq before the mb and block there for some usec. >From a common code point of view, the barrier you mentioned in atomic_add_negative is absolutely useless for this specific case (setpagedirty+put_page) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 20:07 ` Andrea Arcangeli @ 2005-10-15 23:07 ` David S. Miller 2005-10-16 19:36 ` Ivan Kokshaysky 0 siblings, 1 reply; 21+ messages in thread From: David S. Miller @ 2005-10-15 23:07 UTC (permalink / raw) To: andrea Cc: herbert, nickpiggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel From: Andrea Arcangeli <andrea@suse.de> Date: Sat, 15 Oct 2005 22:07:01 +0200 > sure see alpha: > > __asm__ __volatile__( > "1: ldq_l %0,%1\n" > " addq %0,%3,%2\n" > " addq %0,%3,%0\n" > " stq_c %0,%1\n" > " beq %0,2f\n" > " mb\n" > > the memory barrier is applied way after the write is visible to other > cpus, you can even get an irq before the mb and block there for some > usec. For atomic operations returning values, there must be a memory barrier both before and after the atomic operation. This is defined in Documentation/atomic_ops.txt, so Alpha needs to be fixed to add a memory barrier at the beginning of these assembler sequences. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 23:07 ` David S. Miller @ 2005-10-16 19:36 ` Ivan Kokshaysky 2005-10-17 4:29 ` David S. Miller 2005-10-17 11:28 ` Andrea Arcangeli 0 siblings, 2 replies; 21+ messages in thread From: Ivan Kokshaysky @ 2005-10-16 19:36 UTC (permalink / raw) To: David S. Miller Cc: andrea, herbert, nickpiggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel On Sat, Oct 15, 2005 at 04:07:02PM -0700, David S. Miller wrote: > From: Andrea Arcangeli <andrea@suse.de> > Date: Sat, 15 Oct 2005 22:07:01 +0200 > > > sure see alpha: > > > > __asm__ __volatile__( > > "1: ldq_l %0,%1\n" > > " addq %0,%3,%2\n" > > " addq %0,%3,%0\n" > > " stq_c %0,%1\n" > > " beq %0,2f\n" > > " mb\n" > > > > the memory barrier is applied way after the write is visible to other > > cpus, you can even get an irq before the mb and block there for some > > usec. > > For atomic operations returning values, there must be a memory > barrier both before and after the atomic operation. This is > defined in Documentation/atomic_ops.txt, so Alpha needs to be > fixed to add a memory barrier at the beginning of these > assembler sequences. My opinion is that we don't need a barrier even _after_ ll/sc sequences on Alpha - it's redundant; perhaps it's just a historical baggage. I strongly believe that ll/sc _does_ imply an SMP memory barrier, as can be seen from instructions timing: a standalone mb takes tens of cycles, the mb before/after ll/sc takes 0 to 1 cycle on ev6 (a bit more on ev56) depending on instruction slotting. Note that superfluous mb's around atomic stuff still can hurt - Alpha mb instruction also flushes IO write buffers, so it can be _extremely_ expensive under heavy IO... Ivan. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-16 19:36 ` Ivan Kokshaysky @ 2005-10-17 4:29 ` David S. Miller 2005-10-17 7:23 ` Ivan Kokshaysky 2005-10-17 11:28 ` Andrea Arcangeli 1 sibling, 1 reply; 21+ messages in thread From: David S. Miller @ 2005-10-17 4:29 UTC (permalink / raw) To: ink Cc: andrea, herbert, nickpiggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel From: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Date: Sun, 16 Oct 2005 23:36:00 +0400 > My opinion is that we don't need a barrier even _after_ ll/sc sequences > on Alpha - it's redundant; perhaps it's just a historical baggage. > I strongly believe that ll/sc _does_ imply an SMP memory barrier, as can > be seen from instructions timing: a standalone mb takes tens of cycles, > the mb before/after ll/sc takes 0 to 1 cycle on ev6 (a bit more on ev56) > depending on instruction slotting. > Note that superfluous mb's around atomic stuff still can hurt - > Alpha mb instruction also flushes IO write buffers, so it can > be _extremely_ expensive under heavy IO... Even a quick google tells me that on Alpha, LL/SC have implicit barriers only with respect to the locations being acted upon by the LL/SC, but don't have more general barrier properties. This is in line with how PPC is defined as well. I truly believe that you would be removing those Alpha memory barriers only at your own peril. :-) And since the PPC and Alpha defined semantics aparently match, I think it would be wise to add the missing memory barriers from the front of the LL/SC sequences which need them, just as PPC does. I just spent 6 months tracking down a bug that turned out to be a couple of missing memory barriers on sparc64. So, this has taught me to be extra careful in this area :-) I really think this is an area in which it's better to be safe than sorry. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-17 4:29 ` David S. Miller @ 2005-10-17 7:23 ` Ivan Kokshaysky 0 siblings, 0 replies; 21+ messages in thread From: Ivan Kokshaysky @ 2005-10-17 7:23 UTC (permalink / raw) To: David S. Miller Cc: andrea, herbert, nickpiggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel On Sun, Oct 16, 2005 at 09:29:17PM -0700, David S. Miller wrote: > Even a quick google tells me that on Alpha, LL/SC have implicit > barriers only with respect to the locations being acted upon > by the LL/SC, but don't have more general barrier properties. Argh, you are right - I should have read the manual before posting... Ivan. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-16 19:36 ` Ivan Kokshaysky 2005-10-17 4:29 ` David S. Miller @ 2005-10-17 11:28 ` Andrea Arcangeli 1 sibling, 0 replies; 21+ messages in thread From: Andrea Arcangeli @ 2005-10-17 11:28 UTC (permalink / raw) To: Ivan Kokshaysky Cc: David S. Miller, herbert, nickpiggin, benh, hugh, paulus, anton, torvalds, akpm, linux-kernel On Sun, Oct 16, 2005 at 11:36:00PM +0400, Ivan Kokshaysky wrote: > Note that superfluous mb's around atomic stuff still can hurt - > Alpha mb instruction also flushes IO write buffers, so it can ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ That's what needed to implement the wmb on alpha, this is why mb is needed there and we need to add it at the top as well to comply with docs (and especially for atomic_dec_and_test kind of usage like Dave said). Ivan I assume you'll take care of fixing it, thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 18:00 ` Andrea Arcangeli 2005-10-15 19:48 ` Herbert Xu @ 2005-10-15 22:16 ` Benjamin Herrenschmidt 2005-10-15 23:13 ` David S. Miller 1 sibling, 1 reply; 21+ messages in thread From: Benjamin Herrenschmidt @ 2005-10-15 22:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Herbert Xu, Nick Piggin, hugh, paulus, anton, torvalds, akpm, linux-kernel > Even a write barrier is required on the left side, the read barrier on > the right side is useless if there is no write barrier on the left side. > > Note that the barrier in atomic_add_negative is useless here because it > happens way too late, _after_ the count is decremented (not _before_) > so the decreased count could be already visible to the other cpu. Not on ppc64. Our atomic*return functions have a write barrier before the atomic operation. I think there is just too much code out there that either expects implicit barriers done by atomics (abuse them as locks) or simply written by people who don't understand those ordering issues (to be fair, they can be fairly brain damaging). I agree this is not a good generic solution though. > Not all archs are like x86 where a barrier happens implicitly both > before and after the instruction, and the way atomic_add_negative is > implemented the barrier from a common code point of view is only added > _after_ the instruction. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory ordering bug in page reclaim? 2005-10-15 22:16 ` Benjamin Herrenschmidt @ 2005-10-15 23:13 ` David S. Miller 0 siblings, 0 replies; 21+ messages in thread From: David S. Miller @ 2005-10-15 23:13 UTC (permalink / raw) To: benh Cc: andrea, herbert, nickpiggin, hugh, paulus, anton, torvalds, akpm, linux-kernel From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Sun, 16 Oct 2005 08:16:30 +1000 > > Even a write barrier is required on the left side, the read barrier on > > the right side is useless if there is no write barrier on the left side. > > > > Note that the barrier in atomic_add_negative is useless here because it > > happens way too late, _after_ the count is decremented (not _before_) > > so the decreased count could be already visible to the other cpu. > > Not on ppc64. Or sparc64, or ppc, or ... any platform that adheres to the rules specified in Documentation/atomic_ops.txt :-) One really needs the barriers both before and after the atomic operation when you return a value, because this is the only way to get sane semantics for dropping the final reference to some object. When one does: if (atomic_dec_and_test(&obj->count)) __free_stuff(obj); All other modifications to object state before the atomic decrement must reach global visibility _first_, and then the atomic decrement itself must be strongly ordered wrt. future memory operations. That's why it is documented that the memory barriers both before and after the atomic operation are required when the operation returns a result value of some kind. There is a lot of code in the kernel that depends upon this. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2005-10-17 9:31 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-10-15 3:28 Possible memory ordering bug in page reclaim? Nick Piggin 2005-10-15 6:17 ` Hugh Dickins 2005-10-15 7:43 ` Benjamin Herrenschmidt 2005-10-15 8:00 ` Herbert Xu 2005-10-15 16:57 ` Linus Torvalds 2005-10-15 19:29 ` David S. Miller 2005-10-15 22:17 ` Benjamin Herrenschmidt 2005-10-16 0:04 ` Nick Piggin 2005-10-15 8:59 ` Nick Piggin 2005-10-15 12:08 ` Herbert Xu 2005-10-15 13:35 ` Nick Piggin 2005-10-15 18:00 ` Andrea Arcangeli 2005-10-15 19:48 ` Herbert Xu 2005-10-15 20:07 ` Andrea Arcangeli 2005-10-15 23:07 ` David S. Miller 2005-10-16 19:36 ` Ivan Kokshaysky 2005-10-17 4:29 ` David S. Miller 2005-10-17 7:23 ` Ivan Kokshaysky 2005-10-17 11:28 ` Andrea Arcangeli 2005-10-15 22:16 ` Benjamin Herrenschmidt 2005-10-15 23:13 ` David S. Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox