* Re: [PATCH] percpu-rwsem: use barrier in unlock path [not found] ` <Pine.LNX.4.64.1210161924350.20581@file.rdu.redhat.com> @ 2012-10-17 2:23 ` Linus Torvalds 2012-10-17 5:58 ` Lai Jiangshan ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Linus Torvalds @ 2012-10-17 2:23 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Jens Axboe, linux-kernel, linux-arch [ Architecture people, note the potential new SMP barrier! ] On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > + /* > + * The lock is considered unlocked when p->locked is set to false. > + * Use barrier prevent reordering of operations around p->locked. > + */ > +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) > + barrier(); > +#else > + smp_mb(); > +#endif > p->locked = false; Ugh. The #if is too ugly to live. This is a classic case of "people who write their own serialization primitives invariably get them wrong". And this fix is just horrible, and code like this should not be allowed. I suspect we could make the above x86-specific optimization be valid by introducing a new barrier, called "smp_release_before_store()" or something like that, which on x86 just happens to be a no-op (but still a compiler barrier). That's because earlier reads will not pass a later stores, and stores are viewed in program order, so all x86 stores have "release consistency" modulo the theoretical PPro bugs (that I don't think we actually ever saw in practice). But it is possible that that is true on other architectures too, so your hand-crafted x86-specific #if is not just ugly, it's liable to get worse. The alternative is to just say that we should use "smp_mb()" unconditionally, depending on how critical this code actually ends up being. Added linux-arch in case architecture-maintainers have comments on "smp_release_before_store()" thing. It would be kind of similar to the odd "smp_read_barrier_depends()", except it would normally be a full memory barrier, except on architectures where a weaker barrier might be sufficient. I suspect there may be many architectures where a "smp_wmb()" is sufficient for this case, for the simple reason that no sane microarchitecture would *ever* move earlier reads down past a later write, so release consistency really only needs the local writes to be ordered, not the full memory ordering. Arch people? The more optimal solution may be to mark the store *itself* to be "store with release consistency", which on x86 would be a regular store (with the compiler barrier), but on other architectures may be a special memory operation. On architectures with release/aqcuire-consistency, there's not a separate barrier before the store, the store instruction itself is done with special semantics. So maybe the right thing to do is #define smp_release_consistency_store(val, ptr) ... where on x86, the implementation would be a simple do { barrier(); *(ptr)=(val); } while (0) but on other architectures it might be a inline asm with the required magic store-with-release instruction. How important is this code sequence? Is the "percpu_up_write()" function really so critical that we can't have an extra memory barrier? Or do people perhaps see *other* places where release-consistency-stores might be worth doing? But in no event do I want to see that butt-ugly #if statement. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 2:23 ` [PATCH] percpu-rwsem: use barrier in unlock path Linus Torvalds @ 2012-10-17 5:58 ` Lai Jiangshan 2012-10-17 5:58 ` Lai Jiangshan 2012-10-17 15:07 ` Mikulas Patocka 2012-10-17 9:56 ` Alan Cox 2012-10-18 16:00 ` Mikulas Patocka 2 siblings, 2 replies; 15+ messages in thread From: Lai Jiangshan @ 2012-10-17 5:58 UTC (permalink / raw) To: Linus Torvalds Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Eric Dumazet On 10/17/2012 10:23 AM, Linus Torvalds wrote: > [ Architecture people, note the potential new SMP barrier! ] > > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> + /* >> + * The lock is considered unlocked when p->locked is set to false. >> + * Use barrier prevent reordering of operations around p->locked. >> + */ >> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) >> + barrier(); >> +#else >> + smp_mb(); >> +#endif >> p->locked = false; > > Ugh. The #if is too ugly to live. Even the previous patch is applied, percpu_down_read() still needs mb() to pair with it. > > This is a classic case of "people who write their own serialization > primitives invariably get them wrong". And this fix is just horrible, > and code like this should not be allowed. One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is that it is merged without Ackeds or Revieweds from Paul or Peter or someone else who are expert at synchronization/arch memory models. I suggest any new synchronization should stay in -tip for 2 or more cycles before merged to mainline. Thanks, Lai ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 5:58 ` Lai Jiangshan @ 2012-10-17 5:58 ` Lai Jiangshan 2012-10-17 15:07 ` Mikulas Patocka 1 sibling, 0 replies; 15+ messages in thread From: Lai Jiangshan @ 2012-10-17 5:58 UTC (permalink / raw) To: Linus Torvalds Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Eric Dumazet On 10/17/2012 10:23 AM, Linus Torvalds wrote: > [ Architecture people, note the potential new SMP barrier! ] > > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> + /* >> + * The lock is considered unlocked when p->locked is set to false. >> + * Use barrier prevent reordering of operations around p->locked. >> + */ >> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) >> + barrier(); >> +#else >> + smp_mb(); >> +#endif >> p->locked = false; > > Ugh. The #if is too ugly to live. Even the previous patch is applied, percpu_down_read() still needs mb() to pair with it. > > This is a classic case of "people who write their own serialization > primitives invariably get them wrong". And this fix is just horrible, > and code like this should not be allowed. One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is that it is merged without Ackeds or Revieweds from Paul or Peter or someone else who are expert at synchronization/arch memory models. I suggest any new synchronization should stay in -tip for 2 or more cycles before merged to mainline. Thanks, Lai ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 5:58 ` Lai Jiangshan 2012-10-17 5:58 ` Lai Jiangshan @ 2012-10-17 15:07 ` Mikulas Patocka 2012-10-17 20:28 ` Steven Rostedt 1 sibling, 1 reply; 15+ messages in thread From: Mikulas Patocka @ 2012-10-17 15:07 UTC (permalink / raw) To: Lai Jiangshan Cc: Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Eric Dumazet Hi On Wed, 17 Oct 2012, Lai Jiangshan wrote: > On 10/17/2012 10:23 AM, Linus Torvalds wrote: > > [ Architecture people, note the potential new SMP barrier! ] > > > > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > >> + /* > >> + * The lock is considered unlocked when p->locked is set to false. > >> + * Use barrier prevent reordering of operations around p->locked. > >> + */ > >> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) > >> + barrier(); > >> +#else > >> + smp_mb(); > >> +#endif > >> p->locked = false; > > > > Ugh. The #if is too ugly to live. > > Even the previous patch is applied, percpu_down_read() still > needs mb() to pair with it. percpu_down_read uses rcu_read_lock which should guarantee that memory accesses don't escape in front of a rcu-protected section. If rcu_read_unlock has only an unlock barrier and not a full barrier, memory accesses could be moved in front of rcu_read_unlock and reordered with this_cpu_inc(*p->counters), but it doesn't matter because percpu_down_write does synchronize_rcu(), so it never sees these accesses halfway through. > > This is a classic case of "people who write their own serialization > > primitives invariably get them wrong". And this fix is just horrible, > > and code like this should not be allowed. > > One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is that > it is merged without Ackeds or Revieweds from Paul or Peter or someone else > who are expert at synchronization/arch memory models. > > I suggest any new synchronization should stay in -tip for 2 or more cycles > before merged to mainline. But the bug that this synchronization is fixing is quite serious (it causes random crashes when block size is being changed, the crash happens regularly at multiple important business sites) so it must be fixed soon and not wait half a year. > Thanks, > Lai Mikulas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 15:07 ` Mikulas Patocka @ 2012-10-17 20:28 ` Steven Rostedt 2012-10-17 20:28 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Steven Rostedt @ 2012-10-17 20:28 UTC (permalink / raw) To: Mikulas Patocka Cc: Lai Jiangshan, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote: > > > > Even the previous patch is applied, percpu_down_read() still > > needs mb() to pair with it. > > percpu_down_read uses rcu_read_lock which should guarantee that memory > accesses don't escape in front of a rcu-protected section. You do realize that rcu_read_lock() does nothing more that a barrier(), right? Paul worked really hard to get rcu_read_locks() to not call HW barriers. > > If rcu_read_unlock has only an unlock barrier and not a full barrier, > memory accesses could be moved in front of rcu_read_unlock and reordered > with this_cpu_inc(*p->counters), but it doesn't matter because > percpu_down_write does synchronize_rcu(), so it never sees these accesses > halfway through. Looking at the patch, you are correct. The read side doesn't need the memory barrier as the worse thing that will happen is that it sees the locked = false, and will just grab the mutex unnecessarily. > > > > I suggest any new synchronization should stay in -tip for 2 or more cycles > > before merged to mainline. > > But the bug that this synchronization is fixing is quite serious (it > causes random crashes when block size is being changed, the crash happens > regularly at multiple important business sites) so it must be fixed soon > and not wait half a year. I don't think Lai was suggesting to wait on this fix, but instead to totally rip out the percpu_rwsems and work on them some more, and then re-introduce them in a half a year. -- Steve ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 20:28 ` Steven Rostedt @ 2012-10-17 20:28 ` Steven Rostedt 2012-10-18 2:18 ` Lai Jiangshan 2012-10-18 16:05 ` Mikulas Patocka 2 siblings, 0 replies; 15+ messages in thread From: Steven Rostedt @ 2012-10-17 20:28 UTC (permalink / raw) To: Mikulas Patocka Cc: Lai Jiangshan, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote: > > > > Even the previous patch is applied, percpu_down_read() still > > needs mb() to pair with it. > > percpu_down_read uses rcu_read_lock which should guarantee that memory > accesses don't escape in front of a rcu-protected section. You do realize that rcu_read_lock() does nothing more that a barrier(), right? Paul worked really hard to get rcu_read_locks() to not call HW barriers. > > If rcu_read_unlock has only an unlock barrier and not a full barrier, > memory accesses could be moved in front of rcu_read_unlock and reordered > with this_cpu_inc(*p->counters), but it doesn't matter because > percpu_down_write does synchronize_rcu(), so it never sees these accesses > halfway through. Looking at the patch, you are correct. The read side doesn't need the memory barrier as the worse thing that will happen is that it sees the locked = false, and will just grab the mutex unnecessarily. > > > > I suggest any new synchronization should stay in -tip for 2 or more cycles > > before merged to mainline. > > But the bug that this synchronization is fixing is quite serious (it > causes random crashes when block size is being changed, the crash happens > regularly at multiple important business sites) so it must be fixed soon > and not wait half a year. I don't think Lai was suggesting to wait on this fix, but instead to totally rip out the percpu_rwsems and work on them some more, and then re-introduce them in a half a year. -- Steve ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 20:28 ` Steven Rostedt 2012-10-17 20:28 ` Steven Rostedt @ 2012-10-18 2:18 ` Lai Jiangshan 2012-10-18 4:13 ` Steven Rostedt ` (2 more replies) 2012-10-18 16:05 ` Mikulas Patocka 2 siblings, 3 replies; 15+ messages in thread From: Lai Jiangshan @ 2012-10-18 2:18 UTC (permalink / raw) To: Mikulas Patocka Cc: Steven Rostedt, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On 10/18/2012 04:28 AM, Steven Rostedt wrote: > On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote: >>> >>> Even the previous patch is applied, percpu_down_read() still >>> needs mb() to pair with it. >> >> percpu_down_read uses rcu_read_lock which should guarantee that memory >> accesses don't escape in front of a rcu-protected section. > > You do realize that rcu_read_lock() does nothing more that a barrier(), > right? > > Paul worked really hard to get rcu_read_locks() to not call HW barriers. > >> >> If rcu_read_unlock has only an unlock barrier and not a full barrier, >> memory accesses could be moved in front of rcu_read_unlock and reordered >> with this_cpu_inc(*p->counters), but it doesn't matter because >> percpu_down_write does synchronize_rcu(), so it never sees these accesses >> halfway through. > > Looking at the patch, you are correct. The read side doesn't need the > memory barrier as the worse thing that will happen is that it sees the > locked = false, and will just grab the mutex unnecessarily. --------------------- A memory barrier can be added iff these two things are known: 1) it disables the disordering between what and what. 2) what is the corresponding mb() that it pairs with. You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering between the writes to the protected data and the statement "p->locked = false", But I can't find out the corresponding mb() that it pairs with. percpu_down_read() writes to the data The cpu cache/prefetch the data writes to the data which is chaos writes to the data percpu_up_write() mb() p->locked = false; unlikely(p->locked) the cpu see p->lock = false, don't discard the cached/prefetch data this_cpu_inc(*p->counters); the code of read-access to the data ****and we use the chaos data***** So you need to add a mb() after "unlikely(p->locked)". ------------------------- The RCU you use don't protect any data. It protects codes of the fast path: unlikely(p->locked); this_cpu_inc(*p->counters); and synchronize_rcu() ensures all previous fast path had fully finished "this_cpu_inc(*p->counters);". It don't protect other code/data, if you want to protect other code or other data, please add more synchronizations or mb()s. --------------- I extremely hate a synchronization protects code instead of data. but sometimes I also have to do it. --------------- a very draft example of paired-mb()s is here: diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index cf80f7e..84a93c0 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -12,6 +12,14 @@ struct percpu_rw_semaphore { struct mutex mtx; }; +#if 1 +#define light_mb() barrier() +#define heavy_mb() synchronize_sched() +#else +#define light_mb() smp_mb() +#define heavy_mb() smp_mb(); +#endif + static inline void percpu_down_read(struct percpu_rw_semaphore *p) { rcu_read_lock(); @@ -24,22 +32,12 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *p) } this_cpu_inc(*p->counters); rcu_read_unlock(); + light_mb(); /* A, between read of p->locked and read of data, paired with D */ } static inline void percpu_up_read(struct percpu_rw_semaphore *p) { - /* - * On X86, write operation in this_cpu_dec serves as a memory unlock - * barrier (i.e. memory accesses may be moved before the write, but - * no memory accesses are moved past the write). - * On other architectures this may not be the case, so we need smp_mb() - * there. - */ -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) - barrier(); -#else - smp_mb(); -#endif + light_mb(); /* B, between read of the data and write to p->counter, paired with C */ this_cpu_dec(*p->counters); } @@ -61,11 +59,12 @@ static inline void percpu_down_write(struct percpu_rw_semaphore *p) synchronize_rcu(); while (__percpu_count(p->counters)) msleep(1); - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */ + heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ } static inline void percpu_up_write(struct percpu_rw_semaphore *p) { + heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ p->locked = false; mutex_unlock(&p->mtx); } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-18 2:18 ` Lai Jiangshan @ 2012-10-18 4:13 ` Steven Rostedt 2012-10-18 16:17 ` Mikulas Patocka 2012-10-18 15:32 ` Mikulas Patocka 2012-10-18 19:56 ` Mikulas Patocka 2 siblings, 1 reply; 15+ messages in thread From: Steven Rostedt @ 2012-10-18 4:13 UTC (permalink / raw) To: Lai Jiangshan Cc: Mikulas Patocka, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote: > > > > Looking at the patch, you are correct. The read side doesn't need the > > memory barrier as the worse thing that will happen is that it sees the > > locked = false, and will just grab the mutex unnecessarily. > > --------------------- > A memory barrier can be added iff these two things are known: > 1) it disables the disordering between what and what. > 2) what is the corresponding mb() that it pairs with. > OK, I was just looking at the protection and actions of the locked flag, but I see what you are saying with the data itself. > You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering > between the writes to the protected data and the statement "p->locked = false", > But I can't find out the corresponding mb() that it pairs with. > > percpu_down_read() writes to the data > The cpu cache/prefetch the data writes to the data > which is chaos writes to the data > percpu_up_write() > mb() > p->locked = false; > unlikely(p->locked) > the cpu see p->lock = false, > don't discard the cached/prefetch data > this_cpu_inc(*p->counters); > the code of read-access to the data > ****and we use the chaos data***** > > So you need to add a mb() after "unlikely(p->locked)". Does it need a full mb() or could it be just a rmb()? The down_read I wouldn't think would need to protect against stores, would it? The memory barrier should probably go in front of the unlikely() too. The write to p->counters is handled by the synchronized sched, and adding a rmb() in front of the unlikely check would keep prefetched data from passing this barrier. This is a perfect example why this primitive should be vetted outside of mainline before it gets merged. -- Steve ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-18 4:13 ` Steven Rostedt @ 2012-10-18 16:17 ` Mikulas Patocka 0 siblings, 0 replies; 15+ messages in thread From: Mikulas Patocka @ 2012-10-18 16:17 UTC (permalink / raw) To: Steven Rostedt Cc: Lai Jiangshan, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Thu, 18 Oct 2012, Steven Rostedt wrote: > On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote: > > > > > > Looking at the patch, you are correct. The read side doesn't need the > > > memory barrier as the worse thing that will happen is that it sees the > > > locked = false, and will just grab the mutex unnecessarily. > > > > --------------------- > > A memory barrier can be added iff these two things are known: > > 1) it disables the disordering between what and what. > > 2) what is the corresponding mb() that it pairs with. > > > > OK, I was just looking at the protection and actions of the locked flag, > but I see what you are saying with the data itself. > > > You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering > > between the writes to the protected data and the statement "p->locked = false", > > But I can't find out the corresponding mb() that it pairs with. > > > > percpu_down_read() writes to the data > > The cpu cache/prefetch the data writes to the data > > which is chaos writes to the data > > percpu_up_write() > > mb() > > p->locked = false; > > unlikely(p->locked) > > the cpu see p->lock = false, > > don't discard the cached/prefetch data > > this_cpu_inc(*p->counters); > > the code of read-access to the data > > ****and we use the chaos data***** > > > > So you need to add a mb() after "unlikely(p->locked)". > > Does it need a full mb() or could it be just a rmb()? The down_read I > wouldn't think would need to protect against stores, would it? The > memory barrier should probably go in front of the unlikely() too. The > write to p->counters is handled by the synchronized sched, and adding a > rmb() in front of the unlikely check would keep prefetched data from > passing this barrier. > > This is a perfect example why this primitive should be vetted outside of > mainline before it gets merged. > > -- Steve If we do synchronize_rcu() in percpu_up_write, we don't need a barrier in percpu_down_read(). So I would do that. Mikulas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-18 2:18 ` Lai Jiangshan 2012-10-18 4:13 ` Steven Rostedt @ 2012-10-18 15:32 ` Mikulas Patocka 2012-10-18 19:56 ` Mikulas Patocka 2 siblings, 0 replies; 15+ messages in thread From: Mikulas Patocka @ 2012-10-18 15:32 UTC (permalink / raw) To: Lai Jiangshan Cc: Steven Rostedt, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Thu, 18 Oct 2012, Lai Jiangshan wrote: > On 10/18/2012 04:28 AM, Steven Rostedt wrote: > > On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote: > >>> > >>> Even the previous patch is applied, percpu_down_read() still > >>> needs mb() to pair with it. > >> > >> percpu_down_read uses rcu_read_lock which should guarantee that memory > >> accesses don't escape in front of a rcu-protected section. > > > > You do realize that rcu_read_lock() does nothing more that a barrier(), > > right? > > > > Paul worked really hard to get rcu_read_locks() to not call HW barriers. > > > >> > >> If rcu_read_unlock has only an unlock barrier and not a full barrier, > >> memory accesses could be moved in front of rcu_read_unlock and reordered > >> with this_cpu_inc(*p->counters), but it doesn't matter because > >> percpu_down_write does synchronize_rcu(), so it never sees these accesses > >> halfway through. > > > > Looking at the patch, you are correct. The read side doesn't need the > > memory barrier as the worse thing that will happen is that it sees the > > locked = false, and will just grab the mutex unnecessarily. > > --------------------- > A memory barrier can be added iff these two things are known: > 1) it disables the disordering between what and what. > 2) what is the corresponding mb() that it pairs with. > > You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering > between the writes to the protected data and the statement "p->locked = false", > But I can't find out the corresponding mb() that it pairs with. Or alternativelly, instead of barrier, we can do this. Mikulas --- percpu-rwsem: use barrier in unlock path The lock is considered unlocked when p->locked is set to false. Use barrier prevent reordering of operations around p->locked. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- include/linux/percpu-rwsem.h | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h =================================================================== --- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h 2012-10-17 20:48:40.000000000 +0200 +++ linux-3.6.2-fast/include/linux/percpu-rwsem.h 2012-10-18 17:19:24.000000000 +0200 @@ -66,6 +66,12 @@ static inline void percpu_down_write(str static inline void percpu_up_write(struct percpu_rw_semaphore *p) { + /* + * Make sure that other processes that are in rcu section and that + * may have seen partially modified state exit the rcu section and + * try to grab the mutex. + */ + synchronize_rcu(); p->locked = false; mutex_unlock(&p->mtx); } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-18 2:18 ` Lai Jiangshan 2012-10-18 4:13 ` Steven Rostedt 2012-10-18 15:32 ` Mikulas Patocka @ 2012-10-18 19:56 ` Mikulas Patocka 2 siblings, 0 replies; 15+ messages in thread From: Mikulas Patocka @ 2012-10-18 19:56 UTC (permalink / raw) To: Lai Jiangshan Cc: Steven Rostedt, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet This patch looks sensible. I'd apply either this or my previous patch that adds synchronize_rcu() to percpu_up_write. This patch avoids the memory barrier on non-x86 cpus in percpu_up_read, so it is faster than the previous approach. Mikulas On Thu, 18 Oct 2012, Lai Jiangshan wrote: > --------------- > > a very draft example of paired-mb()s is here: > > > diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h > index cf80f7e..84a93c0 100644 > --- a/include/linux/percpu-rwsem.h > +++ b/include/linux/percpu-rwsem.h > @@ -12,6 +12,14 @@ struct percpu_rw_semaphore { > struct mutex mtx; > }; > > +#if 1 > +#define light_mb() barrier() > +#define heavy_mb() synchronize_sched() > +#else > +#define light_mb() smp_mb() > +#define heavy_mb() smp_mb(); > +#endif > + > static inline void percpu_down_read(struct percpu_rw_semaphore *p) > { > rcu_read_lock(); > @@ -24,22 +32,12 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *p) > } > this_cpu_inc(*p->counters); > rcu_read_unlock(); > + light_mb(); /* A, between read of p->locked and read of data, paired with D */ > } > > static inline void percpu_up_read(struct percpu_rw_semaphore *p) > { > - /* > - * On X86, write operation in this_cpu_dec serves as a memory unlock > - * barrier (i.e. memory accesses may be moved before the write, but > - * no memory accesses are moved past the write). > - * On other architectures this may not be the case, so we need smp_mb() > - * there. > - */ > -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) > - barrier(); > -#else > - smp_mb(); > -#endif > + light_mb(); /* B, between read of the data and write to p->counter, paired with C */ > this_cpu_dec(*p->counters); > } > > @@ -61,11 +59,12 @@ static inline void percpu_down_write(struct percpu_rw_semaphore *p) > synchronize_rcu(); > while (__percpu_count(p->counters)) > msleep(1); > - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */ > + heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ > } > > static inline void percpu_up_write(struct percpu_rw_semaphore *p) > { > + heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ > p->locked = false; > mutex_unlock(&p->mtx); > } > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 20:28 ` Steven Rostedt 2012-10-17 20:28 ` Steven Rostedt 2012-10-18 2:18 ` Lai Jiangshan @ 2012-10-18 16:05 ` Mikulas Patocka 2 siblings, 0 replies; 15+ messages in thread From: Mikulas Patocka @ 2012-10-18 16:05 UTC (permalink / raw) To: Steven Rostedt Cc: Lai Jiangshan, Linus Torvalds, Jens Axboe, linux-kernel, linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Wed, 17 Oct 2012, Steven Rostedt wrote: > On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote: > > > > > > Even the previous patch is applied, percpu_down_read() still > > > needs mb() to pair with it. > > > > percpu_down_read uses rcu_read_lock which should guarantee that memory > > accesses don't escape in front of a rcu-protected section. > > You do realize that rcu_read_lock() does nothing more that a barrier(), > right? > > Paul worked really hard to get rcu_read_locks() to not call HW barriers. > > > > > If rcu_read_unlock has only an unlock barrier and not a full barrier, > > memory accesses could be moved in front of rcu_read_unlock and reordered > > with this_cpu_inc(*p->counters), but it doesn't matter because > > percpu_down_write does synchronize_rcu(), so it never sees these accesses > > halfway through. > > Looking at the patch, you are correct. The read side doesn't need the > memory barrier as the worse thing that will happen is that it sees the > locked = false, and will just grab the mutex unnecessarily. It wasn't correct. CPU 1 is holding the write lock. CPU 2 could get to percpu_down_read past rcu_read_lock and prefetch some data that are accessed after percpu_down_read. CPU 1 goes into percpu_up_write(), calls a barrier, p->locked = false; and mutex_unlock(&p->mtx); CPU 2 now sees p->locked == false, so it goes through percpu_down_read. It exists percpu_down_read and uses the invalid prefetched data. It could be fixed by using synchronize_rcu(); in percpu_up_write, I sent a patch for that. Mikulas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 2:23 ` [PATCH] percpu-rwsem: use barrier in unlock path Linus Torvalds 2012-10-17 5:58 ` Lai Jiangshan @ 2012-10-17 9:56 ` Alan Cox 2012-10-18 16:00 ` Mikulas Patocka 2 siblings, 0 replies; 15+ messages in thread From: Alan Cox @ 2012-10-17 9:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-arch > a later stores, and stores are viewed in program order, so all x86 > stores have "release consistency" modulo the theoretical PPro bugs > (that I don't think we actually ever saw in practice). We did seem the for real. The PPro is on its way out however so I hardly thing we care about optimising for that case. Alan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-17 2:23 ` [PATCH] percpu-rwsem: use barrier in unlock path Linus Torvalds 2012-10-17 5:58 ` Lai Jiangshan 2012-10-17 9:56 ` Alan Cox @ 2012-10-18 16:00 ` Mikulas Patocka 2012-10-19 18:48 ` Linus Torvalds 2 siblings, 1 reply; 15+ messages in thread From: Mikulas Patocka @ 2012-10-18 16:00 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, linux-kernel, linux-arch, Lai Jiangshan, Steven Rostedt, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Tue, 16 Oct 2012, Linus Torvalds wrote: > [ Architecture people, note the potential new SMP barrier! ] > > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > + /* > > + * The lock is considered unlocked when p->locked is set to false. > > + * Use barrier prevent reordering of operations around p->locked. > > + */ > > +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) > > + barrier(); > > +#else > > + smp_mb(); > > +#endif > > p->locked = false; > > Ugh. The #if is too ugly to live. One instance of this is already present in the code. I suggest that you drop this patch and use synchronize_rcu() that I have just posted. But another instance of this "#if defined(CONFIG_X86) && ..." is already there in percpu_up_read. What is the procedure for making changes that require support of architectures? It is trivial to make a patch that moves this into arch-specific includes, the problem is that the patch break all the architectures - I wrote support for x86, sparc, parisc, alpha (I can test those) but not the others. Are you going to apply this patch, break majority of architectures and wait until architecture maintainers fix it? Or is there some arch-specific git tree, where the patch should be applied and where the maintainers fix things? > This is a classic case of "people who write their own serialization > primitives invariably get them wrong". And this fix is just horrible, > and code like this should not be allowed. > > I suspect we could make the above x86-specific optimization be valid > by introducing a new barrier, called "smp_release_before_store()" or > something like that, which on x86 just happens to be a no-op (but > still a compiler barrier). That's because earlier reads will not pass > a later stores, and stores are viewed in program order, so all x86 > stores have "release consistency" modulo the theoretical PPro bugs > (that I don't think we actually ever saw in practice). > > But it is possible that that is true on other architectures too, so > your hand-crafted x86-specific #if is not just ugly, it's liable to > get worse. > > The alternative is to just say that we should use "smp_mb()" > unconditionally, depending on how critical this code actually ends up > being. > > Added linux-arch in case architecture-maintainers have comments on > "smp_release_before_store()" thing. It would be kind of similar to the > odd "smp_read_barrier_depends()", except it would normally be a full > memory barrier, except on architectures where a weaker barrier might > be sufficient. > > I suspect there may be many architectures where a "smp_wmb()" is > sufficient for this case, for the simple reason that no sane > microarchitecture would *ever* move earlier reads down past a later > write, Alpha can move reads down past writes (if the read is not in cache and the write hits the cache, it may make sense to do this reordering). > so release consistency really only needs the local writes to be > ordered, not the full memory ordering. > > Arch people? > > The more optimal solution may be to mark the store *itself* to be > "store with release consistency", which on x86 would be a regular > store (with the compiler barrier), but on other architectures may be a > special memory operation. On architectures with > release/aqcuire-consistency, there's not a separate barrier before the > store, the store instruction itself is done with special semantics. So > maybe the right thing to do is > > #define smp_release_consistency_store(val, ptr) ... > > where on x86, the implementation would be a simple > > do { barrier(); *(ptr)=(val); } while (0) smp_release_consistency_store doesn't work. In include/linux/percpu-rwsem.h it is required that "this_cpu_dec(*p->counters);" works as an unlock barrier. So we could do either. "smp_mb(); this_cpu_dec(*p->counters);" - generates pointless barrier on x86 "#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) barrier(); #else smp_mb(); #endif this_cpu_dec(*p->counters); " - removes the barrier on the most common architecture (X86), but the code looks dirty "smp_release_before_store();this_cpu_dec(*p->counters);" - the code is clean, but the downside is that it breaks architectures that don't have smp_release_before_store(). > but on other architectures it might be a inline asm with the required > magic store-with-release instruction. > > How important is this code sequence? Is the "percpu_up_write()" > function really so critical that we can't have an extra memory > barrier? percpu_up_write() is not critical. But percpu_up_read() is critical and it should be as fast as possible. Mikulas > Or do people perhaps see *other* places where > release-consistency-stores might be worth doing? > > But in no event do I want to see that butt-ugly #if statement. > > Linus > --- Introduce smp_release_before_store() smp_release_before_store() with the following store operation works as an unlock barrier - that is, memory accesses may be moved before the unlock barrier, but no memory accesses are moved past the memory barrier. On x86 writes are strongly ordered (unless CONFIG_X86_PPRO_FENCE or CONFIG_X86_OOSTORE is selected), so smp_release_before_store() defaults to a compiler barrier (and no barrier instruction). On other architectures it may need more heavyweight barriers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- Documentation/memory-barriers.txt | 5 +++++ arch/alpha/include/asm/barrier.h | 2 ++ arch/parisc/include/asm/barrier.h | 1 + arch/sparc/include/asm/barrier_32.h | 1 + arch/sparc/include/asm/barrier_64.h | 2 ++ arch/x86/include/asm/barrier.h | 6 ++++++ include/linux/percpu-rwsem.h | 13 +------------ 7 files changed, 18 insertions(+), 12 deletions(-) Index: linux-3.6.2-fast/arch/alpha/include/asm/barrier.h =================================================================== --- linux-3.6.2-fast.orig/arch/alpha/include/asm/barrier.h 2012-10-18 17:45:12.000000000 +0200 +++ linux-3.6.2-fast/arch/alpha/include/asm/barrier.h 2012-10-18 17:46:24.000000000 +0200 @@ -29,6 +29,8 @@ __asm__ __volatile__("mb": : :"memory") #define smp_read_barrier_depends() do { } while (0) #endif +#define smp_release_before_store() smp_mb() + #define set_mb(var, value) \ do { var = value; mb(); } while (0) Index: linux-3.6.2-fast/arch/parisc/include/asm/barrier.h =================================================================== --- linux-3.6.2-fast.orig/arch/parisc/include/asm/barrier.h 2012-10-18 17:45:13.000000000 +0200 +++ linux-3.6.2-fast/arch/parisc/include/asm/barrier.h 2012-10-18 17:46:24.000000000 +0200 @@ -29,6 +29,7 @@ #define smp_wmb() mb() #define smp_read_barrier_depends() do { } while(0) #define read_barrier_depends() do { } while(0) +#define smp_release_before_store() mb () #define set_mb(var, value) do { var = value; mb(); } while (0) Index: linux-3.6.2-fast/arch/sparc/include/asm/barrier_32.h =================================================================== --- linux-3.6.2-fast.orig/arch/sparc/include/asm/barrier_32.h 2012-10-18 17:45:13.000000000 +0200 +++ linux-3.6.2-fast/arch/sparc/include/asm/barrier_32.h 2012-10-18 17:56:48.000000000 +0200 @@ -11,5 +11,6 @@ #define smp_rmb() __asm__ __volatile__("":::"memory") #define smp_wmb() __asm__ __volatile__("":::"memory") #define smp_read_barrier_depends() do { } while(0) +#define smp_release_before_store() smp_mb() #endif /* !(__SPARC_BARRIER_H) */ Index: linux-3.6.2-fast/arch/sparc/include/asm/barrier_64.h =================================================================== --- linux-3.6.2-fast.orig/arch/sparc/include/asm/barrier_64.h 2012-10-18 17:45:13.000000000 +0200 +++ linux-3.6.2-fast/arch/sparc/include/asm/barrier_64.h 2012-10-18 17:46:24.000000000 +0200 @@ -53,4 +53,6 @@ do { __asm__ __volatile__("ba,pt %%xcc, #define smp_read_barrier_depends() do { } while(0) +#define smp_release_before_store() __asm__ __volatile__("":::"memory") + #endif /* !(__SPARC64_BARRIER_H) */ Index: linux-3.6.2-fast/arch/x86/include/asm/barrier.h =================================================================== --- linux-3.6.2-fast.orig/arch/x86/include/asm/barrier.h 2012-10-18 17:45:13.000000000 +0200 +++ linux-3.6.2-fast/arch/x86/include/asm/barrier.h 2012-10-18 17:46:24.000000000 +0200 @@ -100,6 +100,12 @@ #define set_mb(var, value) do { var = value; barrier(); } while (0) #endif +#ifdef CONFIG_X86_PPRO_FENCE +#define smp_release_before_store() smp_mb() +#else +#define smp_release_before_store() smp_wmb() +#endif + /* * Stop RDTSC speculation. This is needed when you need to use RDTSC * (or get_cycles or vread that possibly accesses the TSC) in a defined Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h =================================================================== --- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h 2012-10-18 17:45:13.000000000 +0200 +++ linux-3.6.2-fast/include/linux/percpu-rwsem.h 2012-10-18 17:46:24.000000000 +0200 @@ -28,18 +28,7 @@ static inline void percpu_down_read(stru static inline void percpu_up_read(struct percpu_rw_semaphore *p) { - /* - * On X86, write operation in this_cpu_dec serves as a memory unlock - * barrier (i.e. memory accesses may be moved before the write, but - * no memory accesses are moved past the write). - * On other architectures this may not be the case, so we need smp_mb() - * there. - */ -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE)) - barrier(); -#else - smp_mb(); -#endif + smp_release_before_store(); this_cpu_dec(*p->counters); } Index: linux-3.6.2-fast/Documentation/memory-barriers.txt =================================================================== --- linux-3.6.2-fast.orig/Documentation/memory-barriers.txt 2012-10-18 17:45:13.000000000 +0200 +++ linux-3.6.2-fast/Documentation/memory-barriers.txt 2012-10-18 17:46:24.000000000 +0200 @@ -1056,11 +1056,16 @@ The Linux kernel has eight basic CPU mem WRITE wmb() smp_wmb() READ rmb() smp_rmb() DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() + UNLOCK BARRIER - smp_release_before_store() All memory barriers except the data dependency barriers imply a compiler barrier. Data dependencies do not impose any additional compiler ordering. +smp_release_before_store() together with the following store operation implies +an unlock barrier. That is - memory accesses may be moved before the unlock +barrier, but no memory accesses are moved past the unlock barrier. + Aside: In the case of data dependencies, the compiler would be expected to issue the loads in the correct order (eg. `a[b]` would have to load the value of b before loading a[b]), however there is no guarantee in the C specification ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] percpu-rwsem: use barrier in unlock path 2012-10-18 16:00 ` Mikulas Patocka @ 2012-10-19 18:48 ` Linus Torvalds 0 siblings, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2012-10-19 18:48 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, linux-kernel, linux-arch, Lai Jiangshan, Steven Rostedt, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner, Eric Dumazet On Thu, Oct 18, 2012 at 9:00 AM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > What is the procedure for making changes that require support of > architectures? It is trivial to make a patch that moves this into > arch-specific includes, the problem is that the patch break all the > architectures - I wrote support for x86, sparc, parisc, alpha (I can test > those) but not the others. We'd need to add it to everybody. It shouldn't need per-architecture testing - since "smp_mb()" is always safe. So we could just make all architectures default to that, and then for x86 (and potentially others that have cheaper models for release-consistency) just do the optimized one. We *could* also simply do something like #ifndef smp_release_before_store #define smp_release_before_store() smp_mb() #endif and basically make the rule be that only architectures that have a cheaper one need to define it at all. That may be the correct short-term fix, since there initially would be only a single user. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-10-19 18:48 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.64.1210151716310.10685@file.rdu.redhat.com>
[not found] ` <Pine.LNX.4.64.1210161924350.20581@file.rdu.redhat.com>
2012-10-17 2:23 ` [PATCH] percpu-rwsem: use barrier in unlock path Linus Torvalds
2012-10-17 5:58 ` Lai Jiangshan
2012-10-17 5:58 ` Lai Jiangshan
2012-10-17 15:07 ` Mikulas Patocka
2012-10-17 20:28 ` Steven Rostedt
2012-10-17 20:28 ` Steven Rostedt
2012-10-18 2:18 ` Lai Jiangshan
2012-10-18 4:13 ` Steven Rostedt
2012-10-18 16:17 ` Mikulas Patocka
2012-10-18 15:32 ` Mikulas Patocka
2012-10-18 19:56 ` Mikulas Patocka
2012-10-18 16:05 ` Mikulas Patocka
2012-10-17 9:56 ` Alan Cox
2012-10-18 16:00 ` Mikulas Patocka
2012-10-19 18:48 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).