* [PATCH -tip 0/4] A few updates around smp_store_mb()
@ 2015-10-27 19:53 Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
` (4 more replies)
0 siblings, 5 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave,
linux-kernel
Hi,
Other than Peter's recent rename of set_mb to reflect the SMP
ordering nature of the call, there really hasn't updates that
further reflect this. Here are some pretty straightforward
updates, but ultimately wonder if we cannot make the call the
same for all archs; which is where we seem to be headed _anyway_,
more of this in Patch 3 which updates x86.
Patch 1 is just one a snuck in while looking at arch code.
Only tested on the x86 bits.
Thanks!
Davidlohr Bueso (4):
arch,cmpxchg: Remove tas() definitions
arch,barrier: Use smp barriers in smp_store_release()
x86,asm: Re-work smp_store_mb()
doc,smp: Remove ambiguous statement in smp_store_mb()
Documentation/memory-barriers.txt | 4 ++--
arch/blackfin/include/asm/cmpxchg.h | 1 -
arch/c6x/include/asm/cmpxchg.h | 2 --
arch/frv/include/asm/cmpxchg.h | 2 --
arch/ia64/include/asm/barrier.h | 2 +-
arch/powerpc/include/asm/barrier.h | 2 +-
arch/s390/include/asm/barrier.h | 2 +-
arch/tile/include/asm/cmpxchg.h | 2 --
arch/x86/include/asm/barrier.h | 8 ++++++--
include/asm-generic/barrier.h | 2 +-
10 files changed, 12 insertions(+), 15 deletions(-)
--
2.1.4
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH 1/4] arch,cmpxchg: Remove tas() definitions 2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso @ 2015-10-27 19:53 ` Davidlohr Bueso 2015-12-04 12:01 ` [tip:locking/core] locking/cmpxchg, arch: " tip-bot for Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso ` (3 subsequent siblings) 4 siblings, 1 reply; 37+ messages in thread From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave, linux-kernel, Steven Miao, Aurelien Jacquiot, David Howells, Chris Metcalf, Davidlohr Bueso It seems that 5dc12ddee93 (Remove tas()) missed some files. Correct this and fully drop this macro, for which we should be using cmpxchg like calls. Cc: Steven Miao <realmz6@gmail.com> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: David Howells <dhowells@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- arch/blackfin/include/asm/cmpxchg.h | 1 - arch/c6x/include/asm/cmpxchg.h | 2 -- arch/frv/include/asm/cmpxchg.h | 2 -- arch/tile/include/asm/cmpxchg.h | 2 -- 4 files changed, 7 deletions(-) diff --git a/arch/blackfin/include/asm/cmpxchg.h b/arch/blackfin/include/asm/cmpxchg.h index c05868c..2539288 100644 --- a/arch/blackfin/include/asm/cmpxchg.h +++ b/arch/blackfin/include/asm/cmpxchg.h @@ -128,6 +128,5 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, #endif /* !CONFIG_SMP */ #define xchg(ptr, x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x), (ptr), sizeof(*(ptr)))) -#define tas(ptr) ((void)xchg((ptr), 1)) #endif /* __ARCH_BLACKFIN_CMPXCHG__ */ diff --git a/arch/c6x/include/asm/cmpxchg.h b/arch/c6x/include/asm/cmpxchg.h index b27c8ce..93d0a5a 100644 --- a/arch/c6x/include/asm/cmpxchg.h +++ b/arch/c6x/include/asm/cmpxchg.h @@ -47,8 +47,6 @@ static inline unsigned int __xchg(unsigned int x, volatile void *ptr, int size) #define xchg(ptr, x) \ ((__typeof__(*(ptr)))__xchg((unsigned int)(x), (void *) (ptr), \ sizeof(*(ptr)))) -#define tas(ptr) xchg((ptr), 1) - #include <asm-generic/cmpxchg-local.h> diff --git a/arch/frv/include/asm/cmpxchg.h b/arch/frv/include/asm/cmpxchg.h index 5b04dd0..a899765 100644 --- a/arch/frv/include/asm/cmpxchg.h +++ b/arch/frv/include/asm/cmpxchg.h @@ -69,8 +69,6 @@ extern uint32_t __xchg_32(uint32_t i, volatile void *v); #endif -#define tas(ptr) (xchg((ptr), 1)) - /*****************************************************************************/ /* * compare and conditionally exchange value with memory diff --git a/arch/tile/include/asm/cmpxchg.h b/arch/tile/include/asm/cmpxchg.h index 0ccda3c..25d5899 100644 --- a/arch/tile/include/asm/cmpxchg.h +++ b/arch/tile/include/asm/cmpxchg.h @@ -127,8 +127,6 @@ long long _atomic64_cmpxchg(long long *v, long long o, long long n); #endif -#define tas(ptr) xchg((ptr), 1) - #endif /* __ASSEMBLY__ */ #endif /* _ASM_TILE_CMPXCHG_H */ -- 2.1.4 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [tip:locking/core] locking/cmpxchg, arch: Remove tas() definitions 2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso @ 2015-12-04 12:01 ` tip-bot for Davidlohr Bueso 0 siblings, 0 replies; 37+ messages in thread From: tip-bot for Davidlohr Bueso @ 2015-12-04 12:01 UTC (permalink / raw) To: linux-tip-commits Cc: linux-arch, cmetcalf, realmz6, a-jacquiot, linux-kernel, dbueso, torvalds, tglx, dave, hpa, peterz, paulmck, akpm, mingo Commit-ID: fbd35c0d2fb41b75863a0e45fe939c8440375b0a Gitweb: http://git.kernel.org/tip/fbd35c0d2fb41b75863a0e45fe939c8440375b0a Author: Davidlohr Bueso <dave@stgolabs.net> AuthorDate: Tue, 27 Oct 2015 12:53:48 -0700 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Fri, 4 Dec 2015 11:39:51 +0100 locking/cmpxchg, arch: Remove tas() definitions It seems that commit 5dc12ddee93 ("Remove tas()") missed some files. Correct this and fully drop this macro, for which we should be using cmpxchg() like calls. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <linux-arch@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: David Howells <dhowells@re hat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Miao <realmz6@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1445975631-17047-2-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/blackfin/include/asm/cmpxchg.h | 1 - arch/c6x/include/asm/cmpxchg.h | 2 -- arch/frv/include/asm/cmpxchg.h | 2 -- arch/tile/include/asm/cmpxchg.h | 2 -- 4 files changed, 7 deletions(-) diff --git a/arch/blackfin/include/asm/cmpxchg.h b/arch/blackfin/include/asm/cmpxchg.h index c05868c..2539288 100644 --- a/arch/blackfin/include/asm/cmpxchg.h +++ b/arch/blackfin/include/asm/cmpxchg.h @@ -128,6 +128,5 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, #endif /* !CONFIG_SMP */ #define xchg(ptr, x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x), (ptr), sizeof(*(ptr)))) -#define tas(ptr) ((void)xchg((ptr), 1)) #endif /* __ARCH_BLACKFIN_CMPXCHG__ */ diff --git a/arch/c6x/include/asm/cmpxchg.h b/arch/c6x/include/asm/cmpxchg.h index b27c8ce..93d0a5a 100644 --- a/arch/c6x/include/asm/cmpxchg.h +++ b/arch/c6x/include/asm/cmpxchg.h @@ -47,8 +47,6 @@ static inline unsigned int __xchg(unsigned int x, volatile void *ptr, int size) #define xchg(ptr, x) \ ((__typeof__(*(ptr)))__xchg((unsigned int)(x), (void *) (ptr), \ sizeof(*(ptr)))) -#define tas(ptr) xchg((ptr), 1) - #include <asm-generic/cmpxchg-local.h> diff --git a/arch/frv/include/asm/cmpxchg.h b/arch/frv/include/asm/cmpxchg.h index 5b04dd0..a899765 100644 --- a/arch/frv/include/asm/cmpxchg.h +++ b/arch/frv/include/asm/cmpxchg.h @@ -69,8 +69,6 @@ extern uint32_t __xchg_32(uint32_t i, volatile void *v); #endif -#define tas(ptr) (xchg((ptr), 1)) - /*****************************************************************************/ /* * compare and conditionally exchange value with memory diff --git a/arch/tile/include/asm/cmpxchg.h b/arch/tile/include/asm/cmpxchg.h index 0ccda3c..25d5899 100644 --- a/arch/tile/include/asm/cmpxchg.h +++ b/arch/tile/include/asm/cmpxchg.h @@ -127,8 +127,6 @@ long long _atomic64_cmpxchg(long long *v, long long o, long long n); #endif -#define tas(ptr) xchg((ptr), 1) - #endif /* __ASSEMBLY__ */ #endif /* _ASM_TILE_CMPXCHG_H */ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() 2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso @ 2015-10-27 19:53 ` Davidlohr Bueso 2015-10-27 20:03 ` Davidlohr Bueso 2015-12-04 12:01 ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso ` (2 subsequent siblings) 4 siblings, 2 replies; 37+ messages in thread From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave, linux-kernel, Tony Luck, Benjamin Herrenschmidt, Heiko Carstens, Davidlohr Bueso With b92b8b35a2e (locking/arch: Rename set_mb() to smp_store_mb()) it was made clear that the context of this call (and thus set_mb) is strictly for CPU ordering, as opposed to IO. As such all archs should use the smp variant of mb(), respecting the semantics and saving a mandatory barrier on UP. Cc: Tony Luck <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- Completely untested. arch/ia64/include/asm/barrier.h | 2 +- arch/powerpc/include/asm/barrier.h | 2 +- arch/s390/include/asm/barrier.h | 2 +- include/asm-generic/barrier.h | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h index df896a1..209c4b8 100644 --- a/arch/ia64/include/asm/barrier.h +++ b/arch/ia64/include/asm/barrier.h @@ -77,7 +77,7 @@ do { \ ___p1; \ }) -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) /* * The group barrier in front of the rsm & ssm are necessary to ensure diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index 0eca6ef..a7af5fb 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -34,7 +34,7 @@ #define rmb() __asm__ __volatile__ ("sync" : : : "memory") #define wmb() __asm__ __volatile__ ("sync" : : : "memory") -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) #ifdef __SUBARCH_HAS_LWSYNC # define SMPWMB LWSYNC diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h index d48fe01..d360737 100644 --- a/arch/s390/include/asm/barrier.h +++ b/arch/s390/include/asm/barrier.h @@ -36,7 +36,7 @@ #define smp_mb__before_atomic() smp_mb() #define smp_mb__after_atomic() smp_mb() -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) #define smp_store_release(p, v) \ do { \ diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index b42afad..0f45f93 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -93,7 +93,7 @@ #endif /* CONFIG_SMP */ #ifndef smp_store_mb -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) #endif #ifndef smp_mb__before_atomic -- 2.1.4 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() 2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso @ 2015-10-27 20:03 ` Davidlohr Bueso 2015-12-04 12:01 ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso 1 sibling, 0 replies; 37+ messages in thread From: Davidlohr Bueso @ 2015-10-27 20:03 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, linux-kernel, Tony Luck, Benjamin Herrenschmidt, Heiko Carstens, Davidlohr Bueso So the subject should actually say 'smp_store_mb()'... barriers here, bariers there, barriers everywhere. ^ permalink raw reply [flat|nested] 37+ messages in thread
* [tip:locking/core] lcoking/barriers, arch: Use smp barriers in smp_store_release() 2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso 2015-10-27 20:03 ` Davidlohr Bueso @ 2015-12-04 12:01 ` tip-bot for Davidlohr Bueso 1 sibling, 0 replies; 37+ messages in thread From: tip-bot for Davidlohr Bueso @ 2015-12-04 12:01 UTC (permalink / raw) To: linux-tip-commits Cc: torvalds, linux-kernel, dave, hpa, tony.luck, linux-arch, heiko.carstens, akpm, peterz, mingo, paulmck, tglx, dbueso, benh Commit-ID: d5a73cadf3fdec95e9518ee5bb91bd0747c42b30 Gitweb: http://git.kernel.org/tip/d5a73cadf3fdec95e9518ee5bb91bd0747c42b30 Author: Davidlohr Bueso <dave@stgolabs.net> AuthorDate: Tue, 27 Oct 2015 12:53:49 -0700 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Fri, 4 Dec 2015 11:39:51 +0100 lcoking/barriers, arch: Use smp barriers in smp_store_release() With commit b92b8b35a2e ("locking/arch: Rename set_mb() to smp_store_mb()") it was made clear that the context of this call (and thus set_mb) is strictly for CPU ordering, as opposed to IO. As such all archs should use the smp variant of mb(), respecting the semantics and saving a mandatory barrier on UP. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <linux-arch@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1445975631-17047-3-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/ia64/include/asm/barrier.h | 2 +- arch/powerpc/include/asm/barrier.h | 2 +- arch/s390/include/asm/barrier.h | 2 +- include/asm-generic/barrier.h | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h index df896a1..209c4b8 100644 --- a/arch/ia64/include/asm/barrier.h +++ b/arch/ia64/include/asm/barrier.h @@ -77,7 +77,7 @@ do { \ ___p1; \ }) -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) /* * The group barrier in front of the rsm & ssm are necessary to ensure diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index 0eca6ef..a7af5fb 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -34,7 +34,7 @@ #define rmb() __asm__ __volatile__ ("sync" : : : "memory") #define wmb() __asm__ __volatile__ ("sync" : : : "memory") -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) #ifdef __SUBARCH_HAS_LWSYNC # define SMPWMB LWSYNC diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h index d68e11e..7ffd0b1 100644 --- a/arch/s390/include/asm/barrier.h +++ b/arch/s390/include/asm/barrier.h @@ -36,7 +36,7 @@ #define smp_mb__before_atomic() smp_mb() #define smp_mb__after_atomic() smp_mb() -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) #define smp_store_release(p, v) \ do { \ diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index b42afad..0f45f93 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -93,7 +93,7 @@ #endif /* CONFIG_SMP */ #ifndef smp_store_mb -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) +#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0) #endif #ifndef smp_mb__before_atomic ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso @ 2015-10-27 19:53 ` Davidlohr Bueso 2015-10-27 21:33 ` Linus Torvalds 2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso 2015-10-27 23:27 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions David Howells 4 siblings, 1 reply; 37+ messages in thread From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave, linux-kernel, x86, Davidlohr Bueso With the exception of the recent rename of set_mb to smp_store_mb, thus explicitly enforcing SMP ordering, the code is quite stale - going back to 2002, afaict. Specifically, replace the implicit barriers of xchg for more standard smp_mb() call instead. Thus, (i) We need not re-define it for SMP and UP systems. The later already converts the smp_mb() to a compiler barrier. (ii) Like most other archs, avoid using ugly/hacky (void)xchg patterns and simply add the smp_mb() call explicitly after the assignment. Note that this might affect callers that could/would rely on the atomicity semantics, but there are no guarantees of that for smp_store_mb() mentioned anywhere, plus most archs use this anyway. Thus we continue to be consistent with the memory-barriers.txt file, and more importantly, maintain the semantics of the smp_ nature. Cc: x86@kernel.org Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- arch/x86/include/asm/barrier.h | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index 0681d25..09f817a 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -35,14 +35,18 @@ #define smp_mb() mb() #define smp_rmb() dma_rmb() #define smp_wmb() barrier() -#define smp_store_mb(var, value) do { (void)xchg(&var, value); } while (0) #else /* !SMP */ #define smp_mb() barrier() #define smp_rmb() barrier() #define smp_wmb() barrier() -#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); barrier(); } while (0) #endif /* SMP */ +#define smp_store_mb(var, val) \ +do { \ + WRITE_ONCE(var, val); \ + smp_mb(); \ +} while (0) + #define read_barrier_depends() do { } while (0) #define smp_read_barrier_depends() do { } while (0) -- 2.1.4 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso @ 2015-10-27 21:33 ` Linus Torvalds 2015-10-27 22:01 ` Davidlohr Bueso 2015-10-27 22:37 ` Peter Zijlstra 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2015-10-27 21:33 UTC (permalink / raw) To: Davidlohr Bueso Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote: > > Note that this might affect callers that could/would rely on the > atomicity semantics, but there are no guarantees of that for > smp_store_mb() mentioned anywhere, plus most archs use this anyway. > Thus we continue to be consistent with the memory-barriers.txt file, > and more importantly, maintain the semantics of the smp_ nature. So I dislike this patch, mostly because it now makes it obvious that smp_store_mb() seems to be totally pointless. Every single implementation is now apparently WRITE_ONCE+smp_mb(), and there are what, five users of it, so why not then open-code it? But more importantly, is the "WRITE_ONCE()" even necessary? If there are no atomicity guarantees, then why bother with WRTE_ONCE() either? So with this patch, the whole thing becomes pointless, I feel. (Ok, so it may have been pointless before too, but at least before this patch it generated special code, now it doesn't). So why carry it along at all? Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-10-27 21:33 ` Linus Torvalds @ 2015-10-27 22:01 ` Davidlohr Bueso 2015-10-27 22:37 ` Peter Zijlstra 1 sibling, 0 replies; 37+ messages in thread From: Davidlohr Bueso @ 2015-10-27 22:01 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso On Wed, 28 Oct 2015, Linus Torvalds wrote: >On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote: >> >> Note that this might affect callers that could/would rely on the >> atomicity semantics, but there are no guarantees of that for >> smp_store_mb() mentioned anywhere, plus most archs use this anyway. >> Thus we continue to be consistent with the memory-barriers.txt file, >> and more importantly, maintain the semantics of the smp_ nature. > >So I dislike this patch, mostly because it now makes it obvious that >smp_store_mb() seems to be totally pointless. Every single >implementation is now apparently WRITE_ONCE+smp_mb(), and there are >what, five users of it, so why not then open-code it? So after having gone through pretty much all of smp_store_mb code, this is a feeling I also share. However I justified its existence (as opposed to dropping the call, updating all the callers/documenting the barriers etc.) to at least encapsulate the store+mb logic, which apparently is a pattern somewhat needed(?). Also, the name is obviously exactly what its name implies. But I have no strong preference either way. Now, if we should keep smp_store_mb(), it should probably be made generic, instead of having each arch define it. > >But more importantly, is the "WRITE_ONCE()" even necessary? If there >are no atomicity guarantees, then why bother with WRTE_ONCE() either? Agreed. Hmm, this was introduced by ab3f02fc237 (locking/arch: Add WRITE_ONCE() to set_mb()), back when atomicity aspects were not clear yet. >So with this patch, the whole thing becomes pointless, I feel. (Ok, so >it may have been pointless before too, but at least before this patch >it generated special code, now it doesn't). So why carry it along at >all? Ok, unless others are strongly against it, I'll send a series to drop the call altogether. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-10-27 21:33 ` Linus Torvalds 2015-10-27 22:01 ` Davidlohr Bueso @ 2015-10-27 22:37 ` Peter Zijlstra 2015-10-28 19:49 ` Davidlohr Bueso 2015-11-02 20:15 ` Davidlohr Bueso 1 sibling, 2 replies; 37+ messages in thread From: Peter Zijlstra @ 2015-10-27 22:37 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso On Wed, Oct 28, 2015 at 06:33:56AM +0900, Linus Torvalds wrote: > On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote: > > > > Note that this might affect callers that could/would rely on the > > atomicity semantics, but there are no guarantees of that for > > smp_store_mb() mentioned anywhere, plus most archs use this anyway. > > Thus we continue to be consistent with the memory-barriers.txt file, > > and more importantly, maintain the semantics of the smp_ nature. > > So with this patch, the whole thing becomes pointless, I feel. (Ok, so > it may have been pointless before too, but at least before this patch > it generated special code, now it doesn't). So why carry it along at > all? So I suppose this boils down to if: XCHG ends up being cheaper than MOV+FENCE. PeterA, any idea? ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-10-27 22:37 ` Peter Zijlstra @ 2015-10-28 19:49 ` Davidlohr Bueso 2015-11-02 20:15 ` Davidlohr Bueso 1 sibling, 0 replies; 37+ messages in thread From: Davidlohr Bueso @ 2015-10-28 19:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin On Tue, 27 Oct 2015, Peter Zijlstra wrote: >On Wed, Oct 28, 2015 at 06:33:56AM +0900, Linus Torvalds wrote: >> On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote: >> > >> > Note that this might affect callers that could/would rely on the >> > atomicity semantics, but there are no guarantees of that for >> > smp_store_mb() mentioned anywhere, plus most archs use this anyway. >> > Thus we continue to be consistent with the memory-barriers.txt file, >> > and more importantly, maintain the semantics of the smp_ nature. >> > >> So with this patch, the whole thing becomes pointless, I feel. (Ok, so >> it may have been pointless before too, but at least before this patch >> it generated special code, now it doesn't). So why carry it along at >> all? > >So I suppose this boils down to if: XCHG ends up being cheaper than >MOV+FENCE. If so, could this be the reasoning behind the mix and match of xchg and MOV+FENCE? for different archs? This is from the days when set_mb() was introduced. I wonder if it still even matters... I at least haven't seen much difference in general workloads (I guess any difference would be neglictible for practical matters). But could obviously be missing something. >PeterA, any idea? I suppose you're referring to hpa, Cc'ing him. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-10-27 22:37 ` Peter Zijlstra 2015-10-28 19:49 ` Davidlohr Bueso @ 2015-11-02 20:15 ` Davidlohr Bueso 2015-11-03 0:06 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: Davidlohr Bueso @ 2015-11-02 20:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso On Tue, 27 Oct 2015, Peter Zijlstra wrote: >On Wed, Oct 28, 2015 at 06:33:56AM +0900, Linus Torvalds wrote: >> On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote: >> > >> > Note that this might affect callers that could/would rely on the >> > atomicity semantics, but there are no guarantees of that for >> > smp_store_mb() mentioned anywhere, plus most archs use this anyway. >> > Thus we continue to be consistent with the memory-barriers.txt file, >> > and more importantly, maintain the semantics of the smp_ nature. >> > >> So with this patch, the whole thing becomes pointless, I feel. (Ok, so >> it may have been pointless before too, but at least before this patch >> it generated special code, now it doesn't). So why carry it along at >> all? > >So I suppose this boils down to if: XCHG ends up being cheaper than >MOV+FENCE. So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is constantly cheaper (by at least half the latency) than MFENCE. While there was a decent amount of variation, this difference remained rather constant. Then again, I'm not sure this matters. Thoughts? Thanks, Davidlohr ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-11-02 20:15 ` Davidlohr Bueso @ 2015-11-03 0:06 ` Linus Torvalds 2015-11-03 1:36 ` Davidlohr Bueso 2016-01-12 13:57 ` Michael S. Tsirkin 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2015-11-03 0:06 UTC (permalink / raw) To: Davidlohr Bueso Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote: > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > constantly cheaper (by at least half the latency) than MFENCE. While there > was a decent amount of variation, this difference remained rather constant. Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we use on old cpu's without one (ie 32-bit). I'm not actually convinced that mfence is necessarily a good idea. I could easily see it being microcode, for example. At least on my Haswell, the "lock addq" is pretty much exactly half the cost of "mfence". Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-11-03 0:06 ` Linus Torvalds @ 2015-11-03 1:36 ` Davidlohr Bueso 2016-01-12 13:57 ` Michael S. Tsirkin 1 sibling, 0 replies; 37+ messages in thread From: Davidlohr Bueso @ 2015-11-03 1:36 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso On Mon, 02 Nov 2015, Linus Torvalds wrote: >On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote: >> >> So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is >> constantly cheaper (by at least half the latency) than MFENCE. While there >> was a decent amount of variation, this difference remained rather constant. > >Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we >use on old cpu's without one (ie 32-bit). I'm getting results very close to xchg. >I'm not actually convinced that mfence is necessarily a good idea. I >could easily see it being microcode, for example. Interesting. > >At least on my Haswell, the "lock addq" is pretty much exactly half >the cost of "mfence". Ok, his coincides with my results on IvB. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2015-11-03 0:06 ` Linus Torvalds 2015-11-03 1:36 ` Davidlohr Bueso @ 2016-01-12 13:57 ` Michael S. Tsirkin 2016-01-12 17:20 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 13:57 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin, virtualization On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote: > > > > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is > > constantly cheaper (by at least half the latency) than MFENCE. While there > > was a decent amount of variation, this difference remained rather constant. > > Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we > use on old cpu's without one (ie 32-bit). > > I'm not actually convinced that mfence is necessarily a good idea. I > could easily see it being microcode, for example. > > At least on my Haswell, the "lock addq" is pretty much exactly half > the cost of "mfence". > > Linus mfence was high on some traces I was seeing, so I got curious, too: ----> main.c ----> extern volatile int x; volatile int x; #ifdef __x86_64__ #define SP "rsp" #else #define SP "esp" #endif #ifdef lock #define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory") #endif #ifdef xchg #define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0) #endif #ifdef xchgrz /* same as xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif int main(int argc, char **argv) { int i; int j = 1234; /* * Test barrier in a loop. We also poke at a volatile variable in an * attempt to make it a bit more realistic - this way there's something * in the store-buffer. */ for (i = 0; i < 10000000; ++i) { x = i - j; barrier(); j = x; } return 0; } ----> Makefile: ----> ALL = xchg xchgrz lock mfence lfence sfence CC = gcc CFLAGS += -Wall -O2 -ggdb PERF = perf stat -r 10 --log-fd 1 -- all: ${ALL} clean: rm -f ${ALL} run: all for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done .PHONY: all clean run ${ALL}: main.c ${CC} ${CFLAGS} -D$@ -o $@ main.c -----> Is this a good way to test it? E.g. on my laptop I get: perf stat -r 10 --log-fd 1 -- ./xchg Performance counter stats for './xchg' (10 runs): 53.236967 task-clock # 0.992 CPUs utilized ( +- 0.09% ) 10 context-switches # 0.180 K/sec ( +- 1.70% ) 0 CPU-migrations # 0.000 K/sec 37 page-faults # 0.691 K/sec ( +- 1.13% ) 190,997,612 cycles # 3.588 GHz ( +- 0.04% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,654,850 instructions # 0.42 insns per cycle ( +- 0.01% ) 10,122,372 branches # 190.138 M/sec ( +- 0.01% ) 4,514 branch-misses # 0.04% of all branches ( +- 3.37% ) 0.053642809 seconds time elapsed ( +- 0.12% ) perf stat -r 10 --log-fd 1 -- ./xchgrz Performance counter stats for './xchgrz' (10 runs): 53.189533 task-clock # 0.997 CPUs utilized ( +- 0.22% ) 0 context-switches # 0.000 K/sec 0 CPU-migrations # 0.000 K/sec 37 page-faults # 0.694 K/sec ( +- 0.75% ) 190,785,621 cycles # 3.587 GHz ( +- 0.03% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,602,086 instructions # 0.42 insns per cycle ( +- 0.00% ) 10,112,154 branches # 190.115 M/sec ( +- 0.01% ) 3,743 branch-misses # 0.04% of all branches ( +- 4.02% ) 0.053343693 seconds time elapsed ( +- 0.23% ) perf stat -r 10 --log-fd 1 -- ./lock Performance counter stats for './lock' (10 runs): 53.096434 task-clock # 0.997 CPUs utilized ( +- 0.16% ) 0 context-switches # 0.002 K/sec ( +-100.00% ) 0 CPU-migrations # 0.000 K/sec 37 page-faults # 0.693 K/sec ( +- 0.98% ) 190,796,621 cycles # 3.593 GHz ( +- 0.02% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,601,376 instructions # 0.42 insns per cycle ( +- 0.01% ) 10,112,074 branches # 190.447 M/sec ( +- 0.01% ) 3,475 branch-misses # 0.03% of all branches ( +- 1.33% ) 0.053252678 seconds time elapsed ( +- 0.16% ) perf stat -r 10 --log-fd 1 -- ./mfence Performance counter stats for './mfence' (10 runs): 126.376473 task-clock # 0.999 CPUs utilized ( +- 0.21% ) 0 context-switches # 0.002 K/sec ( +- 66.67% ) 0 CPU-migrations # 0.000 K/sec 36 page-faults # 0.289 K/sec ( +- 0.84% ) 456,147,770 cycles # 3.609 GHz ( +- 0.01% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,892,416 instructions # 0.18 insns per cycle ( +- 0.00% ) 10,163,220 branches # 80.420 M/sec ( +- 0.01% ) 4,653 branch-misses # 0.05% of all branches ( +- 1.27% ) 0.126539273 seconds time elapsed ( +- 0.21% ) perf stat -r 10 --log-fd 1 -- ./lfence Performance counter stats for './lfence' (10 runs): 47.617861 task-clock # 0.997 CPUs utilized ( +- 0.06% ) 0 context-switches # 0.002 K/sec ( +-100.00% ) 0 CPU-migrations # 0.000 K/sec 36 page-faults # 0.764 K/sec ( +- 0.45% ) 170,767,856 cycles # 3.586 GHz ( +- 0.03% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,581,607 instructions # 0.47 insns per cycle ( +- 0.00% ) 10,108,508 branches # 212.284 M/sec ( +- 0.00% ) 3,320 branch-misses # 0.03% of all branches ( +- 1.12% ) 0.047768505 seconds time elapsed ( +- 0.07% ) perf stat -r 10 --log-fd 1 -- ./sfence Performance counter stats for './sfence' (10 runs): 20.156676 task-clock # 0.988 CPUs utilized ( +- 0.45% ) 3 context-switches # 0.159 K/sec ( +- 12.15% ) 0 CPU-migrations # 0.000 K/sec 36 page-faults # 0.002 M/sec ( +- 0.87% ) 72,212,225 cycles # 3.583 GHz ( +- 0.33% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 80,479,149 instructions # 1.11 insns per cycle ( +- 0.00% ) 10,090,785 branches # 500.618 M/sec ( +- 0.01% ) 3,626 branch-misses # 0.04% of all branches ( +- 3.59% ) 0.020411208 seconds time elapsed ( +- 0.52% ) So mfence is more expensive than locked instructions/xchg, but sfence/lfence are slightly faster, and xchg and locked instructions are very close if not the same. I poked at some 10 intel and AMD machines and the numbers are different but the results seem more or less consistent with this. >From size point of view xchg is longer and xchgrz pokes at the red zone which seems unnecessarily hacky, so good old lock+addl is probably the best. There isn't any extra magic behind mfence, is there? E.g. I think lock orders accesses to WC memory as well, so apparently mb() can be redefined unconditionally, without looking at XMM2: ---> x86: drop mfence in favor of lock+addl mfence appears to be way slower than a locked instruction - let's use lock+add unconditionally, same as we always did on old 32-bit. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> --- I'll play with this some more before posting this as a non-stand alone patch. Is there a macro-benchmark where mb is prominent? diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index a584e1c..f0d36e2 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -15,15 +15,15 @@ * Some non-Intel clones support out of order store. wmb() ceases to be a * nop for these. */ -#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2) +#define mb() asm volatile("lock; addl $0,0(%%esp)":::"memory") #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2) #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM) #else +#define mb() asm volatile("lock; addl $0,0(%%rsp)":::"memory") #define rmb() asm volatile("lfence":::"memory") #define wmb() asm volatile("sfence" ::: "memory") #endif #ifdef CONFIG_X86_PPRO_FENCE #define dma_rmb() rmb() #else > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 13:57 ` Michael S. Tsirkin @ 2016-01-12 17:20 ` Linus Torvalds 2016-01-12 17:45 ` Michael S. Tsirkin 2016-01-12 20:30 ` Andy Lutomirski 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2016-01-12 17:20 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin, virtualization On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > #ifdef xchgrz > /* same as xchg but poking at gcc red zone */ > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) > #endif That's not safe in general. gcc might be using its redzone, so doing xchg into it is unsafe. But.. > Is this a good way to test it? .. it's fine for some basic testing. It doesn't show any subtle interactions (ie some operations may have different dynamic behavior when the write buffers are busy etc), but as a baseline for "how fast can things go" the stupid raw loop is fine. And while the xchg into the redzoen wouldn't be acceptable as a real implementation, for timing testing it's likely fine (ie you aren't hitting the problem it can cause). > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > are slightly faster, and xchg and locked instructions are very close if > not the same. Note that we never actually *use* lfence/sfence. They are pointless instructions when looking at CPU memory ordering, because for pure CPU memory ordering stores and loads are already ordered. The only reason to use lfence/sfence is after you've used nontemporal stores for IO. That's very very rare in the kernel. So I wouldn't worry about those. But yes, it does sound like mfence is just a bad idea too. > There isn't any extra magic behind mfence, is there? No. I think the only issue is that there has never been any real reason for CPU designers to try to make mfence go particularly fast. Nobody uses it, again with the exception of some odd loops that use nontemporal stores, and for those the cost tends to always be about the nontemporal accesses themselves (often to things like GPU memory over PCIe), and the mfence cost of a few extra cycles is negligible. The reason "lock ; add $0" has generally been the fastest we've found is simply that locked ops have been important for CPU designers. So I think the patch is fine, and we should likely drop the use of mfence.. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 17:20 ` Linus Torvalds @ 2016-01-12 17:45 ` Michael S. Tsirkin 2016-01-12 18:04 ` Linus Torvalds 2016-01-12 20:30 ` Andy Lutomirski 1 sibling, 1 reply; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 17:45 UTC (permalink / raw) To: Linus Torvalds Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin, virtualization On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > #ifdef xchgrz > > /* same as xchg but poking at gcc red zone */ > > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) > > #endif > > That's not safe in general. gcc might be using its redzone, so doing > xchg into it is unsafe. > > But.. > > > Is this a good way to test it? > > .. it's fine for some basic testing. It doesn't show any subtle > interactions (ie some operations may have different dynamic behavior > when the write buffers are busy etc), but as a baseline for "how fast > can things go" the stupid raw loop is fine. And while the xchg into > the redzoen wouldn't be acceptable as a real implementation, for > timing testing it's likely fine (ie you aren't hitting the problem it > can cause). > > > So mfence is more expensive than locked instructions/xchg, but sfence/lfence > > are slightly faster, and xchg and locked instructions are very close if > > not the same. > > Note that we never actually *use* lfence/sfence. They are pointless > instructions when looking at CPU memory ordering, because for pure CPU > memory ordering stores and loads are already ordered. > > The only reason to use lfence/sfence is after you've used nontemporal > stores for IO. By the way, the comment in barrier.h says: /* * Some non-Intel clones support out of order store. wmb() ceases to be * a nop for these. */ and while the 1st sentence may well be true, if you have an SMP system with out of order stores, making wmb not a nop will not help. Additionally as you point out, wmb is not a nop even for regular intel CPUs because of these weird use-cases. Drop this comment? > That's very very rare in the kernel. So I wouldn't > worry about those. Right - I'll leave these alone, whoever wants to optimize this path will have to do the necessary research. > But yes, it does sound like mfence is just a bad idea too. > > > There isn't any extra magic behind mfence, is there? > > No. > > I think the only issue is that there has never been any real reason > for CPU designers to try to make mfence go particularly fast. Nobody > uses it, again with the exception of some odd loops that use > nontemporal stores, and for those the cost tends to always be about > the nontemporal accesses themselves (often to things like GPU memory > over PCIe), and the mfence cost of a few extra cycles is negligible. > > The reason "lock ; add $0" has generally been the fastest we've found > is simply that locked ops have been important for CPU designers. > > So I think the patch is fine, and we should likely drop the use of mfence.. > > Linus OK so should I repost after a bit more testing? I don't believe this will affect the kernel build benchmark, but I'll try :) -- MST ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 17:45 ` Michael S. Tsirkin @ 2016-01-12 18:04 ` Linus Torvalds 0 siblings, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2016-01-12 18:04 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney, Linux Kernel Mailing List, the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin, virtualization On Tue, Jan 12, 2016 at 9:45 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > By the way, the comment in barrier.h says: > > /* > * Some non-Intel clones support out of order store. wmb() ceases to be > * a nop for these. > */ > > and while the 1st sentence may well be true, if you have > an SMP system with out of order stores, making wmb > not a nop will not help. > > Additionally as you point out, wmb is not a nop even > for regular intel CPUs because of these weird use-cases. > > Drop this comment? We should drop it, yes. We dropped support for CONFIG_X86_OOSTORE almost two years ago. See commit 09df7c4c8097 ("x86: Remove CONFIG_X86_OOSTORE") and it was questionable for a long time even before that (perhaps ever). So the comment is stale. We *do* still use the non-nop rmb/wmb for IO barriers, but even that is generally questionable. See our "copy_user_64.S" for an actual use of "movnt" followed by sfence. There's a couple of other cases too. So that's all correct, but the point is that when we use "movnt" we don't actually use "wmb()", we are doing assembly, and the assembly should just use sfence directly. So it's actually very questionable to ever make even the IO wmb()/rmb() functions use lfence/sfence. They should never really need it. But at the same time, I _really_ don't think we care enough. I'd rather leave those non-smp barrier cases alone as historial unless somebody can point to a case where they care about the performance. We also do have the whole PPRO_FENCE thing, which we can hopefully get rid of at some point too. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 17:20 ` Linus Torvalds 2016-01-12 17:45 ` Michael S. Tsirkin @ 2016-01-12 20:30 ` Andy Lutomirski 2016-01-12 20:54 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: Andy Lutomirski @ 2016-01-12 20:30 UTC (permalink / raw) To: Linus Torvalds, Michael S. Tsirkin Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On 01/12/2016 09:20 AM, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote: >> #ifdef xchgrz >> /* same as xchg but poking at gcc red zone */ >> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) >> #endif > > That's not safe in general. gcc might be using its redzone, so doing > xchg into it is unsafe. > > But.. > >> Is this a good way to test it? > > .. it's fine for some basic testing. It doesn't show any subtle > interactions (ie some operations may have different dynamic behavior > when the write buffers are busy etc), but as a baseline for "how fast > can things go" the stupid raw loop is fine. And while the xchg into > the redzoen wouldn't be acceptable as a real implementation, for > timing testing it's likely fine (ie you aren't hitting the problem it > can cause). I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) was better because it avoided stomping on very-likely-to-be-hot write buffers. --Andy ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:30 ` Andy Lutomirski @ 2016-01-12 20:54 ` Linus Torvalds 2016-01-12 20:59 ` Andy Lutomirski 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2016-01-12 20:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Michael S. Tsirkin, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote: > > I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) > was better because it avoided stomping on very-likely-to-be-hot write > buffers. I suspect it could go either way. You want a small constant (for the isntruction size), but any small constant is likely to be within the current stack frame anyway. I don't think 0(%rsp) is particularly likely to have a spill on it right then and there, but who knows.. And 64(%rsp) is possibly going to be cold in the L1 cache, especially if it's just after a deep function call. Which it might be. So it might work the other way. So my guess would be that you wouldn't be able to measure the difference. It might be there, but probably too small to really see in any noise. But numbers talk, bullshit walks. It would be interesting to be proven wrong. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:54 ` Linus Torvalds @ 2016-01-12 20:59 ` Andy Lutomirski 2016-01-12 21:37 ` Linus Torvalds 2016-01-12 22:21 ` Michael S. Tsirkin 0 siblings, 2 replies; 37+ messages in thread From: Andy Lutomirski @ 2016-01-12 20:59 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Michael S. Tsirkin, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) >> was better because it avoided stomping on very-likely-to-be-hot write >> buffers. > > I suspect it could go either way. You want a small constant (for the > isntruction size), but any small constant is likely to be within the > current stack frame anyway. I don't think 0(%rsp) is particularly > likely to have a spill on it right then and there, but who knows.. > > And 64(%rsp) is possibly going to be cold in the L1 cache, especially > if it's just after a deep function call. Which it might be. So it > might work the other way. > > So my guess would be that you wouldn't be able to measure the > difference. It might be there, but probably too small to really see in > any noise. > > But numbers talk, bullshit walks. It would be interesting to be proven wrong. Here's an article with numbers: http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ I think they're suggesting using a negative offset, which is safe as long as it doesn't page fault, even though we have the redzone disabled. --Andy ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:59 ` Andy Lutomirski @ 2016-01-12 21:37 ` Linus Torvalds 2016-01-12 22:14 ` Michael S. Tsirkin 2016-01-13 16:20 ` Michael S. Tsirkin 2016-01-12 22:21 ` Michael S. Tsirkin 1 sibling, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2016-01-12 21:37 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Michael S. Tsirkin, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > Here's an article with numbers: > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ Well, that's with the busy loop and one set of code generation. It doesn't show the "oops, deeper stack isn't even in the cache any more due to call chains" issue. But yes: > I think they're suggesting using a negative offset, which is safe as > long as it doesn't page fault, even though we have the redzone > disabled. I think a negative offset might work very well. Partly exactly *because* we have the redzone disabled: we know that inside the kernel, we'll never have any live stack frame accesses under the stack pointer, so "-4(%rsp)" sounds good to me. There should never be any pending writes in the write buffer, because even if it *was* live, it would have been read off first. Yeah, it potentially does extend the stack cache footprint by another 4 bytes, but that sounds very benign. So perhaps it might be worth trying to switch the "mfence" to "lock ; addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate for x86-32. I'd still want to see somebody try to benchmark it. I doubt it's noticeable, but making changes because you think it might save a few cycles without then even measuring it is just wrong. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 21:37 ` Linus Torvalds @ 2016-01-12 22:14 ` Michael S. Tsirkin 2016-01-13 16:20 ` Michael S. Tsirkin 1 sibling, 0 replies; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 22:14 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > Here's an article with numbers: > > > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > Well, that's with the busy loop and one set of code generation. It > doesn't show the "oops, deeper stack isn't even in the cache any more > due to call chains" issue. > > But yes: > > > I think they're suggesting using a negative offset, which is safe as > > long as it doesn't page fault, even though we have the redzone > > disabled. > > I think a negative offset might work very well. Partly exactly > *because* we have the redzone disabled: we know that inside the > kernel, we'll never have any live stack frame accesses under the stack > pointer, so "-4(%rsp)" sounds good to me. There should never be any > pending writes in the write buffer, because even if it *was* live, it > would have been read off first. > > Yeah, it potentially does extend the stack cache footprint by another > 4 bytes, but that sounds very benign. > > So perhaps it might be worth trying to switch the "mfence" to "lock ; > addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate > for x86-32. > > I'd still want to see somebody try to benchmark it. I doubt it's > noticeable, but making changes because you think it might save a few > cycles without then even measuring it is just wrong. > > Linus Oops, I posted v2 with just offset 0 before reading the rest of this thread. I did try with offset 0 and didn't measure any change on any perf bench test, or on kernel build. I wonder which benchmark stresses smp_mb the most. I'll look into using a negative offset. -- MST ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 21:37 ` Linus Torvalds 2016-01-12 22:14 ` Michael S. Tsirkin @ 2016-01-13 16:20 ` Michael S. Tsirkin 1 sibling, 0 replies; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 16:20 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > Here's an article with numbers: > > > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > Well, that's with the busy loop and one set of code generation. It > doesn't show the "oops, deeper stack isn't even in the cache any more > due to call chains" issue. It's an interesting read, thanks! So sp is read on return from function I think. I added a function and sure enough, it slows the add 0(sp) variant down. It's still faster than mfence for me though! Testing code + results below. Reaching below stack, or allocating extra 4 bytes above the stack pointer gives us back the performance. > But yes: > > > I think they're suggesting using a negative offset, which is safe as > > long as it doesn't page fault, even though we have the redzone > > disabled. > > I think a negative offset might work very well. Partly exactly > *because* we have the redzone disabled: we know that inside the > kernel, we'll never have any live stack frame accesses under the stack > pointer, so "-4(%rsp)" sounds good to me. There should never be any > pending writes in the write buffer, because even if it *was* live, it > would have been read off first. > > Yeah, it potentially does extend the stack cache footprint by another > 4 bytes, but that sounds very benign. > > So perhaps it might be worth trying to switch the "mfence" to "lock ; > addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate > for x86-32. > > > I'd still want to see somebody try to benchmark it. I doubt it's > noticeable, but making changes because you think it might save a few > cycles without then even measuring it is just wrong. > > Linus I'll try this in the kernel now, will report, though I'm not optimistic a high level benchmark can show this kind of thing. --------------- main.c: --------------- extern volatile int x; volatile int x; #ifdef __x86_64__ #define SP "rsp" #else #define SP "esp" #endif #ifdef lock #define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0) #endif #ifdef locksp #define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory") #endif #ifdef lockrz #define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory") #endif #ifdef xchg #define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0) #endif #ifdef xchgrz /* same as xchg but poking at gcc red zone */ #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0) #endif #ifdef mfence #define barrier() asm("mfence" ::: "memory") #endif #ifdef lfence #define barrier() asm("lfence" ::: "memory") #endif #ifdef sfence #define barrier() asm("sfence" ::: "memory") #endif void __attribute__ ((noinline)) test(int i, int *j) { /* * Test barrier in a loop. We also poke at a volatile variable in an * attempt to make it a bit more realistic - this way there's something * in the store-buffer. */ x = i - *j; barrier(); *j = x; } int main(int argc, char **argv) { int i; int j = 1234; for (i = 0; i < 10000000; ++i) test(i, &j); return 0; } --------------- ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence CC = gcc CFLAGS += -Wall -O2 -ggdb PERF = perf stat -r 10 --log-fd 1 -- TIME = /usr/bin/time -f %e FILTER = cat all: ${ALL} clean: rm -f ${ALL} run: all for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done perf time: run time: RUN=${TIME} perf: RUN=${PERF} perf: FILTER=grep elapsed .PHONY: all clean run perf time xchgrz: CFLAGS += -mno-red-zone ${ALL}: main.c ${CC} ${CFLAGS} -D$@ -o $@ main.c -------------------------------------------- perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed 0.080420565 seconds time elapsed ( +- 2.31% ) perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed 0.087798571 seconds time elapsed ( +- 2.58% ) perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed 0.083023724 seconds time elapsed ( +- 2.44% ) perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed 0.102880750 seconds time elapsed ( +- 0.13% ) perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed 0.084917420 seconds time elapsed ( +- 3.28% ) perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed 0.156014715 seconds time elapsed ( +- 0.16% ) perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed 0.077731443 seconds time elapsed ( +- 0.12% ) perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed 0.036655741 seconds time elapsed ( +- 0.21% ) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 20:59 ` Andy Lutomirski 2016-01-12 21:37 ` Linus Torvalds @ 2016-01-12 22:21 ` Michael S. Tsirkin 2016-01-12 22:55 ` H. Peter Anvin 1 sibling, 1 reply; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-12 22:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski wrote: > On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote: > >> > >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64) > >> was better because it avoided stomping on very-likely-to-be-hot write > >> buffers. > > > > I suspect it could go either way. You want a small constant (for the > > isntruction size), but any small constant is likely to be within the > > current stack frame anyway. I don't think 0(%rsp) is particularly > > likely to have a spill on it right then and there, but who knows.. > > > > And 64(%rsp) is possibly going to be cold in the L1 cache, especially > > if it's just after a deep function call. Which it might be. So it > > might work the other way. > > > > So my guess would be that you wouldn't be able to measure the > > difference. It might be there, but probably too small to really see in > > any noise. > > > > But numbers talk, bullshit walks. It would be interesting to be proven wrong. > > Here's an article with numbers: > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > I think they're suggesting using a negative offset, which is safe as > long as it doesn't page fault, even though we have the redzone > disabled. > > --Andy OK so I'll have to tweak the test to put something on stack to measure the difference: my test tweaks a global variable instead. I'll try that by tomorrow. I couldn't measure any difference between mfence and lock+addl except in a micro-benchmark, but hey since we are tweaking this, let's do the optimal thing. -- MST ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 22:21 ` Michael S. Tsirkin @ 2016-01-12 22:55 ` H. Peter Anvin 2016-01-12 23:24 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2016-01-12 22:55 UTC (permalink / raw) To: Michael S. Tsirkin, Andy Lutomirski Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On 01/12/16 14:21, Michael S. Tsirkin wrote: > > OK so I'll have to tweak the test to put something > on stack to measure the difference: my test tweaks a > global variable instead. > I'll try that by tomorrow. > > I couldn't measure any difference between mfence and lock+addl > except in a micro-benchmark, but hey since we are tweaking this, > let's do the optimal thing. > Be careful with this: if it only shows up in a microbenchmark, we may introduce a hard-to-debug regression for no real benefit. -hpa ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 22:55 ` H. Peter Anvin @ 2016-01-12 23:24 ` Linus Torvalds 2016-01-13 16:17 ` Borislav Petkov 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2016-01-12 23:24 UTC (permalink / raw) To: H. Peter Anvin Cc: Michael S. Tsirkin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 2:55 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > Be careful with this: if it only shows up in a microbenchmark, we may > introduce a hard-to-debug regression for no real benefit. So I can pretty much guarantee that it shouldn't regress from a correctness angle, since we rely *heavily* on locked instructions being barriers, in locking and in various other situations. Indeed, much more so than we ever rely on "smp_mb()". The places that rely on smp_mb() are pretty few in the end. So I think the only issue is whether sometimes "mfence" might be faster. So far, I've never actually heard of that being the case. The fence instructions have always sucked when I've seen them. But talking to the hw people about this is certainly a good idea regardless. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-12 23:24 ` Linus Torvalds @ 2016-01-13 16:17 ` Borislav Petkov 2016-01-13 16:25 ` Michael S. Tsirkin 0 siblings, 1 reply; 37+ messages in thread From: Borislav Petkov @ 2016-01-13 16:17 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Michael S. Tsirkin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote: > But talking to the hw people about this is certainly a good idea regardless. I'm not seeing it in this thread but I might've missed it too. Anyway, I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't touch them. Do we care? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:17 ` Borislav Petkov @ 2016-01-13 16:25 ` Michael S. Tsirkin 2016-01-13 16:33 ` Borislav Petkov 0 siblings, 1 reply; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 16:25 UTC (permalink / raw) To: Borislav Petkov Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Wed, Jan 13, 2016 at 05:17:04PM +0100, Borislav Petkov wrote: > On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote: > > But talking to the hw people about this is certainly a good idea regardless. > > I'm not seeing it in this thread but I might've missed it too. Anyway, > I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't > touch them. > > Do we care? Which flag do you refer to, exactly? > -- > Regards/Gruss, > Boris. > > ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:25 ` Michael S. Tsirkin @ 2016-01-13 16:33 ` Borislav Petkov 2016-01-13 16:42 ` Michael S. Tsirkin 0 siblings, 1 reply; 37+ messages in thread From: Borislav Petkov @ 2016-01-13 16:33 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote: > Which flag do you refer to, exactly? All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:33 ` Borislav Petkov @ 2016-01-13 16:42 ` Michael S. Tsirkin 2016-01-13 16:53 ` Borislav Petkov 2016-01-13 18:38 ` Linus Torvalds 0 siblings, 2 replies; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 16:42 UTC (permalink / raw) To: Borislav Petkov Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Wed, Jan 13, 2016 at 05:33:31PM +0100, Borislav Petkov wrote: > On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote: > > Which flag do you refer to, exactly? > > All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF Oh, I think this means we need a "cc" clobber. This also seems to be a bug on !XMM CPUs. cmpxchg.h gets it right. I'll send a patch. > -- > Regards/Gruss, > Boris. > > ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:42 ` Michael S. Tsirkin @ 2016-01-13 16:53 ` Borislav Petkov 2016-01-13 17:00 ` Michael S. Tsirkin 2016-01-13 18:38 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: Borislav Petkov @ 2016-01-13 16:53 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote: > Oh, I think this means we need a "cc" clobber. Btw, does your microbenchmark do it too? Because, the "cc" clobber should cause additional handling of flags, depending on the context. It won't matter if the context doesn't need rFLAGS handling in the benchmark but if we start using LOCK; ADD in the kernel, I can imagine some places where mb() is used and rFLAGS are live, causing gcc to either reorder code or stash them away... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:53 ` Borislav Petkov @ 2016-01-13 17:00 ` Michael S. Tsirkin 0 siblings, 0 replies; 37+ messages in thread From: Michael S. Tsirkin @ 2016-01-13 17:00 UTC (permalink / raw) To: Borislav Petkov Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Wed, Jan 13, 2016 at 05:53:20PM +0100, Borislav Petkov wrote: > On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote: > > Oh, I think this means we need a "cc" clobber. > > Btw, does your microbenchmark do it too? Yes - I fixed it now, but it did not affect the result. We'd need some code where gcc carries flags around though. > Because, the "cc" clobber should cause additional handling of flags, > depending on the context. It won't matter if the context doesn't need > rFLAGS handling in the benchmark but if we start using LOCK; ADD in the > kernel, I can imagine some places where mb() is used and rFLAGS are > live, causing gcc to either reorder code or stash them away... It will reorder code but not necessarily for the worse :) Best I can do, I will add cc clobber to kernel and see whether binary size grows. > -- > Regards/Gruss, > Boris. > > ECO tip #101: Trim your mails when you reply. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb() 2016-01-13 16:42 ` Michael S. Tsirkin 2016-01-13 16:53 ` Borislav Petkov @ 2016-01-13 18:38 ` Linus Torvalds 1 sibling, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2016-01-13 18:38 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Borislav Petkov, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar On Wed, Jan 13, 2016 at 8:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > Oh, I think this means we need a "cc" clobber. It's probably a good idea to add one. Historically, gcc doesn't need one on x86, and always considers flags clobbered. We are probably missing the cc clobber in a *lot* of places for this reason. But even if not necessary, it's probably a good thing to add for documentation, and in case gcc semantcs ever change. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() 2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso ` (2 preceding siblings ...) 2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso @ 2015-10-27 19:53 ` Davidlohr Bueso 2015-12-04 12:01 ` [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation tip-bot for Davidlohr Bueso 2015-10-27 23:27 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions David Howells 4 siblings, 1 reply; 37+ messages in thread From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave, linux-kernel It serves no purpose but to confuse readers, and is most likely a left over from constant memory-barriers.txt updates. Ie: http://lists.openwall.net/linux-kernel/2006/07/15/27 Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> --- Documentation/memory-barriers.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index b5fe765..ee6fbb0 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1683,8 +1683,8 @@ There are some more advanced barrier functions: (*) smp_store_mb(var, value) This assigns the value to the variable and then inserts a full memory - barrier after it, depending on the function. It isn't guaranteed to - insert anything more than a compiler barrier in a UP compilation. + barrier after it. It isn't guaranteed to insert anything more than a + compiler barrier in a UP compilation. (*) smp_mb__before_atomic(); -- 2.1.4 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation 2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso @ 2015-12-04 12:01 ` tip-bot for Davidlohr Bueso 0 siblings, 0 replies; 37+ messages in thread From: tip-bot for Davidlohr Bueso @ 2015-12-04 12:01 UTC (permalink / raw) To: linux-tip-commits Cc: corbet, tglx, linux-kernel, hpa, linux-arch, paulmck, torvalds, mingo, dave, peterz, akpm Commit-ID: 2d142e599bf73ab70a3457e6947f86935245415e Gitweb: http://git.kernel.org/tip/2d142e599bf73ab70a3457e6947f86935245415e Author: Davidlohr Bueso <dave@stgolabs.net> AuthorDate: Tue, 27 Oct 2015 12:53:51 -0700 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Fri, 4 Dec 2015 11:39:51 +0100 locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation It serves no purpose but to confuse readers, and is most likely a left over from constant memory-barriers.txt updates. I.e.: http://lists.openwall.net/linux-kernel/2006/07/15/27 Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <linux-arch@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1445975631-17047-5-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- Documentation/memory-barriers.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index aef9487..c85054d 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1673,8 +1673,8 @@ There are some more advanced barrier functions: (*) smp_store_mb(var, value) This assigns the value to the variable and then inserts a full memory - barrier after it, depending on the function. It isn't guaranteed to - insert anything more than a compiler barrier in a UP compilation. + barrier after it. It isn't guaranteed to insert anything more than a + compiler barrier in a UP compilation. (*) smp_mb__before_atomic(); ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 1/4] arch,cmpxchg: Remove tas() definitions 2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso ` (3 preceding siblings ...) 2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso @ 2015-10-27 23:27 ` David Howells 4 siblings, 0 replies; 37+ messages in thread From: David Howells @ 2015-10-27 23:27 UTC (permalink / raw) To: Davidlohr Bueso Cc: dhowells, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Linus Torvalds, Paul E. McKenney, linux-kernel, Steven Miao, Aurelien Jacquiot, Chris Metcalf, Davidlohr Bueso Davidlohr Bueso <dave@stgolabs.net> wrote: > It seems that 5dc12ddee93 (Remove tas()) missed some files. Correct > this and fully drop this macro, for which we should be using cmpxchg > like calls. > > Cc: Steven Miao <realmz6@gmail.com> > Cc: Aurelien Jacquiot <a-jacquiot@ti.com> > Cc: Chris Metcalf <cmetcalf@ezchip.com> > Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: David Howells <dhowells@redhat.com> [frv] ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2016-01-13 18:38 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso 2015-12-04 12:01 ` [tip:locking/core] locking/cmpxchg, arch: " tip-bot for Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso 2015-10-27 20:03 ` Davidlohr Bueso 2015-12-04 12:01 ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso 2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso 2015-10-27 21:33 ` Linus Torvalds 2015-10-27 22:01 ` Davidlohr Bueso 2015-10-27 22:37 ` Peter Zijlstra 2015-10-28 19:49 ` Davidlohr Bueso 2015-11-02 20:15 ` Davidlohr Bueso 2015-11-03 0:06 ` Linus Torvalds 2015-11-03 1:36 ` Davidlohr Bueso 2016-01-12 13:57 ` Michael S. Tsirkin 2016-01-12 17:20 ` Linus Torvalds 2016-01-12 17:45 ` Michael S. Tsirkin 2016-01-12 18:04 ` Linus Torvalds 2016-01-12 20:30 ` Andy Lutomirski 2016-01-12 20:54 ` Linus Torvalds 2016-01-12 20:59 ` Andy Lutomirski 2016-01-12 21:37 ` Linus Torvalds 2016-01-12 22:14 ` Michael S. Tsirkin 2016-01-13 16:20 ` Michael S. Tsirkin 2016-01-12 22:21 ` Michael S. Tsirkin 2016-01-12 22:55 ` H. Peter Anvin 2016-01-12 23:24 ` Linus Torvalds 2016-01-13 16:17 ` Borislav Petkov 2016-01-13 16:25 ` Michael S. Tsirkin 2016-01-13 16:33 ` Borislav Petkov 2016-01-13 16:42 ` Michael S. Tsirkin 2016-01-13 16:53 ` Borislav Petkov 2016-01-13 17:00 ` Michael S. Tsirkin 2016-01-13 18:38 ` Linus Torvalds 2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso 2015-12-04 12:01 ` [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation tip-bot for Davidlohr Bueso 2015-10-27 23:27 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).