* [PATCH] arch: Introduce read_acquire()
@ 2014-11-11 18:57 alexander.duyck
2014-11-11 18:57 ` alexander.duyck
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: alexander.duyck @ 2014-11-11 18:57 UTC (permalink / raw)
To: linux-arch, linux-kernel
Cc: Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck,
Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens,
Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven,
Frederic Weisbecker, Martin Schwidefsky, Russell King,
Paul E. McKenney, Linus Torvalds, Ingo Molnar
From: Alexander Duyck <alexander.h.duyck@redhat.com>
In the case of device drivers it is common to utilize receive descriptors
in which a single field is used to determine if the descriptor is currently
in the possession of the device or the CPU. In order to prevent any other
fields from being read a rmb() is used resulting in something like code
snippet from ixgbe_main.c:
if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
break;
/*
* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* RXD_STAT_DD bit is set
*/
rmb();
On reviewing the documentation and code for smp_load_acquire() it occured
to me that implementing something similar for CPU <-> device interraction
would be worth while. This commit provides just the load/read side of this
in the form of read_acquire(). This new primative orders the specified
read against any subsequent reads. As a result we can reduce the above
code snippet down to:
/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* RXD_STAT_DD bit is set
*/
if (!(read_acquire(&rx_desc->wb.upper.status_error) &
cpu_to_le32(IXGBE_RXD_STAT_DD)))
break;
With this commit and the above change I have seen a reduction in processing
time of at least 7ns per 64B frame in the ixgbe driver on an Intel
Core i7-4930K.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
arch/arm/include/asm/barrier.h | 8 ++++++++
arch/arm64/include/asm/barrier.h | 10 ++++++++++
arch/ia64/include/asm/barrier.h | 2 ++
arch/metag/include/asm/barrier.h | 8 ++++++++
arch/mips/include/asm/barrier.h | 8 ++++++++
arch/powerpc/include/asm/barrier.h | 8 ++++++++
arch/s390/include/asm/barrier.h | 2 ++
arch/sparc/include/asm/barrier_64.h | 1 +
arch/x86/include/asm/barrier.h | 10 ++++++++++
include/asm-generic/barrier.h | 8 ++++++++
10 files changed, 65 insertions(+)
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..b082578 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,14 @@
#define smp_wmb() dmb(ishst)
#endif
+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..5b0bfa7 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -52,6 +52,14 @@ do { \
___p1; \
})
+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#else
#define smp_mb() dmb(ish)
@@ -90,6 +98,8 @@ do { \
___p1; \
})
+#define read_acquire(p) smp_load_acquire(p)
+
#endif
#define read_barrier_depends() do { } while(0)
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..2288d09 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -78,6 +78,8 @@ do { \
___p1; \
})
+#define read_acquire(p) smp_load_acquire(p)
+
/*
* XXX check on this ---I suspect what Linus really wants here is
* acquire vs release semantics but we can't discuss this stuff with
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..670b679 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -100,6 +100,14 @@ do { \
___p1; \
})
+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic() barrier()
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..aa5eb06 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -195,6 +195,14 @@ do { \
___p1; \
})
+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_mb__before_atomic() smp_mb__before_llsc()
#define smp_mb__after_atomic() smp_llsc_mb()
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..3ddc884 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -84,6 +84,14 @@ do { \
___p1; \
})
+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_mb__before_atomic() smp_mb()
#define smp_mb__after_atomic() smp_mb()
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..516ad04 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -50,4 +50,6 @@ do { \
___p1; \
})
+#define read_acquire(p) smp_load_acquire(p)
+
#endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..c0ba305 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -68,6 +68,7 @@ do { \
___p1; \
})
+#define read_acquire(p) smp_load_acquire(p)
#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic() barrier()
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..6aa9641 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -118,6 +118,14 @@ do { \
___p1; \
})
+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#else /* regular x86 TSO memory ordering */
#define smp_store_release(p, v) \
@@ -135,6 +143,8 @@ do { \
___p1; \
})
+#define read_acquire(p) smp_load_acquire(p)
+
#endif
/* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c186bfb 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,14 @@
#define smp_mb__after_atomic() smp_mb()
#endif
+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH] arch: Introduce read_acquire() 2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck @ 2014-11-11 18:57 ` alexander.duyck 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 19:47 ` Will Deacon 2 siblings, 0 replies; 15+ messages in thread From: alexander.duyck @ 2014-11-11 18:57 UTC (permalink / raw) To: linux-arch, linux-kernel Cc: Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Linus Torvalds, Ingo Molnar From: Alexander Duyck <alexander.h.duyck@redhat.com> In the case of device drivers it is common to utilize receive descriptors in which a single field is used to determine if the descriptor is currently in the possession of the device or the CPU. In order to prevent any other fields from being read a rmb() is used resulting in something like code snippet from ixgbe_main.c: if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) break; /* * This memory barrier is needed to keep us from reading * any other fields out of the rx_desc until we know the * RXD_STAT_DD bit is set */ rmb(); On reviewing the documentation and code for smp_load_acquire() it occured to me that implementing something similar for CPU <-> device interraction would be worth while. This commit provides just the load/read side of this in the form of read_acquire(). This new primative orders the specified read against any subsequent reads. As a result we can reduce the above code snippet down to: /* This memory barrier is needed to keep us from reading * any other fields out of the rx_desc until we know the * RXD_STAT_DD bit is set */ if (!(read_acquire(&rx_desc->wb.upper.status_error) & cpu_to_le32(IXGBE_RXD_STAT_DD))) break; With this commit and the above change I have seen a reduction in processing time of at least 7ns per 64B frame in the ixgbe driver on an Intel Core i7-4930K. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> --- arch/arm/include/asm/barrier.h | 8 ++++++++ arch/arm64/include/asm/barrier.h | 10 ++++++++++ arch/ia64/include/asm/barrier.h | 2 ++ arch/metag/include/asm/barrier.h | 8 ++++++++ arch/mips/include/asm/barrier.h | 8 ++++++++ arch/powerpc/include/asm/barrier.h | 8 ++++++++ arch/s390/include/asm/barrier.h | 2 ++ arch/sparc/include/asm/barrier_64.h | 1 + arch/x86/include/asm/barrier.h | 10 ++++++++++ include/asm-generic/barrier.h | 8 ++++++++ 10 files changed, 65 insertions(+) diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h index c6a3e73..b082578 100644 --- a/arch/arm/include/asm/barrier.h +++ b/arch/arm/include/asm/barrier.h @@ -59,6 +59,14 @@ #define smp_wmb() dmb(ishst) #endif +#define read_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + rmb(); \ + ___p1; \ +}) + #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index 6389d60..5b0bfa7 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -52,6 +52,14 @@ do { \ ___p1; \ }) +#define read_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + rmb(); \ + ___p1; \ +}) + #else #define smp_mb() dmb(ish) @@ -90,6 +98,8 @@ do { \ ___p1; \ }) +#define read_acquire(p) smp_load_acquire(p) + #endif #define read_barrier_depends() do { } while(0) diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h index a48957c..2288d09 100644 --- a/arch/ia64/include/asm/barrier.h +++ b/arch/ia64/include/asm/barrier.h @@ -78,6 +78,8 @@ do { \ ___p1; \ }) +#define read_acquire(p) smp_load_acquire(p) + /* * XXX check on this ---I suspect what Linus really wants here is * acquire vs release semantics but we can't discuss this stuff with diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h index c7591e8..670b679 100644 --- a/arch/metag/include/asm/barrier.h +++ b/arch/metag/include/asm/barrier.h @@ -100,6 +100,14 @@ do { \ ___p1; \ }) +#define read_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + rmb(); \ + ___p1; \ +}) + #define smp_mb__before_atomic() barrier() #define smp_mb__after_atomic() barrier() diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index d0101dd..aa5eb06 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -195,6 +195,14 @@ do { \ ___p1; \ }) +#define read_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + rmb(); \ + ___p1; \ +}) + #define smp_mb__before_atomic() smp_mb__before_llsc() #define smp_mb__after_atomic() smp_llsc_mb() diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index bab79a1..3ddc884 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -84,6 +84,14 @@ do { \ ___p1; \ }) +#define read_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + rmb(); \ + ___p1; \ +}) + #define smp_mb__before_atomic() smp_mb() #define smp_mb__after_atomic() smp_mb() diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h index b5dce65..516ad04 100644 --- a/arch/s390/include/asm/barrier.h +++ b/arch/s390/include/asm/barrier.h @@ -50,4 +50,6 @@ do { \ ___p1; \ }) +#define read_acquire(p) smp_load_acquire(p) + #endif /* __ASM_BARRIER_H */ diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h index 305dcc3..c0ba305 100644 --- a/arch/sparc/include/asm/barrier_64.h +++ b/arch/sparc/include/asm/barrier_64.h @@ -68,6 +68,7 @@ do { \ ___p1; \ }) +#define read_acquire(p) smp_load_acquire(p) #define smp_mb__before_atomic() barrier() #define smp_mb__after_atomic() barrier() diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index 0f4460b..6aa9641 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -118,6 +118,14 @@ do { \ ___p1; \ }) +#define read_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + rmb(); \ + ___p1; \ +}) + #else /* regular x86 TSO memory ordering */ #define smp_store_release(p, v) \ @@ -135,6 +143,8 @@ do { \ ___p1; \ }) +#define read_acquire(p) smp_load_acquire(p) + #endif /* Atomic operations are already serializing on x86 */ diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index 1402fa8..c186bfb 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -70,6 +70,14 @@ #define smp_mb__after_atomic() smp_mb() #endif +#define read_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + rmb(); \ + ___p1; \ +}) + #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck 2014-11-11 18:57 ` alexander.duyck @ 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 19:40 ` Linus Torvalds ` (2 more replies) 2014-11-11 19:47 ` Will Deacon 2 siblings, 3 replies; 15+ messages in thread From: Linus Torvalds @ 2014-11-11 19:40 UTC (permalink / raw) To: alexander.duyck Cc: linux-arch@vger.kernel.org, Linux Kernel Mailing List, Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Ingo Molnar On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@gmail.com> wrote: > > > On reviewing the documentation and code for smp_load_acquire() it occured > to me that implementing something similar for CPU <-> device interraction > would be worth while. This commit provides just the load/read side of this > in the form of read_acquire(). So I don't hate the concept, but. there's a couple of reasons to think this is broken. One is just the name. Why do we have "smp_load_acquire()", but then call the non-smp version "read_acquire()"? That makes very little sense to me. Why did "load" become "read"? The other is more of a technical issue. Namely the fact that being *device* ordered is completely and totally different from being *CPU* ordered. All sane modern architectures do memory ordering as part of the cache coherency protocol. But part of that is that it actually requires all the CPU's to follow said cache coherency protocol. And bus-master devices don't necessarily follow the same ordering rules. Yes, any sane DMA will be cache-coherent, although sadly the insane kind still exists. But even when DMA is cache _coherent_, that does not necessarily mean that that coherency follows the *ordering* guarantees. Now, in practice, I think that DMA tends to be more strictly ordered than CPU memory ordering is, and the above all "just works". But I'd really want a lot of ack's from architecture maintainers. Particularly PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will necessarily order the read wrt DMA traffic. PPC in particular has some really odd IO ordering rules, but I *think* all the problems come up with just MMIO, not with DMA. But we do have a very real difference between "smp_rmb()" (inter-cpu cache coherency read barrier) and "rmb()" (full memory barrier that synchronizes with IO). And your patch is very confused about this. In *some* places you use "rmb()", and in other places you just use "smp_load_acquire()". Have you done extensive verification to check that this is actually ok? Because the performance difference you quote very much seems to be about your x86 testing now akipping the IO-synchronizing "rmb()", and depending on DMA being ordered even without it. And I'm pretty sure that's actually fine on x86. The real IO-synchronizing rmb() (which translates into a lfence) is only needed for when you have uncached accesses (ie mmio) on x86. So I don't think your code is wrong, I just want to verify that everybody understands the issues. I'm not even sure DMA can ever really have weaker memory ordering (I really don't see how you'd be able to do a read barrier without DMA stores being ordered natively), so maybe I worry too much, but the ppc people in particular should look at this, because the ppc memory ordering rules and serialization are some completely odd ad-hoc black magic.... But anything with non-cache-coherent DMA is obviously very suspect too. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 19:40 ` Linus Torvalds @ 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 20:45 ` Alexander Duyck 2014-11-12 10:10 ` Will Deacon 2 siblings, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2014-11-11 19:40 UTC (permalink / raw) To: alexander.duyck Cc: linux-arch@vger.kernel.org, Linux Kernel Mailing List, Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Ingo Molnar On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@gmail.com> wrote: > > > On reviewing the documentation and code for smp_load_acquire() it occured > to me that implementing something similar for CPU <-> device interraction > would be worth while. This commit provides just the load/read side of this > in the form of read_acquire(). So I don't hate the concept, but. there's a couple of reasons to think this is broken. One is just the name. Why do we have "smp_load_acquire()", but then call the non-smp version "read_acquire()"? That makes very little sense to me. Why did "load" become "read"? The other is more of a technical issue. Namely the fact that being *device* ordered is completely and totally different from being *CPU* ordered. All sane modern architectures do memory ordering as part of the cache coherency protocol. But part of that is that it actually requires all the CPU's to follow said cache coherency protocol. And bus-master devices don't necessarily follow the same ordering rules. Yes, any sane DMA will be cache-coherent, although sadly the insane kind still exists. But even when DMA is cache _coherent_, that does not necessarily mean that that coherency follows the *ordering* guarantees. Now, in practice, I think that DMA tends to be more strictly ordered than CPU memory ordering is, and the above all "just works". But I'd really want a lot of ack's from architecture maintainers. Particularly PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will necessarily order the read wrt DMA traffic. PPC in particular has some really odd IO ordering rules, but I *think* all the problems come up with just MMIO, not with DMA. But we do have a very real difference between "smp_rmb()" (inter-cpu cache coherency read barrier) and "rmb()" (full memory barrier that synchronizes with IO). And your patch is very confused about this. In *some* places you use "rmb()", and in other places you just use "smp_load_acquire()". Have you done extensive verification to check that this is actually ok? Because the performance difference you quote very much seems to be about your x86 testing now akipping the IO-synchronizing "rmb()", and depending on DMA being ordered even without it. And I'm pretty sure that's actually fine on x86. The real IO-synchronizing rmb() (which translates into a lfence) is only needed for when you have uncached accesses (ie mmio) on x86. So I don't think your code is wrong, I just want to verify that everybody understands the issues. I'm not even sure DMA can ever really have weaker memory ordering (I really don't see how you'd be able to do a read barrier without DMA stores being ordered natively), so maybe I worry too much, but the ppc people in particular should look at this, because the ppc memory ordering rules and serialization are some completely odd ad-hoc black magic.... But anything with non-cache-coherent DMA is obviously very suspect too. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 19:40 ` Linus Torvalds @ 2014-11-11 20:45 ` Alexander Duyck 2014-11-12 10:10 ` Peter Zijlstra 2014-11-12 10:10 ` Will Deacon 2 siblings, 1 reply; 15+ messages in thread From: Alexander Duyck @ 2014-11-11 20:45 UTC (permalink / raw) To: Linus Torvalds, alexander.duyck Cc: linux-arch@vger.kernel.org, Linux Kernel Mailing List, Michael Neuling, Tony Luck, Mathieu Desnoyers, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Ingo Molnar On 11/11/2014 11:40 AM, Linus Torvalds wrote: > On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@gmail.com> wrote: >> >> >> On reviewing the documentation and code for smp_load_acquire() it occurred >> to me that implementing something similar for CPU <-> device interaction >> would be worth while. This commit provides just the load/read side of this >> in the form of read_acquire(). > > So I don't hate the concept, but. there's a couple of reasons to think > this is broken. > > One is just the name. Why do we have "smp_load_acquire()", but then > call the non-smp version "read_acquire()"? That makes very little > sense to me. Why did "load" become "read"? The idea behind read vs load in the name was because smp_load_acquire is using a full smp_mb() whereas this just falls back to rmb() for the cases it is dealing with. My main conern is that a full memory barrier would be more expensive so I didn't want to give the idea that this is as completed as smp_load_acquire(). The read_acquire() call is not strictly enforcing any limitations on writes/stores, although there are a few cases where the barriers used do leak that functionality over as a side-effect. > The other is more of a technical issue. Namely the fact that being > *device* ordered is completely and totally different from being *CPU* > ordered. > > All sane modern architectures do memory ordering as part of the cache > coherency protocol. But part of that is that it actually requires all > the CPU's to follow said cache coherency protocol. > > And bus-master devices don't necessarily follow the same ordering > rules. Yes, any sane DMA will be cache-coherent, although sadly the > insane kind still exists. But even when DMA is cache _coherent_, that > does not necessarily mean that that coherency follows the *ordering* > guarantees. > > Now, in practice, I think that DMA tends to be more strictly ordered > than CPU memory ordering is, and the above all "just works". But I'd > really want a lot of ack's from architecture maintainers. Particularly > PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will > necessarily order the read wrt DMA traffic. PPC in particular has some > really odd IO ordering rules, but I *think* all the problems come up > with just MMIO, not with DMA. > > But we do have a very real difference between "smp_rmb()" (inter-cpu > cache coherency read barrier) and "rmb()" (full memory barrier that > synchronizes with IO). > > And your patch is very confused about this. In *some* places you use > "rmb()", and in other places you just use "smp_load_acquire()". Have > you done extensive verification to check that this is actually ok? > Because the performance difference you quote very much seems to be > about your x86 testing now akipping the IO-synchronizing "rmb()", and > depending on DMA being ordered even without it. I am far from an expert on some of these synchronization primitives so I probably did make some blunders. I meant to send this as an RFC, but as is I plan to submit a v2 to just fix the spelling errors in the patch description anyway. The architectures where I thought I might be able to take advantage of the smp_load_acquire functionality are arm, x86, ia64, sparc, and s390. The rest are essentially still just an rmb() call following the volatile read. > And I'm pretty sure that's actually fine on x86. The real > IO-synchronizing rmb() (which translates into a lfence) is only needed > for when you have uncached accesses (ie mmio) on x86. So I don't think > your code is wrong, I just want to verify that everybody understands > the issues. I'm not even sure DMA can ever really have weaker memory > ordering (I really don't see how you'd be able to do a read barrier > without DMA stores being ordered natively), so maybe I worry too much, > but the ppc people in particular should look at this, because the ppc > memory ordering rules and serialization are some completely odd ad-hoc > black magic.... Like I said before the PowerPC version is likely the lowest risk. It is just a standard rmb() call after the read. I'm pretty sure x86 is safe as well since the issue that triggered the introduction of the rmb() into the Intel network drivers was on a PowerPC as I recall and the code had been running on x86 for quite some time without issue when it was reported. > But anything with non-cache-coherent DMA is obviously very suspect too. > > Linus Well for not I am only focused on what should be cache-coherent DMA which usually applies to things like descriptor rings where interaction is expected to be bi-directional without the need for a map or unmap. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 20:45 ` Alexander Duyck @ 2014-11-12 10:10 ` Peter Zijlstra 0 siblings, 0 replies; 15+ messages in thread From: Peter Zijlstra @ 2014-11-12 10:10 UTC (permalink / raw) To: Alexander Duyck Cc: Linus Torvalds, alexander.duyck, linux-arch@vger.kernel.org, Linux Kernel Mailing List, Michael Neuling, Tony Luck, Mathieu Desnoyers, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Ingo Molnar On Tue, Nov 11, 2014 at 12:45:23PM -0800, Alexander Duyck wrote: > > On 11/11/2014 11:40 AM, Linus Torvalds wrote: > >On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@gmail.com> wrote: > >> > >> > >>On reviewing the documentation and code for smp_load_acquire() it occurred > >>to me that implementing something similar for CPU <-> device interaction > >>would be worth while. This commit provides just the load/read side of this > >>in the form of read_acquire(). > > > >So I don't hate the concept, but. there's a couple of reasons to think > >this is broken. > > > >One is just the name. Why do we have "smp_load_acquire()", but then > >call the non-smp version "read_acquire()"? That makes very little > >sense to me. Why did "load" become "read"? > > The idea behind read vs load in the name was because smp_load_acquire is > using a full smp_mb() whereas this just falls back to rmb() for the cases it > is dealing with. My main conern is that a full memory barrier would be more > expensive so I didn't want to give the idea that this is as completed as > smp_load_acquire(). The read_acquire() call is not strictly enforcing any > limitations on writes/stores, although there are a few cases where the > barriers used do leak that functionality over as a side-effect. Then I object. We should not name it acquire if it does not in fact provides acquire semantics. Memory ordering is hard enough, we don't need random weird semantics mixed in just because. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 20:45 ` Alexander Duyck @ 2014-11-12 10:10 ` Will Deacon 2014-11-12 15:42 ` Alexander Duyck 2 siblings, 1 reply; 15+ messages in thread From: Will Deacon @ 2014-11-12 10:10 UTC (permalink / raw) To: Linus Torvalds Cc: alexander.duyck@gmail.com, linux-arch@vger.kernel.org, Linux Kernel Mailing List, Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Ingo Molnar On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote: > On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@gmail.com> wrote: > > On reviewing the documentation and code for smp_load_acquire() it occured > > to me that implementing something similar for CPU <-> device interraction > > would be worth while. This commit provides just the load/read side of this > > in the form of read_acquire(). > > So I don't hate the concept, but. there's a couple of reasons to think > this is broken. > > One is just the name. Why do we have "smp_load_acquire()", but then > call the non-smp version "read_acquire()"? That makes very little > sense to me. Why did "load" become "read"? [...] > But we do have a very real difference between "smp_rmb()" (inter-cpu > cache coherency read barrier) and "rmb()" (full memory barrier that > synchronizes with IO). > > And your patch is very confused about this. In *some* places you use > "rmb()", and in other places you just use "smp_load_acquire()". Have > you done extensive verification to check that this is actually ok? > Because the performance difference you quote very much seems to be > about your x86 testing now akipping the IO-synchronizing "rmb()", and > depending on DMA being ordered even without it. > > And I'm pretty sure that's actually fine on x86. The real > IO-synchronizing rmb() (which translates into a lfence) is only needed > for when you have uncached accesses (ie mmio) on x86. So I don't think > your code is wrong, I just want to verify that everybody understands > the issues. I'm not even sure DMA can ever really have weaker memory > ordering (I really don't see how you'd be able to do a read barrier > without DMA stores being ordered natively), so maybe I worry too much, > but the ppc people in particular should look at this, because the ppc > memory ordering rules and serialization are some completely odd ad-hoc > black magic.... Right, so now I see what's going on here. This isn't actually anything to do with acquire/release (I don't know of any architectures that have a read-barrier-acquire instruction), it's all about DMA to main memory. If a device is DMA'ing data *and* control information (e.g. 'descriptor valid') to memory, then it must be maintaining order between those writes with respect to memory. In that case, using the usual MMIO barriers can be overkill because we really just want to enforce read-ordering on the CPU side. In fact, I think you could even do this with a fake address dependency on ARM (although I'm not actually suggesting we do that). In light of that, it actually sounds like we want a new set of barrier macros that apply only to DMA buffer accesses by the CPU -- they wouldn't enforce ordering against things like MMIO registers. I wonder whether any architectures would implement them differently to the smp_* flavours? > But anything with non-cache-coherent DMA is obviously very suspect too. I think non-cache-coherent DMA should work too (at least, on ARM), but only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable mapping). Will ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-12 10:10 ` Will Deacon @ 2014-11-12 15:42 ` Alexander Duyck 0 siblings, 0 replies; 15+ messages in thread From: Alexander Duyck @ 2014-11-12 15:42 UTC (permalink / raw) To: Will Deacon, Linus Torvalds Cc: alexander.duyck@gmail.com, linux-arch@vger.kernel.org, Linux Kernel Mailing List, Michael Neuling, Tony Luck, Mathieu Desnoyers, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Ingo Molnar On 11/12/2014 02:10 AM, Will Deacon wrote: > On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote: >> On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@gmail.com> wrote: >>> On reviewing the documentation and code for smp_load_acquire() it occured >>> to me that implementing something similar for CPU <-> device interraction >>> would be worth while. This commit provides just the load/read side of this >>> in the form of read_acquire(). >> So I don't hate the concept, but. there's a couple of reasons to think >> this is broken. >> >> One is just the name. Why do we have "smp_load_acquire()", but then >> call the non-smp version "read_acquire()"? That makes very little >> sense to me. Why did "load" become "read"? > [...] > >> But we do have a very real difference between "smp_rmb()" (inter-cpu >> cache coherency read barrier) and "rmb()" (full memory barrier that >> synchronizes with IO). >> >> And your patch is very confused about this. In *some* places you use >> "rmb()", and in other places you just use "smp_load_acquire()". Have >> you done extensive verification to check that this is actually ok? >> Because the performance difference you quote very much seems to be >> about your x86 testing now akipping the IO-synchronizing "rmb()", and >> depending on DMA being ordered even without it. >> >> And I'm pretty sure that's actually fine on x86. The real >> IO-synchronizing rmb() (which translates into a lfence) is only needed >> for when you have uncached accesses (ie mmio) on x86. So I don't think >> your code is wrong, I just want to verify that everybody understands >> the issues. I'm not even sure DMA can ever really have weaker memory >> ordering (I really don't see how you'd be able to do a read barrier >> without DMA stores being ordered natively), so maybe I worry too much, >> but the ppc people in particular should look at this, because the ppc >> memory ordering rules and serialization are some completely odd ad-hoc >> black magic.... > Right, so now I see what's going on here. This isn't actually anything > to do with acquire/release (I don't know of any architectures that have > a read-barrier-acquire instruction), it's all about DMA to main memory. Actually it is sort of, I just hadn't realized it until I read some of the explanations of the C11 acquire/release memory order specifics, but I believe most network drivers are engaged in acquire/release logic because we are usually using something such as a lockless descriptor ring to pass data back and forth between the device and the system. The net win for device drivers is that we can remove some of the heavy-weight barriers that are having to be used by making use of lighter barriers or primitives such as lwsync vs sync in PowerPC or ldar vs dsb(ld) on arm64. > If a device is DMA'ing data *and* control information (e.g. 'descriptor > valid') to memory, then it must be maintaining order between those writes > with respect to memory. In that case, using the usual MMIO barriers can > be overkill because we really just want to enforce read-ordering on the CPU > side. In fact, I think you could even do this with a fake address dependency > on ARM (although I'm not actually suggesting we do that). > > In light of that, it actually sounds like we want a new set of barrier > macros that apply only to DMA buffer accesses by the CPU -- they wouldn't > enforce ordering against things like MMIO registers. I wonder whether any > architectures would implement them differently to the smp_* flavours? My concern would be the cost of the barriers vs the acquire/release primitives. In the case of arm64 I am assuming there is a reason for wanting to use ldar vs dsb instructions. I would imagine the devices drivers would want to get the same kind of advantages. >> But anything with non-cache-coherent DMA is obviously very suspect too. > I think non-cache-coherent DMA should work too (at least, on ARM), but > only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable > mapping). > > Will For now my plan is to focus on coherent memory only with this. Specifically it is only really intended for use with dma_alloc_coherent. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck 2014-11-11 18:57 ` alexander.duyck 2014-11-11 19:40 ` Linus Torvalds @ 2014-11-11 19:47 ` Will Deacon 2014-11-11 21:12 ` Alexander Duyck 2 siblings, 1 reply; 15+ messages in thread From: Will Deacon @ 2014-11-11 19:47 UTC (permalink / raw) To: alexander.duyck@gmail.com Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Linus Torvalds, Ingo Molnar Hello, On Tue, Nov 11, 2014 at 06:57:05PM +0000, alexander.duyck@gmail.com wrote: > From: Alexander Duyck <alexander.h.duyck@redhat.com> > > In the case of device drivers it is common to utilize receive descriptors > in which a single field is used to determine if the descriptor is currently > in the possession of the device or the CPU. In order to prevent any other > fields from being read a rmb() is used resulting in something like code > snippet from ixgbe_main.c: > > if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) > break; > > /* > * This memory barrier is needed to keep us from reading > * any other fields out of the rx_desc until we know the > * RXD_STAT_DD bit is set > */ > rmb(); > > On reviewing the documentation and code for smp_load_acquire() it occured > to me that implementing something similar for CPU <-> device interraction > would be worth while. This commit provides just the load/read side of this > in the form of read_acquire(). This new primative orders the specified > read against any subsequent reads. As a result we can reduce the above > code snippet down to: > > /* This memory barrier is needed to keep us from reading > * any other fields out of the rx_desc until we know the > * RXD_STAT_DD bit is set > */ > if (!(read_acquire(&rx_desc->wb.upper.status_error) & Minor nit on naming, but load_acquire would match what we do with barriers, where you simply drop the smp_ prefix if you want the thing to work on UP systems too. > cpu_to_le32(IXGBE_RXD_STAT_DD))) > break; I'm not familiar with the driver in question, but how are the descriptors mapped? Is the read barrier here purely limiting re-ordering of normal memory accesses by the CPU? If so, isn't there also scope for store_release when updating, e.g. next_to_watch in the same driver? We also need to understand how this plays out with smp_mb__after_unlock_lock, which is currently *only* implemented by PowerPC. If we end up having a similar mess to mmiowb, where PowerPC both implements the barrier *and* plays tricks in its spin_unlock code, then everybody loses because we'd end up with release doing the right thing anyway. Peter and I spoke with Paul at LPC about strengthening smp_load_acquire/smp_store_release so that release->acquire ordering is maintained, which would allow us to drop smp_mb__after_unlock_lock altogether. That's stronger than acquire/release in C11, but I think it's an awful lot easier to use, particularly if device drivers are going to start using these primitives. Thoughts? Will ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 19:47 ` Will Deacon @ 2014-11-11 21:12 ` Alexander Duyck 2014-11-12 10:15 ` Peter Zijlstra 2014-11-12 20:43 ` David Miller 0 siblings, 2 replies; 15+ messages in thread From: Alexander Duyck @ 2014-11-11 21:12 UTC (permalink / raw) To: Will Deacon, alexander.duyck@gmail.com Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Neuling, Tony Luck, Mathieu Desnoyers, Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Linus Torvalds, Ingo Molnar On 11/11/2014 11:47 AM, Will Deacon wrote: > Hello, > > On Tue, Nov 11, 2014 at 06:57:05PM +0000, alexander.duyck@gmail.com wrote: >> From: Alexander Duyck <alexander.h.duyck@redhat.com> >> >> In the case of device drivers it is common to utilize receive descriptors >> in which a single field is used to determine if the descriptor is currently >> in the possession of the device or the CPU. In order to prevent any other >> fields from being read a rmb() is used resulting in something like code >> snippet from ixgbe_main.c: >> >> if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) >> break; >> >> /* >> * This memory barrier is needed to keep us from reading >> * any other fields out of the rx_desc until we know the >> * RXD_STAT_DD bit is set >> */ >> rmb(); >> >> On reviewing the documentation and code for smp_load_acquire() it occured >> to me that implementing something similar for CPU <-> device interraction >> would be worth while. This commit provides just the load/read side of this >> in the form of read_acquire(). This new primative orders the specified >> read against any subsequent reads. As a result we can reduce the above >> code snippet down to: >> >> /* This memory barrier is needed to keep us from reading >> * any other fields out of the rx_desc until we know the >> * RXD_STAT_DD bit is set >> */ >> if (!(read_acquire(&rx_desc->wb.upper.status_error) & > Minor nit on naming, but load_acquire would match what we do with barriers, > where you simply drop the smp_ prefix if you want the thing to work on UP > systems too. The problem is this is slightly different, load_acquire in my mind would use a mb() call, I only use a rmb(). That is why I chose read_acquire as the name. >> cpu_to_le32(IXGBE_RXD_STAT_DD))) >> break; > I'm not familiar with the driver in question, but how are the descriptors > mapped? Is the read barrier here purely limiting re-ordering of normal > memory accesses by the CPU? If so, isn't there also scope for store_release > when updating, e.g. next_to_watch in the same driver? So the driver in question is using descriptor rings allocated via dma_alloc_coherent. The device is notified that new descriptors are present via a memory mapped I/O register, then the device will read the descriptor via a DMA operation and then write it back with another DMA operation and the process of doing so it will set the IXGBE_RXD_STAT_DD bit. The problem with the store_release logic is that it would need to key off of a write to memory mapped I/O. The idea had crossed my mind, but I wasn't confident I had a good enough understanding of things to try and deal with memory ordering for cacheable and uncachable memory in the same call. I would have to do some more research to see if something like that is even possible as I suspect some of the architectures may not support something like that. > We also need to understand how this plays out with > smp_mb__after_unlock_lock, which is currently *only* implemented by PowerPC. > If we end up having a similar mess to mmiowb, where PowerPC both implements > the barrier *and* plays tricks in its spin_unlock code, then everybody > loses because we'd end up with release doing the right thing anyway. PowerPC is not much of a risk in this patch. The implementation I did just fell back to a rmb(). The architectures I need to sort out are arm, x86, sparc, ia64, and s390 as they are the only ones that tried to make use of the smp_load_acquire logic. > Peter and I spoke with Paul at LPC about strengthening > smp_load_acquire/smp_store_release so that release->acquire ordering is > maintained, which would allow us to drop smp_mb__after_unlock_lock > altogether. That's stronger than acquire/release in C11, but I think it's > an awful lot easier to use, particularly if device drivers are going to > start using these primitives. > > Thoughts? > > Will I generally want just enough of a barrier in place to keep things working properly without costing much in terms of CPU time. If you can come up with a generic load_acquire/store_release that could take the place of this function I am fine with that as long as it would function at the same level of performance. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 21:12 ` Alexander Duyck @ 2014-11-12 10:15 ` Peter Zijlstra 2014-11-12 15:23 ` Alexander Duyck 2014-11-12 20:43 ` David Miller 1 sibling, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2014-11-12 10:15 UTC (permalink / raw) To: Alexander Duyck Cc: Will Deacon, alexander.duyck@gmail.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Neuling, Tony Luck, Mathieu Desnoyers, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Linus Torvalds, Ingo Molnar On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote: > >Minor nit on naming, but load_acquire would match what we do with barriers, > >where you simply drop the smp_ prefix if you want the thing to work on UP > >systems too. > > The problem is this is slightly different, load_acquire in my mind would use > a mb() call, I only use a rmb(). That is why I chose read_acquire as the > name. acquire is not about rmb vs mb, do read up on Documentation/memory-barriers.txt. Its a distinctly different semantic. Some archs simply lack the means of implementing this semantics and have to revert to mb (stronger is always allowed). Using the read vs load to wreck the acquire semantics is just insane. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-12 10:15 ` Peter Zijlstra @ 2014-11-12 15:23 ` Alexander Duyck 2014-11-12 15:37 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Alexander Duyck @ 2014-11-12 15:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, alexander.duyck@gmail.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Neuling, Tony Luck, Mathieu Desnoyers, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Linus Torvalds, Ingo Molnar On 11/12/2014 02:15 AM, Peter Zijlstra wrote: > On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote: >>> Minor nit on naming, but load_acquire would match what we do with barriers, >>> where you simply drop the smp_ prefix if you want the thing to work on UP >>> systems too. >> The problem is this is slightly different, load_acquire in my mind would use >> a mb() call, I only use a rmb(). That is why I chose read_acquire as the >> name. > acquire is not about rmb vs mb, do read up on > Documentation/memory-barriers.txt. Its a distinctly different semantic. > Some archs simply lack the means of implementing this semantics and have > to revert to mb (stronger is always allowed). > > Using the read vs load to wreck the acquire semantics is just insane. Actually I have been reading up on it as I wasn't familiar with C11. Most of what I was doing was actually based on the documentation in barriers.txt which was referring to memory operations not loads/stores when referring to the acquire/release so I assumed the full memory barrier was required. I wasn't aware that smp_load_acquire was only supposed to be ordering loads, or that smp_ store_release only applied to stores.I will probably go back and re-implement this patch as introducing load_acquire and add store_release as well. I figure it is possible that a device could be doing something like reading a linked list in memory ("Example combining sync and lwsync" http://www.ibm.com/developerworks/systems/articles/powerpc.html#N102B1) and then you would need both the store_release and the wmb() to deal with system memory vs system memory ordering, and system memory vs MMIO. Base drivers have been doing this kind of stuff for a while. The problem is we have been doing it using barriers that are heavier than they need to be. For example the reason why I am wanting to shift to using something like an acquire operation is because the current barrier we have been using, rmb(), is far heavier than we need for system memory vs system memory ordering on systems such as PowerPC. What I am looking for with the load_acquire/store_release is to allow for the use of lighter weight barriers for updates that only affect system memory. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-12 15:23 ` Alexander Duyck @ 2014-11-12 15:37 ` Peter Zijlstra 2014-11-12 19:24 ` Alexander Duyck 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2014-11-12 15:37 UTC (permalink / raw) To: Alexander Duyck Cc: Will Deacon, alexander.duyck@gmail.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Neuling, Tony Luck, Mathieu Desnoyers, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Linus Torvalds, Ingo Molnar On Wed, Nov 12, 2014 at 07:23:22AM -0800, Alexander Duyck wrote: > > On 11/12/2014 02:15 AM, Peter Zijlstra wrote: > >On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote: > >>>Minor nit on naming, but load_acquire would match what we do with barriers, > >>>where you simply drop the smp_ prefix if you want the thing to work on UP > >>>systems too. > >>The problem is this is slightly different, load_acquire in my mind would use > >>a mb() call, I only use a rmb(). That is why I chose read_acquire as the > >>name. > >acquire is not about rmb vs mb, do read up on > >Documentation/memory-barriers.txt. Its a distinctly different semantic. > >Some archs simply lack the means of implementing this semantics and have > >to revert to mb (stronger is always allowed). > > > >Using the read vs load to wreck the acquire semantics is just insane. > > Actually I have been reading up on it as I wasn't familiar with C11. C11 is _different_ although somewhat related. > Most > of what I was doing was actually based on the documentation in barriers.txt > which was referring to memory operations not loads/stores when referring to > the acquire/release so I assumed the full memory barrier was required. I > wasn't aware that smp_load_acquire was only supposed to be ordering loads, > or that smp_ store_release only applied to stores. It does not.. an ACQUIRE is a semi-permeable barrier that doesn't allow LOADs nor STOREs that are issued _after_ it to appear to happen _before_. The RELEASE is the opposite number, it ensures LOADs and STOREs that are issued _before_ cannot happen _after_. This typically matches locking, where a lock (mutex_lock, spin_lock etc..) have ACQUIRE semantics and the unlock RELEASE. Such that: spin_lock(); a = 1; b = x; spin_unlock(); guarantees all LOADs (x) and STORESs (a,b) happen _inside_ the lock region. What they do not guarantee is: y = 1; spin_lock() a = 1; b = x; spin_unlock() z = 4; An order between y and z, both are allowed _into_ the region and can cross there like: spin_lock(); ... z = 4; y = 1; ... spin_unlock(); The only 'open' issue at the moment is if RELEASE+ACQUIRE := MB. Currently we say this is not so, but Will (and me) would very much like this to be so -- PPC64 being the only arch that actually makes this distinction. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-12 15:37 ` Peter Zijlstra @ 2014-11-12 19:24 ` Alexander Duyck 0 siblings, 0 replies; 15+ messages in thread From: Alexander Duyck @ 2014-11-12 19:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, alexander.duyck@gmail.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Neuling, Tony Luck, Mathieu Desnoyers, Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky, Russell King, Paul E. McKenney, Linus Torvalds, Ingo Molnar On 11/12/2014 07:37 AM, Peter Zijlstra wrote: > On Wed, Nov 12, 2014 at 07:23:22AM -0800, Alexander Duyck wrote: >> >> On 11/12/2014 02:15 AM, Peter Zijlstra wrote: >>> On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote: >>>>> Minor nit on naming, but load_acquire would match what we do with barriers, >>>>> where you simply drop the smp_ prefix if you want the thing to work on UP >>>>> systems too. >>>> The problem is this is slightly different, load_acquire in my mind would use >>>> a mb() call, I only use a rmb(). That is why I chose read_acquire as the >>>> name. >>> acquire is not about rmb vs mb, do read up on >>> Documentation/memory-barriers.txt. Its a distinctly different semantic. >>> Some archs simply lack the means of implementing this semantics and have >>> to revert to mb (stronger is always allowed). >>> >>> Using the read vs load to wreck the acquire semantics is just insane. >> >> Actually I have been reading up on it as I wasn't familiar with C11. > > C11 is _different_ although somewhat related. Honestly I find this quite confusing. If you have some sort of other documentation you can point me at it would be useful in terms of what you are expecting for behaviour and names. >> Most >> of what I was doing was actually based on the documentation in barriers.txt >> which was referring to memory operations not loads/stores when referring to >> the acquire/release so I assumed the full memory barrier was required. I >> wasn't aware that smp_load_acquire was only supposed to be ordering loads, >> or that smp_ store_release only applied to stores. > > It does not.. an ACQUIRE is a semi-permeable barrier that doesn't allow > LOADs nor STOREs that are issued _after_ it to appear to happen _before_. > The RELEASE is the opposite number, it ensures LOADs and STOREs that are > issued _before_ cannot happen _after_. > > This typically matches locking, where a lock (mutex_lock, spin_lock > etc..) have ACQUIRE semantics and the unlock RELEASE. Such that: > > spin_lock(); > a = 1; > b = x; > spin_unlock(); > > guarantees all LOADs (x) and STORESs (a,b) happen _inside_ the lock > region. What they do not guarantee is: > > > y = 1; > spin_lock() > a = 1; > b = x; > spin_unlock() > z = 4; > > An order between y and z, both are allowed _into_ the region and can > cross there like: > > spin_lock(); > ... > z = 4; > y = 1; > ... > spin_unlock(); > > > The only 'open' issue at the moment is if RELEASE+ACQUIRE := MB. > Currently we say this is not so, but Will (and me) would very much like > this to be so -- PPC64 being the only arch that actually makes this > distinction. In the grand scheme of things I suppose it doesn't matter too much. I actually found a documentation that kind of explains subtle nuances of things a bit. Specifically Acquire represents the first row in the table below, and Release represents the second column: Acquire -> LoadLoad LoadStore StoreLoad StoreStore ^ | Release The LoadStore bit was in question in a few different discussions I read, however as it turns out on x86, sparc, s390, PowerPC, arm64, and ia64 they would give you that as a free benefit anyway. I think that covers a wide enough gamut that I don't really care if I take a performance hit on the other architectures for implementing a full mb() versus a rmb(). Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] arch: Introduce read_acquire() 2014-11-11 21:12 ` Alexander Duyck 2014-11-12 10:15 ` Peter Zijlstra @ 2014-11-12 20:43 ` David Miller 1 sibling, 0 replies; 15+ messages in thread From: David Miller @ 2014-11-12 20:43 UTC (permalink / raw) To: alexander.h.duyck Cc: will.deacon, alexander.duyck, linux-arch, linux-kernel, mikey, tony.luck, mathieu.desnoyers, peterz, benh, heiko.carstens, oleg, michael, geert, fweisbec, schwidefsky, linux, paulmck, torvalds, mingo From: Alexander Duyck <alexander.h.duyck@redhat.com> Date: Tue, 11 Nov 2014 13:12:32 -0800 > The architectures I need to sort out are arm, x86, sparc, ia64, and > s390 as they are the only ones that tried to make use of the > smp_load_acquire logic. On sparc64, you don't have to worry about anything really. Only mb() does anything. The cpus execute in a reasonably strict memory ordering mode, and that's why rmb/wmb and friends are simply nops with a compiler barrier. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-11-12 20:43 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck 2014-11-11 18:57 ` alexander.duyck 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 19:40 ` Linus Torvalds 2014-11-11 20:45 ` Alexander Duyck 2014-11-12 10:10 ` Peter Zijlstra 2014-11-12 10:10 ` Will Deacon 2014-11-12 15:42 ` Alexander Duyck 2014-11-11 19:47 ` Will Deacon 2014-11-11 21:12 ` Alexander Duyck 2014-11-12 10:15 ` Peter Zijlstra 2014-11-12 15:23 ` Alexander Duyck 2014-11-12 15:37 ` Peter Zijlstra 2014-11-12 19:24 ` Alexander Duyck 2014-11-12 20:43 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox