[PATCH] arch: Introduce read

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] arch: Introduce read_acquire()
@ 2014-11-11 18:57 alexander.duyck
  2014-11-11 18:57 ` alexander.duyck
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: alexander.duyck @ 2014-11-11 18:57 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck,
	Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens,
	Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven,
	Frederic Weisbecker, Martin Schwidefsky, Russell King,
	Paul E. McKenney, Linus Torvalds, Ingo Molnar

From: Alexander Duyck <alexander.h.duyck@redhat.com>

In the case of device drivers it is common to utilize receive descriptors
in which a single field is used to determine if the descriptor is currently
in the possession of the device or the CPU.  In order to prevent any other
fields from being read a rmb() is used resulting in something like code
snippet from ixgbe_main.c:

	if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
		break;

	/*
	 * This memory barrier is needed to keep us from reading
	 * any other fields out of the rx_desc until we know the
	 * RXD_STAT_DD bit is set
	 */
	rmb();

On reviewing the documentation and code for smp_load_acquire() it occured
to me that implementing something similar for CPU <-> device interraction
would be worth while.  This commit provides just the load/read side of this
in the form of read_acquire().  This new primative orders the specified
read against any subsequent reads.  As a result we can reduce the above
code snippet down to:

	/* This memory barrier is needed to keep us from reading
	 * any other fields out of the rx_desc until we know the
	 * RXD_STAT_DD bit is set
	 */
	if (!(read_acquire(&rx_desc->wb.upper.status_error) &
	      cpu_to_le32(IXGBE_RXD_STAT_DD)))
		break;

With this commit and the above change I have seen a reduction in processing
time of at least 7ns per 64B frame in the ixgbe driver on an Intel
Core i7-4930K.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 arch/arm/include/asm/barrier.h      |    8 ++++++++
 arch/arm64/include/asm/barrier.h    |   10 ++++++++++
 arch/ia64/include/asm/barrier.h     |    2 ++
 arch/metag/include/asm/barrier.h    |    8 ++++++++
 arch/mips/include/asm/barrier.h     |    8 ++++++++
 arch/powerpc/include/asm/barrier.h  |    8 ++++++++
 arch/s390/include/asm/barrier.h     |    2 ++
 arch/sparc/include/asm/barrier_64.h |    1 +
 arch/x86/include/asm/barrier.h      |   10 ++++++++++
 include/asm-generic/barrier.h       |    8 ++++++++
 10 files changed, 65 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..b082578 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,14 @@
 #define smp_wmb()	dmb(ishst)
 #endif
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..5b0bfa7 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -52,6 +52,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #else
 
 #define smp_mb()	dmb(ish)
@@ -90,6 +98,8 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)	smp_load_acquire(p)
+
 #endif
 
 #define read_barrier_depends()		do { } while(0)
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..2288d09 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -78,6 +78,8 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)	smp_load_acquire(p)
+
 /*
  * XXX check on this ---I suspect what Linus really wants here is
  * acquire vs release semantics but we can't discuss this stuff with
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..670b679 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -100,6 +100,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_mb__before_atomic()	barrier()
 #define smp_mb__after_atomic()	barrier()
 
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..aa5eb06 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -195,6 +195,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_mb__before_atomic()	smp_mb__before_llsc()
 #define smp_mb__after_atomic()	smp_llsc_mb()
 
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..3ddc884 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -84,6 +84,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();							\
+	___p1;								\
+})
+
 #define smp_mb__before_atomic()     smp_mb()
 #define smp_mb__after_atomic()      smp_mb()
 
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..516ad04 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -50,4 +50,6 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)			smp_load_acquire(p)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..c0ba305 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -68,6 +68,7 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)		smp_load_acquire(p)
 #define smp_mb__before_atomic()	barrier()
 #define smp_mb__after_atomic()	barrier()
 
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..6aa9641 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -118,6 +118,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #else /* regular x86 TSO memory ordering */
 
 #define smp_store_release(p, v)						\
@@ -135,6 +143,8 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)	smp_load_acquire(p)
+
 #endif
 
 /* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c186bfb 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,14 @@
 #define smp_mb__after_atomic()	smp_mb()
 #endif
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] arch: Introduce read_acquire()
  2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck
@ 2014-11-11 18:57 ` alexander.duyck
  2014-11-11 19:40 ` Linus Torvalds
  2014-11-11 19:47 ` Will Deacon
  2 siblings, 0 replies; 15+ messages in thread
From: alexander.duyck @ 2014-11-11 18:57 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck,
	Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens,
	Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven,
	Frederic Weisbecker, Martin Schwidefsky, Russell King,
	Paul E. McKenney, Linus Torvalds, Ingo Molnar

From: Alexander Duyck <alexander.h.duyck@redhat.com>

In the case of device drivers it is common to utilize receive descriptors
in which a single field is used to determine if the descriptor is currently
in the possession of the device or the CPU.  In order to prevent any other
fields from being read a rmb() is used resulting in something like code
snippet from ixgbe_main.c:

	if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
		break;

	/*
	 * This memory barrier is needed to keep us from reading
	 * any other fields out of the rx_desc until we know the
	 * RXD_STAT_DD bit is set
	 */
	rmb();

On reviewing the documentation and code for smp_load_acquire() it occured
to me that implementing something similar for CPU <-> device interraction
would be worth while.  This commit provides just the load/read side of this
in the form of read_acquire().  This new primative orders the specified
read against any subsequent reads.  As a result we can reduce the above
code snippet down to:

	/* This memory barrier is needed to keep us from reading
	 * any other fields out of the rx_desc until we know the
	 * RXD_STAT_DD bit is set
	 */
	if (!(read_acquire(&rx_desc->wb.upper.status_error) &
	      cpu_to_le32(IXGBE_RXD_STAT_DD)))
		break;

With this commit and the above change I have seen a reduction in processing
time of at least 7ns per 64B frame in the ixgbe driver on an Intel
Core i7-4930K.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 arch/arm/include/asm/barrier.h      |    8 ++++++++
 arch/arm64/include/asm/barrier.h    |   10 ++++++++++
 arch/ia64/include/asm/barrier.h     |    2 ++
 arch/metag/include/asm/barrier.h    |    8 ++++++++
 arch/mips/include/asm/barrier.h     |    8 ++++++++
 arch/powerpc/include/asm/barrier.h  |    8 ++++++++
 arch/s390/include/asm/barrier.h     |    2 ++
 arch/sparc/include/asm/barrier_64.h |    1 +
 arch/x86/include/asm/barrier.h      |   10 ++++++++++
 include/asm-generic/barrier.h       |    8 ++++++++
 10 files changed, 65 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..b082578 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,14 @@
 #define smp_wmb()	dmb(ishst)
 #endif
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..5b0bfa7 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -52,6 +52,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #else
 
 #define smp_mb()	dmb(ish)
@@ -90,6 +98,8 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)	smp_load_acquire(p)
+
 #endif
 
 #define read_barrier_depends()		do { } while(0)
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..2288d09 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -78,6 +78,8 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)	smp_load_acquire(p)
+
 /*
  * XXX check on this ---I suspect what Linus really wants here is
  * acquire vs release semantics but we can't discuss this stuff with
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..670b679 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -100,6 +100,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_mb__before_atomic()	barrier()
 #define smp_mb__after_atomic()	barrier()
 
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..aa5eb06 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -195,6 +195,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_mb__before_atomic()	smp_mb__before_llsc()
 #define smp_mb__after_atomic()	smp_llsc_mb()
 
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..3ddc884 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -84,6 +84,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();							\
+	___p1;								\
+})
+
 #define smp_mb__before_atomic()     smp_mb()
 #define smp_mb__after_atomic()      smp_mb()
 
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..516ad04 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -50,4 +50,6 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)			smp_load_acquire(p)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..c0ba305 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -68,6 +68,7 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)		smp_load_acquire(p)
 #define smp_mb__before_atomic()	barrier()
 #define smp_mb__after_atomic()	barrier()
 
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..6aa9641 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -118,6 +118,14 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #else /* regular x86 TSO memory ordering */
 
 #define smp_store_release(p, v)						\
@@ -135,6 +143,8 @@ do {									\
 	___p1;								\
 })
 
+#define read_acquire(p)	smp_load_acquire(p)
+
 #endif
 
 /* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c186bfb 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,14 @@
 #define smp_mb__after_atomic()	smp_mb()
 #endif
 
+#define read_acquire(p)							\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	rmb();								\
+	___p1;								\
+})
+
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck
  2014-11-11 18:57 ` alexander.duyck
@ 2014-11-11 19:40 ` Linus Torvalds
  2014-11-11 19:40   ` Linus Torvalds
                     ` (2 more replies)
  2014-11-11 19:47 ` Will Deacon
  2 siblings, 3 replies; 15+ messages in thread
From: Linus Torvalds @ 2014-11-11 19:40 UTC (permalink / raw)
  To: alexander.duyck
  Cc: linux-arch@vger.kernel.org, Linux Kernel Mailing List,
	Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck,
	Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens,
	Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven,
	Frederic Weisbecker, Martin Schwidefsky, Russell King,
	Paul E. McKenney, Ingo Molnar

On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@gmail.com> wrote:
>
>
> On reviewing the documentation and code for smp_load_acquire() it occured
> to me that implementing something similar for CPU <-> device interraction
> would be worth while.  This commit provides just the load/read side of this
> in the form of read_acquire().

So I don't hate the concept, but. there's a couple of reasons to think
this is broken.

One is just the name. Why do we have "smp_load_acquire()", but then
call the non-smp version "read_acquire()"? That makes very little
sense to me. Why did "load" become "read"?

The other is more of a technical issue. Namely the fact that being
*device* ordered is completely and totally different from being *CPU*
ordered.

All sane modern architectures do memory ordering as part of the cache
coherency protocol. But part of that is that it actually requires all
the CPU's to follow said cache coherency protocol.

And bus-master devices don't necessarily follow the same ordering
rules. Yes, any sane DMA will be cache-coherent, although sadly the
insane kind still exists. But even when DMA is cache _coherent_, that
does not necessarily mean that that coherency follows the *ordering*
guarantees.

Now, in practice, I think that DMA tends to be more strictly ordered
than  CPU memory ordering is, and the above all "just works". But I'd
really want a lot of ack's from architecture maintainers. Particularly
PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will
necessarily order the read wrt DMA traffic. PPC in particular has some
really odd IO ordering rules, but I *think* all the problems come up
with just MMIO, not with DMA.

But we do have a very real difference between "smp_rmb()" (inter-cpu
cache coherency read barrier) and "rmb()" (full memory barrier that
synchronizes with IO).

And your patch is very confused about this. In *some* places you use
"rmb()", and in other places you just use "smp_load_acquire()". Have
you done extensive verification to check that this is actually ok?
Because the performance difference you quote very much seems to be
about your x86 testing now akipping the IO-synchronizing "rmb()", and
depending on DMA being ordered even without it.

And I'm pretty sure that's actually fine on x86. The real
IO-synchronizing rmb() (which translates into a lfence) is only needed
for when you have uncached accesses (ie mmio) on x86. So I don't think
your code is wrong, I just want to verify that everybody understands
the issues. I'm not even sure DMA can ever really have weaker memory
ordering (I really don't see how you'd be able to do a read barrier
without DMA stores being ordered natively), so maybe I worry too much,
but the ppc people in particular should look at this, because the ppc
memory ordering rules and serialization are some completely odd ad-hoc
black magic....

But anything with non-cache-coherent DMA is obviously very suspect too.

                       Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 19:40 ` Linus Torvalds
@ 2014-11-11 19:40   ` Linus Torvalds
  2014-11-11 20:45   ` Alexander Duyck
  2014-11-12 10:10   ` Will Deacon
  2 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2014-11-11 19:40 UTC (permalink / raw)
  To: alexander.duyck
  Cc: linux-arch@vger.kernel.org, Linux Kernel Mailing List,
	Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck,
	Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens,
	Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven,
	Frederic Weisbecker, Martin Schwidefsky, Russell King,
	Paul E. McKenney, Ingo Molnar

On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@gmail.com> wrote:
>
>
> On reviewing the documentation and code for smp_load_acquire() it occured
> to me that implementing something similar for CPU <-> device interraction
> would be worth while.  This commit provides just the load/read side of this
> in the form of read_acquire().

So I don't hate the concept, but. there's a couple of reasons to think
this is broken.

One is just the name. Why do we have "smp_load_acquire()", but then
call the non-smp version "read_acquire()"? That makes very little
sense to me. Why did "load" become "read"?

The other is more of a technical issue. Namely the fact that being
*device* ordered is completely and totally different from being *CPU*
ordered.

All sane modern architectures do memory ordering as part of the cache
coherency protocol. But part of that is that it actually requires all
the CPU's to follow said cache coherency protocol.

And bus-master devices don't necessarily follow the same ordering
rules. Yes, any sane DMA will be cache-coherent, although sadly the
insane kind still exists. But even when DMA is cache _coherent_, that
does not necessarily mean that that coherency follows the *ordering*
guarantees.

Now, in practice, I think that DMA tends to be more strictly ordered
than  CPU memory ordering is, and the above all "just works". But I'd
really want a lot of ack's from architecture maintainers. Particularly
PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will
necessarily order the read wrt DMA traffic. PPC in particular has some
really odd IO ordering rules, but I *think* all the problems come up
with just MMIO, not with DMA.

But we do have a very real difference between "smp_rmb()" (inter-cpu
cache coherency read barrier) and "rmb()" (full memory barrier that
synchronizes with IO).

And your patch is very confused about this. In *some* places you use
"rmb()", and in other places you just use "smp_load_acquire()". Have
you done extensive verification to check that this is actually ok?
Because the performance difference you quote very much seems to be
about your x86 testing now akipping the IO-synchronizing "rmb()", and
depending on DMA being ordered even without it.

And I'm pretty sure that's actually fine on x86. The real
IO-synchronizing rmb() (which translates into a lfence) is only needed
for when you have uncached accesses (ie mmio) on x86. So I don't think
your code is wrong, I just want to verify that everybody understands
the issues. I'm not even sure DMA can ever really have weaker memory
ordering (I really don't see how you'd be able to do a read barrier
without DMA stores being ordered natively), so maybe I worry too much,
but the ppc people in particular should look at this, because the ppc
memory ordering rules and serialization are some completely odd ad-hoc
black magic....

But anything with non-cache-coherent DMA is obviously very suspect too.

                       Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 19:40 ` Linus Torvalds
  2014-11-11 19:40   ` Linus Torvalds
@ 2014-11-11 20:45   ` Alexander Duyck
  2014-11-12 10:10     ` Peter Zijlstra
  2014-11-12 10:10   ` Will Deacon
  2 siblings, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2014-11-11 20:45 UTC (permalink / raw)
  To: Linus Torvalds, alexander.duyck
  Cc: linux-arch@vger.kernel.org, Linux Kernel Mailing List,
	Michael Neuling, Tony Luck, Mathieu Desnoyers, Peter Zijlstra,
	Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov,
	Will Deacon, Michael Ellerman, Geert Uytterhoeven,
	Frederic Weisbecker, Martin Schwidefsky, Russell King,
	Paul E. McKenney, Ingo Molnar


On 11/11/2014 11:40 AM, Linus Torvalds wrote:
> On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@gmail.com> wrote:
>>
>>
>> On reviewing the documentation and code for smp_load_acquire() it occurred
>> to me that implementing something similar for CPU <-> device interaction
>> would be worth while.  This commit provides just the load/read side of this
>> in the form of read_acquire().
>
> So I don't hate the concept, but. there's a couple of reasons to think
> this is broken.
>
> One is just the name. Why do we have "smp_load_acquire()", but then
> call the non-smp version "read_acquire()"? That makes very little
> sense to me. Why did "load" become "read"?

The idea behind read vs load in the name was because smp_load_acquire is 
using a full smp_mb() whereas this just falls back to rmb() for the 
cases it is dealing with.  My main conern is that a full memory barrier 
would be more expensive so I didn't want to give the idea that this is 
as completed as smp_load_acquire().  The read_acquire() call is not 
strictly enforcing any limitations on writes/stores, although there are 
a few cases where the barriers used do leak that functionality over as a 
side-effect.

> The other is more of a technical issue. Namely the fact that being
> *device* ordered is completely and totally different from being *CPU*
> ordered.
>
> All sane modern architectures do memory ordering as part of the cache
> coherency protocol. But part of that is that it actually requires all
> the CPU's to follow said cache coherency protocol.
>
> And bus-master devices don't necessarily follow the same ordering
> rules. Yes, any sane DMA will be cache-coherent, although sadly the
> insane kind still exists. But even when DMA is cache _coherent_, that
> does not necessarily mean that that coherency follows the *ordering*
> guarantees.
>
> Now, in practice, I think that DMA tends to be more strictly ordered
> than  CPU memory ordering is, and the above all "just works". But I'd
> really want a lot of ack's from architecture maintainers. Particularly
> PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will
> necessarily order the read wrt DMA traffic. PPC in particular has some
> really odd IO ordering rules, but I *think* all the problems come up
> with just MMIO, not with DMA.
>
> But we do have a very real difference between "smp_rmb()" (inter-cpu
> cache coherency read barrier) and "rmb()" (full memory barrier that
> synchronizes with IO).
>
> And your patch is very confused about this. In *some* places you use
> "rmb()", and in other places you just use "smp_load_acquire()". Have
> you done extensive verification to check that this is actually ok?
> Because the performance difference you quote very much seems to be
> about your x86 testing now akipping the IO-synchronizing "rmb()", and
> depending on DMA being ordered even without it.

I am far from an expert on some of these synchronization primitives so I 
probably did make some blunders.  I meant to send this as an RFC, but as 
is I plan to submit a v2 to just fix the spelling errors in the patch 
description anyway.

The architectures where I thought I might be able to take advantage of 
the smp_load_acquire functionality are arm, x86, ia64, sparc, and s390. 
  The rest are essentially still just an rmb() call following the 
volatile read.

> And I'm pretty sure that's actually fine on x86. The real
> IO-synchronizing rmb() (which translates into a lfence) is only needed
> for when you have uncached accesses (ie mmio) on x86. So I don't think
> your code is wrong, I just want to verify that everybody understands
> the issues. I'm not even sure DMA can ever really have weaker memory
> ordering (I really don't see how you'd be able to do a read barrier
> without DMA stores being ordered natively), so maybe I worry too much,
> but the ppc people in particular should look at this, because the ppc
> memory ordering rules and serialization are some completely odd ad-hoc
> black magic....

Like I said before the PowerPC version is likely the lowest risk.  It is 
just a standard rmb() call after the read.

I'm pretty sure x86 is safe as well since the issue that triggered the 
introduction of the rmb() into the Intel network drivers was on a 
PowerPC as I recall and the code had been running on x86 for quite some 
time without issue when it was reported.

> But anything with non-cache-coherent DMA is obviously very suspect too.
>
>                         Linus

Well for not I am only focused on what should be cache-coherent DMA 
which usually applies to things like descriptor rings where interaction 
is expected to be bi-directional without the need for a map or unmap.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 20:45   ` Alexander Duyck
@ 2014-11-12 10:10     ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2014-11-12 10:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Linus Torvalds, alexander.duyck, linux-arch@vger.kernel.org,
	Linux Kernel Mailing List, Michael Neuling, Tony Luck,
	Mathieu Desnoyers, Benjamin Herrenschmidt, Heiko Carstens,
	Oleg Nesterov, Will Deacon, Michael Ellerman, Geert Uytterhoeven,
	Frederic Weisbecker, Martin Schwidefsky, Russell King,
	Paul E. McKenney, Ingo Molnar

On Tue, Nov 11, 2014 at 12:45:23PM -0800, Alexander Duyck wrote:
> 
> On 11/11/2014 11:40 AM, Linus Torvalds wrote:
> >On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@gmail.com> wrote:
> >>
> >>
> >>On reviewing the documentation and code for smp_load_acquire() it occurred
> >>to me that implementing something similar for CPU <-> device interaction
> >>would be worth while.  This commit provides just the load/read side of this
> >>in the form of read_acquire().
> >
> >So I don't hate the concept, but. there's a couple of reasons to think
> >this is broken.
> >
> >One is just the name. Why do we have "smp_load_acquire()", but then
> >call the non-smp version "read_acquire()"? That makes very little
> >sense to me. Why did "load" become "read"?
> 
> The idea behind read vs load in the name was because smp_load_acquire is
> using a full smp_mb() whereas this just falls back to rmb() for the cases it
> is dealing with.  My main conern is that a full memory barrier would be more
> expensive so I didn't want to give the idea that this is as completed as
> smp_load_acquire().  The read_acquire() call is not strictly enforcing any
> limitations on writes/stores, although there are a few cases where the
> barriers used do leak that functionality over as a side-effect.

Then I object. We should not name it acquire if it does not in fact
provides acquire semantics.

Memory ordering is hard enough, we don't need random weird semantics
mixed in just because.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 19:40 ` Linus Torvalds
  2014-11-11 19:40   ` Linus Torvalds
  2014-11-11 20:45   ` Alexander Duyck
@ 2014-11-12 10:10   ` Will Deacon
  2014-11-12 15:42     ` Alexander Duyck
  2 siblings, 1 reply; 15+ messages in thread
From: Will Deacon @ 2014-11-12 10:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: alexander.duyck@gmail.com, linux-arch@vger.kernel.org,
	Linux Kernel Mailing List, Michael Neuling, Tony Luck,
	Mathieu Desnoyers, Alexander Duyck, Peter Zijlstra,
	Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov,
	Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker,
	Martin Schwidefsky, Russell King, Paul E. McKenney, Ingo Molnar

On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote:
> On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@gmail.com> wrote:
> > On reviewing the documentation and code for smp_load_acquire() it occured
> > to me that implementing something similar for CPU <-> device interraction
> > would be worth while.  This commit provides just the load/read side of this
> > in the form of read_acquire().
> 
> So I don't hate the concept, but. there's a couple of reasons to think
> this is broken.
> 
> One is just the name. Why do we have "smp_load_acquire()", but then
> call the non-smp version "read_acquire()"? That makes very little
> sense to me. Why did "load" become "read"?

[...]

> But we do have a very real difference between "smp_rmb()" (inter-cpu
> cache coherency read barrier) and "rmb()" (full memory barrier that
> synchronizes with IO).
> 
> And your patch is very confused about this. In *some* places you use
> "rmb()", and in other places you just use "smp_load_acquire()". Have
> you done extensive verification to check that this is actually ok?
> Because the performance difference you quote very much seems to be
> about your x86 testing now akipping the IO-synchronizing "rmb()", and
> depending on DMA being ordered even without it.
> 
> And I'm pretty sure that's actually fine on x86. The real
> IO-synchronizing rmb() (which translates into a lfence) is only needed
> for when you have uncached accesses (ie mmio) on x86. So I don't think
> your code is wrong, I just want to verify that everybody understands
> the issues. I'm not even sure DMA can ever really have weaker memory
> ordering (I really don't see how you'd be able to do a read barrier
> without DMA stores being ordered natively), so maybe I worry too much,
> but the ppc people in particular should look at this, because the ppc
> memory ordering rules and serialization are some completely odd ad-hoc
> black magic....

Right, so now I see what's going on here. This isn't actually anything
to do with acquire/release (I don't know of any architectures that have
a read-barrier-acquire instruction), it's all about DMA to main memory.

If a device is DMA'ing data *and* control information (e.g. 'descriptor
valid') to memory, then it must be maintaining order between those writes
with respect to memory. In that case, using the usual MMIO barriers can
be overkill because we really just want to enforce read-ordering on the CPU
side. In fact, I think you could even do this with a fake address dependency
on ARM (although I'm not actually suggesting we do that).

In light of that, it actually sounds like we want a new set of barrier
macros that apply only to DMA buffer accesses by the CPU -- they wouldn't
enforce ordering against things like MMIO registers. I wonder whether any
architectures would implement them differently to the smp_* flavours?

> But anything with non-cache-coherent DMA is obviously very suspect too.

I think non-cache-coherent DMA should work too (at least, on ARM), but
only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable
mapping).

Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-12 10:10   ` Will Deacon
@ 2014-11-12 15:42     ` Alexander Duyck
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Duyck @ 2014-11-12 15:42 UTC (permalink / raw)
  To: Will Deacon, Linus Torvalds
  Cc: alexander.duyck@gmail.com, linux-arch@vger.kernel.org,
	Linux Kernel Mailing List, Michael Neuling, Tony Luck,
	Mathieu Desnoyers, Peter Zijlstra, Benjamin Herrenschmidt,
	Heiko Carstens, Oleg Nesterov, Michael Ellerman,
	Geert Uytterhoeven, Frederic Weisbecker, Martin Schwidefsky,
	Russell King, Paul E. McKenney, Ingo Molnar


On 11/12/2014 02:10 AM, Will Deacon wrote:
> On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote:
>> On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@gmail.com> wrote:
>>> On reviewing the documentation and code for smp_load_acquire() it occured
>>> to me that implementing something similar for CPU <-> device interraction
>>> would be worth while.  This commit provides just the load/read side of this
>>> in the form of read_acquire().
>> So I don't hate the concept, but. there's a couple of reasons to think
>> this is broken.
>>
>> One is just the name. Why do we have "smp_load_acquire()", but then
>> call the non-smp version "read_acquire()"? That makes very little
>> sense to me. Why did "load" become "read"?
> [...]
>
>> But we do have a very real difference between "smp_rmb()" (inter-cpu
>> cache coherency read barrier) and "rmb()" (full memory barrier that
>> synchronizes with IO).
>>
>> And your patch is very confused about this. In *some* places you use
>> "rmb()", and in other places you just use "smp_load_acquire()". Have
>> you done extensive verification to check that this is actually ok?
>> Because the performance difference you quote very much seems to be
>> about your x86 testing now akipping the IO-synchronizing "rmb()", and
>> depending on DMA being ordered even without it.
>>
>> And I'm pretty sure that's actually fine on x86. The real
>> IO-synchronizing rmb() (which translates into a lfence) is only needed
>> for when you have uncached accesses (ie mmio) on x86. So I don't think
>> your code is wrong, I just want to verify that everybody understands
>> the issues. I'm not even sure DMA can ever really have weaker memory
>> ordering (I really don't see how you'd be able to do a read barrier
>> without DMA stores being ordered natively), so maybe I worry too much,
>> but the ppc people in particular should look at this, because the ppc
>> memory ordering rules and serialization are some completely odd ad-hoc
>> black magic....
> Right, so now I see what's going on here. This isn't actually anything
> to do with acquire/release (I don't know of any architectures that have
> a read-barrier-acquire instruction), it's all about DMA to main memory.

Actually it is sort of, I just hadn't realized it until I read some of 
the explanations of the C11 acquire/release memory order specifics, but 
I believe most network drivers are engaged in acquire/release logic 
because we are usually using something such as a lockless descriptor 
ring to pass data back and forth between the device and the system.  The 
net win for device drivers is that we can remove some of the 
heavy-weight barriers that are having to be used by making use of 
lighter barriers or primitives such as lwsync vs sync in PowerPC or ldar 
vs dsb(ld) on arm64.

> If a device is DMA'ing data *and* control information (e.g. 'descriptor
> valid') to memory, then it must be maintaining order between those writes
> with respect to memory. In that case, using the usual MMIO barriers can
> be overkill because we really just want to enforce read-ordering on the CPU
> side. In fact, I think you could even do this with a fake address dependency
> on ARM (although I'm not actually suggesting we do that).
>
> In light of that, it actually sounds like we want a new set of barrier
> macros that apply only to DMA buffer accesses by the CPU -- they wouldn't
> enforce ordering against things like MMIO registers. I wonder whether any
> architectures would implement them differently to the smp_* flavours?

My concern would be the cost of the barriers vs the acquire/release 
primitives.  In the case of arm64 I am assuming there is a reason for 
wanting to use ldar vs dsb instructions.  I would imagine the devices 
drivers would want to get the same kind of advantages.

>> But anything with non-cache-coherent DMA is obviously very suspect too.
> I think non-cache-coherent DMA should work too (at least, on ARM), but
> only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable
> mapping).
>
> Will

For now my plan is to focus on coherent memory only with this. 
Specifically it is only really intended for use with dma_alloc_coherent.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck
  2014-11-11 18:57 ` alexander.duyck
  2014-11-11 19:40 ` Linus Torvalds
@ 2014-11-11 19:47 ` Will Deacon
  2014-11-11 21:12   ` Alexander Duyck
  2 siblings, 1 reply; 15+ messages in thread
From: Will Deacon @ 2014-11-11 19:47 UTC (permalink / raw)
  To: alexander.duyck@gmail.com
  Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Michael Neuling, Tony Luck, Mathieu Desnoyers, Alexander Duyck,
	Peter Zijlstra, Benjamin Herrenschmidt, Heiko Carstens,
	Oleg Nesterov, Michael Ellerman, Geert Uytterhoeven,
	Frederic Weisbecker, Martin Schwidefsky, Russell King,
	Paul E. McKenney, Linus Torvalds, Ingo Molnar

Hello,

On Tue, Nov 11, 2014 at 06:57:05PM +0000, alexander.duyck@gmail.com wrote:
> From: Alexander Duyck <alexander.h.duyck@redhat.com>
> 
> In the case of device drivers it is common to utilize receive descriptors
> in which a single field is used to determine if the descriptor is currently
> in the possession of the device or the CPU.  In order to prevent any other
> fields from being read a rmb() is used resulting in something like code
> snippet from ixgbe_main.c:
> 
> 	if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
> 		break;
> 
> 	/*
> 	 * This memory barrier is needed to keep us from reading
> 	 * any other fields out of the rx_desc until we know the
> 	 * RXD_STAT_DD bit is set
> 	 */
> 	rmb();
> 
> On reviewing the documentation and code for smp_load_acquire() it occured
> to me that implementing something similar for CPU <-> device interraction
> would be worth while.  This commit provides just the load/read side of this
> in the form of read_acquire().  This new primative orders the specified
> read against any subsequent reads.  As a result we can reduce the above
> code snippet down to:
> 
> 	/* This memory barrier is needed to keep us from reading
> 	 * any other fields out of the rx_desc until we know the
> 	 * RXD_STAT_DD bit is set
> 	 */
> 	if (!(read_acquire(&rx_desc->wb.upper.status_error) &

Minor nit on naming, but load_acquire would match what we do with barriers,
where you simply drop the smp_ prefix if you want the thing to work on UP
systems too.

> 	      cpu_to_le32(IXGBE_RXD_STAT_DD)))
> 		break;

I'm not familiar with the driver in question, but how are the descriptors
mapped? Is the read barrier here purely limiting re-ordering of normal
memory accesses by the CPU? If so, isn't there also scope for store_release
when updating, e.g. next_to_watch in the same driver?

We also need to understand how this plays out with
smp_mb__after_unlock_lock, which is currently *only* implemented by PowerPC.
If we end up having a similar mess to mmiowb, where PowerPC both implements
the barrier *and* plays tricks in its spin_unlock code, then everybody
loses because we'd end up with release doing the right thing anyway.

Peter and I spoke with Paul at LPC about strengthening
smp_load_acquire/smp_store_release so that release->acquire ordering is
maintained, which would allow us to drop smp_mb__after_unlock_lock
altogether. That's stronger than acquire/release in C11, but I think it's
an awful lot easier to use, particularly if device drivers are going to
start using these primitives.

Thoughts?

Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 19:47 ` Will Deacon
@ 2014-11-11 21:12   ` Alexander Duyck
  2014-11-12 10:15     ` Peter Zijlstra
  2014-11-12 20:43     ` David Miller
  0 siblings, 2 replies; 15+ messages in thread
From: Alexander Duyck @ 2014-11-11 21:12 UTC (permalink / raw)
  To: Will Deacon, alexander.duyck@gmail.com
  Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Michael Neuling, Tony Luck, Mathieu Desnoyers, Peter Zijlstra,
	Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov,
	Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker,
	Martin Schwidefsky, Russell King, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On 11/11/2014 11:47 AM, Will Deacon wrote:
> Hello,
>
> On Tue, Nov 11, 2014 at 06:57:05PM +0000, alexander.duyck@gmail.com wrote:
>> From: Alexander Duyck <alexander.h.duyck@redhat.com>
>>
>> In the case of device drivers it is common to utilize receive descriptors
>> in which a single field is used to determine if the descriptor is currently
>> in the possession of the device or the CPU.  In order to prevent any other
>> fields from being read a rmb() is used resulting in something like code
>> snippet from ixgbe_main.c:
>>
>> 	if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
>> 		break;
>>
>> 	/*
>> 	 * This memory barrier is needed to keep us from reading
>> 	 * any other fields out of the rx_desc until we know the
>> 	 * RXD_STAT_DD bit is set
>> 	 */
>> 	rmb();
>>
>> On reviewing the documentation and code for smp_load_acquire() it occured
>> to me that implementing something similar for CPU <-> device interraction
>> would be worth while.  This commit provides just the load/read side of this
>> in the form of read_acquire().  This new primative orders the specified
>> read against any subsequent reads.  As a result we can reduce the above
>> code snippet down to:
>>
>> 	/* This memory barrier is needed to keep us from reading
>> 	 * any other fields out of the rx_desc until we know the
>> 	 * RXD_STAT_DD bit is set
>> 	 */
>> 	if (!(read_acquire(&rx_desc->wb.upper.status_error) &
> Minor nit on naming, but load_acquire would match what we do with barriers,
> where you simply drop the smp_ prefix if you want the thing to work on UP
> systems too.

The problem is this is slightly different, load_acquire in my mind would 
use a mb() call, I only use a rmb().  That is why I chose read_acquire 
as the name.

>> 	      cpu_to_le32(IXGBE_RXD_STAT_DD)))
>> 		break;
> I'm not familiar with the driver in question, but how are the descriptors
> mapped? Is the read barrier here purely limiting re-ordering of normal
> memory accesses by the CPU? If so, isn't there also scope for store_release
> when updating, e.g. next_to_watch in the same driver?

So the driver in question is using descriptor rings allocated via 
dma_alloc_coherent.    The device is notified that new descriptors are 
present via a memory mapped I/O register, then the device will read the 
descriptor via a DMA operation and then write it back with another DMA 
operation and the process of doing so it will set the IXGBE_RXD_STAT_DD bit.

The problem with the store_release logic is that it would need to key 
off of a write to memory mapped I/O.  The idea had crossed my mind, but 
I wasn't confident I had a good enough understanding of things to try 
and deal with memory ordering for cacheable and uncachable memory in the 
same call.  I would have to do some more research to see if something 
like that is even possible as I suspect some of the architectures may 
not support something like that.

> We also need to understand how this plays out with
> smp_mb__after_unlock_lock, which is currently *only* implemented by PowerPC.
> If we end up having a similar mess to mmiowb, where PowerPC both implements
> the barrier *and* plays tricks in its spin_unlock code, then everybody
> loses because we'd end up with release doing the right thing anyway.

PowerPC is not much of a risk in this patch.  The implementation I did 
just fell back to a rmb().

The architectures I need to sort out are arm, x86, sparc, ia64, and s390 
as they are the only ones that tried to make use of the smp_load_acquire 
logic.

> Peter and I spoke with Paul at LPC about strengthening
> smp_load_acquire/smp_store_release so that release->acquire ordering is
> maintained, which would allow us to drop smp_mb__after_unlock_lock
> altogether. That's stronger than acquire/release in C11, but I think it's
> an awful lot easier to use, particularly if device drivers are going to
> start using these primitives.
>
> Thoughts?
>
> Will

I generally want just enough of a barrier in place to keep things 
working properly without costing much in terms of CPU time.  If you can 
come up with a generic load_acquire/store_release that could take the 
place of this function I am fine with that as long as it would function 
at the same level of performance.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 21:12   ` Alexander Duyck
@ 2014-11-12 10:15     ` Peter Zijlstra
  2014-11-12 15:23       ` Alexander Duyck
  2014-11-12 20:43     ` David Miller
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2014-11-12 10:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Will Deacon, alexander.duyck@gmail.com,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Michael Neuling, Tony Luck, Mathieu Desnoyers,
	Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov,
	Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker,
	Martin Schwidefsky, Russell King, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
> >Minor nit on naming, but load_acquire would match what we do with barriers,
> >where you simply drop the smp_ prefix if you want the thing to work on UP
> >systems too.
> 
> The problem is this is slightly different, load_acquire in my mind would use
> a mb() call, I only use a rmb().  That is why I chose read_acquire as the
> name.

acquire is not about rmb vs mb, do read up on
Documentation/memory-barriers.txt. Its a distinctly different semantic.
Some archs simply lack the means of implementing this semantics and have
to revert to mb (stronger is always allowed).

Using the read vs load to wreck the acquire semantics is just insane.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-12 10:15     ` Peter Zijlstra
@ 2014-11-12 15:23       ` Alexander Duyck
  2014-11-12 15:37         ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2014-11-12 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, alexander.duyck@gmail.com,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Michael Neuling, Tony Luck, Mathieu Desnoyers,
	Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov,
	Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker,
	Martin Schwidefsky, Russell King, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On 11/12/2014 02:15 AM, Peter Zijlstra wrote:
> On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
>>> Minor nit on naming, but load_acquire would match what we do with barriers,
>>> where you simply drop the smp_ prefix if you want the thing to work on UP
>>> systems too.
>> The problem is this is slightly different, load_acquire in my mind would use
>> a mb() call, I only use a rmb().  That is why I chose read_acquire as the
>> name.
> acquire is not about rmb vs mb, do read up on
> Documentation/memory-barriers.txt. Its a distinctly different semantic.
> Some archs simply lack the means of implementing this semantics and have
> to revert to mb (stronger is always allowed).
>
> Using the read vs load to wreck the acquire semantics is just insane.

Actually I have been reading up on it as I wasn't familiar with C11.  
Most of what I was doing was actually based on the documentation in 
barriers.txt which was referring to memory operations not loads/stores 
when referring to the acquire/release so I assumed the full memory 
barrier was required.  I wasn't aware that smp_load_acquire was only 
supposed to be ordering loads, or that smp_ store_release only applied 
to stores.I will probably go back and re-implement this patch as 
introducing load_acquire and add store_release as well.  I figure it is 
possible that a device could be doing something like reading a linked 
list in memory ("Example combining sync and lwsync" 
http://www.ibm.com/developerworks/systems/articles/powerpc.html#N102B1) 
and then you would need both the store_release and the wmb() to deal 
with system memory vs system memory ordering, and system memory vs MMIO.

Base drivers have been doing this kind of stuff for a while.  The 
problem is we have been doing it using barriers that are heavier than 
they need to be.  For example the reason why I am wanting to shift to 
using something like an acquire operation is because the current barrier 
we have been using, rmb(), is far heavier than we need for system memory 
vs system memory ordering on systems such as PowerPC.  What I am looking 
for with the load_acquire/store_release is to allow for the use of 
lighter weight barriers for updates that only affect system memory.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-12 15:23       ` Alexander Duyck
@ 2014-11-12 15:37         ` Peter Zijlstra
  2014-11-12 19:24           ` Alexander Duyck
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2014-11-12 15:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Will Deacon, alexander.duyck@gmail.com,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Michael Neuling, Tony Luck, Mathieu Desnoyers,
	Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov,
	Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker,
	Martin Schwidefsky, Russell King, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar

On Wed, Nov 12, 2014 at 07:23:22AM -0800, Alexander Duyck wrote:
> 
> On 11/12/2014 02:15 AM, Peter Zijlstra wrote:
> >On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
> >>>Minor nit on naming, but load_acquire would match what we do with barriers,
> >>>where you simply drop the smp_ prefix if you want the thing to work on UP
> >>>systems too.
> >>The problem is this is slightly different, load_acquire in my mind would use
> >>a mb() call, I only use a rmb().  That is why I chose read_acquire as the
> >>name.
> >acquire is not about rmb vs mb, do read up on
> >Documentation/memory-barriers.txt. Its a distinctly different semantic.
> >Some archs simply lack the means of implementing this semantics and have
> >to revert to mb (stronger is always allowed).
> >
> >Using the read vs load to wreck the acquire semantics is just insane.
> 
> Actually I have been reading up on it as I wasn't familiar with C11.  

C11 is _different_ although somewhat related.

> Most
> of what I was doing was actually based on the documentation in barriers.txt
> which was referring to memory operations not loads/stores when referring to
> the acquire/release so I assumed the full memory barrier was required.  I
> wasn't aware that smp_load_acquire was only supposed to be ordering loads,
> or that smp_ store_release only applied to stores.

It does not.. an ACQUIRE is a semi-permeable barrier that doesn't allow
LOADs nor STOREs that are issued _after_ it to appear to happen _before_.
The RELEASE is the opposite number, it ensures LOADs and STOREs that are
issued _before_ cannot happen _after_.

This typically matches locking, where a lock (mutex_lock, spin_lock
etc..) have ACQUIRE semantics and the unlock RELEASE. Such that:

	spin_lock();
	a = 1;
	b = x;
	spin_unlock();

guarantees all LOADs (x) and STORESs (a,b) happen _inside_ the lock
region. What they do not guarantee is:


	y = 1;
	spin_lock()
	a = 1;
	b = x;
	spin_unlock()
	z = 4;

An order between y and z, both are allowed _into_ the region and can
cross there like:

	spin_lock();
	...
	z = 4;
	y = 1;
	...
	spin_unlock();


The only 'open' issue at the moment is if RELEASE+ACQUIRE := MB.
Currently we say this is not so, but Will (and me) would very much like
this to be so -- PPC64 being the only arch that actually makes this
distinction.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-12 15:37         ` Peter Zijlstra
@ 2014-11-12 19:24           ` Alexander Duyck
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Duyck @ 2014-11-12 19:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, alexander.duyck@gmail.com,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	Michael Neuling, Tony Luck, Mathieu Desnoyers,
	Benjamin Herrenschmidt, Heiko Carstens, Oleg Nesterov,
	Michael Ellerman, Geert Uytterhoeven, Frederic Weisbecker,
	Martin Schwidefsky, Russell King, Paul E. McKenney,
	Linus Torvalds, Ingo Molnar



On 11/12/2014 07:37 AM, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 07:23:22AM -0800, Alexander Duyck wrote:
>>
>> On 11/12/2014 02:15 AM, Peter Zijlstra wrote:
>>> On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
>>>>> Minor nit on naming, but load_acquire would match what we do with barriers,
>>>>> where you simply drop the smp_ prefix if you want the thing to work on UP
>>>>> systems too.
>>>> The problem is this is slightly different, load_acquire in my mind would use
>>>> a mb() call, I only use a rmb().  That is why I chose read_acquire as the
>>>> name.
>>> acquire is not about rmb vs mb, do read up on
>>> Documentation/memory-barriers.txt. Its a distinctly different semantic.
>>> Some archs simply lack the means of implementing this semantics and have
>>> to revert to mb (stronger is always allowed).
>>>
>>> Using the read vs load to wreck the acquire semantics is just insane.
>>
>> Actually I have been reading up on it as I wasn't familiar with C11.
>
> C11 is _different_ although somewhat related.

Honestly I find this quite confusing.  If you have some sort of other 
documentation you can point me at it would be useful in terms of what 
you are expecting for behaviour and names.

>> Most
>> of what I was doing was actually based on the documentation in barriers.txt
>> which was referring to memory operations not loads/stores when referring to
>> the acquire/release so I assumed the full memory barrier was required.  I
>> wasn't aware that smp_load_acquire was only supposed to be ordering loads,
>> or that smp_ store_release only applied to stores.
>
> It does not.. an ACQUIRE is a semi-permeable barrier that doesn't allow
> LOADs nor STOREs that are issued _after_ it to appear to happen _before_.
> The RELEASE is the opposite number, it ensures LOADs and STOREs that are
> issued _before_ cannot happen _after_.
>
> This typically matches locking, where a lock (mutex_lock, spin_lock
> etc..) have ACQUIRE semantics and the unlock RELEASE. Such that:
>
> 	spin_lock();
> 	a = 1;
> 	b = x;
> 	spin_unlock();
>
> guarantees all LOADs (x) and STORESs (a,b) happen _inside_ the lock
> region. What they do not guarantee is:
>
>
> 	y = 1;
> 	spin_lock()
> 	a = 1;
> 	b = x;
> 	spin_unlock()
> 	z = 4;
>
> An order between y and z, both are allowed _into_ the region and can
> cross there like:
>
> 	spin_lock();
> 	...
> 	z = 4;
> 	y = 1;
> 	...
> 	spin_unlock();
>
>
> The only 'open' issue at the moment is if RELEASE+ACQUIRE := MB.
> Currently we say this is not so, but Will (and me) would very much like
> this to be so -- PPC64 being the only arch that actually makes this
> distinction.

In the grand scheme of things I suppose it doesn't matter too much.  I 
actually found a documentation that kind of explains subtle nuances of 
things a bit.  Specifically Acquire represents the first row in the 
table below, and Release represents the second column:

Acquire -> 	LoadLoad	LoadStore
		StoreLoad	StoreStore
				^
				|
				Release

The LoadStore bit was in question in a few different discussions I read, 
however as it turns out on x86, sparc, s390, PowerPC, arm64, and ia64 
they would give you that as a free benefit anyway.  I think that covers 
a wide enough gamut that I don't really care if I take a performance hit 
on the other architectures for implementing a full mb() versus a rmb().

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] arch: Introduce read_acquire()
  2014-11-11 21:12   ` Alexander Duyck
  2014-11-12 10:15     ` Peter Zijlstra
@ 2014-11-12 20:43     ` David Miller
  1 sibling, 0 replies; 15+ messages in thread
From: David Miller @ 2014-11-12 20:43 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: will.deacon, alexander.duyck, linux-arch, linux-kernel, mikey,
	tony.luck, mathieu.desnoyers, peterz, benh, heiko.carstens, oleg,
	michael, geert, fweisbec, schwidefsky, linux, paulmck, torvalds,
	mingo

From: Alexander Duyck <alexander.h.duyck@redhat.com>
Date: Tue, 11 Nov 2014 13:12:32 -0800

> The architectures I need to sort out are arm, x86, sparc, ia64, and
> s390 as they are the only ones that tried to make use of the
> smp_load_acquire logic.

On sparc64, you don't have to worry about anything really.

Only mb() does anything.  The cpus execute in a reasonably strict
memory ordering mode, and that's why rmb/wmb and friends are simply
nops with a compiler barrier.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-11-12 20:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-11 18:57 [PATCH] arch: Introduce read_acquire() alexander.duyck
2014-11-11 18:57 ` alexander.duyck
2014-11-11 19:40 ` Linus Torvalds
2014-11-11 19:40   ` Linus Torvalds
2014-11-11 20:45   ` Alexander Duyck
2014-11-12 10:10     ` Peter Zijlstra
2014-11-12 10:10   ` Will Deacon
2014-11-12 15:42     ` Alexander Duyck
2014-11-11 19:47 ` Will Deacon
2014-11-11 21:12   ` Alexander Duyck
2014-11-12 10:15     ` Peter Zijlstra
2014-11-12 15:23       ` Alexander Duyck
2014-11-12 15:37         ` Peter Zijlstra
2014-11-12 19:24           ` Alexander Duyck
2014-11-12 20:43     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox