futex atomic vs ordering constraints

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* futex atomic vs ordering constraints
@ 2015-08-26 18:16 Peter Zijlstra
  2015-08-29  1:33 ` Davidlohr Bueso
  2015-09-01 16:31 ` Will Deacon
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2015-08-26 18:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Oleg Nesterov, Paul McKenney, Ingo Molnar,
	mtk.manpages, dvhart, dave, Vineet.Gupta1, ralf, ddaney,
	Will Deacon, linux-kernel

Hi all,

I tried to keep this email short, but failed miserably at this. For
the TL;DR skip to the tail.

So the question of ordering constraints of futex atomic operations has
come up recently:

  http://marc.info/?l=linux-kernel&m=143894765931868

This email will attempt to describe the two primitives and start a
discussion on the constraints.

 * futex_atomic_op_inuser()

There is but a single callsite of this function: futex_wake_op().

It being part of a wake primitive seems to suggest a (RCsc) RELEASE is
the strongest required (the RCsc part because I don't think we want to
expose RCpc to userspace if we don't have to).

The immediate scenario where this is important is:

        CPU0            CPU1            CPU2

        futex_lock(); -> uncontended user acquire
        A = 1;
                        futex_lock(); -> kernel, set pending, sleep
        B = 1;
        futex_unlock();
          if pending
            <kernel>
              futex_wake_op
                spin_lock(bh->lock)
                RELEASE
                futex_atomic_op_inuser(); -> futex unlocked
                                        futex_lock() -> uncontended user steal
                                        load A;

In other words, the moment we perform the WAKE_OP userspace can observe
the 'lock' as unlocked and do a lock (steal) acquire of the 'lock'.

If userspace succeeds with this acquire, we need full serialization of
the locked (RCsc) variables (eg A and B in the above).

Of course, if anything else prior to futex_atomic_op_inuser() implies an
(RCsc) RELEASE or stronger the primitive can do without providing
anything itself.

This turns out to be the case, a successful get_futex_key() implies a
full memory barrier; recent: 1d0dcb3ad9d3 ("futex: Implement lockless
wakeups").

And since get_futex_key() is fundamental to doing _anything_ with a
futex, I think its semi-sane to rely on this.

So we have two valid options:

 - RCsc RELEASE
 - no ordering at all

Current implementation:

alpha:	 MB ll/sc		RELEASE
arm64:	 ll/sc-release MB	FULL
arm:	 MB ll/sc		RELEASE
mips:	 ll/sc MB		ACQUIRE
powerpc: lwsync ll/sc sync	FULL

 * futex_atomic_cmpxchg_inatomic()

This is called from:

	lock_pi_update_atomic
	wake_futex_pi
	fixup_pi_state_owner
	futex_unlock_pi
	handle_futex_death

But I think we can form a position from just two of them:

  futex_unlock_pi() and lock_pi_update_atomic()

these end up being ACQUIRE and RELEASE, and a combination of these two
would give us a requirement for full serialization.

And unlike the previous we cannot talk this one away. Even though every
futex op needs a get_futex_key() which implies a full memory barrier,
and every get_futex_key() needs a put_futex_key(), the latter does _NOT_
imply a full barrier.

So while we could relax the RELEASE semantics we cannot relax the
ACQUIRE semantics.

Then there is handle_futex_death(), which is difficult, I _think_ it
wants to be a RELEASE, but state is corrupted anyhow and I can well
imagine not wanting to play any games here and go fully serialized like
we're used to with cmpxchg.

Now the robust stuff doesn't use {get,put}_futex_key() stuff, so no
implied barriers here.

Which leaves us all with a great big mess.

Current implementation:

alpha:	 MB ll/sc		RELEASE
arm64:	 ll/sc-release MB	FULL
arm:	 MB ll/sc MB		FULL
mips:	 ll/sc MB		ACQUIRE
powerpc: lwsync ll/sc sync	FULL

There are a few options:

 1) punt, mandate they're both fully ordered and stop thinking about it

 2) make them both fully relaxed, rely on implied barriers and employ
    smp_mb__{before,after}_atomic in key places

Given the current state of things and that I don't really think there is
a compelling performance argument to be made for 2, I would suggest we
go with 1.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-08-26 18:16 futex atomic vs ordering constraints Peter Zijlstra
@ 2015-08-29  1:33 ` Davidlohr Bueso
  2015-09-01 16:38   ` Peter Zijlstra
  2015-09-01 16:31 ` Will Deacon
  1 sibling, 1 reply; 15+ messages in thread
From: Davidlohr Bueso @ 2015-08-29  1:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages, dvhart, Vineet.Gupta1, ralf, ddaney,
	Will Deacon, linux-kernel

On Wed, 2015-08-26 at 20:16 +0200, Peter Zijlstra wrote:
> Of course, if anything else prior to futex_atomic_op_inuser() implies an
> (RCsc) RELEASE or stronger the primitive can do without providing
> anything itself.
> 
> This turns out to be the case, a successful get_futex_key() implies a
> full memory barrier; recent: 1d0dcb3ad9d3 ("futex: Implement lockless
> wakeups").

Hmm while it is certainly true that get_futex_key() implies a full
barrier, I don't see why you're referring to the recent wake_q stuff;
where the futex "wakeup" is done much after futex_atomic_op_inuser. Yes,
that too implies a barrier, but not wrt get_futex_key() -- which
fundamentally relies on get_futex_key_refs().

> 
> And since get_futex_key() is fundamental to doing _anything_ with a
> futex, I think its semi-sane to rely on this.

Right, and it wouldn't be the first thing that relies on get_futex_key()
implying a full barrier.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-08-29  1:33 ` Davidlohr Bueso
@ 2015-09-01 16:38   ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2015-09-01 16:38 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages, dvhart, Vineet.Gupta1, ralf, ddaney,
	Will Deacon, linux-kernel

On Fri, Aug 28, 2015 at 06:33:06PM -0700, Davidlohr Bueso wrote:
> On Wed, 2015-08-26 at 20:16 +0200, Peter Zijlstra wrote:
> > Of course, if anything else prior to futex_atomic_op_inuser() implies an
> > (RCsc) RELEASE or stronger the primitive can do without providing
> > anything itself.
> > 
> > This turns out to be the case, a successful get_futex_key() implies a
> > full memory barrier; recent: 1d0dcb3ad9d3 ("futex: Implement lockless
> > wakeups").
> 
> Hmm while it is certainly true that get_futex_key() implies a full
> barrier, I don't see why you're referring to the recent wake_q stuff;

D'oh, because I'm a sheep or so. I meant:

  b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-08-26 18:16 futex atomic vs ordering constraints Peter Zijlstra
  2015-08-29  1:33 ` Davidlohr Bueso
@ 2015-09-01 16:31 ` Will Deacon
  2015-09-01 16:42   ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Will Deacon @ 2015-09-01 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages@gmail.com, dvhart@infradead.org,
	dave@stgolabs.net, Vineet.Gupta1@synopsys.com,
	ralf@linux-mips.org, ddaney@caviumnetworks.com,
	linux-kernel@vger.kernel.org

Hi Peter,

On Wed, Aug 26, 2015 at 07:16:59PM +0100, Peter Zijlstra wrote:
> I tried to keep this email short, but failed miserably at this. For
> the TL;DR skip to the tail.

[...]

> There are a few options:
> 
>  1) punt, mandate they're both fully ordered and stop thinking about it
> 
>  2) make them both fully relaxed, rely on implied barriers and employ
>     smp_mb__{before,after}_atomic in key places
> 
> Given the current state of things and that I don't really think there is
> a compelling performance argument to be made for 2, I would suggest we
> go with 1.

I'd also go for (1). Since there is a userspace side to this, I'd *really*
like to avoid a potential situation on arm64 where the kernel builds its
side of the futex using barrier instructions (e.g. treat LDR + smp_mb()
as acquire) and userspace builds its side out of native acquire/release
instructions and the two end up interacting badly (for example, loss of
SC).

Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-01 16:31 ` Will Deacon
@ 2015-09-01 16:42   ` Peter Zijlstra
  2015-09-01 16:47     ` Will Deacon
  2015-09-01 19:05     ` Thomas Gleixner
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2015-09-01 16:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: Thomas Gleixner, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages@gmail.com, dvhart@infradead.org,
	dave@stgolabs.net, Vineet.Gupta1@synopsys.com,
	ralf@linux-mips.org, ddaney@caviumnetworks.com,
	linux-kernel@vger.kernel.org

On Tue, Sep 01, 2015 at 05:31:40PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> On Wed, Aug 26, 2015 at 07:16:59PM +0100, Peter Zijlstra wrote:
> > I tried to keep this email short, but failed miserably at this. For
> > the TL;DR skip to the tail.
> 
> [...]
> 
> > There are a few options:
> > 
> >  1) punt, mandate they're both fully ordered and stop thinking about it
> > 
> >  2) make them both fully relaxed, rely on implied barriers and employ
> >     smp_mb__{before,after}_atomic in key places
> > 
> > Given the current state of things and that I don't really think there is
> > a compelling performance argument to be made for 2, I would suggest we
> > go with 1.
> 
> I'd also go for (1). Since there is a userspace side to this, I'd *really*
> like to avoid a potential situation on arm64 where the kernel builds its
> side of the futex using barrier instructions (e.g. treat LDR + smp_mb()
> as acquire) and userspace builds its side out of native acquire/release
> instructions and the two end up interacting badly (for example, loss of
> SC).

I thought your native acquire/release were RCsc, or is it that in
combination with the 'fancy' 'full' barrier of stlxr + dmb-ish something
goes sideways?

But yes, unless Thomas has other plans, I'll go ahead and create some
patches to make sure everybody is fully ordered so we can forget about
it again.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-01 16:42   ` Peter Zijlstra
@ 2015-09-01 16:47     ` Will Deacon
  2015-09-01 19:05     ` Thomas Gleixner
  1 sibling, 0 replies; 15+ messages in thread
From: Will Deacon @ 2015-09-01 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages@gmail.com, dvhart@infradead.org,
	dave@stgolabs.net, Vineet.Gupta1@synopsys.com,
	ralf@linux-mips.org, ddaney@caviumnetworks.com,
	linux-kernel@vger.kernel.org

On Tue, Sep 01, 2015 at 05:42:47PM +0100, Peter Zijlstra wrote:
> On Tue, Sep 01, 2015 at 05:31:40PM +0100, Will Deacon wrote:
> > On Wed, Aug 26, 2015 at 07:16:59PM +0100, Peter Zijlstra wrote:
> > > I tried to keep this email short, but failed miserably at this. For
> > > the TL;DR skip to the tail.
> > 
> > [...]
> > 
> > > There are a few options:
> > > 
> > >  1) punt, mandate they're both fully ordered and stop thinking about it
> > > 
> > >  2) make them both fully relaxed, rely on implied barriers and employ
> > >     smp_mb__{before,after}_atomic in key places
> > > 
> > > Given the current state of things and that I don't really think there is
> > > a compelling performance argument to be made for 2, I would suggest we
> > > go with 1.
> > 
> > I'd also go for (1). Since there is a userspace side to this, I'd *really*
> > like to avoid a potential situation on arm64 where the kernel builds its
> > side of the futex using barrier instructions (e.g. treat LDR + smp_mb()
> > as acquire) and userspace builds its side out of native acquire/release
> > instructions and the two end up interacting badly (for example, loss of
> > SC).
> 
> I thought your native acquire/release were RCsc, or is it that in
> combination with the 'fancy' 'full' barrier of stlxr + dmb-ish something
> goes sideways?

Yeah, they don't interact nicely because you can lose the multi-copy
atomicity guarantees you get from using either native acquire/release
everywhere or explicit barriers everywhere. IRIW shows the failure
if you use {DMB; STR} for the writers and LDAR for the readers.

> But yes, unless Thomas has other plans, I'll go ahead and create some
> patches to make sure everybody is fully ordered so we can forget about
> it again.

Sounds good to me!

Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-01 16:42   ` Peter Zijlstra
  2015-09-01 16:47     ` Will Deacon
@ 2015-09-01 19:05     ` Thomas Gleixner
  2015-09-02 12:55       ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2015-09-01 19:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages@gmail.com, dvhart@infradead.org,
	dave@stgolabs.net, Vineet.Gupta1@synopsys.com,
	ralf@linux-mips.org, ddaney@caviumnetworks.com,
	linux-kernel@vger.kernel.org

On Tue, 1 Sep 2015, Peter Zijlstra wrote:
> On Tue, Sep 01, 2015 at 05:31:40PM +0100, Will Deacon wrote:
> > Hi Peter,
> > 
> > On Wed, Aug 26, 2015 at 07:16:59PM +0100, Peter Zijlstra wrote:
> > > I tried to keep this email short, but failed miserably at this. For
> > > the TL;DR skip to the tail.
> > 
> > [...]
> > 
> > > There are a few options:
> > > 
> > >  1) punt, mandate they're both fully ordered and stop thinking about it
> > > 
> > >  2) make them both fully relaxed, rely on implied barriers and employ
> > >     smp_mb__{before,after}_atomic in key places
> > > 
> > > Given the current state of things and that I don't really think there is
> > > a compelling performance argument to be made for 2, I would suggest we
> > > go with 1.
> > 
> > I'd also go for (1). Since there is a userspace side to this, I'd *really*
> > like to avoid a potential situation on arm64 where the kernel builds its
> > side of the futex using barrier instructions (e.g. treat LDR + smp_mb()
> > as acquire) and userspace builds its side out of native acquire/release
> > instructions and the two end up interacting badly (for example, loss of
> > SC).
> 
> I thought your native acquire/release were RCsc, or is it that in
> combination with the 'fancy' 'full' barrier of stlxr + dmb-ish something
> goes sideways?
> 
> But yes, unless Thomas has other plans, I'll go ahead and create some
> patches to make sure everybody is fully ordered so we can forget about
> it again.

No, I don't. There are too many ways to screw that up, so unless
someone has a serious performance issue, we should keep it on the safe
side.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-01 19:05     ` Thomas Gleixner
@ 2015-09-02 12:55       ` Peter Zijlstra
  2015-09-02 16:10         ` Chris Metcalf
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2015-09-02 12:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Will Deacon, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages@gmail.com, dvhart@infradead.org,
	dave@stgolabs.net, Vineet.Gupta1@synopsys.com,
	ralf@linux-mips.org, ddaney@caviumnetworks.com,
	linux-kernel@vger.kernel.org, linux, rth, cmetcalf


So here goes..

Chris, I'm awfully sorry, but I seem to be Tile challenged.

TileGX seems to define:

#define smp_mb__before_atomic()	smp_mb()
#define smp_mb__after_atomic()	smp_mb()

However, its atomic_add_return() implementation looks like:

static inline int atomic_add_return(int i, atomic_t *v)
{
	int val;
	smp_mb();  /* barrier for proper semantics */
	val = __insn_fetchadd4((void *)&v->counter, i) + i;
	barrier();  /* the "+ i" above will wait on memory */
	return val;
}

Which leaves me confused on smp_mb__after_atomic().

That said, your futex ops seem to lack any memory barrier, so naively
I'd add both, its just that your add_return() confuses me.

---
Subject: futex, arch: Make futex atomic ops fully ordered

There was no clear ordering requirement for the two futex atomic
primitives which led to various archs implementing different things.

Since there currently is no clear performance argument (its the syscall
'slow' path) favour mandating full order over complexity seems the right
call.

This patch adds the new ordering requirements to the function comments
(which it places in a more visible location) and updates all arches that
were not already fully ordered (I checked those that have
smp_mb__{before,after}_atomic() definitions, as those are the only ones
that can actually be more relaxed).

In particular:

futex_atomic_op_inuser():

alpha:   MB ll/sc   (RELEASE) -> MB ll/sc MB
arm:     MB ll/sc   (RELEASE) -> MB ll/sc MB
mips:    ll/sc MB   (ACQUIRE) -> MB ll/sc MB

futex_atomic_cmpxchg_inatomic():

alpha:   MB ll/sc   (RELEASE) -> MB ll/sc MB
mips:    ll/sc MB   (ACQUIRE) -> MB ll/sc MB

XXX tile (again).

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/alpha/include/asm/futex.h | 10 +++++++---
 arch/arm/include/asm/futex.h   |  3 ++-
 arch/mips/include/asm/futex.h  |  4 ++++
 include/asm-generic/futex.h    | 26 +-------------------------
 include/linux/futex.h          | 35 +++++++++++++++++++++++++++++++++++
 5 files changed, 49 insertions(+), 29 deletions(-)

diff --git a/arch/alpha/include/asm/futex.h b/arch/alpha/include/asm/futex.h
index f939794363ac..e0d5edad2cd4 100644
--- a/arch/alpha/include/asm/futex.h
+++ b/arch/alpha/include/asm/futex.h
@@ -9,8 +9,8 @@
 #include <asm/barrier.h>
 
 #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg)	\
+	smp_mb();						\
 	__asm__ __volatile__(					\
-		__ASM_SMP_MB					\
 	"1:	ldl_l	%0,0(%2)\n"				\
 		insn						\
 	"2:	stl_c	%1,0(%2)\n"				\
@@ -27,7 +27,8 @@
 	"	.previous\n"					\
 	:	"=&r" (oldval), "=&r"(ret)			\
 	:	"r" (uaddr), "r"(oparg)				\
-	:	"memory")
+	:	"memory");					\
+	smp_mb()
 
 static inline int futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
 {
@@ -90,8 +91,9 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
 		return -EFAULT;
 
+	smp_mb();
+
 	__asm__ __volatile__ (
-		__ASM_SMP_MB
 	"1:	ldl_l	%1,0(%3)\n"
 	"	cmpeq	%1,%4,%2\n"
 	"	beq	%2,3f\n"
@@ -111,6 +113,8 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 	:	"r"(uaddr), "r"((long)(int)oldval), "r"(newval)
 	:	"memory");
 
+	smp_mb();
+
 	*uval = prev;
 	return ret;
 }
diff --git a/arch/arm/include/asm/futex.h b/arch/arm/include/asm/futex.h
index 5eed82809d82..1be3be8eb8c2 100644
--- a/arch/arm/include/asm/futex.h
+++ b/arch/arm/include/asm/futex.h
@@ -34,7 +34,8 @@
 	__futex_atomic_ex_table("%5")				\
 	: "=&r" (ret), "=&r" (oldval), "=&r" (tmp)		\
 	: "r" (uaddr), "r" (oparg), "Ir" (-EFAULT)		\
-	: "cc", "memory")
+	: "cc", "memory");					\
+	smp_mb()
 
 static inline int
 futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
diff --git a/arch/mips/include/asm/futex.h b/arch/mips/include/asm/futex.h
index 1de190bdfb9c..2f38cc84fb04 100644
--- a/arch/mips/include/asm/futex.h
+++ b/arch/mips/include/asm/futex.h
@@ -20,6 +20,8 @@
 
 #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg)		\
 {									\
+	smp_mb__before_llsc();						\
+									\
 	if (cpu_has_llsc && R10000_LLSC_WAR) {				\
 		__asm__ __volatile__(					\
 		"	.set	push				\n"	\
@@ -149,6 +151,8 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
 		return -EFAULT;
 
+	smp_mb__before_llsc();
+
 	if (cpu_has_llsc && R10000_LLSC_WAR) {
 		__asm__ __volatile__(
 		"# futex_atomic_cmpxchg_inatomic			\n"
diff --git a/include/asm-generic/futex.h b/include/asm-generic/futex.h
index e56272c919b5..e3c2a0cea438 100644
--- a/include/asm-generic/futex.h
+++ b/include/asm-generic/futex.h
@@ -12,18 +12,6 @@
  *
  */
 
-/**
- * futex_atomic_op_inuser() - Atomic arithmetic operation with constant
- *			  argument and comparison of the previous
- *			  futex value with another constant.
- *
- * @encoded_op:	encoded operation to execute
- * @uaddr:	pointer to user space address
- *
- * Return:
- * 0 - On success
- * <0 - On error
- */
 static inline int
 futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
 {
@@ -88,19 +76,6 @@ futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
 	return ret;
 }
 
-/**
- * futex_atomic_cmpxchg_inatomic() - Compare and exchange the content of the
- *				uaddr with newval if the current value is
- *				oldval.
- * @uval:	pointer to store content of @uaddr
- * @uaddr:	pointer to user space address
- * @oldval:	old value
- * @newval:	new value to store to @uaddr
- *
- * Return:
- * 0 - On success
- * <0 - On error
- */
 static inline int
 futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 			      u32 oldval, u32 newval)
@@ -121,6 +96,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 }
 
 #else
+
 static inline int
 futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
 {
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 6435f46d6e13..fadb46f38800 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -55,6 +55,41 @@ union futex_key {
 #ifdef CONFIG_FUTEX
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
+
+/**
+ * futex_atomic_op_inuser() - Atomic arithmetic operation with constant
+ *			  argument and comparison of the previous
+ *			  futex value with another constant.
+ *
+ * @encoded_op:	encoded operation to execute
+ * @uaddr:	pointer to user space address
+ *
+ * This operation is assumed to be fully ordered.
+ *
+ * Return:
+ * 0 - On success
+ * <0 - On error
+ */
+extern int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr);
+
+/**
+ * futex_atomic_cmpxchg_inatomic() - Compare and exchange the content of the
+ *				uaddr with newval if the current value is
+ *				oldval.
+ * @uval:	pointer to store content of @uaddr
+ * @uaddr:	pointer to user space address
+ * @oldval:	old value
+ * @newval:	new value to store to @uaddr
+ *
+ * This operation is assumed to be fully ordered.
+ *
+ * Return:
+ * 0 - On success
+ * <0 - On error
+ */
+extern int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
+					 u32 oldval, u32 newval);
+
 #ifdef CONFIG_HAVE_FUTEX_CMPXCHG
 #define futex_cmpxchg_enabled 1
 #else

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-02 12:55       ` Peter Zijlstra
@ 2015-09-02 16:10         ` Chris Metcalf
  2015-09-02 17:00           ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Metcalf @ 2015-09-02 16:10 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Will Deacon, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages@gmail.com, dvhart@infradead.org,
	dave@stgolabs.net, Vineet.Gupta1@synopsys.com,
	ralf@linux-mips.org, ddaney@caviumnetworks.com,
	linux-kernel@vger.kernel.org, linux, rth

On 09/02/2015 08:55 AM, Peter Zijlstra wrote:
> So here goes..
>
> Chris, I'm awfully sorry, but I seem to be Tile challenged.
>
> TileGX seems to define:
>
> #define smp_mb__before_atomic()	smp_mb()
> #define smp_mb__after_atomic()	smp_mb()
>
> However, its atomic_add_return() implementation looks like:
>
> static inline int atomic_add_return(int i, atomic_t *v)
> {
> 	int val;
> 	smp_mb();  /* barrier for proper semantics */
> 	val = __insn_fetchadd4((void *)&v->counter, i) + i;
> 	barrier();  /* the "+ i" above will wait on memory */
> 	return val;
> }
>
> Which leaves me confused on smp_mb__after_atomic().

Are you concerned about whether it has proper memory
barrier semantics already, i.e. full barriers before and after?
In fact we do have a full barrier before, but then because of the
"+ i" / "barrier()", we know that the only other operation since
the previous mb(), namely the read of v->counter, has
completed after the atomic operation.  As a result we can
omit explicitly having a second barrier.

It does seem like all the current memory-order semantics are
correct, unless I'm missing something!

> That said, your futex ops seem to lack any memory barrier, so naively
> I'd add both, its just that your add_return() confuses me.

So something like this?

diff --git a/arch/tile/include/asm/futex.h b/arch/tile/include/asm/futex.h
index 1a6ef1b69cb1..0a5501b11d02 100644
--- a/arch/tile/include/asm/futex.h
+++ b/arch/tile/include/asm/futex.h
@@ -39,6 +39,7 @@
  #ifdef __tilegx__
  
  #define __futex_asm(OP) \
+	smp_mb();						\
  	asm("1: {" #OP " %1, %3, %4; movei %0, 0 }\n"		\
  	    ".pushsection .fixup,\"ax\"\n"			\
  	    "0: { movei %0, %5; j 9f }\n"			\
@@ -48,7 +49,8 @@
  	    ".popsection\n"					\
  	    "9:"						\
  	    : "=r" (ret), "=r" (val), "+m" (*(uaddr))		\
-	    : "r" (uaddr), "r" (oparg), "i" (-EFAULT))
+	    : "r" (uaddr), "r" (oparg), "i" (-EFAULT)); 	\
+	smp_mb()
  
  #define __futex_set() __futex_asm(exch4)
  #define __futex_add() __futex_asm(fetchadd4)
@@ -75,7 +77,10 @@
  
  #define __futex_call(FN)						\
  	{								\
-		struct __get_user gu = FN((u32 __force *)uaddr, lock, oparg); \
+		struct __get_user gu;					\
+		smp_mb();						\
+		gu = FN((u32 __force *)uaddr, lock, oparg);		\
+		/* See smp_mb__after_atomic() */			\
  		val = gu.val;						\
  		ret = gu.err;						\
  	}

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-02 16:10         ` Chris Metcalf
@ 2015-09-02 17:00           ` Peter Zijlstra
  2015-09-02 17:25             ` Chris Metcalf
  2015-09-02 21:18             ` Linus Torvalds
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2015-09-02 17:00 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Will Deacon, Linus Torvalds, Oleg Nesterov,
	Paul McKenney, Ingo Molnar, mtk.manpages@gmail.com,
	dvhart@infradead.org, dave@stgolabs.net,
	Vineet.Gupta1@synopsys.com, ralf@linux-mips.org,
	ddaney@caviumnetworks.com, linux-kernel@vger.kernel.org, linux,
	rth

On Wed, Sep 02, 2015 at 12:10:58PM -0400, Chris Metcalf wrote:
> On 09/02/2015 08:55 AM, Peter Zijlstra wrote:
> >So here goes..
> >
> >Chris, I'm awfully sorry, but I seem to be Tile challenged.
> >
> >TileGX seems to define:
> >
> >#define smp_mb__before_atomic()	smp_mb()
> >#define smp_mb__after_atomic()	smp_mb()
> >
> >However, its atomic_add_return() implementation looks like:
> >
> >static inline int atomic_add_return(int i, atomic_t *v)
> >{
> >	int val;
> >	smp_mb();  /* barrier for proper semantics */
> >	val = __insn_fetchadd4((void *)&v->counter, i) + i;
> >	barrier();  /* the "+ i" above will wait on memory */
> >	return val;
> >}
> >
> >Which leaves me confused on smp_mb__after_atomic().
> 
> Are you concerned about whether it has proper memory
> barrier semantics already, i.e. full barriers before and after?
> In fact we do have a full barrier before, but then because of the
> "+ i" / "barrier()", we know that the only other operation since
> the previous mb(), namely the read of v->counter, has
> completed after the atomic operation.  As a result we can
> omit explicitly having a second barrier.
> 
> It does seem like all the current memory-order semantics are
> correct, unless I'm missing something!

So I'm reading that code like:

	MB
 [RmW]	ret = *val += i


So what is stopping later memory ops like:

   [R]	a = *foo
   [S]	*bar = b

>From getting reordered with the RmW, like:

	MB

   [R]	a = *foo
   [S]	*bar = b

 [RmW]	ret = *val += i

Are you saying Tile does not reorder things like that? If so, why then
is smp_mb__after_atomic() a full mb(). If it does, I don't see how your
add_return is correct.

Alternatively I'm just confused..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-02 17:00           ` Peter Zijlstra
@ 2015-09-02 17:25             ` Chris Metcalf
  2015-09-02 21:18             ` Linus Torvalds
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Metcalf @ 2015-09-02 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Will Deacon, Linus Torvalds, Oleg Nesterov,
	Paul McKenney, Ingo Molnar, mtk.manpages@gmail.com,
	dvhart@infradead.org, dave@stgolabs.net,
	Vineet.Gupta1@synopsys.com, ralf@linux-mips.org,
	ddaney@caviumnetworks.com, linux-kernel@vger.kernel.org, linux,
	rth

On 09/02/2015 01:00 PM, Peter Zijlstra wrote:
> On Wed, Sep 02, 2015 at 12:10:58PM -0400, Chris Metcalf wrote:
>> On 09/02/2015 08:55 AM, Peter Zijlstra wrote:
>>> So here goes..
>>>
>>> Chris, I'm awfully sorry, but I seem to be Tile challenged.
>>>
>>> TileGX seems to define:
>>>
>>> #define smp_mb__before_atomic()	smp_mb()
>>> #define smp_mb__after_atomic()	smp_mb()
>>>
>>> However, its atomic_add_return() implementation looks like:
>>>
>>> static inline int atomic_add_return(int i, atomic_t *v)
>>> {
>>> 	int val;
>>> 	smp_mb();  /* barrier for proper semantics */
>>> 	val = __insn_fetchadd4((void *)&v->counter, i) + i;
>>> 	barrier();  /* the "+ i" above will wait on memory */
>>> 	return val;
>>> }
>>>
>>> Which leaves me confused on smp_mb__after_atomic().
>> Are you concerned about whether it has proper memory
>> barrier semantics already, i.e. full barriers before and after?
>> In fact we do have a full barrier before, but then because of the
>> "+ i" / "barrier()", we know that the only other operation since
>> the previous mb(), namely the read of v->counter, has
>> completed after the atomic operation.  As a result we can
>> omit explicitly having a second barrier.
>>
>> It does seem like all the current memory-order semantics are
>> correct, unless I'm missing something!
> So I'm reading that code like:
>
> 	MB
>   [RmW]	ret = *val += i
>
>
> So what is stopping later memory ops like:
>
>     [R]	a = *foo
>     [S]	*bar = b
>
>  From getting reordered with the RmW, like:
>
> 	MB
>
>     [R]	a = *foo
>     [S]	*bar = b
>
>   [RmW]	ret = *val += i
>
> Are you saying Tile does not reorder things like that? If so, why then
> is smp_mb__after_atomic() a full mb(). If it does, I don't see how your
> add_return is correct.
>
> Alternatively I'm just confused..

Tile does not do out-of-order instruction issue, but it does have an
out-of-order memory subsystem, so in addition to stores becoming
unpredictably visible without a memory barrier, loads will also
potentially not read from memory predictably after issue.
As a result, later operations that use a register that was previously
loaded may stall instruction issue until the load value is available.
A memory fence instruction will cause the core to wait for all
stores to become visible and all load values to be available.

So [R] can't move up to before [RmW] due to the in-order issue
nature of the processor.  And smp_mb__after_atomic() has to
be a full mb() because that's the only barrier we have available
to guarantee that the load has read from memory.  (If the
value of the actual atomic was passed to smp_mb__after_atomic()
then we could just generate a fake use of the value, basically
generating something like "move r1, r1", which would cause
the instruction issue to halt until the value had been read.)

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-02 17:00           ` Peter Zijlstra
  2015-09-02 17:25             ` Chris Metcalf
@ 2015-09-02 21:18             ` Linus Torvalds
  2015-09-04 17:25               ` Chris Metcalf
  2015-09-05 17:53               ` Peter Zijlstra
  1 sibling, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2015-09-02 21:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Thomas Gleixner, Will Deacon, Oleg Nesterov,
	Paul McKenney, Ingo Molnar, mtk.manpages@gmail.com,
	dvhart@infradead.org, dave@stgolabs.net,
	Vineet.Gupta1@synopsys.com, ralf@linux-mips.org,
	ddaney@caviumnetworks.com, linux-kernel@vger.kernel.org,
	Russell King - ARM Linux, Richard Henderson

On Wed, Sep 2, 2015 at 10:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I'm reading that code like:
>
>         MB
>  [RmW]  ret = *val += i
>
>
> So what is stopping later memory ops like:
>
>    [R]  a = *foo
>    [S]  *bar = b
>
> From getting reordered with the RmW, like:
>
>         MB
>
>    [R]  a = *foo
>    [S]  *bar = b
>
>  [RmW]  ret = *val += i

So I do agree that for the atomic futex operation to be usable for
locking (which I think we all agree is a primary objective), the futex
write operations have to work both as acquire and release operations,
which basically means that it has to be both an acquire _and_ release
op.

That said, I'm not sure it really needs to be a full memory barrier on
both sides, and in particular, I'm not sure it needs to be a memory
barroer for the surrounding *kernel* operations.

So I do suspect that adding full smp_mb()'s on both sides is not
necessarily required for architectures, because there are often
serialization guarantees at kernel entry/exit.

So I think we could possibly relax the requirements (and document this
very clearly) to say that the futex operation must be totally ordered
wrt any other _user_space_ accesses by that thread. I suspect a lot of
architectures can then say "we may be very weakly ordered, but kernel
entry/exit implies enough synchronization that we do not need any
futher memory barriers".

For example, on x86, the locked instructions are obviously already
sufficiently strong, but even if they weren't, kernel entry/exit is
documented to be a serializing instruction (which is something
insanely much stronger than just memory ordering). And I suspect there
are similar issues on a lot of architectures where the memory ordering
is done by the core, but the cache subsystem is strongly ordered (ie
saen good SMP systems - so it sounds like tile needs the smp_mb()'s,
but I would almost suspect that POWER and ARM might *not* need them).

           Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-02 21:18             ` Linus Torvalds
@ 2015-09-04 17:25               ` Chris Metcalf
  2015-09-05 17:53               ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Metcalf @ 2015-09-04 17:25 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Thomas Gleixner, Will Deacon, Oleg Nesterov, Paul McKenney,
	Ingo Molnar, mtk.manpages@gmail.com, dvhart@infradead.org,
	dave@stgolabs.net, Vineet.Gupta1@synopsys.com,
	ralf@linux-mips.org, ddaney@caviumnetworks.com,
	linux-kernel@vger.kernel.org, Russell King - ARM Linux,
	Richard Henderson

On 09/02/2015 05:18 PM, Linus Torvalds wrote:
> For example, on x86, the locked instructions are obviously already
> sufficiently strong, but even if they weren't, kernel entry/exit is
> documented to be a serializing instruction (which is something
> insanely much stronger than just memory ordering). And I suspect there
> are similar issues on a lot of architectures where the memory ordering
> is done by the core, but the cache subsystem is strongly ordered (ie
> saen good SMP systems - so it sounds like tile needs the smp_mb()'s,
> but I would almost suspect that POWER and ARM might *not* need them).

Because POWER and ARM have serializing kernel entry/exit?
I think tile has relatively conventional cache/memory semantics,
but it's certainly true there is implicit memory ordering guarantee
for kernel entry/exit.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-02 21:18             ` Linus Torvalds
  2015-09-04 17:25               ` Chris Metcalf
@ 2015-09-05 17:53               ` Peter Zijlstra
  2015-09-07  9:30                 ` Will Deacon
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2015-09-05 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Metcalf, Thomas Gleixner, Will Deacon, Oleg Nesterov,
	Paul McKenney, Ingo Molnar, mtk.manpages@gmail.com,
	dvhart@infradead.org, dave@stgolabs.net,
	Vineet.Gupta1@synopsys.com, ralf@linux-mips.org,
	ddaney@caviumnetworks.com, linux-kernel@vger.kernel.org,
	Russell King - ARM Linux, Richard Henderson

On Wed, Sep 02, 2015 at 02:18:53PM -0700, Linus Torvalds wrote:
> So I think we could possibly relax the requirements (and document this
> very clearly) to say that the futex operation must be totally ordered
> wrt any other _user_space_ accesses by that thread. I suspect a lot of
> architectures can then say "we may be very weakly ordered, but kernel
> entry/exit implies enough synchronization that we do not need any
> futher memory barriers".

Right, so before sending this email I actually spoke to Ralf about this
option, and he said that this is not actually well defined for MIPS.

But we could certainly document it such and let archs for which this is
well documented (I would expect this to be most) choose that
implementation.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: futex atomic vs ordering constraints
  2015-09-05 17:53               ` Peter Zijlstra
@ 2015-09-07  9:30                 ` Will Deacon
  0 siblings, 0 replies; 15+ messages in thread
From: Will Deacon @ 2015-09-07  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Chris Metcalf, Thomas Gleixner, Oleg Nesterov,
	Paul McKenney, Ingo Molnar, mtk.manpages@gmail.com,
	dvhart@infradead.org, dave@stgolabs.net,
	Vineet.Gupta1@synopsys.com, ralf@linux-mips.org,
	ddaney@caviumnetworks.com, linux-kernel@vger.kernel.org,
	Russell King - ARM Linux, Richard Henderson

On Sat, Sep 05, 2015 at 06:53:02PM +0100, Peter Zijlstra wrote:
> On Wed, Sep 02, 2015 at 02:18:53PM -0700, Linus Torvalds wrote:
> > So I think we could possibly relax the requirements (and document this
> > very clearly) to say that the futex operation must be totally ordered
> > wrt any other _user_space_ accesses by that thread. I suspect a lot of
> > architectures can then say "we may be very weakly ordered, but kernel
> > entry/exit implies enough synchronization that we do not need any
> > futher memory barriers".
> 
> Right, so before sending this email I actually spoke to Ralf about this
> option, and he said that this is not actually well defined for MIPS.
> 
> But we could certainly document it such and let archs for which this is
> well documented (I would expect this to be most) choose that
> implementation.

Whilst a control-dependency + exception return forms a barrier of sorts
on arm/arm64, it's not required to be transitive [1], so I wouldn't be
comfortable making that relaxation on the futex path.

Will

[1] See, for example, "ISA2+dmb+ctrlisb+dmb" at
    https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#ARM

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-09-07  9:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-26 18:16 futex atomic vs ordering constraints Peter Zijlstra
2015-08-29  1:33 ` Davidlohr Bueso
2015-09-01 16:38   ` Peter Zijlstra
2015-09-01 16:31 ` Will Deacon
2015-09-01 16:42   ` Peter Zijlstra
2015-09-01 16:47     ` Will Deacon
2015-09-01 19:05     ` Thomas Gleixner
2015-09-02 12:55       ` Peter Zijlstra
2015-09-02 16:10         ` Chris Metcalf
2015-09-02 17:00           ` Peter Zijlstra
2015-09-02 17:25             ` Chris Metcalf
2015-09-02 21:18             ` Linus Torvalds
2015-09-04 17:25               ` Chris Metcalf
2015-09-05 17:53               ` Peter Zijlstra
2015-09-07  9:30                 ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).