Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org, tglx@linutronix.de, juri.lelli@arm.com,
	rostedt@goodmis.org, xlpang@redhat.com, bigeasy@linutronix.de
Cc: linux-kernel@vger.kernel.org, mathieu.desnoyers@efficios.com,
	jdesfossez@efficios.com, bristot@redhat.com
Subject: Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
Date: Fri, 7 Oct 2016 13:21:43 +0200	[thread overview]
Message-ID: <20161007112143.GJ3117@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20161003091847.704255067@infradead.org>


New version..


This one seems to pass all the (pi) futex tests and survives many hours
of my modified pi_stress (I added MADV_UNMAP to punch holes in the
page-tables to trigger (minor) faults).

---
Subject: futex: Rewrite FUTEX_UNLOCK_PI
From: Peter Zijlstra <peterz@infradead.org>
Date: Sun Oct 2 18:42:33 CEST 2016

There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
caused by holding hb->lock while doing the rt_mutex_unlock()
equivalient.

Notably:
 - a PI inversion on hb->lock
 - DL crash because of pointer instability.

This patch doesn't attempt to fix any of the actual problems, but
instead reworks the code to not hold hb->lock across the unlock,
paving the way to actually fix the problems later.

The current reason we hold hb->lock over unlock is that it serializes
against FUTEX_LOCK_PI and avoids new waiters from coming in, this then
ensures the rt_mutex_next_owner() value is stable and can be written
into the user-space futex value before doing the unlock. Such that the
unlock will indeed end up at new_owner.

This patch recognises that holding rt_mutex::wait_lock results in the
very same guarantee, no new waiters can come in while we hold that
lock -- after all, waiters would need this lock to queue themselves.

This (of course) is not entirely straight forward either, see the
comment in rt_mutex_slowunlock(), doing the unlock itself might drop
wait_lock, letting new waiters in.

Another problem is the case where futex_lock_pi() failed to acquire
the lock (ie. released rt_mutex::wait_lock) but hasn't yet re-acquired
hb->lock and called unqueue_me_pi(). In this case we're confused about
having waiters (the futex state says yes, the rt_mutex state says no).

The current solution is to assign the futex to the waiter from the
futex state, and have futex_lock_pi() detect this and try and fix it
up. This again, all relies on hb->lock serializing things.


Solve all that by:

 - using futex specific rt_mutex calls that lack the fastpath, futexes
   have their own fastpath anyway. This makes that
   rt_mutex_futex_unlock() doesn't need to drop rt_mutex::wait_lock
   and the unlock is guaranteed if we manage to update user state.

 - make futex_unlock_pi() drop hb->lock early and only use
   rt_mutex::wait_lock to serialize against rt_mutex waiters
   update the futex value and unlock.

 - in case futex and rt_mutex disagree on waiters, side with rt_mutex
   and simply clear the user value. This works because either there
   really are no waiters left, or futex_lock_pi() triggers the
   lock-steal path and fixes up the WAITERS flag.


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |  178 ++++++++++++++++++++--------------------
 kernel/locking/rtmutex.c        |   55 +++++++++---
 kernel/locking/rtmutex_common.h |    9 +-
 3 files changed, 139 insertions(+), 103 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -914,7 +914,7 @@ void exit_pi_state_list(struct task_stru
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
 
 		spin_unlock(&hb->lock);
 
@@ -1146,7 +1146,7 @@ static int lock_pi_update_atomic(u32 __u
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1291,49 +1291,58 @@ static void mark_wake_futex(struct wake_
 	smp_store_release(&q->lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
-			 struct futex_hash_bucket *hb)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
 {
-	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
+	struct task_struct *new_owner;
+	bool deboost = false;
 	WAKE_Q(wake_q);
-	bool deboost;
 	int ret = 0;
 
-	if (!pi_state)
-		return -EINVAL;
-
-	/*
-	 * If current does not own the pi_state then the futex is
-	 * inconsistent and user space fiddled with the futex value.
-	 */
-	if (pi_state->owner != current)
-		return -EINVAL;
-
 	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
-	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-
-	/*
-	 * It is possible that the next waiter (the one that brought
-	 * top_waiter owner to the kernel) timed out and is no longer
-	 * waiting on the lock.
-	 */
-	if (!new_owner)
-		new_owner = top_waiter->task;
 
-	/*
-	 * We pass it to the next owner. The WAITERS bit is always
-	 * kept enabled while there is PI state around. We cleanup the
-	 * owner died bit, because we are the owner.
-	 */
-	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
+	if (!new_owner) {
+		/*
+		 * This is the case where futex_lock_pi() has not yet or failed
+		 * to acquire the lock but still has the futex_q enqueued. So
+		 * the futex state has a 'waiter' while the rt_mutex state does
+		 * not.
+		 *
+		 * Even though there still is pi_state for this futex, we can
+		 * clear FUTEX_WAITERS. Either:
+		 *
+		 *  - we or futex_lock_pi() will drop the last reference and
+		 *    clean up this pi_state,
+		 *
+		 *  - userspace acquires the futex through its fastpath
+		 *    and the above pi_state cleanup still happens,
+		 *
+		 *  - or futex_lock_pi() will re-set the WAITERS bit in
+		 *    fixup_owner().
+		 */
+		newval = 0;
+		/*
+		 * Since pi_state->owner must point to a valid task, and
+		 * task_pid_vnr(pi_state->owner) must match TID_MASK, use
+		 * init_task.
+		 */
+		new_owner = &init_task;
+	} else {
+		/*
+		 * We pass it to the next owner. The WAITERS bit is always kept
+		 * enabled while there is PI state around. We cleanup the owner
+		 * died bit, because we are the owner.
+		 */
+		newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+	}
 
 	if (unlikely(should_fail_futex(true)))
 		ret = -EFAULT;
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1346,10 +1355,9 @@ static int wake_futex_pi(u32 __user *uad
 		else
 			ret = -EINVAL;
 	}
-	if (ret) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return ret;
-	}
+
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
@@ -1362,22 +1370,20 @@ static int wake_futex_pi(u32 __user *uad
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
-
 	/*
-	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * We've updated the uservalue, this unlock cannot fail.
 	 */
-	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
+	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+out_unlock:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
+	}
 
-	return 0;
+	return ret;
 }
 
 /*
@@ -2228,7 +2234,6 @@ static long futex_wait_restart(struct re
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-	struct task_struct *owner;
 	int ret = 0;
 
 	if (locked) {
@@ -2242,43 +2247,15 @@ static int fixup_owner(u32 __user *uaddr
 	}
 
 	/*
-	 * Catch the rare case, where the lock was released when we were on the
-	 * way back before we locked the hash bucket.
-	 */
-	if (q->pi_state->owner == current) {
-		/*
-		 * Try to get the rt_mutex now. This might fail as some other
-		 * task acquired the rt_mutex after we removed ourself from the
-		 * rt_mutex waiters list.
-		 */
-		if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
-			locked = 1;
-			goto out;
-		}
-
-		/*
-		 * pi_state is incorrect, some other task did a lock steal and
-		 * we returned due to timeout or signal without taking the
-		 * rt_mutex. Too late.
-		 */
-		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
-		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
-		if (!owner)
-			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
-		ret = fixup_pi_state_owner(uaddr, q, owner);
-		goto out;
-	}
-
-	/*
 	 * Paranoia check. If we did not take the lock, then we should not be
 	 * the owner of the rt_mutex.
 	 */
-	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) {
 		printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
 				"pi-state %p\n", ret,
 				q->pi_state->pi_mutex.owner,
 				q->pi_state->owner);
+	}
 
 out:
 	return ret ? ret : locked;
@@ -2566,7 +2543,7 @@ static int futex_lock_pi(u32 __user *uad
 	if (!trylock) {
 		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
 	} else {
-		ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
+		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
@@ -2589,7 +2566,7 @@ static int futex_lock_pi(u32 __user *uad
 	 * it and return the fault to userspace.
 	 */
 	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_unlock(&q.pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
@@ -2656,7 +2633,34 @@ static int futex_unlock_pi(u32 __user *u
 	 */
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
-		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
+		struct futex_pi_state *pi_state = top_waiter->pi_state;
+
+		ret = -EINVAL;
+		if (!pi_state)
+			goto out_unlock;
+
+		/*
+		 * If current does not own the pi_state then the futex is
+		 * inconsistent and user space fiddled with the futex value.
+		 */
+		if (pi_state->owner != current)
+			goto out_unlock;
+
+		/*
+		 * Grab a reference on the pi_state and drop hb->lock.
+		 *
+		 * The reference ensures pi_state lives, dropping the hb->lock
+		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
+		 * close the races against futex_lock_pi(), but in case of
+		 * _any_ fail we'll abort and retry the whole deal.
+		 */
+		WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
+		spin_unlock(&hb->lock);
+
+		ret = wake_futex_pi(uaddr, uval, pi_state);
+
+		put_pi_state(pi_state);
+
 		/*
 		 * In case of success wake_futex_pi dropped the hash
 		 * bucket lock.
@@ -2674,7 +2678,6 @@ static int futex_unlock_pi(u32 __user *u
 		 * setting the FUTEX_WAITERS bit. Try again.
 		 */
 		if (ret == -EAGAIN) {
-			spin_unlock(&hb->lock);
 			put_futex_key(&key);
 			goto retry;
 		}
@@ -2682,7 +2685,7 @@ static int futex_unlock_pi(u32 __user *u
 		 * wake_futex_pi has detected invalid state. Tell user
 		 * space.
 		 */
-		goto out_unlock;
+		goto out_putkey;
 	}
 
 	/*
@@ -2692,8 +2695,10 @@ static int futex_unlock_pi(u32 __user *u
 	 * preserve the WAITERS bit not the OWNER_DIED one. We are the
 	 * owner.
 	 */
-	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0))
+	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) {
+		spin_unlock(&hb->lock);
 		goto pi_faulted;
+	}
 
 	/*
 	 * If uval has changed, let user space handle it.
@@ -2707,7 +2712,6 @@ static int futex_unlock_pi(u32 __user *u
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
 	put_futex_key(&key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2937,7 +2941,7 @@ static int futex_wait_requeue_pi(u32 __u
 	 */
 	if (ret == -EFAULT) {
 		if (pi_mutex && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
+			rt_mutex_futex_unlock(pi_mutex);
 	} else if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1422,15 +1422,23 @@ EXPORT_SYMBOL_GPL(rt_mutex_lock_interrup
 
 /*
  * Futex variant with full deadlock detection.
+ * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
  */
-int rt_mutex_timed_futex_lock(struct rt_mutex *lock,
+int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
 			      struct hrtimer_sleeper *timeout)
 {
 	might_sleep();
 
-	return rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout,
-				       RT_MUTEX_FULL_CHAINWALK,
-				       rt_mutex_slowlock);
+	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
+				 timeout, RT_MUTEX_FULL_CHAINWALK);
+}
+
+/*
+ * Futex variant, must not use fastpath.
+ */
+int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
+{
+	return rt_mutex_slowtrylock(lock);
 }
 
 /**
@@ -1489,19 +1497,38 @@ void __sched rt_mutex_unlock(struct rt_m
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
- * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
- * @lock: the rt_mutex to be unlocked
- *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Futex variant, that since futex variants do not use the fast-path, can be
+ * simple and will not need to retry.
  */
-bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
-				   struct wake_q_head *wqh)
+bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				    struct wake_q_head *wake_q)
+{
+	lockdep_assert_held(&lock->wait_lock);
+
+	debug_rt_mutex_unlock(lock);
+
+	if (!rt_mutex_has_waiters(lock)) {
+		lock->owner = NULL;
+		return false; /* done */
+	}
+
+	mark_wakeup_next_waiter(wake_q, lock);
+	return true; /* deboost and wakeups */
+}
+
+void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
-		return false;
+	WAKE_Q(wake_q);
+	bool deboost;
 
-	return rt_mutex_slowunlock(lock, wqh);
+	raw_spin_lock_irq(&lock->wait_lock);
+	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -108,9 +108,14 @@ extern int rt_mutex_start_proxy_lock(str
 extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
+
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
-extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
-				  struct wake_q_head *wqh);
+extern int rt_mutex_futex_trylock(struct rt_mutex *l);
+
+extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
+extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				 struct wake_q_head *wqh);
+
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

next prev parent reply	other threads:[~2016-10-07 11:22 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-03  9:12 [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Peter Zijlstra
2016-10-03  9:12 ` [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
2016-10-03 14:15   ` Steven Rostedt
2016-10-05  3:58   ` Davidlohr Bueso
2016-10-03  9:12 ` [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
2016-10-03 14:19   ` Steven Rostedt
2016-10-05  3:57   ` Davidlohr Bueso
2016-10-05  6:20     ` Peter Zijlstra
2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
2016-10-03  9:34   ` Peter Zijlstra
2016-10-03 14:25   ` Steven Rostedt
2016-10-05  1:08   ` Davidlohr Bueso
2016-10-05  7:29   ` Sebastian Andrzej Siewior
2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
2016-10-03 15:36   ` Steven Rostedt
2016-10-03 15:44     ` Peter Zijlstra
2016-10-03 15:45     ` Peter Zijlstra
2016-10-03 16:23       ` Steven Rostedt
2016-10-05  7:41   ` Sebastian Andrzej Siewior
2016-10-05  8:09     ` Peter Zijlstra
2016-10-05  8:21       ` Sebastian Andrzej Siewior
2016-10-05  8:32         ` Peter Zijlstra
2016-10-06 10:29   ` Peter Zijlstra
2016-10-07 11:21   ` Peter Zijlstra [this message]
2016-10-08 15:53     ` Thomas Gleixner
2016-10-08 16:55       ` Peter Zijlstra
2016-10-08 17:06         ` Thomas Gleixner
2016-10-10 10:17         ` Thomas Gleixner
2016-10-10 11:40           ` Peter Zijlstra
2016-10-21 12:27           ` Peter Zijlstra
2016-10-27 20:36             ` Thomas Gleixner
2016-11-23 19:20               ` Peter Zijlstra
2016-11-24 16:52                 ` Peter Zijlstra
2016-11-24 17:56                   ` Thomas Gleixner
2016-11-24 18:58                     ` Peter Zijlstra
2016-11-25  9:23                       ` Peter Zijlstra
2016-11-25 10:03                         ` Peter Zijlstra
2016-11-25 19:13                           ` Thomas Gleixner
2016-11-25 14:09               ` Peter Zijlstra
2016-10-08 18:22     ` Thomas Gleixner
2016-10-09 11:17     ` Thomas Gleixner
2016-10-10 14:06       ` Peter Zijlstra
2016-10-05  1:02 ` [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Davidlohr Bueso
2016-10-05  6:20   ` Peter Zijlstra
2016-10-05  7:26     ` Sebastian Andrzej Siewior
2016-10-05 16:04     ` Davidlohr Bueso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161007112143.GJ3117@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bigeasy@linutronix.de \
    --cc=bristot@redhat.com \
    --cc=jdesfossez@efficios.com \
    --cc=juri.lelli@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=xlpang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.