From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Darren Hart <darren@dvhart.com>,
Steven Rostedt <rostedt@goodmis.org>,
fredrik.markstrom@windriver.com,
Davidlohr Bueso <davidlohr@hp.com>,
Manfred Spraul <manfred@colorfullife.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT
Date: Tue, 7 Apr 2015 17:03:48 +0200 [thread overview]
Message-ID: <1428419030-20030-2-git-send-email-bigeasy@linutronix.de> (raw)
In-Reply-To: <1428419030-20030-1-git-send-email-bigeasy@linutronix.de>
From: Thomas Gleixner <tglx@linutronix.de>
The boosted priority is reverted after the unlock but before the
futex_hash_bucket (hb) has been accessed. The result is that we boost the
task, deboost the task, boost again for the hb lock, deboost again.
A sched trace of this scenario looks the following:
| med_prio-93 sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| med_prio-93 sched_switch: prev_comm=med_prio prev_pid=93 prev_prio=29 prev_state=R ==> next_comm=high_prio next_pid=92 next_prio=9
|high_prio-92 sched_pi_setprio: comm=low_prio pid=91 oldprio=120 newprio=9
|high_prio-92 sched_switch: prev_comm=high_prio prev_pid=92 prev_prio=9 prev_state=S ==> next_comm=low_prio next_pid=91 next_prio=9
| low_prio-91 sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| low_prio-91 sched_pi_setprio: comm=low_prio pid=91 oldprio=9 newprio=120
| low_prio-91 sched_switch: prev_comm=low_prio prev_pid=91 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=92 next_prio=9
|high_prio-92 sched_pi_setprio: comm=low_prio pid=91 oldprio=120 newprio=9
|high_prio-92 sched_switch: prev_comm=high_prio prev_pid=92 prev_prio=9 prev_state=D ==> next_comm=low_prio next_pid=91 next_prio=9
| low_prio-91 sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| low_prio-91 sched_pi_setprio: comm=low_prio pid=91 oldprio=9 newprio=120
| low_prio-91 sched_switch: prev_comm=low_prio prev_pid=91 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=92 next_prio=9
We see four sched_pi_setprio() invocation but ideally two would be enough.
The patch tries to avoid the double wakeup by a wake up once the hb lock is
released. The same test case:
| med_prio-21 sched_wakeup: comm=high_prio pid=20 prio=9 success=1 target_cpu=000
| med_prio-21 sched_switch: prev_comm=med_prio prev_pid=21 prev_prio=29 prev_state=R ==> next_comm=high_prio next_pid=20 next_prio=9
|high_prio-20 sched_pi_setprio: comm=low_prio pid=19 oldprio=120 newprio=9
|high_prio-20 sched_switch: prev_comm=high_prio prev_pid=20 prev_prio=9 prev_state=S ==> next_comm=low_prio next_pid=19 next_prio=9
| low_prio-19 sched_wakeup: comm=high_prio pid=20 prio=9 success=1 target_cpu=000
| low_prio-19 sched_pi_setprio: comm=low_prio pid=19 oldprio=9 newprio=120
| low_prio-19 sched_switch: prev_comm=low_prio prev_pid=19 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=20 next_prio=9
only two sched_pi_setprio() invocations as one would expect and see
without -RT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex.c | 32 +++++++++++++++++++++++++++++---
kernel/locking/rtmutex.c | 39 ++++++++++++++++++++++++++++-----------
kernel/locking/rtmutex_common.h | 4 ++++
3 files changed, 61 insertions(+), 14 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 2579e407ff67..b38abe3573a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1122,11 +1122,13 @@ static void wake_futex(struct futex_q *q)
put_task_struct(p);
}
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+ struct futex_hash_bucket *hb)
{
struct task_struct *new_owner;
struct futex_pi_state *pi_state = this->pi_state;
u32 uninitialized_var(curval), newval;
+ bool deboost;
int ret = 0;
if (!pi_state)
@@ -1178,7 +1180,17 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
raw_spin_unlock_irq(&new_owner->pi_lock);
raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
- rt_mutex_unlock(&pi_state->pi_mutex);
+
+ deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex);
+
+ /*
+ * We deboost after dropping hb->lock. That prevents a double
+ * wakeup on RT.
+ */
+ spin_unlock(&hb->lock);
+
+ if (deboost)
+ rt_mutex_adjust_prio(current);
return 0;
}
@@ -2412,13 +2424,26 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
match = futex_top_waiter(hb, &key);
if (match) {
- ret = wake_futex_pi(uaddr, uval, match);
+ ret = wake_futex_pi(uaddr, uval, match, hb);
+
+ /*
+ * In case of success wake_futex_pi dropped the hash
+ * bucket lock.
+ */
+ if (!ret)
+ goto out_putkey;
+
/*
* The atomic access to the futex value generated a
* pagefault, so retry the user-access and the wakeup:
*/
if (ret == -EFAULT)
goto pi_faulted;
+
+ /*
+ * wake_futex_pi has detected invalid state. Tell user
+ * space.
+ */
goto out_unlock;
}
@@ -2439,6 +2464,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
out_unlock:
spin_unlock(&hb->lock);
+out_putkey:
put_futex_key(&key);
return ret;
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b73279367087..9d858106ba0c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -298,7 +298,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
* of task. We do not use the spin_xx_mutex() variants here as we are
* outside of the debug path.)
*/
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
{
unsigned long flags;
@@ -955,8 +955,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
/*
* Wake up the next waiter on the lock.
*
- * Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * Remove the top waiter from the current tasks pi waiter list take a
+ * reference and return a pointer to it.
*
* Called with lock->wait_lock held.
*/
@@ -1253,7 +1253,7 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
/*
* Slow path to release a rt-mutex:
*/
-static void __sched
+static bool __sched
rt_mutex_slowunlock(struct rt_mutex *lock)
{
raw_spin_lock(&lock->wait_lock);
@@ -1296,7 +1296,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
while (!rt_mutex_has_waiters(lock)) {
/* Drops lock->wait_lock ! */
if (unlock_rt_mutex_safe(lock) == true)
- return;
+ return false;
/* Relock the rtmutex and try again */
raw_spin_lock(&lock->wait_lock);
}
@@ -1309,8 +1309,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
raw_spin_unlock(&lock->wait_lock);
- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ return true;
}
/*
@@ -1361,12 +1360,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
static inline void
rt_mutex_fastunlock(struct rt_mutex *lock,
- void (*slowfn)(struct rt_mutex *lock))
+ bool (*slowfn)(struct rt_mutex *lock))
{
- if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
rt_mutex_deadlock_account_unlock(current);
- else
- slowfn(lock);
+ } else if (slowfn(lock)) {
+ /* Undo pi boosting if necessary: */
+ rt_mutex_adjust_prio(current);
+ }
}
/**
@@ -1461,6 +1462,22 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
EXPORT_SYMBOL_GPL(rt_mutex_unlock);
/**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
+{
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+ rt_mutex_deadlock_account_unlock(current);
+ return false;
+ }
+ return rt_mutex_slowunlock(lock);
+}
+
+/**
* rt_mutex_destroy - mark a mutex unusable
* @lock: the mutex to be destroyed
*
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 855212501407..1bcf43099b16 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -132,6 +132,10 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter);
extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock);
+
+extern void rt_mutex_adjust_prio(struct task_struct *task);
+
#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"
#else
--
2.1.4
next prev parent reply other threads:[~2015-04-07 15:04 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-07 15:03 improve futex on -RT by avoiding the double wake-up Sebastian Andrzej Siewior
2015-04-07 15:03 ` Sebastian Andrzej Siewior [this message]
2015-04-07 18:41 ` [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT Thomas Gleixner
2015-04-10 14:42 ` [PATCH 1/3 v2] " Sebastian Andrzej Siewior
2015-04-07 15:03 ` [PATCH 2/3] futex: avoid double wake up in futex_wake() " Sebastian Andrzej Siewior
2015-04-07 19:47 ` Thomas Gleixner
2015-04-10 16:11 ` [PATCH 2/3 v2] " Sebastian Andrzej Siewior
2015-04-13 3:02 ` Davidlohr Bueso
2015-04-16 5:09 ` Davidlohr Bueso
2015-04-16 9:19 ` Thomas Gleixner
2015-04-16 10:16 ` Peter Zijlstra
2015-04-16 10:49 ` Thomas Gleixner
2015-04-16 14:42 ` Davidlohr Bueso
2015-04-16 15:54 ` Peter Zijlstra
2015-04-16 16:22 ` Davidlohr Bueso
2015-04-07 15:03 ` [PATCH 3/3] ipc/mqueue: remove STATE_PENDING Sebastian Andrzej Siewior
2015-04-07 17:48 ` Manfred Spraul
2015-04-07 18:28 ` Thomas Gleixner
2015-04-10 14:37 ` [PATCH v2] " Sebastian Andrzej Siewior
2015-04-23 22:18 ` Thomas Gleixner
2015-04-28 3:24 ` Davidlohr Bueso
2015-04-28 12:37 ` Peter Zijlstra
2015-04-28 16:36 ` Davidlohr Bueso
2015-04-28 16:43 ` Peter Zijlstra
2015-04-28 16:59 ` Davidlohr Bueso
2015-04-29 19:44 ` Manfred Spraul
2015-04-30 18:46 ` Davidlohr Bueso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1428419030-20030-2-git-send-email-bigeasy@linutronix.de \
--to=bigeasy@linutronix.de \
--cc=darren@dvhart.com \
--cc=davidlohr@hp.com \
--cc=fredrik.markstrom@windriver.com \
--cc=linux-kernel@vger.kernel.org \
--cc=manfred@colorfullife.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).