From: Darren Hart <dvhltc@us.ibm.com>
To: "lkml, " <linux-kernel@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Sripathi Kodi <sripathik@in.ibm.com>,
Peter Zijlstra <peterz@infradead.org>,
John Stultz <johnstul@us.ibm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Dinakar Guniguntala <dino@in.ibm.com>,
Ulrich Drepper <drepper@redhat.com>,
Eric Dumazet <dada1@cosmosbay.com>, Ingo Molnar <mingo@elte.hu>,
Jakub Jelinek <jakub@redhat.com>
Subject: [tip PATCH] futex: add requeue-pi documentation
Date: Thu, 07 May 2009 15:40:14 -0700 [thread overview]
Message-ID: <4A03634E.3080609@us.ibm.com> (raw)
From: Darren Hart <dvhltc@us.ibm.com>
Add Documentation/futex-requeue-pi.txt describing the motivation for the
newly added FUTEX_*REQUEUE_PI op codes and their implementation.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sripathi Kodi <sripathik@in.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jakub Jelinek <jakub@redhat.com>
---
Documentation/futex-requeue-pi.txt | 124 ++++++++++++++++++++++++++++++++++++
1 files changed, 124 insertions(+), 0 deletions(-)
create mode 100644 Documentation/futex-requeue-pi.txt
diff --git a/Documentation/futex-requeue-pi.txt b/Documentation/futex-requeue-pi.txt
new file mode 100644
index 0000000..7933394
--- /dev/null
+++ b/Documentation/futex-requeue-pi.txt
@@ -0,0 +1,124 @@
+Futex Requeue PI
+----------------
+
+Requeueing of tasks from a non-PI futex to a PI futex requires special handling
+in order to ensure the underlying rt_mutex is never left without an owner if it
+has waiters; doing so would break the PI boosting logic [see
+rt-mutex-desgin.txt] For the purposes of brevity, this action will be referred
+to as "requeue_pi" throughout this document. Priority inheritance is
+abbreviated throughout as "PI".
+
+Motivation
+----------
+
+Without requeue_pi, the glibc implementation of pthread_cond_broadcast() must
+resort to waking all the tasks waiting on a pthread_condvar and letting them
+try to sort out which task gets to run first in classic thundering-herd
+formation. An ideal implementation would wake the highest-priority waiter, and
+leave the rest to the natural wakeup inherent in unlocking the mutex associated
+with the condvar.
+
+Consider the simplified glibc calls:
+
+/* caller must lock mutex */
+pthread_cond_wait(cond, mutex)
+{
+ lock(cond->__data.__lock);
+ unlock(mutex);
+ do {
+ unlock(cond->__data.__lock);
+ futex_wait(cond->__data.__futex);
+ lock(cond->__data.__lock);
+ } while(...)
+ unlock(cond->__data.__lock);
+ lock(mutex);
+}
+
+pthread_cond_broadcast(cond)
+{
+ lock(cond->__data.__lock);
+ unlock(cond->__data.__lock);
+ futex_requeue(cond->data.__futex, cond->mutex);
+}
+
+Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
+Note that pthread_cond_wait() attempts to lock the mutex only after it has
+returned to user space. This will leave the underlying rt_mutex with waiters,
+and no owner, breaking the previously mentioned PI-boosting algorithms.
+
+In order to support PI-aware pthread_condvar's, the kernel needs to be able to
+requeue tasks to PI futexes. This support implies that upon a successful
+futex_wait system call, the caller would return to user space already holding
+the PI futex. The glibc implementation would be modified as follows:
+
+
+/* caller must lock mutex */
+pthread_cond_wait_pi(cond, mutex)
+{
+ lock(cond->__data.__lock);
+ unlock(mutex);
+ do {
+ unlock(cond->__data.__lock);
+ futex_wait_requeue_pi(cond->__data.__futex);
+ lock(cond->__data.__lock);
+ } while(...)
+ unlock(cond->__data.__lock);
+ /* the kernel acquired the the mutex for us */
+}
+
+pthread_cond_broadcast_pi(cond)
+{
+ lock(cond->__data.__lock);
+ unlock(cond->__data.__lock);
+ futex_requeue_pi(cond->data.__futex, cond->mutex);
+}
+
+The actual glibc implementation will likely test for PI and make the
+necessary changes inside the existing calls rather than creating new calls
+for the PI cases. Similar changes are needed for pthread_cond_timedwait()
+and pthread_cond_signal().
+
+Implementation
+--------------
+
+In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
+for both the requeue code, as well as the waiting code, to be able to acquire
+the rt_mutex before returning to user space. The requeue code cannot simply
+wake the waiter and leave it to acquire the rt_mutex as it would open a race
+window between the requeue call returning to user space and the waiter waking
+and starting to run. This is especially true in the uncontended case.
+
+The solution involves two new rt_mutex helper routines,
+rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which allow the
+requeue code to acquire an uncontended rt_mutex on behalf of the waiter and to
+enqueue the waiter on a contended rt_mutex. Two new system calls provide the
+kernel<->user interface to requeue_pi: FUTEX_WAIT_REQUEUE_PI and
+FUTEX_REQUEUE_CMP_PI.
+
+FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
+pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
+to a PI-aware futex. The implementation is the result of a high-speed
+collision between futex_wait() and futex_lock_pi(), with some extra logic to
+check for the additional wake-up scenarios.
+
+FUTEX_REQUEUE_CMP_PI is called by the waker (pthread_cond_broadcast() and
+pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
+Internally, this system call is still handled by futex_requeue (by passing
+requeue_pi=1). Before requeueing, futex_requeue() attempts to acquire the
+requeue target PI futex on behalf of the top waiter. If it can, this waiter is
+woken. futex_requeue() then proceeds to requeue the remaining
+nr_wake+nr_requeue tasks to the PI futex, calling rt_mutex_start_proxy_lock()
+prior to each requeue to prepare the task as a waiter on the underlying
+rt_mutex. It is possible that the lock can be acquired at this stage as well,
+if so, the next waiter is woken to finish the acquisition of the lock.
+FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but their sum is
+all that really matters. futex_requeue() will wake or requeue up to nr_wake +
+nr_requeue tasks. It will wake only as many tasks as it can acquire the lock
+for, which in the majority of cases should be 0 as good programming practice
+dictates that the caller of either pthread_cond_broadcast() or
+pthread_cond_signal() acquire the mutex prior to making the call.
+FUTEX_REQUEUE_PI requires that nr_wake=1. nr_requeue should be INT_MAX for
+broadcast and 0 for signal.
+
+
+
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
next reply other threads:[~2009-05-07 22:40 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-07 22:40 Darren Hart [this message]
2009-05-09 5:16 ` [tip:core/futexes] futex: add requeue-pi documentation tip-bot for Darren Hart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A03634E.3080609@us.ibm.com \
--to=dvhltc@us.ibm.com \
--cc=dada1@cosmosbay.com \
--cc=dino@in.ibm.com \
--cc=drepper@redhat.com \
--cc=jakub@redhat.com \
--cc=johnstul@us.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sripathik@in.ibm.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox