From: Vernon Mauery <vernux@us.ibm.com>
To: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
linux-kernel@vger.kernel.org,
Sripathi Kodi <sripathik@in.ibm.com>,
Peter Zijlstra <peterz@infradead.org>,
John Stultz <johnstul@us.ibm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Dinakar Guniguntala <dino@in.ibm.com>,
Ulrich Drepper <drepper@redhat.com>,
Eric Dumazet <dada1@cosmosbay.com>, Ingo Molnar <mingo@elte.hu>,
Jakub Jelinek <jakub@redhat.com>,
niv@linux.vnet.ibm.com
Subject: Re: [tip PATCH v7 0/9] RFC: futex: requeue pi implementation
Date: Tue, 07 Apr 2009 14:33:03 -0700 [thread overview]
Message-ID: <49DBC68F.3080508@us.ibm.com> (raw)
In-Reply-To: <49DA2DFA.5080205@us.ibm.com>
Darren Hart wrote:
> Thomas Gleixner wrote:
>> Darren,
>>
>> On Fri, 3 Apr 2009, Darren Hart wrote:
[snip]
>> Darren, could you please polish the initial design notes - especially
>> point out the subtle differences between requeue and requeue_pi - and
>> send them into the thread? That might help Uli and Jakub and we
>> definitly want to have that info preserved in Documentation/ as well.
>
> I'll be away most of the day so I've whipped something up quickly in
> order to get discussion started. It is probably a little heavy on the
> glibc details and a little light on the kernel details for inclusion in
> Documentation/. Please review for high-level content and let me know
> what you would like to see more or less of. I'll pick this back up
> tomorrow to incorporate any suggestions and flesh out the kernel details
> a bit more.
>
Darren,
A gentle review of your documentation follows.
> Futex Requeue PI
> ----------------
>
> Requeueing of tasks from a non-PI futex to a PI futex requires special
> handling in order to ensure the underlying rt_mutex is never left without an
> owner if it has waiters; doing so would break the PI boosting logic [see
> rt-mutex-desgin.txt] For the purposes of brevity, this action will be referred
> to as "requeue_pi" throughout this document. Priority inheritance is
> abbreviated throughout as "PI".
>
> Motivation
> ----------
>
> Without requeue_pi, the current implementation of pthread_cond_broadcast()
> must resort to waking all the tasks waiting on a pthread_condvar and letting
> them try and sort out which task gets to run first in classic thundering-herd
^try to sort out^
> formation. An ideal implementation would wake the highest priority waiter,
^highest-priority^
> and leave the rest to the natural wakeup inherent in unlocking the mutex
> associated with the condvar.
>
> Consider the simplified glibc calls:
>
> /* caller must lock mutex */
> pthread_cond_wait(cond, mutex)
> {
> lock(cond->__data.__lock);
> unlock(mutex);
> do {
> unlock(cond->__data.__lock);
> futex_wait(cond->__data.__futex);
> lock(cond->__data.__lock);
> } while(...)
> unlock(cond->__data.__lock);
> lock(mutex);
> }
>
> pthread_cond_broadcast(cond)
> {
> lock(cond->__data.__lock);
> unlock(cond->__data.__lock);
> futex_requeue(cond->data.__futex, cond->mutex);
> }
>
> Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
> Note that pthread_cond_wait() attempts to lock the mutex only after it has
> returned to userspace. This will leave the underlying rt_mutex with waiters,
> and no owner, breaking the previously mentioned PI boosting algorithms.
^PI-boosting^
>
> In order to support PI-aware pthread_condvar's, the kernel needs to be able to
> requeue tasks to PI futexes. This support implies that upon a successful
> futex_wait system call, the caller would return to userspace already holding
> the PI futex. As such the glibc implemenation would be modified as follows:
^implementation^
>
>
> /* caller must lock mutex */
> pthread_cond_wait_pi(cond, mutex)
> {
> lock(cond->__data.__lock);
> unlock(mutex);
> do {
> unlock(cond->__data.__lock);
> futex_wait_requeue_pi(cond->__data.__futex);
> lock(cond->__data.__lock);
> } while(...)
> unlock(cond->__data.__lock);
> /* the kernel acquired the the mutex for us */
> }
>
> pthread_cond_broadcast_pi(cond)
> {
> lock(cond->__data.__lock);
> unlock(cond->__data.__lock);
> futex_requeue_pi(cond->data.__futex, cond->mutex);
> }
>
> The actual glibc implemenation will likely test for PI and make the
^implementation^
> necessary changes inside the existing calls rather than creating new calls
> for the PI cases. Similar changes are needed for pthread_cond_timedwait()
> and pthread_cond_signal().
>
> Implementation
> --------------
>
> In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
> for the both the requeue code as well as the waiting code to be able to acquire
> the rt_mutex before returning to userspace (the requeue code cannot simply wake
I would suggest not using a parenthetical expression here because of the length
of the enclosed sentence. Also, you have a missing right parenthesis.
> the waiter and leave it to acquire the rt_mutex as it would open a race window
> between the requeue call returning to userspace and the waiter waking and
> starting to run. This is especially true in the uncontended case.
>
> To accomplish this, rt_mutex lock acquistion has to be able to be accomlished
^acquisition^ ^accomplished^
> vicariously by the requeue code on behalf of the waiter. This is accomplished
Lots of accomplishing going on here^^^. And one is missing a its p. Maybe
rewritten as:
To accomplish this, the rt_mutex lock must be acquired vicariously by the
requeue code on behalf of the waiter. This is done by two new rt_mutex
helper routines: rt_mutex_start_proxy_lock(), called by he requeue code;
and rt_mutex_finish_proxy_lock(), called by the waiter upon wakeup.
> via two new rt_mutex helper routines: rt_mutex_start_proxy_lock() (called by
> the requeue code) and rt_mutex_finish_proxy_lock(), called by the waiter upon
> wakeup.
I also changed the punctuation there to remove the parenthetical expression
modifying the first function, where only a comma was used on the second. Too
many commas and modifications made me think that maybe a semicolon would work
well.
>
> Two new system calls provide the kernel<->userspace interface to requeue_pi:
> FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI).
>
> FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
> pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
> to a PI aware futex. The implemenation is basically the result of a high speed
^PI-aware^ ^implementation^ ^high-speed^
> collision between futex_wait() and futex_lock_pi(), with some extra logic to
> check for the additional wake-up scenarios.
>
> FUTEX_REQUEUE_PI is called by the waker (pthread_cond_broadcast() and
> pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
> Internally, this system call is still handled by futex_requeue (by passing
> requeue_pi=1). Before requeueing, futex_requeue() attempts to acquire the
> requeue target PI futex on behalf of the top waiter, if it can, this waiter is
^. If^
> woken. It then proceeds to requeue the remaining nr_wake+nr_requeue tasks to
> the PI futex. It calls rt_mutex_start_proxy_lock() prior to each requeue to
> prepare the task as a waiter on the underlying rt_mutex. It is possible that
> the lock can be acquired at this stage as well, if so, the next waiter is woken
^. If^
> to finish the acquisition of the lock. FUTEX_REQUEUE_PI accepts nr_wake and
> nr_requeue as arguments, but their sum is all that really matters. It will
Lots of 'It' by this point. Maybe replacing 'it' with a proper noun a couple
of times would help make sure we are all on the same page.
> wake or requeue up to nr_wake + nr_requeue tasks. It will wake only as many
> tasks as it can acquire the lock for, which in the majority of cases should be
> 0 as good programming practice dictates that the caller of either
^0, as^
> pthread_cond_broadcast() or pthread_cond_signal() acquire the mutex prior to
> making the call. FUTEX_REQUEUE_PI requires that nr_wake=1. nr_requeue should
> be INT_MAX for broadcast and 0 for signal.
One final point is that generally numbers less than ten are spelled out, but
since this is a technical document where we see things like nr_wake=1, I figured
it would look more consistent to let those 0s slide.
Keep in mind that I am not really an editor so take my changes with a grain of
salt. (I am only *married* to an editor and if she gets wind that I am
pretending to be one on the side I will be in big trouble :)
--Vernon
next prev parent reply other threads:[~2009-04-07 21:33 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-03 20:39 [tip PATCH v7 0/9] RFC: futex: requeue pi implementation Darren Hart
2009-04-03 20:39 ` [tip PATCH v7 1/9] RFC: futex: futex_wait_queue_me() Darren Hart
2009-04-03 20:39 ` [tip PATCH v7 2/9] RFC: futex: futex_top_waiter() Darren Hart
2009-04-03 20:39 ` [tip PATCH v7 3/9] RFC: futex: futex_lock_pi_atomic() Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 4/9] RFC: futex: fixup_owner() Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 5/9] RFC: rt_mutex: add proxy lock routines Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 6/9] RFC: futex: Add FUTEX_HAS_TIMEOUT flag to restart.futex.flags Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 7/9] RFC: futex: Add requeue_futex() call Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 8/9] RFC: futex: add futex_wait_setup() Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 9/9] RFC: futex: add requeue_pi calls Darren Hart
2009-04-05 10:01 ` [tip PATCH v7 0/9] RFC: futex: requeue pi implementation Thomas Gleixner
2009-04-05 21:49 ` Darren Hart
2009-04-06 16:29 ` Darren Hart
2009-04-07 21:33 ` Vernon Mauery [this message]
2009-04-08 22:16 ` Darren Hart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49DBC68F.3080508@us.ibm.com \
--to=vernux@us.ibm.com \
--cc=dada1@cosmosbay.com \
--cc=dino@in.ibm.com \
--cc=drepper@redhat.com \
--cc=dvhltc@us.ibm.com \
--cc=jakub@redhat.com \
--cc=johnstul@us.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=niv@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sripathik@in.ibm.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox