public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Vernon Mauery <vernux@us.ibm.com>
To: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel@vger.kernel.org,
	Sripathi Kodi <sripathik@in.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	John Stultz <johnstul@us.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Dinakar Guniguntala <dino@in.ibm.com>,
	Ulrich Drepper <drepper@redhat.com>,
	Eric Dumazet <dada1@cosmosbay.com>, Ingo Molnar <mingo@elte.hu>,
	Jakub Jelinek <jakub@redhat.com>,
	niv@linux.vnet.ibm.com
Subject: Re: [tip PATCH v7 0/9] RFC: futex: requeue pi implementation
Date: Tue, 07 Apr 2009 14:33:03 -0700	[thread overview]
Message-ID: <49DBC68F.3080508@us.ibm.com> (raw)
In-Reply-To: <49DA2DFA.5080205@us.ibm.com>

Darren Hart wrote:
> Thomas Gleixner wrote:
>> Darren,
>>
>> On Fri, 3 Apr 2009, Darren Hart wrote:
[snip]
>> Darren, could you please polish the initial design notes - especially
>> point out the subtle differences between requeue and requeue_pi - and
>> send them into the thread? That might help Uli and Jakub and we
>> definitly want to have that info preserved in Documentation/ as well.
> 
> I'll be away most of the day so I've whipped something up quickly in
> order to get discussion started.  It is probably a little heavy on the
> glibc details and a little light on the kernel details for inclusion in
> Documentation/.  Please review for high-level content and let me know
> what you would like to see more or less of.  I'll pick this back up
> tomorrow to incorporate any suggestions and flesh out the kernel details
> a bit more.
> 

Darren,

A gentle review of your documentation follows.

> Futex Requeue PI
> ----------------
> 
> Requeueing of tasks from a non-PI futex to a PI futex requires special
> handling in order to ensure the underlying rt_mutex is never left without an
> owner if it has waiters; doing so would break the PI boosting logic [see
> rt-mutex-desgin.txt]  For the purposes of brevity, this action will be referred
> to as "requeue_pi" throughout this document.  Priority inheritance is
> abbreviated throughout as "PI".
> 
> Motivation
> ----------
> 
> Without requeue_pi, the current implementation of pthread_cond_broadcast()
> must resort to waking all the tasks waiting on a pthread_condvar and letting
> them try and sort out which task gets to run first in classic thundering-herd
       ^try to sort out^
> formation.  An ideal implementation would wake the highest priority waiter,
                                                     ^highest-priority^
> and leave the rest to the natural wakeup inherent in unlocking the mutex
> associated with the condvar.
> 
> Consider the simplified glibc calls:
> 
> /* caller must lock mutex */
> pthread_cond_wait(cond, mutex)
> {
>     lock(cond->__data.__lock);
>     unlock(mutex);
>     do {
>        unlock(cond->__data.__lock);
>        futex_wait(cond->__data.__futex);
>        lock(cond->__data.__lock);
>     } while(...)
>     unlock(cond->__data.__lock);
>     lock(mutex);
> }
> 
> pthread_cond_broadcast(cond)
> {
>     lock(cond->__data.__lock);
>     unlock(cond->__data.__lock);
>     futex_requeue(cond->data.__futex, cond->mutex);
> }
> 
> Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
> Note that pthread_cond_wait() attempts to lock the mutex only after it has
> returned to userspace.  This will leave the underlying rt_mutex with waiters,
> and no owner, breaking the previously mentioned PI boosting algorithms.
                                                  ^PI-boosting^
> 
> In order to support PI-aware pthread_condvar's, the kernel needs to be able to
> requeue tasks to PI futexes.  This support implies that upon a successful
> futex_wait system call, the caller would return to userspace already holding
> the PI futex.  As such the glibc implemenation would be modified as follows:
                                   ^implementation^
> 
> 
> /* caller must lock mutex */
> pthread_cond_wait_pi(cond, mutex)
> {
>     lock(cond->__data.__lock);
>     unlock(mutex);
>     do {
>        unlock(cond->__data.__lock);
>        futex_wait_requeue_pi(cond->__data.__futex);
>        lock(cond->__data.__lock);
>     } while(...)
>     unlock(cond->__data.__lock);
>        /* the kernel acquired the the mutex for us */
> }
> 
> pthread_cond_broadcast_pi(cond)
> {
>     lock(cond->__data.__lock);
>     unlock(cond->__data.__lock);
>     futex_requeue_pi(cond->data.__futex, cond->mutex);
> }
> 
> The actual glibc implemenation will likely test for PI and make the
                   ^implementation^

> necessary changes inside the existing calls rather than creating new calls
> for the PI cases.  Similar changes are needed for pthread_cond_timedwait()
> and pthread_cond_signal().
> 
> Implementation
> --------------
> 
> In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
> for the both the requeue code as well as the waiting code to be able to acquire
> the rt_mutex before returning to userspace (the requeue code cannot simply wake

I would suggest not using a parenthetical expression here because of the length
of the enclosed sentence.  Also, you have a missing right parenthesis.

> the waiter and leave it to acquire the rt_mutex as it would open a race window
> between the requeue call returning to userspace and the waiter waking and
> starting to run.  This is especially true in the uncontended case.
> 
> To accomplish this, rt_mutex lock acquistion has to be able to be accomlished
                                    ^acquisition^                   ^accomplished^
> vicariously by the requeue code on behalf of the waiter.  This is accomplished

Lots of accomplishing going on here^^^.  And one is missing a its p.  Maybe
rewritten as:

To accomplish this, the rt_mutex lock must be acquired vicariously by the
requeue code on behalf of the waiter.  This is done by two new rt_mutex
helper routines: rt_mutex_start_proxy_lock(), called by he requeue code;
and rt_mutex_finish_proxy_lock(), called by the waiter upon wakeup.

> via two new rt_mutex helper routines: rt_mutex_start_proxy_lock() (called by
 > the requeue code) and rt_mutex_finish_proxy_lock(), called by the waiter upon
> wakeup.

I also changed the punctuation there to remove the parenthetical expression
modifying the first function, where only a comma was used on the second.  Too
many commas and modifications made me think that maybe a semicolon would work
well.

> 
> Two new system calls provide the kernel<->userspace interface to requeue_pi:
> FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI).
> 
> FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
> pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
> to a PI aware futex.  The implemenation is basically the result of a high speed
       ^PI-aware^           ^implementation^                           ^high-speed^
> collision between futex_wait() and futex_lock_pi(), with some extra logic to
> check for the additional wake-up scenarios.
> 
> FUTEX_REQUEUE_PI is called by the waker (pthread_cond_broadcast() and
> pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
> Internally, this system call is still handled by futex_requeue (by passing
> requeue_pi=1).  Before requeueing, futex_requeue() attempts to acquire the
> requeue target PI futex on behalf of the top waiter, if it can, this waiter is
                                                     ^.  If^
> woken.  It then proceeds to requeue the remaining nr_wake+nr_requeue tasks to
> the PI futex.  It calls rt_mutex_start_proxy_lock() prior to each requeue to
> prepare the task as a waiter on the underlying rt_mutex.  It is possible that
> the lock can be acquired at this stage as well, if so, the next waiter is woken
                                                ^.  If^
> to finish the acquisition of the lock.  FUTEX_REQUEUE_PI accepts nr_wake and
> nr_requeue as arguments, but their sum is all that really matters.  It will

Lots of 'It' by this point.  Maybe replacing 'it' with a proper noun a couple
of times would help make sure we are all on the same page.

> wake or requeue up to nr_wake + nr_requeue tasks.  It will wake only as many
> tasks as it can acquire the lock for, which in the majority of cases should be
> 0 as good programming practice dictates that the caller of either
  ^0, as^
> pthread_cond_broadcast() or pthread_cond_signal() acquire the mutex prior to
> making the call.  FUTEX_REQUEUE_PI requires that nr_wake=1.  nr_requeue should
> be INT_MAX for broadcast and 0 for signal.

One final point is that generally numbers less than ten are spelled out, but
since this is a technical document where we see things like nr_wake=1, I figured
it would look more consistent to let those 0s slide.

Keep in mind that I am not really an editor so take my changes with a grain of
salt. (I am only *married* to an editor and if she gets wind that I am
pretending to be one on the side I will be in big trouble :)

--Vernon

  reply	other threads:[~2009-04-07 21:33 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-03 20:39 [tip PATCH v7 0/9] RFC: futex: requeue pi implementation Darren Hart
2009-04-03 20:39 ` [tip PATCH v7 1/9] RFC: futex: futex_wait_queue_me() Darren Hart
2009-04-03 20:39 ` [tip PATCH v7 2/9] RFC: futex: futex_top_waiter() Darren Hart
2009-04-03 20:39 ` [tip PATCH v7 3/9] RFC: futex: futex_lock_pi_atomic() Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 4/9] RFC: futex: fixup_owner() Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 5/9] RFC: rt_mutex: add proxy lock routines Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 6/9] RFC: futex: Add FUTEX_HAS_TIMEOUT flag to restart.futex.flags Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 7/9] RFC: futex: Add requeue_futex() call Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 8/9] RFC: futex: add futex_wait_setup() Darren Hart
2009-04-03 20:40 ` [tip PATCH v7 9/9] RFC: futex: add requeue_pi calls Darren Hart
2009-04-05 10:01 ` [tip PATCH v7 0/9] RFC: futex: requeue pi implementation Thomas Gleixner
2009-04-05 21:49   ` Darren Hart
2009-04-06 16:29   ` Darren Hart
2009-04-07 21:33     ` Vernon Mauery [this message]
2009-04-08 22:16       ` Darren Hart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49DBC68F.3080508@us.ibm.com \
    --to=vernux@us.ibm.com \
    --cc=dada1@cosmosbay.com \
    --cc=dino@in.ibm.com \
    --cc=drepper@redhat.com \
    --cc=dvhltc@us.ibm.com \
    --cc=jakub@redhat.com \
    --cc=johnstul@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=niv@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sripathik@in.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox