linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Darren Hart <dvhart@linux.intel.com>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: mtk.manpages@gmail.com, "Carlos O'Donell" <carlos@redhat.com>,
	Ingo Molnar <mingo@elte.hu>, Jakub Jelinek <jakub@redhat.com>,
	"linux-man@vger.kernel.org" <linux-man@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Linux API <linux-api@vger.kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: futex(2) man page update help request
Date: Sat, 17 Jan 2015 10:16:34 +0100	[thread overview]
Message-ID: <54BA2872.5040003@gmail.com> (raw)
In-Reply-To: <D0DEF1AE.B7EDE%dvhart@linux.intel.com>

Hello Darren,

On 01/17/2015 02:33 AM, Darren Hart wrote:
> Corrected Davidlohr's email address.

Thanks!

> On 1/15/15, 7:12 AM, "Michael Kerrisk (man-pages)"
> <mtk.manpages@gmail.com> wrote:
> 
>> Hello Darren,
>>
>> I give you the same apology as to Thomas for the
>> long-delayed response to your mail.
>>
>> And I repeat my note to Thomas:
>> In the next day or two, I hope to send out the new version
>> of the futex(2) page for review. The new draft is a bit
>> bigger (okay -- 4 x bigger) than the current page. And there
>> are a quite number of FIXMEs that I've placed in the page
>> for various points--some minor, but a few major--that need
>> to be checked or fixed. Would you have some time to review
>> that page?
> 
> I'll make the time for that. I've wanted to see this for a while, so thank
> you for working on it!

Great!

>> In the meantime, I have a couple of questions, which, if
>> you could answer them, I would work some changes into the
>> page before sending.
>>
>> 1. In various places, distinction is made between non-PI
>>   futexs and PI futexes. But what determines that distinction?
>>   From the kernel's perspective, hat make a futex one type
>>   or another? I presume it is to do with the types of blocking
>>   waiters on the futex, but it would be good to have a formal
>>   definition.
> 
> You're right in that a uaddr is a uaddr is a uaddr. Also "there is no such
> thing as a futex", it doesn't exist as any kind of identifiable object, so
> these discussions can get rather confusing :-)

So, I want to make sure that I am clear on what you mean you say this.
You say "there is no such thing as a futex" because from the kernel's
perspective there is no visible entity in the uncontended case
(where everything can be dealt with in user space). And from user-space,
in the uncontended case all we're doing is memory operations. Right?

On the other hand, from a kernel perspective, we could say that a 
futex "exists" in the contended phases, since the kernel has allocated
state associated with the uaddr. Right?

> A "futex" becomes a PI futex when it is "created" via a PI futex op code.

Precisely which PI op codes? Is it: FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, and
FUTEX_CMP_REQUEUE_PI, and not FUTEX_WAIT_REQUEUE_PI or FUTEX_UNLOCK_PI?

> At that point, the syscall will ensure a pi_state is populated for the
> futex_q entry. See futex_lock_pi() for example. Before the locks are
> taken, there is a call to refill_pi_state_cache() which preps a pi_state
> for assignment later in futex_lock_pi_atomic(). This pi_state provides the
> necessary linkage to perform the priority boosting in the event of a
> priority inversion. This is handled externally from the futexes via the
> rt_mutex construct.
> 
> Clear as mud?

Not quite that bad, but... The thing is, still, the man page has text
such as the following (based on your wording):

       FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31)
              This operation is a PI-aware variant of FUTEX_CMP_REQUEUE.
              It    requeues    waiters    that    are    blocked    via
              FUTEX_WAIT_REQUEUE_PI  on uaddr from a non-PI source futex
              (uaddr) to a PI target futex (uaddr2).

And elsewhere you said

    EINVAL is returned if the non-pi to pi or 
    op pairing semantics are violated.

When someone in user-land (e.g., me) reads pieces like that, they then 
want to find somewhere in the man page a description of what makes a 
futex a *PI futex* and probably some statements of the distinction 
between PI and non-PI futexes. And those statements should be from a 
perspective that is somewhat comprehensible to user-space. I'm not
yet confident that I can do that. Do you care to take a shot at it?

>> 2. Can you say something about the pairing requirements of
>>   FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
>>   What is the requirement and why do we need it?
> 
> Briefly, these op codes exist to support a fairly specific use case:
> support for PI aware pthread condvars (glibc patch acceptance STILL
> PENDING FOR LOVE OF EVERYTHING HOLY WHY?!?!?! 

Yes, Jan Kiszka recently alerted me to the existence of 
https://sourceware.org/bugzilla/show_bug.cgi?id=11588
and I still have some text that you proposed (mail titled
("Pthread Condition Variables and Priority Inversion")
quite a long time ago for the pthread_cond_timedwait() page.
One day, when that page exists, I'll try to remember to add it.

> But is shipped with various
> PREEMPT_RT enabled Linux systems. Because these calls are paired, and more
> of the logic can happen on the kernel side (to preserve ownership of an
> rt_mutex with waiters), so in order to ensure userspace and kernelspace
> remain in sync, we pre-specify the target of the requeue in
> futex_wait_requeue_pi. This also limits the attack surface by only
> supporting exactly what it was meant to do. The corner cases get insane
> otherwise.

Thanks. I've added some text on pairing, based on your text above.

> We could walk through the various ways in which it would break if these
> pairing restrictions were not in place, but I'll have to take some serious
> time to page all those into working memory. Let me know if we need more
> detail here and I will.

I don't think we need that much level of detail.

>> Most of the rest of this mail is just a checklist noting
>> what I did with your comments. No response is needed
>> in most cases, but there is one that I have marked with
>> "???". If you could reply to that. I'd be grateful.
> 
> ...
> 
>>> For all the PI opcodes, we should probably mention something about the
>>> futex value scheme (TID), whereas the other opcodes do not require any
>>> specific value scheme.
>>>
>>> No Owner:	0
>>> Owner:		TID
>>> Waiters:	TID | FUTEX_WAITERS
>>>
>>> This is the relevant section from the referenced paper:
>>> 				
>>> The PI futex operations diverge from the oth-
>>> ers in that they impose a policy describing how
>>> the futex value is to be used. If the lock is un-
>>> owned, the futex value shall be 0. If owned, it
>>> shall be the thread id (tid) of the owning thread.
>>> If there are threads contending for the lock, then
>>> the FUTEX_WAITERS flag is set. With this policy in
>>> place, userspace can atomically acquire an unowned
>>> lock or release an uncontended lock using an atomic
>>> instruction and their own tid. A non-zero futex
>>> value will force waiters into the kernel to lock. The
>>> FUTEX_WAITERS flag forces the owner into the kernel
>>> to unlock. If the callers are forced into the kernel,
>>> they then deal directly with an underlying rt_mutex
>>> which implements the priority inheritance semantics.
>>> After the rt_mutex is acquired, the futex value is up-
>>> dated accordingly, before the calling thread returns
>>> to userspace.
>>>
>>> It is important to note that the kernel will update the futex value
>>> prior
>>> to returning to userspace. Unlike other futex op codes,
>>> FUTEX_CMP_REUQUE_PI (and FUTEX_WAIT_REQUEUE_PI, FUTEX_LOCK_PI are
>>> designed
>>> for the implementation of very specific IPC mechanisms).
>>
>> ??? Great text. May I presume that I can take this text
>> and freely adapt it for the man page? (Actually, this is a
>> request for forgiveness, rather than permission :-).)
> 
> Thanks, and no objection from me.

Thanks.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

  reply	other threads:[~2015-01-17  9:16 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-14 10:35 futex(2) man page update help request Michael Kerrisk (man-pages)
2014-05-14 16:18 ` Darren Hart
2014-05-14 19:03   ` Michael Kerrisk (man-pages)
2014-05-14 19:59     ` Darren Hart
2014-05-14 20:23     ` Carlos O'Donell
2014-05-14 20:44       ` Andy Lutomirski
2014-05-14 23:34       ` Thomas Gleixner
2014-05-15  3:12         ` Carlos O'Donell
2014-05-15  4:49           ` Michael Kerrisk (man-pages)
2014-05-15  4:53         ` Michael Kerrisk (man-pages)
2014-05-15 14:14           ` Thomas Gleixner
2014-05-15 20:19             ` Michael Kerrisk (man-pages)
2014-08-04 14:46               ` Carlos O'Donell
2014-05-15 20:35             ` Darren Hart
2015-01-15 15:12               ` Michael Kerrisk (man-pages)
2015-01-17  1:33                 ` Darren Hart
2015-01-17  9:16                   ` Michael Kerrisk (man-pages) [this message]
2015-01-17 19:26                     ` Darren Hart
2015-01-18 10:18                       ` Michael Kerrisk (man-pages)
2015-01-15 15:10             ` Michael Kerrisk (man-pages)
2015-01-15 22:23               ` Thomas Gleixner
2015-01-16 15:17                 ` Michael Kerrisk (man-pages)
2015-01-16 15:20                   ` Thomas Gleixner
2015-01-16 20:54                     ` Michael Kerrisk (man-pages)
2015-01-17  0:46                       ` Darren Hart
2015-01-19 10:45                         ` Thomas Gleixner
2015-01-19 14:07                           ` Michael Kerrisk (man-pages)
2015-01-23 18:19                         ` Torvald Riegel
2015-01-24 10:05                           ` Thomas Gleixner
2015-01-24 12:58                             ` Torvald Riegel
2015-01-24 16:25                               ` Thomas Gleixner
2015-01-17  0:56                       ` Davidlohr Bueso
2015-01-17  1:11                         ` Darren Hart
2015-01-23 18:29               ` Torvald Riegel
2015-01-24 11:35                 ` Thomas Gleixner
2015-01-24 13:12                   ` Torvald Riegel
2015-01-27  7:48                     ` Michael Kerrisk (man-pages)
2015-02-05 19:57                   ` Darren Hart
2014-05-15  8:13       ` Peter Zijlstra
2014-05-15 15:43         ` Darren Hart
2014-05-15  8:14       ` Peter Zijlstra
2014-05-15 13:18         ` Carlos O'Donell
2014-05-15 13:22           ` Peter Zijlstra
2014-05-15 13:49             ` Michael Kerrisk (man-pages)
2014-05-15 13:55               ` Peter Zijlstra
2014-05-15 14:39               ` Carlos O'Donell
2014-05-15 15:11                 ` Peter Zijlstra
2014-05-14 20:56     ` Davidlohr Bueso
2014-05-14 21:03       ` Darren Hart
2014-05-14 22:21         ` Paul E. McKenney
2014-05-15  0:28       ` H. Peter Anvin
2014-05-15  0:35         ` Andy Lutomirski
2014-05-15  0:41           ` H. Peter Anvin
2014-05-15 19:10         ` Carlos O'Donell
2014-05-14 21:05   ` Davidlohr Bueso
2014-05-15 15:15     ` Joseph S. Myers
2014-05-15  0:18   ` H. Peter Anvin
2014-05-15  5:21     ` Darren Hart
2014-05-15  8:23       ` Peter Zijlstra
2014-05-15 13:46       ` Michael Kerrisk (man-pages)
2014-05-15 14:59         ` H. Peter Anvin
2014-05-15 15:42         ` chrubis
2014-05-15 15:52           ` H. Peter Anvin
2014-05-15 16:01             ` chrubis
2014-05-15 16:07               ` H. Peter Anvin
2014-05-15 16:17                 ` chrubis
2014-05-15 16:56                   ` H. Peter Anvin
2014-05-15 17:06                     ` chrubis
2014-05-15 15:47         ` Darren Hart
2014-05-15 15:35     ` chrubis
2014-05-15 15:28   ` chrubis
2014-05-15 15:40     ` Steven Rostedt
2014-05-15 16:14     ` Darren Hart
2014-05-15 16:30       ` chrubis
2014-05-15 18:17         ` Darren Hart
2014-05-15 19:05           ` chrubis
2014-05-15 19:38             ` Darren Hart
2014-08-11 10:19               ` chrubis
2014-11-26 13:41               ` Cyril Hrubis
2015-02-16 13:14               ` Cyril Hrubis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54BA2872.5040003@gmail.com \
    --to=mtk.manpages@gmail.com \
    --cc=arnd@arndb.de \
    --cc=carlos@redhat.com \
    --cc=dave@stgolabs.net \
    --cc=dvhart@linux.intel.com \
    --cc=jakub@redhat.com \
    --cc=jan.kiszka@siemens.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).