From mboxrd@z Thu Jan 1 00:00:00 1970 From: Darren Hart Subject: Re: Next round: revised futex(2) man page for review Date: Thu, 8 Oct 2015 15:36:07 +0100 Message-ID: <20151008143607.GB101005@vmdeb7> References: <55B61EF3.7080302@gmail.com> <55C5A787.9010806@gmail.com> <5614D836.7070506@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <5614D836.7070506-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "Michael Kerrisk (man-pages)" Cc: Thomas Gleixner , Torvald Riegel , Carlos O'Donell , Ingo Molnar , Jakub Jelinek , linux-man , lkml , Davidlohr Bueso , Arnd Bergmann , Steven Rostedt , Peter Zijlstra , Linux API , Roland McGrath , Anton Blanchard , Eric Dumazet , bill o gallmeister , Jan Kiszka , Daniel Wagner , Rich Felker , Andy Lutomirski , bert hubert , Rusty Russell , Heinrich Schuchardt List-Id: linux-api@vger.kernel.org On Wed, Oct 07, 2015 at 09:30:46AM +0100, Michael Kerrisk (man-pages) w= rote: > Hello Thomas, >=20 > Thanks for the follow up! >=20 > Some open questions below are marked with the string ###. A couple of comments from me below, although I suspect you have this mu= ch covered already. >=20 > On 08/19/2015 04:17 PM, Thomas Gleixner wrote: > > On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote: > >>>> FUTEX_CMP_REQUEUE (since Linux 2.6.7) > >>>> This operation first checks whether the locati= on uaddr > >>>> still contains the value val3. If not, the o= peration > >>>> fails with the error EAGAIN. Otherwise, the o= peration > >>>> wakes up a maximum of val waiters that are waiting= on the > >>>> futex at uaddr. If there are more than val waite= rs, then > >>>> the remaining waiters are removed from the wait q= ueue of > >>>> the source futex at uaddr and added to the wait = queue of > >>>> the target futex at uaddr2. The val2 argument s= pecifies > >>>> an upper limit on the number of waiters that are = requeued > >>>> to the futex at uaddr2. > >>>> > >>>> .\" FIXME(Torvald) Is the following correct? Or is just the dec= ision > >>>> .\" which threads to wake or requeue part of the atomic operatio= n? > >>>> > >>>> The load from uaddr is an atomic memory access= (i.e., > >>>> using atomic machine instructions of the respectiv= e archi=E2=80=90 > >>>> tecture). This load, the comparison with val3, = and the > >>>> requeueing of any waiters are performed atomic= ally and > >>>> totally ordered with respect to other operations = on the > >>>> same futex word. > >>> > >>> It's atomic as the other atomic operations on the futex word. It'= s > >>> always performed with the proper lock(s) held in the kernel. That > >>> means any concurrent operation will serialize on that lock(s). Us= er > >>> space has to make sure, that depending on the observed value no > >>> concurrent operations happen, but that's something the kernel can= not > >>> control. > >> > >> ??? > >> Sorry, I'm not clear here. Is the current text correct then? Or is= some > >> change needed. > >=20 > > I think we need some change here because the meaning of atomic is > > unclear. The basic rules of futexes are: > >=20 > > - All modifying operations on the futex value have to be done with > > atomic instructions, usually cmpxchg. That applies to both kerne= l > > and user space. > >=20 > > That's the atomicity at the futex value level. > >=20 > > - In the kernel we have to create/modify/destroy state in order to > > provide the blocking/requeueing etc. > >=20 > > This state needs protection as well. So all operations related t= o > > the kernel internal state are serialized on the hash bucket > > locks. The hash buckets are a scalability mechanism to avoid > > contention on a single lock protecting all kernel internal > > state. For simplicity reasons you can just think of a global loc= k > > protecting all kernel internal state. > >=20 > > If the kernel creates/modifies state then it can be necessary to > > either reread the futex value or modify it. That happens under t= he > > locks as well. > >=20 > > So in the case of requeue, we take the proper locks and perform = the > > comparison with val3 and the requeueing with the locks held. > > =20 > > So that lock protection makes these operations 'atomic'. The > > correct expression is 'serialized'. >=20 > ### > So, here, i think I need some specific pointers on the precise text > changes that are required. Let's talk about this f2f. For convenience= , > here's the relevant text once again quoted: Not speaking for tglx, but I think the point here is to distinguish bet= ween atomic (as in cmpxchg comparison tests performed on the futex word) and serialized (as in the management of futex hashbuckets and task states). >=20 > FUTEX_CMP_REQUEUE (since Linux 2.6.7) > This operation first checks whether the location ua= ddr > still contains the value val3. If not, the operat= ion > fails with the error EAGAIN. Otherwise, the operat= ion Here you might explain the _CMP_ qualifier and note atomicity of the op= eration: The _CMP_ refers to the verification of the userspace state as specifie= d by through the arguments. This operation first atomically compares the val= ue at uaddr with the value val3 ... > wakes up a maximum of val waiters that are waiting on = the > futex at uaddr. If there are more than val waiters, t= hen > the remaining waiters are removed from the wait queue = of > the source futex at uaddr and added to the wait queue= of > the target futex at uaddr2. The val2 argument specif= ies > an upper limit on the number of waiters that are reque= ued > to the futex at uaddr2. And here, perhaps: These operations are serialized on locks protecting the internal datast= ructures storing the kernel futex state. I believe this paragraph is sufficient, but I'll comment on the second = paragraph below: >=20 > The load from uaddr is an atomic memory access (i.= e., > using atomic machine instructions of the respective arc= hi=E2=80=90 > tecture). This load, the comparison with val3, We are "atomic" up to here. ^ And what follows below is serialized by the hash bucket locks.: > and the > requeueing of any waiters are performed atomically = and > totally ordered with respect to other operations on = the > same futex word. I personally think this is too much implementation detail for the man p= age, especially for FUTEX_CMP_REQUEUE which does not manipulate the futex wo= rd value from kernel space (We're not discussing PI here). The first paragraph i= s sufficient to fully describe the guaranteed behavior of the opcode. >=20 >=20 > >>>> .\" FIXME We need some explanation in the following paragraph of= *why* > >>>> .\" it is important to note that "the kernel will update t= he > >>>> .\" futex word's value prior > >>>> It is important to note to returning to user space" . Can= someone > >>>> explain? that the kernel will update the futex word= 's value > >>>> prior to returning to user space. Unlike the other futex= opera=E2=80=90 > >>>> tions described above, the PI futex operations are desi= gned for > >>>> the implementation of very specific IPC mechanisms. > >>> > >>> If there are multiple waiters on a pi futex then a wake pi operat= ion > >>> will wake the first waiter and hand over the lock to this waiter.= This > >>> includes handing over the rtmutex which represents the futex in t= he > >>> kernel. The strict requirement is that the futex owner and the rt= mutex > >>> owner must be the same, except for the update period which is > >>> serialized by the futex internal locking. That means the kernel m= ust > >>> update the user space value prior to returning to user space. > >=20 > > And as noted above: It must update while holding the proper locks. >=20 > (Okay.) >=20 >=20 > >>>> .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Dar= ren Hart > >>>> .\" made the observation that "EINVAL is returned if the n= on-pi=20 > >>>> .\" to pi or op pairing semantics are violated." > >>>> .\" Probably there needs to be a general statement about t= his > >>>> .\" requirement, probably located at about this point in t= he page. > >>>> .\" Darren (or someone else), care to take a shot at this? > >>> > >>> Well, that's hard to describe because the kernel only has a limit= ed > >>> way of detecting such mismatches. It only can detect it when ther= e are > >>> non PI waiters on a futex and a PI function is called or vice ver= sa. > >> > >> Hmmm. Okay, I filed your comments away for reference, but > >> hopefully someone can help with some actual text. > >=20 > > I let Darren come up with something sensible :) >=20 > Okay, let's see if Darren has something to say. >=20 > >>>> .\" FIXME Somewhere on this page (I guess under the discussion o= f PI > >>>> .\" futexes) we need a discussion of the FUTEX_OWNER_DIED = bit. > >>>> .\" Can someone propose a text? > >>> > >>> If a futex has a rtmutex associated in the kernel, i.e. when ther= e are > >>> blocked waiters, and the owner of the futex/rtmutex dies unexpect= edly, > >>> then the kernel cleans up the rtmutex (as it holds a reference to= the > >>> dying task) and hands it over to the next waiter. That requires t= hat > >>> the user space value is updated accordingly. The kernel sets the > >>> FUTEX_OWNER_DIED in the user space value along with the TID of th= e new > >>> owner. User space is responsible for cleaning this up, though the= re > >>> are cases where the kernel does the cleanup. > >>> =20 > >>> The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, = where > >>> the kernel has no state associated. This happens via the robust f= utex > >>> mechanism. In that case the futex value will be set to > >>> FUTEX_OWNER_DIED. The robust futex mechanism is also available fo= r non > >>> PI futexes. > >> > >> ??? > >> So, I added part of that text to the page, as follows: > >> > >> If a futex has an associated RT-mutex in the kernel (i.e.,= there > >> are blocked waiters) and the owner of the futex/RT-mutex= dies > >> unexpectedly, then the kernel cleans up the RT-mutex and ha= nds it > >> over to the next waiter. This in turn requires that the = user- > >> space value is updated accordingly. To indicate that t= his is > >> required, the kernel sets the FUTEX_OWNER_DIED bit in the = futex > >> word along with the thread ID of the new owner. User sp= ace is > >> then responsible for cleaning this up (though there are = cases > >> where the kernel does the cleanup). > >> > >> Okay? > >> > >> I think the last sentence still requires a little work though. Wha= t does > >> user space need to do in terms of clean up? > >=20 > > User space has usually state as well. So the FUTEX_OWNER_DIED bit > > tells userspace that it needs to cleanup the stale state left over = by > > the dead owner. >=20 > ### > So I changed the last sentence to=20 >=20 > User space is then responsible for cleaning up the stale state le= ft > over by the dead owner. >=20 > Okay? We can't say much beyond it's userspace's problem now because how it's = handled will vary widely. It might be worth mentioning that userspace can use =46UTEX_OWNER_DIED to detect this state and handle it appropriately. >=20 > >>>> .\" FIXME In the next line, what type of "priority" are we talki= ng about? > >>>> .\" Realtime priorities for SCHED_FIFO and SCHED_RR? > >>>> .\" Or something else? > >>>> > >>>> The > >>>> enqueueing of the waiter is in descending priori= ty order > >>>> if more than one waiter exists. =20 > >>> > >>> That also covers sched deadline. > >> > >> ??? > >> Thanks. If the realm is restricted purely to SCHED_OTHER (SCHED_NO= RMAL) > >> processes, does the nice value come into play also? > >=20 > > No. SCHED_OTHER/NORMAL tasks are handled in FIFO order. >=20 > Okay. Thanks. >=20 > >> So by now, I've reworked this text to be: > >> > >> FUTEX_TRYLOCK_PI (since Linux 2.6.18) > >> This operation tries to acquire the futex at uaddr. = It is > >> invoked when a user-space atomic acquire did not s= ucceed > >> because the futex word was not 0. > >> > >> The trylock in kernel might succeed because the fute= x word > >> contains stale state (FUTEX_WAITERS = and/or > >> FUTEX_OWNER_DIED). This can happen when the owner = of the > >> futex died. User space cannot handle this condition= in a > >> race-free manner > >> > >> Okay? > >> > >> I must admit that I find "the trylock in kernel might succeed"hard > >> to understand. Could you elaborate a little? > >=20 > > If the user space value has stale state, then the kernel can fix th= at > > up and acquire the futex. >=20 >=20 > Okay, so I made the last sentence: >=20 > User space cannot handle this condition in a race-free manner, > but the kernel can fix this up and acquire the futex. >=20 > Okay? >=20 > However, I realize I should have said more clearly why I find the tex= t > hard to understand. As I read this text: >=20 > The trylock in kernel might succeed because the futex word > contains stale state FUTEX_WAITERS and/or FUTEX_OWNER_DIED). >=20 > ### >=20 > The words "The trylock in kernel might succeed" makes it feel > like a contrast is being drawn with some other scenario, but > it's not clear what that alternative/contrast is. Do you see > what I mean? Let's talk about this f2f. I presume you got this covered in Dublin: The trylock in the kernel has more state, so it can independently verif= y the flags that userspace must trust implicitly. >=20 > >> ??? > >> So now I've reworded the opening text describing FUTEX_WAIT_REQUEU= E_PI > >> as follows: > >> > >> FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31) > >> Wait on a non-PI futex at uaddr and potential= ly be > >> requeued (via a FUTEX_CMP_REQUEUE_PI operation in a= nother > >> task) onto a PI futex at uaddr2. The wait operati= on on > >> uaddr is the same as for FUTEX_WAIT. > >> > >> The waiter can be removed from the wait on uaddr w= ithout > >> requeueing on uaddr2 via a FUTEX_WAIT operation in a= nother > >=20 > > s/FUTEX_WAIT/FUTEX_WAKE/ >=20 > Thanks. Fixed. >=20 > >> task. In this case, the FUTEX_WAIT_REQUEUE_PI ope= ration > >> returns with the error EWOULDBLOCK. > >> > >> Okay? > >=20 > > Yes. >=20 > Thanks. >=20 > Cheers, >=20 > Michael >=20 >=20 >=20 > --=20 > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ >=20 --=20 Darren Hart Intel Open Source Technology Center