From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Zijlstra Subject: Re: Revised futex(2) man page for review Date: Sat, 28 Mar 2015 12:47:25 +0100 Message-ID: <20150328114725.GJ27490@worktop.programming.kicks-ass.net> References: <55166C01.7000803@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <55166C01.7000803-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "Michael Kerrisk (man-pages)" Cc: Thomas Gleixner , Darren Hart , Carlos O'Donell , Ingo Molnar , Jakub Jelinek , "linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , lkml , Davidlohr Bueso , Arnd Bergmann , Steven Rostedt , Linux API , Torvald Riegel , Roland McGrath , Darren Hart , Anton Blanchard , Eric Dumazet , bill o gallmeister , Jan Kiszka , Daniel Wagner , Rich Felker , Andy Lutomirski , bert hubert , Rusty Russell , Heinrich List-Id: linux-man@vger.kernel.org On Sat, Mar 28, 2015 at 09:53:21AM +0100, Michael Kerrisk (man-pages) w= rote: > So, please take a look at the page below. At this point, > I would most especially appreciate help with the FIXMEs. =46or people who cannot read that troff gibberish (me).. --- =46UTEX(2) Linux Programmer's Manual = FUTEX(2) NAME futex - fast user-space locking SYNOPSIS #include #include int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, /* or: u32 val2 */ int *uaddr2, int val3); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The futex() system call provides a method for waiting until a = certain condition becomes true. It is typically used as a blocking co= nstruct in the context of shared-memory synchronization: The program imp= lements the majority of the synchronization in user space, and uses = one of operations of the system call when it is likely that it has t= o block for a longer time until the condition becomes true. The progra= m uses another operation of the system call to wake anyone waiting for= a par=E2=80=90 ticular condition. The condition is represented by the futex word, which is an addr= ess in memory supplied to the futex() system call, and the value at th= is mem=E2=80=90 ory location. (While the virtual addresses for the same memory = in sep=E2=80=90 arate processes may not be equal, the kernel maps them intern= ally so that the same memory mapped in different locations will correspo= nd for futex() calls.) When executing a futex operation that requests to block a thre= ad, the kernel will only block if the futex word has the value that the = calling thread supplied as expected value. The load from the futex wo= rd, the comparison with the expected value, and the actual blocking will= happen atomically and totally ordered with respect to concurrently ex= ecuting futex operations on the same futex word, such as operations tha= t wake threads blocked on this futex word. Thus, the futex word is = used to connect the synchronization in user spac with the implementat= ion of blocking by the kernel; similar to an atomic compare-and-exchang= e oper=E2=80=90 ation that potentially changes shared memory, blocking via a fu= tex is an atomic compare-and-block operation. See NOTES for a detailed= speci=E2=80=90 fication of the synchronization semantics. One example use of futexes is implementing locks. The state = of the lock (i.e., acquired or not acquired) can be represented as an= atomi=E2=80=90 cally accessed flag in shared memory. In the uncontended c= ase, a thread can access or modify the lock state with atomic instru= ctions, for example atomically changing it from not acquired to acquired= using an atomic compare-and-exchange instruction. If a thread cannot = acquire a lock because it is already acquired by another thread, it can = request to block if and only the lock is still acquired by using the= lock's flag as futex word and expecting a value that represents the a= cquired state. When releasing the lock, a thread has to first reset t= he lock state to not acquired and then execute the futex operation that= wakes one thread blocked on the futex word that is the lock's flag (t= his can be be further optimized to avoid unnecessary wake-ups). See f= utex(7) for more detail on how to use futexes. Besides the basic wait and wake-up futex functionality, there a= re fur=E2=80=90 ther futex operations aimed at supporting more complex use cases= =2E Also note that no explicit initialization or destruction are neces= sary to use futexes; the kernel maintains a futex (i.e., the kernel-i= nternal implementation artifact) only while operations such as FUTE= X_WAIT, described below, are being performed on a particular futex word. Arguments The uaddr argument points to the futex word. On all platforms, = futexes are four-byte integers that must be aligned on a four-byte bo= undary. The operation to perform on the futex is specified in the f= utex_op argument; val is a value whose meaning and purpose depends on fu= tex_op. The remaining arguments (timeout, uaddr2, and val3) are requir= ed only for certain of the futex operations described below. Where = one of these arguments is not required, it is ignored. For several blocking operations, the timeout argument is a point= er to a timespec structure that specifies a timeout for the operation.= How=E2=80=90 ever, notwithstanding the prototype shown above, for some oper= ations, this argument is instead a four-byte integer whose meaning is = deter=E2=80=90 mined by the operation. For these operations, the kernel ca= sts the timeout value to u32, and in the remainder of this page, this a= rgument is referred to as val2 when interpreted in this fashion. Where it is required, the uaddr2 argument is a pointer to a= second futex word that is employed by the operation. The interpretat= ion of the final integer argument, val3, depends on the operation. Futex operations The futex_op argument consists of two parts: a command that sp= ecifies the operation to be performed, bit-wise ORed with zero or o= r more options that modify the behaviour of the operation. The optio= ns that may be included in futex_op are as follows: FUTEX_PRIVATE_FLAG (since Linux 2.6.22) This option bit can be employed with all futex operation= s. It tells the kernel that the futex is process-private = and not shared with another process (i.e., it is only being us= ed for synchronization between threads of the same process)= =2E This allows the kernel to choose the fast path for validati= ng the user-space address and avoids expensive VMA lookups, taki= ng ref=E2=80=90 erence counts on file backing store, and so on. As a convenience, defines a set of co= nstants with the suffix _PRIVATE that are equivalents of all= of the operations listed below, but with the FUTEX_PRIVATE_FLA= G ORed into the constant value. Thus, there are FUTEX_WAIT_P= RIVATE, FUTEX_WAKE_PRIVATE, and so on. FUTEX_CLOCK_REALTIME (since Linux 2.6.28) This option bit can be employed only with the FUTEX_WAIT= _BITSET and FUTEX_WAIT_REQUEUE_PI operations. If this option is set, the kernel treats timeout as an a= bsolute time based on CLOCK_REALTIME. If this option is not set, the kernel treats timeout as r= elative time, measured against the CLOCK_MONOTONIC clock. The operation specified in futex_op is one of the following: FUTEX_WAIT (since Linux 2.6.0) This operation tests that the value at the futex word poi= nted to by the address uaddr still contains the expected value va= l, and if so, then sleeps awaiting FUTEX_WAKE on the futex wor= d. The load of the value of the futex word is an atomic memory = access (i.e., using atomic machine instructions of the res= pective architecture). This load, the comparison with the e= xpected value, and starting to sleep are performed atomica= lly and totally ordered with respect to other futex operations = on the same futex word. If the thread starts to sleep, it is = consid=E2=80=90 ered a waiter on this futex word. If the futex value do= es not match val, then the call fails immediately with th= e error EAGAIN. The purpose of the comparison with the expected value is = to pre=E2=80=90 vent lost wake-ups: If another thread changed the value= of the futex word after the calling thread decided to block ba= sed on the prior value, and if the other thread executed a FUT= EX_WAKE operation (or similar wake-up) after the value change and= before this FUTEX_WAIT operation, then the latter will obse= rve the value change and will not start to sleep. If the timeout argument is non-NULL, its contents specify= a rel=E2=80=90 ative timeout for the wait, measured according = to the CLOCK_MONOTONIC clock. (This interval will be rounded up= to the system clock granularity, and kernel scheduling delays me= an that the blocking interval may overrun by a small amount.) If= time=E2=80=90 out is NULL, the call blocks indefinitely. The arguments uaddr2 and val3 are ignored. FUTEX_WAKE (since Linux 2.6.0) This operation wakes at most val of the waiters that are = waiting (e.g., inside FUTEX_WAIT) on the futex word at the = address uaddr. Most commonly, val is specified as either 1 (wa= ke up a single waiter) or INT_MAX (wake up all waiters). No gu= arantee is provided about which waiters are awoken (e.g., a wait= er with a higher scheduling priority is not guaranteed to be awo= ken in preference to a waiter with a lower priority). The arguments timeout, uaddr2, and val3 are ignored. FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25) This operation creates a file descriptor that is associat= ed with the futex at uaddr. The caller must close the returne= d file descriptor after use. When another process or thread per= forms a FUTEX_WAKE on the futex word, the file descriptor indica= tes as being readable with select(2), poll(2), and epoll(7) The file descriptor can be used to obtain asynchronous no= tifica=E2=80=90 tions: if val is nonzero, then when another process or = thread executes a FUTEX_WAKE, the caller will receive the signal= number that was passed in val. The arguments timeout, uaddr2 and val3 are ignored. To prevent race conditions, the caller should test if the= futex has been upped after FUTEX_FD returns. Because it was inherently racy, FUTEX_FD has been remov= ed from Linux 2.6.26 onward. FUTEX_REQUEUE (since Linux 2.6.0) Avoid using this operation. It is broken for its intende= d pur=E2=80=90 pose. Use FUTEX_CMP_REQUEUE instead. This operation performs the same task as FUTEX_CMP_R= EQUEUE, except that no check is made using the value in val3.= (The argument val3 is ignored.) FUTEX_CMP_REQUEUE (since Linux 2.6.7) This operation first checks whether the location uadd= r still contains the value val3. If not, the operation fails wi= th the error EAGAIN. Otherwise, the operation wakes up a max= imum of val waiters that are waiting on the futex at uaddr. If= there are more than val waiters, then the remaining wait= ers are removed from the wait queue of the source futex at uad= dr and added to the wait queue of the target futex at uaddr2. T= he val2 argument specifies an upper limit on the number of waiter= s that are requeued to the futex at uaddr2. The load from uaddr is an atomic memory access (i.e.= , using atomic machine instructions of the respective archite= cture). This load, the comparison with val3, and the requeueing= of any waiters are performed atomically and totally ordere= d with respect to other operations on the same futex word. This operation was added as a replacement for the = earlier FUTEX_REQUEUE. The difference is that the check of the v= alue at uaddr can be used to ensure that requeueing only happen= s under certain conditions. Both operations can be used to a= void a "thundering herd" effect when FUTEX_WAKE is used and all= of the waiters that are woken need to acquire another futex. Typical values to specify for val are 0 or or 1. (Spe= cifying INT_MAX is not useful, because it would mak= e the FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAKE.= ) The limit value specified via val2 is typically either 1 or I= NT_MAX. (Specifying the argument as 0 is not useful, because it= would make the FUTEX_CMP_REQUEUE operation equivalent to FUTEX_= WAIT.) FUTEX_WAKE_OP (since Linux 2.6.14) This operation was added to support some user-space us= e cases where more than one futex must be handled at the same tim= e. The most notable example is the implementation of pthread_co= nd_sig=E2=80=90 nal(3), which requires operations on two futexes, the on= e used to implement the mutex and the one used in the implementa= tion of the wait queue associated with the condition va= riable. FUTEX_WAKE_OP allows such cases to be implemented withou= t lead=E2=80=90 ing to high rates of contention and context switching. The FUTEX_WAIT_OP operation is equivalent to execute the = follow=E2=80=90 ing code atomically and totally ordered with respect t= o other futex operations on any of the two supplied futex words: int oldval =3D *(int *) uaddr2; *(int *) uaddr2 =3D oldval op oparg; futex(uaddr, FUTEX_WAKE, val, 0, 0, 0); if (oldval cmp cmparg) futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0); In other words, FUTEX_WAIT_OP does the following: * saves the original value of the futex word at uaddr2 a= nd per=E2=80=90 forms an operation to modify the value of the f= utex at uaddr2; this is an atomic read-modify-write memory = access (i.e., using atomic machine instructions of the res= pective architecture) * wakes up a maximum of val waiters on the futex for the= futex word at uaddr; and * dependent on the results of a test of the original v= alue of the futex word at uaddr2, wakes up a maximum of val2 = waiters on the futex for the futex word at uaddr2. The operation and comparison that are to be perfor= med are encoded in the bits of the argument val3. Pictoriall= y, the encoding is: +---+---+-----------+-----------+ |op |cmp| oparg | cmparg | +---+---+-----------+-----------+ 4 4 12 12 <=3D=3D # of bits Expressed in code, the encoding is: #define FUTEX_OP(op, oparg, cmp, cmparg) \ (((op & 0xf) << 28) | \ ((cmp & 0xf) << 24) | \ ((oparg & 0xfff) << 12) | \ (cmparg & 0xfff)) In the above, op and cmp are each one of the codes listed= below. The oparg and cmparg components are literal numeric = values, except as noted below. The op component has one of the following values: FUTEX_OP_SET 0 /* uaddr2 =3D oparg; */ FUTEX_OP_ADD 1 /* uaddr2 +=3D oparg; */ FUTEX_OP_OR 2 /* uaddr2 |=3D oparg; */ FUTEX_OP_ANDN 3 /* uaddr2 &=3D ~oparg; */ FUTEX_OP_XOR 4 /* uaddr2 ^=3D oparg; */ In addition, bit-wise ORing the following value into op= causes (1 << oparg) to be used as the operand: FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand= */ The cmp field is one of the following: FUTEX_OP_CMP_EQ 0 /* if (oldval =3D=3D cmparg) w= ake */ FUTEX_OP_CMP_NE 1 /* if (oldval !=3D cmparg) wak= e */ FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake *= / FUTEX_OP_CMP_LE 3 /* if (oldval <=3D cmparg) wak= e */ FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake *= / FUTEX_OP_CMP_GE 5 /* if (oldval >=3D cmparg) wak= e */ The return value of FUTEX_WAKE_OP is the sum of the num= ber of waiters woken on the futex uaddr plus the number of = waiters woken on the futex uaddr2. FUTEX_WAIT_BITSET (since Linux 2.6.25) This operation is like FUTEX_WAIT except that val3 is u= sed to provide a 32-bit bitset to the kernel. This bitset is st= ored in the kernel-internal state of the waiter. See the descrip= tion of FUTEX_WAKE_BITSET for further details. The FUTEX_WAIT_BITSET operation also interprets the = timeout argument differently from FUTEX_WAIT. See the discuss= ion of FUTEX_CLOCK_REALTIME, above. The uaddr2 argument is ignored. FUTEX_WAKE_BITSET (since Linux 2.6.25) This operation is the same as FUTEX_WAKE except that t= he val3 argument is used to provide a 32-bit bitset to the kernel= =2E This bitset is used to select which waiters should be woken u= p. The selection is done by a bit-wise AND of the "wake" bitset = (i.e., the value in val3) and the bitset which is stored in the = kernel- internal state of the waiter (the "wait" bitset that = is set using FUTEX_WAIT_BITSET). All of the waiters for wh= ich the result of the AND is nonzero are woken up; the remaining = waiters are left sleeping. The effect of FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSE= T is to allow selective wake-ups among multiple waiters that are = blocked on the same futex. Note, however, that using this bitset= multi=E2=80=90 plexing feature on a futex is less efficient than simply= using multiple futexes, because employing bitset multiplexing r= equires the kernel to check all waiters on a futex, including tho= se that are not interested in being woken up (i.e., they do not h= ave the relevant bit set in their "wait" bitset). The uaddr2 and timeout arguments are ignored. The FUTEX_WAIT and FUTEX_WAKE operations correspo= nd to FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET operations wh= ere the bitsets are all ones. Priority-inheritance futexes Linux supports priority-inheritance (PI) futexes in order to = handle priority-inversion problems that can be encountered with norma= l futex locks. Priority inversion is the problem that occurs when a hi= gh-pri=E2=80=90 ority task is blocked waiting to acquire a lock held by a low-p= riority task, while tasks at an intermediate priority continuously preem= pt the low-priority task from the CPU. Consequently, the low-priori= ty task makes no progress toward releasing the lock, and the high-priori= ty task remains blocked. Priority inheritance is a mechanism for dealing with the pr= iority- inversion problem. With this mechanism, when a high-priorit= y task becomes blocked by a lock held by a low-priority task, the l= atter's priority is temporarily raised to that of the former, so that it= is not preempted by any intermediate level tasks, and can thus make p= rogress toward releasing the lock. To be effective, priority inheritanc= e must be transitive, meaning that if a high-priority task blocks on= a lock held by a lower-priority task that is itself blocked by lock h= eld by another intermediate-priority task (and so on, for chains of ar= bitrary length), then both of those task (or more generally, all of the= tasks in a lock chain) have their priorities raised to be the same= as the high-priority task. From a user-space perspective, what makes a futex PI-aware is a = policy agreement between user space and the kernel about the value= of the futex word (described in a moment), coupled with the use of = the PI futex operations described below (in particular, FUTEX_L= OCK_PI, FUTEX_TRYLOCK_PI, and FUTEX_CMP_REQUEUE_PI). The PI futex operations described below differ from the other= futex operations in that they impose policy on the use of the value= of the futex word: * If the lock is not acquired, the futex word's value shall be = 0. * If the lock is acquired, the futex word's value shall be the = thread ID (TID; see gettid(2)) of the owning thread. * If the lock is owned and there are threads contending for th= e lock, then the FUTEX_WAITERS bit shall be set in the futex word's = value; in other words, this value is: FUTEX_WAITERS | TID Note that a PI futex word never just has the value FUTEX_WAITERS= , which is a permissible state for non-PI futexes. With this policy in place, a user-space application can acquire = a not- acquired lock or release a lock that no other threads try to = acquire using atomic instructions executed in user space (e.g., a compa= re-and- swap operation such as cmpxchg on the x86 architecture). Acqu= iring a lock simply consists of using compare-and-swap to atomically s= et the futex word's value to the caller's TID if its previous value= was 0. Releasing a lock requires using compare-and-swap to set the= futex word's value to 0 if the previous value was the expected TID. If a futex is already acquired (i.e., has a nonzero value), = waiters must employ the FUTEX_LOCK_PI operation to acquire the lock. If= other threads are waiting for the lock, then the FUTEX_WAITERS bit is= set in the futex value; in this case, the lock owner must empl= oy the FUTEX_UNLOCK_PI operation to release the lock. In the cases where callers are forced into the kernel (i.e., r= equired to perform a futex() operation), they then deal directly with = a so- called RT-mutex, a kernel locking mechanism which impleme= nts the required priority-inheritance semantics. After the RT-mut= ex is acquired, the futex value is updated accordingly, before the = calling thread returns to user space. It is important to note that the kernel will update the futex = word's value prior to returning to user space. Unlike the other futex= opera=E2=80=90 tions described above, the PI futex operations are designed f= or the implementation of very specific IPC mechanisms. PI futexes are operated on by specifying one of the following va= lues in futex_op: FUTEX_LOCK_PI (since Linux 2.6.18) This operation is used after after an attempt to acqui= re the lock via an atomic user-space instruction failed beca= use the futex word has a nonzero value=E2=80=94specifically, be= cause it con=E2=80=90 tained the namespace-specific TID of the lock owner. The operation checks the value of the futex word at the = address uaddr. If the value is 0, then the kernel tries to ato= mically set the futex value to the caller's TID. If that fails,= or the futex word's value is nonzero, the kernel atomically se= ts the FUTEX_WAITERS bit, which signals the futex owner that it= cannot unlock the futex in user space atomically by setting the= futex value to 0. After that, the kernel tries to find the= thread which is associated with the owner TID, creates or reuses= kernel state on behalf of the owner and attaches the waiter to i= t. The enqueueing of the waiter is in descending priority order = if more than one waiter exists. The owner inherits either the p= riority or the bandwidth of the waiter. This inheritance follo= ws the lock chain in the case of nested locking and performs d= eadlock detection. The timeout argument provides a timeout for the lock a= ttempt. It is interpreted as an absolute time, measured agai= nst the CLOCK_REALTIME clock. If timeout is NULL, the operatio= n will block indefinitely. The uaddr2, val, and val3 arguments are ignored. FUTEX_TRYLOCK_PI (since Linux 2.6.18) This operation tries to acquire the futex at uaddr. I= t deals with the situation where the TID value at uaddr is 0, b= ut the FUTEX_WAITERS bit is set. User space cannot handle this= condi=E2=80=90 tion in a race-free manner The uaddr2, val, timeout, and val3 arguments are ignored. FUTEX_UNLOCK_PI (since Linux 2.6.18) This operation wakes the top priority waiter that is wait= ing in FUTEX_LOCK_PI on the futex address provided by the uadd= r argu=E2=80=90 ment. This is called when the user space value at uaddr can= not be changed atomically from a TID (of the owner) to 0. The uaddr2, val, timeout, and val3 arguments are ignored. FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31) This operation is a PI-aware variant of FUTEX_CMP_REQUE= UE. It requeues waiters that are blocked via FUTEX_WAIT_REQUEUE= _PI on uaddr from a non-PI source futex (uaddr) to a PI targe= t futex (uaddr2). As with FUTEX_CMP_REQUEUE, this operation wakes up a maxi= mum of val waiters that are waiting on the futex at uaddr. H= owever, for FUTEX_CMP_REQUEUE_PI, val is required to be 1 (sin= ce the main point is to avoid a thundering herd). The remainin= g wait=E2=80=90 ers are removed from the wait queue of the source futex a= t uaddr and added to the wait queue of the target futex at uaddr2= =2E The val2 and val3 arguments serve the same purposes= as for FUTEX_CMP_REQUEUE. FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31) Wait operation to wait on a non-PI futex at uaddr and = poten=E2=80=90 tially be requeued onto a PI futex at uaddr2. The wait= opera=E2=80=90 tion on uaddr is the same as FUTEX_WAIT. The waiter = can be removed from the wait on uaddr via FUTEX_WAKE without req= ueueing on uaddr2. If timeout is not NULL, it specifies a timeout for th= e wait operation; this timeout is interpreted as outlined above= in the description of the FUTEX_CLOCK_REALTIME option. If time= out is NULL, the operation can block indefinitely. The val3 argument is ignored. The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were a= dded to support a fairly specific use case: support for priority-= inheri=E2=80=90 tance-aware POSIX threads condition variables. The idea = is that these operations should always be paired, in order to = ensure that user space and the kernel remain in sync. Thus,= in the FUTEX_WAIT_REQUEUE_PI operation, the user-space applicati= on pre- specifies the target of the requeue that takes place= in the FUTEX_CMP_REQUEUE_PI operation. RETURN VALUE In the event of an error, all operations return -1 and set er= rno to indicate the cause of the error. The return value on success = depends on the operation, as described in the following list: FUTEX_WAIT Returns 0 if the caller was woken up. Note that a wake-= up can also be caused by common futex usage patterns in unrelat= ed code that happened to have previously used the futex word's = memory location (e.g., typical futex-based implementations of P= threads mutexes can cause this under some conditions). Therefore= , call=E2=80=90 ers should always conservatively assume that a return val= ue of 0 can mean a spurious wake-up, and use the futex word's= value (i.e., the user space synchronization scheme) to decide whether to continue to block or not. FUTEX_WAKE Returns the number of waiters that were woken up. FUTEX_FD Returns the new file descriptor associated with the futex= =2E FUTEX_REQUEUE Returns the number of waiters that were woken up. FUTEX_CMP_REQUEUE Returns the total number of waiters that were woke= n up or requeued to the futex for the futex word at uaddr2. I= f this value is greater than val, then difference is the nu= mber of waiters requeued to the futex for the futex word at uaddr= 2. FUTEX_WAKE_OP Returns the total number of waiters that were woken up. = This is the sum of the woken waiters on the two futexes for th= e futex words at uaddr and uaddr2. FUTEX_WAIT_BITSET Returns 0 if the caller was woken up. See FUTEX_WAIT for= how to interpret this correctly in practice. FUTEX_WAKE_BITSET Returns the number of waiters that were woken up. FUTEX_LOCK_PI Returns 0 if the futex was successfully locked. FUTEX_TRYLOCK_PI Returns 0 if the futex was successfully locked. FUTEX_UNLOCK_PI Returns 0 if the futex was successfully unlocked. FUTEX_CMP_REQUEUE_PI Returns the total number of waiters that were woke= n up or requeued to the futex for the futex word at uaddr2. I= f this value is greater than val, then difference is the nu= mber of waiters requeued to the futex for the futex word at uaddr= 2. FUTEX_WAIT_REQUEUE_PI Returns 0 if the caller was successfully requeued to the= futex for the futex word at uaddr2. ERRORS EACCES No read access to the memory of a futex word. EAGAIN (FUTEX_WAIT, FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI) Th= e value pointed to by uaddr was not equal to the expected value = val at the time of the call. Note: on Linux, the symbolic names EAGAIN and EWOULDBLOC= K (both of which appear in different parts of the kernel futex= code) have the same value. EAGAIN (FUTEX_CMP_REQUEUE, FUTEX_CMP_REQUEUE_PI) The value poi= nted to by uaddr is not equal to the expected value val3. (This = proba=E2=80=90 bly indicates a race; use the safe FUTEX_WAKE now.) EAGAIN (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_P= I) The futex owner thread ID of uaddr (for FUTEX_CMP_REQU= EUE_PI: uaddr2) is about to exit, but has not yet handled the i= nternal state cleanup. Try again. EDEADLK (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI= ) The futex word at uaddr is already locked by the caller. EDEADLK (FUTEX_CMP_REQUEUE_PI) While requeueing a waiter to the P= I futex for the futex word at uaddr2, the kernel detected a deadl= ock. EFAULT A required pointer argument (i.e., uaddr, uaddr2, or t= imeout) did not point to a valid user-space address. EINTR A FUTEX_WAIT or FUTEX_WAIT_BITSET operation was interrupt= ed by a signal (see signal(7)). In kernels before Linux 2.6.22= , this error could also be returned for on a spurious wakeup= ; since Linux 2.6.22, this no longer happens. EINVAL The operation in futex_op is one of those that employs a= time=E2=80=90 out, but the supplied timeout argument was invalid (tv_= sec was less than zero, or tv_nsec was not less than 1000,000,000= ). EINVAL The operation specified in futex_op employs one or both = of the pointers uaddr and uaddr2, but one of these does not poi= nt to a valid object=E2=80=94that is, the address is not four-byt= e-aligned. EINVAL (FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET) The bitset suppl= ied in val3 is zero. EINVAL (FUTEX_CMP_REQUEUE_PI) uaddr equals uaddr2 (i.e., an atte= mpt was made to requeue to the same futex). EINVAL (FUTEX_FD) The signal number supplied in val is invalid. EINVAL (FUTEX_WAKE, FUTEX_WAKE_OP, FUTEX_WAKE_BITSET, FUTEX_R= EQUEUE, FUTEX_CMP_REQUEUE) The kernel detected an inconsistency = between the user-space state at uaddr and the kernel state=E2=80=94= that is, it detected a waiter which waits in FUTEX_LOCK_PI on uaddr. EINVAL (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_UNLOCK_PI) The= kernel detected an inconsistency between the user-space state at= uaddr and the kernel state. This indicates either state corrup= tion or that the kernel found a waiter on uaddr which is waiti= ng via FUTEX_WAIT or FUTEX_WAIT_BITSET. EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an incons= istency between the user-space state at uaddr2 and the kernel = state; that is, the kernel detected a waiter which waits via FUT= EX_WAIT on uaddr2. EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an incons= istency between the user-space state at uaddr and the kernel stat= e; that is, the kernel detected a waiter which waits via FUTEX_W= AIT or FUTEX_WAIT_BITESET on uaddr. EINVAL (FUTEX_CMP_REQUEUE_PI) The kernel detected an incons= istency between the user-space state at uaddr and the kernel stat= e; that is, the kernel detected a waiter which waits on ua= ddr via FUTEX_LOCK_PI (instead of FUTEX_WAIT_REQUEUE_PI). EINVAL (FUTEX_CMP_REQUEUE_PI) An attempt was made to requeue a = waiter to a futex other than that specified by the m= atching FUTEX_WAIT_REQUEUE_PI call for that waiter. EINVAL (FUTEX_CMP_REQUEUE_PI) The val argument is not 1. EINVAL Invalid argument. ENOMEM (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) T= he ker=E2=80=90 nel could not allocate memory to hold state information. ENFILE (FUTEX_FD) The system limit on the total number of ope= n files has been reached. ENOSYS Invalid operation specified in futex_op. ENOSYS The FUTEX_CLOCK_REALTIME option was specified in futex_o= p, but the accompanying operation was neither FUTEX_WAIT_BIT= SET nor FUTEX_WAIT_REQUEUE_PI. ENOSYS (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_UNL= OCK_PI, FUTEX_CMP_REQUEUE_PI, FUTEX_WAIT_REQUEUE_PI) A run-tim= e check determined that the operation is not available. The PI= futex operations are not implemented on all architectures and = are not supported on some CPU variants. EPERM (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI= ) The caller is not allowed to attach itself to the futex a= t uaddr (for FUTEX_CMP_REQUEUE_PI: the futex at uaddr2). (This = may be caused by a state corruption in user space.) EPERM (FUTEX_UNLOCK_PI) The caller does not own the lock repr= esented by the futex word. ESRCH (FUTEX_LOCK_PI, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI= ) The thread ID in the futex word at uaddr does not exist. ESRCH (FUTEX_CMP_REQUEUE_PI) The thread ID in the futex word at= uaddr2 does not exist. ETIMEDOUT The operation in futex_op employed the timeout specif= ied in timeout, and the timeout expired before the operation com= pleted. VERSIONS Futexes were first made available in a stable kernel release wit= h Linux 2.6.0. Initial futex support was merged in Linux 2.5.7 but with di= fferent semantics from what was described above. A four-argument syst= em call with the semantics described in this page was introduced in= Linux 2.5.40. In Linux 2.5.70, one argument was added. In Linux 2= =2E6.7, a sixth argument was added=E2=80=94messy, especially on the s390 a= rchitecture. CONFORMING TO This system call is Linux-specific. NOTES Glibc does not provide a wrapper for this system call; call it= using syscall(2). EXAMPLE The program below demonstrates use of futexes in a program where= parent and child use a pair of futexes located inside a shared anonymou= s map=E2=80=90 ping to synchronize access to a shared resource: the terminal. = The two processes each write nloops (a command-line argument that defaul= ts to 5 if omitted) messages to the terminal and employ a synchronizati= on pro=E2=80=90 tocol that ensures that they alternate in writing messages. Upo= n run=E2=80=90 ning this program we see output such as the following: $ ./futex_demo Parent (18534) 0 Child (18535) 0 Parent (18534) 1 Child (18535) 1 Parent (18534) 2 Child (18535) 2 Parent (18534) 3 Child (18535) 3 Parent (18534) 4 Child (18535) 4 Program source /* futex_demo.c Usage: futex_demo [nloops] (Default: 5) Demonstrate the use of futexes in a program where parent and = child use a pair of futexes located inside a shared anonymous mappi= ng to synchronize access to a shared resource: the terminal. The tw= o processes each write 'num-loops' messages to the terminal and= employ a synchronization protocol that ensures that they alternate i= n writing messages. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) static int *futex1, *futex2, *iaddr; static int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, int *uaddr2, int val3) { return syscall(SYS_futex, uaddr, futex_op, val, timeout, uaddr, val3); } /* Acquire the futex pointed to by 'futexp': wait for its value = to become 1, and then set the value to 0. */ static void fwait(int *futexp) { int s; /* __sync_bool_compare_and_swap(ptr, oldval, newval) is a gc= c built-in function. It atomically performs the equivalent= of: if (*ptr =3D=3D oldval) *ptr =3D newval; It returns true if the test yielded true and *ptr was upd= ated. The alternative here would be to employ the equivalent at= omic machine-language instructions. For further information, = see the GCC Manual. */ while (1) { /* Is the futex available? */ if (__sync_bool_compare_and_swap(futexp, 1, 0)) break; /* Yes */ /* Futex is not available; wait */ s =3D futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0); if (s =3D=3D -1 && errno !=3D EAGAIN) errExit("futex-FUTEX_WAIT"); } } /* Release the futex pointed to by 'futexp': if the futex curren= tly has the value 0, set its value to 1 and the wake any futex wa= iters, so that if the peer is blocked in fpost(), it can proceed. */ static void fpost(int *futexp) { int s; /* __sync_bool_compare_and_swap() was described in comments = above */ if (__sync_bool_compare_and_swap(futexp, 0, 1)) { s =3D futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0); if (s =3D=3D -1) errExit("futex-FUTEX_WAKE"); } } int main(int argc, char *argv[]) { pid_t childPid; int j, nloops; setbuf(stdout, NULL); nloops =3D (argc > 1) ? atoi(argv[1]) : 5; /* Create a shared anonymous mapping that will hold the fute= xes. Since the futexes are being shared between processes, we subsequently use the "shared" futex operations (i.e., not= the ones suffixed "_PRIVATE") */ iaddr =3D mmap(NULL, sizeof(int) * 2, PROT_READ | PROT_WRITE= , MAP_ANONYMOUS | MAP_SHARED, -1, 0); if (iaddr =3D=3D MAP_FAILED) errExit("mmap"); futex1 =3D &iaddr[0]; futex2 =3D &iaddr[1]; *futex1 =3D 0; /* State: unavailable */ *futex2 =3D 1; /* State: available */ /* Create a child process that inherits the shared anonymous mapping */ childPid =3D fork(); if (childPid =3D=3D -1) errExit("fork"); if (childPid =3D=3D 0) { /* Child */ for (j =3D 0; j < nloops; j++) { fwait(futex1); printf("Child (%ld) %d\n", (long) getpid(), j); fpost(futex2); } exit(EXIT_SUCCESS); } /* Parent falls through to here */ for (j =3D 0; j < nloops; j++) { fwait(futex2); printf("Parent (%ld) %d\n", (long) getpid(), j); fpost(futex1); } wait(NULL); exit(EXIT_SUCCESS); } SEE ALSO get_robust_list(2), restart_syscall(2), futex(7) The following kernel source files: * Documentation/pi-futex.txt * Documentation/futex-requeue-pi.txt * Documentation/locking/rt-mutex.txt * Documentation/locking/rt-mutex-design.txt * Documentation/robust-futex-ABI.txt Franke, H., Russell, R., and Kirwood, M., 2002. Fuss, Futexes a= nd Fur=E2=80=90 wocks: Fast Userlevel Locking in Linux (from proceedings of the = Ottawa Linux Symposium 2002), =E2=9F=A8http://kernel.org/doc/ols/2002/ols2002-pages-479-495.pd= f=E2=9F=A9 Hart, D., 2009. A futex overview and update, =E2=9F=A8http://lwn.net/Articles/360699/=E2=9F=A9 Hart, D. and Guniguntala, D., 2009. Requeue-PI: Making Glibc Co= ndvars PI-Aware (from proceedings of the 2009 Real-Time Linux Workshop)= , =E2=9F=A8http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf=E2= =9F=A9 Drepper, U., 2011. Futexes Are Tricky, =E2=9F=A8http://www.akkadia.org/drepper/futex.pdf=E2=9F=A9 Futex example library, futex-*.tar.bz2 at =E2=9F=A8ftp://ftp.kernel.org/pub/linux/kernel/people/rusty/=E2=9F= =A9 Linux 2014-05-21 F= UTEX(2)