linux-rt-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
@ 2025-09-08 16:33 Sebastian Andrzej Siewior
  2025-09-08 17:46 ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-08 16:33 UTC (permalink / raw)
  To: libc-alpha; +Cc: John Ogness, linux-rt-devel, Thomas Gleixner

The pthread_cond_t structure provides an internal lock using the lower two bits
of __g1_orig_size. This lock is always used by the signaller. The waiter uses
it only in certain corner cases such as when canceling a wait due to a timeout.

Under contention, a thread attempting to acquire the lock invokes the futex
operation FUTEX_WAIT while waiting for the lock to become available. This
behavior can be problematic in real-time environments because the priority of
the waiting task is not considered. As a result, a lower-priority task may
become runnable while the lock owner remains preempted.

To address this, repurpose the unused member __unused_initialized_1 as a
priority-inheritance (PI) futex named __pi_lock. This is the futex word, not a
pthread_mutex_t.

__condvar_acquire_lock() performs an atomic transition from 0 to TID. If it
fails, it falls back to __futex_lock_pi64(), which acquires the lock via
FUTEX_LOCK_PI.
__condvar_release_lock() performs an atomic transition from TID to 0. If it
fails, it falls back to futex_unlock_pi(), which releases the lock and
wakes a potential waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

This is a follow-up to
	https://lore.kernel.org/all/20250829145659.KhM4kIoQ@linutronix.de/

This change expects that PI-FUTEX is supported by the kernel. This is
the case since a long time but it is possible to disable the FUTEX
subsystem or the FUTEX_PI part of it while building the kernel.
Is it okay to assume that PI-FUTEX is available or should there be a
check somewhere during startup and in case of -ENOSYS a fallback to
current implementation?

 nptl/pthread_cond_common.c              | 58 ++++++++++---------------
 sysdeps/nptl/bits/thread-shared-types.h |  2 +-
 2 files changed, 24 insertions(+), 36 deletions(-)

diff --git a/nptl/pthread_cond_common.c b/nptl/pthread_cond_common.c
index 2708d262958f2..74010787b18de 100644
--- a/nptl/pthread_cond_common.c
+++ b/nptl/pthread_cond_common.c
@@ -106,41 +106,23 @@ __condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
 #endif /* !__HAVE_64B_ATOMICS */
 
 /* The lock that signalers use.  See pthread_cond_wait_common for uses.
-   The lock is our normal three-state lock: not acquired (0) / acquired (1) /
-   acquired-with-futex_wake-request (2).  However, we need to preserve the
-   other bits in the unsigned int used for the lock, and therefore it is a
-   little more complex.  */
+   The lock is the futex PI lock (PTHREAD_MUTEX_PI_NORMAL_NP) using just the
+   futex word. */
 static void __attribute__ ((unused))
 __condvar_acquire_lock (pthread_cond_t *cond, int private)
 {
-  unsigned int s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
-  while ((s & 3) == 0)
+  int oldval, newval;
+
+  newval = THREAD_GETMEM (THREAD_SELF, tid);
+  oldval = atomic_compare_and_exchange_val_acq (&cond->__data.__pi_lock,
+                                                newval, 0);
+  if (oldval != 0)
     {
-      if (atomic_compare_exchange_weak_acquire (&cond->__data.__g1_orig_size,
-	  &s, s | 1))
-	return;
-      /* TODO Spinning and back-off.  */
-    }
-  /* We can't change from not acquired to acquired, so try to change to
-     acquired-with-futex-wake-request and do a futex wait if we cannot change
-     from not acquired.  */
-  while (1)
-    {
-      while ((s & 3) != 2)
-	{
-	  if (atomic_compare_exchange_weak_acquire
-	      (&cond->__data.__g1_orig_size, &s, (s & ~(unsigned int) 3) | 2))
-	    {
-	      if ((s & 3) == 0)
-		return;
-	      break;
-	    }
-	  /* TODO Back off.  */
-	}
-      futex_wait_simple (&cond->__data.__g1_orig_size,
-	  (s & ~(unsigned int) 3) | 2, private);
-      /* Reload so we see a recent value.  */
-      s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
+      int e;
+
+      e = __futex_lock_pi64 (&cond->__data.__pi_lock, 0, NULL, private);
+      if (e != 0)
+        futex_fatal_error ();
     }
 }
 
@@ -148,10 +130,16 @@ __condvar_acquire_lock (pthread_cond_t *cond, int private)
 static void __attribute__ ((unused))
 __condvar_release_lock (pthread_cond_t *cond, int private)
 {
-  if ((atomic_fetch_and_release (&cond->__data.__g1_orig_size,
-				 ~(unsigned int) 3) & 3)
-      == 2)
-    futex_wake (&cond->__data.__g1_orig_size, 1, private);
+  pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
+
+  if (!atomic_compare_exchange_weak_release (&cond->__data.__pi_lock, &id, 0))
+    {
+      int e;
+
+      e = futex_unlock_pi ((unsigned int *) &cond->__data.__pi_lock, private);
+      if (e != 0)
+	futex_fatal_error ();
+    }
 }
 
 /* Only use this when having acquired the lock.  */
diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
index e614c7f3c900e..9b362d5282cd9 100644
--- a/sysdeps/nptl/bits/thread-shared-types.h
+++ b/sysdeps/nptl/bits/thread-shared-types.h
@@ -99,7 +99,7 @@ struct __pthread_cond_s
   unsigned int __g1_orig_size;
   unsigned int __wrefs;
   unsigned int __g_signals[2];
-  unsigned int __unused_initialized_1;
+  int __pi_lock;
   unsigned int __unused_initialized_2;
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-08 16:33 [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking Sebastian Andrzej Siewior
@ 2025-09-08 17:46 ` Adhemerval Zanella Netto
  2025-09-09  7:32   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 10+ messages in thread
From: Adhemerval Zanella Netto @ 2025-09-08 17:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, libc-alpha
  Cc: John Ogness, linux-rt-devel, Thomas Gleixner, Carlos O'Donell



On 08/09/25 13:33, Sebastian Andrzej Siewior wrote:
> The pthread_cond_t structure provides an internal lock using the lower two bits
> of __g1_orig_size. This lock is always used by the signaller. The waiter uses
> it only in certain corner cases such as when canceling a wait due to a timeout.
> 
> Under contention, a thread attempting to acquire the lock invokes the futex
> operation FUTEX_WAIT while waiting for the lock to become available. This
> behavior can be problematic in real-time environments because the priority of
> the waiting task is not considered. As a result, a lower-priority task may
> become runnable while the lock owner remains preempted.
> 
> To address this, repurpose the unused member __unused_initialized_1 as a
> priority-inheritance (PI) futex named __pi_lock. This is the futex word, not a
> pthread_mutex_t.
> 
> __condvar_acquire_lock() performs an atomic transition from 0 to TID. If it
> fails, it falls back to __futex_lock_pi64(), which acquires the lock via
> FUTEX_LOCK_PI.
> __condvar_release_lock() performs an atomic transition from TID to 0. If it
> fails, it falls back to futex_unlock_pi(), which releases the lock and
> wakes a potential waiter.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> 
> This is a follow-up to
> 	https://lore.kernel.org/all/20250829145659.KhM4kIoQ@linutronix.de/
> 
> This change expects that PI-FUTEX is supported by the kernel. This is
> the case since a long time but it is possible to disable the FUTEX
> subsystem or the FUTEX_PI part of it while building the kernel.
> Is it okay to assume that PI-FUTEX is available or should there be a
> check somewhere during startup and in case of -ENOSYS a fallback to
> current implementation?

We removed the __ASSUME_FUTEX_LOCK_PI (f5c77f78ec03363d5e550c4996deb75ee3f2e32a)
in favor or always check for PI support at runtime during pthread_mutex_init
(prio_inherit_missing).

Since the kernel still might return ENOSYS for FUTEX_PI I think we should
keep probing its support as runtime.
> 
>  nptl/pthread_cond_common.c              | 58 ++++++++++---------------
>  sysdeps/nptl/bits/thread-shared-types.h |  2 +-
>  2 files changed, 24 insertions(+), 36 deletions(-)
> 
> diff --git a/nptl/pthread_cond_common.c b/nptl/pthread_cond_common.c
> index 2708d262958f2..74010787b18de 100644
> --- a/nptl/pthread_cond_common.c
> +++ b/nptl/pthread_cond_common.c
> @@ -106,41 +106,23 @@ __condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
>  #endif /* !__HAVE_64B_ATOMICS */
>  
>  /* The lock that signalers use.  See pthread_cond_wait_common for uses.
> -   The lock is our normal three-state lock: not acquired (0) / acquired (1) /
> -   acquired-with-futex_wake-request (2).  However, we need to preserve the
> -   other bits in the unsigned int used for the lock, and therefore it is a
> -   little more complex.  */
> +   The lock is the futex PI lock (PTHREAD_MUTEX_PI_NORMAL_NP) using just the
> +   futex word. */
>  static void __attribute__ ((unused))
>  __condvar_acquire_lock (pthread_cond_t *cond, int private)
>  {
> -  unsigned int s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
> -  while ((s & 3) == 0)
> +  int oldval, newval;
> +
> +  newval = THREAD_GETMEM (THREAD_SELF, tid);
> +  oldval = atomic_compare_and_exchange_val_acq (&cond->__data.__pi_lock,
> +                                                newval, 0);
> +  if (oldval != 0)
>      {
> -      if (atomic_compare_exchange_weak_acquire (&cond->__data.__g1_orig_size,
> -	  &s, s | 1))
> -	return;
> -      /* TODO Spinning and back-off.  */
> -    }
> -  /* We can't change from not acquired to acquired, so try to change to
> -     acquired-with-futex-wake-request and do a futex wait if we cannot change
> -     from not acquired.  */
> -  while (1)
> -    {
> -      while ((s & 3) != 2)
> -	{
> -	  if (atomic_compare_exchange_weak_acquire
> -	      (&cond->__data.__g1_orig_size, &s, (s & ~(unsigned int) 3) | 2))
> -	    {
> -	      if ((s & 3) == 0)
> -		return;
> -	      break;
> -	    }
> -	  /* TODO Back off.  */
> -	}
> -      futex_wait_simple (&cond->__data.__g1_orig_size,
> -	  (s & ~(unsigned int) 3) | 2, private);
> -      /* Reload so we see a recent value.  */
> -      s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
> +      int e;
> +
> +      e = __futex_lock_pi64 (&cond->__data.__pi_lock, 0, NULL, private);
> +      if (e != 0)
> +        futex_fatal_error ();

The __futex_lock_pi64 already issues futex_fatal_error() non-expected return code;
so I think we should only use it futex-internal.{c,h}.

And I think we will need to handle ENOSYS here, maybe to use the old locking logic
if the kernel does not support FUTEX_PI. I think it should be fair to assume that
if your running your workload on an environment without FUTEX_PI support you are
not subject to the issue you raised.

I am ccing Carlos, he has helped fix some issue on cond vars recently and he
might have some idea if using PI aware locks does make sense here.

Also, I think we really need regression testcases for this change.

>      }
>  }
>  
> @@ -148,10 +130,16 @@ __condvar_acquire_lock (pthread_cond_t *cond, int private)
>  static void __attribute__ ((unused))
>  __condvar_release_lock (pthread_cond_t *cond, int private)
>  {
> -  if ((atomic_fetch_and_release (&cond->__data.__g1_orig_size,
> -				 ~(unsigned int) 3) & 3)
> -      == 2)
> -    futex_wake (&cond->__data.__g1_orig_size, 1, private);
> +  pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
> +
> +  if (!atomic_compare_exchange_weak_release (&cond->__data.__pi_lock, &id, 0))
> +    {
> +      int e;
> +
> +      e = futex_unlock_pi ((unsigned int *) &cond->__data.__pi_lock, private);
> +      if (e != 0)
> +	futex_fatal_error ();
> +    }
>  }
>  
>  /* Only use this when having acquired the lock.  */
> diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
> index e614c7f3c900e..9b362d5282cd9 100644
> --- a/sysdeps/nptl/bits/thread-shared-types.h
> +++ b/sysdeps/nptl/bits/thread-shared-types.h
> @@ -99,7 +99,7 @@ struct __pthread_cond_s
>    unsigned int __g1_orig_size;
>    unsigned int __wrefs;
>    unsigned int __g_signals[2];
> -  unsigned int __unused_initialized_1;
> +  int __pi_lock;
>    unsigned int __unused_initialized_2;
>  };
>  


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-08 17:46 ` Adhemerval Zanella Netto
@ 2025-09-09  7:32   ` Sebastian Andrzej Siewior
  2025-09-09 12:22     ` Florian Weimer
  2025-09-09 13:09     ` Adhemerval Zanella Netto
  0 siblings, 2 replies; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-09  7:32 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: libc-alpha, John Ogness, linux-rt-devel, Thomas Gleixner,
	Carlos O'Donell

On 2025-09-08 14:46:12 [-0300], Adhemerval Zanella Netto wrote:
> > This change expects that PI-FUTEX is supported by the kernel. This is
> > the case since a long time but it is possible to disable the FUTEX
> > subsystem or the FUTEX_PI part of it while building the kernel.
> > Is it okay to assume that PI-FUTEX is available or should there be a
> > check somewhere during startup and in case of -ENOSYS a fallback to
> > current implementation?
> 
> We removed the __ASSUME_FUTEX_LOCK_PI (f5c77f78ec03363d5e550c4996deb75ee3f2e32a)
> in favor or always check for PI support at runtime during pthread_mutex_init
> (prio_inherit_missing).
> 
> Since the kernel still might return ENOSYS for FUTEX_PI I think we should
> keep probing its support as runtime.

Okay. Since that one is static, I guess I would have to make my own
check in pthread_cond.

But… Now that I look at it again. The kernel has this FUTEX_PI option
which depends on RT_MUTEXES. But RT_MUTEXES has no off switch so it must
always be selected once FUTEX itself is enabled. The I2C subsystem
selects RT_MUTEXES and I don't think there is a config without PI-FUTEX
considering this.
We _used_ to have runtime detection for PI-FUTEX support because not all
architectures provided a cmpxchg function for futex. This is gone and
all architectures as of v5.17 provide it. That would be commit
   3297481d688a5 ("futex: Remove futex_cmpxchg detection")
   https://git.kernel.org/torvalds/c/3297481d688a5

for reference. So I *think* this config option can be removed on kernel
side and it appears to me as of v5.17 there should be no need for a
runtime check regarding PI-futex.

> > --- a/nptl/pthread_cond_common.c
> > +++ b/nptl/pthread_cond_common.c
> > @@ -106,41 +106,23 @@ __condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
> >  static void __attribute__ ((unused))
> >  __condvar_acquire_lock (pthread_cond_t *cond, int private)
> >  {
> > +      int e;
> > +
> > +      e = __futex_lock_pi64 (&cond->__data.__pi_lock, 0, NULL, private);
> > +      if (e != 0)
> > +        futex_fatal_error ();
> 
> The __futex_lock_pi64 already issues futex_fatal_error() non-expected return code;
> so I think we should only use it futex-internal.{c,h}.

There shouldn't be any error. There might be the case where the lock
owner is gone (ESRCH I believe) or the theoretical ENOMEM. ESRCH isn't
handled now but it can't be recognized either. It would require to kill
the thread owning the lock.
So either abort the operation if the futex-op returns an error because
"this shouldn't happen" or I don't know.

> And I think we will need to handle ENOSYS here, maybe to use the old locking logic
> if the kernel does not support FUTEX_PI. I think it should be fair to assume that
> if your running your workload on an environment without FUTEX_PI support you are
> not subject to the issue you raised.

Yes, that is fair to assume. Also given the above it might be a
leftover. But let me add the fallback code and then we can remove once
it is certain on Kernel's side.

> I am ccing Carlos, he has helped fix some issue on cond vars recently and he
> might have some idea if using PI aware locks does make sense here.
> 
> Also, I think we really need regression testcases for this change.

I came up with something to test this. It required a few CPUs and task
pinning otherwise the small window didn't trigger.

> >      }
> >  }

Sebastian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-09  7:32   ` Sebastian Andrzej Siewior
@ 2025-09-09 12:22     ` Florian Weimer
  2025-09-09 15:37       ` Sebastian Andrzej Siewior
  2025-09-09 13:09     ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 10+ messages in thread
From: Florian Weimer @ 2025-09-09 12:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Adhemerval Zanella Netto, libc-alpha, John Ogness, linux-rt-devel,
	Thomas Gleixner, Carlos O'Donell

* Sebastian Andrzej Siewior:

> On 2025-09-08 14:46:12 [-0300], Adhemerval Zanella Netto wrote:
>> > This change expects that PI-FUTEX is supported by the kernel. This is
>> > the case since a long time but it is possible to disable the FUTEX
>> > subsystem or the FUTEX_PI part of it while building the kernel.
>> > Is it okay to assume that PI-FUTEX is available or should there be a
>> > check somewhere during startup and in case of -ENOSYS a fallback to
>> > current implementation?
>> 
>> We removed the __ASSUME_FUTEX_LOCK_PI (f5c77f78ec03363d5e550c4996deb75ee3f2e32a)
>> in favor or always check for PI support at runtime during pthread_mutex_init
>> (prio_inherit_missing).
>> 
>> Since the kernel still might return ENOSYS for FUTEX_PI I think we should
>> keep probing its support as runtime.
>
> Okay. Since that one is static, I guess I would have to make my own
> check in pthread_cond.
>
> But… Now that I look at it again. The kernel has this FUTEX_PI option
> which depends on RT_MUTEXES. But RT_MUTEXES has no off switch so it must
> always be selected once FUTEX itself is enabled. The I2C subsystem
> selects RT_MUTEXES and I don't think there is a config without PI-FUTEX
> considering this.
> We _used_ to have runtime detection for PI-FUTEX support because not all
> architectures provided a cmpxchg function for futex. This is gone and
> all architectures as of v5.17 provide it. That would be commit
>    3297481d688a5 ("futex: Remove futex_cmpxchg detection")
>    https://git.kernel.org/torvalds/c/3297481d688a5
>
> for reference. So I *think* this config option can be removed on kernel
> side and it appears to me as of v5.17 there should be no need for a
> runtime check regarding PI-futex.

Are there seccomp filters for PI futexes?  I wouldn't be surprised if
people added them after some of the high-profile futex vulnerabilities.
I think we should not support such seccomp filters, but we need to know
what we are up against.

> There shouldn't be any error. There might be the case where the lock
> owner is gone (ESRCH I believe) or the theoretical ENOMEM. ESRCH isn't
> handled now but it can't be recognized either. It would require to kill
> the thread owning the lock.
> So either abort the operation if the futex-op returns an error because
> "this shouldn't happen" or I don't know.

ENOMEM needs to be reported to the caller because the application may
want to react to it.

Long term, we should probably have different interfaces for full locking
(with expected failures during locking) and simple mutexes (where
locking can only fail due to memory corruption).

Unlocking must never fail with ENOMEM or similar error codes, it must
always suceed (except if there is memory corruption).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-09  7:32   ` Sebastian Andrzej Siewior
  2025-09-09 12:22     ` Florian Weimer
@ 2025-09-09 13:09     ` Adhemerval Zanella Netto
  2025-09-09 15:52       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 10+ messages in thread
From: Adhemerval Zanella Netto @ 2025-09-09 13:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: libc-alpha, John Ogness, linux-rt-devel, Thomas Gleixner,
	Carlos O'Donell



On 09/09/25 04:32, Sebastian Andrzej Siewior wrote:
> On 2025-09-08 14:46:12 [-0300], Adhemerval Zanella Netto wrote:
>>> This change expects that PI-FUTEX is supported by the kernel. This is
>>> the case since a long time but it is possible to disable the FUTEX
>>> subsystem or the FUTEX_PI part of it while building the kernel.
>>> Is it okay to assume that PI-FUTEX is available or should there be a
>>> check somewhere during startup and in case of -ENOSYS a fallback to
>>> current implementation?
>>
>> We removed the __ASSUME_FUTEX_LOCK_PI (f5c77f78ec03363d5e550c4996deb75ee3f2e32a)
>> in favor or always check for PI support at runtime during pthread_mutex_init
>> (prio_inherit_missing).
>>
>> Since the kernel still might return ENOSYS for FUTEX_PI I think we should
>> keep probing its support as runtime.
> 
> Okay. Since that one is static, I guess I would have to make my own
> check in pthread_cond.
> 

Or move the prio_inherit_missing to its own TU and use is an internal
symbol.

> But… Now that I look at it again. The kernel has this FUTEX_PI option
> which depends on RT_MUTEXES. But RT_MUTEXES has no off switch so it must
> always be selected once FUTEX itself is enabled. The I2C subsystem
> selects RT_MUTEXES and I don't think there is a config without PI-FUTEX
> considering this.
> We _used_ to have runtime detection for PI-FUTEX support because not all
> architectures provided a cmpxchg function for futex. This is gone and
> all architectures as of v5.17 provide it. That would be commit
>    3297481d688a5 ("futex: Remove futex_cmpxchg detection")
>    https://git.kernel.org/torvalds/c/3297481d688a5
> 
> for reference. So I *think* this config option can be removed on kernel
> side and it appears to me as of v5.17 there should be no need for a
> runtime check regarding PI-futex.

Right, but at least for glibc we will need to keep the runtime check
because we still support old kernels.

> 
>>> --- a/nptl/pthread_cond_common.c
>>> +++ b/nptl/pthread_cond_common.c
>>> @@ -106,41 +106,23 @@ __condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
> …
>>>  static void __attribute__ ((unused))
>>>  __condvar_acquire_lock (pthread_cond_t *cond, int private)
>>>  {
> …
>>> +      int e;
>>> +
>>> +      e = __futex_lock_pi64 (&cond->__data.__pi_lock, 0, NULL, private);
>>> +      if (e != 0)
>>> +        futex_fatal_error ();
>>
>> The __futex_lock_pi64 already issues futex_fatal_error() non-expected return code;
>> so I think we should only use it futex-internal.{c,h}.
> 
> There shouldn't be any error. There might be the case where the lock
> owner is gone (ESRCH I believe) or the theoretical ENOMEM. ESRCH isn't
> handled now but it can't be recognized either. It would require to kill
> the thread owning the lock.
> So either abort the operation if the futex-op returns an error because
> "this shouldn't happen" or I don't know.
> 

On pthread_mutex_lock we handle ESRCH only for non-robust PI mutexes, and
we delay the thread indefinitely. I am not sure which is the best approach,
but I am tending for the delay one for consistency.

>> And I think we will need to handle ENOSYS here, maybe to use the old locking logic
>> if the kernel does not support FUTEX_PI. I think it should be fair to assume that
>> if your running your workload on an environment without FUTEX_PI support you are
>> not subject to the issue you raised.
> 
> Yes, that is fair to assume. Also given the above it might be a
> leftover. But let me add the fallback code and then we can remove once
> it is certain on Kernel's side.

I think we won't be able to remove because we still need to handle older
kernels that might return ENOSYS.

> 
>> I am ccing Carlos, he has helped fix some issue on cond vars recently and he
>> might have some idea if using PI aware locks does make sense here.
>>
>> Also, I think we really need regression testcases for this change.
> 
> I came up with something to test this. It required a few CPUs and task
> pinning otherwise the small window didn't trigger.

Right, I think the idea would to provide a way to over multiples runs from
multiple developers in multiple machines we can assume that this is triggered.
I agree that for such problems is really hard to come up with reliable test
cases without spending a lot of cpu time.

> 
>>>      }
>>>  }
> 
> Sebastian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-09 12:22     ` Florian Weimer
@ 2025-09-09 15:37       ` Sebastian Andrzej Siewior
  2025-09-09 16:32         ` Florian Weimer
  0 siblings, 1 reply; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-09 15:37 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Adhemerval Zanella Netto, libc-alpha, John Ogness, linux-rt-devel,
	Thomas Gleixner, Carlos O'Donell

On 2025-09-09 14:22:26 [+0200], Florian Weimer wrote:
> * Sebastian Andrzej Siewior:
> 
> > On 2025-09-08 14:46:12 [-0300], Adhemerval Zanella Netto wrote:
> >> > This change expects that PI-FUTEX is supported by the kernel. This is
> >> > the case since a long time but it is possible to disable the FUTEX
> >> > subsystem or the FUTEX_PI part of it while building the kernel.
> >> > Is it okay to assume that PI-FUTEX is available or should there be a
> >> > check somewhere during startup and in case of -ENOSYS a fallback to
> >> > current implementation?
> >> 
> >> We removed the __ASSUME_FUTEX_LOCK_PI (f5c77f78ec03363d5e550c4996deb75ee3f2e32a)
> >> in favor or always check for PI support at runtime during pthread_mutex_init
> >> (prio_inherit_missing).
> >> 
> >> Since the kernel still might return ENOSYS for FUTEX_PI I think we should
> >> keep probing its support as runtime.
> >
> > Okay. Since that one is static, I guess I would have to make my own
> > check in pthread_cond.
> >
> > But… Now that I look at it again. The kernel has this FUTEX_PI option
> > which depends on RT_MUTEXES. But RT_MUTEXES has no off switch so it must
> > always be selected once FUTEX itself is enabled. The I2C subsystem
> > selects RT_MUTEXES and I don't think there is a config without PI-FUTEX
> > considering this.
> > We _used_ to have runtime detection for PI-FUTEX support because not all
> > architectures provided a cmpxchg function for futex. This is gone and
> > all architectures as of v5.17 provide it. That would be commit
> >    3297481d688a5 ("futex: Remove futex_cmpxchg detection")
> >    https://git.kernel.org/torvalds/c/3297481d688a5
> >
> > for reference. So I *think* this config option can be removed on kernel
> > side and it appears to me as of v5.17 there should be no need for a
> > runtime check regarding PI-futex.
> 
> Are there seccomp filters for PI futexes?  I wouldn't be surprised if
> people added them after some of the high-profile futex vulnerabilities.
> I think we should not support such seccomp filters, but we need to know
> what we are up against.

I don't know. But don't you allow a syscall such a sys_futex and don't
filter additional arguments? Unless one would filter the op argument, it
shouldn't be an issue. There was this switch from sys_futex to
sys_futex_time64 on 32bit architectures but this is more a permanent
switch…
And libc doesn't use sys_futex_wake, sys_futex_wait which would make a
difference in this case. So unless op is checked, it should be fine.

> > There shouldn't be any error. There might be the case where the lock
> > owner is gone (ESRCH I believe) or the theoretical ENOMEM. ESRCH isn't
> > handled now but it can't be recognized either. It would require to kill
> > the thread owning the lock.
> > So either abort the operation if the futex-op returns an error because
> > "this shouldn't happen" or I don't know.
> 
> ENOMEM needs to be reported to the caller because the application may
> want to react to it.

Well. Right now it only checks ESRCH and EDEADLK. Everything else is
considered success.
So do want to update this + man-page?
But what should be done this pthread_cond_.*() functions? I guess we
can't forward that possible -ENOMEM to the caller?

Also if we are in good mood, there pthread_mutex_lock() has this comment
|                 /* ESRCH can happen only for non-robust PI mutexes where
|                    the owner of the lock died.  */

This is simply not true as far as the kernel goes. If the futex uaddr
contains a pid of a non-existing task then LOCK_PI will return ESRCH. A
simple testcase would be

| #include <stdio.h>
| #include <pthread.h>
|
| static pthread_mutex_t l;
|
| static void *thread_code(void *arg)
| {
|         pthread_mutex_lock(&l);
|         return NULL;
| }
|
| int main(void)
| {
|         pthread_mutexattr_t attr;
|         pthread_t thread;
|         int ret;
|
|         ret = pthread_mutexattr_init(&attr);
|         ret |= pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
|         ret |= pthread_mutex_init(&l, &attr);
|         ret |= pthread_create(&thread, NULL, thread_code, NULL);
|         if (ret) {
|                 printf("->[%d] %d\n", __LINE__, ret);
|                 return 1;
|         }
|         pthread_join(thread, NULL);
|         ret = pthread_mutex_lock(&l);
|         printf("-> %d %m\n", ret);
|
|         return 0;
| }

and strace says:
| futex(0x55afb0ceb080, FUTEX_LOCK_PI_PRIVATE, NULL) = -1 ESRCH (No such process)
| futex(0x7ffe73216f24, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

the second futex() is the __futex_abstimed_wait64() that follows.

> Long term, we should probably have different interfaces for full locking
> (with expected failures during locking) and simple mutexes (where
> locking can only fail due to memory corruption).
> 
> Unlocking must never fail with ENOMEM or similar error codes, it must
> always suceed (except if there is memory corruption).

UNLOCK_PI. There is no ENOMEM, unlock always succeeds.
Except for pilot errors:
- EFAULT (can't read, uaddr not properly aligned)
- EPERM (we are not owner).
- EAGAIN if userland fiddled with with the value while kernel tried to
  unlock. This _shouldn't_ happen because if the 0->tid transition fails
  the user should go to kernel. But if it fiddles, EAGAIN will be a
  possible outcome.

> Thanks,
> Florian

Sebastian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-09 13:09     ` Adhemerval Zanella Netto
@ 2025-09-09 15:52       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-09 15:52 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: libc-alpha, John Ogness, linux-rt-devel, Thomas Gleixner,
	Carlos O'Donell

On 2025-09-09 10:09:36 [-0300], Adhemerval Zanella Netto wrote:
> > Okay. Since that one is static, I guess I would have to make my own
> > check in pthread_cond.
> > 
> 
> Or move the prio_inherit_missing to its own TU and use is an internal
> symbol.
okay.

> Right, but at least for glibc we will need to keep the runtime check
> because we still support old kernels.

I see. Is there a minimum old kernel? And will this increased somehow
based on LTS support or y2038 requirements?

Basically we have this due to old ARM machines :)

> >>> --- a/nptl/pthread_cond_common.c
> >>> +++ b/nptl/pthread_cond_common.c
> >>> @@ -106,41 +106,23 @@ __condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
> > …
> >>>  static void __attribute__ ((unused))
> >>>  __condvar_acquire_lock (pthread_cond_t *cond, int private)
> >>>  {
> > …
> >>> +      int e;
> >>> +
> >>> +      e = __futex_lock_pi64 (&cond->__data.__pi_lock, 0, NULL, private);
> >>> +      if (e != 0)
> >>> +        futex_fatal_error ();
> >>
> >> The __futex_lock_pi64 already issues futex_fatal_error() non-expected return code;
> >> so I think we should only use it futex-internal.{c,h}.
> > 
> > There shouldn't be any error. There might be the case where the lock
> > owner is gone (ESRCH I believe) or the theoretical ENOMEM. ESRCH isn't
> > handled now but it can't be recognized either. It would require to kill
> > the thread owning the lock.
> > So either abort the operation if the futex-op returns an error because
> > "this shouldn't happen" or I don't know.
> > 
> 
> On pthread_mutex_lock we handle ESRCH only for non-robust PI mutexes, and
> we delay the thread indefinitely. I am not sure which is the best approach,
> but I am tending for the delay one for consistency.

As I wrote in my other mail, we should somehow consider this dead-owner
case but can do this loop for consistency.

> >> I am ccing Carlos, he has helped fix some issue on cond vars recently and he
> >> might have some idea if using PI aware locks does make sense here.
> >>
> >> Also, I think we really need regression testcases for this change.
> > 
> > I came up with something to test this. It required a few CPUs and task
> > pinning otherwise the small window didn't trigger.
> 
> Right, I think the idea would to provide a way to over multiples runs from
> multiple developers in multiple machines we can assume that this is triggered.
> I agree that for such problems is really hard to come up with reliable test
> cases without spending a lot of cpu time.

Let me clean this up, follow the testsuite approach. I did verify it in
the with the syscall tracer that the case triggered. So it run a few
times and assume it touched the PI paths.

> > 
> >>>      }
> >>>  }

Sebastian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-09 15:37       ` Sebastian Andrzej Siewior
@ 2025-09-09 16:32         ` Florian Weimer
  2025-09-09 19:14           ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Weimer @ 2025-09-09 16:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Adhemerval Zanella Netto, libc-alpha, John Ogness, linux-rt-devel,
	Thomas Gleixner, Carlos O'Donell

* Sebastian Andrzej Siewior:

>> Are there seccomp filters for PI futexes?  I wouldn't be surprised if
>> people added them after some of the high-profile futex vulnerabilities.
>> I think we should not support such seccomp filters, but we need to know
>> what we are up against.
>
> I don't know. But don't you allow a syscall such a sys_futex and don't
> filter additional arguments? Unless one would filter the op argument, it
> shouldn't be an issue.

There's this historic example:

  Remove priority inheritance support for Futex in all Chrome policies,
  including NaCl NonSFI
  <https://issues.chromium.org/issues/40381864>

Some applications are exactly doing that, though, particularly for
FUTEX_CMP_REQUEUE_PI.

Given how widely the Chromium sandbox code is used, I wonder if this is
still an issue.

>> > There shouldn't be any error. There might be the case where the lock
>> > owner is gone (ESRCH I believe) or the theoretical ENOMEM. ESRCH isn't
>> > handled now but it can't be recognized either. It would require to kill
>> > the thread owning the lock.
>> > So either abort the operation if the futex-op returns an error because
>> > "this shouldn't happen" or I don't know.
>> 
>> ENOMEM needs to be reported to the caller because the application may
>> want to react to it.
>
> Well. Right now it only checks ESRCH and EDEADLK. Everything else is
> considered success.
> So do want to update this + man-page?

The man page already mentions ENOMEM, it's just glibc that is buggy.

> But what should be done this pthread_cond_.*() functions? I guess we
> can't forward that possible -ENOMEM to the caller?

Is this about the wait operation?  POSIX allows spurious wakeups, so we
could technically return without an error.

> Also if we are in good mood, there pthread_mutex_lock() has this comment
> |                 /* ESRCH can happen only for non-robust PI mutexes where
> |                    the owner of the lock died.  */
>
> This is simply not true as far as the kernel goes. If the futex uaddr
> contains a pid of a non-existing task then LOCK_PI will return ESRCH. A
> simple testcase would be
>
> | #include <stdio.h>
> | #include <pthread.h>
> |
> | static pthread_mutex_t l;
> |
> | static void *thread_code(void *arg)
> | {
> |         pthread_mutex_lock(&l);
> |         return NULL;
> | }
> |
> | int main(void)
> | {
> |         pthread_mutexattr_t attr;
> |         pthread_t thread;
> |         int ret;
> |
> |         ret = pthread_mutexattr_init(&attr);
> |         ret |= pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
> |         ret |= pthread_mutex_init(&l, &attr);
> |         ret |= pthread_create(&thread, NULL, thread_code, NULL);
> |         if (ret) {
> |                 printf("->[%d] %d\n", __LINE__, ret);
> |                 return 1;
> |         }
> |         pthread_join(thread, NULL);
> |         ret = pthread_mutex_lock(&l);
> |         printf("-> %d %m\n", ret);
> |
> |         return 0;
> | }
>
> and strace says:
> | futex(0x55afb0ceb080, FUTEX_LOCK_PI_PRIVATE, NULL) = -1 ESRCH (No such process)
> | futex(0x7ffe73216f24, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
>
> the second futex() is the __futex_abstimed_wait64() that follows.

Does POSIX define what happens if a non-robust mutex is accessed again
if a thread terminates without unlocking it first?  If yes, then this is
indeed a problem.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-09 16:32         ` Florian Weimer
@ 2025-09-09 19:14           ` Adhemerval Zanella Netto
  2025-09-10  5:57             ` Florian Weimer
  0 siblings, 1 reply; 10+ messages in thread
From: Adhemerval Zanella Netto @ 2025-09-09 19:14 UTC (permalink / raw)
  To: Florian Weimer, Sebastian Andrzej Siewior
  Cc: libc-alpha, John Ogness, linux-rt-devel, Thomas Gleixner,
	Carlos O'Donell



On 09/09/25 13:32, Florian Weimer wrote:
> * Sebastian Andrzej Siewior:
> 
>>> Are there seccomp filters for PI futexes?  I wouldn't be surprised if
>>> people added them after some of the high-profile futex vulnerabilities.
>>> I think we should not support such seccomp filters, but we need to know
>>> what we are up against.
>>
>> I don't know. But don't you allow a syscall such a sys_futex and don't
>> filter additional arguments? Unless one would filter the op argument, it
>> shouldn't be an issue.
> 
> There's this historic example:
> 
>   Remove priority inheritance support for Futex in all Chrome policies,
>   including NaCl NonSFI
>   <https://issues.chromium.org/issues/40381864>
> 
> Some applications are exactly doing that, though, particularly for
> FUTEX_CMP_REQUEUE_PI.
> 
> Given how widely the Chromium sandbox code is used, I wonder if this is
> still an issue.
> 
>>>> There shouldn't be any error. There might be the case where the lock
>>>> owner is gone (ESRCH I believe) or the theoretical ENOMEM. ESRCH isn't
>>>> handled now but it can't be recognized either. It would require to kill
>>>> the thread owning the lock.
>>>> So either abort the operation if the futex-op returns an error because
>>>> "this shouldn't happen" or I don't know.
>>>
>>> ENOMEM needs to be reported to the caller because the application may
>>> want to react to it.
>>
>> Well. Right now it only checks ESRCH and EDEADLK. Everything else is
>> considered success.
>> So do want to update this + man-page?
> 
> The man page already mentions ENOMEM, it's just glibc that is buggy.
> 
>> But what should be done this pthread_cond_.*() functions? I guess we
>> can't forward that possible -ENOMEM to the caller?
> 
> Is this about the wait operation?  POSIX allows spurious wakeups, so we
> could technically return without an error.
> 
>> Also if we are in good mood, there pthread_mutex_lock() has this comment
>> |                 /* ESRCH can happen only for non-robust PI mutexes where
>> |                    the owner of the lock died.  */
>>
>> This is simply not true as far as the kernel goes. If the futex uaddr
>> contains a pid of a non-existing task then LOCK_PI will return ESRCH. A
>> simple testcase would be
>>
>> | #include <stdio.h>
>> | #include <pthread.h>
>> |
>> | static pthread_mutex_t l;
>> |
>> | static void *thread_code(void *arg)
>> | {
>> |         pthread_mutex_lock(&l);
>> |         return NULL;
>> | }
>> |
>> | int main(void)
>> | {
>> |         pthread_mutexattr_t attr;
>> |         pthread_t thread;
>> |         int ret;
>> |
>> |         ret = pthread_mutexattr_init(&attr);
>> |         ret |= pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
>> |         ret |= pthread_mutex_init(&l, &attr);
>> |         ret |= pthread_create(&thread, NULL, thread_code, NULL);
>> |         if (ret) {
>> |                 printf("->[%d] %d\n", __LINE__, ret);
>> |                 return 1;
>> |         }
>> |         pthread_join(thread, NULL);
>> |         ret = pthread_mutex_lock(&l);
>> |         printf("-> %d %m\n", ret);
>> |
>> |         return 0;
>> | }
>>
>> and strace says:
>> | futex(0x55afb0ceb080, FUTEX_LOCK_PI_PRIVATE, NULL) = -1 ESRCH (No such process)
>> | futex(0x7ffe73216f24, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
>>
>> the second futex() is the __futex_abstimed_wait64() that follows.
> 
> Does POSIX define what happens if a non-robust mutex is accessed again
> if a thread terminates without unlocking it first?  If yes, then this is
> indeed a problem.

Afaik this is UB for non-robust mutexes, although the resolution for Austin
Issue 755  [1] is not clear IMHO.

[1] https://www.austingroupbugs.net/view.php?id=755#c1875



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking
  2025-09-09 19:14           ` Adhemerval Zanella Netto
@ 2025-09-10  5:57             ` Florian Weimer
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Weimer @ 2025-09-10  5:57 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: Sebastian Andrzej Siewior, libc-alpha, John Ogness,
	linux-rt-devel, Thomas Gleixner, Carlos O'Donell

* Adhemerval Zanella Netto:

>> Does POSIX define what happens if a non-robust mutex is accessed again
>> if a thread terminates without unlocking it first?  If yes, then this is
>> indeed a problem.
>
> Afaik this is UB for non-robust mutexes, although the resolution for Austin
> Issue 755  [1] is not clear IMHO.
>
> [1] https://www.austingroupbugs.net/view.php?id=755#c1875

Ah, so it's UB in practically all implementations (for at least some
mutex types), but not UB in POSIX.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-09-10  5:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 16:33 [PATCH] nptl: Use a PI-aware lock for internal pthread_cond-related locking Sebastian Andrzej Siewior
2025-09-08 17:46 ` Adhemerval Zanella Netto
2025-09-09  7:32   ` Sebastian Andrzej Siewior
2025-09-09 12:22     ` Florian Weimer
2025-09-09 15:37       ` Sebastian Andrzej Siewior
2025-09-09 16:32         ` Florian Weimer
2025-09-09 19:14           ` Adhemerval Zanella Netto
2025-09-10  5:57             ` Florian Weimer
2025-09-09 13:09     ` Adhemerval Zanella Netto
2025-09-09 15:52       ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).