linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
@ 2025-07-09 18:05 Waiman Long
  2025-07-09 18:19 ` Waiman Long
  2025-07-09 18:19 ` Linus Torvalds
  0 siblings, 2 replies; 7+ messages in thread
From: Waiman Long @ 2025-07-09 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	Jonathan Corbet
  Cc: linux-kernel, Linus Torvalds, Jann Horn, Waiman Long

Jann reported a possible UAF scenario where a task in the mutex_unlock()
path had released the mutex and was about to acquire the wait_lock
to check out the waiters. In the interim, another task could come in,
acquire and release the mutex and then free the memory object holding
the mutex after that.

Thread A                      Thread B
========                      ========
                              eventpoll_release_file
                                mutex_lock
                                  [success on trylock fastpath]
                                __ep_remove
                                  ep_refcount_dec_and_test
                                    [drop refcount from 2 to 1]

ep_eventpoll_release
  ep_clear_and_put
    mutex_lock
      __mutex_lock_slowpath
        __mutex_lock
          __mutex_lock_common
            __mutex_add_waiter
              [enqueue waiter]
              [set MUTEX_FLAG_WAITERS]

                                mutex_unlock
                                  __mutex_unlock_slowpath
                                    atomic_long_try_cmpxchg_release
                                      [reads MUTEX_FLAG_WAITERS]
                                      [drops lock ownership]

            __mutex_trylock
              [success]
            __mutex_remove_waiter
    ep_refcount_dec_and_test
      [drop refcount from 1 to 0]
    mutex_unlock
    ep_free
      kfree(ep)

                                    raw_spin_lock_irqsave(&lock->wait_lock)
                                      *** UAF WRITE ***

This race condition is possible especially if a preemption happens right
after releasing the lock but before acquiring the wait_lock. Rwsem's
__up_write() and __up_read() helpers have already disabled
preemption to minimize this vulnernable time period, do the same for
__mutex_unlock_slowpath() to minimize the chance of this race condition.

Also add a note in Documentation/locking/mutex-design.rst to suggest
that callers can use rcu_free() to delay the actual memory free to
eliminate this UAF scenario.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/locking/mutex-design.rst | 6 ++++--
 kernel/locking/mutex.c                 | 6 ++++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst
index 7c30b4aa5e28..51a3a28ca830 100644
--- a/Documentation/locking/mutex-design.rst
+++ b/Documentation/locking/mutex-design.rst
@@ -117,8 +117,10 @@ the structure anymore.
 
 The mutex user must ensure that the mutex is not destroyed while a
 release operation is still in progress - in other words, callers of
-mutex_unlock() must ensure that the mutex stays alive until mutex_unlock()
-has returned.
+mutex_unlock() must ensure that the mutex stays alive until
+mutex_unlock() has returned. One possible way to do that is to use
+kfree_rcu() or its variants to delay the actual freeing the memory
+object containing the mutex.
 
 Interfaces
 ----------
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a39ecccbd106..d33f36d305fb 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -912,9 +912,15 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	 * Release the lock before (potentially) taking the spinlock such that
 	 * other contenders can get on with things ASAP.
 	 *
+	 * Preemption is disabled to minimize the time gap between releasing
+	 * the lock and acquiring the wait_lock. Callers may consider using
+	 * kfree_rcu() if the memory holding the mutex may be freed after
+	 * another mutex_unlock() call to ensure that UAF will not happen.
+	 *
 	 * Except when HANDOFF, in that case we must not clear the owner field,
 	 * but instead set it to the top waiter.
 	 */
+	guard(preempt)();
 	owner = atomic_long_read(&lock->owner);
 	for (;;) {
 		MUTEX_WARN_ON(__owner_task(owner) != current);
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
  2025-07-09 18:05 [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath() Waiman Long
@ 2025-07-09 18:19 ` Waiman Long
  2025-07-09 18:19 ` Linus Torvalds
  1 sibling, 0 replies; 7+ messages in thread
From: Waiman Long @ 2025-07-09 18:19 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	Jonathan Corbet
  Cc: linux-kernel, Linus Torvalds, Jann Horn

On 7/9/25 2:05 PM, Waiman Long wrote:
> Jann reported a possible UAF scenario where a task in the mutex_unlock()
> path had released the mutex and was about to acquire the wait_lock
> to check out the waiters. In the interim, another task could come in,
> acquire and release the mutex and then free the memory object holding
> the mutex after that.
>
> Thread A                      Thread B
> ========                      ========
>                                eventpoll_release_file
>                                  mutex_lock
>                                    [success on trylock fastpath]
>                                  __ep_remove
>                                    ep_refcount_dec_and_test
>                                      [drop refcount from 2 to 1]
>
> ep_eventpoll_release
>    ep_clear_and_put
>      mutex_lock
>        __mutex_lock_slowpath
>          __mutex_lock
>            __mutex_lock_common
>              __mutex_add_waiter
>                [enqueue waiter]
>                [set MUTEX_FLAG_WAITERS]
>
>                                  mutex_unlock
>                                    __mutex_unlock_slowpath
>                                      atomic_long_try_cmpxchg_release
>                                        [reads MUTEX_FLAG_WAITERS]
>                                        [drops lock ownership]
>
>              __mutex_trylock
>                [success]
>              __mutex_remove_waiter
>      ep_refcount_dec_and_test
>        [drop refcount from 1 to 0]
>      mutex_unlock
>      ep_free
>        kfree(ep)
>
>                                      raw_spin_lock_irqsave(&lock->wait_lock)
>                                        *** UAF WRITE ***
>
> This race condition is possible especially if a preemption happens right
> after releasing the lock but before acquiring the wait_lock. Rwsem's
> __up_write() and __up_read() helpers have already disabled
> preemption to minimize this vulnernable time period, do the same for
> __mutex_unlock_slowpath() to minimize the chance of this race condition.
>
> Also add a note in Documentation/locking/mutex-design.rst to suggest
> that callers can use rcu_free() to delay the actual memory free to
> eliminate this UAF scenario.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   Documentation/locking/mutex-design.rst | 6 ++++--
>   kernel/locking/mutex.c                 | 6 ++++++
>   2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst
> index 7c30b4aa5e28..51a3a28ca830 100644
> --- a/Documentation/locking/mutex-design.rst
> +++ b/Documentation/locking/mutex-design.rst
> @@ -117,8 +117,10 @@ the structure anymore.
>   
>   The mutex user must ensure that the mutex is not destroyed while a
>   release operation is still in progress - in other words, callers of
> -mutex_unlock() must ensure that the mutex stays alive until mutex_unlock()
> -has returned.
> +mutex_unlock() must ensure that the mutex stays alive until
> +mutex_unlock() has returned. One possible way to do that is to use
> +kfree_rcu() or its variants to delay the actual freeing the memory
> +object containing the mutex.
>   
>   Interfaces
>   ----------
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index a39ecccbd106..d33f36d305fb 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -912,9 +912,15 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>   	 * Release the lock before (potentially) taking the spinlock such that
>   	 * other contenders can get on with things ASAP.
>   	 *
> +	 * Preemption is disabled to minimize the time gap between releasing
> +	 * the lock and acquiring the wait_lock. Callers may consider using
> +	 * kfree_rcu() if the memory holding the mutex may be freed after
> +	 * another mutex_unlock() call to ensure that UAF will not happen.
> +	 *
>   	 * Except when HANDOFF, in that case we must not clear the owner field,
>   	 * but instead set it to the top waiter.
>   	 */
> +	guard(preempt)();
>   	owner = atomic_long_read(&lock->owner);
>   	for (;;) {
>   		MUTEX_WARN_ON(__owner_task(owner) != current);

Note that Linus' patch should fix this particular eventpoll UAF race 
condition, but there may be other code paths where similar situations 
can happen.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
  2025-07-09 18:05 [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath() Waiman Long
  2025-07-09 18:19 ` Waiman Long
@ 2025-07-09 18:19 ` Linus Torvalds
  2025-07-09 18:21   ` Linus Torvalds
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2025-07-09 18:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	Jonathan Corbet, linux-kernel, Jann Horn

On Wed, 9 Jul 2025 at 11:06, Waiman Long <longman@redhat.com> wrote:
>
> This race condition is possible especially if a preemption happens right
> after releasing the lock but before acquiring the wait_lock. Rwsem's
> __up_write() and __up_read() helpers have already disabled
> preemption to minimize this vulnernable time period, do the same for
> __mutex_unlock_slowpath() to minimize the chance of this race condition.

I think this patch is actively detrimental, in that it only helps hide
the bug. The bug still exists, it's just harder to hit.

Maybe that is worth it as a "hardening" thing, but I feel this makes
people believe even *more* that they can use mutexes for object
lifetimes.

And that's a fundamentally buggy assumption. Locking is about mutual
exclusion, not lifetimes. There are very specific things where
"release the lock" also releases an object, but they should be
considered very very special.

All objects that aren't purely thread-local should have lifetimes that
depend *solely* on refcounts. Anything else is typically a serious
bug, or needs to be actively discouraged, not encouraged like this.

Even with things like RCU, the lifetime of the object should be about
refcounts, and the RCU thing should be purely about "I'm going
asynchronous lookups that aren't protected by any locks outside this
object, so I may see objects that are being torn down".

I absolutely detest the notion of "let's make locking be tied to
object lifetimes".

Note that locks *outside* the object is obviously very very normal,
but then the *lock* has a totally different lifetime entirely, and the
lifetime of the lock has nothing to do with the lifetime of the
object.

Please don't confuse the two. This was eventpoll being completely
broken. As usual. We've had less eventpoll breakage lately than we
historically used to have, but that's hopefully because that horrid
pile is not actively developed any more, and is slowly - oh so very
slowly - getting fixed.

I'm very sad that io_uring ended up with eventpoll support, but that
was sold on the premise that it makes it easier to incrementally turn
some eventpoll user into an io_uring user. I certainly hope it leads
to less epoll use in the long run, rather than more.

             Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
  2025-07-09 18:19 ` Linus Torvalds
@ 2025-07-09 18:21   ` Linus Torvalds
  2025-07-09 18:28     ` Waiman Long
  2025-07-09 18:28     ` Linus Torvalds
  0 siblings, 2 replies; 7+ messages in thread
From: Linus Torvalds @ 2025-07-09 18:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	Jonathan Corbet, linux-kernel, Jann Horn

On Wed, 9 Jul 2025 at 11:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I absolutely detest the notion of "let's make locking be tied to
> object lifetimes".

Side note: I wonder if there's any way to detect this kind of race in general.

And I suspect it would involve the exact *opposite* of your patch:
make mutex_unlock() actively cause preemption after it has released
the lock but before it has done the final accesses.

                 Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
  2025-07-09 18:21   ` Linus Torvalds
@ 2025-07-09 18:28     ` Waiman Long
  2025-07-09 18:28     ` Linus Torvalds
  1 sibling, 0 replies; 7+ messages in thread
From: Waiman Long @ 2025-07-09 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	Jonathan Corbet, linux-kernel, Jann Horn


On 7/9/25 2:21 PM, Linus Torvalds wrote:
> On Wed, 9 Jul 2025 at 11:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> I absolutely detest the notion of "let's make locking be tied to
>> object lifetimes".
> Side note: I wonder if there's any way to detect this kind of race in general.
>
> And I suspect it would involve the exact *opposite* of your patch:
> make mutex_unlock() actively cause preemption after it has released
> the lock but before it has done the final accesses.

I think we can do a a cond_resched() under CONFIG_DEBUG_MUTEXES and 
CONFIG_KASAN. We certainly don't want to do that with a production kernel.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
  2025-07-09 18:21   ` Linus Torvalds
  2025-07-09 18:28     ` Waiman Long
@ 2025-07-09 18:28     ` Linus Torvalds
  2025-07-09 19:42       ` Waiman Long
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2025-07-09 18:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	Jonathan Corbet, linux-kernel, Jann Horn

On Wed, 9 Jul 2025 at 11:21, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I suspect it would involve the exact *opposite* of your patch:
> make mutex_unlock() actively cause preemption after it has released
> the lock but before it has done the final accesses.

.. sadly, I suspect we have a ton of mutex_unlock() users in atomic
contexts, so we probably can't do that. It's not like you *should* do
it, but I don't think we've ever disallowed it.

You can't use mutex_unlock from interrupts etc, but you can use it
while holding a spinlock.

           Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
  2025-07-09 18:28     ` Linus Torvalds
@ 2025-07-09 19:42       ` Waiman Long
  0 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2025-07-09 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
	Jonathan Corbet, linux-kernel, Jann Horn


On 7/9/25 2:28 PM, Linus Torvalds wrote:
> On Wed, 9 Jul 2025 at 11:21, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> And I suspect it would involve the exact *opposite* of your patch:
>> make mutex_unlock() actively cause preemption after it has released
>> the lock but before it has done the final accesses.
> .. sadly, I suspect we have a ton of mutex_unlock() users in atomic
> contexts, so we probably can't do that. It's not like you *should* do
> it, but I don't think we've ever disallowed it.
>
> You can't use mutex_unlock from interrupts etc, but you can use it
> while holding a spinlock.

I have just sent out another mutex patch to enforce a context switch in 
__mutex_unlock_slowpath() under the right context.

As for this one, you are right that hiding it may not be the best idea. 
So I am going to drop it.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-09 19:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 18:05 [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath() Waiman Long
2025-07-09 18:19 ` Waiman Long
2025-07-09 18:19 ` Linus Torvalds
2025-07-09 18:21   ` Linus Torvalds
2025-07-09 18:28     ` Waiman Long
2025-07-09 18:28     ` Linus Torvalds
2025-07-09 19:42       ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).