PROBLEM: Kernel 6.17 newly deadlocks futex

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* PROBLEM: Kernel 6.17 newly deadlocks futex
@ 2025-12-19 10:02 Florian Albertz
  2025-12-19 20:07 ` Thomas Gleixner
  0 siblings, 1 reply; 4+ messages in thread
From: Florian Albertz @ 2025-12-19 10:02 UTC (permalink / raw)
  To: tglx, mingo; +Cc: linux-kernel

Hi everyone,

a program of mine started deadlocking in kernel 6.17 due to hanging in a
FUTEX_WAIT_PRIVATE call.

Now first off, due to factors outside of my control, I am using futexes with
the FUTEX_PRIVATE_FLAG while also working with child processes which aren't
spawned with CLONE_THREAD. They are however created with CLONE_VM.

This did work before (and works now, excluding the specific edge case demonstrated
below), but I would understand this not being fixed as FUTEX_PRIVATE_FLAG
is documented to be specifically about threaded programs. I would be very happy
if the previous behaviour could be restored though. Ideally with FUTEX_PRIVATE_FLAG
being documented to work as long as processes run in the same memory space.

But about the actual deadlock. The following program completes execution on
a released 6.16.10 kernel on x86_64. On kernel 6.17.9 as well as 6.18.1 it deadlocks.
Tested kernels are from the official archlinux repositories:

---
#define _GNU_SOURCE
#include <linux/futex.h>
#include <sched.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static uint32_t *fut;

static int noop(void *arg) { return 0; }

static int child(void *arg) {
    // It is important this call to create a thread happens between
    // the wait and wake calls.
    //
    // Due to the new behavior around `need_futex_hash_allocate_defaults`,
    // the first clone which includes CLONE_THREAD (CLONE_VM is not enough)
    // results in a change in how futex hashes are calculated.
    clone(noop, malloc(STACK_SIZE) + STACK_SIZE,
            CLONE_VM | CLONE_SIGHAND | CLONE_THREAD, NULL, NULL, NULL);

    // So this now works with another hash and therefore does not wake the main
    // process.
    *fut = 1;
    syscall(SYS_futex, fut, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);

    return 0;
}

int main(int argc, char *argv[]) {
    fut = calloc(1, sizeof(*fut));

    // Now we create a new process sharing virtual memory but crucially without
    // specifying CLONE_THREAD.
    clone(child, malloc(STACK_SIZE) + STACK_SIZE, CLONE_VM, NULL, NULL, NULL);

    // And now this futex wait never wakes from kernel 6.17 onwards.
    syscall(SYS_futex, fut, FUTEX_WAIT_PRIVATE, 0, NULL, NULL, 0);
}
---

I realise for a fully reliable reproduction there would probably be more synchronization required,
but I hope the above is enough to demonstrate the problem. Same goes for error handling etc.
Also apologies for any other things causing confusion with the above code, I think this
reproduction may be the first C code I have written in years.

The issue does not occur if any process with CLONE_THREAD was created before the wait.
It does not occur if no process with CLONE_THREAD is created at all. And the code also
works as expected if the FUTEX_PRIVATE_FLAG is omitted.

Thank you for your time and work on the kernel, I'll gladly provide any further info you need.
Greetings and happy holidays,

Florian A.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PROBLEM: Kernel 6.17 newly deadlocks futex
  2025-12-19 10:02 PROBLEM: Kernel 6.17 newly deadlocks futex Florian Albertz
@ 2025-12-19 20:07 ` Thomas Gleixner
  2026-01-09 16:56   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Gleixner @ 2025-12-19 20:07 UTC (permalink / raw)
  To: Florian Albertz, mingo
  Cc: linux-kernel, Sebastian Andrzej Siewior, Peter Zijlstra

On Fri, Dec 19 2025 at 11:02, Florian Albertz wrote:
> static int child(void *arg) {
>     // It is important this call to create a thread happens between
>     // the wait and wake calls.
>     //
>     // Due to the new behavior around `need_futex_hash_allocate_defaults`,
>     // the first clone which includes CLONE_THREAD (CLONE_VM is not enough)
>     // results in a change in how futex hashes are calculated.

The problem is not this one.

>     clone(noop, malloc(STACK_SIZE) + STACK_SIZE,
>             CLONE_VM | CLONE_SIGHAND | CLONE_THREAD, NULL, NULL, NULL);
>
>     // So this now works with another hash and therefore does not wake the main
>     // process.
>     *fut = 1;
>     syscall(SYS_futex, fut, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
>
>     return 0;
> }
>
> int main(int argc, char *argv[]) {
>     fut = calloc(1, sizeof(*fut));
>
>     // Now we create a new process sharing virtual memory but crucially without
>     // specifying CLONE_THREAD.

The problem is here because the condition for hash allocation is too
tight. The private hash is bound to the MM which shared with CLONE_VM,
so the clone has to install a private hash despite creating a process
and not a thread.

>     clone(child, malloc(STACK_SIZE) + STACK_SIZE, CLONE_VM, NULL, NULL, NULL);
>
>     // And now this futex wait never wakes from kernel 6.17 onwards.
>     syscall(SYS_futex, fut, FUTEX_WAIT_PRIVATE, 0, NULL, NULL, 0);
> }

The below should fix that. It's not completely correct because the
resulting hash sizing looks at current->signal->threads. As signal is
not shared each resulting process accounts for their own threads. Fixing
that needs some more thoughts.

Thanks,

        tglx
---
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1948,11 +1948,9 @@ static void rv_task_fork(struct task_str
 #define rv_task_fork(p) do {} while (0)
 #endif
 
-static bool need_futex_hash_allocate_default(u64 clone_flags)
+static inline bool need_futex_hash_allocate_default(u64 clone_flags)
 {
-	if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
-		return false;
-	return true;
+	return !!(clone_flags & CLONE_VM);
 }
 
 /*

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PROBLEM: Kernel 6.17 newly deadlocks futex
  2025-12-19 20:07 ` Thomas Gleixner
@ 2026-01-09 16:56   ` Sebastian Andrzej Siewior
  2026-01-19 18:24     ` Florian Albertz
  0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-01-09 16:56 UTC (permalink / raw)
  To: Thomas Gleixner, Florian Albertz; +Cc: mingo, linux-kernel, Peter Zijlstra

On 2025-12-19 21:07:13 [+0100], Thomas Gleixner wrote:
> On Fri, Dec 19 2025 at 11:02, Florian Albertz wrote:
…
> >     clone(child, malloc(STACK_SIZE) + STACK_SIZE, CLONE_VM, NULL, NULL, NULL);
> >
> >     // And now this futex wait never wakes from kernel 6.17 onwards.
> >     syscall(SYS_futex, fut, FUTEX_WAIT_PRIVATE, 0, NULL, NULL, 0);
> > }
> 
> The below should fix that. It's not completely correct because the
> resulting hash sizing looks at current->signal->threads. As signal is
> not shared each resulting process accounts for their own threads. Fixing
> that needs some more thoughts.

I'm not sure if I mix things up or it was based on an earlier version
where things were different but if I'm right then PeterZ said if someone
uses CLONE_VM without CLONE_THREAD then he can keep the pieces.

Using only CLONE_VM is okay (well it is not but is not causing the
problem here). Using CLONE_VM for some clone() invocations and CLONE_VM
+ CLONE_THREAD for other is causing the problem.
Who is doing this? Some exotic early container runtime?

CLONE_VM without CLONE_THREAD is common with CLONE_VFORK and in this
case we don't want to create the private hash.
I'm not sure if it is worth the effort. The wrong or not accurate
get_nr_threads() shouldn't be a problem given the situation. I would
suggest to limit it to "CLONE_THREAD | CLONE_VM" or "!CLONE_THREAD &&
CLONE_VM" if we really want to support this.

> Thanks,
> 
>         tglx

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PROBLEM: Kernel 6.17 newly deadlocks futex
  2026-01-09 16:56   ` Sebastian Andrzej Siewior
@ 2026-01-19 18:24     ` Florian Albertz
  0 siblings, 0 replies; 4+ messages in thread
From: Florian Albertz @ 2026-01-19 18:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Thomas Gleixner
  Cc: mingo, linux-kernel, Peter Zijlstra

On Fri, Jan 9, 2026, at 17:56, Sebastian Andrzej Siewior wrote:
> Who is doing this? Some exotic early container runtime?

Personally, I built an abstraction which behaves similarly to how rust does threads, while allowing to configure the namespaces for each of those "threads". Which makes working with namespaces way more ergonomic because changing namespaces is not an all-or-nothing proposition for a binary anymore.

We can run a single function in another network namespace without having to exec an entirely different binary for example. And ideally we can still use in-process synchronization primitives to make this properly useful.

The _PRIVATE flags come into the picture because they are used by rusts standard library and it would be nice if we could use the languages default synchronization primitives. So other code can be mostly oblivious to whether it is used across such a fake-thread boundary or not.

I get that this is probably a pretty weird way to use these APIs, but it IS incredibly useful when having to deal with namespaces a lot. And you guessed correctly, that in my case this happens to be while container tooling.

>> Thanks,
>> 
>>         tglx
>
> Sebastian

Thanks to both of you,
Florian A.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-01-19 18:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19 10:02 PROBLEM: Kernel 6.17 newly deadlocks futex Florian Albertz
2025-12-19 20:07 ` Thomas Gleixner
2026-01-09 16:56   ` Sebastian Andrzej Siewior
2026-01-19 18:24     ` Florian Albertz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox