linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] locking/urgent for v6.17-rc1
@ 2025-08-09 18:02 Borislav Petkov
  2025-08-10  5:58 ` pr-tracker-bot
  2025-08-21 18:19 ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: Borislav Petkov @ 2025-08-09 18:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: x86-ml, lkml

Hi Linus,

please pull a locking/urgent fix for v6.17-rc1.

Thx.

---

The following changes since commit 98e8f2c0e0930feee6a2538450c74d9d7de0a9cc:

  Merge tag 'x86-platform-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2025-07-29 20:05:06 -0700)

are available in the Git repository at:

  ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip tags/locking_urgent_for_v6.17_rc1

for you to fetch changes up to e703b7e247503b8bf87b62c02a4392749b09eca8:

  futex: Move futex cleanup to __mmdrop() (2025-08-02 15:11:52 +0200)

----------------------------------------------------------------
- Prevent a futex hash leak due to different mm lifetimes

----------------------------------------------------------------
Thomas Gleixner (1):
      futex: Move futex cleanup to __mmdrop()

 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-09 18:02 [GIT PULL] locking/urgent for v6.17-rc1 Borislav Petkov
@ 2025-08-10  5:58 ` pr-tracker-bot
  2025-08-21 18:19 ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: pr-tracker-bot @ 2025-08-10  5:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linus Torvalds, x86-ml, lkml

The pull request you sent on Sat, 9 Aug 2025 20:02:07 +0200:

> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip tags/locking_urgent_for_v6.17_rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8e8f6b635fae254252f7f52dd3e79fb68d06c332

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-09 18:02 [GIT PULL] locking/urgent for v6.17-rc1 Borislav Petkov
  2025-08-10  5:58 ` pr-tracker-bot
@ 2025-08-21 18:19 ` Linus Torvalds
  2025-08-21 19:45   ` Sean Christopherson
  2025-08-22 10:57   ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2025-08-21 18:19 UTC (permalink / raw)
  To: Borislav Petkov, Thomas Gleixner, Sebastian Andrzej Siewior,
	Peter Zijlstra
  Cc: x86-ml, lkml

On Sat, 9 Aug 2025 at 14:02, Borislav Petkov <bp@alien8.de> wrote:
>
> please pull a locking/urgent fix for v6.17-rc1.

Ok, so this clearly wasn't a fix.

> Thomas Gleixner (1):
>       futex: Move futex cleanup to __mmdrop()

So this causes problems, because __mmdrop is not done in thread
context, and the kvfree() calls then cause issues:

  https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
  https://lore.kernel.org/all/20250818131902.5039-1-hdanton@sina.com/

Hilf Danton sent out a patch, but honestly, that patch looks like pure
bandaid, and will make the exit path horribly much slower by moving
things into workqueues. It might not be visible in profiles exactly
*because* it's then hidden in workqueues, but it's not great.

I think it's a mistake to allow vmalloc'ing those hashes in the first
place, and I suggest the local hash be size-limited to the point where
it's just a kmalloc() and thus works in all contexts.

Or maybe the mistake was the mm-private hashing in the first place.
Maybe that hash shouldn't be allocated at mm_alloc() ->
futex_mm_init() at all. Only initialized by the futex code when
needed, and then dropped in exit_mmap().

So the problems seem deeper than just "free'd in the wrong context".

           Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-21 18:19 ` Linus Torvalds
@ 2025-08-21 19:45   ` Sean Christopherson
  2025-08-22 14:16     ` Sebastian Andrzej Siewior
  2025-08-22 10:57   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-08-21 19:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Thomas Gleixner, Sebastian Andrzej Siewior,
	Peter Zijlstra, x86-ml, lkml

On Thu, Aug 21, 2025, Linus Torvalds wrote:
> On Sat, 9 Aug 2025 at 14:02, Borislav Petkov <bp@alien8.de> wrote:
> >
> > please pull a locking/urgent fix for v6.17-rc1.
> 
> Ok, so this clearly wasn't a fix.
> 
> > Thomas Gleixner (1):
> >       futex: Move futex cleanup to __mmdrop()
> 
> So this causes problems, because __mmdrop is not done in thread
> context, and the kvfree() calls then cause issues:
> 
>   https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
>   https://lore.kernel.org/all/20250818131902.5039-1-hdanton@sina.com/
> 
> Hilf Danton sent out a patch, but honestly, that patch looks like pure
> bandaid, and will make the exit path horribly much slower by moving
> things into workqueues. It might not be visible in profiles exactly
> *because* it's then hidden in workqueues, but it's not great.
> 
> I think it's a mistake to allow vmalloc'ing those hashes in the first
> place, and I suggest the local hash be size-limited to the point where
> it's just a kmalloc() and thus works in all contexts.
> 
> Or maybe the mistake was the mm-private hashing in the first place.
> Maybe that hash shouldn't be allocated at mm_alloc() ->
> futex_mm_init() at all. Only initialized by the futex code when
> needed, and then dropped in exit_mmap().
> 
> So the problems seem deeper than just "free'd in the wrong context".

Piggybacking the futex private hashing attention, the new fanciness is causing
crashes in my testing.  The crashes are 100% reproducible, but my reproducer is
simply running a variety of tests in parallel, i.e. isn't very debug-friendly,
and the code itself is black magic to me, so all I've done is bisect.

I reported the issue on the original thread, but haven't seen any follow-up.

https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-21 18:19 ` Linus Torvalds
  2025-08-21 19:45   ` Sean Christopherson
@ 2025-08-22 10:57   ` Sebastian Andrzej Siewior
  2025-08-22 14:12     ` [PATCH] futex: Move futex_hash_free() back to __mmput() Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-22 10:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, x86-ml, lkml

On 2025-08-21 14:19:31 [-0400], Linus Torvalds wrote:
> On Sat, 9 Aug 2025 at 14:02, Borislav Petkov <bp@alien8.de> wrote:
> >
> > please pull a locking/urgent fix for v6.17-rc1.
> 
> Ok, so this clearly wasn't a fix.
> 
> > Thomas Gleixner (1):
> >       futex: Move futex cleanup to __mmdrop()
> 
> So this causes problems, because __mmdrop is not done in thread
> context, and the kvfree() calls then cause issues:
> 
>   https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
>   https://lore.kernel.org/all/20250818131902.5039-1-hdanton@sina.com/
> 
> Hilf Danton sent out a patch, but honestly, that patch looks like pure
> bandaid, and will make the exit path horribly much slower by moving
> things into workqueues. It might not be visible in profiles exactly
> *because* it's then hidden in workqueues, but it's not great.

vfree() has an in_interrupt() check. Extending it to an irq-check would
check this but not a section where a spin_lock_t is held since we can't
check for disabled preemption. And therefore another band aid.

> I think it's a mistake to allow vmalloc'ing those hashes in the first
> place, and I suggest the local hash be size-limited to the point where
> it's just a kmalloc() and thus works in all contexts.

default auto scaling with 512 CPUs without lockdep maxes out
	64 + (64 * 512 * 4) = 128KiB + 64.

For kmalloc() the slab cache is used up to 8KiB which crosses the limit
with 32 CPUs. Then we have kmalloc()'s max allocation limit which is at
4MiB.

Given that Jakub hits the warning after sometime might indicate that it
was possible to fulfill the requirement initially but over time the
memory became too fragmented. So it allocated virtual memory.

> Or maybe the mistake was the mm-private hashing in the first place.
> Maybe that hash shouldn't be allocated at mm_alloc() ->
> futex_mm_init() at all. Only initialized by the futex code when
> needed, and then dropped in exit_mmap().

mm_alloc() is fine. This does only an alloc_percpu() and its counter part
can be used in atomic context. So the hash pointer should end up in
kfree(NULL).
Let me stare at the initial report leading to the fix. Maybe we can
avoid the leak and the atomic context altogether.

> So the problems seem deeper than just "free'd in the wrong context".
> 
>            Linus

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] futex: Move futex_hash_free() back to __mmput()
  2025-08-22 10:57   ` Sebastian Andrzej Siewior
@ 2025-08-22 14:12     ` Sebastian Andrzej Siewior
  2025-08-27 11:34       ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
  2025-08-31 12:23       ` tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 2 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-22 14:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, x86-ml, lkml,
	Jakub Kicinski

To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code
has been moved to __mmdrop(). This resulted in a warnings if the futex
hash table has been allocated via vmalloc() the mmdrop() was invoked
from atomic context.
The free path must stay in __mmput() to ensure it is invoked from
preemptible context.

In order to avoid the memory leak, delay the allocation of
mm_struct::mm->futex_ref to futex_hash_allocate(). This works because
neither the per-CPU counter nor the private hash has been allocated and
therefore
- futex_private_hash() callers (such as exit_pi_state_list()) don't
  acquire reference if there is no private hash yet. There is also no
  reference put.

- Regular callers (futex_hash()) fallback to global hash. No reference
  counting here.

The futex_ref member can be allocated in futex_hash_allocate() before
the private hash itself is allocated. This happens either while the
first thread is created or on request. In both cases the process has
just a single thread so there can be either futex operation in progress
or the request to create a private hash.

Move futex_hash_free() back to __mmput();
Move the allocation of mm_struct::futex_ref to futex_hash_allocate().

Fixes:  e703b7e247503 ("futex: Move futex cleanup to __mmdrop()")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/fork.c       |  2 +-
 kernel/futex/core.c | 15 +++++++++++----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499dc..c4ada32598bd5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -689,7 +689,6 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
-	futex_hash_free(mm);
 
 	free_mm(mm);
 }
@@ -1138,6 +1137,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	lru_gen_del_mm(mm);
+	futex_hash_free(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb5567af0c5..fb63c13aa66fc 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1724,10 +1724,6 @@ int futex_mm_init(struct mm_struct *mm)
 	/* futex-ref */
 	atomic_long_set(&mm->futex_atomic, 0);
 	mm->futex_batches = get_state_synchronize_rcu();
-	mm->futex_ref = alloc_percpu(unsigned int);
-	if (!mm->futex_ref)
-		return -ENOMEM;
-	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
 	return 0;
 }
 
@@ -1801,6 +1797,17 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 		}
 	}
 
+	if (!mm->futex_ref) {
+		/*
+		 * This will always be allocated by the first thread and
+		 * therefore requires no locking.
+		 */
+		mm->futex_ref = alloc_percpu(unsigned int);
+		if (!mm->futex_ref)
+			return -ENOMEM;
+		this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+	}
+
 	fph = kvzalloc(struct_size(fph, queues, hash_slots),
 		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-21 19:45   ` Sean Christopherson
@ 2025-08-22 14:16     ` Sebastian Andrzej Siewior
  2025-08-23  0:28       ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-22 14:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Linus Torvalds, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	x86-ml, lkml

On 2025-08-21 12:45:52 [-0700], Sean Christopherson wrote:
> Piggybacking the futex private hashing attention, the new fanciness is causing
> crashes in my testing.  The crashes are 100% reproducible, but my reproducer is
> simply running a variety of tests in parallel, i.e. isn't very debug-friendly,
> and the code itself is black magic to me, so all I've done is bisect.
> 
> I reported the issue on the original thread, but haven't seen any follow-up.
> 
> https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com

I somehow missed it. Can you try rc2 with the patch I just sent? 

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-22 14:16     ` Sebastian Andrzej Siewior
@ 2025-08-23  0:28       ` Sean Christopherson
  2025-08-25 16:04         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-08-23  0:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Linus Torvalds, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	x86-ml, lkml

On Fri, Aug 22, 2025, Sebastian Andrzej Siewior wrote:
> On 2025-08-21 12:45:52 [-0700], Sean Christopherson wrote:
> > Piggybacking the futex private hashing attention, the new fanciness is causing
> > crashes in my testing.  The crashes are 100% reproducible, but my reproducer is
> > simply running a variety of tests in parallel, i.e. isn't very debug-friendly,
> > and the code itself is black magic to me, so all I've done is bisect.
> > 
> > I reported the issue on the original thread, but haven't seen any follow-up.
> > 
> > https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
> 
> I somehow missed it. Can you try rc2 with the patch I just sent? 

No dice, fails with the same signature.

I got a trimmed down reproduer.  Load KVM, run this in the background (in a loop)
to constantly trigger try_to_wake_up() on relevant tasks (needs to be run as root):

  echo Y > /sys/module/kvm/parameters/nx_huge_pages
  echo N > /sys/module/kvm/parameters/nx_huge_pages
  sleep .2

and then run the hardware_disable_test KVM selftest (from
tools/testing/selftests/kvm/hardware_disable_test.c).

Strace on hardware_disable_test spewed a whole pile of these

  wait4(32861, 0x7ffc66475dec, WNOHANG, NULL) = 0
  futex(0x7fb735c43000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

immediately before the crash.  I assume it corresponds to this:

		/* Child is still running, keep waiting. */
		if (pid != waitpid(pid, &status, WNOHANG))
			continue;

I also got a new splat on the "WARN_ON_ONCE(ret < 0);" at the end of __futex_ref_atomic_end().
This happened during boot; AFAICT our userspace was setting up cgroups.  In this
case, the system hung and I had to reboot.

  ------------[ cut here ]------------
  WARNING: CPU: 45 PID: 0 at kernel/futex/core.c:1604 futex_ref_rcu+0xbf/0xf0
  Modules linked in: vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd gq(O) sha3_generic
  CPU: 45 UID: 0 PID: 0 Comm: swapper/45 Tainted: G S         O        6.17.0-smp--1278d576b27d-futex #886 NONE 
  Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
  Hardware name: Google LLC Indus/Indus_QC_03, BIOS 30.110.0 09/13/2024
  RIP: 0010:futex_ref_rcu+0xbf/0xf0
  Code: c7 04 0a 00 00 00 00 48 ff c0 eb c2 65 ff 01 89 e8 4c 01 f0 48 ff c0 48 89 c1 f0 48 0f c1 8b 48 01 00 00 48 01 c1 74 06 79 0c <0f> 0b eb 08 48 89 df e8 55 0a f9 ff 48 89 df 5b 41 5e 5d e9 f9 5c
  RSP: 0018:ffffa43c8d440ec8 EFLAGS: 00010286
  RAX: 8000000000000000 RBX: ffff933782245080 RCX: ffffffffffffffff
  RDX: 0000000000000060 RSI: 0000000000000060 RDI: ffffffffac840520
  RBP: 0000000000000000 R08: ffff933680044d00 R09: ffffffff00000000
  R10: ffff9365c9b59e00 R11: ffff9365c9b59e00 R12: ffffffffab77ac10
  R13: ffff9337822451b8 R14: 7fffffffffffffff R15: ffff9365c749de00
  FS:  0000000000000000(0000) GS:ffff9395514f2000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fad8a21cf38 CR3: 00000055c062b002 CR4: 00000000007706f0
  PKRU: 55555554
  Call Trace:
   <IRQ>
   rcu_do_batch+0x250/0x7e0
   rcu_core+0x12f/0x230
   handle_softirqs+0xc8/0x280
   __irq_exit_rcu+0x48/0x100
   sysvec_apic_timer_interrupt+0x74/0x80
   </IRQ>
   <TASK>
   asm_sysvec_apic_timer_interrupt+0x1a/0x20
  RIP: 0010:cpuidle_enter_state+0xfb/0x290
  Code: bb f6 ff ff 49 89 c4 8b 73 04 bf ff ff ff ff e8 9b 68 d8 ff 31 ff e8 f4 32 48 ff 80 7c 24 04 00 74 05 e8 c8 68 d8 ff fb 85 ed <0f> 88 ba 00 00 00 89 e9 48 6b f9 68 4c 8b 44 24 08 49 8b 54 38 30
  RSP: 0018:ffffa43c803d3e80 EFLAGS: 00000206
  RAX: ffff9395514f2000 RBX: ffff9394ff776548 RCX: 000000000000001f
  RDX: 000000000018ec50 RSI: 000000000000002d RDI: 0000000000000000
  RBP: 0000000000000003 R08: 0000000000000002 R09: 0000000000000002
  R10: 00000000000003dc R11: 0000000000000389 R12: 00000010fb32644d
  R13: 00000010fb2333f7 R14: ffffffffad276f68 R15: 0000000000000003
   cpuidle_enter+0x2c/0x40
   do_idle+0x1ac/0x250
   cpu_startup_entry+0x2a/0x30
   start_secondary+0x80/0x80
   common_startup_64+0x13e/0x140
   </TASK>
  ---[ end trace 0000000000000000 ]---

Heh, and two more when booting a different system.  Guess it's my lucky day.
This time whatever went sideways didn't appear to be fatal as the system booted
and I could ssh in.  One is the same WARN as above, and the second WARN on the
system hit the

  WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);

in futex_hash_allocate().

  ------------[ cut here ]------------
  WARNING: CPU: 120 PID: 11779 at kernel/futex/core.c:1553 futex_hash_allocate+0x436/0x450
  Modules linked in: vfat fat ccp k10temp i2c_piix4 cdc_acm xhci_pci xhci_hcd gq(O) sha3_generic
  CPU: 120 UID: 0 PID: 11779 Comm: borglet Tainted: G     U  W  O        6.17.0-smp--1278d576b27d-futex #886 NONE 
  Tainted: [U]=USER, [W]=WARN, [O]=OOT_MODULE
  Hardware name: Google, Inc.                                                       Arcadia_IT_80/Arcadia_IT_80, BIOS 34.64.2-0 12/26/2024
  RIP: 0010:futex_hash_allocate+0x436/0x450
  Code: 31 ff 65 48 8b 05 ba bc ae 02 48 3b 44 24 48 75 20 44 89 f8 48 83 c4 50 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b e9 9d fe ff ff <0f> 0b e9 c0 fe ff ff e8 ce 99 af 00 66 66 66 66 66 2e 0f 1f 84 00
  RSP: 0018:ffffbbc0f1237d10 EFLAGS: 00010286
  RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffa25747532180
  RDX: 0000000000000400 RSI: 000000000000ffc0 RDI: 00000000000039b8
  RBP: ffffa296a2620000 R08: 00000000004029c0 R09: 00000000ffffffff
  R10: 00000000ffffffff R11: 0000000000010040 R12: ffffa2571336b700
  R13: ffffa2571336b600 R14: ffffa2571336b600 R15: ffffa296b9270000
  FS:  00007f6bbd3297c0(0000) GS:ffffa2d5a31b2000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f6bae810f38 CR3: 00000001330a4001 CR4: 0000000000770ef0
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? cgroup_can_fork+0x258/0x420
   copy_process+0xae3/0xff0
   kernel_clone+0x99/0x320
   __x64_sys_clone+0xc8/0xf0
   do_syscall_64+0x6f/0x1f0
   ? arch_exit_to_user_mode_prepare+0x9/0x50
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7f6bbd466051
  Code: 48 85 ff 74 3d 48 85 f6 74 38 48 83 ee 10 48 89 4e 08 48 89 3e 48 89 d7 4c 89 c2 4d 89 c8 4c 8b 54 24 08 b8 38 00 00 00 0f 05 <48> 85 c0 7c 13 74 01 c3 31 ed 58 5f ff d0 48 89 c7 b8 3c 00 00 00
  RSP: 002b:00007fffff2eda98 EFLAGS: 00000206 ORIG_RAX: 0000000000000038
  RAX: ffffffffffffffda RBX: 00007f6bae812700 RCX: 00007f6bbd466051
  RDX: 00007f6bae8129d0 RSI: 00007f6bae810f30 RDI: 00000000003d0f00
  RBP: 00007fffff2edad0 R08: 00007f6bae812700 R09: 00007f6bae812700
  R10: 00007f6bae8129d0 R11: 0000000000000206 R12: 00007f6bae8129d0
  R13: 00007fffff2edb66 R14: 00007fffff2edbd0 R15: 00007f6bae810f40
   </TASK>
  ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-23  0:28       ` Sean Christopherson
@ 2025-08-25 16:04         ` Sebastian Andrzej Siewior
  2025-08-25 23:55           ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-25 16:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Linus Torvalds, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	x86-ml, lkml

On 2025-08-22 17:28:02 [-0700], Sean Christopherson wrote:
> > > https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
> > 
> > I somehow missed it. Can you try rc2 with the patch I just sent? 
> 
> No dice, fails with the same signature.
> 
> I got a trimmed down reproduer.  Load KVM, run this in the background (in a loop)
> to constantly trigger try_to_wake_up() on relevant tasks (needs to be run as root):
> 
>   echo Y > /sys/module/kvm/parameters/nx_huge_pages
>   echo N > /sys/module/kvm/parameters/nx_huge_pages
>   sleep .2
> 
> and then run the hardware_disable_test KVM selftest (from
> tools/testing/selftests/kvm/hardware_disable_test.c).

With this information I was able to reproduce what you had in the link a
the top. I don't know why it happens. It hangs and lockdep isn't happy
with the lock - it seems to be a valid task_struct::pi_lock lock for one
of the kvm-nx-lpage-recovery threads.

I got rid of all free_percpu() and kvfree() in futex/core.c (and leak
memory, yes) and this still happens.
I was able to avoid the crash if I skip the assignment of the second
private hash but it turned out that I was not patient enough.

The strange part here is that the private hash is not used. The private
hash gets allocated and resized because hardware_disable_test creates a
lot of threads. But it is not used, it just sits around and waits to be
cleared.

And it also seems to happen if I tell futex_hash_allocate_default() not
to do anything at all.

kvm-nx-lpage-recovery shares the mm but it grabs a reference.
It might be a coincidence but the task, on which the wakeup chokes,
seems to be gone according to my traces. And with

diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,7 +84,10 @@ static int vhost_task_fn(void *data)
  */
 void vhost_task_wake(struct vhost_task *vtsk)
 {
-	wake_up_process(vtsk->task);
+	mutex_lock(&vtsk->exit_mutex);
+	if (!test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
+		wake_up_process(vtsk->task);
+	mutex_unlock(&vtsk->exit_mutex);
 }
 EXPORT_SYMBOL_GPL(vhost_task_wake);
 
it doesn't crash anymore. Could it attempts to wake a task that is gone?

> Strace on hardware_disable_test spewed a whole pile of these
> 
>   wait4(32861, 0x7ffc66475dec, WNOHANG, NULL) = 0
>   futex(0x7fb735c43000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

That is a shared FUTEX and is probably part pthread_join().

> immediately before the crash.  I assume it corresponds to this:
> 
> 		/* Child is still running, keep waiting. */
> 		if (pid != waitpid(pid, &status, WNOHANG))
> 			continue;
> 
> I also got a new splat on the "WARN_ON_ONCE(ret < 0);" at the end of __futex_ref_atomic_end().
> This happened during boot; AFAICT our userspace was setting up cgroups.  In this
> case, the system hung and I had to reboot.

This is odd

>   ------------[ cut here ]------------
>   WARNING: CPU: 45 PID: 0 at kernel/futex/core.c:1604 futex_ref_rcu+0xbf/0xf0
> Heh, and two more when booting a different system.  Guess it's my lucky day.
> This time whatever went sideways didn't appear to be fatal as the system booted
> and I could ssh in.  One is the same WARN as above, and the second WARN on the
> system hit the
> 
>   WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
> 
> in futex_hash_allocate().

This means the counter don't add up after the switch. Not sure how. This
seems to be a random task but it might be part of the previous splat.

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] locking/urgent for v6.17-rc1
  2025-08-25 16:04         ` Sebastian Andrzej Siewior
@ 2025-08-25 23:55           ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-08-25 23:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Linus Torvalds, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	x86-ml, lkml

On Mon, Aug 25, 2025, Sebastian Andrzej Siewior wrote:
> On 2025-08-22 17:28:02 [-0700], Sean Christopherson wrote:
> kvm-nx-lpage-recovery shares the mm but it grabs a reference.
> It might be a coincidence but the task, on which the wakeup chokes,
> seems to be gone according to my traces. And with
> 
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,7 +84,10 @@ static int vhost_task_fn(void *data)
>   */
>  void vhost_task_wake(struct vhost_task *vtsk)
>  {
> -	wake_up_process(vtsk->task);
> +	mutex_lock(&vtsk->exit_mutex);
> +	if (!test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
> +		wake_up_process(vtsk->task);
> +	mutex_unlock(&vtsk->exit_mutex);
>  }
>  EXPORT_SYMBOL_GPL(vhost_task_wake);
>  
> it doesn't crash anymore. Could it attempts to wake a task that is gone?

Oh fudge, that indeed is what's happening.

Each VM that KVM creates has a kvm-nx-lpage-recovery task, and KVM wakes all such
tasks across all VMs in response to any change to the hugepage recovery settings,
i.e. when privileged userspace changes any of the associate module params.

KVM holds a global lock when walking the list of VMs and so guarantees the VM
hasn't fully exited, but nothing prevents the recovery task from getting a signal
and exiting long before the VM is destroyed.  hardware_disable_test is (deliberately?)
not very tidy, and exits without explicitly closing the VM and vCPU fds, and so
its recovery task gets terminated via signal instead of by KVM explicitly calling
vhost_task_stop() when the VM is being destroyed.

The basic gist of the above diff works, but unfortunately simply taking
vtsk->exit_mutex in vhost_task_wake() doesn't appear to be an option because the
vhost code appears to have gone through a lot of effort to avoid waking an exited
task.

I think we can also add some sanity checks and hints to help future users of the
vhost task code from running into the same problem.

I'll post a proper series.

Thanks a ton, I owe you a drink of your choice :-)

> > Strace on hardware_disable_test spewed a whole pile of these
> > 
> >   wait4(32861, 0x7ffc66475dec, WNOHANG, NULL) = 0
> >   futex(0x7fb735c43000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
> 
> That is a shared FUTEX and is probably part pthread_join().
> 
> > immediately before the crash.  I assume it corresponds to this:
> > 
> > 		/* Child is still running, keep waiting. */
> > 		if (pid != waitpid(pid, &status, WNOHANG))
> > 			continue;
> > 
> > I also got a new splat on the "WARN_ON_ONCE(ret < 0);" at the end of __futex_ref_atomic_end().
> > This happened during boot; AFAICT our userspace was setting up cgroups.  In this
> > case, the system hung and I had to reboot.
> 
> This is odd
> 
> >   ------------[ cut here ]------------
> >   WARNING: CPU: 45 PID: 0 at kernel/futex/core.c:1604 futex_ref_rcu+0xbf/0xf0
> …
> > Heh, and two more when booting a different system.  Guess it's my lucky day.
> > This time whatever went sideways didn't appear to be fatal as the system booted
> > and I could ssh in.  One is the same WARN as above, and the second WARN on the
> > system hit the
> > 
> >   WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
> > 
> > in futex_hash_allocate().
> 
> This means the counter don't add up after the switch. Not sure how. This
> seems to be a random task but it might be part of the previous splat.

Yeah, IIRC, those only showed up when I kexec'd into a new kernel instead of doing
a normal reboot, so it may have been some weird leftovers and/or PEBKAC?  I'll
file a new bug report if I see either of those warnings again.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip: locking/urgent] futex: Move futex_hash_free() back to __mmput()
  2025-08-22 14:12     ` [PATCH] futex: Move futex_hash_free() back to __mmput() Sebastian Andrzej Siewior
@ 2025-08-27 11:34       ` tip-bot2 for Sebastian Andrzej Siewior
  2025-08-31 12:23       ` tip-bot2 for Sebastian Andrzej Siewior
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-08-27 11:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Jakub Kicinski, Sebastian Andrzej Siewior, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID:     1b708b38414d32838baa39c9dee59d40731ed202
Gitweb:        https://git.kernel.org/tip/1b708b38414d32838baa39c9dee59d40731ed202
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Fri, 22 Aug 2025 16:12:38 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 27 Aug 2025 13:31:07 +02:00

futex: Move futex_hash_free() back to __mmput()

To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code
has been moved to __mmdrop(). This resulted in a warnings if the futex
hash table has been allocated via vmalloc() the mmdrop() was invoked
from atomic context.
The free path must stay in __mmput() to ensure it is invoked from
preemptible context.

In order to avoid the memory leak, delay the allocation of
mm_struct::mm->futex_ref to futex_hash_allocate(). This works because
neither the per-CPU counter nor the private hash has been allocated and
therefore
- futex_private_hash() callers (such as exit_pi_state_list()) don't
  acquire reference if there is no private hash yet. There is also no
  reference put.

- Regular callers (futex_hash()) fallback to global hash. No reference
  counting here.

The futex_ref member can be allocated in futex_hash_allocate() before
the private hash itself is allocated. This happens either while the
first thread is created or on request. In both cases the process has
just a single thread so there can be either futex operation in progress
or the request to create a private hash.

Move futex_hash_free() back to __mmput();
Move the allocation of mm_struct::futex_ref to futex_hash_allocate().

Fixes:  e703b7e247503 ("futex: Move futex cleanup to __mmdrop()")
Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250822141238.PfnkTjFb@linutronix.de
---
 kernel/fork.c       |  2 +-
 kernel/futex/core.c | 15 +++++++++++----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index af67385..c4ada32 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -689,7 +689,6 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
-	futex_hash_free(mm);
 
 	free_mm(mm);
 }
@@ -1138,6 +1137,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	lru_gen_del_mm(mm);
+	futex_hash_free(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb556..fb63c13 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1724,10 +1724,6 @@ int futex_mm_init(struct mm_struct *mm)
 	/* futex-ref */
 	atomic_long_set(&mm->futex_atomic, 0);
 	mm->futex_batches = get_state_synchronize_rcu();
-	mm->futex_ref = alloc_percpu(unsigned int);
-	if (!mm->futex_ref)
-		return -ENOMEM;
-	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
 	return 0;
 }
 
@@ -1801,6 +1797,17 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 		}
 	}
 
+	if (!mm->futex_ref) {
+		/*
+		 * This will always be allocated by the first thread and
+		 * therefore requires no locking.
+		 */
+		mm->futex_ref = alloc_percpu(unsigned int);
+		if (!mm->futex_ref)
+			return -ENOMEM;
+		this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+	}
+
 	fph = kvzalloc(struct_size(fph, queues, hash_slots),
 		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip: locking/urgent] futex: Move futex_hash_free() back to __mmput()
  2025-08-22 14:12     ` [PATCH] futex: Move futex_hash_free() back to __mmput() Sebastian Andrzej Siewior
  2025-08-27 11:34       ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
@ 2025-08-31 12:23       ` tip-bot2 for Sebastian Andrzej Siewior
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-08-31 12:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Jakub Kicinski, Sebastian Andrzej Siewior, Peter Zijlstra (Intel),
	Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID:     d9b05321e21e4b218de4ce8a590bf375f58b6346
Gitweb:        https://git.kernel.org/tip/d9b05321e21e4b218de4ce8a590bf375f58b6346
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Fri, 22 Aug 2025 16:12:38 +02:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Sun, 31 Aug 2025 11:48:19 +02:00

futex: Move futex_hash_free() back to __mmput()

To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code
has been moved to __mmdrop(). This resulted in a warnings if the futex
hash table has been allocated via vmalloc() the mmdrop() was invoked
from atomic context.
The free path must stay in __mmput() to ensure it is invoked from
preemptible context.

In order to avoid the memory leak, delay the allocation of
mm_struct::mm->futex_ref to futex_hash_allocate(). This works because
neither the per-CPU counter nor the private hash has been allocated and
therefore
- futex_private_hash() callers (such as exit_pi_state_list()) don't
  acquire reference if there is no private hash yet. There is also no
  reference put.

- Regular callers (futex_hash()) fallback to global hash. No reference
  counting here.

The futex_ref member can be allocated in futex_hash_allocate() before
the private hash itself is allocated. This happens either while the
first thread is created or on request. In both cases the process has
just a single thread so there can be either futex operation in progress
or the request to create a private hash.

Move futex_hash_free() back to __mmput();
Move the allocation of mm_struct::futex_ref to futex_hash_allocate().

  [ bp: Fold a follow-up fix to prevent a use-after-free:
    https://lore.kernel.org/r/20250830213806.sEKuuGSm@linutronix.de ]

Fixes:  e703b7e247503 ("futex: Move futex cleanup to __mmdrop()")
Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lkml.kernel.org/r/20250822141238.PfnkTjFb@linutronix.de
---
 kernel/fork.c       |  2 +-
 kernel/futex/core.c | 16 ++++++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index af67385..c4ada32 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -689,7 +689,6 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
-	futex_hash_free(mm);
 
 	free_mm(mm);
 }
@@ -1138,6 +1137,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	lru_gen_del_mm(mm);
+	futex_hash_free(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb556..125804f 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1722,12 +1722,9 @@ int futex_mm_init(struct mm_struct *mm)
 	RCU_INIT_POINTER(mm->futex_phash, NULL);
 	mm->futex_phash_new = NULL;
 	/* futex-ref */
+	mm->futex_ref = NULL;
 	atomic_long_set(&mm->futex_atomic, 0);
 	mm->futex_batches = get_state_synchronize_rcu();
-	mm->futex_ref = alloc_percpu(unsigned int);
-	if (!mm->futex_ref)
-		return -ENOMEM;
-	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
 	return 0;
 }
 
@@ -1801,6 +1798,17 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 		}
 	}
 
+	if (!mm->futex_ref) {
+		/*
+		 * This will always be allocated by the first thread and
+		 * therefore requires no locking.
+		 */
+		mm->futex_ref = alloc_percpu(unsigned int);
+		if (!mm->futex_ref)
+			return -ENOMEM;
+		this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+	}
+
 	fph = kvzalloc(struct_size(fph, queues, hash_slots),
 		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-31 12:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-09 18:02 [GIT PULL] locking/urgent for v6.17-rc1 Borislav Petkov
2025-08-10  5:58 ` pr-tracker-bot
2025-08-21 18:19 ` Linus Torvalds
2025-08-21 19:45   ` Sean Christopherson
2025-08-22 14:16     ` Sebastian Andrzej Siewior
2025-08-23  0:28       ` Sean Christopherson
2025-08-25 16:04         ` Sebastian Andrzej Siewior
2025-08-25 23:55           ` Sean Christopherson
2025-08-22 10:57   ` Sebastian Andrzej Siewior
2025-08-22 14:12     ` [PATCH] futex: Move futex_hash_free() back to __mmput() Sebastian Andrzej Siewior
2025-08-27 11:34       ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
2025-08-31 12:23       ` tip-bot2 for Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).