linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Suren Baghdasaryan <surenb@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Phil Auld <pauld@redhat.com>
Subject: Re: [PATCH] slab: fix barn NULL pointer dereference on memoryless nodes
Date: Mon, 13 Oct 2025 10:21:36 +0900	[thread overview]
Message-ID: <aOxUIDOuOSI743sR@hyeyoo> (raw)
In-Reply-To: <20251011-null-barn-fix-v1-1-5fe5af5b8fd8@suse.cz>

On Sat, Oct 11, 2025 at 10:45:41AM +0200, Vlastimil Babka wrote:
> Phil reported a boot failure once sheaves become used in commits
> 59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
> 3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000040
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: Oops: 0000 [#1] SMP NOPTI
>  CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
>  Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
>  RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
>  Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
>  RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
>  RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
>  RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
>  RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
>  R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
>  R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
>  FS:  0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
>  Call Trace:
>   <TASK>
>   ? srso_return_thunk+0x5/0x5f
>   ? vm_area_alloc+0x1e/0x60
>   kmem_cache_alloc_noprof+0x4ec/0x5b0
>   vm_area_alloc+0x1e/0x60
>   create_init_stack_vma+0x26/0x210
>   alloc_bprm+0x139/0x200
>   kernel_execve+0x4a/0x140
>   call_usermodehelper_exec_async+0xd0/0x190
>   ? __pfx_call_usermodehelper_exec_async+0x10/0x10
>   ret_from_fork+0xf0/0x110
>   ? __pfx_call_usermodehelper_exec_async+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Modules linked in:
>  CR2: 0000000000000040
>  ---[ end trace 0000000000000000 ]---
>  RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
>  Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
>  RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
>  RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
>  RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
>  RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
>  R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
>  R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
>  FS:  0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
>  Kernel panic - not syncing: Fatal exception
>  Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
> that memory is only on 2 of them."
> 
>  # numactl --hardware
>  available: 8 nodes (0-7)
>  node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
>  node 0 size: 0 MB
>  node 0 free: 0 MB
>  node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
>  node 1 size: 31584 MB
>  node 1 free: 30397 MB
>  node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
>  node 2 size: 0 MB
>  node 2 free: 0 MB
>  node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
>  node 3 size: 0 MB
>  node 3 free: 0 MB
>  node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
>  node 4 size: 0 MB
>  node 4 free: 0 MB
>  node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
>  node 5 size: 32214 MB
>  node 5 free: 31625 MB
>  node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
>  node 6 size: 0 MB
>  node 6 free: 0 MB
>  node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
>  node 7 size: 0 MB
>  node 7 free: 0 MB
> 
> Linus decoded the stacktrace to get_barn() and get_node() and determined
> that kmem_cache->node[numa_mem_id()] is NULL.
> 
> The problem is due to a wrong assumption that memoryless nodes only
> exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
> points to the nearest node that has memory. SLUB has been allocating its
> kmem_cache_node structures only on nodes with memory and so it does with
> struct node_barn.

Right, even without CONFIG_HAVE_MEMORYLESS_NODES, some nodes may not
have N_MEMORY state (and thus no s->node[nid] allocated) if there's
no memory attached to them.

> For kmem_cache_node, get_partial_node() checks if get_node() result is
> not NULL, which I assumed was for protection from a bogus node id passed
> to kmalloc_node() but apparently it's also for systems where
> numa_mem_id() (used when no specific node is given) might return a
> memoryless node.
> 
> Fix the sheaves code the same way by checking the result of get_node()
> and bailing out if it's NULL. Note that cpus on such memoryless nodes
> will have degraded sheaves performance, which can be improved later,
> preferably by making numa_mem_id() work properly on such systems.
>
> Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
> Reported-and-tested-by: Phil Auld <pauld@redhat.com>
> Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/
> Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/all/CAHk-*3Dwg1xK*2BBr*3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ*3DQD_BcQ@mail.gmail.com/
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


      reply	other threads:[~2025-10-13  1:21 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-11  8:45 [PATCH] slab: fix barn NULL pointer dereference on memoryless nodes Vlastimil Babka
2025-10-13  1:21 ` Harry Yoo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOxUIDOuOSI743sR@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pauld@redhat.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=surenb@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).