All of lore.kernel.org
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Suren Baghdasaryan <surenb@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Phil Auld <pauld@redhat.com>
Subject: Re: [PATCH] slab: fix barn NULL pointer dereference on memoryless nodes
Date: Mon, 13 Oct 2025 10:21:36 +0900	[thread overview]
Message-ID: <aOxUIDOuOSI743sR@hyeyoo> (raw)
In-Reply-To: <20251011-null-barn-fix-v1-1-5fe5af5b8fd8@suse.cz>

On Sat, Oct 11, 2025 at 10:45:41AM +0200, Vlastimil Babka wrote:
> Phil reported a boot failure once sheaves become used in commits
> 59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
> 3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000040
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: Oops: 0000 [#1] SMP NOPTI
>  CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
>  Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
>  RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
>  Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
>  RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
>  RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
>  RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
>  RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
>  R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
>  R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
>  FS:  0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
>  Call Trace:
>   <TASK>
>   ? srso_return_thunk+0x5/0x5f
>   ? vm_area_alloc+0x1e/0x60
>   kmem_cache_alloc_noprof+0x4ec/0x5b0
>   vm_area_alloc+0x1e/0x60
>   create_init_stack_vma+0x26/0x210
>   alloc_bprm+0x139/0x200
>   kernel_execve+0x4a/0x140
>   call_usermodehelper_exec_async+0xd0/0x190
>   ? __pfx_call_usermodehelper_exec_async+0x10/0x10
>   ret_from_fork+0xf0/0x110
>   ? __pfx_call_usermodehelper_exec_async+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Modules linked in:
>  CR2: 0000000000000040
>  ---[ end trace 0000000000000000 ]---
>  RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
>  Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
>  RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
>  RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
>  RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
>  RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
>  R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
>  R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
>  FS:  0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
>  Kernel panic - not syncing: Fatal exception
>  Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
> that memory is only on 2 of them."
> 
>  # numactl --hardware
>  available: 8 nodes (0-7)
>  node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
>  node 0 size: 0 MB
>  node 0 free: 0 MB
>  node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
>  node 1 size: 31584 MB
>  node 1 free: 30397 MB
>  node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
>  node 2 size: 0 MB
>  node 2 free: 0 MB
>  node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
>  node 3 size: 0 MB
>  node 3 free: 0 MB
>  node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
>  node 4 size: 0 MB
>  node 4 free: 0 MB
>  node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
>  node 5 size: 32214 MB
>  node 5 free: 31625 MB
>  node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
>  node 6 size: 0 MB
>  node 6 free: 0 MB
>  node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
>  node 7 size: 0 MB
>  node 7 free: 0 MB
> 
> Linus decoded the stacktrace to get_barn() and get_node() and determined
> that kmem_cache->node[numa_mem_id()] is NULL.
> 
> The problem is due to a wrong assumption that memoryless nodes only
> exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
> points to the nearest node that has memory. SLUB has been allocating its
> kmem_cache_node structures only on nodes with memory and so it does with
> struct node_barn.

Right, even without CONFIG_HAVE_MEMORYLESS_NODES, some nodes may not
have N_MEMORY state (and thus no s->node[nid] allocated) if there's
no memory attached to them.

> For kmem_cache_node, get_partial_node() checks if get_node() result is
> not NULL, which I assumed was for protection from a bogus node id passed
> to kmalloc_node() but apparently it's also for systems where
> numa_mem_id() (used when no specific node is given) might return a
> memoryless node.
> 
> Fix the sheaves code the same way by checking the result of get_node()
> and bailing out if it's NULL. Note that cpus on such memoryless nodes
> will have degraded sheaves performance, which can be improved later,
> preferably by making numa_mem_id() work properly on such systems.
>
> Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
> Reported-and-tested-by: Phil Auld <pauld@redhat.com>
> Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/
> Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/all/CAHk-*3Dwg1xK*2BBr*3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ*3DQD_BcQ@mail.gmail.com/
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


      reply	other threads:[~2025-10-13  1:21 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-11  8:45 [PATCH] slab: fix barn NULL pointer dereference on memoryless nodes Vlastimil Babka
2025-10-13  1:21 ` Harry Yoo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOxUIDOuOSI743sR@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pauld@redhat.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=surenb@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.