From: Ingo Molnar <mingo@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
Cc: x86@kernel.org, Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
David Hildenbrand <david@redhat.com>,
"H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
Michal Hocko <mhocko@suse.com>,
Peter Zijlstra <peterz@infradead.org>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Thomas Gleixner <tglx@linutronix.de>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2] x86/mm: Drop 4MB restriction on minimal NUMA node memory size
Date: Wed, 18 Oct 2023 12:42:50 +0200 [thread overview]
Message-ID: <ZS+2qqjEO5/867br@gmail.com> (raw)
In-Reply-To: <20231017062215.171670-1-rppt@kernel.org>
* Mike Rapoport <rppt@kernel.org> wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Qi Zheng reports crashes in a production environment and provides a
> simplified example as a reproducer:
>
> For example, if we use qemu to start a two NUMA node kernel,
> one of the nodes has 2M memory (less than NODE_MIN_SIZE),
> and the other node has 2G, then we will encounter the
> following panic:
>
> [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 0.150783] #PF: supervisor write access in kernel mode
> [ 0.151488] #PF: error_code(0x0002) - not-present page
> <...>
> [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
> <...>
> [ 0.169781] Call Trace:
> [ 0.170159] <TASK>
> [ 0.170448] deactivate_slab+0x187/0x3c0
> [ 0.171031] ? bootstrap+0x1b/0x10e
> [ 0.171559] ? preempt_count_sub+0x9/0xa0
> [ 0.172145] ? kmem_cache_alloc+0x12c/0x440
> [ 0.172735] ? bootstrap+0x1b/0x10e
> [ 0.173236] bootstrap+0x6b/0x10e
> [ 0.173720] kmem_cache_init+0x10a/0x188
> [ 0.174240] start_kernel+0x415/0x6ac
> [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb
> [ 0.175417] </TASK>
> [ 0.175713] Modules linked in:
> [ 0.176117] CR2: 0000000000000000
>
> The crashes happen because of inconsistency between nodemask that has
> nodes with less than 4MB as memoryless and the actual memory fed into
> core mm.
Presumably the core MM got fixed too to not just crash, but provide some
sort of warning?
> The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring
> empty node in SRAT parsing") that introduced minimal size of a NUMA node
> does not explain why a node size cannot be less than 4MB and what boot
> failures this restriction might fix.
>
> Since then a lot has changed and core mm won't confuse badly about small
> node sizes.
Core MM won't get confused ... other than by the above weird Qemu topology,
to which it responds with a ... NULL pointer dereference?
Seems quite close to the literal definition of 'get confused badly' to me,
and doesn't give me the warm fuzzy feeling that giving the core MM even
*more* weird topologies is super safe ... :-/
> Drop the limitation for the minimal node size.
While I agree with dropping the limitation, and I agree that 9391a3f9c7f1
should have provided more of a justification, I believe a core MM fix is in
order as well, for it to not crash. [ If it's fixed upstream already,
please reference the relevant commit ID. ]
Also, the changelog spelling & general presentation were quite low quality
- I've fixed it up a bit below, please carry this version going forward.
Please spell-check your patches before sending out Nth versions of it,
maybe maintainers are skipping them for a reason!
Thanks,
Ingo
=================>
From: "Mike Rapoport (IBM)" <rppt@kernel.org>
Date: Tue, 17 Oct 2023 09:22:15 +0300
Subject: [PATCH] x86/mm: Drop 4MB restriction on minimal NUMA node memory size
Qi Zheng reported crashes in a production environment and provided a
simplified example as a reproducer:
| For example, if we use qemu to start a two NUMA node kernel,
| one of the nodes has 2M memory (less than NODE_MIN_SIZE),
| and the other node has 2G, then we will encounter the
| following panic:
|
| BUG: kernel NULL pointer dereference, address: 0000000000000000
| <...>
| RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
| <...>
| Call Trace:
| <TASK>
| deactivate_slab()
| bootstrap()
| kmem_cache_init()
| start_kernel()
| secondary_startup_64_no_verify()
The crashes happen because of inconsistency between the nodemask that
has nodes with less than 4MB as memoryless, and the actual memory fed
into the core mm.
The commit:
9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")
... that introduced minimal size of a NUMA node does not explain why
a node size cannot be less than 4MB and what boot failures this
restriction might fix.
In the 17 years since then a lot has changed and core mm won't get
confused about small node sizes.
Drop the limitation for the minimal node size.
[ mingo: Improved changelog clarity. ]
Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/
---
arch/x86/include/asm/numa.h | 7 -------
arch/x86/mm/numa.c | 7 -------
2 files changed, 14 deletions(-)
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..ef2844d69173 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -12,13 +12,6 @@
#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
-/*
- * Too small node sizes may confuse the VM badly. Usually they
- * result from BIOS bugs. So dont recognize nodes as standalone
- * NUMA entities that have less than this amount of RAM listed:
- */
-#define NODE_MIN_SIZE (4*1024*1024)
-
extern int numa_off;
/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c01c5506fd4a..aa39d678fe81 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -602,13 +602,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (start >= end)
continue;
- /*
- * Don't confuse VM with a node that doesn't have the
- * minimum amount of memory:
- */
- if (end && (end - start) < NODE_MIN_SIZE)
- continue;
-
alloc_node_data(nid);
}
next prev parent reply other threads:[~2023-10-18 10:42 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-17 6:22 [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size Mike Rapoport
2023-10-17 7:28 ` David Hildenbrand
2023-10-17 7:35 ` David Hildenbrand
2023-10-17 7:52 ` Mike Rapoport
2023-10-18 10:42 ` Ingo Molnar [this message]
2023-10-18 12:26 ` [PATCH v2] x86/mm: Drop 4MB restriction on minimal NUMA node memory size Qi Zheng
2023-10-18 12:44 ` Ingo Molnar
2023-10-18 13:20 ` Qi Zheng
2023-10-20 8:46 ` Ingo Molnar
2023-10-20 8:59 ` Ingo Molnar
2023-10-19 9:35 ` Mike Rapoport
2023-10-18 11:55 ` [PATCH] x86/mm: drop 4MB restriction on minimal NUMA node size Mario Casquero
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZS+2qqjEO5/867br@gmail.com \
--to=mingo@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rppt@kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.