[PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node()
@ 2024-05-29  7:42 Jan Beulich
  2024-05-29 15:36 ` Dave Hansen
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2024-05-29  7:42 UTC (permalink / raw)
  To: lkml; +Cc: Dave Hansen, Andrew Lutomirski, Peter Zijlstra

On an (old) x86 system with SRAT just covering space above 4Gb:

    ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0xfffffffff] hotplug

the commit referenced below leads to this NUMA configuration no longer
being refused by a CONFIG_NUMA=y kernel (previously

    NUMA: nodes only cover 6144MB of your 8185MB e820 RAM. Not used.
    No NUMA configuration found
    Faking a node at [mem 0x0000000000000000-0x000000027fffffff]

was seen in the log directly after the message quoted above), because of
memblock_validate_numa_coverage() checking for NUMA_NO_NODE (only). This
in turn led to memblock_alloc_range_nid()'s warning about MAX_NUMNODES
triggering, followed by a NULL deref in memmap_init() when trying to
access node 64's (NODE_SHIFT=6) node data.

To compensate said change, avoid passing MAX_NUMNODES to
memblock_set_node(). In turn numa_clear_kernel_node_hotplug()'s check
then also needs adjusting.

Fixes: ff6c3d81f2e8 ("NUMA: optimize detection of memory with no node id assigned by firmware")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: stable@vger.kernel.org
---
This still leaves MAX_NUMNODES uses in various other places in
mm/memblock.c. Interestingly
https://lore.kernel.org/lkml/20170309034415.GA16588@WeideMacBook-Pro.local/T/#t
was a more complete patch which, for an unclear reason, looks to never
have made it anywhere.

--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -493,7 +493,7 @@ static void __init numa_clear_kernel_nod
 	for_each_reserved_mem_region(mb_region) {
 		int nid = memblock_get_region_node(mb_region);
 
-		if (nid != MAX_NUMNODES)
+		if (nid != NUMA_NO_NODE)
 			node_set(nid, reserved_nodemask);
 	}
 
@@ -614,9 +614,9 @@ static int __init numa_init(int (*init_f
 	nodes_clear(node_online_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
-				  MAX_NUMNODES));
+				  NUMA_NO_NODE));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
-				  MAX_NUMNODES));
+				  NUMA_NO_NODE));
 	/* In case that parsing SRAT failed. */
 	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
 	numa_reset_distance();

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node()
  2024-05-29  7:42 [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node() Jan Beulich
@ 2024-05-29 15:36 ` Dave Hansen
  2024-05-29 16:00   ` Jan Beulich
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2024-05-29 15:36 UTC (permalink / raw)
  To: Jan Beulich, lkml; +Cc: Dave Hansen, Andrew Lutomirski, Peter Zijlstra

On 5/29/24 00:42, Jan Beulich wrote:
> On an (old) x86 system with SRAT just covering space above 4Gb:
> 
>     ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0xfffffffff] hotplug

OK, so you've got a system with buggy NUMA information.  It _used_ to
"refuse" the NUMA configuration.  Now it tries to move forward and
eventually does a NULL deref in memmap_init().

Right?

> the commit referenced below leads to this NUMA configuration no longer
> being refused by a CONFIG_NUMA=y kernel (previously
> 
>     NUMA: nodes only cover 6144MB of your 8185MB e820 RAM. Not used.
>     No NUMA configuration found
>     Faking a node at [mem 0x0000000000000000-0x000000027fffffff]
> 
> was seen in the log directly after the message quoted above), because of
> memblock_validate_numa_coverage() checking for NUMA_NO_NODE (only). This
> in turn led to memblock_alloc_range_nid()'s warning about MAX_NUMNODES
> triggering, followed by a NULL deref in memmap_init() when trying to
> access node 64's (NODE_SHIFT=6) node data.

This is a really oblique way of saying:

	... followed by a NULL deref in memmap_init() of
	NODE_DATA(MAX_NUMNODES).

> To compensate said change, avoid passing MAX_NUMNODES to
> memblock_set_node(). In turn numa_clear_kernel_node_hotplug()'s check
> then also needs adjusting.
> 
> Fixes: ff6c3d81f2e8 ("NUMA: optimize detection of memory with no node id assigned by firmware")

I was expecting to see MAX_NUMNODES checks in ff6c3d81f2e8 somewhere.
But I don't see any in the numa_meminfo_cover_memory() or
__absent_pages_in_range().

In other words, it's not completely clear why ff6c3d81f2e8 introduced
this problem.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node()
  2024-05-29 15:36 ` Dave Hansen
@ 2024-05-29 16:00   ` Jan Beulich
  2024-05-29 16:08     ` Dave Hansen
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2024-05-29 16:00 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Dave Hansen, Andrew Lutomirski, Peter Zijlstra, lkml

On 29.05.2024 17:36, Dave Hansen wrote:
> On 5/29/24 00:42, Jan Beulich wrote:
>> On an (old) x86 system with SRAT just covering space above 4Gb:
>>
>>     ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0xfffffffff] hotplug
> 
> OK, so you've got a system with buggy NUMA information.  It _used_ to
> "refuse" the NUMA configuration.  Now it tries to move forward and
> eventually does a NULL deref in memmap_init().
> 
> Right?

Yes.

>> the commit referenced below leads to this NUMA configuration no longer
>> being refused by a CONFIG_NUMA=y kernel (previously
>>
>>     NUMA: nodes only cover 6144MB of your 8185MB e820 RAM. Not used.
>>     No NUMA configuration found
>>     Faking a node at [mem 0x0000000000000000-0x000000027fffffff]
>>
>> was seen in the log directly after the message quoted above), because of
>> memblock_validate_numa_coverage() checking for NUMA_NO_NODE (only). This
>> in turn led to memblock_alloc_range_nid()'s warning about MAX_NUMNODES
>> triggering, followed by a NULL deref in memmap_init() when trying to
>> access node 64's (NODE_SHIFT=6) node data.
> 
> This is a really oblique way of saying:
> 
> 	... followed by a NULL deref in memmap_init() of
> 	NODE_DATA(MAX_NUMNODES).
> 
>> To compensate said change, avoid passing MAX_NUMNODES to
>> memblock_set_node(). In turn numa_clear_kernel_node_hotplug()'s check
>> then also needs adjusting.
>>
>> Fixes: ff6c3d81f2e8 ("NUMA: optimize detection of memory with no node id assigned by firmware")
> 
> I was expecting to see MAX_NUMNODES checks in ff6c3d81f2e8 somewhere.
> But I don't see any in the numa_meminfo_cover_memory() or
> __absent_pages_in_range().
> 
> In other words, it's not completely clear why ff6c3d81f2e8 introduced
> this problem.

It is my understanding that said change, by preventing the NUMA
configuration from being rejected, resulted in different code paths to
be taken. The observed crash was somewhat later than the "No NUMA
configuration found" etc messages. Thus I don't really see a connection
between said change not having had any MAX_NUMNODES check and it having
introduced the (only perceived?) regression.

Jan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node()
  2024-05-29 16:00   ` Jan Beulich
@ 2024-05-29 16:08     ` Dave Hansen
  2024-05-31  6:21       ` Jan Beulich
  2024-05-31  9:42       ` Mike Rapoport
  0 siblings, 2 replies; 6+ messages in thread
From: Dave Hansen @ 2024-05-29 16:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dave Hansen, Andrew Lutomirski, Peter Zijlstra, lkml

On 5/29/24 09:00, Jan Beulich wrote:
>> In other words, it's not completely clear why ff6c3d81f2e8 introduced
>> this problem.
> It is my understanding that said change, by preventing the NUMA
> configuration from being rejected, resulted in different code paths to
> be taken. The observed crash was somewhat later than the "No NUMA
> configuration found" etc messages. Thus I don't really see a connection
> between said change not having had any MAX_NUMNODES check and it having
> introduced the (only perceived?) regression.

So your system has a bad NUMA config.  If it's rejected, then all is
merry.  Something goes and writes over the nids in all of the memblocks
to point to 0 (probably).

If it _isn't_ rejected, then it leaves a memblock in place that points
to MAX_NUMNODES.  That MAX_NUMNODES is a ticking time bomb for later.

So this patch doesn't actually revert the rejection behavior change in
the Fixes: commit.  It just makes the rest of the code more tolerant to
_not_ rejecting the NUMA config?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node()
  2024-05-29 16:08     ` Dave Hansen
@ 2024-05-31  6:21       ` Jan Beulich
  2024-05-31  9:42       ` Mike Rapoport
  1 sibling, 0 replies; 6+ messages in thread
From: Jan Beulich @ 2024-05-31  6:21 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Dave Hansen, Andrew Lutomirski, Peter Zijlstra, lkml

On 29.05.2024 18:08, Dave Hansen wrote:
> On 5/29/24 09:00, Jan Beulich wrote:
>>> In other words, it's not completely clear why ff6c3d81f2e8 introduced
>>> this problem.
>> It is my understanding that said change, by preventing the NUMA
>> configuration from being rejected, resulted in different code paths to
>> be taken. The observed crash was somewhat later than the "No NUMA
>> configuration found" etc messages. Thus I don't really see a connection
>> between said change not having had any MAX_NUMNODES check and it having
>> introduced the (only perceived?) regression.
> 
> So your system has a bad NUMA config.  If it's rejected, then all is
> merry.  Something goes and writes over the nids in all of the memblocks
> to point to 0 (probably).
> 
> If it _isn't_ rejected, then it leaves a memblock in place that points
> to MAX_NUMNODES.  That MAX_NUMNODES is a ticking time bomb for later.
> 
> So this patch doesn't actually revert the rejection behavior change in
> the Fixes: commit.  It just makes the rest of the code more tolerant to
> _not_ rejecting the NUMA config?

No, the NUMA config is now properly rejected again:

NUMA: no nodes coverage for 2041MB of 8185MB RAM
No NUMA configuration found
Faking a node at [mem 0x0000000000000000-0x000000027fffffff]

Jan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node()
  2024-05-29 16:08     ` Dave Hansen
  2024-05-31  6:21       ` Jan Beulich
@ 2024-05-31  9:42       ` Mike Rapoport
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Rapoport @ 2024-05-31  9:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jan Beulich, Dave Hansen, Andrew Lutomirski, Peter Zijlstra, lkml

Hi Dave,

On Wed, May 29, 2024 at 09:08:12AM -0700, Dave Hansen wrote:
> On 5/29/24 09:00, Jan Beulich wrote:
> >> In other words, it's not completely clear why ff6c3d81f2e8 introduced
> >> this problem.
> > It is my understanding that said change, by preventing the NUMA
> > configuration from being rejected, resulted in different code paths to
> > be taken. The observed crash was somewhat later than the "No NUMA
> > configuration found" etc messages. Thus I don't really see a connection
> > between said change not having had any MAX_NUMNODES check and it having
> > introduced the (only perceived?) regression.
> 
> So your system has a bad NUMA config.  If it's rejected, then all is
> merry.  Something goes and writes over the nids in all of the memblocks
> to point to 0 (probably).
> 
> If it _isn't_ rejected, then it leaves a memblock in place that points
> to MAX_NUMNODES.  That MAX_NUMNODES is a ticking time bomb for later.
> 
> So this patch doesn't actually revert the rejection behavior change in
> the Fixes: commit.  It just makes the rest of the code more tolerant to
> _not_ rejecting the NUMA config?
 
It actually does. Before ff6c3d81f2e8 the NUMA coverage was verified
against numa_meminfo rather than memblock, so it could detect that only
small portion of the memory has node ID assigned.

With transition to memblock, the verification relies on node IDs set by the
arch code, but since memblock_validate_numa_coverage() only checked for
NUMA_NO_NODE is missed the ranges with nid == MAX_NUMNODES.

I took Jan's fix for memblock:

https://lore.kernel.org/all/1c8a058c-5365-4f27-a9f1-3aeb7fb3e7b2@suse.com

but I think that we should replace MAX_NUMNODES with NUMA_NO_NODE in calls to
memblock_set_node() in arch/x86.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-05-31  9:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29  7:42 [PATCH] x86/NUMA: don't pass MAX_NUMNODES to memblock_set_node() Jan Beulich
2024-05-29 15:36 ` Dave Hansen
2024-05-29 16:00   ` Jan Beulich
2024-05-29 16:08     ` Dave Hansen
2024-05-31  6:21       ` Jan Beulich
2024-05-31  9:42       ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox