sparclinux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/26] mm: introduce numa_memblks
@ 2024-08-07  6:40 Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 01/26] mm: move kernel/numa.c to mm/ Mike Rapoport
                   ` (25 more replies)
  0 siblings, 26 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

Following the discussion about handling of CXL fixed memory windows on
arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
the generic code so they will be available on arm64/riscv and maybe on
loongarch sometime later.

While it could be possible to use memblock to describe CXL memory windows,
it currently lacks notion of unpopulated memory ranges and numa_memblks
does implement this.

Another reason to make numa_memblks generic is that both arch_numa (arm64
and riscv) and loongarch use trimmed copy of x86 code although there is no
fundamental reason why the same code cannot be used on all these platforms.
Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
more consistent and I believe will reduce maintenance burden.

And with generic numa_memblks it is (almost) straightforward to enable NUMA
emulation on arm64 and riscv.

The first 9 commits in this series are cleanups that are not strictly
related to numa_memblks.
Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks
and NUMA emulation to the generic code.
Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22
does some aftermath cleanups.
Commit 23 updates of_numa_init() to return error of no NUMA nodes were
found in the device tree.
Commit 24 switches arch_numa to numa_memblks.
Commit 25 enables usage of phys_to_target_node() and
memory_add_physaddr_to_nid() with numa_memblks.
Commit 26 moves the description for numa=fake from x86 to admin-guide.

[1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/

v3: https://lore.kernel.org/all/20240801060826.559858-1-rppt@kernel.org
* update allocation of offline node, thanks Jonathan
* add comment about dependency of get_pfn_range_for_nid on
  memblock_set_node(), per Dan
* fix build errros with 32-bit phys_address_t reported by kbuild
* add Acked- and Reviewed-by, thanks Dan and David

v2: https://lore.kernel.org/all/20240723064156.4009477-1-rppt@kernel.org
* rebase on v6.11-rc1
* fix dummy_numa_init() in arch_numa, thanks Zi Yan
* update of_numa_init() to return error of no NUMA nodes were
* add Tested-by, thanks Zi Yan

v1: https://lore.kernel.org/all/20240716111346.3676969-1-rppt@kernel.org
* add cleanup for arch_alloc_nodedata and HAVE_ARCH_NODEDATA_EXTENSION
* add patch that moves description of numa=fake kernel parameter from
  x86 to admin-guide
* reduce rounding up of node_data allocations from PAGE_SIZE to
  SMP_CACHE_BYTES
* restore single allocation attempt of numa_distance
* fix several comments
* added review tags

Mike Rapoport (Microsoft) (26):
  mm: move kernel/numa.c to mm/
  MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures
  MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes
  MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION
  MIPS: loongson64: rename __node_data to node_data
  MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION
  arch, mm: move definition of node_data to generic code
  mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
  arch, mm: pull out allocation of NODE_DATA to generic code
  x86/numa: simplify numa_distance allocation
  x86/numa: use get_pfn_range_for_nid to verify that node spans memory
  x86/numa: move FAKE_NODE_* defines to numa_emu
  x86/numa_emu: simplify allocation of phys_dist
  x86/numa_emu: split __apicid_to_node update to a helper function
  x86/numa_emu: use a helper function to get MAX_DMA32_PFN
  x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
  mm: introduce numa_memblks
  mm: move numa_distance and related code from x86 to numa_memblks
  mm: introduce numa_emulation
  mm: numa_memblks: introduce numa_memblks_init
  mm: numa_memblks: make several functions and variables static
  mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing
    meminfo
  of, numa: return -EINVAL when no numa-node-id is found
  arch_numa: switch over to numa_memblks
  mm: make range-to-target_node lookup facility a part of numa_memblks
  docs: move numa=fake description to kernel-parameters.txt

 .../admin-guide/kernel-parameters.txt         |  15 +
 .../arch/x86/x86_64/boot-options.rst          |  12 -
 arch/arm64/include/asm/Kbuild                 |   1 +
 arch/arm64/include/asm/mmzone.h               |  13 -
 arch/arm64/include/asm/topology.h             |   1 +
 arch/loongarch/include/asm/Kbuild             |   1 +
 arch/loongarch/include/asm/mmzone.h           |  16 -
 arch/loongarch/include/asm/topology.h         |   1 +
 arch/loongarch/kernel/numa.c                  |  21 -
 arch/mips/Kconfig                             |   5 -
 arch/mips/include/asm/mach-ip27/mmzone.h      |   1 -
 .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
 arch/mips/loongson64/numa.c                   |  28 +-
 arch/mips/sgi-ip27/ip27-memory.c              |  12 +-
 arch/mips/sgi-ip27/ip27-smp.c                 |   2 +
 arch/powerpc/include/asm/mmzone.h             |   6 -
 arch/powerpc/mm/numa.c                        |  26 +-
 arch/riscv/include/asm/Kbuild                 |   1 +
 arch/riscv/include/asm/mmzone.h               |  13 -
 arch/riscv/include/asm/topology.h             |   4 +
 arch/s390/include/asm/Kbuild                  |   1 +
 arch/s390/include/asm/mmzone.h                |  17 -
 arch/s390/kernel/numa.c                       |   3 -
 arch/sh/include/asm/mmzone.h                  |   3 -
 arch/sh/mm/init.c                             |   7 +-
 arch/sh/mm/numa.c                             |   3 -
 arch/sparc/include/asm/mmzone.h               |   4 -
 arch/sparc/mm/init_64.c                       |  11 +-
 arch/x86/Kconfig                              |   9 +-
 arch/x86/include/asm/Kbuild                   |   1 +
 arch/x86/include/asm/mmzone.h                 |   6 -
 arch/x86/include/asm/mmzone_32.h              |  17 -
 arch/x86/include/asm/mmzone_64.h              |  18 -
 arch/x86/include/asm/numa.h                   |  26 +-
 arch/x86/include/asm/sparsemem.h              |   9 -
 arch/x86/mm/Makefile                          |   1 -
 arch/x86/mm/amdtopology.c                     |   1 +
 arch/x86/mm/numa.c                            | 622 +-----------------
 arch/x86/mm/numa_internal.h                   |  24 -
 drivers/acpi/numa/srat.c                      |   1 +
 drivers/base/Kconfig                          |   1 +
 drivers/base/arch_numa.c                      | 224 ++-----
 drivers/cxl/Kconfig                           |   2 +-
 drivers/dax/Kconfig                           |   2 +-
 drivers/of/of_numa.c                          |   5 +-
 include/asm-generic/mmzone.h                  |   5 +
 include/asm-generic/numa.h                    |   6 +-
 include/linux/memory_hotplug.h                |  48 --
 include/linux/numa.h                          |   8 +
 include/linux/numa_memblks.h                  |  58 ++
 kernel/Makefile                               |   1 -
 kernel/numa.c                                 |  26 -
 mm/Kconfig                                    |  11 +
 mm/Makefile                                   |   3 +
 mm/mm_init.c                                  |  10 +-
 mm/numa.c                                     |  69 ++
 {arch/x86/mm => mm}/numa_emulation.c          |  42 +-
 mm/numa_memblks.c                             | 571 ++++++++++++++++
 58 files changed, 893 insertions(+), 1166 deletions(-)
 delete mode 100644 arch/arm64/include/asm/mmzone.h
 delete mode 100644 arch/loongarch/include/asm/mmzone.h
 delete mode 100644 arch/riscv/include/asm/mmzone.h
 delete mode 100644 arch/s390/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone_32.h
 delete mode 100644 arch/x86/include/asm/mmzone_64.h
 create mode 100644 include/asm-generic/mmzone.h
 create mode 100644 include/linux/numa_memblks.h
 delete mode 100644 kernel/numa.c
 create mode 100644 mm/numa.c
 rename {arch/x86/mm => mm}/numa_emulation.c (94%)
 create mode 100644 mm/numa_memblks.c


base-commit: 8400291e289ee6b2bf9779ff1c83a291501f017b
-- 
2.43.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v4 01/26] mm: move kernel/numa.c to mm/
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 02/26] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures Mike Rapoport
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The stub functions in kernel/numa.c belong to mm/ rather than to kernel/

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/Makefile       | 1 -
 mm/Makefile           | 1 +
 {kernel => mm}/numa.c | 0
 3 files changed, 1 insertion(+), 1 deletion(-)
 rename {kernel => mm}/numa.c (100%)

diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..87866b037fbe 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -116,7 +116,6 @@ obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
 obj-$(CONFIG_HAVE_STATIC_CALL) += static_call.o
 obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call_inline.o
 obj-$(CONFIG_CFI_CLANG) += cfi.o
-obj-$(CONFIG_NUMA) += numa.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/mm/Makefile b/mm/Makefile
index d2915f8c9dc0..4e668be85f0b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -141,3 +141,4 @@ obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
+obj-$(CONFIG_NUMA) += numa.o
diff --git a/kernel/numa.c b/mm/numa.c
similarity index 100%
rename from kernel/numa.c
rename to mm/numa.c
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 02/26] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 01/26] mm: move kernel/numa.c to mm/ Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 03/26] MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes Mike Rapoport
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

sgi-ip27 is the only system that defines NODE_DATA() differently than
the rest of NUMA machines.

Add node_data array of struct pglist pointers that will point to
__node_data[node]->pglist and redefine NODE_DATA() to use node_data
array.

This will allow pulling declaration of node_data to the generic mm code
in the next commit.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/mips/include/asm/mach-ip27/mmzone.h | 5 ++++-
 arch/mips/sgi-ip27/ip27-memory.c         | 5 ++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h
index 08c36e50a860..629c3f290203 100644
--- a/arch/mips/include/asm/mach-ip27/mmzone.h
+++ b/arch/mips/include/asm/mach-ip27/mmzone.h
@@ -22,7 +22,10 @@ struct node_data {
 
 extern struct node_data *__node_data[];
 
-#define NODE_DATA(n)		(&__node_data[(n)]->pglist)
 #define hub_data(n)		(&__node_data[(n)]->hub)
 
+extern struct pglist_data *node_data[];
+
+#define NODE_DATA(nid)		(node_data[nid])
+
 #endif /* _ASM_MACH_MMZONE_H */
diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
index b8ca94cfb4fe..c30ef6958b97 100644
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -34,8 +34,10 @@
 #define SLOT_PFNSHIFT		(SLOT_SHIFT - PAGE_SHIFT)
 #define PFN_NASIDSHFT		(NASID_SHFT - PAGE_SHIFT)
 
-struct node_data *__node_data[MAX_NUMNODES];
+struct pglist_data *node_data[MAX_NUMNODES];
+EXPORT_SYMBOL(node_data);
 
+struct node_data *__node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_data);
 
 static u64 gen_region_mask(void)
@@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node)
 	 */
 	__node_data[node] = __va(slot_freepfn << PAGE_SHIFT);
 	memset(__node_data[node], 0, PAGE_SIZE);
+	node_data[node] = &__node_data[node]->pglist;
 
 	NODE_DATA(node)->node_start_pfn = start_pfn;
 	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 03/26] MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 01/26] mm: move kernel/numa.c to mm/ Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 02/26] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 04/26] MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

For SGI IP27 machines node_possible_map is statically set to
NODE_MASK_ALL and it is not updated during NUMA initialization.

Ensure that it only contains nodes present in the system.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/mips/sgi-ip27/ip27-smp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/mips/sgi-ip27/ip27-smp.c b/arch/mips/sgi-ip27/ip27-smp.c
index 5d2652a1d35a..62733e049570 100644
--- a/arch/mips/sgi-ip27/ip27-smp.c
+++ b/arch/mips/sgi-ip27/ip27-smp.c
@@ -70,11 +70,13 @@ void cpu_node_probe(void)
 	gda_t *gdap = GDA;
 
 	nodes_clear(node_online_map);
+	nodes_clear(node_possible_map);
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		nasid_t nasid = gdap->g_nasidtable[i];
 		if (nasid == INVALID_NASID)
 			break;
 		node_set_online(nasid);
+		node_set(nasid, node_possible_map);
 		highest = node_scan_cpus(nasid, highest);
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 04/26] MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (2 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 03/26] MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 05/26] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and
sgi-ip27") added HAVE_ARCH_NODEDATA_EXTENSION to sgi-ip27 to silence a
compilation error that happened because sgi-ip27 didn't define array of
pg_data_t as node_data like most other architectures did.

After addition of node_data array that matches other architectures and
after ensuring that offline nodes do not appear on node_possible_map, it
is safe to drop arch_alloc_nodedata() and HAVE_ARCH_NODEDATA_EXTENSION
from sgi-ip27.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/mips/Kconfig                |  1 -
 arch/mips/sgi-ip27/ip27-memory.c | 10 ----------
 2 files changed, 11 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 60077e576935..ea5f3c3c31f6 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -735,7 +735,6 @@ config SGI_IP27
 	select WAR_R10000_LLSC
 	select MIPS_L1_CACHE_SHIFT_7
 	select NUMA
-	select HAVE_ARCH_NODEDATA_EXTENSION
 	help
 	  This are the SGI Origin 200, Origin 2000 and Onyx 2 Graphics
 	  workstations.  To compile a Linux kernel that runs on these, say Y
diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
index c30ef6958b97..eb6d2fa41a8a 100644
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -426,13 +426,3 @@ void __init mem_init(void)
 	memblock_free_all();
 	setup_zero_pages();	/* This comes from node 0 */
 }
-
-pg_data_t * __init arch_alloc_nodedata(int nid)
-{
-	return memblock_alloc(sizeof(pg_data_t), SMP_CACHE_BYTES);
-}
-
-void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
-{
-	__node_data[nid] = (struct node_data *)pgdat;
-}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 05/26] MIPS: loongson64: rename __node_data to node_data
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (3 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 04/26] MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 06/26] MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Make definition of node_data match other architectures.
This will allow pulling declaration of node_data to the generic mm code in
the following commit.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/mips/include/asm/mach-loongson64/mmzone.h | 4 ++--
 arch/mips/loongson64/numa.c                    | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/asm/mach-loongson64/mmzone.h b/arch/mips/include/asm/mach-loongson64/mmzone.h
index a3d65d37b8b5..2effd5f8ed62 100644
--- a/arch/mips/include/asm/mach-loongson64/mmzone.h
+++ b/arch/mips/include/asm/mach-loongson64/mmzone.h
@@ -14,9 +14,9 @@
 #define pa_to_nid(addr)  (((addr) & 0xf00000000000) >> NODE_ADDRSPACE_SHIFT)
 #define nid_to_addrbase(nid) ((unsigned long)(nid) << NODE_ADDRSPACE_SHIFT)
 
-extern struct pglist_data *__node_data[];
+extern struct pglist_data *node_data[];
 
-#define NODE_DATA(n)		(__node_data[n])
+#define NODE_DATA(n)		(node_data[n])
 
 extern void __init prom_init_numa_memory(void);
 
diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index 68dafd6d3e25..b50ce28d2741 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -29,8 +29,8 @@
 
 unsigned char __node_distances[MAX_NUMNODES][MAX_NUMNODES];
 EXPORT_SYMBOL(__node_distances);
-struct pglist_data *__node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(__node_data);
+struct pglist_data *node_data[MAX_NUMNODES];
+EXPORT_SYMBOL(node_data);
 
 cpumask_t __node_cpumask[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_cpumask);
@@ -107,7 +107,7 @@ static void __init node_mem_init(unsigned int node)
 	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
 	if (tnid != node)
 		pr_info("NODE_DATA(%d) on node %d\n", node, tnid);
-	__node_data[node] = nd;
+	node_data[node] = nd;
 	NODE_DATA(node)->node_start_pfn = start_pfn;
 	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;
 
@@ -206,5 +206,5 @@ pg_data_t * __init arch_alloc_nodedata(int nid)
 
 void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 {
-	__node_data[nid] = pgdat;
+	node_data[nid] = pgdat;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 06/26] MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (4 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 05/26] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 07/26] arch, mm: move definition of node_data to generic code Mike Rapoport
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and
sgi-ip27") added HAVE_ARCH_NODEDATA_EXTENSION to loongson64 to silence a
compilation error that happened because loongson64 didn't define array
of pg_data_t as node_data like most other architectures did.

After rename of __node_data to node_data arch_alloc_nodedata() and
HAVE_ARCH_NODEDATA_EXTENSION can be dropped from loongson64.

Since it was the only user of HAVE_ARCH_NODEDATA_EXTENSION config option
also remove this option from arch/mips/Kconfig.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/mips/Kconfig           |  4 ----
 arch/mips/loongson64/numa.c | 10 ----------
 2 files changed, 14 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index ea5f3c3c31f6..43da6d596e2b 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -502,7 +502,6 @@ config MACH_LOONGSON64
 	select USE_OF
 	select BUILTIN_DTB
 	select PCI_HOST_GENERIC
-	select HAVE_ARCH_NODEDATA_EXTENSION if NUMA
 	help
 	  This enables the support of Loongson-2/3 family of machines.
 
@@ -2612,9 +2611,6 @@ config NUMA
 config SYS_SUPPORTS_NUMA
 	bool
 
-config HAVE_ARCH_NODEDATA_EXTENSION
-	bool
-
 config RELOCATABLE
 	bool "Relocatable kernel"
 	depends on SYS_SUPPORTS_RELOCATABLE
diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index b50ce28d2741..64fcfaa885b6 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -198,13 +198,3 @@ void __init prom_init_numa_memory(void)
 	pr_info("CP0_PageGrain: CP0 5.1 (0x%x)\n", read_c0_pagegrain());
 	prom_meminit();
 }
-
-pg_data_t * __init arch_alloc_nodedata(int nid)
-{
-	return memblock_alloc(sizeof(pg_data_t), SMP_CACHE_BYTES);
-}
-
-void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
-{
-	node_data[nid] = pgdat;
-}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 07/26] arch, mm: move definition of node_data to generic code
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (5 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 06/26] MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 08/26] mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Every architecture that supports NUMA defines node_data in the same way:

	struct pglist_data *node_data[MAX_NUMNODES];

No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.

Add definition and declaration of node_data to generic code and drop
architecture-specific versions.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm64/include/asm/Kbuild                  |  1 +
 arch/arm64/include/asm/mmzone.h                | 13 -------------
 arch/arm64/include/asm/topology.h              |  1 +
 arch/loongarch/include/asm/Kbuild              |  1 +
 arch/loongarch/include/asm/mmzone.h            | 16 ----------------
 arch/loongarch/include/asm/topology.h          |  1 +
 arch/loongarch/kernel/numa.c                   |  3 ---
 arch/mips/include/asm/mach-ip27/mmzone.h       |  4 ----
 arch/mips/include/asm/mach-loongson64/mmzone.h |  4 ----
 arch/mips/loongson64/numa.c                    |  2 --
 arch/mips/sgi-ip27/ip27-memory.c               |  3 ---
 arch/powerpc/include/asm/mmzone.h              |  6 ------
 arch/powerpc/mm/numa.c                         |  2 --
 arch/riscv/include/asm/Kbuild                  |  1 +
 arch/riscv/include/asm/mmzone.h                | 13 -------------
 arch/riscv/include/asm/topology.h              |  4 ++++
 arch/s390/include/asm/Kbuild                   |  1 +
 arch/s390/include/asm/mmzone.h                 | 17 -----------------
 arch/s390/kernel/numa.c                        |  3 ---
 arch/sh/include/asm/mmzone.h                   |  3 ---
 arch/sh/mm/numa.c                              |  3 ---
 arch/sparc/include/asm/mmzone.h                |  4 ----
 arch/sparc/mm/init_64.c                        |  2 --
 arch/x86/include/asm/Kbuild                    |  1 +
 arch/x86/include/asm/mmzone.h                  |  6 ------
 arch/x86/include/asm/mmzone_32.h               | 17 -----------------
 arch/x86/include/asm/mmzone_64.h               | 18 ------------------
 arch/x86/mm/numa.c                             |  3 ---
 drivers/base/arch_numa.c                       |  2 --
 include/asm-generic/mmzone.h                   |  5 +++++
 include/linux/numa.h                           |  3 +++
 mm/numa.c                                      |  3 +++
 32 files changed, 22 insertions(+), 144 deletions(-)
 delete mode 100644 arch/arm64/include/asm/mmzone.h
 delete mode 100644 arch/loongarch/include/asm/mmzone.h
 delete mode 100644 arch/riscv/include/asm/mmzone.h
 delete mode 100644 arch/s390/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone_32.h
 delete mode 100644 arch/x86/include/asm/mmzone_64.h
 create mode 100644 include/asm-generic/mmzone.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 7d7d97ad3cd5..4e350df9a02d 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -9,6 +9,7 @@ syscall-y += unistd_compat_32.h
 
 generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
+generic-y += mmzone.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
 generic-y += parport.h
diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
deleted file mode 100644
index fa17e01d9ab2..000000000000
--- a/arch/arm64/include/asm/mmzone.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_MMZONE_H
-#define __ASM_MMZONE_H
-
-#ifdef CONFIG_NUMA
-
-#include <asm/numa.h>
-
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)		(node_data[(nid)])
-
-#endif /* CONFIG_NUMA */
-#endif /* __ASM_MMZONE_H */
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index 0f6ef432fb84..5fc3af9f8f29 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -5,6 +5,7 @@
 #include <linux/cpumask.h>
 
 #ifdef CONFIG_NUMA
+#include <asm/numa.h>
 
 struct pci_bus;
 int pcibus_to_node(struct pci_bus *bus);
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 2bb3676429c0..8fa22cc52774 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -9,5 +9,6 @@ generic-y += qrwlock.h
 generic-y += qspinlock.h
 generic-y += user.h
 generic-y += ioctl.h
+generic-y += mmzone.h
 generic-y += statfs.h
 generic-y += param.h
diff --git a/arch/loongarch/include/asm/mmzone.h b/arch/loongarch/include/asm/mmzone.h
deleted file mode 100644
index 2b9a90727e19..000000000000
--- a/arch/loongarch/include/asm/mmzone.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Author: Huacai Chen (chenhuacai@loongson.cn)
- * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
- */
-#ifndef _ASM_MMZONE_H_
-#define _ASM_MMZONE_H_
-
-#include <asm/page.h>
-#include <asm/numa.h>
-
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)	(node_data[(nid)])
-
-#endif /* _ASM_MMZONE_H_ */
diff --git a/arch/loongarch/include/asm/topology.h b/arch/loongarch/include/asm/topology.h
index 66128dec0bf6..50273c9187d0 100644
--- a/arch/loongarch/include/asm/topology.h
+++ b/arch/loongarch/include/asm/topology.h
@@ -8,6 +8,7 @@
 #include <linux/smp.h>
 
 #ifdef CONFIG_NUMA
+#include <asm/numa.h>
 
 extern cpumask_t cpus_on_node[];
 
diff --git a/arch/loongarch/kernel/numa.c b/arch/loongarch/kernel/numa.c
index 8fe21f868f72..acada671e020 100644
--- a/arch/loongarch/kernel/numa.c
+++ b/arch/loongarch/kernel/numa.c
@@ -27,10 +27,7 @@
 #include <asm/time.h>
 
 int numa_off;
-struct pglist_data *node_data[MAX_NUMNODES];
 unsigned char node_distances[MAX_NUMNODES][MAX_NUMNODES];
-
-EXPORT_SYMBOL(node_data);
 EXPORT_SYMBOL(node_distances);
 
 static struct numa_meminfo numa_meminfo;
diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h
index 629c3f290203..56959eb9cb26 100644
--- a/arch/mips/include/asm/mach-ip27/mmzone.h
+++ b/arch/mips/include/asm/mach-ip27/mmzone.h
@@ -24,8 +24,4 @@ extern struct node_data *__node_data[];
 
 #define hub_data(n)		(&__node_data[(n)]->hub)
 
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)		(node_data[nid])
-
 #endif /* _ASM_MACH_MMZONE_H */
diff --git a/arch/mips/include/asm/mach-loongson64/mmzone.h b/arch/mips/include/asm/mach-loongson64/mmzone.h
index 2effd5f8ed62..8fb70fd3c9c4 100644
--- a/arch/mips/include/asm/mach-loongson64/mmzone.h
+++ b/arch/mips/include/asm/mach-loongson64/mmzone.h
@@ -14,10 +14,6 @@
 #define pa_to_nid(addr)  (((addr) & 0xf00000000000) >> NODE_ADDRSPACE_SHIFT)
 #define nid_to_addrbase(nid) ((unsigned long)(nid) << NODE_ADDRSPACE_SHIFT)
 
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(n)		(node_data[n])
-
 extern void __init prom_init_numa_memory(void);
 
 #endif /* _ASM_MACH_MMZONE_H */
diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index 64fcfaa885b6..d56238745744 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -29,8 +29,6 @@
 
 unsigned char __node_distances[MAX_NUMNODES][MAX_NUMNODES];
 EXPORT_SYMBOL(__node_distances);
-struct pglist_data *node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(node_data);
 
 cpumask_t __node_cpumask[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_cpumask);
diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
index eb6d2fa41a8a..1963313f55d8 100644
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -34,9 +34,6 @@
 #define SLOT_PFNSHIFT		(SLOT_SHIFT - PAGE_SHIFT)
 #define PFN_NASIDSHFT		(NASID_SHFT - PAGE_SHIFT)
 
-struct pglist_data *node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(node_data);
-
 struct node_data *__node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_data);
 
diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
index da827d2d0866..d99863cd6cde 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -20,12 +20,6 @@
 
 #ifdef CONFIG_NUMA
 
-extern struct pglist_data *node_data[];
-/*
- * Return a pointer to the node data for node n.
- */
-#define NODE_DATA(nid)		(node_data[nid])
-
 /*
  * Following are specific to this numa platform.
  */
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index aa89899f0c1a..0744a9a2944b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -43,11 +43,9 @@ static char *cmdline __initdata;
 
 int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
-struct pglist_data *node_data[MAX_NUMNODES];
 
 EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
-EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index 5c589770f2a8..1461af12da6e 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -5,6 +5,7 @@ syscall-y += syscall_table_64.h
 generic-y += early_ioremap.h
 generic-y += flat.h
 generic-y += kvm_para.h
+generic-y += mmzone.h
 generic-y += parport.h
 generic-y += spinlock.h
 generic-y += spinlock_types.h
diff --git a/arch/riscv/include/asm/mmzone.h b/arch/riscv/include/asm/mmzone.h
deleted file mode 100644
index fa17e01d9ab2..000000000000
--- a/arch/riscv/include/asm/mmzone.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_MMZONE_H
-#define __ASM_MMZONE_H
-
-#ifdef CONFIG_NUMA
-
-#include <asm/numa.h>
-
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)		(node_data[(nid)])
-
-#endif /* CONFIG_NUMA */
-#endif /* __ASM_MMZONE_H */
diff --git a/arch/riscv/include/asm/topology.h b/arch/riscv/include/asm/topology.h
index 61183688bdd5..fe1a8bf6902d 100644
--- a/arch/riscv/include/asm/topology.h
+++ b/arch/riscv/include/asm/topology.h
@@ -4,6 +4,10 @@
 
 #include <linux/arch_topology.h>
 
+#ifdef CONFIG_NUMA
+#include <asm/numa.h>
+#endif
+
 /* Replace task scheduler's default frequency-invariant accounting */
 #define arch_scale_freq_tick		topology_scale_freq_tick
 #define arch_set_freq_scale		topology_set_freq_scale
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 4b904110d27c..297bf7157968 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
 generic-y += asm-offsets.h
 generic-y += kvm_types.h
 generic-y += mcs_spinlock.h
+generic-y += mmzone.h
diff --git a/arch/s390/include/asm/mmzone.h b/arch/s390/include/asm/mmzone.h
deleted file mode 100644
index 73e3e7c6976c..000000000000
--- a/arch/s390/include/asm/mmzone.h
+++ /dev/null
@@ -1,17 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * NUMA support for s390
- *
- * Copyright IBM Corp. 2015
- */
-
-#ifndef _ASM_S390_MMZONE_H
-#define _ASM_S390_MMZONE_H
-
-#ifdef CONFIG_NUMA
-
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid) (node_data[nid])
-
-#endif /* CONFIG_NUMA */
-#endif /* _ASM_S390_MMZONE_H */
diff --git a/arch/s390/kernel/numa.c b/arch/s390/kernel/numa.c
index 23ab9f02f278..ddc1448ea2e1 100644
--- a/arch/s390/kernel/numa.c
+++ b/arch/s390/kernel/numa.c
@@ -14,9 +14,6 @@
 #include <linux/node.h>
 #include <asm/numa.h>
 
-struct pglist_data *node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(node_data);
-
 void __init numa_setup(void)
 {
 	int nid;
diff --git a/arch/sh/include/asm/mmzone.h b/arch/sh/include/asm/mmzone.h
index 7b8dead2723d..63f88b465e39 100644
--- a/arch/sh/include/asm/mmzone.h
+++ b/arch/sh/include/asm/mmzone.h
@@ -5,9 +5,6 @@
 #ifdef CONFIG_NUMA
 #include <linux/numa.h>
 
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)		(node_data[nid])
-
 static inline int pfn_to_nid(unsigned long pfn)
 {
 	int nid;
diff --git a/arch/sh/mm/numa.c b/arch/sh/mm/numa.c
index 50f0dc1744d0..9bc212b5e762 100644
--- a/arch/sh/mm/numa.c
+++ b/arch/sh/mm/numa.c
@@ -14,9 +14,6 @@
 #include <linux/pfn.h>
 #include <asm/sections.h>
 
-struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL_GPL(node_data);
-
 /*
  * On SH machines the conventional approach is to stash system RAM
  * in node 0, and other memory blocks in to node 1 and up, ordered by
diff --git a/arch/sparc/include/asm/mmzone.h b/arch/sparc/include/asm/mmzone.h
index a236d8aa893a..74eb2c71d077 100644
--- a/arch/sparc/include/asm/mmzone.h
+++ b/arch/sparc/include/asm/mmzone.h
@@ -6,10 +6,6 @@
 
 #include <linux/cpumask.h>
 
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)		(node_data[nid])
-
 extern int numa_cpu_lookup_table[];
 extern cpumask_t numa_cpumask_lookup_table[];
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 53d7cb5bbffe..c6c7f43cb1e8 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1115,11 +1115,9 @@ static void init_node_masks_nonnuma(void)
 }
 
 #ifdef CONFIG_NUMA
-struct pglist_data *node_data[MAX_NUMNODES];
 
 EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(numa_cpumask_lookup_table);
-EXPORT_SYMBOL(node_data);
 
 static int scan_pio_for_cfg_handle(struct mdesc_handle *md, u64 pio,
 				   u32 cfg_handle)
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index a192bdea69e2..6c23d1661b17 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -11,3 +11,4 @@ generated-y += xen-hypercalls.h
 
 generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
+generic-y += mmzone.h
diff --git a/arch/x86/include/asm/mmzone.h b/arch/x86/include/asm/mmzone.h
deleted file mode 100644
index c41b41edd691..000000000000
--- a/arch/x86/include/asm/mmzone.h
+++ /dev/null
@@ -1,6 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifdef CONFIG_X86_32
-# include <asm/mmzone_32.h>
-#else
-# include <asm/mmzone_64.h>
-#endif
diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
deleted file mode 100644
index 2d4515e8b7df..000000000000
--- a/arch/x86/include/asm/mmzone_32.h
+++ /dev/null
@@ -1,17 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Written by Pat Gaughen (gone@us.ibm.com) Mar 2002
- *
- */
-
-#ifndef _ASM_X86_MMZONE_32_H
-#define _ASM_X86_MMZONE_32_H
-
-#include <asm/smp.h>
-
-#ifdef CONFIG_NUMA
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)	(node_data[nid])
-#endif /* CONFIG_NUMA */
-
-#endif /* _ASM_X86_MMZONE_32_H */
diff --git a/arch/x86/include/asm/mmzone_64.h b/arch/x86/include/asm/mmzone_64.h
deleted file mode 100644
index 0c585046f744..000000000000
--- a/arch/x86/include/asm/mmzone_64.h
+++ /dev/null
@@ -1,18 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* K8 NUMA support */
-/* Copyright 2002,2003 by Andi Kleen, SuSE Labs */
-/* 2.5 Version loosely based on the NUMAQ Code by Pat Gaughen. */
-#ifndef _ASM_X86_MMZONE_64_H
-#define _ASM_X86_MMZONE_64_H
-
-#ifdef CONFIG_NUMA
-
-#include <linux/mmdebug.h>
-#include <asm/smp.h>
-
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)		(node_data[nid])
-
-#endif
-#endif /* _ASM_X86_MMZONE_64_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 6ce10e3c6228..7de725d6bb05 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -24,9 +24,6 @@
 int numa_off;
 nodemask_t numa_nodes_parsed __initdata;
 
-struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL(node_data);
-
 static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
 static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
 
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 555aee3ee8e7..ceac5b59bf2b 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -15,8 +15,6 @@
 
 #include <asm/sections.h>
 
-struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL(node_data);
 nodemask_t numa_nodes_parsed __initdata;
 static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
 
diff --git a/include/asm-generic/mmzone.h b/include/asm-generic/mmzone.h
new file mode 100644
index 000000000000..2ab5193e8394
--- /dev/null
+++ b/include/asm-generic/mmzone.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_MMZONE_H
+#define _ASM_GENERIC_MMZONE_H
+
+#endif
diff --git a/include/linux/numa.h b/include/linux/numa.h
index eb19503604fe..e5841d4057ab 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -30,6 +30,9 @@ static inline bool numa_valid_node(int nid)
 #ifdef CONFIG_NUMA
 #include <asm/sparsemem.h>
 
+extern struct pglist_data *node_data[];
+#define NODE_DATA(nid)	(node_data[nid])
+
 /* Generic implementation available */
 int numa_nearest_node(int node, unsigned int state);
 
diff --git a/mm/numa.c b/mm/numa.c
index 67ca6b8585c0..8c157d41c026 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -3,6 +3,9 @@
 #include <linux/printk.h>
 #include <linux/numa.h>
 
+struct pglist_data *node_data[MAX_NUMNODES];
+EXPORT_SYMBOL(node_data);
+
 /* Stub functions: */
 
 #ifndef memory_add_physaddr_to_nid
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 08/26] mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (6 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 07/26] arch, mm: move definition of node_data to generic code Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 09/26] arch, mm: pull out allocation of NODE_DATA to generic code Mike Rapoport
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

There are no users of HAVE_ARCH_NODEDATA_EXTENSION left, so
arch_alloc_nodedata() and arch_refresh_nodedata() are not needed
anymore.

Replace the call to arch_alloc_nodedata() in free_area_init() with a
new helper alloc_offline_node_data(), remove arch_refresh_nodedata()
and cleanup include/linux/memory_hotplug.h from the associated
ifdefery.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memory_hotplug.h | 48 ----------------------------------
 include/linux/numa.h           |  4 +++
 mm/mm_init.c                   | 10 ++-----
 mm/numa.c                      | 12 +++++++++
 4 files changed, 18 insertions(+), 56 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ebe876930e78..b27ddce5d324 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -16,54 +16,6 @@ struct resource;
 struct vmem_altmap;
 struct dev_pagemap;
 
-#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
-/*
- * For supporting node-hotadd, we have to allocate a new pgdat.
- *
- * If an arch has generic style NODE_DATA(),
- * node_data[nid] = kzalloc() works well. But it depends on the architecture.
- *
- * In general, generic_alloc_nodedata() is used.
- *
- */
-extern pg_data_t *arch_alloc_nodedata(int nid);
-extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
-
-#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
-
-#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
-
-#ifdef CONFIG_NUMA
-/*
- * XXX: node aware allocation can't work well to get new node's memory at this time.
- *	Because, pgdat for the new node is not allocated/initialized yet itself.
- *	To use new node's memory, more consideration will be necessary.
- */
-#define generic_alloc_nodedata(nid)				\
-({								\
-	memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);	\
-})
-
-extern pg_data_t *node_data[];
-static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
-{
-	node_data[nid] = pgdat;
-}
-
-#else /* !CONFIG_NUMA */
-
-/* never called */
-static inline pg_data_t *generic_alloc_nodedata(int nid)
-{
-	BUG();
-	return NULL;
-}
-static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
-{
-}
-#endif /* CONFIG_NUMA */
-#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
 
diff --git a/include/linux/numa.h b/include/linux/numa.h
index e5841d4057ab..b41b1569781b 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -33,6 +33,8 @@ static inline bool numa_valid_node(int nid)
 extern struct pglist_data *node_data[];
 #define NODE_DATA(nid)	(node_data[nid])
 
+void __init alloc_offline_node_data(int nid);
+
 /* Generic implementation available */
 int numa_nearest_node(int node, unsigned int state);
 
@@ -60,6 +62,8 @@ static inline int phys_to_target_node(u64 start)
 {
 	return 0;
 }
+
+static inline void alloc_offline_node_data(int nid) {}
 #endif
 
 #define numa_map_to_online_node(node) numa_nearest_node(node, N_ONLINE)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 75c3bd42799b..2785be04e7bb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1836,14 +1836,8 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	for_each_node(nid) {
 		pg_data_t *pgdat;
 
-		if (!node_online(nid)) {
-			/* Allocator not initialized yet */
-			pgdat = arch_alloc_nodedata(nid);
-			if (!pgdat)
-				panic("Cannot allocate %zuB for node %d.\n",
-				       sizeof(*pgdat), nid);
-			arch_refresh_nodedata(nid, pgdat);
-		}
+		if (!node_online(nid))
+			alloc_offline_node_data(nid);
 
 		pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
diff --git a/mm/numa.c b/mm/numa.c
index 8c157d41c026..64e1b7d2c1ee 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -6,6 +6,18 @@
 struct pglist_data *node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(node_data);
 
+void __init alloc_offline_node_data(int nid)
+{
+	pg_data_t *pgdat;
+
+	pgdat = memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);
+	if (!pgdat)
+		panic("Cannot allocate %zuB for node %d.\n",
+		      sizeof(*pgdat), nid);
+
+	node_data[nid] = pgdat;
+}
+
 /* Stub functions: */
 
 #ifndef memory_add_physaddr_to_nid
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 09/26] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (7 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 08/26] mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 10/26] x86/numa: simplify numa_distance allocation Mike Rapoport
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Architectures that support NUMA duplicate the code that allocates
NODE_DATA on the node-local memory with slight variations in reporting
of the addresses where the memory was allocated.

Use x86 version as the basis for the generic alloc_node_data() function
and call this function in architecture specific numa initialization.

Round up node data size to SMP_CACHE_BYTES rather than to PAGE_SIZE like
x86 used to do since the bootmem era when allocation granularity was
PAGE_SIZE anyway.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/loongarch/kernel/numa.c | 18 ------------------
 arch/mips/loongson64/numa.c  | 16 ++--------------
 arch/powerpc/mm/numa.c       | 24 +++---------------------
 arch/sh/mm/init.c            |  7 +------
 arch/sparc/mm/init_64.c      |  9 ++-------
 arch/x86/mm/numa.c           | 34 +---------------------------------
 drivers/base/arch_numa.c     | 21 +--------------------
 include/linux/numa.h         |  1 +
 mm/numa.c                    | 27 +++++++++++++++++++++++++++
 9 files changed, 38 insertions(+), 119 deletions(-)

diff --git a/arch/loongarch/kernel/numa.c b/arch/loongarch/kernel/numa.c
index acada671e020..84fe7f854820 100644
--- a/arch/loongarch/kernel/numa.c
+++ b/arch/loongarch/kernel/numa.c
@@ -187,24 +187,6 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
 }
 
-static void __init alloc_node_data(int nid)
-{
-	void *nd;
-	unsigned long nd_pa;
-	size_t nd_sz = roundup(sizeof(pg_data_t), PAGE_SIZE);
-
-	nd_pa = memblock_phys_alloc_try_nid(nd_sz, SMP_CACHE_BYTES, nid);
-	if (!nd_pa) {
-		pr_err("Cannot find %zu Byte for node_data (initial node: %d)\n", nd_sz, nid);
-		return;
-	}
-
-	nd = __va(nd_pa);
-
-	node_data[nid] = nd;
-	memset(nd, 0, sizeof(pg_data_t));
-}
-
 static void __init node_mem_init(unsigned int node)
 {
 	unsigned long start_pfn, end_pfn;
diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index d56238745744..8388400d052f 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
 
 static void __init node_mem_init(unsigned int node)
 {
-	struct pglist_data *nd;
 	unsigned long node_addrspace_offset;
 	unsigned long start_pfn, end_pfn;
-	unsigned long nd_pa;
-	int tnid;
-	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
 
 	node_addrspace_offset = nid_to_addrbase(node);
 	pr_info("Node%d's addrspace_offset is 0x%lx\n",
@@ -96,16 +92,8 @@ static void __init node_mem_init(unsigned int node)
 	pr_info("Node%d: start_pfn=0x%lx, end_pfn=0x%lx\n",
 		node, start_pfn, end_pfn);
 
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, node);
-	if (!nd_pa)
-		panic("Cannot allocate %zu bytes for node %d data\n",
-		      nd_size, node);
-	nd = __va(nd_pa);
-	memset(nd, 0, sizeof(struct pglist_data));
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != node)
-		pr_info("NODE_DATA(%d) on node %d\n", node, tnid);
-	node_data[node] = nd;
+	alloc_node_data(node);
+
 	NODE_DATA(node)->node_start_pfn = start_pfn;
 	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0744a9a2944b..3c1da08304d0 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1093,27 +1093,9 @@ void __init dump_numa_cpu_topology(void)
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
 	u64 spanned_pages = end_pfn - start_pfn;
-	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
-	u64 nd_pa;
-	void *nd;
-	int tnid;
-
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-	if (!nd_pa)
-		panic("Cannot allocate %zu bytes for node %d data\n",
-		      nd_size, nid);
-
-	nd = __va(nd_pa);
-
-	/* report and initialize */
-	pr_info("  NODE_DATA [mem %#010Lx-%#010Lx]\n",
-		nd_pa, nd_pa + nd_size - 1);
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != nid)
-		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
-
-	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+
+	alloc_node_data(nid);
+
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start_pfn;
 	NODE_DATA(nid)->node_spanned_pages = spanned_pages;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index d1fe90b2f5ff..2a88b0c9e70f 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -212,12 +212,7 @@ void __init allocate_pgdat(unsigned int nid)
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 
 #ifdef CONFIG_NUMA
-	NODE_DATA(nid) = memblock_alloc_try_nid(
-				sizeof(struct pglist_data),
-				SMP_CACHE_BYTES, MEMBLOCK_LOW_LIMIT,
-				MEMBLOCK_ALLOC_ACCESSIBLE, nid);
-	if (!NODE_DATA(nid))
-		panic("Can't allocate pgdat for node %d\n", nid);
+	alloc_node_data(nid);
 #endif
 
 	NODE_DATA(nid)->node_start_pfn = start_pfn;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index c6c7f43cb1e8..21f8cbbd0581 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1075,14 +1075,9 @@ static void __init allocate_node_data(int nid)
 {
 	struct pglist_data *p;
 	unsigned long start_pfn, end_pfn;
-#ifdef CONFIG_NUMA
 
-	NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data),
-					     SMP_CACHE_BYTES, nid);
-	if (!NODE_DATA(nid)) {
-		prom_printf("Cannot allocate pglist_data for nid[%d]\n", nid);
-		prom_halt();
-	}
+#ifdef CONFIG_NUMA
+	alloc_node_data(nid);
 
 	NODE_DATA(nid)->node_id = nid;
 #endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 7de725d6bb05..5e1dde26674b 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -191,39 +191,6 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
 }
 
-/* Allocate NODE_DATA for a node on the local memory */
-static void __init alloc_node_data(int nid)
-{
-	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
-	u64 nd_pa;
-	void *nd;
-	int tnid;
-
-	/*
-	 * Allocate node data.  Try node-local memory and then any node.
-	 * Never allocate in DMA zone.
-	 */
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
-		       nd_size, nid);
-		return;
-	}
-	nd = __va(nd_pa);
-
-	/* report and initialize */
-	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
-	       nd_pa, nd_pa + nd_size - 1);
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != nid)
-		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
-
-	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-	node_set_online(nid);
-}
-
 /**
  * numa_cleanup_meminfo - Cleanup a numa_meminfo
  * @mi: numa_meminfo to clean up
@@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		node_set_online(nid);
 	}
 
 	/* Dump memblock with node info and return. */
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index ceac5b59bf2b..b6af7475ec44 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -216,30 +216,11 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
  */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
-	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
-	u64 nd_pa;
-	void *nd;
-	int tnid;
-
 	if (start_pfn >= end_pfn)
 		pr_info("Initmem setup node %d [<memory-less node>]\n", nid);
 
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-	if (!nd_pa)
-		panic("Cannot allocate %zu bytes for node %d data\n",
-		      nd_size, nid);
-
-	nd = __va(nd_pa);
-
-	/* report and initialize */
-	pr_info("NODE_DATA [mem %#010Lx-%#010Lx]\n",
-		nd_pa, nd_pa + nd_size - 1);
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != nid)
-		pr_info("NODE_DATA(%d) on node %d\n", nid, tnid);
+	alloc_node_data(nid);
 
-	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start_pfn;
 	NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
diff --git a/include/linux/numa.h b/include/linux/numa.h
index b41b1569781b..3567e40329eb 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -33,6 +33,7 @@ static inline bool numa_valid_node(int nid)
 extern struct pglist_data *node_data[];
 #define NODE_DATA(nid)	(node_data[nid])
 
+void __init alloc_node_data(int nid);
 void __init alloc_offline_node_data(int nid);
 
 /* Generic implementation available */
diff --git a/mm/numa.c b/mm/numa.c
index 64e1b7d2c1ee..1f1582dcdf4a 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -1,11 +1,38 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
+#include <linux/memblock.h>
 #include <linux/printk.h>
 #include <linux/numa.h>
 
 struct pglist_data *node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(node_data);
 
+/* Allocate NODE_DATA for a node on the local memory */
+void __init alloc_node_data(int nid)
+{
+	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
+	u64 nd_pa;
+	void *nd;
+	int tnid;
+
+	/* Allocate node data.  Try node-local memory and then any node. */
+	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+	if (!nd_pa)
+		panic("Cannot allocate %zu bytes for node %d data\n",
+		      nd_size, nid);
+	nd = __va(nd_pa);
+
+	/* report and initialize */
+	pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
+		nd_pa, nd_pa + nd_size - 1);
+	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+	if (tnid != nid)
+		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
+
+	node_data[nid] = nd;
+	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+}
+
 void __init alloc_offline_node_data(int nid)
 {
 	pg_data_t *pgdat;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 10/26] x86/numa: simplify numa_distance allocation
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (8 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 09/26] arch, mm: pull out allocation of NODE_DATA to generic code Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 11/26] x86/numa: use get_pfn_range_for_nid to verify that node spans memory Mike Rapoport
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Allocation of numa_distance uses memblock_phys_alloc_range() to limit
allocation to be below the last mapped page.

But NUMA initializaition runs after the direct map is populated and
there is also code in setup_arch() that adjusts memblock limit to
reflect how much memory is already mapped in the direct map.

Simplify the allocation of numa_distance and use plain memblock_alloc().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/numa.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5e1dde26674b..edfc38803779 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -331,7 +331,6 @@ static int __init numa_alloc_distance(void)
 	nodemask_t nodes_parsed;
 	size_t size;
 	int i, j, cnt = 0;
-	u64 phys;
 
 	/* size the new table and allocate it */
 	nodes_parsed = numa_nodes_parsed;
@@ -342,16 +341,14 @@ static int __init numa_alloc_distance(void)
 	cnt++;
 	size = cnt * cnt * sizeof(numa_distance[0]);
 
-	phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0,
-					 PFN_PHYS(max_pfn_mapped));
-	if (!phys) {
+	numa_distance = memblock_alloc(size, PAGE_SIZE);
+	if (!numa_distance) {
 		pr_warn("Warning: can't allocate distance table!\n");
 		/* don't retry until explicitly reset */
 		numa_distance = (void *)1LU;
 		return -ENOMEM;
 	}
 
-	numa_distance = __va(phys);
 	numa_distance_cnt = cnt;
 
 	/* fill with the default distances */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 11/26] x86/numa: use get_pfn_range_for_nid to verify that node spans memory
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (9 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 10/26] x86/numa: simplify numa_distance allocation Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 12/26] x86/numa: move FAKE_NODE_* defines to numa_emu Mike Rapoport
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Instead of looping over numa_meminfo array to detect node's start and
end addresses use get_pfn_range_for_init().

This is shorter and make it easier to lift numa_memblks to generic code.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/numa.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index edfc38803779..30b0ec801b02 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -521,17 +521,14 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 
 	/* Finally register nodes. */
 	for_each_node_mask(nid, node_possible_map) {
-		u64 start = PFN_PHYS(max_pfn);
-		u64 end = 0;
+		unsigned long start_pfn, end_pfn;
 
-		for (i = 0; i < mi->nr_blks; i++) {
-			if (nid != mi->blk[i].nid)
-				continue;
-			start = min(mi->blk[i].start, start);
-			end = max(mi->blk[i].end, end);
-		}
-
-		if (start >= end)
+		/*
+		 * Note, get_pfn_range_for_nid() depends on
+		 * memblock_set_node() having already happened
+		 */
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
+		if (start_pfn >= end_pfn)
 			continue;
 
 		alloc_node_data(nid);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 12/26] x86/numa: move FAKE_NODE_* defines to numa_emu
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (10 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 11/26] x86/numa: use get_pfn_range_for_nid to verify that node spans memory Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 13/26] x86/numa_emu: simplify allocation of phys_dist Mike Rapoport
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are
only used by numa emulation code, make them local to
arch/x86/mm/numa_emulation.c

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/numa.h  | 2 --
 arch/x86/mm/numa_emulation.c | 3 +++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index ef2844d69173..2dab1ada96cf 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -71,8 +71,6 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 #endif
 
 #ifdef CONFIG_NUMA_EMU
-#define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
-#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
 int numa_emu_cmdline(char *str);
 #else /* CONFIG_NUMA_EMU */
 static inline int numa_emu_cmdline(char *str)
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 9a9305367fdd..1ce22e315b80 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -10,6 +10,9 @@
 
 #include "numa_internal.h"
 
+#define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
+#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
+
 static int emu_nid_to_phys[MAX_NUMNODES];
 static char *emu_cmdline __initdata;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 13/26] x86/numa_emu: simplify allocation of phys_dist
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (11 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 12/26] x86/numa: move FAKE_NODE_* defines to numa_emu Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 14/26] x86/numa_emu: split __apicid_to_node update to a helper function Mike Rapoport
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

By the time numa_emulation() is called, all physical memory is already
mapped in the direct map and there is no need to define limits for
memblock allocation.

Replace memblock_phys_alloc_range() with memblock_alloc().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/numa_emulation.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 1ce22e315b80..439804e21962 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -448,15 +448,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 
 	/* copy the physical distance table */
 	if (numa_dist_cnt) {
-		u64 phys;
-
-		phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0,
-						 PFN_PHYS(max_pfn_mapped));
-		if (!phys) {
+		phys_dist = memblock_alloc(phys_size, PAGE_SIZE);
+		if (!phys_dist) {
 			pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
 			goto no_emu;
 		}
-		phys_dist = __va(phys);
 
 		for (i = 0; i < numa_dist_cnt; i++)
 			for (j = 0; j < numa_dist_cnt; j++)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 14/26] x86/numa_emu: split __apicid_to_node update to a helper function
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (12 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 13/26] x86/numa_emu: simplify allocation of phys_dist Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:40 ` [PATCH v4 15/26] x86/numa_emu: use a helper function to get MAX_DMA32_PFN Mike Rapoport
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

This is required to make numa emulation code architecture independent so
that it can be moved to generic code in following commits.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/numa.h  |  2 ++
 arch/x86/mm/numa.c           | 22 ++++++++++++++++++++++
 arch/x86/mm/numa_emulation.c | 14 +-------------
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 2dab1ada96cf..7017d540894a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -72,6 +72,8 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 
 #ifdef CONFIG_NUMA_EMU
 int numa_emu_cmdline(char *str);
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids);
 #else /* CONFIG_NUMA_EMU */
 static inline int numa_emu_cmdline(char *str)
 {
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 30b0ec801b02..ea3fc2d866e2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -852,6 +852,28 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */
 
+#ifdef CONFIG_NUMA_EMU
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids)
+{
+	int i, j;
+
+	/*
+	 * Transform __apicid_to_node table to use emulated nids by
+	 * reverse-mapping phys_nid.  The maps should always exist but fall
+	 * back to zero just in case.
+	 */
+	for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
+		if (__apicid_to_node[i] == NUMA_NO_NODE)
+			continue;
+		for (j = 0; j < nr_emu_nids; j++)
+			if (__apicid_to_node[i] == emu_nid_to_phys[j])
+				break;
+		__apicid_to_node[i] = j < nr_emu_nids ? j : 0;
+	}
+}
+#endif /* CONFIG_NUMA_EMU */
+
 #ifdef CONFIG_NUMA_KEEP_MEMINFO
 static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
 {
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 439804e21962..f2746e52ab93 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -476,19 +476,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		    ei.blk[i].nid != NUMA_NO_NODE)
 			node_set(ei.blk[i].nid, numa_nodes_parsed);
 
-	/*
-	 * Transform __apicid_to_node table to use emulated nids by
-	 * reverse-mapping phys_nid.  The maps should always exist but fall
-	 * back to zero just in case.
-	 */
-	for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
-		if (__apicid_to_node[i] == NUMA_NO_NODE)
-			continue;
-		for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
-			if (__apicid_to_node[i] == emu_nid_to_phys[j])
-				break;
-		__apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
-	}
+	numa_emu_update_cpu_to_node(emu_nid_to_phys, ARRAY_SIZE(emu_nid_to_phys));
 
 	/* make sure all emulated nodes are mapped to a physical node */
 	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 15/26] x86/numa_emu: use a helper function to get MAX_DMA32_PFN
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (13 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 14/26] x86/numa_emu: split __apicid_to_node update to a helper function Mike Rapoport
@ 2024-08-07  6:40 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 16/26] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned Mike Rapoport
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

This is required to make numa emulation code architecture independent so
that it can be moved to generic code in following commits.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/numa.h  | 1 +
 arch/x86/mm/numa.c           | 5 +++++
 arch/x86/mm/numa_emulation.c | 4 ++--
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 7017d540894a..b22c85c1ef18 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -74,6 +74,7 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 int numa_emu_cmdline(char *str);
 void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
 					unsigned int nr_emu_nids);
+u64 __init numa_emu_dma_end(void);
 #else /* CONFIG_NUMA_EMU */
 static inline int numa_emu_cmdline(char *str)
 {
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ea3fc2d866e2..2acf8d17b9b8 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -872,6 +872,11 @@ void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
 		__apicid_to_node[i] = j < nr_emu_nids ? j : 0;
 	}
 }
+
+u64 __init numa_emu_dma_end(void)
+{
+	return PFN_PHYS(MAX_DMA32_PFN);
+}
 #endif /* CONFIG_NUMA_EMU */
 
 #ifdef CONFIG_NUMA_KEEP_MEMINFO
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index f2746e52ab93..fb4814497446 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -128,7 +128,7 @@ static int __init split_nodes_interleave(struct numa_meminfo *ei,
 	 */
 	while (!nodes_empty(physnode_mask)) {
 		for_each_node_mask(i, physnode_mask) {
-			u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+			u64 dma32_end = numa_emu_dma_end();
 			u64 start, limit, end;
 			int phys_blk;
 
@@ -275,7 +275,7 @@ static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
 	 */
 	while (!nodes_empty(physnode_mask)) {
 		for_each_node_mask(i, physnode_mask) {
-			u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+			u64 dma32_end = numa_emu_dma_end();
 			u64 start, limit, end;
 			int phys_blk;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 16/26] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (14 preceding siblings ...)
  2024-08-07  6:40 ` [PATCH v4 15/26] x86/numa_emu: use a helper function to get MAX_DMA32_PFN Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 17/26] mm: introduce numa_memblks Mike Rapoport
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

CPU id cannot be negative.

Making it unsigned also aligns with declarations in
include/asm-generic/numa.h used by arm64 and riscv and allows sharing
numa emulation code with these architectures.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/numa.h  | 10 +++++-----
 arch/x86/mm/numa.c           | 10 +++++-----
 arch/x86/mm/numa_emulation.c | 10 +++++-----
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index b22c85c1ef18..6fa5ea925aac 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -54,20 +54,20 @@ static inline int numa_cpu_node(int cpu)
 extern void numa_set_node(int cpu, int node);
 extern void numa_clear_node(int cpu);
 extern void __init init_cpu_to_node(void);
-extern void numa_add_cpu(int cpu);
-extern void numa_remove_cpu(int cpu);
+extern void numa_add_cpu(unsigned int cpu);
+extern void numa_remove_cpu(unsigned int cpu);
 extern void init_gi_nodes(void);
 #else	/* CONFIG_NUMA */
 static inline void numa_set_node(int cpu, int node)	{ }
 static inline void numa_clear_node(int cpu)		{ }
 static inline void init_cpu_to_node(void)		{ }
-static inline void numa_add_cpu(int cpu)		{ }
-static inline void numa_remove_cpu(int cpu)		{ }
+static inline void numa_add_cpu(unsigned int cpu)	{ }
+static inline void numa_remove_cpu(unsigned int cpu)	{ }
 static inline void init_gi_nodes(void)			{ }
 #endif	/* CONFIG_NUMA */
 
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-void debug_cpumask_set_cpu(int cpu, int node, bool enable);
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable);
 #endif
 
 #ifdef CONFIG_NUMA_EMU
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2acf8d17b9b8..bf56f667fe0f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -741,12 +741,12 @@ void __init init_cpu_to_node(void)
 #ifndef CONFIG_DEBUG_PER_CPU_MAPS
 
 # ifndef CONFIG_NUMA_EMU
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	cpumask_set_cpu(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	cpumask_clear_cpu(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]);
 }
@@ -784,7 +784,7 @@ int early_cpu_to_node(int cpu)
 	return per_cpu(x86_cpu_to_node_map, cpu);
 }
 
-void debug_cpumask_set_cpu(int cpu, int node, bool enable)
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable)
 {
 	struct cpumask *mask;
 
@@ -816,12 +816,12 @@ static void numa_set_cpumask(int cpu, bool enable)
 	debug_cpumask_set_cpu(cpu, early_cpu_to_node(cpu), enable);
 }
 
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, true);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, false);
 }
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index fb4814497446..235f8a4eb2fa 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -514,7 +514,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 }
 
 #ifndef CONFIG_DEBUG_PER_CPU_MAPS
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	int physnid, nid;
 
@@ -532,7 +532,7 @@ void numa_add_cpu(int cpu)
 			cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	int i;
 
@@ -540,7 +540,7 @@ void numa_remove_cpu(int cpu)
 		cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
 }
 #else	/* !CONFIG_DEBUG_PER_CPU_MAPS */
-static void numa_set_cpumask(int cpu, bool enable)
+static void numa_set_cpumask(unsigned int cpu, bool enable)
 {
 	int nid, physnid;
 
@@ -560,12 +560,12 @@ static void numa_set_cpumask(int cpu, bool enable)
 	}
 }
 
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, true);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, false);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 17/26] mm: introduce numa_memblks
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (15 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 16/26] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 18/26] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig
options to let x86 select it in its Kconfig.

This code will be later reused by arch_numa.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/Kconfig             |   1 +
 arch/x86/include/asm/numa.h  |   3 -
 arch/x86/mm/amdtopology.c    |   1 +
 arch/x86/mm/numa.c           | 372 +--------------------------------
 arch/x86/mm/numa_emulation.c |   1 +
 arch/x86/mm/numa_internal.h  |  15 +-
 drivers/acpi/numa/srat.c     |   1 +
 drivers/of/of_numa.c         |   1 +
 include/linux/numa_memblks.h |  35 ++++
 mm/Kconfig                   |   3 +
 mm/Makefile                  |   1 +
 mm/numa_memblks.c            | 385 +++++++++++++++++++++++++++++++++++
 12 files changed, 436 insertions(+), 383 deletions(-)
 create mode 100644 include/linux/numa_memblks.h
 create mode 100644 mm/numa_memblks.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 007bab9f2a0e..74afb59c6603 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -296,6 +296,7 @@ config X86
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
 	select NEED_SG_DMA_LENGTH
+	select NUMA_MEMBLKS			if NUMA
 	select PCI_DOMAINS			if PCI
 	select PCI_LOCKLESS_CONFIG		if PCI
 	select PERF_EVENTS
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 6fa5ea925aac..6e9a50bf03d4 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -10,8 +10,6 @@
 
 #ifdef CONFIG_NUMA
 
-#define NR_NODE_MEMBLKS		(MAX_NUMNODES*2)
-
 extern int numa_off;
 
 /*
@@ -25,7 +23,6 @@ extern int numa_off;
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
 
-extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
 
 static inline void set_apicid_to_node(int apicid, s16 node)
diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index 9332b36a1091..628833afee37 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/io.h>
 #include <linux/pci_ids.h>
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index bf56f667fe0f..0bada905f409 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/topology.h>
 #include <linux/sort.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/e820/api.h>
 #include <asm/proto.h>
@@ -22,10 +23,6 @@
 #include "numa_internal.h"
 
 int numa_off;
-nodemask_t numa_nodes_parsed __initdata;
-
-static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
-static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
 
 static int numa_distance_cnt;
 static u8 *numa_distance;
@@ -121,194 +118,6 @@ void __init setup_node_to_cpumask_map(void)
 	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
-static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
-				     struct numa_meminfo *mi)
-{
-	/* ignore zero length blks */
-	if (start == end)
-		return 0;
-
-	/* whine about and ignore invalid blks */
-	if (start > end || nid < 0 || nid >= MAX_NUMNODES) {
-		pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
-			nid, start, end - 1);
-		return 0;
-	}
-
-	if (mi->nr_blks >= NR_NODE_MEMBLKS) {
-		pr_err("too many memblk ranges\n");
-		return -EINVAL;
-	}
-
-	mi->blk[mi->nr_blks].start = start;
-	mi->blk[mi->nr_blks].end = end;
-	mi->blk[mi->nr_blks].nid = nid;
-	mi->nr_blks++;
-	return 0;
-}
-
-/**
- * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo
- * @idx: Index of memblk to remove
- * @mi: numa_meminfo to remove memblk from
- *
- * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and
- * decrementing @mi->nr_blks.
- */
-void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
-{
-	mi->nr_blks--;
-	memmove(&mi->blk[idx], &mi->blk[idx + 1],
-		(mi->nr_blks - idx) * sizeof(mi->blk[0]));
-}
-
-/**
- * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another
- * @dst: numa_meminfo to append block to
- * @idx: Index of memblk to remove
- * @src: numa_meminfo to remove memblk from
- */
-static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx,
-					 struct numa_meminfo *src)
-{
-	dst->blk[dst->nr_blks++] = src->blk[idx];
-	numa_remove_memblk_from(idx, src);
-}
-
-/**
- * numa_add_memblk - Add one numa_memblk to numa_meminfo
- * @nid: NUMA node ID of the new memblk
- * @start: Start address of the new memblk
- * @end: End address of the new memblk
- *
- * Add a new memblk to the default numa_meminfo.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init numa_add_memblk(int nid, u64 start, u64 end)
-{
-	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
-}
-
-/**
- * numa_cleanup_meminfo - Cleanup a numa_meminfo
- * @mi: numa_meminfo to clean up
- *
- * Sanitize @mi by merging and removing unnecessary memblks.  Also check for
- * conflicts and clear unused memblks.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
-{
-	const u64 low = 0;
-	const u64 high = PFN_PHYS(max_pfn);
-	int i, j, k;
-
-	/* first, trim all entries */
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *bi = &mi->blk[i];
-
-		/* move / save reserved memory ranges */
-		if (!memblock_overlaps_region(&memblock.memory,
-					bi->start, bi->end - bi->start)) {
-			numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
-			continue;
-		}
-
-		/* make sure all non-reserved blocks are inside the limits */
-		bi->start = max(bi->start, low);
-
-		/* preserve info for non-RAM areas above 'max_pfn': */
-		if (bi->end > high) {
-			numa_add_memblk_to(bi->nid, high, bi->end,
-					   &numa_reserved_meminfo);
-			bi->end = high;
-		}
-
-		/* and there's no empty block */
-		if (bi->start >= bi->end)
-			numa_remove_memblk_from(i--, mi);
-	}
-
-	/* merge neighboring / overlapping entries */
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *bi = &mi->blk[i];
-
-		for (j = i + 1; j < mi->nr_blks; j++) {
-			struct numa_memblk *bj = &mi->blk[j];
-			u64 start, end;
-
-			/*
-			 * See whether there are overlapping blocks.  Whine
-			 * about but allow overlaps of the same nid.  They
-			 * will be merged below.
-			 */
-			if (bi->end > bj->start && bi->start < bj->end) {
-				if (bi->nid != bj->nid) {
-					pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
-					       bi->nid, bi->start, bi->end - 1,
-					       bj->nid, bj->start, bj->end - 1);
-					return -EINVAL;
-				}
-				pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
-					bi->nid, bi->start, bi->end - 1,
-					bj->start, bj->end - 1);
-			}
-
-			/*
-			 * Join together blocks on the same node, holes
-			 * between which don't overlap with memory on other
-			 * nodes.
-			 */
-			if (bi->nid != bj->nid)
-				continue;
-			start = min(bi->start, bj->start);
-			end = max(bi->end, bj->end);
-			for (k = 0; k < mi->nr_blks; k++) {
-				struct numa_memblk *bk = &mi->blk[k];
-
-				if (bi->nid == bk->nid)
-					continue;
-				if (start < bk->end && end > bk->start)
-					break;
-			}
-			if (k < mi->nr_blks)
-				continue;
-			printk(KERN_INFO "NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
-			       bi->nid, bi->start, bi->end - 1, bj->start,
-			       bj->end - 1, start, end - 1);
-			bi->start = start;
-			bi->end = end;
-			numa_remove_memblk_from(j--, mi);
-		}
-	}
-
-	/* clear unused ones */
-	for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
-		mi->blk[i].start = mi->blk[i].end = 0;
-		mi->blk[i].nid = NUMA_NO_NODE;
-	}
-
-	return 0;
-}
-
-/*
- * Set nodes, which have memory in @mi, in *@nodemask.
- */
-static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
-					      const struct numa_meminfo *mi)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
-		if (mi->blk[i].start != mi->blk[i].end &&
-		    mi->blk[i].nid != NUMA_NO_NODE)
-			node_set(mi->blk[i].nid, *nodemask);
-}
-
 /**
  * numa_reset_distance - Reset NUMA distance table
  *
@@ -410,111 +219,13 @@ int __node_distance(int from, int to)
 }
 EXPORT_SYMBOL(__node_distance);
 
-/*
- * Mark all currently memblock-reserved physical memory (which covers the
- * kernel's own memory ranges) as hot-unswappable.
- */
-static void __init numa_clear_kernel_node_hotplug(void)
-{
-	nodemask_t reserved_nodemask = NODE_MASK_NONE;
-	struct memblock_region *mb_region;
-	int i;
-
-	/*
-	 * We have to do some preprocessing of memblock regions, to
-	 * make them suitable for reservation.
-	 *
-	 * At this time, all memory regions reserved by memblock are
-	 * used by the kernel, but those regions are not split up
-	 * along node boundaries yet, and don't necessarily have their
-	 * node ID set yet either.
-	 *
-	 * So iterate over all memory known to the x86 architecture,
-	 * and use those ranges to set the nid in memblock.reserved.
-	 * This will split up the memblock regions along node
-	 * boundaries and will set the node IDs as well.
-	 */
-	for (i = 0; i < numa_meminfo.nr_blks; i++) {
-		struct numa_memblk *mb = numa_meminfo.blk + i;
-		int ret;
-
-		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
-		WARN_ON_ONCE(ret);
-	}
-
-	/*
-	 * Now go over all reserved memblock regions, to construct a
-	 * node mask of all kernel reserved memory areas.
-	 *
-	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
-	 *   numa_meminfo might not include all memblock.reserved
-	 *   memory ranges, because quirks such as trim_snb_memory()
-	 *   reserve specific pages for Sandy Bridge graphics. ]
-	 */
-	for_each_reserved_mem_region(mb_region) {
-		int nid = memblock_get_region_node(mb_region);
-
-		if (nid != NUMA_NO_NODE)
-			node_set(nid, reserved_nodemask);
-	}
-
-	/*
-	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
-	 * belonging to the reserved node mask.
-	 *
-	 * Note that this will include memory regions that reside
-	 * on nodes that contain kernel memory - entire nodes
-	 * become hot-unpluggable:
-	 */
-	for (i = 0; i < numa_meminfo.nr_blks; i++) {
-		struct numa_memblk *mb = numa_meminfo.blk + i;
-
-		if (!node_isset(mb->nid, reserved_nodemask))
-			continue;
-
-		memblock_clear_hotplug(mb->start, mb->end - mb->start);
-	}
-}
-
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
-	int i, nid;
+	int nid, err;
 
-	/* Account for nodes with cpus and no memory */
-	node_possible_map = numa_nodes_parsed;
-	numa_nodemask_from_meminfo(&node_possible_map, mi);
-	if (WARN_ON(nodes_empty(node_possible_map)))
-		return -EINVAL;
-
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start,
-				  &memblock.memory, mb->nid);
-	}
-
-	/*
-	 * At very early time, the kernel have to use some memory such as
-	 * loading the kernel image. We cannot prevent this anyway. So any
-	 * node the kernel resides in should be un-hotpluggable.
-	 *
-	 * And when we come here, alloc node data won't fail.
-	 */
-	numa_clear_kernel_node_hotplug();
-
-	/*
-	 * If sections array is gonna be used for pfn -> nid mapping, check
-	 * whether its granularity is fine enough.
-	 */
-	if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) {
-		unsigned long pfn_align = node_map_pfn_alignment();
-
-		if (pfn_align && pfn_align < PAGES_PER_SECTION) {
-			pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
-				PFN_PHYS(pfn_align) >> 20,
-				PFN_PHYS(PAGES_PER_SECTION) >> 20);
-			return -EINVAL;
-		}
-	}
+	err = numa_register_meminfo(mi);
+	if (err)
+		return err;
 
 	if (!memblock_validate_numa_coverage(SZ_1M))
 		return -EINVAL;
@@ -916,76 +627,3 @@ int memory_add_physaddr_to_nid(u64 start)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 
 #endif
-
-static int __init cmp_memblk(const void *a, const void *b)
-{
-	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
-	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
-
-	return (ma->start > mb->start) - (ma->start < mb->start);
-}
-
-static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
-
-/**
- * numa_fill_memblks - Fill gaps in numa_meminfo memblks
- * @start: address to begin fill
- * @end: address to end fill
- *
- * Find and extend numa_meminfo memblks to cover the physical
- * address range @start-@end
- *
- * RETURNS:
- * 0		  : Success
- * NUMA_NO_MEMBLK : No memblks exist in address range @start-@end
- */
-
-int __init numa_fill_memblks(u64 start, u64 end)
-{
-	struct numa_memblk **blk = &numa_memblk_list[0];
-	struct numa_meminfo *mi = &numa_meminfo;
-	int count = 0;
-	u64 prev_end;
-
-	/*
-	 * Create a list of pointers to numa_meminfo memblks that
-	 * overlap start, end. The list is used to make in-place
-	 * changes that fill out the numa_meminfo memblks.
-	 */
-	for (int i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *bi = &mi->blk[i];
-
-		if (memblock_addrs_overlap(start, end - start, bi->start,
-					   bi->end - bi->start)) {
-			blk[count] = &mi->blk[i];
-			count++;
-		}
-	}
-	if (!count)
-		return NUMA_NO_MEMBLK;
-
-	/* Sort the list of pointers in memblk->start order */
-	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
-
-	/* Make sure the first/last memblks include start/end */
-	blk[0]->start = min(blk[0]->start, start);
-	blk[count - 1]->end = max(blk[count - 1]->end, end);
-
-	/*
-	 * Fill any gaps by tracking the previous memblks
-	 * end address and backfilling to it if needed.
-	 */
-	prev_end = blk[0]->end;
-	for (int i = 1; i < count; i++) {
-		struct numa_memblk *curr = blk[i];
-
-		if (prev_end >= curr->start) {
-			if (prev_end < curr->end)
-				prev_end = curr->end;
-		} else {
-			curr->start = prev_end;
-			prev_end = curr->end;
-		}
-	}
-	return 0;
-}
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 235f8a4eb2fa..33610026b7a3 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -6,6 +6,7 @@
 #include <linux/errno.h>
 #include <linux/topology.h>
 #include <linux/memblock.h>
+#include <linux/numa_memblks.h>
 #include <asm/dma.h>
 
 #include "numa_internal.h"
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index 86860f279662..a51229a2f5af 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -5,23 +5,12 @@
 #include <linux/types.h>
 #include <asm/numa.h>
 
-struct numa_memblk {
-	u64			start;
-	u64			end;
-	int			nid;
-};
-
-struct numa_meminfo {
-	int			nr_blks;
-	struct numa_memblk	blk[NR_NODE_MEMBLKS];
-};
-
-void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
-int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
 void __init numa_reset_distance(void);
 
 void __init x86_numa_init(void);
 
+struct numa_meminfo;
+
 #ifdef CONFIG_NUMA_EMU
 void __init numa_emulation(struct numa_meminfo *numa_meminfo,
 			   int numa_dist_cnt);
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 44f91f2c6c5d..bec0dcd1f9c3 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -17,6 +17,7 @@
 #include <linux/numa.h>
 #include <linux/nodemask.h>
 #include <linux/topology.h>
+#include <linux/numa_memblks.h>
 
 static nodemask_t nodes_found_map = NODE_MASK_NONE;
 
diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
index 5949829a1b00..838747e319a2 100644
--- a/drivers/of/of_numa.c
+++ b/drivers/of/of_numa.c
@@ -10,6 +10,7 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/nodemask.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/numa.h>
 
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
new file mode 100644
index 000000000000..6981cf97d2c9
--- /dev/null
+++ b/include/linux/numa_memblks.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NUMA_MEMBLKS_H
+#define __NUMA_MEMBLKS_H
+
+#ifdef CONFIG_NUMA_MEMBLKS
+#include <linux/types.h>
+
+#define NR_NODE_MEMBLKS		(MAX_NUMNODES * 2)
+
+struct numa_memblk {
+	u64			start;
+	u64			end;
+	int			nid;
+};
+
+struct numa_meminfo {
+	int			nr_blks;
+	struct numa_memblk	blk[NR_NODE_MEMBLKS];
+};
+
+extern struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+extern struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+
+int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
+
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
+int __init numa_register_meminfo(struct numa_meminfo *mi);
+
+void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+				       const struct numa_meminfo *mi);
+
+#endif /* CONFIG_NUMA_MEMBLKS */
+
+#endif	/* __NUMA_MEMBLKS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index b72e7d040f78..dc5912d29ed5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1263,6 +1263,9 @@ config IOMMU_MM_DATA
 config EXECMEM
 	bool
 
+config NUMA_MEMBLKS
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 4e668be85f0b..e3fac7efd880 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -142,3 +142,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_NUMA) += numa.o
+obj-$(CONFIG_NUMA_MEMBLKS) += numa_memblks.o
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
new file mode 100644
index 000000000000..72f191a94c66
--- /dev/null
+++ b/mm/numa_memblks.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/array_size.h>
+#include <linux/sort.h>
+#include <linux/printk.h>
+#include <linux/memblock.h>
+#include <linux/numa.h>
+#include <linux/numa_memblks.h>
+
+nodemask_t numa_nodes_parsed __initdata;
+
+struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+
+static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+				     struct numa_meminfo *mi)
+{
+	/* ignore zero length blks */
+	if (start == end)
+		return 0;
+
+	/* whine about and ignore invalid blks */
+	if (start > end || nid < 0 || nid >= MAX_NUMNODES) {
+		pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
+			nid, start, end - 1);
+		return 0;
+	}
+
+	if (mi->nr_blks >= NR_NODE_MEMBLKS) {
+		pr_err("too many memblk ranges\n");
+		return -EINVAL;
+	}
+
+	mi->blk[mi->nr_blks].start = start;
+	mi->blk[mi->nr_blks].end = end;
+	mi->blk[mi->nr_blks].nid = nid;
+	mi->nr_blks++;
+	return 0;
+}
+
+/**
+ * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo
+ * @idx: Index of memblk to remove
+ * @mi: numa_meminfo to remove memblk from
+ *
+ * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and
+ * decrementing @mi->nr_blks.
+ */
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
+{
+	mi->nr_blks--;
+	memmove(&mi->blk[idx], &mi->blk[idx + 1],
+		(mi->nr_blks - idx) * sizeof(mi->blk[0]));
+}
+
+/**
+ * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another
+ * @dst: numa_meminfo to append block to
+ * @idx: Index of memblk to remove
+ * @src: numa_meminfo to remove memblk from
+ */
+static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx,
+					 struct numa_meminfo *src)
+{
+	dst->blk[dst->nr_blks++] = src->blk[idx];
+	numa_remove_memblk_from(idx, src);
+}
+
+/**
+ * numa_add_memblk - Add one numa_memblk to numa_meminfo
+ * @nid: NUMA node ID of the new memblk
+ * @start: Start address of the new memblk
+ * @end: End address of the new memblk
+ *
+ * Add a new memblk to the default numa_meminfo.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int __init numa_add_memblk(int nid, u64 start, u64 end)
+{
+	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
+}
+
+/**
+ * numa_cleanup_meminfo - Cleanup a numa_meminfo
+ * @mi: numa_meminfo to clean up
+ *
+ * Sanitize @mi by merging and removing unnecessary memblks.  Also check for
+ * conflicts and clear unused memblks.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
+{
+	const u64 low = 0;
+	const u64 high = PFN_PHYS(max_pfn);
+	int i, j, k;
+
+	/* first, trim all entries */
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *bi = &mi->blk[i];
+
+		/* move / save reserved memory ranges */
+		if (!memblock_overlaps_region(&memblock.memory,
+					bi->start, bi->end - bi->start)) {
+			numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
+			continue;
+		}
+
+		/* make sure all non-reserved blocks are inside the limits */
+		bi->start = max(bi->start, low);
+
+		/* preserve info for non-RAM areas above 'max_pfn': */
+		if (bi->end > high) {
+			numa_add_memblk_to(bi->nid, high, bi->end,
+					   &numa_reserved_meminfo);
+			bi->end = high;
+		}
+
+		/* and there's no empty block */
+		if (bi->start >= bi->end)
+			numa_remove_memblk_from(i--, mi);
+	}
+
+	/* merge neighboring / overlapping entries */
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *bi = &mi->blk[i];
+
+		for (j = i + 1; j < mi->nr_blks; j++) {
+			struct numa_memblk *bj = &mi->blk[j];
+			u64 start, end;
+
+			/*
+			 * See whether there are overlapping blocks.  Whine
+			 * about but allow overlaps of the same nid.  They
+			 * will be merged below.
+			 */
+			if (bi->end > bj->start && bi->start < bj->end) {
+				if (bi->nid != bj->nid) {
+					pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
+					       bi->nid, bi->start, bi->end - 1,
+					       bj->nid, bj->start, bj->end - 1);
+					return -EINVAL;
+				}
+				pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
+					bi->nid, bi->start, bi->end - 1,
+					bj->start, bj->end - 1);
+			}
+
+			/*
+			 * Join together blocks on the same node, holes
+			 * between which don't overlap with memory on other
+			 * nodes.
+			 */
+			if (bi->nid != bj->nid)
+				continue;
+			start = min(bi->start, bj->start);
+			end = max(bi->end, bj->end);
+			for (k = 0; k < mi->nr_blks; k++) {
+				struct numa_memblk *bk = &mi->blk[k];
+
+				if (bi->nid == bk->nid)
+					continue;
+				if (start < bk->end && end > bk->start)
+					break;
+			}
+			if (k < mi->nr_blks)
+				continue;
+			pr_info("NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
+			       bi->nid, bi->start, bi->end - 1, bj->start,
+			       bj->end - 1, start, end - 1);
+			bi->start = start;
+			bi->end = end;
+			numa_remove_memblk_from(j--, mi);
+		}
+	}
+
+	/* clear unused ones */
+	for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
+		mi->blk[i].start = mi->blk[i].end = 0;
+		mi->blk[i].nid = NUMA_NO_NODE;
+	}
+
+	return 0;
+}
+
+/*
+ * Set nodes, which have memory in @mi, in *@nodemask.
+ */
+void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+				       const struct numa_meminfo *mi)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
+		if (mi->blk[i].start != mi->blk[i].end &&
+		    mi->blk[i].nid != NUMA_NO_NODE)
+			node_set(mi->blk[i].nid, *nodemask);
+}
+
+/*
+ * Mark all currently memblock-reserved physical memory (which covers the
+ * kernel's own memory ranges) as hot-unswappable.
+ */
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+	nodemask_t reserved_nodemask = NODE_MASK_NONE;
+	struct memblock_region *mb_region;
+	int i;
+
+	/*
+	 * We have to do some preprocessing of memblock regions, to
+	 * make them suitable for reservation.
+	 *
+	 * At this time, all memory regions reserved by memblock are
+	 * used by the kernel, but those regions are not split up
+	 * along node boundaries yet, and don't necessarily have their
+	 * node ID set yet either.
+	 *
+	 * So iterate over all parsed memory blocks and use those ranges to
+	 * set the nid in memblock.reserved.  This will split up the
+	 * memblock regions along node boundaries and will set the node IDs
+	 * as well.
+	 */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		struct numa_memblk *mb = numa_meminfo.blk + i;
+		int ret;
+
+		ret = memblock_set_node(mb->start, mb->end - mb->start,
+					&memblock.reserved, mb->nid);
+		WARN_ON_ONCE(ret);
+	}
+
+	/*
+	 * Now go over all reserved memblock regions, to construct a
+	 * node mask of all kernel reserved memory areas.
+	 *
+	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
+	 *   numa_meminfo might not include all memblock.reserved
+	 *   memory ranges, because quirks such as trim_snb_memory()
+	 *   reserve specific pages for Sandy Bridge graphics. ]
+	 */
+	for_each_reserved_mem_region(mb_region) {
+		int nid = memblock_get_region_node(mb_region);
+
+		if (nid != MAX_NUMNODES)
+			node_set(nid, reserved_nodemask);
+	}
+
+	/*
+	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
+	 * belonging to the reserved node mask.
+	 *
+	 * Note that this will include memory regions that reside
+	 * on nodes that contain kernel memory - entire nodes
+	 * become hot-unpluggable:
+	 */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		struct numa_memblk *mb = numa_meminfo.blk + i;
+
+		if (!node_isset(mb->nid, reserved_nodemask))
+			continue;
+
+		memblock_clear_hotplug(mb->start, mb->end - mb->start);
+	}
+}
+
+int __init numa_register_meminfo(struct numa_meminfo *mi)
+{
+	int i;
+
+	/* Account for nodes with cpus and no memory */
+	node_possible_map = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&node_possible_map, mi);
+	if (WARN_ON(nodes_empty(node_possible_map)))
+		return -EINVAL;
+
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *mb = &mi->blk[i];
+
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.memory, mb->nid);
+	}
+
+	/*
+	 * At very early time, the kernel have to use some memory such as
+	 * loading the kernel image. We cannot prevent this anyway. So any
+	 * node the kernel resides in should be un-hotpluggable.
+	 *
+	 * And when we come here, alloc node data won't fail.
+	 */
+	numa_clear_kernel_node_hotplug();
+
+	/*
+	 * If sections array is gonna be used for pfn -> nid mapping, check
+	 * whether its granularity is fine enough.
+	 */
+	if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) {
+		unsigned long pfn_align = node_map_pfn_alignment();
+
+		if (pfn_align && pfn_align < PAGES_PER_SECTION) {
+			pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
+				PFN_PHYS(pfn_align) >> 20,
+				PFN_PHYS(PAGES_PER_SECTION) >> 20);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int __init cmp_memblk(const void *a, const void *b)
+{
+	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
+	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
+
+	return (ma->start > mb->start) - (ma->start < mb->start);
+}
+
+static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
+
+/**
+ * numa_fill_memblks - Fill gaps in numa_meminfo memblks
+ * @start: address to begin fill
+ * @end: address to end fill
+ *
+ * Find and extend numa_meminfo memblks to cover the physical
+ * address range @start-@end
+ *
+ * RETURNS:
+ * 0		  : Success
+ * NUMA_NO_MEMBLK : No memblks exist in address range @start-@end
+ */
+
+int __init numa_fill_memblks(u64 start, u64 end)
+{
+	struct numa_memblk **blk = &numa_memblk_list[0];
+	struct numa_meminfo *mi = &numa_meminfo;
+	int count = 0;
+	u64 prev_end;
+
+	/*
+	 * Create a list of pointers to numa_meminfo memblks that
+	 * overlap start, end. The list is used to make in-place
+	 * changes that fill out the numa_meminfo memblks.
+	 */
+	for (int i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *bi = &mi->blk[i];
+
+		if (memblock_addrs_overlap(start, end - start, bi->start,
+					   bi->end - bi->start)) {
+			blk[count] = &mi->blk[i];
+			count++;
+		}
+	}
+	if (!count)
+		return NUMA_NO_MEMBLK;
+
+	/* Sort the list of pointers in memblk->start order */
+	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
+
+	/* Make sure the first/last memblks include start/end */
+	blk[0]->start = min(blk[0]->start, start);
+	blk[count - 1]->end = max(blk[count - 1]->end, end);
+
+	/*
+	 * Fill any gaps by tracking the previous memblks
+	 * end address and backfilling to it if needed.
+	 */
+	prev_end = blk[0]->end;
+	for (int i = 1; i < count; i++) {
+		struct numa_memblk *curr = blk[i];
+
+		if (prev_end >= curr->start) {
+			if (prev_end < curr->end)
+				prev_end = curr->end;
+		} else {
+			curr->start = prev_end;
+			prev_end = curr->end;
+		}
+	}
+	return 0;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 18/26] mm: move numa_distance and related code from x86 to numa_memblks
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (16 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 17/26] mm: introduce numa_memblks Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 19/26] mm: introduce numa_emulation Mike Rapoport
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Move code dealing with numa_distance array from arch/x86 to
mm/numa_memblks.c

This code will be later reused by arch_numa.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/numa.h  |   2 -
 arch/x86/mm/numa.c           | 104 -----------------------------------
 arch/x86/mm/numa_internal.h  |   2 -
 include/linux/numa_memblks.h |   4 ++
 mm/numa_memblks.c            | 104 +++++++++++++++++++++++++++++++++++
 5 files changed, 108 insertions(+), 108 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 6e9a50bf03d4..203100500f24 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -23,8 +23,6 @@ extern int numa_off;
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
 
-extern void __init numa_set_distance(int from, int to, int distance);
-
 static inline void set_apicid_to_node(int apicid, s16 node)
 {
 	__apicid_to_node[apicid] = node;
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 0bada905f409..095502095503 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -24,9 +24,6 @@
 
 int numa_off;
 
-static int numa_distance_cnt;
-static u8 *numa_distance;
-
 static __init int numa_setup(char *opt)
 {
 	if (!opt)
@@ -118,107 +115,6 @@ void __init setup_node_to_cpumask_map(void)
 	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
-/**
- * numa_reset_distance - Reset NUMA distance table
- *
- * The current table is freed.  The next numa_set_distance() call will
- * create a new one.
- */
-void __init numa_reset_distance(void)
-{
-	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
-
-	/* numa_distance could be 1LU marking allocation failure, test cnt */
-	if (numa_distance_cnt)
-		memblock_free(numa_distance, size);
-	numa_distance_cnt = 0;
-	numa_distance = NULL;	/* enable table creation */
-}
-
-static int __init numa_alloc_distance(void)
-{
-	nodemask_t nodes_parsed;
-	size_t size;
-	int i, j, cnt = 0;
-
-	/* size the new table and allocate it */
-	nodes_parsed = numa_nodes_parsed;
-	numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
-
-	for_each_node_mask(i, nodes_parsed)
-		cnt = i;
-	cnt++;
-	size = cnt * cnt * sizeof(numa_distance[0]);
-
-	numa_distance = memblock_alloc(size, PAGE_SIZE);
-	if (!numa_distance) {
-		pr_warn("Warning: can't allocate distance table!\n");
-		/* don't retry until explicitly reset */
-		numa_distance = (void *)1LU;
-		return -ENOMEM;
-	}
-
-	numa_distance_cnt = cnt;
-
-	/* fill with the default distances */
-	for (i = 0; i < cnt; i++)
-		for (j = 0; j < cnt; j++)
-			numa_distance[i * cnt + j] = i == j ?
-				LOCAL_DISTANCE : REMOTE_DISTANCE;
-	printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
-
-	return 0;
-}
-
-/**
- * numa_set_distance - Set NUMA distance from one NUMA to another
- * @from: the 'from' node to set distance
- * @to: the 'to'  node to set distance
- * @distance: NUMA distance
- *
- * Set the distance from node @from to @to to @distance.  If distance table
- * doesn't exist, one which is large enough to accommodate all the currently
- * known nodes will be created.
- *
- * If such table cannot be allocated, a warning is printed and further
- * calls are ignored until the distance table is reset with
- * numa_reset_distance().
- *
- * If @from or @to is higher than the highest known node or lower than zero
- * at the time of table creation or @distance doesn't make sense, the call
- * is ignored.
- * This is to allow simplification of specific NUMA config implementations.
- */
-void __init numa_set_distance(int from, int to, int distance)
-{
-	if (!numa_distance && numa_alloc_distance() < 0)
-		return;
-
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
-			from < 0 || to < 0) {
-		pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
-			     from, to, distance);
-		return;
-	}
-
-	if ((u8)distance != distance ||
-	    (from == to && distance != LOCAL_DISTANCE)) {
-		pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
-			     from, to, distance);
-		return;
-	}
-
-	numa_distance[from * numa_distance_cnt + to] = distance;
-}
-
-int __node_distance(int from, int to)
-{
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
-		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
-	return numa_distance[from * numa_distance_cnt + to];
-}
-EXPORT_SYMBOL(__node_distance);
-
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	int nid, err;
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index a51229a2f5af..249e3aaeadce 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -5,8 +5,6 @@
 #include <linux/types.h>
 #include <asm/numa.h>
 
-void __init numa_reset_distance(void);
-
 void __init x86_numa_init(void);
 
 struct numa_meminfo;
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 6981cf97d2c9..968a590535ac 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -7,6 +7,10 @@
 
 #define NR_NODE_MEMBLKS		(MAX_NUMNODES * 2)
 
+extern int numa_distance_cnt;
+void __init numa_set_distance(int from, int to, int distance);
+void __init numa_reset_distance(void);
+
 struct numa_memblk {
 	u64			start;
 	u64			end;
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index 72f191a94c66..e3c3519725d4 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -7,11 +7,115 @@
 #include <linux/numa.h>
 #include <linux/numa_memblks.h>
 
+int numa_distance_cnt;
+static u8 *numa_distance;
+
 nodemask_t numa_nodes_parsed __initdata;
 
 struct numa_meminfo numa_meminfo __initdata_or_meminfo;
 struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
 
+/**
+ * numa_reset_distance - Reset NUMA distance table
+ *
+ * The current table is freed.  The next numa_set_distance() call will
+ * create a new one.
+ */
+void __init numa_reset_distance(void)
+{
+	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
+
+	/* numa_distance could be 1LU marking allocation failure, test cnt */
+	if (numa_distance_cnt)
+		memblock_free(numa_distance, size);
+	numa_distance_cnt = 0;
+	numa_distance = NULL;	/* enable table creation */
+}
+
+static int __init numa_alloc_distance(void)
+{
+	nodemask_t nodes_parsed;
+	size_t size;
+	int i, j, cnt = 0;
+
+	/* size the new table and allocate it */
+	nodes_parsed = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
+
+	for_each_node_mask(i, nodes_parsed)
+		cnt = i;
+	cnt++;
+	size = cnt * cnt * sizeof(numa_distance[0]);
+
+	numa_distance = memblock_alloc(size, PAGE_SIZE);
+	if (!numa_distance) {
+		pr_warn("Warning: can't allocate distance table!\n");
+		/* don't retry until explicitly reset */
+		numa_distance = (void *)1LU;
+		return -ENOMEM;
+	}
+
+	numa_distance_cnt = cnt;
+
+	/* fill with the default distances */
+	for (i = 0; i < cnt; i++)
+		for (j = 0; j < cnt; j++)
+			numa_distance[i * cnt + j] = i == j ?
+				LOCAL_DISTANCE : REMOTE_DISTANCE;
+	printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
+
+	return 0;
+}
+
+/**
+ * numa_set_distance - Set NUMA distance from one NUMA to another
+ * @from: the 'from' node to set distance
+ * @to: the 'to'  node to set distance
+ * @distance: NUMA distance
+ *
+ * Set the distance from node @from to @to to @distance.  If distance table
+ * doesn't exist, one which is large enough to accommodate all the currently
+ * known nodes will be created.
+ *
+ * If such table cannot be allocated, a warning is printed and further
+ * calls are ignored until the distance table is reset with
+ * numa_reset_distance().
+ *
+ * If @from or @to is higher than the highest known node or lower than zero
+ * at the time of table creation or @distance doesn't make sense, the call
+ * is ignored.
+ * This is to allow simplification of specific NUMA config implementations.
+ */
+void __init numa_set_distance(int from, int to, int distance)
+{
+	if (!numa_distance && numa_alloc_distance() < 0)
+		return;
+
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
+			from < 0 || to < 0) {
+		pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
+			     from, to, distance);
+		return;
+	}
+
+	if ((u8)distance != distance ||
+	    (from == to && distance != LOCAL_DISTANCE)) {
+		pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
+			     from, to, distance);
+		return;
+	}
+
+	numa_distance[from * numa_distance_cnt + to] = distance;
+}
+
+int __node_distance(int from, int to)
+{
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
+		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+	return numa_distance[from * numa_distance_cnt + to];
+}
+EXPORT_SYMBOL(__node_distance);
+
 static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
 				     struct numa_meminfo *mi)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 19/26] mm: introduce numa_emulation
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (17 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 18/26] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 20/26] mm: numa_memblks: introduce numa_memblks_init Mike Rapoport
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Move numa_emulation code from arch/x86 to mm/numa_emulation.c

This code will be later reused by arch_numa.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig                     |  8 --------
 arch/x86/include/asm/numa.h          | 12 ------------
 arch/x86/mm/Makefile                 |  1 -
 arch/x86/mm/numa_internal.h          | 11 -----------
 include/linux/numa_memblks.h         | 17 +++++++++++++++++
 mm/Kconfig                           |  8 ++++++++
 mm/Makefile                          |  1 +
 {arch/x86/mm => mm}/numa_emulation.c |  4 +---
 8 files changed, 27 insertions(+), 35 deletions(-)
 rename {arch/x86/mm => mm}/numa_emulation.c (99%)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 74afb59c6603..acd9745bf2ae 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1600,14 +1600,6 @@ config X86_64_ACPI_NUMA
 	help
 	  Enable ACPI SRAT based node topology detection.
 
-config NUMA_EMU
-	bool "NUMA emulation"
-	depends on NUMA
-	help
-	  Enable NUMA emulation. A flat machine will be split
-	  into virtual nodes when booted with "numa=fake=N", where N is the
-	  number of nodes. This is only useful for debugging.
-
 config NODES_SHIFT
 	int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
 	range 1 10
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 203100500f24..5469d7a7c40f 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -65,16 +65,4 @@ static inline void init_gi_nodes(void)			{ }
 void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable);
 #endif
 
-#ifdef CONFIG_NUMA_EMU
-int numa_emu_cmdline(char *str);
-void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
-					unsigned int nr_emu_nids);
-u64 __init numa_emu_dma_end(void);
-#else /* CONFIG_NUMA_EMU */
-static inline int numa_emu_cmdline(char *str)
-{
-	return -EINVAL;
-}
-#endif /* CONFIG_NUMA_EMU */
-
 #endif	/* _ASM_X86_NUMA_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 8d3a00e5c528..690fbf48e853 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -57,7 +57,6 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
 obj-$(CONFIG_NUMA)		+= numa.o numa_$(BITS).o
 obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
-obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index 249e3aaeadce..11e1ff370c10 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -7,15 +7,4 @@
 
 void __init x86_numa_init(void);
 
-struct numa_meminfo;
-
-#ifdef CONFIG_NUMA_EMU
-void __init numa_emulation(struct numa_meminfo *numa_meminfo,
-			   int numa_dist_cnt);
-#else
-static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
-				  int numa_dist_cnt)
-{ }
-#endif
-
 #endif	/* __X86_MM_NUMA_INTERNAL_H */
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 968a590535ac..f81f98678074 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -34,6 +34,23 @@ int __init numa_register_meminfo(struct numa_meminfo *mi);
 void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
 				       const struct numa_meminfo *mi);
 
+#ifdef CONFIG_NUMA_EMU
+int numa_emu_cmdline(char *str);
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids);
+u64 __init numa_emu_dma_end(void);
+void __init numa_emulation(struct numa_meminfo *numa_meminfo,
+			   int numa_dist_cnt);
+#else
+static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
+				  int numa_dist_cnt)
+{ }
+static inline int numa_emu_cmdline(char *str)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_NUMA_EMU */
+
 #endif /* CONFIG_NUMA_MEMBLKS */
 
 #endif	/* __NUMA_MEMBLKS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index dc5912d29ed5..3b466df1d9e2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1266,6 +1266,14 @@ config EXECMEM
 config NUMA_MEMBLKS
 	bool
 
+config NUMA_EMU
+	bool "NUMA emulation"
+	depends on NUMA_MEMBLKS
+	help
+	  Enable NUMA emulation. A flat machine will be split
+	  into virtual nodes when booted with "numa=fake=N", where N is the
+	  number of nodes. This is only useful for debugging.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index e3fac7efd880..75a189cc67ef 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -143,3 +143,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_NUMA) += numa.o
 obj-$(CONFIG_NUMA_MEMBLKS) += numa_memblks.o
+obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
diff --git a/arch/x86/mm/numa_emulation.c b/mm/numa_emulation.c
similarity index 99%
rename from arch/x86/mm/numa_emulation.c
rename to mm/numa_emulation.c
index 33610026b7a3..031fb9961bf7 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/mm/numa_emulation.c
@@ -7,9 +7,7 @@
 #include <linux/topology.h>
 #include <linux/memblock.h>
 #include <linux/numa_memblks.h>
-#include <asm/dma.h>
-
-#include "numa_internal.h"
+#include <asm/numa.h>
 
 #define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
 #define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 20/26] mm: numa_memblks: introduce numa_memblks_init
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (18 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 19/26] mm: introduce numa_emulation Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 21/26] mm: numa_memblks: make several functions and variables static Mike Rapoport
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Move most of x86::numa_init() to numa_memblks so that the latter will be
more self-contained.

With this numa_memblk data structures should not be exposed to the
architecture specific code.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/numa.c           | 40 ++++-------------------------------
 include/linux/numa_memblks.h |  3 +++
 mm/numa_memblks.c            | 41 ++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 36 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 095502095503..d23287611449 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -115,13 +115,9 @@ void __init setup_node_to_cpumask_map(void)
 	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_register_nodes(void)
 {
-	int nid, err;
-
-	err = numa_register_meminfo(mi);
-	if (err)
-		return err;
+	int nid;
 
 	if (!memblock_validate_numa_coverage(SZ_1M))
 		return -EINVAL;
@@ -175,39 +171,11 @@ static int __init numa_init(int (*init_func)(void))
 	for (i = 0; i < MAX_LOCAL_APIC; i++)
 		set_apicid_to_node(i, NUMA_NO_NODE);
 
-	nodes_clear(numa_nodes_parsed);
-	nodes_clear(node_possible_map);
-	nodes_clear(node_online_map);
-	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
-				  NUMA_NO_NODE));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
-				  NUMA_NO_NODE));
-	/* In case that parsing SRAT failed. */
-	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
-	numa_reset_distance();
-
-	ret = init_func();
+	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true);
 	if (ret < 0)
 		return ret;
 
-	/*
-	 * We reset memblock back to the top-down direction
-	 * here because if we configured ACPI_NUMA, we have
-	 * parsed SRAT in init_func(). It is ok to have the
-	 * reset here even if we did't configure ACPI_NUMA
-	 * or acpi numa init fails and fallbacks to dummy
-	 * numa init.
-	 */
-	memblock_set_bottom_up(false);
-
-	ret = numa_cleanup_meminfo(&numa_meminfo);
-	if (ret < 0)
-		return ret;
-
-	numa_emulation(&numa_meminfo, numa_distance_cnt);
-
-	ret = numa_register_memblks(&numa_meminfo);
+	ret = numa_register_nodes();
 	if (ret < 0)
 		return ret;
 
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index f81f98678074..07381320848f 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -34,6 +34,9 @@ int __init numa_register_meminfo(struct numa_meminfo *mi);
 void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
 				       const struct numa_meminfo *mi);
 
+int __init numa_memblks_init(int (*init_func)(void),
+			     bool memblock_force_top_down);
+
 #ifdef CONFIG_NUMA_EMU
 int numa_emu_cmdline(char *str);
 void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index e3c3519725d4..7749b6f6b250 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -415,6 +415,47 @@ int __init numa_register_meminfo(struct numa_meminfo *mi)
 	return 0;
 }
 
+int __init numa_memblks_init(int (*init_func)(void),
+			     bool memblock_force_top_down)
+{
+	int ret;
+
+	nodes_clear(numa_nodes_parsed);
+	nodes_clear(node_possible_map);
+	nodes_clear(node_online_map);
+	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
+				  NUMA_NO_NODE));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+				  NUMA_NO_NODE));
+	/* In case that parsing SRAT failed. */
+	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
+	numa_reset_distance();
+
+	ret = init_func();
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * We reset memblock back to the top-down direction
+	 * here because if we configured ACPI_NUMA, we have
+	 * parsed SRAT in init_func(). It is ok to have the
+	 * reset here even if we did't configure ACPI_NUMA
+	 * or acpi numa init fails and fallbacks to dummy
+	 * numa init.
+	 */
+	if (memblock_force_top_down)
+		memblock_set_bottom_up(false);
+
+	ret = numa_cleanup_meminfo(&numa_meminfo);
+	if (ret < 0)
+		return ret;
+
+	numa_emulation(&numa_meminfo, numa_distance_cnt);
+
+	return numa_register_meminfo(&numa_meminfo);
+}
+
 static int __init cmp_memblk(const void *a, const void *b)
 {
 	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 21/26] mm: numa_memblks: make several functions and variables static
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (19 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 20/26] mm: numa_memblks: introduce numa_memblks_init Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 22/26] mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo Mike Rapoport
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Make functions and variables that are exclusively used by numa_memblks
static.

Move numa_nodemask_from_meminfo() before its callers to avoid forward
declaration.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/numa_memblks.h |  8 --------
 mm/numa_memblks.c            | 36 ++++++++++++++++++------------------
 2 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 07381320848f..5c6e12ad0b7a 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -7,7 +7,6 @@
 
 #define NR_NODE_MEMBLKS		(MAX_NUMNODES * 2)
 
-extern int numa_distance_cnt;
 void __init numa_set_distance(int from, int to, int distance);
 void __init numa_reset_distance(void);
 
@@ -22,17 +21,10 @@ struct numa_meminfo {
 	struct numa_memblk	blk[NR_NODE_MEMBLKS];
 };
 
-extern struct numa_meminfo numa_meminfo __initdata_or_meminfo;
-extern struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
-
 int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
 
 int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
-int __init numa_register_meminfo(struct numa_meminfo *mi);
-
-void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
-				       const struct numa_meminfo *mi);
 
 int __init numa_memblks_init(int (*init_func)(void),
 			     bool memblock_force_top_down);
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index 7749b6f6b250..e97665a5e8ce 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -7,13 +7,27 @@
 #include <linux/numa.h>
 #include <linux/numa_memblks.h>
 
-int numa_distance_cnt;
+static int numa_distance_cnt;
 static u8 *numa_distance;
 
 nodemask_t numa_nodes_parsed __initdata;
 
-struct numa_meminfo numa_meminfo __initdata_or_meminfo;
-struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+
+/*
+ * Set nodes, which have memory in @mi, in *@nodemask.
+ */
+static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+					      const struct numa_meminfo *mi)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
+		if (mi->blk[i].start != mi->blk[i].end &&
+		    mi->blk[i].nid != NUMA_NO_NODE)
+			node_set(mi->blk[i].nid, *nodemask);
+}
 
 /**
  * numa_reset_distance - Reset NUMA distance table
@@ -290,20 +304,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
 	return 0;
 }
 
-/*
- * Set nodes, which have memory in @mi, in *@nodemask.
- */
-void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
-				       const struct numa_meminfo *mi)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
-		if (mi->blk[i].start != mi->blk[i].end &&
-		    mi->blk[i].nid != NUMA_NO_NODE)
-			node_set(mi->blk[i].nid, *nodemask);
-}
-
 /*
  * Mark all currently memblock-reserved physical memory (which covers the
  * kernel's own memory ranges) as hot-unswappable.
@@ -371,7 +371,7 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	}
 }
 
-int __init numa_register_meminfo(struct numa_meminfo *mi)
+static int __init numa_register_meminfo(struct numa_meminfo *mi)
 {
 	int i;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 22/26] mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (20 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 21/26] mm: numa_memblks: make several functions and variables static Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 23/26] of, numa: return -EINVAL when no numa-node-id is found Mike Rapoport
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

numa_cleanup_meminfo() moves blocks outside system RAM to
numa_reserved_meminfo and it uses 0 and PFN_PHYS(max_pfn) to determine
the memory boundaries.

Replace the memory range boundaries with more portable
memblock_start_of_DRAM() and memblock_end_of_DRAM().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/numa_memblks.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index e97665a5e8ce..e4358ad92233 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -212,8 +212,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
  */
 int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
 {
-	const u64 low = 0;
-	const u64 high = PFN_PHYS(max_pfn);
+	const u64 low = memblock_start_of_DRAM();
+	const u64 high = memblock_end_of_DRAM();
 	int i, j, k;
 
 	/* first, trim all entries */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 23/26] of, numa: return -EINVAL when no numa-node-id is found
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (21 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 22/26] mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 24/26] arch_numa: switch over to numa_memblks Mike Rapoport
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Currently of_numa_parse_memory_nodes() returns 0 if no "memory" node in
device tree contains "numa-node-id" property. This makes of_numa_init()
to return "success" despite no NUMA nodes were actually parsed and set
up.

arch_numa workarounds this by returning an error if numa_nodes_parsed is
empty.

numa_memblks however would WARN() in such case and since it will be used
by arch_numa shortly, such warning is not desirable.

Make sure of_numa_init() returns -EINVAL when no NUMA node information
was found in the device tree.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 drivers/of/of_numa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
index 838747e319a2..2ec20886d176 100644
--- a/drivers/of/of_numa.c
+++ b/drivers/of/of_numa.c
@@ -45,7 +45,7 @@ static int __init of_numa_parse_memory_nodes(void)
 	struct device_node *np = NULL;
 	struct resource rsrc;
 	u32 nid;
-	int i, r;
+	int i, r = -EINVAL;
 
 	for_each_node_by_type(np, "memory") {
 		r = of_property_read_u32(np, "numa-node-id", &nid);
@@ -72,7 +72,7 @@ static int __init of_numa_parse_memory_nodes(void)
 		}
 	}
 
-	return 0;
+	return r;
 }
 
 static int __init of_numa_parse_distance_map_v1(struct device_node *map)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 24/26] arch_numa: switch over to numa_memblks
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (22 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 23/26] of, numa: return -EINVAL when no numa-node-id is found Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:58   ` Arnd Bergmann
  2024-11-27 19:32   ` Marc Zyngier
  2024-08-07  6:41 ` [PATCH v4 25/26] mm: make range-to-target_node lookup facility a part of numa_memblks Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 26/26] docs: move numa=fake description to kernel-parameters.txt Mike Rapoport
  25 siblings, 2 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Until now arch_numa was directly translating firmware NUMA information
to memblock.

Using numa_memblks as an intermediate step has a few advantages:
* alignment with more battle tested x86 implementation
* availability of NUMA emulation
* maintaining node information for not yet populated memory

Adjust a few places in numa_memblks to compile with 32-bit phys_addr_t
and replace current functionality related to numa_add_memblk() and
__node_distance() in arch_numa with the implementation based on
numa_memblks and add functions required by numa_emulation.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/Kconfig       |   1 +
 drivers/base/arch_numa.c   | 201 +++++++++++--------------------------
 include/asm-generic/numa.h |   6 +-
 mm/numa_memblks.c          |  17 ++--
 4 files changed, 75 insertions(+), 150 deletions(-)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 2b8fd6bb7da0..064eb52ff7e2 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -226,6 +226,7 @@ config GENERIC_ARCH_TOPOLOGY
 
 config GENERIC_ARCH_NUMA
 	bool
+	select NUMA_MEMBLKS
 	help
 	  Enable support for generic NUMA implementation. Currently, RISC-V
 	  and ARM64 use it.
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index b6af7475ec44..8d49893c0e94 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -12,14 +12,12 @@
 #include <linux/memblock.h>
 #include <linux/module.h>
 #include <linux/of.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/sections.h>
 
-nodemask_t numa_nodes_parsed __initdata;
 static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
 
-static int numa_distance_cnt;
-static u8 *numa_distance;
 bool numa_off;
 
 static __init int numa_parse_early_param(char *opt)
@@ -28,6 +26,8 @@ static __init int numa_parse_early_param(char *opt)
 		return -EINVAL;
 	if (str_has_prefix(opt, "off"))
 		numa_off = true;
+	if (!strncmp(opt, "fake=", 5))
+		return numa_emu_cmdline(opt + 5);
 
 	return 0;
 }
@@ -59,6 +59,7 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif
 
+#ifndef CONFIG_NUMA_EMU
 static void numa_update_cpu(unsigned int cpu, bool remove)
 {
 	int nid = cpu_to_node(cpu);
@@ -81,6 +82,7 @@ void numa_remove_cpu(unsigned int cpu)
 {
 	numa_update_cpu(cpu, true);
 }
+#endif
 
 void numa_clear_node(unsigned int cpu)
 {
@@ -142,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
 
-int __init early_cpu_to_node(int cpu)
+int early_cpu_to_node(int cpu)
 {
 	return cpu_to_node_map[cpu];
 }
@@ -187,30 +189,6 @@ void __init setup_per_cpu_areas(void)
 }
 #endif
 
-/**
- * numa_add_memblk() - Set node id to memblk
- * @nid: NUMA node ID of the new memblk
- * @start: Start address of the new memblk
- * @end:  End address of the new memblk
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init numa_add_memblk(int nid, u64 start, u64 end)
-{
-	int ret;
-
-	ret = memblock_set_node(start, (end - start), &memblock.memory, nid);
-	if (ret < 0) {
-		pr_err("memblock [0x%llx - 0x%llx] failed to add on node %d\n",
-			start, (end - 1), nid);
-		return ret;
-	}
-
-	node_set(nid, numa_nodes_parsed);
-	return ret;
-}
-
 /*
  * Initialize NODE_DATA for a node on the local memory
  */
@@ -226,116 +204,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 	NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
 }
 
-/*
- * numa_free_distance
- *
- * The current table is freed.
- */
-void __init numa_free_distance(void)
-{
-	size_t size;
-
-	if (!numa_distance)
-		return;
-
-	size = numa_distance_cnt * numa_distance_cnt *
-		sizeof(numa_distance[0]);
-
-	memblock_free(numa_distance, size);
-	numa_distance_cnt = 0;
-	numa_distance = NULL;
-}
-
-/*
- * Create a new NUMA distance table.
- */
-static int __init numa_alloc_distance(void)
-{
-	size_t size;
-	int i, j;
-
-	size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
-	numa_distance = memblock_alloc(size, PAGE_SIZE);
-	if (WARN_ON(!numa_distance))
-		return -ENOMEM;
-
-	numa_distance_cnt = nr_node_ids;
-
-	/* fill with the default distances */
-	for (i = 0; i < numa_distance_cnt; i++)
-		for (j = 0; j < numa_distance_cnt; j++)
-			numa_distance[i * numa_distance_cnt + j] = i == j ?
-				LOCAL_DISTANCE : REMOTE_DISTANCE;
-
-	pr_debug("Initialized distance table, cnt=%d\n", numa_distance_cnt);
-
-	return 0;
-}
-
-/**
- * numa_set_distance() - Set inter node NUMA distance from node to node.
- * @from: the 'from' node to set distance
- * @to: the 'to'  node to set distance
- * @distance: NUMA distance
- *
- * Set the distance from node @from to @to to @distance.
- * If distance table doesn't exist, a warning is printed.
- *
- * If @from or @to is higher than the highest known node or lower than zero
- * or @distance doesn't make sense, the call is ignored.
- */
-void __init numa_set_distance(int from, int to, int distance)
-{
-	if (!numa_distance) {
-		pr_warn_once("Warning: distance table not allocated yet\n");
-		return;
-	}
-
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
-			from < 0 || to < 0) {
-		pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
-			    from, to, distance);
-		return;
-	}
-
-	if ((u8)distance != distance ||
-	    (from == to && distance != LOCAL_DISTANCE)) {
-		pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
-			     from, to, distance);
-		return;
-	}
-
-	numa_distance[from * numa_distance_cnt + to] = distance;
-}
-
-/*
- * Return NUMA distance @from to @to
- */
-int __node_distance(int from, int to)
-{
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
-		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
-	return numa_distance[from * numa_distance_cnt + to];
-}
-EXPORT_SYMBOL(__node_distance);
-
 static int __init numa_register_nodes(void)
 {
 	int nid;
-	struct memblock_region *mblk;
-
-	/* Check that valid nid is set to memblks */
-	for_each_mem_region(mblk) {
-		int mblk_nid = memblock_get_region_node(mblk);
-		phys_addr_t start = mblk->base;
-		phys_addr_t end = mblk->base + mblk->size - 1;
-
-		if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) {
-			pr_warn("Warning: invalid memblk node %d [mem %pap-%pap]\n",
-				mblk_nid, &start, &end);
-			return -EINVAL;
-		}
-	}
 
 	/* Finally register nodes. */
 	for_each_node_mask(nid, numa_nodes_parsed) {
@@ -360,11 +231,7 @@ static int __init numa_init(int (*init_func)(void))
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 
-	ret = numa_alloc_distance();
-	if (ret < 0)
-		return ret;
-
-	ret = init_func();
+	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ false);
 	if (ret < 0)
 		goto out_free_distance;
 
@@ -382,7 +249,7 @@ static int __init numa_init(int (*init_func)(void))
 
 	return 0;
 out_free_distance:
-	numa_free_distance();
+	numa_reset_distance();
 	return ret;
 }
 
@@ -412,6 +279,7 @@ static int __init dummy_numa_init(void)
 		pr_err("NUMA init failed\n");
 		return ret;
 	}
+	node_set(0, numa_nodes_parsed);
 
 	numa_off = true;
 	return 0;
@@ -454,3 +322,54 @@ void __init arch_numa_init(void)
 
 	numa_init(dummy_numa_init);
 }
+
+#ifdef CONFIG_NUMA_EMU
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids)
+{
+	int i, j;
+
+	/*
+	 * Transform cpu_to_node_map table to use emulated nids by
+	 * reverse-mapping phys_nid.  The maps should always exist but fall
+	 * back to zero just in case.
+	 */
+	for (i = 0; i < ARRAY_SIZE(cpu_to_node_map); i++) {
+		if (cpu_to_node_map[i] == NUMA_NO_NODE)
+			continue;
+		for (j = 0; j < nr_emu_nids; j++)
+			if (cpu_to_node_map[i] == emu_nid_to_phys[j])
+				break;
+		cpu_to_node_map[i] = j < nr_emu_nids ? j : 0;
+	}
+}
+
+u64 __init numa_emu_dma_end(void)
+{
+	return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
+}
+
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable)
+{
+	struct cpumask *mask;
+
+	if (node == NUMA_NO_NODE)
+		return;
+
+	mask = node_to_cpumask_map[node];
+	if (!cpumask_available(mask)) {
+		pr_err("node_to_cpumask_map[%i] NULL\n", node);
+		dump_stack();
+		return;
+	}
+
+	if (enable)
+		cpumask_set_cpu(cpu, mask);
+	else
+		cpumask_clear_cpu(cpu, mask);
+
+	pr_debug("%s cpu %d node %d: mask now %*pbl\n",
+		 enable ? "numa_add_cpu" : "numa_remove_cpu",
+		 cpu, node, cpumask_pr_args(mask));
+}
+#endif /* CONFIG_NUMA_EMU */
diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index c32e0cf23c90..c2b046d1fd82 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -32,8 +32,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 
 void __init arch_numa_init(void);
 int __init numa_add_memblk(int nodeid, u64 start, u64 end);
-void __init numa_set_distance(int from, int to, int distance);
-void __init numa_free_distance(void);
 void __init early_map_cpu_to_node(unsigned int cpu, int nid);
 int __init early_cpu_to_node(int cpu);
 void numa_store_cpu_info(unsigned int cpu);
@@ -51,4 +49,8 @@ static inline int early_cpu_to_node(int cpu) { return 0; }
 
 #endif	/* CONFIG_NUMA */
 
+#ifdef CONFIG_NUMA_EMU
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable);
+#endif
+
 #endif	/* __ASM_GENERIC_NUMA_H */
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index e4358ad92233..c4037faa438b 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -405,9 +405,12 @@ static int __init numa_register_meminfo(struct numa_meminfo *mi)
 		unsigned long pfn_align = node_map_pfn_alignment();
 
 		if (pfn_align && pfn_align < PAGES_PER_SECTION) {
-			pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
-				PFN_PHYS(pfn_align) >> 20,
-				PFN_PHYS(PAGES_PER_SECTION) >> 20);
+			unsigned long node_align_mb = PFN_PHYS(pfn_align) >> 20;
+
+			unsigned long sect_align_mb = PFN_PHYS(PAGES_PER_SECTION) >> 20;
+
+			pr_warn("Node alignment %luMB < min %luMB, rejecting NUMA config\n",
+				node_align_mb, sect_align_mb);
 			return -EINVAL;
 		}
 	}
@@ -418,18 +421,18 @@ static int __init numa_register_meminfo(struct numa_meminfo *mi)
 int __init numa_memblks_init(int (*init_func)(void),
 			     bool memblock_force_top_down)
 {
+	phys_addr_t max_addr = (phys_addr_t)ULLONG_MAX;
 	int ret;
 
 	nodes_clear(numa_nodes_parsed);
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
-				  NUMA_NO_NODE));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+	WARN_ON(memblock_set_node(0, max_addr, &memblock.memory, NUMA_NO_NODE));
+	WARN_ON(memblock_set_node(0, max_addr, &memblock.reserved,
 				  NUMA_NO_NODE));
 	/* In case that parsing SRAT failed. */
-	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
+	WARN_ON(memblock_clear_hotplug(0, max_addr));
 	numa_reset_distance();
 
 	ret = init_func();
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 25/26] mm: make range-to-target_node lookup facility a part of numa_memblks
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (23 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 24/26] arch_numa: switch over to numa_memblks Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  2024-08-07  6:41 ` [PATCH v4 26/26] docs: move numa=fake description to kernel-parameters.txt Mike Rapoport
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The x86 implementation of range-to-target_node lookup (i.e.
phys_to_target_node() and memory_add_physaddr_to_nid()) relies on
numa_memblks.

Since numa_memblks are now part of the generic code, move these
functions from x86 to mm/numa_memblks.c and select
CONFIG_NUMA_KEEP_MEMINFO when CONFIG_NUMA_MEMBLKS=y for dax and cxl.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/sparsemem.h |  9 --------
 arch/x86/mm/numa.c               | 38 --------------------------------
 drivers/cxl/Kconfig              |  2 +-
 drivers/dax/Kconfig              |  2 +-
 include/linux/numa_memblks.h     |  7 ++++++
 mm/numa.c                        |  1 +
 mm/numa_memblks.c                | 38 ++++++++++++++++++++++++++++++++
 7 files changed, 48 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 64df897c0ee3..3918c7a434f5 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -31,13 +31,4 @@
 
 #endif /* CONFIG_SPARSEMEM */
 
-#ifndef __ASSEMBLY__
-#ifdef CONFIG_NUMA_KEEP_MEMINFO
-extern int phys_to_target_node(phys_addr_t start);
-#define phys_to_target_node phys_to_target_node
-extern int memory_add_physaddr_to_nid(u64 start);
-#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
-#endif
-#endif /* __ASSEMBLY__ */
-
 #endif /* _ASM_X86_SPARSEMEM_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d23287611449..64e5cdb2460a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -453,41 +453,3 @@ u64 __init numa_emu_dma_end(void)
 	return PFN_PHYS(MAX_DMA32_PFN);
 }
 #endif /* CONFIG_NUMA_EMU */
-
-#ifdef CONFIG_NUMA_KEEP_MEMINFO
-static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
-{
-	int i;
-
-	for (i = 0; i < mi->nr_blks; i++)
-		if (mi->blk[i].start <= start && mi->blk[i].end > start)
-			return mi->blk[i].nid;
-	return NUMA_NO_NODE;
-}
-
-int phys_to_target_node(phys_addr_t start)
-{
-	int nid = meminfo_to_nid(&numa_meminfo, start);
-
-	/*
-	 * Prefer online nodes, but if reserved memory might be
-	 * hot-added continue the search with reserved ranges.
-	 */
-	if (nid != NUMA_NO_NODE)
-		return nid;
-
-	return meminfo_to_nid(&numa_reserved_meminfo, start);
-}
-EXPORT_SYMBOL_GPL(phys_to_target_node);
-
-int memory_add_physaddr_to_nid(u64 start)
-{
-	int nid = meminfo_to_nid(&numa_meminfo, start);
-
-	if (nid == NUMA_NO_NODE)
-		nid = numa_meminfo.blk[0].nid;
-	return nid;
-}
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
-
-#endif
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 99b5c25be079..29c192f20082 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -6,7 +6,7 @@ menuconfig CXL_BUS
 	select FW_UPLOAD
 	select PCI_DOE
 	select FIRMWARE_TABLE
-	select NUMA_KEEP_MEMINFO if (NUMA && X86)
+	select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS
 	help
 	  CXL is a bus that is electrically compatible with PCI Express, but
 	  layers three protocols on that signalling (CXL.io, CXL.cache, and
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..d656e4c0eb84 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -30,7 +30,7 @@ config DEV_DAX_PMEM
 config DEV_DAX_HMEM
 	tristate "HMEM DAX: direct access to 'specific purpose' memory"
 	depends on EFI_SOFT_RESERVE
-	select NUMA_KEEP_MEMINFO if (NUMA && X86)
+	select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS
 	default DEV_DAX
 	help
 	  EFI 2.8 platforms, and others, may advertise 'specific purpose'
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 5c6e12ad0b7a..17d4bcc34091 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -46,6 +46,13 @@ static inline int numa_emu_cmdline(char *str)
 }
 #endif /* CONFIG_NUMA_EMU */
 
+#ifdef CONFIG_NUMA_KEEP_MEMINFO
+extern int phys_to_target_node(phys_addr_t start);
+#define phys_to_target_node phys_to_target_node
+extern int memory_add_physaddr_to_nid(u64 start);
+#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
+#endif /* CONFIG_NUMA_KEEP_MEMINFO */
+
 #endif /* CONFIG_NUMA_MEMBLKS */
 
 #endif	/* __NUMA_MEMBLKS_H */
diff --git a/mm/numa.c b/mm/numa.c
index 1f1582dcdf4a..e2eec07707d1 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -3,6 +3,7 @@
 #include <linux/memblock.h>
 #include <linux/printk.h>
 #include <linux/numa.h>
+#include <linux/numa_memblks.h>
 
 struct pglist_data *node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(node_data);
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index c4037faa438b..a28507cf1e7f 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -531,3 +531,41 @@ int __init numa_fill_memblks(u64 start, u64 end)
 	}
 	return 0;
 }
+
+#ifdef CONFIG_NUMA_KEEP_MEMINFO
+static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
+{
+	int i;
+
+	for (i = 0; i < mi->nr_blks; i++)
+		if (mi->blk[i].start <= start && mi->blk[i].end > start)
+			return mi->blk[i].nid;
+	return NUMA_NO_NODE;
+}
+
+int phys_to_target_node(phys_addr_t start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	/*
+	 * Prefer online nodes, but if reserved memory might be
+	 * hot-added continue the search with reserved ranges.
+	 */
+	if (nid != NUMA_NO_NODE)
+		return nid;
+
+	return meminfo_to_nid(&numa_reserved_meminfo, start);
+}
+EXPORT_SYMBOL_GPL(phys_to_target_node);
+
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_meminfo.blk[0].nid;
+	return nid;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+
+#endif /* CONFIG_NUMA_KEEP_MEMINFO */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 26/26] docs: move numa=fake description to kernel-parameters.txt
  2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
                   ` (24 preceding siblings ...)
  2024-08-07  6:41 ` [PATCH v4 25/26] mm: make range-to-target_node lookup facility a part of numa_memblks Mike Rapoport
@ 2024-08-07  6:41 ` Mike Rapoport
  25 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07  6:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

NUMA emulation can be now enabled on arm64 and riscv in addition to x86.

Move description of numa=fake parameters from x86 documentation of
admin-guide/kernel-parameters.txt

Suggested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 15 +++++++++++++++
 Documentation/arch/x86/x86_64/boot-options.rst  | 12 ------------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f1384c7b59c9..bcdee8984e1f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4123,6 +4123,21 @@
 			Disable NUMA, Only set up a single NUMA node
 			spanning all memory.
 
+	numa=fake=<size>[MG]
+			[KNL, ARM64, RISCV, X86, EARLY]
+			If given as a memory unit, fills all system RAM with
+			nodes of size interleaved over physical nodes.
+
+	numa=fake=<N>
+			[KNL, ARM64, RISCV, X86, EARLY]
+			If given as an integer, fills all system RAM with N
+			fake nodes interleaved over physical nodes.
+
+	numa=fake=<N>U
+			[KNL, ARM64, RISCV, X86, EARLY]
+			If given as an integer followed by 'U', it will
+			divide each physical node into N emulated nodes.
+
 	numa_balancing=	[KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
 			NUMA balancing.
 			Allowed values are enable and disable
diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst
index 137432d34109..98d4805f0823 100644
--- a/Documentation/arch/x86/x86_64/boot-options.rst
+++ b/Documentation/arch/x86/x86_64/boot-options.rst
@@ -170,18 +170,6 @@ NUMA
     Don't parse the HMAT table for NUMA setup, or soft-reserved memory
     partitioning.
 
-  numa=fake=<size>[MG]
-    If given as a memory unit, fills all system RAM with nodes of
-    size interleaved over physical nodes.
-
-  numa=fake=<N>
-    If given as an integer, fills all system RAM with N fake nodes
-    interleaved over physical nodes.
-
-  numa=fake=<N>U
-    If given as an integer followed by 'U', it will divide each
-    physical node into N emulated nodes.
-
 ACPI
 ====
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 24/26] arch_numa: switch over to numa_memblks
  2024-08-07  6:41 ` [PATCH v4 24/26] arch_numa: switch over to numa_memblks Mike Rapoport
@ 2024-08-07  6:58   ` Arnd Bergmann
  2024-08-07 18:18     ` Mike Rapoport
  2024-11-27 19:32   ` Marc Zyngier
  1 sibling, 1 reply; 33+ messages in thread
From: Arnd Bergmann @ 2024-08-07  6:58 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S . Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Palmer Dabbelt,
	Rafael J . Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, Linux-Arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86

On Wed, Aug 7, 2024, at 08:41, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Until now arch_numa was directly translating firmware NUMA information
> to memblock.

I get a link time warning from this:

    WARNING: modpost: vmlinux: section mismatch in reference: numa_set_cpumask+0x24 (section: .text.unlikely) -> early_cpu_to_node (section: .init.text)

> @@ -142,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>  unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
>  EXPORT_SYMBOL(__per_cpu_offset);
> 
> -int __init early_cpu_to_node(int cpu)
> +int early_cpu_to_node(int cpu)
>  {
>  	return cpu_to_node_map[cpu];
>  }

early_cpu_to_node() can no longer be __init here

> +#endif /* CONFIG_NUMA_EMU */
> diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
> index c32e0cf23c90..c2b046d1fd82 100644
> --- a/include/asm-generic/numa.h
> +++ b/include/asm-generic/numa.h
> @@ -32,8 +32,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
> 
>  void __init arch_numa_init(void);
>  int __init numa_add_memblk(int nodeid, u64 start, u64 end);
> -void __init numa_set_distance(int from, int to, int distance);
> -void __init numa_free_distance(void);
>  void __init early_map_cpu_to_node(unsigned int cpu, int nid);
>  int __init early_cpu_to_node(int cpu);
>  void numa_store_cpu_info(unsigned int cpu);

but is still declared as __init in the header, so it is
still put in that section and discarded after boot.

I was confused by this at first, since the 'early' name
seems to imply that you shouldn't call it once the system
is up, but now you do.

     Arnd

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 24/26] arch_numa: switch over to numa_memblks
  2024-08-07  6:58   ` Arnd Bergmann
@ 2024-08-07 18:18     ` Mike Rapoport
  2024-08-07 18:53       ` Arnd Bergmann
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Rapoport @ 2024-08-07 18:18 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S . Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Palmer Dabbelt,
	Rafael J . Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, Linux-Arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86

On Wed, Aug 07, 2024 at 08:58:37AM +0200, Arnd Bergmann wrote:
> On Wed, Aug 7, 2024, at 08:41, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Until now arch_numa was directly translating firmware NUMA information
> > to memblock.
> 
> I get a link time warning from this:
> 
>     WARNING: modpost: vmlinux: section mismatch in reference: numa_set_cpumask+0x24 (section: .text.unlikely) -> early_cpu_to_node (section: .init.text)

I didn't see this neither in my build tests nor in kbuild reports :/
 
> > @@ -142,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
> >  unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
> >  EXPORT_SYMBOL(__per_cpu_offset);
> > 
> > -int __init early_cpu_to_node(int cpu)
> > +int early_cpu_to_node(int cpu)
> >  {
> >  	return cpu_to_node_map[cpu];
> >  }
> 
> early_cpu_to_node() can no longer be __init here
> 
> > +#endif /* CONFIG_NUMA_EMU */
> > diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
> > index c32e0cf23c90..c2b046d1fd82 100644
> > --- a/include/asm-generic/numa.h
> > +++ b/include/asm-generic/numa.h
> > @@ -32,8 +32,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
> > 
> >  void __init arch_numa_init(void);
> >  int __init numa_add_memblk(int nodeid, u64 start, u64 end);
> > -void __init numa_set_distance(int from, int to, int distance);
> > -void __init numa_free_distance(void);
> >  void __init early_map_cpu_to_node(unsigned int cpu, int nid);
> >  int __init early_cpu_to_node(int cpu);
> >  void numa_store_cpu_info(unsigned int cpu);
> 
> but is still declared as __init in the header, so it is
> still put in that section and discarded after boot.

I believe this should fix it

diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index c2b046d1fd82..e063d6487f66 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -33,7 +33,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
 void __init arch_numa_init(void);
 int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 void __init early_map_cpu_to_node(unsigned int cpu, int nid);
-int __init early_cpu_to_node(int cpu);
+int early_cpu_to_node(int cpu);
 void numa_store_cpu_info(unsigned int cpu);
 void numa_add_cpu(unsigned int cpu);
 void numa_remove_cpu(unsigned int cpu);
 
> I was confused by this at first, since the 'early' name
> seems to imply that you shouldn't call it once the system
> is up, but now you do.

I agree that this is confusing, but that's what x86 does and numa_emulation
uses.
 
>      Arnd
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 24/26] arch_numa: switch over to numa_memblks
  2024-08-07 18:18     ` Mike Rapoport
@ 2024-08-07 18:53       ` Arnd Bergmann
  0 siblings, 0 replies; 33+ messages in thread
From: Arnd Bergmann @ 2024-08-07 18:53 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S . Miller, Davidlohr Bueso,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Jonathan Corbet, Michael Ellerman, Palmer Dabbelt,
	Rafael J . Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, Linux-Arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86

On Wed, Aug 7, 2024, at 20:18, Mike Rapoport wrote:
> On Wed, Aug 07, 2024 at 08:58:37AM +0200, Arnd Bergmann wrote:
>> On Wed, Aug 7, 2024, at 08:41, Mike Rapoport wrote:
>> > 
>> >  void __init arch_numa_init(void);
>> >  int __init numa_add_memblk(int nodeid, u64 start, u64 end);
>> > -void __init numa_set_distance(int from, int to, int distance);
>> > -void __init numa_free_distance(void);
>> >  void __init early_map_cpu_to_node(unsigned int cpu, int nid);
>> >  int __init early_cpu_to_node(int cpu);
>> >  void numa_store_cpu_info(unsigned int cpu);
>> 
>> but is still declared as __init in the header, so it is
>> still put in that section and discarded after boot.
>
> I believe this should fix it

Yes, sorry I should have posted the patch as well, this is
what I tested with locally.

     Arnd

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 24/26] arch_numa: switch over to numa_memblks
@ 2024-08-26 22:46 Bruno Faccini
  0 siblings, 0 replies; 33+ messages in thread
From: Bruno Faccini @ 2024-08-26 22:46 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel@vger.kernel.org, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dan Williams, Dave Hansen, David Hildenbrand,
	David S. Miller, Davidlohr Bueso, Greg Kroah-Hartman,
	Heiko Carstens, Huacai Chen, Ingo Molnar, Jiaxun Yang,
	John Paul Adrian Glaubitz, Jonathan Cameron, Jonathan Corbet,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Samuel Holland, Thomas Bogendoerfer, Thomas Gleixner,
	Vasily Gorbik, Will Deacon, devicetree@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-cxl@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-mips@vger.kernel.org,
	linux-mm@kvack.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, linux-sh@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev,
	nvdimm@lists.linux.dev, sparclinux@vger.kernel.org,
	x86@kernel.org, Zi Yan, Bruno Faccini

On 7 Aug 2024, at 2:41, Mike Rapoport wrote:

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Until now arch_numa was directly translating firmware NUMA information
to memblock.

Using numa_memblks as an intermediate step has a few advantages:
* alignment with more battle tested x86 implementation
* availability of NUMA emulation
* maintaining node information for not yet populated memory

Adjust a few places in numa_memblks to compile with 32-bit phys_addr_t
and replace current functionality related to numa_add_memblk() and
__node_distance() in arch_numa with the implementation based on
numa_memblks and add functions required by numa_emulation.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
  drivers/base/Kconfig       |   1 +
  drivers/base/arch_numa.c   | 201 +++++++++++--------------------------
  include/asm-generic/numa.h |   6 +-
  mm/numa_memblks.c          |  17 ++--
  4 files changed, 75 insertions(+), 150 deletions(-)


<snip>

+
+u64 __init numa_emu_dma_end(void)
+{
+             return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
+}
+

PFN_PHYS() translation is unnecessary here, as
memblock_start_of_DRAM() + SZ_4G is already a
memory size.

This should fix it:
====================================================
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 8d49893c0e94..e18701676426 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -346,7 +346,7 @@ void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,

u64 __init numa_emu_dma_end(void)
{
-              return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
+             return memblock_start_of_DRAM() + SZ_4G;
}

void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable)
====================================================



!!! I had a lot of trouble to send in plain text from Outlook on my Mac, sorry for the noise and the duplicate copies !!!




^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 24/26] arch_numa: switch over to numa_memblks
       [not found] <MW4PR12MB72616723E1A090E315681FF6A38B2@MW4PR12MB7261.namprd12.prod.outlook.com>
@ 2024-08-27  8:52 ` Mike Rapoport
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2024-08-27  8:52 UTC (permalink / raw)
  To: Bruno Faccini
  Cc: linux-kernel@vger.kernel.org, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dan Williams, Dave Hansen, David Hildenbrand,
	David S. Miller, Davidlohr Bueso, Greg Kroah-Hartman,
	Heiko Carstens, Huacai Chen, Ingo Molnar, Jiaxun Yang,
	John Paul Adrian Glaubitz, Jonathan Cameron, Jonathan Corbet,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Samuel Holland, Thomas Bogendoerfer, Thomas Gleixner,
	Vasily Gorbik, Will Deacon, devicetree@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-cxl@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-mips@vger.kernel.org,
	linux-mm@kvack.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, linux-sh@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev,
	nvdimm@lists.linux.dev, sparclinux@vger.kernel.org,
	x86@kernel.org, Zi Yan

Hi,

On Mon, Aug 26, 2024 at 06:17:22PM +0000, Bruno Faccini wrote:
> > On 7 Aug 2024, at 2:41, Mike Rapoport wrote:
> > 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Until now arch_numa was directly translating firmware NUMA information
> > to memblock.
> > 
> > Using numa_memblks as an intermediate step has a few advantages:
> > * alignment with more battle tested x86 implementation
> > * availability of NUMA emulation
> > * maintaining node information for not yet populated memory
> > 
> > Adjust a few places in numa_memblks to compile with 32-bit phys_addr_t
> > and replace current functionality related to numa_add_memblk() and
> > __node_distance() in arch_numa with the implementation based on
> > numa_memblks and add functions required by numa_emulation.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via
> > QEMU]
> > Acked-by: Dan Williams <dan.j.williams@intel.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> >   drivers/base/Kconfig       |   1 +
> >   drivers/base/arch_numa.c   | 201 +++++++++++--------------------------
> >   include/asm-generic/numa.h |   6 +-
> >   mm/numa_memblks.c          |  17 ++--
> >   4 files changed, 75 insertions(+), 150 deletions(-)
> >  
> > <snip>
> > 
> > +
> > +u64 __init numa_emu_dma_end(void)
> > +{
> > +             return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
> > +}
> > +
> 
> PFN_PHYS() translation is unnecessary here, as
> memblock_start_of_DRAM() + SZ_4G is already a
> memory size.
> 
> This should fix it:
>  
> diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
> index 8d49893c0e94..e18701676426 100644
> --- a/drivers/base/arch_numa.c
> +++ b/drivers/base/arch_numa.c
> @@ -346,7 +346,7 @@ void __init numa_emu_update_cpu_to_node(int
> *emu_nid_to_phys,
> 
> u64 __init numa_emu_dma_end(void)
> {
> -              return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
> +             return memblock_start_of_DRAM() + SZ_4G;
> }
> 
> void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable)

Right, I've missed that. Thanks for the fix!

Andrew, can you please apply this (with fixed formatting)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 8d49893c0e94..e18701676426 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -346,7 +346,7 @@ void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
 
 u64 __init numa_emu_dma_end(void)
 {
-	return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
+	return memblock_start_of_DRAM() + SZ_4G;
 }
 
 void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable)

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 24/26] arch_numa: switch over to numa_memblks
  2024-08-07  6:41 ` [PATCH v4 24/26] arch_numa: switch over to numa_memblks Mike Rapoport
  2024-08-07  6:58   ` Arnd Bergmann
@ 2024-11-27 19:32   ` Marc Zyngier
  1 sibling, 0 replies; 33+ messages in thread
From: Marc Zyngier @ 2024-11-27 19:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Davidlohr Bueso, Greg Kroah-Hartman, Heiko Carstens, Huacai Chen,
	Ingo Molnar, Jiaxun Yang, John Paul Adrian Glaubitz,
	Jonathan Cameron, Jonathan Corbet, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring, Samuel Holland,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	Zi Yan, devicetree, linux-acpi, linux-arch, linux-arm-kernel,
	linux-cxl, linux-doc, linux-mips, linux-mm, linux-riscv,
	linux-s390, linux-sh, linuxppc-dev, loongarch, nvdimm, sparclinux,
	x86, Jonathan Cameron

Hi Mike,

Sorry for reviving a rather old thread.

On Wed, 07 Aug 2024 07:41:08 +0100,
Mike Rapoport <rppt@kernel.org> wrote:
> 
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Until now arch_numa was directly translating firmware NUMA information
> to memblock.
> 
> Using numa_memblks as an intermediate step has a few advantages:
> * alignment with more battle tested x86 implementation
> * availability of NUMA emulation
> * maintaining node information for not yet populated memory
> 
> Adjust a few places in numa_memblks to compile with 32-bit phys_addr_t
> and replace current functionality related to numa_add_memblk() and
> __node_distance() in arch_numa with the implementation based on
> numa_memblks and add functions required by numa_emulation.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/base/Kconfig       |   1 +
>  drivers/base/arch_numa.c   | 201 +++++++++++--------------------------
>  include/asm-generic/numa.h |   6 +-
>  mm/numa_memblks.c          |  17 ++--
>  4 files changed, 75 insertions(+), 150 deletions(-)
>

[...]

>  static int __init numa_register_nodes(void)
>  {
>  	int nid;
> -	struct memblock_region *mblk;
> -
> -	/* Check that valid nid is set to memblks */
> -	for_each_mem_region(mblk) {
> -		int mblk_nid = memblock_get_region_node(mblk);
> -		phys_addr_t start = mblk->base;
> -		phys_addr_t end = mblk->base + mblk->size - 1;
> -
> -		if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) {
> -			pr_warn("Warning: invalid memblk node %d [mem %pap-%pap]\n",
> -				mblk_nid, &start, &end);
> -			return -EINVAL;
> -		}
> -	}
>  

This hunk has the unfortunate side effect of killing my ThunderX
extremely early at boot time, as this sorry excuse for a machine
really relies on the kernel recognising that whatever NUMA information
the FW offers is BS.

Reverting this hunk restores happiness (sort of).

FWIW, I've posted a patch with such revert at [1].

Thanks,

	M.

[1] https://lore.kernel.org/r/20241127193000.3702637-1-maz@kernel.org

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-11-27 19:32 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07  6:40 [PATCH v4 00/26] mm: introduce numa_memblks Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 01/26] mm: move kernel/numa.c to mm/ Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 02/26] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 03/26] MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 04/26] MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 05/26] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 06/26] MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 07/26] arch, mm: move definition of node_data to generic code Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 08/26] mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 09/26] arch, mm: pull out allocation of NODE_DATA to generic code Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 10/26] x86/numa: simplify numa_distance allocation Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 11/26] x86/numa: use get_pfn_range_for_nid to verify that node spans memory Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 12/26] x86/numa: move FAKE_NODE_* defines to numa_emu Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 13/26] x86/numa_emu: simplify allocation of phys_dist Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 14/26] x86/numa_emu: split __apicid_to_node update to a helper function Mike Rapoport
2024-08-07  6:40 ` [PATCH v4 15/26] x86/numa_emu: use a helper function to get MAX_DMA32_PFN Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 16/26] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 17/26] mm: introduce numa_memblks Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 18/26] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 19/26] mm: introduce numa_emulation Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 20/26] mm: numa_memblks: introduce numa_memblks_init Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 21/26] mm: numa_memblks: make several functions and variables static Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 22/26] mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 23/26] of, numa: return -EINVAL when no numa-node-id is found Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 24/26] arch_numa: switch over to numa_memblks Mike Rapoport
2024-08-07  6:58   ` Arnd Bergmann
2024-08-07 18:18     ` Mike Rapoport
2024-08-07 18:53       ` Arnd Bergmann
2024-11-27 19:32   ` Marc Zyngier
2024-08-07  6:41 ` [PATCH v4 25/26] mm: make range-to-target_node lookup facility a part of numa_memblks Mike Rapoport
2024-08-07  6:41 ` [PATCH v4 26/26] docs: move numa=fake description to kernel-parameters.txt Mike Rapoport
  -- strict thread matches above, loose matches on Subject: below --
2024-08-26 22:46 [PATCH v4 24/26] arch_numa: switch over to numa_memblks Bruno Faccini
     [not found] <MW4PR12MB72616723E1A090E315681FF6A38B2@MW4PR12MB7261.namprd12.prod.outlook.com>
2024-08-27  8:52 ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).