[PATCH 00/17] mm: introduce numa

linux-sh.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/17] mm: introduce numa_memblks
@ 2024-07-16 11:13 Mike Rapoport
  2024-07-16 11:13 ` [PATCH 01/17] mm: move kernel/numa.c to mm/ Mike Rapoport
                   ` (17 more replies)
  0 siblings, 18 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

Following the discussion about handling of CXL fixed memory windows on
arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
the generic code so they will be available on arm64/riscv and maybe on
loongarch sometime later.

While it could be possible to use memblock to describe CXL memory windows,
it currently lacks notion of unpopulated memory ranges and numa_memblks
does implement this.

Another reason to make numa_memblks generic is that both arch_numa (arm64
and riscv) and loongarch use trimmed copy of x86 code although there is no
fundamental reason why the same code cannot be used on all these platforms.
Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
more consistent and I believe will reduce maintenance burden.

And with generic numa_memblks it is (almost) straightforward to enable NUMA
emulation on arm64 and riscv.

The first 5 commits in this series are cleanups that are not strictly
related to numa_memblks.

Commits 6-11 slightly reorder code in x86 to allow extracting numa_memblks
and NUMA emulation to the generic code.

Commits 12-14 actually move the code from arch/x86/ to mm/ and commit 15
does some aftermath cleanups.

Commit 16 switches arch_numa to numa_memblks.

Commit 17 enables usage of phys_to_target_node() and
memory_add_physaddr_to_nid() with numa_memblks.

[1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/

Mike Rapoport (Microsoft) (17):
  mm: move kernel/numa.c to mm/
  MIPS: sgi-ip27: make NODE_DATA() the same as on all other
    architectures
  MIPS: loongson64: rename __node_data to node_data
  arch, mm: move definition of node_data to generic code
  arch, mm: pull out allocation of NODE_DATA to generic code
  x86/numa: simplify numa_distance allocation
  x86/numa: move FAKE_NODE_* defines to numa_emu
  x86/numa_emu: simplify allocation of phys_dist
  x86/numa_emu: split __apicid_to_node update to a helper function
  x86/numa_emu: use a helper function to get MAX_DMA32_PFN
  x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
  mm: introduce numa_memblks
  mm: move numa_distance and related code from x86 to numa_memblks
  mm: introduce numa_emulation
  mm: make numa_memblks more self-contained
  arch_numa: switch over to numa_memblks
  mm: make range-to-target_node lookup facility a part of numa_memblks

 arch/arm64/include/asm/Kbuild                 |   1 +
 arch/arm64/include/asm/mmzone.h               |  13 -
 arch/arm64/include/asm/topology.h             |   1 +
 arch/loongarch/include/asm/Kbuild             |   1 +
 arch/loongarch/include/asm/mmzone.h           |  16 -
 arch/loongarch/include/asm/topology.h         |   1 +
 arch/loongarch/kernel/numa.c                  |  21 -
 arch/mips/include/asm/mach-ip27/mmzone.h      |   1 -
 .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
 arch/mips/loongson64/numa.c                   |  20 +-
 arch/mips/sgi-ip27/ip27-memory.c              |   2 +-
 arch/powerpc/include/asm/mmzone.h             |   6 -
 arch/powerpc/mm/numa.c                        |  26 +-
 arch/riscv/include/asm/Kbuild                 |   1 +
 arch/riscv/include/asm/mmzone.h               |  13 -
 arch/riscv/include/asm/topology.h             |   4 +
 arch/s390/include/asm/Kbuild                  |   1 +
 arch/s390/include/asm/mmzone.h                |  17 -
 arch/s390/kernel/numa.c                       |   3 -
 arch/sh/include/asm/mmzone.h                  |   3 -
 arch/sh/mm/init.c                             |   7 +-
 arch/sh/mm/numa.c                             |   3 -
 arch/sparc/include/asm/mmzone.h               |   4 -
 arch/sparc/mm/init_64.c                       |  11 +-
 arch/x86/Kconfig                              |   9 +-
 arch/x86/include/asm/Kbuild                   |   1 +
 arch/x86/include/asm/mmzone.h                 |   6 -
 arch/x86/include/asm/mmzone_32.h              |  17 -
 arch/x86/include/asm/mmzone_64.h              |  18 -
 arch/x86/include/asm/numa.h                   |  24 +-
 arch/x86/include/asm/sparsemem.h              |   9 -
 arch/x86/mm/Makefile                          |   1 -
 arch/x86/mm/amdtopology.c                     |   1 +
 arch/x86/mm/numa.c                            | 618 +-----------------
 arch/x86/mm/numa_internal.h                   |  24 -
 drivers/acpi/numa/srat.c                      |   1 +
 drivers/base/Kconfig                          |   1 +
 drivers/base/arch_numa.c                      | 223 ++-----
 drivers/cxl/Kconfig                           |   2 +-
 drivers/dax/Kconfig                           |   2 +-
 drivers/of/of_numa.c                          |   1 +
 include/asm-generic/mmzone.h                  |   5 +
 include/asm-generic/numa.h                    |   6 +-
 include/linux/numa.h                          |   5 +
 include/linux/numa_memblks.h                  |  58 ++
 kernel/Makefile                               |   1 -
 kernel/numa.c                                 |  26 -
 mm/Kconfig                                    |  11 +
 mm/Makefile                                   |   3 +
 mm/numa.c                                     |  57 ++
 {arch/x86/mm => mm}/numa_emulation.c          |  42 +-
 mm/numa_memblks.c                             | 565 ++++++++++++++++
 52 files changed, 847 insertions(+), 1070 deletions(-)
 delete mode 100644 arch/arm64/include/asm/mmzone.h
 delete mode 100644 arch/loongarch/include/asm/mmzone.h
 delete mode 100644 arch/riscv/include/asm/mmzone.h
 delete mode 100644 arch/s390/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone_32.h
 delete mode 100644 arch/x86/include/asm/mmzone_64.h
 create mode 100644 include/asm-generic/mmzone.h
 create mode 100644 include/linux/numa_memblks.h
 delete mode 100644 kernel/numa.c
 create mode 100644 mm/numa.c
 rename {arch/x86/mm => mm}/numa_emulation.c (94%)
 create mode 100644 mm/numa_memblks.c


base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826
-- 
2.43.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/17] mm: move kernel/numa.c to mm/
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-17 14:35   ` David Hildenbrand
  2024-07-19 13:55   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures Mike Rapoport
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The stub functions in kernel/numa.c belong to mm/ rather than to kernel/

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 kernel/Makefile       | 1 -
 mm/Makefile           | 1 +
 {kernel => mm}/numa.c | 0
 3 files changed, 1 insertion(+), 1 deletion(-)
 rename {kernel => mm}/numa.c (100%)

diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..87866b037fbe 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -116,7 +116,6 @@ obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
 obj-$(CONFIG_HAVE_STATIC_CALL) += static_call.o
 obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call_inline.o
 obj-$(CONFIG_CFI_CLANG) += cfi.o
-obj-$(CONFIG_NUMA) += numa.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/mm/Makefile b/mm/Makefile
index 8fb85acda1b1..773b3b267438 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -139,3 +139,4 @@ obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
+obj-$(CONFIG_NUMA) += numa.o
diff --git a/kernel/numa.c b/mm/numa.c
similarity index 100%
rename from kernel/numa.c
rename to mm/numa.c
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
  2024-07-16 11:13 ` [PATCH 01/17] mm: move kernel/numa.c to mm/ Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-17 14:32   ` David Hildenbrand
  2024-07-16 11:13 ` [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

sgi-ip27 is the only system that defines NODE_DATA() differently than
the rest of NUMA machines.

Add node_data array of struct pglist pointers that will point to
__node_data[node]->pglist and redefine NODE_DATA() to use node_data
array.

This will allow pulling declaration of node_data to the generic mm code
in the next commit.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/mips/include/asm/mach-ip27/mmzone.h | 5 ++++-
 arch/mips/sgi-ip27/ip27-memory.c         | 5 ++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h
index 08c36e50a860..629c3f290203 100644
--- a/arch/mips/include/asm/mach-ip27/mmzone.h
+++ b/arch/mips/include/asm/mach-ip27/mmzone.h
@@ -22,7 +22,10 @@ struct node_data {
 
 extern struct node_data *__node_data[];
 
-#define NODE_DATA(n)		(&__node_data[(n)]->pglist)
 #define hub_data(n)		(&__node_data[(n)]->hub)
 
+extern struct pglist_data *node_data[];
+
+#define NODE_DATA(nid)		(node_data[nid])
+
 #endif /* _ASM_MACH_MMZONE_H */
diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
index b8ca94cfb4fe..c30ef6958b97 100644
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -34,8 +34,10 @@
 #define SLOT_PFNSHIFT		(SLOT_SHIFT - PAGE_SHIFT)
 #define PFN_NASIDSHFT		(NASID_SHFT - PAGE_SHIFT)
 
-struct node_data *__node_data[MAX_NUMNODES];
+struct pglist_data *node_data[MAX_NUMNODES];
+EXPORT_SYMBOL(node_data);
 
+struct node_data *__node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_data);
 
 static u64 gen_region_mask(void)
@@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node)
 	 */
 	__node_data[node] = __va(slot_freepfn << PAGE_SHIFT);
 	memset(__node_data[node], 0, PAGE_SIZE);
+	node_data[node] = &__node_data[node]->pglist;
 
 	NODE_DATA(node)->node_start_pfn = start_pfn;
 	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
  2024-07-16 11:13 ` [PATCH 01/17] mm: move kernel/numa.c to mm/ Mike Rapoport
  2024-07-16 11:13 ` [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-16 13:07   ` Jiaxun Yang
                     ` (2 more replies)
  2024-07-16 11:13 ` [PATCH 04/17] arch, mm: move definition of node_data to generic code Mike Rapoport
                   ` (14 subsequent siblings)
  17 siblings, 3 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Make definition of node_data match other architectures.
This will allow pulling declaration of node_data to the generic mm code in
the following commit.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/mips/include/asm/mach-loongson64/mmzone.h | 4 ++--
 arch/mips/loongson64/numa.c                    | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/asm/mach-loongson64/mmzone.h b/arch/mips/include/asm/mach-loongson64/mmzone.h
index a3d65d37b8b5..2effd5f8ed62 100644
--- a/arch/mips/include/asm/mach-loongson64/mmzone.h
+++ b/arch/mips/include/asm/mach-loongson64/mmzone.h
@@ -14,9 +14,9 @@
 #define pa_to_nid(addr)  (((addr) & 0xf00000000000) >> NODE_ADDRSPACE_SHIFT)
 #define nid_to_addrbase(nid) ((unsigned long)(nid) << NODE_ADDRSPACE_SHIFT)
 
-extern struct pglist_data *__node_data[];
+extern struct pglist_data *node_data[];
 
-#define NODE_DATA(n)		(__node_data[n])
+#define NODE_DATA(n)		(node_data[n])
 
 extern void __init prom_init_numa_memory(void);
 
diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index 68dafd6d3e25..b50ce28d2741 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -29,8 +29,8 @@
 
 unsigned char __node_distances[MAX_NUMNODES][MAX_NUMNODES];
 EXPORT_SYMBOL(__node_distances);
-struct pglist_data *__node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(__node_data);
+struct pglist_data *node_data[MAX_NUMNODES];
+EXPORT_SYMBOL(node_data);
 
 cpumask_t __node_cpumask[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_cpumask);
@@ -107,7 +107,7 @@ static void __init node_mem_init(unsigned int node)
 	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
 	if (tnid != node)
 		pr_info("NODE_DATA(%d) on node %d\n", node, tnid);
-	__node_data[node] = nd;
+	node_data[node] = nd;
 	NODE_DATA(node)->node_start_pfn = start_pfn;
 	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;
 
@@ -206,5 +206,5 @@ pg_data_t * __init arch_alloc_nodedata(int nid)
 
 void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 {
-	__node_data[nid] = pgdat;
+	node_data[nid] = pgdat;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 04/17] arch, mm: move definition of node_data to generic code
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (2 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-17 14:35   ` David Hildenbrand
                     ` (2 more replies)
  2024-07-16 11:13 ` [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA " Mike Rapoport
                   ` (13 subsequent siblings)
  17 siblings, 3 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Every architecture that supports NUMA defines node_data in the same way:

	struct pglist_data *node_data[MAX_NUMNODES];

No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.

Add definition and declaration of node_data to generic code and drop
architecture-specific versions.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/arm64/include/asm/Kbuild                  |  1 +
 arch/arm64/include/asm/mmzone.h                | 13 -------------
 arch/arm64/include/asm/topology.h              |  1 +
 arch/loongarch/include/asm/Kbuild              |  1 +
 arch/loongarch/include/asm/mmzone.h            | 16 ----------------
 arch/loongarch/include/asm/topology.h          |  1 +
 arch/loongarch/kernel/numa.c                   |  3 ---
 arch/mips/include/asm/mach-ip27/mmzone.h       |  4 ----
 arch/mips/include/asm/mach-loongson64/mmzone.h |  4 ----
 arch/mips/loongson64/numa.c                    |  2 --
 arch/mips/sgi-ip27/ip27-memory.c               |  3 ---
 arch/powerpc/include/asm/mmzone.h              |  6 ------
 arch/powerpc/mm/numa.c                         |  2 --
 arch/riscv/include/asm/Kbuild                  |  1 +
 arch/riscv/include/asm/mmzone.h                | 13 -------------
 arch/riscv/include/asm/topology.h              |  4 ++++
 arch/s390/include/asm/Kbuild                   |  1 +
 arch/s390/include/asm/mmzone.h                 | 17 -----------------
 arch/s390/kernel/numa.c                        |  3 ---
 arch/sh/include/asm/mmzone.h                   |  3 ---
 arch/sh/mm/numa.c                              |  3 ---
 arch/sparc/include/asm/mmzone.h                |  4 ----
 arch/sparc/mm/init_64.c                        |  2 --
 arch/x86/include/asm/Kbuild                    |  1 +
 arch/x86/include/asm/mmzone.h                  |  6 ------
 arch/x86/include/asm/mmzone_32.h               | 17 -----------------
 arch/x86/include/asm/mmzone_64.h               | 18 ------------------
 arch/x86/mm/numa.c                             |  3 ---
 drivers/base/arch_numa.c                       |  2 --
 include/asm-generic/mmzone.h                   |  5 +++++
 include/linux/numa.h                           |  3 +++
 mm/numa.c                                      |  3 +++
 32 files changed, 22 insertions(+), 144 deletions(-)
 delete mode 100644 arch/arm64/include/asm/mmzone.h
 delete mode 100644 arch/loongarch/include/asm/mmzone.h
 delete mode 100644 arch/riscv/include/asm/mmzone.h
 delete mode 100644 arch/s390/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone.h
 delete mode 100644 arch/x86/include/asm/mmzone_32.h
 delete mode 100644 arch/x86/include/asm/mmzone_64.h
 create mode 100644 include/asm-generic/mmzone.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 4b6d2d52053e..4aaaa821ab6b 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
+generic-y += mmzone.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
 generic-y += parport.h
diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
deleted file mode 100644
index fa17e01d9ab2..000000000000
--- a/arch/arm64/include/asm/mmzone.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_MMZONE_H
-#define __ASM_MMZONE_H
-
-#ifdef CONFIG_NUMA
-
-#include <asm/numa.h>
-
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)		(node_data[(nid)])
-
-#endif /* CONFIG_NUMA */
-#endif /* __ASM_MMZONE_H */
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index 0f6ef432fb84..5fc3af9f8f29 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -5,6 +5,7 @@
 #include <linux/cpumask.h>
 
 #ifdef CONFIG_NUMA
+#include <asm/numa.h>
 
 struct pci_bus;
 int pcibus_to_node(struct pci_bus *bus);
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index c862672ed953..2804f2a2ad61 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -15,6 +15,7 @@ generic-y += fcntl.h
 generic-y += ioctl.h
 generic-y += ioctls.h
 generic-y += mman.h
+generic-y += mmzone.h
 generic-y += msgbuf.h
 generic-y += sembuf.h
 generic-y += shmbuf.h
diff --git a/arch/loongarch/include/asm/mmzone.h b/arch/loongarch/include/asm/mmzone.h
deleted file mode 100644
index 2b9a90727e19..000000000000
--- a/arch/loongarch/include/asm/mmzone.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Author: Huacai Chen (chenhuacai@loongson.cn)
- * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
- */
-#ifndef _ASM_MMZONE_H_
-#define _ASM_MMZONE_H_
-
-#include <asm/page.h>
-#include <asm/numa.h>
-
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)	(node_data[(nid)])
-
-#endif /* _ASM_MMZONE_H_ */
diff --git a/arch/loongarch/include/asm/topology.h b/arch/loongarch/include/asm/topology.h
index 66128dec0bf6..50273c9187d0 100644
--- a/arch/loongarch/include/asm/topology.h
+++ b/arch/loongarch/include/asm/topology.h
@@ -8,6 +8,7 @@
 #include <linux/smp.h>
 
 #ifdef CONFIG_NUMA
+#include <asm/numa.h>
 
 extern cpumask_t cpus_on_node[];
 
diff --git a/arch/loongarch/kernel/numa.c b/arch/loongarch/kernel/numa.c
index 8fe21f868f72..acada671e020 100644
--- a/arch/loongarch/kernel/numa.c
+++ b/arch/loongarch/kernel/numa.c
@@ -27,10 +27,7 @@
 #include <asm/time.h>
 
 int numa_off;
-struct pglist_data *node_data[MAX_NUMNODES];
 unsigned char node_distances[MAX_NUMNODES][MAX_NUMNODES];
-
-EXPORT_SYMBOL(node_data);
 EXPORT_SYMBOL(node_distances);
 
 static struct numa_meminfo numa_meminfo;
diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h
index 629c3f290203..56959eb9cb26 100644
--- a/arch/mips/include/asm/mach-ip27/mmzone.h
+++ b/arch/mips/include/asm/mach-ip27/mmzone.h
@@ -24,8 +24,4 @@ extern struct node_data *__node_data[];
 
 #define hub_data(n)		(&__node_data[(n)]->hub)
 
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)		(node_data[nid])
-
 #endif /* _ASM_MACH_MMZONE_H */
diff --git a/arch/mips/include/asm/mach-loongson64/mmzone.h b/arch/mips/include/asm/mach-loongson64/mmzone.h
index 2effd5f8ed62..8fb70fd3c9c4 100644
--- a/arch/mips/include/asm/mach-loongson64/mmzone.h
+++ b/arch/mips/include/asm/mach-loongson64/mmzone.h
@@ -14,10 +14,6 @@
 #define pa_to_nid(addr)  (((addr) & 0xf00000000000) >> NODE_ADDRSPACE_SHIFT)
 #define nid_to_addrbase(nid) ((unsigned long)(nid) << NODE_ADDRSPACE_SHIFT)
 
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(n)		(node_data[n])
-
 extern void __init prom_init_numa_memory(void);
 
 #endif /* _ASM_MACH_MMZONE_H */
diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index b50ce28d2741..9208eaadf690 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -29,8 +29,6 @@
 
 unsigned char __node_distances[MAX_NUMNODES][MAX_NUMNODES];
 EXPORT_SYMBOL(__node_distances);
-struct pglist_data *node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(node_data);
 
 cpumask_t __node_cpumask[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_cpumask);
diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
index c30ef6958b97..31e1d85b4fb2 100644
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -34,9 +34,6 @@
 #define SLOT_PFNSHIFT		(SLOT_SHIFT - PAGE_SHIFT)
 #define PFN_NASIDSHFT		(NASID_SHFT - PAGE_SHIFT)
 
-struct pglist_data *node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(node_data);
-
 struct node_data *__node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(__node_data);
 
diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
index da827d2d0866..d99863cd6cde 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -20,12 +20,6 @@
 
 #ifdef CONFIG_NUMA
 
-extern struct pglist_data *node_data[];
-/*
- * Return a pointer to the node data for node n.
- */
-#define NODE_DATA(nid)		(node_data[nid])
-
 /*
  * Following are specific to this numa platform.
  */
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a490724e84ad..8c18973cd71e 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -43,11 +43,9 @@ static char *cmdline __initdata;
 
 int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
-struct pglist_data *node_data[MAX_NUMNODES];
 
 EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
-EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index 504f8b7e72d4..e44f168f60fc 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -2,6 +2,7 @@
 generic-y += early_ioremap.h
 generic-y += flat.h
 generic-y += kvm_para.h
+generic-y += mmzone.h
 generic-y += parport.h
 generic-y += spinlock.h
 generic-y += spinlock_types.h
diff --git a/arch/riscv/include/asm/mmzone.h b/arch/riscv/include/asm/mmzone.h
deleted file mode 100644
index fa17e01d9ab2..000000000000
--- a/arch/riscv/include/asm/mmzone.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_MMZONE_H
-#define __ASM_MMZONE_H
-
-#ifdef CONFIG_NUMA
-
-#include <asm/numa.h>
-
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)		(node_data[(nid)])
-
-#endif /* CONFIG_NUMA */
-#endif /* __ASM_MMZONE_H */
diff --git a/arch/riscv/include/asm/topology.h b/arch/riscv/include/asm/topology.h
index 61183688bdd5..fe1a8bf6902d 100644
--- a/arch/riscv/include/asm/topology.h
+++ b/arch/riscv/include/asm/topology.h
@@ -4,6 +4,10 @@
 
 #include <linux/arch_topology.h>
 
+#ifdef CONFIG_NUMA
+#include <asm/numa.h>
+#endif
+
 /* Replace task scheduler's default frequency-invariant accounting */
 #define arch_scale_freq_tick		topology_scale_freq_tick
 #define arch_set_freq_scale		topology_set_freq_scale
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 4b904110d27c..297bf7157968 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
 generic-y += asm-offsets.h
 generic-y += kvm_types.h
 generic-y += mcs_spinlock.h
+generic-y += mmzone.h
diff --git a/arch/s390/include/asm/mmzone.h b/arch/s390/include/asm/mmzone.h
deleted file mode 100644
index 73e3e7c6976c..000000000000
--- a/arch/s390/include/asm/mmzone.h
+++ /dev/null
@@ -1,17 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * NUMA support for s390
- *
- * Copyright IBM Corp. 2015
- */
-
-#ifndef _ASM_S390_MMZONE_H
-#define _ASM_S390_MMZONE_H
-
-#ifdef CONFIG_NUMA
-
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid) (node_data[nid])
-
-#endif /* CONFIG_NUMA */
-#endif /* _ASM_S390_MMZONE_H */
diff --git a/arch/s390/kernel/numa.c b/arch/s390/kernel/numa.c
index 23ab9f02f278..ddc1448ea2e1 100644
--- a/arch/s390/kernel/numa.c
+++ b/arch/s390/kernel/numa.c
@@ -14,9 +14,6 @@
 #include <linux/node.h>
 #include <asm/numa.h>
 
-struct pglist_data *node_data[MAX_NUMNODES];
-EXPORT_SYMBOL(node_data);
-
 void __init numa_setup(void)
 {
 	int nid;
diff --git a/arch/sh/include/asm/mmzone.h b/arch/sh/include/asm/mmzone.h
index 7b8dead2723d..63f88b465e39 100644
--- a/arch/sh/include/asm/mmzone.h
+++ b/arch/sh/include/asm/mmzone.h
@@ -5,9 +5,6 @@
 #ifdef CONFIG_NUMA
 #include <linux/numa.h>
 
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)		(node_data[nid])
-
 static inline int pfn_to_nid(unsigned long pfn)
 {
 	int nid;
diff --git a/arch/sh/mm/numa.c b/arch/sh/mm/numa.c
index 50f0dc1744d0..9bc212b5e762 100644
--- a/arch/sh/mm/numa.c
+++ b/arch/sh/mm/numa.c
@@ -14,9 +14,6 @@
 #include <linux/pfn.h>
 #include <asm/sections.h>
 
-struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL_GPL(node_data);
-
 /*
  * On SH machines the conventional approach is to stash system RAM
  * in node 0, and other memory blocks in to node 1 and up, ordered by
diff --git a/arch/sparc/include/asm/mmzone.h b/arch/sparc/include/asm/mmzone.h
index a236d8aa893a..74eb2c71d077 100644
--- a/arch/sparc/include/asm/mmzone.h
+++ b/arch/sparc/include/asm/mmzone.h
@@ -6,10 +6,6 @@
 
 #include <linux/cpumask.h>
 
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)		(node_data[nid])
-
 extern int numa_cpu_lookup_table[];
 extern cpumask_t numa_cpumask_lookup_table[];
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 00b247d924a9..3cb698204609 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1115,11 +1115,9 @@ static void init_node_masks_nonnuma(void)
 }
 
 #ifdef CONFIG_NUMA
-struct pglist_data *node_data[MAX_NUMNODES];
 
 EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(numa_cpumask_lookup_table);
-EXPORT_SYMBOL(node_data);
 
 static int scan_pio_for_cfg_handle(struct mdesc_handle *md, u64 pio,
 				   u32 cfg_handle)
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index a192bdea69e2..6c23d1661b17 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -11,3 +11,4 @@ generated-y += xen-hypercalls.h
 
 generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
+generic-y += mmzone.h
diff --git a/arch/x86/include/asm/mmzone.h b/arch/x86/include/asm/mmzone.h
deleted file mode 100644
index c41b41edd691..000000000000
--- a/arch/x86/include/asm/mmzone.h
+++ /dev/null
@@ -1,6 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifdef CONFIG_X86_32
-# include <asm/mmzone_32.h>
-#else
-# include <asm/mmzone_64.h>
-#endif
diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
deleted file mode 100644
index 2d4515e8b7df..000000000000
--- a/arch/x86/include/asm/mmzone_32.h
+++ /dev/null
@@ -1,17 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Written by Pat Gaughen (gone@us.ibm.com) Mar 2002
- *
- */
-
-#ifndef _ASM_X86_MMZONE_32_H
-#define _ASM_X86_MMZONE_32_H
-
-#include <asm/smp.h>
-
-#ifdef CONFIG_NUMA
-extern struct pglist_data *node_data[];
-#define NODE_DATA(nid)	(node_data[nid])
-#endif /* CONFIG_NUMA */
-
-#endif /* _ASM_X86_MMZONE_32_H */
diff --git a/arch/x86/include/asm/mmzone_64.h b/arch/x86/include/asm/mmzone_64.h
deleted file mode 100644
index 0c585046f744..000000000000
--- a/arch/x86/include/asm/mmzone_64.h
+++ /dev/null
@@ -1,18 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* K8 NUMA support */
-/* Copyright 2002,2003 by Andi Kleen, SuSE Labs */
-/* 2.5 Version loosely based on the NUMAQ Code by Pat Gaughen. */
-#ifndef _ASM_X86_MMZONE_64_H
-#define _ASM_X86_MMZONE_64_H
-
-#ifdef CONFIG_NUMA
-
-#include <linux/mmdebug.h>
-#include <asm/smp.h>
-
-extern struct pglist_data *node_data[];
-
-#define NODE_DATA(nid)		(node_data[nid])
-
-#endif
-#endif /* _ASM_X86_MMZONE_64_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 6ce10e3c6228..7de725d6bb05 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -24,9 +24,6 @@
 int numa_off;
 nodemask_t numa_nodes_parsed __initdata;
 
-struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL(node_data);
-
 static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
 static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
 
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 5b59d133b6af..9b71ad2869f1 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -15,8 +15,6 @@
 
 #include <asm/sections.h>
 
-struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL(node_data);
 nodemask_t numa_nodes_parsed __initdata;
 static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
 
diff --git a/include/asm-generic/mmzone.h b/include/asm-generic/mmzone.h
new file mode 100644
index 000000000000..2ab5193e8394
--- /dev/null
+++ b/include/asm-generic/mmzone.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_MMZONE_H
+#define _ASM_GENERIC_MMZONE_H
+
+#endif
diff --git a/include/linux/numa.h b/include/linux/numa.h
index eb19503604fe..e5841d4057ab 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -30,6 +30,9 @@ static inline bool numa_valid_node(int nid)
 #ifdef CONFIG_NUMA
 #include <asm/sparsemem.h>
 
+extern struct pglist_data *node_data[];
+#define NODE_DATA(nid)	(node_data[nid])
+
 /* Generic implementation available */
 int numa_nearest_node(int node, unsigned int state);
 
diff --git a/mm/numa.c b/mm/numa.c
index 67ca6b8585c0..8c157d41c026 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -3,6 +3,9 @@
 #include <linux/printk.h>
 #include <linux/numa.h>
 
+struct pglist_data *node_data[MAX_NUMNODES];
+EXPORT_SYMBOL(node_data);
+
 /* Stub functions: */
 
 #ifndef memory_add_physaddr_to_nid
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (3 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 04/17] arch, mm: move definition of node_data to generic code Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-17 14:42   ` David Hildenbrand
  2024-07-19 16:11   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 06/17] x86/numa: simplify numa_distance allocation Mike Rapoport
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Architectures that support NUMA duplicate the code that allocates
NODE_DATA on the node-local memory with slight variations in reporting
of the addresses where the memory was allocated.

Use x86 version as the basis for the generic alloc_node_data() function
and call this function in architecture specific numa initialization.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/loongarch/kernel/numa.c | 18 ------------------
 arch/mips/loongson64/numa.c  | 16 ++--------------
 arch/powerpc/mm/numa.c       | 24 +++---------------------
 arch/sh/mm/init.c            |  7 +------
 arch/sparc/mm/init_64.c      |  9 ++-------
 arch/x86/mm/numa.c           | 34 +---------------------------------
 drivers/base/arch_numa.c     | 21 +--------------------
 include/linux/numa.h         |  2 ++
 mm/numa.c                    | 27 +++++++++++++++++++++++++++
 9 files changed, 39 insertions(+), 119 deletions(-)

diff --git a/arch/loongarch/kernel/numa.c b/arch/loongarch/kernel/numa.c
index acada671e020..84fe7f854820 100644
--- a/arch/loongarch/kernel/numa.c
+++ b/arch/loongarch/kernel/numa.c
@@ -187,24 +187,6 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
 }
 
-static void __init alloc_node_data(int nid)
-{
-	void *nd;
-	unsigned long nd_pa;
-	size_t nd_sz = roundup(sizeof(pg_data_t), PAGE_SIZE);
-
-	nd_pa = memblock_phys_alloc_try_nid(nd_sz, SMP_CACHE_BYTES, nid);
-	if (!nd_pa) {
-		pr_err("Cannot find %zu Byte for node_data (initial node: %d)\n", nd_sz, nid);
-		return;
-	}
-
-	nd = __va(nd_pa);
-
-	node_data[nid] = nd;
-	memset(nd, 0, sizeof(pg_data_t));
-}
-
 static void __init node_mem_init(unsigned int node)
 {
 	unsigned long start_pfn, end_pfn;
diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index 9208eaadf690..909f6cec3a26 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
 
 static void __init node_mem_init(unsigned int node)
 {
-	struct pglist_data *nd;
 	unsigned long node_addrspace_offset;
 	unsigned long start_pfn, end_pfn;
-	unsigned long nd_pa;
-	int tnid;
-	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
 
 	node_addrspace_offset = nid_to_addrbase(node);
 	pr_info("Node%d's addrspace_offset is 0x%lx\n",
@@ -96,16 +92,8 @@ static void __init node_mem_init(unsigned int node)
 	pr_info("Node%d: start_pfn=0x%lx, end_pfn=0x%lx\n",
 		node, start_pfn, end_pfn);
 
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, node);
-	if (!nd_pa)
-		panic("Cannot allocate %zu bytes for node %d data\n",
-		      nd_size, node);
-	nd = __va(nd_pa);
-	memset(nd, 0, sizeof(struct pglist_data));
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != node)
-		pr_info("NODE_DATA(%d) on node %d\n", node, tnid);
-	node_data[node] = nd;
+	alloc_node_data(node);
+
 	NODE_DATA(node)->node_start_pfn = start_pfn;
 	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8c18973cd71e..4c54764af160 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1081,27 +1081,9 @@ void __init dump_numa_cpu_topology(void)
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
 	u64 spanned_pages = end_pfn - start_pfn;
-	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
-	u64 nd_pa;
-	void *nd;
-	int tnid;
-
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-	if (!nd_pa)
-		panic("Cannot allocate %zu bytes for node %d data\n",
-		      nd_size, nid);
-
-	nd = __va(nd_pa);
-
-	/* report and initialize */
-	pr_info("  NODE_DATA [mem %#010Lx-%#010Lx]\n",
-		nd_pa, nd_pa + nd_size - 1);
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != nid)
-		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
-
-	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+
+	alloc_node_data(nid);
+
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start_pfn;
 	NODE_DATA(nid)->node_spanned_pages = spanned_pages;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index bf1b54055316..5cc89a0932c3 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -212,12 +212,7 @@ void __init allocate_pgdat(unsigned int nid)
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 
 #ifdef CONFIG_NUMA
-	NODE_DATA(nid) = memblock_alloc_try_nid(
-				sizeof(struct pglist_data),
-				SMP_CACHE_BYTES, MEMBLOCK_LOW_LIMIT,
-				MEMBLOCK_ALLOC_ACCESSIBLE, nid);
-	if (!NODE_DATA(nid))
-		panic("Can't allocate pgdat for node %d\n", nid);
+	alloc_node_data(nid);
 #endif
 
 	NODE_DATA(nid)->node_start_pfn = start_pfn;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 3cb698204609..83279c43572d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1075,14 +1075,9 @@ static void __init allocate_node_data(int nid)
 {
 	struct pglist_data *p;
 	unsigned long start_pfn, end_pfn;
-#ifdef CONFIG_NUMA
 
-	NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data),
-					     SMP_CACHE_BYTES, nid);
-	if (!NODE_DATA(nid)) {
-		prom_printf("Cannot allocate pglist_data for nid[%d]\n", nid);
-		prom_halt();
-	}
+#ifdef CONFIG_NUMA
+	alloc_node_data(nid);
 
 	NODE_DATA(nid)->node_id = nid;
 #endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 7de725d6bb05..5e1dde26674b 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -191,39 +191,6 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
 }
 
-/* Allocate NODE_DATA for a node on the local memory */
-static void __init alloc_node_data(int nid)
-{
-	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
-	u64 nd_pa;
-	void *nd;
-	int tnid;
-
-	/*
-	 * Allocate node data.  Try node-local memory and then any node.
-	 * Never allocate in DMA zone.
-	 */
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
-		       nd_size, nid);
-		return;
-	}
-	nd = __va(nd_pa);
-
-	/* report and initialize */
-	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
-	       nd_pa, nd_pa + nd_size - 1);
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != nid)
-		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
-
-	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-	node_set_online(nid);
-}
-
 /**
  * numa_cleanup_meminfo - Cleanup a numa_meminfo
  * @mi: numa_meminfo to clean up
@@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		node_set_online(nid);
 	}
 
 	/* Dump memblock with node info and return. */
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 9b71ad2869f1..2ebf12eab99f 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -216,30 +216,11 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
  */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
-	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
-	u64 nd_pa;
-	void *nd;
-	int tnid;
-
 	if (start_pfn >= end_pfn)
 		pr_info("Initmem setup node %d [<memory-less node>]\n", nid);
 
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-	if (!nd_pa)
-		panic("Cannot allocate %zu bytes for node %d data\n",
-		      nd_size, nid);
-
-	nd = __va(nd_pa);
-
-	/* report and initialize */
-	pr_info("NODE_DATA [mem %#010Lx-%#010Lx]\n",
-		nd_pa, nd_pa + nd_size - 1);
-	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-	if (tnid != nid)
-		pr_info("NODE_DATA(%d) on node %d\n", nid, tnid);
+	alloc_node_data(nid);
 
-	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start_pfn;
 	NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
diff --git a/include/linux/numa.h b/include/linux/numa.h
index e5841d4057ab..3b12d8ca0afd 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -33,6 +33,8 @@ static inline bool numa_valid_node(int nid)
 extern struct pglist_data *node_data[];
 #define NODE_DATA(nid)	(node_data[nid])
 
+void __init alloc_node_data(int nid);
+
 /* Generic implementation available */
 int numa_nearest_node(int node, unsigned int state);
 
diff --git a/mm/numa.c b/mm/numa.c
index 8c157d41c026..0483cabc4c4b 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -1,11 +1,38 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
+#include <linux/memblock.h>
 #include <linux/printk.h>
 #include <linux/numa.h>
 
 struct pglist_data *node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(node_data);
 
+/* Allocate NODE_DATA for a node on the local memory */
+void __init alloc_node_data(int nid)
+{
+	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	u64 nd_pa;
+	void *nd;
+	int tnid;
+
+	/* Allocate node data.  Try node-local memory and then any node. */
+	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+	if (!nd_pa)
+		panic("Cannot allocate %zu bytes for node %d data\n",
+		      nd_size, nid);
+	nd = __va(nd_pa);
+
+	/* report and initialize */
+	pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
+		nd_pa, nd_pa + nd_size - 1);
+	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+	if (tnid != nid)
+		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
+
+	node_data[nid] = nd;
+	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+}
+
 /* Stub functions: */
 
 #ifndef memory_add_physaddr_to_nid
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 06/17] x86/numa: simplify numa_distance allocation
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (4 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA " Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 16:28   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 07/17] x86/numa: move FAKE_NODE_* defines to numa_emu Mike Rapoport
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Allocation of numa_distance uses memblock_phys_alloc_range() to limit
allocation to be below the last mapped page.

But NUMA initializaition runs after the direct map is populated and
there is also code in setup_arch() that adjusts memblock limit to
reflect how much memory is already mapped in the direct map.

Simplify the allocation of numa_distance and use plain memblock_alloc().
This makes the code clearer and ensures that when numa_distance is not
allocated it is always NULL.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/mm/numa.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5e1dde26674b..ab2d4ecef786 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -319,8 +319,7 @@ void __init numa_reset_distance(void)
 {
 	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
 
-	/* numa_distance could be 1LU marking allocation failure, test cnt */
-	if (numa_distance_cnt)
+	if (numa_distance)
 		memblock_free(numa_distance, size);
 	numa_distance_cnt = 0;
 	numa_distance = NULL;	/* enable table creation */
@@ -331,7 +330,6 @@ static int __init numa_alloc_distance(void)
 	nodemask_t nodes_parsed;
 	size_t size;
 	int i, j, cnt = 0;
-	u64 phys;
 
 	/* size the new table and allocate it */
 	nodes_parsed = numa_nodes_parsed;
@@ -342,16 +340,12 @@ static int __init numa_alloc_distance(void)
 	cnt++;
 	size = cnt * cnt * sizeof(numa_distance[0]);
 
-	phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0,
-					 PFN_PHYS(max_pfn_mapped));
-	if (!phys) {
+	numa_distance = memblock_alloc(size, PAGE_SIZE);
+	if (!numa_distance) {
 		pr_warn("Warning: can't allocate distance table!\n");
-		/* don't retry until explicitly reset */
-		numa_distance = (void *)1LU;
 		return -ENOMEM;
 	}
 
-	numa_distance = __va(phys);
 	numa_distance_cnt = cnt;
 
 	/* fill with the default distances */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 07/17] x86/numa: move FAKE_NODE_* defines to numa_emu
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (5 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 06/17] x86/numa: simplify numa_distance allocation Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 16:30   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 08/17] x86/numa_emu: simplify allocation of phys_dist Mike Rapoport
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are
only used by numa emulation code, make them local to
arch/x86/mm/numa_emulation.c

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/include/asm/numa.h  | 2 --
 arch/x86/mm/numa_emulation.c | 3 +++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index ef2844d69173..2dab1ada96cf 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -71,8 +71,6 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 #endif
 
 #ifdef CONFIG_NUMA_EMU
-#define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
-#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
 int numa_emu_cmdline(char *str);
 #else /* CONFIG_NUMA_EMU */
 static inline int numa_emu_cmdline(char *str)
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 9a9305367fdd..1ce22e315b80 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -10,6 +10,9 @@
 
 #include "numa_internal.h"
 
+#define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
+#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
+
 static int emu_nid_to_phys[MAX_NUMNODES];
 static char *emu_cmdline __initdata;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 08/17] x86/numa_emu: simplify allocation of phys_dist
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (6 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 07/17] x86/numa: move FAKE_NODE_* defines to numa_emu Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 16:38   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function Mike Rapoport
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

By the time numa_emulation() is called, all physical memory is already
mapped in the direct map and there is no need to define limits for
memblock allocation.

Replace memblock_phys_alloc_range() with memblock_alloc().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/mm/numa_emulation.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 1ce22e315b80..439804e21962 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -448,15 +448,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 
 	/* copy the physical distance table */
 	if (numa_dist_cnt) {
-		u64 phys;
-
-		phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0,
-						 PFN_PHYS(max_pfn_mapped));
-		if (!phys) {
+		phys_dist = memblock_alloc(phys_size, PAGE_SIZE);
+		if (!phys_dist) {
 			pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n");
 			goto no_emu;
 		}
-		phys_dist = __va(phys);
 
 		for (i = 0; i < numa_dist_cnt; i++)
 			for (j = 0; j < numa_dist_cnt; j++)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (7 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 08/17] x86/numa_emu: simplify allocation of phys_dist Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 16:47   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 10/17] x86/numa_emu: use a helper function to get MAX_DMA32_PFN Mike Rapoport
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

This is required to make numa emulation code architecture independent so
that it can be moved to generic code in following commits.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/include/asm/numa.h  |  2 ++
 arch/x86/mm/numa.c           | 22 ++++++++++++++++++++++
 arch/x86/mm/numa_emulation.c | 14 +-------------
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 2dab1ada96cf..7017d540894a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -72,6 +72,8 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 
 #ifdef CONFIG_NUMA_EMU
 int numa_emu_cmdline(char *str);
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids);
 #else /* CONFIG_NUMA_EMU */
 static inline int numa_emu_cmdline(char *str)
 {
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ab2d4ecef786..1320d776caed 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -852,6 +852,28 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */
 
+#ifdef CONFIG_NUMA_EMU
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids)
+{
+	int i, j;
+
+	/*
+	 * Transform __apicid_to_node table to use emulated nids by
+	 * reverse-mapping phys_nid.  The maps should always exist but fall
+	 * back to zero just in case.
+	 */
+	for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
+		if (__apicid_to_node[i] == NUMA_NO_NODE)
+			continue;
+		for (j = 0; j < nr_emu_nids; j++)
+			if (__apicid_to_node[i] == emu_nid_to_phys[j])
+				break;
+		__apicid_to_node[i] = j < nr_emu_nids ? j : 0;
+	}
+}
+#endif /* CONFIG_NUMA_EMU */
+
 #ifdef CONFIG_NUMA_KEEP_MEMINFO
 static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
 {
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 439804e21962..f2746e52ab93 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -476,19 +476,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		    ei.blk[i].nid != NUMA_NO_NODE)
 			node_set(ei.blk[i].nid, numa_nodes_parsed);
 
-	/*
-	 * Transform __apicid_to_node table to use emulated nids by
-	 * reverse-mapping phys_nid.  The maps should always exist but fall
-	 * back to zero just in case.
-	 */
-	for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
-		if (__apicid_to_node[i] == NUMA_NO_NODE)
-			continue;
-		for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
-			if (__apicid_to_node[i] == emu_nid_to_phys[j])
-				break;
-		__apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
-	}
+	numa_emu_update_cpu_to_node(emu_nid_to_phys, ARRAY_SIZE(emu_nid_to_phys));
 
 	/* make sure all emulated nodes are mapped to a physical node */
 	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 10/17] x86/numa_emu: use a helper function to get MAX_DMA32_PFN
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (8 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 16:50   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 11/17] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned Mike Rapoport
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

This is required to make numa emulation code architecture independent s
that it can be moved to generic code in following commits.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/include/asm/numa.h  | 1 +
 arch/x86/mm/numa.c           | 5 +++++
 arch/x86/mm/numa_emulation.c | 4 ++--
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 7017d540894a..b22c85c1ef18 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -74,6 +74,7 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 int numa_emu_cmdline(char *str);
 void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
 					unsigned int nr_emu_nids);
+u64 __init numa_emu_dma_end(void);
 #else /* CONFIG_NUMA_EMU */
 static inline int numa_emu_cmdline(char *str)
 {
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1320d776caed..0a59e3ceecda 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -872,6 +872,11 @@ void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
 		__apicid_to_node[i] = j < nr_emu_nids ? j : 0;
 	}
 }
+
+u64 __init numa_emu_dma_end(void)
+{
+	return PFN_PHYS(MAX_DMA32_PFN);
+}
 #endif /* CONFIG_NUMA_EMU */
 
 #ifdef CONFIG_NUMA_KEEP_MEMINFO
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index f2746e52ab93..fb4814497446 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -128,7 +128,7 @@ static int __init split_nodes_interleave(struct numa_meminfo *ei,
 	 */
 	while (!nodes_empty(physnode_mask)) {
 		for_each_node_mask(i, physnode_mask) {
-			u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+			u64 dma32_end = numa_emu_dma_end();
 			u64 start, limit, end;
 			int phys_blk;
 
@@ -275,7 +275,7 @@ static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
 	 */
 	while (!nodes_empty(physnode_mask)) {
 		for_each_node_mask(i, physnode_mask) {
-			u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN);
+			u64 dma32_end = numa_emu_dma_end();
 			u64 start, limit, end;
 			int phys_blk;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 11/17] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (9 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 10/17] x86/numa_emu: use a helper function to get MAX_DMA32_PFN Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 16:57   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 12/17] mm: introduce numa_memblks Mike Rapoport
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

CPU id cannot be negative.

Making it unsigned also aligns with declarations in
include/asm-generic/numa.h used by arm64 and riscv and allows sharing
numa emulation code with these architectures.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/include/asm/numa.h  | 10 +++++-----
 arch/x86/mm/numa.c           | 10 +++++-----
 arch/x86/mm/numa_emulation.c | 10 +++++-----
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index b22c85c1ef18..6fa5ea925aac 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -54,20 +54,20 @@ static inline int numa_cpu_node(int cpu)
 extern void numa_set_node(int cpu, int node);
 extern void numa_clear_node(int cpu);
 extern void __init init_cpu_to_node(void);
-extern void numa_add_cpu(int cpu);
-extern void numa_remove_cpu(int cpu);
+extern void numa_add_cpu(unsigned int cpu);
+extern void numa_remove_cpu(unsigned int cpu);
 extern void init_gi_nodes(void);
 #else	/* CONFIG_NUMA */
 static inline void numa_set_node(int cpu, int node)	{ }
 static inline void numa_clear_node(int cpu)		{ }
 static inline void init_cpu_to_node(void)		{ }
-static inline void numa_add_cpu(int cpu)		{ }
-static inline void numa_remove_cpu(int cpu)		{ }
+static inline void numa_add_cpu(unsigned int cpu)	{ }
+static inline void numa_remove_cpu(unsigned int cpu)	{ }
 static inline void init_gi_nodes(void)			{ }
 #endif	/* CONFIG_NUMA */
 
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-void debug_cpumask_set_cpu(int cpu, int node, bool enable);
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable);
 #endif
 
 #ifdef CONFIG_NUMA_EMU
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 0a59e3ceecda..deaa4816a895 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -741,12 +741,12 @@ void __init init_cpu_to_node(void)
 #ifndef CONFIG_DEBUG_PER_CPU_MAPS
 
 # ifndef CONFIG_NUMA_EMU
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	cpumask_set_cpu(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	cpumask_clear_cpu(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]);
 }
@@ -784,7 +784,7 @@ int early_cpu_to_node(int cpu)
 	return per_cpu(x86_cpu_to_node_map, cpu);
 }
 
-void debug_cpumask_set_cpu(int cpu, int node, bool enable)
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable)
 {
 	struct cpumask *mask;
 
@@ -816,12 +816,12 @@ static void numa_set_cpumask(int cpu, bool enable)
 	debug_cpumask_set_cpu(cpu, early_cpu_to_node(cpu), enable);
 }
 
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, true);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, false);
 }
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index fb4814497446..235f8a4eb2fa 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -514,7 +514,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 }
 
 #ifndef CONFIG_DEBUG_PER_CPU_MAPS
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	int physnid, nid;
 
@@ -532,7 +532,7 @@ void numa_add_cpu(int cpu)
 			cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	int i;
 
@@ -540,7 +540,7 @@ void numa_remove_cpu(int cpu)
 		cpumask_clear_cpu(cpu, node_to_cpumask_map[i]);
 }
 #else	/* !CONFIG_DEBUG_PER_CPU_MAPS */
-static void numa_set_cpumask(int cpu, bool enable)
+static void numa_set_cpumask(unsigned int cpu, bool enable)
 {
 	int nid, physnid;
 
@@ -560,12 +560,12 @@ static void numa_set_cpumask(int cpu, bool enable)
 	}
 }
 
-void numa_add_cpu(int cpu)
+void numa_add_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, true);
 }
 
-void numa_remove_cpu(int cpu)
+void numa_remove_cpu(unsigned int cpu)
 {
 	numa_set_cpumask(cpu, false);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 12/17] mm: introduce numa_memblks
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (10 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 11/17] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 18:16   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig
options to let x86 select it in its Kconfig.

This code will be later reused by arch_numa.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/Kconfig             |   1 +
 arch/x86/include/asm/numa.h  |   3 -
 arch/x86/mm/amdtopology.c    |   1 +
 arch/x86/mm/numa.c           | 372 +--------------------------------
 arch/x86/mm/numa_emulation.c |   1 +
 arch/x86/mm/numa_internal.h  |  15 +-
 drivers/acpi/numa/srat.c     |   1 +
 drivers/of/of_numa.c         |   1 +
 include/linux/numa_memblks.h |  35 ++++
 mm/Kconfig                   |   3 +
 mm/Makefile                  |   1 +
 mm/numa_memblks.c            | 385 +++++++++++++++++++++++++++++++++++
 12 files changed, 436 insertions(+), 383 deletions(-)
 create mode 100644 include/linux/numa_memblks.h
 create mode 100644 mm/numa_memblks.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d7122a1883e..d8084f37157c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -295,6 +295,7 @@ config X86
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
 	select NEED_SG_DMA_LENGTH
+	select NUMA_MEMBLKS			if NUMA
 	select PCI_DOMAINS			if PCI
 	select PCI_LOCKLESS_CONFIG		if PCI
 	select PERF_EVENTS
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 6fa5ea925aac..6e9a50bf03d4 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -10,8 +10,6 @@
 
 #ifdef CONFIG_NUMA
 
-#define NR_NODE_MEMBLKS		(MAX_NUMNODES*2)
-
 extern int numa_off;
 
 /*
@@ -25,7 +23,6 @@ extern int numa_off;
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
 
-extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
 
 static inline void set_apicid_to_node(int apicid, s16 node)
diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index 9332b36a1091..628833afee37 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/io.h>
 #include <linux/pci_ids.h>
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index deaa4816a895..8bc0b34c6ea2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/topology.h>
 #include <linux/sort.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/e820/api.h>
 #include <asm/proto.h>
@@ -22,10 +23,6 @@
 #include "numa_internal.h"
 
 int numa_off;
-nodemask_t numa_nodes_parsed __initdata;
-
-static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
-static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
 
 static int numa_distance_cnt;
 static u8 *numa_distance;
@@ -121,194 +118,6 @@ void __init setup_node_to_cpumask_map(void)
 	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
-static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
-				     struct numa_meminfo *mi)
-{
-	/* ignore zero length blks */
-	if (start == end)
-		return 0;
-
-	/* whine about and ignore invalid blks */
-	if (start > end || nid < 0 || nid >= MAX_NUMNODES) {
-		pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
-			nid, start, end - 1);
-		return 0;
-	}
-
-	if (mi->nr_blks >= NR_NODE_MEMBLKS) {
-		pr_err("too many memblk ranges\n");
-		return -EINVAL;
-	}
-
-	mi->blk[mi->nr_blks].start = start;
-	mi->blk[mi->nr_blks].end = end;
-	mi->blk[mi->nr_blks].nid = nid;
-	mi->nr_blks++;
-	return 0;
-}
-
-/**
- * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo
- * @idx: Index of memblk to remove
- * @mi: numa_meminfo to remove memblk from
- *
- * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and
- * decrementing @mi->nr_blks.
- */
-void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
-{
-	mi->nr_blks--;
-	memmove(&mi->blk[idx], &mi->blk[idx + 1],
-		(mi->nr_blks - idx) * sizeof(mi->blk[0]));
-}
-
-/**
- * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another
- * @dst: numa_meminfo to append block to
- * @idx: Index of memblk to remove
- * @src: numa_meminfo to remove memblk from
- */
-static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx,
-					 struct numa_meminfo *src)
-{
-	dst->blk[dst->nr_blks++] = src->blk[idx];
-	numa_remove_memblk_from(idx, src);
-}
-
-/**
- * numa_add_memblk - Add one numa_memblk to numa_meminfo
- * @nid: NUMA node ID of the new memblk
- * @start: Start address of the new memblk
- * @end: End address of the new memblk
- *
- * Add a new memblk to the default numa_meminfo.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init numa_add_memblk(int nid, u64 start, u64 end)
-{
-	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
-}
-
-/**
- * numa_cleanup_meminfo - Cleanup a numa_meminfo
- * @mi: numa_meminfo to clean up
- *
- * Sanitize @mi by merging and removing unnecessary memblks.  Also check for
- * conflicts and clear unused memblks.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
-{
-	const u64 low = 0;
-	const u64 high = PFN_PHYS(max_pfn);
-	int i, j, k;
-
-	/* first, trim all entries */
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *bi = &mi->blk[i];
-
-		/* move / save reserved memory ranges */
-		if (!memblock_overlaps_region(&memblock.memory,
-					bi->start, bi->end - bi->start)) {
-			numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
-			continue;
-		}
-
-		/* make sure all non-reserved blocks are inside the limits */
-		bi->start = max(bi->start, low);
-
-		/* preserve info for non-RAM areas above 'max_pfn': */
-		if (bi->end > high) {
-			numa_add_memblk_to(bi->nid, high, bi->end,
-					   &numa_reserved_meminfo);
-			bi->end = high;
-		}
-
-		/* and there's no empty block */
-		if (bi->start >= bi->end)
-			numa_remove_memblk_from(i--, mi);
-	}
-
-	/* merge neighboring / overlapping entries */
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *bi = &mi->blk[i];
-
-		for (j = i + 1; j < mi->nr_blks; j++) {
-			struct numa_memblk *bj = &mi->blk[j];
-			u64 start, end;
-
-			/*
-			 * See whether there are overlapping blocks.  Whine
-			 * about but allow overlaps of the same nid.  They
-			 * will be merged below.
-			 */
-			if (bi->end > bj->start && bi->start < bj->end) {
-				if (bi->nid != bj->nid) {
-					pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
-					       bi->nid, bi->start, bi->end - 1,
-					       bj->nid, bj->start, bj->end - 1);
-					return -EINVAL;
-				}
-				pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
-					bi->nid, bi->start, bi->end - 1,
-					bj->start, bj->end - 1);
-			}
-
-			/*
-			 * Join together blocks on the same node, holes
-			 * between which don't overlap with memory on other
-			 * nodes.
-			 */
-			if (bi->nid != bj->nid)
-				continue;
-			start = min(bi->start, bj->start);
-			end = max(bi->end, bj->end);
-			for (k = 0; k < mi->nr_blks; k++) {
-				struct numa_memblk *bk = &mi->blk[k];
-
-				if (bi->nid == bk->nid)
-					continue;
-				if (start < bk->end && end > bk->start)
-					break;
-			}
-			if (k < mi->nr_blks)
-				continue;
-			printk(KERN_INFO "NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
-			       bi->nid, bi->start, bi->end - 1, bj->start,
-			       bj->end - 1, start, end - 1);
-			bi->start = start;
-			bi->end = end;
-			numa_remove_memblk_from(j--, mi);
-		}
-	}
-
-	/* clear unused ones */
-	for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
-		mi->blk[i].start = mi->blk[i].end = 0;
-		mi->blk[i].nid = NUMA_NO_NODE;
-	}
-
-	return 0;
-}
-
-/*
- * Set nodes, which have memory in @mi, in *@nodemask.
- */
-static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
-					      const struct numa_meminfo *mi)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
-		if (mi->blk[i].start != mi->blk[i].end &&
-		    mi->blk[i].nid != NUMA_NO_NODE)
-			node_set(mi->blk[i].nid, *nodemask);
-}
-
 /**
  * numa_reset_distance - Reset NUMA distance table
  *
@@ -407,111 +216,13 @@ int __node_distance(int from, int to)
 }
 EXPORT_SYMBOL(__node_distance);
 
-/*
- * Mark all currently memblock-reserved physical memory (which covers the
- * kernel's own memory ranges) as hot-unswappable.
- */
-static void __init numa_clear_kernel_node_hotplug(void)
-{
-	nodemask_t reserved_nodemask = NODE_MASK_NONE;
-	struct memblock_region *mb_region;
-	int i;
-
-	/*
-	 * We have to do some preprocessing of memblock regions, to
-	 * make them suitable for reservation.
-	 *
-	 * At this time, all memory regions reserved by memblock are
-	 * used by the kernel, but those regions are not split up
-	 * along node boundaries yet, and don't necessarily have their
-	 * node ID set yet either.
-	 *
-	 * So iterate over all memory known to the x86 architecture,
-	 * and use those ranges to set the nid in memblock.reserved.
-	 * This will split up the memblock regions along node
-	 * boundaries and will set the node IDs as well.
-	 */
-	for (i = 0; i < numa_meminfo.nr_blks; i++) {
-		struct numa_memblk *mb = numa_meminfo.blk + i;
-		int ret;
-
-		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
-		WARN_ON_ONCE(ret);
-	}
-
-	/*
-	 * Now go over all reserved memblock regions, to construct a
-	 * node mask of all kernel reserved memory areas.
-	 *
-	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
-	 *   numa_meminfo might not include all memblock.reserved
-	 *   memory ranges, because quirks such as trim_snb_memory()
-	 *   reserve specific pages for Sandy Bridge graphics. ]
-	 */
-	for_each_reserved_mem_region(mb_region) {
-		int nid = memblock_get_region_node(mb_region);
-
-		if (nid != NUMA_NO_NODE)
-			node_set(nid, reserved_nodemask);
-	}
-
-	/*
-	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
-	 * belonging to the reserved node mask.
-	 *
-	 * Note that this will include memory regions that reside
-	 * on nodes that contain kernel memory - entire nodes
-	 * become hot-unpluggable:
-	 */
-	for (i = 0; i < numa_meminfo.nr_blks; i++) {
-		struct numa_memblk *mb = numa_meminfo.blk + i;
-
-		if (!node_isset(mb->nid, reserved_nodemask))
-			continue;
-
-		memblock_clear_hotplug(mb->start, mb->end - mb->start);
-	}
-}
-
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
-	int i, nid;
+	int i, nid, err;
 
-	/* Account for nodes with cpus and no memory */
-	node_possible_map = numa_nodes_parsed;
-	numa_nodemask_from_meminfo(&node_possible_map, mi);
-	if (WARN_ON(nodes_empty(node_possible_map)))
-		return -EINVAL;
-
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start,
-				  &memblock.memory, mb->nid);
-	}
-
-	/*
-	 * At very early time, the kernel have to use some memory such as
-	 * loading the kernel image. We cannot prevent this anyway. So any
-	 * node the kernel resides in should be un-hotpluggable.
-	 *
-	 * And when we come here, alloc node data won't fail.
-	 */
-	numa_clear_kernel_node_hotplug();
-
-	/*
-	 * If sections array is gonna be used for pfn -> nid mapping, check
-	 * whether its granularity is fine enough.
-	 */
-	if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) {
-		unsigned long pfn_align = node_map_pfn_alignment();
-
-		if (pfn_align && pfn_align < PAGES_PER_SECTION) {
-			pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
-				PFN_PHYS(pfn_align) >> 20,
-				PFN_PHYS(PAGES_PER_SECTION) >> 20);
-			return -EINVAL;
-		}
-	}
+	err = numa_register_meminfo(mi);
+	if (err)
+		return err;
 
 	if (!memblock_validate_numa_coverage(SZ_1M))
 		return -EINVAL;
@@ -916,76 +627,3 @@ int memory_add_physaddr_to_nid(u64 start)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 
 #endif
-
-static int __init cmp_memblk(const void *a, const void *b)
-{
-	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
-	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
-
-	return (ma->start > mb->start) - (ma->start < mb->start);
-}
-
-static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
-
-/**
- * numa_fill_memblks - Fill gaps in numa_meminfo memblks
- * @start: address to begin fill
- * @end: address to end fill
- *
- * Find and extend numa_meminfo memblks to cover the physical
- * address range @start-@end
- *
- * RETURNS:
- * 0		  : Success
- * NUMA_NO_MEMBLK : No memblks exist in address range @start-@end
- */
-
-int __init numa_fill_memblks(u64 start, u64 end)
-{
-	struct numa_memblk **blk = &numa_memblk_list[0];
-	struct numa_meminfo *mi = &numa_meminfo;
-	int count = 0;
-	u64 prev_end;
-
-	/*
-	 * Create a list of pointers to numa_meminfo memblks that
-	 * overlap start, end. The list is used to make in-place
-	 * changes that fill out the numa_meminfo memblks.
-	 */
-	for (int i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *bi = &mi->blk[i];
-
-		if (memblock_addrs_overlap(start, end - start, bi->start,
-					   bi->end - bi->start)) {
-			blk[count] = &mi->blk[i];
-			count++;
-		}
-	}
-	if (!count)
-		return NUMA_NO_MEMBLK;
-
-	/* Sort the list of pointers in memblk->start order */
-	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
-
-	/* Make sure the first/last memblks include start/end */
-	blk[0]->start = min(blk[0]->start, start);
-	blk[count - 1]->end = max(blk[count - 1]->end, end);
-
-	/*
-	 * Fill any gaps by tracking the previous memblks
-	 * end address and backfilling to it if needed.
-	 */
-	prev_end = blk[0]->end;
-	for (int i = 1; i < count; i++) {
-		struct numa_memblk *curr = blk[i];
-
-		if (prev_end >= curr->start) {
-			if (prev_end < curr->end)
-				prev_end = curr->end;
-		} else {
-			curr->start = prev_end;
-			prev_end = curr->end;
-		}
-	}
-	return 0;
-}
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 235f8a4eb2fa..33610026b7a3 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -6,6 +6,7 @@
 #include <linux/errno.h>
 #include <linux/topology.h>
 #include <linux/memblock.h>
+#include <linux/numa_memblks.h>
 #include <asm/dma.h>
 
 #include "numa_internal.h"
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index 86860f279662..a51229a2f5af 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -5,23 +5,12 @@
 #include <linux/types.h>
 #include <asm/numa.h>
 
-struct numa_memblk {
-	u64			start;
-	u64			end;
-	int			nid;
-};
-
-struct numa_meminfo {
-	int			nr_blks;
-	struct numa_memblk	blk[NR_NODE_MEMBLKS];
-};
-
-void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
-int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
 void __init numa_reset_distance(void);
 
 void __init x86_numa_init(void);
 
+struct numa_meminfo;
+
 #ifdef CONFIG_NUMA_EMU
 void __init numa_emulation(struct numa_meminfo *numa_meminfo,
 			   int numa_dist_cnt);
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index e3f26e71637a..6f2983cbe553 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -17,6 +17,7 @@
 #include <linux/numa.h>
 #include <linux/nodemask.h>
 #include <linux/topology.h>
+#include <linux/numa_memblks.h>
 
 static nodemask_t nodes_found_map = NODE_MASK_NONE;
 
diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
index 5949829a1b00..838747e319a2 100644
--- a/drivers/of/of_numa.c
+++ b/drivers/of/of_numa.c
@@ -10,6 +10,7 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/nodemask.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/numa.h>
 
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
new file mode 100644
index 000000000000..6981cf97d2c9
--- /dev/null
+++ b/include/linux/numa_memblks.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NUMA_MEMBLKS_H
+#define __NUMA_MEMBLKS_H
+
+#ifdef CONFIG_NUMA_MEMBLKS
+#include <linux/types.h>
+
+#define NR_NODE_MEMBLKS		(MAX_NUMNODES * 2)
+
+struct numa_memblk {
+	u64			start;
+	u64			end;
+	int			nid;
+};
+
+struct numa_meminfo {
+	int			nr_blks;
+	struct numa_memblk	blk[NR_NODE_MEMBLKS];
+};
+
+extern struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+extern struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+
+int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
+
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
+int __init numa_register_meminfo(struct numa_meminfo *mi);
+
+void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+				       const struct numa_meminfo *mi);
+
+#endif /* CONFIG_NUMA_MEMBLKS */
+
+#endif	/* __NUMA_MEMBLKS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..15c6efbaa1df 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1249,6 +1249,9 @@ config IOMMU_MM_DATA
 config EXECMEM
 	bool
 
+config NUMA_MEMBLKS
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 773b3b267438..17bc4013a2c5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -140,3 +140,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_NUMA) += numa.o
+obj-$(CONFIG_NUMA_MEMBLKS) += numa_memblks.o
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
new file mode 100644
index 000000000000..e31307317ca7
--- /dev/null
+++ b/mm/numa_memblks.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/array_size.h>
+#include <linux/sort.h>
+#include <linux/printk.h>
+#include <linux/memblock.h>
+#include <linux/numa.h>
+#include <linux/numa_memblks.h>
+
+nodemask_t numa_nodes_parsed __initdata;
+
+struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+
+static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+				     struct numa_meminfo *mi)
+{
+	/* ignore zero length blks */
+	if (start == end)
+		return 0;
+
+	/* whine about and ignore invalid blks */
+	if (start > end || nid < 0 || nid >= MAX_NUMNODES) {
+		pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
+			nid, start, end - 1);
+		return 0;
+	}
+
+	if (mi->nr_blks >= NR_NODE_MEMBLKS) {
+		pr_err("too many memblk ranges\n");
+		return -EINVAL;
+	}
+
+	mi->blk[mi->nr_blks].start = start;
+	mi->blk[mi->nr_blks].end = end;
+	mi->blk[mi->nr_blks].nid = nid;
+	mi->nr_blks++;
+	return 0;
+}
+
+/**
+ * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo
+ * @idx: Index of memblk to remove
+ * @mi: numa_meminfo to remove memblk from
+ *
+ * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and
+ * decrementing @mi->nr_blks.
+ */
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
+{
+	mi->nr_blks--;
+	memmove(&mi->blk[idx], &mi->blk[idx + 1],
+		(mi->nr_blks - idx) * sizeof(mi->blk[0]));
+}
+
+/**
+ * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another
+ * @dst: numa_meminfo to append block to
+ * @idx: Index of memblk to remove
+ * @src: numa_meminfo to remove memblk from
+ */
+static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx,
+					 struct numa_meminfo *src)
+{
+	dst->blk[dst->nr_blks++] = src->blk[idx];
+	numa_remove_memblk_from(idx, src);
+}
+
+/**
+ * numa_add_memblk - Add one numa_memblk to numa_meminfo
+ * @nid: NUMA node ID of the new memblk
+ * @start: Start address of the new memblk
+ * @end: End address of the new memblk
+ *
+ * Add a new memblk to the default numa_meminfo.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int __init numa_add_memblk(int nid, u64 start, u64 end)
+{
+	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
+}
+
+/**
+ * numa_cleanup_meminfo - Cleanup a numa_meminfo
+ * @mi: numa_meminfo to clean up
+ *
+ * Sanitize @mi by merging and removing unnecessary memblks.  Also check for
+ * conflicts and clear unused memblks.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
+{
+	const u64 low = 0;
+	const u64 high = PFN_PHYS(max_pfn);
+	int i, j, k;
+
+	/* first, trim all entries */
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *bi = &mi->blk[i];
+
+		/* move / save reserved memory ranges */
+		if (!memblock_overlaps_region(&memblock.memory,
+					bi->start, bi->end - bi->start)) {
+			numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
+			continue;
+		}
+
+		/* make sure all non-reserved blocks are inside the limits */
+		bi->start = max(bi->start, low);
+
+		/* preserve info for non-RAM areas above 'max_pfn': */
+		if (bi->end > high) {
+			numa_add_memblk_to(bi->nid, high, bi->end,
+					   &numa_reserved_meminfo);
+			bi->end = high;
+		}
+
+		/* and there's no empty block */
+		if (bi->start >= bi->end)
+			numa_remove_memblk_from(i--, mi);
+	}
+
+	/* merge neighboring / overlapping entries */
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *bi = &mi->blk[i];
+
+		for (j = i + 1; j < mi->nr_blks; j++) {
+			struct numa_memblk *bj = &mi->blk[j];
+			u64 start, end;
+
+			/*
+			 * See whether there are overlapping blocks.  Whine
+			 * about but allow overlaps of the same nid.  They
+			 * will be merged below.
+			 */
+			if (bi->end > bj->start && bi->start < bj->end) {
+				if (bi->nid != bj->nid) {
+					pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
+					       bi->nid, bi->start, bi->end - 1,
+					       bj->nid, bj->start, bj->end - 1);
+					return -EINVAL;
+				}
+				pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
+					bi->nid, bi->start, bi->end - 1,
+					bj->start, bj->end - 1);
+			}
+
+			/*
+			 * Join together blocks on the same node, holes
+			 * between which don't overlap with memory on other
+			 * nodes.
+			 */
+			if (bi->nid != bj->nid)
+				continue;
+			start = min(bi->start, bj->start);
+			end = max(bi->end, bj->end);
+			for (k = 0; k < mi->nr_blks; k++) {
+				struct numa_memblk *bk = &mi->blk[k];
+
+				if (bi->nid == bk->nid)
+					continue;
+				if (start < bk->end && end > bk->start)
+					break;
+			}
+			if (k < mi->nr_blks)
+				continue;
+			pr_info("NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
+			       bi->nid, bi->start, bi->end - 1, bj->start,
+			       bj->end - 1, start, end - 1);
+			bi->start = start;
+			bi->end = end;
+			numa_remove_memblk_from(j--, mi);
+		}
+	}
+
+	/* clear unused ones */
+	for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
+		mi->blk[i].start = mi->blk[i].end = 0;
+		mi->blk[i].nid = NUMA_NO_NODE;
+	}
+
+	return 0;
+}
+
+/*
+ * Set nodes, which have memory in @mi, in *@nodemask.
+ */
+void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+				       const struct numa_meminfo *mi)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
+		if (mi->blk[i].start != mi->blk[i].end &&
+		    mi->blk[i].nid != NUMA_NO_NODE)
+			node_set(mi->blk[i].nid, *nodemask);
+}
+
+/*
+ * Mark all currently memblock-reserved physical memory (which covers the
+ * kernel's own memory ranges) as hot-unswappable.
+ */
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+	nodemask_t reserved_nodemask = NODE_MASK_NONE;
+	struct memblock_region *mb_region;
+	int i;
+
+	/*
+	 * We have to do some preprocessing of memblock regions, to
+	 * make them suitable for reservation.
+	 *
+	 * At this time, all memory regions reserved by memblock are
+	 * used by the kernel, but those regions are not split up
+	 * along node boundaries yet, and don't necessarily have their
+	 * node ID set yet either.
+	 *
+	 * So iterate over all memory known to the x86 architecture,
+	 * and use those ranges to set the nid in memblock.reserved.
+	 * This will split up the memblock regions along node
+	 * boundaries and will set the node IDs as well.
+	 */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		struct numa_memblk *mb = numa_meminfo.blk + i;
+		int ret;
+
+		ret = memblock_set_node(mb->start, mb->end - mb->start,
+					&memblock.reserved, mb->nid);
+		WARN_ON_ONCE(ret);
+	}
+
+	/*
+	 * Now go over all reserved memblock regions, to construct a
+	 * node mask of all kernel reserved memory areas.
+	 *
+	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
+	 *   numa_meminfo might not include all memblock.reserved
+	 *   memory ranges, because quirks such as trim_snb_memory()
+	 *   reserve specific pages for Sandy Bridge graphics. ]
+	 */
+	for_each_reserved_mem_region(mb_region) {
+		int nid = memblock_get_region_node(mb_region);
+
+		if (nid != MAX_NUMNODES)
+			node_set(nid, reserved_nodemask);
+	}
+
+	/*
+	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
+	 * belonging to the reserved node mask.
+	 *
+	 * Note that this will include memory regions that reside
+	 * on nodes that contain kernel memory - entire nodes
+	 * become hot-unpluggable:
+	 */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		struct numa_memblk *mb = numa_meminfo.blk + i;
+
+		if (!node_isset(mb->nid, reserved_nodemask))
+			continue;
+
+		memblock_clear_hotplug(mb->start, mb->end - mb->start);
+	}
+}
+
+int __init numa_register_meminfo(struct numa_meminfo *mi)
+{
+	int i;
+
+	/* Account for nodes with cpus and no memory */
+	node_possible_map = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&node_possible_map, mi);
+	if (WARN_ON(nodes_empty(node_possible_map)))
+		return -EINVAL;
+
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *mb = &mi->blk[i];
+
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.memory, mb->nid);
+	}
+
+	/*
+	 * At very early time, the kernel have to use some memory such as
+	 * loading the kernel image. We cannot prevent this anyway. So any
+	 * node the kernel resides in should be un-hotpluggable.
+	 *
+	 * And when we come here, alloc node data won't fail.
+	 */
+	numa_clear_kernel_node_hotplug();
+
+	/*
+	 * If sections array is gonna be used for pfn -> nid mapping, check
+	 * whether its granularity is fine enough.
+	 */
+	if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) {
+		unsigned long pfn_align = node_map_pfn_alignment();
+
+		if (pfn_align && pfn_align < PAGES_PER_SECTION) {
+			pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
+				PFN_PHYS(pfn_align) >> 20,
+				PFN_PHYS(PAGES_PER_SECTION) >> 20);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int __init cmp_memblk(const void *a, const void *b)
+{
+	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
+	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
+
+	return (ma->start > mb->start) - (ma->start < mb->start);
+}
+
+static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
+
+/**
+ * numa_fill_memblks - Fill gaps in numa_meminfo memblks
+ * @start: address to begin fill
+ * @end: address to end fill
+ *
+ * Find and extend numa_meminfo memblks to cover the physical
+ * address range @start-@end
+ *
+ * RETURNS:
+ * 0		  : Success
+ * NUMA_NO_MEMBLK : No memblks exist in address range @start-@end
+ */
+
+int __init numa_fill_memblks(u64 start, u64 end)
+{
+	struct numa_memblk **blk = &numa_memblk_list[0];
+	struct numa_meminfo *mi = &numa_meminfo;
+	int count = 0;
+	u64 prev_end;
+
+	/*
+	 * Create a list of pointers to numa_meminfo memblks that
+	 * overlap start, end. The list is used to make in-place
+	 * changes that fill out the numa_meminfo memblks.
+	 */
+	for (int i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *bi = &mi->blk[i];
+
+		if (memblock_addrs_overlap(start, end - start, bi->start,
+					   bi->end - bi->start)) {
+			blk[count] = &mi->blk[i];
+			count++;
+		}
+	}
+	if (!count)
+		return NUMA_NO_MEMBLK;
+
+	/* Sort the list of pointers in memblk->start order */
+	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
+
+	/* Make sure the first/last memblks include start/end */
+	blk[0]->start = min(blk[0]->start, start);
+	blk[count - 1]->end = max(blk[count - 1]->end, end);
+
+	/*
+	 * Fill any gaps by tracking the previous memblks
+	 * end address and backfilling to it if needed.
+	 */
+	prev_end = blk[0]->end;
+	for (int i = 1; i < count; i++) {
+		struct numa_memblk *curr = blk[i];
+
+		if (prev_end >= curr->start) {
+			if (prev_end < curr->end)
+				prev_end = curr->end;
+		} else {
+			curr->start = prev_end;
+			prev_end = curr->end;
+		}
+	}
+	return 0;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (11 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 12/17] mm: introduce numa_memblks Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-18 21:46   ` Samuel Holland
  2024-07-19 17:48   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 14/17] mm: introduce numa_emulation Mike Rapoport
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Move code dealing with numa_distance array from arch/x86 to
mm/numa_memblks.c

This code will be later reused by arch_numa.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/mm/numa.c                   | 101 ---------------------------
 arch/x86/mm/numa_internal.h          |   2 -
 include/linux/numa_memblks.h         |   4 ++
 {arch/x86/mm => mm}/numa_emulation.c |   0
 mm/numa_memblks.c                    | 101 +++++++++++++++++++++++++++
 5 files changed, 105 insertions(+), 103 deletions(-)
 rename {arch/x86/mm => mm}/numa_emulation.c (100%)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bc0b34c6ea2..3848e68d771a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -24,9 +24,6 @@
 
 int numa_off;
 
-static int numa_distance_cnt;
-static u8 *numa_distance;
-
 static __init int numa_setup(char *opt)
 {
 	if (!opt)
@@ -118,104 +115,6 @@ void __init setup_node_to_cpumask_map(void)
 	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
-/**
- * numa_reset_distance - Reset NUMA distance table
- *
- * The current table is freed.  The next numa_set_distance() call will
- * create a new one.
- */
-void __init numa_reset_distance(void)
-{
-	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
-
-	if (numa_distance)
-		memblock_free(numa_distance, size);
-	numa_distance_cnt = 0;
-	numa_distance = NULL;	/* enable table creation */
-}
-
-static int __init numa_alloc_distance(void)
-{
-	nodemask_t nodes_parsed;
-	size_t size;
-	int i, j, cnt = 0;
-
-	/* size the new table and allocate it */
-	nodes_parsed = numa_nodes_parsed;
-	numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
-
-	for_each_node_mask(i, nodes_parsed)
-		cnt = i;
-	cnt++;
-	size = cnt * cnt * sizeof(numa_distance[0]);
-
-	numa_distance = memblock_alloc(size, PAGE_SIZE);
-	if (!numa_distance) {
-		pr_warn("Warning: can't allocate distance table!\n");
-		return -ENOMEM;
-	}
-
-	numa_distance_cnt = cnt;
-
-	/* fill with the default distances */
-	for (i = 0; i < cnt; i++)
-		for (j = 0; j < cnt; j++)
-			numa_distance[i * cnt + j] = i == j ?
-				LOCAL_DISTANCE : REMOTE_DISTANCE;
-	printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
-
-	return 0;
-}
-
-/**
- * numa_set_distance - Set NUMA distance from one NUMA to another
- * @from: the 'from' node to set distance
- * @to: the 'to'  node to set distance
- * @distance: NUMA distance
- *
- * Set the distance from node @from to @to to @distance.  If distance table
- * doesn't exist, one which is large enough to accommodate all the currently
- * known nodes will be created.
- *
- * If such table cannot be allocated, a warning is printed and further
- * calls are ignored until the distance table is reset with
- * numa_reset_distance().
- *
- * If @from or @to is higher than the highest known node or lower than zero
- * at the time of table creation or @distance doesn't make sense, the call
- * is ignored.
- * This is to allow simplification of specific NUMA config implementations.
- */
-void __init numa_set_distance(int from, int to, int distance)
-{
-	if (!numa_distance && numa_alloc_distance() < 0)
-		return;
-
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
-			from < 0 || to < 0) {
-		pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
-			     from, to, distance);
-		return;
-	}
-
-	if ((u8)distance != distance ||
-	    (from == to && distance != LOCAL_DISTANCE)) {
-		pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
-			     from, to, distance);
-		return;
-	}
-
-	numa_distance[from * numa_distance_cnt + to] = distance;
-}
-
-int __node_distance(int from, int to)
-{
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
-		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
-	return numa_distance[from * numa_distance_cnt + to];
-}
-EXPORT_SYMBOL(__node_distance);
-
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	int i, nid, err;
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index a51229a2f5af..249e3aaeadce 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -5,8 +5,6 @@
 #include <linux/types.h>
 #include <asm/numa.h>
 
-void __init numa_reset_distance(void);
-
 void __init x86_numa_init(void);
 
 struct numa_meminfo;
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 6981cf97d2c9..968a590535ac 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -7,6 +7,10 @@
 
 #define NR_NODE_MEMBLKS		(MAX_NUMNODES * 2)
 
+extern int numa_distance_cnt;
+void __init numa_set_distance(int from, int to, int distance);
+void __init numa_reset_distance(void);
+
 struct numa_memblk {
 	u64			start;
 	u64			end;
diff --git a/arch/x86/mm/numa_emulation.c b/mm/numa_emulation.c
similarity index 100%
rename from arch/x86/mm/numa_emulation.c
rename to mm/numa_emulation.c
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index e31307317ca7..e0039549aaac 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -7,11 +7,112 @@
 #include <linux/numa.h>
 #include <linux/numa_memblks.h>
 
+int numa_distance_cnt;
+static u8 *numa_distance;
+
 nodemask_t numa_nodes_parsed __initdata;
 
 struct numa_meminfo numa_meminfo __initdata_or_meminfo;
 struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
 
+/**
+ * numa_reset_distance - Reset NUMA distance table
+ *
+ * The current table is freed.  The next numa_set_distance() call will
+ * create a new one.
+ */
+void __init numa_reset_distance(void)
+{
+	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
+
+	if (numa_distance)
+		memblock_free(numa_distance, size);
+	numa_distance_cnt = 0;
+	numa_distance = NULL;	/* enable table creation */
+}
+
+static int __init numa_alloc_distance(void)
+{
+	nodemask_t nodes_parsed;
+	size_t size;
+	int i, j, cnt = 0;
+
+	/* size the new table and allocate it */
+	nodes_parsed = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
+
+	for_each_node_mask(i, nodes_parsed)
+		cnt = i;
+	cnt++;
+	size = cnt * cnt * sizeof(numa_distance[0]);
+
+	numa_distance = memblock_alloc(size, PAGE_SIZE);
+	if (!numa_distance) {
+		pr_warn("Warning: can't allocate distance table!\n");
+		return -ENOMEM;
+	}
+
+	numa_distance_cnt = cnt;
+
+	/* fill with the default distances */
+	for (i = 0; i < cnt; i++)
+		for (j = 0; j < cnt; j++)
+			numa_distance[i * cnt + j] = i == j ?
+				LOCAL_DISTANCE : REMOTE_DISTANCE;
+	printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
+
+	return 0;
+}
+
+/**
+ * numa_set_distance - Set NUMA distance from one NUMA to another
+ * @from: the 'from' node to set distance
+ * @to: the 'to'  node to set distance
+ * @distance: NUMA distance
+ *
+ * Set the distance from node @from to @to to @distance.  If distance table
+ * doesn't exist, one which is large enough to accommodate all the currently
+ * known nodes will be created.
+ *
+ * If such table cannot be allocated, a warning is printed and further
+ * calls are ignored until the distance table is reset with
+ * numa_reset_distance().
+ *
+ * If @from or @to is higher than the highest known node or lower than zero
+ * at the time of table creation or @distance doesn't make sense, the call
+ * is ignored.
+ * This is to allow simplification of specific NUMA config implementations.
+ */
+void __init numa_set_distance(int from, int to, int distance)
+{
+	if (!numa_distance && numa_alloc_distance() < 0)
+		return;
+
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
+			from < 0 || to < 0) {
+		pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
+			     from, to, distance);
+		return;
+	}
+
+	if ((u8)distance != distance ||
+	    (from == to && distance != LOCAL_DISTANCE)) {
+		pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
+			     from, to, distance);
+		return;
+	}
+
+	numa_distance[from * numa_distance_cnt + to] = distance;
+}
+
+int __node_distance(int from, int to)
+{
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
+		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+	return numa_distance[from * numa_distance_cnt + to];
+}
+EXPORT_SYMBOL(__node_distance);
+
 static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
 				     struct numa_meminfo *mi)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 14/17] mm: introduce numa_emulation
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (12 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 16:03   ` Zi Yan
  2024-07-16 11:13 ` [PATCH 15/17] mm: make numa_memblks more self-contained Mike Rapoport
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Move numa_emulation codfrom arch/x86 to mm/numa_emulation.c

This code will be later reused by arch_numa.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/Kconfig             |  8 --------
 arch/x86/include/asm/numa.h  | 12 ------------
 arch/x86/mm/Makefile         |  1 -
 arch/x86/mm/numa_internal.h  | 11 -----------
 include/linux/numa_memblks.h | 17 +++++++++++++++++
 mm/Kconfig                   |  8 ++++++++
 mm/Makefile                  |  1 +
 mm/numa_emulation.c          |  4 +---
 8 files changed, 27 insertions(+), 35 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d8084f37157c..a42735c126fa 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1592,14 +1592,6 @@ config X86_64_ACPI_NUMA
 	help
 	  Enable ACPI SRAT based node topology detection.
 
-config NUMA_EMU
-	bool "NUMA emulation"
-	depends on NUMA
-	help
-	  Enable NUMA emulation. A flat machine will be split
-	  into virtual nodes when booted with "numa=fake=N", where N is the
-	  number of nodes. This is only useful for debugging.
-
 config NODES_SHIFT
 	int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
 	range 1 10
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 6e9a50bf03d4..c6e232e3c303 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -67,16 +67,4 @@ static inline void init_gi_nodes(void)			{ }
 void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable);
 #endif
 
-#ifdef CONFIG_NUMA_EMU
-int numa_emu_cmdline(char *str);
-void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
-					unsigned int nr_emu_nids);
-u64 __init numa_emu_dma_end(void);
-#else /* CONFIG_NUMA_EMU */
-static inline int numa_emu_cmdline(char *str)
-{
-	return -EINVAL;
-}
-#endif /* CONFIG_NUMA_EMU */
-
 #endif	/* _ASM_X86_NUMA_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 8d3a00e5c528..690fbf48e853 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -57,7 +57,6 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
 obj-$(CONFIG_NUMA)		+= numa.o numa_$(BITS).o
 obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
-obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index 249e3aaeadce..11e1ff370c10 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -7,15 +7,4 @@
 
 void __init x86_numa_init(void);
 
-struct numa_meminfo;
-
-#ifdef CONFIG_NUMA_EMU
-void __init numa_emulation(struct numa_meminfo *numa_meminfo,
-			   int numa_dist_cnt);
-#else
-static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
-				  int numa_dist_cnt)
-{ }
-#endif
-
 #endif	/* __X86_MM_NUMA_INTERNAL_H */
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 968a590535ac..f81f98678074 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -34,6 +34,23 @@ int __init numa_register_meminfo(struct numa_meminfo *mi);
 void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
 				       const struct numa_meminfo *mi);
 
+#ifdef CONFIG_NUMA_EMU
+int numa_emu_cmdline(char *str);
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids);
+u64 __init numa_emu_dma_end(void);
+void __init numa_emulation(struct numa_meminfo *numa_meminfo,
+			   int numa_dist_cnt);
+#else
+static inline void numa_emulation(struct numa_meminfo *numa_meminfo,
+				  int numa_dist_cnt)
+{ }
+static inline int numa_emu_cmdline(char *str)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_NUMA_EMU */
+
 #endif /* CONFIG_NUMA_MEMBLKS */
 
 #endif	/* __NUMA_MEMBLKS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 15c6efbaa1df..ae58eecdefdc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1252,6 +1252,14 @@ config EXECMEM
 config NUMA_MEMBLKS
 	bool
 
+config NUMA_EMU
+	bool "NUMA emulation"
+	depends on NUMA_MEMBLKS
+	help
+	  Enable NUMA emulation. A flat machine will be split
+	  into virtual nodes when booted with "numa=fake=N", where N is the
+	  number of nodes. This is only useful for debugging.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 17bc4013a2c5..d5b1b30f76e3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -141,3 +141,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_NUMA) += numa.o
 obj-$(CONFIG_NUMA_MEMBLKS) += numa_memblks.o
+obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
index 33610026b7a3..031fb9961bf7 100644
--- a/mm/numa_emulation.c
+++ b/mm/numa_emulation.c
@@ -7,9 +7,7 @@
 #include <linux/topology.h>
 #include <linux/memblock.h>
 #include <linux/numa_memblks.h>
-#include <asm/dma.h>
-
-#include "numa_internal.h"
+#include <asm/numa.h>
 
 #define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
 #define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 15/17] mm: make numa_memblks more self-contained
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (13 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 14/17] mm: introduce numa_emulation Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 18:07   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 16/17] arch_numa: switch over to numa_memblks Mike Rapoport
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Introduce numa_memblks_init() and move some code around to avoid several
global variables in numa_memblks.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/mm/numa.c           | 53 ++++---------------------
 include/linux/numa_memblks.h |  9 +----
 mm/numa_memblks.c            | 77 +++++++++++++++++++++++++++---------
 3 files changed, 68 insertions(+), 71 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3848e68d771a..16bc703c9272 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -115,30 +115,19 @@ void __init setup_node_to_cpumask_map(void)
 	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_register_nodes(void)
 {
-	int i, nid, err;
-
-	err = numa_register_meminfo(mi);
-	if (err)
-		return err;
+	int nid;
 
 	if (!memblock_validate_numa_coverage(SZ_1M))
 		return -EINVAL;
 
 	/* Finally register nodes. */
 	for_each_node_mask(nid, node_possible_map) {
-		u64 start = PFN_PHYS(max_pfn);
-		u64 end = 0;
-
-		for (i = 0; i < mi->nr_blks; i++) {
-			if (nid != mi->blk[i].nid)
-				continue;
-			start = min(mi->blk[i].start, start);
-			end = max(mi->blk[i].end, end);
-		}
+		unsigned long start_pfn, end_pfn;
 
-		if (start >= end)
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
+		if (start_pfn >= end_pfn)
 			continue;
 
 		alloc_node_data(nid);
@@ -178,39 +167,11 @@ static int __init numa_init(int (*init_func)(void))
 	for (i = 0; i < MAX_LOCAL_APIC; i++)
 		set_apicid_to_node(i, NUMA_NO_NODE);
 
-	nodes_clear(numa_nodes_parsed);
-	nodes_clear(node_possible_map);
-	nodes_clear(node_online_map);
-	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
-				  NUMA_NO_NODE));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
-				  NUMA_NO_NODE));
-	/* In case that parsing SRAT failed. */
-	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
-	numa_reset_distance();
-
-	ret = init_func();
-	if (ret < 0)
-		return ret;
-
-	/*
-	 * We reset memblock back to the top-down direction
-	 * here because if we configured ACPI_NUMA, we have
-	 * parsed SRAT in init_func(). It is ok to have the
-	 * reset here even if we did't configure ACPI_NUMA
-	 * or acpi numa init fails and fallbacks to dummy
-	 * numa init.
-	 */
-	memblock_set_bottom_up(false);
-
-	ret = numa_cleanup_meminfo(&numa_meminfo);
+	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true);
 	if (ret < 0)
 		return ret;
 
-	numa_emulation(&numa_meminfo, numa_distance_cnt);
-
-	ret = numa_register_memblks(&numa_meminfo);
+	ret = numa_register_nodes();
 	if (ret < 0)
 		return ret;
 
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index f81f98678074..5c6e12ad0b7a 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -7,7 +7,6 @@
 
 #define NR_NODE_MEMBLKS		(MAX_NUMNODES * 2)
 
-extern int numa_distance_cnt;
 void __init numa_set_distance(int from, int to, int distance);
 void __init numa_reset_distance(void);
 
@@ -22,17 +21,13 @@ struct numa_meminfo {
 	struct numa_memblk	blk[NR_NODE_MEMBLKS];
 };
 
-extern struct numa_meminfo numa_meminfo __initdata_or_meminfo;
-extern struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
-
 int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
 
 int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
-int __init numa_register_meminfo(struct numa_meminfo *mi);
 
-void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
-				       const struct numa_meminfo *mi);
+int __init numa_memblks_init(int (*init_func)(void),
+			     bool memblock_force_top_down);
 
 #ifdef CONFIG_NUMA_EMU
 int numa_emu_cmdline(char *str);
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index e0039549aaac..640f3a3ce0ee 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -7,13 +7,27 @@
 #include <linux/numa.h>
 #include <linux/numa_memblks.h>
 
-int numa_distance_cnt;
+static int numa_distance_cnt;
 static u8 *numa_distance;
 
 nodemask_t numa_nodes_parsed __initdata;
 
-struct numa_meminfo numa_meminfo __initdata_or_meminfo;
-struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
+
+/*
+ * Set nodes, which have memory in @mi, in *@nodemask.
+ */
+static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+					      const struct numa_meminfo *mi)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
+		if (mi->blk[i].start != mi->blk[i].end &&
+		    mi->blk[i].nid != NUMA_NO_NODE)
+			node_set(mi->blk[i].nid, *nodemask);
+}
 
 /**
  * numa_reset_distance - Reset NUMA distance table
@@ -287,20 +301,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
 	return 0;
 }
 
-/*
- * Set nodes, which have memory in @mi, in *@nodemask.
- */
-void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
-				       const struct numa_meminfo *mi)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
-		if (mi->blk[i].start != mi->blk[i].end &&
-		    mi->blk[i].nid != NUMA_NO_NODE)
-			node_set(mi->blk[i].nid, *nodemask);
-}
-
 /*
  * Mark all currently memblock-reserved physical memory (which covers the
  * kernel's own memory ranges) as hot-unswappable.
@@ -368,7 +368,7 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	}
 }
 
-int __init numa_register_meminfo(struct numa_meminfo *mi)
+static int __init numa_register_meminfo(struct numa_meminfo *mi)
 {
 	int i;
 
@@ -412,6 +412,47 @@ int __init numa_register_meminfo(struct numa_meminfo *mi)
 	return 0;
 }
 
+int __init numa_memblks_init(int (*init_func)(void),
+			     bool memblock_force_top_down)
+{
+	int ret;
+
+	nodes_clear(numa_nodes_parsed);
+	nodes_clear(node_possible_map);
+	nodes_clear(node_online_map);
+	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
+				  NUMA_NO_NODE));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+				  NUMA_NO_NODE));
+	/* In case that parsing SRAT failed. */
+	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
+	numa_reset_distance();
+
+	ret = init_func();
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * We reset memblock back to the top-down direction
+	 * here because if we configured ACPI_NUMA, we have
+	 * parsed SRAT in init_func(). It is ok to have the
+	 * reset here even if we did't configure ACPI_NUMA
+	 * or acpi numa init fails and fallbacks to dummy
+	 * numa init.
+	 */
+	if (memblock_force_top_down)
+		memblock_set_bottom_up(false);
+
+	ret = numa_cleanup_meminfo(&numa_meminfo);
+	if (ret < 0)
+		return ret;
+
+	numa_emulation(&numa_meminfo, numa_distance_cnt);
+
+	return numa_register_meminfo(&numa_meminfo);
+}
+
 static int __init cmp_memblk(const void *a, const void *b)
 {
 	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 16/17] arch_numa: switch over to numa_memblks
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (14 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 15/17] mm: make numa_memblks more self-contained Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 18:16   ` Jonathan Cameron
  2024-07-16 11:13 ` [PATCH 17/17] mm: make range-to-target_node lookup facility a part of numa_memblks Mike Rapoport
  2024-07-19 13:33 ` [PATCH 00/17] mm: introduce numa_memblks Jonathan Cameron
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Until now arch_numa was directly translating firmware NUMA information
to memblock.

Using numa_memblks as an intermediate step has a few advantages:
* alignment with more battle tested x86 implementation
* availability of NUMA emulation
* maintaining node information for not yet populated memory

Replace current functionality related to numa_add_memblk() and
__node_distance() with the implementation based on numa_memblks and add
functions required by numa_emulation.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 drivers/base/Kconfig       |   1 +
 drivers/base/arch_numa.c   | 200 +++++++++++--------------------------
 include/asm-generic/numa.h |   6 +-
 3 files changed, 64 insertions(+), 143 deletions(-)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 2b8fd6bb7da0..064eb52ff7e2 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -226,6 +226,7 @@ config GENERIC_ARCH_TOPOLOGY
 
 config GENERIC_ARCH_NUMA
 	bool
+	select NUMA_MEMBLKS
 	help
 	  Enable support for generic NUMA implementation. Currently, RISC-V
 	  and ARM64 use it.
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 2ebf12eab99f..333cbbad7466 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -12,14 +12,12 @@
 #include <linux/memblock.h>
 #include <linux/module.h>
 #include <linux/of.h>
+#include <linux/numa_memblks.h>
 
 #include <asm/sections.h>
 
-nodemask_t numa_nodes_parsed __initdata;
 static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
 
-static int numa_distance_cnt;
-static u8 *numa_distance;
 bool numa_off;
 
 static __init int numa_parse_early_param(char *opt)
@@ -28,6 +26,8 @@ static __init int numa_parse_early_param(char *opt)
 		return -EINVAL;
 	if (str_has_prefix(opt, "off"))
 		numa_off = true;
+	if (!strncmp(opt, "fake=", 5))
+		return numa_emu_cmdline(opt + 5);
 
 	return 0;
 }
@@ -59,6 +59,7 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif
 
+#ifndef CONFIG_NUMA_EMU
 static void numa_update_cpu(unsigned int cpu, bool remove)
 {
 	int nid = cpu_to_node(cpu);
@@ -81,6 +82,7 @@ void numa_remove_cpu(unsigned int cpu)
 {
 	numa_update_cpu(cpu, true);
 }
+#endif
 
 void numa_clear_node(unsigned int cpu)
 {
@@ -142,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
 
-int __init early_cpu_to_node(int cpu)
+int early_cpu_to_node(int cpu)
 {
 	return cpu_to_node_map[cpu];
 }
@@ -187,30 +189,6 @@ void __init setup_per_cpu_areas(void)
 }
 #endif
 
-/**
- * numa_add_memblk() - Set node id to memblk
- * @nid: NUMA node ID of the new memblk
- * @start: Start address of the new memblk
- * @end:  End address of the new memblk
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init numa_add_memblk(int nid, u64 start, u64 end)
-{
-	int ret;
-
-	ret = memblock_set_node(start, (end - start), &memblock.memory, nid);
-	if (ret < 0) {
-		pr_err("memblock [0x%llx - 0x%llx] failed to add on node %d\n",
-			start, (end - 1), nid);
-		return ret;
-	}
-
-	node_set(nid, numa_nodes_parsed);
-	return ret;
-}
-
 /*
  * Initialize NODE_DATA for a node on the local memory
  */
@@ -226,116 +204,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 	NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
 }
 
-/*
- * numa_free_distance
- *
- * The current table is freed.
- */
-void __init numa_free_distance(void)
-{
-	size_t size;
-
-	if (!numa_distance)
-		return;
-
-	size = numa_distance_cnt * numa_distance_cnt *
-		sizeof(numa_distance[0]);
-
-	memblock_free(numa_distance, size);
-	numa_distance_cnt = 0;
-	numa_distance = NULL;
-}
-
-/*
- * Create a new NUMA distance table.
- */
-static int __init numa_alloc_distance(void)
-{
-	size_t size;
-	int i, j;
-
-	size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
-	numa_distance = memblock_alloc(size, PAGE_SIZE);
-	if (WARN_ON(!numa_distance))
-		return -ENOMEM;
-
-	numa_distance_cnt = nr_node_ids;
-
-	/* fill with the default distances */
-	for (i = 0; i < numa_distance_cnt; i++)
-		for (j = 0; j < numa_distance_cnt; j++)
-			numa_distance[i * numa_distance_cnt + j] = i == j ?
-				LOCAL_DISTANCE : REMOTE_DISTANCE;
-
-	pr_debug("Initialized distance table, cnt=%d\n", numa_distance_cnt);
-
-	return 0;
-}
-
-/**
- * numa_set_distance() - Set inter node NUMA distance from node to node.
- * @from: the 'from' node to set distance
- * @to: the 'to'  node to set distance
- * @distance: NUMA distance
- *
- * Set the distance from node @from to @to to @distance.
- * If distance table doesn't exist, a warning is printed.
- *
- * If @from or @to is higher than the highest known node or lower than zero
- * or @distance doesn't make sense, the call is ignored.
- */
-void __init numa_set_distance(int from, int to, int distance)
-{
-	if (!numa_distance) {
-		pr_warn_once("Warning: distance table not allocated yet\n");
-		return;
-	}
-
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
-			from < 0 || to < 0) {
-		pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
-			    from, to, distance);
-		return;
-	}
-
-	if ((u8)distance != distance ||
-	    (from == to && distance != LOCAL_DISTANCE)) {
-		pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
-			     from, to, distance);
-		return;
-	}
-
-	numa_distance[from * numa_distance_cnt + to] = distance;
-}
-
-/*
- * Return NUMA distance @from to @to
- */
-int __node_distance(int from, int to)
-{
-	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
-		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
-	return numa_distance[from * numa_distance_cnt + to];
-}
-EXPORT_SYMBOL(__node_distance);
-
 static int __init numa_register_nodes(void)
 {
 	int nid;
-	struct memblock_region *mblk;
-
-	/* Check that valid nid is set to memblks */
-	for_each_mem_region(mblk) {
-		int mblk_nid = memblock_get_region_node(mblk);
-		phys_addr_t start = mblk->base;
-		phys_addr_t end = mblk->base + mblk->size - 1;
-
-		if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) {
-			pr_warn("Warning: invalid memblk node %d [mem %pap-%pap]\n",
-				mblk_nid, &start, &end);
-			return -EINVAL;
-		}
-	}
 
 	/* Finally register nodes. */
 	for_each_node_mask(nid, numa_nodes_parsed) {
@@ -360,11 +231,7 @@ static int __init numa_init(int (*init_func)(void))
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 
-	ret = numa_alloc_distance();
-	if (ret < 0)
-		return ret;
-
-	ret = init_func();
+	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ false);
 	if (ret < 0)
 		goto out_free_distance;
 
@@ -382,7 +249,7 @@ static int __init numa_init(int (*init_func)(void))
 
 	return 0;
 out_free_distance:
-	numa_free_distance();
+	numa_reset_distance();
 	return ret;
 }
 
@@ -454,3 +321,54 @@ void __init arch_numa_init(void)
 
 	numa_init(dummy_numa_init);
 }
+
+#ifdef CONFIG_NUMA_EMU
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+					unsigned int nr_emu_nids)
+{
+	int i, j;
+
+	/*
+	 * Transform __apicid_to_node table to use emulated nids by
+	 * reverse-mapping phys_nid.  The maps should always exist but fall
+	 * back to zero just in case.
+	 */
+	for (i = 0; i < ARRAY_SIZE(cpu_to_node_map); i++) {
+		if (cpu_to_node_map[i] == NUMA_NO_NODE)
+			continue;
+		for (j = 0; j < nr_emu_nids; j++)
+			if (cpu_to_node_map[i] == emu_nid_to_phys[j])
+				break;
+		cpu_to_node_map[i] = j < nr_emu_nids ? j : 0;
+	}
+}
+
+u64 __init numa_emu_dma_end(void)
+{
+	return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
+}
+
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable)
+{
+	struct cpumask *mask;
+
+	if (node == NUMA_NO_NODE)
+		return;
+
+	mask = node_to_cpumask_map[node];
+	if (!cpumask_available(mask)) {
+		pr_err("node_to_cpumask_map[%i] NULL\n", node);
+		dump_stack();
+		return;
+	}
+
+	if (enable)
+		cpumask_set_cpu(cpu, mask);
+	else
+		cpumask_clear_cpu(cpu, mask);
+
+	pr_debug("%s cpu %d node %d: mask now %*pbl\n",
+		 enable ? "numa_add_cpu" : "numa_remove_cpu",
+		 cpu, node, cpumask_pr_args(mask));
+}
+#endif /* CONFIG_NUMA_EMU */
diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index c32e0cf23c90..c2b046d1fd82 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -32,8 +32,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 
 void __init arch_numa_init(void);
 int __init numa_add_memblk(int nodeid, u64 start, u64 end);
-void __init numa_set_distance(int from, int to, int distance);
-void __init numa_free_distance(void);
 void __init early_map_cpu_to_node(unsigned int cpu, int nid);
 int __init early_cpu_to_node(int cpu);
 void numa_store_cpu_info(unsigned int cpu);
@@ -51,4 +49,8 @@ static inline int early_cpu_to_node(int cpu) { return 0; }
 
 #endif	/* CONFIG_NUMA */
 
+#ifdef CONFIG_NUMA_EMU
+void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable);
+#endif
+
 #endif	/* __ASM_GENERIC_NUMA_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 17/17] mm: make range-to-target_node lookup facility a part of numa_memblks
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (15 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 16/17] arch_numa: switch over to numa_memblks Mike Rapoport
@ 2024-07-16 11:13 ` Mike Rapoport
  2024-07-19 18:19   ` Jonathan Cameron
  2024-07-19 13:33 ` [PATCH 00/17] mm: introduce numa_memblks Jonathan Cameron
  17 siblings, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-16 11:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Mike Rapoport, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The x86 implementation of range-to-target_node lookup (i.e.
phys_to_target_node() and memory_add_physaddr_to_nid()) relies on
numa_memblks.

Since numa_memblks are now part of the generic code, move these
functions from x86 to mm/numa_memblks.c and select
CONFIG_NUMA_KEEP_MEMINFO when CONFIG_NUMA_MEMBLKS=y for dax and cxl.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/include/asm/sparsemem.h |  9 --------
 arch/x86/mm/numa.c               | 38 --------------------------------
 drivers/cxl/Kconfig              |  2 +-
 drivers/dax/Kconfig              |  2 +-
 include/linux/numa_memblks.h     |  7 ++++++
 mm/numa.c                        |  1 +
 mm/numa_memblks.c                | 38 ++++++++++++++++++++++++++++++++
 7 files changed, 48 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 64df897c0ee3..3918c7a434f5 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -31,13 +31,4 @@
 
 #endif /* CONFIG_SPARSEMEM */
 
-#ifndef __ASSEMBLY__
-#ifdef CONFIG_NUMA_KEEP_MEMINFO
-extern int phys_to_target_node(phys_addr_t start);
-#define phys_to_target_node phys_to_target_node
-extern int memory_add_physaddr_to_nid(u64 start);
-#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
-#endif
-#endif /* __ASSEMBLY__ */
-
 #endif /* _ASM_X86_SPARSEMEM_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 16bc703c9272..8e790528805e 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -449,41 +449,3 @@ u64 __init numa_emu_dma_end(void)
 	return PFN_PHYS(MAX_DMA32_PFN);
 }
 #endif /* CONFIG_NUMA_EMU */
-
-#ifdef CONFIG_NUMA_KEEP_MEMINFO
-static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
-{
-	int i;
-
-	for (i = 0; i < mi->nr_blks; i++)
-		if (mi->blk[i].start <= start && mi->blk[i].end > start)
-			return mi->blk[i].nid;
-	return NUMA_NO_NODE;
-}
-
-int phys_to_target_node(phys_addr_t start)
-{
-	int nid = meminfo_to_nid(&numa_meminfo, start);
-
-	/*
-	 * Prefer online nodes, but if reserved memory might be
-	 * hot-added continue the search with reserved ranges.
-	 */
-	if (nid != NUMA_NO_NODE)
-		return nid;
-
-	return meminfo_to_nid(&numa_reserved_meminfo, start);
-}
-EXPORT_SYMBOL_GPL(phys_to_target_node);
-
-int memory_add_physaddr_to_nid(u64 start)
-{
-	int nid = meminfo_to_nid(&numa_meminfo, start);
-
-	if (nid == NUMA_NO_NODE)
-		nid = numa_meminfo.blk[0].nid;
-	return nid;
-}
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
-
-#endif
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 99b5c25be079..29c192f20082 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -6,7 +6,7 @@ menuconfig CXL_BUS
 	select FW_UPLOAD
 	select PCI_DOE
 	select FIRMWARE_TABLE
-	select NUMA_KEEP_MEMINFO if (NUMA && X86)
+	select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS
 	help
 	  CXL is a bus that is electrically compatible with PCI Express, but
 	  layers three protocols on that signalling (CXL.io, CXL.cache, and
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..d656e4c0eb84 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -30,7 +30,7 @@ config DEV_DAX_PMEM
 config DEV_DAX_HMEM
 	tristate "HMEM DAX: direct access to 'specific purpose' memory"
 	depends on EFI_SOFT_RESERVE
-	select NUMA_KEEP_MEMINFO if (NUMA && X86)
+	select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS
 	default DEV_DAX
 	help
 	  EFI 2.8 platforms, and others, may advertise 'specific purpose'
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 5c6e12ad0b7a..17d4bcc34091 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -46,6 +46,13 @@ static inline int numa_emu_cmdline(char *str)
 }
 #endif /* CONFIG_NUMA_EMU */
 
+#ifdef CONFIG_NUMA_KEEP_MEMINFO
+extern int phys_to_target_node(phys_addr_t start);
+#define phys_to_target_node phys_to_target_node
+extern int memory_add_physaddr_to_nid(u64 start);
+#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
+#endif /* CONFIG_NUMA_KEEP_MEMINFO */
+
 #endif /* CONFIG_NUMA_MEMBLKS */
 
 #endif	/* __NUMA_MEMBLKS_H */
diff --git a/mm/numa.c b/mm/numa.c
index 0483cabc4c4b..64c30cab2208 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -3,6 +3,7 @@
 #include <linux/memblock.h>
 #include <linux/printk.h>
 #include <linux/numa.h>
+#include <linux/numa_memblks.h>
 
 struct pglist_data *node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(node_data);
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index 640f3a3ce0ee..46ac3f998b4e 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -525,3 +525,41 @@ int __init numa_fill_memblks(u64 start, u64 end)
 	}
 	return 0;
 }
+
+#ifdef CONFIG_NUMA_KEEP_MEMINFO
+static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
+{
+	int i;
+
+	for (i = 0; i < mi->nr_blks; i++)
+		if (mi->blk[i].start <= start && mi->blk[i].end > start)
+			return mi->blk[i].nid;
+	return NUMA_NO_NODE;
+}
+
+int phys_to_target_node(phys_addr_t start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	/*
+	 * Prefer online nodes, but if reserved memory might be
+	 * hot-added continue the search with reserved ranges.
+	 */
+	if (nid != NUMA_NO_NODE)
+		return nid;
+
+	return meminfo_to_nid(&numa_reserved_meminfo, start);
+}
+EXPORT_SYMBOL_GPL(phys_to_target_node);
+
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_meminfo.blk[0].nid;
+	return nid;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+
+#endif /* CONFIG_NUMA_KEEP_MEMINFO */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data
  2024-07-16 11:13 ` [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
@ 2024-07-16 13:07   ` Jiaxun Yang
  2024-07-17 14:33   ` David Hildenbrand
  2024-07-19 15:27   ` Jonathan Cameron
  2 siblings, 0 replies; 60+ messages in thread
From: Jiaxun Yang @ 2024-07-16 13:07 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S . Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	John Paul Adrian Glaubitz, Jonathan Cameron, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips@vger.kernel.org,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-acpi, linux-cxl, nvdimm, devicetree, linux-arch, linux-mm,
	x86



在2024年7月16日七月 下午7:13，Mike Rapoport写道：
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Make definition of node_data match other architectures.
> This will allow pulling declaration of node_data to the generic mm code in
> the following commit.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>

MIPS should go arch_numa at some point as well.

Thanks
- Jiaxun

> ---
>  arch/mips/include/asm/mach-loongson64/mmzone.h | 4 ++--
>  arch/mips/loongson64/numa.c                    | 8 ++++----
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/mips/include/asm/mach-loongson64/mmzone.h 
> b/arch/mips/include/asm/mach-loongson64/mmzone.h
> index a3d65d37b8b5..2effd5f8ed62 100644
> --- a/arch/mips/include/asm/mach-loongson64/mmzone.h
> +++ b/arch/mips/include/asm/mach-loongson64/mmzone.h
> @@ -14,9 +14,9 @@
>  #define pa_to_nid(addr)  (((addr) & 0xf00000000000) >> 
> NODE_ADDRSPACE_SHIFT)
>  #define nid_to_addrbase(nid) ((unsigned long)(nid) << 
> NODE_ADDRSPACE_SHIFT)
> 
> -extern struct pglist_data *__node_data[];
> +extern struct pglist_data *node_data[];
> 
> -#define NODE_DATA(n)		(__node_data[n])
> +#define NODE_DATA(n)		(node_data[n])
> 
>  extern void __init prom_init_numa_memory(void);
> 
> diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
> index 68dafd6d3e25..b50ce28d2741 100644
> --- a/arch/mips/loongson64/numa.c
> +++ b/arch/mips/loongson64/numa.c
> @@ -29,8 +29,8 @@
> 
>  unsigned char __node_distances[MAX_NUMNODES][MAX_NUMNODES];
>  EXPORT_SYMBOL(__node_distances);
> -struct pglist_data *__node_data[MAX_NUMNODES];
> -EXPORT_SYMBOL(__node_data);
> +struct pglist_data *node_data[MAX_NUMNODES];
> +EXPORT_SYMBOL(node_data);
> 
>  cpumask_t __node_cpumask[MAX_NUMNODES];
>  EXPORT_SYMBOL(__node_cpumask);
> @@ -107,7 +107,7 @@ static void __init node_mem_init(unsigned int node)
>  	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>  	if (tnid != node)
>  		pr_info("NODE_DATA(%d) on node %d\n", node, tnid);
> -	__node_data[node] = nd;
> +	node_data[node] = nd;
>  	NODE_DATA(node)->node_start_pfn = start_pfn;
>  	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;
> 
> @@ -206,5 +206,5 @@ pg_data_t * __init arch_alloc_nodedata(int nid)
> 
>  void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
>  {
> -	__node_data[nid] = pgdat;
> +	node_data[nid] = pgdat;
>  }
> -- 
> 2.43.0

-- 
- Jiaxun

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures
  2024-07-16 11:13 ` [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures Mike Rapoport
@ 2024-07-17 14:32   ` David Hildenbrand
  2024-07-19 14:38     ` Jonathan Cameron
  0 siblings, 1 reply; 60+ messages in thread
From: David Hildenbrand @ 2024-07-17 14:32 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David S. Miller, Greg Kroah-Hartman, Heiko Carstens,
	Huacai Chen, Ingo Molnar, Jiaxun Yang, John Paul Adrian Glaubitz,
	Jonathan Cameron, Michael Ellerman, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> sgi-ip27 is the only system that defines NODE_DATA() differently than
> the rest of NUMA machines.
> 
> Add node_data array of struct pglist pointers that will point to
> __node_data[node]->pglist and redefine NODE_DATA() to use node_data
> array.
> 
> This will allow pulling declaration of node_data to the generic mm code
> in the next commit.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>   arch/mips/include/asm/mach-ip27/mmzone.h | 5 ++++-
>   arch/mips/sgi-ip27/ip27-memory.c         | 5 ++++-
>   2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h
> index 08c36e50a860..629c3f290203 100644
> --- a/arch/mips/include/asm/mach-ip27/mmzone.h
> +++ b/arch/mips/include/asm/mach-ip27/mmzone.h
> @@ -22,7 +22,10 @@ struct node_data {
>   
>   extern struct node_data *__node_data[];
>   
> -#define NODE_DATA(n)		(&__node_data[(n)]->pglist)
>   #define hub_data(n)		(&__node_data[(n)]->hub)
>   
> +extern struct pglist_data *node_data[];
> +
> +#define NODE_DATA(nid)		(node_data[nid])
> +
>   #endif /* _ASM_MACH_MMZONE_H */
> diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
> index b8ca94cfb4fe..c30ef6958b97 100644
> --- a/arch/mips/sgi-ip27/ip27-memory.c
> +++ b/arch/mips/sgi-ip27/ip27-memory.c
> @@ -34,8 +34,10 @@
>   #define SLOT_PFNSHIFT		(SLOT_SHIFT - PAGE_SHIFT)
>   #define PFN_NASIDSHFT		(NASID_SHFT - PAGE_SHIFT)
>   
> -struct node_data *__node_data[MAX_NUMNODES];
> +struct pglist_data *node_data[MAX_NUMNODES];
> +EXPORT_SYMBOL(node_data);
>   
> +struct node_data *__node_data[MAX_NUMNODES];
>   EXPORT_SYMBOL(__node_data);
>   
>   static u64 gen_region_mask(void)
> @@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node)
>   	 */
>   	__node_data[node] = __va(slot_freepfn << PAGE_SHIFT);
>   	memset(__node_data[node], 0, PAGE_SIZE);
> +	node_data[node] = &__node_data[node]->pglist;
>   
>   	NODE_DATA(node)->node_start_pfn = start_pfn;
>   	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;

I was assuming we could get rid of __node_data->pglist.

But now I am confused where that is actually set.

Anyhow

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data
  2024-07-16 11:13 ` [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
  2024-07-16 13:07   ` Jiaxun Yang
@ 2024-07-17 14:33   ` David Hildenbrand
  2024-07-19 15:27   ` Jonathan Cameron
  2 siblings, 0 replies; 60+ messages in thread
From: David Hildenbrand @ 2024-07-17 14:33 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David S. Miller, Greg Kroah-Hartman, Heiko Carstens,
	Huacai Chen, Ingo Molnar, Jiaxun Yang, John Paul Adrian Glaubitz,
	Jonathan Cameron, Michael Ellerman, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Make definition of node_data match other architectures.
> This will allow pulling declaration of node_data to the generic mm code in
> the following commit.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] arch, mm: move definition of node_data to generic code
  2024-07-16 11:13 ` [PATCH 04/17] arch, mm: move definition of node_data to generic code Mike Rapoport
@ 2024-07-17 14:35   ` David Hildenbrand
  2024-07-19 15:39   ` Jonathan Cameron
  2024-07-23  0:15   ` Davidlohr Bueso
  2 siblings, 0 replies; 60+ messages in thread
From: David Hildenbrand @ 2024-07-17 14:35 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David S. Miller, Greg Kroah-Hartman, Heiko Carstens,
	Huacai Chen, Ingo Molnar, Jiaxun Yang, John Paul Adrian Glaubitz,
	Jonathan Cameron, Michael Ellerman, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Every architecture that supports NUMA defines node_data in the same way:
> 
> 	struct pglist_data *node_data[MAX_NUMNODES];
> 
> No reason to keep multiple copies of this definition and its forward
> declarations, especially when such forward declaration is the only thing
> in include/asm/mmzone.h for many architectures.
> 
> Add definition and declaration of node_data to generic code and drop
> architecture-specific versions.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>   arch/arm64/include/asm/Kbuild                  |  1 +
>   arch/arm64/include/asm/mmzone.h                | 13 -------------
>   arch/arm64/include/asm/topology.h              |  1 +
>   arch/loongarch/include/asm/Kbuild              |  1 +
>   arch/loongarch/include/asm/mmzone.h            | 16 ----------------
>   arch/loongarch/include/asm/topology.h          |  1 +
>   arch/loongarch/kernel/numa.c                   |  3 ---
>   arch/mips/include/asm/mach-ip27/mmzone.h       |  4 ----
>   arch/mips/include/asm/mach-loongson64/mmzone.h |  4 ----
>   arch/mips/loongson64/numa.c                    |  2 --
>   arch/mips/sgi-ip27/ip27-memory.c               |  3 ---
>   arch/powerpc/include/asm/mmzone.h              |  6 ------
>   arch/powerpc/mm/numa.c                         |  2 --
>   arch/riscv/include/asm/Kbuild                  |  1 +
>   arch/riscv/include/asm/mmzone.h                | 13 -------------
>   arch/riscv/include/asm/topology.h              |  4 ++++
>   arch/s390/include/asm/Kbuild                   |  1 +
>   arch/s390/include/asm/mmzone.h                 | 17 -----------------
>   arch/s390/kernel/numa.c                        |  3 ---
>   arch/sh/include/asm/mmzone.h                   |  3 ---
>   arch/sh/mm/numa.c                              |  3 ---
>   arch/sparc/include/asm/mmzone.h                |  4 ----
>   arch/sparc/mm/init_64.c                        |  2 --
>   arch/x86/include/asm/Kbuild                    |  1 +
>   arch/x86/include/asm/mmzone.h                  |  6 ------
>   arch/x86/include/asm/mmzone_32.h               | 17 -----------------
>   arch/x86/include/asm/mmzone_64.h               | 18 ------------------
>   arch/x86/mm/numa.c                             |  3 ---
>   drivers/base/arch_numa.c                       |  2 --
>   include/asm-generic/mmzone.h                   |  5 +++++
>   include/linux/numa.h                           |  3 +++
>   mm/numa.c                                      |  3 +++
>   32 files changed, 22 insertions(+), 144 deletions(-)
>   delete mode 100644 arch/arm64/include/asm/mmzone.h
>   delete mode 100644 arch/loongarch/include/asm/mmzone.h
>   delete mode 100644 arch/riscv/include/asm/mmzone.h
>   delete mode 100644 arch/s390/include/asm/mmzone.h
>   delete mode 100644 arch/x86/include/asm/mmzone.h
>   delete mode 100644 arch/x86/include/asm/mmzone_32.h
>   delete mode 100644 arch/x86/include/asm/mmzone_64.h
>   create mode 100644 include/asm-generic/mmzone.h

Nice!

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/17] mm: move kernel/numa.c to mm/
  2024-07-16 11:13 ` [PATCH 01/17] mm: move kernel/numa.c to mm/ Mike Rapoport
@ 2024-07-17 14:35   ` David Hildenbrand
  2024-07-19 13:55   ` Jonathan Cameron
  1 sibling, 0 replies; 60+ messages in thread
From: David Hildenbrand @ 2024-07-17 14:35 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David S. Miller, Greg Kroah-Hartman, Heiko Carstens,
	Huacai Chen, Ingo Molnar, Jiaxun Yang, John Paul Adrian Glaubitz,
	Jonathan Cameron, Michael Ellerman, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> The stub functions in kernel/numa.c belong to mm/ rather than to kernel/
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-16 11:13 ` [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA " Mike Rapoport
@ 2024-07-17 14:42   ` David Hildenbrand
  2024-07-18  7:02     ` Mike Rapoport
  2024-07-20 10:24     ` Mike Rapoport
  2024-07-19 16:11   ` Jonathan Cameron
  1 sibling, 2 replies; 60+ messages in thread
From: David Hildenbrand @ 2024-07-17 14:42 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David S. Miller, Greg Kroah-Hartman, Heiko Carstens,
	Huacai Chen, Ingo Molnar, Jiaxun Yang, John Paul Adrian Glaubitz,
	Jonathan Cameron, Michael Ellerman, Palmer Dabbelt,
	Rafael J. Wysocki, Rob Herring, Thomas Bogendoerfer,
	Thomas Gleixner, Vasily Gorbik, Will Deacon, linux-arm-kernel,
	loongarch, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, linux-acpi, linux-cxl, nvdimm, devicetree,
	linux-arch, linux-mm, x86

On 16.07.24 13:13, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Architectures that support NUMA duplicate the code that allocates
> NODE_DATA on the node-local memory with slight variations in reporting
> of the addresses where the memory was allocated.
> 
> Use x86 version as the basis for the generic alloc_node_data() function
> and call this function in architecture specific numa initialization.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---

[...]

> diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
> index 9208eaadf690..909f6cec3a26 100644
> --- a/arch/mips/loongson64/numa.c
> +++ b/arch/mips/loongson64/numa.c
> @@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
>   
>   static void __init node_mem_init(unsigned int node)
>   {
> -	struct pglist_data *nd;
>   	unsigned long node_addrspace_offset;
>   	unsigned long start_pfn, end_pfn;
> -	unsigned long nd_pa;
> -	int tnid;
> -	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);

One interesting change is that we now always round up to full pages on 
architectures where we previously rounded up to SMP_CACHE_BYTES.

I assume we don't really expect a significant growth in memory 
consumption that we care about, especially because most systems with 
many nodes also have  quite some memory around.


> -/* Allocate NODE_DATA for a node on the local memory */
> -static void __init alloc_node_data(int nid)
> -{
> -	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> -	u64 nd_pa;
> -	void *nd;
> -	int tnid;
> -
> -	/*
> -	 * Allocate node data.  Try node-local memory and then any node.
> -	 * Never allocate in DMA zone.
> -	 */
> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> -	if (!nd_pa) {
> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> -		       nd_size, nid);
> -		return;
> -	}
> -	nd = __va(nd_pa);
> -
> -	/* report and initialize */
> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> -	       nd_pa, nd_pa + nd_size - 1);
> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> -	if (tnid != nid)
> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> -
> -	node_data[nid] = nd;
> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -	node_set_online(nid);
> -}
> -
>   /**
>    * numa_cleanup_meminfo - Cleanup a numa_meminfo
>    * @mi: numa_meminfo to clean up
> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>   			continue;
>   
>   		alloc_node_data(nid);
> +		node_set_online(nid);
>   	}

I can spot that we only remove a single node_set_online() call from x86.

What about all the other architectures? Will there be any change in 
behavior for them? Or do we simply set the nodes online later once more?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-17 14:42   ` David Hildenbrand
@ 2024-07-18  7:02     ` Mike Rapoport
  2024-07-19 15:07       ` David Hildenbrand
  2024-07-20 10:24     ` Mike Rapoport
  1 sibling, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-18  7:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David S. Miller, Greg Kroah-Hartman,
	Heiko Carstens, Huacai Chen, Ingo Molnar, Jiaxun Yang,
	John Paul Adrian Glaubitz, Jonathan Cameron, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Wed, Jul 17, 2024 at 04:42:48PM +0200, David Hildenbrand wrote:
> On 16.07.24 13:13, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Architectures that support NUMA duplicate the code that allocates
> > NODE_DATA on the node-local memory with slight variations in reporting
> > of the addresses where the memory was allocated.
> > 
> > Use x86 version as the basis for the generic alloc_node_data() function
> > and call this function in architecture specific numa initialization.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> 
> [...]
> 
> > diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
> > index 9208eaadf690..909f6cec3a26 100644
> > --- a/arch/mips/loongson64/numa.c
> > +++ b/arch/mips/loongson64/numa.c
> > @@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
> >   static void __init node_mem_init(unsigned int node)
> >   {
> > -	struct pglist_data *nd;
> >   	unsigned long node_addrspace_offset;
> >   	unsigned long start_pfn, end_pfn;
> > -	unsigned long nd_pa;
> > -	int tnid;
> > -	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
> 
> One interesting change is that we now always round up to full pages on
> architectures where we previously rounded up to SMP_CACHE_BYTES.

On my workstation struct pglist_data take 174400, cachelines: 2725, members: 43 */
 
> I assume we don't really expect a significant growth in memory consumption
> that we care about, especially because most systems with many nodes also
> have  quite some memory around.

With Debian kernel configuration for 6.5 struct pglist data takes 174400
bytes so the increase here is below 1%.

For NUMA systems with a lot of nodes that shouldn't be a problem.

> > -/* Allocate NODE_DATA for a node on the local memory */
> > -static void __init alloc_node_data(int nid)
> > -{
> > -	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> > -	u64 nd_pa;
> > -	void *nd;
> > -	int tnid;
> > -
> > -	/*
> > -	 * Allocate node data.  Try node-local memory and then any node.
> > -	 * Never allocate in DMA zone.
> > -	 */
> > -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> > -	if (!nd_pa) {
> > -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> > -		       nd_size, nid);
> > -		return;
> > -	}
> > -	nd = __va(nd_pa);
> > -
> > -	/* report and initialize */
> > -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> > -	       nd_pa, nd_pa + nd_size - 1);
> > -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> > -	if (tnid != nid)
> > -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> > -
> > -	node_data[nid] = nd;
> > -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > -
> > -	node_set_online(nid);
> > -}
> > -
> >   /**
> >    * numa_cleanup_meminfo - Cleanup a numa_meminfo
> >    * @mi: numa_meminfo to clean up
> > @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >   			continue;
> >   		alloc_node_data(nid);
> > +		node_set_online(nid);
> >   	}
> 
> I can spot that we only remove a single node_set_online() call from x86.
> 
> What about all the other architectures? Will there be any change in behavior
> for them? Or do we simply set the nodes online later once more?

On x86 node_set_online() was a part of alloc_node_data() and I moved it
outside so it's called right after alloc_node_data(). On other
architectures the allocation didn't include that call, so there should be
no difference there.
 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks
  2024-07-16 11:13 ` [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
@ 2024-07-18 21:46   ` Samuel Holland
  2024-07-19  5:55     ` Mike Rapoport
  2024-07-19 17:48   ` Jonathan Cameron
  1 sibling, 1 reply; 60+ messages in thread
From: Samuel Holland @ 2024-07-18 21:46 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexander Gordeev, Andreas Larsson, Andrew Morton, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christophe Leroy, Dan Williams,
	Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On 2024-07-16 6:13 AM, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Move code dealing with numa_distance array from arch/x86 to
> mm/numa_memblks.c
> 
> This code will be later reused by arch_numa.
> 
> No functional changes.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  arch/x86/mm/numa.c                   | 101 ---------------------------
>  arch/x86/mm/numa_internal.h          |   2 -
>  include/linux/numa_memblks.h         |   4 ++
>  {arch/x86/mm => mm}/numa_emulation.c |   0
>  mm/numa_memblks.c                    | 101 +++++++++++++++++++++++++++
>  5 files changed, 105 insertions(+), 103 deletions(-)
>  rename {arch/x86/mm => mm}/numa_emulation.c (100%)

The numa_emulation.c rename looks like it should be part of the next commit, not
this one.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks
  2024-07-18 21:46   ` Samuel Holland
@ 2024-07-19  5:55     ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-19  5:55 UTC (permalink / raw)
  To: Samuel Holland
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Thu, Jul 18, 2024 at 04:46:17PM -0500, Samuel Holland wrote:
> On 2024-07-16 6:13 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Move code dealing with numa_distance array from arch/x86 to
> > mm/numa_memblks.c
> > 
> > This code will be later reused by arch_numa.
> > 
> > No functional changes.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  arch/x86/mm/numa.c                   | 101 ---------------------------
> >  arch/x86/mm/numa_internal.h          |   2 -
> >  include/linux/numa_memblks.h         |   4 ++
> >  {arch/x86/mm => mm}/numa_emulation.c |   0
> >  mm/numa_memblks.c                    | 101 +++++++++++++++++++++++++++
> >  5 files changed, 105 insertions(+), 103 deletions(-)
> >  rename {arch/x86/mm => mm}/numa_emulation.c (100%)
> 
> The numa_emulation.c rename looks like it should be part of the next commit, not
> this one.

Right, thanks!

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/17] mm: introduce numa_memblks
  2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
                   ` (16 preceding siblings ...)
  2024-07-16 11:13 ` [PATCH 17/17] mm: make range-to-target_node lookup facility a part of numa_memblks Mike Rapoport
@ 2024-07-19 13:33 ` Jonathan Cameron
  2024-07-22  8:08   ` Mike Rapoport
  17 siblings, 1 reply; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 13:33 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:29 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Hi,
> 
> Following the discussion about handling of CXL fixed memory windows on
> arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
> the generic code so they will be available on arm64/riscv and maybe on
> loongarch sometime later.
> 
> While it could be possible to use memblock to describe CXL memory windows,
> it currently lacks notion of unpopulated memory ranges and numa_memblks
> does implement this.
> 
> Another reason to make numa_memblks generic is that both arch_numa (arm64
> and riscv) and loongarch use trimmed copy of x86 code although there is no
> fundamental reason why the same code cannot be used on all these platforms.
> Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
> more consistent and I believe will reduce maintenance burden.
> 
> And with generic numa_memblks it is (almost) straightforward to enable NUMA
> emulation on arm64 and riscv.
> 
> The first 5 commits in this series are cleanups that are not strictly
> related to numa_memblks.
> 
> Commits 6-11 slightly reorder code in x86 to allow extracting numa_memblks
> and NUMA emulation to the generic code.
> 
> Commits 12-14 actually move the code from arch/x86/ to mm/ and commit 15
> does some aftermath cleanups.
> 
> Commit 16 switches arch_numa to numa_memblks.
> 
> Commit 17 enables usage of phys_to_target_node() and
> memory_add_physaddr_to_nid() with numa_memblks.

Hi Mike,

I've lightly tested with emulated CXL + Generic Ports and Generic
Initiators as well as more normal cpus and memory via qemu on arm64 and it's
looking good.

From my earlier series, patch 4 is probably still needed to avoid
presenting nodes with nothing in them at boot (but not if we hotplug
memory then remove it again in which case they disappear)
https://lore.kernel.org/all/20240529171236.32002-5-Jonathan.Cameron@huawei.com/
However that was broken/inconsistent before your rework so I can send that
patch separately. 

Thanks for getting this sorted!  I should get time to do more extensive
testing and review in next week or so.

Jonathan

> 
> [1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/
> 
> Mike Rapoport (Microsoft) (17):
>   mm: move kernel/numa.c to mm/
>   MIPS: sgi-ip27: make NODE_DATA() the same as on all other
>     architectures
>   MIPS: loongson64: rename __node_data to node_data
>   arch, mm: move definition of node_data to generic code
>   arch, mm: pull out allocation of NODE_DATA to generic code
>   x86/numa: simplify numa_distance allocation
>   x86/numa: move FAKE_NODE_* defines to numa_emu
>   x86/numa_emu: simplify allocation of phys_dist
>   x86/numa_emu: split __apicid_to_node update to a helper function
>   x86/numa_emu: use a helper function to get MAX_DMA32_PFN
>   x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
>   mm: introduce numa_memblks
>   mm: move numa_distance and related code from x86 to numa_memblks
>   mm: introduce numa_emulation
>   mm: make numa_memblks more self-contained
>   arch_numa: switch over to numa_memblks
>   mm: make range-to-target_node lookup facility a part of numa_memblks
> 
>  arch/arm64/include/asm/Kbuild                 |   1 +
>  arch/arm64/include/asm/mmzone.h               |  13 -
>  arch/arm64/include/asm/topology.h             |   1 +
>  arch/loongarch/include/asm/Kbuild             |   1 +
>  arch/loongarch/include/asm/mmzone.h           |  16 -
>  arch/loongarch/include/asm/topology.h         |   1 +
>  arch/loongarch/kernel/numa.c                  |  21 -
>  arch/mips/include/asm/mach-ip27/mmzone.h      |   1 -
>  .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
>  arch/mips/loongson64/numa.c                   |  20 +-
>  arch/mips/sgi-ip27/ip27-memory.c              |   2 +-
>  arch/powerpc/include/asm/mmzone.h             |   6 -
>  arch/powerpc/mm/numa.c                        |  26 +-
>  arch/riscv/include/asm/Kbuild                 |   1 +
>  arch/riscv/include/asm/mmzone.h               |  13 -
>  arch/riscv/include/asm/topology.h             |   4 +
>  arch/s390/include/asm/Kbuild                  |   1 +
>  arch/s390/include/asm/mmzone.h                |  17 -
>  arch/s390/kernel/numa.c                       |   3 -
>  arch/sh/include/asm/mmzone.h                  |   3 -
>  arch/sh/mm/init.c                             |   7 +-
>  arch/sh/mm/numa.c                             |   3 -
>  arch/sparc/include/asm/mmzone.h               |   4 -
>  arch/sparc/mm/init_64.c                       |  11 +-
>  arch/x86/Kconfig                              |   9 +-
>  arch/x86/include/asm/Kbuild                   |   1 +
>  arch/x86/include/asm/mmzone.h                 |   6 -
>  arch/x86/include/asm/mmzone_32.h              |  17 -
>  arch/x86/include/asm/mmzone_64.h              |  18 -
>  arch/x86/include/asm/numa.h                   |  24 +-
>  arch/x86/include/asm/sparsemem.h              |   9 -
>  arch/x86/mm/Makefile                          |   1 -
>  arch/x86/mm/amdtopology.c                     |   1 +
>  arch/x86/mm/numa.c                            | 618 +-----------------
>  arch/x86/mm/numa_internal.h                   |  24 -
>  drivers/acpi/numa/srat.c                      |   1 +
>  drivers/base/Kconfig                          |   1 +
>  drivers/base/arch_numa.c                      | 223 ++-----
>  drivers/cxl/Kconfig                           |   2 +-
>  drivers/dax/Kconfig                           |   2 +-
>  drivers/of/of_numa.c                          |   1 +
>  include/asm-generic/mmzone.h                  |   5 +
>  include/asm-generic/numa.h                    |   6 +-
>  include/linux/numa.h                          |   5 +
>  include/linux/numa_memblks.h                  |  58 ++
>  kernel/Makefile                               |   1 -
>  kernel/numa.c                                 |  26 -
>  mm/Kconfig                                    |  11 +
>  mm/Makefile                                   |   3 +
>  mm/numa.c                                     |  57 ++
>  {arch/x86/mm => mm}/numa_emulation.c          |  42 +-
>  mm/numa_memblks.c                             | 565 ++++++++++++++++
>  52 files changed, 847 insertions(+), 1070 deletions(-)
>  delete mode 100644 arch/arm64/include/asm/mmzone.h
>  delete mode 100644 arch/loongarch/include/asm/mmzone.h
>  delete mode 100644 arch/riscv/include/asm/mmzone.h
>  delete mode 100644 arch/s390/include/asm/mmzone.h
>  delete mode 100644 arch/x86/include/asm/mmzone.h
>  delete mode 100644 arch/x86/include/asm/mmzone_32.h
>  delete mode 100644 arch/x86/include/asm/mmzone_64.h
>  create mode 100644 include/asm-generic/mmzone.h
>  create mode 100644 include/linux/numa_memblks.h
>  delete mode 100644 kernel/numa.c
>  create mode 100644 mm/numa.c
>  rename {arch/x86/mm => mm}/numa_emulation.c (94%)
>  create mode 100644 mm/numa_memblks.c
> 
> 
> base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/17] mm: move kernel/numa.c to mm/
  2024-07-16 11:13 ` [PATCH 01/17] mm: move kernel/numa.c to mm/ Mike Rapoport
  2024-07-17 14:35   ` David Hildenbrand
@ 2024-07-19 13:55   ` Jonathan Cameron
  1 sibling, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 13:55 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:30 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> The stub functions in kernel/numa.c belong to mm/ rather than to kernel/
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Makes sense + all arch specific implementations are in arch/*/mm not
arch/*/kernel so this makes it more consistent with that.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures
  2024-07-17 14:32   ` David Hildenbrand
@ 2024-07-19 14:38     ` Jonathan Cameron
  2024-07-22  7:34       ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 14:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, linux-kernel, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dan Williams, Dave Hansen, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Wed, 17 Jul 2024 16:32:59 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 16.07.24 13:13, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > sgi-ip27 is the only system that defines NODE_DATA() differently than
> > the rest of NUMA machines.
> > 
> > Add node_data array of struct pglist pointers that will point to
> > __node_data[node]->pglist and redefine NODE_DATA() to use node_data
> > array.
> > 
> > This will allow pulling declaration of node_data to the generic mm code
> > in the next commit.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >   arch/mips/include/asm/mach-ip27/mmzone.h | 5 ++++-
> >   arch/mips/sgi-ip27/ip27-memory.c         | 5 ++++-
> >   2 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h
> > index 08c36e50a860..629c3f290203 100644
> > --- a/arch/mips/include/asm/mach-ip27/mmzone.h
> > +++ b/arch/mips/include/asm/mach-ip27/mmzone.h
> > @@ -22,7 +22,10 @@ struct node_data {
> >   
> >   extern struct node_data *__node_data[];
> >   
> > -#define NODE_DATA(n)		(&__node_data[(n)]->pglist)
> >   #define hub_data(n)		(&__node_data[(n)]->hub)
> >   
> > +extern struct pglist_data *node_data[];
> > +
> > +#define NODE_DATA(nid)		(node_data[nid])
> > +
> >   #endif /* _ASM_MACH_MMZONE_H */
> > diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
> > index b8ca94cfb4fe..c30ef6958b97 100644
> > --- a/arch/mips/sgi-ip27/ip27-memory.c
> > +++ b/arch/mips/sgi-ip27/ip27-memory.c
> > @@ -34,8 +34,10 @@
> >   #define SLOT_PFNSHIFT		(SLOT_SHIFT - PAGE_SHIFT)
> >   #define PFN_NASIDSHFT		(NASID_SHFT - PAGE_SHIFT)
> >   
> > -struct node_data *__node_data[MAX_NUMNODES];
> > +struct pglist_data *node_data[MAX_NUMNODES];
> > +EXPORT_SYMBOL(node_data);
> >   
> > +struct node_data *__node_data[MAX_NUMNODES];
> >   EXPORT_SYMBOL(__node_data);
> >   
> >   static u64 gen_region_mask(void)
> > @@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node)
> >   	 */
> >   	__node_data[node] = __va(slot_freepfn << PAGE_SHIFT);
> >   	memset(__node_data[node], 0, PAGE_SIZE);
> > +	node_data[node] = &__node_data[node]->pglist;
> >   
> >   	NODE_DATA(node)->node_start_pfn = start_pfn;
> >   	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;  
> 
> I was assuming we could get rid of __node_data->pglist.
> 
> But now I am confused where that is actually set.

It looks nasty... Cast in arch_refresh_nodedata() takes
incoming pg_data_t * and casts it to the local version of
struct node_data * which I think is this one

struct node_data {
	struct pglist_data pglist; (which is pg_data_t pglist)
	struct hub_data hub;
};

https://elixir.bootlin.com/linux/v6.10/source/arch/mips/sgi-ip27/ip27-memory.c#L432

Now that pg_data_t is allocated by 
arch_alloc_nodedata() which might be fine (though types could be handled in a more
readable fashion via some container_of() magic.
https://elixir.bootlin.com/linux/v6.10/source/arch/mips/sgi-ip27/ip27-memory.c#L427

However that call is:
pg_data_t * __init arch_alloc_nodedata(int nid)
{
	return memblock_alloc(sizeof(pg_data_t), SMP_CACHE_BYTES);
}

So doesn't seem to allocate enough space to me as should be sizeof(struct node_data)

Worth cleaning up whilst here?  Proper handling of types would definitely
help.

Jonathan


> 
> Anyhow
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-18  7:02     ` Mike Rapoport
@ 2024-07-19 15:07       ` David Hildenbrand
  2024-07-19 15:34         ` Mike Rapoport
  2024-07-19 15:51         ` Jonathan Cameron
  0 siblings, 2 replies; 60+ messages in thread
From: David Hildenbrand @ 2024-07-19 15:07 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David S. Miller, Greg Kroah-Hartman,
	Heiko Carstens, Huacai Chen, Ingo Molnar, Jiaxun Yang,
	John Paul Adrian Glaubitz, Jonathan Cameron, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

>>> -	 * Allocate node data.  Try node-local memory and then any node.
>>> -	 * Never allocate in DMA zone.
>>> -	 */
>>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>> -	if (!nd_pa) {
>>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
>>> -		       nd_size, nid);
>>> -		return;
>>> -	}
>>> -	nd = __va(nd_pa);
>>> -
>>> -	/* report and initialize */
>>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
>>> -	       nd_pa, nd_pa + nd_size - 1);
>>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>>> -	if (tnid != nid)
>>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
>>> -
>>> -	node_data[nid] = nd;
>>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>>> -
>>> -	node_set_online(nid);
>>> -}
>>> -
>>>    /**
>>>     * numa_cleanup_meminfo - Cleanup a numa_meminfo
>>>     * @mi: numa_meminfo to clean up
>>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>>    			continue;
>>>    		alloc_node_data(nid);
>>> +		node_set_online(nid);
>>>    	}
>>
>> I can spot that we only remove a single node_set_online() call from x86.
>>
>> What about all the other architectures? Will there be any change in behavior
>> for them? Or do we simply set the nodes online later once more?
> 
> On x86 node_set_online() was a part of alloc_node_data() and I moved it
> outside so it's called right after alloc_node_data(). On other
> architectures the allocation didn't include that call, so there should be
> no difference there.

But won't their arch code try setting the nodes online at a later stage?

And I think, some architectures only set nodes online conditionally
(see most other node_set_online() calls).

Sorry if I'm confused here, but with now unconditional node_set_online(), won't
we change the behavior of other architectures?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data
  2024-07-16 11:13 ` [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
  2024-07-16 13:07   ` Jiaxun Yang
  2024-07-17 14:33   ` David Hildenbrand
@ 2024-07-19 15:27   ` Jonathan Cameron
  2 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 15:27 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:32 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Make definition of node_data match other architectures.
> This will allow pulling declaration of node_data to the generic mm code in
> the following commit.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
FWIW rename looks fine
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-19 15:07       ` David Hildenbrand
@ 2024-07-19 15:34         ` Mike Rapoport
  2024-07-19 15:46           ` David Hildenbrand
  2024-07-19 15:51         ` Jonathan Cameron
  1 sibling, 1 reply; 60+ messages in thread
From: Mike Rapoport @ 2024-07-19 15:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David S. Miller, Greg Kroah-Hartman,
	Heiko Carstens, Huacai Chen, Ingo Molnar, Jiaxun Yang,
	John Paul Adrian Glaubitz, Jonathan Cameron, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 05:07:35PM +0200, David Hildenbrand wrote:
> > > > -	 * Allocate node data.  Try node-local memory and then any node.
> > > > -	 * Never allocate in DMA zone.
> > > > -	 */
> > > > -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> > > > -	if (!nd_pa) {
> > > > -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> > > > -		       nd_size, nid);
> > > > -		return;
> > > > -	}
> > > > -	nd = __va(nd_pa);
> > > > -
> > > > -	/* report and initialize */
> > > > -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> > > > -	       nd_pa, nd_pa + nd_size - 1);
> > > > -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> > > > -	if (tnid != nid)
> > > > -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> > > > -
> > > > -	node_data[nid] = nd;
> > > > -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> > > > -
> > > > -	node_set_online(nid);
> > > > -}
> > > > -
> > > >    /**
> > > >     * numa_cleanup_meminfo - Cleanup a numa_meminfo
> > > >     * @mi: numa_meminfo to clean up
> > > > @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> > > >    			continue;
> > > >    		alloc_node_data(nid);
> > > > +		node_set_online(nid);
> > > >    	}
> > > 
> > > I can spot that we only remove a single node_set_online() call from x86.
> > > 
> > > What about all the other architectures? Will there be any change in behavior
> > > for them? Or do we simply set the nodes online later once more?
> > 
> > On x86 node_set_online() was a part of alloc_node_data() and I moved it
> > outside so it's called right after alloc_node_data(). On other
> > architectures the allocation didn't include that call, so there should be
> > no difference there.
> 
> But won't their arch code try setting the nodes online at a later stage?
> 
> And I think, some architectures only set nodes online conditionally
> (see most other node_set_online() calls).
> 
> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
> we change the behavior of other architectures?

The generic alloc_node_data() does not set the node online:

+/* Allocate NODE_DATA for a node on the local memory */
+void __init alloc_node_data(int nid)
+{
+	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	u64 nd_pa;
+	void *nd;
+	int tnid;
+
+	/* Allocate node data.  Try node-local memory and then any node. */
+	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+	if (!nd_pa)
+		panic("Cannot allocate %zu bytes for node %d data\n",
+		      nd_size, nid);
+	nd = __va(nd_pa);
+
+	/* report and initialize */
+	pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
+		nd_pa, nd_pa + nd_size - 1);
+	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+	if (tnid != nid)
+		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
+
+	node_data[nid] = nd;
+	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+}

I might have missed some architecture except x86 that calls
node_set_online() in its alloc_node_data(), but the intention was to leave
that call outside the alloc and explicitly add it after the call to
alloc_node_data() if needed like in x86.

> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] arch, mm: move definition of node_data to generic code
  2024-07-16 11:13 ` [PATCH 04/17] arch, mm: move definition of node_data to generic code Mike Rapoport
  2024-07-17 14:35   ` David Hildenbrand
@ 2024-07-19 15:39   ` Jonathan Cameron
  2024-07-23  0:15   ` Davidlohr Bueso
  2 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 15:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:33 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Every architecture that supports NUMA defines node_data in the same way:
> 
> 	struct pglist_data *node_data[MAX_NUMNODES];
> 
> No reason to keep multiple copies of this definition and its forward
> declarations, especially when such forward declaration is the only thing
> in include/asm/mmzone.h for many architectures.
> 
> Add definition and declaration of node_data to generic code and drop
> architecture-specific versions.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-19 15:34         ` Mike Rapoport
@ 2024-07-19 15:46           ` David Hildenbrand
  0 siblings, 0 replies; 60+ messages in thread
From: David Hildenbrand @ 2024-07-19 15:46 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David S. Miller, Greg Kroah-Hartman,
	Heiko Carstens, Huacai Chen, Ingo Molnar, Jiaxun Yang,
	John Paul Adrian Glaubitz, Jonathan Cameron, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On 19.07.24 17:34, Mike Rapoport wrote:
> On Fri, Jul 19, 2024 at 05:07:35PM +0200, David Hildenbrand wrote:
>>>>> -	 * Allocate node data.  Try node-local memory and then any node.
>>>>> -	 * Never allocate in DMA zone.
>>>>> -	 */
>>>>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>>>> -	if (!nd_pa) {
>>>>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
>>>>> -		       nd_size, nid);
>>>>> -		return;
>>>>> -	}
>>>>> -	nd = __va(nd_pa);
>>>>> -
>>>>> -	/* report and initialize */
>>>>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
>>>>> -	       nd_pa, nd_pa + nd_size - 1);
>>>>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>>>>> -	if (tnid != nid)
>>>>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
>>>>> -
>>>>> -	node_data[nid] = nd;
>>>>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>>>>> -
>>>>> -	node_set_online(nid);
>>>>> -}
>>>>> -
>>>>>     /**
>>>>>      * numa_cleanup_meminfo - Cleanup a numa_meminfo
>>>>>      * @mi: numa_meminfo to clean up
>>>>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>>>>     			continue;
>>>>>     		alloc_node_data(nid);
>>>>> +		node_set_online(nid);
>>>>>     	}
>>>>
>>>> I can spot that we only remove a single node_set_online() call from x86.
>>>>
>>>> What about all the other architectures? Will there be any change in behavior
>>>> for them? Or do we simply set the nodes online later once more?
>>>
>>> On x86 node_set_online() was a part of alloc_node_data() and I moved it
>>> outside so it's called right after alloc_node_data(). On other
>>> architectures the allocation didn't include that call, so there should be
>>> no difference there.
>>
>> But won't their arch code try setting the nodes online at a later stage?
>>
>> And I think, some architectures only set nodes online conditionally
>> (see most other node_set_online() calls).
>>
>> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
>> we change the behavior of other architectures?
> 
> The generic alloc_node_data() does not set the node online:
> 
> +/* Allocate NODE_DATA for a node on the local memory */
> +void __init alloc_node_data(int nid)
> +{
> +	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> +	u64 nd_pa;
> +	void *nd;
> +	int tnid;
> +
> +	/* Allocate node data.  Try node-local memory and then any node. */
> +	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> +	if (!nd_pa)
> +		panic("Cannot allocate %zu bytes for node %d data\n",
> +		      nd_size, nid);
> +	nd = __va(nd_pa);
> +
> +	/* report and initialize */
> +	pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> +		nd_pa, nd_pa + nd_size - 1);
> +	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> +	if (tnid != nid)
> +		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
> +
> +	node_data[nid] = nd;
> +	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> +}
> 
> I might have missed some architecture except x86 that calls
> node_set_online() in its alloc_node_data(), but the intention was to leave
> that call outside the alloc and explicitly add it after the call to
> alloc_node_data() if needed like in x86.

I'm stupid, I didn't realize it is still only called from x86 :(

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-19 15:07       ` David Hildenbrand
  2024-07-19 15:34         ` Mike Rapoport
@ 2024-07-19 15:51         ` Jonathan Cameron
  2024-07-19 16:07           ` David Hildenbrand
  1 sibling, 1 reply; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 15:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, linux-kernel, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dan Williams, Dave Hansen, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, 19 Jul 2024 17:07:35 +0200
David Hildenbrand <david@redhat.com> wrote:

> >>> -	 * Allocate node data.  Try node-local memory and then any node.
> >>> -	 * Never allocate in DMA zone.
> >>> -	 */
> >>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> >>> -	if (!nd_pa) {
> >>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> >>> -		       nd_size, nid);
> >>> -		return;
> >>> -	}
> >>> -	nd = __va(nd_pa);
> >>> -
> >>> -	/* report and initialize */
> >>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> >>> -	       nd_pa, nd_pa + nd_size - 1);
> >>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> >>> -	if (tnid != nid)
> >>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
> >>> -
> >>> -	node_data[nid] = nd;
> >>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> >>> -
> >>> -	node_set_online(nid);
> >>> -}
> >>> -
> >>>    /**
> >>>     * numa_cleanup_meminfo - Cleanup a numa_meminfo
> >>>     * @mi: numa_meminfo to clean up
> >>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
> >>>    			continue;
> >>>    		alloc_node_data(nid);
> >>> +		node_set_online(nid);
> >>>    	}  
> >>
> >> I can spot that we only remove a single node_set_online() call from x86.
> >>
> >> What about all the other architectures? Will there be any change in behavior
> >> for them? Or do we simply set the nodes online later once more?  
> > 
> > On x86 node_set_online() was a part of alloc_node_data() and I moved it
> > outside so it's called right after alloc_node_data(). On other
> > architectures the allocation didn't include that call, so there should be
> > no difference there.  
> 
> But won't their arch code try setting the nodes online at a later stage?
> 
> And I think, some architectures only set nodes online conditionally
> (see most other node_set_online() calls).
> 
> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
> we change the behavior of other architectures?
This is moving x86 code to x86 code, not a generic location
so how would that affect anyone else? Their onlining should be same as
before.

The node onlining difference are a pain (I recall that fun from adding
generic initiators) as different ordering on x86 and arm64 at least.

Jonathan

> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/17] mm: introduce numa_emulation
  2024-07-16 11:13 ` [PATCH 14/17] mm: introduce numa_emulation Mike Rapoport
@ 2024-07-19 16:03   ` Zi Yan
  2024-07-20 12:09     ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: Zi Yan @ 2024-07-19 16:03 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

[-- Attachment #1: Type: text/plain, Size: 2894 bytes --]

On 16 Jul 2024, at 7:13, Mike Rapoport wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Move numa_emulation codfrom arch/x86 to mm/numa_emulation.c
>
> This code will be later reused by arch_numa.
>
> No functional changes.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  arch/x86/Kconfig             |  8 --------
>  arch/x86/include/asm/numa.h  | 12 ------------
>  arch/x86/mm/Makefile         |  1 -
>  arch/x86/mm/numa_internal.h  | 11 -----------
>  include/linux/numa_memblks.h | 17 +++++++++++++++++
>  mm/Kconfig                   |  8 ++++++++
>  mm/Makefile                  |  1 +
>  mm/numa_emulation.c          |  4 +---
>  8 files changed, 27 insertions(+), 35 deletions(-)

After this code move, the document of numa=fake= should be moved from
Documentation/arch/x86/x86_64/boot-options.rst to
Documentation/admin-guide/kernel-parameters.txt
too.

Something like:

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bc55fb55cd26..ce3659289b5e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4158,6 +4158,18 @@
                        Disable NUMA, Only set up a single NUMA node
                        spanning all memory.

+       numa=fake=<size>[MG]
+                       If given as a memory unit, fills all system RAM with nodes of
+                       size interleaved over physical nodes.
+
+       numa=fake=<N>
+                       If given as an integer, fills all system RAM with N fake nodes
+                       interleaved over physical nodes.
+
+       numa=fake=<N>U
+                       If given as an integer followed by 'U', it will divide each
+                       physical node into N emulated nodes.
+
        numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
                        NUMA balancing.
                        Allowed values are enable and disable
diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst
index 137432d34109..98d4805f0823 100644
--- a/Documentation/arch/x86/x86_64/boot-options.rst
+++ b/Documentation/arch/x86/x86_64/boot-options.rst
@@ -170,18 +170,6 @@ NUMA
     Don't parse the HMAT table for NUMA setup, or soft-reserved memory
     partitioning.

-  numa=fake=<size>[MG]
-    If given as a memory unit, fills all system RAM with nodes of
-    size interleaved over physical nodes.
-
-  numa=fake=<N>
-    If given as an integer, fills all system RAM with N fake nodes
-    interleaved over physical nodes.
-
-  numa=fake=<N>U
-    If given as an integer followed by 'U', it will divide each
-    physical node into N emulated nodes.
-
 ACPI
 ====

Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-19 15:51         ` Jonathan Cameron
@ 2024-07-19 16:07           ` David Hildenbrand
  0 siblings, 0 replies; 60+ messages in thread
From: David Hildenbrand @ 2024-07-19 16:07 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mike Rapoport, linux-kernel, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dan Williams, Dave Hansen, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On 19.07.24 17:51, Jonathan Cameron wrote:
> On Fri, 19 Jul 2024 17:07:35 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>>>> -	 * Allocate node data.  Try node-local memory and then any node.
>>>>> -	 * Never allocate in DMA zone.
>>>>> -	 */
>>>>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>>>> -	if (!nd_pa) {
>>>>> -		pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
>>>>> -		       nd_size, nid);
>>>>> -		return;
>>>>> -	}
>>>>> -	nd = __va(nd_pa);
>>>>> -
>>>>> -	/* report and initialize */
>>>>> -	printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
>>>>> -	       nd_pa, nd_pa + nd_size - 1);
>>>>> -	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>>>>> -	if (tnid != nid)
>>>>> -		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
>>>>> -
>>>>> -	node_data[nid] = nd;
>>>>> -	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>>>>> -
>>>>> -	node_set_online(nid);
>>>>> -}
>>>>> -
>>>>>     /**
>>>>>      * numa_cleanup_meminfo - Cleanup a numa_meminfo
>>>>>      * @mi: numa_meminfo to clean up
>>>>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>>>>>     			continue;
>>>>>     		alloc_node_data(nid);
>>>>> +		node_set_online(nid);
>>>>>     	}
>>>>
>>>> I can spot that we only remove a single node_set_online() call from x86.
>>>>
>>>> What about all the other architectures? Will there be any change in behavior
>>>> for them? Or do we simply set the nodes online later once more?
>>>
>>> On x86 node_set_online() was a part of alloc_node_data() and I moved it
>>> outside so it's called right after alloc_node_data(). On other
>>> architectures the allocation didn't include that call, so there should be
>>> no difference there.
>>
>> But won't their arch code try setting the nodes online at a later stage?
>>
>> And I think, some architectures only set nodes online conditionally
>> (see most other node_set_online() calls).
>>
>> Sorry if I'm confused here, but with now unconditional node_set_online(), won't
>> we change the behavior of other architectures?
> This is moving x86 code to x86 code, not a generic location
> so how would that affect anyone else? Their onlining should be same as
> before.

Yes, see my reply to Mike.

> 
> The node onlining difference are a pain (I recall that fun from adding
> generic initiators) as different ordering on x86 and arm64 at least.

That's part of the reason I was confused, because I remember some nasty 
inconsistency.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-16 11:13 ` [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA " Mike Rapoport
  2024-07-17 14:42   ` David Hildenbrand
@ 2024-07-19 16:11   ` Jonathan Cameron
  1 sibling, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 16:11 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:34 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Architectures that support NUMA duplicate the code that allocates
> NODE_DATA on the node-local memory with slight variations in reporting
> of the addresses where the memory was allocated.
> 
> Use x86 version as the basis for the generic alloc_node_data() function
> and call this function in architecture specific numa initialization.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>


I've no idea what rules are for the sparc prom_printf() calls but given
that file already has mix and match of those and normal prints in
single functions I assume this change is fine and we'll just
see the prints a bit later.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/17] x86/numa: simplify numa_distance allocation
  2024-07-16 11:13 ` [PATCH 06/17] x86/numa: simplify numa_distance allocation Mike Rapoport
@ 2024-07-19 16:28   ` Jonathan Cameron
  2024-07-22  7:51     ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 16:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:35 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Allocation of numa_distance uses memblock_phys_alloc_range() to limit
> allocation to be below the last mapped page.
> 
> But NUMA initializaition runs after the direct map is populated and

initialization (one too many 'i's)

> there is also code in setup_arch() that adjusts memblock limit to
> reflect how much memory is already mapped in the direct map.
> 
> Simplify the allocation of numa_distance and use plain memblock_alloc().
> This makes the code clearer and ensures that when numa_distance is not
> allocated it is always NULL.
Doesn't this break the comment in numa_set_distance() kernel-doc?
"
 * If such table cannot be allocated, a warning is printed and further
 * calls are ignored until the distance table is reset with
 * numa_reset_distance().
"

Superficially that looks to be to avoid repeatedly hitting the
singleton bit at the top of numa_set_distance() as SRAT or similar
parsing occurs.

> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  arch/x86/mm/numa.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 5e1dde26674b..ab2d4ecef786 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -319,8 +319,7 @@ void __init numa_reset_distance(void)
>  {
>  	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
>  
> -	/* numa_distance could be 1LU marking allocation failure, test cnt */
> -	if (numa_distance_cnt)
> +	if (numa_distance)
>  		memblock_free(numa_distance, size);
>  	numa_distance_cnt = 0;
>  	numa_distance = NULL;	/* enable table creation */
> @@ -331,7 +330,6 @@ static int __init numa_alloc_distance(void)
>  	nodemask_t nodes_parsed;
>  	size_t size;
>  	int i, j, cnt = 0;
> -	u64 phys;
>  
>  	/* size the new table and allocate it */
>  	nodes_parsed = numa_nodes_parsed;
> @@ -342,16 +340,12 @@ static int __init numa_alloc_distance(void)
>  	cnt++;
>  	size = cnt * cnt * sizeof(numa_distance[0]);
>  
> -	phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0,
> -					 PFN_PHYS(max_pfn_mapped));
> -	if (!phys) {
> +	numa_distance = memblock_alloc(size, PAGE_SIZE);
> +	if (!numa_distance) {
>  		pr_warn("Warning: can't allocate distance table!\n");
> -		/* don't retry until explicitly reset */
> -		numa_distance = (void *)1LU;
>  		return -ENOMEM;
>  	}
>  
> -	numa_distance = __va(phys);
>  	numa_distance_cnt = cnt;
>  
>  	/* fill with the default distances */


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/17] x86/numa: move FAKE_NODE_* defines to numa_emu
  2024-07-16 11:13 ` [PATCH 07/17] x86/numa: move FAKE_NODE_* defines to numa_emu Mike Rapoport
@ 2024-07-19 16:30   ` Jonathan Cameron
  0 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 16:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:36 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are
> only used by numa emulation code, make them local to
> arch/x86/mm/numa_emulation.c
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/17] x86/numa_emu: simplify allocation of phys_dist
  2024-07-16 11:13 ` [PATCH 08/17] x86/numa_emu: simplify allocation of phys_dist Mike Rapoport
@ 2024-07-19 16:38   ` Jonathan Cameron
  0 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 16:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:37 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> By the time numa_emulation() is called, all physical memory is already
> mapped in the direct map and there is no need to define limits for
> memblock allocation.
> 
> Replace memblock_phys_alloc_range() with memblock_alloc().
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Indeed seems to be after mapping physical memory, so this looks fine.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function
  2024-07-16 11:13 ` [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function Mike Rapoport
@ 2024-07-19 16:47   ` Jonathan Cameron
  0 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 16:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:38 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> This is required to make numa emulation code architecture independent so
> that it can be moved to generic code in following commits.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Not the most intuitive of function names but I can't immediately
think of a better one.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/17] x86/numa_emu: use a helper function to get MAX_DMA32_PFN
  2024-07-16 11:13 ` [PATCH 10/17] x86/numa_emu: use a helper function to get MAX_DMA32_PFN Mike Rapoport
@ 2024-07-19 16:50   ` Jonathan Cameron
  0 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 16:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:39 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> This is required to make numa emulation code architecture independent s
> that it can be moved to generic code in following commits.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/17] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
  2024-07-16 11:13 ` [PATCH 11/17] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned Mike Rapoport
@ 2024-07-19 16:57   ` Jonathan Cameron
  0 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 16:57 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:40 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> CPU id cannot be negative.
> 
> Making it unsigned also aligns with declarations in
> include/asm-generic/numa.h used by arm64 and riscv and allows sharing
> numa emulation code with these architectures.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Makes sense for both reasons. FWIW given how simple it is.

Maybe worth bringing a few more functions inline with this?
Probably something for another day given we don't care about the
inconsistency for this series.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks
  2024-07-16 11:13 ` [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
  2024-07-18 21:46   ` Samuel Holland
@ 2024-07-19 17:48   ` Jonathan Cameron
  2024-07-20 12:25     ` Mike Rapoport
  1 sibling, 1 reply; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 17:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:42 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Move code dealing with numa_distance array from arch/x86 to
> mm/numa_memblks.c

It's not really numa memblock related. Is this the best place
to put it?

> 
> This code will be later reused by arch_numa.
> 
> No functional changes.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 15/17] mm: make numa_memblks more self-contained
  2024-07-16 11:13 ` [PATCH 15/17] mm: make numa_memblks more self-contained Mike Rapoport
@ 2024-07-19 18:07   ` Jonathan Cameron
  2024-07-20 12:32     ` Mike Rapoport
  2024-07-22  8:05     ` Mike Rapoport
  0 siblings, 2 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 18:07 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:44 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Introduce numa_memblks_init() and move some code around to avoid several
> global variables in numa_memblks.

Hi Mike,

Adding the effectively always on memblock_force_top_down
deserves a comment on why. I assume because you are going to do
something with it later? 

There also seems to be more going on in here such as the change to
get_pfn_range_for_nid()  Perhaps break this up so each
change can have an explanation. 


> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  arch/x86/mm/numa.c           | 53 ++++---------------------
>  include/linux/numa_memblks.h |  9 +----
>  mm/numa_memblks.c            | 77 +++++++++++++++++++++++++++---------
>  3 files changed, 68 insertions(+), 71 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 3848e68d771a..16bc703c9272 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -115,30 +115,19 @@ void __init setup_node_to_cpumask_map(void)
>  	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
>  }
>  
> -static int __init numa_register_memblks(struct numa_meminfo *mi)
> +static int __init numa_register_nodes(void)
>  {
> -	int i, nid, err;
> -
> -	err = numa_register_meminfo(mi);
> -	if (err)
> -		return err;
> +	int nid;
>  
>  	if (!memblock_validate_numa_coverage(SZ_1M))
>  		return -EINVAL;
>  
>  	/* Finally register nodes. */
>  	for_each_node_mask(nid, node_possible_map) {
> -		u64 start = PFN_PHYS(max_pfn);
> -		u64 end = 0;
> -
> -		for (i = 0; i < mi->nr_blks; i++) {
> -			if (nid != mi->blk[i].nid)
> -				continue;
> -			start = min(mi->blk[i].start, start);
> -			end = max(mi->blk[i].end, end);
> -		}
> +		unsigned long start_pfn, end_pfn;
>  
> -		if (start >= end)
> +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);

It's not immediately obvious to me that this code is equivalent so I'd
prefer it in a separate patch with some description of why
it is a valid change.

> +		if (start_pfn >= end_pfn)
>  			continue;
>  
>  		alloc_node_data(nid);
> @@ -178,39 +167,11 @@ static int __init numa_init(int (*init_func)(void))
>  	for (i = 0; i < MAX_LOCAL_APIC; i++)
>  		set_apicid_to_node(i, NUMA_NO_NODE);
>  
> -	nodes_clear(numa_nodes_parsed);
> -	nodes_clear(node_possible_map);
> -	nodes_clear(node_online_map);
> -	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> -				  NUMA_NO_NODE));
> -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
> -				  NUMA_NO_NODE));
> -	/* In case that parsing SRAT failed. */
> -	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> -	numa_reset_distance();
> -
> -	ret = init_func();
> -	if (ret < 0)
> -		return ret;
> -
> -	/*
> -	 * We reset memblock back to the top-down direction
> -	 * here because if we configured ACPI_NUMA, we have
> -	 * parsed SRAT in init_func(). It is ok to have the
> -	 * reset here even if we did't configure ACPI_NUMA
> -	 * or acpi numa init fails and fallbacks to dummy
> -	 * numa init.
> -	 */
> -	memblock_set_bottom_up(false);
> -
> -	ret = numa_cleanup_meminfo(&numa_meminfo);
> +	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true);
The comment in parameter list seems unnecessary.
Maybe add a comment above the call instead if need to call that out?

>  	if (ret < 0)
>  		return ret;
>  
> -	numa_emulation(&numa_meminfo, numa_distance_cnt);
> -
> -	ret = numa_register_memblks(&numa_meminfo);
> +	ret = numa_register_nodes();
>  	if (ret < 0)
>  		return ret;
>  

> diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> index e0039549aaac..640f3a3ce0ee 100644
> --- a/mm/numa_memblks.c
> +++ b/mm/numa_memblks.c
> @@ -7,13 +7,27 @@
>  #include <linux/numa.h>
>  #include <linux/numa_memblks.h>
>  

> +/*
> + * Set nodes, which have memory in @mi, in *@nodemask.
> + */
> +static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
> +					      const struct numa_meminfo *mi)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
> +		if (mi->blk[i].start != mi->blk[i].end &&
> +		    mi->blk[i].nid != NUMA_NO_NODE)
> +			node_set(mi->blk[i].nid, *nodemask);
> +}

The code move doesn't have an obvious purpose. Maybe call that
out in the patch description if it is needed for a future patch.
Or do it in two goes so first just adds the static, 2nd shuffles
the code.

>  
>  /**
>   * numa_reset_distance - Reset NUMA distance table
> @@ -287,20 +301,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>  	return 0;
>  }
>  
> -/*
> - * Set nodes, which have memory in @mi, in *@nodemask.
> - */
> -void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
> -				       const struct numa_meminfo *mi)
> -{
> -	int i;
> -
> -	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
> -		if (mi->blk[i].start != mi->blk[i].end &&
> -		    mi->blk[i].nid != NUMA_NO_NODE)
> -			node_set(mi->blk[i].nid, *nodemask);
> -}
> -
>  /*
>   * Mark all currently memblock-reserved physical memory (which covers the
>   * kernel's own memory ranges) as hot-unswappable.
> @@ -368,7 +368,7 @@ static void __init numa_clear_kernel_node_hotplug(void)
>  	}
>  }
>  
> -int __init numa_register_meminfo(struct numa_meminfo *mi)
> +static int __init numa_register_meminfo(struct numa_meminfo *mi)
>  {
>  	int i;
>  
> @@ -412,6 +412,47 @@ int __init numa_register_meminfo(struct numa_meminfo *mi)
>  	return 0;
>  }
>  
> +int __init numa_memblks_init(int (*init_func)(void),
> +			     bool memblock_force_top_down)
> +{
> +	int ret;
> +
> +	nodes_clear(numa_nodes_parsed);
> +	nodes_clear(node_possible_map);
> +	nodes_clear(node_online_map);
> +	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> +	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> +				  NUMA_NO_NODE));
> +	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
> +				  NUMA_NO_NODE));
> +	/* In case that parsing SRAT failed. */
> +	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> +	numa_reset_distance();
> +
> +	ret = init_func();
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * We reset memblock back to the top-down direction
> +	 * here because if we configured ACPI_NUMA, we have
> +	 * parsed SRAT in init_func(). It is ok to have the
> +	 * reset here even if we did't configure ACPI_NUMA
> +	 * or acpi numa init fails and fallbacks to dummy
> +	 * numa init.
> +	 */
> +	if (memblock_force_top_down)
> +		memblock_set_bottom_up(false);
> +
> +	ret = numa_cleanup_meminfo(&numa_meminfo);
> +	if (ret < 0)
> +		return ret;
> +
> +	numa_emulation(&numa_meminfo, numa_distance_cnt);
> +
> +	return numa_register_meminfo(&numa_meminfo);
> +}
> +
>  static int __init cmp_memblk(const void *a, const void *b)
>  {
>  	const struct numa_memblk *ma = *(const struct numa_memblk **)a;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 16/17] arch_numa: switch over to numa_memblks
  2024-07-16 11:13 ` [PATCH 16/17] arch_numa: switch over to numa_memblks Mike Rapoport
@ 2024-07-19 18:16   ` Jonathan Cameron
  0 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 18:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:45 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Until now arch_numa was directly translating firmware NUMA information
> to memblock.
> 
> Using numa_memblks as an intermediate step has a few advantages:
> * alignment with more battle tested x86 implementation
> * availability of NUMA emulation
> * maintaining node information for not yet populated memory
> 
> Replace current functionality related to numa_add_memblk() and
> __node_distance() with the implementation based on numa_memblks and add
> functions required by numa_emulation.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

One trivial comment inline,

Jonathan
>  /*
>   * Initialize NODE_DATA for a node on the local memory
>   */
> @@ -226,116 +204,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
>  	NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
>  }

>  
> @@ -454,3 +321,54 @@ void __init arch_numa_init(void)
>  
>  	numa_init(dummy_numa_init);
>  }
> +
> +#ifdef CONFIG_NUMA_EMU
> +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
> +					unsigned int nr_emu_nids)
> +{
> +	int i, j;
> +
> +	/*
> +	 * Transform __apicid_to_node table to use emulated nids by

Comment needs an update seeing as there is no __apicid_to_node table
here.

> +	 * reverse-mapping phys_nid.  The maps should always exist but fall
> +	 * back to zero just in case.
> +	 */
> +	for (i = 0; i < ARRAY_SIZE(cpu_to_node_map); i++) {
> +		if (cpu_to_node_map[i] == NUMA_NO_NODE)
> +			continue;
> +		for (j = 0; j < nr_emu_nids; j++)
> +			if (cpu_to_node_map[i] == emu_nid_to_phys[j])
> +				break;
> +		cpu_to_node_map[i] = j < nr_emu_nids ? j : 0;
> +	}
> +}
> +
> +u64 __init numa_emu_dma_end(void)
> +{
> +	return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
> +}


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/17] mm: introduce numa_memblks
  2024-07-16 11:13 ` [PATCH 12/17] mm: introduce numa_memblks Mike Rapoport
@ 2024-07-19 18:16   ` Jonathan Cameron
  2024-07-22  8:03     ` Mike Rapoport
  0 siblings, 1 reply; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 18:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:41 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig
> options to let x86 select it in its Kconfig.
> 
> This code will be later reused by arch_numa.
> 
> No functional changes.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Hi Mike,

My only real concern in here is there are a few places where
the lifted code makes changes to memblocks that are x86 only today.
I need to do some more digging to work out if those are safe
in all cases.

Jonathan



> +/**
> + * numa_cleanup_meminfo - Cleanup a numa_meminfo
> + * @mi: numa_meminfo to clean up
> + *
> + * Sanitize @mi by merging and removing unnecessary memblks.  Also check for
> + * conflicts and clear unused memblks.
> + *
> + * RETURNS:
> + * 0 on success, -errno on failure.
> + */
> +int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> +{
> +	const u64 low = 0;

Given always zero, why not just use that value inline?

> +	const u64 high = PFN_PHYS(max_pfn);
> +	int i, j, k;
> +
> +	/* first, trim all entries */
> +	for (i = 0; i < mi->nr_blks; i++) {
> +		struct numa_memblk *bi = &mi->blk[i];
> +
> +		/* move / save reserved memory ranges */
> +		if (!memblock_overlaps_region(&memblock.memory,
> +					bi->start, bi->end - bi->start)) {
> +			numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
> +			continue;
> +		}
> +
> +		/* make sure all non-reserved blocks are inside the limits */
> +		bi->start = max(bi->start, low);
> +
> +		/* preserve info for non-RAM areas above 'max_pfn': */
> +		if (bi->end > high) {
> +			numa_add_memblk_to(bi->nid, high, bi->end,
> +					   &numa_reserved_meminfo);
> +			bi->end = high;
> +		}
> +
> +		/* and there's no empty block */
> +		if (bi->start >= bi->end)
> +			numa_remove_memblk_from(i--, mi);
> +	}
> +
> +	/* merge neighboring / overlapping entries */
> +	for (i = 0; i < mi->nr_blks; i++) {
> +		struct numa_memblk *bi = &mi->blk[i];
> +
> +		for (j = i + 1; j < mi->nr_blks; j++) {
> +			struct numa_memblk *bj = &mi->blk[j];
> +			u64 start, end;
> +
> +			/*
> +			 * See whether there are overlapping blocks.  Whine
> +			 * about but allow overlaps of the same nid.  They
> +			 * will be merged below.
> +			 */
> +			if (bi->end > bj->start && bi->start < bj->end) {
> +				if (bi->nid != bj->nid) {
> +					pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
> +					       bi->nid, bi->start, bi->end - 1,
> +					       bj->nid, bj->start, bj->end - 1);
> +					return -EINVAL;
> +				}
> +				pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
> +					bi->nid, bi->start, bi->end - 1,
> +					bj->start, bj->end - 1);
> +			}
> +
> +			/*
> +			 * Join together blocks on the same node, holes
> +			 * between which don't overlap with memory on other
> +			 * nodes.
> +			 */
> +			if (bi->nid != bj->nid)
> +				continue;
> +			start = min(bi->start, bj->start);
> +			end = max(bi->end, bj->end);
> +			for (k = 0; k < mi->nr_blks; k++) {
> +				struct numa_memblk *bk = &mi->blk[k];
> +
> +				if (bi->nid == bk->nid)
> +					continue;
> +				if (start < bk->end && end > bk->start)
> +					break;
> +			}
> +			if (k < mi->nr_blks)
> +				continue;
> +			pr_info("NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
> +			       bi->nid, bi->start, bi->end - 1, bj->start,
> +			       bj->end - 1, start, end - 1);
> +			bi->start = start;
> +			bi->end = end;
> +			numa_remove_memblk_from(j--, mi);
> +		}
> +	}
> +
> +	/* clear unused ones */
> +	for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
> +		mi->blk[i].start = mi->blk[i].end = 0;
> +		mi->blk[i].nid = NUMA_NO_NODE;
> +	}
> +
> +	return 0;
> +}

...


> +/*
> + * Mark all currently memblock-reserved physical memory (which covers the
> + * kernel's own memory ranges) as hot-unswappable.
> + */
> +static void __init numa_clear_kernel_node_hotplug(void)

This will be a change for non x86 architectures.  'should' be fine
but I'm not 100% sure.

> +{
> +	nodemask_t reserved_nodemask = NODE_MASK_NONE;
> +	struct memblock_region *mb_region;
> +	int i;
> +
> +	/*
> +	 * We have to do some preprocessing of memblock regions, to
> +	 * make them suitable for reservation.
> +	 *
> +	 * At this time, all memory regions reserved by memblock are
> +	 * used by the kernel, but those regions are not split up
> +	 * along node boundaries yet, and don't necessarily have their
> +	 * node ID set yet either.
> +	 *
> +	 * So iterate over all memory known to the x86 architecture,

Comment needs an update at least given not x86 specific any more.

> +	 * and use those ranges to set the nid in memblock.reserved.
> +	 * This will split up the memblock regions along node
> +	 * boundaries and will set the node IDs as well.
> +	 */
> +	for (i = 0; i < numa_meminfo.nr_blks; i++) {
> +		struct numa_memblk *mb = numa_meminfo.blk + i;
> +		int ret;
> +
> +		ret = memblock_set_node(mb->start, mb->end - mb->start,
> +					&memblock.reserved, mb->nid);
> +		WARN_ON_ONCE(ret);
> +	}
> +
> +	/*
> +	 * Now go over all reserved memblock regions, to construct a
> +	 * node mask of all kernel reserved memory areas.
> +	 *
> +	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> +	 *   numa_meminfo might not include all memblock.reserved
> +	 *   memory ranges, because quirks such as trim_snb_memory()
> +	 *   reserve specific pages for Sandy Bridge graphics. ]
> +	 */
> +	for_each_reserved_mem_region(mb_region) {
> +		int nid = memblock_get_region_node(mb_region);
> +
> +		if (nid != MAX_NUMNODES)
> +			node_set(nid, reserved_nodemask);
> +	}
> +
> +	/*
> +	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
> +	 * belonging to the reserved node mask.
> +	 *
> +	 * Note that this will include memory regions that reside
> +	 * on nodes that contain kernel memory - entire nodes
> +	 * become hot-unpluggable:
> +	 */
> +	for (i = 0; i < numa_meminfo.nr_blks; i++) {
> +		struct numa_memblk *mb = numa_meminfo.blk + i;
> +
> +		if (!node_isset(mb->nid, reserved_nodemask))
> +			continue;
> +
> +		memblock_clear_hotplug(mb->start, mb->end - mb->start);
> +	}
> +}

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 17/17] mm: make range-to-target_node lookup facility a part of numa_memblks
  2024-07-16 11:13 ` [PATCH 17/17] mm: make range-to-target_node lookup facility a part of numa_memblks Mike Rapoport
@ 2024-07-19 18:19   ` Jonathan Cameron
  0 siblings, 0 replies; 60+ messages in thread
From: Jonathan Cameron @ 2024-07-19 18:19 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024 14:13:46 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> The x86 implementation of range-to-target_node lookup (i.e.
> phys_to_target_node() and memory_add_physaddr_to_nid()) relies on
> numa_memblks.
> 
> Since numa_memblks are now part of the generic code, move these
> functions from x86 to mm/numa_memblks.c and select
> CONFIG_NUMA_KEEP_MEMINFO when CONFIG_NUMA_MEMBLKS=y for dax and cxl.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks. I'll poke around more next week.  Have a good weekend.

Jonathan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code
  2024-07-17 14:42   ` David Hildenbrand
  2024-07-18  7:02     ` Mike Rapoport
@ 2024-07-20 10:24     ` Mike Rapoport
  1 sibling, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-20 10:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David S. Miller, Greg Kroah-Hartman,
	Heiko Carstens, Huacai Chen, Ingo Molnar, Jiaxun Yang,
	John Paul Adrian Glaubitz, Jonathan Cameron, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Wed, Jul 17, 2024 at 04:42:48PM +0200, David Hildenbrand wrote:
> On 16.07.24 13:13, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Architectures that support NUMA duplicate the code that allocates
> > NODE_DATA on the node-local memory with slight variations in reporting
> > of the addresses where the memory was allocated.
> > 
> > Use x86 version as the basis for the generic alloc_node_data() function
> > and call this function in architecture specific numa initialization.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> 
> [...]
> 
> > diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
> > index 9208eaadf690..909f6cec3a26 100644
> > --- a/arch/mips/loongson64/numa.c
> > +++ b/arch/mips/loongson64/numa.c
> > @@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
> >   static void __init node_mem_init(unsigned int node)
> >   {
> > -	struct pglist_data *nd;
> >   	unsigned long node_addrspace_offset;
> >   	unsigned long start_pfn, end_pfn;
> > -	unsigned long nd_pa;
> > -	int tnid;
> > -	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
> 
> One interesting change is that we now always round up to full pages on
> architectures where we previously rounded up to SMP_CACHE_BYTES.

I did some git archaeology and it seems that round up to full pages on x86
backdates to bootmem era when allocation granularity was PAGE_SIZE anyway.
I'm going to change that to SMP_CACHE_BYTES in v2.
 
> I assume we don't really expect a significant growth in memory consumption
> that we care about, especially because most systems with many nodes also
> have  quite some memory around.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/17] mm: introduce numa_emulation
  2024-07-19 16:03   ` Zi Yan
@ 2024-07-20 12:09     ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-20 12:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 12:03:11PM -0400, Zi Yan wrote:
> On 16 Jul 2024, at 7:13, Mike Rapoport wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Move numa_emulation codfrom arch/x86 to mm/numa_emulation.c
> >
> > This code will be later reused by arch_numa.
> >
> > No functional changes.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  arch/x86/Kconfig             |  8 --------
> >  arch/x86/include/asm/numa.h  | 12 ------------
> >  arch/x86/mm/Makefile         |  1 -
> >  arch/x86/mm/numa_internal.h  | 11 -----------
> >  include/linux/numa_memblks.h | 17 +++++++++++++++++
> >  mm/Kconfig                   |  8 ++++++++
> >  mm/Makefile                  |  1 +
> >  mm/numa_emulation.c          |  4 +---
> >  8 files changed, 27 insertions(+), 35 deletions(-)
> 
> After this code move, the document of numa=fake= should be moved from
> Documentation/arch/x86/x86_64/boot-options.rst to
> Documentation/admin-guide/kernel-parameters.txt
> too.

I'll add this as a separate commit.
 
> Something like:
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bc55fb55cd26..ce3659289b5e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4158,6 +4158,18 @@
>                         Disable NUMA, Only set up a single NUMA node
>                         spanning all memory.
> 
> +       numa=fake=<size>[MG]
> +                       If given as a memory unit, fills all system RAM with nodes of
> +                       size interleaved over physical nodes.
> +
> +       numa=fake=<N>
> +                       If given as an integer, fills all system RAM with N fake nodes
> +                       interleaved over physical nodes.
> +
> +       numa=fake=<N>U
> +                       If given as an integer followed by 'U', it will divide each
> +                       physical node into N emulated nodes.
> +
>         numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
>                         NUMA balancing.
>                         Allowed values are enable and disable
> diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst
> index 137432d34109..98d4805f0823 100644
> --- a/Documentation/arch/x86/x86_64/boot-options.rst
> +++ b/Documentation/arch/x86/x86_64/boot-options.rst
> @@ -170,18 +170,6 @@ NUMA
>      Don't parse the HMAT table for NUMA setup, or soft-reserved memory
>      partitioning.
> 
> -  numa=fake=<size>[MG]
> -    If given as a memory unit, fills all system RAM with nodes of
> -    size interleaved over physical nodes.
> -
> -  numa=fake=<N>
> -    If given as an integer, fills all system RAM with N fake nodes
> -    interleaved over physical nodes.
> -
> -  numa=fake=<N>U
> -    If given as an integer followed by 'U', it will divide each
> -    physical node into N emulated nodes.
> -
>  ACPI
>  ====
> 
> Best Regards,
> Yan, Zi



-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks
  2024-07-19 17:48   ` Jonathan Cameron
@ 2024-07-20 12:25     ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-20 12:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 06:48:42PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:42 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Move code dealing with numa_distance array from arch/x86 to
> > mm/numa_memblks.c
> 
> It's not really numa memblock related. Is this the best place
> to put it?

There is a dependency of numa_alloc_distance() on
numa_nodemask_from_meminfo() that relies on numa_memblk but I agree that
they are not really related.

However, I'd prefer to keep this code in mm/numa_memblks.c because
node_distance() definitions and related code are different between
architecures and having this code outside numa_memblks in e.g
mm/numa.c would be way more involved.
 
> > This code will be later reused by arch_numa.
> > 
> > No functional changes.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 15/17] mm: make numa_memblks more self-contained
  2024-07-19 18:07   ` Jonathan Cameron
@ 2024-07-20 12:32     ` Mike Rapoport
  2024-07-22  8:05     ` Mike Rapoport
  1 sibling, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-20 12:32 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 07:07:12PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:44 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Introduce numa_memblks_init() and move some code around to avoid several
> > global variables in numa_memblks.
> 
> Hi Mike,
> 
> Adding the effectively always on memblock_force_top_down
> deserves a comment on why. I assume because you are going to do
> something with it later? 

Yes, arch_numa sets it to false. I'll add a note in the changelog.

> There also seems to be more going on in here such as the change to
> get_pfn_range_for_nid()  Perhaps break this up so each
> change can have an explanation. 
 
Ok.
 
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  arch/x86/mm/numa.c           | 53 ++++---------------------
> >  include/linux/numa_memblks.h |  9 +----
> >  mm/numa_memblks.c            | 77 +++++++++++++++++++++++++++---------
> >  3 files changed, 68 insertions(+), 71 deletions(-)
> > 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 3848e68d771a..16bc703c9272 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -115,30 +115,19 @@ void __init setup_node_to_cpumask_map(void)
> >  	pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
> >  }
> >  
> > -static int __init numa_register_memblks(struct numa_meminfo *mi)
> > +static int __init numa_register_nodes(void)
> >  {
> > -	int i, nid, err;
> > -
> > -	err = numa_register_meminfo(mi);
> > -	if (err)
> > -		return err;
> > +	int nid;
> >  
> >  	if (!memblock_validate_numa_coverage(SZ_1M))
> >  		return -EINVAL;
> >  
> >  	/* Finally register nodes. */
> >  	for_each_node_mask(nid, node_possible_map) {
> > -		u64 start = PFN_PHYS(max_pfn);
> > -		u64 end = 0;
> > -
> > -		for (i = 0; i < mi->nr_blks; i++) {
> > -			if (nid != mi->blk[i].nid)
> > -				continue;
> > -			start = min(mi->blk[i].start, start);
> > -			end = max(mi->blk[i].end, end);
> > -		}
> > +		unsigned long start_pfn, end_pfn;
> >  
> > -		if (start >= end)
> > +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> 
> It's not immediately obvious to me that this code is equivalent so I'd
> prefer it in a separate patch with some description of why
> it is a valid change.

Will do.
 
> > +		if (start_pfn >= end_pfn)
> >  			continue;
> >  
> >  		alloc_node_data(nid);
> > @@ -178,39 +167,11 @@ static int __init numa_init(int (*init_func)(void))
> >  	for (i = 0; i < MAX_LOCAL_APIC; i++)
> >  		set_apicid_to_node(i, NUMA_NO_NODE);
> >  
> > -	nodes_clear(numa_nodes_parsed);
> > -	nodes_clear(node_possible_map);
> > -	nodes_clear(node_online_map);
> > -	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> > -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> > -				  NUMA_NO_NODE));
> > -	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
> > -				  NUMA_NO_NODE));
> > -	/* In case that parsing SRAT failed. */
> > -	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> > -	numa_reset_distance();
> > -
> > -	ret = init_func();
> > -	if (ret < 0)
> > -		return ret;
> > -
> > -	/*
> > -	 * We reset memblock back to the top-down direction
> > -	 * here because if we configured ACPI_NUMA, we have
> > -	 * parsed SRAT in init_func(). It is ok to have the
> > -	 * reset here even if we did't configure ACPI_NUMA
> > -	 * or acpi numa init fails and fallbacks to dummy
> > -	 * numa init.
> > -	 */
> > -	memblock_set_bottom_up(false);
> > -
> > -	ret = numa_cleanup_meminfo(&numa_meminfo);
> > +	ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true);
> The comment in parameter list seems unnecessary.
> Maybe add a comment above the call instead if need to call that out?

I'll drop it for now.
 
> >  	if (ret < 0)
> >  		return ret;
> >  
> > -	numa_emulation(&numa_meminfo, numa_distance_cnt);
> > -
> > -	ret = numa_register_memblks(&numa_meminfo);
> > +	ret = numa_register_nodes();
> >  	if (ret < 0)
> >  		return ret;
> >  
> 
> > diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> > index e0039549aaac..640f3a3ce0ee 100644
> > --- a/mm/numa_memblks.c
> > +++ b/mm/numa_memblks.c
> > @@ -7,13 +7,27 @@
> >  #include <linux/numa.h>
> >  #include <linux/numa_memblks.h>
> >  
> 
> > +/*
> > + * Set nodes, which have memory in @mi, in *@nodemask.
> > + */
> > +static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
> > +					      const struct numa_meminfo *mi)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
> > +		if (mi->blk[i].start != mi->blk[i].end &&
> > +		    mi->blk[i].nid != NUMA_NO_NODE)
> > +			node_set(mi->blk[i].nid, *nodemask);
> > +}
> 
> The code move doesn't have an obvious purpose. Maybe call that
> out in the patch description if it is needed for a future patch.
> Or do it in two goes so first just adds the static, 2nd shuffles
> the code.
 
Before the move numa_nodemask_from_meminfo() was global so it was ok to
define it after its callers.
I'll split this into a separate commit.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures
  2024-07-19 14:38     ` Jonathan Cameron
@ 2024-07-22  7:34       ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-22  7:34 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: David Hildenbrand, linux-kernel, Alexander Gordeev,
	Andreas Larsson, Andrew Morton, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christophe Leroy, Dan Williams, Dave Hansen,
	David S. Miller, Greg Kroah-Hartman, Heiko Carstens, Huacai Chen,
	Ingo Molnar, Jiaxun Yang, John Paul Adrian Glaubitz,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 03:38:52PM +0100, Jonathan Cameron wrote:
> On Wed, 17 Jul 2024 16:32:59 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> > On 16.07.24 13:13, Mike Rapoport wrote:
> > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > > 
> > > sgi-ip27 is the only system that defines NODE_DATA() differently than
> > > the rest of NUMA machines.
> > > 
> > > Add node_data array of struct pglist pointers that will point to
> > > __node_data[node]->pglist and redefine NODE_DATA() to use node_data
> > > array.
> > > 
> > > This will allow pulling declaration of node_data to the generic mm code
> > > in the next commit.
> > > 
> > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > ---
> > >   arch/mips/include/asm/mach-ip27/mmzone.h | 5 ++++-
> > >   arch/mips/sgi-ip27/ip27-memory.c         | 5 ++++-
> > >   2 files changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h
> > > index 08c36e50a860..629c3f290203 100644
> > > --- a/arch/mips/include/asm/mach-ip27/mmzone.h
> > > +++ b/arch/mips/include/asm/mach-ip27/mmzone.h
> > > @@ -22,7 +22,10 @@ struct node_data {
> > >   
> > >   extern struct node_data *__node_data[];
> > >   
> > > -#define NODE_DATA(n)		(&__node_data[(n)]->pglist)
> > >   #define hub_data(n)		(&__node_data[(n)]->hub)
> > >   
> > > +extern struct pglist_data *node_data[];
> > > +
> > > +#define NODE_DATA(nid)		(node_data[nid])
> > > +
> > >   #endif /* _ASM_MACH_MMZONE_H */
> > > diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
> > > index b8ca94cfb4fe..c30ef6958b97 100644
> > > --- a/arch/mips/sgi-ip27/ip27-memory.c
> > > +++ b/arch/mips/sgi-ip27/ip27-memory.c
> > > @@ -34,8 +34,10 @@
> > >   #define SLOT_PFNSHIFT		(SLOT_SHIFT - PAGE_SHIFT)
> > >   #define PFN_NASIDSHFT		(NASID_SHFT - PAGE_SHIFT)
> > >   
> > > -struct node_data *__node_data[MAX_NUMNODES];
> > > +struct pglist_data *node_data[MAX_NUMNODES];
> > > +EXPORT_SYMBOL(node_data);
> > >   
> > > +struct node_data *__node_data[MAX_NUMNODES];
> > >   EXPORT_SYMBOL(__node_data);
> > >   
> > >   static u64 gen_region_mask(void)
> > > @@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node)
> > >   	 */
> > >   	__node_data[node] = __va(slot_freepfn << PAGE_SHIFT);
> > >   	memset(__node_data[node], 0, PAGE_SIZE);
> > > +	node_data[node] = &__node_data[node]->pglist;
> > >   
> > >   	NODE_DATA(node)->node_start_pfn = start_pfn;
> > >   	NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;  
> > 
> > I was assuming we could get rid of __node_data->pglist.
> > 
> > But now I am confused where that is actually set.
> 
> It looks nasty... 

Nasty indeed :)

> Cast in arch_refresh_nodedata() takes incoming pg_data_t * and casts it
> to the local version of struct node_data * which I think is this one
> 
> struct node_data {
> 	struct pglist_data pglist; (which is pg_data_t pglist)
> 	struct hub_data hub;
> };
> 
> https://elixir.bootlin.com/linux/v6.10/source/arch/mips/sgi-ip27/ip27-memory.c#L432
> 
> Now that pg_data_t is allocated by 
> arch_alloc_nodedata() which might be fine (though types could be handled in a more
> readable fashion via some container_of() magic.
> https://elixir.bootlin.com/linux/v6.10/source/arch/mips/sgi-ip27/ip27-memory.c#L427
> 
> However that call is:
> pg_data_t * __init arch_alloc_nodedata(int nid)
> {
> 	return memblock_alloc(sizeof(pg_data_t), SMP_CACHE_BYTES);
> }
> 
> So doesn't seem to allocate enough space to me as should be sizeof(struct node_data)

Well, it's there to silence a compiler error (commit f8f9f21c7848 ("MIPS:
Fix build error for loongson64 and sgi-ip27")), but this is not a proper
fix :(
Luckily nothing calls cpumask_of_node() for offline nodes...
 
> Worth cleaning up whilst here?  Proper handling of types would definitely
> help.

Worth cleanup indeed, but I'd rather drop arch_alloc_nodedata() on MIPS
altogether.
 
> Jonathan

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/17] x86/numa: simplify numa_distance allocation
  2024-07-19 16:28   ` Jonathan Cameron
@ 2024-07-22  7:51     ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-22  7:51 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 05:28:49PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:35 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Allocation of numa_distance uses memblock_phys_alloc_range() to limit
> > allocation to be below the last mapped page.
> > 
> > But NUMA initializaition runs after the direct map is populated and
> 
> initialization (one too many 'i's)

Thanks.
 
> > there is also code in setup_arch() that adjusts memblock limit to
> > reflect how much memory is already mapped in the direct map.
> > 
> > Simplify the allocation of numa_distance and use plain memblock_alloc().
> > This makes the code clearer and ensures that when numa_distance is not
> > allocated it is always NULL.
> Doesn't this break the comment in numa_set_distance() kernel-doc?
> "
>  * If such table cannot be allocated, a warning is printed and further
>  * calls are ignored until the distance table is reset with
>  * numa_reset_distance().
> "
> 
> Superficially that looks to be to avoid repeatedly hitting the
> singleton bit at the top of numa_set_distance() as SRAT or similar
> parsing occurs.

I believe it's there to avoid allocation of numa_distance in the middle of
distance parsing (SLIT or DT numa-distance-map).

If the allocation fails for the first element in the table, the
numa_distance and numa_distance_cnt remain zero and node_distance() falls
back to

	return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;

It's different from arch_numa that always tries to allocate MAX_NUMNODES *
MAX_NUMNODES for numa_distance and treats the allocation failure as a
failure to initialize NUMA.

I like the general approach x86 uses more, i.e. in case distance parsing
fails in some way NUMA is still initialized with probably suboptimal
distances between nodes.

I'm going to restore that "singleton" behavior for now and will look into
making this all less cumbersome later.
 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  arch/x86/mm/numa.c | 12 +++---------
> >  1 file changed, 3 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 5e1dde26674b..ab2d4ecef786 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -319,8 +319,7 @@ void __init numa_reset_distance(void)
> >  {
> >  	size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]);
> >  
> > -	/* numa_distance could be 1LU marking allocation failure, test cnt */
> > -	if (numa_distance_cnt)
> > +	if (numa_distance)
> >  		memblock_free(numa_distance, size);
> >  	numa_distance_cnt = 0;
> >  	numa_distance = NULL;	/* enable table creation */
> > @@ -331,7 +330,6 @@ static int __init numa_alloc_distance(void)
> >  	nodemask_t nodes_parsed;
> >  	size_t size;
> >  	int i, j, cnt = 0;
> > -	u64 phys;
> >  
> >  	/* size the new table and allocate it */
> >  	nodes_parsed = numa_nodes_parsed;
> > @@ -342,16 +340,12 @@ static int __init numa_alloc_distance(void)
> >  	cnt++;
> >  	size = cnt * cnt * sizeof(numa_distance[0]);
> >  
> > -	phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0,
> > -					 PFN_PHYS(max_pfn_mapped));
> > -	if (!phys) {
> > +	numa_distance = memblock_alloc(size, PAGE_SIZE);
> > +	if (!numa_distance) {
> >  		pr_warn("Warning: can't allocate distance table!\n");
> > -		/* don't retry until explicitly reset */
> > -		numa_distance = (void *)1LU;
> >  		return -ENOMEM;
> >  	}
> >  
> > -	numa_distance = __va(phys);
> >  	numa_distance_cnt = cnt;
> >  
> >  	/* fill with the default distances */
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/17] mm: introduce numa_memblks
  2024-07-19 18:16   ` Jonathan Cameron
@ 2024-07-22  8:03     ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-22  8:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 07:16:47PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:41 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig
> > options to let x86 select it in its Kconfig.
> > 
> > This code will be later reused by arch_numa.
> > 
> > No functional changes.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Hi Mike,
> 
> My only real concern in here is there are a few places where
> the lifted code makes changes to memblocks that are x86 only today.
> I need to do some more digging to work out if those are safe
> in all cases.
> 
> Jonathan
> 
> 
> 
> > +/**
> > + * numa_cleanup_meminfo - Cleanup a numa_meminfo
> > + * @mi: numa_meminfo to clean up
> > + *
> > + * Sanitize @mi by merging and removing unnecessary memblks.  Also check for
> > + * conflicts and clear unused memblks.
> > + *
> > + * RETURNS:
> > + * 0 on success, -errno on failure.
> > + */
> > +int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> > +{
> > +	const u64 low = 0;
> 
> Given always zero, why not just use that value inline?

Actually it seems to me that it should be memblock_start_of_DRAM().

The blocks outside system memory are moved to numa_reserved_meminfo, so
AFAIU on arm64/riscv such blocks can be below the RAM.
 
> > +	const u64 high = PFN_PHYS(max_pfn);
> > +	int i, j, k;
> > +
> > +	/* first, trim all entries */
> > +	for (i = 0; i < mi->nr_blks; i++) {
> > +		struct numa_memblk *bi = &mi->blk[i];
> > +
> > +		/* move / save reserved memory ranges */
> > +		if (!memblock_overlaps_region(&memblock.memory,
> > +					bi->start, bi->end - bi->start)) {
> > +			numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
> > +			continue;
> > +		}
> > +
> > +		/* make sure all non-reserved blocks are inside the limits */
> > +		bi->start = max(bi->start, low);
> > +
> > +		/* preserve info for non-RAM areas above 'max_pfn': */
> > +		if (bi->end > high) {
> > +			numa_add_memblk_to(bi->nid, high, bi->end,
> > +					   &numa_reserved_meminfo);
> > +			bi->end = high;
> > +		}
> > +
> > +		/* and there's no empty block */
> > +		if (bi->start >= bi->end)
> > +			numa_remove_memblk_from(i--, mi);
> > +	}
> > +
> > +	/* merge neighboring / overlapping entries */
> > +	for (i = 0; i < mi->nr_blks; i++) {
> > +		struct numa_memblk *bi = &mi->blk[i];
> > +
> > +		for (j = i + 1; j < mi->nr_blks; j++) {
> > +			struct numa_memblk *bj = &mi->blk[j];
> > +			u64 start, end;
> > +
> > +			/*
> > +			 * See whether there are overlapping blocks.  Whine
> > +			 * about but allow overlaps of the same nid.  They
> > +			 * will be merged below.
> > +			 */
> > +			if (bi->end > bj->start && bi->start < bj->end) {
> > +				if (bi->nid != bj->nid) {
> > +					pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n",
> > +					       bi->nid, bi->start, bi->end - 1,
> > +					       bj->nid, bj->start, bj->end - 1);
> > +					return -EINVAL;
> > +				}
> > +				pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n",
> > +					bi->nid, bi->start, bi->end - 1,
> > +					bj->start, bj->end - 1);
> > +			}
> > +
> > +			/*
> > +			 * Join together blocks on the same node, holes
> > +			 * between which don't overlap with memory on other
> > +			 * nodes.
> > +			 */
> > +			if (bi->nid != bj->nid)
> > +				continue;
> > +			start = min(bi->start, bj->start);
> > +			end = max(bi->end, bj->end);
> > +			for (k = 0; k < mi->nr_blks; k++) {
> > +				struct numa_memblk *bk = &mi->blk[k];
> > +
> > +				if (bi->nid == bk->nid)
> > +					continue;
> > +				if (start < bk->end && end > bk->start)
> > +					break;
> > +			}
> > +			if (k < mi->nr_blks)
> > +				continue;
> > +			pr_info("NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
> > +			       bi->nid, bi->start, bi->end - 1, bj->start,
> > +			       bj->end - 1, start, end - 1);
> > +			bi->start = start;
> > +			bi->end = end;
> > +			numa_remove_memblk_from(j--, mi);
> > +		}
> > +	}
> > +
> > +	/* clear unused ones */
> > +	for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
> > +		mi->blk[i].start = mi->blk[i].end = 0;
> > +		mi->blk[i].nid = NUMA_NO_NODE;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> ...
> 
> 
> > +/*
> > + * Mark all currently memblock-reserved physical memory (which covers the
> > + * kernel's own memory ranges) as hot-unswappable.
> > + */
> > +static void __init numa_clear_kernel_node_hotplug(void)
> 
> This will be a change for non x86 architectures.  'should' be fine
> but I'm not 100% sure.

This function sets nid to memblock.reserved which does not change anything
except the dump in debugfs and then uses the node info in memblock.reserve
to clear MEMBLOCK_HOTPLUG from the regions in memblock.memory that contain
the reserved memory because they cannot be hot(un)plugged anyway.
 
> > +{
> > +	nodemask_t reserved_nodemask = NODE_MASK_NONE;
> > +	struct memblock_region *mb_region;
> > +	int i;
> > +
> > +	/*
> > +	 * We have to do some preprocessing of memblock regions, to
> > +	 * make them suitable for reservation.
> > +	 *
> > +	 * At this time, all memory regions reserved by memblock are
> > +	 * used by the kernel, but those regions are not split up
> > +	 * along node boundaries yet, and don't necessarily have their
> > +	 * node ID set yet either.
> > +	 *
> > +	 * So iterate over all memory known to the x86 architecture,
> 
> Comment needs an update at least given not x86 specific any more.

Sure, will fix.
 
> > +	 * and use those ranges to set the nid in memblock.reserved.
> > +	 * This will split up the memblock regions along node
> > +	 * boundaries and will set the node IDs as well.
> > +	 */
> > +	for (i = 0; i < numa_meminfo.nr_blks; i++) {
> > +		struct numa_memblk *mb = numa_meminfo.blk + i;
> > +		int ret;
> > +
> > +		ret = memblock_set_node(mb->start, mb->end - mb->start,
> > +					&memblock.reserved, mb->nid);
> > +		WARN_ON_ONCE(ret);
> > +	}
> > +
> > +	/*
> > +	 * Now go over all reserved memblock regions, to construct a
> > +	 * node mask of all kernel reserved memory areas.
> > +	 *
> > +	 * [ Note, when booting with mem=nn[kMG] or in a kdump kernel,
> > +	 *   numa_meminfo might not include all memblock.reserved
> > +	 *   memory ranges, because quirks such as trim_snb_memory()
> > +	 *   reserve specific pages for Sandy Bridge graphics. ]
> > +	 */
> > +	for_each_reserved_mem_region(mb_region) {
> > +		int nid = memblock_get_region_node(mb_region);
> > +
> > +		if (nid != MAX_NUMNODES)
> > +			node_set(nid, reserved_nodemask);
> > +	}
> > +
> > +	/*
> > +	 * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory
> > +	 * belonging to the reserved node mask.
> > +	 *
> > +	 * Note that this will include memory regions that reside
> > +	 * on nodes that contain kernel memory - entire nodes
> > +	 * become hot-unpluggable:
> > +	 */
> > +	for (i = 0; i < numa_meminfo.nr_blks; i++) {
> > +		struct numa_memblk *mb = numa_meminfo.blk + i;
> > +
> > +		if (!node_isset(mb->nid, reserved_nodemask))
> > +			continue;
> > +
> > +		memblock_clear_hotplug(mb->start, mb->end - mb->start);
> > +	}
> > +}
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 15/17] mm: make numa_memblks more self-contained
  2024-07-19 18:07   ` Jonathan Cameron
  2024-07-20 12:32     ` Mike Rapoport
@ 2024-07-22  8:05     ` Mike Rapoport
  1 sibling, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-22  8:05 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 07:07:12PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:44 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Introduce numa_memblks_init() and move some code around to avoid several
> > global variables in numa_memblks.
> 
> Hi Mike,
> 
> Adding the effectively always on memblock_force_top_down
> deserves a comment on why. I assume because you are going to do
> something with it later? 
> 
> There also seems to be more going on in here such as the change to
> get_pfn_range_for_nid()  Perhaps break this up so each
> change can have an explanation. 

I'll split this into several commits. 
 
-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/17] mm: introduce numa_memblks
  2024-07-19 13:33 ` [PATCH 00/17] mm: introduce numa_memblks Jonathan Cameron
@ 2024-07-22  8:08   ` Mike Rapoport
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Rapoport @ 2024-07-22  8:08 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Michael Ellerman,
	Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Fri, Jul 19, 2024 at 02:33:47PM +0100, Jonathan Cameron wrote:
> On Tue, 16 Jul 2024 14:13:29 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Hi,
> > 
> > Following the discussion about handling of CXL fixed memory windows on
> > arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
> > the generic code so they will be available on arm64/riscv and maybe on
> > loongarch sometime later.
> > 
> > While it could be possible to use memblock to describe CXL memory windows,
> > it currently lacks notion of unpopulated memory ranges and numa_memblks
> > does implement this.
> > 
> > Another reason to make numa_memblks generic is that both arch_numa (arm64
> > and riscv) and loongarch use trimmed copy of x86 code although there is no
> > fundamental reason why the same code cannot be used on all these platforms.
> > Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
> > more consistent and I believe will reduce maintenance burden.
> > 
> > And with generic numa_memblks it is (almost) straightforward to enable NUMA
> > emulation on arm64 and riscv.
> > 
> > The first 5 commits in this series are cleanups that are not strictly
> > related to numa_memblks.
> > 
> > Commits 6-11 slightly reorder code in x86 to allow extracting numa_memblks
> > and NUMA emulation to the generic code.
> > 
> > Commits 12-14 actually move the code from arch/x86/ to mm/ and commit 15
> > does some aftermath cleanups.
> > 
> > Commit 16 switches arch_numa to numa_memblks.
> > 
> > Commit 17 enables usage of phys_to_target_node() and
> > memory_add_physaddr_to_nid() with numa_memblks.
> 
> Hi Mike,
> 
> I've lightly tested with emulated CXL + Generic Ports and Generic
> Initiators as well as more normal cpus and memory via qemu on arm64 and it's
> looking good.
> 
> From my earlier series, patch 4 is probably still needed to avoid
> presenting nodes with nothing in them at boot (but not if we hotplug
> memory then remove it again in which case they disappear)
> https://lore.kernel.org/all/20240529171236.32002-5-Jonathan.Cameron@huawei.com/
> However that was broken/inconsistent before your rework so I can send that
> patch separately. 

I'd appreciate it :)
 
> Thanks for getting this sorted!  I should get time to do more extensive
> testing and review in next week or so.

Thanks, you may want to wait for v2, I'm planning to send it this week.
 
> Jonathan
> 
> > 
> > [1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/
> > 
> > Mike Rapoport (Microsoft) (17):
> >   mm: move kernel/numa.c to mm/
> >   MIPS: sgi-ip27: make NODE_DATA() the same as on all other
> >     architectures
> >   MIPS: loongson64: rename __node_data to node_data
> >   arch, mm: move definition of node_data to generic code
> >   arch, mm: pull out allocation of NODE_DATA to generic code
> >   x86/numa: simplify numa_distance allocation
> >   x86/numa: move FAKE_NODE_* defines to numa_emu
> >   x86/numa_emu: simplify allocation of phys_dist
> >   x86/numa_emu: split __apicid_to_node update to a helper function
> >   x86/numa_emu: use a helper function to get MAX_DMA32_PFN
> >   x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
> >   mm: introduce numa_memblks
> >   mm: move numa_distance and related code from x86 to numa_memblks
> >   mm: introduce numa_emulation
> >   mm: make numa_memblks more self-contained
> >   arch_numa: switch over to numa_memblks
> >   mm: make range-to-target_node lookup facility a part of numa_memblks
> > 
> >  arch/arm64/include/asm/Kbuild                 |   1 +
> >  arch/arm64/include/asm/mmzone.h               |  13 -
> >  arch/arm64/include/asm/topology.h             |   1 +
> >  arch/loongarch/include/asm/Kbuild             |   1 +
> >  arch/loongarch/include/asm/mmzone.h           |  16 -
> >  arch/loongarch/include/asm/topology.h         |   1 +
> >  arch/loongarch/kernel/numa.c                  |  21 -
> >  arch/mips/include/asm/mach-ip27/mmzone.h      |   1 -
> >  .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
> >  arch/mips/loongson64/numa.c                   |  20 +-
> >  arch/mips/sgi-ip27/ip27-memory.c              |   2 +-
> >  arch/powerpc/include/asm/mmzone.h             |   6 -
> >  arch/powerpc/mm/numa.c                        |  26 +-
> >  arch/riscv/include/asm/Kbuild                 |   1 +
> >  arch/riscv/include/asm/mmzone.h               |  13 -
> >  arch/riscv/include/asm/topology.h             |   4 +
> >  arch/s390/include/asm/Kbuild                  |   1 +
> >  arch/s390/include/asm/mmzone.h                |  17 -
> >  arch/s390/kernel/numa.c                       |   3 -
> >  arch/sh/include/asm/mmzone.h                  |   3 -
> >  arch/sh/mm/init.c                             |   7 +-
> >  arch/sh/mm/numa.c                             |   3 -
> >  arch/sparc/include/asm/mmzone.h               |   4 -
> >  arch/sparc/mm/init_64.c                       |  11 +-
> >  arch/x86/Kconfig                              |   9 +-
> >  arch/x86/include/asm/Kbuild                   |   1 +
> >  arch/x86/include/asm/mmzone.h                 |   6 -
> >  arch/x86/include/asm/mmzone_32.h              |  17 -
> >  arch/x86/include/asm/mmzone_64.h              |  18 -
> >  arch/x86/include/asm/numa.h                   |  24 +-
> >  arch/x86/include/asm/sparsemem.h              |   9 -
> >  arch/x86/mm/Makefile                          |   1 -
> >  arch/x86/mm/amdtopology.c                     |   1 +
> >  arch/x86/mm/numa.c                            | 618 +-----------------
> >  arch/x86/mm/numa_internal.h                   |  24 -
> >  drivers/acpi/numa/srat.c                      |   1 +
> >  drivers/base/Kconfig                          |   1 +
> >  drivers/base/arch_numa.c                      | 223 ++-----
> >  drivers/cxl/Kconfig                           |   2 +-
> >  drivers/dax/Kconfig                           |   2 +-
> >  drivers/of/of_numa.c                          |   1 +
> >  include/asm-generic/mmzone.h                  |   5 +
> >  include/asm-generic/numa.h                    |   6 +-
> >  include/linux/numa.h                          |   5 +
> >  include/linux/numa_memblks.h                  |  58 ++
> >  kernel/Makefile                               |   1 -
> >  kernel/numa.c                                 |  26 -
> >  mm/Kconfig                                    |  11 +
> >  mm/Makefile                                   |   3 +
> >  mm/numa.c                                     |  57 ++
> >  {arch/x86/mm => mm}/numa_emulation.c          |  42 +-
> >  mm/numa_memblks.c                             | 565 ++++++++++++++++
> >  52 files changed, 847 insertions(+), 1070 deletions(-)
> >  delete mode 100644 arch/arm64/include/asm/mmzone.h
> >  delete mode 100644 arch/loongarch/include/asm/mmzone.h
> >  delete mode 100644 arch/riscv/include/asm/mmzone.h
> >  delete mode 100644 arch/s390/include/asm/mmzone.h
> >  delete mode 100644 arch/x86/include/asm/mmzone.h
> >  delete mode 100644 arch/x86/include/asm/mmzone_32.h
> >  delete mode 100644 arch/x86/include/asm/mmzone_64.h
> >  create mode 100644 include/asm-generic/mmzone.h
> >  create mode 100644 include/linux/numa_memblks.h
> >  delete mode 100644 kernel/numa.c
> >  create mode 100644 mm/numa.c
> >  rename {arch/x86/mm => mm}/numa_emulation.c (94%)
> >  create mode 100644 mm/numa_memblks.c
> > 
> > 
> > base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] arch, mm: move definition of node_data to generic code
  2024-07-16 11:13 ` [PATCH 04/17] arch, mm: move definition of node_data to generic code Mike Rapoport
  2024-07-17 14:35   ` David Hildenbrand
  2024-07-19 15:39   ` Jonathan Cameron
@ 2024-07-23  0:15   ` Davidlohr Bueso
  2 siblings, 0 replies; 60+ messages in thread
From: Davidlohr Bueso @ 2024-07-23  0:15 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas, Christophe Leroy,
	Dan Williams, Dave Hansen, David Hildenbrand, David S. Miller,
	Greg Kroah-Hartman, Heiko Carstens, Huacai Chen, Ingo Molnar,
	Jiaxun Yang, John Paul Adrian Glaubitz, Jonathan Cameron,
	Michael Ellerman, Palmer Dabbelt, Rafael J. Wysocki, Rob Herring,
	Thomas Bogendoerfer, Thomas Gleixner, Vasily Gorbik, Will Deacon,
	linux-arm-kernel, loongarch, linux-mips, linuxppc-dev,
	linux-riscv, linux-s390, linux-sh, sparclinux, linux-acpi,
	linux-cxl, nvdimm, devicetree, linux-arch, linux-mm, x86

On Tue, 16 Jul 2024, Mike Rapoport wrote:\n
>From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
>Every architecture that supports NUMA defines node_data in the same way:
>
>	struct pglist_data *node_data[MAX_NUMNODES];
>
>No reason to keep multiple copies of this definition and its forward
>declarations, especially when such forward declaration is the only thing
>in include/asm/mmzone.h for many architectures.
>
>Add definition and declaration of node_data to generic code and drop
>architecture-specific versions.
>
>Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Nice cleanup.

Acked-by: Davidlohr Bueso <dave@stgolabs.net>

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2024-07-23  0:15 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-16 11:13 [PATCH 00/17] mm: introduce numa_memblks Mike Rapoport
2024-07-16 11:13 ` [PATCH 01/17] mm: move kernel/numa.c to mm/ Mike Rapoport
2024-07-17 14:35   ` David Hildenbrand
2024-07-19 13:55   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures Mike Rapoport
2024-07-17 14:32   ` David Hildenbrand
2024-07-19 14:38     ` Jonathan Cameron
2024-07-22  7:34       ` Mike Rapoport
2024-07-16 11:13 ` [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data Mike Rapoport
2024-07-16 13:07   ` Jiaxun Yang
2024-07-17 14:33   ` David Hildenbrand
2024-07-19 15:27   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 04/17] arch, mm: move definition of node_data to generic code Mike Rapoport
2024-07-17 14:35   ` David Hildenbrand
2024-07-19 15:39   ` Jonathan Cameron
2024-07-23  0:15   ` Davidlohr Bueso
2024-07-16 11:13 ` [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA " Mike Rapoport
2024-07-17 14:42   ` David Hildenbrand
2024-07-18  7:02     ` Mike Rapoport
2024-07-19 15:07       ` David Hildenbrand
2024-07-19 15:34         ` Mike Rapoport
2024-07-19 15:46           ` David Hildenbrand
2024-07-19 15:51         ` Jonathan Cameron
2024-07-19 16:07           ` David Hildenbrand
2024-07-20 10:24     ` Mike Rapoport
2024-07-19 16:11   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 06/17] x86/numa: simplify numa_distance allocation Mike Rapoport
2024-07-19 16:28   ` Jonathan Cameron
2024-07-22  7:51     ` Mike Rapoport
2024-07-16 11:13 ` [PATCH 07/17] x86/numa: move FAKE_NODE_* defines to numa_emu Mike Rapoport
2024-07-19 16:30   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 08/17] x86/numa_emu: simplify allocation of phys_dist Mike Rapoport
2024-07-19 16:38   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function Mike Rapoport
2024-07-19 16:47   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 10/17] x86/numa_emu: use a helper function to get MAX_DMA32_PFN Mike Rapoport
2024-07-19 16:50   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 11/17] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned Mike Rapoport
2024-07-19 16:57   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 12/17] mm: introduce numa_memblks Mike Rapoport
2024-07-19 18:16   ` Jonathan Cameron
2024-07-22  8:03     ` Mike Rapoport
2024-07-16 11:13 ` [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks Mike Rapoport
2024-07-18 21:46   ` Samuel Holland
2024-07-19  5:55     ` Mike Rapoport
2024-07-19 17:48   ` Jonathan Cameron
2024-07-20 12:25     ` Mike Rapoport
2024-07-16 11:13 ` [PATCH 14/17] mm: introduce numa_emulation Mike Rapoport
2024-07-19 16:03   ` Zi Yan
2024-07-20 12:09     ` Mike Rapoport
2024-07-16 11:13 ` [PATCH 15/17] mm: make numa_memblks more self-contained Mike Rapoport
2024-07-19 18:07   ` Jonathan Cameron
2024-07-20 12:32     ` Mike Rapoport
2024-07-22  8:05     ` Mike Rapoport
2024-07-16 11:13 ` [PATCH 16/17] arch_numa: switch over to numa_memblks Mike Rapoport
2024-07-19 18:16   ` Jonathan Cameron
2024-07-16 11:13 ` [PATCH 17/17] mm: make range-to-target_node lookup facility a part of numa_memblks Mike Rapoport
2024-07-19 18:19   ` Jonathan Cameron
2024-07-19 13:33 ` [PATCH 00/17] mm: introduce numa_memblks Jonathan Cameron
2024-07-22  8:08   ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).