* [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time
@ 2025-05-22 9:17 Donet Tom
2025-05-22 9:17 ` [PATCH v5 2/4] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
` (5 more replies)
0 siblings, 6 replies; 13+ messages in thread
From: Donet Tom @ 2025-05-22 9:17 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Donet Tom
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node’s start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a
`memory block` may be absent.
As a result, `memory blocks` that fall between a node’s start and end
PFNs may span across multiple nodes, and some sections within those blocks
may be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
for_each_online_node(nid):
start_pfn = pgdat->node_start_pfn;
end_pfn = pgdat->node_start_pfn + node_spanned_pages;
for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
mem_blk = memory_block_id(pfn_to_section_nr(pfn));
pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
pfn_mb_end = pfn_start + memory_block_pfns - 1
for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
if (get_nid_for_pfn(pfn) != nid):
continue;
else
do_register_memory_block_under_node(nid, mem_blk,
MEMINIT_EARLY);
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each
`memory block` in this range, we inspect all PFNs it contains and check
their associated NUMA node ID. If a PFN within the block matches the
current node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match is
found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
overhead of the binary search increases the execution time significantly,
potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the
`memory blocks` that belong to the current NUMA node. `memblock regions`
are contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_memory_region(r): // r => region
if (!node_online(r->nid)):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
We iterate over all memblock regions, and if the node associated with the
region is online, we calculate the start and end memory blocks based on the
region's start and end PFNs. We then register all the memory blocks within
that range under the region node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v4 -> v5
1. Moved all helpers(memory_block_id(), pfn_to_block_id(), and phys_to_block_id())
into memory.h and exported sections_per_block.
2. register_memory_blocks_early() moved out of for_each_online_node().
Now we iterate over all memory regions at once and register the
memory blocks.
Tested corner cases where memory blocks span across multiple memblock regions; it
is working fine.
#cd /sys/devices/system/node/
# find node1/ |grep memory0
node1/memory0
# find node0/ |grep memory0
node0/memory0
# find node0/ |grep memory0
node2/memory0
# cat node0/memory0/valid_zones
none
V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/memory.c | 21 ++++----------------
drivers/base/node.c | 45 ++++++++++++++++++++++++++++++++++++++++--
include/linux/memory.h | 19 +++++++++++++++++-
include/linux/node.h | 3 +++
4 files changed, 68 insertions(+), 20 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 19469e7f88c2..39fcc075a36f 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -22,6 +22,7 @@
#include <linux/stat.h>
#include <linux/slab.h>
#include <linux/xarray.h>
+#include <linux/export.h>
#include <linux/atomic.h>
#include <linux/uaccess.h>
@@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str)
#define to_memory_block(dev) container_of(dev, struct memory_block, dev)
-static int sections_per_block;
-
-static inline unsigned long memory_block_id(unsigned long section_nr)
-{
- return section_nr / sections_per_block;
-}
-
-static inline unsigned long pfn_to_block_id(unsigned long pfn)
-{
- return memory_block_id(pfn_to_section_nr(pfn));
-}
-
-static inline unsigned long phys_to_block_id(unsigned long phys)
-{
- return pfn_to_block_id(PFN_DOWN(phys));
-}
+int sections_per_block;
+EXPORT_SYMBOL(sections_per_block);
static int memory_subsys_online(struct device *dev);
static int memory_subsys_offline(struct device *dev);
@@ -632,7 +619,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
*
* Called under device_hotplug_lock.
*/
-static struct memory_block *find_memory_block_by_id(unsigned long block_id)
+struct memory_block *find_memory_block_by_id(unsigned long block_id)
{
struct memory_block *mem;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index cd13ef287011..e8b6f6b9ce51 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
#include <linux/pm_runtime.h>
#include <linux/swap.h>
#include <linux/slab.h>
+#include <linux/memblock.h>
static const struct bus_type node_subsys = {
.name = "node",
@@ -850,6 +851,41 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
}
+/*
+ * register_memory_blocks_under_node_early : Register the memory blocks
+ * under the nodes.
+ *
+ * This function iterates over all memblock regions, and if the node associated with
+ * the region is online, calculates the start and end memory blocks based on the
+ * region's start and end PFNs. Then, registers all the memory blocks within that
+ * range under the region node.
+ */
+static void register_memory_blocks_under_node_early(void)
+{
+ struct memblock_region *r;
+
+ for_each_mem_region(r) {
+ const unsigned long start_block_id = phys_to_block_id(r->base);
+ const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
+ unsigned long block_id;
+
+ if (!node_online(r->nid))
+ continue;
+
+ for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
+ struct memory_block *mem;
+
+ mem = find_memory_block_by_id(block_id);
+ if (!mem)
+ continue;
+
+ do_register_memory_block_under_node(r->nid, mem, MEMINIT_EARLY);
+ put_device(&mem->dev);
+ }
+
+ }
+}
+
void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
@@ -971,11 +1007,16 @@ void __init node_dev_init(void)
/*
* Create all node devices, which will properly link the node
- * to applicable memory block devices and already created cpu devices.
+ * to already created cpu devices.
*/
for_each_online_node(i) {
- ret = register_one_node(i);
+ ret = __register_one_node(i);
if (ret)
panic("%s() failed to add node: %d\n", __func__, ret);
}
+
+ /*
+ * Link the node to memory block devices
+ */
+ register_memory_blocks_under_node_early();
}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 12daa6ec7d09..2a61088e17ad 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -171,12 +171,30 @@ struct memory_group *memory_group_find_by_id(int mgid);
typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
struct memory_group *excluded, void *arg);
+struct memory_block *find_memory_block_by_id(unsigned long block_id);
#define hotplug_memory_notifier(fn, pri) ({ \
static __meminitdata struct notifier_block fn##_mem_nb =\
{ .notifier_call = fn, .priority = pri };\
register_memory_notifier(&fn##_mem_nb); \
})
+extern int sections_per_block;
+
+static inline unsigned long memory_block_id(unsigned long section_nr)
+{
+ return section_nr / sections_per_block;
+}
+
+static inline unsigned long pfn_to_block_id(unsigned long pfn)
+{
+ return memory_block_id(pfn_to_section_nr(pfn));
+}
+
+static inline unsigned long phys_to_block_id(unsigned long phys)
+{
+ return pfn_to_block_id(PFN_DOWN(phys));
+}
+
#ifdef CONFIG_NUMA
void memory_block_add_nid(struct memory_block *mem, int nid,
enum meminit_context context);
@@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
* can sleep.
*/
extern struct mutex text_mutex;
-
#endif /* _LINUX_MEMORY_H_ */
diff --git a/include/linux/node.h b/include/linux/node.h
index 2b7517892230..5c763253c42c 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star
enum meminit_context context)
{
}
+static inline void register_memory_blocks_under_node_early(void)
+{
+}
#endif
extern void unregister_node(struct node *node);
--
2.43.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 2/4] drivers/base/node: remove register_mem_block_under_node_early()
2025-05-22 9:17 [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
@ 2025-05-22 9:17 ` Donet Tom
2025-05-22 9:17 ` [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
` (4 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Donet Tom @ 2025-05-22 9:17 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Donet Tom
The function register_mem_block_under_node_early() is no longer used,
as register_memory_blocks_under_node_early() now handles memory block
registration during early boot.
Removed register_mem_block_under_node_early() and get_nid_for_pfn(),
the latter was only used by the former.
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v4->v5
Added Acked-by tag
V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/node.c | 58 +--------------------------------------------
1 file changed, 1 insertion(+), 57 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index e8b6f6b9ce51..0c97523e7b7d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -748,15 +748,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-static int __ref get_nid_for_pfn(unsigned long pfn)
-{
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- if (system_state < SYSTEM_RUNNING)
- return early_pfn_to_nid(pfn);
-#endif
- return pfn_to_nid(pfn);
-}
-
static void do_register_memory_block_under_node(int nid,
struct memory_block *mem_blk,
enum meminit_context context)
@@ -783,46 +774,6 @@ static void do_register_memory_block_under_node(int nid,
ret);
}
-/* register memory section under specified node if it spans that node */
-static int register_mem_block_under_node_early(struct memory_block *mem_blk,
- void *arg)
-{
- unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE;
- unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
- unsigned long end_pfn = start_pfn + memory_block_pfns - 1;
- int nid = *(int *)arg;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn <= end_pfn; pfn++) {
- int page_nid;
-
- /*
- * memory block could have several absent sections from start.
- * skip pfn range from absent section
- */
- if (!pfn_in_present_section(pfn)) {
- pfn = round_down(pfn + PAGES_PER_SECTION,
- PAGES_PER_SECTION) - 1;
- continue;
- }
-
- /*
- * We need to check if page belongs to nid only at the boot
- * case because node's ranges can be interleaved.
- */
- page_nid = get_nid_for_pfn(pfn);
- if (page_nid < 0)
- continue;
- if (page_nid != nid)
- continue;
-
- do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
- return 0;
- }
- /* mem section does not span the specified node */
- return 0;
-}
-
/*
* During hotplug we know that all pages in the memory block belong to the same
* node.
@@ -890,15 +841,8 @@ void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
{
- walk_memory_blocks_func_t func;
-
- if (context == MEMINIT_HOTPLUG)
- func = register_mem_block_under_node_hotplug;
- else
- func = register_mem_block_under_node_early;
-
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
- (void *)&nid, func);
+ (void *)&nid, register_mem_block_under_node_hotplug);
return;
}
#endif /* CONFIG_MEMORY_HOTPLUG */
--
2.43.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-22 9:17 [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
2025-05-22 9:17 ` [PATCH v5 2/4] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
@ 2025-05-22 9:17 ` Donet Tom
2025-05-22 10:06 ` Oscar Salvador
2025-05-22 12:09 ` David Hildenbrand
2025-05-22 9:17 ` [PATCH v5 4/4] drivers/base/node : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
` (3 subsequent siblings)
5 siblings, 2 replies; 13+ messages in thread
From: Donet Tom @ 2025-05-22 9:17 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Donet Tom
register_one_node() is now only called via cpu_up() → __try_online_node()
during CPU hotplug operations to online a node.
At this stage, the node has not yet had any memory added. As a result,
there are no memory blocks to walk or register, so calling
register_memory_blocks_under_node() is unnecessary.
Therefore, the call to register_memory_blocks_under_node()
has been removed from register_one_node().
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v4->v5
1. register_one_node() implementation simplifiedi based on David's
comment.
2. Added Acked-by tag
V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
include/linux/node.h | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)
diff --git a/include/linux/node.h b/include/linux/node.h
index 5c763253c42c..6cf349c26780 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -136,18 +136,8 @@ static inline int register_one_node(int nid)
{
int error = 0;
- if (node_online(nid)) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- unsigned long start_pfn = pgdat->node_start_pfn;
- unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
-
+ if (node_online(nid))
error = __register_one_node(nid);
- if (error)
- return error;
- register_memory_blocks_under_node(nid, start_pfn, end_pfn,
- MEMINIT_EARLY);
- }
-
return error;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 4/4] drivers/base/node : Rename register_memory_blocks_under_node() and remove context argument
2025-05-22 9:17 [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
2025-05-22 9:17 ` [PATCH v5 2/4] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
2025-05-22 9:17 ` [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
@ 2025-05-22 9:17 ` Donet Tom
2025-05-22 10:36 ` [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Mike Rapoport
` (2 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Donet Tom @ 2025-05-22 9:17 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Donet Tom
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v4->v5
Added Acked-by tag
V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/node.c | 5 ++---
include/linux/node.h | 11 +++++------
mm/memory_hotplug.c | 5 ++---
3 files changed, 9 insertions(+), 12 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0c97523e7b7d..3f7c8662696a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -837,9 +837,8 @@ static void register_memory_blocks_under_node_early(void)
}
}
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
{
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
(void *)&nid, register_mem_block_under_node_hotplug);
diff --git a/include/linux/node.h b/include/linux/node.h
index 6cf349c26780..fb567e5aef1c 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -111,13 +111,12 @@ struct memory_block;
extern struct node *node_devices[];
#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context);
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
#else
-static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+static inline void register_memory_blocks_under_node_hotplug(int nid,
+ unsigned long start_pfn,
+ unsigned long end_pfn)
{
}
static inline void register_memory_blocks_under_node_early(void)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8305483de38b..f734cc924b51 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1575,9 +1575,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
BUG_ON(ret);
}
- register_memory_blocks_under_node(nid, PFN_DOWN(start),
- PFN_UP(start + size - 1),
- MEMINIT_HOTPLUG);
+ register_memory_blocks_under_node_hotplug(nid, PFN_DOWN(start),
+ PFN_UP(start + size - 1));
/* create new memmap entry */
if (!strcmp(res->name, "System RAM"))
--
2.43.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-22 9:17 ` [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
@ 2025-05-22 10:06 ` Oscar Salvador
2025-05-22 10:31 ` Mike Rapoport
2025-05-22 12:09 ` David Hildenbrand
1 sibling, 1 reply; 13+ messages in thread
From: Oscar Salvador @ 2025-05-22 10:06 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Mike Rapoport, Zi Yan,
Greg Kroah-Hartman, Ritesh Harjani, linux-mm, linux-kernel,
Rafael J . Wysocki, Danilo Krummrich, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang
On Thu, May 22, 2025 at 04:17:30AM -0500, Donet Tom wrote:
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 5c763253c42c..6cf349c26780 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -136,18 +136,8 @@ static inline int register_one_node(int nid)
> {
> int error = 0;
>
> - if (node_online(nid)) {
> - struct pglist_data *pgdat = NODE_DATA(nid);
> - unsigned long start_pfn = pgdat->node_start_pfn;
> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> -
> + if (node_online(nid))
> error = __register_one_node(nid);
Heh, remembering this code always brings me joy.
After this patch, register_one_node() is only called from try_online_node(), right?
Which, before calling in, explicitly sets the node online, so... we can get rid of
the node_online() check unless I am missing something.
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-22 10:06 ` Oscar Salvador
@ 2025-05-22 10:31 ` Mike Rapoport
2025-05-22 11:46 ` Donet Tom
0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2025-05-22 10:31 UTC (permalink / raw)
To: Oscar Salvador
Cc: Donet Tom, David Hildenbrand, Andrew Morton, Zi Yan,
Greg Kroah-Hartman, Ritesh Harjani, linux-mm, linux-kernel,
Rafael J . Wysocki, Danilo Krummrich, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang
On Thu, May 22, 2025 at 12:06:06PM +0200, Oscar Salvador wrote:
> On Thu, May 22, 2025 at 04:17:30AM -0500, Donet Tom wrote:
> > diff --git a/include/linux/node.h b/include/linux/node.h
> > index 5c763253c42c..6cf349c26780 100644
> > --- a/include/linux/node.h
> > +++ b/include/linux/node.h
> > @@ -136,18 +136,8 @@ static inline int register_one_node(int nid)
> > {
> > int error = 0;
> >
> > - if (node_online(nid)) {
> > - struct pglist_data *pgdat = NODE_DATA(nid);
> > - unsigned long start_pfn = pgdat->node_start_pfn;
> > - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> > -
> > + if (node_online(nid))
> > error = __register_one_node(nid);
>
> Heh, remembering this code always brings me joy.
>
> After this patch, register_one_node() is only called from try_online_node(), right?
> Which, before calling in, explicitly sets the node online, so... we can get rid of
> the node_online() check unless I am missing something.
I think you are right and a sensible follow up cleanup can be renaming
__register_one_node() to register_one_node() :)
> --
> Oscar Salvador
> SUSE Labs
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time
2025-05-22 9:17 [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
` (2 preceding siblings ...)
2025-05-22 9:17 ` [PATCH v5 4/4] drivers/base/node : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
@ 2025-05-22 10:36 ` Mike Rapoport
2025-05-22 11:59 ` Donet Tom
2025-05-22 12:09 ` David Hildenbrand
2025-05-23 8:46 ` Oscar Salvador
5 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2025-05-22 10:36 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, Ritesh Harjani, linux-mm, linux-kernel,
Rafael J . Wysocki, Danilo Krummrich, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang
On Thu, May 22, 2025 at 04:17:28AM -0500, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_memory_region(r): // r => region
> if (!node_online(r->nid)):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all memblock regions, and if the node associated with the
> region is online, we calculate the start and end memory blocks based on the
> region's start and end PFNs. We then register all the memory blocks within
> that range under the region node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v4 -> v5
>
> 1. Moved all helpers(memory_block_id(), pfn_to_block_id(), and phys_to_block_id())
> into memory.h and exported sections_per_block.
> 2. register_memory_blocks_early() moved out of for_each_online_node().
> Now we iterate over all memory regions at once and register the
> memory blocks.
>
> Tested corner cases where memory blocks span across multiple memblock regions; it
> is working fine.
>
> #cd /sys/devices/system/node/
> # find node1/ |grep memory0
> node1/memory0
> # find node0/ |grep memory0
> node0/memory0
> # find node0/ |grep memory0
> node2/memory0
> # cat node0/memory0/valid_zones
> none
>
> V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
> V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/memory.c | 21 ++++----------------
> drivers/base/node.c | 45 ++++++++++++++++++++++++++++++++++++++++--
> include/linux/memory.h | 19 +++++++++++++++++-
> include/linux/node.h | 3 +++
> 4 files changed, 68 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 19469e7f88c2..39fcc075a36f 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -22,6 +22,7 @@
> #include <linux/stat.h>
> #include <linux/slab.h>
> #include <linux/xarray.h>
> +#include <linux/export.h>
>
> #include <linux/atomic.h>
> #include <linux/uaccess.h>
> @@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str)
>
> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>
> -static int sections_per_block;
> -
> -static inline unsigned long memory_block_id(unsigned long section_nr)
> -{
> - return section_nr / sections_per_block;
> -}
> -
> -static inline unsigned long pfn_to_block_id(unsigned long pfn)
> -{
> - return memory_block_id(pfn_to_section_nr(pfn));
> -}
> -
> -static inline unsigned long phys_to_block_id(unsigned long phys)
> -{
> - return pfn_to_block_id(PFN_DOWN(phys));
> -}
> +int sections_per_block;
> +EXPORT_SYMBOL(sections_per_block);
>
> static int memory_subsys_online(struct device *dev);
> static int memory_subsys_offline(struct device *dev);
> @@ -632,7 +619,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> *
> * Called under device_hotplug_lock.
> */
> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
> {
> struct memory_block *mem;
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cd13ef287011..e8b6f6b9ce51 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -20,6 +20,7 @@
> #include <linux/pm_runtime.h>
> #include <linux/swap.h>
> #include <linux/slab.h>
> +#include <linux/memblock.h>
>
> static const struct bus_type node_subsys = {
> .name = "node",
> @@ -850,6 +851,41 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
> }
>
> +/*
> + * register_memory_blocks_under_node_early : Register the memory blocks
> + * under the nodes.
> + *
> + * This function iterates over all memblock regions, and if the node associated with
> + * the region is online, calculates the start and end memory blocks based on the
> + * region's start and end PFNs. Then, registers all the memory blocks within that
> + * range under the region node.
> + */
> +static void register_memory_blocks_under_node_early(void)
> +{
> + struct memblock_region *r;
> +
> + for_each_mem_region(r) {
> + const unsigned long start_block_id = phys_to_block_id(r->base);
> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
> + unsigned long block_id;
> +
> + if (!node_online(r->nid))
memblock_get_region_node() please, otherwise it won't build for !NUMA.
Otherwise
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> + continue;
> +
> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
> + struct memory_block *mem;
> +
> + mem = find_memory_block_by_id(block_id);
> + if (!mem)
> + continue;
> +
> + do_register_memory_block_under_node(r->nid, mem, MEMINIT_EARLY);
> + put_device(&mem->dev);
> + }
> +
> + }
> +}
> +
> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context)
> @@ -971,11 +1007,16 @@ void __init node_dev_init(void)
>
> /*
> * Create all node devices, which will properly link the node
> - * to applicable memory block devices and already created cpu devices.
> + * to already created cpu devices.
> */
> for_each_online_node(i) {
> - ret = register_one_node(i);
> + ret = __register_one_node(i);
> if (ret)
> panic("%s() failed to add node: %d\n", __func__, ret);
> }
> +
> + /*
> + * Link the node to memory block devices
> + */
> + register_memory_blocks_under_node_early();
> }
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 12daa6ec7d09..2a61088e17ad 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -171,12 +171,30 @@ struct memory_group *memory_group_find_by_id(int mgid);
> typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
> struct memory_group *excluded, void *arg);
> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
> #define hotplug_memory_notifier(fn, pri) ({ \
> static __meminitdata struct notifier_block fn##_mem_nb =\
> { .notifier_call = fn, .priority = pri };\
> register_memory_notifier(&fn##_mem_nb); \
> })
>
> +extern int sections_per_block;
> +
> +static inline unsigned long memory_block_id(unsigned long section_nr)
> +{
> + return section_nr / sections_per_block;
> +}
> +
> +static inline unsigned long pfn_to_block_id(unsigned long pfn)
> +{
> + return memory_block_id(pfn_to_section_nr(pfn));
> +}
> +
> +static inline unsigned long phys_to_block_id(unsigned long phys)
> +{
> + return pfn_to_block_id(PFN_DOWN(phys));
> +}
> +
> #ifdef CONFIG_NUMA
> void memory_block_add_nid(struct memory_block *mem, int nid,
> enum meminit_context context);
> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
> * can sleep.
> */
> extern struct mutex text_mutex;
> -
> #endif /* _LINUX_MEMORY_H_ */
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 2b7517892230..5c763253c42c 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star
> enum meminit_context context)
> {
> }
> +static inline void register_memory_blocks_under_node_early(void)
> +{
> +}
> #endif
>
> extern void unregister_node(struct node *node);
> --
> 2.43.5
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-22 10:31 ` Mike Rapoport
@ 2025-05-22 11:46 ` Donet Tom
0 siblings, 0 replies; 13+ messages in thread
From: Donet Tom @ 2025-05-22 11:46 UTC (permalink / raw)
To: Mike Rapoport, Oscar Salvador
Cc: David Hildenbrand, Andrew Morton, Zi Yan, Greg Kroah-Hartman,
Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang
On 5/22/25 4:01 PM, Mike Rapoport wrote:
> On Thu, May 22, 2025 at 12:06:06PM +0200, Oscar Salvador wrote:
>> On Thu, May 22, 2025 at 04:17:30AM -0500, Donet Tom wrote:
>>> diff --git a/include/linux/node.h b/include/linux/node.h
>>> index 5c763253c42c..6cf349c26780 100644
>>> --- a/include/linux/node.h
>>> +++ b/include/linux/node.h
>>> @@ -136,18 +136,8 @@ static inline int register_one_node(int nid)
>>> {
>>> int error = 0;
>>>
>>> - if (node_online(nid)) {
>>> - struct pglist_data *pgdat = NODE_DATA(nid);
>>> - unsigned long start_pfn = pgdat->node_start_pfn;
>>> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>>> -
>>> + if (node_online(nid))
>>> error = __register_one_node(nid);
>> Heh, remembering this code always brings me joy.
>>
>> After this patch, register_one_node() is only called from try_online_node(), right?
>> Which, before calling in, explicitly sets the node online, so... we can get rid of
>> the node_online() check unless I am missing something.
>
> I think you are right and a sensible follow up cleanup can be renaming
> __register_one_node() to register_one_node() :)
Thank you, Oscar and Mike.
Yes, I agree—we can remove register_one_node() entirely.
I will go ahead and remove register_one_node(), rename __register_one_node() to
register_one_node(), and send the next revision.
Thanks
DOnet
>
>> --
>> Oscar Salvador
>> SUSE Labs
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time
2025-05-22 10:36 ` [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Mike Rapoport
@ 2025-05-22 11:59 ` Donet Tom
0 siblings, 0 replies; 13+ messages in thread
From: Donet Tom @ 2025-05-22 11:59 UTC (permalink / raw)
To: Mike Rapoport
Cc: David Hildenbrand, Andrew Morton, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, Ritesh Harjani, linux-mm, linux-kernel,
Rafael J . Wysocki, Danilo Krummrich, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang
On 5/22/25 4:06 PM, Mike Rapoport wrote:
> On Thu, May 22, 2025 at 04:17:28AM -0500, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified using
>> the node’s start and end PFNs, which are obtained from the node's pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.
>>
>> As a result, `memory blocks` that fall between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>> start_pfn = pgdat->node_start_pfn;
>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>> if (get_nid_for_pfn(pfn) != nid):
>> continue;
>> else
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially since
>> we iterate over each `memory block` and all PFNs within it until a match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA node, and
>> they do not span across multiple nodes.
>>
>> for_each_memory_region(r): // r => region
>> if (!node_online(r->nid)):
>> continue;
>> else
>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>> do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
>>
>> We iterate over all memblock regions, and if the node associated with the
>> region is online, we calculate the start and end memory blocks based on the
>> region's start and end PFNs. We then register all the memory blocks within
>> that range under the region node.
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>
>> ---
>> v4 -> v5
>>
>> 1. Moved all helpers(memory_block_id(), pfn_to_block_id(), and phys_to_block_id())
>> into memory.h and exported sections_per_block.
>> 2. register_memory_blocks_early() moved out of for_each_online_node().
>> Now we iterate over all memory regions at once and register the
>> memory blocks.
>>
>> Tested corner cases where memory blocks span across multiple memblock regions; it
>> is working fine.
>>
>> #cd /sys/devices/system/node/
>> # find node1/ |grep memory0
>> node1/memory0
>> # find node0/ |grep memory0
>> node0/memory0
>> # find node0/ |grep memory0
>> node2/memory0
>> # cat node0/memory0/valid_zones
>> none
>>
>> V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
>> V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
>> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>> ---
>> drivers/base/memory.c | 21 ++++----------------
>> drivers/base/node.c | 45 ++++++++++++++++++++++++++++++++++++++++--
>> include/linux/memory.h | 19 +++++++++++++++++-
>> include/linux/node.h | 3 +++
>> 4 files changed, 68 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 19469e7f88c2..39fcc075a36f 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -22,6 +22,7 @@
>> #include <linux/stat.h>
>> #include <linux/slab.h>
>> #include <linux/xarray.h>
>> +#include <linux/export.h>
>>
>> #include <linux/atomic.h>
>> #include <linux/uaccess.h>
>> @@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str)
>>
>> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>>
>> -static int sections_per_block;
>> -
>> -static inline unsigned long memory_block_id(unsigned long section_nr)
>> -{
>> - return section_nr / sections_per_block;
>> -}
>> -
>> -static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> -{
>> - return memory_block_id(pfn_to_section_nr(pfn));
>> -}
>> -
>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>> -{
>> - return pfn_to_block_id(PFN_DOWN(phys));
>> -}
>> +int sections_per_block;
>> +EXPORT_SYMBOL(sections_per_block);
>>
>> static int memory_subsys_online(struct device *dev);
>> static int memory_subsys_offline(struct device *dev);
>> @@ -632,7 +619,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>> *
>> * Called under device_hotplug_lock.
>> */
>> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> {
>> struct memory_block *mem;
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cd13ef287011..e8b6f6b9ce51 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>> #include <linux/pm_runtime.h>
>> #include <linux/swap.h>
>> #include <linux/slab.h>
>> +#include <linux/memblock.h>
>>
>> static const struct bus_type node_subsys = {
>> .name = "node",
>> @@ -850,6 +851,41 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>> }
>>
>> +/*
>> + * register_memory_blocks_under_node_early : Register the memory blocks
>> + * under the nodes.
>> + *
>> + * This function iterates over all memblock regions, and if the node associated with
>> + * the region is online, calculates the start and end memory blocks based on the
>> + * region's start and end PFNs. Then, registers all the memory blocks within that
>> + * range under the region node.
>> + */
>> +static void register_memory_blocks_under_node_early(void)
>> +{
>> + struct memblock_region *r;
>> +
>> + for_each_mem_region(r) {
>> + const unsigned long start_block_id = phys_to_block_id(r->base);
>> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
>> + unsigned long block_id;
>> +
>> + if (!node_online(r->nid))
> memblock_get_region_node() please, otherwise it won't build for !NUMA.
Thank you Mike
I tested with !CONFIG_NUMA, and the build was successful. This is
because node.c is not compiled when CONFIG_NUMA is disabled:
obj-$(CONFIG_NUMA) += node.o
But it is better to use memblock_get_region_node(). I'll make the change
and include it in the next revision.Thanks Donet
> Otherwise
>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>
>> + continue;
>> +
>> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
>> + struct memory_block *mem;
>> +
>> + mem = find_memory_block_by_id(block_id);
>> + if (!mem)
>> + continue;
>> +
>> + do_register_memory_block_under_node(r->nid, mem, MEMINIT_EARLY);
>> + put_device(&mem->dev);
>> + }
>> +
>> + }
>> +}
>> +
>> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>> unsigned long end_pfn,
>> enum meminit_context context)
>> @@ -971,11 +1007,16 @@ void __init node_dev_init(void)
>>
>> /*
>> * Create all node devices, which will properly link the node
>> - * to applicable memory block devices and already created cpu devices.
>> + * to already created cpu devices.
>> */
>> for_each_online_node(i) {
>> - ret = register_one_node(i);
>> + ret = __register_one_node(i);
>> if (ret)
>> panic("%s() failed to add node: %d\n", __func__, ret);
>> }
>> +
>> + /*
>> + * Link the node to memory block devices
>> + */
>> + register_memory_blocks_under_node_early();
>> }
>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>> index 12daa6ec7d09..2a61088e17ad 100644
>> --- a/include/linux/memory.h
>> +++ b/include/linux/memory.h
>> @@ -171,12 +171,30 @@ struct memory_group *memory_group_find_by_id(int mgid);
>> typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
>> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
>> struct memory_group *excluded, void *arg);
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>> #define hotplug_memory_notifier(fn, pri) ({ \
>> static __meminitdata struct notifier_block fn##_mem_nb =\
>> { .notifier_call = fn, .priority = pri };\
>> register_memory_notifier(&fn##_mem_nb); \
>> })
>>
>> +extern int sections_per_block;
>> +
>> +static inline unsigned long memory_block_id(unsigned long section_nr)
>> +{
>> + return section_nr / sections_per_block;
>> +}
>> +
>> +static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> +{
>> + return memory_block_id(pfn_to_section_nr(pfn));
>> +}
>> +
>> +static inline unsigned long phys_to_block_id(unsigned long phys)
>> +{
>> + return pfn_to_block_id(PFN_DOWN(phys));
>> +}
>> +
>> #ifdef CONFIG_NUMA
>> void memory_block_add_nid(struct memory_block *mem, int nid,
>> enum meminit_context context);
>> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
>> * can sleep.
>> */
>> extern struct mutex text_mutex;
>> -
>> #endif /* _LINUX_MEMORY_H_ */
>> diff --git a/include/linux/node.h b/include/linux/node.h
>> index 2b7517892230..5c763253c42c 100644
>> --- a/include/linux/node.h
>> +++ b/include/linux/node.h
>> @@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star
>> enum meminit_context context)
>> {
>> }
>> +static inline void register_memory_blocks_under_node_early(void)
>> +{
>> +}
>> #endif
>>
>> extern void unregister_node(struct node *node);
>> --
>> 2.43.5
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time
2025-05-22 9:17 [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
` (3 preceding siblings ...)
2025-05-22 10:36 ` [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Mike Rapoport
@ 2025-05-22 12:09 ` David Hildenbrand
2025-05-22 12:29 ` Donet Tom
2025-05-23 8:46 ` Oscar Salvador
5 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-05-22 12:09 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang
On 22.05.25 11:17, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_memory_region(r): // r => region
> if (!node_online(r->nid)):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all memblock regions, and if the node associated with the
> region is online, we calculate the start and end memory blocks based on the
> region's start and end PFNs. We then register all the memory blocks within
> that range under the region node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v4 -> v5
>
> 1. Moved all helpers(memory_block_id(), pfn_to_block_id(), and phys_to_block_id())
> into memory.h and exported sections_per_block.
> 2. register_memory_blocks_early() moved out of for_each_online_node().
> Now we iterate over all memory regions at once and register the
> memory blocks.
>
> Tested corner cases where memory blocks span across multiple memblock regions; it
> is working fine.
>
> #cd /sys/devices/system/node/
> # find node1/ |grep memory0
> node1/memory0
> # find node0/ |grep memory0
> node0/memory0
> # find node0/ |grep memory0
> node2/memory0
> # cat node0/memory0/valid_zones
> none
>
> V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
> V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/memory.c | 21 ++++----------------
> drivers/base/node.c | 45 ++++++++++++++++++++++++++++++++++++++++--
> include/linux/memory.h | 19 +++++++++++++++++-
> include/linux/node.h | 3 +++
> 4 files changed, 68 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 19469e7f88c2..39fcc075a36f 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -22,6 +22,7 @@
> #include <linux/stat.h>
> #include <linux/slab.h>
> #include <linux/xarray.h>
> +#include <linux/export.h>
>
> #include <linux/atomic.h>
> #include <linux/uaccess.h>
> @@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str)
>
> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>
> -static int sections_per_block;
> -
> -static inline unsigned long memory_block_id(unsigned long section_nr)
> -{
> - return section_nr / sections_per_block;
> -}
> -
> -static inline unsigned long pfn_to_block_id(unsigned long pfn)
> -{
> - return memory_block_id(pfn_to_section_nr(pfn));
> -}
> -
> -static inline unsigned long phys_to_block_id(unsigned long phys)
> -{
> - return pfn_to_block_id(PFN_DOWN(phys));
> -}
> +int sections_per_block;
> +EXPORT_SYMBOL(sections_per_block);
>
> static int memory_subsys_online(struct device *dev);
> static int memory_subsys_offline(struct device *dev);
> @@ -632,7 +619,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> *
> * Called under device_hotplug_lock.
> */
> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
> {
> struct memory_block *mem;
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cd13ef287011..e8b6f6b9ce51 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -20,6 +20,7 @@
> #include <linux/pm_runtime.h>
> #include <linux/swap.h>
> #include <linux/slab.h>
> +#include <linux/memblock.h>
>
> static const struct bus_type node_subsys = {
> .name = "node",
> @@ -850,6 +851,41 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
> }
>
> +/*
> + * register_memory_blocks_under_node_early : Register the memory blocks
> + * under the nodes.
"register_memory_blocks_under_nodes"
> + *
> + * This function iterates over all memblock regions, and if the node associated with
"the node" does not apply.
> + * the region is online, calculates the start and end memory blocks based on the
> + * region's start and end PFNs. Then, registers all the memory blocks within that
> + * range under the region node.
More like "registers all memory blocks under the corresponding nodes
..." then clarify that a block might get registered under multiple nodes
etc.
> + */
> +static void register_memory_blocks_under_node_early(void)
> +{
> + struct memblock_region *r;
> +
> + for_each_mem_region(r) {
> + const unsigned long start_block_id = phys_to_block_id(r->base);
> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
> + unsigned long block_id;
> +
> + if (!node_online(r->nid))
> + continue;
> +
> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
> + struct memory_block *mem;
> +
> + mem = find_memory_block_by_id(block_id);
> + if (!mem)
> + continue;
> +
> + do_register_memory_block_under_node(r->nid, mem, MEMINIT_EARLY);
> + put_device(&mem->dev);
> + }
> +
> + }
> +}
> +
> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context)
> @@ -971,11 +1007,16 @@ void __init node_dev_init(void)
>
> /*
> * Create all node devices, which will properly link the node
> - * to applicable memory block devices and already created cpu devices.
> + * to already created cpu devices.
> */
> for_each_online_node(i) {
> - ret = register_one_node(i);
> + ret = __register_one_node(i);
> if (ret)
> panic("%s() failed to add node: %d\n", __func__, ret);
> }
> +
> + /*
> + * Link the node to memory block devices
> + */
> + register_memory_blocks_under_node_early();
> }> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 12daa6ec7d09..2a61088e17ad 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -171,12 +171,30 @@ struct memory_group *memory_group_find_by_id(int mgid);
> typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
> struct memory_group *excluded, void *arg);
> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
> #define hotplug_memory_notifier(fn, pri) ({ \
> static __meminitdata struct notifier_block fn##_mem_nb =\
> { .notifier_call = fn, .priority = pri };\
> register_memory_notifier(&fn##_mem_nb); \
> })
>
> +extern int sections_per_block;
> +
> +static inline unsigned long memory_block_id(unsigned long section_nr)
> +{
> + return section_nr / sections_per_block;
> +}
> +
> +static inline unsigned long pfn_to_block_id(unsigned long pfn)
> +{
> + return memory_block_id(pfn_to_section_nr(pfn));
> +}
> +
> +static inline unsigned long phys_to_block_id(unsigned long phys)
> +{
> + return pfn_to_block_id(PFN_DOWN(phys));
> +}
> +
> #ifdef CONFIG_NUMA
> void memory_block_add_nid(struct memory_block *mem, int nid,
> enum meminit_context context);
> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
> * can sleep.
> */
> extern struct mutex text_mutex;
> -
Unrelated change.
Apart from that LGTM
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-22 9:17 ` [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-22 10:06 ` Oscar Salvador
@ 2025-05-22 12:09 ` David Hildenbrand
1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-22 12:09 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang
On 22.05.25 11:17, Donet Tom wrote:
> register_one_node() is now only called via cpu_up() → __try_online_node()
> during CPU hotplug operations to online a node.
>
> At this stage, the node has not yet had any memory added. As a result,
> there are no memory blocks to walk or register, so calling
> register_memory_blocks_under_node() is unnecessary.
>
> Therefore, the call to register_memory_blocks_under_node()
> has been removed from register_one_node().
>
> Acked-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time
2025-05-22 12:09 ` David Hildenbrand
@ 2025-05-22 12:29 ` Donet Tom
0 siblings, 0 replies; 13+ messages in thread
From: Donet Tom @ 2025-05-22 12:29 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang
On 5/22/25 5:39 PM, David Hildenbrand wrote:
> On 22.05.25 11:17, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified
>> using
>> the node’s start and end PFNs, which are obtained from the node's
>> pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.
>>
>> As a result, `memory blocks` that fall between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those
>> blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>> start_pfn = pgdat->node_start_pfn;
>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>> for_each_memory_block_between(PFN_PHYS(start_pfn),
>> PFN_PHYS(end_pfn))
>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>> if (get_nid_for_pfn(pfn) != nid):
>> continue;
>> else
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn()
>> performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially
>> since
>> we iterate over each `memory block` and all PFNs within it until a
>> match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time
>> significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA
>> node, and
>> they do not span across multiple nodes.
>>
>> for_each_memory_region(r): // r => region
>> if (!node_online(r->nid)):
>> continue;
>> else
>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>> do_register_memory_block_under_node(r->nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> We iterate over all memblock regions, and if the node associated with
>> the
>> region is online, we calculate the start and end memory blocks based
>> on the
>> region's start and end PFNs. We then register all the memory blocks
>> within
>> that range under the region node.
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>
>> ---
>> v4 -> v5
>>
>> 1. Moved all helpers(memory_block_id(), pfn_to_block_id(), and
>> phys_to_block_id())
>> into memory.h and exported sections_per_block.
>> 2. register_memory_blocks_early() moved out of for_each_online_node().
>> Now we iterate over all memory regions at once and register the
>> memory blocks.
>>
>> Tested corner cases where memory blocks span across multiple
>> memblock regions; it
>> is working fine.
>>
>> #cd /sys/devices/system/node/
>> # find node1/ |grep memory0
>> node1/memory0
>> # find node0/ |grep memory0
>> node0/memory0
>> # find node0/ |grep memory0
>> node2/memory0
>> # cat node0/memory0/valid_zones
>> none
>>
>> V4 -
>> https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
>> V3 -
>> https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
>> v2 -
>> https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>> v1 -
>> https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>> ---
>> drivers/base/memory.c | 21 ++++----------------
>> drivers/base/node.c | 45 ++++++++++++++++++++++++++++++++++++++++--
>> include/linux/memory.h | 19 +++++++++++++++++-
>> include/linux/node.h | 3 +++
>> 4 files changed, 68 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 19469e7f88c2..39fcc075a36f 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -22,6 +22,7 @@
>> #include <linux/stat.h>
>> #include <linux/slab.h>
>> #include <linux/xarray.h>
>> +#include <linux/export.h>
>> #include <linux/atomic.h>
>> #include <linux/uaccess.h>
>> @@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str)
>> #define to_memory_block(dev) container_of(dev, struct
>> memory_block, dev)
>> -static int sections_per_block;
>> -
>> -static inline unsigned long memory_block_id(unsigned long section_nr)
>> -{
>> - return section_nr / sections_per_block;
>> -}
>> -
>> -static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> -{
>> - return memory_block_id(pfn_to_section_nr(pfn));
>> -}
>> -
>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>> -{
>> - return pfn_to_block_id(PFN_DOWN(phys));
>> -}
>> +int sections_per_block;
>> +EXPORT_SYMBOL(sections_per_block);
>> static int memory_subsys_online(struct device *dev);
>> static int memory_subsys_offline(struct device *dev);
>> @@ -632,7 +619,7 @@ int __weak arch_get_memory_phys_device(unsigned
>> long start_pfn)
>> *
>> * Called under device_hotplug_lock.
>> */
>> -static struct memory_block *find_memory_block_by_id(unsigned long
>> block_id)
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> {
>> struct memory_block *mem;
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cd13ef287011..e8b6f6b9ce51 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>> #include <linux/pm_runtime.h>
>> #include <linux/swap.h>
>> #include <linux/slab.h>
>> +#include <linux/memblock.h>
>> static const struct bus_type node_subsys = {
>> .name = "node",
>> @@ -850,6 +851,41 @@ void unregister_memory_block_under_nodes(struct
>> memory_block *mem_blk)
>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>> }
>> +/*
>> + * register_memory_blocks_under_node_early : Register the memory blocks
>> + * under the nodes.
>
> "register_memory_blocks_under_nodes"
>
>> + *
>> + * This function iterates over all memblock regions, and if the node
>> associated with
>
> "the node" does not apply.
>
>> + * the region is online, calculates the start and end memory blocks
>> based on the
>> + * region's start and end PFNs. Then, registers all the memory
>> blocks within that
>> + * range under the region node.
>
> More like "registers all memory blocks under the corresponding nodes
> ..." then clarify that a block might get registered under multiple
> nodes etc.
>
>> + */
>> +static void register_memory_blocks_under_node_early(void)
>> +{
>> + struct memblock_region *r;
>> +
>> + for_each_mem_region(r) {
>> + const unsigned long start_block_id = phys_to_block_id(r->base);
>> + const unsigned long end_block_id = phys_to_block_id(r->base
>> + r->size - 1);
>> + unsigned long block_id;
>> +
>> + if (!node_online(r->nid))
>> + continue;
>> +
>> + for (block_id = start_block_id; block_id <= end_block_id;
>> block_id++) {
>> + struct memory_block *mem;
>> +
>> + mem = find_memory_block_by_id(block_id);
>> + if (!mem)
>> + continue;
>> +
>> + do_register_memory_block_under_node(r->nid, mem,
>> MEMINIT_EARLY);
>> + put_device(&mem->dev);
>> + }
>> +
>> + }
>> +}
>> +
>> void register_memory_blocks_under_node(int nid, unsigned long
>> start_pfn,
>> unsigned long end_pfn,
>> enum meminit_context context)
>> @@ -971,11 +1007,16 @@ void __init node_dev_init(void)
>> /*
>> * Create all node devices, which will properly link the node
>> - * to applicable memory block devices and already created cpu
>> devices.
>> + * to already created cpu devices.
>> */
>> for_each_online_node(i) {
>> - ret = register_one_node(i);
>> + ret = __register_one_node(i);
>> if (ret)
>> panic("%s() failed to add node: %d\n", __func__, ret);
>> }
>> +
>> + /*
>> + * Link the node to memory block devices
>> + */
>> + register_memory_blocks_under_node_early();
> > }> diff --git a/include/linux/memory.h b/include/linux/memory.h
>> index 12daa6ec7d09..2a61088e17ad 100644
>> --- a/include/linux/memory.h
>> +++ b/include/linux/memory.h
>> @@ -171,12 +171,30 @@ struct memory_group
>> *memory_group_find_by_id(int mgid);
>> typedef int (*walk_memory_groups_func_t)(struct memory_group *,
>> void *);
>> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t
>> func,
>> struct memory_group *excluded, void *arg);
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>> #define hotplug_memory_notifier(fn, pri) ({ \
>> static __meminitdata struct notifier_block fn##_mem_nb =\
>> { .notifier_call = fn, .priority = pri };\
>> register_memory_notifier(&fn##_mem_nb); \
>> })
>> +extern int sections_per_block;
>> +
>> +static inline unsigned long memory_block_id(unsigned long section_nr)
>> +{
>> + return section_nr / sections_per_block;
>> +}
>> +
>> +static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> +{
>> + return memory_block_id(pfn_to_section_nr(pfn));
>> +}
>> +
>> +static inline unsigned long phys_to_block_id(unsigned long phys)
>> +{
>> + return pfn_to_block_id(PFN_DOWN(phys));
>> +}
>> +
>> #ifdef CONFIG_NUMA
>> void memory_block_add_nid(struct memory_block *mem, int nid,
>> enum meminit_context context);
>> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block
>> *mem, int nid,
>> * can sleep.
>> */
>> extern struct mutex text_mutex;
>> -
>
> Unrelated change.
>
>
Thank you, David.
I will make the changes and send the next version.
>
>
> Apart from that LGTM
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time
2025-05-22 9:17 [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
` (4 preceding siblings ...)
2025-05-22 12:09 ` David Hildenbrand
@ 2025-05-23 8:46 ` Oscar Salvador
5 siblings, 0 replies; 13+ messages in thread
From: Oscar Salvador @ 2025-05-23 8:46 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Mike Rapoport, Zi Yan,
Greg Kroah-Hartman, Ritesh Harjani, linux-mm, linux-kernel,
Rafael J . Wysocki, Danilo Krummrich, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang
On Thu, May 22, 2025 at 04:17:28AM -0500, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
...
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
With the fixup from Mike:
Acked-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-05-23 8:47 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 9:17 [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
2025-05-22 9:17 ` [PATCH v5 2/4] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
2025-05-22 9:17 ` [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-22 10:06 ` Oscar Salvador
2025-05-22 10:31 ` Mike Rapoport
2025-05-22 11:46 ` Donet Tom
2025-05-22 12:09 ` David Hildenbrand
2025-05-22 9:17 ` [PATCH v5 4/4] drivers/base/node : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-22 10:36 ` [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Mike Rapoport
2025-05-22 11:59 ` Donet Tom
2025-05-22 12:09 ` David Hildenbrand
2025-05-22 12:29 ` Donet Tom
2025-05-23 8:46 ` Oscar Salvador
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).