All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Donet Tom <donettom@linux.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oscar Salvador <osalvador@suse.de>, Zi Yan <ziy@nvidia.com>,
	Ritesh Harjani <ritesh.list@gmail.com>,
	rafael@kernel.org, Danilo Krummrich <dakr@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Alison Schofield <alison.schofield@intel.com>,
	Yury Norov <yury.norov@gmail.com>,
	Dave Jiang <dave.jiang@intel.com>
Subject: Re: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
Date: Fri, 16 May 2025 13:09:11 +0300	[thread overview]
Message-ID: <aCcOx34j5mgiwfcx@kernel.org> (raw)
In-Reply-To: <56cb2494-56ba-4895-9dd1-23243c2eecdb@redhat.com>

On Fri, May 16, 2025 at 11:15:29AM +0200, David Hildenbrand wrote:
> On 16.05.25 10:19, Donet Tom wrote:
> > During node device initialization, `memory blocks` are registered under
> > each NUMA node. The `memory blocks` to be registered are identified using
> > the node’s start and end PFNs, which are obtained from the node's pg_data
> > 
> > However, not all PFNs within this range necessarily belong to the same
> > node—some may belong to other nodes. Additionally, due to the
> > discontiguous nature of physical memory, certain sections within a
> > `memory block` may be absent.
> > 
> > As a result, `memory blocks` that fall between a node’s start and end
> > PFNs may span across multiple nodes, and some sections within those blocks
> > may be missing. `Memory blocks` have a fixed size, which is architecture
> > dependent.
> > 
> > Due to these considerations, the memory block registration is currently
> > performed as follows:
> > 
> > for_each_online_node(nid):
> >      start_pfn = pgdat->node_start_pfn;
> >      end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> >      for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> >          mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> >          pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> >          pfn_mb_end = pfn_start + memory_block_pfns - 1
> >          for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> >              if (get_nid_for_pfn(pfn) != nid):
> >                  continue;
> >              else
> >                  do_register_memory_block_under_node(nid, mem_blk,
> >                                                          MEMINIT_EARLY);
> > 
> > Here, we derive the start and end PFNs from the node's pg_data, then
> > determine the memory blocks that may belong to the node. For each
> > `memory block` in this range, we inspect all PFNs it contains and check
> > their associated NUMA node ID. If a PFN within the block matches the
> > current node, the memory block is registered under that node.
> > 
> > If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> > a binary search in the `memblock regions` to determine the NUMA node ID
> > for a given PFN. If it is not enabled, the node ID is retrieved directly
> > from the struct page.
> > 
> > On large systems, this process can become time-consuming, especially since
> > we iterate over each `memory block` and all PFNs within it until a match is
> > found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> > overhead of the binary search increases the execution time significantly,
> > potentially leading to soft lockups during boot.
> > 
> > In this patch, we iterate over `memblock region` to identify the
> > `memory blocks` that belong to the current NUMA node. `memblock regions`
> > are contiguous memory ranges, each associated with a single NUMA node, and
> > they do not span across multiple nodes.
> > 
> > for_each_online_node(nid):
> >    for_each_memory_region(r): // r => region
> >      if (r->nid != nid):
> >        continue;
> >      else
> >        for_each_memory_block_between(r->base, r->base + r->size - 1):
> >          do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
> > 
> > We iterate over all `memblock regions` and identify those that belong to
> > the current NUMA node. For each `memblock region` associated with the
> > current node, we calculate the start and end `memory blocks` based on the
> > region's start and end PFNs. We then register all `memory blocks` within
> > that range under the current node.
> > 
> > Test Results on My system with 32TB RAM
> > =======================================
> > 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
> > 
> > Without this patch
> > ------------------
> > Startup finished in 1min 16.528s (kernel)
> > 
> > With this patch
> > ---------------
> > Startup finished in 17.236s (kernel) - 78% Improvement
> > 
> > 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
> > 
> > Without this patch
> > ------------------
> > Startup finished in 28.320s (kernel)
> > 
> > With this patch
> > ---------------
> > Startup finished in 15.621s (kernel) - 46% Improvement
> > 
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Acked-by: Zi Yan <ziy@nvidia.com>
> > Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> > 
> > ---
> > v3 -> v4
> > 
> > Addressed Mike's comment by making node_dev_init() call __register_one_node().
> > 
> > V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> > v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> > v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> > ---
> >   drivers/base/memory.c  |  4 ++--
> >   drivers/base/node.c    | 41 ++++++++++++++++++++++++++++++++++++++++-
> >   include/linux/memory.h |  2 ++
> >   include/linux/node.h   |  3 +++
> >   4 files changed, 47 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 19469e7f88c2..7f1d266ae593 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
> >   	return memory_block_id(pfn_to_section_nr(pfn));
> >   }
> > -static inline unsigned long phys_to_block_id(unsigned long phys)
> > +unsigned long phys_to_block_id(unsigned long phys)
> >   {
> >   	return pfn_to_block_id(PFN_DOWN(phys));
> >   }
> 
> 
> I was wondering whether we should move all these helpers into a header, and
> export sections_per_block instead. Probably doesn't really matter for your
> use case.
> 
> > @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> >    *
> >    * Called under device_hotplug_lock.
> >    */
> > -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> > +struct memory_block *find_memory_block_by_id(unsigned long block_id)
> >   {
> >   	struct memory_block *mem;
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index cd13ef287011..f8cafd8c8fb1 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -20,6 +20,7 @@
> >   #include <linux/pm_runtime.h>
> >   #include <linux/swap.h>
> >   #include <linux/slab.h>
> > +#include <linux/memblock.h>
> >   static const struct bus_type node_subsys = {
> >   	.name = "node",
> > @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> >   			  kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
> >   }
> > +/*
> > + * register_memory_blocks_under_node_early : Register the memory
> > + *		  blocks under the current node.
> > + * @nid : Current node under registration
> > + *
> > + * This function iterates over all memblock regions and identifies the regions
> > + * that belong to the current node. For each region which belongs to current
> > + * node, it calculates the start and end memory blocks based on the region's
> > + * start and end PFNs. It then registers all memory blocks within that range
> > + * under the current node.
> > + */
> > +static void register_memory_blocks_under_node_early(int nid)
> > +{
> > +	struct memblock_region *r;
> > +
> > +	for_each_mem_region(r) {
> > +		if (r->nid != nid)
> > +			continue;
> > +
> > +		const unsigned long start_block_id = phys_to_block_id(r->base);
> > +		const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
> > +		unsigned long block_id;
> 
> This should definitely be above the if().
> 
> > +
> > +		for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
> > +			struct memory_block *mem;
> > +
> > +			mem = find_memory_block_by_id(block_id);
> > +			if (!mem)
> > +				continue;
> > +
> > +			do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
> > +			put_device(&mem->dev);
> > +		}
> > +
> > +	}
> > +}
> > +
> >   void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> >   				       unsigned long end_pfn,
> >   				       enum meminit_context context)
> > @@ -974,8 +1012,9 @@ void __init node_dev_init(void)
> >   	 * to applicable memory block devices and already created cpu devices.
> >   	 */
> >   	for_each_online_node(i) {
> > -		ret = register_one_node(i);
> > +		ret =  __register_one_node(i);
> >   		if (ret)
> >   			panic("%s() failed to add node: %d\n", __func__, ret);
> > +		register_memory_blocks_under_node_early(i);
> >   	}
> 
> In general, LGTM.
> 
> 
> BUT :)
> 
> I was wondering whether having a register_memory_blocks_early() call *after*
> the for_each_online_node(), and walking all memory regions only once would
> make a difference.

I don't know how many nodes there should be to see measurable performance
difference, but having register_memory_blocks_under_node_early() after
for_each_online_node() is definitely nicer. 
There's no real need to run for_each_mem_region() for every online node.
 
> We'd have to be smart about memory blocks that fall into multiple regions,
> but it should be a corner case and doable.

This is a corner case that should be handled regardless of the loop order.
And I don't think it's handled today at all.

If we have a block that crosses node boundaries, current implementation of
register_mem_block_under_node_early() will register it under the first
node.
 
> OTOH, we usually don't expect having a lot of regions, so iterating over
> them is probably not a big bottleneck? Anyhow, just wanted to raise it.

There would be at least a region per node and having 

for_each_online_node()
	for_each_mem_region()

makes the loop O(n²) for no good reason.
 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.


  reply	other threads:[~2025-05-16 10:09 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-16  8:19 [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Donet Tom
2025-05-16  8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-16 10:10   ` Mike Rapoport
2025-05-20 10:05   ` Oscar Salvador
2025-05-16  8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-16  9:18   ` David Hildenbrand
2025-05-16 10:58     ` Donet Tom
2025-05-16 10:10   ` Mike Rapoport
2025-05-20 10:06   ` Oscar Salvador
2025-05-16  8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-16  9:18   ` David Hildenbrand
2025-05-16 10:11   ` Mike Rapoport
2025-05-20 10:07   ` Oscar Salvador
2025-05-16  9:15 ` [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time David Hildenbrand
2025-05-16 10:09   ` Mike Rapoport [this message]
2025-05-16 10:12     ` David Hildenbrand
2025-05-16 11:00       ` Donet Tom
2025-05-16 11:00   ` Donet Tom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aCcOx34j5mgiwfcx@kernel.org \
    --to=rppt@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=dakr@kernel.org \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=donettom@linux.ibm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=osalvador@suse.de \
    --cc=rafael@kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=yury.norov@gmail.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.