Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Muchun Song <muchun.song@linux.dev>
Cc: linux-mm@kvack.org, Donet Tom <donettom@linux.ibm.com>,
	Gang Li <gang.li@linux.dev>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes
Date: Fri, 10 Jan 2025 15:07:09 +0530	[thread overview]
Message-ID: <87tta7ui0q.fsf@gmail.com> (raw)
In-Reply-To: <73DCC4CB-DB4F-4E66-B208-A515A6A4DE96@linux.dev>

Muchun Song <muchun.song@linux.dev> writes:

>> On Oct 8, 2024, at 02:45, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
>> 
>> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
>> 
>>> gather_bootmem_prealloc() function assumes the start nid as
>>> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes
>>> can be interleaved in any fashion, hence ensure current code checks for all
>>> online numa nodes as part of gather_bootmem_prealloc_parallel().
>>> Let's still make max_threads as N_MEMORY so that we can possibly have
>>> a uniform distribution of online nodes among these parallel threads.
>>> 
>>> e.g. qemu cmdline
>>> ========================
>>> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
>>> mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
>>> 
>> 
>> I think this patch still might not work for below numa config. Because
>> in this we have an offline node-0, node-1 with only cpus and node-2 with 
>> cpus and memory. 
>> 
>> numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0"
>> mem_cmd="-object memory-backend-ram,id=mem1,size=32G"
>> 
>> Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let
>> me give some thought to this before posting v2.
>
> How about setting .size with nr_node_ids?
>

Yes, I agree. We could do .size = nr_node_ids.
Let me send a patch with above fix and your suggested-by.

Sorry about the delay. Got pulled into other things.

-ritesh

> Muchun,
> THanks.
>
>> 
>> -ritesh
>> 
>> 
>>> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
>>> ==========================
>>> ~ # cat /proc/meminfo  |grep -i huge
>>> AnonHugePages:         0 kB
>>> ShmemHugePages:        0 kB
>>> FileHugePages:         0 kB
>>> HugePages_Total:       0
>>> HugePages_Free:        0
>>> HugePages_Rsvd:        0
>>> HugePages_Surp:        0
>>> Hugepagesize:    1048576 kB
>>> Hugetlb:               0 kB
>>> 
>>> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
>>> ===========================
>>> ~ # cat /proc/meminfo |grep -i huge
>>> AnonHugePages:         0 kB
>>> ShmemHugePages:        0 kB
>>> FileHugePages:         0 kB
>>> HugePages_Total:       2
>>> HugePages_Free:        2
>>> HugePages_Rsvd:        0
>>> HugePages_Surp:        0
>>> Hugepagesize:    1048576 kB
>>> Hugetlb:         2097152 kB
>>> 
>>> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
>>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>> Cc: Gang Li <gang.li@linux.dev>
>>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
>>> Cc: Muchun Song <muchun.song@linux.dev>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: linux-mm@kvack.org
>>> ---
>>> 
>>> ==== Additional data ====
>>> 
>>> w/o this patch:
>>> ================
>>> ~ # dmesg |grep -Ei "numa|node|huge"
>>> [    0.000000][    T0] numa: Partition configured for 2 NUMA nodes.
>>> [    0.000000][    T0]  memory[0x0]     [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
>>> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde50800-0x3dde57fff]
>>> [    0.000000][    T0] numa:     NODE_DATA(0) on node 1
>>> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde49000-0x3dde507ff]
>>> [    0.000000][    T0] Movable zone start for each node
>>> [    0.000000][    T0] Early memory node ranges
>>> [    0.000000][    T0]   node   1: [mem 0x0000000000000000-0x00000003ffffffff]
>>> [    0.000000][    T0] Initmem setup node 0 as memoryless
>>> [    0.000000][    T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
>>> [    0.000000][    T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
>>> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [    0.000000][    T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
>>> [    0.000000][    T0] Fallback order for Node 0: 1
>>> [    0.000000][    T0] Fallback order for Node 1: 1
>>> [    0.000000][    T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
>>> [    0.044978][    T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
>>> [    0.209159][    T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
>>> [    0.414281][    T1] smp: Brought up 2 nodes, 4 CPUs
>>> [    0.415268][    T1] numa: Node 0 CPUs: 0-1
>>> [    0.416030][    T1] numa: Node 1 CPUs: 2-3
>>> [   13.644459][   T41] node 1 deferred pages initialised in 12040ms
>>> [   14.241701][    T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
>>> [   14.242781][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
>>> [   14.243806][    T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
>>> [   14.244753][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
>>> [   16.490452][    T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
>>> [   27.804266][    T1] Demotion targets for Node 1: null
>>> 
>>> with this patch:
>>> =================
>>> ~ # dmesg |grep -Ei "numa|node|huge"
>>> [    0.000000][    T0] numa: Partition configured for 2 NUMA nodes.
>>> [    0.000000][    T0]  memory[0x0]     [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
>>> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde50800-0x3dde57fff]
>>> [    0.000000][    T0] numa:     NODE_DATA(0) on node 1
>>> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde49000-0x3dde507ff]
>>> [    0.000000][    T0] Movable zone start for each node
>>> [    0.000000][    T0] Early memory node ranges
>>> [    0.000000][    T0]   node   1: [mem 0x0000000000000000-0x00000003ffffffff]
>>> [    0.000000][    T0] Initmem setup node 0 as memoryless
>>> [    0.000000][    T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
>>> [    0.000000][    T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
>>> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [    0.000000][    T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
>>> [    0.000000][    T0] Fallback order for Node 0: 1
>>> [    0.000000][    T0] Fallback order for Node 1: 1
>>> [    0.000000][    T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
>>> [    0.048825][    T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
>>> [    0.204211][    T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
>>> [    0.378821][    T1] smp: Brought up 2 nodes, 4 CPUs
>>> [    0.379642][    T1] numa: Node 0 CPUs: 0-1
>>> [    0.380302][    T1] numa: Node 1 CPUs: 2-3
>>> [   11.577527][   T41] node 1 deferred pages initialised in 10250ms
>>> [   12.557856][    T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages
>>> [   12.574197][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
>>> [   12.576339][    T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
>>> [   12.577262][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
>>> [   15.102445][    T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
>>> [   26.173888][    T1] Demotion targets for Node 1: null
>>> 
>>> mm/hugetlb.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9a3a6e2dee97..60f45314c151 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void)
>>> .thread_fn = gather_bootmem_prealloc_parallel,
>>> .fn_arg = NULL,
>>> .start = 0,
>>> - .size = num_node_state(N_MEMORY),
>>> + .size = num_node_state(N_ONLINE),
>>> .align = 1,
>>> .min_chunk = 1,
>>> .max_threads = num_node_state(N_MEMORY),
>>> --
>>> 2.39.5

     prev parent reply	other threads:[~2025-01-10  9:41 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-03 20:00 [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes Ritesh Harjani (IBM)
2024-10-07 18:45 ` Ritesh Harjani
2024-10-08  7:59   ` Muchun Song
2025-01-10  9:37     ` Ritesh Harjani [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tta7ui0q.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=donettom@linux.ibm.com \
    --cc=gang.li@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).