From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, Ben Widawsky <ben.widawsky@intel.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Feng Tang <feng.tang@intel.com>, Michal Hocko <mhocko@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Mel Gorman <mgorman@techsingularity.net>,
Mike Kravetz <mike.kravetz@oracle.com>,
Randy Dunlap <rdunlap@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>, Andi Kleen <ak@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Huang Ying <ying.huang@intel.com>,
linux-api@vger.kernel.org
Subject: Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
Date: Tue, 30 Nov 2021 14:29:02 +0530 [thread overview]
Message-ID: <87wnkqaujt.fsf@linux.ibm.com> (raw)
In-Reply-To: <20211129140215.11b7cf9f1034a7fe7017768c@linux-foundation.org>
Andrew Morton <akpm@linux-foundation.org> writes:
> On Tue, 16 Nov 2021 12:12:37 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>
>> This syscall can be used to set a home node for the MPOL_BIND
>> and MPOL_PREFERRED_MANY memory policy. Users should use this
>> syscall after setting up a memory policy for the specified range
>> as shown below.
>>
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
>> new_nodes->size + 1, 0);
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> home_node, 0);
>>
>> The syscall allows specifying a home node/preferred node from which kernel
>> will fulfill memory allocation requests first.
>>
>> For address range with MPOL_BIND memory policy, if nodemask specifies more
>> than one node, page allocations will come from the node in the nodemask
>> with sufficient free memory that is closest to the home node/preferred node.
>>
>> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
>> page allocation will come from the node in the nodemask with sufficient
>> free memory that is closest to the home node/preferred node. If there is
>> not enough memory in all the nodes specified in the nodemask, the allocation
>> will be attempted from the closest numa node to the home node in the system.
>>
>> This helps applications to hint at a memory allocation preference node
>> and fallback to _only_ a set of nodes if the memory is not available
>> on the preferred node. Fallback allocation is attempted from the node which is
>> nearest to the preferred node.
>>
>> This helps applications to have control on memory allocation numa nodes and
>> avoids default fallback to slow memory NUMA nodes. For example a system with
>> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
>>
>> new_nodes = numa_bitmask_alloc(nr_nodes);
>>
>> numa_bitmask_setbit(new_nodes, 1);
>> numa_bitmask_setbit(new_nodes, 2);
>> numa_bitmask_setbit(new_nodes, 3);
>>
>> p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0);
>>
>> sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
>>
>> This will allocate from nodes closer to node 2 and will make sure kernel will
>> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
>> nodes 10, 11 and 12
>>
>> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
>> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
>> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
>> ever is closer to node 2.
>>
>> ...
>>
>> @@ -1477,6 +1478,60 @@ static long kernel_mbind(unsigned long start, unsigned long len,
>> return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
>> }
>>
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> + unsigned long, home_node, unsigned long, flags)
>> +{
>> + struct mm_struct *mm = current->mm;
>> + struct vm_area_struct *vma;
>> + struct mempolicy *new;
>> + unsigned long vmstart;
>> + unsigned long vmend;
>> + unsigned long end;
>> + int err = -ENOENT;
>> +
>> + if (start & ~PAGE_MASK)
>> + return -EINVAL;
>> + /*
>> + * flags is used for future extension if any.
>> + */
>> + if (flags != 0)
>> + return -EINVAL;
>> +
>> + if (!node_online(home_node))
>> + return -EINVAL;
>
> What's the thinking here? The node can later be offlined and the
> kernel takes no action to reset home nodes, so why not permit setting a
> presently-offline node as the home node? Checking here seems rather
> arbitrary?
The node online check was needed to avoid accessing
uninitialised pgdat structure. Such an access can result in
below crash
cpu 0x0: Vector: 300 (Data Access) at [c00000000a693840]
pc: c0000000004e9bac: __next_zones_zonelist+0xc/0xa0
lr: c000000000558d54: __alloc_pages+0x474/0x540
sp: c00000000a693ae0
msr: 8000000000009033
dar: 1508
dsisr: 40000000
current = 0xc0000000087f8380
paca = 0xc000000003130000 irqmask: 0x03 irq_happened: 0x01
pid = 1161, comm = test_mpol_prefe
Linux version 5.16.0-rc3-14872-gd6ef4ee28b4f-dirty (kvaneesh@ltc-boston8) (gcc (Ubuntu 9.3.0-1
7ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #505 SMP Mon Nov 29 22:16:49 CST
2021
enter ? for help
[link register ] c000000000558d54 __alloc_pages+0x474/0x540
[c00000000a693ae0] c000000000558c68 __alloc_pages+0x388/0x540 (unreliable)
[c00000000a693b60] c00000000059299c alloc_pages_vma+0xcc/0x380
[c00000000a693bd0] c00000000052129c __handle_mm_fault+0xcec/0x1900
[c00000000a693cc0] c000000000522094 handle_mm_fault+0x1e4/0x4f0
[c00000000a693d20] c000000000087288 ___do_page_fault+0x2f8/0xc20
[c00000000a693de0] c000000000087e50 do_page_fault+0x60/0x130
[c00000000a693e10] c00000000000891c data_access_common_virt+0x19c/0x1f0
--- Exception: 300 (Data Access) at 000074931e429160
SP (7fffe8116a50) is in userspace
0:mon>
Now IIUC, even after a node is marked offline via try_offline_node() we
still be able to access the zonelist details using the pgdata struct.
I was not able to force a NUMA node offline in my test, even after removing the
memory assigned to it.
root@ubuntu-guest:/sys/devices/system/node/node2# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: 4046 MB
node 0 free: 3362 MB
node 1 cpus: 2 3
node 1 size: 4090 MB
node 1 free: 3788 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 7
node 3 size: 4063 MB
node 3 free: 3832 MB
node distances:
node 0 1 2 3
0: 10 11 222 33
1: 44 10 55 66
2: 77 88 10 99
3: 101 121 132 10
next prev parent reply other threads:[~2021-11-30 8:59 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-16 6:42 [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node Aneesh Kumar K.V
2021-11-16 6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
2021-11-29 10:11 ` Michal Hocko
2021-11-29 10:12 ` [PATCH 4/3] mm: drop node from alloc_pages_vma Michal Hocko
2021-11-16 6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
2021-11-29 10:32 ` Michal Hocko
2021-11-29 10:46 ` Aneesh Kumar K.V
2021-11-29 12:45 ` Michal Hocko
2021-11-29 13:47 ` Aneesh Kumar K.V
2021-11-29 14:52 ` Michal Hocko
2021-11-29 14:59 ` Aneesh Kumar K.V
2021-11-29 15:19 ` Michal Hocko
2021-11-29 22:02 ` Andrew Morton
2021-11-30 8:59 ` Aneesh Kumar K.V [this message]
2021-11-30 9:59 ` Michal Hocko
2021-12-01 3:00 ` Andrew Morton
2021-12-01 6:22 ` Aneesh Kumar K.V
2021-12-01 0:47 ` Daniel Jordan
2021-12-01 6:15 ` Aneesh Kumar K.V
2021-12-01 16:22 ` Daniel Jordan
2021-11-16 6:42 ` [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node Aneesh Kumar K.V
2021-11-29 8:37 ` [PATCH v5 0/3] mm: add new " Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wnkqaujt.fsf@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=aarcange@redhat.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=ben.widawsky@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=feng.tang@intel.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=rdunlap@infradead.org \
--cc=vbabka@suse.cz \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.