From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Ben Widawsky <ben.widawsky@intel.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Feng Tang <feng.tang@intel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Mel Gorman <mgorman@techsingularity.net>,
Mike Kravetz <mike.kravetz@oracle.com>,
Randy Dunlap <rdunlap@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>, Andi Kleen <ak@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Huang Ying <ying.huang@intel.com>,
linux-api@vger.kernel.org
Subject: Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
Date: Mon, 29 Nov 2021 16:16:05 +0530 [thread overview]
Message-ID: <87fsrf1bpu.fsf@linux.ibm.com> (raw)
In-Reply-To: <YaSsR0z6GN07qyH7@dhcp22.suse.cz>
Michal Hocko <mhocko@suse.com> writes:
> On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote:
>> This syscall can be used to set a home node for the MPOL_BIND
>> and MPOL_PREFERRED_MANY memory policy. Users should use this
>> syscall after setting up a memory policy for the specified range
>> as shown below.
>>
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
>> new_nodes->size + 1, 0);
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> home_node, 0);
>>
>> The syscall allows specifying a home node/preferred node from which kernel
>> will fulfill memory allocation requests first.
>>
>> For address range with MPOL_BIND memory policy, if nodemask specifies more
>> than one node, page allocations will come from the node in the nodemask
>> with sufficient free memory that is closest to the home node/preferred node.
>>
>> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
>> page allocation will come from the node in the nodemask with sufficient
>> free memory that is closest to the home node/preferred node. If there is
>> not enough memory in all the nodes specified in the nodemask, the allocation
>> will be attempted from the closest numa node to the home node in the system.
>>
>> This helps applications to hint at a memory allocation preference node
>> and fallback to _only_ a set of nodes if the memory is not available
>> on the preferred node. Fallback allocation is attempted from the node which is
>> nearest to the preferred node.
>>
>> This helps applications to have control on memory allocation numa nodes and
>> avoids default fallback to slow memory NUMA nodes. For example a system with
>> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
>>
>> new_nodes = numa_bitmask_alloc(nr_nodes);
>>
>> numa_bitmask_setbit(new_nodes, 1);
>> numa_bitmask_setbit(new_nodes, 2);
>> numa_bitmask_setbit(new_nodes, 3);
>>
>> p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0);
>>
>> sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
>>
>> This will allocate from nodes closer to node 2 and will make sure kernel will
>> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
>> nodes 10, 11 and 12
>
> I think you are not really explaining why the home node is really needed
> for that usecase. You can limit memory access to those nodes even
> without the home node. Why the defaulot local node is insufficient is
> really a missing part in the explanation.
>
> One usecase would be cpu less nodes and their preference for the
> allocation. If there are others make sure to mention them in the
> changelog.
Will add this.
>
>> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
>> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
>> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
>> ever is closer to node 2.
>>
>> Cc: Ben Widawsky <ben.widawsky@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Feng Tang <feng.tang@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: linux-api@vger.kernel.org
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>> .../admin-guide/mm/numa_memory_policy.rst | 14 ++++-
>> include/linux/mempolicy.h | 1 +
>> mm/mempolicy.c | 62 +++++++++++++++++++
>> 3 files changed, 76 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> index 64fd0ba0d057..6eab52d4c3b2 100644
>> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
>> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> @@ -408,7 +408,7 @@ follows:
>> Memory Policy APIs
>> ==================
>>
>> -Linux supports 3 system calls for controlling memory policy. These APIS
>> +Linux supports 4 system calls for controlling memory policy. These APIS
>> always affect only the calling task, the calling task's address space, or
>> some shared object mapped into the calling task's address space.
>>
>> @@ -460,6 +460,18 @@ requested via the 'flags' argument.
>>
>> See the mbind(2) man page for more details.
>>
>> +Set home node for a Range of Task's Address Spacec::
>> +
>> + long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>> + unsigned long home_node,
>> + unsigned long flags);
>> +
>> +sys_set_mempolicy_home_node set the home node for a VMA policy present in the
>> +task's address range. The system call updates the home node only for the existing
>> +mempolicy range. Other address ranges are ignored.
>
>> A home node is the NUMA node
>> +closest to which page allocation will come from.
>
> I woudl repgrase
> The home node override the default allocation policy to allocate memory
> close to the local node for an executing CPU.
>
ok
> [...]
>
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> + unsigned long, home_node, unsigned long, flags)
>> +{
>> + struct mm_struct *mm = current->mm;
>> + struct vm_area_struct *vma;
>> + struct mempolicy *new;
>> + unsigned long vmstart;
>> + unsigned long vmend;
>> + unsigned long end;
>> + int err = -ENOENT;
>> +
>> + if (start & ~PAGE_MASK)
>> + return -EINVAL;
>> + /*
>> + * flags is used for future extension if any.
>> + */
>> + if (flags != 0)
>> + return -EINVAL;
>> +
>> + if (!node_online(home_node))
>> + return -EINVAL;
>
> You really want to check the home_node before dereferencing the mask.
>
Any reason why we want to check for home node first?
>> +
>> + len = (len + PAGE_SIZE - 1) & PAGE_MASK;
>> + end = start + len;
>> +
>> + if (end < start)
>> + return -EINVAL;
>> + if (end == start)
>> + return 0;
>> + mmap_write_lock(mm);
>> + vma = find_vma(mm, start);
>> + for (; vma && vma->vm_start < end; vma = vma->vm_next) {
>> +
>> + vmstart = max(start, vma->vm_start);
>> + vmend = min(end, vma->vm_end);
>> + new = mpol_dup(vma_policy(vma));
>> + if (IS_ERR(new)) {
>> + err = PTR_ERR(new);
>> + break;
>> + }
>> + /*
>> + * Only update home node if there is an existing vma policy
>> + */
>> + if (!new)
>> + continue;
>
> Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as
> supported but you seem to be applying the home node to all existing
> policieso
The restriction is done in policy_node.
@@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
}
+ if ((policy->mode == MPOL_BIND ||
+ policy->mode == MPOL_PREFERRED_MANY) &&
+ policy->home_node != NUMA_NO_NODE)
+ return policy->home_node;
+
return nd;
}
>
>> + new->home_node = home_node;
>> + err = mbind_range(mm, vmstart, vmend, new);
>> + if (err)
>> + break;
>> + }
>> + mmap_write_unlock(mm);
>> + return err;
>> +}
>> +
>
> Other than that I do not see any major issues.
> --
> Michal Hocko
> SUSE Labs
-aneesh
next prev parent reply other threads:[~2021-11-29 10:48 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20211116064238.727454-1-aneesh.kumar@linux.ibm.com>
2021-11-16 6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
2021-11-29 10:11 ` Michal Hocko
2021-11-29 10:12 ` [PATCH 4/3] mm: drop node from alloc_pages_vma Michal Hocko
2021-11-16 6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
2021-11-29 10:32 ` Michal Hocko
2021-11-29 10:46 ` Aneesh Kumar K.V [this message]
2021-11-29 12:45 ` Michal Hocko
2021-11-29 13:47 ` Aneesh Kumar K.V
2021-11-29 14:52 ` Michal Hocko
2021-11-29 14:59 ` Aneesh Kumar K.V
2021-11-29 15:19 ` Michal Hocko
2021-11-29 22:02 ` Andrew Morton
2021-11-30 8:59 ` Aneesh Kumar K.V
2021-11-30 9:59 ` Michal Hocko
2021-12-01 3:00 ` Andrew Morton
2021-12-01 6:22 ` Aneesh Kumar K.V
2021-12-01 0:47 ` Daniel Jordan
2021-12-01 6:15 ` Aneesh Kumar K.V
2021-12-01 16:22 ` Daniel Jordan
2021-11-16 6:42 ` [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87fsrf1bpu.fsf@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=aarcange@redhat.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=ben.widawsky@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=feng.tang@intel.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=rdunlap@infradead.org \
--cc=vbabka@suse.cz \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).