linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	Ben Widawsky <ben.widawsky@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Andi Kleen <ak@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	linux-api@vger.kernel.org
Subject: Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
Date: Mon, 29 Nov 2021 16:16:05 +0530	[thread overview]
Message-ID: <87fsrf1bpu.fsf@linux.ibm.com> (raw)
In-Reply-To: <YaSsR0z6GN07qyH7@dhcp22.suse.cz>

Michal Hocko <mhocko@suse.com> writes:

> On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote:
>> This syscall can be used to set a home node for the MPOL_BIND
>> and MPOL_PREFERRED_MANY memory policy. Users should use this
>> syscall after setting up a memory policy for the specified range
>> as shown below.
>> 
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
>> 	    new_nodes->size + 1, 0);
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> 				  home_node, 0);
>> 
>> The syscall allows specifying a home node/preferred node from which kernel
>> will fulfill memory allocation requests first.
>> 
>> For address range with MPOL_BIND memory policy, if nodemask specifies more
>> than one node, page allocations will come from the node in the nodemask
>> with sufficient free memory that is closest to the home node/preferred node.
>> 
>> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
>> page allocation will come from the node in the nodemask with sufficient
>> free memory that is closest to the home node/preferred node. If there is
>> not enough memory in all the nodes specified in the nodemask, the allocation
>> will be attempted from the closest numa node to the home node in the system.
>> 
>> This helps applications to hint at a memory allocation preference node
>> and fallback to _only_ a set of nodes if the memory is not available
>> on the preferred node.  Fallback allocation is attempted from the node which is
>> nearest to the preferred node.
>> 
>> This helps applications to have control on memory allocation numa nodes and
>> avoids default fallback to slow memory NUMA nodes. For example a system with
>> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
>> 
>>  new_nodes = numa_bitmask_alloc(nr_nodes);
>> 
>>  numa_bitmask_setbit(new_nodes, 1);
>>  numa_bitmask_setbit(new_nodes, 2);
>>  numa_bitmask_setbit(new_nodes, 3);
>> 
>>  p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>>  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
>> 
>>  sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
>> 
>> This will allocate from nodes closer to node 2 and will make sure kernel will
>> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
>> nodes 10, 11 and 12
>
> I think you are not really explaining why the home node is really needed
> for that usecase. You can limit memory access to those nodes even
> without the home node. Why the defaulot local node is insufficient is
> really a missing part in the explanation.
>
> One usecase would be cpu less nodes and their preference for the
> allocation. If there are others make sure to mention them in the
> changelog.

Will add this.

>
>> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
>> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
>> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
>> ever is closer to node 2.
>> 
>> Cc: Ben Widawsky <ben.widawsky@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Feng Tang <feng.tang@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: linux-api@vger.kernel.org
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  .../admin-guide/mm/numa_memory_policy.rst     | 14 ++++-
>>  include/linux/mempolicy.h                     |  1 +
>>  mm/mempolicy.c                                | 62 +++++++++++++++++++
>>  3 files changed, 76 insertions(+), 1 deletion(-)
>> 
>> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> index 64fd0ba0d057..6eab52d4c3b2 100644
>> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
>> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> @@ -408,7 +408,7 @@ follows:
>>  Memory Policy APIs
>>  ==================
>>  
>> -Linux supports 3 system calls for controlling memory policy.  These APIS
>> +Linux supports 4 system calls for controlling memory policy.  These APIS
>>  always affect only the calling task, the calling task's address space, or
>>  some shared object mapped into the calling task's address space.
>>  
>> @@ -460,6 +460,18 @@ requested via the 'flags' argument.
>>  
>>  See the mbind(2) man page for more details.
>>  
>> +Set home node for a Range of Task's Address Spacec::
>> +
>> +	long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>> +  					 unsigned long home_node,
>> +					 unsigned long flags);
>> +
>> +sys_set_mempolicy_home_node set the home node for a VMA policy present in the
>> +task's address range. The system call updates the home node only for the existing
>> +mempolicy range. Other address ranges are ignored.
>
>> A home node is the NUMA node
>> +closest to which page allocation will come from.
>
> I woudl repgrase
> The home node override the default allocation policy to allocate memory
> close to the local node for an executing CPU.
>

ok

> [...]
>
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> +		unsigned long, home_node, unsigned long, flags)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +	struct vm_area_struct *vma;
>> +	struct mempolicy *new;
>> +	unsigned long vmstart;
>> +	unsigned long vmend;
>> +	unsigned long end;
>> +	int err = -ENOENT;
>> +
>> +	if (start & ~PAGE_MASK)
>> +		return -EINVAL;
>> +	/*
>> +	 * flags is used for future extension if any.
>> +	 */
>> +	if (flags != 0)
>> +		return -EINVAL;
>> +
>> +	if (!node_online(home_node))
>> +		return -EINVAL;
>
> You really want to check the home_node before dereferencing the mask.
>

Any reason why we want to check for home node first?

>> +
>> +	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
>> +	end = start + len;
>> +
>> +	if (end < start)
>> +		return -EINVAL;
>> +	if (end == start)
>> +		return 0;
>> +	mmap_write_lock(mm);
>> +	vma = find_vma(mm, start);
>> +	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
>> +
>> +		vmstart = max(start, vma->vm_start);
>> +		vmend   = min(end, vma->vm_end);
>> +		new = mpol_dup(vma_policy(vma));
>> +		if (IS_ERR(new)) {
>> +			err = PTR_ERR(new);
>> +			break;
>> +		}
>> +		/*
>> +		 * Only update home node if there is an existing vma policy
>> +		 */
>> +		if (!new)
>> +			continue;
>
> Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as
> supported but you seem to be applying the home node to all existing
> policieso


The restriction is done in policy_node. 

@@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
	}

+	if ((policy->mode == MPOL_BIND ||
+	     policy->mode == MPOL_PREFERRED_MANY) &&
+	    policy->home_node != NUMA_NO_NODE)
+		return policy->home_node;
+
	return nd;
 }




>
>> +		new->home_node = home_node;
>> +		err = mbind_range(mm, vmstart, vmend, new);
>> +		if (err)
>> +			break;
>> +	}
>> +	mmap_write_unlock(mm);
>> +	return err;
>> +}
>> +
>
> Other than that I do not see any major issues.
> -- 
> Michal Hocko
> SUSE Labs


-aneesh

  reply	other threads:[~2021-11-29 10:48 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20211116064238.727454-1-aneesh.kumar@linux.ibm.com>
2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
2021-11-29 10:11   ` Michal Hocko
2021-11-29 10:12   ` [PATCH 4/3] mm: drop node from alloc_pages_vma Michal Hocko
2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
2021-11-29 10:32   ` Michal Hocko
2021-11-29 10:46     ` Aneesh Kumar K.V [this message]
2021-11-29 12:45       ` Michal Hocko
2021-11-29 13:47         ` Aneesh Kumar K.V
2021-11-29 14:52           ` Michal Hocko
2021-11-29 14:59             ` Aneesh Kumar K.V
2021-11-29 15:19               ` Michal Hocko
2021-11-29 22:02   ` Andrew Morton
2021-11-30  8:59     ` Aneesh Kumar K.V
2021-11-30  9:59       ` Michal Hocko
2021-12-01  3:00       ` Andrew Morton
2021-12-01  6:22         ` Aneesh Kumar K.V
2021-12-01  0:47   ` Daniel Jordan
2021-12-01  6:15     ` Aneesh Kumar K.V
2021-12-01 16:22       ` Daniel Jordan
2021-11-16  6:42 ` [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fsrf1bpu.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=ben.widawsky@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=feng.tang@intel.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=rdunlap@infradead.org \
    --cc=vbabka@suse.cz \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).