All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wen Congyang <wency@cn.fujitsu.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, Rob Landley <rob@landley.net>,
	Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Jiang Liu <jiang.liu@huawei.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>, Mel Gorman <mgorman@suse.de>,
	Yinghai Lu <yinghai@kernel.org>,
	"rusty@rustcorp.com.au" <rusty@rustcorp.com.au>
Subject: Re: [PART3 Patch 00/14] introduce N_MEMORY
Date: Thu, 15 Nov 2012 14:33:49 +0800	[thread overview]
Message-ID: <50A48CCD.4090604@cn.fujitsu.com> (raw)
In-Reply-To: <20121114115227.8763c3cd.akpm@linux-foundation.org>

At 11/15/2012 03:52 AM, Andrew Morton Wrote:
> On Fri, 02 Nov 2012 15:41:55 +0800
> Wen Congyang <wency@cn.fujitsu.com> wrote:
> 
>> At 11/02/2012 05:36 AM, David Rientjes Wrote:
>>> On Thu, 1 Nov 2012, Wen Congyang wrote:
>>>
>>>>> This doesn't describe why we need the new node state, unfortunately.  It 
>>>>
>>>> 1. Somethimes, we use the node which contains the memory that can be used by
>>>>    kernel.
>>>> 2. Sometimes, we use the node which contains the memory.
>>>>
>>>> In case1, we use N_HIGH_MEMORY, and we use N_MEMORY in case2.
>>>>
>>>
>>> Yeah, that's clear, but the question is still _why_ we want two different 
>>> nodemasks.  I know that this part of the patchset simply introduces the 
>>> new nodemask because the name "N_MEMORY" is more clear than 
>>> "N_HIGH_MEMORY", but there's no real incentive for making that change by 
>>> introducing a new nodemask where a simple rename would suffice.
>>>
>>> I can only assume that you want to later use one of them for a different 
>>> purpose: those that do not include nodes that consist of only 
>>> ZONE_MOVABLE.  But that change for MPOL_BIND is nacked since it 
>>> significantly changes the semantics of set_mempolicy() and you can't break 
>>> userspace (see my response to that from yesterday).  Until that problem is 
>>> addressed, then there's no reason for the additional nodemask so nack on 
>>> this series as well.
> 
> I cannot locate "my response to that from yesterday".  Specificity, please!
> 
>>
>> I still think that we need two nodemasks: one store the node which has memory
>> that the kernel can use, and one store the node which has memory.
>>
>> For example:
>>
>> ==========================
>> static void *__meminit alloc_page_cgroup(size_t size, int nid)
>> {
>> 	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
>> 	void *addr = NULL;
>>
>> 	addr = alloc_pages_exact_nid(nid, size, flags);
>> 	if (addr) {
>> 		kmemleak_alloc(addr, size, 1, flags);
>> 		return addr;
>> 	}
>>
>> 	if (node_state(nid, N_HIGH_MEMORY))
>> 		addr = vzalloc_node(size, nid);
>> 	else
>> 		addr = vzalloc(size);
>>
>> 	return addr;
>> }
>> ==========================
>> If the node only has ZONE_MOVABLE memory, we should use vzalloc().
>> So we should have a mask that stores the node which has memory that
>> the kernel can use.
>>
>> ==========================
>> static int mpol_set_nodemask(struct mempolicy *pol,
>> 		     const nodemask_t *nodes, struct nodemask_scratch *nsc)
>> {
>> 	int ret;
>>
>> 	/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
>> 	if (pol == NULL)
>> 		return 0;
>> 	/* Check N_HIGH_MEMORY */
>> 	nodes_and(nsc->mask1,
>> 		  cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
>> ...
>> 		if (pol->flags & MPOL_F_RELATIVE_NODES)
>> 			mpol_relative_nodemask(&nsc->mask2, nodes,&nsc->mask1);
>> 		else
>> 			nodes_and(nsc->mask2, *nodes, nsc->mask1);
>> ...
>> }
>> ==========================
>> If the user specifies 2 nodes: one has ZONE_MOVABLE memory, and the other one doesn't.
>> nsc->mask2 should contain these 2 nodes. So we should hava a mask that store the node
>> which has memory.
>>
>> There maybe something wrong in the change for MPOL_BIND. But this patchset is needed.
> 
> Well, let's discuss the userspace-visible non-back-compatible mpol
> change.  What is it, why did it happen, what is its impact, is it
> acceptable?

With the all patchsets, we can make a node which only has ZONE_MOVABLE memory.
When we test this feature, we found a problem: we can't bind a task to
such node, because there is no normal memory on this node.

According to the comment in policy_nodemask():
===============
static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
	/* Lower zones don't get a nodemask applied for MPOL_BIND */
	if (unlikely(policy->mode == MPOL_BIND) &&
			gfp_zone(gfp) >= policy_zone &&
			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
		return &policy->v.nodes;

	return NULL;
}
===============

The mempolicy may only affect the memory for userspace. So I think we should
allow the user to bind a task to a movable node.

So we modify the function is_valid_nodemask() in part6 to allow the user to
do this.

We modify the function policy_nodemask() in part6, because:
we may allocate memory in task context(For example: fork a process, and allocate
memory to manage the new task), and the memory is used by the kernel(we can't
access it in userspace). In this case, gfp_zone() is ZONE_NORMAL, and
gfp_zone() >= policy_zone is true. Now we will return policy->v.nodes, and will
try allocate the memory in movable node. We can't allocate memory now.
So we modify the function policy_nodemask() to fix this problem.

Does this change mpol?

Thanks
Wen Congyang

> 
> I grabbed "PART1" and "PART2", but that's as far as I got with the six
> memory hotplug patch series.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Wen Congyang <wency@cn.fujitsu.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, Rob Landley <rob@landley.net>,
	Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Jiang Liu <jiang.liu@huawei.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>, Mel Gorman <mgorman@suse.de>,
	Yinghai Lu <yinghai@kernel.org>,
	"rusty@rustcorp.com.au" <rusty@rustcorp.com.au>
Subject: Re: [PART3 Patch 00/14] introduce N_MEMORY
Date: Thu, 15 Nov 2012 14:33:49 +0800	[thread overview]
Message-ID: <50A48CCD.4090604@cn.fujitsu.com> (raw)
In-Reply-To: <20121114115227.8763c3cd.akpm@linux-foundation.org>

At 11/15/2012 03:52 AM, Andrew Morton Wrote:
> On Fri, 02 Nov 2012 15:41:55 +0800
> Wen Congyang <wency@cn.fujitsu.com> wrote:
> 
>> At 11/02/2012 05:36 AM, David Rientjes Wrote:
>>> On Thu, 1 Nov 2012, Wen Congyang wrote:
>>>
>>>>> This doesn't describe why we need the new node state, unfortunately.  It 
>>>>
>>>> 1. Somethimes, we use the node which contains the memory that can be used by
>>>>    kernel.
>>>> 2. Sometimes, we use the node which contains the memory.
>>>>
>>>> In case1, we use N_HIGH_MEMORY, and we use N_MEMORY in case2.
>>>>
>>>
>>> Yeah, that's clear, but the question is still _why_ we want two different 
>>> nodemasks.  I know that this part of the patchset simply introduces the 
>>> new nodemask because the name "N_MEMORY" is more clear than 
>>> "N_HIGH_MEMORY", but there's no real incentive for making that change by 
>>> introducing a new nodemask where a simple rename would suffice.
>>>
>>> I can only assume that you want to later use one of them for a different 
>>> purpose: those that do not include nodes that consist of only 
>>> ZONE_MOVABLE.  But that change for MPOL_BIND is nacked since it 
>>> significantly changes the semantics of set_mempolicy() and you can't break 
>>> userspace (see my response to that from yesterday).  Until that problem is 
>>> addressed, then there's no reason for the additional nodemask so nack on 
>>> this series as well.
> 
> I cannot locate "my response to that from yesterday".  Specificity, please!
> 
>>
>> I still think that we need two nodemasks: one store the node which has memory
>> that the kernel can use, and one store the node which has memory.
>>
>> For example:
>>
>> ==========================
>> static void *__meminit alloc_page_cgroup(size_t size, int nid)
>> {
>> 	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
>> 	void *addr = NULL;
>>
>> 	addr = alloc_pages_exact_nid(nid, size, flags);
>> 	if (addr) {
>> 		kmemleak_alloc(addr, size, 1, flags);
>> 		return addr;
>> 	}
>>
>> 	if (node_state(nid, N_HIGH_MEMORY))
>> 		addr = vzalloc_node(size, nid);
>> 	else
>> 		addr = vzalloc(size);
>>
>> 	return addr;
>> }
>> ==========================
>> If the node only has ZONE_MOVABLE memory, we should use vzalloc().
>> So we should have a mask that stores the node which has memory that
>> the kernel can use.
>>
>> ==========================
>> static int mpol_set_nodemask(struct mempolicy *pol,
>> 		     const nodemask_t *nodes, struct nodemask_scratch *nsc)
>> {
>> 	int ret;
>>
>> 	/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
>> 	if (pol == NULL)
>> 		return 0;
>> 	/* Check N_HIGH_MEMORY */
>> 	nodes_and(nsc->mask1,
>> 		  cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
>> ...
>> 		if (pol->flags & MPOL_F_RELATIVE_NODES)
>> 			mpol_relative_nodemask(&nsc->mask2, nodes,&nsc->mask1);
>> 		else
>> 			nodes_and(nsc->mask2, *nodes, nsc->mask1);
>> ...
>> }
>> ==========================
>> If the user specifies 2 nodes: one has ZONE_MOVABLE memory, and the other one doesn't.
>> nsc->mask2 should contain these 2 nodes. So we should hava a mask that store the node
>> which has memory.
>>
>> There maybe something wrong in the change for MPOL_BIND. But this patchset is needed.
> 
> Well, let's discuss the userspace-visible non-back-compatible mpol
> change.  What is it, why did it happen, what is its impact, is it
> acceptable?

With the all patchsets, we can make a node which only has ZONE_MOVABLE memory.
When we test this feature, we found a problem: we can't bind a task to
such node, because there is no normal memory on this node.

According to the comment in policy_nodemask():
===============
static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
	/* Lower zones don't get a nodemask applied for MPOL_BIND */
	if (unlikely(policy->mode == MPOL_BIND) &&
			gfp_zone(gfp) >= policy_zone &&
			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
		return &policy->v.nodes;

	return NULL;
}
===============

The mempolicy may only affect the memory for userspace. So I think we should
allow the user to bind a task to a movable node.

So we modify the function is_valid_nodemask() in part6 to allow the user to
do this.

We modify the function policy_nodemask() in part6, because:
we may allocate memory in task context(For example: fork a process, and allocate
memory to manage the new task), and the memory is used by the kernel(we can't
access it in userspace). In this case, gfp_zone() is ZONE_NORMAL, and
gfp_zone() >= policy_zone is true. Now we will return policy->v.nodes, and will
try allocate the memory in movable node. We can't allocate memory now.
So we modify the function policy_nodemask() to fix this problem.

Does this change mpol?

Thanks
Wen Congyang

> 
> I grabbed "PART1" and "PART2", but that's as far as I got with the six
> memory hotplug patch series.
> 


  reply	other threads:[~2012-11-15  6:49 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-31  8:03 [PART3 Patch 00/14] introduce N_MEMORY Wen Congyang
2012-10-31  8:03 ` Wen Congyang
2012-10-31  8:03 ` [PART3 Patch 01/14] node_states: " Wen Congyang
2012-10-31  8:03   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 02/14] cpuset: use N_MEMORY instead N_HIGH_MEMORY Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 03/14] procfs: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 04/14] memcontrol: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 05/14] oom: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 06/14] mm,migrate: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 07/14] mempolicy: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 08/14] hugetlb: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 09/14] vmstat: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 10/14] kthread: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 11/14] init: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 12/14] vmscan: " Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 13/14] page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31  8:04 ` [PART3 Patch 14/14] hotplug: update nodemasks management Wen Congyang
2012-10-31  8:04   ` Wen Congyang
2012-10-31 18:16 ` [PART3 Patch 00/14] introduce N_MEMORY David Rientjes
2012-10-31 18:16   ` David Rientjes
2012-11-01  6:13   ` Wen Congyang
2012-11-01  6:13     ` Wen Congyang
2012-11-01 21:36     ` David Rientjes
2012-11-01 21:36       ` David Rientjes
2012-11-02  7:41       ` Wen Congyang
2012-11-02  7:41         ` Wen Congyang
2012-11-14 19:52         ` Andrew Morton
2012-11-14 19:52           ` Andrew Morton
2012-11-15  6:33           ` Wen Congyang [this message]
2012-11-15  6:33             ` Wen Congyang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50A48CCD.4090604@cn.fujitsu.com \
    --to=wency@cn.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=isimatu.yasuaki@jp.fujitsu.com \
    --cc=jiang.liu@huawei.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=rientjes@google.com \
    --cc=rob@landley.net \
    --cc=rusty@rustcorp.com.au \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.