linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* MPOL_BIND on memory only nodes
@ 2016-10-12  9:25 Anshuman Khandual
  2016-10-12  9:43 ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Anshuman Khandual @ 2016-10-12  9:25 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Linux Memory Management List, Andrew Morton, Mel Gorman,
	Michal Hocko, Aneesh Kumar K.V, Balbir Singh, Vlastimil Babka,
	Minchan Kim

Hi,

We have the following function policy_zonelist() which selects a zonelist
during various allocation paths. With this, general user space allocations
(IIUC might not have __GFP_THISNODE) fails while trying to get memory from
a memory only node without CPUs as the application runs some where else
and that node is not part of the nodemask. Why we insist on __GFP_THISNODE ?
On any memory only node its likely that the local node "nd" might not be
part of the nodemask, hence does it make sense to pick up the first node of
the nodemask in those cases without looking for __GFP_THISNODE ?

/* Return a zonelist indicated by gfp for node representing a mempolicy */
static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
	int nd)
{
	switch (policy->mode) {
	case MPOL_PREFERRED:
		if (!(policy->flags & MPOL_F_LOCAL))
			nd = policy->v.preferred_node;
		break;
	case MPOL_BIND:
		/*
		 * Normally, MPOL_BIND allocations are node-local within the
		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
		 * current node isn't part of the mask, we use the zonelist for
		 * the first node in the mask instead.
		 */
		if (unlikely(gfp & __GFP_THISNODE) &&
				unlikely(!node_isset(nd, policy->v.nodes)))
			nd = first_node(policy->v.nodes);
		break;
	default:
		BUG();
	}
	return node_zonelist(nd, gfp);
}

- Anshuman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-12  9:25 MPOL_BIND on memory only nodes Anshuman Khandual
@ 2016-10-12  9:43 ` Michal Hocko
  2016-10-12 10:38   ` Anshuman Khandual
  2016-10-12 13:16   ` Michal Hocko
  0 siblings, 2 replies; 11+ messages in thread
From: Michal Hocko @ 2016-10-12  9:43 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Andrew Morton, Mel Gorman, Aneesh Kumar K.V, Balbir Singh,
	Vlastimil Babka, Minchan Kim

On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
> Hi,
> 
> We have the following function policy_zonelist() which selects a zonelist
> during various allocation paths. With this, general user space allocations
> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from
> a memory only node without CPUs as the application runs some where else
> and that node is not part of the nodemask.

I am not sure I understand. So you have a task with MPOL_BIND without a
cpu less node in the mask and you are wondering why the memory is not
allocated from that node?

> Why we insist on __GFP_THISNODE ?

AFAIU __GFP_THISNODE just overrides the given node to the policy
nodemask in case the current node is not part of that node mask. In
other words we are ignoring the given node and use what the policy says. 
I can see how this can be confusing especially when confronting the
documentation:

 * __GFP_THISNODE forces the allocation to be satisified from the requested
 *   node with no fallbacks or placement policy enforcements.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-12  9:43 ` Michal Hocko
@ 2016-10-12 10:38   ` Anshuman Khandual
  2016-10-12 11:01     ` Michal Hocko
  2016-10-12 13:16   ` Michal Hocko
  1 sibling, 1 reply; 11+ messages in thread
From: Anshuman Khandual @ 2016-10-12 10:38 UTC (permalink / raw)
  To: Michal Hocko, Anshuman Khandual
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Andrew Morton, Mel Gorman, Aneesh Kumar K.V, Balbir Singh,
	Vlastimil Babka, Minchan Kim

On 10/12/2016 03:13 PM, Michal Hocko wrote:
> On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
>> Hi,
>>
>> We have the following function policy_zonelist() which selects a zonelist
>> during various allocation paths. With this, general user space allocations
>> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from
>> a memory only node without CPUs as the application runs some where else
>> and that node is not part of the nodemask.

My bad. Was playing with some changes to the zonelists rebuild after
a memory node hotplug and the order of various zones in them.

> 
> I am not sure I understand. So you have a task with MPOL_BIND without a
> cpu less node in the mask and you are wondering why the memory is not
> allocated from that node?

In my experiment, there is a MPOL_BIND call with a CPU less node in
the node mask and the memory is not allocated from that CPU less node.
Thats because the zone of the CPU less node was absent from the
FALLBACK zonelist of the local node.

> 
>> Why we insist on __GFP_THISNODE ?
> 
> AFAIU __GFP_THISNODE just overrides the given node to the policy
> nodemask in case the current node is not part of that node mask. In
> other words we are ignoring the given node and use what the policy says. 

Right but provided the gfp flag has __GFP_THISNODE in it. In absence
of __GFP_THISNODE, the node from the nodemask will not be selected. I
still wonder why ? Can we always go to the first node in the nodemask
for MPOL_BIND interface calls ? Just curious to know why preference
is given to the local node and it's FALLBACK zonelist.

> I can see how this can be confusing especially when confronting the
> documentation:
> 
>  * __GFP_THISNODE forces the allocation to be satisified from the requested
>  *   node with no fallbacks or placement policy enforcements.
> 

Yeah, right.

Thanks for your reply.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-12 10:38   ` Anshuman Khandual
@ 2016-10-12 11:01     ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2016-10-12 11:01 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Andrew Morton, Mel Gorman, Aneesh Kumar K.V, Balbir Singh,
	Vlastimil Babka, Minchan Kim

On Wed 12-10-16 16:08:48, Anshuman Khandual wrote:
> On 10/12/2016 03:13 PM, Michal Hocko wrote:
> > On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
> >> Hi,
> >>
> >> We have the following function policy_zonelist() which selects a zonelist
> >> during various allocation paths. With this, general user space allocations
> >> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from
> >> a memory only node without CPUs as the application runs some where else
> >> and that node is not part of the nodemask.
> 
> My bad. Was playing with some changes to the zonelists rebuild after
> a memory node hotplug and the order of various zones in them.
> 
> > 
> > I am not sure I understand. So you have a task with MPOL_BIND without a
> > cpu less node in the mask and you are wondering why the memory is not
> > allocated from that node?
> 
> In my experiment, there is a MPOL_BIND call with a CPU less node in
> the node mask and the memory is not allocated from that CPU less node.
> Thats because the zone of the CPU less node was absent from the
> FALLBACK zonelist of the local node.

So do I understand this correctly that the issue was caused by
non-upstream changes?

> >> Why we insist on __GFP_THISNODE ?
> > 
> > AFAIU __GFP_THISNODE just overrides the given node to the policy
> > nodemask in case the current node is not part of that node mask. In
> > other words we are ignoring the given node and use what the policy says. 
> 
> Right but provided the gfp flag has __GFP_THISNODE in it. In absence
> of __GFP_THISNODE, the node from the nodemask will not be selected.

In absence of __GFP_THISNODE we will use the zonelist for the given node
and that should contain even memoryless nodes for the fallback. The
nodemask from policy_nodemask() will then make sure that only nodes
relevant to the used policy is used.

> I still wonder why ? Can we always go to the first node in the
> nodemask for MPOL_BIND interface calls ? Just curious to know why
> preference is given to the local node and it's FALLBACK zonelist.

It is not always a local node. Look at how do_huge_pmd_wp_page_fallback
tries to make all the pages into the same node. Also we have
alloc_pages_current() which tries to allocate from the local node which
should not fallback to the firs node in the policy nodemask.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-12  9:43 ` Michal Hocko
  2016-10-12 10:38   ` Anshuman Khandual
@ 2016-10-12 13:16   ` Michal Hocko
  2016-10-13  9:54     ` Anshuman Khandual
  2016-10-13 10:24     ` Mel Gorman
  1 sibling, 2 replies; 11+ messages in thread
From: Michal Hocko @ 2016-10-12 13:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Andrew Morton, Anshuman Khandual, Aneesh Kumar K.V, Balbir Singh,
	Vlastimil Babka, Minchan Kim

On Wed 12-10-16 11:43:37, Michal Hocko wrote:
> On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
[...]
> > Why we insist on __GFP_THISNODE ?
> 
> AFAIU __GFP_THISNODE just overrides the given node to the policy
> nodemask in case the current node is not part of that node mask. In
> other words we are ignoring the given node and use what the policy says. 
> I can see how this can be confusing especially when confronting the
> documentation:
> 
>  * __GFP_THISNODE forces the allocation to be satisified from the requested
>  *   node with no fallbacks or placement policy enforcements.

You made me think and look into this deeper. I came to the conclusion
that this is actually a relict from the past. policy_zonelist is called
only from 3 places:
- huge_zonelist - never should do __GFP_THISNODE when going this path
- alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
- alloc_pages_current - which uses default_policy id __GFP_THISNODE is
  used

So AFAICS this is essentially a dead code or I am missing something. Mel
do you remember why we needed it in the past? I would be really tempted
to just get rid of this confusing code and this instead:
---
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ad1c96ac313c..98beec47bba9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	int nd)
 {
-	switch (policy->mode) {
-	case MPOL_PREFERRED:
-		if (!(policy->flags & MPOL_F_LOCAL))
-			nd = policy->v.preferred_node;
-		break;
-	case MPOL_BIND:
+	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
+		nd = policy->v.preferred_node;
+	else {
 		/*
-		 * Normally, MPOL_BIND allocations are node-local within the
-		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
-		 * current node isn't part of the mask, we use the zonelist for
-		 * the first node in the mask instead.
+		 * __GFP_THISNODE shouldn't even be used with the bind policy because
+		 * we might easily break the expectation to stay on the requested node
+		 * and not break the policy.
 		 */
-		if (unlikely(gfp & __GFP_THISNODE) &&
-				unlikely(!node_isset(nd, policy->v.nodes)))
-			nd = first_node(policy->v.nodes);
-		break;
-	default:
-		BUG();
+		WARN_ON_ONCE(polic->mode == MPOL_BIND && (gfp && __GFP_THISNODE));
 	}
+
 	return node_zonelist(nd, gfp);
 }
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-12 13:16   ` Michal Hocko
@ 2016-10-13  9:54     ` Anshuman Khandual
  2016-10-13 10:07       ` Michal Hocko
  2016-10-13 10:24     ` Mel Gorman
  1 sibling, 1 reply; 11+ messages in thread
From: Anshuman Khandual @ 2016-10-13  9:54 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Andrew Morton, Aneesh Kumar K.V, Balbir Singh, Vlastimil Babka,
	Minchan Kim

On 10/12/2016 06:46 PM, Michal Hocko wrote:
> On Wed 12-10-16 11:43:37, Michal Hocko wrote:
>> On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
> [...]
>>> Why we insist on __GFP_THISNODE ?
>>
>> AFAIU __GFP_THISNODE just overrides the given node to the policy
>> nodemask in case the current node is not part of that node mask. In
>> other words we are ignoring the given node and use what the policy says. 
>> I can see how this can be confusing especially when confronting the
>> documentation:
>>
>>  * __GFP_THISNODE forces the allocation to be satisified from the requested
>>  *   node with no fallbacks or placement policy enforcements.
> 
> You made me think and look into this deeper. I came to the conclusion
> that this is actually a relict from the past. policy_zonelist is called
> only from 3 places:
> - huge_zonelist - never should do __GFP_THISNODE when going this path
> - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
>   used
> 
> So AFAICS this is essentially a dead code or I am missing something. Mel
> do you remember why we needed it in the past? I would be really tempted
> to just get rid of this confusing code and this instead:
> ---
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ad1c96ac313c..98beec47bba9 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>  	int nd)
>  {
> -	switch (policy->mode) {
> -	case MPOL_PREFERRED:
> -		if (!(policy->flags & MPOL_F_LOCAL))
> -			nd = policy->v.preferred_node;
> -		break;
> -	case MPOL_BIND:
> +	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
> +		nd = policy->v.preferred_node;
> +	else {
>  		/*
> -		 * Normally, MPOL_BIND allocations are node-local within the
> -		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
> -		 * current node isn't part of the mask, we use the zonelist for
> -		 * the first node in the mask instead.
> +		 * __GFP_THISNODE shouldn't even be used with the bind policy because
> +		 * we might easily break the expectation to stay on the requested node
> +		 * and not break the policy.
>  		 */
> -		if (unlikely(gfp & __GFP_THISNODE) &&
> -				unlikely(!node_isset(nd, policy->v.nodes)))
> -			nd = first_node(policy->v.nodes);
> -		break;
> -	default:
> -		BUG();
> +		WARN_ON_ONCE(polic->mode == MPOL_BIND && (gfp && __GFP_THISNODE));
>  	}
> +
>  	return node_zonelist(nd, gfp);
>  }

Which makes the function look like this. Even with these changes, MPOL_BIND is
still going to pick up the local node's zonelist instead of the first node in
policy->v.nodes nodemask. It completely ignores policy->v.nodes which it should
not.

/* Return a zonelist indicated by gfp for node representing a mempolicy */
static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
        int nd)
{
        if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
                nd = policy->v.preferred_node;
        else {
                /*
                 * __GFP_THISNODE shouldn't even be used with the bind policy because
                 * we might easily break the expectation to stay on the requested node
                 * and not break the policy.
                 */
                WARN_ON_ONCE(polic->mode == MPOL_BIND && (gfp && __GFP_THISNODE));
        }

        return node_zonelist(nd, gfp);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-13  9:54     ` Anshuman Khandual
@ 2016-10-13 10:07       ` Michal Hocko
  2016-10-13 10:58         ` Anshuman Khandual
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2016-10-13 10:07 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, Linux Kernel Mailing List,
	Linux Memory Management List, Andrew Morton, Aneesh Kumar K.V,
	Balbir Singh, Vlastimil Babka, Minchan Kim

On Thu 13-10-16 15:24:54, Anshuman Khandual wrote:
[...]
> Which makes the function look like this. Even with these changes, MPOL_BIND is
> still going to pick up the local node's zonelist instead of the first node in
> policy->v.nodes nodemask. It completely ignores policy->v.nodes which it should
> not.

Not really. I have tried to explain earlier. We do not ignore policy
nodemask. This one comes from policy_nodemask. We start with the local
node but fallback to some of the nodes from the nodemask defined by the
policy.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-12 13:16   ` Michal Hocko
  2016-10-13  9:54     ` Anshuman Khandual
@ 2016-10-13 10:24     ` Mel Gorman
  2016-10-13 12:38       ` Michal Hocko
  1 sibling, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2016-10-13 10:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Andrew Morton, Anshuman Khandual, Aneesh Kumar K.V, Balbir Singh,
	Vlastimil Babka, Minchan Kim

On Wed, Oct 12, 2016 at 03:16:27PM +0200, Michal Hocko wrote:
> On Wed 12-10-16 11:43:37, Michal Hocko wrote:
> > On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
> [...]
> > > Why we insist on __GFP_THISNODE ?
> > 
> > AFAIU __GFP_THISNODE just overrides the given node to the policy
> > nodemask in case the current node is not part of that node mask. In
> > other words we are ignoring the given node and use what the policy says. 
> > I can see how this can be confusing especially when confronting the
> > documentation:
> > 
> >  * __GFP_THISNODE forces the allocation to be satisified from the requested
> >  *   node with no fallbacks or placement policy enforcements.
> 
> You made me think and look into this deeper. I came to the conclusion
> that this is actually a relict from the past. policy_zonelist is called
> only from 3 places:
> - huge_zonelist - never should do __GFP_THISNODE when going this path
> - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
>   used
> 
> So AFAICS this is essentially a dead code or I am missing something. Mel
> do you remember why we needed it in the past?

I don't recall a specific reason. It was likely due to confusion on my
part at the time on the exact use of __GFP_THISNODE. The expectation is
that flag is not used in fault paths or with policies. It's meant to
enforce node-locality for kernel internal decisions such as the locality
of slab pages and ensuring that a THP collapse from khugepaged is on the
same node.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-13 10:07       ` Michal Hocko
@ 2016-10-13 10:58         ` Anshuman Khandual
  2016-10-13 12:51           ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Anshuman Khandual @ 2016-10-13 10:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Linux Kernel Mailing List,
	Linux Memory Management List, Andrew Morton, Aneesh Kumar K.V,
	Balbir Singh, Vlastimil Babka, Minchan Kim

On 10/13/2016 03:37 PM, Michal Hocko wrote:
> On Thu 13-10-16 15:24:54, Anshuman Khandual wrote:
> [...]
>> Which makes the function look like this. Even with these changes, MPOL_BIND is
>> still going to pick up the local node's zonelist instead of the first node in
>> policy->v.nodes nodemask. It completely ignores policy->v.nodes which it should
>> not.
> 
> Not really. I have tried to explain earlier. We do not ignore policy
> nodemask. This one comes from policy_nodemask. We start with the local
> node but fallback to some of the nodes from the nodemask defined by the
> policy.
> 

Yeah saw your response but did not get that exactly. We dont ignore
policy nodemask while memory allocation, correct. But my point was
we are ignoring policy nodemask while selecting zonelist which will
be used during page allocation. Though the zone contents of both the
zonelists are likely to be same, would not it be better to get the
zone list from the nodemask as well ? Or I am still missing something
here. The following change is what I am trying to propose.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ad1c96a..f60ab80 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1685,14 +1685,7 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
                        nd = policy->v.preferred_node;
                break;
        case MPOL_BIND:
-               /*
-                * Normally, MPOL_BIND allocations are node-local within the
-                * allowed nodemask.  However, if __GFP_THISNODE is set and the
-                * current node isn't part of the mask, we use the zonelist for
-                * the first node in the mask instead.
-                */
-               if (unlikely(gfp & __GFP_THISNODE) &&
-                               unlikely(!node_isset(nd, policy->v.nodes)))
+               if (unlikely(!node_isset(nd, policy->v.nodes)))
                        nd = first_node(policy->v.nodes);
                break;
        default:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-13 10:24     ` Mel Gorman
@ 2016-10-13 12:38       ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2016-10-13 12:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Andrew Morton, Anshuman Khandual, Aneesh Kumar K.V, Balbir Singh,
	Vlastimil Babka, Minchan Kim

On Thu 13-10-16 11:24:59, Mel Gorman wrote:
> On Wed, Oct 12, 2016 at 03:16:27PM +0200, Michal Hocko wrote:
> > On Wed 12-10-16 11:43:37, Michal Hocko wrote:
> > > On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
> > [...]
> > > > Why we insist on __GFP_THISNODE ?
> > > 
> > > AFAIU __GFP_THISNODE just overrides the given node to the policy
> > > nodemask in case the current node is not part of that node mask. In
> > > other words we are ignoring the given node and use what the policy says. 
> > > I can see how this can be confusing especially when confronting the
> > > documentation:
> > > 
> > >  * __GFP_THISNODE forces the allocation to be satisified from the requested
> > >  *   node with no fallbacks or placement policy enforcements.
> > 
> > You made me think and look into this deeper. I came to the conclusion
> > that this is actually a relict from the past. policy_zonelist is called
> > only from 3 places:
> > - huge_zonelist - never should do __GFP_THISNODE when going this path
> > - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> > - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
> >   used
> > 
> > So AFAICS this is essentially a dead code or I am missing something. Mel
> > do you remember why we needed it in the past?
> 
> I don't recall a specific reason. It was likely due to confusion on my
> part at the time on the exact use of __GFP_THISNODE. The expectation is
> that flag is not used in fault paths or with policies. It's meant to
> enforce node-locality for kernel internal decisions such as the locality
> of slab pages and ensuring that a THP collapse from khugepaged is on the
> same node.

This is my understanding as well. Thanks for double checking. I will
send a proper patch (it will even compile as a bonus point ;).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: MPOL_BIND on memory only nodes
  2016-10-13 10:58         ` Anshuman Khandual
@ 2016-10-13 12:51           ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2016-10-13 12:51 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mel Gorman, Linux Kernel Mailing List,
	Linux Memory Management List, Andrew Morton, Aneesh Kumar K.V,
	Balbir Singh, Vlastimil Babka, Minchan Kim

On Thu 13-10-16 16:28:27, Anshuman Khandual wrote:
> On 10/13/2016 03:37 PM, Michal Hocko wrote:
> > On Thu 13-10-16 15:24:54, Anshuman Khandual wrote:
> > [...]
> >> Which makes the function look like this. Even with these changes, MPOL_BIND is
> >> still going to pick up the local node's zonelist instead of the first node in
> >> policy->v.nodes nodemask. It completely ignores policy->v.nodes which it should
> >> not.
> > 
> > Not really. I have tried to explain earlier. We do not ignore policy
> > nodemask. This one comes from policy_nodemask. We start with the local
> > node but fallback to some of the nodes from the nodemask defined by the
> > policy.
> > 
> 
> Yeah saw your response but did not get that exactly. We dont ignore
> policy nodemask while memory allocation, correct. But my point was
> we are ignoring policy nodemask while selecting zonelist which will
> be used during page allocation. Though the zone contents of both the
> zonelists are likely to be same, would not it be better to get the
> zone list from the nodemask as well ?

Why. Zonelist from the current node should contain all availanle zones
and get_page_from_freelist then filters this zonelist accoring to
mempolicy and nodemask

> Or I am still missing something
> here. The following change is what I am trying to propose.
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ad1c96a..f60ab80 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1685,14 +1685,7 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>                         nd = policy->v.preferred_node;
>                 break;
>         case MPOL_BIND:
> -               /*
> -                * Normally, MPOL_BIND allocations are node-local within the
> -                * allowed nodemask.  However, if __GFP_THISNODE is set and the
> -                * current node isn't part of the mask, we use the zonelist for
> -                * the first node in the mask instead.
> -                */
> -               if (unlikely(gfp & __GFP_THISNODE) &&
> -                               unlikely(!node_isset(nd, policy->v.nodes)))
> +               if (unlikely(!node_isset(nd, policy->v.nodes)))
>                         nd = first_node(policy->v.nodes);

That shouldn't make much difference as per above.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-10-13 12:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-12  9:25 MPOL_BIND on memory only nodes Anshuman Khandual
2016-10-12  9:43 ` Michal Hocko
2016-10-12 10:38   ` Anshuman Khandual
2016-10-12 11:01     ` Michal Hocko
2016-10-12 13:16   ` Michal Hocko
2016-10-13  9:54     ` Anshuman Khandual
2016-10-13 10:07       ` Michal Hocko
2016-10-13 10:58         ` Anshuman Khandual
2016-10-13 12:51           ` Michal Hocko
2016-10-13 10:24     ` Mel Gorman
2016-10-13 12:38       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).