Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Greg Thelen <gthelen@google.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
Date: Thu, 26 Feb 2015 00:55:40 +0100	[thread overview]
Message-ID: <54EE60FC.7000909@suse.cz> (raw)
In-Reply-To: <alpine.DEB.2.10.1502251311360.18097@chino.kir.corp.google.com>

On 25.2.2015 22:24, David Rientjes wrote:
>
>> alloc_pages_preferred_node() variant, change the exact_node() variant to pass
>> __GFP_THISNODE, and audit and adjust all callers accordingly.
>>
> Sounds like that should be done as part of a cleanup after the 4.0 issues
> are addressed.  alloc_pages_exact_node() does seem to suggest that we want
> exactly that node, implying __GFP_THISNODE behavior already, so it would
> be good to avoid having this come up again in the future.

Oh lovely, just found out that there's alloc_pages_node which should be the
preferred-only version, but in fact does not differ from 
alloc_pages_exact_node
in any relevant way. I agree we should do some larger cleanup for next 
version.

>> Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE
>> already. Of course, nothing guarantees that hugepage == true implies that gfp
>> == GFP_TRANSHUGE... but current in-tree callers conform to that.
>>
> Ah, good point, and it includes __GFP_NORETRY as well which means that
> this patch is busted.  It won't try compaction or direct reclaim in the
> page allocator slowpath because of this:
>
> 	/*
> 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> 	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
> 	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
> 	 * using a larger set of nodes after it has established that the
> 	 * allowed per node queues are empty and that nodes are
> 	 * over allocated.
> 	 */
> 	if (IS_ENABLED(CONFIG_NUMA) &&
> 	    (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> 		goto nopage;
>
> Hmm.  It would be disappointing to have to pass the nodemask of the exact
> node that we want to allocate from into the page allocator to avoid using
> __GFP_THISNODE.

Yeah.

>
> There's a sneaky way around it by just removing __GFP_NORETRY from
> GFP_TRANSHUGE so the condition above fails and since the page allocator
> won't retry for such a high-order allocation, but that probably just
> papers over this stuff too much already.  I think what we want to do is

Alternatively alloc_pages_exact_node() adds __GFP_THISNODE just to
node_zonelist() call and not to __alloc_pages() gfp_mask proper? Unless 
__GFP_THISNODE
was given *also* in the incoming gfp_mask, this should give us the right 
combination?
But it's also subtle....

> cause the slab allocators to not use __GFP_WAIT if they want to avoid
> reclaim.

Yes, the fewer subtle heuristics we have that include combinations of 
flags (*cough*
GFP_TRANSHUGE *cough*), the better.

> This is probably going to be a much more invasive patch than originally
> thought.

Right, we might be changing behavior not just for slab allocators, but 
also others using such
combination of flags.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Vlastimil Babka <vbabka@suse.cz>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Greg Thelen <gthelen@google.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
Date: Thu, 26 Feb 2015 00:55:40 +0100	[thread overview]
Message-ID: <54EE60FC.7000909@suse.cz> (raw)
In-Reply-To: <alpine.DEB.2.10.1502251311360.18097@chino.kir.corp.google.com>

On 25.2.2015 22:24, David Rientjes wrote:
>
>> alloc_pages_preferred_node() variant, change the exact_node() variant to pass
>> __GFP_THISNODE, and audit and adjust all callers accordingly.
>>
> Sounds like that should be done as part of a cleanup after the 4.0 issues
> are addressed.  alloc_pages_exact_node() does seem to suggest that we want
> exactly that node, implying __GFP_THISNODE behavior already, so it would
> be good to avoid having this come up again in the future.

Oh lovely, just found out that there's alloc_pages_node which should be the
preferred-only version, but in fact does not differ from 
alloc_pages_exact_node
in any relevant way. I agree we should do some larger cleanup for next 
version.

>> Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE
>> already. Of course, nothing guarantees that hugepage == true implies that gfp
>> == GFP_TRANSHUGE... but current in-tree callers conform to that.
>>
> Ah, good point, and it includes __GFP_NORETRY as well which means that
> this patch is busted.  It won't try compaction or direct reclaim in the
> page allocator slowpath because of this:
>
> 	/*
> 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> 	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
> 	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
> 	 * using a larger set of nodes after it has established that the
> 	 * allowed per node queues are empty and that nodes are
> 	 * over allocated.
> 	 */
> 	if (IS_ENABLED(CONFIG_NUMA) &&
> 	    (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> 		goto nopage;
>
> Hmm.  It would be disappointing to have to pass the nodemask of the exact
> node that we want to allocate from into the page allocator to avoid using
> __GFP_THISNODE.

Yeah.

>
> There's a sneaky way around it by just removing __GFP_NORETRY from
> GFP_TRANSHUGE so the condition above fails and since the page allocator
> won't retry for such a high-order allocation, but that probably just
> papers over this stuff too much already.  I think what we want to do is

Alternatively alloc_pages_exact_node() adds __GFP_THISNODE just to
node_zonelist() call and not to __alloc_pages() gfp_mask proper? Unless 
__GFP_THISNODE
was given *also* in the incoming gfp_mask, this should give us the right 
combination?
But it's also subtle....

> cause the slab allocators to not use __GFP_WAIT if they want to avoid
> reclaim.

Yes, the fewer subtle heuristics we have that include combinations of 
flags (*cough*
GFP_TRANSHUGE *cough*), the better.

> This is probably going to be a much more invasive patch than originally
> thought.

Right, we might be changing behavior not just for slab allocators, but 
also others using such
combination of flags.

next prev parent reply	other threads:[~2015-02-25 23:55 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-24 22:24 [patch for-4.0] mm, thp: really limit transparent hugepage allocation to local node David Rientjes
2015-02-24 22:24 ` David Rientjes
2015-02-24 23:24 ` [patch v2 " David Rientjes
2015-02-24 23:24   ` David Rientjes
2015-02-25 10:52   ` Vlastimil Babka
2015-02-25 10:52     ` Vlastimil Babka
2015-02-25 21:24     ` David Rientjes
2015-02-25 21:24       ` David Rientjes
2015-02-25 23:55       ` Vlastimil Babka [this message]
2015-02-25 23:55         ` Vlastimil Babka
2015-04-21  7:31         ` Aneesh Kumar K.V
2015-04-21  7:31           ` Aneesh Kumar K.V
2015-05-05  9:12           ` Vlastimil Babka
2015-05-05  9:12             ` Vlastimil Babka
2015-05-05 13:22             ` Aneesh Kumar K.V
2015-05-05 13:22               ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54EE60FC.7000909@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=gthelen@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.