* [patch 1/2] mm: remove GFP_THISNODE
@ 2015-02-26 0:23 ` David Rientjes
0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2015-02-26 0:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme,
Greg Thelen, linux-kernel, linux-mm, netdev, dev
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.
GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected. It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator
slowpath to fail without trying reclaim even though it may be used in
combination with __GFP_WAIT.
An example of the problem this creates: commit e97ca8e5b864 ("mm: fix
GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE. The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is
a no-op in these cases since the page allocator special-cases
__GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.
It's time to just remove GFP_TRANSHUGE entirely. We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_TRANSHUGE and
it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.
This allows the aforementioned functions to actually reclaim as they
should. It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.
Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/gfp.h | 10 ----------
mm/page_alloc.c | 22 ++++++----------------
mm/slab.c | 22 ++++++++++++++++++----
net/openvswitch/flow.c | 4 +++-
4 files changed, 27 insertions(+), 31 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -117,16 +117,6 @@ struct vm_area_struct;
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
__GFP_NO_KSWAPD)
-/*
- * GFP_THISNODE does not perform any reclaim, you most likely want to
- * use __GFP_THISNODE to allocate from a given node without fallback!
- */
-#ifdef CONFIG_NUMA
-#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
-#else
-#define GFP_THISNODE ((__force gfp_t)0)
-#endif
-
/* This mask makes up all the page movable related flags */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2355,13 +2355,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
/* The OOM killer does not compensate for light reclaim */
if (!(gfp_mask & __GFP_FS))
goto out;
- /*
- * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
- * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
- * The caller should handle page allocation failure by itself if
- * it specifies __GFP_THISNODE.
- * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
- */
+ /* The OOM killer may not free memory on a specific node */
if (gfp_mask & __GFP_THISNODE)
goto out;
}
@@ -2615,15 +2609,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}
/*
- * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
- * __GFP_NOWARN set) should not cause reclaim since the subsystem
- * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
- * using a larger set of nodes after it has established that the
- * allowed per node queues are empty and that nodes are
- * over allocated.
+ * If this allocation cannot block and it is for a specific node, then
+ * fail early. There's no need to wakeup kswapd or retry for a
+ * speculative node-specific allocation.
*/
- if (IS_ENABLED(CONFIG_NUMA) &&
- (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
+ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
goto nopage;
retry:
@@ -2816,7 +2806,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
/*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
- * of GFP_THISNODE and a memoryless node
+ * of __GFP_THISNODE and a memoryless node
*/
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -857,6 +857,11 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep,
return NULL;
}
+static inline gfp_t gfp_exact_node(gfp_t flags)
+{
+ return flags;
+}
+
#else /* CONFIG_NUMA */
static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
@@ -1023,6 +1028,15 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
return __cache_free_alien(cachep, objp, node, page_node);
}
+
+/*
+ * Construct gfp mask to allocate from a specific node but do not invoke reclaim
+ * or warn about failures.
+ */
+static inline gfp_t gfp_exact_node(gfp_t flags)
+{
+ return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+}
#endif
/*
@@ -2825,7 +2839,7 @@ alloc_done:
if (unlikely(!ac->avail)) {
int x;
force_grow:
- x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
+ x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
@@ -3019,7 +3033,7 @@ retry:
get_node(cache, nid) &&
get_node(cache, nid)->free_objects) {
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ gfp_exact_node(flags), nid);
if (obj)
break;
}
@@ -3047,7 +3061,7 @@ retry:
nid = page_to_nid(page);
if (cache_grow(cache, flags, nid, page)) {
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ gfp_exact_node(flags), nid);
if (!obj)
/*
* Another processor may allocate the
@@ -3118,7 +3132,7 @@ retry:
must_grow:
spin_unlock(&n->list_lock);
- x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
+ x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
if (x)
goto retry;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -100,7 +100,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
new_stats =
kmem_cache_alloc_node(flow_stats_cache,
- GFP_THISNODE |
+ GFP_NOWAIT |
+ __GFP_THISNODE |
+ __GFP_NOWARN |
__GFP_NOMEMALLOC,
node);
if (likely(new_stats)) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 50+ messages in thread* [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-26 0:23 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-26 0:23 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_TRANSHUGE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_TRANSHUGE and it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/gfp.h | 10 ---------- mm/page_alloc.c | 22 ++++++---------------- mm/slab.c | 22 ++++++++++++++++++---- net/openvswitch/flow.c | 4 +++- 4 files changed, 27 insertions(+), 31 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -117,16 +117,6 @@ struct vm_area_struct; __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ __GFP_NO_KSWAPD) -/* - * GFP_THISNODE does not perform any reclaim, you most likely want to - * use __GFP_THISNODE to allocate from a given node without fallback! - */ -#ifdef CONFIG_NUMA -#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) -#else -#define GFP_THISNODE ((__force gfp_t)0) -#endif - /* This mask makes up all the page movable related flags */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2355,13 +2355,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; - /* - * GFP_THISNODE contains __GFP_NORETRY and we never hit this. - * Sanity check for bare calls of __GFP_THISNODE, not real OOM. - * The caller should handle page allocation failure by itself if - * it specifies __GFP_THISNODE. - * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER. - */ + /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; } @@ -2615,15 +2609,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } /* - * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and - * __GFP_NOWARN set) should not cause reclaim since the subsystem - * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim - * using a larger set of nodes after it has established that the - * allowed per node queues are empty and that nodes are - * over allocated. + * If this allocation cannot block and it is for a specific node, then + * fail early. There's no need to wakeup kswapd or retry for a + * speculative node-specific allocation. */ - if (IS_ENABLED(CONFIG_NUMA) && - (gfp_mask & GFP_THISNODE) == GFP_THISNODE) + if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait) goto nopage; retry: @@ -2816,7 +2806,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, /* * Check the zones suitable for the gfp_mask contain at least one * valid zone. It's possible to have an empty zonelist as a result - * of GFP_THISNODE and a memoryless node + * of __GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist->_zonerefs->zone)) return NULL; diff --git a/mm/slab.c b/mm/slab.c --- a/mm/slab.c +++ b/mm/slab.c @@ -857,6 +857,11 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep, return NULL; } +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return flags; +} + #else /* CONFIG_NUMA */ static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int); @@ -1023,6 +1028,15 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) return __cache_free_alien(cachep, objp, node, page_node); } + +/* + * Construct gfp mask to allocate from a specific node but do not invoke reclaim + * or warn about failures. + */ +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; +} #endif /* @@ -2825,7 +2839,7 @@ alloc_done: if (unlikely(!ac->avail)) { int x; force_grow: - x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); @@ -3019,7 +3033,7 @@ retry: get_node(cache, nid) && get_node(cache, nid)->free_objects) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (obj) break; } @@ -3047,7 +3061,7 @@ retry: nid = page_to_nid(page); if (cache_grow(cache, flags, nid, page)) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (!obj) /* * Another processor may allocate the @@ -3118,7 +3132,7 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL); if (x) goto retry; diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -100,7 +100,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, new_stats = kmem_cache_alloc_node(flow_stats_cache, - GFP_THISNODE | + GFP_NOWAIT | + __GFP_THISNODE | + __GFP_NOWARN | __GFP_NOMEMALLOC, node); if (likely(new_stats)) { ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-26 0:23 ` David Rientjes @ 2015-02-26 0:56 ` Christoph Lameter -1 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-02-26 0:56 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Wed, 25 Feb 2015, David Rientjes wrote: > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. Well but then its not removing it. You are replacing it with an inline function. > + > +/* > + * Construct gfp mask to allocate from a specific node but do not invoke reclaim > + * or warn about failures. > + */ > +static inline gfp_t gfp_exact_node(gfp_t flags) > +{ > + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; > +} > #endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-26 0:56 ` Christoph Lameter 0 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-02-26 0:56 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Wed, 25 Feb 2015, David Rientjes wrote: > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. Well but then its not removing it. You are replacing it with an inline function. > + > +/* > + * Construct gfp mask to allocate from a specific node but do not invoke reclaim > + * or warn about failures. > + */ > +static inline gfp_t gfp_exact_node(gfp_t flags) > +{ > + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; > +} > #endif ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-26 0:56 ` Christoph Lameter @ 2015-02-26 1:04 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-26 1:04 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Wed, 25 Feb 2015, Christoph Lameter wrote: > On Wed, 25 Feb 2015, David Rientjes wrote: > > > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > Well but then its not removing it. You are replacing it with an inline > function. > Removing GFP_THISNODE, not __GFP_THISNODE. GFP_THISNODE, as the commit message says, is a special collection of flags that means "never try reclaim" and people confuse it for __GFP_THISNODE. There are legitimate usecases where we want __GFP_THISNODE, in other words restricting the allocation to only a specific node, and try reclaim but not warn in failure or retry. The most notable example is in the followup patch for thp, both for page faults and khugepaged, where we want to target the local node but silently fallback to small pages instead. This removes the special "no reclaim" behavior of __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN and relies on clearing __GFP_WAIT instead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-26 1:04 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-26 1:04 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Wed, 25 Feb 2015, Christoph Lameter wrote: > On Wed, 25 Feb 2015, David Rientjes wrote: > > > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > Well but then its not removing it. You are replacing it with an inline > function. > Removing GFP_THISNODE, not __GFP_THISNODE. GFP_THISNODE, as the commit message says, is a special collection of flags that means "never try reclaim" and people confuse it for __GFP_THISNODE. There are legitimate usecases where we want __GFP_THISNODE, in other words restricting the allocation to only a specific node, and try reclaim but not warn in failure or retry. The most notable example is in the followup patch for thp, both for page faults and khugepaged, where we want to target the local node but silently fallback to small pages instead. This removes the special "no reclaim" behavior of __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN and relies on clearing __GFP_WAIT instead. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-26 0:23 ` David Rientjes @ 2015-02-26 8:30 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-26 8:30 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 02/26/2015 01:23 AM, David Rientjes wrote: > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > GFP_THISNODE is a secret combination of gfp bits that have different > behavior than expected. It is a combination of __GFP_THISNODE, > __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator > slowpath to fail without trying reclaim even though it may be used in > combination with __GFP_WAIT. > > An example of the problem this creates: commit e97ca8e5b864 ("mm: fix > GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE > that really just wanted __GFP_THISNODE. The problem doesn't end there, > however, because even it was a no-op for alloc_misplaced_dst_page(), > which also sets __GFP_NORETRY and __GFP_NOWARN, and > migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT > is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is > a no-op in these cases since the page allocator special-cases > __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. > > It's time to just remove GFP_TRANSHUGE entirely. We leave __GFP_THISNODE ^THISNODE :) Although yes, it would be nice if we could replace the GFP_TRANSHUGE magic checks as well. > to restrict an allocation to a local node, but remove GFP_TRANSHUGE and > it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it > wants to avoid reclaim. > > This allows the aforementioned functions to actually reclaim as they > should. It also enables any future callers that want to do > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. So, I agree with the intention, but this has some subtle implications that should be mentioned/decided. The check for GFP_THISNODE in __alloc_pages_slowpath() comes earlier than the check for __GFP_WAIT. So the differences will be: 1) We will now call wake_all_kswapds(), unless __GFP_NO_KSWAPD is passed, which is only done for hugepages and some type of i915 allocation. Do we want the opportunistic attempts from slab to wake up kswapds or do we pass the flag? 2) There will be another attempt on get_page_from_freelist() with different alloc_flags than in the fast path attempt. Without __GFP_WAIT (and also, again, __GFP_KSWAPD, since your commit b104a35d32, which is another subtle check for hugepage allocations btw), it will consider the allocation atomic and add ALLOC_HARDER flag, unless __GFP_NOMEMALLOC is in __gfp_flags - it seems it's generally not. It will also clear ALLOC_CPUSET, which was the concern of b104a35d32. However, if I look at __cpuset_node_allowed(), I see that it's always true for __GFP_THISNODE, which makes me question commit b104a35d32 in light of your patch 2/2 and generally the sanity of all these flags and my career choice. Ugh :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-26 8:30 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-26 8:30 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 02/26/2015 01:23 AM, David Rientjes wrote: > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > GFP_THISNODE is a secret combination of gfp bits that have different > behavior than expected. It is a combination of __GFP_THISNODE, > __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator > slowpath to fail without trying reclaim even though it may be used in > combination with __GFP_WAIT. > > An example of the problem this creates: commit e97ca8e5b864 ("mm: fix > GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE > that really just wanted __GFP_THISNODE. The problem doesn't end there, > however, because even it was a no-op for alloc_misplaced_dst_page(), > which also sets __GFP_NORETRY and __GFP_NOWARN, and > migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT > is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is > a no-op in these cases since the page allocator special-cases > __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. > > It's time to just remove GFP_TRANSHUGE entirely. We leave __GFP_THISNODE ^THISNODE :) Although yes, it would be nice if we could replace the GFP_TRANSHUGE magic checks as well. > to restrict an allocation to a local node, but remove GFP_TRANSHUGE and > it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it > wants to avoid reclaim. > > This allows the aforementioned functions to actually reclaim as they > should. It also enables any future callers that want to do > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. So, I agree with the intention, but this has some subtle implications that should be mentioned/decided. The check for GFP_THISNODE in __alloc_pages_slowpath() comes earlier than the check for __GFP_WAIT. So the differences will be: 1) We will now call wake_all_kswapds(), unless __GFP_NO_KSWAPD is passed, which is only done for hugepages and some type of i915 allocation. Do we want the opportunistic attempts from slab to wake up kswapds or do we pass the flag? 2) There will be another attempt on get_page_from_freelist() with different alloc_flags than in the fast path attempt. Without __GFP_WAIT (and also, again, __GFP_KSWAPD, since your commit b104a35d32, which is another subtle check for hugepage allocations btw), it will consider the allocation atomic and add ALLOC_HARDER flag, unless __GFP_NOMEMALLOC is in __gfp_flags - it seems it's generally not. It will also clear ALLOC_CPUSET, which was the concern of b104a35d32. However, if I look at __cpuset_node_allowed(), I see that it's always true for __GFP_THISNODE, which makes me question commit b104a35d32 in light of your patch 2/2 and generally the sanity of all these flags and my career choice. Ugh :) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-26 8:30 ` Vlastimil Babka @ 2015-02-27 3:09 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 3:09 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Thu, 26 Feb 2015, Vlastimil Babka wrote: > > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > > > GFP_THISNODE is a secret combination of gfp bits that have different > > behavior than expected. It is a combination of __GFP_THISNODE, > > __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator > > slowpath to fail without trying reclaim even though it may be used in > > combination with __GFP_WAIT. > > > > An example of the problem this creates: commit e97ca8e5b864 ("mm: fix > > GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE > > that really just wanted __GFP_THISNODE. The problem doesn't end there, > > however, because even it was a no-op for alloc_misplaced_dst_page(), > > which also sets __GFP_NORETRY and __GFP_NOWARN, and > > migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT > > is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is > > a no-op in these cases since the page allocator special-cases > > __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. > > > > It's time to just remove GFP_TRANSHUGE entirely. We leave __GFP_THISNODE > > ^THISNODE :) Although yes, it would be nice if we > could replace the GFP_TRANSHUGE magic checks as well. > Haha, I referenced GFP_TRANSHUGE twice here when I meant GFP_THISNODE, I must want to fix that up as well. > > to restrict an allocation to a local node, but remove GFP_TRANSHUGE and > > it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it > > wants to avoid reclaim. > > > > This allows the aforementioned functions to actually reclaim as they > > should. It also enables any future callers that want to do > > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The > > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. > > So, I agree with the intention, but this has some subtle implications that > should be mentioned/decided. The check for GFP_THISNODE in > __alloc_pages_slowpath() comes earlier than the check for __GFP_WAIT. So the > differences will be: > > 1) We will now call wake_all_kswapds(), unless __GFP_NO_KSWAPD is passed, which > is only done for hugepages and some type of i915 allocation. Do we want the > opportunistic attempts from slab to wake up kswapds or do we pass the flag? > > 2) There will be another attempt on get_page_from_freelist() with different > alloc_flags than in the fast path attempt. Without __GFP_WAIT (and also, again, > __GFP_KSWAPD, since your commit b104a35d32, which is another subtle check for > hugepage allocations btw), it will consider the allocation atomic and add > ALLOC_HARDER flag, unless __GFP_NOMEMALLOC is in __gfp_flags - it seems it's > generally not. It will also clear ALLOC_CPUSET, which was the concern of > b104a35d32. However, if I look at __cpuset_node_allowed(), I see that it's > always true for __GFP_THISNODE, which makes me question commit b104a35d32 in > light of your patch 2/2 and generally the sanity of all these flags and my > career choice. > Do we do either of these? gfp_exact_node() sets __GFP_THISNODE and clears __GFP_WAIT which will make the new conditional trigger immediately for NUMA configs. Existing callers of GFP_KERNEL | __GFP_THISNODE aren't impacted and net/openvswitch/flow.c is mentioned in the changelog as actually wanting GFP_NOWAIT | __GFP_THISNODE so that this early check still fails. > Ugh :) > Ugh indeed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-27 3:09 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 3:09 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Thu, 26 Feb 2015, Vlastimil Babka wrote: > > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > > > GFP_THISNODE is a secret combination of gfp bits that have different > > behavior than expected. It is a combination of __GFP_THISNODE, > > __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator > > slowpath to fail without trying reclaim even though it may be used in > > combination with __GFP_WAIT. > > > > An example of the problem this creates: commit e97ca8e5b864 ("mm: fix > > GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE > > that really just wanted __GFP_THISNODE. The problem doesn't end there, > > however, because even it was a no-op for alloc_misplaced_dst_page(), > > which also sets __GFP_NORETRY and __GFP_NOWARN, and > > migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT > > is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is > > a no-op in these cases since the page allocator special-cases > > __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. > > > > It's time to just remove GFP_TRANSHUGE entirely. We leave __GFP_THISNODE > > ^THISNODE :) Although yes, it would be nice if we > could replace the GFP_TRANSHUGE magic checks as well. > Haha, I referenced GFP_TRANSHUGE twice here when I meant GFP_THISNODE, I must want to fix that up as well. > > to restrict an allocation to a local node, but remove GFP_TRANSHUGE and > > it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it > > wants to avoid reclaim. > > > > This allows the aforementioned functions to actually reclaim as they > > should. It also enables any future callers that want to do > > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The > > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. > > So, I agree with the intention, but this has some subtle implications that > should be mentioned/decided. The check for GFP_THISNODE in > __alloc_pages_slowpath() comes earlier than the check for __GFP_WAIT. So the > differences will be: > > 1) We will now call wake_all_kswapds(), unless __GFP_NO_KSWAPD is passed, which > is only done for hugepages and some type of i915 allocation. Do we want the > opportunistic attempts from slab to wake up kswapds or do we pass the flag? > > 2) There will be another attempt on get_page_from_freelist() with different > alloc_flags than in the fast path attempt. Without __GFP_WAIT (and also, again, > __GFP_KSWAPD, since your commit b104a35d32, which is another subtle check for > hugepage allocations btw), it will consider the allocation atomic and add > ALLOC_HARDER flag, unless __GFP_NOMEMALLOC is in __gfp_flags - it seems it's > generally not. It will also clear ALLOC_CPUSET, which was the concern of > b104a35d32. However, if I look at __cpuset_node_allowed(), I see that it's > always true for __GFP_THISNODE, which makes me question commit b104a35d32 in > light of your patch 2/2 and generally the sanity of all these flags and my > career choice. > Do we do either of these? gfp_exact_node() sets __GFP_THISNODE and clears __GFP_WAIT which will make the new conditional trigger immediately for NUMA configs. Existing callers of GFP_KERNEL | __GFP_THISNODE aren't impacted and net/openvswitch/flow.c is mentioned in the changelog as actually wanting GFP_NOWAIT | __GFP_THISNODE so that this early check still fails. > Ugh :) > Ugh indeed. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-27 3:09 ` David Rientjes @ 2015-02-27 7:34 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-27 7:34 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 02/27/2015 04:09 AM, David Rientjes wrote: > On Thu, 26 Feb 2015, Vlastimil Babka wrote: > >> > to restrict an allocation to a local node, but remove GFP_TRANSHUGE and >> > it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it >> > wants to avoid reclaim. >> > >> > This allows the aforementioned functions to actually reclaim as they >> > should. It also enables any future callers that want to do >> > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The >> > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. >> >> So, I agree with the intention, but this has some subtle implications that >> should be mentioned/decided. The check for GFP_THISNODE in >> __alloc_pages_slowpath() comes earlier than the check for __GFP_WAIT. So the >> differences will be: >> >> 1) We will now call wake_all_kswapds(), unless __GFP_NO_KSWAPD is passed, which >> is only done for hugepages and some type of i915 allocation. Do we want the >> opportunistic attempts from slab to wake up kswapds or do we pass the flag? >> >> 2) There will be another attempt on get_page_from_freelist() with different >> alloc_flags than in the fast path attempt. Without __GFP_WAIT (and also, again, >> __GFP_KSWAPD, since your commit b104a35d32, which is another subtle check for >> hugepage allocations btw), it will consider the allocation atomic and add >> ALLOC_HARDER flag, unless __GFP_NOMEMALLOC is in __gfp_flags - it seems it's >> generally not. It will also clear ALLOC_CPUSET, which was the concern of >> b104a35d32. However, if I look at __cpuset_node_allowed(), I see that it's >> always true for __GFP_THISNODE, which makes me question commit b104a35d32 in >> light of your patch 2/2 and generally the sanity of all these flags and my >> career choice. >> > > Do we do either of these? gfp_exact_node() sets __GFP_THISNODE and clears > __GFP_WAIT which will make the new conditional trigger immediately for > NUMA configs. Oh, right. I missed the new trigger. My sanity and career is saved! Well, no... the flags are still a mess. Aren't GFP_TRANSHUGE | __GFP_THISNODE allocations still problematic after this patch and 2/2? Those do include __GFP_WAIT (unless !defrag). So with only patch 2/2 without 1/2 they would match GFP_THISNODE and bail out (not good for khugepaged at least...). With both patches they won't bail out and __GFP_NO_KSWAPD will prevent most of the stuff described above, including clearing ALLOC_CPUSET. But __cpuset_node_allowed() will allow it to allocate anywhere anyway thanks to the newly passed __GFP_THISNODE, which would be a regression of what b104a35d32 fixed... unless I'm missing something else that prevents it, which wouldn't surprise me at all. There's this outdated comment: * The __GFP_THISNODE placement logic is really handled elsewhere, * by forcibly using a zonelist starting at a specified node, and by * (in get_page_from_freelist()) refusing to consider the zones for * any node on the zonelist except the first. By the time any such * calls get to this routine, we should just shut up and say 'yes'. AFAIK the __GFP_THISNODE zonelist contains *only* zones from the single node and there's no other "refusing". And I don't really see why __GFP_THISNODE should have this exception, it feels to me like "well we shouldn't reach this but we are not sure, so let's play it safe". So maybe we could just remove this exception? I don't think any other user of __GFP_THISNODE | __GFP_WAIT user relies on this allowed memset violation? > Existing callers of GFP_KERNEL | __GFP_THISNODE aren't impacted and > net/openvswitch/flow.c is mentioned in the changelog as actually wanting > GFP_NOWAIT | __GFP_THISNODE so that this early check still fails. > >> Ugh :) >> > > Ugh indeed. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-27 7:34 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-27 7:34 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 02/27/2015 04:09 AM, David Rientjes wrote: > On Thu, 26 Feb 2015, Vlastimil Babka wrote: > >> > to restrict an allocation to a local node, but remove GFP_TRANSHUGE and >> > it's obscurity. Instead, we require that a caller clear __GFP_WAIT if it >> > wants to avoid reclaim. >> > >> > This allows the aforementioned functions to actually reclaim as they >> > should. It also enables any future callers that want to do >> > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The >> > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. >> >> So, I agree with the intention, but this has some subtle implications that >> should be mentioned/decided. The check for GFP_THISNODE in >> __alloc_pages_slowpath() comes earlier than the check for __GFP_WAIT. So the >> differences will be: >> >> 1) We will now call wake_all_kswapds(), unless __GFP_NO_KSWAPD is passed, which >> is only done for hugepages and some type of i915 allocation. Do we want the >> opportunistic attempts from slab to wake up kswapds or do we pass the flag? >> >> 2) There will be another attempt on get_page_from_freelist() with different >> alloc_flags than in the fast path attempt. Without __GFP_WAIT (and also, again, >> __GFP_KSWAPD, since your commit b104a35d32, which is another subtle check for >> hugepage allocations btw), it will consider the allocation atomic and add >> ALLOC_HARDER flag, unless __GFP_NOMEMALLOC is in __gfp_flags - it seems it's >> generally not. It will also clear ALLOC_CPUSET, which was the concern of >> b104a35d32. However, if I look at __cpuset_node_allowed(), I see that it's >> always true for __GFP_THISNODE, which makes me question commit b104a35d32 in >> light of your patch 2/2 and generally the sanity of all these flags and my >> career choice. >> > > Do we do either of these? gfp_exact_node() sets __GFP_THISNODE and clears > __GFP_WAIT which will make the new conditional trigger immediately for > NUMA configs. Oh, right. I missed the new trigger. My sanity and career is saved! Well, no... the flags are still a mess. Aren't GFP_TRANSHUGE | __GFP_THISNODE allocations still problematic after this patch and 2/2? Those do include __GFP_WAIT (unless !defrag). So with only patch 2/2 without 1/2 they would match GFP_THISNODE and bail out (not good for khugepaged at least...). With both patches they won't bail out and __GFP_NO_KSWAPD will prevent most of the stuff described above, including clearing ALLOC_CPUSET. But __cpuset_node_allowed() will allow it to allocate anywhere anyway thanks to the newly passed __GFP_THISNODE, which would be a regression of what b104a35d32 fixed... unless I'm missing something else that prevents it, which wouldn't surprise me at all. There's this outdated comment: * The __GFP_THISNODE placement logic is really handled elsewhere, * by forcibly using a zonelist starting at a specified node, and by * (in get_page_from_freelist()) refusing to consider the zones for * any node on the zonelist except the first. By the time any such * calls get to this routine, we should just shut up and say 'yes'. AFAIK the __GFP_THISNODE zonelist contains *only* zones from the single node and there's no other "refusing". And I don't really see why __GFP_THISNODE should have this exception, it feels to me like "well we shouldn't reach this but we are not sure, so let's play it safe". So maybe we could just remove this exception? I don't think any other user of __GFP_THISNODE | __GFP_WAIT user relies on this allowed memset violation? > Existing callers of GFP_KERNEL | __GFP_THISNODE aren't impacted and > net/openvswitch/flow.c is mentioned in the changelog as actually wanting > GFP_NOWAIT | __GFP_THISNODE so that this early check still fails. > >> Ugh :) >> > > Ugh indeed. > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-27 7:34 ` Vlastimil Babka @ 2015-02-27 22:03 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Fri, 27 Feb 2015, Vlastimil Babka wrote: > Oh, right. I missed the new trigger. My sanity and career is saved! > Haha. > Well, no... the flags are still a mess. Aren't GFP_TRANSHUGE | __GFP_THISNODE > allocations still problematic after this patch and 2/2? Those do include > __GFP_WAIT (unless !defrag). So with only patch 2/2 without 1/2 they would match > GFP_THISNODE and bail out (not good for khugepaged at least...). With both patches: if __GFP_WAIT isn't set, either for page fault or khugepaged, then we always exit immediately from __alloc_pages_slowpath(): we can't try reclaim or compaction. If __GFP_WAIT is set, then the new conditional fails, and the slowpath proceeds as we want it to with a zonelist that only includes local nodes because __GFP_THISNODE is set for node_zonelist() in alloc_pages_exact_node(). Those are the only zones that get_page_from_freelist() gets to iterate over. With only this patch: we still have the problem that is fixed with the second patch, thp is preferred on the node of choice but can be allocated from any other node for fallback because the allocations lack __GFP_THISNODE. > With both > patches they won't bail out and __GFP_NO_KSWAPD will prevent most of the stuff > described above, including clearing ALLOC_CPUSET. Yeah, ALLOC_CPUSET is never cleared for thp allocations because atomic == false for thp, regardless of this series. > But __cpuset_node_allowed() > will allow it to allocate anywhere anyway thanks to the newly passed > __GFP_THISNODE, which would be a regression of what b104a35d32 fixed... unless > I'm missing something else that prevents it, which wouldn't surprise me at all. > > There's this outdated comment: > > * The __GFP_THISNODE placement logic is really handled elsewhere, > * by forcibly using a zonelist starting at a specified node, and by > * (in get_page_from_freelist()) refusing to consider the zones for > * any node on the zonelist except the first. By the time any such > * calls get to this routine, we should just shut up and say 'yes'. > > AFAIK the __GFP_THISNODE zonelist contains *only* zones from the single node and > there's no other "refusing". Yes, __cpuset_node_allowed() is never called for a zone from any other node when __GFP_THISNODE is passed because of node_zonelist(). It's pointless to iterate over those zones since the allocation wants to fail instead of allocate on them. Do you see any issues with either patch 1/2 or patch 2/2 besides the s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? > And I don't really see why __GFP_THISNODE should > have this exception, it feels to me like "well we shouldn't reach this but we > are not sure, so let's play it safe". So maybe we could just remove this > exception? I don't think any other user of __GFP_THISNODE | __GFP_WAIT user > relies on this allowed memset violation? > Since this function was written, there were other callers to cpuset_{node,zone}_allowed_{soft,hard}wall() that may have required it. I looked at all the current callers of cpuset_zone_allowed() and they don't appear to need this "exception" (slub calls node_zonelist() itself for the iteration and slab never calls it for __GFP_THISNODE). So, yeah, I think it can be removed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-27 22:03 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Fri, 27 Feb 2015, Vlastimil Babka wrote: > Oh, right. I missed the new trigger. My sanity and career is saved! > Haha. > Well, no... the flags are still a mess. Aren't GFP_TRANSHUGE | __GFP_THISNODE > allocations still problematic after this patch and 2/2? Those do include > __GFP_WAIT (unless !defrag). So with only patch 2/2 without 1/2 they would match > GFP_THISNODE and bail out (not good for khugepaged at least...). With both patches: if __GFP_WAIT isn't set, either for page fault or khugepaged, then we always exit immediately from __alloc_pages_slowpath(): we can't try reclaim or compaction. If __GFP_WAIT is set, then the new conditional fails, and the slowpath proceeds as we want it to with a zonelist that only includes local nodes because __GFP_THISNODE is set for node_zonelist() in alloc_pages_exact_node(). Those are the only zones that get_page_from_freelist() gets to iterate over. With only this patch: we still have the problem that is fixed with the second patch, thp is preferred on the node of choice but can be allocated from any other node for fallback because the allocations lack __GFP_THISNODE. > With both > patches they won't bail out and __GFP_NO_KSWAPD will prevent most of the stuff > described above, including clearing ALLOC_CPUSET. Yeah, ALLOC_CPUSET is never cleared for thp allocations because atomic == false for thp, regardless of this series. > But __cpuset_node_allowed() > will allow it to allocate anywhere anyway thanks to the newly passed > __GFP_THISNODE, which would be a regression of what b104a35d32 fixed... unless > I'm missing something else that prevents it, which wouldn't surprise me at all. > > There's this outdated comment: > > * The __GFP_THISNODE placement logic is really handled elsewhere, > * by forcibly using a zonelist starting at a specified node, and by > * (in get_page_from_freelist()) refusing to consider the zones for > * any node on the zonelist except the first. By the time any such > * calls get to this routine, we should just shut up and say 'yes'. > > AFAIK the __GFP_THISNODE zonelist contains *only* zones from the single node and > there's no other "refusing". Yes, __cpuset_node_allowed() is never called for a zone from any other node when __GFP_THISNODE is passed because of node_zonelist(). It's pointless to iterate over those zones since the allocation wants to fail instead of allocate on them. Do you see any issues with either patch 1/2 or patch 2/2 besides the s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? > And I don't really see why __GFP_THISNODE should > have this exception, it feels to me like "well we shouldn't reach this but we > are not sure, so let's play it safe". So maybe we could just remove this > exception? I don't think any other user of __GFP_THISNODE | __GFP_WAIT user > relies on this allowed memset violation? > Since this function was written, there were other callers to cpuset_{node,zone}_allowed_{soft,hard}wall() that may have required it. I looked at all the current callers of cpuset_zone_allowed() and they don't appear to need this "exception" (slub calls node_zonelist() itself for the iteration and slab never calls it for __GFP_THISNODE). So, yeah, I think it can be removed. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-27 22:03 ` David Rientjes @ 2015-02-27 22:19 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-27 22:19 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 02/27/2015 11:03 PM, David Rientjes wrote: >> With both >> patches they won't bail out and __GFP_NO_KSWAPD will prevent most of the stuff >> described above, including clearing ALLOC_CPUSET. > > Yeah, ALLOC_CPUSET is never cleared for thp allocations because atomic == > false for thp, regardless of this series. > >> But __cpuset_node_allowed() >> will allow it to allocate anywhere anyway thanks to the newly passed >> __GFP_THISNODE, which would be a regression of what b104a35d32 fixed... unless >> I'm missing something else that prevents it, which wouldn't surprise me at all. >> >> There's this outdated comment: >> >> * The __GFP_THISNODE placement logic is really handled elsewhere, >> * by forcibly using a zonelist starting at a specified node, and by >> * (in get_page_from_freelist()) refusing to consider the zones for >> * any node on the zonelist except the first. By the time any such >> * calls get to this routine, we should just shut up and say 'yes'. >> >> AFAIK the __GFP_THISNODE zonelist contains *only* zones from the single node and >> there's no other "refusing". > > Yes, __cpuset_node_allowed() is never called for a zone from any other > node when __GFP_THISNODE is passed because of node_zonelist(). It's > pointless to iterate over those zones since the allocation wants to fail > instead of allocate on them. > > Do you see any issues with either patch 1/2 or patch 2/2 besides the > s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? Well, my point is, what if the node we are explicitly trying to allocate hugepage on, is in fact not allowed by our cpuset? This could happen in the page fault case, no? Although in a weird configuration when process can (and really gets scheduled to run) on a node where it is not allowed to allocate from... >> And I don't really see why __GFP_THISNODE should >> have this exception, it feels to me like "well we shouldn't reach this but we >> are not sure, so let's play it safe". So maybe we could just remove this >> exception? I don't think any other user of __GFP_THISNODE | __GFP_WAIT user >> relies on this allowed memset violation? >> > > Since this function was written, there were other callers to > cpuset_{node,zone}_allowed_{soft,hard}wall() that may have required it. I > looked at all the current callers of cpuset_zone_allowed() and they don't > appear to need this "exception" (slub calls node_zonelist() itself for the > iteration and slab never calls it for __GFP_THISNODE). So, yeah, I think > it can be removed. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-27 22:19 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-27 22:19 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 02/27/2015 11:03 PM, David Rientjes wrote: >> With both >> patches they won't bail out and __GFP_NO_KSWAPD will prevent most of the stuff >> described above, including clearing ALLOC_CPUSET. > > Yeah, ALLOC_CPUSET is never cleared for thp allocations because atomic == > false for thp, regardless of this series. > >> But __cpuset_node_allowed() >> will allow it to allocate anywhere anyway thanks to the newly passed >> __GFP_THISNODE, which would be a regression of what b104a35d32 fixed... unless >> I'm missing something else that prevents it, which wouldn't surprise me at all. >> >> There's this outdated comment: >> >> * The __GFP_THISNODE placement logic is really handled elsewhere, >> * by forcibly using a zonelist starting at a specified node, and by >> * (in get_page_from_freelist()) refusing to consider the zones for >> * any node on the zonelist except the first. By the time any such >> * calls get to this routine, we should just shut up and say 'yes'. >> >> AFAIK the __GFP_THISNODE zonelist contains *only* zones from the single node and >> there's no other "refusing". > > Yes, __cpuset_node_allowed() is never called for a zone from any other > node when __GFP_THISNODE is passed because of node_zonelist(). It's > pointless to iterate over those zones since the allocation wants to fail > instead of allocate on them. > > Do you see any issues with either patch 1/2 or patch 2/2 besides the > s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? Well, my point is, what if the node we are explicitly trying to allocate hugepage on, is in fact not allowed by our cpuset? This could happen in the page fault case, no? Although in a weird configuration when process can (and really gets scheduled to run) on a node where it is not allowed to allocate from... >> And I don't really see why __GFP_THISNODE should >> have this exception, it feels to me like "well we shouldn't reach this but we >> are not sure, so let's play it safe". So maybe we could just remove this >> exception? I don't think any other user of __GFP_THISNODE | __GFP_WAIT user >> relies on this allowed memset violation? >> > > Since this function was written, there were other callers to > cpuset_{node,zone}_allowed_{soft,hard}wall() that may have required it. I > looked at all the current callers of cpuset_zone_allowed() and they don't > appear to need this "exception" (slub calls node_zonelist() itself for the > iteration and slab never calls it for __GFP_THISNODE). So, yeah, I think > it can be removed. > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-27 22:19 ` Vlastimil Babka @ 2015-02-27 22:31 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Fri, 27 Feb 2015, Vlastimil Babka wrote: > > Do you see any issues with either patch 1/2 or patch 2/2 besides the > > s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? > > Well, my point is, what if the node we are explicitly trying to allocate > hugepage on, is in fact not allowed by our cpuset? This could happen in the page > fault case, no? Although in a weird configuration when process can (and really > gets scheduled to run) on a node where it is not allowed to allocate from... > If the process is running a node that is not allowed by the cpuset, then alloc_hugepage_vma() now fails with VM_FAULT_FALLBACK. That was the intended policy change of commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node"). [ alloc_hugepage_vma() should probably be using numa_mem_id() instead for memoryless node platforms. ] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-27 22:31 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On Fri, 27 Feb 2015, Vlastimil Babka wrote: > > Do you see any issues with either patch 1/2 or patch 2/2 besides the > > s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? > > Well, my point is, what if the node we are explicitly trying to allocate > hugepage on, is in fact not allowed by our cpuset? This could happen in the page > fault case, no? Although in a weird configuration when process can (and really > gets scheduled to run) on a node where it is not allowed to allocate from... > If the process is running a node that is not allowed by the cpuset, then alloc_hugepage_vma() now fails with VM_FAULT_FALLBACK. That was the intended policy change of commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node"). [ alloc_hugepage_vma() should probably be using numa_mem_id() instead for memoryless node platforms. ] ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE 2015-02-27 22:31 ` David Rientjes @ 2015-02-27 22:52 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-27 22:52 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 27.2.2015 23:31, David Rientjes wrote: > On Fri, 27 Feb 2015, Vlastimil Babka wrote: > >>> Do you see any issues with either patch 1/2 or patch 2/2 besides the >>> s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? >> Well, my point is, what if the node we are explicitly trying to allocate >> hugepage on, is in fact not allowed by our cpuset? This could happen in the page >> fault case, no? Although in a weird configuration when process can (and really >> gets scheduled to run) on a node where it is not allowed to allocate from... >> > If the process is running a node that is not allowed by the cpuset, then > alloc_hugepage_vma() now fails with VM_FAULT_FALLBACK. That was the > intended policy change of commit 077fcf116c8c ("mm/thp: allocate > transparent hugepages on local node"). Ah, right, didn't realize that mempolicy also takes that into account. Thanks for removing the exception anyway. > > [ alloc_hugepage_vma() should probably be using numa_mem_id() instead for > memoryless node platforms. ] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch 1/2] mm: remove GFP_THISNODE @ 2015-02-27 22:52 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-02-27 22:52 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Greg Thelen, linux-kernel, linux-mm, netdev, dev On 27.2.2015 23:31, David Rientjes wrote: > On Fri, 27 Feb 2015, Vlastimil Babka wrote: > >>> Do you see any issues with either patch 1/2 or patch 2/2 besides the >>> s/GFP_TRANSHUGE/GFP_THISNODE/ that is necessary on the changelog? >> Well, my point is, what if the node we are explicitly trying to allocate >> hugepage on, is in fact not allowed by our cpuset? This could happen in the page >> fault case, no? Although in a weird configuration when process can (and really >> gets scheduled to run) on a node where it is not allowed to allocate from... >> > If the process is running a node that is not allowed by the cpuset, then > alloc_hugepage_vma() now fails with VM_FAULT_FALLBACK. That was the > intended policy change of commit 077fcf116c8c ("mm/thp: allocate > transparent hugepages on local node"). Ah, right, didn't realize that mempolicy also takes that into account. Thanks for removing the exception anyway. > > [ alloc_hugepage_vma() should probably be using numa_mem_id() instead for > memoryless node platforms. ] ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <alpine.DEB.2.10.1502251621010.10303-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* [patch v2 1/3] mm: remove GFP_THISNODE 2015-02-26 0:23 ` David Rientjes (?) @ 2015-02-27 22:16 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:16 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, dev-yBygre7rU0TnMu66kgdUjQ NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- v2: fix typos in commit message per Vlastimil include/linux/gfp.h | 10 ---------- mm/page_alloc.c | 22 ++++++---------------- mm/slab.c | 22 ++++++++++++++++++---- net/openvswitch/flow.c | 4 +++- 4 files changed, 27 insertions(+), 31 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -117,16 +117,6 @@ struct vm_area_struct; __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ __GFP_NO_KSWAPD) -/* - * GFP_THISNODE does not perform any reclaim, you most likely want to - * use __GFP_THISNODE to allocate from a given node without fallback! - */ -#ifdef CONFIG_NUMA -#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) -#else -#define GFP_THISNODE ((__force gfp_t)0) -#endif - /* This mask makes up all the page movable related flags */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2355,13 +2355,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; - /* - * GFP_THISNODE contains __GFP_NORETRY and we never hit this. - * Sanity check for bare calls of __GFP_THISNODE, not real OOM. - * The caller should handle page allocation failure by itself if - * it specifies __GFP_THISNODE. - * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER. - */ + /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; } @@ -2615,15 +2609,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } /* - * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and - * __GFP_NOWARN set) should not cause reclaim since the subsystem - * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim - * using a larger set of nodes after it has established that the - * allowed per node queues are empty and that nodes are - * over allocated. + * If this allocation cannot block and it is for a specific node, then + * fail early. There's no need to wakeup kswapd or retry for a + * speculative node-specific allocation. */ - if (IS_ENABLED(CONFIG_NUMA) && - (gfp_mask & GFP_THISNODE) == GFP_THISNODE) + if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait) goto nopage; retry: @@ -2816,7 +2806,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, /* * Check the zones suitable for the gfp_mask contain at least one * valid zone. It's possible to have an empty zonelist as a result - * of GFP_THISNODE and a memoryless node + * of __GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist->_zonerefs->zone)) return NULL; diff --git a/mm/slab.c b/mm/slab.c --- a/mm/slab.c +++ b/mm/slab.c @@ -857,6 +857,11 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep, return NULL; } +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return flags; +} + #else /* CONFIG_NUMA */ static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int); @@ -1023,6 +1028,15 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) return __cache_free_alien(cachep, objp, node, page_node); } + +/* + * Construct gfp mask to allocate from a specific node but do not invoke reclaim + * or warn about failures. + */ +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; +} #endif /* @@ -2825,7 +2839,7 @@ alloc_done: if (unlikely(!ac->avail)) { int x; force_grow: - x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); @@ -3019,7 +3033,7 @@ retry: get_node(cache, nid) && get_node(cache, nid)->free_objects) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (obj) break; } @@ -3047,7 +3061,7 @@ retry: nid = page_to_nid(page); if (cache_grow(cache, flags, nid, page)) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (!obj) /* * Another processor may allocate the @@ -3118,7 +3132,7 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL); if (x) goto retry; diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -100,7 +100,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, new_stats = kmem_cache_alloc_node(flow_stats_cache, - GFP_THISNODE | + GFP_NOWAIT | + __GFP_THISNODE | + __GFP_NOWARN | __GFP_NOMEMALLOC, node); if (likely(new_stats)) { ^ permalink raw reply [flat|nested] 50+ messages in thread
* [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-02-27 22:16 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:16 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> --- v2: fix typos in commit message per Vlastimil include/linux/gfp.h | 10 ---------- mm/page_alloc.c | 22 ++++++---------------- mm/slab.c | 22 ++++++++++++++++++---- net/openvswitch/flow.c | 4 +++- 4 files changed, 27 insertions(+), 31 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -117,16 +117,6 @@ struct vm_area_struct; __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ __GFP_NO_KSWAPD) -/* - * GFP_THISNODE does not perform any reclaim, you most likely want to - * use __GFP_THISNODE to allocate from a given node without fallback! - */ -#ifdef CONFIG_NUMA -#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) -#else -#define GFP_THISNODE ((__force gfp_t)0) -#endif - /* This mask makes up all the page movable related flags */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2355,13 +2355,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; - /* - * GFP_THISNODE contains __GFP_NORETRY and we never hit this. - * Sanity check for bare calls of __GFP_THISNODE, not real OOM. - * The caller should handle page allocation failure by itself if - * it specifies __GFP_THISNODE. - * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER. - */ + /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; } @@ -2615,15 +2609,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } /* - * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and - * __GFP_NOWARN set) should not cause reclaim since the subsystem - * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim - * using a larger set of nodes after it has established that the - * allowed per node queues are empty and that nodes are - * over allocated. + * If this allocation cannot block and it is for a specific node, then + * fail early. There's no need to wakeup kswapd or retry for a + * speculative node-specific allocation. */ - if (IS_ENABLED(CONFIG_NUMA) && - (gfp_mask & GFP_THISNODE) == GFP_THISNODE) + if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait) goto nopage; retry: @@ -2816,7 +2806,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, /* * Check the zones suitable for the gfp_mask contain at least one * valid zone. It's possible to have an empty zonelist as a result - * of GFP_THISNODE and a memoryless node + * of __GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist->_zonerefs->zone)) return NULL; diff --git a/mm/slab.c b/mm/slab.c --- a/mm/slab.c +++ b/mm/slab.c @@ -857,6 +857,11 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep, return NULL; } +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return flags; +} + #else /* CONFIG_NUMA */ static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int); @@ -1023,6 +1028,15 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) return __cache_free_alien(cachep, objp, node, page_node); } + +/* + * Construct gfp mask to allocate from a specific node but do not invoke reclaim + * or warn about failures. + */ +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; +} #endif /* @@ -2825,7 +2839,7 @@ alloc_done: if (unlikely(!ac->avail)) { int x; force_grow: - x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); @@ -3019,7 +3033,7 @@ retry: get_node(cache, nid) && get_node(cache, nid)->free_objects) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (obj) break; } @@ -3047,7 +3061,7 @@ retry: nid = page_to_nid(page); if (cache_grow(cache, flags, nid, page)) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (!obj) /* * Another processor may allocate the @@ -3118,7 +3132,7 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL); if (x) goto retry; diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -100,7 +100,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, new_stats = kmem_cache_alloc_node(flow_stats_cache, - GFP_THISNODE | + GFP_NOWAIT | + __GFP_THISNODE | + __GFP_NOWARN | __GFP_NOMEMALLOC, node); if (likely(new_stats)) { ^ permalink raw reply [flat|nested] 50+ messages in thread
* [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-02-27 22:16 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:16 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> --- v2: fix typos in commit message per Vlastimil include/linux/gfp.h | 10 ---------- mm/page_alloc.c | 22 ++++++---------------- mm/slab.c | 22 ++++++++++++++++++---- net/openvswitch/flow.c | 4 +++- 4 files changed, 27 insertions(+), 31 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -117,16 +117,6 @@ struct vm_area_struct; __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ __GFP_NO_KSWAPD) -/* - * GFP_THISNODE does not perform any reclaim, you most likely want to - * use __GFP_THISNODE to allocate from a given node without fallback! - */ -#ifdef CONFIG_NUMA -#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) -#else -#define GFP_THISNODE ((__force gfp_t)0) -#endif - /* This mask makes up all the page movable related flags */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2355,13 +2355,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not compensate for light reclaim */ if (!(gfp_mask & __GFP_FS)) goto out; - /* - * GFP_THISNODE contains __GFP_NORETRY and we never hit this. - * Sanity check for bare calls of __GFP_THISNODE, not real OOM. - * The caller should handle page allocation failure by itself if - * it specifies __GFP_THISNODE. - * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER. - */ + /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; } @@ -2615,15 +2609,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } /* - * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and - * __GFP_NOWARN set) should not cause reclaim since the subsystem - * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim - * using a larger set of nodes after it has established that the - * allowed per node queues are empty and that nodes are - * over allocated. + * If this allocation cannot block and it is for a specific node, then + * fail early. There's no need to wakeup kswapd or retry for a + * speculative node-specific allocation. */ - if (IS_ENABLED(CONFIG_NUMA) && - (gfp_mask & GFP_THISNODE) == GFP_THISNODE) + if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait) goto nopage; retry: @@ -2816,7 +2806,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, /* * Check the zones suitable for the gfp_mask contain at least one * valid zone. It's possible to have an empty zonelist as a result - * of GFP_THISNODE and a memoryless node + * of __GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist->_zonerefs->zone)) return NULL; diff --git a/mm/slab.c b/mm/slab.c --- a/mm/slab.c +++ b/mm/slab.c @@ -857,6 +857,11 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep, return NULL; } +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return flags; +} + #else /* CONFIG_NUMA */ static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int); @@ -1023,6 +1028,15 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) return __cache_free_alien(cachep, objp, node, page_node); } + +/* + * Construct gfp mask to allocate from a specific node but do not invoke reclaim + * or warn about failures. + */ +static inline gfp_t gfp_exact_node(gfp_t flags) +{ + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; +} #endif /* @@ -2825,7 +2839,7 @@ alloc_done: if (unlikely(!ac->avail)) { int x; force_grow: - x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); @@ -3019,7 +3033,7 @@ retry: get_node(cache, nid) && get_node(cache, nid)->free_objects) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (obj) break; } @@ -3047,7 +3061,7 @@ retry: nid = page_to_nid(page); if (cache_grow(cache, flags, nid, page)) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + gfp_exact_node(flags), nid); if (!obj) /* * Another processor may allocate the @@ -3118,7 +3132,7 @@ retry: must_grow: spin_unlock(&n->list_lock); - x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL); if (x) goto retry; diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -100,7 +100,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, new_stats = kmem_cache_alloc_node(flow_stats_cache, - GFP_THISNODE | + GFP_NOWAIT | + __GFP_THISNODE | + __GFP_NOWARN | __GFP_NOMEMALLOC, node); if (likely(new_stats)) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* [patch v2 2/3] mm, thp: really limit transparent hugepage allocation to local node 2015-02-27 22:16 ` David Rientjes @ 2015-02-27 22:17 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:17 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") restructured alloc_hugepage_vma() with the intent of only allocating transparent hugepages locally when there was not an effective interleave mempolicy. alloc_pages_exact_node() does not limit the allocation to the single node, however, but rather prefers it. This is because __GFP_THISNODE is not set which would cause the node-local nodemask to be passed. Without it, only a nodemask that prefers the local node is passed. Fix this by passing __GFP_THISNODE and falling back to small pages when the allocation fails. Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") suffers from a similar problem for khugepaged, which is also fixed. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") Signed-off-by: David Rientjes <rientjes@google.com> --- mm/huge_memory.c | 9 +++++++-- mm/mempolicy.c | 3 ++- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2311,8 +2311,14 @@ static struct page struct vm_area_struct *vma, unsigned long address, int node) { + gfp_t flags; + VM_BUG_ON_PAGE(*hpage, *hpage); + /* Only allocate from the target node */ + flags = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) | + __GFP_THISNODE; + /* * Before allocating the hugepage, release the mmap_sem read lock. * The allocation can take potentially a long time if it involves @@ -2321,8 +2327,7 @@ static struct page */ up_read(&mm->mmap_sem); - *hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask( - khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER); + *hpage = alloc_pages_exact_node(node, flags, HPAGE_PMD_ORDER); if (unlikely(!*hpage)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1985,7 +1985,8 @@ retry_cpuset: nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(node, *nmask)) { mpol_cond_put(pol); - page = alloc_pages_exact_node(node, gfp, order); + page = alloc_pages_exact_node(node, + gfp | __GFP_THISNODE, order); goto out; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* [patch v2 2/3] mm, thp: really limit transparent hugepage allocation to local node @ 2015-02-27 22:17 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:17 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") restructured alloc_hugepage_vma() with the intent of only allocating transparent hugepages locally when there was not an effective interleave mempolicy. alloc_pages_exact_node() does not limit the allocation to the single node, however, but rather prefers it. This is because __GFP_THISNODE is not set which would cause the node-local nodemask to be passed. Without it, only a nodemask that prefers the local node is passed. Fix this by passing __GFP_THISNODE and falling back to small pages when the allocation fails. Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") suffers from a similar problem for khugepaged, which is also fixed. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") Signed-off-by: David Rientjes <rientjes@google.com> --- mm/huge_memory.c | 9 +++++++-- mm/mempolicy.c | 3 ++- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2311,8 +2311,14 @@ static struct page struct vm_area_struct *vma, unsigned long address, int node) { + gfp_t flags; + VM_BUG_ON_PAGE(*hpage, *hpage); + /* Only allocate from the target node */ + flags = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) | + __GFP_THISNODE; + /* * Before allocating the hugepage, release the mmap_sem read lock. * The allocation can take potentially a long time if it involves @@ -2321,8 +2327,7 @@ static struct page */ up_read(&mm->mmap_sem); - *hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask( - khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER); + *hpage = alloc_pages_exact_node(node, flags, HPAGE_PMD_ORDER); if (unlikely(!*hpage)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1985,7 +1985,8 @@ retry_cpuset: nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(node, *nmask)) { mpol_cond_put(pol); - page = alloc_pages_exact_node(node, gfp, order); + page = alloc_pages_exact_node(node, + gfp | __GFP_THISNODE, order); goto out; } } ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <alpine.DEB.2.10.1502271416580.7225-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* Re: [patch v2 2/3] mm, thp: really limit transparent hugepage allocation to local node 2015-02-27 22:17 ` David Rientjes (?) @ 2015-03-02 13:47 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:47 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: dev-yBygre7rU0TnMu66kgdUjQ, Greg Thelen, netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Pekka Enberg, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Li Zefan, Mel Gorman, Johannes Weiner, Christoph Lameter, Joonsoo Kim On 02/27/2015 11:17 PM, David Rientjes wrote: > Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local > node") restructured alloc_hugepage_vma() with the intent of only > allocating transparent hugepages locally when there was not an effective > interleave mempolicy. > > alloc_pages_exact_node() does not limit the allocation to the single > node, however, but rather prefers it. This is because __GFP_THISNODE is > not set which would cause the node-local nodemask to be passed. Without > it, only a nodemask that prefers the local node is passed. > > Fix this by passing __GFP_THISNODE and falling back to small pages when > the allocation fails. > > Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target > node") suffers from a similar problem for khugepaged, which is also > fixed. > > Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") > Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") > Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 2/3] mm, thp: really limit transparent hugepage allocation to local node @ 2015-03-02 13:47 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:47 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 02/27/2015 11:17 PM, David Rientjes wrote: > Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local > node") restructured alloc_hugepage_vma() with the intent of only > allocating transparent hugepages locally when there was not an effective > interleave mempolicy. > > alloc_pages_exact_node() does not limit the allocation to the single > node, however, but rather prefers it. This is because __GFP_THISNODE is > not set which would cause the node-local nodemask to be passed. Without > it, only a nodemask that prefers the local node is passed. > > Fix this by passing __GFP_THISNODE and falling back to small pages when > the allocation fails. > > Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target > node") suffers from a similar problem for khugepaged, which is also > fixed. > > Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") > Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") > Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 2/3] mm, thp: really limit transparent hugepage allocation to local node @ 2015-03-02 13:47 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:47 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 02/27/2015 11:17 PM, David Rientjes wrote: > Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local > node") restructured alloc_hugepage_vma() with the intent of only > allocating transparent hugepages locally when there was not an effective > interleave mempolicy. > > alloc_pages_exact_node() does not limit the allocation to the single > node, however, but rather prefers it. This is because __GFP_THISNODE is > not set which would cause the node-local nodemask to be passed. Without > it, only a nodemask that prefers the local node is passed. > > Fix this by passing __GFP_THISNODE and falling back to small pages when > the allocation fails. > > Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target > node") suffers from a similar problem for khugepaged, which is also > fixed. > > Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") > Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") > Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* [patch v2 3/3] kernel, cpuset: remove exception for __GFP_THISNODE 2015-02-27 22:16 ` David Rientjes @ 2015-02-27 22:17 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:17 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so remove the obscure comment about it and its special-case exception. Signed-off-by: David Rientjes <rientjes@google.com> --- kernel/cpuset.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) diff --git a/kernel/cpuset.c b/kernel/cpuset.c --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -2445,20 +2445,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs) * @node: is this an allowed node? * @gfp_mask: memory allocation flags * - * If we're in interrupt, yes, we can always allocate. If __GFP_THISNODE is - * set, yes, we can always allocate. If node is in our task's mems_allowed, - * yes. If it's not a __GFP_HARDWALL request and this node is in the nearest - * hardwalled cpuset ancestor to this task's cpuset, yes. If the task has been - * OOM killed and has access to memory reserves as specified by the TIF_MEMDIE - * flag, yes. + * If we're in interrupt, yes, we can always allocate. If @node is set in + * current's mems_allowed, yes. If it's not a __GFP_HARDWALL request and this + * node is set in the nearest hardwalled cpuset ancestor to current's cpuset, + * yes. If current has access to memory reserves due to TIF_MEMDIE, yes. * Otherwise, no. * - * The __GFP_THISNODE placement logic is really handled elsewhere, - * by forcibly using a zonelist starting at a specified node, and by - * (in get_page_from_freelist()) refusing to consider the zones for - * any node on the zonelist except the first. By the time any such - * calls get to this routine, we should just shut up and say 'yes'. - * * GFP_USER allocations are marked with the __GFP_HARDWALL bit, * and do not allow allocations outside the current tasks cpuset * unless the task has been OOM killed as is marked TIF_MEMDIE. @@ -2494,7 +2486,7 @@ int __cpuset_node_allowed(int node, gfp_t gfp_mask) int allowed; /* is allocation in zone z allowed? */ unsigned long flags; - if (in_interrupt() || (gfp_mask & __GFP_THISNODE)) + if (in_interrupt()) return 1; if (node_isset(node, current->mems_allowed)) return 1; ^ permalink raw reply [flat|nested] 50+ messages in thread
* [patch v2 3/3] kernel, cpuset: remove exception for __GFP_THISNODE @ 2015-02-27 22:17 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-27 22:17 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so remove the obscure comment about it and its special-case exception. Signed-off-by: David Rientjes <rientjes@google.com> --- kernel/cpuset.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) diff --git a/kernel/cpuset.c b/kernel/cpuset.c --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -2445,20 +2445,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs) * @node: is this an allowed node? * @gfp_mask: memory allocation flags * - * If we're in interrupt, yes, we can always allocate. If __GFP_THISNODE is - * set, yes, we can always allocate. If node is in our task's mems_allowed, - * yes. If it's not a __GFP_HARDWALL request and this node is in the nearest - * hardwalled cpuset ancestor to this task's cpuset, yes. If the task has been - * OOM killed and has access to memory reserves as specified by the TIF_MEMDIE - * flag, yes. + * If we're in interrupt, yes, we can always allocate. If @node is set in + * current's mems_allowed, yes. If it's not a __GFP_HARDWALL request and this + * node is set in the nearest hardwalled cpuset ancestor to current's cpuset, + * yes. If current has access to memory reserves due to TIF_MEMDIE, yes. * Otherwise, no. * - * The __GFP_THISNODE placement logic is really handled elsewhere, - * by forcibly using a zonelist starting at a specified node, and by - * (in get_page_from_freelist()) refusing to consider the zones for - * any node on the zonelist except the first. By the time any such - * calls get to this routine, we should just shut up and say 'yes'. - * * GFP_USER allocations are marked with the __GFP_HARDWALL bit, * and do not allow allocations outside the current tasks cpuset * unless the task has been OOM killed as is marked TIF_MEMDIE. @@ -2494,7 +2486,7 @@ int __cpuset_node_allowed(int node, gfp_t gfp_mask) int allowed; /* is allocation in zone z allowed? */ unsigned long flags; - if (in_interrupt() || (gfp_mask & __GFP_THISNODE)) + if (in_interrupt()) return 1; if (node_isset(node, current->mems_allowed)) return 1; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 3/3] kernel, cpuset: remove exception for __GFP_THISNODE 2015-02-27 22:17 ` David Rientjes @ 2015-03-02 13:47 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:47 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 02/27/2015 11:17 PM, David Rientjes wrote: > Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so > remove the obscure comment about it and its special-case exception. > > Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 3/3] kernel, cpuset: remove exception for __GFP_THISNODE @ 2015-03-02 13:47 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:47 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 02/27/2015 11:17 PM, David Rientjes wrote: > Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so > remove the obscure comment about it and its special-case exception. > > Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-02-27 22:16 ` David Rientjes @ 2015-02-27 22:53 ` Christoph Lameter -1 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-02-27 22:53 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Fri, 27 Feb 2015, David Rientjes wrote: > +/* > + * Construct gfp mask to allocate from a specific node but do not invoke reclaim > + * or warn about failures. > + */ We should be triggering reclaim from slab allocations. Why would we not do this? Otherwise we will be going uselessly off node for slab allocations. > +static inline gfp_t gfp_exact_node(gfp_t flags) > +{ > + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; > +} > #endif Reclaim needs to be triggered. In particular zone reclaim was made to be triggered from slab allocations to create more room if needed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-02-27 22:53 ` Christoph Lameter 0 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-02-27 22:53 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Fri, 27 Feb 2015, David Rientjes wrote: > +/* > + * Construct gfp mask to allocate from a specific node but do not invoke reclaim > + * or warn about failures. > + */ We should be triggering reclaim from slab allocations. Why would we not do this? Otherwise we will be going uselessly off node for slab allocations. > +static inline gfp_t gfp_exact_node(gfp_t flags) > +{ > + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; > +} > #endif Reclaim needs to be triggered. In particular zone reclaim was made to be triggered from slab allocations to create more room if needed. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-02-27 22:53 ` Christoph Lameter @ 2015-02-28 3:21 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-28 3:21 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Fri, 27 Feb 2015, Christoph Lameter wrote: > > +/* > > + * Construct gfp mask to allocate from a specific node but do not invoke reclaim > > + * or warn about failures. > > + */ > > We should be triggering reclaim from slab allocations. Why would we not do > this? > > Otherwise we will be going uselessly off node for slab allocations. > > > +static inline gfp_t gfp_exact_node(gfp_t flags) > > +{ > > + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; > > +} > > #endif > > Reclaim needs to be triggered. In particular zone reclaim was made to be > triggered from slab allocations to create more room if needed. > This illustrates the precise need for a patch like this that removes GFP_THISNODE entirely: notice there's no functional change with this patch. GFP_THISNODE is __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY. The calls to ____cache_alloc_node() and cache_grow() modified by this patch in mm/slab.c that pass GFP_THISNODE get caught in the page allocator slowpath by this: if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; with today's kernel. In fact, there is no way for the slab allocator to currently allocate exactly on one node, allow reclaim, and avoid looping forever while suppressing the page allocation failure warning. The reason is because of how GFP_THISNODE is defined above. With this patch, it would be possible to modify gfp_exact_node() so that instead of doing return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; which has no functional change from today, it could be return flags | __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY; so that we _can_ do reclaim for that node and avoid looping forever when the allocation fails. These three flags are the exact same bits set in today's GFP_THISNODE and it is, I agree, what the slab allocator really wants to do in cache_grow(). But the conditional above is what short-circuits such an allocation and needs to be removed, which is what this patch does. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-02-28 3:21 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-02-28 3:21 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Vlastimil Babka, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Fri, 27 Feb 2015, Christoph Lameter wrote: > > +/* > > + * Construct gfp mask to allocate from a specific node but do not invoke reclaim > > + * or warn about failures. > > + */ > > We should be triggering reclaim from slab allocations. Why would we not do > this? > > Otherwise we will be going uselessly off node for slab allocations. > > > +static inline gfp_t gfp_exact_node(gfp_t flags) > > +{ > > + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; > > +} > > #endif > > Reclaim needs to be triggered. In particular zone reclaim was made to be > triggered from slab allocations to create more room if needed. > This illustrates the precise need for a patch like this that removes GFP_THISNODE entirely: notice there's no functional change with this patch. GFP_THISNODE is __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY. The calls to ____cache_alloc_node() and cache_grow() modified by this patch in mm/slab.c that pass GFP_THISNODE get caught in the page allocator slowpath by this: if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; with today's kernel. In fact, there is no way for the slab allocator to currently allocate exactly on one node, allow reclaim, and avoid looping forever while suppressing the page allocation failure warning. The reason is because of how GFP_THISNODE is defined above. With this patch, it would be possible to modify gfp_exact_node() so that instead of doing return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; which has no functional change from today, it could be return flags | __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY; so that we _can_ do reclaim for that node and avoid looping forever when the allocation fails. These three flags are the exact same bits set in today's GFP_THISNODE and it is, I agree, what the slab allocator really wants to do in cache_grow(). But the conditional above is what short-circuits such an allocation and needs to be removed, which is what this patch does. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <alpine.DEB.2.10.1502271415510.7225-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-02-27 22:16 ` David Rientjes (?) @ 2015-03-02 13:46 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:46 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, dev-yBygre7rU0TnMu66kgdUjQ On 02/27/2015 11:16 PM, David Rientjes wrote: > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > GFP_THISNODE is a secret combination of gfp bits that have different > behavior than expected. It is a combination of __GFP_THISNODE, > __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator > slowpath to fail without trying reclaim even though it may be used in > combination with __GFP_WAIT. > > An example of the problem this creates: commit e97ca8e5b864 ("mm: fix > GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE > that really just wanted __GFP_THISNODE. The problem doesn't end there, > however, because even it was a no-op for alloc_misplaced_dst_page(), > which also sets __GFP_NORETRY and __GFP_NOWARN, and > migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT > is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is > a no-op in these cases since the page allocator special-cases > __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. > > It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE > to restrict an allocation to a local node, but remove GFP_THISNODE and > its obscurity. Instead, we require that a caller clear __GFP_WAIT if it > wants to avoid reclaim. > > This allows the aforementioned functions to actually reclaim as they > should. It also enables any future callers that want to do > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. > > Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so > it is unchanged. > > Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> So you've convinced me that this is a non-functional change for slab and a prerequisity for patch 2/3 which is the main point of this series for 4.0. That said, the new 'goto nopage' condition is still triggered by a combination of flag states, and the less we have those, the better for us IMHO. Looking at commit 952f3b51be which introduced this particular check using GFP_THISNODE, it seemed like it was a workaround to avoid triggering reclaim, without needing a new gfp flag. Nowadays, we have such flag called __GFP_NO_KSWAPD and as I explained in my reply to v1 (where I missed the new condition), passing the flag would already prevent wake_all_kswapds() and treating the allocation as atomic in gfp_to_alloc_flags(). So the whole difference would be another get_page_from_freelist() attempt (possibly less constrained than the fast path one) and then bail out on !wait. So it would be IMHO better for longer-term maintainability to have the relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these opportunistic allocation attempts, instead of having this subtle semantic difference attached to __GFP_THISNODE && !__GFP_WAIT. It would be probably too risky for 4.0. On the other hand, I don't think even this series is really urgent. It's true that patch 2/3 fixes two commits, including a 4.0 one, but those commits are already not regressions without the fix. But maybe current -rcX is low enough to proceed with this series and catch any regressions in time, allowing the larger cleanups later. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 13:46 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:46 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 02/27/2015 11:16 PM, David Rientjes wrote: > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > GFP_THISNODE is a secret combination of gfp bits that have different > behavior than expected. It is a combination of __GFP_THISNODE, > __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator > slowpath to fail without trying reclaim even though it may be used in > combination with __GFP_WAIT. > > An example of the problem this creates: commit e97ca8e5b864 ("mm: fix > GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE > that really just wanted __GFP_THISNODE. The problem doesn't end there, > however, because even it was a no-op for alloc_misplaced_dst_page(), > which also sets __GFP_NORETRY and __GFP_NOWARN, and > migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT > is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is > a no-op in these cases since the page allocator special-cases > __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. > > It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE > to restrict an allocation to a local node, but remove GFP_THISNODE and > its obscurity. Instead, we require that a caller clear __GFP_WAIT if it > wants to avoid reclaim. > > This allows the aforementioned functions to actually reclaim as they > should. It also enables any future callers that want to do > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. > > Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so > it is unchanged. > > Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> So you've convinced me that this is a non-functional change for slab and a prerequisity for patch 2/3 which is the main point of this series for 4.0. That said, the new 'goto nopage' condition is still triggered by a combination of flag states, and the less we have those, the better for us IMHO. Looking at commit 952f3b51be which introduced this particular check using GFP_THISNODE, it seemed like it was a workaround to avoid triggering reclaim, without needing a new gfp flag. Nowadays, we have such flag called __GFP_NO_KSWAPD and as I explained in my reply to v1 (where I missed the new condition), passing the flag would already prevent wake_all_kswapds() and treating the allocation as atomic in gfp_to_alloc_flags(). So the whole difference would be another get_page_from_freelist() attempt (possibly less constrained than the fast path one) and then bail out on !wait. So it would be IMHO better for longer-term maintainability to have the relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these opportunistic allocation attempts, instead of having this subtle semantic difference attached to __GFP_THISNODE && !__GFP_WAIT. It would be probably too risky for 4.0. On the other hand, I don't think even this series is really urgent. It's true that patch 2/3 fixes two commits, including a 4.0 one, but those commits are already not regressions without the fix. But maybe current -rcX is low enough to proceed with this series and catch any regressions in time, allowing the larger cleanups later. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 13:46 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 13:46 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 02/27/2015 11:16 PM, David Rientjes wrote: > NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. > > GFP_THISNODE is a secret combination of gfp bits that have different > behavior than expected. It is a combination of __GFP_THISNODE, > __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator > slowpath to fail without trying reclaim even though it may be used in > combination with __GFP_WAIT. > > An example of the problem this creates: commit e97ca8e5b864 ("mm: fix > GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE > that really just wanted __GFP_THISNODE. The problem doesn't end there, > however, because even it was a no-op for alloc_misplaced_dst_page(), > which also sets __GFP_NORETRY and __GFP_NOWARN, and > migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT > is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is > a no-op in these cases since the page allocator special-cases > __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. > > It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE > to restrict an allocation to a local node, but remove GFP_THISNODE and > its obscurity. Instead, we require that a caller clear __GFP_WAIT if it > wants to avoid reclaim. > > This allows the aforementioned functions to actually reclaim as they > should. It also enables any future callers that want to do > __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The > rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. > > Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so > it is unchanged. > > Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> So you've convinced me that this is a non-functional change for slab and a prerequisity for patch 2/3 which is the main point of this series for 4.0. That said, the new 'goto nopage' condition is still triggered by a combination of flag states, and the less we have those, the better for us IMHO. Looking at commit 952f3b51be which introduced this particular check using GFP_THISNODE, it seemed like it was a workaround to avoid triggering reclaim, without needing a new gfp flag. Nowadays, we have such flag called __GFP_NO_KSWAPD and as I explained in my reply to v1 (where I missed the new condition), passing the flag would already prevent wake_all_kswapds() and treating the allocation as atomic in gfp_to_alloc_flags(). So the whole difference would be another get_page_from_freelist() attempt (possibly less constrained than the fast path one) and then bail out on !wait. So it would be IMHO better for longer-term maintainability to have the relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these opportunistic allocation attempts, instead of having this subtle semantic difference attached to __GFP_THISNODE && !__GFP_WAIT. It would be probably too risky for 4.0. On the other hand, I don't think even this series is really urgent. It's true that patch 2/3 fixes two commits, including a 4.0 one, but those commits are already not regressions without the fix. But maybe current -rcX is low enough to proceed with this series and catch any regressions in time, allowing the larger cleanups later. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-03-02 13:46 ` Vlastimil Babka @ 2015-03-02 15:46 ` Christoph Lameter -1 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-03-02 15:46 UTC (permalink / raw) To: Vlastimil Babka Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Mon, 2 Mar 2015, Vlastimil Babka wrote: > So it would be IMHO better for longer-term maintainability to have the > relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these > opportunistic allocation attempts, instead of having this subtle semantic You are thinking about an opportunistic allocation attempt in SLAB? AFAICT SLAB allocations should trigger reclaim. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 15:46 ` Christoph Lameter 0 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-03-02 15:46 UTC (permalink / raw) To: Vlastimil Babka Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Mon, 2 Mar 2015, Vlastimil Babka wrote: > So it would be IMHO better for longer-term maintainability to have the > relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these > opportunistic allocation attempts, instead of having this subtle semantic You are thinking about an opportunistic allocation attempt in SLAB? AFAICT SLAB allocations should trigger reclaim. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-03-02 15:46 ` Christoph Lameter @ 2015-03-02 16:02 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 16:02 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 03/02/2015 04:46 PM, Christoph Lameter wrote: > On Mon, 2 Mar 2015, Vlastimil Babka wrote: > >> So it would be IMHO better for longer-term maintainability to have the >> relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these >> opportunistic allocation attempts, instead of having this subtle semantic > > You are thinking about an opportunistic allocation attempt in SLAB? > > AFAICT SLAB allocations should trigger reclaim. > Well, let me quote your commit 952f3b51beb5: -------- commit 952f3b51beb592f3f1de15adcdef802fc086ea91 Author: Christoph Lameter <clameter@sgi.com> Date: Wed Dec 6 20:33:26 2006 -0800 [PATCH] GFP_THISNODE must not trigger global reclaim The intent of GFP_THISNODE is to make sure that an allocation occurs on a particular node. If this is not possible then NULL needs to be returned so that the caller can choose what to do next on its own (the slab allocator depends on that). However, GFP_THISNODE currently triggers reclaim before returning a failure (GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node then we will currently do some reclaim before returning NULL. The caller may want memory from other nodes before reclaim should be triggered. (If the caller wants reclaim then he can directly use __GFP_THISNODE instead). There is no flag to avoid reclaim in the page allocator and adding yet another GFP_xx flag would be difficult given that we are out of available flags. So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE, __GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before waking up kswapd. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5d123b3..dc8753b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1151,6 +1151,17 @@ restart: if (page) goto got_pg; + /* + * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and + * __GFP_NOWARN set) should not cause reclaim since the subsystem + * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim + * using a larger set of nodes after it has established that the + * allowed per node queues are empty and that nodes are + * over allocated. + */ + if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) + goto nopage; + for (z = zonelist->zones; *z; z++) wakeup_kswapd(*z, order); -------- So it seems to me that *at least some* allocations in slab are supposed to work like this? Of course it's possible that since 2006, more allocation sites in slab started passing GFP_THISNODE without realizing this semantics. In that case, such sites should be fixed. (I think David already mentioned some in this thread?) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 16:02 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 16:02 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 03/02/2015 04:46 PM, Christoph Lameter wrote: > On Mon, 2 Mar 2015, Vlastimil Babka wrote: > >> So it would be IMHO better for longer-term maintainability to have the >> relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these >> opportunistic allocation attempts, instead of having this subtle semantic > > You are thinking about an opportunistic allocation attempt in SLAB? > > AFAICT SLAB allocations should trigger reclaim. > Well, let me quote your commit 952f3b51beb5: -------- commit 952f3b51beb592f3f1de15adcdef802fc086ea91 Author: Christoph Lameter <clameter@sgi.com> Date: Wed Dec 6 20:33:26 2006 -0800 [PATCH] GFP_THISNODE must not trigger global reclaim The intent of GFP_THISNODE is to make sure that an allocation occurs on a particular node. If this is not possible then NULL needs to be returned so that the caller can choose what to do next on its own (the slab allocator depends on that). However, GFP_THISNODE currently triggers reclaim before returning a failure (GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node then we will currently do some reclaim before returning NULL. The caller may want memory from other nodes before reclaim should be triggered. (If the caller wants reclaim then he can directly use __GFP_THISNODE instead). There is no flag to avoid reclaim in the page allocator and adding yet another GFP_xx flag would be difficult given that we are out of available flags. So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE, __GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before waking up kswapd. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5d123b3..dc8753b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1151,6 +1151,17 @@ restart: if (page) goto got_pg; + /* + * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and + * __GFP_NOWARN set) should not cause reclaim since the subsystem + * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim + * using a larger set of nodes after it has established that the + * allowed per node queues are empty and that nodes are + * over allocated. + */ + if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) + goto nopage; + for (z = zonelist->zones; *z; z++) wakeup_kswapd(*z, order); -------- So it seems to me that *at least some* allocations in slab are supposed to work like this? Of course it's possible that since 2006, more allocation sites in slab started passing GFP_THISNODE without realizing this semantics. In that case, such sites should be fixed. (I think David already mentioned some in this thread?) ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-03-02 16:02 ` Vlastimil Babka @ 2015-03-02 16:08 ` Christoph Lameter -1 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-03-02 16:08 UTC (permalink / raw) To: Vlastimil Babka Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Mon, 2 Mar 2015, Vlastimil Babka wrote: > > You are thinking about an opportunistic allocation attempt in SLAB? > > > > AFAICT SLAB allocations should trigger reclaim. > > > > Well, let me quote your commit 952f3b51beb5: This was about global reclaim. Local reclaim is good and that can be done via zone_reclaim. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 16:08 ` Christoph Lameter 0 siblings, 0 replies; 50+ messages in thread From: Christoph Lameter @ 2015-03-02 16:08 UTC (permalink / raw) To: Vlastimil Babka Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Mon, 2 Mar 2015, Vlastimil Babka wrote: > > You are thinking about an opportunistic allocation attempt in SLAB? > > > > AFAICT SLAB allocations should trigger reclaim. > > > > Well, let me quote your commit 952f3b51beb5: This was about global reclaim. Local reclaim is good and that can be done via zone_reclaim. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <alpine.DEB.2.11.1503021007030.6245-gkYfJU5Cukgdnm+yROfE0A@public.gmane.org>]
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-03-02 16:08 ` Christoph Lameter (?) @ 2015-03-02 16:23 ` Vlastimil Babka -1 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 16:23 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, dev-yBygre7rU0TnMu66kgdUjQ On 03/02/2015 05:08 PM, Christoph Lameter wrote: > On Mon, 2 Mar 2015, Vlastimil Babka wrote: > >>> You are thinking about an opportunistic allocation attempt in SLAB? >>> >>> AFAICT SLAB allocations should trigger reclaim. >>> >> >> Well, let me quote your commit 952f3b51beb5: > > This was about global reclaim. Local reclaim is good and that can be > done via zone_reclaim. Right, so the patch is a functional change for zone_reclaim_mode == 1, where !__GFP_WAIT will prevent it. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 16:23 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 16:23 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 03/02/2015 05:08 PM, Christoph Lameter wrote: > On Mon, 2 Mar 2015, Vlastimil Babka wrote: > >>> You are thinking about an opportunistic allocation attempt in SLAB? >>> >>> AFAICT SLAB allocations should trigger reclaim. >>> >> >> Well, let me quote your commit 952f3b51beb5: > > This was about global reclaim. Local reclaim is good and that can be > done via zone_reclaim. Right, so the patch is a functional change for zone_reclaim_mode == 1, where !__GFP_WAIT will prevent it. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 16:23 ` Vlastimil Babka 0 siblings, 0 replies; 50+ messages in thread From: Vlastimil Babka @ 2015-03-02 16:23 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On 03/02/2015 05:08 PM, Christoph Lameter wrote: > On Mon, 2 Mar 2015, Vlastimil Babka wrote: > >>> You are thinking about an opportunistic allocation attempt in SLAB? >>> >>> AFAICT SLAB allocations should trigger reclaim. >>> >> >> Well, let me quote your commit 952f3b51beb5: > > This was about global reclaim. Local reclaim is good and that can be > done via zone_reclaim. Right, so the patch is a functional change for zone_reclaim_mode == 1, where !__GFP_WAIT will prevent it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE 2015-03-02 16:23 ` Vlastimil Babka @ 2015-03-02 20:40 ` David Rientjes -1 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-03-02 20:40 UTC (permalink / raw) To: Vlastimil Babka Cc: Christoph Lameter, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Mon, 2 Mar 2015, Vlastimil Babka wrote: > > > > You are thinking about an opportunistic allocation attempt in SLAB? > > > > > > > > AFAICT SLAB allocations should trigger reclaim. > > > > > > > > > > Well, let me quote your commit 952f3b51beb5: > > > > This was about global reclaim. Local reclaim is good and that can be > > done via zone_reclaim. > > Right, so the patch is a functional change for zone_reclaim_mode == 1, where > !__GFP_WAIT will prevent it. > My patch is not a functional change, get_page_from_freelist() handles zone_reclaim_mode == 1 properly in the page allocator fastpath. This patch only touches the slowpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [patch v2 1/3] mm: remove GFP_THISNODE @ 2015-03-02 20:40 ` David Rientjes 0 siblings, 0 replies; 50+ messages in thread From: David Rientjes @ 2015-03-02 20:40 UTC (permalink / raw) To: Vlastimil Babka Cc: Christoph Lameter, Andrew Morton, Pekka Enberg, Joonsoo Kim, Johannes Weiner, Mel Gorman, Pravin Shelar, Jarno Rajahalme, Li Zefan, Greg Thelen, linux-kernel, linux-mm, netdev, cgroups, dev On Mon, 2 Mar 2015, Vlastimil Babka wrote: > > > > You are thinking about an opportunistic allocation attempt in SLAB? > > > > > > > > AFAICT SLAB allocations should trigger reclaim. > > > > > > > > > > Well, let me quote your commit 952f3b51beb5: > > > > This was about global reclaim. Local reclaim is good and that can be > > done via zone_reclaim. > > Right, so the patch is a functional change for zone_reclaim_mode == 1, where > !__GFP_WAIT will prevent it. > My patch is not a functional change, get_page_from_freelist() handles zone_reclaim_mode == 1 properly in the page allocator fastpath. This patch only touches the slowpath. ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2015-03-02 20:40 UTC | newest]
Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-26 0:23 [patch 1/2] mm: remove GFP_THISNODE David Rientjes
2015-02-26 0:23 ` David Rientjes
2015-02-26 0:56 ` Christoph Lameter
2015-02-26 0:56 ` Christoph Lameter
2015-02-26 1:04 ` David Rientjes
2015-02-26 1:04 ` David Rientjes
2015-02-26 8:30 ` Vlastimil Babka
2015-02-26 8:30 ` Vlastimil Babka
2015-02-27 3:09 ` David Rientjes
2015-02-27 3:09 ` David Rientjes
2015-02-27 7:34 ` Vlastimil Babka
2015-02-27 7:34 ` Vlastimil Babka
2015-02-27 22:03 ` David Rientjes
2015-02-27 22:03 ` David Rientjes
2015-02-27 22:19 ` Vlastimil Babka
2015-02-27 22:19 ` Vlastimil Babka
2015-02-27 22:31 ` David Rientjes
2015-02-27 22:31 ` David Rientjes
2015-02-27 22:52 ` Vlastimil Babka
2015-02-27 22:52 ` Vlastimil Babka
[not found] ` <alpine.DEB.2.10.1502251621010.10303-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2015-02-27 22:16 ` [patch v2 1/3] " David Rientjes
2015-02-27 22:16 ` David Rientjes
2015-02-27 22:16 ` David Rientjes
2015-02-27 22:17 ` [patch v2 2/3] mm, thp: really limit transparent hugepage allocation to local node David Rientjes
2015-02-27 22:17 ` David Rientjes
[not found] ` <alpine.DEB.2.10.1502271416580.7225-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2015-03-02 13:47 ` Vlastimil Babka
2015-03-02 13:47 ` Vlastimil Babka
2015-03-02 13:47 ` Vlastimil Babka
2015-02-27 22:17 ` [patch v2 3/3] kernel, cpuset: remove exception for __GFP_THISNODE David Rientjes
2015-02-27 22:17 ` David Rientjes
2015-03-02 13:47 ` Vlastimil Babka
2015-03-02 13:47 ` Vlastimil Babka
2015-02-27 22:53 ` [patch v2 1/3] mm: remove GFP_THISNODE Christoph Lameter
2015-02-27 22:53 ` Christoph Lameter
2015-02-28 3:21 ` David Rientjes
2015-02-28 3:21 ` David Rientjes
[not found] ` <alpine.DEB.2.10.1502271415510.7225-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2015-03-02 13:46 ` Vlastimil Babka
2015-03-02 13:46 ` Vlastimil Babka
2015-03-02 13:46 ` Vlastimil Babka
2015-03-02 15:46 ` Christoph Lameter
2015-03-02 15:46 ` Christoph Lameter
2015-03-02 16:02 ` Vlastimil Babka
2015-03-02 16:02 ` Vlastimil Babka
2015-03-02 16:08 ` Christoph Lameter
2015-03-02 16:08 ` Christoph Lameter
[not found] ` <alpine.DEB.2.11.1503021007030.6245-gkYfJU5Cukgdnm+yROfE0A@public.gmane.org>
2015-03-02 16:23 ` Vlastimil Babka
2015-03-02 16:23 ` Vlastimil Babka
2015-03-02 16:23 ` Vlastimil Babka
2015-03-02 20:40 ` David Rientjes
2015-03-02 20:40 ` David Rientjes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.