[patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
@ 2013-12-11 18:09 Johannes Weiner
  2013-12-11 18:24 ` Rik van Riel
  2013-12-11 22:47 ` Mel Gorman
  0 siblings, 2 replies; 5+ messages in thread
From: Johannes Weiner @ 2013-12-11 18:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, linux-mm, linux-kernel

Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone.  It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.

Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f075ed0..f861d0257e90 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & GFP_MOVABLE_MASK)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (zone_reclaim_mode &&
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-11 18:09 [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Johannes Weiner
@ 2013-12-11 18:24 ` Rik van Riel
  2013-12-11 22:47 ` Mel Gorman
  1 sibling, 0 replies; 5+ messages in thread
From: Rik van Riel @ 2013-12-11 18:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Mel Gorman, linux-mm, linux-kernel

On 12/11/2013 01:09 PM, Johannes Weiner wrote:
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
> 
> The round-robin policy is only there to provide fairness among memory
> allocations that are reclaimed involuntarily based on pressure in each
> zone.  It does not make sense to apply it to unreclaimable kernel
> allocations that are freed manually, in this case instantly after the
> allocation, and incur the remote reference costs twice for no reason.
> 
> Only round-robin allocations that are usually freed through page
> reclaim or slab shrinking.
> 
> Bisected-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@kernel.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-11 18:09 [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Johannes Weiner
  2013-12-11 18:24 ` Rik van Riel
@ 2013-12-11 22:47 ` Mel Gorman
  2013-12-12  1:09   ` Johannes Weiner
  1 sibling, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2013-12-11 22:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, linux-mm, linux-kernel

On Wed, Dec 11, 2013 at 01:09:16PM -0500, Johannes Weiner wrote:
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
> 

The original patch was primarily concerned with the fair aging of LRU pages
of zones within a node. This patch uses GFP_MOVABLE_MASK which includes
__GFP_RECLAIMABLE meaning any slab created with SLAB_RECLAIM_ACCOUNT is still
getting the round-robin treatment. Those pages have a different lifecycle
to LRU pages and the shrinkers are only node aware, not zone aware.
While I get this patch probably helps this specific benchmark, was the
use of GFP_MOVABLE_MASK intentional or did you mean to use __GFP_MOVABLE?

Looking at the original patch again I think I made a major mistake when
reviewing it. Considering the effect of the following for NUMA machines

        for_each_zone_zonelist_nodemask(zone, z, zonelist,
                                                high_zoneidx, nodemask) {
		....
                if (alloc_flags & ALLOC_WMARK_LOW) {
                        if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
				continue;
                        if (zone_reclaim_mode &&
                            !zone_local(preferred_zone, zone))
                                continue;
		}

Enabling zone_reclaim_mode sucks badly for workloads that are not paritioned
to fit within NUMA nodes. Consequently, I expect the common case it that
it's disabled by default due to small NUMA distances or manually disabled.

However, the effect of that block is that we allocate NR_ALLOC_BATCH
from local zones then fallback to batch allocating remote nodes! I bet
the numa_hit stats in /proc/vmstat have sucked recently. The original
problem was because the page allocator would try allocating from the
highest zone while kswapd reclaimed from it causing LRU-aging problems.
The problem is not the same between nodes. How do you feel about dropping
the zone_reclaim_mode check above and only round-robin in batches between
zones on the local node?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-11 22:47 ` Mel Gorman
@ 2013-12-12  1:09   ` Johannes Weiner
  2013-12-12 13:18     ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2013-12-12  1:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, linux-mm, linux-kernel

On Wed, Dec 11, 2013 at 10:47:19PM +0000, Mel Gorman wrote:
> On Wed, Dec 11, 2013 at 01:09:16PM -0500, Johannes Weiner wrote:
> > Dave Hansen noted a regression in a microbenchmark that loops around
> > open() and close() on an 8-node NUMA machine and bisected it down to
> > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> > change forces the slab allocations of the file descriptor to spread
> > out to all 8 nodes, causing remote references in the page allocator
> > and slab.
> > 
> 
> The original patch was primarily concerned with the fair aging of LRU pages
> of zones within a node. This patch uses GFP_MOVABLE_MASK which includes
> __GFP_RECLAIMABLE meaning any slab created with SLAB_RECLAIM_ACCOUNT is still
> getting the round-robin treatment. Those pages have a different lifecycle
> to LRU pages and the shrinkers are only node aware, not zone aware.
> While I get this patch probably helps this specific benchmark, was the
> use of GFP_MOVABLE_MASK intentional or did you mean to use __GFP_MOVABLE?

It was intentional to spread SLAB_RECLAIM_ACCOUNT pages across all
allowed nodes evenly for the same aging fairness reason.

> Looking at the original patch again I think I made a major mistake when
> reviewing it. Considering the effect of the following for NUMA machines
> 
>         for_each_zone_zonelist_nodemask(zone, z, zonelist,
>                                                 high_zoneidx, nodemask) {
> 		....
>                 if (alloc_flags & ALLOC_WMARK_LOW) {
>                         if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> 				continue;
>                         if (zone_reclaim_mode &&
>                             !zone_local(preferred_zone, zone))
>                                 continue;
> 		}
> 
> 
> Enabling zone_reclaim_mode sucks badly for workloads that are not paritioned
> to fit within NUMA nodes. Consequently, I expect the common case it that
> it's disabled by default due to small NUMA distances or manually disabled.
> 
> However, the effect of that block is that we allocate NR_ALLOC_BATCH
> from local zones then fallback to batch allocating remote nodes! I bet
> the numa_hit stats in /proc/vmstat have sucked recently. The original
> problem was because the page allocator would try allocating from the
> highest zone while kswapd reclaimed from it causing LRU-aging problems.
> The problem is not the same between nodes. How do you feel about dropping
> the zone_reclaim_mode check above and only round-robin in batches between
> zones on the local node?

It might not be for anon but it's the same problem for cache.  The
page allocator will fill all the nodes in the system before waking up
the kswapds.  It will utilize all nodes, just not evenly.

I know that on the node-level staying local is often preferrable over
full memory utilization but I was under the assumption that
zone_reclaim_mode is there to express this preference.

My patch certainly makes this preference more aggressive in the sense
that there is no grayzone anymore.  There is no try to stay local.
There is either not using a block of memory at all, or using it to the
same extent as any other block of the same size; that's the
requirement for fair aging.

That being said, the fairness concerns are primarily about file pages.
Should we exclude anon and slab pages entirely?  I'd still account for
them in the batches but only apply placement rules to page cache.
That should still leave us with roughly equal cache aging speeds in
all zones and nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-12  1:09   ` Johannes Weiner
@ 2013-12-12 13:18     ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2013-12-12 13:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, linux-mm, linux-kernel

On Wed, Dec 11, 2013 at 08:09:03PM -0500, Johannes Weiner wrote:
> On Wed, Dec 11, 2013 at 10:47:19PM +0000, Mel Gorman wrote:
> > On Wed, Dec 11, 2013 at 01:09:16PM -0500, Johannes Weiner wrote:
> > > Dave Hansen noted a regression in a microbenchmark that loops around
> > > open() and close() on an 8-node NUMA machine and bisected it down to
> > > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> > > change forces the slab allocations of the file descriptor to spread
> > > out to all 8 nodes, causing remote references in the page allocator
> > > and slab.
> > > 
> > 
> > The original patch was primarily concerned with the fair aging of LRU pages
> > of zones within a node. This patch uses GFP_MOVABLE_MASK which includes
> > __GFP_RECLAIMABLE meaning any slab created with SLAB_RECLAIM_ACCOUNT is still
> > getting the round-robin treatment. Those pages have a different lifecycle
> > to LRU pages and the shrinkers are only node aware, not zone aware.
> > While I get this patch probably helps this specific benchmark, was the
> > use of GFP_MOVABLE_MASK intentional or did you mean to use __GFP_MOVABLE?
> 
> It was intentional to spread SLAB_RECLAIM_ACCOUNT pages across all
> allowed nodes evenly for the same aging fairness reason.
> 

That could be argued either way and ultimately it'll come down to being
workload dependant.

> > Looking at the original patch again I think I made a major mistake when
> > reviewing it. Considering the effect of the following for NUMA machines
> > 
> >         for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >                                                 high_zoneidx, nodemask) {
> > 		....
> >                 if (alloc_flags & ALLOC_WMARK_LOW) {
> >                         if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> > 				continue;
> >                         if (zone_reclaim_mode &&
> >                             !zone_local(preferred_zone, zone))
> >                                 continue;
> > 		}
> > 
> > 
> > Enabling zone_reclaim_mode sucks badly for workloads that are not paritioned
> > to fit within NUMA nodes. Consequently, I expect the common case it that
> > it's disabled by default due to small NUMA distances or manually disabled.
> > 
> > However, the effect of that block is that we allocate NR_ALLOC_BATCH
> > from local zones then fallback to batch allocating remote nodes! I bet
> > the numa_hit stats in /proc/vmstat have sucked recently. The original
> > problem was because the page allocator would try allocating from the
> > highest zone while kswapd reclaimed from it causing LRU-aging problems.
> > The problem is not the same between nodes. How do you feel about dropping
> > the zone_reclaim_mode check above and only round-robin in batches between
> > zones on the local node?
> 
> It might not be for anon but it's the same problem for cache.  The
> page allocator will fill all the nodes in the system before waking up
> the kswapds.  It will utilize all nodes, just not evenly.
> 

Which is a definite remote allocation penalty now versus a potential page
aging inversion problem later. That's a shitty tradeoff.

> I know that on the node-level staying local is often preferrable over
> full memory utilization but I was under the assumption that
> zone_reclaim_mode is there to express this preference.
> 

It's the expected behaviour of MPOL_LOCAL and the original patch significant
altered default behaviour that should have been flagged. The zone_relaim_mode
tunable has nasty side effects for workloads that are not partitioned to
fit within a NUMA node. I blame myself here because that patch should have
sent off klaxons but I was blinded by those lovely performance figures.

> My patch certainly makes this preference more aggressive in the sense
> that there is no grayzone anymore.  There is no try to stay local.

Which automatic NUMA balancing then comes in and stomps all over.

> There is either not using a block of memory at all, or using it to the
> same extent as any other block of the same size; that's the
> requirement for fair aging.
> 
> That being said, the fairness concerns are primarily about file pages.
> Should we exclude anon and slab pages entirely? 

File pages would certainly be the most noticeable. We may not want to
prematurely discard clean pages because of the IO costs of brining them
back in and the other penalties for example. Unfortunately, there is not
a reliable way to distinguish between these type of pages from allocator
context, we'd need another GFP flag. At least, an obvious method did not
immediately print to mind.

Without or without that flag, deciding how to treat them would lead to a
lot of hand waving and bullshit. I'll put together a small series that
makes this configurable and set what I think are sensible defaults and
we can figure out the way forward on that basis.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-12-12 13:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-11 18:09 [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Johannes Weiner
2013-12-11 18:24 ` Rik van Riel
2013-12-11 22:47 ` Mel Gorman
2013-12-12  1:09   ` Johannes Weiner
2013-12-12 13:18     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).