Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Rik van Riel <riel@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
Date: Wed, 11 Dec 2013 22:47:19 +0000	[thread overview]
Message-ID: <20131211224719.GE11295@suse.de> (raw)
In-Reply-To: <1386785356-19911-1-git-send-email-hannes@cmpxchg.org>

On Wed, Dec 11, 2013 at 01:09:16PM -0500, Johannes Weiner wrote:
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
> 

The original patch was primarily concerned with the fair aging of LRU pages
of zones within a node. This patch uses GFP_MOVABLE_MASK which includes
__GFP_RECLAIMABLE meaning any slab created with SLAB_RECLAIM_ACCOUNT is still
getting the round-robin treatment. Those pages have a different lifecycle
to LRU pages and the shrinkers are only node aware, not zone aware.
While I get this patch probably helps this specific benchmark, was the
use of GFP_MOVABLE_MASK intentional or did you mean to use __GFP_MOVABLE?

Looking at the original patch again I think I made a major mistake when
reviewing it. Considering the effect of the following for NUMA machines

        for_each_zone_zonelist_nodemask(zone, z, zonelist,
                                                high_zoneidx, nodemask) {
		....
                if (alloc_flags & ALLOC_WMARK_LOW) {
                        if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
				continue;
                        if (zone_reclaim_mode &&
                            !zone_local(preferred_zone, zone))
                                continue;
		}

Enabling zone_reclaim_mode sucks badly for workloads that are not paritioned
to fit within NUMA nodes. Consequently, I expect the common case it that
it's disabled by default due to small NUMA distances or manually disabled.

However, the effect of that block is that we allocate NR_ALLOC_BATCH
from local zones then fallback to batch allocating remote nodes! I bet
the numa_hit stats in /proc/vmstat have sucked recently. The original
problem was because the page allocator would try allocating from the
highest zone while kswapd reclaimed from it causing LRU-aging problems.
The problem is not the same between nodes. How do you feel about dropping
the zone_reclaim_mode check above and only round-robin in batches between
zones on the local node?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mgorman@suse.de>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Rik van Riel <riel@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
Date: Wed, 11 Dec 2013 22:47:19 +0000	[thread overview]
Message-ID: <20131211224719.GE11295@suse.de> (raw)
In-Reply-To: <1386785356-19911-1-git-send-email-hannes@cmpxchg.org>

On Wed, Dec 11, 2013 at 01:09:16PM -0500, Johannes Weiner wrote:
> Dave Hansen noted a regression in a microbenchmark that loops around
> open() and close() on an 8-node NUMA machine and bisected it down to
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> change forces the slab allocations of the file descriptor to spread
> out to all 8 nodes, causing remote references in the page allocator
> and slab.
> 

The original patch was primarily concerned with the fair aging of LRU pages
of zones within a node. This patch uses GFP_MOVABLE_MASK which includes
__GFP_RECLAIMABLE meaning any slab created with SLAB_RECLAIM_ACCOUNT is still
getting the round-robin treatment. Those pages have a different lifecycle
to LRU pages and the shrinkers are only node aware, not zone aware.
While I get this patch probably helps this specific benchmark, was the
use of GFP_MOVABLE_MASK intentional or did you mean to use __GFP_MOVABLE?

Looking at the original patch again I think I made a major mistake when
reviewing it. Considering the effect of the following for NUMA machines

        for_each_zone_zonelist_nodemask(zone, z, zonelist,
                                                high_zoneidx, nodemask) {
		....
                if (alloc_flags & ALLOC_WMARK_LOW) {
                        if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
				continue;
                        if (zone_reclaim_mode &&
                            !zone_local(preferred_zone, zone))
                                continue;
		}

Enabling zone_reclaim_mode sucks badly for workloads that are not paritioned
to fit within NUMA nodes. Consequently, I expect the common case it that
it's disabled by default due to small NUMA distances or manually disabled.

However, the effect of that block is that we allocate NR_ALLOC_BATCH
from local zones then fallback to batch allocating remote nodes! I bet
the numa_hit stats in /proc/vmstat have sucked recently. The original
problem was because the page allocator would try allocating from the
highest zone while kswapd reclaimed from it causing LRU-aging problems.
The problem is not the same between nodes. How do you feel about dropping
the zone_reclaim_mode check above and only round-robin in batches between
zones on the local node?

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2013-12-11 22:47 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-11 18:09 [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Johannes Weiner
2013-12-11 18:09 ` Johannes Weiner
2013-12-11 18:24 ` Rik van Riel
2013-12-11 18:24   ` Rik van Riel
2013-12-11 22:47 ` Mel Gorman [this message]
2013-12-11 22:47   ` Mel Gorman
2013-12-12  1:09   ` Johannes Weiner
2013-12-12  1:09     ` Johannes Weiner
2013-12-12 13:18     ` Mel Gorman
2013-12-12 13:18       ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131211224719.GE11295@suse.de \
    --to=mgorman@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.