From: Mel Gorman <mgorman@techsingularity.net>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linux-MM <linux-mm@kvack.org>, Rik van Riel <riel@surriel.com>,
Vlastimil Babka <vbabka@suse.cz>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
Date: Tue, 23 Feb 2016 21:58:59 +0000 [thread overview]
Message-ID: <20160223215859.GO2854@techsingularity.net> (raw)
In-Reply-To: <20160223205915.GA10744@cmpxchg.org>
On Tue, Feb 23, 2016 at 12:59:15PM -0800, Johannes Weiner wrote:
> On Tue, Feb 23, 2016 at 08:19:32PM +0000, Mel Gorman wrote:
> > On Tue, Feb 23, 2016 at 12:04:16PM -0800, Johannes Weiner wrote:
> > > On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> > > > In many benchmarks, there is an obvious difference in the number of
> > > > allocations from each zone as the fair zone allocation policy is removed
> > > > towards the end of the series. For example, this is the allocation stats
> > > > when running blogbench that showed no difference in headling performance
> > > >
> > > > mmotm-20160209 nodelru-v2
> > > > DMA allocs 0 0
> > > > DMA32 allocs 7218763 608067
> > > > Normal allocs 12701806 18821286
> > > > Movable allocs 0 0
> > >
> > > According to the mmotm numbers, your DMA32 zone is over a third of
> > > available memory, yet in the nodelru-v2 kernel sees only 3% of the
> > > allocations.
> >
> > In this case yes but blogbench is not scaled to memory size and is not
> > reclaim intensive. If you look, you'll see the total number of overall
> > allocations is very similar. During that test, there is a small amount of
> > kswapd scan activity (but not reclaim which is odd) at the start of the
> > test for nodelru but that's about it.
>
> Yes, if fairness enforcement is now done by reclaim, then workloads
> without reclaim will show skewed placement as the Normal zone is again
> filled up first before moving on to the next zone.
>
> That is fine. But what about the balance in reclaiming workloads?
>
That is the key question -- whether node LRU reclaim renders it
unnecessary.
> > > That's an insanely high level of aging inversion, where
> > > the lifetime of a cache entry is again highly dependent on placement.
> > >
> >
> > The aging is now indepdant of what zone the page was allocated from because
> > it's node-based LRU reclaim. That may mean that the occupancy of individual
> > zones is now different but it should only matter if there is a large number
> > of address-limited requests.
>
> The problem is that kswapd will stay awake and continuously draw
> subsequent allocations into a single zone, thus utilizing only a
> fraction of available memory.
Not quite. Look at prepare_kswapd_sleep() in the full series and it has this
for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;
if (!populated_zone(zone))
continue;
if (zone_balanced(zone, order, 0, classzone_idx))
return true;
}
and balance_pgdat has this
/* Only reclaim if there are no eligible zones */
for (i = classzone_idx; i >= 0; i--) {
zone = pgdat->node_zones + i;
if (!populated_zone(zone))
continue;
if (!zone_balanced(zone, order, 0, classzone_idx)) {
classzone_idx = i;
break;
}
}
kswapd only stays awake until *one* balanced zone is available. That is
a key difference with the existing kswapd which balances all zones.
> A DMA32-limited kswapd wakeups can
> reclaim cache in DMA32 continuously if the allocator continously
> places new cache pages in that zone. It looks like that is what
> happened in the stutter benchmark.
>
There may be corner cases where we artifically wake kswapd at DMA32
instead of a higher zone. If that happens, it should be addressed so
that only GFP_DMA32 wakes and reclaims that zone.
> Sure, it doesn't matter in that benchmark, because the pages are used
> only once. But if it had an actual cache workingset bigger than DMA32
> but smaller than DMA32+Normal, it would be thrashing unnecessarily.
>
> If kswapd were truly balancing the pages in a node equally, regardless
> of zone placement, then in the long run we should see zone allocations
> converge to a share that is in proportion to each zone's size. As far
> as I can see, that is not quite happening yet.
>
Not quite either. The order kswapd reclaims is in related to the age of
all pages in the node. Early in the lifetime of the system, that may be
ZONE_NORMAL initially until the other zones are populated. Ultimately
the balance of zones will be related to the age of the pages.
> > > The fact that this doesn't make a performance difference in the
> > > specific benchmarks you ran only proves just that: these specific
> > > benchmarks don't care. IMO, benchmarking is not enough here. If this
> > > is truly supposed to be unproblematic, then I think we need a reasoned
> > > explanation. I can't imagine how it possibly could be, though.
> > >
> >
> > The basic explanation is that reclaim is on a per-node basis and we
> > no longer balance all zones, just one that is necessary to satisfy the
> > original request that wokeup kswapd.
> >
> > > If reclaim can't guarantee a balanced zone utilization then the
> > > allocator has to keep doing it. :(
> >
> > That's the key issue - the main reason balanced zone utilisation is
> > necessary is because we reclaim on a per-zone basis and we must avoid
> > page aging anomalies. If we balance such that one eligible zone is above
> > the watermark then it's less of a concern.
>
> Yes, but only if there can't be extended reclaim stretches that prefer
> the pages of a single zone. Yet it looks like this is still possible.
>
And that is a problem if a workload is dominated by allocations
requiring the lower zones. If that is the common case then it's a bust
and fair zone allocation policy is still required. That removes one
motivation from the series as it leaves some fatness in the page
allocator paths.
> I wonder if that were fixed by dropping patch 7/27?
Potentially yes although it would be preferred to avoid unnecessarily
waking kswapd for a lower zone. That could be enforced by modifying
wake_all_kswapd() to always wake based on the highest available zone in
a pgdat that is below the zone required by the allocation request.
> Potentially it
> would need a bit more work than that. I.e. could we make kswapd
> balance only for the highest classzone in the system, and thus make
> address-limited allocations fend for themselves in direct reclaim?
>
That would be a side-effect of modifying wake_all_kswapd. Would shoving
that in alleviate your concerns?
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-02-23 21:59 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
2016-02-23 15:04 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
2016-02-23 18:04 ` Johannes Weiner
2016-03-03 10:37 ` Vlastimil Babka
2016-02-23 15:04 ` [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim Mel Gorman
2016-02-23 18:06 ` Johannes Weiner
2016-03-03 11:31 ` Vlastimil Babka
2016-03-09 11:59 ` Mel Gorman
2016-03-09 12:30 ` Vlastimil Babka
2016-02-23 15:04 ` [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats Mel Gorman
2016-02-23 18:13 ` Johannes Weiner
2016-02-24 9:19 ` Mel Gorman
2016-02-23 15:04 ` [PATCH 04/27] mm, vmscan: Move lru_lock to the node Mel Gorman
2016-02-23 18:40 ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 05/27] mm, vmscan: Move LRU lists to node Mel Gorman
2016-02-23 18:42 ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis Mel Gorman
2016-02-23 18:57 ` Johannes Weiner
2016-02-23 19:03 ` Johannes Weiner
2016-02-24 10:21 ` Mel Gorman
2016-02-23 15:04 ` [PATCH 07/27] mm, vmscan: Have kswapd only scan based on the highest requested zone Mel Gorman
2016-02-25 22:17 ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman
2016-02-28 16:16 ` Johannes Weiner
2016-03-03 13:46 ` Vlastimil Babka
2016-03-09 14:45 ` Mel Gorman
2016-02-23 15:04 ` [PATCH 09/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps Mel Gorman
2016-02-28 16:16 ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 10/27] mm, vmscan: By default have direct reclaim only shrink once per node Mel Gorman
2016-02-28 16:17 ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 11/27] mm, vmscan: Clear congestion, dirty and need for compaction on a per-node basis Mel Gorman
2016-02-23 15:04 ` [PATCH 12/27] mm: vmscan: Do not reclaim from kswapd if there is any eligible zone Mel Gorman
2016-02-23 15:04 ` [PATCH 13/27] mm, vmscan: Make shrink_node decisions more node-centric Mel Gorman
2016-02-23 15:04 ` [PATCH 14/27] mm, memcg: Move memcg limit enforcement from zones to nodes Mel Gorman
2016-02-23 15:04 ` [PATCH 15/27] mm, workingset: Make working set detection node-aware Mel Gorman
2016-02-28 16:17 ` Johannes Weiner
2016-02-23 15:17 ` [PATCH 16/27] mm, page_alloc: Consider dirtyable memory in terms of nodes Mel Gorman
2016-02-28 16:17 ` Johannes Weiner
2016-02-23 15:18 ` [PATCH 17/27] mm: Move page mapped accounting to the node Mel Gorman
2016-02-23 15:18 ` [PATCH 18/27] mm: Rename NR_ANON_PAGES to NR_ANON_MAPPED Mel Gorman
2016-02-23 15:18 ` [PATCH 19/27] mm: Move most file-based accounting to the node Mel Gorman
2016-02-23 15:19 ` [PATCH 20/27] mm: Move vmscan writes and file write " Mel Gorman
2016-02-23 15:19 ` [PATCH 21/27] mm, vmscan: Update classzone_idx if buffer_heads_over_limit Mel Gorman
2016-02-23 15:19 ` [PATCH 22/27] mm, vmscan: Only wakeup kswapd once per node for the requested classzone Mel Gorman
2016-02-23 15:20 ` [PATCH 23/27] mm, vmscan: Account in vmstat for pages skipped during reclaim Mel Gorman
2016-02-23 15:20 ` [PATCH 24/27] mm: Convert zone_reclaim to node_reclaim Mel Gorman
2016-02-23 15:20 ` [PATCH 25/27] mm, vmscan: Add classzone information to tracepoints Mel Gorman
2016-02-23 15:21 ` [PATCH 26/27] mm, page_alloc: Remove fair zone allocation policy Mel Gorman
2016-02-23 15:21 ` [PATCH 27/27] mm: page_alloc: Cache the last node whose dirty limit is reached Mel Gorman
2016-02-23 17:15 ` [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Christoph Lameter
2016-02-23 20:04 ` Johannes Weiner
2016-02-23 20:19 ` Mel Gorman
2016-02-23 20:59 ` Johannes Weiner
2016-02-23 21:58 ` Mel Gorman [this message]
2016-02-24 0:12 ` Johannes Weiner
2016-02-24 10:46 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2016-02-23 13:44 Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160223215859.GO2854@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=riel@surriel.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).