Re: [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Michal Hocko <mhocko@kernel.org>, Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache
Date: Fri, 25 Sep 2015 15:09:07 -0400	[thread overview]
Message-ID: <20150925190907.GC16359@cmpxchg.org> (raw)
In-Reply-To: <1442832762-7247-8-git-send-email-mgorman@techsingularity.net>

On Mon, Sep 21, 2015 at 11:52:39AM +0100, Mel Gorman wrote:
> The zonelist cache (zlc) was introduced to skip over zones that were
> recently known to be full. This avoided expensive operations such as the
> cpuset checks, watermark calculations and zone_reclaim. The situation
> today is different and the complexity of zlc is harder to justify.
> 
> 1) The cpuset checks are no-ops unless a cpuset is active and in general
>    are a lot cheaper.
> 
> 2) zone_reclaim is now disabled by default and I suspect that was a large
>    source of the cost that zlc wanted to avoid. When it is enabled, it's
>    known to be a major source of stalling when nodes fill up and it's
>    unwise to hit every other user with the overhead.
> 
> 3) Watermark checks are expensive to calculate for high-order
>    allocation requests. Later patches in this series will reduce the cost
>    of the watermark checking.
> 
> 4) The most important issue is that in the current implementation it
>    is possible for a failed THP allocation to mark a zone full for order-0
>    allocations and cause a fallback to remote nodes.
> 
> The last issue could be addressed with additional complexity but as the
> benefit of zlc is questionable, it is better to remove it.  If stalls
> due to zone_reclaim are ever reported then an alternative would be to
> introduce deferring logic based on a timeout inside zone_reclaim itself
> and leave the page allocator fast paths alone.
> 
> The impact on page-allocator microbenchmarks is negligible as they don't
> hit the paths where the zlc comes into play. Most page-reclaim related
> workloads showed no noticeable difference as a result of the removal.
> 
> The impact was noticeable in a workload called "stutter". One part uses a
> lot of anonymous memory, a second measures mmap latency and a third copies
> a large file. In an ideal world the latency application would not notice
> the mmap latency.  On a 2-node machine the results of this patch are
> 
> stutter
>                              4.3.0-rc1             4.3.0-rc1
>                               baseline              nozlc-v4
> Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
> 1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
> 2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
> 3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
> Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
> Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
> Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
> Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
> Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
> Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
> Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
> Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
> Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
> Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
> Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
> Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
> Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)
> 
> Note that the maximum stall latency went from 24 seconds to 12 which is still
> bad but an improvement.  The milage varies considerably 2-node machine on an
> earlier test went from 494 seconds to 47 seconds and  a 4-node machine that
> tested an earlier version of this patch went from a worst case stall time of
> 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable
> as it is hammering the system and the milage will vary between machines.
> 
> There is a secondary impact with potentially more direct reclaim because
> zones are now being considered instead of being skipped by zlc. In this
> particular test run it did not occur so will not be described. However,
> in at least one test the following was observed
> 
> 1. Direct reclaim rates were higher. This was likely due to direct reclaim
>   being entered instead of the zlc disabling a zone and busy looping.
>   Busy looping may have the effect of allowing kswapd to make more
>   progress and in some cases may be better overall. If this is found then
>   the correct action is to put direct reclaimers to sleep on a waitqueue
>   and allow kswapd make forward progress. Busy looping on the zlc is even
>   worse than when the allocator used to blindly call congestion_wait().
> 
> 2. There was higher swap activity as direct reclaim was active.
> 
> 3. Direct reclaim efficiency was lower. This is related to 1 as more
>   scanning activity also encountered more pages that could not be
>   immediately reclaimed
> 
> In that case, the direct page scan and reclaim rates are noticeable but
> it is not considered a problem for a few reasons
> 
> 1. The test is primarily concerned with latency. The mmap attempts are also
>    faulted which means there are THP allocation requests. The ZLC could
>    cause zones to be disabled causing the process to busy loop instead
>    of reclaiming.  This looks like elevated direct reclaim activity but
>    it's the correct action to take based on what processes requested.
> 
> 2. The test hammers reclaim and compaction heavily. The number of successful
>    THP faults is highly variable but affects the reclaim stats. It's not a
>    realistic or reasonable measure of page reclaim activity.
> 
> 3. No other page-reclaim intensive workload that was tested showed a problem.
> 
> 4. If a workload is identified that benefitted from the busy looping then it
>    should be fixed by having direct reclaimers sleep on a wait queue until
>    woken by kswapd instead of busy looping. We had this class of problem before
>    when congestion_waits() with a fixed timeout was a brain damaged decision
>    but happened to benefit some workloads.
> 
> If a workload is identified that relied on the zlc to busy loop then it
> should be fixed correctly and have a direct reclaimer sleep on a waitqueue
> until woken by kswapd.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  74 -----------------
>  mm/page_alloc.c        | 212 -------------------------------------------------
>  2 files changed, 286 deletions(-)

This patch and its results look great!

And I agree, should this affect the balance between kswapd and direct
reclaim, it should be fixed explicitely and not rely on something as
unrelated as the zonelist cache.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Johannes Weiner <hannes@cmpxchg.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Michal Hocko <mhocko@kernel.org>, Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache
Date: Fri, 25 Sep 2015 15:09:07 -0400	[thread overview]
Message-ID: <20150925190907.GC16359@cmpxchg.org> (raw)
In-Reply-To: <1442832762-7247-8-git-send-email-mgorman@techsingularity.net>

On Mon, Sep 21, 2015 at 11:52:39AM +0100, Mel Gorman wrote:
> The zonelist cache (zlc) was introduced to skip over zones that were
> recently known to be full. This avoided expensive operations such as the
> cpuset checks, watermark calculations and zone_reclaim. The situation
> today is different and the complexity of zlc is harder to justify.
> 
> 1) The cpuset checks are no-ops unless a cpuset is active and in general
>    are a lot cheaper.
> 
> 2) zone_reclaim is now disabled by default and I suspect that was a large
>    source of the cost that zlc wanted to avoid. When it is enabled, it's
>    known to be a major source of stalling when nodes fill up and it's
>    unwise to hit every other user with the overhead.
> 
> 3) Watermark checks are expensive to calculate for high-order
>    allocation requests. Later patches in this series will reduce the cost
>    of the watermark checking.
> 
> 4) The most important issue is that in the current implementation it
>    is possible for a failed THP allocation to mark a zone full for order-0
>    allocations and cause a fallback to remote nodes.
> 
> The last issue could be addressed with additional complexity but as the
> benefit of zlc is questionable, it is better to remove it.  If stalls
> due to zone_reclaim are ever reported then an alternative would be to
> introduce deferring logic based on a timeout inside zone_reclaim itself
> and leave the page allocator fast paths alone.
> 
> The impact on page-allocator microbenchmarks is negligible as they don't
> hit the paths where the zlc comes into play. Most page-reclaim related
> workloads showed no noticeable difference as a result of the removal.
> 
> The impact was noticeable in a workload called "stutter". One part uses a
> lot of anonymous memory, a second measures mmap latency and a third copies
> a large file. In an ideal world the latency application would not notice
> the mmap latency.  On a 2-node machine the results of this patch are
> 
> stutter
>                              4.3.0-rc1             4.3.0-rc1
>                               baseline              nozlc-v4
> Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
> 1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
> 2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
> 3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
> Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
> Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
> Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
> Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
> Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
> Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
> Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
> Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
> Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
> Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
> Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
> Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
> Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)
> 
> Note that the maximum stall latency went from 24 seconds to 12 which is still
> bad but an improvement.  The milage varies considerably 2-node machine on an
> earlier test went from 494 seconds to 47 seconds and  a 4-node machine that
> tested an earlier version of this patch went from a worst case stall time of
> 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable
> as it is hammering the system and the milage will vary between machines.
> 
> There is a secondary impact with potentially more direct reclaim because
> zones are now being considered instead of being skipped by zlc. In this
> particular test run it did not occur so will not be described. However,
> in at least one test the following was observed
> 
> 1. Direct reclaim rates were higher. This was likely due to direct reclaim
>   being entered instead of the zlc disabling a zone and busy looping.
>   Busy looping may have the effect of allowing kswapd to make more
>   progress and in some cases may be better overall. If this is found then
>   the correct action is to put direct reclaimers to sleep on a waitqueue
>   and allow kswapd make forward progress. Busy looping on the zlc is even
>   worse than when the allocator used to blindly call congestion_wait().
> 
> 2. There was higher swap activity as direct reclaim was active.
> 
> 3. Direct reclaim efficiency was lower. This is related to 1 as more
>   scanning activity also encountered more pages that could not be
>   immediately reclaimed
> 
> In that case, the direct page scan and reclaim rates are noticeable but
> it is not considered a problem for a few reasons
> 
> 1. The test is primarily concerned with latency. The mmap attempts are also
>    faulted which means there are THP allocation requests. The ZLC could
>    cause zones to be disabled causing the process to busy loop instead
>    of reclaiming.  This looks like elevated direct reclaim activity but
>    it's the correct action to take based on what processes requested.
> 
> 2. The test hammers reclaim and compaction heavily. The number of successful
>    THP faults is highly variable but affects the reclaim stats. It's not a
>    realistic or reasonable measure of page reclaim activity.
> 
> 3. No other page-reclaim intensive workload that was tested showed a problem.
> 
> 4. If a workload is identified that benefitted from the busy looping then it
>    should be fixed by having direct reclaimers sleep on a wait queue until
>    woken by kswapd instead of busy looping. We had this class of problem before
>    when congestion_waits() with a fixed timeout was a brain damaged decision
>    but happened to benefit some workloads.
> 
> If a workload is identified that relied on the zlc to busy loop then it
> should be fixed correctly and have a direct reclaimer sleep on a waitqueue
> until woken by kswapd.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  74 -----------------
>  mm/page_alloc.c        | 212 -------------------------------------------------
>  2 files changed, 286 deletions(-)

This patch and its results look great!

And I agree, should this affect the balance between kswapd and direct
reclaim, it should be fixed explicitely and not rely on something as
unrelated as the zonelist cache.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

next prev parent reply	other threads:[~2015-09-25 19:09 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
2015-09-21 10:52 ` Mel Gorman
2015-09-21 10:52 ` [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-24 20:01   ` Johannes Weiner
2015-09-24 20:01     ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 02/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-24 20:05   ` Johannes Weiner
2015-09-24 20:05     ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-24 20:06   ` Johannes Weiner
2015-09-24 20:06     ` Johannes Weiner
2015-09-30 22:22   ` David Rientjes
2015-09-30 22:22     ` David Rientjes
2015-10-01  7:35     ` Vlastimil Babka
2015-10-01  7:35       ` Vlastimil Babka
2015-09-21 10:52 ` [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-24 20:34   ` Johannes Weiner
2015-09-24 20:34     ` Johannes Weiner
2015-09-25 12:50     ` Mel Gorman
2015-09-25 12:50       ` Mel Gorman
2015-09-25 13:56       ` Johannes Weiner
2015-09-25 13:56         ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-24 13:51   ` Michal Hocko
2015-09-24 13:51     ` Michal Hocko
2015-09-24 20:55   ` Johannes Weiner
2015-09-24 20:55     ` Johannes Weiner
2015-09-25 12:51     ` Mel Gorman
2015-09-25 12:51       ` Mel Gorman
2015-09-25 19:01       ` Johannes Weiner
2015-09-25 19:01         ` Johannes Weiner
2015-09-29 13:35         ` Mel Gorman
2015-09-29 13:35           ` Mel Gorman
2015-09-30 12:26           ` Vlastimil Babka
2015-09-30 12:26             ` Vlastimil Babka
2015-09-30 13:17             ` Mel Gorman
2015-09-30 13:17               ` Mel Gorman
2015-10-01  3:04             ` Drokin, Oleg
2015-10-01  3:04               ` Drokin, Oleg
2015-10-02 12:30               ` Mel Gorman
2015-10-02 12:30                 ` Mel Gorman
2015-09-21 10:52 ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-25 19:03   ` Johannes Weiner
2015-09-25 19:03     ` Johannes Weiner
2015-09-28 23:55   ` Andrew Morton
2015-09-28 23:55     ` Andrew Morton
2015-09-29 13:37     ` Mel Gorman
2015-09-29 13:37       ` Mel Gorman
2015-10-01  8:39       ` Vlastimil Babka
2015-10-01  8:39         ` Vlastimil Babka
2015-10-02 13:03         ` [PATCH] mm: page_alloc: Hide some GFP internals and document the bits and flag combinations -fix Mel Gorman
2015-10-02 13:03           ` Mel Gorman
2015-10-01 14:06       ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Michal Hocko
2015-10-01 14:06         ` Michal Hocko
2015-09-30 22:25   ` David Rientjes
2015-09-30 22:25     ` David Rientjes
2015-09-21 10:52 ` [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-25 19:09   ` Johannes Weiner [this message]
2015-09-25 19:09     ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 08/10] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
2015-09-21 10:52   ` Mel Gorman
2015-09-24 13:50   ` Michal Hocko
2015-09-24 13:50     ` Michal Hocko
2015-09-25 19:22   ` Johannes Weiner
2015-09-25 19:22     ` Johannes Weiner
2015-09-29 21:01   ` Andrew Morton
2015-09-29 21:01     ` Andrew Morton
2015-09-30  8:27     ` Mel Gorman
2015-09-30  8:27       ` Mel Gorman
2015-09-30 14:02       ` Vlastimil Babka
2015-09-30 14:02         ` Vlastimil Babka
2015-09-21 12:03 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
2015-09-21 12:03   ` Mel Gorman
2015-09-25 19:32   ` Johannes Weiner
2015-09-25 19:32     ` Johannes Weiner
2015-09-29 21:05   ` Andrew Morton
2015-09-29 21:05     ` Andrew Morton
2015-09-30  8:46     ` Mel Gorman
2015-09-30  8:46       ` Mel Gorman
2015-09-30 14:17       ` Vlastimil Babka
2015-09-30 14:17         ` Vlastimil Babka
2015-09-30 15:12         ` Mel Gorman
2015-09-30 15:12           ` Mel Gorman
2015-09-30 20:37           ` Andrew Morton
2015-09-30 20:37             ` Andrew Morton
2015-09-30 14:11   ` Vlastimil Babka
2015-09-30 14:11     ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150925190907.GC16359@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.