linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	hannes@cmpxchg.org, riel@surriel.com, mgorman@suse.de,
	willy@infradead.org, akpm@linux-foundation.org
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
Date: Tue, 3 Apr 2018 15:31:15 +0200	[thread overview]
Message-ID: <20180403133115.GA5501@dhcp22.suse.cz> (raw)
In-Reply-To: <1522661062-39745-2-git-send-email-buddy.lumpkin@oracle.com>

On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> Page replacement is handled in the Linux Kernel in one of two ways:
> 
> 1) Asynchronously via kswapd
> 2) Synchronously, via direct reclaim
> 
> At page allocation time the allocating task is immediately given a page
> from the zone free list allowing it to go right back to work doing
> whatever it was doing; Probably directly or indirectly executing business
> logic.
> 
> Just prior to satisfying the allocation, free pages is checked to see if
> it has reached the zone low watermark and if so, kswapd is awakened.
> Kswapd will start scanning pages looking for inactive pages to evict to
> make room for new page allocations. The work of kswapd allows tasks to
> continue allocating memory from their respective zone free list without
> incurring any delay.
> 
> When the demand for free pages exceeds the rate that kswapd tasks can
> supply them, page allocation works differently. Once the allocating task
> finds that the number of free pages is at or below the zone min watermark,
> the task will no longer pull pages from the free list. Instead, the task
> will run the same CPU-bound routines as kswapd to satisfy its own
> allocation by scanning and evicting pages. This is called a direct reclaim.
> 
> The time spent performing a direct reclaim can be substantial, often
> taking tens to hundreds of milliseconds for small order0 allocations to
> half a second or more for order9 huge-page allocations. In fact, kswapd is
> not actually required on a linux system. It exists for the sole purpose of
> optimizing performance by preventing direct reclaims.
> 
> When memory shortfall is sufficient to trigger direct reclaims, they can
> occur in any task that is running on the system. A single aggressive
> memory allocating task can set the stage for collateral damage to occur in
> small tasks that rarely allocate additional memory. Consider the impact of
> injecting an additional 100ms of latency when nscd allocates memory to
> facilitate caching of a DNS query.
> 
> The presence of direct reclaims 10 years ago was a fairly reliable
> indicator that too much was being asked of a Linux system. Kswapd was
> likely wasting time scanning pages that were ineligible for eviction.
> Adding RAM or reducing the working set size would usually make the problem
> go away. Since then hardware has evolved to bring a new struggle for
> kswapd. Storage speeds have increased by orders of magnitude while CPU
> clock speeds stayed the same or even slowed down in exchange for more
> cores per package. This presents a throughput problem for a single
> threaded kswapd that will get worse with each generation of new hardware.

AFAIR we used to scale the number of kswapd workers many years ago. It
just turned out to be not all that great. We have a kswapd reclaim
window for quite some time and that can allow to tune how much proactive
kswapd should be.

Also please note that the direct reclaim is a way to throttle overly
aggressive memory consumers. The more we do in the background context
the easier for them it will be to allocate faster. So I am not really
sure that more background threads will solve the underlying problem. It
is just a matter of memory hogs tunning to end in the very same
situtation AFAICS. Moreover the more they are going to allocate the more
less CPU time will _other_ (non-allocating) task get.

> Test Details

I will have to study this more to comment.

[...]
> By increasing the number of kswapd threads, throughput increased by ~50%
> while kernel mode CPU utilization decreased or stayed the same, likely due
> to a decrease in the number of parallel tasks at any given time doing page
> replacement.

Well, isn't that just an effect of more work being done on behalf of
other workload that might run along with your tests (and which doesn't
really need to allocate a lot of memory)? In other words how
does the patch behaves with a non-artificial mixed workloads?

Please note that I am not saying that we absolutely have to stick with the
current single-thread-per-node implementation but I would really like to
see more background on why we should be allowing heavy memory hogs to
allocate faster or how to prevent that. I would be also very interested
to see how to scale the number of threads based on how CPUs are utilized
by other workloads.
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2018-04-03 13:31 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-02  9:24 [RFC PATCH 0/1] mm: Support multiple kswapd threads per node Buddy Lumpkin
2018-04-02  9:24 ` [RFC PATCH 1/1] vmscan: " Buddy Lumpkin
2018-04-03 13:31   ` Michal Hocko [this message]
2018-04-03 19:07     ` Matthew Wilcox
2018-04-03 20:49       ` Buddy Lumpkin
2018-04-03 21:12         ` Matthew Wilcox
2018-04-04 10:07           ` Buddy Lumpkin
2018-04-05  4:08           ` Buddy Lumpkin
2018-04-11  6:37           ` Buddy Lumpkin
2018-04-11  3:52       ` Buddy Lumpkin
2018-04-03 19:41     ` Buddy Lumpkin
2018-04-12 13:16       ` Michal Hocko
2018-04-17  3:02         ` Buddy Lumpkin
2018-04-17  9:03           ` Michal Hocko
2018-04-03 20:13     ` Buddy Lumpkin
2018-04-11  3:10     ` Buddy Lumpkin
2018-04-12 13:23       ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2020-09-30 19:27 Sebastiaan Meijer
2020-10-01 12:30 ` Michal Hocko
2020-10-01 16:18   ` Sebastiaan Meijer
2020-10-02  7:03     ` Michal Hocko
2020-10-02  8:40       ` Mel Gorman
2020-10-02 13:53       ` Rik van Riel
2020-10-02 14:00         ` Matthew Wilcox
2020-10-02 14:29         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180403133115.GA5501@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=buddy.lumpkin@oracle.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@surriel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).