linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Simon Jeons <simon.jeons@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: dormando <dormando@rydia.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	Seiji Aguchi <seiji.aguchi@hds.com>,
	Satoru Moriya <satoru.moriya@hds.com>,
	Randy Dunlap <rdunlap@xenotime.net>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"lwoodman@redhat.com" <lwoodman@redhat.com>,
	"hughd@google.com" <hughd@google.com>, Mel Gorman <mel@csn.ul.ie>
Subject: Re: [PATCH] add extra free kbytes tunable
Date: Fri, 01 Mar 2013 17:31:47 +0800	[thread overview]
Message-ID: <51307583.2020006@gmail.com> (raw)
In-Reply-To: <51307354.5000401@gmail.com>

On 03/01/2013 05:22 PM, Simon Jeons wrote:
> Hi Johannes,
>
> On 02/23/2013 01:56 AM, Johannes Weiner wrote:
>> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
>>>> The problem is that adding this tunable will constrain future VM
>>>> implementations.  We will forever need to at least retain the
>>>> pseudo-file.  We will also need to make some effort to retain its
>>>> behaviour.
>>>>
>>>> It would of course be better to fix things so you don't need to tweak
>>>> VM internals to get acceptable behaviour.
>>> I sympathize with this. It's presently all that keeps us afloat though.
>>> I'll whine about it again later if nothing else pans out.
>>>
>>>> You said:
>>>>
>>>> : We have a server workload wherein machines with 100G+ of "free" 
>>>> memory
>>>> : (used by page cache), scattered but frequent random io reads from 
>>>> 12+
>>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
>>>> reclaim
>>>> : in a few different ways.
>>>> :
>>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred
>>>> : thousand).
>>>> :
>>>> : 2) A burst of reads or traffic can cause extra pressure, which 
>>>> kswapd
>>>> : occasionally responds to by freeing up 40g+ of the pagecache all 
>>>> at once
>>>> : (!) while pausing the system (Argh).
>>>> :
>>>> : 3) A blip in an upstream provider or failover from a peer causes the
>>>> : kernel to allocate massive amounts of memory for retransmission
>>>> : queues/etc, potentially along with buffered IO reads and (some, 
>>>> but not
>>>> : often a ton) of new allocations from an application. This paired 
>>>> with 2)
>>>> : can cause the box to stall for 15+ seconds.
>>>>
>>>> Can we prioritise these?  2) looks just awful - kswapd shouldn't just
>>>> go off and free 40G of pagecache.  Do you know what's actually in that
>>>> pagecache?  Large number of small files or small number of (very) 
>>>> large
>>>> files?
>>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
>>> accessed via address. occasionally madvise (WILLNEED) applied to the
>>> address ranges before attempting to use them. There're a mix of other
>>> files but nothing significant. The mmap's are READONLY and writes 
>>> are done
>>> via pwrite-ish functions.
>>>
>>> I could use some guidance on inspecting/tracing the problem. I've been
>>> trying to reproduce it in a lab, and respecting to 2)'s issue I've 
>>> found:
>>>
>>> - The amount of memory freed back up is either a percentage of total
>>> memory or a percentage of free memory. (a machine with 48G of ram will
>>> "only" free up an extra 4-7g)
>>>
>>> - It's most likely to happen after a fresh boot, or if "3 > 
>>> drop_caches"
>>> is applied with the application down. As it fills it seems to get 
>>> itself
>>> into trouble, but becomes more stable after that. Unfortunately 1) 
>>> and 3)
>>> still apply to a stable instance.
>>>
>>> - Protecting the DMA32 zone with something like "1 1 32" into
>>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
>>>
>>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to 
>>> a few
>>> hundred thousand pages before finding anything it actually wants to
>>> reclaim (low vmeff). I've only been able to reproduce this from a clean
>>> start. It can take up to 3 seconds before kswapd starts actually
>>> reclaiming pages.
>>>
>>> - So far as I can tell we're almost exclusively using 0 order 
>>> allocations.
>>> THP is disabled.
>>>
>>> There's not much dirty memory involved. It's not flushing out writes 
>>> while
>>> reclaiming, it just kills off massive amount of cached memory.
>> Mapped file pages have to get scanned twice before they are reclaimed
>> because we don't have enough usage information after the first scan.
>
> It seems that just VM_EXEC mapped file pages are protected.
> Issue in page reclaim subsystem:
> static inline int page_is_file_cache(struct page *page)
> {
>     return !PageSwapBacked(page);
> }
> AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
> be cleaned if removed from swap cache. So anonymous pages which are 
> reclaimed and add to swap cache won't have this flag, then they will 
> be treated as

s/are/aren't

> file backed pages?  Is it buggy? In function __add_to_swap_cache if 
> add to radix tree successfully will result in increase NR_FILE_PAGES, 
> why?
>>
>> In your case, when you start this workload after a fresh boot or
>> dropping the caches, there will be 48G of mapped file pages that have
>> never been scanned before and that need to be looked at twice.
>>
>> Unfortunately, if kswapd does not make progress (and it won't for some
>> time at first), it will scan more and more aggressively with
>
> Why kswapd does not make progress for some time at first?
>
>> increasing scan priority.  And when the 48G of pages are finally
>> cycled, kswapd's scan window is a large percentage of your machine's
>> memory, and it will free every single page in it.
>>
>> I think we should think about capping kswapd zone reclaim cycles just
>> as we do for direct reclaim.  It's a little ridiculous that it can run
>> unbounded and reclaim every page in a zone without ever checking back
>> against the watermark.  We still increase the scan window evenly when
>> we don't make forward progress, but we are more carefully inching zone
>> levels back toward the watermarks.
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c4883eb..8a4c446 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t 
>> *pgdat, int order,
>>           .may_unmap = 1,
>>           .may_swap = 1,
>>           /*
>> -         * kswapd doesn't want to be bailed out while reclaim. because
>> -         * we want to put equal scanning pressure on each zone.
>> +         * Even kswapd zone scans want to be bailed out after
>> +         * reclaiming a good chunk of pages.  It will just
>> +         * come back if the watermarks are still not met.
>>            */
>> -        .nr_to_reclaim = ULONG_MAX,
>> +        .nr_to_reclaim = SWAP_CLUSTER_MAX,
>>           .order = order,
>>           .target_mem_cgroup = NULL,
>>       };
>>
>> -- 
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-03-01  9:31 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-12  2:01 extra free kbytes tunable dormando
2013-02-15 22:21 ` Seiji Aguchi
2013-02-15 22:25   ` Rik van Riel
2013-02-17 23:48     ` [PATCH] add " dormando
2013-02-19 23:29       ` Andrew Morton
2013-02-20  5:19         ` dormando
2013-02-22 17:56           ` Johannes Weiner
2013-02-26 10:47             ` Mel Gorman
2013-02-26 15:13               ` Johannes Weiner
2013-02-26 16:25                 ` Mel Gorman
2013-03-01  9:22             ` Simon Jeons
2013-03-01  9:31               ` Simon Jeons [this message]
2013-03-01 22:33                 ` Hugh Dickins
2013-03-02  0:10                   ` Simon Jeons
2013-03-02  1:42                     ` Hugh Dickins
2013-03-02  2:42                       ` Simon Jeons
2013-03-02  3:08                         ` Hugh Dickins
2013-03-02  4:06                           ` Simon Jeons
2013-03-09  1:08                           ` Simon Jeons
2013-02-17 23:54     ` dormando
2013-02-15 22:49   ` Satoru Moriya

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51307583.2020006@gmail.com \
    --to=simon.jeons@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=dormando@rydia.net \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lwoodman@redhat.com \
    --cc=mel@csn.ul.ie \
    --cc=rdunlap@xenotime.net \
    --cc=riel@redhat.com \
    --cc=satoru.moriya@hds.com \
    --cc=seiji.aguchi@hds.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).