linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Izik Eidus <ieidus@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Chris Wright <chrisw@sous-sol.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	bpicco@redhat.com, Christoph Hellwig <hch@infradead.org>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	Arnd Bergmann <arnd@arndb.de>
Subject: Re: [PATCH 25 of 31] transparent hugepage core
Date: Mon, 1 Feb 2010 14:27:04 +0100	[thread overview]
Message-ID: <20100201132704.GG12034@random.random> (raw)
In-Reply-To: <20100129152939.GI7139@csn.ul.ie>

On Fri, Jan 29, 2010 at 03:29:39PM +0000, Mel Gorman wrote:
> In fact kswapd will get woken up if you fail a GFP_ATOMIC allocation.
> What I would expect to to happen is the first allocation would fail but
> kswapd would wake up and start reclaiming for order-9 (i.e. the huge page
> size). This might be a more hit-and-miss affair than you'd like though and
> would make performance predictions that bit harder.

yeah, it turns out the kswapd behavior breaks it. In short you get
huge swap storms even without __GFP_IO/FS/WAIT in the direct
reclaim. So I had to add this:

Subject: _GFP_NO_KSWAPD

From: Andrea Arcangeli <aarcange@redhat.com>

Transparent hugepage allocations must be allowed not to invoke kswapd or any
other kind of indirect reclaim (especially when the defrag sysfs is control
disabled). It's unacceptable to swap out anonymous pages (potentially
anonymous transparent hugepages) in order to create new transparent hugepages.
This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
machine and so having it suffer an unbearable slowdown, so another one with
guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
memory intensive workloads, makes no sense). If a transparent hugepage
allocation fails the slowdown is minor and there is total fallback, so kswapd
should never be asked to swapout memory to allow the high order allocation to
succeed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -59,13 +59,15 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)0)
 #endif
 
+#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1829,7 +1829,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx);
+	if (!(gfp_mask & __GFP_NO_KSWAPD))
+		wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background



I also added this for safety, because I don't want hugepage allocation
to eat from the reserved pfmemalloc pool:

Subject: don't alloc harder for gfp nomemalloc even if nowait

From: Andrea Arcangeli <aarcange@redhat.com>

Not worth throwing away the precious reserved free memory pool for allocations
that can fail gracefully (either through mempool or because they're transhuge
allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1773,7 +1773,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 */
 	alloc_flags |= (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	/*
+	 * Not worth trying to allocate harder for __GFP_NOMEMALLOC
+	 * even if it can't schedule.
+	 */
+	if (!wait && !(gfp_mask & __GFP_NOMEMALLOC)) {
 		alloc_flags |= ALLOC_HARDER;
 		/*
 		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.


With these two patches and this:

#define GFP_TRANSHUGE  (__GFP_HARDWALL | __GFP_HIGHMEM |  \
	 __GFP_MOVABLE | __GFP_COMP | __GFP_NOMEMALLOC | \
	 __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD)

static inline struct page *alloc_hugepage(void)
{
	int defrag = transparent_hugepage_defrag();
	return alloc_pages(GFP_TRANSHUGE | (defrag ? __GFP_WAIT : 0),
			   HPAGE_PMD_ORDER);
}


It seems leaving defrag off by default is much faster to allocate when
there is total fragmentation, as with NOWAIT we won't get into cache
reclaim. I also removed the differentiation between madvise/always in
the "defrag" knob because what is not ok for madvise is also not ok
for full transparency. It's not ok of VM takes a lot to startup etc...
If something we could have khugepaged default to defrag like in my
previous versions but I don't want to risk shrinking cache for no good
so for now they all use the above alloc_hugepage as main and only
allocation method for transhuge pages. This default now works fluid
all the time and no apparent VM behavior change in my laptop with
"always" enabling (and the apps gets the hugepages sometime).

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc6/transparent_hugepage-10
	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc6/transparent_hugepage-10.gz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2010-02-01 13:28 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 01 of 31] define MADV_HUGEPAGE Andrea Arcangeli
2010-01-28 20:30   ` Arnd Bergmann
2010-01-28 14:33 ` [PATCH 02 of 31] compound_lock Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 03 of 31] alter compound get_page/put_page Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 04 of 31] update futex compound knowledge Andrea Arcangeli
2010-01-28 16:11   ` Mel Gorman
2010-02-01  7:45     ` Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 05 of 31] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 06 of 31] clear compound mapping Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 07 of 31] add native_set_pmd_at Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 08 of 31] add pmd paravirt ops Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 09 of 31] no paravirt version of pmd ops Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 10 of 31] export maybe_mkwrite Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 11 of 31] comment reminder in destroy_compound_page Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 12 of 31] config_transparent_hugepage Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 13 of 31] special pmd_trans_* functions Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 14 of 31] add pmd mangling generic functions Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 15 of 31] add pmd mangling functions to x86 Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 16 of 31] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 17 of 31] pte alloc trans splitting Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 18 of 31] add pmd mmu_notifier helpers Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 19 of 31] clear page compound Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 20 of 31] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 21 of 31] split_huge_page_mm/vma Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 22 of 31] split_huge_page paging Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 23 of 31] clear_copy_huge_page Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 24 of 31] kvm mmu transparent hugepage support Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
2010-01-28 17:57   ` Mel Gorman
2010-01-28 18:05     ` Rik van Riel
2010-01-28 18:07       ` Mel Gorman
2010-01-28 22:36     ` Andrea Arcangeli
2010-01-28 22:43       ` Andrea Arcangeli
2010-01-29  0:00       ` Andrea Arcangeli
2010-01-29 15:29       ` Mel Gorman
2010-01-29 18:59         ` Andrea Arcangeli
2010-01-31 20:24         ` Andrea Arcangeli
2010-02-01 13:27         ` Andrea Arcangeli [this message]
2010-02-01 13:53           ` Mel Gorman
2010-01-28 14:33 ` [PATCH 26 of 31] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 27 of 31] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 28 of 31] memcg compound Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 29 of 31] memcg huge memory Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 30 of 31] transparent hugepage vmstat Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 31 of 31] khugepaged Andrea Arcangeli
2010-01-28 17:05 ` [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2010-01-26 13:51 [PATCH 00 of 31] Transparent Hugepage support #7 Andrea Arcangeli
2010-01-26 13:52 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
2010-01-26 22:34   ` Rik van Riel
2010-01-25 17:18 [PATCH 00 of 31] Transparent Hugepage support #6 Andrea Arcangeli
2010-01-25 17:19 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100201132704.GG12034@random.random \
    --to=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=avi@redhat.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bpicco@redhat.com \
    --cc=chrisw@sous-sol.org \
    --cc=cl@linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hch@infradead.org \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=ieidus@redhat.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mingo@elte.hu \
    --cc=mtosatti@redhat.com \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    --cc=travis@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).