From: Andrea Arcangeli <aarcange@redhat.com>
To: Christoph Lameter <cl@linux-foundation.org>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
Izik Eidus <ieidus@redhat.com>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mel@csn.ul.ie>, Andi Kleen <andi@firstfloor.org>,
Dave Hansen <dave@linux.vnet.ibm.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Chris Wright <chrisw@sous-sol.org>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 00 of 30] Transparent Hugepage support #3
Date: Tue, 26 Jan 2010 17:11:20 +0100 [thread overview]
Message-ID: <20100126161120.GN30452@random.random> (raw)
In-Reply-To: <alpine.DEB.2.00.1001260939050.23549@router.home>
On Tue, Jan 26, 2010 at 09:47:51AM -0600, Christoph Lameter wrote:
> I have to disable swap to be able to make use of these huge pages?
No.
> Just because your configuration did not split does not mean that there
> is a guarantee of them not splitting. You need to guarantee that the VM
> does not split them in order to be able to safely refer to them from
> code (like I/O paths).
No. O_DIRECT already works on those pages without splitting them,
there is no need to split them, just run 512 gups like you would be
doing if those weren't hugepages.
If your I/O can be interrupted then just use mmu notifier, call
gup_fast, and be notified if anything runs that split the page.
Splitting the page doesn't mean relocating it, DMA won't be able to
notice. So if you use mmu notifier just 1 gup + put_page will be
enough exactly because with mmu notifier you won't need refcounting on
tail pages and head pages at all!
If you don't have longstanding mapping and a way to synchronously
interrupt the visibility of hugepages from your device, then likely
you work with small dma sizes like storage and networking does, and
gup each 4k will be fine.
> Earlier you stated that reclaim can remove 4k pieces of huge pages after a
> split. How does gup keep the huge pages stable while doing I/O? Does gup
> submit 512 pointers to 4k chunks or 1 pointer to a 2M chunk?
gup works like now, you just write code that works today on a
fragmented hugepage, and it'll still work. So you need to run 512 gup_fast
to be sure all 4k fragments are stable. But if you can use mmu
notifier just one gup_fast(&head_page), put_page(head_page) will be
enough after you're registered.
I'm unsure exactly what you need to do that won't be feasible with mmu
notifier and 1 gup or 512 gup.
> This implementation seems to only address the TLB pressure issue
> but not the scaling issue that arises because we have to handle data in
> 4k chunks (512 4k pointers instead of one 2M pointer). Scaling is not
> addressed because complex fallback logic sabotages a basic benefit of
> huge pages.
Scaling is addressed for everything, including collapsing the hugepage
back after swapin if they're fragmented because of that. Furthermore
we want to remove split_huge_page from as many paths as possible but
Rome wasn't built in a day. We need to stabilize and stress this code
now, then we include it, and extend it to tmpfs and pagecache.
Note a malloc(3G)+memset(3G) takes >5sec with lockdep without
transparent hugepage, or <2sec after "echo always >enabled", TLB
pressure is irrelevant in that workload that spends all time
allocating pages and clearing them through kernel direct
mapping. Your idea that this is only taking care of TLB pressure is
totally wrong and I posted benchmarks already as proof (which become
extreme the moment you enable lockdep and all the little locks becomes
more costly, so avoiding 512 page faults and doing a single call to
alloc_pages(order=9) speedup the workload more than 100%).
> > performance and functionality than what my patch delivers already
> > (ok swapping will be a little more efficient if done through 2M I/O
> > but swap performance isn't so critical). Our objective is to over time
> > eliminate the need of split_huge_page. khugepaged will remain required
>
> Ok then establish some way to make these huge pages stable.
Again: register into mmu notifer, call gup_fast; put_page, and you're
done. 1 op, and just 3 cachelines for pgd,pud and pmd to get to the page.
> That all depends on what you mean by guarantee I guess.
mmu notifier is a must if the mapping is longstanding or you'll lock
the ram. It's also a lot more efficient than doing 512 gup_fast which
would achieve the same effect but it's evil against the VM (lock the
user virtual memory in ram) and requires 512 gup instead of just 1.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-01-26 16:12 UTC|newest]
Thread overview: 79+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-21 6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 01 of 30] define MADV_HUGEPAGE Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 02 of 30] compound_lock Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 03 of 30] alter compound get_page/put_page Andrea Arcangeli
2010-01-21 17:35 ` Dave Hansen
2010-01-23 17:39 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 04 of 30] clear compound mapping Andrea Arcangeli
2010-01-21 17:43 ` Dave Hansen
2010-01-23 17:55 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 05 of 30] add native_set_pmd_at Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 06 of 30] add pmd paravirt ops Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 07 of 30] no paravirt version of pmd ops Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 08 of 30] export maybe_mkwrite Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 09 of 30] comment reminder in destroy_compound_page Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 10 of 30] config_transparent_hugepage Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 11 of 30] add pmd mangling functions to x86 Andrea Arcangeli
2010-01-21 17:47 ` Dave Hansen
2010-01-21 19:14 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 12 of 30] add pmd mangling generic functions Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 13 of 30] special pmd_trans_* functions Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 14 of 30] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 15 of 30] pte alloc trans splitting Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 16 of 30] add pmd mmu_notifier helpers Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 17 of 30] clear page compound Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 18 of 30] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 19 of 30] ensure mapcount is taken on head pages Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 20 of 30] split_huge_page_mm/vma Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 21 of 30] split_huge_page paging Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 22 of 30] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-01-21 20:40 ` Christoph Lameter
2010-01-21 23:01 ` Andrea Arcangeli
2010-01-21 23:17 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 23 of 30] clear_copy_huge_page Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 24 of 30] kvm mmu transparent hugepage support Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 25 of 30] transparent hugepage core Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 26 of 30] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 27 of 30] memcg compound Andrea Arcangeli
2010-01-21 7:07 ` KAMEZAWA Hiroyuki
2010-01-21 15:44 ` Andrea Arcangeli
2010-01-21 23:55 ` KAMEZAWA Hiroyuki
2010-01-21 6:20 ` [PATCH 28 of 30] memcg huge memory Andrea Arcangeli
2010-01-21 7:16 ` KAMEZAWA Hiroyuki
2010-01-21 16:08 ` Andrea Arcangeli
2010-01-22 0:13 ` KAMEZAWA Hiroyuki
2010-01-27 11:27 ` Balbir Singh
2010-01-28 0:50 ` Daisuke Nishimura
2010-01-28 11:39 ` Andrea Arcangeli
2010-01-28 12:23 ` Balbir Singh
2010-01-28 12:36 ` Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 29 of 30] transparent hugepage vmstat Andrea Arcangeli
2010-01-21 6:20 ` [PATCH 30 of 30] khugepaged Andrea Arcangeli
2010-01-22 14:46 ` [PATCH 00 of 30] Transparent Hugepage support #3 Christoph Lameter
2010-01-22 15:19 ` Andrea Arcangeli
2010-01-22 16:51 ` Christoph Lameter
2010-01-23 17:58 ` Andrea Arcangeli
2010-01-25 21:50 ` Christoph Lameter
2010-01-25 22:46 ` Andrea Arcangeli
2010-01-26 15:47 ` Christoph Lameter
2010-01-26 16:11 ` Andrea Arcangeli [this message]
2010-01-26 16:30 ` Christoph Lameter
2010-01-26 16:45 ` Andrea Arcangeli
2010-01-26 18:23 ` Christoph Lameter
2010-01-26 17:09 ` Avi Kivity
2010-01-26 0:52 ` Rik van Riel
2010-01-26 6:53 ` Gleb Natapov
2010-01-26 12:35 ` Andrea Arcangeli
2010-01-26 15:55 ` Christoph Lameter
2010-01-26 16:19 ` Andrea Arcangeli
2010-01-26 15:54 ` Christoph Lameter
2010-01-26 16:16 ` Andrea Arcangeli
2010-01-26 16:24 ` Andi Kleen
2010-01-26 16:37 ` Christoph Lameter
2010-01-26 16:42 ` Mel Gorman
2010-01-26 16:52 ` Andrea Arcangeli
2010-01-26 17:26 ` Mel Gorman
2010-01-26 19:46 ` Andrea Arcangeli
2010-01-26 23:07 ` Rik van Riel
2010-01-27 18:33 ` Christoph Lameter
2010-01-26 11:24 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100126161120.GN30452@random.random \
--to=aarcange@redhat.com \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=avi@redhat.com \
--cc=benh@kernel.crashing.org \
--cc=chrisw@sous-sol.org \
--cc=cl@linux-foundation.org \
--cc=dave@linux.vnet.ibm.com \
--cc=hugh.dickins@tiscali.co.uk \
--cc=ieidus@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=mingo@elte.hu \
--cc=mtosatti@redhat.com \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
--cc=travis@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).