From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Marcelo Tosatti <mtosatti@redhat.com>,
Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
Izik Eidus <ieidus@redhat.com>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
Dave Hansen <dave@linux.vnet.ibm.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Christoph Lameter <cl@linux-foundation.org>,
Chris Wright <chrisw@sous-sol.org>,
bpicco@redhat.com,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
Arnd Bergmann <arnd@arndb.de>,
"Michael S. Tsirkin" <mst@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #15
Date: Fri, 26 Mar 2010 17:36:55 +0000 [thread overview]
Message-ID: <20100326173655.GC2024@csn.ul.ie> (raw)
In-Reply-To: <patchbomb.1269622804@v2.random>
On Fri, Mar 26, 2010 at 06:00:04PM +0100, Andrea Arcangeli wrote:
> Hello,
>
> this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries
> in split_huge_page (at pratically zero cost, so I didn't need to add a fake
> feature flag and it's a lot safer to do it this way just in case).
> split_large_page in change_page_attr has the same issue too, but I've no idea
> how to fix it there because the pmd cannot be marked non present at any given
> time as change_page_attr may be running on ram below 640k and that is the same
> pmd where the kernel .text resides. However I doubt it'll ever be a practical
> problem. Other cpus also has a lot of warnings and risks in allowing
> simultaneous TLB entries of different size.
>
> Johannes also sent a cute optimization to split split_huge_page_vma/mm he
> converted those in a single split_huge_page_pmd and in addition he also sent
> native support for hugepages in both mincore and mprotect. Which shows how
> deep he already understands the whole huge_memory.c and its usage in the
> callers. Seeing significant contributions like this I think further confirms
> this is the way to go. Thanks a lot Johannes.
>
> The ability to bisect before the mincore and mprotect native implementations
> is one of the huge benefits of this approach. The hardest of all will be to
> add swap native support to 2M pages later (as it involves to make the
> swapcache 2M capable and that in turn means it expodes more than the rest all
> over the pagecache code) but I think first we've other priorities:
>
> 1) merge memory compaction
Testing V6 at the moment.
> 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory
> compaction is capable of relocating slab entries in-use (correct me if I'm
> wrong, I think it's impossible as long as the slab entries are mapped by 2M
> pages and not 4k ptes like vmalloc).So the idea is that we should have the
Correct, slab pages currently cannot migrate. Framentation within slab
is minimised by anti-fragmentation by distinguishing between reclaimable
and unreclaimable slab and grouping them appropriately. The objective is
to put all the unmovable pages in as few 2M (or 4M or 16M) pages as
possible. If min_free_kbytes is tuned as hugeadm
--recommended-min_free_kbytes suggests, this works pretty well.
> slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks
> to 4k. Otherwise the slab will fragment the memory badly by allocating with
> alloc_page().
Again, if min_free_kbytes is tuned appropriately, anti-frag should
mitigate most of the fragmentation-related damage.
On the notion of having a 2M front slab allocator, SLUB is not far off
being capable of such a thing but there are risks. If a 2M page is
dedicated to a slab, then other slabs will need their own 2M pages.
Overall memory usage grows and you end up worse off.
If you suggest that slab uses 2M pages and breaks them up for slabs, you
are very close to what anti-frag already does. The difference might be
that slab would guarantee that the 2M page is only use for slab. Again,
you could force this situation with anti-frag but the decision was made
to allow a certain amount of fragmentation to avoid the memory overhead
of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets
much of what you need.
Arguably, min_free_kbytes should be tuned appropriately once it's detected
that huge pages are in use. It would not be hard at all, we just don't do it.
Stronger guarantees on layout are possible but not done today because of
the cost.
> Basically the buddy allocator will guarantee the slab will
> generate as much fragement as possible because it does its best to keep the
> high order pages for who asks for them.
Again, already does this up to a point. rmqueue_fallback() could refuse to
break up small contiguous pages for slab to force better layout in terms of
fragmentation but it costs heavily when memory is low because you now have to
reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation.
> Probably the fallback should
> happen inside the buddy allocator instead of calling alloc_pages
> repeteadly, that should avoid taking a flood of locks. Basically
> the buddy should give the worst possible fragmentation effect to users that
> should be relocated, while the other users that cannot be relocated and
> only use 4k pages will better use a front allocator on top of alloc_pages.
> Something like alloc_page_not_relocatable() that will do its stuff
> internally and try to keep those in the same 2M pages.
Sounds very similar to anti-frag again.
> This alone should
> help tremendously and I think it's orthogonal to the memory compaction of
> the relocatable stuff. Or maybe we should just live with a large chunk of
> the memory not being relocatable,
You could force such a situation by always having X number of lower blocks
MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those
areas. You'd need to do some juggling with counters and watermarks. It's not
impossible and I considered doing it when anti-fragmentation was introduced
but again, there was insufficient data to support such a move.
> but I like this idea because it's more
> dynamic and it won't have fixed rule "limit the slab to 0-1g range". And
> it'd tend to try to keep fragmentation down even if we spill over the 1G
> range. (1g is purely made up number)
> 3) teach ksm to merge hugepages. I talked about this with Izik and we agree
> the current ksm tree algorithm will be the best at that compared to ksm
> algorithms.
>
>
> To run KVM on top on this and take advantage of hugepages you need a few liner
> patch I posted to qemu-devel to take care of aligning the start of the guest
> memory so that the guest physical address and host virtual address will have
> the same subpage numbers.
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15.gz
>
> I'd be nice to have this merged in -mm.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-03-26 17:37 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 14 of 41] add pmd mangling generic functions Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 15 of 41] add pmd mangling functions to x86 Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 16 of 41] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 17 of 41] pte alloc trans splitting Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 18 of 41] add pmd mmu_notifier helpers Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 19 of 41] clear page compound Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 20 of 41] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 21 of 41] split_huge_page_mm/vma Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 22 of 41] split_huge_page paging Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 23 of 41] clear_copy_huge_page Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 24 of 41] kvm mmu transparent hugepage support Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 25 of 41] _GFP_NO_KSWAPD Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 27 of 41] transparent hugepage core Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 28 of 41] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 29 of 41] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 30 of 41] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 31 of 41] memcg compound Andrea Arcangeli
2010-03-29 1:57 ` Daisuke Nishimura
2010-03-29 18:23 ` Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 32 of 41] memcg huge memory Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 33 of 41] transparent hugepage vmstat Andrea Arcangeli
2010-03-29 2:13 ` Daisuke Nishimura
2010-03-29 18:21 ` Andrea Arcangeli
2010-03-30 0:40 ` Daisuke Nishimura
2010-03-26 17:00 ` [PATCH 34 of 41] khugepaged Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 35 of 41] don't leave orhpaned swap cache after ksm merging Andrea Arcangeli
2010-03-26 17:16 ` Rik van Riel
2010-03-26 17:23 ` Andrea Arcangeli
2010-03-26 21:32 ` Hugh Dickins
2010-03-27 1:08 ` Andrea Arcangeli
2010-03-29 14:01 ` Andrea Arcangeli
2010-03-30 6:56 ` Hugh Dickins
2010-04-01 16:47 ` Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 36 of 41] skip transhuge pages in ksm for now Andrea Arcangeli
2010-03-26 17:20 ` Rik van Riel
2010-03-26 17:00 ` [PATCH 37 of 41] add x86 32bit support Andrea Arcangeli
2010-03-26 17:45 ` Rik van Riel
2010-03-26 17:54 ` Johannes Weiner
2010-03-26 19:54 ` Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 38 of 41] mincore transparent hugepage support Andrea Arcangeli
2010-03-26 18:13 ` Rik van Riel
2010-03-26 17:00 ` [PATCH 39 of 41] add pmd_modify Andrea Arcangeli
2010-03-26 18:24 ` Rik van Riel
2010-03-26 17:00 ` [PATCH 40 of 41] mprotect: pass vma down to page table walkers Andrea Arcangeli
2010-03-26 18:26 ` Rik van Riel
2010-03-26 17:00 ` [PATCH 41 of 41] mprotect: transparent huge page support Andrea Arcangeli
2010-03-26 18:27 ` Rik van Riel
2010-03-26 17:36 ` Mel Gorman [this message]
2010-03-26 18:07 ` [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
2010-03-26 21:09 ` Mel Gorman
2010-03-26 18:00 ` Christoph Lameter
2010-03-26 18:23 ` Andrea Arcangeli
2010-03-26 18:44 ` Christoph Lameter
2010-03-26 19:34 ` Andrea Arcangeli
2010-03-26 19:55 ` Christoph Lameter
-- strict thread matches above, loose matches on Subject: below --
2010-03-26 16:48 Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100326173655.GC2024@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=aarcange@redhat.com \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=avi@redhat.com \
--cc=balbir@linux.vnet.ibm.com \
--cc=benh@kernel.crashing.org \
--cc=bpicco@redhat.com \
--cc=chrisw@sous-sol.org \
--cc=cl@linux-foundation.org \
--cc=dave@linux.vnet.ibm.com \
--cc=hannes@cmpxchg.org \
--cc=hugh.dickins@tiscali.co.uk \
--cc=ieidus@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=mst@redhat.com \
--cc=mtosatti@redhat.com \
--cc=npiggin@suse.de \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=travis@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).