linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Izik Eidus <ieidus@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Chris Wright <chrisw@sous-sol.org>,
	bpicco@redhat.com,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	Arnd Bergmann <arnd@arndb.de>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH 00 of 41] Transparent Hugepage Support #15
Date: Fri, 26 Mar 2010 21:09:23 +0000	[thread overview]
Message-ID: <20100326210923.GD2024@csn.ul.ie> (raw)
In-Reply-To: <20100326180701.GC5825@random.random>

On Fri, Mar 26, 2010 at 07:07:01PM +0100, Andrea Arcangeli wrote:
> On Fri, Mar 26, 2010 at 05:36:55PM +0000, Mel Gorman wrote:
> > Correct, slab pages currently cannot migrate. Framentation within slab
> > is minimised by anti-fragmentation by distinguishing between reclaimable
> > and unreclaimable slab and grouping them appropriately. The objective is
> > to put all the unmovable pages in as few 2M (or 4M or 16M) pages as
> > possible. If min_free_kbytes is tuned as hugeadm
> > --recommended-min_free_kbytes suggests, this works pretty well.
> 
> Awesome. So this feature is already part of your memory compaction
> code?

No, anti-fragmentation has been in a long time. hugeadm (part of
libhugetlbfs) has supported --recommended-min_free_kbytes for some time
as well.

> As you may have noticed I didn't start looking deep on your code
> yet.
> 
> > Again, if min_free_kbytes is tuned appropriately, anti-frag should
> > mitigate most of the fragmentation-related damage.
> 
> I don't see the relation of why this logic should be connected to
> min_free_kbytes. Maybe I'll get it if I read the code. But
> min_free_kbytes is about the PF_MEMALLOC pool and GFP_ATOMIC memory. I
> can't see any connection with min_free_kbytes setting, and in to
> trying to keep all non relocatable entries in the same HPAGE_PMD_SIZEd
> pages.
> 

Anti-fragmentation groups within pageblocks that are the size of the
default huge page size. Blocks can have different migratetypes and the
free lists are also based on types. If there isn't a free page of the
appropriate type, rmqueue_fallback() selects an alternative list to use
from. Each one of these "fallback" events potentially increases the
badness of the level of fragmentation.

Using --recommended-min_free_kbytes keeps a number of pages free such that
these "fallback" events are severely reduced because there is typically a
page free of the appropriate type located in the correct pageblock.

If you were very curious, you use the mm_page_alloc_extfrag trace event to
monitor fragmentation-related events. Part of the event reports "fragmenting="
which indicates whether the fallback is severe in terms of fragmentation
or not.

> > On the notion of having a 2M front slab allocator, SLUB is not far off
> > being capable of such a thing but there are risks. If a 2M page is
> > dedicated to a slab, then other slabs will need their own 2M pages.
> > Overall memory usage grows and you end up worse off.
> >
> > If you suggest that slab uses 2M pages and breaks them up for slabs, you
> > are very close to what anti-frag already does. The difference might be
> 
> That's exactly what I meant yes. Doing it per-slab would be useless.
> 
> The idea was for slub to simply call alloc_page_not_relocatable(order)

If you don't specify migratetype-related GFP flags, it's assumed to be
UNMOVABLE.

> instead of alloc_page() every time it allocates an order <=
> HPAGE_PMD_ORDER. That means this 2M page would be shared for _all_
> slabs, otherwise it wouldn't work.
> 

I still think anti-frag is already doing most of what you suggest. Slab
should already be using UNMOVABLE blocks (See /proc/pagetypeinfo for how
the pageblocks are being used).

> The page freeing could even go back in the buddy initially. So the max
> waste would be 2M per cpu of ram (the front page has to be per-cpu to
> perform).
> 
> > that slab would guarantee that the 2M page is only use for slab. Again,
> > you could force this situation with anti-frag but the decision was made
> > to allow a certain amount of fragmentation to avoid the memory overhead
> > of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets
> > much of what you need.
> 
> Well if this 2M page is shared by other not relocatable entities
> that might be even better in some scenario (maybe worse in others)

The 2M page is today being shared with other unmovable (what you call
not relocatable) pages. The scenario where it potentially gets worse is
where there is a weird mix of pagetable and slab allocations. This will
push up the number of blocks used for unmovable pages to some extent.

> but
> I'm totally fine with a more elaborate approach. Clearly some driver
> could also start to call alloc_pages_not_relocatable() and then it'd
> also share the same memory as slab. I think it has to be an
> universally available feature, just like you implemented. Except right
> now the main problem is slab so that's the first user for sure ;).
> 

Right now, allocations are assumed to be unmovable unless otherwise specified.

> > Arguably, min_free_kbytes should be tuned appropriately once it's detected
> > that huge pages are in use. It would not be hard at all, we just don't do it.
> > 
> > Stronger guarantees on layout are possible but not done today because of
> > the cost.
> 
> Could you elaborate what "guarantees of layout" means?
> 

The ideal would be the fewest number of pageblocks are in use and
each pageblock only contains the pages of a specific migratetype.

One "guaranteed layout" would be that pageblocks only ever contain pages
of a given type but this would potentially require a full 2M of data to
be relocated or reclaimed to satisfy a new allocation. It would also
cause problems with atomics. It would be great from a fragmentation
perspective but suck otherwise.

> > 
> > >    Basically the buddy allocator will guarantee the slab will
> > >    generate as much fragement as possible because it does its best to keep the
> > >    high order pages for who asks for them.
> > 
> > Again, already does this up to a point. rmqueue_fallback() could refuse to
> > break up small contiguous pages for slab to force better layout in terms of
> > fragmentation but it costs heavily when memory is low because you now have to
> > reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation.
> 
> I guess this will require a sysfs control.

It would also be a new feature. With memory compaction, the page
allocator will compact memory to satisfy a high-order allocation but it
doesn't compact memory to avoid mixing pageblocks.

> Do you have a
> /sys/kernel/mm/defrag directory or something?> If hugepages are
> absolutely mandatory (like with hypervisor-only usage) it is worth
> invoking memory compaction to satisfy what i call "front allocator"
> and give a full 2M page to slab instead of using the already available
> fragment. And to rmqueue-fallback only if defrag fails.
> 

There is a proc entry and sysfs entry that allow to compact either all
of memory or on a per-node basis but I'd be surprised if it was
required. When a new machine starts up, it should start
direct-compacting memory to get the huge pages it needs.

> > Sounds very similar to anti-frag again.
> 
> Indeed.
> 
> > You could force such a situation by always having X number of lower blocks
> > MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those
> > areas. You'd need to do some juggling with counters and watermarks. It's not
> > impossible and I considered doing it when anti-fragmentation was introduced
> > but again, there was insufficient data to support such a move.
> 
> Agreed. I also like a more dynamic approach, the whole idea of
> transparent hugepage is that the admin does nothing, no reservation,
> and in this case no decision of how much memory to be
> MIGRATE_UNMOVABLE.
> 
> Looking forward to see transparent hugepage taking full advantage of
> your patchset!
> 

Same here.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-03-26 21:09 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 14 of 41] add pmd mangling generic functions Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 15 of 41] add pmd mangling functions to x86 Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 16 of 41] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 17 of 41] pte alloc trans splitting Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 18 of 41] add pmd mmu_notifier helpers Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 19 of 41] clear page compound Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 20 of 41] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 21 of 41] split_huge_page_mm/vma Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 22 of 41] split_huge_page paging Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 23 of 41] clear_copy_huge_page Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 24 of 41] kvm mmu transparent hugepage support Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 25 of 41] _GFP_NO_KSWAPD Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 27 of 41] transparent hugepage core Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 28 of 41] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 29 of 41] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 30 of 41] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 31 of 41] memcg compound Andrea Arcangeli
2010-03-29  1:57   ` Daisuke Nishimura
2010-03-29 18:23     ` Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 32 of 41] memcg huge memory Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 33 of 41] transparent hugepage vmstat Andrea Arcangeli
2010-03-29  2:13   ` Daisuke Nishimura
2010-03-29 18:21     ` Andrea Arcangeli
2010-03-30  0:40       ` Daisuke Nishimura
2010-03-26 17:00 ` [PATCH 34 of 41] khugepaged Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 35 of 41] don't leave orhpaned swap cache after ksm merging Andrea Arcangeli
2010-03-26 17:16   ` Rik van Riel
2010-03-26 17:23     ` Andrea Arcangeli
2010-03-26 21:32       ` Hugh Dickins
2010-03-27  1:08         ` Andrea Arcangeli
2010-03-29 14:01           ` Andrea Arcangeli
2010-03-30  6:56             ` Hugh Dickins
2010-04-01 16:47               ` Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 36 of 41] skip transhuge pages in ksm for now Andrea Arcangeli
2010-03-26 17:20   ` Rik van Riel
2010-03-26 17:00 ` [PATCH 37 of 41] add x86 32bit support Andrea Arcangeli
2010-03-26 17:45   ` Rik van Riel
2010-03-26 17:54   ` Johannes Weiner
2010-03-26 19:54     ` Andrea Arcangeli
2010-03-26 17:00 ` [PATCH 38 of 41] mincore transparent hugepage support Andrea Arcangeli
2010-03-26 18:13   ` Rik van Riel
2010-03-26 17:00 ` [PATCH 39 of 41] add pmd_modify Andrea Arcangeli
2010-03-26 18:24   ` Rik van Riel
2010-03-26 17:00 ` [PATCH 40 of 41] mprotect: pass vma down to page table walkers Andrea Arcangeli
2010-03-26 18:26   ` Rik van Riel
2010-03-26 17:00 ` [PATCH 41 of 41] mprotect: transparent huge page support Andrea Arcangeli
2010-03-26 18:27   ` Rik van Riel
2010-03-26 17:36 ` [PATCH 00 of 41] Transparent Hugepage Support #15 Mel Gorman
2010-03-26 18:07   ` Andrea Arcangeli
2010-03-26 21:09     ` Mel Gorman [this message]
2010-03-26 18:00 ` Christoph Lameter
2010-03-26 18:23   ` Andrea Arcangeli
2010-03-26 18:44     ` Christoph Lameter
2010-03-26 19:34       ` Andrea Arcangeli
2010-03-26 19:55         ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2010-03-26 16:48 Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100326210923.GD2024@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=avi@redhat.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bpicco@redhat.com \
    --cc=chrisw@sous-sol.org \
    --cc=cl@linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=ieidus@redhat.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=npiggin@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=travis@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).