* [PATCH 00 of 41] Transparent Hugepage Support #15 @ 2010-03-26 16:48 Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli ` (12 more replies) 0 siblings, 13 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner Hello, this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries in split_huge_page (at pratically zero cost, so I didn't need to add a fake feature flag and it's a lot safer to do it this way just in case). split_large_page in change_page_attr has the same issue too, but I've no idea how to fix it there because the pmd cannot be marked non present at any given time as change_page_attr may be running on ram below 640k and that is the same pmd where the kernel .text resides. However I doubt it'll ever be a practical problem. Other cpus also has a lot of warnings and risks in allowing simultaneous TLB entries of different size. Johannes also sent a cute optimization to split split_huge_page_vma/mm he converted those in a single split_huge_page_pmd and in addition he also sent native support for hugepages in both mincore and mprotect. Which shows how deep he already understands the whole huge_memory.c and its usage in the callers. Seeing significant contributions like this I think further confirms this is the way to go. Thanks a lot Johannes. The ability to bisect before the mincore and mprotect native implementations is one of the huge benefits of this approach. The hardest of all will be to add swap native support to 2M pages later (as it involves to make the swapcache 2M capable and that in turn means it expodes all over the pagecache code) but I think first we've other priorities: 1) merge memory compaction 2) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 01 of 41] define MADV_HUGEPAGE 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli ` (11 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Define MADV_HUGEPAGE. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> --- diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h --- a/arch/alpha/include/asm/mman.h +++ b/arch/alpha/include/asm/mman.h @@ -53,6 +53,8 @@ #define MADV_MERGEABLE 12 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h --- a/arch/mips/include/asm/mman.h +++ b/arch/mips/include/asm/mman.h @@ -77,6 +77,8 @@ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ #define MADV_HWPOISON 100 /* poison a page for testing */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h --- a/arch/parisc/include/asm/mman.h +++ b/arch/parisc/include/asm/mman.h @@ -59,6 +59,8 @@ #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ +#define MADV_HUGEPAGE 67 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 #define MAP_VARIABLE 0 diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h --- a/arch/xtensa/include/asm/mman.h +++ b/arch/xtensa/include/asm/mman.h @@ -83,6 +83,8 @@ #define MADV_MERGEABLE 12 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -45,7 +45,7 @@ #define MADV_MERGEABLE 12 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */ -#define MADV_HUGEPAGE 15 /* Worth backing with hugepages */ +#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ /* compatibility flags */ #define MAP_FILE 0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 02 of 41] compound_lock 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli ` (10 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Add a new compound_lock() needed to serialize put_page against __split_huge_page_refcount(). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> --- diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -13,6 +13,7 @@ #include <linux/debug_locks.h> #include <linux/mm_types.h> #include <linux/range.h> +#include <linux/bit_spinlock.h> struct mempolicy; struct anon_vma; @@ -297,6 +298,20 @@ static inline int is_vmalloc_or_module_a } #endif +static inline void compound_lock(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + bit_spin_lock(PG_compound_lock, &page->flags); +#endif +} + +static inline void compound_unlock(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + bit_spin_unlock(PG_compound_lock, &page->flags); +#endif +} + static inline struct page *compound_head(struct page *page) { if (unlikely(PageTail(page))) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -108,6 +108,9 @@ enum pageflags { #ifdef CONFIG_MEMORY_FAILURE PG_hwpoison, /* hardware poisoned page. Don't touch */ #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + PG_compound_lock, +#endif __NR_PAGEFLAGS, /* Filesystems */ @@ -399,6 +402,12 @@ static inline void __ClearPageTail(struc #define __PG_MLOCKED 0 #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define __PG_COMPOUND_LOCK (1 << PG_compound_lock) +#else +#define __PG_COMPOUND_LOCK 0 +#endif + /* * Flags checked when a page is freed. Pages being freed should not have * these flags set. It they are, there is a problem. @@ -408,7 +417,8 @@ static inline void __ClearPageTail(struc 1 << PG_private | 1 << PG_private_2 | \ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON) + 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ + __PG_COMPOUND_LOCK) /* * Flags checked when a page is prepped for return by the page allocator. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 03 of 41] alter compound get_page/put_page 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli ` (9 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Alter compound get_page/put_page to keep references on subpages too, in order to allow __split_huge_page_refcount to split an hugepage even while subpages have been pinned by one of the get_user_pages() variants. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> --- diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c --- a/arch/powerpc/mm/gup.c +++ b/arch/powerpc/mm/gup.c @@ -16,6 +16,16 @@ #ifdef __HAVE_ARCH_PTE_SPECIAL +static inline void pin_huge_page_tail(struct page *page) +{ + /* + * __split_huge_page_refcount() cannot run + * from under us. + */ + VM_BUG_ON(atomic_read(&page->_count) < 0); + atomic_inc(&page->_count); +} + /* * The performance critical leaf functions are made noinline otherwise gcc * inlines everything into a single function which results in too much @@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t put_page(page); return 0; } + if (PageTail(page)) + pin_huge_page_tail(page); pages[*nr] = page; (*nr)++; diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -105,6 +105,16 @@ static inline void get_head_page_multipl atomic_add(nr, &page->_count); } +static inline void pin_huge_page_tail(struct page *page) +{ + /* + * __split_huge_page_refcount() cannot run + * from under us. + */ + VM_BUG_ON(atomic_read(&page->_count) < 0); + atomic_inc(&page->_count); +} + static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; + if (PageTail(page)) + pin_huge_page_tail(page); (*nr)++; page++; refs++; diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -326,9 +326,17 @@ static inline int page_count(struct page static inline void get_page(struct page *page) { - page = compound_head(page); - VM_BUG_ON(atomic_read(&page->_count) == 0); + VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page)); atomic_inc(&page->_count); + if (unlikely(PageTail(page))) { + /* + * This is safe only because + * __split_huge_page_refcount can't run under + * get_page(). + */ + VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0); + atomic_inc(&page->first_page->_count); + } } static inline struct page *virt_to_head_page(const void *x) diff --git a/mm/swap.c b/mm/swap.c --- a/mm/swap.c +++ b/mm/swap.c @@ -55,17 +55,82 @@ static void __page_cache_release(struct del_page_from_lru(zone, page); spin_unlock_irqrestore(&zone->lru_lock, flags); } +} + +static void __put_single_page(struct page *page) +{ + __page_cache_release(page); free_hot_cold_page(page, 0); } +static void __put_compound_page(struct page *page) +{ + compound_page_dtor *dtor; + + __page_cache_release(page); + dtor = get_compound_page_dtor(page); + (*dtor)(page); +} + static void put_compound_page(struct page *page) { - page = compound_head(page); - if (put_page_testzero(page)) { - compound_page_dtor *dtor; - - dtor = get_compound_page_dtor(page); - (*dtor)(page); + if (unlikely(PageTail(page))) { + /* __split_huge_page_refcount can run under us */ + struct page *page_head = page->first_page; + smp_rmb(); + if (likely(PageTail(page) && get_page_unless_zero(page_head))) { + if (unlikely(!PageHead(page_head))) { + /* PageHead is cleared after PageTail */ + smp_rmb(); + VM_BUG_ON(PageTail(page)); + goto out_put_head; + } + /* + * Only run compound_lock on a valid PageHead, + * after having it pinned with + * get_page_unless_zero() above. + */ + smp_mb(); + /* page_head wasn't a dangling pointer */ + compound_lock(page_head); + if (unlikely(!PageTail(page))) { + /* __split_huge_page_refcount run before us */ + compound_unlock(page_head); + VM_BUG_ON(PageHead(page_head)); + out_put_head: + if (put_page_testzero(page_head)) + __put_single_page(page_head); + out_put_single: + if (put_page_testzero(page)) + __put_single_page(page); + return; + } + VM_BUG_ON(page_head != page->first_page); + /* + * We can release the refcount taken by + * get_page_unless_zero now that + * split_huge_page_refcount is blocked on the + * compound_lock. + */ + if (put_page_testzero(page_head)) + VM_BUG_ON(1); + /* __split_huge_page_refcount will wait now */ + VM_BUG_ON(atomic_read(&page->_count) <= 0); + atomic_dec(&page->_count); + VM_BUG_ON(atomic_read(&page_head->_count) <= 0); + compound_unlock(page_head); + if (put_page_testzero(page_head)) + __put_compound_page(page_head); + } else { + /* page_head is a dangling pointer */ + VM_BUG_ON(PageTail(page)); + goto out_put_single; + } + } else if (put_page_testzero(page)) { + if (PageHead(page)) + __put_compound_page(page); + else + __put_single_page(page); } } @@ -74,7 +139,7 @@ void put_page(struct page *page) if (unlikely(PageCompound(page))) put_compound_page(page); else if (put_page_testzero(page)) - __page_cache_release(page); + __put_single_page(page); } EXPORT_SYMBOL(put_page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 04 of 41] update futex compound knowledge 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (2 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli ` (8 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Futex code is smarter than most other gup_fast O_DIRECT code and knows about the compound internals. However now doing a put_page(head_page) will not release the pin on the tail page taken by gup-fast, leading to all sort of refcounting bugchecks. Getting a stable head_page is a little tricky. page_head = page is there because if this is not a tail page it's also the page_head. Only in case this is a tail page, compound_head is called, otherwise it's guaranteed unnecessary. And if it's a tail page compound_head has to run atomically inside irq disabled section __get_user_pages_fast before returning. Otherwise ->first_page won't be a stable pointer. Disableing irq before __get_user_page_fast and releasing irq after running compound_head is needed because if __get_user_page_fast returns == 1, it means the huge pmd is established and cannot go away from under us. pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait for local_irq_enable before the IPI delivery can return. This means __split_huge_page_refcount can't be running from under us, and in turn when we run compound_head(page) we're not reading a dangling pointer from tailpage->first_page. Then after we get to stable head page, we are always safe to call compound_lock and after taking the compound lock on head page we can finally re-check if the page returned by gup-fast is still a tail page. in which case we're set and we didn't need to split the hugepage in order to take a futex on it. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> --- diff --git a/kernel/futex.c b/kernel/futex.c --- a/kernel/futex.c +++ b/kernel/futex.c @@ -218,7 +218,7 @@ get_futex_key(u32 __user *uaddr, int fsh { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; - struct page *page; + struct page *page, *page_head; int err; /* @@ -250,10 +250,53 @@ again: if (err < 0) return err; - page = compound_head(page); - lock_page(page); - if (!page->mapping) { - unlock_page(page); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + page_head = page; + if (unlikely(PageTail(page))) { + put_page(page); + /* serialize against __split_huge_page_splitting() */ + local_irq_disable(); + if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { + page_head = compound_head(page); + /* + * page_head is valid pointer but we must pin + * it before taking the PG_lock and/or + * PG_compound_lock. The moment we re-enable + * irqs __split_huge_page_splitting() can + * return and the head page can be freed from + * under us. We can't take the PG_lock and/or + * PG_compound_lock on a page that could be + * freed from under us. + */ + if (page != page_head) + get_page(page_head); + local_irq_enable(); + } else { + local_irq_enable(); + goto again; + } + } +#else + page_head = compound_head(page); + if (page != page_head) + get_page(page_head); +#endif + + lock_page(page_head); + if (unlikely(page_head != page)) { + compound_lock(page_head); + if (unlikely(!PageTail(page))) { + compound_unlock(page_head); + unlock_page(page_head); + put_page(page_head); + put_page(page); + goto again; + } + } + if (!page_head->mapping) { + unlock_page(page_head); + if (page_head != page) + put_page(page_head); put_page(page); goto again; } @@ -265,19 +308,25 @@ again: * it's a read-only handle, it's expected that futexes attach to * the object not the particular process. */ - if (PageAnon(page)) { + if (PageAnon(page_head)) { key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */ key->private.mm = mm; key->private.address = address; } else { key->both.offset |= FUT_OFF_INODE; /* inode-based key */ - key->shared.inode = page->mapping->host; - key->shared.pgoff = page->index; + key->shared.inode = page_head->mapping->host; + key->shared.pgoff = page_head->index; } get_futex_key_refs(key); - unlock_page(page); + unlock_page(page_head); + if (page != page_head) { + VM_BUG_ON(!PageTail(page)); + /* releasing compound_lock after page_lock won't matter */ + compound_unlock(page_head); + put_page(page_head); + } put_page(page); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 05 of 41] fix bad_page to show the real reason the page is bad 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (3 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli ` (7 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> page_count shows the count of the head page, but the actual check is done on the tail page, so show what is really being checked. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5291,7 +5291,7 @@ void dump_page(struct page *page) { printk(KERN_ALERT "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n", - page, page_count(page), page_mapcount(page), + page, atomic_read(&page->_count), page_mapcount(page), page->mapping, page->index); dump_page_flags(page->flags); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 06 of 41] clear compound mapping 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (4 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli ` (6 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Clear compound mapping for anonymous compound pages like it already happens for regular anonymous pages. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -629,6 +629,8 @@ static void __free_pages_ok(struct page trace_mm_page_free_direct(page, order); kmemcheck_free_shadow(page, order); + if (PageAnon(page)) + page->mapping = NULL; for (i = 0 ; i < (1 << order) ; ++i) bad += free_pages_check(page + i); if (bad) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 07 of 41] add native_set_pmd_at 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (5 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli ` (5 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Used by paravirt and not paravirt set_pmd_at. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -528,6 +528,12 @@ static inline void native_set_pte_at(str native_set_pte(ptep, pte); } +static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp , pmd_t pmd) +{ + native_set_pmd(pmdp, pmd); +} + #ifndef CONFIG_PARAVIRT /* * Rules for using pte_update - it must be called after any PTE update which -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 08 of 41] add pmd paravirt ops 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (6 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli ` (4 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary (vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer), but this is to keep full simmetry with pte paravirt ops, which looks cleaner and simpler from a common code POV. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -440,6 +440,11 @@ static inline void pte_update(struct mm_ { PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep); } +static inline void pmd_update(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp) +{ + PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp); +} static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -447,6 +452,12 @@ static inline void pte_update_defer(stru PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep); } +static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp) +{ + PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp); +} + static inline pte_t __pte(pteval_t val) { pteval_t ret; @@ -548,6 +559,18 @@ static inline void set_pte_at(struct mm_ PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp, pmd_t pmd) +{ + if (sizeof(pmdval_t) > sizeof(long)) + /* 5 arg words */ + pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd); + else + PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd); +} +#endif + static inline void set_pmd(pmd_t *pmdp, pmd_t pmd) { pmdval_t val = native_pmd_val(pmd); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -266,10 +266,16 @@ struct pv_mmu_ops { void (*set_pte_at)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval); void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval); + void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp, pmd_t pmdval); void (*pte_update)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); + void (*pmd_update)(struct mm_struct *mm, unsigned long addr, + pmd_t *pmdp); + void (*pmd_update_defer)(struct mm_struct *mm, + unsigned long addr, pmd_t *pmdp); pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = { .set_pte = native_set_pte, .set_pte_at = native_set_pte_at, .set_pmd = native_set_pmd, + .set_pmd_at = native_set_pmd_at, .pte_update = paravirt_nop, .pte_update_defer = paravirt_nop, + .pmd_update = paravirt_nop, + .pmd_update_defer = paravirt_nop, .ptep_modify_prot_start = __ptep_modify_prot_start, .ptep_modify_prot_commit = __ptep_modify_prot_commit, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 09 of 41] no paravirt version of pmd ops 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (7 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli ` (3 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> No paravirt version of set_pmd_at/pmd_update/pmd_update_defer. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -33,6 +33,7 @@ extern struct list_head pgd_list; #else /* !CONFIG_PARAVIRT */ #define set_pte(ptep, pte) native_set_pte(ptep, pte) #define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte) +#define set_pmd_at(mm, addr, pmdp, pmd) native_set_pmd_at(mm, addr, pmdp, pmd) #define set_pte_atomic(ptep, pte) \ native_set_pte_atomic(ptep, pte) @@ -57,6 +58,8 @@ extern struct list_head pgd_list; #define pte_update(mm, addr, ptep) do { } while (0) #define pte_update_defer(mm, addr, ptep) do { } while (0) +#define pmd_update(mm, addr, ptep) do { } while (0) +#define pmd_update_defer(mm, addr, ptep) do { } while (0) #define pgd_val(x) native_pgd_val(x) #define __pgd(x) native_make_pgd(x) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 10 of 41] export maybe_mkwrite 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (8 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli ` (2 subsequent siblings) 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> huge_memory.c needs it too when it fallbacks in copying hugepages into regular fragmented pages if hugepage allocation fails during COW. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -390,6 +390,19 @@ static inline void set_compound_order(st } /* + * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when + * servicing faults for write access. In the normal case, do always want + * pte_mkwrite. But get_user_pages can cause write faults for mappings + * that do not have writing enabled, when used by access_process_vm. + */ +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pte = pte_mkwrite(pte); + return pte; +} + +/* * Multiple processes may "see" the same page. E.g. for untouched * mappings of /dev/null, all processes see the same page full of * zeroes, and text pages of executables and shared libraries have diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -2031,19 +2031,6 @@ static inline int pte_unmap_same(struct return same; } -/* - * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when - * servicing faults for write access. In the normal case, do always want - * pte_mkwrite. But get_user_pages can cause write faults for mappings - * that do not have writing enabled, when used by access_process_vm. - */ -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) -{ - if (likely(vma->vm_flags & VM_WRITE)) - pte = pte_mkwrite(pte); - return pte; -} - static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) { /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 11 of 41] comment reminder in destroy_compound_page 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (9 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent on its internal behavior. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -334,6 +334,7 @@ void prep_compound_page(struct page *pag } } +/* update __split_huge_page_refcount if you change this function */ static int destroy_compound_page(struct page *page, unsigned long order) { int i; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 12 of 41] config_transparent_hugepage 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (10 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> Add config option. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> --- diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -287,3 +287,17 @@ config NOMMU_INITIAL_TRIM_EXCESS of 1 says that all excess pages should be trimmed. See Documentation/nommu-mmap.txt for more information. + +config TRANSPARENT_HUGEPAGE + bool "Transparent Hugepage support" if EMBEDDED + depends on X86_64 + default y + help + Transparent Hugepages allows the kernel to use huge pages and + huge tlb transparently to the applications whenever possible. + This feature can improve computing performance to certain + applications by speeding up page faults during memory + allocation, by reducing the number of tlb misses and by speeding + up the pagetable walking. + + If memory constrained on embedded, you may want to say N. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 13 of 41] special pmd_trans_* functions 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli ` (11 preceding siblings ...) 2010-03-26 16:48 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli @ 2010-03-26 16:48 ` Andrea Arcangeli 12 siblings, 0 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner From: Andrea Arcangeli <aarcange@redhat.com> These returns 0 at compile time when the config option is disabled, to allow gcc to eliminate the transparent hugepage function calls at compile time without additional #ifdefs (only the export of those functions have to be visible to gcc but they won't be required at link time and huge_memory.o can be not built at all). _PAGE_BIT_UNUSED1 is never used for pmd, only on pte. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> --- diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -168,6 +168,19 @@ extern void cleanup_highmap(void); #define kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK) #define __HAVE_ARCH_PTE_SAME + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline int pmd_trans_splitting(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_SPLITTING; +} + +static inline int pmd_trans_huge(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_PSE; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_64_H */ diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -22,6 +22,7 @@ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ #define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1 #define _PAGE_BIT_CPA_TEST _PAGE_BIT_UNUSED1 +#define _PAGE_BIT_SPLITTING _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */ #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ /* If _PAGE_BIT_PRESENT is clear, we use these: */ @@ -45,6 +46,7 @@ #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) #define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) #define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) +#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING) #define __HAVE_ARCH_PTE_SPECIAL #ifdef CONFIG_KMEMCHECK diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -344,6 +344,11 @@ extern void untrack_pfn_vma(struct vm_ar unsigned long size); #endif +#ifndef CONFIG_TRANSPARENT_HUGEPAGE +#define pmd_trans_huge(pmd) 0 +#define pmd_trans_splitting(pmd) 0 +#endif + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_GENERIC_PGTABLE_H */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 00 of 41] Transparent Hugepage Support #15 @ 2010-03-26 17:00 Andrea Arcangeli 2010-03-26 17:36 ` Mel Gorman 2010-03-26 18:00 ` Christoph Lameter 0 siblings, 2 replies; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 17:00 UTC (permalink / raw) To: linux-mm, Andrew Morton Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner Hello, this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries in split_huge_page (at pratically zero cost, so I didn't need to add a fake feature flag and it's a lot safer to do it this way just in case). split_large_page in change_page_attr has the same issue too, but I've no idea how to fix it there because the pmd cannot be marked non present at any given time as change_page_attr may be running on ram below 640k and that is the same pmd where the kernel .text resides. However I doubt it'll ever be a practical problem. Other cpus also has a lot of warnings and risks in allowing simultaneous TLB entries of different size. Johannes also sent a cute optimization to split split_huge_page_vma/mm he converted those in a single split_huge_page_pmd and in addition he also sent native support for hugepages in both mincore and mprotect. Which shows how deep he already understands the whole huge_memory.c and its usage in the callers. Seeing significant contributions like this I think further confirms this is the way to go. Thanks a lot Johannes. The ability to bisect before the mincore and mprotect native implementations is one of the huge benefits of this approach. The hardest of all will be to add swap native support to 2M pages later (as it involves to make the swapcache 2M capable and that in turn means it expodes more than the rest all over the pagecache code) but I think first we've other priorities: 1) merge memory compaction 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory compaction is capable of relocating slab entries in-use (correct me if I'm wrong, I think it's impossible as long as the slab entries are mapped by 2M pages and not 4k ptes like vmalloc). So the idea is that we should have the slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks to 4k. Otherwise the slab will fragment the memory badly by allocating with alloc_page(). Basically the buddy allocator will guarantee the slab will generate as much fragement as possible because it does its best to keep the high order pages for who asks for them. Probably the fallback should happen inside the buddy allocator instead of calling alloc_pages repeteadly, that should avoid taking a flood of locks. Basically the buddy should give the worst possible fragmentation effect to users that should be relocated, while the other users that cannot be relocated and only use 4k pages will better use a front allocator on top of alloc_pages. Something like alloc_page_not_relocatable() that will do its stuff internally and try to keep those in the same 2M pages. This alone should help tremendously and I think it's orthogonal to the memory compaction of the relocatable stuff. Or maybe we should just live with a large chunk of the memory not being relocatable, but I like this idea because it's more dynamic and it won't have fixed rule "limit the slab to 0-1g range". And it'd tend to try to keep fragmentation down even if we spill over the 1G range. (1g is purely made up number) 3) teach ksm to merge hugepages. I talked about this with Izik and we agree the current ksm tree algorithm will be the best at that compared to ksm algorithms. To run KVM on top on this and take advantage of hugepages you need a few liner patch I posted to qemu-devel to take care of aligning the start of the guest memory so that the guest physical address and host virtual address will have the same subpage numbers. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15 http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15.gz I'd be nice to have this merged in -mm. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli @ 2010-03-26 17:36 ` Mel Gorman 2010-03-26 18:07 ` Andrea Arcangeli 2010-03-26 18:00 ` Christoph Lameter 1 sibling, 1 reply; 23+ messages in thread From: Mel Gorman @ 2010-03-26 17:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, Mar 26, 2010 at 06:00:04PM +0100, Andrea Arcangeli wrote: > Hello, > > this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries > in split_huge_page (at pratically zero cost, so I didn't need to add a fake > feature flag and it's a lot safer to do it this way just in case). > split_large_page in change_page_attr has the same issue too, but I've no idea > how to fix it there because the pmd cannot be marked non present at any given > time as change_page_attr may be running on ram below 640k and that is the same > pmd where the kernel .text resides. However I doubt it'll ever be a practical > problem. Other cpus also has a lot of warnings and risks in allowing > simultaneous TLB entries of different size. > > Johannes also sent a cute optimization to split split_huge_page_vma/mm he > converted those in a single split_huge_page_pmd and in addition he also sent > native support for hugepages in both mincore and mprotect. Which shows how > deep he already understands the whole huge_memory.c and its usage in the > callers. Seeing significant contributions like this I think further confirms > this is the way to go. Thanks a lot Johannes. > > The ability to bisect before the mincore and mprotect native implementations > is one of the huge benefits of this approach. The hardest of all will be to > add swap native support to 2M pages later (as it involves to make the > swapcache 2M capable and that in turn means it expodes more than the rest all > over the pagecache code) but I think first we've other priorities: > > 1) merge memory compaction Testing V6 at the moment. > 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory > compaction is capable of relocating slab entries in-use (correct me if I'm > wrong, I think it's impossible as long as the slab entries are mapped by 2M > pages and not 4k ptes like vmalloc).So the idea is that we should have the Correct, slab pages currently cannot migrate. Framentation within slab is minimised by anti-fragmentation by distinguishing between reclaimable and unreclaimable slab and grouping them appropriately. The objective is to put all the unmovable pages in as few 2M (or 4M or 16M) pages as possible. If min_free_kbytes is tuned as hugeadm --recommended-min_free_kbytes suggests, this works pretty well. > slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks > to 4k. Otherwise the slab will fragment the memory badly by allocating with > alloc_page(). Again, if min_free_kbytes is tuned appropriately, anti-frag should mitigate most of the fragmentation-related damage. On the notion of having a 2M front slab allocator, SLUB is not far off being capable of such a thing but there are risks. If a 2M page is dedicated to a slab, then other slabs will need their own 2M pages. Overall memory usage grows and you end up worse off. If you suggest that slab uses 2M pages and breaks them up for slabs, you are very close to what anti-frag already does. The difference might be that slab would guarantee that the 2M page is only use for slab. Again, you could force this situation with anti-frag but the decision was made to allow a certain amount of fragmentation to avoid the memory overhead of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets much of what you need. Arguably, min_free_kbytes should be tuned appropriately once it's detected that huge pages are in use. It would not be hard at all, we just don't do it. Stronger guarantees on layout are possible but not done today because of the cost. > Basically the buddy allocator will guarantee the slab will > generate as much fragement as possible because it does its best to keep the > high order pages for who asks for them. Again, already does this up to a point. rmqueue_fallback() could refuse to break up small contiguous pages for slab to force better layout in terms of fragmentation but it costs heavily when memory is low because you now have to reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation. > Probably the fallback should > happen inside the buddy allocator instead of calling alloc_pages > repeteadly, that should avoid taking a flood of locks. Basically > the buddy should give the worst possible fragmentation effect to users that > should be relocated, while the other users that cannot be relocated and > only use 4k pages will better use a front allocator on top of alloc_pages. > Something like alloc_page_not_relocatable() that will do its stuff > internally and try to keep those in the same 2M pages. Sounds very similar to anti-frag again. > This alone should > help tremendously and I think it's orthogonal to the memory compaction of > the relocatable stuff. Or maybe we should just live with a large chunk of > the memory not being relocatable, You could force such a situation by always having X number of lower blocks MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those areas. You'd need to do some juggling with counters and watermarks. It's not impossible and I considered doing it when anti-fragmentation was introduced but again, there was insufficient data to support such a move. > but I like this idea because it's more > dynamic and it won't have fixed rule "limit the slab to 0-1g range". And > it'd tend to try to keep fragmentation down even if we spill over the 1G > range. (1g is purely made up number) > 3) teach ksm to merge hugepages. I talked about this with Izik and we agree > the current ksm tree algorithm will be the best at that compared to ksm > algorithms. > > > To run KVM on top on this and take advantage of hugepages you need a few liner > patch I posted to qemu-devel to take care of aligning the start of the guest > memory so that the guest physical address and host virtual address will have > the same subpage numbers. > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15 > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15.gz > > I'd be nice to have this merged in -mm. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 17:36 ` Mel Gorman @ 2010-03-26 18:07 ` Andrea Arcangeli 2010-03-26 21:09 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 18:07 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, Mar 26, 2010 at 05:36:55PM +0000, Mel Gorman wrote: > Correct, slab pages currently cannot migrate. Framentation within slab > is minimised by anti-fragmentation by distinguishing between reclaimable > and unreclaimable slab and grouping them appropriately. The objective is > to put all the unmovable pages in as few 2M (or 4M or 16M) pages as > possible. If min_free_kbytes is tuned as hugeadm > --recommended-min_free_kbytes suggests, this works pretty well. Awesome. So this feature is already part of your memory compaction code? As you may have noticed I didn't start looking deep on your code yet. > Again, if min_free_kbytes is tuned appropriately, anti-frag should > mitigate most of the fragmentation-related damage. I don't see the relation of why this logic should be connected to min_free_kbytes. Maybe I'll get it if I read the code. But min_free_kbytes is about the PF_MEMALLOC pool and GFP_ATOMIC memory. I can't see any connection with min_free_kbytes setting, and in to trying to keep all non relocatable entries in the same HPAGE_PMD_SIZEd pages. > On the notion of having a 2M front slab allocator, SLUB is not far off > being capable of such a thing but there are risks. If a 2M page is > dedicated to a slab, then other slabs will need their own 2M pages. > Overall memory usage grows and you end up worse off. > > If you suggest that slab uses 2M pages and breaks them up for slabs, you > are very close to what anti-frag already does. The difference might be That's exactly what I meant yes. Doing it per-slab would be useless. The idea was for slub to simply call alloc_page_not_relocatable(order) instead of alloc_page() every time it allocates an order <= HPAGE_PMD_ORDER. That means this 2M page would be shared for _all_ slabs, otherwise it wouldn't work. The page freeing could even go back in the buddy initially. So the max waste would be 2M per cpu of ram (the front page has to be per-cpu to perform). > that slab would guarantee that the 2M page is only use for slab. Again, > you could force this situation with anti-frag but the decision was made > to allow a certain amount of fragmentation to avoid the memory overhead > of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets > much of what you need. Well if this 2M page is shared by other not relocatable entities that might be even better in some scenario (maybe worse in others) but I'm totally fine with a more elaborate approach. Clearly some driver could also start to call alloc_pages_not_relocatable() and then it'd also share the same memory as slab. I think it has to be an universally available feature, just like you implemented. Except right now the main problem is slab so that's the first user for sure ;). > Arguably, min_free_kbytes should be tuned appropriately once it's detected > that huge pages are in use. It would not be hard at all, we just don't do it. > > Stronger guarantees on layout are possible but not done today because of > the cost. Could you elaborate what "guarantees of layout" means? > > > Basically the buddy allocator will guarantee the slab will > > generate as much fragement as possible because it does its best to keep the > > high order pages for who asks for them. > > Again, already does this up to a point. rmqueue_fallback() could refuse to > break up small contiguous pages for slab to force better layout in terms of > fragmentation but it costs heavily when memory is low because you now have to > reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation. I guess this will require a sysfs control. Do you have a /sys/kernel/mm/defrag directory or something? If hugepages are absolutely mandatory (like with hypervisor-only usage) it is worth invoking memory compaction to satisfy what i call "front allocator" and give a full 2M page to slab instead of using the already available fragment. And to rmqueue-fallback only if defrag fails. > Sounds very similar to anti-frag again. Indeed. > You could force such a situation by always having X number of lower blocks > MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those > areas. You'd need to do some juggling with counters and watermarks. It's not > impossible and I considered doing it when anti-fragmentation was introduced > but again, there was insufficient data to support such a move. Agreed. I also like a more dynamic approach, the whole idea of transparent hugepage is that the admin does nothing, no reservation, and in this case no decision of how much memory to be MIGRATE_UNMOVABLE. Looking forward to see transparent hugepage taking full advantage of your patchset! Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 18:07 ` Andrea Arcangeli @ 2010-03-26 21:09 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2010-03-26 21:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, Mar 26, 2010 at 07:07:01PM +0100, Andrea Arcangeli wrote: > On Fri, Mar 26, 2010 at 05:36:55PM +0000, Mel Gorman wrote: > > Correct, slab pages currently cannot migrate. Framentation within slab > > is minimised by anti-fragmentation by distinguishing between reclaimable > > and unreclaimable slab and grouping them appropriately. The objective is > > to put all the unmovable pages in as few 2M (or 4M or 16M) pages as > > possible. If min_free_kbytes is tuned as hugeadm > > --recommended-min_free_kbytes suggests, this works pretty well. > > Awesome. So this feature is already part of your memory compaction > code? No, anti-fragmentation has been in a long time. hugeadm (part of libhugetlbfs) has supported --recommended-min_free_kbytes for some time as well. > As you may have noticed I didn't start looking deep on your code > yet. > > > Again, if min_free_kbytes is tuned appropriately, anti-frag should > > mitigate most of the fragmentation-related damage. > > I don't see the relation of why this logic should be connected to > min_free_kbytes. Maybe I'll get it if I read the code. But > min_free_kbytes is about the PF_MEMALLOC pool and GFP_ATOMIC memory. I > can't see any connection with min_free_kbytes setting, and in to > trying to keep all non relocatable entries in the same HPAGE_PMD_SIZEd > pages. > Anti-fragmentation groups within pageblocks that are the size of the default huge page size. Blocks can have different migratetypes and the free lists are also based on types. If there isn't a free page of the appropriate type, rmqueue_fallback() selects an alternative list to use from. Each one of these "fallback" events potentially increases the badness of the level of fragmentation. Using --recommended-min_free_kbytes keeps a number of pages free such that these "fallback" events are severely reduced because there is typically a page free of the appropriate type located in the correct pageblock. If you were very curious, you use the mm_page_alloc_extfrag trace event to monitor fragmentation-related events. Part of the event reports "fragmenting=" which indicates whether the fallback is severe in terms of fragmentation or not. > > On the notion of having a 2M front slab allocator, SLUB is not far off > > being capable of such a thing but there are risks. If a 2M page is > > dedicated to a slab, then other slabs will need their own 2M pages. > > Overall memory usage grows and you end up worse off. > > > > If you suggest that slab uses 2M pages and breaks them up for slabs, you > > are very close to what anti-frag already does. The difference might be > > That's exactly what I meant yes. Doing it per-slab would be useless. > > The idea was for slub to simply call alloc_page_not_relocatable(order) If you don't specify migratetype-related GFP flags, it's assumed to be UNMOVABLE. > instead of alloc_page() every time it allocates an order <= > HPAGE_PMD_ORDER. That means this 2M page would be shared for _all_ > slabs, otherwise it wouldn't work. > I still think anti-frag is already doing most of what you suggest. Slab should already be using UNMOVABLE blocks (See /proc/pagetypeinfo for how the pageblocks are being used). > The page freeing could even go back in the buddy initially. So the max > waste would be 2M per cpu of ram (the front page has to be per-cpu to > perform). > > > that slab would guarantee that the 2M page is only use for slab. Again, > > you could force this situation with anti-frag but the decision was made > > to allow a certain amount of fragmentation to avoid the memory overhead > > of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets > > much of what you need. > > Well if this 2M page is shared by other not relocatable entities > that might be even better in some scenario (maybe worse in others) The 2M page is today being shared with other unmovable (what you call not relocatable) pages. The scenario where it potentially gets worse is where there is a weird mix of pagetable and slab allocations. This will push up the number of blocks used for unmovable pages to some extent. > but > I'm totally fine with a more elaborate approach. Clearly some driver > could also start to call alloc_pages_not_relocatable() and then it'd > also share the same memory as slab. I think it has to be an > universally available feature, just like you implemented. Except right > now the main problem is slab so that's the first user for sure ;). > Right now, allocations are assumed to be unmovable unless otherwise specified. > > Arguably, min_free_kbytes should be tuned appropriately once it's detected > > that huge pages are in use. It would not be hard at all, we just don't do it. > > > > Stronger guarantees on layout are possible but not done today because of > > the cost. > > Could you elaborate what "guarantees of layout" means? > The ideal would be the fewest number of pageblocks are in use and each pageblock only contains the pages of a specific migratetype. One "guaranteed layout" would be that pageblocks only ever contain pages of a given type but this would potentially require a full 2M of data to be relocated or reclaimed to satisfy a new allocation. It would also cause problems with atomics. It would be great from a fragmentation perspective but suck otherwise. > > > > > Basically the buddy allocator will guarantee the slab will > > > generate as much fragement as possible because it does its best to keep the > > > high order pages for who asks for them. > > > > Again, already does this up to a point. rmqueue_fallback() could refuse to > > break up small contiguous pages for slab to force better layout in terms of > > fragmentation but it costs heavily when memory is low because you now have to > > reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation. > > I guess this will require a sysfs control. It would also be a new feature. With memory compaction, the page allocator will compact memory to satisfy a high-order allocation but it doesn't compact memory to avoid mixing pageblocks. > Do you have a > /sys/kernel/mm/defrag directory or something?> If hugepages are > absolutely mandatory (like with hypervisor-only usage) it is worth > invoking memory compaction to satisfy what i call "front allocator" > and give a full 2M page to slab instead of using the already available > fragment. And to rmqueue-fallback only if defrag fails. > There is a proc entry and sysfs entry that allow to compact either all of memory or on a per-node basis but I'd be surprised if it was required. When a new machine starts up, it should start direct-compacting memory to get the huge pages it needs. > > Sounds very similar to anti-frag again. > > Indeed. > > > You could force such a situation by always having X number of lower blocks > > MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those > > areas. You'd need to do some juggling with counters and watermarks. It's not > > impossible and I considered doing it when anti-fragmentation was introduced > > but again, there was insufficient data to support such a move. > > Agreed. I also like a more dynamic approach, the whole idea of > transparent hugepage is that the admin does nothing, no reservation, > and in this case no decision of how much memory to be > MIGRATE_UNMOVABLE. > > Looking forward to see transparent hugepage taking full advantage of > your patchset! > Same here. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli 2010-03-26 17:36 ` Mel Gorman @ 2010-03-26 18:00 ` Christoph Lameter 2010-03-26 18:23 ` Andrea Arcangeli 1 sibling, 1 reply; 23+ messages in thread From: Christoph Lameter @ 2010-03-26 18:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, 26 Mar 2010, Andrea Arcangeli wrote: > 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory > compaction is capable of relocating slab entries in-use (correct me if I'm > wrong, I think it's impossible as long as the slab entries are mapped by 2M SLUB is capable of using huge pages. Specify slub_min_order=9 on boot and it will make the kernel use huge pages. > pages and not 4k ptes like vmalloc). So the idea is that we should have the > slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks > to 4k. Otherwise the slab will fragment the memory badly by allocating with > alloc_page(). Basically the buddy allocator will guarantee the slab will > generate as much fragement as possible because it does its best to keep the > high order pages for who asks for them. Probably the fallback should Fallback is another issue. SLUB can handle various orders of pages in the same slab cache and already implements fallback to order 0. To implement a scheme as you suggest here would not require any changes to data structures but only to the slab allocation functions. See allocate_slab() in mm/slub.c -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 18:00 ` Christoph Lameter @ 2010-03-26 18:23 ` Andrea Arcangeli 2010-03-26 18:44 ` Christoph Lameter 0 siblings, 1 reply; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 18:23 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, Mar 26, 2010 at 01:00:12PM -0500, Christoph Lameter wrote: > On Fri, 26 Mar 2010, Andrea Arcangeli wrote: > > > 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory > > compaction is capable of relocating slab entries in-use (correct me if I'm > > wrong, I think it's impossible as long as the slab entries are mapped by 2M > > SLUB is capable of using huge pages. Specify slub_min_order=9 on boot and > it will make the kernel use huge pages. > > > pages and not 4k ptes like vmalloc). So the idea is that we should have the > > slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks > > to 4k. Otherwise the slab will fragment the memory badly by allocating with > > alloc_page(). Basically the buddy allocator will guarantee the slab will > > generate as much fragement as possible because it does its best to keep the > > high order pages for who asks for them. Probably the fallback should > > Fallback is another issue. SLUB can handle various orders of pages in the > same slab cache and already implements fallback to order 0. To implement > a scheme as you suggest here would not require any changes to data > structures but only to the slab allocation functions. See allocate_slab() > in mm/slub.c Thanks for the information! Luckily it seems Mel already taken care of this part in his patchset. But in my view, this feature should be available outside of SLUB/SLAB and potentially available to drivers and such. SLUB having this embedded is nice to know!!! BTW, unfortunately according to tons of measurements done so far, SLUB is too slow on most workstations and small/mid servers (usually single digits but in some case even double digits percentage slowdowns depending on the workload, hackbench tends to stress it the most). It's a tradeoff between avoiding wasting tons of ram on 1024-way and running fast. Either that or something's wrong with SLUB implementation (and I'm talking about 2.6.32, no earlier code). I'd also like to save memory so it'd be great if SLUB can be fixed to perform faster! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 18:23 ` Andrea Arcangeli @ 2010-03-26 18:44 ` Christoph Lameter 2010-03-26 19:34 ` Andrea Arcangeli 0 siblings, 1 reply; 23+ messages in thread From: Christoph Lameter @ 2010-03-26 18:44 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, 26 Mar 2010, Andrea Arcangeli wrote: > BTW, unfortunately according to tons of measurements done so far, SLUB > is too slow on most workstations and small/mid servers (usually single > digits but in some case even double digits percentage slowdowns > depending on the workload, hackbench tends to stress it the > most). It's a tradeoff between avoiding wasting tons of ram on > 1024-way and running fast. Either that or something's wrong with SLUB > implementation (and I'm talking about 2.6.32, no earlier code). I'd > also like to save memory so it'd be great if SLUB can be fixed to > perform faster! The SLUB fastpath is the fastest there is. Problems arise because of locality constraints in SLUB. SLAB can throw gobs of memory at it to guarantee a high cache hit rate but to cover all angles on NUMA requires to throw the gobs multiple times. The weakness is SLUBs free functions which frees the object directly to the slab page instead of running through a series of caching structures. If frees occur to locally dispersed objects then SLUB is at a disadvantage since its hitting cold cache lines for metadata on free. On the other hand SLUB hands out objects in a locality aware fashion and not randomly from everywhere like SLAB. This is certainly good to reduce TLB pressure. Huge pages would accellerate SLUB since more objects can be served from the same page than before. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 18:44 ` Christoph Lameter @ 2010-03-26 19:34 ` Andrea Arcangeli 2010-03-26 19:55 ` Christoph Lameter 0 siblings, 1 reply; 23+ messages in thread From: Andrea Arcangeli @ 2010-03-26 19:34 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, Mar 26, 2010 at 01:44:23PM -0500, Christoph Lameter wrote: > TLB pressure. Huge pages would accellerate SLUB since more objects can be > served from the same page than before. Agreed. I see it fallbacks to 4k instead of gradually going down, but that was my point, doing the fallback and entry alloc_pages N without internal buddy support would be fairly inefficient. This is why is suggest this logic to be outside of slab/slub, in theory even slab could be a bit faster thanks to large TLB on newly allocated slab objects. I hope Mel's code already takes care of all of this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 00 of 41] Transparent Hugepage Support #15 2010-03-26 19:34 ` Andrea Arcangeli @ 2010-03-26 19:55 ` Christoph Lameter 0 siblings, 0 replies; 23+ messages in thread From: Christoph Lameter @ 2010-03-26 19:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner On Fri, 26 Mar 2010, Andrea Arcangeli wrote: > On Fri, Mar 26, 2010 at 01:44:23PM -0500, Christoph Lameter wrote: > > TLB pressure. Huge pages would accellerate SLUB since more objects can be > > served from the same page than before. > > Agreed. I see it fallbacks to 4k instead of gradually going down, but > that was my point, doing the fallback and entry alloc_pages N without > internal buddy support would be fairly inefficient. This is why is > suggest this logic to be outside of slab/slub, in theory even slab > could be a bit faster thanks to large TLB on newly allocated slab > objects. I hope Mel's code already takes care of all of this. SLAB's queueing system has the inevitable garbling effect on memory references. The larger the queues the larger that effect becomes. We already have internal buddy support in the page allocator. Mel's defrag approach groups them together. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2010-03-26 21:09 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli 2010-03-26 16:48 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli -- strict thread matches above, loose matches on Subject: below -- 2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli 2010-03-26 17:36 ` Mel Gorman 2010-03-26 18:07 ` Andrea Arcangeli 2010-03-26 21:09 ` Mel Gorman 2010-03-26 18:00 ` Christoph Lameter 2010-03-26 18:23 ` Andrea Arcangeli 2010-03-26 18:44 ` Christoph Lameter 2010-03-26 19:34 ` Andrea Arcangeli 2010-03-26 19:55 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).