[PATCH 00 of 41] Transparent Hugepage Support #15

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 41] Transparent Hugepage Support #15
@ 2010-03-26 16:48 Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
                   ` (12 more replies)
  0 siblings, 13 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

Hello,

this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries
in split_huge_page (at pratically zero cost, so I didn't need to add a fake
feature flag and it's a lot safer to do it this way just in case).
split_large_page in change_page_attr has the same issue too, but I've no idea
how to fix it there because the pmd cannot be marked non present at any given
time as change_page_attr may be running on ram below 640k and that is the same
pmd where the kernel .text resides. However I doubt it'll ever be a practical
problem. Other cpus also has a lot of warnings and risks in allowing
simultaneous TLB entries of different size.

Johannes also sent a cute optimization to split split_huge_page_vma/mm he
converted those in a single split_huge_page_pmd and in addition he also sent
native support for hugepages in both mincore and mprotect. Which shows how
deep he already understands the whole huge_memory.c and its usage in the
callers.  Seeing significant contributions like this I think further confirms
this is the way to go. Thanks a lot Johannes.

The ability to bisect before the mincore and mprotect native implementations
is one of the huge benefits of this approach. The hardest of all will be to
add swap native support to 2M pages later (as it involves to make the
swapcache 2M capable and that in turn means it expodes all over the
pagecache code) but I think first we've other priorities:

1) merge memory compaction
2) 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01 of 41] define MADV_HUGEPAGE
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
---

diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
--- a/arch/alpha/include/asm/mman.h
+++ b/arch/alpha/include/asm/mman.h
@@ -53,6 +53,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
--- a/arch/mips/include/asm/mman.h
+++ b/arch/mips/include/asm/mman.h
@@ -77,6 +77,8 @@
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 #define MADV_HWPOISON    100		/* poison a page for testing */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -59,6 +59,8 @@
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	67		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
--- a/arch/xtensa/include/asm/mman.h
+++ b/arch/xtensa/include/asm/mman.h
@@ -83,6 +83,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -45,7 +45,7 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
-#define MADV_HUGEPAGE	15		/* Worth backing with hugepages */
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
 
 /* compatibility flags */
 #define MAP_FILE	0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 02 of 41] compound_lock
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -13,6 +13,7 @@
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
 #include <linux/range.h>
+#include <linux/bit_spinlock.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -297,6 +298,20 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_lock(PG_compound_lock, &page->flags);
+#endif
+}
+
+static inline void compound_unlock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_unlock(PG_compound_lock, &page->flags);
+#endif
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,9 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	PG_compound_lock,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -399,6 +402,12 @@ static inline void __ClearPageTail(struc
 #define __PG_MLOCKED		0
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
+#else
+#define __PG_COMPOUND_LOCK		0
+#endif
+
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -408,7 +417,8 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_COMPOUND_LOCK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 03 of 41] alter compound get_page/put_page
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Alter compound get_page/put_page to keep references on subpages too, in order
to allow __split_huge_page_refcount to split an hugepage even while subpages
have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,6 +16,16 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t 
 			put_page(page);
 			return 0;
 		}
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -105,6 +105,16 @@ static inline void get_head_page_multipl
 	atomic_add(nr, &page->_count);
 }
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -326,9 +326,17 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount can't run under
+		 * get_page().
+		 */
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,17 +55,82 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_cold_page(page, 0);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock(page_head);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock(page_head);
+				VM_BUG_ON(PageHead(page_head));
+			out_put_head:
+				if (put_page_testzero(page_head))
+					__put_single_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			compound_unlock(page_head);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+		} else {
+			/* page_head is a dangling pointer */
+			VM_BUG_ON(PageTail(page));
+			goto out_put_single;
+		}
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -74,7 +139,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 04 of 41] update futex compound knowledge
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Futex code is smarter than most other gup_fast O_DIRECT code and knows about
the compound internals. However now doing a put_page(head_page) will not
release the pin on the tail page taken by gup-fast, leading to all sort of
refcounting bugchecks. Getting a stable head_page is a little tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called, otherwise
it's guaranteed unnecessary. And if it's a tail page compound_head has to run
atomically inside irq disabled section __get_user_pages_fast before returning.
Otherwise ->first_page won't be a stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it means
the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait
for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn when we
run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are always safe
to call compound_lock and after taking the compound lock on head page we can
finally re-check if the page returned by gup-fast is still a tail page. in
which case we're set and we didn't need to split the hugepage in order to take
a futex on it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/kernel/futex.c b/kernel/futex.c
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -218,7 +218,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page;
+	struct page *page, *page_head;
 	int err;
 
 	/*
@@ -250,10 +250,53 @@ again:
 	if (err < 0)
 		return err;
 
-	page = compound_head(page);
-	lock_page(page);
-	if (!page->mapping) {
-		unlock_page(page);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page_head = page;
+	if (unlikely(PageTail(page))) {
+		put_page(page);
+		/* serialize against __split_huge_page_splitting() */
+		local_irq_disable();
+		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+			page_head = compound_head(page);
+			/*
+			 * page_head is valid pointer but we must pin
+			 * it before taking the PG_lock and/or
+			 * PG_compound_lock. The moment we re-enable
+			 * irqs __split_huge_page_splitting() can
+			 * return and the head page can be freed from
+			 * under us. We can't take the PG_lock and/or
+			 * PG_compound_lock on a page that could be
+			 * freed from under us.
+			 */
+			if (page != page_head)
+				get_page(page_head);
+			local_irq_enable();
+		} else {
+			local_irq_enable();
+			goto again;
+		}
+	}
+#else
+	page_head = compound_head(page);
+	if (page != page_head)
+		get_page(page_head);
+#endif
+
+	lock_page(page_head);
+	if (unlikely(page_head != page)) {
+		compound_lock(page_head);
+		if (unlikely(!PageTail(page))) {
+			compound_unlock(page_head);
+			unlock_page(page_head);
+			put_page(page_head);
+			put_page(page);
+			goto again;
+		}
+	}
+	if (!page_head->mapping) {
+		unlock_page(page_head);
+		if (page_head != page)
+			put_page(page_head);
 		put_page(page);
 		goto again;
 	}
@@ -265,19 +308,25 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page)) {
+	if (PageAnon(page_head)) {
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page->mapping->host;
-		key->shared.pgoff = page->index;
+		key->shared.inode = page_head->mapping->host;
+		key->shared.pgoff = page_head->index;
 	}
 
 	get_futex_key_refs(key);
 
-	unlock_page(page);
+	unlock_page(page_head);
+	if (page != page_head) {
+		VM_BUG_ON(!PageTail(page));
+		/* releasing compound_lock after page_lock won't matter */
+		compound_unlock(page_head);
+		put_page(page_head);
+	}
 	put_page(page);
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05 of 41] fix bad_page to show the real reason the page is bad
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

page_count shows the count of the head page, but the actual check is done on
the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5291,7 +5291,7 @@ void dump_page(struct page *page)
 {
 	printk(KERN_ALERT
 	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
-		page, page_count(page), page_mapcount(page),
+		page, atomic_read(&page->_count), page_mapcount(page),
 		page->mapping, page->index);
 	dump_page_flags(page->flags);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 06 of 41] clear compound mapping
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Clear compound mapping for anonymous compound pages like it already happens for
regular anonymous pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -629,6 +629,8 @@ static void __free_pages_ok(struct page 
 	trace_mm_page_free_direct(page, order);
 	kmemcheck_free_shadow(page, order);
 
+	if (PageAnon(page))
+		page->mapping = NULL;
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
 	if (bad)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 07 of 41] add native_set_pmd_at
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -528,6 +528,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 08 of 41] add pmd paravirt ops
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary
(vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer),
but this is to keep full simmetry with pte paravirt ops, which looks cleaner
and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -440,6 +440,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -447,6 +452,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -548,6 +559,18 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+#endif
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -266,10 +266,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 09 of 41] no paravirt version of pmd ops
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -33,6 +33,7 @@ extern struct list_head pgd_list;
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -57,6 +58,8 @@ extern struct list_head pgd_list;
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 10 of 41] export maybe_mkwrite
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

huge_memory.c needs it too when it fallbacks in copying hugepages into regular
fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -390,6 +390,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2031,19 +2031,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 11 of 41] comment reminder in destroy_compound_page
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent
on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -334,6 +334,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 12 of 41] config_transparent_hugepage
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  2010-03-26 16:48 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

Add config option.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -287,3 +287,17 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage support" if EMBEDDED
+	depends on X86_64
+	default y
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If memory constrained on embedded, you may want to say N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 13 of 41] special pmd_trans_* functions
  2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2010-03-26 16:48 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
@ 2010-03-26 16:48 ` Andrea Arcangeli
  12 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 16:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

From: Andrea Arcangeli <aarcange@redhat.com>

These returns 0 at compile time when the config option is disabled, to allow
gcc to eliminate the transparent hugepage function calls at compile time
without additional #ifdefs (only the export of those functions have to be
visible to gcc but they won't be required at link time and huge_memory.o can be
not built at all).

_PAGE_BIT_UNUSED1 is never used for pmd, only on pte.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -168,6 +168,19 @@ extern void cleanup_highmap(void);
 #define	kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK)
 
 #define __HAVE_ARCH_PTE_SAME
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,6 +22,7 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
+#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -45,6 +46,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -344,6 +344,11 @@ extern void untrack_pfn_vma(struct vm_ar
 				unsigned long size);
 #endif
 
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_trans_huge(pmd) 0
+#define pmd_trans_splitting(pmd) 0
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_GENERIC_PGTABLE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 00 of 41] Transparent Hugepage Support #15
@ 2010-03-26 17:00 Andrea Arcangeli
  2010-03-26 17:36 ` Mel Gorman
  2010-03-26 18:00 ` Christoph Lameter
  0 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 17:00 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

Hello,

this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries
in split_huge_page (at pratically zero cost, so I didn't need to add a fake
feature flag and it's a lot safer to do it this way just in case).
split_large_page in change_page_attr has the same issue too, but I've no idea
how to fix it there because the pmd cannot be marked non present at any given
time as change_page_attr may be running on ram below 640k and that is the same
pmd where the kernel .text resides. However I doubt it'll ever be a practical
problem. Other cpus also has a lot of warnings and risks in allowing
simultaneous TLB entries of different size.

Johannes also sent a cute optimization to split split_huge_page_vma/mm he 
converted those in a single split_huge_page_pmd and in addition he also sent
native support for hugepages in both mincore and mprotect. Which shows how
deep he already understands the whole huge_memory.c and its usage in the
callers.  Seeing significant contributions like this I think further confirms
this is the way to go. Thanks a lot Johannes.

The ability to bisect before the mincore and mprotect native implementations 
is one of the huge benefits of this approach. The hardest of all will be to 
add swap native support to 2M pages later (as it involves to make the 
swapcache 2M capable and that in turn means it expodes more than the rest all
over the pagecache code) but I think first we've other priorities:

1) merge memory compaction
2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory
   compaction is capable of relocating slab entries in-use (correct me if I'm
   wrong, I think it's impossible as long as the slab entries are mapped by 2M
   pages and not 4k ptes like vmalloc). So the idea is that we should have the
   slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks
   to 4k. Otherwise the slab will fragment the memory badly by allocating with
   alloc_page(). Basically the buddy allocator will guarantee the slab will
   generate as much fragement as possible because it does its best to keep the
   high order pages for who asks for them. Probably the fallback should
   happen inside the buddy allocator instead of calling alloc_pages
   repeteadly, that should avoid taking a flood of locks. Basically
   the buddy should give the worst possible fragmentation effect to users that
   should be relocated, while the other users that cannot be relocated and
   only use 4k pages will better use a front allocator on top of alloc_pages.
   Something like alloc_page_not_relocatable() that will do its stuff
   internally and try to keep those in the same 2M pages. This alone should
   help tremendously and I think it's orthogonal to the memory compaction of
   the relocatable stuff. Or maybe we should just live with a large chunk of
   the memory not being relocatable, but I like this idea because it's more
   dynamic and it won't have fixed rule "limit the slab to 0-1g range". And
   it'd tend to try to keep fragmentation down even if we spill over the 1G
   range. (1g is purely made up number)
3) teach ksm to merge hugepages. I talked about this with Izik and we agree
   the current ksm tree algorithm will be the best at that compared to ksm
   algorithms.

To run KVM on top on this and take advantage of hugepages you need a few liner
patch I posted to qemu-devel to take care of aligning the start of the guest
memory so that the guest physical address and host virtual address will have
the same subpage numbers.

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15
	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15.gz

I'd be nice to have this merged in -mm.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
@ 2010-03-26 17:36 ` Mel Gorman
  2010-03-26 18:07   ` Andrea Arcangeli
  2010-03-26 18:00 ` Christoph Lameter
  1 sibling, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2010-03-26 17:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

On Fri, Mar 26, 2010 at 06:00:04PM +0100, Andrea Arcangeli wrote:
> Hello,
> 
> this fixes a potential issue with regard to simultaneous 4k and 2M TLB entries
> in split_huge_page (at pratically zero cost, so I didn't need to add a fake
> feature flag and it's a lot safer to do it this way just in case).
> split_large_page in change_page_attr has the same issue too, but I've no idea
> how to fix it there because the pmd cannot be marked non present at any given
> time as change_page_attr may be running on ram below 640k and that is the same
> pmd where the kernel .text resides. However I doubt it'll ever be a practical
> problem. Other cpus also has a lot of warnings and risks in allowing
> simultaneous TLB entries of different size.
> 
> Johannes also sent a cute optimization to split split_huge_page_vma/mm he 
> converted those in a single split_huge_page_pmd and in addition he also sent
> native support for hugepages in both mincore and mprotect. Which shows how
> deep he already understands the whole huge_memory.c and its usage in the
> callers.  Seeing significant contributions like this I think further confirms
> this is the way to go. Thanks a lot Johannes.
> 
> The ability to bisect before the mincore and mprotect native implementations 
> is one of the huge benefits of this approach. The hardest of all will be to 
> add swap native support to 2M pages later (as it involves to make the 
> swapcache 2M capable and that in turn means it expodes more than the rest all
> over the pagecache code) but I think first we've other priorities:
> 
> 1) merge memory compaction

Testing V6 at the moment.

> 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory
>    compaction is capable of relocating slab entries in-use (correct me if I'm
>    wrong, I think it's impossible as long as the slab entries are mapped by 2M
>    pages and not 4k ptes like vmalloc).So the idea is that we should have the

Correct, slab pages currently cannot migrate. Framentation within slab
is minimised by anti-fragmentation by distinguishing between reclaimable
and unreclaimable slab and grouping them appropriately. The objective is
to put all the unmovable pages in as few 2M (or 4M or 16M) pages as
possible. If min_free_kbytes is tuned as hugeadm
--recommended-min_free_kbytes suggests, this works pretty well.

>    slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks
>    to 4k. Otherwise the slab will fragment the memory badly by allocating with
>    alloc_page().

Again, if min_free_kbytes is tuned appropriately, anti-frag should
mitigate most of the fragmentation-related damage.

On the notion of having a 2M front slab allocator, SLUB is not far off
being capable of such a thing but there are risks. If a 2M page is
dedicated to a slab, then other slabs will need their own 2M pages.
Overall memory usage grows and you end up worse off.

If you suggest that slab uses 2M pages and breaks them up for slabs, you
are very close to what anti-frag already does. The difference might be
that slab would guarantee that the 2M page is only use for slab. Again,
you could force this situation with anti-frag but the decision was made
to allow a certain amount of fragmentation to avoid the memory overhead
of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets
much of what you need.

Arguably, min_free_kbytes should be tuned appropriately once it's detected
that huge pages are in use. It would not be hard at all, we just don't do it.

Stronger guarantees on layout are possible but not done today because of
the cost.

>    Basically the buddy allocator will guarantee the slab will
>    generate as much fragement as possible because it does its best to keep the
>    high order pages for who asks for them.

Again, already does this up to a point. rmqueue_fallback() could refuse to
break up small contiguous pages for slab to force better layout in terms of
fragmentation but it costs heavily when memory is low because you now have to
reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation.

> Probably the fallback should
>    happen inside the buddy allocator instead of calling alloc_pages
>    repeteadly, that should avoid taking a flood of locks. Basically
>    the buddy should give the worst possible fragmentation effect to users that
>    should be relocated, while the other users that cannot be relocated and
>    only use 4k pages will better use a front allocator on top of alloc_pages.
>    Something like alloc_page_not_relocatable() that will do its stuff
>    internally and try to keep those in the same 2M pages.

Sounds very similar to anti-frag again.

> This alone should
>    help tremendously and I think it's orthogonal to the memory compaction of
>    the relocatable stuff. Or maybe we should just live with a large chunk of
>    the memory not being relocatable,

You could force such a situation by always having X number of lower blocks
MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those
areas. You'd need to do some juggling with counters and watermarks. It's not
impossible and I considered doing it when anti-fragmentation was introduced
but again, there was insufficient data to support such a move.

> but I like this idea because it's more
>    dynamic and it won't have fixed rule "limit the slab to 0-1g range". And
>    it'd tend to try to keep fragmentation down even if we spill over the 1G
>    range. (1g is purely made up number)
> 3) teach ksm to merge hugepages. I talked about this with Izik and we agree
>    the current ksm tree algorithm will be the best at that compared to ksm
>    algorithms.
> 
> 
> To run KVM on top on this and take advantage of hugepages you need a few liner
> patch I posted to qemu-devel to take care of aligning the start of the guest
> memory so that the guest physical address and host virtual address will have
> the same subpage numbers.
> 
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-15.gz
> 
> I'd be nice to have this merged in -mm.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 17:36 ` Mel Gorman
@ 2010-03-26 18:07   ` Andrea Arcangeli
  2010-03-26 21:09     ` Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 18:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

On Fri, Mar 26, 2010 at 05:36:55PM +0000, Mel Gorman wrote:
> Correct, slab pages currently cannot migrate. Framentation within slab
> is minimised by anti-fragmentation by distinguishing between reclaimable
> and unreclaimable slab and grouping them appropriately. The objective is
> to put all the unmovable pages in as few 2M (or 4M or 16M) pages as
> possible. If min_free_kbytes is tuned as hugeadm
> --recommended-min_free_kbytes suggests, this works pretty well.

Awesome. So this feature is already part of your memory compaction
code? As you may have noticed I didn't start looking deep on your code
yet.

> Again, if min_free_kbytes is tuned appropriately, anti-frag should
> mitigate most of the fragmentation-related damage.

I don't see the relation of why this logic should be connected to
min_free_kbytes. Maybe I'll get it if I read the code. But
min_free_kbytes is about the PF_MEMALLOC pool and GFP_ATOMIC memory. I
can't see any connection with min_free_kbytes setting, and in to
trying to keep all non relocatable entries in the same HPAGE_PMD_SIZEd
pages.

> On the notion of having a 2M front slab allocator, SLUB is not far off
> being capable of such a thing but there are risks. If a 2M page is
> dedicated to a slab, then other slabs will need their own 2M pages.
> Overall memory usage grows and you end up worse off.
>
> If you suggest that slab uses 2M pages and breaks them up for slabs, you
> are very close to what anti-frag already does. The difference might be

That's exactly what I meant yes. Doing it per-slab would be useless.

The idea was for slub to simply call alloc_page_not_relocatable(order)
instead of alloc_page() every time it allocates an order <=
HPAGE_PMD_ORDER. That means this 2M page would be shared for _all_
slabs, otherwise it wouldn't work.

The page freeing could even go back in the buddy initially. So the max
waste would be 2M per cpu of ram (the front page has to be per-cpu to
perform).

> that slab would guarantee that the 2M page is only use for slab. Again,
> you could force this situation with anti-frag but the decision was made
> to allow a certain amount of fragmentation to avoid the memory overhead
> of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets
> much of what you need.

Well if this 2M page is shared by other not relocatable entities
that might be even better in some scenario (maybe worse in others) but
I'm totally fine with a more elaborate approach. Clearly some driver
could also start to call alloc_pages_not_relocatable() and then it'd
also share the same memory as slab. I think it has to be an
universally available feature, just like you implemented. Except right
now the main problem is slab so that's the first user for sure ;).

> Arguably, min_free_kbytes should be tuned appropriately once it's detected
> that huge pages are in use. It would not be hard at all, we just don't do it.
> 
> Stronger guarantees on layout are possible but not done today because of
> the cost.

Could you elaborate what "guarantees of layout" means?

> 
> >    Basically the buddy allocator will guarantee the slab will
> >    generate as much fragement as possible because it does its best to keep the
> >    high order pages for who asks for them.
> 
> Again, already does this up to a point. rmqueue_fallback() could refuse to
> break up small contiguous pages for slab to force better layout in terms of
> fragmentation but it costs heavily when memory is low because you now have to
> reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation.

I guess this will require a sysfs control. Do you have a
/sys/kernel/mm/defrag directory or something? If hugepages are
absolutely mandatory (like with hypervisor-only usage) it is worth
invoking memory compaction to satisfy what i call "front allocator"
and give a full 2M page to slab instead of using the already available
fragment. And to rmqueue-fallback only if defrag fails.

> Sounds very similar to anti-frag again.

Indeed.

> You could force such a situation by always having X number of lower blocks
> MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those
> areas. You'd need to do some juggling with counters and watermarks. It's not
> impossible and I considered doing it when anti-fragmentation was introduced
> but again, there was insufficient data to support such a move.

Agreed. I also like a more dynamic approach, the whole idea of
transparent hugepage is that the admin does nothing, no reservation,
and in this case no decision of how much memory to be
MIGRATE_UNMOVABLE.

Looking forward to see transparent hugepage taking full advantage of
your patchset!

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 18:07   ` Andrea Arcangeli
@ 2010-03-26 21:09     ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2010-03-26 21:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner

On Fri, Mar 26, 2010 at 07:07:01PM +0100, Andrea Arcangeli wrote:
> On Fri, Mar 26, 2010 at 05:36:55PM +0000, Mel Gorman wrote:
> > Correct, slab pages currently cannot migrate. Framentation within slab
> > is minimised by anti-fragmentation by distinguishing between reclaimable
> > and unreclaimable slab and grouping them appropriately. The objective is
> > to put all the unmovable pages in as few 2M (or 4M or 16M) pages as
> > possible. If min_free_kbytes is tuned as hugeadm
> > --recommended-min_free_kbytes suggests, this works pretty well.
> 
> Awesome. So this feature is already part of your memory compaction
> code?

No, anti-fragmentation has been in a long time. hugeadm (part of
libhugetlbfs) has supported --recommended-min_free_kbytes for some time
as well.

> As you may have noticed I didn't start looking deep on your code
> yet.
> 
> > Again, if min_free_kbytes is tuned appropriately, anti-frag should
> > mitigate most of the fragmentation-related damage.
> 
> I don't see the relation of why this logic should be connected to
> min_free_kbytes. Maybe I'll get it if I read the code. But
> min_free_kbytes is about the PF_MEMALLOC pool and GFP_ATOMIC memory. I
> can't see any connection with min_free_kbytes setting, and in to
> trying to keep all non relocatable entries in the same HPAGE_PMD_SIZEd
> pages.
> 

Anti-fragmentation groups within pageblocks that are the size of the
default huge page size. Blocks can have different migratetypes and the
free lists are also based on types. If there isn't a free page of the
appropriate type, rmqueue_fallback() selects an alternative list to use
from. Each one of these "fallback" events potentially increases the
badness of the level of fragmentation.

Using --recommended-min_free_kbytes keeps a number of pages free such that
these "fallback" events are severely reduced because there is typically a
page free of the appropriate type located in the correct pageblock.

If you were very curious, you use the mm_page_alloc_extfrag trace event to
monitor fragmentation-related events. Part of the event reports "fragmenting="
which indicates whether the fallback is severe in terms of fragmentation
or not.

> > On the notion of having a 2M front slab allocator, SLUB is not far off
> > being capable of such a thing but there are risks. If a 2M page is
> > dedicated to a slab, then other slabs will need their own 2M pages.
> > Overall memory usage grows and you end up worse off.
> >
> > If you suggest that slab uses 2M pages and breaks them up for slabs, you
> > are very close to what anti-frag already does. The difference might be
> 
> That's exactly what I meant yes. Doing it per-slab would be useless.
> 
> The idea was for slub to simply call alloc_page_not_relocatable(order)

If you don't specify migratetype-related GFP flags, it's assumed to be
UNMOVABLE.

> instead of alloc_page() every time it allocates an order <=
> HPAGE_PMD_ORDER. That means this 2M page would be shared for _all_
> slabs, otherwise it wouldn't work.
> 

I still think anti-frag is already doing most of what you suggest. Slab
should already be using UNMOVABLE blocks (See /proc/pagetypeinfo for how
the pageblocks are being used).

> The page freeing could even go back in the buddy initially. So the max
> waste would be 2M per cpu of ram (the front page has to be per-cpu to
> perform).
> 
> > that slab would guarantee that the 2M page is only use for slab. Again,
> > you could force this situation with anti-frag but the decision was made
> > to allow a certain amount of fragmentation to avoid the memory overhead
> > of such a thing. Again, tuning min_free_kbytes + anti-fragmentation gets
> > much of what you need.
> 
> Well if this 2M page is shared by other not relocatable entities
> that might be even better in some scenario (maybe worse in others)

The 2M page is today being shared with other unmovable (what you call
not relocatable) pages. The scenario where it potentially gets worse is
where there is a weird mix of pagetable and slab allocations. This will
push up the number of blocks used for unmovable pages to some extent.

> but
> I'm totally fine with a more elaborate approach. Clearly some driver
> could also start to call alloc_pages_not_relocatable() and then it'd
> also share the same memory as slab. I think it has to be an
> universally available feature, just like you implemented. Except right
> now the main problem is slab so that's the first user for sure ;).
> 

Right now, allocations are assumed to be unmovable unless otherwise specified.

> > Arguably, min_free_kbytes should be tuned appropriately once it's detected
> > that huge pages are in use. It would not be hard at all, we just don't do it.
> > 
> > Stronger guarantees on layout are possible but not done today because of
> > the cost.
> 
> Could you elaborate what "guarantees of layout" means?
> 

The ideal would be the fewest number of pageblocks are in use and
each pageblock only contains the pages of a specific migratetype.

One "guaranteed layout" would be that pageblocks only ever contain pages
of a given type but this would potentially require a full 2M of data to
be relocated or reclaimed to satisfy a new allocation. It would also
cause problems with atomics. It would be great from a fragmentation
perspective but suck otherwise.

> > 
> > >    Basically the buddy allocator will guarantee the slab will
> > >    generate as much fragement as possible because it does its best to keep the
> > >    high order pages for who asks for them.
> > 
> > Again, already does this up to a point. rmqueue_fallback() could refuse to
> > break up small contiguous pages for slab to force better layout in terms of
> > fragmentation but it costs heavily when memory is low because you now have to
> > reclaim (or relocate) more pages than necessary to satisfy anti-fragmentation.
> 
> I guess this will require a sysfs control.

It would also be a new feature. With memory compaction, the page
allocator will compact memory to satisfy a high-order allocation but it
doesn't compact memory to avoid mixing pageblocks.

> Do you have a
> /sys/kernel/mm/defrag directory or something?> If hugepages are
> absolutely mandatory (like with hypervisor-only usage) it is worth
> invoking memory compaction to satisfy what i call "front allocator"
> and give a full 2M page to slab instead of using the already available
> fragment. And to rmqueue-fallback only if defrag fails.
> 

There is a proc entry and sysfs entry that allow to compact either all
of memory or on a per-node basis but I'd be surprised if it was
required. When a new machine starts up, it should start
direct-compacting memory to get the huge pages it needs.

> > Sounds very similar to anti-frag again.
> 
> Indeed.
> 
> > You could force such a situation by always having X number of lower blocks
> > MIGRATE_UNMOVABLE and forcing a situation where fallback never happens to those
> > areas. You'd need to do some juggling with counters and watermarks. It's not
> > impossible and I considered doing it when anti-fragmentation was introduced
> > but again, there was insufficient data to support such a move.
> 
> Agreed. I also like a more dynamic approach, the whole idea of
> transparent hugepage is that the admin does nothing, no reservation,
> and in this case no decision of how much memory to be
> MIGRATE_UNMOVABLE.
> 
> Looking forward to see transparent hugepage taking full advantage of
> your patchset!
> 

Same here.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
  2010-03-26 17:36 ` Mel Gorman
@ 2010-03-26 18:00 ` Christoph Lameter
  2010-03-26 18:23   ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2010-03-26 18:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner

On Fri, 26 Mar 2010, Andrea Arcangeli wrote:

> 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory
>    compaction is capable of relocating slab entries in-use (correct me if I'm
>    wrong, I think it's impossible as long as the slab entries are mapped by 2M

SLUB is capable of using huge pages. Specify slub_min_order=9 on boot and
it will make the kernel use huge pages.

>    pages and not 4k ptes like vmalloc). So the idea is that we should have the
>    slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks
>    to 4k. Otherwise the slab will fragment the memory badly by allocating with
>    alloc_page(). Basically the buddy allocator will guarantee the slab will
>    generate as much fragement as possible because it does its best to keep the
>    high order pages for who asks for them. Probably the fallback should

Fallback is another issue. SLUB can handle various orders of pages in the
same slab cache and already implements fallback to order 0. To implement
a scheme as you suggest here would not require any changes to data
structures but only to the slab allocation functions. See allocate_slab()
in mm/slub.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 18:00 ` Christoph Lameter
@ 2010-03-26 18:23   ` Andrea Arcangeli
  2010-03-26 18:44     ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner

On Fri, Mar 26, 2010 at 01:00:12PM -0500, Christoph Lameter wrote:
> On Fri, 26 Mar 2010, Andrea Arcangeli wrote:
> 
> > 2) writing a HPAGE_PMD_ORDER front slab allocator. I don't think memory
> >    compaction is capable of relocating slab entries in-use (correct me if I'm
> >    wrong, I think it's impossible as long as the slab entries are mapped by 2M
> 
> SLUB is capable of using huge pages. Specify slub_min_order=9 on boot and
> it will make the kernel use huge pages.
> 
> >    pages and not 4k ptes like vmalloc). So the idea is that we should have the
> >    slab allocate 2M if it fails, 1M if it fails 512k etc... until it fallbacks
> >    to 4k. Otherwise the slab will fragment the memory badly by allocating with
> >    alloc_page(). Basically the buddy allocator will guarantee the slab will
> >    generate as much fragement as possible because it does its best to keep the
> >    high order pages for who asks for them. Probably the fallback should
> 
> Fallback is another issue. SLUB can handle various orders of pages in the
> same slab cache and already implements fallback to order 0. To implement
> a scheme as you suggest here would not require any changes to data
> structures but only to the slab allocation functions. See allocate_slab()
> in mm/slub.c

Thanks for the information! Luckily it seems Mel already taken care of
this part in his patchset. But in my view, this feature should be
available outside of SLUB/SLAB and potentially available to drivers
and such. SLUB having this embedded is nice to know!!!

BTW, unfortunately according to tons of measurements done so far, SLUB
is too slow on most workstations and small/mid servers (usually single
digits but in some case even double digits percentage slowdowns
depending on the workload, hackbench tends to stress it the
most). It's a tradeoff between avoiding wasting tons of ram on
1024-way and running fast. Either that or something's wrong with SLUB
implementation (and I'm talking about 2.6.32, no earlier code). I'd
also like to save memory so it'd be great if SLUB can be fixed to
perform faster!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 18:23   ` Andrea Arcangeli
@ 2010-03-26 18:44     ` Christoph Lameter
  2010-03-26 19:34       ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2010-03-26 18:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner

On Fri, 26 Mar 2010, Andrea Arcangeli wrote:

> BTW, unfortunately according to tons of measurements done so far, SLUB
> is too slow on most workstations and small/mid servers (usually single
> digits but in some case even double digits percentage slowdowns
> depending on the workload, hackbench tends to stress it the
> most). It's a tradeoff between avoiding wasting tons of ram on
> 1024-way and running fast. Either that or something's wrong with SLUB
> implementation (and I'm talking about 2.6.32, no earlier code). I'd
> also like to save memory so it'd be great if SLUB can be fixed to
> perform faster!

The SLUB fastpath is the fastest there is. Problems arise because of
locality constraints in SLUB. SLAB can throw gobs of memory at it to
guarantee a high cache hit rate but to cover all angles on NUMA requires
to throw the gobs multiple times. The weakness is SLUBs free functions
which frees the object directly to the slab page instead of
running through a series of caching structures. If frees occur to
locally dispersed objects then SLUB is at a disadvantage since its hitting cold
cache lines for metadata on free.

On the other hand SLUB hands out objects in a locality aware fashion and
not randomly from everywhere like SLAB. This is certainly good to reduce
TLB pressure. Huge pages would accellerate SLUB since more objects can be
served from the same page than before.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 18:44     ` Christoph Lameter
@ 2010-03-26 19:34       ` Andrea Arcangeli
  2010-03-26 19:55         ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2010-03-26 19:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner

On Fri, Mar 26, 2010 at 01:44:23PM -0500, Christoph Lameter wrote:
> TLB pressure. Huge pages would accellerate SLUB since more objects can be
> served from the same page than before.

Agreed. I see it fallbacks to 4k instead of gradually going down, but
that was my point, doing the fallback and entry alloc_pages N without
internal buddy support would be fairly inefficient. This is why is
suggest this logic to be outside of slab/slub, in theory even slab
could be a bit faster thanks to large TLB on newly allocated slab
objects. I hope Mel's code already takes care of all of this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #15
  2010-03-26 19:34       ` Andrea Arcangeli
@ 2010-03-26 19:55         ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2010-03-26 19:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Andrew Morton, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner

On Fri, 26 Mar 2010, Andrea Arcangeli wrote:

> On Fri, Mar 26, 2010 at 01:44:23PM -0500, Christoph Lameter wrote:
> > TLB pressure. Huge pages would accellerate SLUB since more objects can be
> > served from the same page than before.
>
> Agreed. I see it fallbacks to 4k instead of gradually going down, but
> that was my point, doing the fallback and entry alloc_pages N without
> internal buddy support would be fairly inefficient. This is why is
> suggest this logic to be outside of slab/slub, in theory even slab
> could be a bit faster thanks to large TLB on newly allocated slab
> objects. I hope Mel's code already takes care of all of this.

SLAB's queueing system has the inevitable garbling effect on memory
references. The larger the queues the larger that effect becomes.

We already have internal buddy support in the page allocator. Mel's defrag
approach groups them together.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-03-26 21:09 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-26 16:48 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
2010-03-26 16:48 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2010-03-26 17:00 [PATCH 00 of 41] Transparent Hugepage Support #15 Andrea Arcangeli
2010-03-26 17:36 ` Mel Gorman
2010-03-26 18:07   ` Andrea Arcangeli
2010-03-26 21:09     ` Mel Gorman
2010-03-26 18:00 ` Christoph Lameter
2010-03-26 18:23   ` Andrea Arcangeli
2010-03-26 18:44     ` Christoph Lameter
2010-03-26 19:34       ` Andrea Arcangeli
2010-03-26 19:55         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).