[PATCH 00 of 31] Transparent Hugepage support #8

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 31] Transparent Hugepage support #8
@ 2010-01-28 14:33 Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 01 of 31] define MADV_HUGEPAGE Andrea Arcangeli
                   ` (31 more replies)
  0 siblings, 32 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

Hello,

this is the last covering all review plus fixes for vm_normal_page in
khugepaged (which still results in a warning generated by a pte having
pte_special bit and mapping mmio area in the 256M area of the graphics card of
mst but nor VM_PFNMAP nor VM_MIXEDMAP set, even worse vm_file is null and
vm_ops is null too). I still can't figure out how a special pte can be mapped
in an area with vm_ops and vm_file both null (btw, on a side note I doubt we
ever have a case of vm_ops not null and vm_file null or vm_ops null and
vm_file not null). It happens during the speculative readonly pass and I
intentionally tried to avoid taking the pt lock (it was later taken in
collapse_huge_page of course). I wonder if that's the reason so I added the
pt lock in the speculative pass too, but I can't see how it can happen even
without the lock (the pte can't go away under it because khugepaged holds the
mmap_sem read mode).  I would imagine it could be a problem only if pte
updates weren't atomic as they are on 64bit (despite not being enforced with
asm() constructs but relaying on gcc) so takign pt lock will help on that
side, but even if it was gcc doing partial writes to ptes, it shouldn't be
always pointing to mmio of graphics card.

I can't reproduce the khugepaged warning here, but it's only a warning, it
can't affect stability if it's true my hypotesis that the bug was already
there, and that I only exposed it with khugepaged.

But if it's not my bug, then I wonder why munmap doesn't trip on it too.  The
suspicious code at the moment is i915_gem, things like unmap_mapping_range
etc.. One of my theories is that unmap_mapping_range of i915_gem is not
clearing all ptes and then those pte_special leaks into newly allocated
mappings as new pte are allocated but then I can't imagine this not to have
adverse effects (at the very least it should screwup graphics card at boot
around the time these bugcheck triggers). Also teh bugchecks seem to go away
after 80 sec uptime so maybe the corruption is then cleared as the new
mappings are teardown too (this time correctly through regular munmap and not
->close).

It's my primary focus is to understand that pte_special thing, because other
than the above there is no other known issue so far, and it is rock solid in
all hardware where I deployed it and I leave swapping storms running 24/7 in
addition to khugepaged at 0 scan/defrag_sleep to stress smp safety of
split_huge_page and collapse_huge_page. I rebooted laptop and test server,
only to upgrade the #8 version to include in the testing the new code post
review.

I suggest to try it (especially if you use i915_gem, as I need to know if
anybody else can reproduce the khugepaged warning with pte_special set) and
let me know, the more testing the better.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01 of 31] define MADV_HUGEPAGE
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 20:30   ` Arnd Bergmann
  2010-01-28 14:33 ` [PATCH 02 of 31] compound_lock Andrea Arcangeli
                   ` (30 subsequent siblings)
  31 siblings, 1 reply; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
--- a/arch/alpha/include/asm/mman.h
+++ b/arch/alpha/include/asm/mman.h
@@ -53,6 +53,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
--- a/arch/mips/include/asm/mman.h
+++ b/arch/mips/include/asm/mman.h
@@ -77,6 +77,8 @@
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 #define MADV_HWPOISON    100		/* poison a page for testing */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -59,6 +59,8 @@
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	67		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
--- a/arch/xtensa/include/asm/mman.h
+++ b/arch/xtensa/include/asm/mman.h
@@ -83,6 +83,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -45,6 +45,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01 of 31] define MADV_HUGEPAGE
  2010-01-28 14:33 ` [PATCH 01 of 31] define MADV_HUGEPAGE Andrea Arcangeli
@ 2010-01-28 20:30   ` Arnd Bergmann
  0 siblings, 0 replies; 50+ messages in thread
From: Arnd Bergmann @ 2010-01-28 20:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh

On Thursday 28 January 2010, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Define MADV_HUGEPAGE.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Arnd Bergmann <arnd@arndb.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 02 of 31] compound_lock
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 01 of 31] define MADV_HUGEPAGE Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 03 of 31] alter compound get_page/put_page Andrea Arcangeli
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/bit_spinlock.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -294,6 +295,20 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_lock(PG_compound_lock, &page->flags);
+#endif
+}
+
+static inline void compound_unlock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_unlock(PG_compound_lock, &page->flags);
+#endif
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,9 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	PG_compound_lock,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -399,6 +402,12 @@ static inline void __ClearPageTail(struc
 #define __PG_MLOCKED		0
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
+#else
+#define __PG_COMPOUND_LOCK		0
+#endif
+
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -408,7 +417,8 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_COMPOUND_LOCK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 03 of 31] alter compound get_page/put_page
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 01 of 31] define MADV_HUGEPAGE Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 02 of 31] compound_lock Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 04 of 31] update futex compound knowledge Andrea Arcangeli
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Alter compound get_page/put_page to keep references on subpages too, in order
to allow __split_huge_page_refcount to split an hugepage even while subpages
have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,6 +16,16 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t 
 			put_page(page);
 			return 0;
 		}
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -105,6 +105,16 @@ static inline void get_head_page_multipl
 	atomic_add(nr, &page->_count);
 }
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -323,9 +323,17 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount can't run under
+		 * get_page().
+		 */
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,17 +55,82 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_page(page);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock(page_head);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock(page_head);
+				VM_BUG_ON(PageHead(page_head));
+			out_put_head:
+				if (put_page_testzero(page_head))
+					__put_single_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			compound_unlock(page_head);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+		} else {
+			/* page_head is a dangling pointer */
+			VM_BUG_ON(PageTail(page));
+			goto out_put_single;
+		}
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -74,7 +139,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 04 of 31] update futex compound knowledge
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 03 of 31] alter compound get_page/put_page Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 16:11   ` Mel Gorman
  2010-01-28 14:33 ` [PATCH 05 of 31] fix bad_page to show the real reason the page is bad Andrea Arcangeli
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Futex code is smarter than most other gup_fast O_DIRECT code and knows about
the compound internals. However now doing a put_page(head_page) will not
release the pin on the tail page taken by gup-fast, leading to all sort of
refcounting bugchecks. Getting a stable head_page is a little tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called, otherwise
it's guaranteed unnecessary. And if it's a tail page compound_head has to run
atomically inside irq disabled section __get_user_pages_fast before returning.
Otherwise ->first_page won't be a stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it means
the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait
for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn when we
run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are always safe
to call compound_lock and after taking the compound lock on head page we can
finally re-check if the page returned by gup-fast is still a tail page. in
which case we're set and we didn't need to split the hugepage in order to take
a futex on it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/kernel/futex.c b/kernel/futex.c
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -218,7 +218,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page;
+	struct page *page, *page_head;
 	int err;
 
 	/*
@@ -250,10 +250,36 @@ again:
 	if (err < 0)
 		return err;
 
-	page = compound_head(page);
-	lock_page(page);
-	if (!page->mapping) {
-		unlock_page(page);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page_head = page;
+	if (unlikely(PageTail(page))) {
+		put_page(page);
+		/* serialize against __split_huge_page_splitting() */
+		local_irq_disable();
+		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+			page_head = compound_head(page);
+			local_irq_enable();
+		} else {
+			local_irq_enable();
+			goto again;
+		}
+	}
+#else
+	page_head = compound_head(page);
+#endif
+
+	lock_page(page_head);
+	if (unlikely(page_head != page)) {
+		compound_lock(page_head);
+		if (unlikely(!PageTail(page))) {
+			compound_unlock(page_head);
+			unlock_page(page_head);
+			put_page(page);
+			goto again;
+		}
+	}
+	if (!page_head->mapping) {
+		unlock_page(page_head);
 		put_page(page);
 		goto again;
 	}
@@ -265,19 +291,21 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page)) {
+	if (PageAnon(page_head)) {
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page->mapping->host;
-		key->shared.pgoff = page->index;
+		key->shared.inode = page_head->mapping->host;
+		key->shared.pgoff = page_head->index;
 	}
 
 	get_futex_key_refs(key);
 
-	unlock_page(page);
+	if (unlikely(PageTail(page)))
+		compound_unlock(page_head);
+	unlock_page(page_head);
 	put_page(page);
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04 of 31] update futex compound knowledge
  2010-01-28 14:33 ` [PATCH 04 of 31] update futex compound knowledge Andrea Arcangeli
@ 2010-01-28 16:11   ` Mel Gorman
  2010-02-01  7:45     ` Andrea Arcangeli
  0 siblings, 1 reply; 50+ messages in thread
From: Mel Gorman @ 2010-01-28 16:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Thu, Jan 28, 2010 at 03:33:18PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Futex code is smarter than most other gup_fast O_DIRECT code and knows about
> the compound internals. However now doing a put_page(head_page) will not
> release the pin on the tail page taken by gup-fast, leading to all sort of
> refcounting bugchecks. Getting a stable head_page is a little tricky.
> 
> page_head = page is there because if this is not a tail page it's also the
> page_head. Only in case this is a tail page, compound_head is called, otherwise
> it's guaranteed unnecessary. And if it's a tail page compound_head has to run
> atomically inside irq disabled section __get_user_pages_fast before returning.
> Otherwise ->first_page won't be a stable pointer.
> 
> Disableing irq before __get_user_page_fast and releasing irq after running
> compound_head is needed because if __get_user_page_fast returns == 1, it means
> the huge pmd is established and cannot go away from under us.
> pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait
> for local_irq_enable before the IPI delivery can return. This means
> __split_huge_page_refcount can't be running from under us, and in turn when we
> run compound_head(page) we're not reading a dangling pointer from
> tailpage->first_page. Then after we get to stable head page, we are always safe
> to call compound_lock and after taking the compound lock on head page we can
> finally re-check if the page returned by gup-fast is still a tail page. in
> which case we're set and we didn't need to split the hugepage in order to take
> a futex on it.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

When I rebased my tree, I found that the get_user_pages_fast() write is 1
unconditionally now where it wasn't on 2.6.32 so sorry about that confusion.

Acked-by: Mel Gorman <mel@csn.ul.ie>



> ---
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -218,7 +218,7 @@ get_futex_key(u32 __user *uaddr, int fsh
>  {
>  	unsigned long address = (unsigned long)uaddr;
>  	struct mm_struct *mm = current->mm;
> -	struct page *page;
> +	struct page *page, *page_head;
>  	int err;
>  
>  	/*
> @@ -250,10 +250,36 @@ again:
>  	if (err < 0)
>  		return err;
>  
> -	page = compound_head(page);
> -	lock_page(page);
> -	if (!page->mapping) {
> -		unlock_page(page);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	page_head = page;
> +	if (unlikely(PageTail(page))) {
> +		put_page(page);
> +		/* serialize against __split_huge_page_splitting() */
> +		local_irq_disable();
> +		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
> +			page_head = compound_head(page);
> +			local_irq_enable();
> +		} else {
> +			local_irq_enable();
> +			goto again;
> +		}
> +	}
> +#else
> +	page_head = compound_head(page);
> +#endif
> +
> +	lock_page(page_head);
> +	if (unlikely(page_head != page)) {
> +		compound_lock(page_head);
> +		if (unlikely(!PageTail(page))) {
> +			compound_unlock(page_head);
> +			unlock_page(page_head);
> +			put_page(page);
> +			goto again;
> +		}
> +	}
> +	if (!page_head->mapping) {
> +		unlock_page(page_head);
>  		put_page(page);
>  		goto again;
>  	}
> @@ -265,19 +291,21 @@ again:
>  	 * it's a read-only handle, it's expected that futexes attach to
>  	 * the object not the particular process.
>  	 */
> -	if (PageAnon(page)) {
> +	if (PageAnon(page_head)) {
>  		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
>  		key->private.mm = mm;
>  		key->private.address = address;
>  	} else {
>  		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
> -		key->shared.inode = page->mapping->host;
> -		key->shared.pgoff = page->index;
> +		key->shared.inode = page_head->mapping->host;
> +		key->shared.pgoff = page_head->index;
>  	}
>  
>  	get_futex_key_refs(key);
>  
> -	unlock_page(page);
> +	if (unlikely(PageTail(page)))
> +		compound_unlock(page_head);
> +	unlock_page(page_head);
>  	put_page(page);
>  	return 0;
>  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04 of 31] update futex compound knowledge
  2010-01-28 16:11   ` Mel Gorman
@ 2010-02-01  7:45     ` Andrea Arcangeli
  0 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-02-01  7:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

There was a little problem in the futex code, basically we can't take
PG_lock on a page that isn't pinned. This should fix it. I never had a
problem in practice. Untested still.. just wanted to update you on
this one.

diff --git a/kernel/futex.c b/kernel/futex.c
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -258,6 +258,18 @@ again:
 		local_irq_disable();
 		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
 			page_head = compound_head(page);
+			/*
+			 * page_head is valid pointer but we must pin
+			 * it before taking the PG_lock and/or
+			 * PG_compound_lock. The moment we re-enable
+			 * irqs __split_huge_page_splitting() can
+			 * return and the head page can be freed from
+			 * under us. We can't take the PG_lock and/or
+			 * PG_compound_lock on a page that could be
+			 * freed from under us.
+			 */
+			if (page != page_head)
+				get_page(page_head);
 			local_irq_enable();
 		} else {
 			local_irq_enable();
@@ -266,6 +278,8 @@ again:
 	}
 #else
 	page_head = compound_head(page);
+	if (page != page_head)
+		get_page(page_head);
 #endif
 
 	lock_page(page_head);
@@ -274,12 +288,15 @@ again:
 		if (unlikely(!PageTail(page))) {
 			compound_unlock(page_head);
 			unlock_page(page_head);
+			put_page(page_head);
 			put_page(page);
 			goto again;
 		}
 	}
 	if (!page_head->mapping) {
 		unlock_page(page_head);
+		if (page_head != page)
+			put_page(page_head);
 		put_page(page);
 		goto again;
 	}
@@ -303,9 +320,13 @@ again:
 
 	get_futex_key_refs(key);
 
-	if (unlikely(PageTail(page)))
+	unlock_page(page_head);
+	if (page != page_head) {
+		VM_BUG_ON(!PageTail(page));
+		/* releasing compound_lock after page_lock won't matter */
 		compound_unlock(page_head);
-	unlock_page(page_head);
+		put_page(page_head);
+	}
 	put_page(page);
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 05 of 31] fix bad_page to show the real reason the page is bad
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 04 of 31] update futex compound knowledge Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 06 of 31] clear compound mapping Andrea Arcangeli
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

page_count shows the count of the head page, but the actual check is done on
the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -265,7 +265,7 @@ static void bad_page(struct page *page)
 		current->comm, page_to_pfn(page));
 	printk(KERN_ALERT
 		"page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n",
-		page, (void *)page->flags, page_count(page),
+		page, (void *)page->flags, atomic_read(&page->_count),
 		page_mapcount(page), page->mapping, page->index);
 
 	dump_stack();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 06 of 31] clear compound mapping
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 05 of 31] fix bad_page to show the real reason the page is bad Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 07 of 31] add native_set_pmd_at Andrea Arcangeli
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Clear compound mapping for anonymous compound pages like it already happens for
regular anonymous pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -584,6 +584,8 @@ static void __free_pages_ok(struct page 
 
 	kmemcheck_free_shadow(page, order);
 
+	if (PageAnon(page))
+		page->mapping = NULL;
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
 	if (bad)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 07 of 31] add native_set_pmd_at
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 06 of 31] clear compound mapping Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 08 of 31] add pmd paravirt ops Andrea Arcangeli
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -528,6 +528,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 08 of 31] add pmd paravirt ops
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 07 of 31] add native_set_pmd_at Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 09 of 31] no paravirt version of pmd ops Andrea Arcangeli
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary
(vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer),
but this is to keep full simmetry with pte paravirt ops, which looks cleaner
and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -449,6 +449,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -456,6 +461,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -557,6 +568,18 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+#endif
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -266,10 +266,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 09 of 31] no paravirt version of pmd ops
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 08 of 31] add pmd paravirt ops Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 10 of 31] export maybe_mkwrite Andrea Arcangeli
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -33,6 +33,7 @@ extern struct list_head pgd_list;
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -57,6 +58,8 @@ extern struct list_head pgd_list;
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 10 of 31] export maybe_mkwrite
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 09 of 31] no paravirt version of pmd ops Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 11 of 31] comment reminder in destroy_compound_page Andrea Arcangeli
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

huge_memory.c needs it too when it fallbacks in copying hugepages into regular
fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -387,6 +387,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1943,19 +1943,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 11 of 31] comment reminder in destroy_compound_page
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 10 of 31] export maybe_mkwrite Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 12 of 31] config_transparent_hugepage Andrea Arcangeli
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent
on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -311,6 +311,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 12 of 31] config_transparent_hugepage
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 11 of 31] comment reminder in destroy_compound_page Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 13 of 31] special pmd_trans_* functions Andrea Arcangeli
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add config option.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -283,3 +283,17 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage support" if EMBEDDED
+	depends on X86_64
+	default y
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If memory constrained on embedded, you may want to say N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 13 of 31] special pmd_trans_* functions
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 12 of 31] config_transparent_hugepage Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 14 of 31] add pmd mangling generic functions Andrea Arcangeli
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

These returns 0 at compile time when the config option is disabled, to allow
gcc to eliminate the transparent hugepage function calls at compile time
without additional #ifdefs (only the export of those functions have to be
visible to gcc but they won't be required at link time and huge_memory.o can be
not built at all).

_PAGE_BIT_UNUSED1 is never used for pmd, only on pte.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -168,6 +168,19 @@ extern void cleanup_highmap(void);
 #define	kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK)
 
 #define __HAVE_ARCH_PTE_SAME
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,6 +22,7 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
+#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -45,6 +46,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -344,6 +344,11 @@ extern void untrack_pfn_vma(struct vm_ar
 				unsigned long size);
 #endif
 
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_trans_huge(pmd) 0
+#define pmd_trans_splitting(pmd) ({ BUG(); 0; })
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_GENERIC_PGTABLE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 14 of 31] add pmd mangling generic functions
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 13 of 31] special pmd_trans_* functions Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 15 of 31] add pmd mangling functions to x86 Andrea Arcangeli
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -25,6 +25,26 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({								\
+		int __changed = !pmd_same(*(__pmdp), __entry);		\
+		VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);		\
+		if (__changed) {					\
+			set_pmd_at((__vma)->vm_mm, __address, __pmdp,	\
+				   __entry);				\
+			flush_tlb_range(__vma, __address,		\
+					(__address) + HPAGE_PMD_SIZE);	\
+		}							\
+		__changed;						\
+	})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({ BUG(); 0 })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young(__vma, __address, __ptep)		\
 ({									\
@@ -39,6 +59,25 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	int r = 1;							\
+	if (!pmd_young(__pmd))						\
+		r = 0;							\
+	else								\
+		set_pmd_at((__vma)->vm_mm, (__address),			\
+			   (__pmdp), pmd_mkold(__pmd));			\
+	r;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0 })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young(__vma, __address, __ptep)		\
 ({									\
@@ -50,6 +89,24 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__young = pmdp_test_and_clear_young(__vma, __address, __pmdp);	\
+	if (__young)							\
+		flush_tlb_range(__vma, __address,			\
+				(__address) + HPAGE_PMD_SIZE);		\
+	__young;							\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0 })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear(__mm, __address, __ptep)			\
 ({									\
@@ -59,6 +116,20 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_GET_AND_CLEAR
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_get_and_clear(__mm, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	pmd_clear((__mm), (__address), (__pmdp));			\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_get_and_clear(__mm, __address, __pmdp)	\
+	({ BUG(); 0 })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define ptep_get_and_clear_full(__mm, __address, __ptep, __full)	\
 ({									\
@@ -90,6 +161,22 @@ do {									\
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp);	\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush(__vma, __address, __pmdp)	\
+	({ BUG(); 0 })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 struct mm_struct;
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
@@ -99,10 +186,45 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_wrprotect(mm, address, pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_splitting_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = pmd_mksplitting(*(__pmdp));			\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd);		\
+	/* tlb flush only to serialize against gup-fast */		\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_splitting_flush(__vma, __address, __pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)	(pte_val(A) == pte_val(B))
 #endif
 
+#ifndef __HAVE_ARCH_PMD_SAME
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_same(A,B)	(pmd_val(A) == pmd_val(B))
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmd_same(A,B)	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY
 #define page_test_dirty(page)		(0)
 #endif
@@ -347,6 +469,9 @@ extern void untrack_pfn_vma(struct vm_ar
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd) 0
 #define pmd_trans_splitting(pmd) ({ BUG(); 0; })
+#ifndef __HAVE_ARCH_PMD_WRITE
+#define pmd_write(pmd)	({ BUG(); 0; })
+#endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 
 #endif /* !__ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 15 of 31] add pmd mangling functions to x86
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 14 of 31] add pmd mangling generic functions Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 16 of 31] bail out gup_fast on splitting pmd Andrea Arcangeli
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add needed pmd mangling functions with simmetry with their pte counterparts.
pmdp_freeze_flush is the only exception only present on the pmd side and it's
needed to serialize the VM against split_huge_page, it simply atomically clears
the present bit in the same way pmdp_clear_flush_young atomically clears the
accessed bit (and both need to flush the tlb to make it effective, which is
mandatory to happen synchronously for pmdp_freeze_flush).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -300,15 +300,15 @@ pmd_t *populate_extra_pmd(unsigned long 
 pte_t *populate_extra_pte(unsigned long vaddr);
 #endif	/* __ASSEMBLY__ */
 
+#ifndef __ASSEMBLY__
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_X86_32
 # include "pgtable_32.h"
 #else
 # include "pgtable_64.h"
 #endif
 
-#ifndef __ASSEMBLY__
-#include <linux/mm_types.h>
-
 static inline int pte_none(pte_t pte)
 {
 	return !pte.pte;
@@ -351,7 +351,7 @@ static inline unsigned long pmd_page_vad
  * Currently stuck as a macro due to indirect forward reference to
  * linux/mmzone.h's __section_mem_map_addr() definition:
  */
-#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
+#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
 
 /*
  * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -72,6 +72,19 @@ static inline pte_t native_ptep_get_and_
 #endif
 }
 
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+#ifdef CONFIG_SMP
+	return native_make_pmd(xchg(&xp->pmd, 0));
+#else
+	/* native_local_pmdp_get_and_clear,
+	   but duplicated because of cyclic dependency */
+	pmd_t ret = *xp;
+	native_pmd_clear(NULL, 0, xp);
+	return ret;
+#endif
+}
+
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	*pmdp = pmd;
@@ -181,6 +194,98 @@ static inline int pmd_trans_huge(pmd_t p
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
+
+#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
+	pmd_update(mm, addr, pmdp);
+}
+
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -288,6 +288,25 @@ int ptep_set_access_flags(struct vm_area
 	return changed;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+	int changed = !pmd_same(*pmdp, entry);
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	if (changed && dirty) {
+		*pmdp = entry;
+		pmd_update_defer(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+
+	return changed;
+}
+#endif
+
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
 {
@@ -303,6 +322,23 @@ int ptep_test_and_clear_young(struct vm_
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp)
+{
+	int ret = 0;
+
+	if (pmd_young(*pmdp))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					 (unsigned long *) &pmdp->pmd);
+
+	if (ret)
+		pmd_update(vma->vm_mm, addr, pmdp);
+
+	return ret;
+}
+#endif
+
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
@@ -315,6 +351,36 @@ int ptep_clear_flush_young(struct vm_are
 	return young;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+
+	return young;
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp)
+{
+	int set;
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
+				(unsigned long *)&pmdp->pmd);
+	if (set) {
+		pmd_update(vma->vm_mm, address, pmdp);
+		/* need tlb flush only to serialize against gup-fast */
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 16 of 31] bail out gup_fast on splitting pmd
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 15 of 31] add pmd mangling functions to x86 Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 17 of 31] pte alloc trans splitting Andrea Arcangeli
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Force gup_fast to take the slow path and block if the pmd is splitting, not
only if it's none.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -160,7 +160,18 @@ static int gup_pmd_range(pud_t pud, unsi
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		/*
+		 * The pmd_trans_splitting() check below explains why
+		 * pmdp_splitting_flush has to flush the tlb, to stop
+		 * this gup-fast code from running while we set the
+		 * splitting bit in the pmd. Returning zero will take
+		 * the slow path that will call wait_split_huge_page()
+		 * if the pmd is still in splitting state. gup-fast
+		 * can't because it has irq disabled and
+		 * wait_split_huge_page() would never return as the
+		 * tlb flush IPI wouldn't run.
+		 */
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 17 of 31] pte alloc trans splitting
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 16 of 31] bail out gup_fast on splitting pmd Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 18 of 31] add pmd mmu_notifier helpers Andrea Arcangeli
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

pte alloc routines must wait for split_huge_page if the pmd is not
present and not null (i.e. pmd_trans_splitting). The additional
branches are optimized away at compile time by pmd_trans_splitting if
the config option is off. However we must pass the vma down in order
to know the anon_vma lock to wait for.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -955,7 +955,8 @@ static inline int __pmd_alloc(struct mm_
 int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address);
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 /*
@@ -1024,12 +1025,14 @@ static inline void pgtable_page_dtor(str
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc_map(mm, pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
-		NULL: pte_offset_map(pmd, address))
+#define pte_alloc_map(mm, vma, pmd, address)				\
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma,	\
+							pmd, address))?	\
+	 NULL: pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, NULL,	\
+							pmd, address))?	\
 		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -324,9 +324,11 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
+	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -346,14 +348,18 @@ int __pte_alloc(struct mm_struct *mm, pm
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	wait_split_huge_page = 0;
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	}
+	} else if (unlikely(pmd_trans_splitting(*pmd)))
+		wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
+	if (wait_split_huge_page)
+		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -366,10 +372,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&init_mm.page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	}
+	} else
+		VM_BUG_ON(pmd_trans_splitting(*pmd));
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3020,7 +3027,7 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
 
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -48,7 +48,8 @@ static pmd_t *get_old_pmd(struct mm_stru
 	return pmd;
 }
 
-static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -63,7 +64,7 @@ static pmd_t *alloc_new_pmd(struct mm_st
 	if (!pmd)
 		return NULL;
 
-	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
+	if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, addr))
 		return NULL;
 
 	return pmd;
@@ -148,7 +149,7 @@ unsigned long move_page_tables(struct vm
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 18 of 31] add pmd mmu_notifier helpers
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 17 of 31] pte alloc trans splitting Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 19 of 31] clear page compound Andrea Arcangeli
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add mmu notifier helpers to handle pmd huge operations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr
 	__pte;								\
 })
 
+#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+	__pmd;								\
+})
+
+#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+})
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr
 	__young;							\
 })
 
+#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
 #define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
 ({									\
 	struct mm_struct *___mm = __mm;					\
@@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr
 }
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 19 of 31] clear page compound
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 18 of 31] add pmd mmu_notifier helpers Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 20 of 31] add pmd_huge_pte to mm_struct Andrea Arcangeli
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -349,7 +349,7 @@ static inline void set_page_writeback(st
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
-__PAGEFLAG(Head, head)
+__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 
 static inline int PageCompound(struct page *page)
@@ -357,6 +357,13 @@ static inline int PageCompound(struct pa
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
+}
+#endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
@@ -394,6 +401,14 @@ static inline void __ClearPageTail(struc
 	page->flags &= ~PG_head_tail_mask;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(page->flags & PG_head_tail_mask != (1 << PG_compound));
+	ClearPageCompound(page);
+}
+#endif
+
 #endif /* !PAGEFLAGS_EXTENDED */
 
 #ifdef CONFIG_MMU

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 20 of 31] add pmd_huge_pte to mm_struct
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 19 of 31] clear page compound Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 21 of 31] split_huge_page_mm/vma Andrea Arcangeli
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

This increase the size of the mm struct a bit but it is needed to preallocate
one pte for each hugepage so that split_huge_page will not require a fail path.
Guarantee of success is a fundamental property of split_huge_page to avoid
decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
would otherwise force the hugepage-unaware VM code to learn rolling back in the
middle of its pte mangling operations (if something we need it to learn
handling pmd_trans_huge natively rather being capable of rollback). When
split_huge_page runs a pte is needed to succeed the split, to map the newly
splitted regular pages with a regular pte.  This way all existing VM code
remains backwards compatible by just adding a split_huge_page* one liner. The
memory waste of those preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -291,6 +291,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+#endif
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -498,6 +498,9 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON(mm->pmd_huge_pte);
+#endif
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -638,6 +641,10 @@ struct mm_struct *dup_mm(struct task_str
 	mm->token_priority = 0;
 	mm->last_interval = 0;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	mm->pmd_huge_pte = NULL;
+#endif
+
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 21 of 31] split_huge_page_mm/vma
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 20 of 31] add pmd_huge_pte to mm_struct Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 22 of 31] split_huge_page paging Andrea Arcangeli
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page_mm/vma compat code. Each one of those would need to be expanded
to hundred of lines of complex code without a fully reliable
split_huge_page_mm/vma functionality.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
+	split_huge_page_mm(mm, 0xA0000, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -446,6 +446,7 @@ static inline int check_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_vma(vma, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -132,6 +132,7 @@ static long do_mincore(unsigned long add
 	if (pud_none_or_clear_bad(pud))
 		goto none_mapped;
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_vma(vma, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto none_mapped;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -89,6 +89,7 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -42,6 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_mm(mm, addr, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -34,6 +34,7 @@ static int walk_pmd_range(pud_t *pud, un
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(walk->mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 22 of 31] split_huge_page paging
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 21 of 31] split_huge_page_mm/vma Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 23 of 31] clear_copy_huge_page Andrea Arcangeli
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Paging logic that splits the page before it is unmapped and added to swap to
ensure backwards compatibility with the legacy swap code. Eventually swap
should natively pageout the hugepages to increase performance and decrease
seeking and fragmentation of swap space. swapoff can just skip over huge pmd as
they cannot be part of swap yet. In add_to_swap be careful to split the page
only if we got a valid swap entry so we don't split hugepages with a full swap.

In theory we could split pages before isolating them during the lru scan, but
for khugepaged to be safe, I'm relying on either mmap_sem write mode, or
PG_lock taken, so split_huge_page has to run either with mmap_sem read/write
mode or PG_lock taken. Calling it from isolate_lru_page would make locking more
complicated, in addition to that split_huge_page would deadlock if called by
__isolate_lru_page because it has to take the lru lock to add the tail pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -378,6 +378,8 @@ static void collect_procs_anon(struct pa
 	struct task_struct *tsk;
 	struct anon_vma *av;
 
+	if (unlikely(split_huge_page(page)))
+		return;
 	read_lock(&tasklist_lock);
 	av = page_lock_anon_vma(page);
 	if (av == NULL)	/* Not actually mapped anymore */
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1174,6 +1174,7 @@ int try_to_unmap(struct page *page, enum
 	int ret;
 
 	BUG_ON(!PageLocked(page));
+	BUG_ON(PageTransHuge(page));
 
 	if (unlikely(PageKsm(page)))
 		ret = try_to_unmap_ksm(page, flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -156,6 +156,12 @@ int add_to_swap(struct page *page)
 	if (!entry.val)
 		return 0;
 
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page))) {
+			swapcache_free(entry, NULL);
+			return 0;
+		}
+
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
 	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -905,6 +905,8 @@ static inline int unuse_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (unlikely(pmd_trans_huge(*pmd)))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 23 of 31] clear_copy_huge_page
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 22 of 31] split_huge_page paging Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 24 of 31] kvm mmu transparent hugepage support Andrea Arcangeli
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1386,5 +1386,14 @@ extern void shake_page(struct page *p, i
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void clear_huge_page(struct page *page,
+			    unsigned long addr,
+			    unsigned int pages_per_huge_page);
+extern void copy_huge_page(struct page *dst, struct page *src,
+			   unsigned long addr, struct vm_area_struct *vma,
+			   unsigned int pages_per_huge_page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -385,70 +385,6 @@ static int vma_has_reserves(struct vm_ar
 	return 0;
 }
 
-static void clear_gigantic_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-	struct page *p = page;
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++, p = mem_map_next(p, page, i)) {
-		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
-	}
-}
-static void clear_huge_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-
-	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, sz);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++) {
-		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
-	}
-}
-
-static void copy_gigantic_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); ) {
-		cond_resched();
-		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-static void copy_huge_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-
-	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		cond_resched();
-		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
-	}
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2334,7 +2270,8 @@ retry_avoidcopy:
 		return -PTR_ERR(new_page);
 	}
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_huge_page(new_page, old_page, address, vma,
+		       pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3396,3 +3396,73 @@ void might_fault(void)
 }
 EXPORT_SYMBOL(might_fault);
 #endif
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+static void clear_gigantic_page(struct page *page,
+				unsigned long addr,
+				unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *p = page;
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page;
+	     i++, p = mem_map_next(p, page, i)) {
+		cond_resched();
+		clear_user_highpage(p, addr + i * PAGE_SIZE);
+	}
+}
+void clear_huge_page(struct page *page,
+		     unsigned long addr, unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		clear_gigantic_page(page, addr, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+	}
+}
+
+static void copy_gigantic_page(struct page *dst, struct page *src,
+			       unsigned long addr,
+			       struct vm_area_struct *vma,
+			       unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; ) {
+		cond_resched();
+		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+void copy_huge_page(struct page *dst, struct page *src,
+		    unsigned long addr, struct vm_area_struct *vma,
+		    unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src, addr, vma, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE,
+				   vma);
+	}
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 24 of 31] kvm mmu transparent hugepage support
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 23 of 31] clear_copy_huge_page Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Marcelo Tosatti <mtosatti@redhat.com>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -489,6 +489,15 @@ static int host_mapping_level(struct kvm
 out:
 	up_read(&current->mm->mmap_sem);
 
+	/* check for transparent hugepages */
+	if (page_size == PAGE_SIZE) {
+		struct page *page = gfn_to_page(kvm, gfn);
+
+		if (!is_error_page(page) && PageHead(page))
+			page_size = KVM_HPAGE_SIZE(2);
+		kvm_release_page_clean(page);
+	}
+
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) {
 		if (page_size >= KVM_HPAGE_SIZE(i))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 25 of 31] transparent hugepage core
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 24 of 31] kvm mmu transparent hugepage support Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 17:57   ` Mel Gorman
  2010-01-28 14:33 ` [PATCH 26 of 31] madvise(MADV_HUGEPAGE) Andrea Arcangeli
                   ` (6 subsequent siblings)
  31 siblings, 1 reply; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,128 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      unsigned long address, pmd_t *pmd,
+				      unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+
+#define HPAGE_PMD_SHIFT HPAGE_SHIFT
+#define HPAGE_PMD_MASK HPAGE_MASK
+#define HPAGE_PMD_SIZE HPAGE_SIZE
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &				\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)			       \
+	(transparent_hugepage_flags &				       \
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
+	 (transparent_hugepage_flags &				       \
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
+				 pmd_t *pmd);
+extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
+extern int split_huge_page(struct page *page);
+#define split_huge_page_mm(__mm, __addr, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_mm(__mm, __addr, __pmd);	\
+	}  while (0)
+#define split_huge_page_vma(__vma, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_vma(__vma, __pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		smp_mb();						\
+		spin_unlock_wait(&(__anon_vma)->lock);			\
+		smp_mb();						\
+		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
+			  pmd_trans_huge(*(__pmd)));			\
+	} while (0)
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+#if HPAGE_PMD_ORDER > MAX_ORDER
+#error "hugepages can't be allocated by the buddy allocator"
+#endif
+
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define transparent_hugepage_enabled(__vma) 0
+#define transparent_hugepage_defrag(__vma) 0
+#define transparent_hugepage_debug_cow() 0
+
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_mm(__mm, __addr, __pmd)	\
+	do { }  while (0)
+#define split_huge_page_vma(__vma, __pmd)	\
+	do { }  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#define PageTransHuge(page) 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -234,6 +237,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
 }
 
 static inline void
+__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
+		       struct list_head *head)
+{
+	list_add(&page->lru, head);
+	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
+}
+
+static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->lru[l].list);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
-	mem_cgroup_add_lru_list(page, l);
+	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
 }
 
 static inline void
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -205,6 +205,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_add_page_tail(struct zone* zone,
+			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,849 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+/*
+ * Currently uses __GFP_REPEAT during allocation. Should be
+ * implemented using page migration and real defrag algorithms in
+ * future VM.
+ */
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init ksm_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(ksm_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	if (!str)
+		return 0;
+	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, address);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	__SetPageUptodate(page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the clear_huge_page writes to
+	 * become visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+		add_mm_counter(mm, anon_rss, HPAGE_PMD_NR);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	return ret;
+}
+
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+			   (defrag ? __GFP_REPEAT : 0)|__GFP_NOWARN,
+			   HPAGE_PMD_ORDER);
+}
+
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
+						    page, haddr);
+	}
+out:
+	pte = pte_alloc_map(mm, vma, pmd, address);
+	if (!pte)
+		return VM_FAULT_OOM;
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd)))
+		goto out_unlock;
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, anon_rss, HPAGE_PMD_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL;
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmd, pmd_t orig_pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int ret = 0, i;
+	struct page **pages;
+
+	pages = kzalloc(sizeof(struct page *) * HPAGE_PMD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+					  vma, address);
+		if (unlikely(!pages[i])) {
+			while (--i >= 0)
+				put_page(pages[i]);
+			kfree(pages);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		copy_user_highpage(pages[i], page + i,
+				   haddr + PAGE_SHIFT*i, vma);
+		__SetPageUptodate(pages[i]);
+		cond_resched();
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		put_page(page);
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_add_new_anon_rmap(pages[i], vma, haddr);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	kfree(pages);
+
+	mm->nr_ptes++;
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	page_remove_rmap(page);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(&mm->page_table_lock);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0;
+	struct page *page, *new_page;
+	unsigned long haddr;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_page(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_PMD_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+	if (unlikely(transparent_hugepage_debug_cow()) && new_page) {
+		put_page(new_page);
+		new_page = NULL;
+	}
+	if (unlikely(!new_page))
+		return do_huge_pmd_wp_page_fallback(mm, vma, address,
+						    pmd, orig_pmd, page, haddr);
+
+	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	__SetPageUptodate(new_page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pmd_page(*pmd);
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, anon_rss, -HPAGE_PMD_NR);
+			spin_unlock(&tlb->mm->page_table_lock);
+			VM_BUG_ON(!PageHead(page));
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	if (address & ~HPAGE_PMD_MASK)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, the pmd must remain marked huge at all
+		 * times or the VM won't take the pmd_trans_huge paths
+		 * and it won't wait on the anon_vma->lock to
+		 * serialize against split_huge_page*.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+	struct zone *zone = page_zone(page);
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+
+		page_tail->index = ++head_index;
+
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		lru_add_page_tail(zone, page, page_tail);
+
+		put_page(page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+	BUG_ON(page_count(page) <= 0);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			BUG_ON(PageCompound(page+i));
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct vm_area_struct *vma;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mapcount = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct page *page;
+	struct anon_vma *anon_vma;
+	struct mm_struct *mm;
+
+	BUG_ON(vma->vm_flags & VM_HUGETLB);
+
+	mm = vma->vm_mm;
+
+	anon_vma = vma->anon_vma;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(pmd_trans_splitting(*pmd));
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&anon_vma->lock);
+		return;
+	}
+	page = pmd_page(*pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	__split_huge_page(page, anon_vma);
+
+	spin_unlock(&anon_vma->lock);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_mm(struct mm_struct *mm,
+			  unsigned long address,
+			  pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+	BUG_ON(vma->vm_start > address);
+	BUG_ON(vma->vm_mm != mm);
+
+	__split_huge_page_vma(vma, pmd);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -647,9 +647,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_PMD_SIZE)
+				split_huge_page_vma(vma, pmd);
+			else if (zap_huge_pmd(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_pmd_anonymous_page(mm, vma, address,
+							  pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_pmd_wp_page(mm, vma, address,
+							   pmd, orig_pmd);
+			return 0;
+		}
+	}
 	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
@@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -380,9 +363,39 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageTransHuge(page))) {
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -461,6 +461,43 @@ void __pagevec_release(struct pagevec *p
 
 EXPORT_SYMBOL(__pagevec_release);
 
+/* used by __split_huge_page_refcount() */
+void lru_add_page_tail(struct zone* zone,
+		       struct page *page, struct page *page_tail)
+{
+	int active;
+	enum lru_list lru;
+	const int file = 0;
+	struct list_head *head;
+
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(PageCompound(page_tail));
+	VM_BUG_ON(PageLRU(page_tail));
+	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
+
+	SetPageLRU(page_tail);
+
+	if (page_evictable(page_tail, NULL)) {
+		if (PageActive(page)) {
+			SetPageActive(page_tail);
+			active = 1;
+			lru = LRU_ACTIVE_ANON;
+		} else {
+			active = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
+		update_page_reclaim_stat(zone, page_tail, file, active);
+		if (likely(PageLRU(page)))
+			head = page->lru.prev;
+		else
+			head = &zone->lru[lru].list;
+		__add_page_to_lru_list(zone, page_tail, lru, head);
+	} else {
+		SetPageUnevictable(page_tail);
+		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
+	}
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-28 14:33 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
@ 2010-01-28 17:57   ` Mel Gorman
  2010-01-28 18:05     ` Rik van Riel
  2010-01-28 22:36     ` Andrea Arcangeli
  0 siblings, 2 replies; 50+ messages in thread
From: Mel Gorman @ 2010-01-28 17:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

Sorry for the long delay getting to this patch. I ran out of beans the
first time around. Unlike Rik, I can't handle 31 patches in one sitting.

On Thu, Jan 28, 2010 at 03:33:39PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 

It occurs to me that this infrastructure should be reusable to make allow
optional swapping of hugetlbfs. I haven't investigated the possibility properly
but it should be doable as a mount option with maybe a boot-parameter for
shared memory.

> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing
> 
> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
>    not null)
> 
> 4) avoidance of reservation and maximization of use of hugepages whenever
>    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
>    1 machine with 1 database with 1 database cache with 1 database cache size
>    known at boot time. It's definitely not feasible with a virtualization
>    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
>    with an unknown size of each virtual machine with an unknown amount of
>    pagecache that could be potentially useful in the host for guest not using
>    O_DIRECT (aka cache=off).
> 
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
> 
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
> 
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
> 
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
> 

As I write this, I haven't read the patch but I suspect the default will
depend on how mmap_sem is used. For example, if "always" means that down_read
is called for longer periods of time then workloads that mmap() heavily or
otherwise depend on down_write could suffer.

> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
> 
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
> 
> Swap and oom works fine (well just like with regular pages ;).

I think that's the first time I've ever heard OOM handling described as
"fine" :)

> MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
> 
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup,

FWIW, having read the available papers on transparent support, I agree that
prefault logic is unlikely to be a proven win. The complexity and overhead of
prefaulting are guaranteed and easy to measure. The benefits due to huge page
usage are a "maybe", harder to prove and depend on the workload and hardware.

Using hugetlbfs with huge pages is also recognised to have significant costs
during startup which is faulting in pages larger than 4K. In benchmarking,
it can show up as huge pages hurting performance if the benchmark is too
short-lived.

> but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks. 
> One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
> 
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
> 
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3

<plug>
could have done 

hugectl --heap ./largepage3 

too if your version of libhugetlbfs was recent enough to have the
utilities packed.
</plug>

> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,128 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +				      struct vm_area_struct *vma,
> +				      unsigned long address, pmd_t *pmd,
> +				      unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +			 struct vm_area_struct *vma);
> +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +					  unsigned long addr,
> +					  pmd_t *pmd,
> +					  unsigned int flags);
> +extern int zap_huge_pmd(struct mmu_gather *tlb,
> +			struct vm_area_struct *vma,
> +			pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> +	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> +	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};

Will need Documentation/ updates at some point explaining these flags
and how they apply to the transparent_hugepage= boot parameter.

It's not overly important but as these are flags, it would have been
more readable to just define them as 1, 2, 4, 8 etc rather than using
1<<TRANSPARENT_HUGEPAGE_X in so many places. i.e. similar to how GFP
flags are defined and used. I know you use test_bit on these later but
it's not clear you need the locked checks and could just use
"if (flags & whatever)"

> +
> +enum page_check_address_pmd_flag {
> +	PAGE_CHECK_ADDRESS_PMD_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> +				     struct mm_struct *mm,
> +				     unsigned long address,
> +				     enum page_check_address_pmd_flag flag);
> +
> +#define HPAGE_PMD_SHIFT HPAGE_SHIFT
> +#define HPAGE_PMD_MASK HPAGE_MASK
> +#define HPAGE_PMD_SIZE HPAGE_SIZE
> +

Should these be defined by the architecture? I ask because your patch notes
that architectures supporting huge pages at levels other than the PMD will
have issues. On IA-64 for example, HPAGE_SHIFT can be set as a kernel boot
parameter.

That said, these definitions work for the architecture that *is* supported.

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define transparent_hugepage_enabled(__vma)				\
> +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> +	 (transparent_hugepage_flags &				\
> +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma)			       \
> +	(transparent_hugepage_flags &				       \
> +	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
> +	 (transparent_hugepage_flags &				       \
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow()				\
> +	(transparent_hugepage_flags &					\
> +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> +			  struct vm_area_struct *vma,
> +			  unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> +			    struct vm_area_struct *vma, unsigned long address,
> +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
> +				 pmd_t *pmd);
> +extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
> +extern int split_huge_page(struct page *page);
> +#define split_huge_page_mm(__mm, __addr, __pmd)				\
> +	do {								\
> +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> +			__split_huge_page_mm(__mm, __addr, __pmd);	\
> +	}  while (0)

I'm not sure what the current popular thing is but ...

__pmd is using in this #define twice. Hypothetically, if the third
parameter passed to this function had side-effects (e.g. pmd++), then
the expectation of the caller is that it happens once but in reality it
happens twice due to the use of #define.

For this reason, I prefer to see static inlines instead of #defines where
parameters appear more than once. Just because you do not have stupid
callers doesn't mean that someone else will add one for you in the
future.

> +#define split_huge_page_vma(__vma, __pmd)				\
> +	do {								\
> +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> +			__split_huge_page_vma(__vma, __pmd);		\
> +	}  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)				\
> +	do {								\
> +		smp_mb();						\
> +		spin_unlock_wait(&(__anon_vma)->lock);			\
> +		smp_mb();						\
> +		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
> +			  pmd_trans_huge(*(__pmd)));			\
> +	} while (0)

Barriers without comments are doomed to stupid questions. Can you add a
comment on what this barrier is protecting against?

I *think* it's because spin unlocking is not a barrier (although the exact
details escape me) so presumably spin_unlock_wait() isn't one either. In
this case, you have to be sure that reads/writes to that lock that occured
since you called pmd_trans_splitting() have happened. Am I close?

> +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> +#if HPAGE_PMD_ORDER > MAX_ORDER
> +#error "hugepages can't be allocated by the buddy allocator"
> +#endif
> +
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +static inline int PageTransHuge(struct page *page)
> +{
> +	VM_BUG_ON(PageTail(page));
> +	return PageHead(page);
> +}
> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define transparent_hugepage_enabled(__vma) 0
> +#define transparent_hugepage_defrag(__vma) 0
> +#define transparent_hugepage_debug_cow() 0
> +
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> +	return 0;
> +}
> +#define split_huge_page_mm(__mm, __addr, __pmd)	\
> +	do { }  while (0)
> +#define split_huge_page_vma(__vma, __pmd)	\
> +	do { }  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)	\
> +	do { } while (0)
> +#define PageTransHuge(page) 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
>  #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> +#if BITS_PER_LONG > 32
> +#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
> +#endif
>  

#ifdef CONFIG_TRANSPARENT_HUGEPAGE ?

Because of the use of the page flag bits, I believe you are restricted to
64 bit anyway. 

>  #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> @@ -234,6 +237,7 @@ struct inode;
>   * files which need it (119 of them)
>   */
>  #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>  
>  /*
>   * Methods to modify the page usage count.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
>  }
>  
>  static inline void
> +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> +		       struct list_head *head)
> +{
> +	list_add(&page->lru, head);
> +	__inc_zone_state(zone, NR_LRU_BASE + l);
> +	mem_cgroup_add_lru_list(page, l);
> +}
> +
> +static inline void
>  add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
>  {
> -	list_add(&page->lru, &zone->lru[l].list);
> -	__inc_zone_state(zone, NR_LRU_BASE + l);
> -	mem_cgroup_add_lru_list(page, l);
> +	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
>  }
>  
>  static inline void
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -205,6 +205,8 @@ extern unsigned int nr_free_pagecache_pa
>  /* linux/mm/swap.c */
>  extern void __lru_cache_add(struct page *, enum lru_list lru);
>  extern void lru_cache_add_lru(struct page *, enum lru_list lru);
> +extern void lru_add_page_tail(struct zone* zone,
> +			      struct page *page, struct page *page_tail);
>  extern void activate_page(struct page *);
>  extern void mark_page_accessed(struct page *);
>  extern void lru_add_drain(void);
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c
> @@ -0,0 +1,849 @@
> +/*
> + *  Copyright (C) 2009  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> +	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag enabled,
> +				enum transparent_hugepage_flag req_madv)
> +{
> +	if (test_bit(enabled, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(req_madv, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag enabled,
> +				 enum transparent_hugepage_flag req_madv)
> +{
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		set_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_FLAG,
> +				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_FLAG,
> +				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +/*
> + * Currently uses __GFP_REPEAT during allocation. Should be
> + * implemented using page migration and real defrag algorithms in
> + * future VM.
> + */
> +static ssize_t defrag_show(struct kobject *kobj,
> +			   struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> +	__ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t single_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag flag)
> +{
> +	if (test_bit(flag, &transparent_hugepage_flags))
> +		return sprintf(buf, "[yes] no\n");
> +	else
> +		return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag flag)
> +{
> +	if (!memcmp("yes", buf,
> +		    min(sizeof("yes")-1, count))) {
> +		set_bit(flag, &transparent_hugepage_flags);
> +	} else if (!memcmp("no", buf,
> +			   min(sizeof("no")-1, count))) {
> +		clear_bit(flag, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t debug_cow_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return single_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> +			       struct kobj_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	return single_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> +	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> +	&enabled_attr.attr,
> +	&defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> +	&debug_cow_attr.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> +	.attrs = hugepage_attr,
> +	.name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init ksm_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int err;
> +
> +	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> +	if (err)
> +		printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> +	return 0;
> +}
> +module_init(ksm_init)
> +

ksm_init.... I'm not seeing the connection to KSM. I suspect you cut&pasted
from ksm.c there and forgot to rename it. It's not important as the static
avoids collisions but it looks odd.

> +static int __init setup_transparent_hugepage(char *str)
> +{
> +	if (!str)
> +		return 0;
> +	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
> +	return 1;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +

The parameters are never sanity checked. This means that flags that should be
mutually exclusive can be set at the same time. e.g.  TRANSPARENT_HUGEPAGE_FLAG
and TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG.

I didn't check how important this is, but prehaps you should be using the
same helper functions as used in sysfs.

> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> +				 struct mm_struct *mm)
> +{
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	if (!mm->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> +	mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	return pmd;
> +}
> +
> +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address, pmd_t *pmd,
> +					struct page *page,
> +					unsigned long haddr)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, address);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_PMD_NR);
> +	__SetPageUptodate(page);
> +
> +	/*
> +	 * spin_lock() below is not the equivalent of smp_wmb(), so
> +	 * this is needed to avoid the clear_huge_page writes to
> +	 * become visible after the set_pmd_at() write.
> +	 */
> +	smp_wmb();
> +

I'm not seeing the equivalent barrier in do_anonymous_page() between
when the page is zero'd and the PTE inserted. What am I missing?

> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		put_page(page);
> +		pte_free(mm, pgtable);
> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +		add_mm_counter(mm, anon_rss, HPAGE_PMD_NR);
> +		spin_unlock(&mm->page_table_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> +	return alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
> +			   (defrag ? __GFP_REPEAT : 0)|__GFP_NOWARN,
> +			   HPAGE_PMD_ORDER);
> +}
> +
> +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       unsigned int flags)
> +{
> +	struct page *page;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	pte_t *pte;
> +
> +	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +		page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +		if (unlikely(!page))
> +			goto out;
> +

Something to consider here performance-wise when transparent-hugepage is
defaulted to "always".

alloc_hugepage() is potentially very expensive. You could enter direct
reclaim, lumpy reclaim, wakeup kswapd etc. If you want to optimistically
use huge pages, it might have less impact for !defrag to imply !__GFP_WAIT.
On the other hand, if memory compaction is merged (my test machines are
still tied up, hence no release since), one would be happy for it to compact,
but not necessarily enter direct reclaim.

It's a tricky one.....

> +		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
> +						    page, haddr);
> +	}
> +out:
> +	pte = pte_alloc_map(mm, vma, pmd, address);
> +	if (!pte)
> +		return VM_FAULT_OOM;
> +	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct page *src_page;
> +	pmd_t pmd;
> +	pgtable_t pgtable;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	pgtable = pte_alloc_one(dst_mm, addr);
> +	if (unlikely(!pgtable))
> +		goto out;
> +
> +	spin_lock(&dst_mm->page_table_lock);
> +	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> +	ret = -EAGAIN;
> +	pmd = *src_pmd;
> +	if (unlikely(!pmd_trans_huge(pmd)))
> +		goto out_unlock;
> +	if (unlikely(pmd_trans_splitting(pmd))) {
> +		/* split huge page running from under us */
> +		spin_unlock(&src_mm->page_table_lock);
> +		spin_unlock(&dst_mm->page_table_lock);
> +
> +		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> +		goto out;
> +	}
> +	src_page = pmd_page(pmd);
> +	VM_BUG_ON(!PageHead(src_page));
> +	get_page(src_page);
> +	page_dup_rmap(src_page);
> +	add_mm_counter(dst_mm, anon_rss, HPAGE_PMD_NR);
> +
> +	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +	prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> +	ret = 0;
> +out_unlock:
> +	spin_unlock(&src_mm->page_table_lock);
> +	spin_unlock(&dst_mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	pgtable = mm->pmd_huge_pte;
> +	if (list_empty(&pgtable->lru))
> +		mm->pmd_huge_pte = NULL;
> +	else {
> +		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> +					      struct page, lru);
> +		list_del(&pgtable->lru);
> +	}
> +	return pgtable;
> +}
> +
> +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address,
> +					pmd_t *pmd, pmd_t orig_pmd,
> +					struct page *page,
> +					unsigned long haddr)
> +{
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	int ret = 0, i;
> +	struct page **pages;
> +
> +	pages = kzalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> +			GFP_KERNEL);

Is kzalloc really necessary? It's fixed-size and you initialise it so
why not just kmalloc()?

> +	if (unlikely(!pages)) {
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> +					  vma, address);
> +		if (unlikely(!pages[i])) {
> +			while (--i >= 0)
> +				put_page(pages[i]);
> +			kfree(pages);
> +			ret |= VM_FAULT_OOM;
> +			goto out;
> +		}
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_free_pages;
> +	else
> +		get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		copy_user_highpage(pages[i], page + i,
> +				   haddr + PAGE_SHIFT*i, vma);
> +		__SetPageUptodate(pages[i]);
> +		cond_resched();
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_free_pages;
> +	else
> +		put_page(page);
> +
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	/* leave pmd empty until pte is filled */
> +
> +	pgtable = get_pmd_huge_pte(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +		entry = mk_pte(pages[i], vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		page_add_new_anon_rmap(pages[i], vma, haddr);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		VM_BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	kfree(pages);
> +
> +	mm->nr_ptes++;
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	page_remove_rmap(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	ret |= VM_FAULT_WRITE;
> +	put_page(page);
> +
> +out:
> +	return ret;
> +
> +out_free_pages:
> +	spin_unlock(&mm->page_table_lock);
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		put_page(pages[i]);
> +	kfree(pages);
> +	goto out;
> +}
> +
> +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> +	int ret = 0;
> +	struct page *page, *new_page;
> +	unsigned long haddr;
> +
> +	VM_BUG_ON(!vma->anon_vma);
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_unlock;
> +
> +	page = pmd_page(orig_pmd);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	haddr = address & HPAGE_PMD_MASK;
> +	if (page_mapcount(page) == 1) {
> +		pmd_t entry;
> +		entry = pmd_mkyoung(orig_pmd);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
> +			update_mmu_cache(vma, address, entry);
> +		ret |= VM_FAULT_WRITE;
> +		goto out_unlock;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +	if (unlikely(transparent_hugepage_debug_cow()) && new_page) {
> +		put_page(new_page);
> +		new_page = NULL;
> +	}
> +	if (unlikely(!new_page))
> +		return do_huge_pmd_wp_page_fallback(mm, vma, address,
> +						    pmd, orig_pmd, page, haddr);
> +
> +	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> +	__SetPageUptodate(new_page);
> +
> +	/*
> +	 * spin_lock() below is not the equivalent of smp_wmb(), so
> +	 * this is needed to avoid the copy_huge_page writes to become
> +	 * visible after the set_pmd_at() write.
> +	 */
> +	smp_wmb();
> +

You do that here but not for the wp_page_fallback in the same type of
setup. Can you point out the obvious thing I'm missing?

> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		put_page(new_page);
> +	else {
> +		pmd_t entry;
> +		entry = mk_pmd(new_page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		page_add_new_anon_rmap(new_page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		update_mmu_cache(vma, address, entry);
> +		page_remove_rmap(page);
> +		put_page(page);
> +		ret |= VM_FAULT_WRITE;
> +	}
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +	return ret;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +				   unsigned long addr,
> +				   pmd_t *pmd,
> +				   unsigned int flags)
> +{
> +	struct page *page = NULL;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	if (flags & FOLL_WRITE && !pmd_write(*pmd))
> +		goto out;
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!PageHead(page));
> +	if (flags & FOLL_TOUCH) {
> +		pmd_t _pmd;
> +		/*
> +		 * We should set the dirty bit only for FOLL_WRITE but
> +		 * for now the dirty bit in the pmd is meaningless.
> +		 * And if the dirty bit will become meaningful and
> +		 * we'll only set it with FOLL_WRITE, an atomic
> +		 * set_bit will be required on the pmd to set the
> +		 * young bit, instead of the current set_pmd_at.
> +		 */
> +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> +		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
> +	}
> +	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON(!PageCompound(page));
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +
> +out:
> +	return page;
> +}
> +
> +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		 pmd_t *pmd)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&tlb->mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma,
> +					     pmd);
> +		} else {
> +			struct page *page;
> +			pgtable_t pgtable;
> +			pgtable = get_pmd_huge_pte(tlb->mm);
> +			page = pmd_page(*pmd);
> +			pmd_clear(pmd);
> +			page_remove_rmap(page);
> +			VM_BUG_ON(page_mapcount(page) < 0);
> +			add_mm_counter(tlb->mm, anon_rss, -HPAGE_PMD_NR);
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			VM_BUG_ON(!PageHead(page));
> +			tlb_remove_page(tlb, page);
> +			pte_free(tlb->mm, pgtable);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&tlb->mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> +			      struct mm_struct *mm,
> +			      unsigned long address,
> +			      enum page_check_address_pmd_flag flag)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd, *ret = NULL;
> +
> +	if (address & ~HPAGE_PMD_MASK)
> +		goto out;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> +		  pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
> +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> +			  !pmd_trans_splitting(*pmd));
> +		ret = pmd;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> +				       struct vm_area_struct *vma,
> +				       unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd;
> +	int ret = 0;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> +	if (pmd) {
> +		/*
> +		 * We can't temporarily set the pmd to null in order
> +		 * to split it, the pmd must remain marked huge at all
> +		 * times or the VM won't take the pmd_trans_huge paths
> +		 * and it won't wait on the anon_vma->lock to
> +		 * serialize against split_huge_page*.
> +		 */

heh, that is an understatement for the problems that could occur if you set
the PMD to NULL.

> +		pmdp_splitting_flush_notify(vma, address, pmd);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> +	int i;
> +	unsigned long head_index = page->index;
> +	struct zone *zone = page_zone(page);
> +
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */

hmm, it's not a "now" thing but I suspect it would be preferable to isolate
from the LRU, release the LRU lock, do what you need to do and then add the
pages in batch back onto the LRU.

> +	spin_lock_irq(&zone->lru_lock);
> +	compound_lock(page);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +
> +		/* tail_page->_count cannot change */
> +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> +		BUG_ON(page_count(page) <= 0);
> +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> +		/* after clearing PageTail the gup refcount can be released */
> +		smp_mb();
> +
> +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		page_tail->flags |= (page->flags &
> +				     ((1L << PG_referenced) |
> +				      (1L << PG_swapbacked) |
> +				      (1L << PG_mlocked) |
> +				      (1L << PG_uptodate)));
> +		page_tail->flags |= (1L << PG_dirty);
> +
> +		/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> +

You have the LRU taken with an interrupt-safe lock. Who else could be
manipulating the struct page to make this barrier necessary?

> +		/*
> +		 * __split_huge_page_splitting() already set the
> +		 * splitting bit in all pmd that could map this
> +		 * hugepage, that will ensure no CPU can alter the
> +		 * mapcount on the head page. The mapcount is only
> +		 * accounted in the head page and it has to be
> +		 * transferred to all tail pages in the below code. So
> +		 * for this code to be safe, the split the mapcount
> +		 * can't change. But that doesn't mean userland can't
> +		 * keep changing and reading the page contents while
> +		 * we transfer the mapcount, so the pmd splitting
> +		 * status is achieved setting a reserved bit in the
> +		 * pmd, not by clearing the present bit.
> +		*/
> +		BUG_ON(page_mapcount(page_tail));
> +		page_tail->_mapcount = page->_mapcount;
> +
> +		BUG_ON(page_tail->mapping);
> +		page_tail->mapping = page->mapping;
> +
> +		page_tail->index = ++head_index;
> +
> +		BUG_ON(!PageAnon(page_tail));
> +		BUG_ON(!PageUptodate(page_tail));
> +		BUG_ON(!PageDirty(page_tail));
> +		BUG_ON(!PageSwapBacked(page_tail));
> +
> +		lru_add_page_tail(zone, page, page_tail);
> +
> +		put_page(page_tail);
> +	}
> +
> +	ClearPageCompound(page);
> +	compound_unlock(page);
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	BUG_ON(page_count(page) <= 0);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> +				 struct vm_area_struct *vma,
> +				 unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd, _pmd;
> +	int ret = 0, i;
> +	pgtable_t pgtable;
> +	unsigned long haddr;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> +	if (pmd) {
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			BUG_ON(PageCompound(page+i));
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!pmd_write(*pmd))
> +				entry = pte_wrprotect(entry);
> +			else
> +				BUG_ON(page_mapcount(page) != 1);
> +			if (!pmd_young(*pmd))
> +				entry = pte_mkold(entry);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pgtable);
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +/* must be called with anon_vma->lock hold */
> +static void __split_huge_page(struct page *page,
> +			      struct anon_vma *anon_vma)
> +{
> +	int mapcount, mapcount2;
> +	struct vm_area_struct *vma;
> +
> +	BUG_ON(!PageHead(page));
> +	BUG_ON(PageTail(page));
> +
> +	mapcount = 0;
> +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount += __split_huge_page_splitting(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != page_mapcount(page));
> +
> +	__split_huge_page_refcount(page);
> +
> +	mapcount2 = 0;
> +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount2 += __split_huge_page_map(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != mapcount2);
> +}
> +
> +/* must run with mmap_sem to prevent vma to go away */

held for read?

Either way, this is what I was thinking of with worklodas that mmap/munmap
heavily potentially suffering with transparent_hugepages set to "always"
by default.

I don't think it's a *severe* problem as such but when documentation exists
on this feature, it is something worth mentioning.

> +void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
> +{
> +	struct page *page;
> +	struct anon_vma *anon_vma;
> +	struct mm_struct *mm;
> +
> +	BUG_ON(vma->vm_flags & VM_HUGETLB);
> +
> +	mm = vma->vm_mm;
> +
> +	anon_vma = vma->anon_vma;
> +
> +	spin_lock(&anon_vma->lock);
> +	BUG_ON(pmd_trans_splitting(*pmd));
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		spin_unlock(&anon_vma->lock);
> +		return;
> +	}
> +	page = pmd_page(*pmd);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	__split_huge_page(page, anon_vma);
> +
> +	spin_unlock(&anon_vma->lock);
> +	BUG_ON(pmd_trans_huge(*pmd));
> +}
> +
> +/* must run with mmap_sem to prevent vma to go away */
> +void __split_huge_page_mm(struct mm_struct *mm,
> +			  unsigned long address,
> +			  pmd_t *pmd)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = find_vma(mm, address);
> +	BUG_ON(vma->vm_start > address);
> +	BUG_ON(vma->vm_mm != mm);
> +
> +	__split_huge_page_vma(vma, pmd);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	int ret = 1;
> +
> +	BUG_ON(!PageAnon(page));
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		goto out;
> +	ret = 0;
> +	if (!PageCompound(page))
> +		goto out_unlock;
> +
> +	BUG_ON(!PageSwapBacked(page));
> +	__split_huge_page(page, anon_vma);
> +
> +	BUG_ON(PageCompound(page));
> +out_unlock:
> +	page_unlock_anon_vma(anon_vma);
> +out:
> +	return ret;
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -647,9 +647,9 @@ out_set_pte:
>  	return 0;
>  }
>  
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> -		unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		   unsigned long addr, unsigned long end)
>  {
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
> @@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*src_pmd)) {
> +			int err;
> +			err = copy_huge_pmd(dst_mm, src_mm,
> +					    dst_pmd, src_pmd, addr, vma);
> +			if (err == -ENOMEM)
> +				return -ENOMEM;
> +			if (!err)
> +				continue;
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*pmd)) {
> +			if (next-addr != HPAGE_PMD_SIZE)
> +				split_huge_page_vma(vma, pmd);
> +			else if (zap_huge_pmd(tlb, vma, pmd)) {
> +				(*zap_work)--;
> +				continue;
> +			}
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(pmd)) {
>  			(*zap_work)--;
>  			continue;
> @@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
> -	if (pmd_huge(*pmd)) {
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
>  		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
>  		goto out;
>  	}
> +	if (pmd_trans_huge(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		if (likely(pmd_trans_huge(*pmd))) {
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				page = follow_trans_huge_pmd(mm, address,
> +							     pmd, flags);
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +		/* fall through */
> +	}
>  	if (unlikely(pmd_bad(*pmd)))
>  		goto no_page_table;
>  
> @@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
>  			pmd = pmd_offset(pud, pg);
>  			if (pmd_none(*pmd))
>  				return i ? : -EFAULT;
> +			VM_BUG_ON(pmd_trans_huge(*pmd));
>  			pte = pte_offset_map(pmd, pg);
>  			if (pte_none(*pte)) {
>  				pte_unmap(pte);
> @@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
>   * but allow concurrent faults), and pte mapped but not yet locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> +		     struct vm_area_struct *vma, unsigned long address,
> +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
>  {
>  	pte_t entry;
>  	spinlock_t *ptl;
> @@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		if (!vma->vm_ops)
> +			return do_huge_pmd_anonymous_page(mm, vma, address,
> +							  pmd, flags);
> +	} else {
> +		pmd_t orig_pmd = *pmd;
> +		barrier();
> +		if (pmd_trans_huge(orig_pmd)) {
> +			if (flags & FAULT_FLAG_WRITE &&
> +			    !pmd_write(orig_pmd) &&
> +			    !pmd_trans_splitting(orig_pmd))
> +				return do_huge_pmd_wp_page(mm, vma, address,
> +							   pmd, orig_pmd);
> +			return 0;
> +		}
> +	}
>  	pte = pte_alloc_map(mm, vma, pmd, address);
>  	if (!pte)
>  		return VM_FAULT_OOM;
> @@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
>  		goto out;
>  
>  	pmd = pmd_offset(pud, address);
> +	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>  		goto out;
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -56,6 +56,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
>   * Returns virtual address or -EFAULT if page's index/offset is not
>   * within the range mapped the @vma.
>   */
> -static inline unsigned long
> +inline unsigned long
>  vma_address(struct page *page, struct vm_area_struct *vma)
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
>  			unsigned long *vm_flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *pte;
> -	spinlock_t *ptl;
>  	int referenced = 0;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,
>  	 * in order that it progresses to try_to_unmap and is moved to the
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 0;	/* break early from loop */
>  		*vm_flags |= VM_LOCKED;
> -		goto out_unmap;
> -	}
> -
> -	if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -		/*
> -		 * Don't treat a reference through a sequentially read
> -		 * mapping as such.  If the page has been used in
> -		 * another mapping, we will catch it; if this other
> -		 * mapping is already gone, the unmap path will have
> -		 * set PG_referenced or activated the page.
> -		 */
> -		if (likely(!VM_SequentialReadHint(vma)))
> -			referenced++;
> +		goto out;
>  	}
>  
>  	/* Pretend the page is referenced if the task has the
> @@ -380,9 +363,39 @@ int page_referenced_one(struct page *pag
>  			rwsem_is_locked(&mm->mmap_sem))
>  		referenced++;
>  
> -out_unmap:
> +	if (unlikely(PageTransHuge(page))) {
> +		pmd_t *pmd;
> +
> +		spin_lock(&mm->page_table_lock);
> +		pmd = page_check_address_pmd(page, mm, address,
> +					     PAGE_CHECK_ADDRESS_PMD_FLAG);
> +		if (pmd && !pmd_trans_splitting(*pmd) &&
> +		    pmdp_clear_flush_young_notify(vma, address, pmd))
> +			referenced++;
> +		spin_unlock(&mm->page_table_lock);
> +	} else {
> +		pte_t *pte;
> +		spinlock_t *ptl;
> +
> +		pte = page_check_address(page, mm, address, &ptl, 0);
> +		if (!pte)
> +			goto out;
> +
> +		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> +			/*
> +			 * Don't treat a reference through a sequentially read
> +			 * mapping as such.  If the page has been used in
> +			 * another mapping, we will catch it; if this other
> +			 * mapping is already gone, the unmap path will have
> +			 * set PG_referenced or activated the page.
> +			 */
> +			if (likely(!VM_SequentialReadHint(vma)))
> +				referenced++;
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +
>  	(*mapcount)--;
> -	pte_unmap_unlock(pte, ptl);
>  
>  	if (referenced)
>  		*vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -461,6 +461,43 @@ void __pagevec_release(struct pagevec *p
>  
>  EXPORT_SYMBOL(__pagevec_release);
>  
> +/* used by __split_huge_page_refcount() */
> +void lru_add_page_tail(struct zone* zone,
> +		       struct page *page, struct page *page_tail)
> +{
> +	int active;
> +	enum lru_list lru;
> +	const int file = 0;
> +	struct list_head *head;
> +
> +	VM_BUG_ON(!PageHead(page));
> +	VM_BUG_ON(PageCompound(page_tail));
> +	VM_BUG_ON(PageLRU(page_tail));
> +	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
> +
> +	SetPageLRU(page_tail);
> +
> +	if (page_evictable(page_tail, NULL)) {
> +		if (PageActive(page)) {
> +			SetPageActive(page_tail);
> +			active = 1;
> +			lru = LRU_ACTIVE_ANON;
> +		} else {
> +			active = 0;
> +			lru = LRU_INACTIVE_ANON;
> +		}
> +		update_page_reclaim_stat(zone, page_tail, file, active);
> +		if (likely(PageLRU(page)))
> +			head = page->lru.prev;
> +		else
> +			head = &zone->lru[lru].list;
> +		__add_page_to_lru_list(zone, page_tail, lru, head);
> +	} else {
> +		SetPageUnevictable(page_tail);
> +		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
> +	}
> +}
> +
>  /*
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
> 

Broadly speaking, this was a lot more understandable than I was expecting
and I did not find any major snags or difficulties. However, I've also ran
out of beans again so I'll be taking another break before moving onto the
rest of the set :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-28 17:57   ` Mel Gorman
@ 2010-01-28 18:05     ` Rik van Riel
  2010-01-28 18:07       ` Mel Gorman
  2010-01-28 22:36     ` Andrea Arcangeli
  1 sibling, 1 reply; 50+ messages in thread
From: Rik van Riel @ 2010-01-28 18:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On 01/28/2010 12:57 PM, Mel Gorman wrote:
> Sorry for the long delay getting to this patch. I ran out of beans the
> first time around. Unlike Rik, I can't handle 31 patches in one sitting.
>
> On Thu, Jan 28, 2010 at 03:33:39PM +0100, Andrea Arcangeli wrote:
>> From: Andrea Arcangeli<aarcange@redhat.com>
>>
>> Lately I've been working to make KVM use hugepages transparently
>> without the usual restrictions of hugetlbfs. Some of the restrictions
>> I'd like to see removed:
>>
>> 1) hugepages have to be swappable or the guest physical memory remains
>>     locked in RAM and can't be paged out to swap
>>
>
> It occurs to me that this infrastructure should be reusable to make allow
> optional swapping of hugetlbfs. I haven't investigated the possibility properly
> but it should be doable as a mount option with maybe a boot-parameter for
> shared memory.

I agree that would be nice.  However, as you noticed above this
patch set is quite large already.  Merging the infrastructure from
hugetlb and the anonymous hugepages is probably better done in a
follow up patch series, since the two are pretty different beasts
at this point.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-28 18:05     ` Rik van Riel
@ 2010-01-28 18:07       ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-01-28 18:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Thu, Jan 28, 2010 at 01:05:26PM -0500, Rik van Riel wrote:
> On 01/28/2010 12:57 PM, Mel Gorman wrote:
>> Sorry for the long delay getting to this patch. I ran out of beans the
>> first time around. Unlike Rik, I can't handle 31 patches in one sitting.
>>
>> On Thu, Jan 28, 2010 at 03:33:39PM +0100, Andrea Arcangeli wrote:
>>> From: Andrea Arcangeli<aarcange@redhat.com>
>>>
>>> Lately I've been working to make KVM use hugepages transparently
>>> without the usual restrictions of hugetlbfs. Some of the restrictions
>>> I'd like to see removed:
>>>
>>> 1) hugepages have to be swappable or the guest physical memory remains
>>>     locked in RAM and can't be paged out to swap
>>>
>>
>> It occurs to me that this infrastructure should be reusable to make allow
>> optional swapping of hugetlbfs. I haven't investigated the possibility properly
>> but it should be doable as a mount option with maybe a boot-parameter for
>> shared memory.
>
> I agree that would be nice.  However, as you noticed above this
> patch set is quite large already.  Merging the infrastructure from
> hugetlb and the anonymous hugepages is probably better done in a
> follow up patch series, since the two are pretty different beasts
> at this point.
>

Fully agreed on all counts.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-28 17:57   ` Mel Gorman
  2010-01-28 18:05     ` Rik van Riel
@ 2010-01-28 22:36     ` Andrea Arcangeli
  2010-01-28 22:43       ` Andrea Arcangeli
                         ` (2 more replies)
  1 sibling, 3 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 22:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Thu, Jan 28, 2010 at 05:57:54PM +0000, Mel Gorman wrote:
> It occurs to me that this infrastructure should be reusable to make allow
> optional swapping of hugetlbfs. I haven't investigated the possibility properly
> but it should be doable as a mount option with maybe a boot-parameter for
> shared memory.

I thought it was higher prio to back tmpfs with this which is mounted
by default and used by more apps automatically. I'm not sure if it
makes sense to back hugetlbfs with this given the extra length it went
to avoid the regular VM fault paths while getting closer and closer in
features to the default VM paths but I'm not against it either, just a
thought.

> As I write this, I haven't read the patch but I suspect the default will
> depend on how mmap_sem is used. For example, if "always" means that down_read
> is called for longer periods of time then workloads that mmap() heavily or
> otherwise depend on down_write could suffer.

Note, page fault time goes down from >5 sec to <2 sec with lockdep and
debug (otherwise it's only 50% faster). So mmap_sem hold time will be
lower with "always". COWs are the only corner case that is a bit
slower on my systems because I've 2M cache and that thrashes on 4M
copy per fault... So even for totally short lived, mmap_sem hold time
can decrease -33% and the app run 50% faster (or 100% faster with
lockdep) even without mmap_sem contention at all. But for other apps
the cache trashing may slow it down a bit.

The main reason for "always" default is to be sure everybody is
stressing it. It can also be merged in mainline with default
madvise/madvise/madvise, that is fine by me, for now it has to be
always > enabled. Leaving "always" as a default like in my laptop
where I'm writing this, isn't insane but it's not a 100% guaranteed
speedup always because of the cache trashing. We've also to change the
copy/clear_huge to use non temporal stores, that might help.

> I think that's the first time I've ever heard OOM handling described as
> "fine" :)

Eheeh ;)

> FWIW, having read the available papers on transparent support, I agree that
> prefault logic is unlikely to be a proven win. The complexity and overhead of
> prefaulting are guaranteed and easy to measure. The benefits due to huge page
> usage are a "maybe", harder to prove and depend on the workload and hardware.

Full agreement. My prototype benchmark results showed it's a very gray
area. I want something that runs userland as fast as it can get in the
best case, not just 10/20/30% faster with the only benefit that it
trashes less cache. Because even if it trashes cache less, it will
still trash more of it than with the default 4k page-size... and we
don't get enough benefit in return (notably zero tlb benefits). I want
the full benefit in return immediately and max speedup for the best
case, the worst case is still going to run slower even with
prefault. Plus we can disable it and selectively enable if one has l2
cache issues. And the complexity of the code isn't even remotely
comparable...

> Using hugetlbfs with huge pages is also recognised to have significant costs
> during startup which is faulting in pages larger than 4K. In benchmarking,
> it can show up as huge pages hurting performance if the benchmark is too
> short-lived.

We'll see how it goes with this one.

> Will need Documentation/ updates at some point explaining these flags
> and how they apply to the transparent_hugepage= boot parameter.

I expect admins to fiddle with /sysfs with boot time scripts,
transparent_hugepage= I intended it more for a =0 and similar lowlevel
usage if people can't boot or something, so I don't see a need of
documentation other than the =0 setting, but we can document it.

> It's not overly important but as these are flags, it would have been
> more readable to just define them as 1, 2, 4, 8 etc rather than using
> 1<<TRANSPARENT_HUGEPAGE_X in so many places. i.e. similar to how GFP
> flags are defined and used. I know you use test_bit on these later but
> it's not clear you need the locked checks and could just use
> "if (flags & whatever)"

test_bit as lockless and as out of order as (flag & whatever). So only
the setting of the default value is annoying, but all other places
testing it are already as efficient and as short as possible I
think. The writing of the bits has to be atomic right now, and it's
faster than having to take locks (but writing is not important and
extremely rare this is why it's a read_mostly).

> Should these be defined by the architecture? I ask because your patch notes
> that architectures supporting huge pages at levels other than the PMD will
> have issues. On IA-64 for example, HPAGE_SHIFT can be set as a kernel boot
> parameter.
> 
> That said, these definitions work for the architecture that *is* supported.

There is no risk because nothing is going to ever use HPAGE_PMD_* if
TRANSPARENT_HUGEPAGE is not set. So as long as it builds... It's not
inside #ifdef because of this:

		next = pmd_addr_end(addr, end);
		if (pmd_trans_huge(*pmd)) {
			if (next-addr != HPAGE_PMD_SIZE)
				split_huge_page_vma(vma, pmd);
			else if (zap_huge_pmd(tlb, vma, pmd)) {
				(*zap_work)--;
				continue;
			}
			/* fall through */
		}

but pmd_trans_huge is defined to 0 at compile time, so HPAGE_PMD_SIZE
can be random. I guess I'll change it to this to avoid any possible
future risk:

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -42,11 +42,11 @@ extern pmd_t *page_check_address_pmd(str
 				     unsigned long address,
 				     enum page_check_address_pmd_flag flag);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT HPAGE_SHIFT
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define transparent_hugepage_enabled(__vma)				\
 	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
 	 (transparent_hugepage_flags &				\
@@ -116,6 +116,10 @@ static inline int PageTransCompound(stru
 	return PageCompound(page);
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
+#define HPAGE_PMD_MASK ({ BUG(); 0; })
+#define HPAGE_PMD_SIZE ({ BUG(); 0; })
+
 #define transparent_hugepage_enabled(__vma) 0
 #define transparent_hugepage_defrag(__vma) 0
 #define transparent_hugepage_debug_cow() 0


> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define transparent_hugepage_enabled(__vma)				\
> > +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> > +	 (transparent_hugepage_flags &				\
> > +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
> > +	  (__vma)->vm_flags & VM_HUGEPAGE))
> > +#define transparent_hugepage_defrag(__vma)			       \
> > +	(transparent_hugepage_flags &				       \
> > +	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
> > +	 (transparent_hugepage_flags &				       \
> > +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
> > +	  (__vma)->vm_flags & VM_HUGEPAGE))
> > +#ifdef CONFIG_DEBUG_VM
> > +#define transparent_hugepage_debug_cow()				\
> > +	(transparent_hugepage_flags &					\
> > +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> > +#else /* CONFIG_DEBUG_VM */
> > +#define transparent_hugepage_debug_cow() 0
> > +#endif /* CONFIG_DEBUG_VM */
> > +
> > +extern unsigned long transparent_hugepage_flags;
> > +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> > +			  struct vm_area_struct *vma,
> > +			  unsigned long addr, unsigned long end);
> > +extern int handle_pte_fault(struct mm_struct *mm,
> > +			    struct vm_area_struct *vma, unsigned long address,
> > +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> > +extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
> > +				 pmd_t *pmd);
> > +extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
> > +extern int split_huge_page(struct page *page);
> > +#define split_huge_page_mm(__mm, __addr, __pmd)				\
> > +	do {								\
> > +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> > +			__split_huge_page_mm(__mm, __addr, __pmd);	\
> > +	}  while (0)
> 
> I'm not sure what the current popular thing is but ...
> 
> __pmd is using in this #define twice. Hypothetically, if the third
> parameter passed to this function had side-effects (e.g. pmd++), then
> the expectation of the caller is that it happens once but in reality it
> happens twice due to the use of #define.
> 
> For this reason, I prefer to see static inlines instead of #defines where
> parameters appear more than once. Just because you do not have stupid
> callers doesn't mean that someone else will add one for you in the
> future.

problem is it won't all build without this, things like anon_vma
aren't defined. But you're right spotting that problem!

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -82,21 +82,24 @@ extern void __split_huge_page_vma(struct
 extern int split_huge_page(struct page *page);
 #define split_huge_page_mm(__mm, __addr, __pmd)				\
 	do {								\
-		if (unlikely(pmd_trans_huge(*(__pmd))))			\
-			__split_huge_page_mm(__mm, __addr, __pmd);	\
+		pmd_t ____pmd = __pmd;					\
+		if (unlikely(pmd_trans_huge(*(____pmd))))		\
+			__split_huge_page_mm(__mm, __addr, ____pmd);	\
 	}  while (0)
 #define split_huge_page_vma(__vma, __pmd)				\
 	do {								\
-		if (unlikely(pmd_trans_huge(*(__pmd))))			\
-			__split_huge_page_vma(__vma, __pmd);		\
+		pmd_t ____pmd = __pmd;					\
+		if (unlikely(pmd_trans_huge(*(____pmd))))		\
+			__split_huge_page_vma(__vma, ____pmd);		\
 	}  while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
+		pmd_t ____pmd = __pmd;					\
 		smp_mb();						\
 		spin_unlock_wait(&(__anon_vma)->lock);			\
 		smp_mb();						\
-		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
-			  pmd_trans_huge(*(__pmd)));			\
+		VM_BUG_ON(pmd_trans_splitting(*(____pmd)) ||		\
+			  pmd_trans_huge(*(____pmd)));			\
 	} while (0)
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)


> > +#define wait_split_huge_page(__anon_vma, __pmd)				\
> > +	do {								\
> > +		smp_mb();						\
> > +		spin_unlock_wait(&(__anon_vma)->lock);			\
> > +		smp_mb();						\
> > +		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
> > +			  pmd_trans_huge(*(__pmd)));			\
> > +	} while (0)
> 
> Barriers without comments are doomed to stupid questions. Can you add a
> comment on what this barrier is protecting against?

the reason is that spin_unlock_wait has nothing to do with
spin_unlock, it's a while loop on a casted volatile field, nothing in
assembly. So the cpu can move anything around it and it's almost
worthless without some memory barrier around.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
+		pmd_t ____pmd = __pmd;					\
+		/*							\
+		 * spin_unlock_wait() is just a loop in C and so the	\
+		 * CPU can reorder anything around it.			\
+		 */							\
 		smp_mb();						\
 		spin_unlock_wait(&(__anon_vma)->lock);			\
 		smp_mb();						\
-		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
-			  pmd_trans_huge(*(__pmd)));			\
+		VM_BUG_ON(pmd_trans_splitting(*(____pmd)) ||		\
+			  pmd_trans_huge(*(____pmd)));			\
 	} while (0)
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)


Now thinking more I think the first smp_mb is unnecessary. Any code
that could read or write stuff before the spin_unlock_wait will be
even safer after it. All it matters is that the code _after_ the
spin_unlock_wait doesn't run before it. So I'll remove the first
smb_mb() of the two. I guess I was a little too paranoid but it's sure
better to have one more barrier than one less... eheh ;)

> I *think* it's because spin unlocking is not a barrier (although the exact
> details escape me) so presumably spin_unlock_wait() isn't one either. In
> this case, you have to be sure that reads/writes to that lock that occured
> since you called pmd_trans_splitting() have happened. Am I close?

Correct, spin_unlock would be a barrier that only prevents the stuff
up to go down, it doesn't prevent the pmd_trans to go up. In this case
spin_unlock_wait it's not a smp barrier at all.

> > +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> > +#if HPAGE_PMD_ORDER > MAX_ORDER
> > +#error "hugepages can't be allocated by the buddy allocator"
> > +#endif
> > +
> > +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> > +static inline int PageTransHuge(struct page *page)
> > +{
> > +	VM_BUG_ON(PageTail(page));
> > +	return PageHead(page);
> > +}
> > +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +#define transparent_hugepage_enabled(__vma) 0
> > +#define transparent_hugepage_defrag(__vma) 0
> > +#define transparent_hugepage_debug_cow() 0
> > +
> > +#define transparent_hugepage_flags 0UL
> > +static inline int split_huge_page(struct page *page)
> > +{
> > +	return 0;
> > +}
> > +#define split_huge_page_mm(__mm, __addr, __pmd)	\
> > +	do { }  while (0)
> > +#define split_huge_page_vma(__vma, __pmd)	\
> > +	do { }  while (0)
> > +#define wait_split_huge_page(__anon_vma, __pmd)	\
> > +	do { } while (0)
> > +#define PageTransHuge(page) 0
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> > +#endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
> >  #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
> >  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
> >  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> > +#if BITS_PER_LONG > 32
> > +#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
> > +#endif
> >  
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE ?
> 
> Because of the use of the page flag bits, I believe you are restricted to
> 64 bit anyway. 

I'm restricted to 32bit on x86. I expect in the future VM_ will get
added to 64bit only versions so I didn't want to pollute or make it
special with CONFIG_TRANSPARENT_HUGEPAGE. In addition this isn't
allocated as an enum so it's not like with an #ifdef I can save
anything (plus I fixed in 64bit space to avoid bothering anybody in
32bit space ;)

> > +#endif
> > +	return 0;
> > +}
> > +module_init(ksm_init)
> > +
> 
> ksm_init.... I'm not seeing the connection to KSM. I suspect you cut&pasted
> from ksm.c there and forgot to rename it. It's not important as the static
> avoids collisions but it looks odd.

Yep, not just that, the entire mm_slot allocation and hash is
cut-and-pasted, search for ksm ;). Now next thought should be to unify
it in a different common .c file. Luckily for me there's some little
difference between the two mm_slot implementations that gives me an
excuse for a cut-and-pasted version to save ram in the object size
unless somebody finds a way to shrink the size of the ksm mm_slot
which is unlikely because it works different and needs an rmap chain
(something I don't need here).

> > +static int __init setup_transparent_hugepage(char *str)
> > +{
> > +	if (!str)
> > +		return 0;
> > +	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
> > +	return 1;
> > +}
> > +__setup("transparent_hugepage=", setup_transparent_hugepage);
> > +
> 
> The parameters are never sanity checked. This means that flags that should be
> mutually exclusive can be set at the same time. e.g.  TRANSPARENT_HUGEPAGE_FLAG
> and TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG.
> 
> I didn't check how important this is, but prehaps you should be using the
> same helper functions as used in sysfs.

I'm aware of that, my view is if you own the system you as well "cp
/dev/zero /dev/mem". If we want to make the API user friendly we
should support a more user friendly command line option than a
number. The important thing is that no app should use this thing, it's
meant for extreme cases only.

> > +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> > +				 struct mm_struct *mm)
> > +{
> > +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> > +
> > +	/* FIFO */
> > +	if (!mm->pmd_huge_pte)
> > +		INIT_LIST_HEAD(&pgtable->lru);
> > +	else
> > +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> > +	mm->pmd_huge_pte = pgtable;
> > +}
> > +
> > +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> > +{
> > +	if (likely(vma->vm_flags & VM_WRITE))
> > +		pmd = pmd_mkwrite(pmd);
> > +	return pmd;
> > +}
> > +
> > +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> > +					struct vm_area_struct *vma,
> > +					unsigned long address, pmd_t *pmd,
> > +					struct page *page,
> > +					unsigned long haddr)
> > +{
> > +	int ret = 0;
> > +	pgtable_t pgtable;
> > +
> > +	VM_BUG_ON(!PageCompound(page));
> > +	pgtable = pte_alloc_one(mm, address);
> > +	if (unlikely(!pgtable)) {
> > +		put_page(page);
> > +		return VM_FAULT_OOM;
> > +	}
> > +
> > +	clear_huge_page(page, haddr, HPAGE_PMD_NR);
> > +	__SetPageUptodate(page);
> > +
> > +	/*
> > +	 * spin_lock() below is not the equivalent of smp_wmb(), so
> > +	 * this is needed to avoid the clear_huge_page writes to
> > +	 * become visible after the set_pmd_at() write.
> > +	 */
> > +	smp_wmb();
> > +
> 
> I'm not seeing the equivalent barrier in do_anonymous_page() between
> when the page is zero'd and the PTE inserted. What am I missing?

Good point. Only explanation I can find is page_add_new_anon_rmap that
already takes the zone_lru lock and is a full barrier.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -219,13 +219,6 @@ static int __do_huge_pmd_anonymous_page(
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	__SetPageUptodate(page);
 
-	/*
-	 * spin_lock() below is not the equivalent of smp_wmb(), so
-	 * this is needed to avoid the clear_huge_page writes to
-	 * become visible after the set_pmd_at() write.
-	 */
-	smp_wmb();
-
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(&mm->page_table_lock);
@@ -236,6 +229,12 @@ static int __do_huge_pmd_anonymous_page(
 		entry = mk_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		entry = pmd_mkhuge(entry);
+		/*
+		 * The spinlocking to take the lru_lock inside
+		 * page_add_new_anon_rmap() acts as a full memory
+		 * barrier to be sure clear_huge_page writes become
+		 * visible after the set_pmd_at() write.
+		 */
 		page_add_new_anon_rmap(page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		prepare_pmd_huge_pte(pgtable, mm);


> > +	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> > +		if (unlikely(anon_vma_prepare(vma)))
> > +			return VM_FAULT_OOM;
> > +		page = alloc_hugepage(transparent_hugepage_defrag(vma));
> > +		if (unlikely(!page))
> > +			goto out;
> > +
> 
> Something to consider here performance-wise when transparent-hugepage is
> defaulted to "always".
> 
> alloc_hugepage() is potentially very expensive. You could enter direct
> reclaim, lumpy reclaim, wakeup kswapd etc. If you want to optimistically
> use huge pages, it might have less impact for !defrag to imply !__GFP_WAIT.
> On the other hand, if memory compaction is merged (my test machines are
> still tied up, hence no release since), one would be happy for it to compact,
> but not necessarily enter direct reclaim.
> 
> It's a tricky one.....

Agreed but I think gfp-atomic is basically never going to get anything
if there's cache. This shrinks a bit of cache every time and it won't
wakeup kswapd (if it does it's vm bug, kswapd should kick in only if
we break below the mid watermark or whatever equivalent hysteresis).

> > +	pages = kzalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> > +			GFP_KERNEL);
> 
> Is kzalloc really necessary? It's fixed-size and you initialise it so
> why not just kmalloc()?

Nice optimization! I guess I improved the failure path, in some
ancient version it likely was needed.

> > +	if (unlikely(!new_page))
> > +		return do_huge_pmd_wp_page_fallback(mm, vma, address,
> > +						    pmd, orig_pmd, page, haddr);
> > +
> > +	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> > +	__SetPageUptodate(new_page);
> > +
> > +	/*
> > +	 * spin_lock() below is not the equivalent of smp_wmb(), so
> > +	 * this is needed to avoid the copy_huge_page writes to become
> > +	 * visible after the set_pmd_at() write.
> > +	 */
> > +	smp_wmb();
> > +
> 
> You do that here but not for the wp_page_fallback in the same type of
> setup. Can you point out the obvious thing I'm missing?

I'll remove it from here too.

Note that the fallback version has it, between the last set_pte_at and
the pmd_populate and that's needed I think (even page_remove_rmap only
does atomic_dec_and_test which won't call the barrier it in case it's
not reaching zero I think). The kfree I don't like to relay on it.

> > +static void __split_huge_page_refcount(struct page *page)
> > +{
> > +	int i;
> > +	unsigned long head_index = page->index;
> > +	struct zone *zone = page_zone(page);
> > +
> > +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> 
> hmm, it's not a "now" thing but I suspect it would be preferable to isolate
> from the LRU, release the LRU lock, do what you need to do and then add the
> pages in batch back onto the LRU.

we can't isolate it from lru if it's already isolated but we've to be
able to split it anyway. Maybe we can chance it so that a hugepage is
always guaranteed to have PageLRU set, but split_huge_page has to take
the lru lock to add tail pages to the lru, and the way the VM isolates
the pages isn't friendly to it (to provide that guarantee I would have
to implement a split_huge_page that works inside the lru_lock, that
may be feasible). Having to enforce that guarantee looked more risky
at the moment, my version can handle inside or outside the lru equally
well but I'm not against changing it to enforce a hugepage is always
in lru.

> > +	spin_lock_irq(&zone->lru_lock);
> > +	compound_lock(page);
> > +
> > +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> > +		struct page *page_tail = page + i;
> > +
> > +		/* tail_page->_count cannot change */
> > +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> > +		BUG_ON(page_count(page) <= 0);
> > +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> > +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> > +
> > +		/* after clearing PageTail the gup refcount can be released */
> > +		smp_mb();
> > +
> > +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > +		page_tail->flags |= (page->flags &
> > +				     ((1L << PG_referenced) |
> > +				      (1L << PG_swapbacked) |
> > +				      (1L << PG_mlocked) |
> > +				      (1L << PG_uptodate)));
> > +		page_tail->flags |= (1L << PG_dirty);
> > +
> > +		/*
> > +		 * 1) clear PageTail before overwriting first_page
> > +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> > +		 */
> > +		smp_wmb();
> > +
> 
> You have the LRU taken with an interrupt-safe lock. Who else could be
> manipulating the struct page to make this barrier necessary?

This is the barrier where pagetail goes away before clearing
first_page (overrwitten by the next lines below). We discussed this in
other mail.

> 
> > +		/*
> > +		 * __split_huge_page_splitting() already set the
> > +		 * splitting bit in all pmd that could map this
> > +		 * hugepage, that will ensure no CPU can alter the
> > +		 * mapcount on the head page. The mapcount is only
> > +		 * accounted in the head page and it has to be
> > +		 * transferred to all tail pages in the below code. So
> > +		 * for this code to be safe, the split the mapcount
> > +		 * can't change. But that doesn't mean userland can't
> > +		 * keep changing and reading the page contents while
> > +		 * we transfer the mapcount, so the pmd splitting
> > +		 * status is achieved setting a reserved bit in the
> > +		 * pmd, not by clearing the present bit.
> > +		*/
> > +		BUG_ON(page_mapcount(page_tail));
> > +		page_tail->_mapcount = page->_mapcount;
> > +
> > +		BUG_ON(page_tail->mapping);
> > +		page_tail->mapping = page->mapping;
> > +
> > +		page_tail->index = ++head_index;
> > +
> > +		BUG_ON(!PageAnon(page_tail));
> > +		BUG_ON(!PageUptodate(page_tail));
> > +		BUG_ON(!PageDirty(page_tail));
> > +		BUG_ON(!PageSwapBacked(page_tail));
> > +
> > +		lru_add_page_tail(zone, page, page_tail);
> > +
> > +		put_page(page_tail);
> > +	}
> > +
> > +	ClearPageCompound(page);
> > +	compound_unlock(page);
> > +	spin_unlock_irq(&zone->lru_lock);
> > +
> > +	BUG_ON(page_count(page) <= 0);
> > +}
> > +	int mapcount, mapcount2;
> > +	struct vm_area_struct *vma;
> > +
> > +	BUG_ON(!PageHead(page));
> > +	BUG_ON(PageTail(page));
> > +
> > +	mapcount = 0;
> > +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> > +		unsigned long addr = vma_address(page, vma);
> > +		if (addr == -EFAULT)
> > +			continue;
> > +		mapcount += __split_huge_page_splitting(page, vma, addr);
> > +	}
> > +	BUG_ON(mapcount != page_mapcount(page));
> > +
> > +	__split_huge_page_refcount(page);
> > +
> > +	mapcount2 = 0;
> > +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> > +		unsigned long addr = vma_address(page, vma);
> > +		if (addr == -EFAULT)
> > +			continue;
> > +		mapcount2 += __split_huge_page_map(page, vma, addr);
> > +	}
> > +	BUG_ON(mapcount != mapcount2);
> > +}
> > +
> > +/* must run with mmap_sem to prevent vma to go away */
> 
> held for read?

read or write both ok, this is why I didn't specify it.

> Broadly speaking, this was a lot more understandable than I was expecting
> and I did not find any major snags or difficulties. However, I've also ran
> out of beans again so I'll be taking another break before moving onto the
> rest of the set :)

Awesome :).

Please start running #8 on your laptop or workstation or both ;) if
you've i915 set modeline=1 or KMS=y in .config, we need to track down
what leaves a pte_special on an anonymous vma (likely gem as it goes
away with modeline=0 but I don't see where). It has to be driver
related but it's not guaranteed that the bug also happens without my
patch. Kernel is stable even when khugepaged complains about this
pte_special every 10 sec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-28 22:36     ` Andrea Arcangeli
@ 2010-01-28 22:43       ` Andrea Arcangeli
  2010-01-29  0:00       ` Andrea Arcangeli
  2010-01-29 15:29       ` Mel Gorman
  2 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 22:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Thu, Jan 28, 2010 at 11:36:53PM +0100, Andrea Arcangeli wrote:
>  	do {								\
> -		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> -			__split_huge_page_mm(__mm, __addr, __pmd);	\
> +		pmd_t ____pmd = __pmd;					\
> +		if (unlikely(pmd_trans_huge(*(____pmd))))		\
> +			__split_huge_page_mm(__mm, __addr, ____pmd);	\
>  	}  while (0)

then parenthesis should be moved from ____pmd to __pmd of course...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-28 22:36     ` Andrea Arcangeli
  2010-01-28 22:43       ` Andrea Arcangeli
@ 2010-01-29  0:00       ` Andrea Arcangeli
  2010-01-29 15:29       ` Mel Gorman
  2 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-29  0:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Thu, Jan 28, 2010 at 11:36:53PM +0100, Andrea Arcangeli wrote:
> able to split it anyway. Maybe we can chance it so that a hugepage is
> always guaranteed to have PageLRU set, but split_huge_page has to take

The most important benefit of not having enforced that guarantee, is
that young bits will be checked on isolated pages (so with pagelru not
set), when they are still mapped by pmd_trans_huge pmd. This way qemu
pte side, and kvm spte side will both always keep using hugetlb and
pmd_trans_huge in pmd/spmd (as long as at least one between the pmds
and spmds has the young bit set).

Like said to Robin in a different thread, it may not be necessary to
call pmdp_flush_splitting_notify, pmdp_flush_splitting may be enough
if the mmu notifier users learn that if they map a spmd with PSE set
they may also get an invalidate at a later time when the page has no
pagehead/tail/compound set anymore. Current way is conceptually
simpler and less risky because by the time we clear pagetail/compound
in split_huge_page_refcount the secondary mapping is gone just after
the primary mapping (ksm minor fault will wait for splitting bit to go
away like any other regular page fault from other threads).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-28 22:36     ` Andrea Arcangeli
  2010-01-28 22:43       ` Andrea Arcangeli
  2010-01-29  0:00       ` Andrea Arcangeli
@ 2010-01-29 15:29       ` Mel Gorman
  2010-01-29 18:59         ` Andrea Arcangeli
                           ` (2 more replies)
  2 siblings, 3 replies; 50+ messages in thread
From: Mel Gorman @ 2010-01-29 15:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Thu, Jan 28, 2010 at 11:36:53PM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 28, 2010 at 05:57:54PM +0000, Mel Gorman wrote:
> > It occurs to me that this infrastructure should be reusable to make allow
> > optional swapping of hugetlbfs. I haven't investigated the possibility properly
> > but it should be doable as a mount option with maybe a boot-parameter for
> > shared memory.
> 
> I thought it was higher prio to back tmpfs with this which is mounted
> by default and used by more apps automatically.

There isn't much pressure to merge hugetlbfs and tmpfs at the moment
although it would have some advantages. Even if tmpfs is mounted by
default, it does not necessarily mean that it would be usable with huge
pages. Minimally, page size would have to be determined on a per-file
basis and there isn't much in the way of clean interfaces for doing such
a thing.

> I'm not sure if it
> makes sense to back hugetlbfs with this given the extra length it went
> to avoid the regular VM fault paths while getting closer and closer in
> features to the default VM paths but I'm not against it either, just a
> thought.
> 

Will keep it mind.

> > As I write this, I haven't read the patch but I suspect the default will
> > depend on how mmap_sem is used. For example, if "always" means that down_read
> > is called for longer periods of time then workloads that mmap() heavily or
> > otherwise depend on down_write could suffer.
> 
> Note, page fault time goes down from >5 sec to <2 sec with lockdep and
> debug (otherwise it's only 50% faster). So mmap_sem hold time will be
> lower with "always". COWs are the only corner case that is a bit
> slower on my systems because I've 2M cache and that thrashes on 4M
> copy per fault... So even for totally short lived, mmap_sem hold time
> can decrease -33% and the app run 50% faster (or 100% faster with
> lockdep) even without mmap_sem contention at all. But for other apps
> the cache trashing may slow it down a bit.
> 

I was more concerned about the hold times during promotion of huge pages
and splitting because it's a new source of mmap_sem hold times. Time
will tell if it's a real problem or not.

> The main reason for "always" default is to be sure everybody is
> stressing it. It can also be merged in mainline with default
> madvise/madvise/madvise, that is fine by me, for now it has to be
> always > enabled.

That's fair.

> Leaving "always" as a default like in my laptop
> where I'm writing this, isn't insane but it's not a 100% guaranteed
> speedup always because of the cache trashing. We've also to change the
> copy/clear_huge to use non temporal stores, that might help.
> 
> > I think that's the first time I've ever heard OOM handling described as
> > "fine" :)
> 
> Eheeh ;)
> 
> > FWIW, having read the available papers on transparent support, I agree that
> > prefault logic is unlikely to be a proven win. The complexity and overhead of
> > prefaulting are guaranteed and easy to measure. The benefits due to huge page
> > usage are a "maybe", harder to prove and depend on the workload and hardware.
> 
> Full agreement. My prototype benchmark results showed it's a very gray
> area. I want something that runs userland as fast as it can get in the
> best case, not just 10/20/30% faster with the only benefit that it
> trashes less cache. Because even if it trashes cache less, it will
> still trash more of it than with the default 4k page-size... and we
> don't get enough benefit in return (notably zero tlb benefits). I want
> the full benefit in return immediately and max speedup for the best
> case, the worst case is still going to run slower even with
> prefault. Plus we can disable it and selectively enable if one has l2
> cache issues. And the complexity of the code isn't even remotely
> comparable...
> 
> > Using hugetlbfs with huge pages is also recognised to have significant costs
> > during startup which is faulting in pages larger than 4K. In benchmarking,
> > it can show up as huge pages hurting performance if the benchmark is too
> > short-lived.
> 
> We'll see how it goes with this one.
> 
> > Will need Documentation/ updates at some point explaining these flags
> > and how they apply to the transparent_hugepage= boot parameter.
> 
> I expect admins to fiddle with /sysfs with boot time scripts,
> transparent_hugepage= I intended it more for a =0 and similar lowlevel
> usage if people can't boot or something, so I don't see a need of
> documentation other than the =0 setting, but we can document it.
> 

Maybe make it no_transparent_hugepage then with the expectation that
init scripts do the proper twiddling.

> > It's not overly important but as these are flags, it would have been
> > more readable to just define them as 1, 2, 4, 8 etc rather than using
> > 1<<TRANSPARENT_HUGEPAGE_X in so many places. i.e. similar to how GFP
> > flags are defined and used. I know you use test_bit on these later but
> > it's not clear you need the locked checks and could just use
> > "if (flags & whatever)"
> 
> test_bit as lockless and as out of order as (flag & whatever). So only
> the setting of the default value is annoying, but all other places
> testing it are already as efficient and as short as possible I
> think. The writing of the bits has to be atomic right now, and it's
> faster than having to take locks (but writing is not important and
> extremely rare this is why it's a read_mostly).
> 

Ok.

> > Should these be defined by the architecture? I ask because your patch notes
> > that architectures supporting huge pages at levels other than the PMD will
> > have issues. On IA-64 for example, HPAGE_SHIFT can be set as a kernel boot
> > parameter.
> > 
> > That said, these definitions work for the architecture that *is* supported.
> 
> There is no risk because nothing is going to ever use HPAGE_PMD_* if
> TRANSPARENT_HUGEPAGE is not set. So as long as it builds... It's not
> inside #ifdef because of this:
> 
> 		next = pmd_addr_end(addr, end);
> 		if (pmd_trans_huge(*pmd)) {
> 			if (next-addr != HPAGE_PMD_SIZE)
> 				split_huge_page_vma(vma, pmd);
> 			else if (zap_huge_pmd(tlb, vma, pmd)) {
> 				(*zap_work)--;
> 				continue;
> 			}
> 			/* fall through */
> 		}
> 
> but pmd_trans_huge is defined to 0 at compile time, so HPAGE_PMD_SIZE
> can be random. I guess I'll change it to this to avoid any possible
> future risk:
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -42,11 +42,11 @@ extern pmd_t *page_check_address_pmd(str
>  				     unsigned long address,
>  				     enum page_check_address_pmd_flag flag);
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define HPAGE_PMD_SHIFT HPAGE_SHIFT
>  #define HPAGE_PMD_MASK HPAGE_MASK
>  #define HPAGE_PMD_SIZE HPAGE_SIZE
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define transparent_hugepage_enabled(__vma)				\
>  	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
>  	 (transparent_hugepage_flags &				\
> @@ -116,6 +116,10 @@ static inline int PageTransCompound(stru
>  	return PageCompound(page);
>  }
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
> +#define HPAGE_PMD_MASK ({ BUG(); 0; })
> +#define HPAGE_PMD_SIZE ({ BUG(); 0; })
> +
>  #define transparent_hugepage_enabled(__vma) 0
>  #define transparent_hugepage_defrag(__vma) 0
>  #define transparent_hugepage_debug_cow() 0
> 

Seems reasonable.

> 
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +#define transparent_hugepage_enabled(__vma)				\
> > > +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> > > +	 (transparent_hugepage_flags &				\
> > > +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
> > > +	  (__vma)->vm_flags & VM_HUGEPAGE))
> > > +#define transparent_hugepage_defrag(__vma)			       \
> > > +	(transparent_hugepage_flags &				       \
> > > +	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
> > > +	 (transparent_hugepage_flags &				       \
> > > +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
> > > +	  (__vma)->vm_flags & VM_HUGEPAGE))
> > > +#ifdef CONFIG_DEBUG_VM
> > > +#define transparent_hugepage_debug_cow()				\
> > > +	(transparent_hugepage_flags &					\
> > > +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> > > +#else /* CONFIG_DEBUG_VM */
> > > +#define transparent_hugepage_debug_cow() 0
> > > +#endif /* CONFIG_DEBUG_VM */
> > > +
> > > +extern unsigned long transparent_hugepage_flags;
> > > +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> > > +			  struct vm_area_struct *vma,
> > > +			  unsigned long addr, unsigned long end);
> > > +extern int handle_pte_fault(struct mm_struct *mm,
> > > +			    struct vm_area_struct *vma, unsigned long address,
> > > +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> > > +extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
> > > +				 pmd_t *pmd);
> > > +extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
> > > +extern int split_huge_page(struct page *page);
> > > +#define split_huge_page_mm(__mm, __addr, __pmd)				\
> > > +	do {								\
> > > +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> > > +			__split_huge_page_mm(__mm, __addr, __pmd);	\
> > > +	}  while (0)
> > 
> > I'm not sure what the current popular thing is but ...
> > 
> > __pmd is using in this #define twice. Hypothetically, if the third
> > parameter passed to this function had side-effects (e.g. pmd++), then
> > the expectation of the caller is that it happens once but in reality it
> > happens twice due to the use of #define.
> > 
> > For this reason, I prefer to see static inlines instead of #defines where
> > parameters appear more than once. Just because you do not have stupid
> > callers doesn't mean that someone else will add one for you in the
> > future.
> 
> problem is it won't all build without this, things like anon_vma
> aren't defined.

Gack, didn't spot that.

> But you're right spotting that problem!
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -82,21 +82,24 @@ extern void __split_huge_page_vma(struct
>  extern int split_huge_page(struct page *page);
>  #define split_huge_page_mm(__mm, __addr, __pmd)				\
>  	do {								\
> -		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> -			__split_huge_page_mm(__mm, __addr, __pmd);	\
> +		pmd_t ____pmd = __pmd;					\
> +		if (unlikely(pmd_trans_huge(*(____pmd))))		\
> +			__split_huge_page_mm(__mm, __addr, ____pmd);	\
>  	}  while (0)
>  #define split_huge_page_vma(__vma, __pmd)				\
>  	do {								\
> -		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> -			__split_huge_page_vma(__vma, __pmd);		\
> +		pmd_t ____pmd = __pmd;					\
> +		if (unlikely(pmd_trans_huge(*(____pmd))))		\
> +			__split_huge_page_vma(__vma, ____pmd);		\
>  	}  while (0)
>  #define wait_split_huge_page(__anon_vma, __pmd)				\
>  	do {								\
> +		pmd_t ____pmd = __pmd;					\
>  		smp_mb();						\
>  		spin_unlock_wait(&(__anon_vma)->lock);			\
>  		smp_mb();						\
> -		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
> -			  pmd_trans_huge(*(__pmd)));			\
> +		VM_BUG_ON(pmd_trans_splitting(*(____pmd)) ||		\
> +			  pmd_trans_huge(*(____pmd)));			\
>  	} while (0)
>  #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> 

That's much better - I don't spot any other nasty side-effects.

> 
> > > +#define wait_split_huge_page(__anon_vma, __pmd)				\
> > > +	do {								\
> > > +		smp_mb();						\
> > > +		spin_unlock_wait(&(__anon_vma)->lock);			\
> > > +		smp_mb();						\
> > > +		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
> > > +			  pmd_trans_huge(*(__pmd)));			\
> > > +	} while (0)
> > 
> > Barriers without comments are doomed to stupid questions. Can you add a
> > comment on what this barrier is protecting against?
> 
> the reason is that spin_unlock_wait has nothing to do with
> spin_unlock, it's a while loop on a casted volatile field, nothing in
> assembly. So the cpu can move anything around it and it's almost
> worthless without some memory barrier around.
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
>  #define wait_split_huge_page(__anon_vma, __pmd)				\
>  	do {								\
> +		pmd_t ____pmd = __pmd;					\
> +		/*							\
> +		 * spin_unlock_wait() is just a loop in C and so the	\
> +		 * CPU can reorder anything around it.			\
> +		 */							\
>  		smp_mb();						\
>  		spin_unlock_wait(&(__anon_vma)->lock);			\
>  		smp_mb();						\
> -		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
> -			  pmd_trans_huge(*(__pmd)));			\
> +		VM_BUG_ON(pmd_trans_splitting(*(____pmd)) ||		\
> +			  pmd_trans_huge(*(____pmd)));			\
>  	} while (0)
>  #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> 

Thanks for the comment.

> Now thinking more I think the first smp_mb is unnecessary. Any code
> that could read or write stuff before the spin_unlock_wait will be
> even safer after it. All it matters is that the code _after_ the
> spin_unlock_wait doesn't run before it. So I'll remove the first
> smb_mb() of the two. I guess I was a little too paranoid but it's sure
> better to have one more barrier than one less... eheh ;)
> 

heh, as long as you don't mind the inevitable questions that come alone
every time a barrier is stuck anywhere :)

> > I *think* it's because spin unlocking is not a barrier (although the exact
> > details escape me) so presumably spin_unlock_wait() isn't one either. In
> > this case, you have to be sure that reads/writes to that lock that occured
> > since you called pmd_trans_splitting() have happened. Am I close?
> 
> Correct, spin_unlock would be a barrier that only prevents the stuff
> up to go down, it doesn't prevent the pmd_trans to go up. In this case
> spin_unlock_wait it's not a smp barrier at all.
> 

Once again, thanks for the explanation. 

> > > +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > > +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> > > +#if HPAGE_PMD_ORDER > MAX_ORDER
> > > +#error "hugepages can't be allocated by the buddy allocator"
> > > +#endif
> > > +
> > > +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> > > +static inline int PageTransHuge(struct page *page)
> > > +{
> > > +	VM_BUG_ON(PageTail(page));
> > > +	return PageHead(page);
> > > +}
> > > +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > > +#define transparent_hugepage_enabled(__vma) 0
> > > +#define transparent_hugepage_defrag(__vma) 0
> > > +#define transparent_hugepage_debug_cow() 0
> > > +
> > > +#define transparent_hugepage_flags 0UL
> > > +static inline int split_huge_page(struct page *page)
> > > +{
> > > +	return 0;
> > > +}
> > > +#define split_huge_page_mm(__mm, __addr, __pmd)	\
> > > +	do { }  while (0)
> > > +#define split_huge_page_vma(__vma, __pmd)	\
> > > +	do { }  while (0)
> > > +#define wait_split_huge_page(__anon_vma, __pmd)	\
> > > +	do { } while (0)
> > > +#define PageTransHuge(page) 0
> > > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > > +
> > > +#endif /* _LINUX_HUGE_MM_H */
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
> > >  #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
> > >  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
> > >  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> > > +#if BITS_PER_LONG > 32
> > > +#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
> > > +#endif
> > >  
> > 
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE ?
> > 
> > Because of the use of the page flag bits, I believe you are restricted to
> > 64 bit anyway. 
> 
> I'm restricted to 32bit on x86. I expect in the future VM_ will get
> added to 64bit only versions so I didn't want to pollute or make it
> special with CONFIG_TRANSPARENT_HUGEPAGE. In addition this isn't
> allocated as an enum so it's not like with an #ifdef I can save
> anything (plus I fixed in 64bit space to avoid bothering anybody in
> 32bit space ;)
> 
> > > +#endif
> > > +	return 0;
> > > +}
> > > +module_init(ksm_init)
> > > +
> > 
> > ksm_init.... I'm not seeing the connection to KSM. I suspect you cut&pasted
> > from ksm.c there and forgot to rename it. It's not important as the static
> > avoids collisions but it looks odd.
> 
> Yep, not just that, the entire mm_slot allocation and hash is
> cut-and-pasted, search for ksm ;). Now next thought should be to unify
> it in a different common .c file. Luckily for me there's some little
> difference between the two mm_slot implementations that gives me an
> excuse for a cut-and-pasted version to save ram in the object size
> unless somebody finds a way to shrink the size of the ksm mm_slot
> which is unlikely because it works different and needs an rmap chain
> (something I don't need here).
> 
> > > +static int __init setup_transparent_hugepage(char *str)
> > > +{
> > > +	if (!str)
> > > +		return 0;
> > > +	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
> > > +	return 1;
> > > +}
> > > +__setup("transparent_hugepage=", setup_transparent_hugepage);
> > > +
> > 
> > The parameters are never sanity checked. This means that flags that should be
> > mutually exclusive can be set at the same time. e.g.  TRANSPARENT_HUGEPAGE_FLAG
> > and TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG.
> > 
> > I didn't check how important this is, but prehaps you should be using the
> > same helper functions as used in sysfs.
> 
> I'm aware of that, my view is if you own the system you as well "cp
> /dev/zero /dev/mem". If we want to make the API user friendly we
> should support a more user friendly command line option than a
> number. The important thing is that no app should use this thing, it's
> meant for extreme cases only.
> 

In that case, I'd suggest simply no_transparent_hugepage with the
expectation that it's only used to get around early-boot problems that
might crop up. While the opportunity for a user to hang themselves is
always good for chuckles, there is little point giving them more rope
than they need :)

> > > +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> > > +				 struct mm_struct *mm)
> > > +{
> > > +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> > > +
> > > +	/* FIFO */
> > > +	if (!mm->pmd_huge_pte)
> > > +		INIT_LIST_HEAD(&pgtable->lru);
> > > +	else
> > > +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> > > +	mm->pmd_huge_pte = pgtable;
> > > +}
> > > +
> > > +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> > > +{
> > > +	if (likely(vma->vm_flags & VM_WRITE))
> > > +		pmd = pmd_mkwrite(pmd);
> > > +	return pmd;
> > > +}
> > > +
> > > +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> > > +					struct vm_area_struct *vma,
> > > +					unsigned long address, pmd_t *pmd,
> > > +					struct page *page,
> > > +					unsigned long haddr)
> > > +{
> > > +	int ret = 0;
> > > +	pgtable_t pgtable;
> > > +
> > > +	VM_BUG_ON(!PageCompound(page));
> > > +	pgtable = pte_alloc_one(mm, address);
> > > +	if (unlikely(!pgtable)) {
> > > +		put_page(page);
> > > +		return VM_FAULT_OOM;
> > > +	}
> > > +
> > > +	clear_huge_page(page, haddr, HPAGE_PMD_NR);
> > > +	__SetPageUptodate(page);
> > > +
> > > +	/*
> > > +	 * spin_lock() below is not the equivalent of smp_wmb(), so
> > > +	 * this is needed to avoid the clear_huge_page writes to
> > > +	 * become visible after the set_pmd_at() write.
> > > +	 */
> > > +	smp_wmb();
> > > +
> > 
> > I'm not seeing the equivalent barrier in do_anonymous_page() between
> > when the page is zero'd and the PTE inserted. What am I missing?
> 
> Good point. Only explanation I can find is page_add_new_anon_rmap that
> already takes the zone_lru lock and is a full barrier.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -219,13 +219,6 @@ static int __do_huge_pmd_anonymous_page(
>  	clear_huge_page(page, haddr, HPAGE_PMD_NR);
>  	__SetPageUptodate(page);
>  
> -	/*
> -	 * spin_lock() below is not the equivalent of smp_wmb(), so
> -	 * this is needed to avoid the clear_huge_page writes to
> -	 * become visible after the set_pmd_at() write.
> -	 */
> -	smp_wmb();
> -
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
>  		spin_unlock(&mm->page_table_lock);
> @@ -236,6 +229,12 @@ static int __do_huge_pmd_anonymous_page(
>  		entry = mk_pmd(page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  		entry = pmd_mkhuge(entry);
> +		/*
> +		 * The spinlocking to take the lru_lock inside
> +		 * page_add_new_anon_rmap() acts as a full memory
> +		 * barrier to be sure clear_huge_page writes become
> +		 * visible after the set_pmd_at() write.
> +		 */
>  		page_add_new_anon_rmap(page, vma, haddr);
>  		set_pmd_at(mm, haddr, pmd, entry);
>  		prepare_pmd_huge_pte(pgtable, mm);
> 

That looks a lot better. One less barrier, is just as safe and is consistent
with the normal fault paths.

> 
> > > +	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> > > +		if (unlikely(anon_vma_prepare(vma)))
> > > +			return VM_FAULT_OOM;
> > > +		page = alloc_hugepage(transparent_hugepage_defrag(vma));
> > > +		if (unlikely(!page))
> > > +			goto out;
> > > +
> > 
> > Something to consider here performance-wise when transparent-hugepage is
> > defaulted to "always".
> > 
> > alloc_hugepage() is potentially very expensive. You could enter direct
> > reclaim, lumpy reclaim, wakeup kswapd etc. If you want to optimistically
> > use huge pages, it might have less impact for !defrag to imply !__GFP_WAIT.
> > On the other hand, if memory compaction is merged (my test machines are
> > still tied up, hence no release since), one would be happy for it to compact,
> > but not necessarily enter direct reclaim.
> > 
> > It's a tricky one.....
> 
> Agreed but I think gfp-atomic is basically never going to get anything
> if there's cache. This shrinks a bit of cache every time and it won't
> wakeup kswapd (if it does it's vm bug, kswapd should kick in only if
> we break below the mid watermark or whatever equivalent hysteresis).
> 

In fact kswapd will get woken up if you fail a GFP_ATOMIC allocation.
What I would expect to to happen is the first allocation would fail but
kswapd would wake up and start reclaiming for order-9 (i.e. the huge page
size). This might be a more hit-and-miss affair than you'd like though and
would make performance predictions that bit harder.

> > > +	pages = kzalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> > > +			GFP_KERNEL);
> > 
> > Is kzalloc really necessary? It's fixed-size and you initialise it so
> > why not just kmalloc()?
> 
> Nice optimization! I guess I improved the failure path, in some
> ancient version it likely was needed.
> 
> > > +	if (unlikely(!new_page))
> > > +		return do_huge_pmd_wp_page_fallback(mm, vma, address,
> > > +						    pmd, orig_pmd, page, haddr);
> > > +
> > > +	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> > > +	__SetPageUptodate(new_page);
> > > +
> > > +	/*
> > > +	 * spin_lock() below is not the equivalent of smp_wmb(), so
> > > +	 * this is needed to avoid the copy_huge_page writes to become
> > > +	 * visible after the set_pmd_at() write.
> > > +	 */
> > > +	smp_wmb();
> > > +
> > 
> > You do that here but not for the wp_page_fallback in the same type of
> > setup. Can you point out the obvious thing I'm missing?
> 
> I'll remove it from here too.
> 
> Note that the fallback version has it, between the last set_pte_at and

Bah.

*slaps self*

> the pmd_populate and that's needed I think (even page_remove_rmap only
> does atomic_dec_and_test which won't call the barrier it in case it's
> not reaching zero I think). The kfree I don't like to relay on it.
> 

I don't think you can rely on it. I agree that barrier makes more sense
so that the last set_pte_at is visible before the PMD is.

> > > +static void __split_huge_page_refcount(struct page *page)
> > > +{
> > > +	int i;
> > > +	unsigned long head_index = page->index;
> > > +	struct zone *zone = page_zone(page);
> > > +
> > > +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> > 
> > hmm, it's not a "now" thing but I suspect it would be preferable to isolate
> > from the LRU, release the LRU lock, do what you need to do and then add the
> > pages in batch back onto the LRU.
> 
> we can't isolate it from lru if it's already isolated but we've to be
> able to split it anyway. Maybe we can chance it so that a hugepage is
> always guaranteed to have PageLRU set, but split_huge_page has to take
> the lru lock to add tail pages to the lru, and the way the VM isolates
> the pages isn't friendly to it (to provide that guarantee I would have
> to implement a split_huge_page that works inside the lru_lock, that
> may be feasible). Having to enforce that guarantee looked more risky
> at the moment, my version can handle inside or outside the lru equally
> well but I'm not against changing it to enforce a hugepage is always
> in lru.
> 

It's certainly a future thing. I think it's unlikely that a workload will
be identified that is reclaim-heavy and manages to suffer slowdowns because
huge pages were being split a lot. I don't think it's worth worrying
about right now.

> > > +	spin_lock_irq(&zone->lru_lock);
> > > +	compound_lock(page);
> > > +
> > > +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> > > +		struct page *page_tail = page + i;
> > > +
> > > +		/* tail_page->_count cannot change */
> > > +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> > > +		BUG_ON(page_count(page) <= 0);
> > > +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> > > +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> > > +
> > > +		/* after clearing PageTail the gup refcount can be released */
> > > +		smp_mb();
> > > +
> > > +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > +		page_tail->flags |= (page->flags &
> > > +				     ((1L << PG_referenced) |
> > > +				      (1L << PG_swapbacked) |
> > > +				      (1L << PG_mlocked) |
> > > +				      (1L << PG_uptodate)));
> > > +		page_tail->flags |= (1L << PG_dirty);
> > > +
> > > +		/*
> > > +		 * 1) clear PageTail before overwriting first_page
> > > +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> > > +		 */
> > > +		smp_wmb();
> > > +
> > 
> > You have the LRU taken with an interrupt-safe lock. Who else could be
> > manipulating the struct page to make this barrier necessary?
> 
> This is the barrier where pagetail goes away before clearing
> first_page (overrwitten by the next lines below). We discussed this in
> other mail.
> 

*sleps self*

> > 
> > > +		/*
> > > +		 * __split_huge_page_splitting() already set the
> > > +		 * splitting bit in all pmd that could map this
> > > +		 * hugepage, that will ensure no CPU can alter the
> > > +		 * mapcount on the head page. The mapcount is only
> > > +		 * accounted in the head page and it has to be
> > > +		 * transferred to all tail pages in the below code. So
> > > +		 * for this code to be safe, the split the mapcount
> > > +		 * can't change. But that doesn't mean userland can't
> > > +		 * keep changing and reading the page contents while
> > > +		 * we transfer the mapcount, so the pmd splitting
> > > +		 * status is achieved setting a reserved bit in the
> > > +		 * pmd, not by clearing the present bit.
> > > +		*/
> > > +		BUG_ON(page_mapcount(page_tail));
> > > +		page_tail->_mapcount = page->_mapcount;
> > > +
> > > +		BUG_ON(page_tail->mapping);
> > > +		page_tail->mapping = page->mapping;
> > > +
> > > +		page_tail->index = ++head_index;
> > > +
> > > +		BUG_ON(!PageAnon(page_tail));
> > > +		BUG_ON(!PageUptodate(page_tail));
> > > +		BUG_ON(!PageDirty(page_tail));
> > > +		BUG_ON(!PageSwapBacked(page_tail));
> > > +
> > > +		lru_add_page_tail(zone, page, page_tail);
> > > +
> > > +		put_page(page_tail);
> > > +	}
> > > +
> > > +	ClearPageCompound(page);
> > > +	compound_unlock(page);
> > > +	spin_unlock_irq(&zone->lru_lock);
> > > +
> > > +	BUG_ON(page_count(page) <= 0);
> > > +}
> > > +	int mapcount, mapcount2;
> > > +	struct vm_area_struct *vma;
> > > +
> > > +	BUG_ON(!PageHead(page));
> > > +	BUG_ON(PageTail(page));
> > > +
> > > +	mapcount = 0;
> > > +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> > > +		unsigned long addr = vma_address(page, vma);
> > > +		if (addr == -EFAULT)
> > > +			continue;
> > > +		mapcount += __split_huge_page_splitting(page, vma, addr);
> > > +	}
> > > +	BUG_ON(mapcount != page_mapcount(page));
> > > +
> > > +	__split_huge_page_refcount(page);
> > > +
> > > +	mapcount2 = 0;
> > > +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> > > +		unsigned long addr = vma_address(page, vma);
> > > +		if (addr == -EFAULT)
> > > +			continue;
> > > +		mapcount2 += __split_huge_page_map(page, vma, addr);
> > > +	}
> > > +	BUG_ON(mapcount != mapcount2);
> > > +}
> > > +
> > > +/* must run with mmap_sem to prevent vma to go away */
> > 
> > held for read?
> 
> read or write both ok, this is why I didn't specify it.
> 
> > Broadly speaking, this was a lot more understandable than I was expecting
> > and I did not find any major snags or difficulties. However, I've also ran
> > out of beans again so I'll be taking another break before moving onto the
> > rest of the set :)
> 
> Awesome :).
> 
> Please start running #8 on your laptop or workstation or both ;) if
> you've i915 set modeline=1 or KMS=y in .config, we need to track down
> what leaves a pte_special on an anonymous vma (likely gem as it goes
> away with modeline=0 but I don't see where). It has to be driver
> related but it's not guaranteed that the bug also happens without my
> patch. Kernel is stable even when khugepaged complains about this
> pte_special every 10 sec.
> 

Unfortunately, I don't have a i915 but I'll be testing the patchset on
the laptop over the weekend.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-29 15:29       ` Mel Gorman
@ 2010-01-29 18:59         ` Andrea Arcangeli
  2010-01-31 20:24         ` Andrea Arcangeli
  2010-02-01 13:27         ` Andrea Arcangeli
  2 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-29 18:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Fri, Jan 29, 2010 at 03:29:39PM +0000, Mel Gorman wrote:
> In that case, I'd suggest simply no_transparent_hugepage with the
> expectation that it's only used to get around early-boot problems that
> might crop up. While the opportunity for a user to hang themselves is
> always good for chuckles, there is little point giving them more rope
> than they need :)

I tend to agree.

> Unfortunately, I don't have a i915 but I'll be testing the patchset on
> the laptop over the weekend.

To debug this I sent this patch that makes khugepaged a readonly
thing, a pure pagetable scanner validator. And I asked to boot with
transparent_hugepage=16 (that only enables the crippled down version
of khugepaged that will never create transhuge pages). This way no
hugepage will ever be generated and no hugepmd either. So if i915
still trips on this, we'll know it's not my fault. Otherwise we've to
dig deeper in the patch.

the i915 bug triggers at boot time only with i915.modeset=1 but it
doesn't on my laptop also with i915 KMS=y. It starts before X. So I
think the explanation is some splash screen (which I don't have) is
mmapping the DRM framebuffer and DRM framebuffer corrupts some pte to
point to the agpgart area. Otherwise it means some hugepage is
confusing it, but nothing related to i915 should _ever_ care about
hugepages because hugepages are only generated on vm_ops = NULL, and
it must never happen that remap_pfn_range or vm_insert_pfn or anything
like that happens on a vma with vm_ops not null.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1552,8 +1552,10 @@ static int khugepaged_scan_pmd(struct mm
 		if (page_count(page) != 1)
 			goto out_unmap;
 	}
+#if 0
 	if (referenced)
 		ret = 1;
+#endif
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {

I think this is also a must to add for mainline in addition to the
above debugging trick (this will be the next step if the khugepaged
crippled down in readonly mode and transparent_hugepage=16 still show
the bug triggering, basically declaring transparent hugepage inocent,
if hugepage is innocent then the below will have a chance to expose
the bug).

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1678,6 +1680,7 @@ int vm_insert_pfn(struct vm_area_struct 
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
 	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+	BUG_ON(!vma->vm_ops);

 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
@@ -1697,6 +1700,7 @@ int vm_insert_mixed(struct vm_area_struc
 			unsigned long pfn)
 {
 	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vma->vm_ops);

 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
@@ -1828,6 +1832,7 @@ int remap_pfn_range(struct vm_area_struc
 		return -EINVAL;

 	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
+	BUG_ON(!vma->vm_ops);

 	err = track_pfn_vma_new(vma, &prot, pfn, PAGE_ALIGN(size));
 	if (err) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-29 15:29       ` Mel Gorman
  2010-01-29 18:59         ` Andrea Arcangeli
@ 2010-01-31 20:24         ` Andrea Arcangeli
  2010-02-01 13:27         ` Andrea Arcangeli
  2 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-31 20:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Fri, Jan 29, 2010 at 03:29:39PM +0000, Mel Gorman wrote:
> Unfortunately, I don't have a i915 but I'll be testing the patchset on
> the laptop over the weekend.

never mind it's fixed. It was only cosmetical. It triggered on vesa
too, it just needed a large enough framebuffer. In short I didn't
interrupt the scan of the vma at vm_end & HPAGE_PMD_MASK. So I started
scanning a pmd that wasn't fully included in the current vma, but
passing the current vma to the pmd scanning like if the whole pmd
belonged it. The real vma mapping of the framebuffer was fine.

The reason this was going away with "echo never
>transparent_hugepage/enabled" while leaving khugepaged enabled in
readonly mode, is that mm_slot isn't allocated and khugepaged doesn't
track vmas that don't ask for hugepages so the registration logic of
mm is the same for madvise and always... of course.

This was the fix, and I'll submit #9 with it.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1657,9 +1657,14 @@ static unsigned int khugepaged_scan_mm_s
 		}
 		if (khugepaged_scan.address < hstart)
 			khugepaged_scan.address = hstart;
+		if (khugepaged_scan.address > hend) {
+			khugepaged_scan.address = hend + HPAGE_PMD_SIZE;
+			progress++;
+			continue;
+		}
 		BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
 
-		while (khugepaged_scan.address < vma->vm_end) {
+		while (khugepaged_scan.address < hend) {
 			int ret;
 			cond_resched();
 			if (unlikely(khugepaged_test_exit(mm)))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-29 15:29       ` Mel Gorman
  2010-01-29 18:59         ` Andrea Arcangeli
  2010-01-31 20:24         ` Andrea Arcangeli
@ 2010-02-01 13:27         ` Andrea Arcangeli
  2010-02-01 13:53           ` Mel Gorman
  2 siblings, 1 reply; 50+ messages in thread
From: Andrea Arcangeli @ 2010-02-01 13:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Fri, Jan 29, 2010 at 03:29:39PM +0000, Mel Gorman wrote:
> In fact kswapd will get woken up if you fail a GFP_ATOMIC allocation.
> What I would expect to to happen is the first allocation would fail but
> kswapd would wake up and start reclaiming for order-9 (i.e. the huge page
> size). This might be a more hit-and-miss affair than you'd like though and
> would make performance predictions that bit harder.

yeah, it turns out the kswapd behavior breaks it. In short you get
huge swap storms even without __GFP_IO/FS/WAIT in the direct
reclaim. So I had to add this:

Subject: _GFP_NO_KSWAPD

From: Andrea Arcangeli <aarcange@redhat.com>

Transparent hugepage allocations must be allowed not to invoke kswapd or any
other kind of indirect reclaim (especially when the defrag sysfs is control
disabled). It's unacceptable to swap out anonymous pages (potentially
anonymous transparent hugepages) in order to create new transparent hugepages.
This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
machine and so having it suffer an unbearable slowdown, so another one with
guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
memory intensive workloads, makes no sense). If a transparent hugepage
allocation fails the slowdown is minor and there is total fallback, so kswapd
should never be asked to swapout memory to allow the high order allocation to
succeed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -59,13 +59,15 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)0)
 #endif

+#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

 /* This equals 0, but use constants in case they ever change */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1829,7 +1829,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 		goto nopage;

 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx);
+	if (!(gfp_mask & __GFP_NO_KSWAPD))
+		wake_all_kswapd(order, zonelist, high_zoneidx);

 	/*
 	 * OK, we're below the kswapd watermark and have kicked background

I also added this for safety, because I don't want hugepage allocation
to eat from the reserved pfmemalloc pool:

Subject: don't alloc harder for gfp nomemalloc even if nowait

From: Andrea Arcangeli <aarcange@redhat.com>

Not worth throwing away the precious reserved free memory pool for allocations
that can fail gracefully (either through mempool or because they're transhuge
allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1773,7 +1773,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 */
 	alloc_flags |= (gfp_mask & __GFP_HIGH);

-	if (!wait) {
+	/*
+	 * Not worth trying to allocate harder for __GFP_NOMEMALLOC
+	 * even if it can't schedule.
+	 */
+	if (!wait && !(gfp_mask & __GFP_NOMEMALLOC)) {
 		alloc_flags |= ALLOC_HARDER;
 		/*
 		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.

With these two patches and this:

#define GFP_TRANSHUGE  (__GFP_HARDWALL | __GFP_HIGHMEM |  \
	 __GFP_MOVABLE | __GFP_COMP | __GFP_NOMEMALLOC | \
	 __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD)

static inline struct page *alloc_hugepage(void)
{
	int defrag = transparent_hugepage_defrag();
	return alloc_pages(GFP_TRANSHUGE | (defrag ? __GFP_WAIT : 0),
			   HPAGE_PMD_ORDER);
}

It seems leaving defrag off by default is much faster to allocate when
there is total fragmentation, as with NOWAIT we won't get into cache
reclaim. I also removed the differentiation between madvise/always in
the "defrag" knob because what is not ok for madvise is also not ok
for full transparency. It's not ok of VM takes a lot to startup etc...
If something we could have khugepaged default to defrag like in my
previous versions but I don't want to risk shrinking cache for no good
so for now they all use the above alloc_hugepage as main and only
allocation method for transhuge pages. This default now works fluid
all the time and no apparent VM behavior change in my laptop with
"always" enabling (and the apps gets the hugepages sometime).

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc6/transparent_hugepage-10
	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc6/transparent_hugepage-10.gz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-02-01 13:27         ` Andrea Arcangeli
@ 2010-02-01 13:53           ` Mel Gorman
  0 siblings, 0 replies; 50+ messages in thread
From: Mel Gorman @ 2010-02-01 13:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Mon, Feb 01, 2010 at 02:27:04PM +0100, Andrea Arcangeli wrote:
> On Fri, Jan 29, 2010 at 03:29:39PM +0000, Mel Gorman wrote:
> > In fact kswapd will get woken up if you fail a GFP_ATOMIC allocation.
> > What I would expect to to happen is the first allocation would fail but
> > kswapd would wake up and start reclaiming for order-9 (i.e. the huge page
> > size). This might be a more hit-and-miss affair than you'd like though and
> > would make performance predictions that bit harder.
> 
> yeah, it turns out the kswapd behavior breaks it. In short you get
> huge swap storms even without __GFP_IO/FS/WAIT in the direct
> reclaim. So I had to add this:
> 

That's pretty much what I was expecting. Even if the caller does not
allow IO, FS or WAIT, that doesn't stop kswapd doing all the work
indirectly.

> Subject: _GFP_NO_KSWAPD
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Transparent hugepage allocations must be allowed not to invoke kswapd or any
> other kind of indirect reclaim (especially when the defrag sysfs is control
> disabled). It's unacceptable to swap out anonymous pages (potentially
> anonymous transparent hugepages) in order to create new transparent hugepages.
> This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
> machine and so having it suffer an unbearable slowdown, so another one with
> guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
> memory intensive workloads, makes no sense). If a transparent hugepage
> allocation fails the slowdown is minor and there is total fallback, so kswapd
> should never be asked to swapout memory to allow the high order allocation to
> succeed.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -59,13 +59,15 @@ struct vm_area_struct;
>  #define __GFP_NOTRACK	((__force gfp_t)0)
>  #endif
>  
> +#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
> +
>  /*
>   * This may seem redundant, but it's a way of annotating false positives vs.
>   * allocations that simply cannot be supported (e.g. page tables).
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1829,7 +1829,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
>  		goto nopage;
>  
>  restart:
> -	wake_all_kswapd(order, zonelist, high_zoneidx);
> +	if (!(gfp_mask & __GFP_NO_KSWAPD))
> +		wake_all_kswapd(order, zonelist, high_zoneidx);
>  
>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
> 
> 
> 
> I also added this for safety, because I don't want hugepage allocation
> to eat from the reserved pfmemalloc pool:
> 
> Subject: don't alloc harder for gfp nomemalloc even if nowait
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Not worth throwing away the precious reserved free memory pool for allocations
> that can fail gracefully (either through mempool or because they're transhuge
> allocations later falling back to 4k allocations).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1773,7 +1773,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 */
>  	alloc_flags |= (gfp_mask & __GFP_HIGH);
>  
> -	if (!wait) {
> +	/*
> +	 * Not worth trying to allocate harder for __GFP_NOMEMALLOC
> +	 * even if it can't schedule.
> +	 */
> +	if (!wait && !(gfp_mask & __GFP_NOMEMALLOC)) {
>  		alloc_flags |= ALLOC_HARDER;
>  		/*
>  		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> 
> 
> With these two patches and this:
> 
> #define GFP_TRANSHUGE  (__GFP_HARDWALL | __GFP_HIGHMEM |  \
> 	 __GFP_MOVABLE | __GFP_COMP | __GFP_NOMEMALLOC | \
> 	 __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD)
> 
> static inline struct page *alloc_hugepage(void)
> {
> 	int defrag = transparent_hugepage_defrag();
> 	return alloc_pages(GFP_TRANSHUGE | (defrag ? __GFP_WAIT : 0),
> 			   HPAGE_PMD_ORDER);
> }
> 
> 
> It seems leaving defrag off by default is much faster to allocate when
> there is total fragmentation, as with NOWAIT we won't get into cache
> reclaim. I also removed the differentiation between madvise/always in
> the "defrag" knob because what is not ok for madvise is also not ok
> for full transparency. It's not ok of VM takes a lot to startup etc...
> If something we could have khugepaged default to defrag like in my
> previous versions but I don't want to risk shrinking cache for no good
> so for now they all use the above alloc_hugepage as main and only
> allocation method for transhuge pages. This default now works fluid
> all the time and no apparent VM behavior change in my laptop with
> "always" enabling (and the apps gets the hugepages sometime).
> 
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc6/transparent_hugepage-10
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc6/transparent_hugepage-10.gz
> 

Will take a closer look again during the next round of review but glancing
through these patches, nothing other bad things related to kswapd spring
to mind.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 26 of 31] madvise(MADV_HUGEPAGE)
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 27 of 31] pmd_trans_huge migrate bugcheck Andrea Arcangeli
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
isn't built into the kernel. Never silently return success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,6 +101,7 @@ extern int split_huge_page(struct page *
 #endif
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+extern int hugepage_madvise(unsigned long *vm_flags);
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -123,6 +124,11 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+static inline int hugepage_madvise(unsigned long *vm_flags)
+{
+	BUG_ON(0);
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -465,9 +465,11 @@ int do_huge_pmd_wp_page(struct mm_struct
 		put_page(new_page);
 		new_page = NULL;
 	}
-	if (unlikely(!new_page))
-		return do_huge_pmd_wp_page_fallback(mm, vma, address,
-						    pmd, orig_pmd, page, haddr);
+	if (unlikely(!new_page)) {
+		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+						   pmd, orig_pmd, page, haddr);
+		goto out;
+	}
 
 	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
@@ -497,6 +499,7 @@ int do_huge_pmd_wp_page(struct mm_struct
 	}
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+out:
 	return ret;
 }
 
@@ -847,3 +850,19 @@ out_unlock:
 out:
 	return ret;
 }
+
+int hugepage_madvise(unsigned long *vm_flags)
+{
+	/*
+	 * Be somewhat over-protective like KSM for now!
+	 */
+	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
+			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
+			 VM_MIXEDMAP | VM_SAO))
+		return -EINVAL;
+
+	*vm_flags |= VM_HUGEPAGE;
+
+	return 0;
+}
diff --git a/mm/madvise.c b/mm/madvise.c
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
 		if (error)
 			goto out;
 		break;
+	case MADV_HUGEPAGE:
+		error = hugepage_madvise(&new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+#endif
 		return 1;
 
 	default:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 27 of 31] pmd_trans_huge migrate bugcheck
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (25 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 26 of 31] madvise(MADV_HUGEPAGE) Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 28 of 31] memcg compound Andrea Arcangeli
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -107,6 +107,10 @@ static inline int PageTransHuge(struct p
 	VM_BUG_ON(PageTail(page));
 	return PageHead(page);
 }
+static inline int PageTransCompound(struct page *page)
+{
+	return PageCompound(page);
+}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define transparent_hugepage_enabled(__vma) 0
 #define transparent_hugepage_defrag(__vma) 0
@@ -124,6 +128,7 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+#define PageTransCompound(page) 0
 static inline int hugepage_madvise(unsigned long *vm_flags)
 {
 	BUG_ON(0);
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -99,6 +99,7 @@ static int remove_migration_pte(struct p
 		goto out;
 
 	pmd = pmd_offset(pud, addr);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (!pmd_present(*pmd))
 		goto out;
 
@@ -819,6 +820,10 @@ static int do_move_page_to_node_array(st
 		if (PageReserved(page) || PageKsm(page))
 			goto put_and_set;
 
+		if (unlikely(PageTransCompound(page)))
+			if (unlikely(split_huge_page(page)))
+				goto put_and_set;
+
 		pp->page = page;
 		err = page_to_nid(page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 28 of 31] memcg compound
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (26 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 27 of 31] pmd_trans_huge migrate bugcheck Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 29 of 31] memcg huge memory Andrea Arcangeli
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -4,6 +4,10 @@ NOTE: The Memory Resource Controller has
 to as the memory controller in this document. Do not confuse memory controller
 used here with the memory controller that is used in hardware.
 
+NOTE: When in this documentation we refer to PAGE_SIZE, we actually
+mean the real page size of the page being accounted which is bigger than
+PAGE_SIZE for compound pages.
+
 Salient features
 
 a. Enable control of Anonymous, Page Cache (mapped and unmapped) and
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1401,8 +1401,8 @@ static int __cpuinit memcg_stock_cpu_cal
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg,
-			bool oom, struct page *page)
+				   gfp_t gfp_mask, struct mem_cgroup **memcg,
+				   bool oom, struct page *page, int page_size)
 {
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -1415,6 +1415,9 @@ static int __mem_cgroup_try_charge(struc
 		return 0;
 	}
 
+	if (PageTransHuge(page))
+		csize = page_size;
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -1439,8 +1442,9 @@ static int __mem_cgroup_try_charge(struc
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (consume_stock(mem))
-			goto charged;
+		if (!PageTransHuge(page))
+			if (consume_stock(mem))
+				goto charged;
 
 		ret = res_counter_charge(&mem->res, csize, &fail_res);
 		if (likely(!ret)) {
@@ -1460,7 +1464,7 @@ static int __mem_cgroup_try_charge(struc
 									res);
 
 		/* reduce request size and retry */
-		if (csize > PAGE_SIZE) {
+		if (csize > page_size) {
 			csize = PAGE_SIZE;
 			continue;
 		}
@@ -1491,7 +1495,7 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
-	if (csize > PAGE_SIZE)
+	if (csize > page_size)
 		refill_stock(mem, csize - PAGE_SIZE);
 charged:
 	/*
@@ -1512,12 +1516,12 @@ nomem:
  * This function is for that and do uncharge, put css's refcnt.
  * gotten by try_charge().
  */
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem, int page_size)
 {
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, page_size);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, page_size);
 	}
 	css_put(&mem->css);
 }
@@ -1575,8 +1579,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  */
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				     struct page_cgroup *pc,
-				     enum charge_type ctype)
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
@@ -1585,7 +1590,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem);
+		mem_cgroup_cancel_charge(mem, page_size);
 		return;
 	}
 
@@ -1722,7 +1727,8 @@ static int mem_cgroup_move_parent(struct
 		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
+		PAGE_SIZE);
 	if (ret || !parent)
 		goto put_back;
 
@@ -1730,7 +1736,7 @@ static int mem_cgroup_move_parent(struct
 	if (!ret)
 		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
 	else
-		mem_cgroup_cancel_charge(parent);	/* does css_put */
+		mem_cgroup_cancel_charge(parent, PAGE_SIZE); /* does css_put */
 put_back:
 	putback_lru_page(page);
 put:
@@ -1752,6 +1758,10 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 	int ret;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -1760,11 +1770,12 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page,
+				      page_size);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
 	return 0;
 }
 
@@ -1773,8 +1784,6 @@ int mem_cgroup_newpage_charge(struct pag
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -1787,7 +1796,7 @@ int mem_cgroup_newpage_charge(struct pag
 	if (unlikely(!mm))
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+					MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
 static void
@@ -1880,14 +1889,14 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page, PAGE_SIZE);
 	/* drop extra refcnt from tryget */
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, page, PAGE_SIZE);
 }
 
 static void
@@ -1903,7 +1912,7 @@ __mem_cgroup_commit_charge_swapin(struct
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, ctype);
+	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -1952,11 +1961,12 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem);
+	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
 }
 
 static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
+	      int page_size)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
@@ -1989,14 +1999,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += PAGE_SIZE;
+	batch->bytes += page_size;
 	if (uncharge_memsw)
-		batch->memsw_bytes += PAGE_SIZE;
+		batch->memsw_bytes += page_size;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, page_size);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, page_size);
 	return;
 }
 
@@ -2009,6 +2019,10 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2048,7 +2062,7 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype);
+		__do_uncharge(mem, ctype, page_size);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -2217,7 +2231,7 @@ int mem_cgroup_prepare_migration(struct 
 
 	if (mem) {
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
-						page);
+					      page, PAGE_SIZE);
 		css_put(&mem->css);
 	}
 	*ptr = mem;
@@ -2260,7 +2274,7 @@ void mem_cgroup_end_migration(struct mem
 	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
 	 * So, double-counting is effectively avoided.
 	 */
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
 
 	/*
 	 * Both of oldpage and newpage are still under lock_page().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 29 of 31] memcg huge memory
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (27 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 28 of 31] memcg compound Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 30 of 31] transparent hugepage vmstat Andrea Arcangeli
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add memcg charge/uncharge to hugepage faults in huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -212,6 +212,7 @@ static int __do_huge_pmd_anonymous_page(
 	VM_BUG_ON(!PageCompound(page));
 	pgtable = pte_alloc_one(mm, address);
 	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		return VM_FAULT_OOM;
 	}
@@ -229,6 +230,7 @@ static int __do_huge_pmd_anonymous_page(
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(&mm->page_table_lock);
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -267,6 +269,10 @@ int do_huge_pmd_anonymous_page(struct mm
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
+		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+			put_page(page);
+			goto out;
+		}
 
 		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
 						    page, haddr);
@@ -367,9 +373,15 @@ static int do_huge_pmd_wp_page_fallback(
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 					  vma, address);
-		if (unlikely(!pages[i])) {
-			while (--i >= 0)
+		if (unlikely(!pages[i] ||
+			     mem_cgroup_newpage_charge(pages[i], mm,
+						       GFP_KERNEL))) {
+			if (pages[i])
 				put_page(pages[i]);
+			while (--i >= 0) {
+				mem_cgroup_uncharge_page(pages[i]);
+				put_page(pages[i]);
+			}
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
@@ -428,8 +440,10 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
+	}
 	kfree(pages);
 	goto out;
 }
@@ -471,6 +485,11 @@ int do_huge_pmd_wp_page(struct mm_struct
 		goto out;
 	}
 
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		put_page(new_page);
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
 	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
@@ -482,9 +501,10 @@ int do_huge_pmd_wp_page(struct mm_struct
 	smp_wmb();
 
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-	else {
+	} else {
 		pmd_t entry;
 		entry = mk_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 30 of 31] transparent hugepage vmstat
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (28 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 29 of 31] memcg huge memory Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 14:33 ` [PATCH 31 of 31] khugepaged Andrea Arcangeli
  2010-01-28 17:05 ` [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,6 +101,9 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		"HardwareCorrupted: %5lu kB\n"
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		"AnonHugePages:  %8lu kB\n"
+#endif
 		,
 		K(i.totalram),
 		K(i.freeram),
@@ -151,6 +154,10 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,7 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -720,6 +720,9 @@ static void __split_huge_page_refcount(s
 		put_page(page_tail);
 	}
 
+	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -687,8 +687,13 @@ void page_add_anon_rmap(struct page *pag
 	struct vm_area_struct *vma, unsigned long address)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
-	if (first)
-		__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (first) {
+		if (!PageTransHuge(page))
+			__inc_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__inc_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 	if (unlikely(PageKsm(page)))
 		return;
 
@@ -716,7 +721,10 @@ void page_add_new_anon_rmap(struct page 
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (!PageTransHuge(page))
+	    __inc_zone_page_state(page, NR_ANON_PAGES);
+	else
+	    __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__page_set_anon_rmap(page, vma, address);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -763,7 +771,11 @@ void page_remove_rmap(struct page *page)
 	}
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
-		__dec_zone_page_state(page, NR_ANON_PAGES);
+		if (!PageTransHuge(page))
+			__dec_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__dec_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -655,6 +655,9 @@ static const char * const vmstat_text[] 
 	"numa_local",
 	"numa_other",
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"nr_anon_transparent_hugepages",
+#endif
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	"pgpgin",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 31 of 31] khugepaged
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (29 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 30 of 31] transparent hugepage vmstat Andrea Arcangeli
@ 2010-01-28 14:33 ` Andrea Arcangeli
  2010-01-28 17:05 ` [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 14:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

From: Andrea Arcangeli <aarcange@redhat.com>

Add khugepaged to relocate fragmented pages into hugepages if new hugepages
become available. (this is indipendent of the defrag logic that will have to
make new hugepages available)

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -25,6 +25,8 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
@@ -50,12 +52,14 @@ extern pmd_t *page_check_address_pmd(str
 	 (transparent_hugepage_flags &				\
 	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
 	  (__vma)->vm_flags & VM_HUGEPAGE))
-#define transparent_hugepage_defrag(__vma)			       \
+#define __transparent_hugepage_defrag(__in_madv)		       \
 	(transparent_hugepage_flags &				       \
 	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
 	 (transparent_hugepage_flags &				       \
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
-	  (__vma)->vm_flags & VM_HUGEPAGE))
+	  (__in_madv)))
+#define transparent_hugepage_defrag(__vma)				\
+	__transparent_hugepage_defrag((__vma)->vm_flags & VM_HUGEPAGE)
 #ifdef CONFIG_DEBUG_VM
 #define transparent_hugepage_debug_cow()				\
 	(transparent_hugepage_flags &					\
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
new file mode 100644
--- /dev/null
+++ b/include/linux/khugepaged.h
@@ -0,0 +1,32 @@
+#ifndef _LINUX_KHUGEPAGED_H
+#define _LINUX_KHUGEPAGED_H
+
+#include <linux/sched.h> /* MMF_VM_HUGEPAGE */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __khugepaged_enter(struct mm_struct *mm);
+extern void __khugepaged_exit(struct mm_struct *mm);
+
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+		return __khugepaged_enter(mm);
+	return 0;
+}
+
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
+		__khugepaged_exit(mm);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	return 0;
+}
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -482,6 +482,7 @@ extern int get_dumpable(struct mm_struct
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
+#define MMF_VM_HUGEPAGE		17	/* set when VM_HUGEPAGE is set on vma */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -65,6 +65,7 @@
 #include <linux/perf_event.h>
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
+#include <linux/khugepaged.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -306,6 +307,9 @@ static int dup_mmap(struct mm_struct *mm
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
 		goto out;
+	retval = khugepaged_fork(mm, oldmm);
+	if (retval)
+		goto out;
 
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
@@ -515,6 +519,7 @@ void mmput(struct mm_struct *mm)
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
 		ksm_exit(mm);
+		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -12,14 +12,119 @@
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/khugepaged.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
 
 unsigned long transparent_hugepage_flags __read_mostly =
-	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG);
+
+/* default scan 16 pmd every 10 second */
+static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*16;
+static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+/* during fragmentation poll the hugepage allocator once every minute */
+static unsigned int khugepaged_defrag_sleep_millisecs __read_mostly = 60000;
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+static DEFINE_SPINLOCK(khugepaged_mm_lock);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+
+static int khugepaged(void *none);
+static int mm_slots_hash_init(void);
+static int ksm_slab_init(void);
+static void ksm_slab_free(void);
+
+#define MM_SLOTS_HASH_HEADS 1024
+static struct hlist_head *mm_slots_hash __read_mostly;
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+	struct hlist_node hash;
+	struct list_head mm_node;
+	struct mm_struct *mm;
+};
+
+/**
+ * struct khugepaged_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one khugepaged_scan instance of this cursor structure.
+ */
+struct khugepaged_scan {
+	struct list_head mm_head;
+	struct mm_slot *mm_slot;
+	unsigned long address;
+} khugepaged_scan = {
+	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
+};
+
+#define khugepaged_enabled()					       \
+	(transparent_hugepage_flags &				       \
+	 ((1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG) |		       \
+	  (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG)))
+#define khugepaged_always()				\
+	(transparent_hugepage_flags &			\
+	 (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG))
+#define khugepaged_req_madv()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG))
+
+static int start_khugepaged(void)
+{
+	int err = 0;
+	if (khugepaged_enabled()) {
+		int wakeup;
+		if (unlikely(!mm_slot_cache || !mm_slots_hash)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_thread)
+			khugepaged_thread = kthread_run(khugepaged, NULL,
+							"khugepaged");
+		if (unlikely(IS_ERR(khugepaged_thread))) {
+			clear_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				  &transparent_hugepage_flags);
+			clear_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG,
+				  &transparent_hugepage_flags);
+			printk(KERN_ERR
+			       "khugepaged: kthread_run(khugepaged) failed\n");
+			err = PTR_ERR(khugepaged_thread);
+			khugepaged_thread = NULL;
+		}
+		wakeup = !list_empty(&khugepaged_scan.mm_head);
+		mutex_unlock(&khugepaged_mutex);
+		if (wakeup)
+			wake_up_interruptible(&khugepaged_wait);
+	} else
+		/* wakeup to exit */
+		wake_up_interruptible(&khugepaged_wait);
+out:
+	return err;
+}
 
 #ifdef CONFIG_SYSFS
+
+static void wakeup_khugepaged(void)
+{
+	mutex_lock(&khugepaged_mutex);
+	if (khugepaged_thread)
+		wake_up_process(khugepaged_thread);
+	mutex_unlock(&khugepaged_mutex);
+}
+
 static ssize_t double_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag enabled,
@@ -153,20 +258,168 @@ static struct attribute *hugepage_attr[]
 
 static struct attribute_group hugepage_attr_group = {
 	.attrs = hugepage_attr,
-	.name = "transparent_hugepage",
+};
+
+static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
+}
+
+static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_scan_sleep_millisecs = msecs;
+	wakeup_khugepaged();
+
+	return count;
+}
+static struct kobj_attribute scan_sleep_millisecs_attr =
+	__ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
+	       scan_sleep_millisecs_store);
+
+static ssize_t defrag_sleep_millisecs_show(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_defrag_sleep_millisecs);
+}
+
+static ssize_t defrag_sleep_millisecs_store(struct kobject *kobj,
+					    struct kobj_attribute *attr,
+					    const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_defrag_sleep_millisecs = msecs;
+	wakeup_khugepaged();
+
+	return count;
+}
+static struct kobj_attribute defrag_sleep_millisecs_attr =
+	__ATTR(defrag_sleep_millisecs, 0644, defrag_sleep_millisecs_show,
+	       defrag_sleep_millisecs_store);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
+}
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long pages;
+
+	err = strict_strtoul(buf, 10, &pages);
+	if (err || !pages || pages > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_pages_to_scan = pages;
+
+	return count;
+}
+static struct kobj_attribute pages_to_scan_attr =
+	__ATTR(pages_to_scan, 0644, pages_to_scan_show,
+	       pages_to_scan_store);
+
+static ssize_t khugepaged_enabled_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG);
+}
+static ssize_t khugepaged_enabled_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = double_flag_store(kobj, attr, buf, count,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG);
+	if (ret > 0) {
+		int err = start_khugepaged();
+		if (err)
+			ret = err;
+	}
+	return ret;
+}
+static struct kobj_attribute khugepaged_enabled_attr =
+	__ATTR(enabled, 0644, khugepaged_enabled_show,
+	       khugepaged_enabled_store);
+
+static struct attribute *khugepaged_attr[] = {
+	&khugepaged_enabled_attr.attr,
+	&pages_to_scan_attr.attr,
+	&scan_sleep_millisecs_attr.attr,
+	&defrag_sleep_millisecs_attr.attr,
+	NULL,
+};
+
+static struct attribute_group khugepaged_attr_group = {
+	.attrs = khugepaged_attr,
+	.name = "khugepaged",
 };
 #endif /* CONFIG_SYSFS */
 
 static int __init ksm_init(void)
 {
+	int err;
 #ifdef CONFIG_SYSFS
-	int err;
+	static struct kobject *hugepage_kobj;
 
-	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	err = -ENOMEM;
+	hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
+	if (unlikely(!hugepage_kobj)) {
+		printk(KERN_ERR "hugepage: failed kobject create\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &hugepage_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &khugepaged_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+#endif
+
+	err = ksm_slab_init();
 	if (err)
-		printk(KERN_ERR "hugepage: register sysfs failed\n");
-#endif
-	return 0;
+		goto out;
+
+	err = mm_slots_hash_init();
+	if (err) {
+		ksm_slab_free();
+		goto out;
+	}
+
+	start_khugepaged();
+
+out:
+	return err;
 }
 module_init(ksm_init)
 
@@ -255,6 +508,11 @@ static inline struct page *alloc_hugepag
 			   HPAGE_PMD_ORDER);
 }
 
+static inline struct page *alloc_hugepage_defrag(void)
+{
+	return alloc_hugepage(1);
+}
+
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
@@ -266,6 +524,12 @@ int do_huge_pmd_anonymous_page(struct mm
 	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
+		if (unlikely(!test_bit(MMF_VM_HUGEPAGE, &mm->flags)))
+			if (khugepaged_always() ||
+			    (khugepaged_req_madv() &&
+			     vma->vm_flags & VM_HUGEPAGE))
+				if (__khugepaged_enter(mm))
+					return VM_FAULT_OOM;
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
@@ -889,3 +1153,655 @@ int hugepage_madvise(unsigned long *vm_f
 
 	return 0;
 }
+
+static int __init ksm_slab_init(void)
+{
+	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
+					  sizeof(struct mm_slot),
+					  __alignof__(struct mm_slot), 0, NULL);
+	if (!mm_slot_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __init ksm_slab_free(void)
+{
+	kmem_cache_destroy(mm_slot_cache);
+	mm_slot_cache = NULL;
+}
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+	if (!mm_slot_cache)	/* initialization failed */
+		return NULL;
+	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+	kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static int __init mm_slots_hash_init(void)
+{
+	mm_slots_hash = kzalloc(MM_SLOTS_HASH_HEADS * sizeof(struct hlist_head),
+				GFP_KERNEL);
+	if (!mm_slots_hash)
+		return -ENOMEM;
+	return 0;
+}
+
+#if 0
+static void __init mm_slots_hash_free(void)
+{
+	kfree(mm_slots_hash);
+	mm_slots_hash = NULL;
+}
+#endif
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	struct hlist_head *bucket;
+	struct hlist_node *node;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	hlist_for_each_entry(mm_slot, node, bucket, hash) {
+		if (mm == mm_slot->mm)
+			return mm_slot;
+	}
+	return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+				    struct mm_slot *mm_slot)
+{
+	struct hlist_head *bucket;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	mm_slot->mm = mm;
+	hlist_add_head(&mm_slot->hash, bucket);
+}
+
+int __khugepaged_enter(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int wakeup;
+
+	mm_slot = alloc_mm_slot();
+	if (!mm_slot)
+		return -ENOMEM;
+
+	spin_lock(&khugepaged_mm_lock);
+	insert_to_mm_slots_hash(mm, mm_slot);
+	/*
+	 * Insert just behind the scanning cursor, to let the area settle
+	 * down a little.
+	 */
+	wakeup = list_empty(&khugepaged_scan.mm_head);
+	list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
+	set_bit(MMF_VM_HUGEPAGE, &mm->flags);
+	spin_unlock(&khugepaged_mm_lock);
+
+	atomic_inc(&mm->mm_count);
+	if (wakeup)
+		wake_up_interruptible(&khugepaged_wait);
+
+	return 0;
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int free = 0;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		free = 1;
+	}
+
+	if (free) {
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		spin_unlock(&khugepaged_mm_lock);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		spin_unlock(&khugepaged_mm_lock);
+		/*
+		 * This is required to serialize against
+		 * khugepaged_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we
+		 * return all pagetables will be destroyed) until
+		 * khugepaged has finished working on the pagetables
+		 * under the mmap_sem.
+		 */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	} else
+		spin_unlock(&khugepaged_mm_lock);
+}
+
+static inline int khugepaged_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static void release_pte_page(struct page *page)
+{
+	/* 0 stands for page_is_file_cache(page) == false */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	unlock_page(page);
+	putback_lru_page(page);
+}
+
+static void release_pte_pages(pte_t *pte, pte_t *_pte)
+{
+	while (--_pte >= pte)
+		release_pte_page(pte_page(*_pte));
+}
+
+static void release_all_pte_pages(pte_t *pte)
+{
+	release_pte_pages(pte, pte + HPAGE_PMD_NR);
+}
+
+static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
+					unsigned long address,
+					pte_t *pte)
+{
+	struct page *page;
+	pte_t *_pte;
+	int referenced = 0, isolated = 0;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval) || !pte_write(pteval)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/* If there is no mapped pte young don't collapse the page */
+		if (pte_young(pteval))
+			referenced = 1;
+		page = vm_normal_page(vma, address, pteval);
+		if (unlikely(!page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		VM_BUG_ON(PageCompound(page));
+		BUG_ON(!PageAnon(page));
+		VM_BUG_ON(!PageSwapBacked(page));
+
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * We can do it before isolate_lru_page because the
+		 * page can't be freed from under us. NOTE: PG_lock
+		 * seems entirely unnecessary but in doubt this is
+		 * safer. If proven unnecessary it can be removed.
+		 */
+		if (!trylock_page(page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * Isolate the page to avoid collapsing an hugepage
+		 * currently in use by the VM.
+		 */
+		if (isolate_lru_page(page)) {
+			unlock_page(page);
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/* 0 stands for page_is_file_cache(page) == false */
+		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		VM_BUG_ON(!PageLocked(page));
+		VM_BUG_ON(PageLRU(page));
+	}
+	if (unlikely(!referenced))
+		release_all_pte_pages(pte);
+	else
+		isolated = 1;
+out:
+	return isolated;
+}
+
+static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+				      struct vm_area_struct *vma,
+				      unsigned long address,
+				      spinlock_t *ptl)
+{
+	pte_t *_pte;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+		struct page *src_page = pte_page(*_pte);
+		copy_user_highpage(page, src_page, address, vma);
+		VM_BUG_ON(page_mapcount(src_page) != 1);
+		VM_BUG_ON(page_count(src_page) != 2);
+		release_pte_page(src_page);
+		/*
+		 * ptl mostly unnecessary, but preempt has to be disabled
+		 * to update the per-cpu stats inside page_remove_rmap().
+		 */
+		spin_lock(ptl);
+		/* paravirt calls inside pte_clear here are superfluous */
+		pte_clear(vma->vm_mm, address, _pte);
+		page_remove_rmap(src_page);
+		spin_unlock(ptl);
+		free_page_and_swap_cache(src_page);
+
+		address += PAGE_SIZE;
+		page++;
+	}
+}
+
+static void collapse_huge_page(struct mm_struct *mm,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	struct vm_area_struct *vma;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, _pmd;
+	pte_t *pte;
+	pgtable_t pgtable;
+	struct page *new_page;
+	spinlock_t *ptl;
+	int isolated;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(!*hpage);
+
+	/*
+	 * Prevent all access to pagetables with the exception of
+	 * gup_fast later hanlded by the ptep_clear_flush and the VM
+	 * handled by the anon_vma lock + PG_lock.
+	 */
+	down_write(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto out;
+
+	vma = find_vma(mm, address);
+	if (vma->vm_start > address)
+		goto out;
+
+	if (!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always())
+		goto out;
+
+	if (!vma->anon_vma || vma->vm_ops || vma->vm_file)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	/* pmd can't go away or become huge under us */
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	/* stop anon_vma rmap pagetable access */
+	spin_lock(&vma->anon_vma->lock);
+
+	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
+
+	spin_lock(&mm->page_table_lock); /* probably unnecessary */
+	/* after this gup_fast can't run anymore */
+	_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	spin_lock(ptl);
+	isolated = __collapse_huge_page_isolate(vma, address, pte);
+	spin_unlock(ptl);
+	pte_unmap(pte);
+
+	if (unlikely(!isolated)) {
+		spin_lock(&mm->page_table_lock);
+		BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, address, pmd, _pmd);
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&vma->anon_vma->lock);
+		goto out;
+	}
+
+	/*
+	 * All pages are isolated and locked so anon_vma rmap
+	 * can't run anymore.
+	 */
+	spin_unlock(&vma->anon_vma->lock);
+
+	new_page = *hpage;
+	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	__SetPageUptodate(new_page);
+	pgtable = pmd_pgtable(_pmd);
+	VM_BUG_ON(page_count(pgtable) != 1);
+	VM_BUG_ON(page_mapcount(pgtable) != 0);
+
+	_pmd = mk_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+	_pmd = pmd_mkhuge(_pmd);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	BUG_ON(!pmd_none(*pmd));
+	page_add_new_anon_rmap(new_page, vma, address);
+	set_pmd_at(mm, address, pmd, _pmd);
+	update_mmu_cache(vma, address, entry);
+	prepare_pmd_huge_pte(pgtable, mm);
+	mm->nr_ptes--;
+	spin_unlock(&mm->page_table_lock);
+
+	*hpage = NULL;
+out:
+	up_write(&mm->mmap_sem);
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	int ret = 0, referenced = 0;
+	struct page *page;
+	unsigned long _address;
+	spinlock_t *ptl;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval) || !pte_write(pteval))
+			goto out_unmap;
+		if (pte_young(pteval))
+			referenced = 1;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			goto out_unmap;
+		VM_BUG_ON(PageCompound(page));
+		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
+			goto out_unmap;
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1)
+			goto out_unmap;
+	}
+	if (referenced)
+		ret = 1;
+out_unmap:
+	pte_unmap_unlock(pte, ptl);
+	if (ret) {
+		up_read(&mm->mmap_sem);
+		collapse_huge_page(mm, address, hpage);
+	}
+out:
+	return ret;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
+					    struct page **hpage)
+{
+	struct mm_slot *mm_slot;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	VM_BUG_ON(!pages);
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_scan.mm_slot)
+		mm_slot = khugepaged_scan.mm_slot;
+	else {
+		mm_slot = list_entry(khugepaged_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		khugepaged_scan.address = 0;
+		khugepaged_scan.mm_slot = mm_slot;
+	}
+	spin_unlock(&khugepaged_mm_lock);
+
+	mm = mm_slot->mm;
+	down_read(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, khugepaged_scan.address);
+
+	progress++;
+	for (; vma; vma = vma->vm_next) {
+		unsigned long hstart, hend;
+
+		cond_resched();
+		if (unlikely(khugepaged_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!(vma->vm_flags & VM_HUGEPAGE) &&
+		    !khugepaged_always()) {
+			progress++;
+			continue;
+		}
+		if (!vma->anon_vma || vma->vm_ops || vma->vm_file) {
+			khugepaged_scan.address = vma->vm_end;
+			progress++;
+			continue;
+		}
+		hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+		hend = vma->vm_end & HPAGE_PMD_MASK;
+		if (hstart >= hend) {
+			progress++;
+			continue;
+		}
+		if (khugepaged_scan.address < hstart)
+			khugepaged_scan.address = hstart;
+		BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+
+		while (khugepaged_scan.address < vma->vm_end) {
+			int ret;
+			cond_resched();
+			if (unlikely(khugepaged_test_exit(mm)))
+				goto breakouterloop;
+
+			ret = khugepaged_scan_pmd(mm, vma,
+						  khugepaged_scan.address,
+						  hpage);
+			/* move to next address */
+			khugepaged_scan.address += HPAGE_PMD_SIZE;
+			progress += HPAGE_PMD_NR;
+			if (ret)
+				/* we released mmap_sem so break loop */
+				goto breakouterloop_mmap_sem;
+			if (progress >= pages)
+				goto breakouterloop;
+		}
+	}
+breakouterloop:
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+breakouterloop_mmap_sem:
+
+	spin_lock(&khugepaged_mm_lock);
+	BUG_ON(khugepaged_scan.mm_slot != mm_slot);
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (khugepaged_test_exit(mm) || !vma) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * khugepaged runs here, khugepaged_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
+			khugepaged_scan.mm_slot = list_entry(
+				mm_slot->mm_node.next,
+				struct mm_slot, mm_node);
+			khugepaged_scan.address = 0;
+		} else
+			khugepaged_scan.mm_slot = NULL;
+
+		collect_mm_slot(mm_slot);
+	}
+
+	return progress;
+}
+
+static int khugepaged_has_work(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) &&
+		khugepaged_enabled();
+}
+
+static int khugepaged_wait_event(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) ||
+		!khugepaged_enabled();
+}
+
+static void khugepaged_do_scan(struct page **hpage)
+{
+	unsigned int progress = 0, pass_through_head = 0;
+	unsigned int pages = khugepaged_pages_to_scan;
+
+	barrier(); /* write khugepaged_pages_to_scan to local stack */
+
+	while (progress < pages) {
+		cond_resched();
+
+		if (!*hpage) {
+			*hpage = alloc_hugepage_defrag();
+			if (unlikely(!*hpage))
+				break;
+		}
+
+		spin_lock(&khugepaged_mm_lock);
+		if (!khugepaged_scan.mm_slot)
+			pass_through_head++;
+		if (khugepaged_has_work() &&
+		    pass_through_head < 2)
+			progress += khugepaged_scan_mm_slot(pages - progress,
+							    hpage);
+		else
+			progress = pages;
+		spin_unlock(&khugepaged_mm_lock);
+	}
+}
+
+static struct page *khugepaged_alloc_hugepage(void)
+{
+	struct page *hpage;
+
+	do {
+		hpage = alloc_hugepage_defrag();
+		if (!hpage) {
+			if (!khugepaged_defrag_sleep_millisecs)
+				continue;
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_defrag_sleep_millisecs));
+		}
+	} while (unlikely(!hpage) &&
+		 likely(khugepaged_enabled()));
+	return hpage;
+}
+
+static void khugepaged_loop(void)
+{
+	struct page *hpage;
+
+	while (likely(khugepaged_enabled())) {
+		hpage = khugepaged_alloc_hugepage();
+		if (unlikely(!hpage))
+			break;
+
+		khugepaged_do_scan(&hpage);
+		if (hpage)
+			put_page(hpage);
+		if (khugepaged_has_work()) {
+			if (!khugepaged_scan_sleep_millisecs)
+				continue;
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_scan_sleep_millisecs));
+		} else if (khugepaged_enabled())
+			wait_event_interruptible(khugepaged_wait,
+						 khugepaged_wait_event());
+	}
+}
+
+static int khugepaged(void *none)
+{
+	struct mm_slot *mm_slot;
+
+	for (;;) {
+		BUG_ON(khugepaged_thread != current);
+		khugepaged_loop();
+		BUG_ON(khugepaged_thread != current);
+
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_enabled())
+			break;
+		mutex_unlock(&khugepaged_mutex);
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = khugepaged_scan.mm_slot;
+	khugepaged_scan.mm_slot = NULL;
+	if (mm_slot)
+		collect_mm_slot(mm_slot);
+	spin_unlock(&khugepaged_mm_lock);
+
+	khugepaged_thread = NULL;
+	mutex_unlock(&khugepaged_mutex);
+
+	return 0;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00 of 31] Transparent Hugepage support #8
  2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
                   ` (30 preceding siblings ...)
  2010-01-28 14:33 ` [PATCH 31 of 31] khugepaged Andrea Arcangeli
@ 2010-01-28 17:05 ` Andrea Arcangeli
  31 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 17:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann

On Thu, Jan 28, 2010 at 03:33:14PM +0100, Andrea Arcangeli wrote:
> I suggest to try it (especially if you use i915_gem, as I need to know if
> anybody else can reproduce the khugepaged warning with pte_special set) and

If you test the #8 submit, please add i915.modeline=1 or better set
CONFIG_DRM_I915_KMS=y or I was just told that the problem goes away.

quilt:

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc5/transparent_hugepage-8/

I also provide a monolith that will make life easier to apply it on
top of upstream, if somebody wants to help testing this.

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.33-rc5/transparent_hugepage-8.bz2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 00 of 31] Transparent Hugepage support #7
@ 2010-01-26 13:51 Andrea Arcangeli
  2010-01-26 13:52 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
  0 siblings, 1 reply; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 13:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro

Hello,

this is an update that notably fixes one bug in split_huge_page_mm/vma that
was calling find_vma on "address+HPAGE_PMD_SIZE-1" even if address wasn't
always hpage aligned, so it was failing on the last hugepage of the vma unless
"address" was hugepage aligned (firefox flash tripped on this last night, but
thanks to the amount of BUG_ON that I added it was immediate to fix). No more
problems with java applets, flash etc...

This also moves the MADV_HUGEPAGE to 15 to avoid tripping on parisc (this is
just in case, no idea if parisc is planning to support transparent hugepage or
not). Not sure why there's not just one file for all MADV_ defines, there are
4 billions of madv possible with this api so it looks unnecessary to have
per-arch defines.

So this is running fine, no more bugchecks tripping on mprotect and laptop was
rock solid so far.

 14:45:10 up 15:10,  5 users,  load average: 0.22, 0.17, 0.06

 AnonPages:        612988 kB
 AnonHugePages:     65536 kB

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 25 of 31] transparent hugepage core
  2010-01-26 13:51 [PATCH 00 of 31] Transparent Hugepage support #7 Andrea Arcangeli
@ 2010-01-26 13:52 ` Andrea Arcangeli
  2010-01-26 22:34   ` Rik van Riel
  0 siblings, 1 reply; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,128 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      unsigned long address, pmd_t *pmd,
+				      unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+
+#define HPAGE_PMD_SHIFT HPAGE_SHIFT
+#define HPAGE_PMD_MASK HPAGE_MASK
+#define HPAGE_PMD_SIZE HPAGE_SIZE
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &				\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)			       \
+	(transparent_hugepage_flags &				       \
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
+	 (transparent_hugepage_flags &				       \
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
+				 pmd_t *pmd);
+extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
+extern int split_huge_page(struct page *page);
+#define split_huge_page_mm(__mm, __addr, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_mm(__mm, __addr, __pmd);	\
+	}  while (0)
+#define split_huge_page_vma(__vma, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_vma(__vma, __pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		smp_mb();						\
+		spin_unlock_wait(&(__anon_vma)->lock);			\
+		smp_mb();						\
+		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
+			  pmd_trans_huge(*(__pmd)));			\
+	} while (0)
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+#if HPAGE_PMD_ORDER > MAX_ORDER
+#error "hugepages can't be allocated by the buddy allocator"
+#endif
+
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define transparent_hugepage_enabled(__vma) 0
+#define transparent_hugepage_defrag(__vma) 0
+#define transparent_hugepage_debug_cow() 0
+
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_mm(__mm, __addr, __pmd)	\
+	do { }  while (0)
+#define split_huge_page_vma(__vma, __pmd)	\
+	do { }  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#define PageTransHuge(page) 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -234,6 +237,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
 }
 
 static inline void
+__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
+		       struct list_head *head)
+{
+	list_add(&page->lru, head);
+	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
+}
+
+static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->lru[l].list);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
-	mem_cgroup_add_lru_list(page, l);
+	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
 }
 
 static inline void
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -205,6 +205,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_add_page_tail(struct zone* zone,
+			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,848 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+/*
+ * Currently uses __GFP_REPEAT during allocation. Should be
+ * implemented using page migration and real defrag algorithms in
+ * future VM.
+ */
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init ksm_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(ksm_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	if (!str)
+		return 0;
+	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, address);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	__SetPageUptodate(page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the clear_huge_page writes to
+	 * become visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	return ret;
+}
+
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+			   (defrag ? __GFP_REPEAT : 0)|__GFP_NOWARN,
+			   HPAGE_PMD_ORDER);
+}
+
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
+						    page, haddr);
+	}
+out:
+	pte = pte_alloc_map(mm, vma, pmd, address);
+	if (!pte)
+		return VM_FAULT_OOM;
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd)))
+		goto out_unlock;
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, anon_rss, HPAGE_PMD_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL;
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmd, pmd_t orig_pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int ret = 0, i;
+	struct page **pages;
+
+	pages = kzalloc(sizeof(struct page *) * HPAGE_PMD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+					  vma, address);
+		if (unlikely(!pages[i])) {
+			while (--i >= 0)
+				put_page(pages[i]);
+			kfree(pages);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		copy_user_highpage(pages[i], page + i,
+				   haddr + PAGE_SHIFT*i, vma);
+		__SetPageUptodate(pages[i]);
+		cond_resched();
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		put_page(page);
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_add_new_anon_rmap(pages[i], vma, haddr);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	kfree(pages);
+
+	mm->nr_ptes++;
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	page_remove_rmap(page);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(&mm->page_table_lock);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0;
+	struct page *page, *new_page;
+	unsigned long haddr;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_page(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_PMD_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+	if (unlikely(transparent_hugepage_debug_cow()) && new_page) {
+		put_page(new_page);
+		new_page = NULL;
+	}
+	if (unlikely(!new_page))
+		return do_huge_pmd_wp_page_fallback(mm, vma, address,
+						    pmd, orig_pmd, page, haddr);
+
+	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	__SetPageUptodate(new_page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pmd_page(*pmd);
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, anon_rss, -HPAGE_PMD_NR);
+			spin_unlock(&tlb->mm->page_table_lock);
+			VM_BUG_ON(!PageHead(page));
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	if (address & ~HPAGE_PMD_MASK)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, the pmd must remain marked huge at all
+		 * times or the VM won't take the pmd_trans_huge paths
+		 * and it won't wait on the anon_vma->lock to
+		 * serialize against split_huge_page*.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+	struct zone *zone = page_zone(page);
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+
+		page_tail->index = ++head_index;
+
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		lru_add_page_tail(zone, page, page_tail);
+
+		put_page(page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+	BUG_ON(page_count(page) <= 0);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			BUG_ON(PageCompound(page+i));
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct vm_area_struct *vma;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mapcount = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct page *page;
+	struct anon_vma *anon_vma;
+	struct mm_struct *mm;
+
+	BUG_ON(vma->vm_flags & VM_HUGETLB);
+
+	mm = vma->vm_mm;
+
+	anon_vma = vma->anon_vma;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(pmd_trans_splitting(*pmd));
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&anon_vma->lock);
+		return;
+	}
+	page = pmd_page(*pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	__split_huge_page(page, anon_vma);
+
+	spin_unlock(&anon_vma->lock);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_mm(struct mm_struct *mm,
+			  unsigned long address,
+			  pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+	BUG_ON(vma->vm_start > address);
+	BUG_ON(vma->vm_mm != mm);
+
+	__split_huge_page_vma(vma, pmd);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -647,9 +647,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_PMD_SIZE)
+				split_huge_page_vma(vma, pmd);
+			else if (zap_huge_pmd(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_pmd_anonymous_page(mm, vma, address,
+							  pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_pmd_wp_page(mm, vma, address,
+							   pmd, orig_pmd);
+			return 0;
+		}
+	}
 	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
@@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -380,9 +363,39 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageTransHuge(page))) {
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -461,6 +461,43 @@ void __pagevec_release(struct pagevec *p
 
 EXPORT_SYMBOL(__pagevec_release);
 
+/* used by __split_huge_page_refcount() */
+void lru_add_page_tail(struct zone* zone,
+		       struct page *page, struct page *page_tail)
+{
+	int active;
+	enum lru_list lru;
+	const int file = 0;
+	struct list_head *head;
+
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(PageCompound(page_tail));
+	VM_BUG_ON(PageLRU(page_tail));
+	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
+
+	SetPageLRU(page_tail);
+
+	if (page_evictable(page_tail, NULL)) {
+		if (PageActive(page)) {
+			SetPageActive(page_tail);
+			active = 1;
+			lru = LRU_ACTIVE_ANON;
+		} else {
+			active = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
+		update_page_reclaim_stat(zone, page_tail, file, active);
+		if (likely(PageLRU(page)))
+			head = page->lru.prev;
+		else
+			head = &zone->lru[lru].list;
+		__add_page_to_lru_list(zone, page_tail, lru, head);
+	} else {
+		SetPageUnevictable(page_tail);
+		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
+	}
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 25 of 31] transparent hugepage core
  2010-01-26 13:52 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
@ 2010-01-26 22:34   ` Rik van Riel
  0 siblings, 0 replies; 50+ messages in thread
From: Rik van Riel @ 2010-01-26 22:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro

On 01/26/2010 08:52 AM, Andrea Arcangeli wrote:
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 00 of 31] Transparent Hugepage support #6
@ 2010-01-25 17:18 Andrea Arcangeli
  2010-01-25 17:19 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
  0 siblings, 1 reply; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-25 17:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro

Hello everyone,

so this is working on my laptop and for now on I'll keeping taking advantage
of hugepages here on my laptop (all sysctl set to "always") and there are no
bugs left that I am aware of, my laptop seems rock solid with sound, skype
videocall, youtube etc... On server I'm running this with lockdep, full
preempt and all all debug goodies enabled, and it's doing swap storms of 5G in
a loop with firefox and stuff running as well (which triggers the futex and
stuff on hugepages without splitting them).

The major bug that triggered with firefox was futex (firefox is using
hugepages all the time now), it was tricky to find because futex takes the pin
on a tail page and then run put_page on the head page only, so leaving at a
much later time (during pte teardown) one hugepage being freed but with
bad_page triggering because of the atomic count of the tail page being > 0.

Other cleanups and changes as usual.

I think next submit should be against mmotd, not agaist mainline anymore.
I still wait answer from Dave on the tail page handling for hugetlbfs to
decide if to optimize that bit with yet another new PG_trans_huge bitflag.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 25 of 31] transparent hugepage core
  2010-01-25 17:18 [PATCH 00 of 31] Transparent Hugepage support #6 Andrea Arcangeli
@ 2010-01-25 17:19 ` Andrea Arcangeli
  0 siblings, 0 replies; 50+ messages in thread
From: Andrea Arcangeli @ 2010-01-25 17:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, Andrew Morton,
	bpicco, Christoph Hellwig, KOSAKI Motohiro

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,124 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      unsigned long address, pmd_t *pmd,
+				      unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &				\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)			       \
+	(transparent_hugepage_flags &				       \
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
+	 (transparent_hugepage_flags &				       \
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+#define HPAGE_PMD_SHIFT HPAGE_SHIFT
+#define HPAGE_PMD_MASK HPAGE_MASK
+#define HPAGE_PMD_SIZE HPAGE_SIZE
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
+				 pmd_t *pmd);
+extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
+extern int split_huge_page(struct page *page);
+#define split_huge_page_mm(__mm, __addr, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_mm(__mm, __addr, __pmd);	\
+	}  while (0)
+#define split_huge_page_vma(__vma, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_vma(__vma, __pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		smp_mb();						\
+		spin_unlock_wait(&(__anon_vma)->lock);			\
+		smp_mb();						\
+		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
+			  pmd_trans_huge(*(__pmd)));			\
+	} while (0)
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+#if HPAGE_PMD_ORDER > MAX_ORDER
+#error "hugepages can't be allocated by the buddy allocator"
+#endif
+
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_mm(__mm, __addr, __pmd)	\
+	do { }  while (0)
+#define split_huge_page_vma(__vma, __pmd)	\
+	do { }  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#define PageTransHuge(page) 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -234,6 +237,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
 }
 
 static inline void
+__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
+		       struct list_head *head)
+{
+	list_add(&page->lru, head);
+	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
+}
+
+static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->lru[l].list);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
-	mem_cgroup_add_lru_list(page, l);
+	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
 }
 
 static inline void
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -205,6 +205,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_add_page_tail(struct zone* zone,
+			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,848 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+/*
+ * Currently uses __GFP_REPEAT during allocation. Should be
+ * implemented using page migration and real defrag algorithms in
+ * future VM.
+ */
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init ksm_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(ksm_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	if (!str)
+		return 0;
+	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, address);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	__SetPageUptodate(page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the clear_huge_page writes to
+	 * become visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	return ret;
+}
+
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+			   (defrag ? __GFP_REPEAT : 0)|__GFP_NOWARN,
+			   HPAGE_PMD_ORDER);
+}
+
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
+						    page, haddr);
+	}
+out:
+	pte = pte_alloc_map(mm, vma, pmd, address);
+	if (!pte)
+		return VM_FAULT_OOM;
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd)))
+		goto out_unlock;
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, anon_rss, HPAGE_PMD_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL;
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmd, pmd_t orig_pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int ret = 0, i;
+	struct page **pages;
+
+	pages = kzalloc(sizeof(struct page *) * HPAGE_PMD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+					  vma, address);
+		if (unlikely(!pages[i])) {
+			while (--i >= 0)
+				put_page(pages[i]);
+			kfree(pages);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		copy_user_highpage(pages[i], page + i,
+				   haddr + PAGE_SHIFT*i, vma);
+		__SetPageUptodate(pages[i]);
+		cond_resched();
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		put_page(page);
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_add_new_anon_rmap(pages[i], vma, haddr);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	kfree(pages);
+
+	mm->nr_ptes++;
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	page_remove_rmap(page);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(&mm->page_table_lock);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0;
+	struct page *page, *new_page;
+	unsigned long haddr;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_page(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_PMD_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+	if (unlikely(transparent_hugepage_debug_cow()) && new_page) {
+		put_page(new_page);
+		new_page = NULL;
+	}
+	if (unlikely(!new_page))
+		return do_huge_pmd_wp_page_fallback(mm, vma, address,
+						    pmd, orig_pmd, page, haddr);
+
+	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	__SetPageUptodate(new_page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pmd_page(*pmd);
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, anon_rss, -HPAGE_PMD_NR);
+			spin_unlock(&tlb->mm->page_table_lock);
+			VM_BUG_ON(!PageHead(page));
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	if (address & ~HPAGE_PMD_MASK)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, the pmd must remain marked huge at all
+		 * times or the VM won't take the pmd_trans_huge paths
+		 * and it won't wait on the anon_vma->lock to
+		 * serialize against split_huge_page*.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+	struct zone *zone = page_zone(page);
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+
+		page_tail->index = ++head_index;
+
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		lru_add_page_tail(zone, page, page_tail);
+
+		put_page(page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+	BUG_ON(page_count(page) <= 0);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			BUG_ON(PageCompound(page+i));
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct vm_area_struct *vma;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mapcount = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct page *page;
+	struct anon_vma *anon_vma;
+	struct mm_struct *mm;
+
+	BUG_ON(vma->vm_flags & VM_HUGETLB);
+
+	mm = vma->vm_mm;
+
+	anon_vma = vma->anon_vma;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(pmd_trans_splitting(*pmd));
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&anon_vma->lock);
+		return;
+	}
+	page = pmd_page(*pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	__split_huge_page(page, anon_vma);
+
+	spin_unlock(&anon_vma->lock);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_mm(struct mm_struct *mm,
+			  unsigned long address,
+			  pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address + HPAGE_PMD_SIZE - 1);
+	BUG_ON(vma->vm_start > address);
+	BUG_ON(vma->vm_mm != mm);
+
+	__split_huge_page_vma(vma, pmd);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -647,9 +647,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_PMD_SIZE)
+				__split_huge_page_vma(vma, pmd);
+			else if (zap_huge_pmd(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_pmd_anonymous_page(mm, vma, address,
+							  pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_pmd_wp_page(mm, vma, address,
+							   pmd, orig_pmd);
+			return 0;
+		}
+	}
 	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
@@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -380,9 +363,39 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageTransHuge(page))) {
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -461,6 +461,43 @@ void __pagevec_release(struct pagevec *p
 
 EXPORT_SYMBOL(__pagevec_release);
 
+/* used by __split_huge_page_refcount() */
+void lru_add_page_tail(struct zone* zone,
+		       struct page *page, struct page *page_tail)
+{
+	int active;
+	enum lru_list lru;
+	const int file = 0;
+	struct list_head *head;
+
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(PageCompound(page_tail));
+	VM_BUG_ON(PageLRU(page_tail));
+	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
+
+	SetPageLRU(page_tail);
+
+	if (page_evictable(page_tail, NULL)) {
+		if (PageActive(page)) {
+			SetPageActive(page_tail);
+			active = 1;
+			lru = LRU_ACTIVE_ANON;
+		} else {
+			active = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
+		update_page_reclaim_stat(zone, page_tail, file, active);
+		if (likely(PageLRU(page)))
+			head = page->lru.prev;
+		else
+			head = &zone->lru[lru].list;
+		__add_page_to_lru_list(zone, page_tail, lru, head);
+	} else {
+		SetPageUnevictable(page_tail);
+		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
+	}
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2010-02-01 13:53 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-28 14:33 [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 01 of 31] define MADV_HUGEPAGE Andrea Arcangeli
2010-01-28 20:30   ` Arnd Bergmann
2010-01-28 14:33 ` [PATCH 02 of 31] compound_lock Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 03 of 31] alter compound get_page/put_page Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 04 of 31] update futex compound knowledge Andrea Arcangeli
2010-01-28 16:11   ` Mel Gorman
2010-02-01  7:45     ` Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 05 of 31] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 06 of 31] clear compound mapping Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 07 of 31] add native_set_pmd_at Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 08 of 31] add pmd paravirt ops Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 09 of 31] no paravirt version of pmd ops Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 10 of 31] export maybe_mkwrite Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 11 of 31] comment reminder in destroy_compound_page Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 12 of 31] config_transparent_hugepage Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 13 of 31] special pmd_trans_* functions Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 14 of 31] add pmd mangling generic functions Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 15 of 31] add pmd mangling functions to x86 Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 16 of 31] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 17 of 31] pte alloc trans splitting Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 18 of 31] add pmd mmu_notifier helpers Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 19 of 31] clear page compound Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 20 of 31] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 21 of 31] split_huge_page_mm/vma Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 22 of 31] split_huge_page paging Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 23 of 31] clear_copy_huge_page Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 24 of 31] kvm mmu transparent hugepage support Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
2010-01-28 17:57   ` Mel Gorman
2010-01-28 18:05     ` Rik van Riel
2010-01-28 18:07       ` Mel Gorman
2010-01-28 22:36     ` Andrea Arcangeli
2010-01-28 22:43       ` Andrea Arcangeli
2010-01-29  0:00       ` Andrea Arcangeli
2010-01-29 15:29       ` Mel Gorman
2010-01-29 18:59         ` Andrea Arcangeli
2010-01-31 20:24         ` Andrea Arcangeli
2010-02-01 13:27         ` Andrea Arcangeli
2010-02-01 13:53           ` Mel Gorman
2010-01-28 14:33 ` [PATCH 26 of 31] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 27 of 31] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 28 of 31] memcg compound Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 29 of 31] memcg huge memory Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 30 of 31] transparent hugepage vmstat Andrea Arcangeli
2010-01-28 14:33 ` [PATCH 31 of 31] khugepaged Andrea Arcangeli
2010-01-28 17:05 ` [PATCH 00 of 31] Transparent Hugepage support #8 Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2010-01-26 13:51 [PATCH 00 of 31] Transparent Hugepage support #7 Andrea Arcangeli
2010-01-26 13:52 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli
2010-01-26 22:34   ` Rik van Riel
2010-01-25 17:18 [PATCH 00 of 31] Transparent Hugepage support #6 Andrea Arcangeli
2010-01-25 17:19 ` [PATCH 25 of 31] transparent hugepage core Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).