[PATCH 09/27] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>
Subject: [PATCH 09/27] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides
Date: Mon, 19 Nov 2012 03:14:26 +0100	[thread overview]
Message-ID: <1353291284-2998-10-git-send-email-mingo@kernel.org> (raw)
In-Reply-To: <1353291284-2998-1-git-send-email-mingo@kernel.org>

This patch is based on patches written by multiple people:

   Hugh Dickins <hughd@google.com>
   Johannes Weiner <hannes@cmpxchg.org>
   Peter Zijlstra <a.p.zijlstra@chello.nl>

Of the "mm/mpol: Create special PROT_NONE infrastructure" patch
and its variants.

I have reworked the code so significantly that I had to
drop the acks and signoffs.

In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.

Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().

This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.

Architectures can set the CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT Kconfig
variable, in which case they get the PROT_NONE variant. Alternatively
they can provide the basic primitives themselves:

  bool pte_numa(struct vm_area_struct *vma, pte_t pte);
  pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
  bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
  pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
  unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);

[ This non-generic angle is untested though. ]

Original-Idea-by: Rik van Riel <riel@redhat.com>
Also-From: Johannes Weiner <hannes@cmpxchg.org>
Also-From: Hugh Dickins <hughd@google.com>
Also-From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/pgtable.h |  55 ++++++++++++++
 include/linux/huge_mm.h       |  12 ++++
 include/linux/mempolicy.h     |   6 ++
 include/linux/migrate.h       |   5 ++
 include/linux/mm.h            |   5 ++
 include/linux/sched.h         |   2 +
 init/Kconfig                  |  22 ++++++
 mm/Makefile                   |   1 +
 mm/huge_memory.c              | 162 ++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h                 |   5 +-
 mm/memcontrol.c               |   7 +-
 mm/memory.c                   |  85 ++++++++++++++++++++--
 mm/migrate.c                  |   2 +-
 mm/mprotect.c                 |   7 --
 mm/numa.c                     |  73 +++++++++++++++++++
 15 files changed, 430 insertions(+), 19 deletions(-)
 create mode 100644 mm/numa.c

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 48fc1dc..d03d0a8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -537,6 +537,61 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 }
 
 /*
+ * Is this pte used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern bool pte_numa(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_numa
+static inline bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+	return false;
+}
+# endif
+#endif
+
+/*
+ * Turn a pte into a NUMA entry:
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_mknuma
+static inline pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+	return pte;
+}
+# endif
+#endif
+
+/*
+ * Is this pmd used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
+#else
+# ifndef pmd_numa
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+	return false;
+}
+# endif
+#endif
+
+/*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+	return newprot;
+}
+#endif
+
+/*
  * This is a noop if Transparent Hugepage Support is not built into
  * the kernel. Otherwise it is equivalent to
  * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..7f5a552 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -197,4 +197,16 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_NUMA_BALANCING_HUGEPAGE
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmd,
+				  unsigned int flags, pmd_t orig_pmd);
+#else
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+					 unsigned long address, pmd_t *pmd,
+					 unsigned int flags, pmd_t orig_pmd)
+{
+}
+#endif
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..f329306 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -324,4 +324,10 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 }
 
 #endif /* CONFIG_NUMA */
+
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..afd9af1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -64,4 +64,9 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define fail_migrate_page NULL
 
 #endif /* CONFIG_MIGRATION */
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5fc1d46..246375c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1559,6 +1559,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern unsigned long
+change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+#endif
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e1581a0..a0a2808 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1575,6 +1575,8 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+static inline void task_numa_fault(int node, int cpu, int pages) { }
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..f36c83d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,28 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# Helper Kconfig switches to express compound feature dependencies
+# and thus make the .h/.c code more readable:
+#
+config NUMA_BALANCING_HUGEPAGE
+	bool
+	default y
+	depends on NUMA_BALANCING
+	depends on TRANSPARENT_HUGEPAGE
+
+config ARCH_USES_NUMA_GENERIC_PGPROT
+	bool
+	default y
+	depends on ARCH_WANTS_NUMA_GENERIC_PGPROT
+	depends on NUMA_BALANCING
+
+config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+	bool
+	default y
+	depends on ARCH_USES_NUMA_GENERIC_PGPROT
+	depends on TRANSPARENT_HUGEPAGE
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..26f7574 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT) += numa.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..814e3ea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -725,6 +726,165 @@ out:
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Handle a NUMA fault: check whether we should migrate and
+ * mark it accessible again.
+ */
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd,
+			   unsigned int flags, pmd_t entry)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct mem_cgroup *memcg = NULL;
+	struct page *new_page;
+	struct page *page = NULL;
+	int last_cpu;
+	int node = -1;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry)))
+		goto unlock;
+
+	if (unlikely(pmd_trans_splitting(entry))) {
+		spin_unlock(&mm->page_table_lock);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return;
+	}
+
+	page = pmd_page(entry);
+	if (page) {
+		int page_nid = page_to_nid(page);
+
+		VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+		last_cpu = page_last_cpu(page);
+
+		get_page(page);
+		/*
+		 * Note that migrating pages shared by others is safe, since
+		 * get_user_pages() or GUP fast would have to fault this page
+		 * present before they could proceed, and we are holding the
+		 * pagetable lock here and are mindful of pmd races below.
+		 */
+		node = mpol_misplaced(page, vma, haddr);
+		if (node != -1 && node != page_nid)
+			goto migrate;
+	}
+
+fixup:
+	/* change back to regular protection */
+	entry = pmd_modify(entry, vma->vm_page_prot);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+
+unlock:
+	spin_unlock(&mm->page_table_lock);
+	if (page) {
+		task_numa_fault(page_to_nid(page), last_cpu, HPAGE_PMD_NR);
+		put_page(page);
+	}
+	return;
+
+migrate:
+	spin_unlock(&mm->page_table_lock);
+
+	lock_page(page);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+		unlock_page(page);
+		put_page(page);
+		return;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_pages_node(node,
+	    (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+	if (!new_page)
+		goto alloc_fail;
+
+	if (isolate_lru_page(page)) {	/* Does an implicit get_page() */
+		put_page(new_page);
+		goto alloc_fail;
+	}
+
+	__set_page_locked(new_page);
+	SetPageSwapBacked(new_page);
+
+	/* anon mapping, we can simply copy page->mapping to the new page: */
+	new_page->mapping = page->mapping;
+	new_page->index = page->index;
+
+	migrate_page_copy(new_page, page);
+
+	WARN_ON(PageLRU(new_page));
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+
+		/* Reverse changes made by migrate_page_copy() */
+		if (TestClearPageActive(new_page))
+			SetPageActive(page);
+		if (TestClearPageUnevictable(new_page))
+			SetPageUnevictable(page);
+		mlock_migrate_page(page, new_page);
+
+		unlock_page(new_page);
+		put_page(new_page);		/* Free it */
+
+		unlock_page(page);
+		putback_lru_page(page);
+		put_page(page);			/* Drop the local reference */
+		return;
+	}
+	/*
+	 * Traditional migration needs to prepare the memcg charge
+	 * transaction early to prevent the old page from being
+	 * uncharged when installing migration entries.  Here we can
+	 * save the potential rollback and start the charge transfer
+	 * only when migration is already known to end successfully.
+	 */
+	mem_cgroup_prepare_migration(page, new_page, &memcg);
+
+	entry = mk_pmd(new_page, vma->vm_page_prot);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = pmd_mkhuge(entry);
+
+	page_add_new_anon_rmap(new_page, vma, haddr);
+
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+	page_remove_rmap(page);
+	/*
+	 * Finish the charge transaction under the page table lock to
+	 * prevent split_huge_page() from dividing up the charge
+	 * before it's fully transferred to the new page.
+	 */
+	mem_cgroup_end_migration(memcg, page, new_page, true);
+	spin_unlock(&mm->page_table_lock);
+
+	task_numa_fault(node, last_cpu, HPAGE_PMD_NR);
+
+	unlock_page(new_page);
+	unlock_page(page);
+	put_page(page);			/* Drop the rmap reference */
+	put_page(page);			/* Drop the LRU isolation reference */
+	put_page(page);			/* Drop the local reference */
+	return;
+
+alloc_fail:
+	unlock_page(page);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		put_page(page);
+		page = NULL;
+		goto unlock;
+	}
+	goto fixup;
+}
+#endif
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
@@ -1363,6 +1523,8 @@ static int __split_huge_page_map(struct page *page,
 				BUG_ON(page_mapcount(page) != 1);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
+			if (pmd_numa(vma, *pmd))
+				entry = pte_mknuma(vma, entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
diff --git a/mm/internal.h b/mm/internal.h
index a4fa284..b84d571 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -212,11 +212,12 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 {
 	if (TestClearPageMlocked(page)) {
 		unsigned long flags;
+		int nr_pages = hpage_nr_pages(page);
 
 		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		SetPageMlocked(newpage);
-		__inc_zone_page_state(newpage, NR_MLOCK);
+		__mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
 		local_irq_restore(flags);
 	}
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 				  struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
 	enum charge_type ctype;
 
 	*memcgp = NULL;
 
-	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return;
 
+	if (PageTransHuge(page))
+		nr_pages <<= compound_order(page);
+
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
 }
 
 /* remove redundant charge if migration failed*/
diff --git a/mm/memory.c b/mm/memory.c
index 24d3a4a..b9bb15c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3437,6 +3438,69 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pte_t *ptep, pmd_t *pmd,
+			unsigned int flags, pte_t entry)
+{
+	struct page *page = NULL;
+	int node, page_nid = -1;
+	int last_cpu = -1;
+	spinlock_t *ptl;
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, entry)))
+		goto out_unlock;
+
+	page = vm_normal_page(vma, address, entry);
+	if (page) {
+		get_page(page);
+		page_nid = page_to_nid(page);
+		last_cpu = page_last_cpu(page);
+		node = mpol_misplaced(page, vma, address);
+		if (node != -1 && node != page_nid)
+			goto migrate;
+	}
+
+out_pte_upgrade_unlock:
+	flush_cache_page(vma, address, pte_pfn(entry));
+
+	ptep_modify_prot_start(mm, address, ptep);
+	entry = pte_modify(entry, vma->vm_page_prot);
+	ptep_modify_prot_commit(mm, address, ptep, entry);
+
+	/* No TLB flush needed because we upgraded the PTE */
+
+	update_mmu_cache(vma, address, ptep);
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+out:
+	if (page) {
+		task_numa_fault(page_nid, last_cpu, 1);
+		put_page(page);
+	}
+
+	return 0;
+
+migrate:
+	pte_unmap_unlock(ptep, ptl);
+
+	if (!migrate_misplaced_page(page, node)) {
+		page_nid = node;
+		goto out;
+	}
+
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_same(*ptep, entry)) {
+		put_page(page);
+		page = NULL;
+		goto out_unlock;
+	}
+
+	goto out_pte_upgrade_unlock;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3475,6 +3539,9 @@ int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(vma, entry))
+		return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3539,13 +3606,16 @@ retry:
 							  pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
-		int ret;
+		int ret = 0;
 
 		barrier();
-		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+		if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_numa(vma, orig_pmd)) {
+				do_huge_pmd_numa_page(mm, vma, address, pmd,
+						      flags, orig_pmd);
+			}
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3555,12 +3625,13 @@ retry:
 				 */
 				if (unlikely(ret & VM_FAULT_OOM))
 					goto retry;
-				return ret;
 			}
-			return 0;
+
+			return ret;
 		}
 	}
 
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..4ba45f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -407,7 +407,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	if (PageHuge(page))
+	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
 		copy_highpage(newpage, page);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7c3628a..6ff2d5e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,13 +28,6 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
-	return newprot;
-}
-#endif
-
 static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
diff --git a/mm/numa.c b/mm/numa.c
new file mode 100644
index 0000000..8d18800
--- /dev/null
+++ b/mm/numa.c
@@ -0,0 +1,73 @@
+/*
+ * Generic NUMA page table entry support. This code reuses
+ * PROT_NONE: an architecture can choose to use its own
+ * implementation, by setting CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
+ * and not setting CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT.
+ */
+#include <linux/mm.h>
+
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+	/*
+	 * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+	 */
+	vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+
+	return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
+bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+	/*
+	 * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+	 * "normal" vma->vm_page_prot protections.  Genuine PROT_NONE
+	 * VMAs should never get here, because the fault handling code
+	 * will notice that the VMA has no read or write permissions.
+	 *
+	 * This means we cannot get 'special' PROT_NONE faults from genuine
+	 * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+	 * tracking.
+	 *
+	 * Neither case is really interesting for our current use though so we
+	 * don't care.
+	 */
+	if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+		return false;
+
+	return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
+pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+	return pte_modify(pte, vma_prot_none(vma));
+}
+
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+	/*
+	 * See pte_numa() above
+	 */
+	if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+		return false;
+
+	return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+#endif
+
+/*
+ * The scheduler uses this function to mark a range of virtual
+ * memory inaccessible to user-space, for the purposes of probing
+ * the composition of the working set.
+ *
+ * The resulting page faults will be demultiplexed into:
+ *
+ *    mm/memory.c::do_numa_page()
+ *    mm/huge_memory.c::do_huge_pmd_numa_page()
+ *
+ * This generic version simply uses PROT_NONE.
+ */
+unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	return change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-11-19  2:15 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-19  2:14 [PATCH 00/27] Latest numa/core release, v16 Ingo Molnar
2012-11-19  2:14 ` [PATCH 01/27] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
2012-11-19  2:14 ` [PATCH 02/27] x86/mm: Only do a local tlb flush " Ingo Molnar
2012-11-19  2:14 ` [PATCH 03/27] x86/mm: Introduce pte_accessible() Ingo Molnar
2012-11-19  2:14 ` [PATCH 04/27] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
2012-11-19  2:14 ` [PATCH 05/27] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
2012-11-19  2:14 ` [PATCH 06/27] mm: Count the number of pages affected in change_protection() Ingo Molnar
2012-11-19  2:14 ` [PATCH 07/27] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
2012-11-19  2:14 ` [PATCH 08/27] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
2012-11-19  2:14 ` Ingo Molnar [this message]
2012-11-19  2:14 ` [PATCH 10/27] sched: Make find_busiest_queue() a method Ingo Molnar
2012-11-19  2:14 ` [PATCH 11/27] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
2012-11-19  2:14 ` [PATCH 12/27] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
2012-11-19  2:14 ` [PATCH 13/27] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
2012-11-19  2:14 ` [PATCH 14/27] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
2012-11-19  2:14 ` [PATCH 15/27] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
2012-11-19  2:14 ` [PATCH 16/27] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
2012-11-19  2:14 ` [PATCH 17/27] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
2012-11-19  2:14 ` [PATCH 18/27] sched: Add adaptive NUMA affinity support Ingo Molnar
2012-11-19  2:14 ` [PATCH 19/27] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
2012-11-19  2:14 ` [PATCH 20/27] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
2012-11-19  2:14 ` [PATCH 21/27] sched: Implement slow start for working set sampling Ingo Molnar
2012-11-19  2:14 ` [PATCH 22/27] sched, numa, mm: Interleave shared tasks Ingo Molnar
2012-11-19  2:14 ` [PATCH 23/27] sched: Implement NUMA scanning backoff Ingo Molnar
2012-11-19  2:14 ` [PATCH 24/27] sched: Improve convergence Ingo Molnar
2012-11-19  2:14 ` [PATCH 25/27] sched: Introduce staged average NUMA faults Ingo Molnar
2012-11-19  2:14 ` [PATCH 26/27] sched: Track groups of shared tasks Ingo Molnar
2012-11-19  2:14 ` [PATCH 27/27] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
2012-11-19 16:29 ` [PATCH 00/27] Latest numa/core release, v16 Mel Gorman
2012-11-19 19:13   ` Ingo Molnar
2012-11-19 21:18     ` Mel Gorman
2012-11-19 22:36       ` Ingo Molnar
2012-11-19 23:00         ` Mel Gorman
2012-11-20  0:41           ` Rik van Riel
2012-11-21 10:58             ` Mel Gorman
2012-11-20  1:02         ` Linus Torvalds
2012-11-20  7:17           ` Ingo Molnar
2012-11-20  7:37             ` David Rientjes
2012-11-20  7:48               ` Ingo Molnar
2012-11-20  8:01               ` Ingo Molnar
2012-11-20  8:11                 ` David Rientjes
2012-11-21 11:14               ` Mel Gorman
2012-11-20 10:20             ` Mel Gorman
2012-11-20 10:47               ` Mel Gorman
2012-11-20 15:29             ` [PATCH] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
2012-11-20 16:09               ` [PATCH, v2] " Ingo Molnar
2012-11-20 16:31                 ` Rik van Riel
2012-11-20 16:52                   ` Ingo Molnar
2012-11-21 12:08                     ` Mel Gorman
2012-11-21  8:12                   ` Ingo Molnar
2012-11-21  2:41                 ` David Rientjes
2012-11-21  9:34                   ` Ingo Molnar
2012-11-21 11:40                 ` Mel Gorman
2012-11-23  1:26                 ` Alex Shi
2012-11-20 17:56               ` numa/core regressions fixed - more testers wanted Ingo Molnar
2012-11-21  1:54                 ` Andrew Theurer
2012-11-21  3:22                   ` Rik van Riel
2012-11-21  4:10                     ` Hugh Dickins
2012-11-21 17:59                       ` Andrew Theurer
2012-11-21 11:52                   ` Mel Gorman
2012-11-21 22:15                     ` Andrew Theurer
2012-11-21  3:33                 ` David Rientjes
2012-11-21  9:38                   ` Ingo Molnar
2012-11-21 11:06                   ` Ingo Molnar
2012-11-21  8:39                 ` Alex Shi
2012-11-22  1:21                   ` Ingo Molnar
2012-11-23 13:31                     ` Ingo Molnar
2012-11-23 15:23                       ` Alex Shi
2012-11-26  2:11                       ` Alex Shi
2012-11-28 14:21                         ` Alex Shi
2012-11-20 10:40         ` [PATCH 00/27] Latest numa/core release, v16 Ingo Molnar
2012-11-20 11:40           ` Mel Gorman
2012-11-21 10:38     ` Mel Gorman
2012-11-21 19:37       ` Andrea Arcangeli
2012-11-21 19:56         ` Mel Gorman
2012-11-19 20:07   ` Ingo Molnar
2012-11-19 21:37     ` Mel Gorman
2012-11-20  0:50   ` David Rientjes
2012-11-20  1:05     ` David Rientjes
2012-11-20  6:00       ` Ingo Molnar
2012-11-20  6:20         ` David Rientjes
2012-11-20  7:44           ` Ingo Molnar
2012-11-20  7:48             ` Paul Turner
2012-11-20  8:20             ` David Rientjes
2012-11-20  9:06               ` Ingo Molnar
2012-11-20  9:41                 ` [patch] x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it Ingo Molnar
2012-11-20 23:01                   ` Andy Lutomirski
2012-11-21  0:43                   ` David Rientjes
2012-11-20 12:02                 ` [PATCH] x86/mm: Don't flush the TLB on #WP pmd fixups Ingo Molnar
2012-11-20 12:31                   ` Ingo Molnar
2012-11-21 11:47                     ` Mel Gorman
2012-11-21  1:22                   ` David Rientjes
2012-11-21 17:02                 ` [PATCH 00/27] Latest numa/core release, v16 Linus Torvalds
2012-11-21 17:10                   ` Ingo Molnar
2012-11-21 17:20                     ` Ingo Molnar
2012-11-22  4:31                       ` David Rientjes
2012-11-21 17:40                     ` Ingo Molnar
2012-11-21 22:04                     ` Linus Torvalds
2012-11-21 22:46                       ` Ingo Molnar
2012-11-21 17:45                   ` Rik van Riel
2012-11-21 18:04                   ` Ingo Molnar

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:48fc1dc dfblob:d03d0a8 dfblob:b31cb7d dfblob:7f5a552
dfblob:e5ccb9d dfblob:f329306 dfblob:ce7e667 dfblob:afd9af1
dfblob:5fc1d46 dfblob:246375c dfblob:e1581a0 dfblob:a0a2808
dfblob:6fdd6e3 dfblob:f36c83d dfblob:6b025f8 dfblob:26f7574
dfblob:40f17c3 dfblob:814e3ea dfblob:a4fa284 dfblob:b84d571
dfblob:7acf43b dfblob:011e510 dfblob:24d3a4a dfblob:b9bb15c
dfblob:77ed2d7 dfblob:4ba45f4 dfblob:7c3628a dfblob:6ff2d5e
dfblob:8d18800 )
 OR (
bs:"[PATCH 09/27] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1353291284-2998-10-git-send-email-mingo@kernel.org \
    --to=mingo@kernel.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).