[PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing
@ 2014-06-02 21:36 Dave Hansen
  2014-06-02 21:36 ` [PATCH 01/10] mm: pagewalk: consolidate vma->vm_start checks Dave Hansen
                   ` (10 more replies)
  0 siblings, 11 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen

The hugetlbfs and THP support in the walk_page_range() code was
mostly an afterthought.

We also _tried_ to have the pagewalk code be concerned only with
page tables and *NOT* VMAs.  We lost that battle since 80% of
the page walkers just pass the VMA along anyway.

This does a few cleanups and adds a new flavor of walker which
can be stupid^Wsimple and not have to be explicitly taught about
THP.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/10] mm: pagewalk: consolidate vma->vm_start checks
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 02/10] mm: pagewalk: always skip hugetlbfs except when explicitly handled Dave Hansen
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We check vma->vm_start against the address being walked twice.
Consolidate the locations down to a single one.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/pagewalk.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff -puN mm/pagewalk.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled mm/pagewalk.c
--- a/mm/pagewalk.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled	2014-06-02 14:20:18.938791410 -0700
+++ b/mm/pagewalk.c	2014-06-02 14:20:18.941791545 -0700
@@ -192,13 +192,12 @@ int walk_page_range(unsigned long addr,
 		 * - VM_PFNMAP vma's
 		 */
 		vma = find_vma(walk->mm, addr);
-		if (vma) {
+		if (vma && (vma->vm_start <= addr)) {
 			/*
 			 * There are no page structures backing a VM_PFNMAP
 			 * range, so do not allow split_huge_page_pmd().
 			 */
-			if ((vma->vm_start <= addr) &&
-			    (vma->vm_flags & VM_PFNMAP)) {
+			if (vma->vm_flags & VM_PFNMAP) {
 				next = vma->vm_end;
 				pgd = pgd_offset(walk->mm, next);
 				continue;
@@ -209,8 +208,7 @@ int walk_page_range(unsigned long addr,
 			 * architecture and we can't handled it in the same
 			 * manner as non-huge pages.
 			 */
-			if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
-			    is_vm_hugetlb_page(vma)) {
+			if (walk->hugetlb_entry && is_vm_hugetlb_page(vma)) {
 				if (vma->vm_end < next)
 					next = vma->vm_end;
 				/*
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 02/10] mm: pagewalk: always skip hugetlbfs except when explicitly handled
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
  2014-06-02 21:36 ` [PATCH 01/10] mm: pagewalk: consolidate vma->vm_start checks Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 03/10] mm: pagewalk: have generic code keep track of VMA Dave Hansen
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The short story:

The walk_page_range() code is fragile for hugetlbfs VMAs.  Each
walker instance must either exclude hugetlbfs from being walked,
or add a ->hugetlb_entry handler.  If this is not done, the code
will go off the rails and start clearing huge page table entries.

This patch removes that requirement on the walkers.  They can
merrily call walk_page_range() on hugetlbfs areas, and those
areas will simply be skipped inside the page walker code if they
have not set up a handler.

This makes the code more robust, shorter, and makes it more
intuitive to write a page table walker.  Yay.

Long story:

I was looking at the page walker code and thought I found a bug.
If the walker hits a hugetlbfs VMA where walk->hugetlb_entry was
not set, it would hit the if(), and the clear out the pgd
thinking it was bad.

This essentially means that *EVERY* page walker has to *KNOW* to
either exclude hugetlbfs VMAs, or set a ->hugetlb_entry handler.
The good news is that all 9 users of walk_page_range() do this
implicitly or explicitly.  The bad news is that it took me an
hour to convince myself of this, and future walk_page_range()
instances are vulnerable to making this mistake.  I think the
madvise() use was probably just lucky (details below).

Here's the code trimmed down.  Note what happens if we have a
is_vm_hugetlb_page(), !walk->hugetlb_entry, and a huge page pgd
entry in 'pgd' (or any of the lower levels).

int walk_page_range(unsigned long addr, unsigned long end, ...
{
...
	vma = find_vma(walk->mm, addr);
        if (vma) {
		if (walk->hugetlb_entry && is_vm_hugetlb_page(vma)) {
			walk_hugetlb_range(vma, addr, next, walk);
			...
			continue;
		}
	}
	if (pgd_none_or_clear_bad(pgd)) {


There are currently 9 users of walk_page_range().  They handle
hugetlbfs pages in 5 ways:

/proc/$pid/smaps:
/proc/$pid/clear_refs:
cgroup precharge:
cgroup move charge:
	checks VMA explicitly for hugetblfs and skips, does not set
	->hugetlb_entry (this patch removes the now unnecessary
	hugetlbfs checks for these)

openrisc dma alloc:
	works on kernel memory, so no hugetlbfs, also arch does not
	even support hugetlbfs

powerpc subpage protection:
	uses arch-specific is_hugepage_only_range() check

/proc/$pid/pagemap:
/proc/$pid/numa_map:
	sets ->hugetlb_entry
	(these are unaffected by this patch)

MADV_WILLNEED:
	does not set ->hugetlb_entry
	only called via:
	madvise_willneed() {
		if (!vma->file)
			force_swapin_readahead(...) {
				walk_page_range(...)
			}
	}
	That !vma->file check just _happens_ to cover hugetlbfs
  	vmas since they are always file-backed (or at least have
	vma->file set as far as I can tell)

	(this case is unaffected by this patch)

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/fs/proc/task_mmu.c |    4 +---
 b/mm/memcontrol.c    |    4 ----
 b/mm/pagewalk.c      |    5 ++++-
 3 files changed, 5 insertions(+), 8 deletions(-)

diff -puN fs/proc/task_mmu.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1	2014-06-02 14:20:19.210803615 -0700
+++ b/fs/proc/task_mmu.c	2014-06-02 14:20:19.218803974 -0700
@@ -590,7 +590,7 @@ static int show_smap(struct seq_file *m,
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
 	/* mmap_sem is held in m_start */
-	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
+	if (vma->vm_mm)
 		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
 
 	show_map_vma(m, vma, is_pid);
@@ -829,8 +829,6 @@ static ssize_t clear_refs_write(struct f
 			mmu_notifier_invalidate_range_start(mm, 0, -1);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
-			if (is_vm_hugetlb_page(vma))
-				continue;
 			/*
 			 * Writing 1 to /proc/pid/clear_refs affects all pages.
 			 *
diff -puN mm/memcontrol.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 mm/memcontrol.c
--- a/mm/memcontrol.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1	2014-06-02 14:20:19.212803706 -0700
+++ b/mm/memcontrol.c	2014-06-02 14:20:19.220804064 -0700
@@ -6821,8 +6821,6 @@ static unsigned long mem_cgroup_count_pr
 			.mm = mm,
 			.private = vma,
 		};
-		if (is_vm_hugetlb_page(vma))
-			continue;
 		walk_page_range(vma->vm_start, vma->vm_end,
 					&mem_cgroup_count_precharge_walk);
 	}
@@ -7087,8 +7085,6 @@ retry:
 			.mm = mm,
 			.private = vma,
 		};
-		if (is_vm_hugetlb_page(vma))
-			continue;
 		ret = walk_page_range(vma->vm_start, vma->vm_end,
 						&mem_cgroup_move_charge_walk);
 		if (ret)
diff -puN mm/pagewalk.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 mm/pagewalk.c
--- a/mm/pagewalk.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1	2014-06-02 14:20:19.214803794 -0700
+++ b/mm/pagewalk.c	2014-06-02 14:20:19.220804064 -0700
@@ -115,6 +115,9 @@ static int walk_hugetlb_range(struct vm_
 	pte_t *pte;
 	int err = 0;
 
+	if (!walk->hugetlb_entry)
+		return 0;
+
 	do {
 		next = hugetlb_entry_end(h, addr, end);
 		pte = huge_pte_offset(walk->mm, addr & hmask);
@@ -208,7 +211,7 @@ int walk_page_range(unsigned long addr,
 			 * architecture and we can't handled it in the same
 			 * manner as non-huge pages.
 			 */
-			if (walk->hugetlb_entry && is_vm_hugetlb_page(vma)) {
+			if (is_vm_hugetlb_page(vma)) {
 				if (vma->vm_end < next)
 					next = vma->vm_end;
 				/*
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 03/10] mm: pagewalk: have generic code keep track of VMA
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
  2014-06-02 21:36 ` [PATCH 01/10] mm: pagewalk: consolidate vma->vm_start checks Dave Hansen
  2014-06-02 21:36 ` [PATCH 02/10] mm: pagewalk: always skip hugetlbfs except when explicitly handled Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 04/10] mm: pagewalk: add page walker for mincore() Dave Hansen
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

7 out of 9 of the page walkers need the VMA and pass it in some
way through mm_walk->private.  Let's add it in the page walker
infrastructure.

This will increase the number of find_vma() calls, but the VMA
cache should help us out pretty nicely here.  This is also quite
easy to optimize if this turns out to be an issue by skipping the
find_vma() call if 'addr' is still within our current
mm_walk->vma.

/proc/$pid/numa_map:
/proc/$pid/smaps:
	lots of stuff including vma (vma is a drop in the bucket)
	in a struct
/proc/$pid/clear_refs:
	passes vma plus an enum in a struct
/proc/$pid/pagemap:
openrisc:
	no VMA
MADV_WILLNEED:
	walk->private is set to vma
cgroup precharge:
	walk->private is set to vma
cgroup move charge:
	walk->private is set to vma
powerpc subpages:
	walk->private is set to vma

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/powerpc/mm/subpage-prot.c |    3 --
 b/fs/proc/task_mmu.c             |   25 ++++++---------------
 b/include/linux/mm.h             |    1 
 b/mm/madvise.c                   |    3 --
 b/mm/memcontrol.c                |    4 +--
 b/mm/pagewalk.c                  |   45 ++++++++++++++++++++++++++++++++++-----
 6 files changed, 52 insertions(+), 29 deletions(-)

diff -puN arch/powerpc/mm/subpage-prot.c~page-walker-pass-vma arch/powerpc/mm/subpage-prot.c
--- a/arch/powerpc/mm/subpage-prot.c~page-walker-pass-vma	2014-06-02 14:20:19.524817706 -0700
+++ b/arch/powerpc/mm/subpage-prot.c	2014-06-02 14:20:19.536818243 -0700
@@ -134,7 +134,7 @@ static void subpage_prot_clear(unsigned
 static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
-	struct vm_area_struct *vma = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 	split_huge_page_pmd(vma, addr, pmd);
 	return 0;
 }
@@ -163,7 +163,6 @@ static void subpage_mark_vma_nohuge(stru
 		if (vma->vm_start >= (addr + len))
 			break;
 		vma->vm_flags |= VM_NOHUGEPAGE;
-		subpage_proto_walk.private = vma;
 		walk_page_range(vma->vm_start, vma->vm_end,
 				&subpage_proto_walk);
 		vma = vma->vm_next;
diff -puN fs/proc/task_mmu.c~page-walker-pass-vma fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~page-walker-pass-vma	2014-06-02 14:20:19.526817794 -0700
+++ b/fs/proc/task_mmu.c	2014-06-02 14:20:19.537818287 -0700
@@ -424,7 +424,6 @@ const struct file_operations proc_tid_ma
 
 #ifdef CONFIG_PROC_PAGE_MONITOR
 struct mem_size_stats {
-	struct vm_area_struct *vma;
 	unsigned long resident;
 	unsigned long shared_clean;
 	unsigned long shared_dirty;
@@ -443,7 +442,7 @@ static void smaps_pte_entry(pte_t ptent,
 		unsigned long ptent_size, struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
-	struct vm_area_struct *vma = mss->vma;
+	struct vm_area_struct *vma = walk->vma;
 	pgoff_t pgoff = linear_page_index(vma, addr);
 	struct page *page = NULL;
 	int mapcount;
@@ -495,7 +494,7 @@ static int smaps_pte_range(pmd_t *pmd, u
 			   struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
-	struct vm_area_struct *vma = mss->vma;
+	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
 
@@ -588,7 +587,6 @@ static int show_smap(struct seq_file *m,
 	};
 
 	memset(&mss, 0, sizeof mss);
-	mss.vma = vma;
 	/* mmap_sem is held in m_start */
 	if (vma->vm_mm)
 		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
@@ -712,7 +710,6 @@ enum clear_refs_types {
 };
 
 struct clear_refs_private {
-	struct vm_area_struct *vma;
 	enum clear_refs_types type;
 };
 
@@ -748,7 +745,7 @@ static int clear_refs_pte_range(pmd_t *p
 				unsigned long end, struct mm_walk *walk)
 {
 	struct clear_refs_private *cp = walk->private;
-	struct vm_area_struct *vma = cp->vma;
+	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
@@ -828,7 +825,6 @@ static ssize_t clear_refs_write(struct f
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_start(mm, 0, -1);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			cp.vma = vma;
 			/*
 			 * Writing 1 to /proc/pid/clear_refs affects all pages.
 			 *
@@ -1073,15 +1069,11 @@ static int pagemap_hugetlb_range(pte_t *
 				 struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
-	struct vm_area_struct *vma;
 	int err = 0;
 	int flags2;
 	pagemap_entry_t pme;
 
-	vma = find_vma(walk->mm, addr);
-	WARN_ON_ONCE(!vma);
-
-	if (vma && (vma->vm_flags & VM_SOFTDIRTY))
+	if (walk->vma && (walk->vma->vm_flags & VM_SOFTDIRTY))
 		flags2 = __PM_SOFT_DIRTY;
 	else
 		flags2 = 0;
@@ -1241,7 +1233,6 @@ const struct file_operations proc_pagema
 #ifdef CONFIG_NUMA
 
 struct numa_maps {
-	struct vm_area_struct *vma;
 	unsigned long pages;
 	unsigned long anon;
 	unsigned long active;
@@ -1317,11 +1308,11 @@ static int gather_pte_stats(pmd_t *pmd,
 
 	md = walk->private;
 
-	if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
-		page = can_gather_numa_stats(huge_pte, md->vma, addr);
+		page = can_gather_numa_stats(huge_pte, walk->vma, addr);
 		if (page)
 			gather_stats(page, md, pte_dirty(huge_pte),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1333,7 +1324,7 @@ static int gather_pte_stats(pmd_t *pmd,
 		return 0;
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	do {
-		struct page *page = can_gather_numa_stats(*pte, md->vma, addr);
+		struct page *page = can_gather_numa_stats(*pte, walk->vma, addr);
 		if (!page)
 			continue;
 		gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1392,8 +1383,6 @@ static int show_numa_map(struct seq_file
 	/* Ensure we start with an empty set of numa_maps statistics. */
 	memset(md, 0, sizeof(*md));
 
-	md->vma = vma;
-
 	walk.hugetlb_entry = gather_hugetbl_stats;
 	walk.pmd_entry = gather_pte_stats;
 	walk.private = md;
diff -puN include/linux/mm.h~page-walker-pass-vma include/linux/mm.h
--- a/include/linux/mm.h~page-walker-pass-vma	2014-06-02 14:20:19.528817884 -0700
+++ b/include/linux/mm.h	2014-06-02 14:20:19.538818332 -0700
@@ -1118,6 +1118,7 @@ struct mm_walk {
 			     unsigned long addr, unsigned long next,
 			     struct mm_walk *walk);
 	struct mm_struct *mm;
+	struct vm_area_struct *vma;
 	void *private;
 };
 
diff -puN mm/madvise.c~page-walker-pass-vma mm/madvise.c
--- a/mm/madvise.c~page-walker-pass-vma	2014-06-02 14:20:19.529817929 -0700
+++ b/mm/madvise.c	2014-06-02 14:20:19.539818378 -0700
@@ -139,7 +139,7 @@ static int swapin_walk_pmd_entry(pmd_t *
 	unsigned long end, struct mm_walk *walk)
 {
 	pte_t *orig_pte;
-	struct vm_area_struct *vma = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 	unsigned long index;
 
 	if (pmd_none_or_trans_huge_or_clear_bad(pmd))
@@ -176,7 +176,6 @@ static void force_swapin_readahead(struc
 	struct mm_walk walk = {
 		.mm = vma->vm_mm,
 		.pmd_entry = swapin_walk_pmd_entry,
-		.private = vma,
 	};
 
 	walk_page_range(start, end, &walk);
diff -puN mm/memcontrol.c~page-walker-pass-vma mm/memcontrol.c
--- a/mm/memcontrol.c~page-walker-pass-vma	2014-06-02 14:20:19.532818064 -0700
+++ b/mm/memcontrol.c	2014-06-02 14:20:19.541818468 -0700
@@ -6786,7 +6786,7 @@ static int mem_cgroup_count_precharge_pt
 					unsigned long addr, unsigned long end,
 					struct mm_walk *walk)
 {
-	struct vm_area_struct *vma = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
 
@@ -6962,7 +6962,7 @@ static int mem_cgroup_move_charge_pte_ra
 				struct mm_walk *walk)
 {
 	int ret = 0;
-	struct vm_area_struct *vma = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
 	enum mc_target_type target_type;
diff -puN mm/pagewalk.c~page-walker-pass-vma mm/pagewalk.c
--- a/mm/pagewalk.c~page-walker-pass-vma	2014-06-02 14:20:19.533818109 -0700
+++ b/mm/pagewalk.c	2014-06-02 14:20:19.542818513 -0700
@@ -3,6 +3,38 @@
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
 
+
+/*
+ * The VMA which applies to the current place in the
+ * page walk is tracked in walk->vma.  If there is
+ * no VMA covering the current area (when in a pte_hole)
+ * walk->vma will be NULL.
+ *
+ * If the area bing walked is covered by more than one
+ * VMA, then the first one will be set in walk->vma.
+ * Additional VMAs can be found by walking the VMA sibling
+ * list, or by calling this function or find_vma() directly.
+ *
+ * In a situation where the area being walked is not
+ * entirely covered by a VMA, the _first_ VMA which covers
+ * part of the area will be set in walk->vma.
+ */
+static void walk_update_vma(unsigned long addr, unsigned long end,
+		     struct mm_walk *walk)
+{
+	struct vm_area_struct *new_vma = find_vma(walk->mm, addr);
+
+	/*
+	 * find_vma() is not exact and returns the next VMA
+	 * ending after addr.  The vma we found may be outside
+	 * the range which we are walking, so clear it if so.
+	 */
+	if (new_vma && new_vma->vm_start >= end)
+		new_vma = NULL;
+
+	walk->vma = new_vma;
+}
+
 static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
@@ -15,6 +47,7 @@ static int walk_pte_range(pmd_t *pmd, un
 		if (err)
 		       break;
 		addr += PAGE_SIZE;
+		walk_update_vma(addr, addr + PAGE_SIZE, walk);
 		if (addr == end)
 			break;
 		pte++;
@@ -35,6 +68,7 @@ static int walk_pmd_range(pud_t *pud, un
 	do {
 again:
 		next = pmd_addr_end(addr, end);
+		walk_update_vma(addr, next, walk);
 		if (pmd_none(*pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);
@@ -79,6 +113,7 @@ static int walk_pud_range(pgd_t *pgd, un
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
+		walk_update_vma(addr, next, walk);
 		if (pud_none_or_clear_bad(pud)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);
@@ -105,10 +140,10 @@ static unsigned long hugetlb_entry_end(s
 	return boundary < end ? boundary : end;
 }
 
-static int walk_hugetlb_range(struct vm_area_struct *vma,
-			      unsigned long addr, unsigned long end,
+static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 			      struct mm_walk *walk)
 {
+	struct vm_area_struct *vma = walk->vma;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long next;
 	unsigned long hmask = huge_page_mask(h);
@@ -187,14 +222,14 @@ int walk_page_range(unsigned long addr,
 		struct vm_area_struct *vma = NULL;
 
 		next = pgd_addr_end(addr, end);
-
+		walk_update_vma(addr, next, walk);
 		/*
 		 * This function was not intended to be vma based.
 		 * But there are vma special cases to be handled:
 		 * - hugetlb vma's
 		 * - VM_PFNMAP vma's
 		 */
-		vma = find_vma(walk->mm, addr);
+		vma = walk->vma;
 		if (vma && (vma->vm_start <= addr)) {
 			/*
 			 * There are no page structures backing a VM_PFNMAP
@@ -219,7 +254,7 @@ int walk_page_range(unsigned long addr,
 				 * so walk through hugetlb entries within a
 				 * given vma.
 				 */
-				err = walk_hugetlb_range(vma, addr, next, walk);
+				err = walk_hugetlb_range(addr, next, walk);
 				if (err)
 					break;
 				pgd = pgd_offset(walk->mm, next);
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 04/10] mm: pagewalk: add page walker for mincore()
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
                   ` (2 preceding siblings ...)
  2014-06-02 21:36 ` [PATCH 03/10] mm: pagewalk: have generic code keep track of VMA Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 05/10] mm: mincore: clean up hugetlbfs handling (part 1) Dave Hansen
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This converts the sys_mincore() code over to use the
walk_page_range() infrastructure.  This provides some pretty
nice code savings.

Note that the (now removed) comment:

       /*
	* Huge pages are always in RAM for now, but
	* theoretically it needs to be checked.
	*/

is bogus and has been for years.  We started demand-faulting them
long ago.  Thank goodness theory matters!

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/mincore.c |  123 +++++++++++++++++----------------------------------------
 1 file changed, 37 insertions(+), 86 deletions(-)

diff -puN mm/mincore.c~mincore-page-walker-0 mm/mincore.c
--- a/mm/mincore.c~mincore-page-walker-0	2014-06-02 14:20:19.879833634 -0700
+++ b/mm/mincore.c	2014-06-02 14:20:19.883833814 -0700
@@ -19,38 +19,29 @@
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 
-static void mincore_hugetlb_page_range(struct vm_area_struct *vma,
-				unsigned long addr, unsigned long end,
-				unsigned char *vec)
+static int mincore_hugetlb_page_range(pte_t *ptep, unsigned long hmask,
+					unsigned long addr, unsigned long end,
+					struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
-	struct hstate *h;
-
-	h = hstate_vma(vma);
+	unsigned char *vec = walk->private;
 	while (1) {
-		unsigned char present;
-		pte_t *ptep;
-		/*
-		 * Huge pages are always in RAM for now, but
-		 * theoretically it needs to be checked.
-		 */
-		ptep = huge_pte_offset(current->mm,
-				       addr & huge_page_mask(h));
-		present = ptep && !huge_pte_none(huge_ptep_get(ptep));
+		int present = !huge_pte_none(huge_ptep_get(ptep));
 		while (1) {
 			*vec = present;
 			vec++;
 			addr += PAGE_SIZE;
 			if (addr == end)
-				return;
+				return 0;
 			/* check hugepage border */
-			if (!(addr & ~huge_page_mask(h)))
+			if (!(addr & hmask))
 				break;
 		}
 	}
 #else
 	BUG();
 #endif
+	return 0;
 }
 
 /*
@@ -94,10 +85,11 @@ static unsigned char mincore_page(struct
 	return present;
 }
 
-static void mincore_unmapped_range(struct vm_area_struct *vma,
-				unsigned long addr, unsigned long end,
-				unsigned char *vec)
+static int mincore_unmapped_range(unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
 {
+	struct vm_area_struct *vma = walk->vma;
+	unsigned char *vec = walk->private;
 	unsigned long nr = (end - addr) >> PAGE_SHIFT;
 	int i;
 
@@ -111,27 +103,35 @@ static void mincore_unmapped_range(struc
 		for (i = 0; i < nr; i++)
 			vec[i] = 0;
 	}
+	return 0;
 }
 
-static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, unsigned long end,
-			unsigned char *vec)
+static int mincore_pte_range(pmd_t *pmd,
+			     unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
 {
+	struct vm_area_struct *vma = walk->vma;
+	unsigned char *vec = walk->private;
 	unsigned long next;
 	spinlock_t *ptl;
 	pte_t *ptep;
 
+	if (pmd_trans_huge(*pmd)) {
+		int success = mincore_huge_pmd(vma, pmd, addr, end, vec);
+		if (success)
+			return 0;
+		/* the trans huge pmd just split, handle as small */
+	}
+
 	ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	do {
 		pte_t pte = *ptep;
 		pgoff_t pgoff;
 
 		next = addr + PAGE_SIZE;
-		if (pte_none(pte))
-			mincore_unmapped_range(vma, addr, next, vec);
-		else if (pte_present(pte))
+		if (pte_present(pte)) {
 			*vec = 1;
-		else if (pte_file(pte)) {
+		} else if (pte_file(pte)) {
 			pgoff = pte_to_pgoff(pte);
 			*vec = mincore_page(vma->vm_file->f_mapping, pgoff);
 		} else { /* pte is a swap entry */
@@ -154,67 +154,21 @@ static void mincore_pte_range(struct vm_
 		vec++;
 	} while (ptep++, addr = next, addr != end);
 	pte_unmap_unlock(ptep - 1, ptl);
-}
-
-static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-			unsigned long addr, unsigned long end,
-			unsigned char *vec)
-{
-	unsigned long next;
-	pmd_t *pmd;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*pmd)) {
-			if (mincore_huge_pmd(vma, pmd, addr, next, vec)) {
-				vec += (next - addr) >> PAGE_SHIFT;
-				continue;
-			}
-			/* fall through */
-		}
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			mincore_unmapped_range(vma, addr, next, vec);
-		else
-			mincore_pte_range(vma, pmd, addr, next, vec);
-		vec += (next - addr) >> PAGE_SHIFT;
-	} while (pmd++, addr = next, addr != end);
-}
-
-static void mincore_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-			unsigned long addr, unsigned long end,
-			unsigned char *vec)
-{
-	unsigned long next;
-	pud_t *pud;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			mincore_unmapped_range(vma, addr, next, vec);
-		else
-			mincore_pmd_range(vma, pud, addr, next, vec);
-		vec += (next - addr) >> PAGE_SHIFT;
-	} while (pud++, addr = next, addr != end);
+	return 0;
 }
 
 static void mincore_page_range(struct vm_area_struct *vma,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec)
 {
-	unsigned long next;
-	pgd_t *pgd;
-
-	pgd = pgd_offset(vma->vm_mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			mincore_unmapped_range(vma, addr, next, vec);
-		else
-			mincore_pud_range(vma, pgd, addr, next, vec);
-		vec += (next - addr) >> PAGE_SHIFT;
-	} while (pgd++, addr = next, addr != end);
+	struct mm_walk mincore_walk = {
+		.pte_hole      = mincore_unmapped_range,
+		.pmd_entry     = mincore_pte_range,
+		.hugetlb_entry = mincore_hugetlb_page_range,
+		.private = vec,
+		.mm = vma->vm_mm,
+	};
+	walk_page_range(vma->vm_start, vma->vm_end, &mincore_walk);
 }
 
 /*
@@ -233,10 +187,7 @@ static long do_mincore(unsigned long add
 
 	end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
 
-	if (is_vm_hugetlb_page(vma))
-		mincore_hugetlb_page_range(vma, addr, end, vec);
-	else
-		mincore_page_range(vma, addr, end, vec);
+	mincore_page_range(vma, addr, end, vec);
 
 	return (end - addr) >> PAGE_SHIFT;
 }
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 05/10] mm: mincore: clean up hugetlbfs handling (part 1)
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
                   ` (3 preceding siblings ...)
  2014-06-02 21:36 ` [PATCH 04/10] mm: pagewalk: add page walker for mincore() Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 06/10] mm: mincore: clean up hugetlbfs handler (part 2) Dave Hansen
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The page walker functions are only called _via_ the page
walker.  I don't see this changing any time soon.  The
page walker only calls walk->hugetlb_entry() under an
#ifdef CONFIG_HUGETLB_PAGE.

With this in place, I think putting BUG()s in the
->hugetlb_entry handlers is a bit like wearing a belt and
suspenders.

This axes the BUG() from the mincore ->hugetlb_entry along
with the #ifdef.  The compiler is more than smart enough
to do the right thing when it sees:

	if (1)
		return;
	// unreachable

The only downside here is that we now need some header stubs
for huge_pte_none() / huge_pte_get().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/include/linux/hugetlb.h |   10 ++++++++++
 b/mm/mincore.c            |    9 +++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff -puN include/linux/hugetlb.h~cleanup-hugetlbfs-mincore-1 include/linux/hugetlb.h
--- a/include/linux/hugetlb.h~cleanup-hugetlbfs-mincore-1	2014-06-02 14:20:20.144845525 -0700
+++ b/include/linux/hugetlb.h	2014-06-02 14:20:20.149845750 -0700
@@ -458,6 +458,16 @@ static inline spinlock_t *huge_pte_lockp
 {
 	return &mm->page_table_lock;
 }
+static inline int huge_pte_none(pte_t pte)
+{
+	WARN_ONCE(1, "%s() called when hugetlbfs disabled", __func__);
+	return 1;
+}
+static inline pte_t huge_ptep_get(pte_t *pte)
+{
+	WARN_ONCE(1, "%s() called when hugetlbfs disabled", __func__);
+	return __pte(0);
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff -puN mm/mincore.c~cleanup-hugetlbfs-mincore-1 mm/mincore.c
--- a/mm/mincore.c~cleanup-hugetlbfs-mincore-1	2014-06-02 14:20:20.146845615 -0700
+++ b/mm/mincore.c	2014-06-02 14:20:20.149845750 -0700
@@ -23,8 +23,12 @@ static int mincore_hugetlb_page_range(pt
 					unsigned long addr, unsigned long end,
 					struct mm_walk *walk)
 {
-#ifdef CONFIG_HUGETLB_PAGE
 	unsigned char *vec = walk->private;
+
+	/* This is as good as an explicit ifdef */
+	if (!is_vm_hugetlb_page(walk->vma))
+		return 0;
+
 	while (1) {
 		int present = !huge_pte_none(huge_ptep_get(ptep));
 		while (1) {
@@ -38,9 +42,6 @@ static int mincore_hugetlb_page_range(pt
 				break;
 		}
 	}
-#else
-	BUG();
-#endif
 	return 0;
 }
 
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 06/10] mm: mincore: clean up hugetlbfs handler (part 2)
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
                   ` (4 preceding siblings ...)
  2014-06-02 21:36 ` [PATCH 05/10] mm: mincore: clean up hugetlbfs handling (part 1) Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 07/10] mm: pagewalk: kill check for hugetlbfs inside /proc pagemap code Dave Hansen
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The walk_page_range() code calls in to the ->hugetlbfs_entry
handler once for each huge page table entry.  This means that
addr and end are always within the same huge page.  (Well, end is
not technically _within_ it, because it is exclusive.)

The outer while() loop in mincore_hugetlb_page_range() appears to
be designed to work if we crossed a huge page boundary to a new
huge pte and 'present' changed.  However, that is impossible for
two reasons:

	1. The above-mentioned walk_page_range() restriction
	2. We never move ptep

So the outer while() along with the check for crossing the end of
the huge page boundary (which is impossible) make no sense.  Once
we peel it off, it's clear that we can just make the 'return' in
to the loop condition.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/mincore.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff -puN mm/mincore.c~cleanup-hugetlbfs-mincore-2 mm/mincore.c
--- a/mm/mincore.c~cleanup-hugetlbfs-mincore-2	2014-06-02 14:20:20.426858178 -0700
+++ b/mm/mincore.c	2014-06-02 14:20:20.430858359 -0700
@@ -24,23 +24,17 @@ static int mincore_hugetlb_page_range(pt
 					struct mm_walk *walk)
 {
 	unsigned char *vec = walk->private;
+	int present;
 
 	/* This is as good as an explicit ifdef */
 	if (!is_vm_hugetlb_page(walk->vma))
 		return 0;
 
-	while (1) {
-		int present = !huge_pte_none(huge_ptep_get(ptep));
-		while (1) {
-			*vec = present;
-			vec++;
-			addr += PAGE_SIZE;
-			if (addr == end)
-				return 0;
-			/* check hugepage border */
-			if (!(addr & hmask))
-				break;
-		}
+	present = !huge_pte_none(huge_ptep_get(ptep));
+	while (addr < end) {
+		*vec = present;
+		vec++;
+		addr += PAGE_SIZE;
 	}
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 07/10] mm: pagewalk: kill check for hugetlbfs inside /proc pagemap code
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
                   ` (5 preceding siblings ...)
  2014-06-02 21:36 ` [PATCH 06/10] mm: mincore: clean up hugetlbfs handler (part 2) Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 08/10] mm: pagewalk: add locked pte walker Dave Hansen
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The page map code does not call the normal handlers for hugetlbfs
areas.  They are handled by ->hugetlb_entry exclusively, so
remove the check for it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/fs/proc/task_mmu.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -puN fs/proc/task_mmu.c~do-not-check-for-hugetlbfs-inside-pagemap-walker fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~do-not-check-for-hugetlbfs-inside-pagemap-walker	2014-06-02 14:20:20.693870160 -0700
+++ b/fs/proc/task_mmu.c	2014-06-02 14:20:20.697870340 -0700
@@ -1033,8 +1033,7 @@ static int pagemap_pte_range(pmd_t *pmd,
 
 		/* check that 'vma' actually covers this address,
 		 * and that it isn't a huge page vma */
-		if (vma && (vma->vm_start <= addr) &&
-		    !is_vm_hugetlb_page(vma)) {
+		if (vma && (vma->vm_start <= addr)) {
 			pte = pte_offset_map(pmd, addr);
 			pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
 			/* unmap before userspace copy */
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 08/10] mm: pagewalk: add locked pte walker
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
                   ` (6 preceding siblings ...)
  2014-06-02 21:36 ` [PATCH 07/10] mm: pagewalk: kill check for hugetlbfs inside /proc pagemap code Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 09/10] mm: pagewalk: use new locked walker for /proc/pid/smaps Dave Hansen
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Neither the locking nor the splitting logic needed for
transparent huge pages is trivial.  We end up having to teach
each of the page walkers about it individually, and have the same
pattern copied across several of them.

This patch introduces a new handler: ->locked_single_entry.  It
does two things: it handles the page table locking, including the
difference between pmds and ptes, and it lets you have a single
handler for large and small pages.

This greatly simplifies the handlers.  I only implemented this
for two of the walk_page_range() users for now.  I believe this
can at least be applied to a few more of them going forward.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/include/linux/mm.h |    7 +++++++
 b/mm/pagewalk.c      |   43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff -puN include/linux/mm.h~mm-pagewalk-add-locked-walker include/linux/mm.h
--- a/include/linux/mm.h~mm-pagewalk-add-locked-walker	2014-06-02 14:20:20.963882275 -0700
+++ b/include/linux/mm.h	2014-06-02 14:20:20.969882545 -0700
@@ -1096,6 +1096,11 @@ void unmap_vmas(struct mmu_gather *tlb,
  *	       pmd_trans_huge() pmds.  They may simply choose to
  *	       split_huge_page() instead of handling it explicitly.
  * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
+ * @locked_pte_entry: if set, called for each pmd or pte entry. The
+ * 		      page table lock for the entry is also acquired
+ * 		      such that the handler does not have to worry
+ * 		      about the entry disappearing (or being split in
+ * 		      the case of a pmd_trans_huge).
  * @pte_hole: if set, called for each hole at all levels
  * @hugetlb_entry: if set, called for each hugetlb entry
  *		   *Caution*: The caller must hold mmap_sem() if @hugetlb_entry
@@ -1112,6 +1117,8 @@ struct mm_walk {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_entry)(pte_t *pte, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
+	int (*locked_single_entry)(pte_t *pte, unsigned long addr,
+			 unsigned long pte_size, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
 			struct mm_walk *walk);
 	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
diff -puN mm/pagewalk.c~mm-pagewalk-add-locked-walker mm/pagewalk.c
--- a/mm/pagewalk.c~mm-pagewalk-add-locked-walker	2014-06-02 14:20:20.965882364 -0700
+++ b/mm/pagewalk.c	2014-06-02 14:20:20.969882545 -0700
@@ -57,6 +57,40 @@ static int walk_pte_range(pmd_t *pmd, un
 	return err;
 }
 
+static int walk_single_entry_locked(pmd_t *pmd, unsigned long addr,
+				    unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+        struct vm_area_struct *vma = walk->vma;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+		ret = walk->locked_single_entry((pte_t *)pmd, addr,
+						HPAGE_PMD_SIZE, walk);
+		spin_unlock(ptl);
+		return ret;
+	}
+
+	/*
+	 * See pmd_none_or_trans_huge_or_clear_bad() for a
+	 * description of the races we are avoiding with this.
+	 * Note that this essentially acts as if the pmd were
+	 * NULL (empty).
+	 */
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ret = walk->locked_single_entry(pte, addr, PAGE_SIZE, walk);
+		if (ret)
+			break;
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	return ret;
+}
+
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
@@ -77,6 +111,15 @@ again:
 			continue;
 		}
 		/*
+		 * A ->locked_single_entry must be able to handle
+		 * arbitrary (well, pmd or pte-sized) sizes
+		 */
+		if (walk->locked_single_entry)
+			err = walk_single_entry_locked(pmd, addr, next, walk);
+		if (err)
+			break;
+
+		/*
 		 * This implies that each ->pmd_entry() handler
 		 * needs to know about pmd_trans_huge() pmds
 		 */
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 09/10] mm: pagewalk: use new locked walker for /proc/pid/smaps
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
                   ` (7 preceding siblings ...)
  2014-06-02 21:36 ` [PATCH 08/10] mm: pagewalk: add locked pte walker Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
  2014-06-02 21:36 ` [PATCH 10/10] mm: pagewalk: use locked walker for /proc/pid/numa_maps Dave Hansen
       [not found] ` <1401745925-l651h3s9@n-horiguchi@ah.jp.nec.com>
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The diffstat tells the story here.  Using the new walker function
greatly simplifies the code.  One side-effect here is that we'll
call cond_resched() more often than we did before.  It used to be
called once per pte page, but now it's called on every pte.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/fs/proc/task_mmu.c |   27 +++++----------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

diff -puN fs/proc/task_mmu.c~mm-pagewalk-use-locked-walker-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~mm-pagewalk-use-locked-walker-smaps	2014-06-02 14:20:21.247895019 -0700
+++ b/fs/proc/task_mmu.c	2014-06-02 14:20:21.251895198 -0700
@@ -490,32 +490,15 @@ static void smaps_pte_entry(pte_t ptent,
 	}
 }
 
-static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+static int smaps_single_locked(pte_t *pte, unsigned long addr, unsigned long size,
 			   struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
-	struct vm_area_struct *vma = walk->vma;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
-		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
-		spin_unlock(ptl);
+
+	if (size == HPAGE_PMD_SIZE)
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
-		return 0;
-	}
 
-	if (pmd_trans_unstable(pmd))
-		return 0;
-	/*
-	 * The mmap_sem held all the way back in m_start() is what
-	 * keeps khugepaged out of here and from collapsing things
-	 * in here.
-	 */
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; pte++, addr += PAGE_SIZE)
-		smaps_pte_entry(*pte, addr, PAGE_SIZE, walk);
-	pte_unmap_unlock(pte - 1, ptl);
+	smaps_pte_entry(*pte, addr, size, walk);
 	cond_resched();
 	return 0;
 }
@@ -581,7 +564,7 @@ static int show_smap(struct seq_file *m,
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
-		.pmd_entry = smaps_pte_range,
+		.locked_single_entry = smaps_single_locked,
 		.mm = vma->vm_mm,
 		.private = &mss,
 	};
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 10/10] mm: pagewalk: use locked walker for /proc/pid/numa_maps
  2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
                   ` (8 preceding siblings ...)
  2014-06-02 21:36 ` [PATCH 09/10] mm: pagewalk: use new locked walker for /proc/pid/smaps Dave Hansen
@ 2014-06-02 21:36 ` Dave Hansen
       [not found] ` <1401745925-l651h3s9@n-horiguchi@ah.jp.nec.com>
  10 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, kirill.shutemov, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Same deal as the last one.  Lots of code savings using the new
walker function.


Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/fs/proc/task_mmu.c |   39 ++++++++-------------------------------
 1 file changed, 8 insertions(+), 31 deletions(-)

diff -puN fs/proc/task_mmu.c~mm-pagewalk-use-locked-walker-numa_maps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~mm-pagewalk-use-locked-walker-numa_maps	2014-06-02 14:20:21.518907178 -0700
+++ b/fs/proc/task_mmu.c	2014-06-02 14:20:21.522907359 -0700
@@ -1280,41 +1280,18 @@ static struct page *can_gather_numa_stat
 	return page;
 }
 
-static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
-		unsigned long end, struct mm_walk *walk)
+static int gather_stats_locked(pte_t *pte, unsigned long addr,
+		unsigned long size, struct mm_walk *walk)
 {
-	struct numa_maps *md;
-	spinlock_t *ptl;
-	pte_t *orig_pte;
-	pte_t *pte;
-
-	md = walk->private;
+	struct numa_maps *md = walk->private;
+	struct page *page = can_gather_numa_stats(*pte, walk->vma, addr);
 
-	if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) {
-		pte_t huge_pte = *(pte_t *)pmd;
-		struct page *page;
-
-		page = can_gather_numa_stats(huge_pte, walk->vma, addr);
-		if (page)
-			gather_stats(page, md, pte_dirty(huge_pte),
-				     HPAGE_PMD_SIZE/PAGE_SIZE);
-		spin_unlock(ptl);
-		return 0;
-	}
+	if (page)
+		gather_stats(page, md, pte_dirty(*pte), size/PAGE_SIZE);
 
-	if (pmd_trans_unstable(pmd))
-		return 0;
-	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-	do {
-		struct page *page = can_gather_numa_stats(*pte, walk->vma, addr);
-		if (!page)
-			continue;
-		gather_stats(page, md, pte_dirty(*pte), 1);
-
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
 	return 0;
 }
+
 #ifdef CONFIG_HUGETLB_PAGE
 static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
@@ -1366,7 +1343,7 @@ static int show_numa_map(struct seq_file
 	memset(md, 0, sizeof(*md));
 
 	walk.hugetlb_entry = gather_hugetbl_stats;
-	walk.pmd_entry = gather_pte_stats;
+	walk.locked_single_entry = gather_stats_locked;
 	walk.private = md;
 	walk.mm = mm;
 
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <1401745925-l651h3s9@n-horiguchi@ah.jp.nec.com>]

* Re: [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing
       [not found] ` <1401745925-l651h3s9@n-horiguchi@ah.jp.nec.com>
@ 2014-06-02 21:53   ` Dave Hansen
       [not found]     ` <1401776292-dn0fof8e@n-horiguchi@ah.jp.nec.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2014-06-02 21:53 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: linux-kernel, linux-mm, Kirill A. Shutemov

On 06/02/2014 02:52 PM, Naoya Horiguchi wrote:
> What version is this patchset based on?
> Recently I comprehensively rewrote page table walker (from the same motivation
> as yours) and the patchset is now in linux-mm. I guess most of your patchset
> (I've not read them yet) conflict with this patchset.
> So could you take a look on it?

It's on top of a version of Linus's from the last week.  I'll take a
look at how it sits on top of -mm.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <1401776292-dn0fof8e@n-horiguchi@ah.jp.nec.com>]

* Re: [PATCH -mm] mincore: apply page table walker on do_mincore() (Re: [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing)
       [not found]     ` <1401776292-dn0fof8e@n-horiguchi@ah.jp.nec.com>
@ 2014-06-03 15:55       ` Dave Hansen
       [not found]         ` <1401825676-8py0r32h@n-horiguchi@ah.jp.nec.com>
  2014-06-03 15:59       ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2014-06-03 15:55 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: linux-kernel, linux-mm, Kirill A. Shutemov

On 06/02/2014 11:18 PM, Naoya Horiguchi wrote:
> And for patch 8, 9, and 10, I don't think it's good idea to add a new callback
> which can handle both pmd and pte (because they are essentially differnt thing).
> But the underneath idea of doing pmd_trans_huge_lock() in the common code in
> walk_single_entry_locked() looks nice to me. So it would be great if we can do
> the same thing in walk_pmd_range() (of linux-mm) to reduce code in callbacks.

You think they are different, I think they're the same. :)

What the walkers *really* care about is getting a leaf node in the page
tables.  They generally don't *care* whether it is a pmd or pte, they
just want to know what its value is and how large it is.

I'd argue that they don't really ever need to actually know at which
level they are in the page tables, just if they are at the bottom or
not.  Note that *NOBODY* sets a pud or pgd entry.  That's because the
walkers are 100% concerned about leaf nodes (pte's) at this point.

Take a look at my version of gather_stats_locked():

>  static int gather_stats_locked(pte_t *pte, unsigned long addr,
>                 unsigned long size, struct mm_walk *walk)
>  {
>         struct numa_maps *md = walk->private;
>         struct page *page = can_gather_numa_stats(*pte, walk->vma, addr);
>  
>         if (page)
>                 gather_stats(page, md, pte_dirty(*pte), size/PAGE_SIZE);
>  
>         return 0;
>  }

The mmotm version looks _very_ similar to that, *BUT* the mmotm version
needs to have an entire *EXTRA* 22-line gather_pmd_stats() dealing with
THP locking, while mine doesn't.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <1401825676-8py0r32h@n-horiguchi@ah.jp.nec.com>]

* Re: [PATCH -mm] mincore: apply page table walker on do_mincore() (Re: [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing)
       [not found]         ` <1401825676-8py0r32h@n-horiguchi@ah.jp.nec.com>
@ 2014-06-03 20:33           ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-03 20:33 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: linux-kernel, linux-mm, Kirill A. Shutemov

On 06/03/2014 01:01 PM, Naoya Horiguchi wrote:
> On Tue, Jun 03, 2014 at 08:55:04AM -0700, Dave Hansen wrote:
>> On 06/02/2014 11:18 PM, Naoya Horiguchi wrote:
>>> And for patch 8, 9, and 10, I don't think it's good idea to add a new callback
>>> which can handle both pmd and pte (because they are essentially differnt thing).
>>> But the underneath idea of doing pmd_trans_huge_lock() in the common code in
>>> walk_single_entry_locked() looks nice to me. So it would be great if we can do
>>> the same thing in walk_pmd_range() (of linux-mm) to reduce code in callbacks.
>>
>> You think they are different, I think they're the same. :)
>>
>> What the walkers *really* care about is getting a leaf node in the page
>> tables.  They generally don't *care* whether it is a pmd or pte, they
>> just want to know what its value is and how large it is.
> 
> OK, I see your idea, so I think that we could go to the direction to
> unify all p(gd|ud|md|te)_entry() callbacks.
> And if we find the leaf entry in whatever level, we call the common entry
> handler on the entry, right?

That's a level farther than I took it, but I think it makes sense.
Nobody is using the walkers for the purposes of looking at anything but
leaf nodes.

> It would takes some time and effort to make all users to fit to this new
> scheme, so my suggestion is:
>  1. move pmd locking to walk_pmd_range() (then, your locked_single_entry()
>     callback is equal to pmd_entry())

Yes, except that still means that each walker needs separate code for
regular _and_ transparent huge pages.  It would be nice to be able to
have a single handler which handles both.

>  2. let each existing user have its common entry handler, and connect it to
>     its pmd_entry() and/or pte_entry() to keep compatibility
>  3. apply page table walker to potential users.
>     I'd like to keep pmd/pte_entry() until we complete phase 2.,
>     because we could find something which let us change core code,
>  4. and finaly replace all p(gd|ud|md|te)_entry() with a unified callback.
> 
> Could you let me have a few days to work on 1.?
> I think that it means your patch 8 is effectively merged on top of mine.
> So your current problem will be solved.

That sounds quite nice.

>> I'd argue that they don't really ever need to actually know at which
>> level they are in the page tables, just if they are at the bottom or
>> not.  Note that *NOBODY* sets a pud or pgd entry.  That's because the
>> walkers are 100% concerned about leaf nodes (pte's) at this point.
> 
> Yes. BTW do you think we should pud_entry() and pgd_entry() immediately?
> We can do it and it reduces some trivial evaluations, so it's optimized
> a little.

Yeah, we might as well.  They're just wasted space at the moment.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -mm] mincore: apply page table walker on do_mincore() (Re: [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing)
       [not found]     ` <1401776292-dn0fof8e@n-horiguchi@ah.jp.nec.com>
  2014-06-03 15:55       ` [PATCH -mm] mincore: apply page table walker on do_mincore() (Re: [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing) Dave Hansen
@ 2014-06-03 15:59       ` Dave Hansen
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2014-06-03 15:59 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: linux-kernel, linux-mm, Kirill A. Shutemov

On 06/02/2014 11:18 PM, Naoya Horiguchi wrote:
> +	/*
> +	 * Huge pages are always in RAM for now, but
> +	 * theoretically it needs to be checked.
> +	 */
> +	present = pte && !huge_pte_none(huge_ptep_get(pte));
> +	for (; addr != end; vec++, addr += PAGE_SIZE)
> +		*vec = present;
> +	cond_resched();
> +	walk->private += (end - addr) >> PAGE_SHIFT;

That comment is bogus, fwiw.  Huge pages are demand-faulted and it's
quite possible that they are not present.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-06-03 20:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
2014-06-02 21:36 ` [PATCH 01/10] mm: pagewalk: consolidate vma->vm_start checks Dave Hansen
2014-06-02 21:36 ` [PATCH 02/10] mm: pagewalk: always skip hugetlbfs except when explicitly handled Dave Hansen
2014-06-02 21:36 ` [PATCH 03/10] mm: pagewalk: have generic code keep track of VMA Dave Hansen
2014-06-02 21:36 ` [PATCH 04/10] mm: pagewalk: add page walker for mincore() Dave Hansen
2014-06-02 21:36 ` [PATCH 05/10] mm: mincore: clean up hugetlbfs handling (part 1) Dave Hansen
2014-06-02 21:36 ` [PATCH 06/10] mm: mincore: clean up hugetlbfs handler (part 2) Dave Hansen
2014-06-02 21:36 ` [PATCH 07/10] mm: pagewalk: kill check for hugetlbfs inside /proc pagemap code Dave Hansen
2014-06-02 21:36 ` [PATCH 08/10] mm: pagewalk: add locked pte walker Dave Hansen
2014-06-02 21:36 ` [PATCH 09/10] mm: pagewalk: use new locked walker for /proc/pid/smaps Dave Hansen
2014-06-02 21:36 ` [PATCH 10/10] mm: pagewalk: use locked walker for /proc/pid/numa_maps Dave Hansen
     [not found] ` <1401745925-l651h3s9@n-horiguchi@ah.jp.nec.com>
2014-06-02 21:53   ` [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
     [not found]     ` <1401776292-dn0fof8e@n-horiguchi@ah.jp.nec.com>
2014-06-03 15:55       ` [PATCH -mm] mincore: apply page table walker on do_mincore() (Re: [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing) Dave Hansen
     [not found]         ` <1401825676-8py0r32h@n-horiguchi@ah.jp.nec.com>
2014-06-03 20:33           ` Dave Hansen
2014-06-03 15:59       ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox