[PATCH 0/8] avoid allocation in show_numa

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] avoid allocation in show_numa_map()
@ 2011-04-27 23:35 Stephen Wilson
  2011-04-27 23:35 ` [PATCH 1/8] mm: export get_vma_policy() Stephen Wilson
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

Recently a concern was raised[1] that performing an allocation while holding a
reference on a tasks mm could lead to a stalemate in the oom killer.  The
concern was specific to the goings-on in /proc.  Hugh Dickins stated the issue
thusly:

    ...imagine what happens if the system is out of memory, and the mm
    we're looking at is selected for killing by the OOM killer: while we
    wait in __get_free_page for more memory, no memory is freed from the
    selected mm because it cannot reach exit_mmap while we hold that
    reference.

The primary goal of this series is to eliminate repeated allocation/free cycles
currently happening in show_numa_maps() while we hold a reference to an mm.

The strategy is to perform the allocation once when /proc/pid/numa_maps is
opened, before a reference on the target tasks mm is taken.

Unfortunately, show_numa_maps() is implemented in mm/mempolicy.c while the
primary procfs implementation  lives in fs/proc/task_mmu.c.  This makes
clean cooperation between show_numa_maps() and the other seq_file operations
(start(), stop(), etc) difficult.

Patches 1-5 convert show_numa_maps() to use the generic walk_page_range()
functionality instead of the mempolicy.c specific page table walking logic.
Also, get_vma_policy() is exported.  This makes the show_numa_maps()
implementation independent of mempolicy.c. 

Patch 6 moves show_numa_maps() and supporting routines over to
fs/proc/task_mmu.c.

Finally, patches 7 and 8 provide minor cleanup and eliminates the troublesome
allocation.

Please note that moving show_numa_maps() into fs/proc/task_mmu.c essentially
reverts 1a75a6c825 and 48fce3429d.  Also, please see the discussion at [2].  My
main justifications for moving the code back into task_mmu.c is:

  - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden. 

  - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

These patches are based on v2.6.39-rc5.

Please note that this series is VERY LIGHTLY TESTED.  I have been using
CONFIG_NUMA_EMU=y thus far as I will not have access to a real NUMA system for
another week or two.

--
steve

[1] lkml.org/lkml/2011/4/25/496
[2] marc.info/?t=113149255100001&r=1&w=2

Stephen Wilson (8):
      mm: export get_vma_policy()
      mm: use walk_page_range() instead of custom page table walking code
      mm: remove MPOL_MF_STATS
      mm: make gather_stats() type-safe and remove forward declaration
      mm: remove check_huge_range()
      mm: proc: move show_numa_map() to fs/proc/task_mmu.c
      proc: make struct proc_maps_private truly private
      proc: allocate storage for numa_maps statistics once

 fs/proc/internal.h        |    8 ++
 fs/proc/task_mmu.c        |  190 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/mempolicy.h |    3 +
 include/linux/proc_fs.h   |    8 --
 mm/mempolicy.c            |  164 +--------------------------------------
 5 files changed, 200 insertions(+), 173 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] mm: export get_vma_policy()
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  7:39   ` KOSAKI Motohiro
  2011-04-27 23:35 ` [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code Stephen Wilson
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

In commit 48fce3429df84a94766fbbc845fa8450d0715b48 get_vma_policy() was
marked static as all clients were local to mempolicy.c.

However, the decision to generate /proc/pid/numa_maps in the numa memory
policy code and outside the procfs subsystem introduces an artificial
interdependency between the two systems.  Exporting get_vma_policy()
once again is the first step to clean up this interdependency.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 include/linux/mempolicy.h |    3 +++
 mm/mempolicy.c            |    2 +-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 31ac26c..c2f6032 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -199,6 +199,9 @@ void mpol_free_shared_policy(struct shared_policy *p);
 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 					    unsigned long idx);
 
+struct mempolicy *get_vma_policy(struct task_struct *tsk,
+		struct vm_area_struct *vma, unsigned long addr);
+
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
 extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 959a8b8..5bfb03e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1489,7 +1489,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct task_struct *task,
+struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
  2011-04-27 23:35 ` [PATCH 1/8] mm: export get_vma_policy() Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  7:38   ` KOSAKI Motohiro
  2011-04-27 23:35 ` [PATCH 3/8] mm: remove MPOL_MF_STATS Stephen Wilson
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

In the specific case of show_numa_map(), the custom page table walking
logic implemented in mempolicy.c does not provide any special service
beyond that provided by walk_page_range().

Also, converting show_numa_map() to use the generic routine decouples
the function from mempolicy.c, allowing it to be moved out of the mm
subsystem and into fs/proc.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 mm/mempolicy.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5bfb03e..dfe27e3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
 	md->node[page_to_nid(page)]++;
 }
 
+static int gather_pte_stats(pte_t *pte, unsigned long addr,
+		unsigned long pte_size, struct mm_walk *walk)
+{
+	struct page *page;
+
+	if (pte_none(*pte))
+		return 0;
+
+	page = pte_page(*pte);
+	if (!page)
+		return 0;
+
+	gather_stats(page, walk->private, pte_dirty(*pte));
+	return 0;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 static void check_huge_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end,
@@ -2597,12 +2613,35 @@ static void check_huge_range(struct vm_area_struct *vma,
 		gather_stats(page, md, pte_dirty(*ptep));
 	}
 }
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+		unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+	struct page *page;
+
+	if (pte_none(*pte))
+		return 0;
+
+	page = pte_page(*pte);
+	if (!page)
+		return 0;
+
+	gather_stats(page, walk->private, pte_dirty(*pte));
+	return 0;
+}
+
 #else
 static inline void check_huge_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end,
 		struct numa_maps *md)
 {
 }
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+		unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+	return 0;
+}
 #endif
 
 /*
@@ -2615,6 +2654,7 @@ int show_numa_map(struct seq_file *m, void *v)
 	struct numa_maps *md;
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
+	struct mm_walk walk = {};
 	struct mempolicy *pol;
 	int n;
 	char buffer[50];
@@ -2626,6 +2666,11 @@ int show_numa_map(struct seq_file *m, void *v)
 	if (!md)
 		return 0;
 
+	walk.hugetlb_entry = gather_hugetbl_stats;
+	walk.pte_entry = gather_pte_stats;
+	walk.private = md;
+	walk.mm = mm;
+
 	pol = get_vma_policy(priv->task, vma, vma->vm_start);
 	mpol_to_str(buffer, sizeof(buffer), pol, 0);
 	mpol_cond_put(pol);
@@ -2642,13 +2687,7 @@ int show_numa_map(struct seq_file *m, void *v)
 		seq_printf(m, " stack");
 	}
 
-	if (is_vm_hugetlb_page(vma)) {
-		check_huge_range(vma, vma->vm_start, vma->vm_end, md);
-		seq_printf(m, " huge");
-	} else {
-		check_pgd_range(vma, vma->vm_start, vma->vm_end,
-			&node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
-	}
+	walk_page_range(vma->vm_start, vma->vm_end, &walk);
 
 	if (!md->pages)
 		goto out;
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/8] mm: remove MPOL_MF_STATS
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
  2011-04-27 23:35 ` [PATCH 1/8] mm: export get_vma_policy() Stephen Wilson
  2011-04-27 23:35 ` [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  7:44   ` KOSAKI Motohiro
  2011-04-27 23:35 ` [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration Stephen Wilson
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

Mapping statistics in a NUMA environment is now computed using the
generic walk_page_range() logic.  Remove the old/equivalent
functionality.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 mm/mempolicy.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index dfe27e3..63c0d69 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -99,7 +99,6 @@
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
-#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
 
 static struct kmem_cache *policy_cache;
 static struct kmem_cache *sn_cache;
@@ -492,9 +491,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
 			continue;
 
-		if (flags & MPOL_MF_STATS)
-			gather_stats(page, private, pte_dirty(*pte));
-		else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
 			migrate_page_add(page, private, flags);
 		else
 			break;
@@ -2572,6 +2569,7 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
 		unsigned long pte_size, struct mm_walk *walk)
 {
 	struct page *page;
+	int nid;
 
 	if (pte_none(*pte))
 		return 0;
@@ -2580,6 +2578,10 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
 	if (!page)
 		return 0;
 
+	nid = page_to_nid(page);
+	if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+		return 0;
+
 	gather_stats(page, walk->private, pte_dirty(*pte));
 	return 0;
 }
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
                   ` (2 preceding siblings ...)
  2011-04-27 23:35 ` [PATCH 3/8] mm: remove MPOL_MF_STATS Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  7:45   ` KOSAKI Motohiro
  2011-04-27 23:35 ` [PATCH 5/8] mm: remove check_huge_range() Stephen Wilson
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

Improve the prototype of gather_stats() to take a struct numa_maps as
argument instead of a generic void *.  Update all callers to make the
required type explicit.

Since gather_stats() is not needed before its definition and is
scheduled to be moved out of mempolicy.c the declaration is removed as
well.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 mm/mempolicy.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 63c0d69..d4c0b8d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -456,7 +456,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 	},
 };
 
-static void gather_stats(struct page *, void *, int pte_dirty);
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
@@ -2538,9 +2537,8 @@ struct numa_maps {
 	unsigned long node[MAX_NUMNODES];
 };
 
-static void gather_stats(struct page *page, void *private, int pte_dirty)
+static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
 {
-	struct numa_maps *md = private;
 	int count = page_mapcount(page);
 
 	md->pages++;
@@ -2568,6 +2566,7 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
 static int gather_pte_stats(pte_t *pte, unsigned long addr,
 		unsigned long pte_size, struct mm_walk *walk)
 {
+	struct numa_maps *md;
 	struct page *page;
 	int nid;
 
@@ -2582,7 +2581,8 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
 	if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
 		return 0;
 
-	gather_stats(page, walk->private, pte_dirty(*pte));
+	md = walk->private;
+	gather_stats(page, md, pte_dirty(*pte));
 	return 0;
 }
 
@@ -2619,6 +2619,7 @@ static void check_huge_range(struct vm_area_struct *vma,
 static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
+	struct numa_maps *md;
 	struct page *page;
 
 	if (pte_none(*pte))
@@ -2628,7 +2629,8 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
 	if (!page)
 		return 0;
 
-	gather_stats(page, walk->private, pte_dirty(*pte));
+	md = walk->private;
+	gather_stats(page, md, pte_dirty(*pte));
 	return 0;
 }
 
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/8] mm: remove check_huge_range()
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
                   ` (3 preceding siblings ...)
  2011-04-27 23:35 ` [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  7:46   ` KOSAKI Motohiro
  2011-04-27 23:35 ` [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c Stephen Wilson
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

This function has been superseded by gather_hugetbl_stats() and is no
longer needed.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 mm/mempolicy.c |   35 -----------------------------------
 1 files changed, 0 insertions(+), 35 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d4c0b8d..c5a4342 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2587,35 +2587,6 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void check_huge_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end,
-		struct numa_maps *md)
-{
-	unsigned long addr;
-	struct page *page;
-	struct hstate *h = hstate_vma(vma);
-	unsigned long sz = huge_page_size(h);
-
-	for (addr = start; addr < end; addr += sz) {
-		pte_t *ptep = huge_pte_offset(vma->vm_mm,
-						addr & huge_page_mask(h));
-		pte_t pte;
-
-		if (!ptep)
-			continue;
-
-		pte = *ptep;
-		if (pte_none(pte))
-			continue;
-
-		page = pte_page(pte);
-		if (!page)
-			continue;
-
-		gather_stats(page, md, pte_dirty(*ptep));
-	}
-}
-
 static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
@@ -2635,12 +2606,6 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
 }
 
 #else
-static inline void check_huge_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end,
-		struct numa_maps *md)
-{
-}
-
 static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
                   ` (4 preceding siblings ...)
  2011-04-27 23:35 ` [PATCH 5/8] mm: remove check_huge_range() Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  7:49   ` KOSAKI Motohiro
  2011-04-27 23:35 ` [PATCH 7/8] proc: make struct proc_maps_private truly private Stephen Wilson
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
issues.

  - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

  - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

  - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 fs/proc/task_mmu.c |  170 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/mempolicy.c     |  168 ---------------------------------------------------
 2 files changed, 168 insertions(+), 170 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2e7addf..9f069d2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -856,8 +856,174 @@ const struct file_operations proc_pagemap_operations = {
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
-extern int show_numa_map(struct seq_file *m, void *v);
 
+struct numa_maps {
+	unsigned long pages;
+	unsigned long anon;
+	unsigned long active;
+	unsigned long writeback;
+	unsigned long mapcount_max;
+	unsigned long dirty;
+	unsigned long swapcache;
+	unsigned long node[MAX_NUMNODES];
+};
+
+static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
+{
+	int count = page_mapcount(page);
+
+	md->pages++;
+	if (pte_dirty || PageDirty(page))
+		md->dirty++;
+
+	if (PageSwapCache(page))
+		md->swapcache++;
+
+	if (PageActive(page) || PageUnevictable(page))
+		md->active++;
+
+	if (PageWriteback(page))
+		md->writeback++;
+
+	if (PageAnon(page))
+		md->anon++;
+
+	if (count > md->mapcount_max)
+		md->mapcount_max = count;
+
+	md->node[page_to_nid(page)]++;
+}
+
+static int gather_pte_stats(pte_t *pte, unsigned long addr,
+		unsigned long pte_size, struct mm_walk *walk)
+{
+	struct numa_maps *md;
+	struct page *page;
+	int nid;
+
+	if (pte_none(*pte))
+		return 0;
+
+	page = pte_page(*pte);
+	if (!page)
+		return 0;
+
+	nid = page_to_nid(page);
+	if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+		return 0;
+
+	md = walk->private;
+	gather_stats(page, md, pte_dirty(*pte));
+	return 0;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+		unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+	struct numa_maps *md;
+	struct page *page;
+
+	if (pte_none(*pte))
+		return 0;
+
+	page = pte_page(*pte);
+	if (!page)
+		return 0;
+
+	md = walk->private;
+	gather_stats(page, md, pte_dirty(*pte));
+	return 0;
+}
+
+#else
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+		unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+	return 0;
+}
+#endif
+
+/*
+ * Display pages allocated per node and memory policy via /proc.
+ */
+static int show_numa_map(struct seq_file *m, void *v)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+	struct numa_maps *md;
+	struct file *file = vma->vm_file;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mm_walk walk = {};
+	struct mempolicy *pol;
+	int n;
+	char buffer[50];
+
+	if (!mm)
+		return 0;
+
+	md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
+	if (!md)
+		return 0;
+
+	walk.hugetlb_entry = gather_hugetbl_stats;
+	walk.pte_entry = gather_pte_stats;
+	walk.private = md;
+	walk.mm = mm;
+
+	pol = get_vma_policy(priv->task, vma, vma->vm_start);
+	mpol_to_str(buffer, sizeof(buffer), pol, 0);
+	mpol_cond_put(pol);
+
+	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
+
+	if (file) {
+		seq_printf(m, " file=");
+		seq_path(m, &file->f_path, "\n\t= ");
+	} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
+		seq_printf(m, " heap");
+	} else if (vma->vm_start <= mm->start_stack &&
+			vma->vm_end >= mm->start_stack) {
+		seq_printf(m, " stack");
+	}
+
+	walk_page_range(vma->vm_start, vma->vm_end, &walk);
+
+	if (!md->pages)
+		goto out;
+
+	if (md->anon)
+		seq_printf(m, " anon=%lu", md->anon);
+
+	if (md->dirty)
+		seq_printf(m, " dirty=%lu", md->dirty);
+
+	if (md->pages != md->anon && md->pages != md->dirty)
+		seq_printf(m, " mapped=%lu", md->pages);
+
+	if (md->mapcount_max > 1)
+		seq_printf(m, " mapmax=%lu", md->mapcount_max);
+
+	if (md->swapcache)
+		seq_printf(m, " swapcache=%lu", md->swapcache);
+
+	if (md->active < md->pages && !is_vm_hugetlb_page(vma))
+		seq_printf(m, " active=%lu", md->active);
+
+	if (md->writeback)
+		seq_printf(m, " writeback=%lu", md->writeback);
+
+	for_each_node_state(n, N_HIGH_MEMORY)
+		if (md->node[n])
+			seq_printf(m, " N%d=%lu", n, md->node[n]);
+out:
+	seq_putc(m, '\n');
+	kfree(md);
+
+	if (m->count < m->size)
+		m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
+	return 0;
+}
 static const struct seq_operations proc_pid_numa_maps_op = {
         .start  = m_start,
         .next   = m_next,
@@ -876,4 +1042,4 @@ const struct file_operations proc_numa_maps_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif
+#endif /* CONFIG_NUMA */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c5a4342..e7fb9d2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2525,171 +2525,3 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
 	}
 	return p - buffer;
 }
-
-struct numa_maps {
-	unsigned long pages;
-	unsigned long anon;
-	unsigned long active;
-	unsigned long writeback;
-	unsigned long mapcount_max;
-	unsigned long dirty;
-	unsigned long swapcache;
-	unsigned long node[MAX_NUMNODES];
-};
-
-static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
-{
-	int count = page_mapcount(page);
-
-	md->pages++;
-	if (pte_dirty || PageDirty(page))
-		md->dirty++;
-
-	if (PageSwapCache(page))
-		md->swapcache++;
-
-	if (PageActive(page) || PageUnevictable(page))
-		md->active++;
-
-	if (PageWriteback(page))
-		md->writeback++;
-
-	if (PageAnon(page))
-		md->anon++;
-
-	if (count > md->mapcount_max)
-		md->mapcount_max = count;
-
-	md->node[page_to_nid(page)]++;
-}
-
-static int gather_pte_stats(pte_t *pte, unsigned long addr,
-		unsigned long pte_size, struct mm_walk *walk)
-{
-	struct numa_maps *md;
-	struct page *page;
-	int nid;
-
-	if (pte_none(*pte))
-		return 0;
-
-	page = pte_page(*pte);
-	if (!page)
-		return 0;
-
-	nid = page_to_nid(page);
-	if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
-		return 0;
-
-	md = walk->private;
-	gather_stats(page, md, pte_dirty(*pte));
-	return 0;
-}
-
-#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
-	struct numa_maps *md;
-	struct page *page;
-
-	if (pte_none(*pte))
-		return 0;
-
-	page = pte_page(*pte);
-	if (!page)
-		return 0;
-
-	md = walk->private;
-	gather_stats(page, md, pte_dirty(*pte));
-	return 0;
-}
-
-#else
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
-	return 0;
-}
-#endif
-
-/*
- * Display pages allocated per node and memory policy via /proc.
- */
-int show_numa_map(struct seq_file *m, void *v)
-{
-	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *vma = v;
-	struct numa_maps *md;
-	struct file *file = vma->vm_file;
-	struct mm_struct *mm = vma->vm_mm;
-	struct mm_walk walk = {};
-	struct mempolicy *pol;
-	int n;
-	char buffer[50];
-
-	if (!mm)
-		return 0;
-
-	md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
-	if (!md)
-		return 0;
-
-	walk.hugetlb_entry = gather_hugetbl_stats;
-	walk.pte_entry = gather_pte_stats;
-	walk.private = md;
-	walk.mm = mm;
-
-	pol = get_vma_policy(priv->task, vma, vma->vm_start);
-	mpol_to_str(buffer, sizeof(buffer), pol, 0);
-	mpol_cond_put(pol);
-
-	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
-
-	if (file) {
-		seq_printf(m, " file=");
-		seq_path(m, &file->f_path, "\n\t= ");
-	} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
-		seq_printf(m, " heap");
-	} else if (vma->vm_start <= mm->start_stack &&
-			vma->vm_end >= mm->start_stack) {
-		seq_printf(m, " stack");
-	}
-
-	walk_page_range(vma->vm_start, vma->vm_end, &walk);
-
-	if (!md->pages)
-		goto out;
-
-	if (md->anon)
-		seq_printf(m," anon=%lu",md->anon);
-
-	if (md->dirty)
-		seq_printf(m," dirty=%lu",md->dirty);
-
-	if (md->pages != md->anon && md->pages != md->dirty)
-		seq_printf(m, " mapped=%lu", md->pages);
-
-	if (md->mapcount_max > 1)
-		seq_printf(m, " mapmax=%lu", md->mapcount_max);
-
-	if (md->swapcache)
-		seq_printf(m," swapcache=%lu", md->swapcache);
-
-	if (md->active < md->pages && !is_vm_hugetlb_page(vma))
-		seq_printf(m," active=%lu", md->active);
-
-	if (md->writeback)
-		seq_printf(m," writeback=%lu", md->writeback);
-
-	for_each_node_state(n, N_HIGH_MEMORY)
-		if (md->node[n])
-			seq_printf(m, " N%d=%lu", n, md->node[n]);
-out:
-	seq_putc(m, '\n');
-	kfree(md);
-
-	if (m->count < m->size)
-		m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
-	return 0;
-}
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/8] proc: make struct proc_maps_private truly private
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
                   ` (5 preceding siblings ...)
  2011-04-27 23:35 ` [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  7:51   ` KOSAKI Motohiro
  2011-04-27 23:35 ` [PATCH 8/8] proc: allocate storage for numa_maps statistics once Stephen Wilson
  2011-05-04 23:10 ` [PATCH 0/8] avoid allocation in show_numa_map() Andrew Morton
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
there is no need to export struct proc_maps_private to the world.  Move
it to fs/proc/internal.h instead.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 fs/proc/internal.h      |    8 ++++++++
 include/linux/proc_fs.h |    8 --------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c03e8d3..3c39863 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -119,3 +119,11 @@ struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
  */
 int proc_readdir(struct file *, void *, filldir_t);
 struct dentry *proc_lookup(struct inode *, struct dentry *, struct nameidata *);
+
+struct proc_maps_private {
+	struct pid *pid;
+	struct task_struct *task;
+#ifdef CONFIG_MMU
+	struct vm_area_struct *tail_vma;
+#endif
+};
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 838c114..55ff594 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -286,12 +286,4 @@ static inline struct net *PDE_NET(struct proc_dir_entry *pde)
 	return pde->parent->data;
 }
 
-struct proc_maps_private {
-	struct pid *pid;
-	struct task_struct *task;
-#ifdef CONFIG_MMU
-	struct vm_area_struct *tail_vma;
-#endif
-};
-
 #endif /* _LINUX_PROC_FS_H */
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/8] proc: allocate storage for numa_maps statistics once
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
                   ` (6 preceding siblings ...)
  2011-04-27 23:35 ` [PATCH 7/8] proc: make struct proc_maps_private truly private Stephen Wilson
@ 2011-04-27 23:35 ` Stephen Wilson
  2011-05-09  8:24   ` KOSAKI Motohiro
  2011-05-04 23:10 ` [PATCH 0/8] avoid allocation in show_numa_map() Andrew Morton
  8 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-04-27 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel

In show_numa_map() we collect statistics into a numa_maps structure.
Since the number of NUMA nodes can be very large, this structure is not
a candidate for stack allocation.

Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
is invoked, perform the allocation just once when /proc/pid/numa_maps is
opened.

Performing the allocation when numa_maps is opened, and thus before a
reference to the target tasks mm is taken, eliminates a potential
stalemate condition in the oom-killer as originally described by Hugh
Dickins:

  ... imagine what happens if the system is out of memory, and the mm
  we're looking at is selected for killing by the OOM killer: while
  we wait in __get_free_page for more memory, no memory is freed
  from the selected mm because it cannot reach exit_mmap while we hold
  that reference.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
 fs/proc/task_mmu.c |   36 +++++++++++++++++++++++++++---------
 1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9f069d2..1ca3a00 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -868,6 +868,11 @@ struct numa_maps {
 	unsigned long node[MAX_NUMNODES];
 };
 
+struct numa_maps_private {
+	struct proc_maps_private proc_maps;
+	struct numa_maps md;
+};
+
 static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
 {
 	int count = page_mapcount(page);
@@ -949,9 +954,10 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
  */
 static int show_numa_map(struct seq_file *m, void *v)
 {
-	struct proc_maps_private *priv = m->private;
+	struct numa_maps_private *numa_priv = m->private;
+	struct proc_maps_private *proc_priv = &numa_priv->proc_maps;
 	struct vm_area_struct *vma = v;
-	struct numa_maps *md;
+	struct numa_maps *md = &numa_priv->md;
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_walk walk = {};
@@ -962,16 +968,15 @@ static int show_numa_map(struct seq_file *m, void *v)
 	if (!mm)
 		return 0;
 
-	md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
-	if (!md)
-		return 0;
+	/* Ensure we start with an empty set of numa_maps statistics. */
+	memset(md, 0, sizeof(*md));
 
 	walk.hugetlb_entry = gather_hugetbl_stats;
 	walk.pte_entry = gather_pte_stats;
 	walk.private = md;
 	walk.mm = mm;
 
-	pol = get_vma_policy(priv->task, vma, vma->vm_start);
+	pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
 	mpol_to_str(buffer, sizeof(buffer), pol, 0);
 	mpol_cond_put(pol);
 
@@ -1018,12 +1023,12 @@ static int show_numa_map(struct seq_file *m, void *v)
 			seq_printf(m, " N%d=%lu", n, md->node[n]);
 out:
 	seq_putc(m, '\n');
-	kfree(md);
 
 	if (m->count < m->size)
-		m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
+		m->version = (vma != proc_priv->tail_vma) ? vma->vm_start : 0;
 	return 0;
 }
+
 static const struct seq_operations proc_pid_numa_maps_op = {
         .start  = m_start,
         .next   = m_next,
@@ -1033,7 +1038,20 @@ static const struct seq_operations proc_pid_numa_maps_op = {
 
 static int numa_maps_open(struct inode *inode, struct file *file)
 {
-	return do_maps_open(inode, file, &proc_pid_numa_maps_op);
+	struct numa_maps_private *priv;
+	int ret = -ENOMEM;
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (priv) {
+		priv->proc_maps.pid = proc_pid(inode);
+		ret = seq_open(file, &proc_pid_numa_maps_op);
+		if (!ret) {
+			struct seq_file *m = file->private_data;
+			m->private = priv;
+		} else {
+			kfree(priv);
+		}
+	}
+	return ret;
 }
 
 const struct file_operations proc_numa_maps_operations = {
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] avoid allocation in show_numa_map()
  2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
                   ` (7 preceding siblings ...)
  2011-04-27 23:35 ` [PATCH 8/8] proc: allocate storage for numa_maps statistics once Stephen Wilson
@ 2011-05-04 23:10 ` Andrew Morton
  2011-05-05  2:37   ` Stephen Wilson
  8 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2011-05-04 23:10 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: Alexander Viro, KOSAKI Motohiro, Hugh Dickins, David Rientjes,
	linux-mm, linux-kernel, Jeremy Fitzhardinge

On Wed, 27 Apr 2011 19:35:41 -0400
Stephen Wilson <wilsons@start.ca> wrote:

> Recently a concern was raised[1] that performing an allocation while holding a
> reference on a tasks mm could lead to a stalemate in the oom killer.  The
> concern was specific to the goings-on in /proc.  Hugh Dickins stated the issue
> thusly:
> 
>     ...imagine what happens if the system is out of memory, and the mm
>     we're looking at is selected for killing by the OOM killer: while we
>     wait in __get_free_page for more memory, no memory is freed from the
>     selected mm because it cannot reach exit_mmap while we hold that
>     reference.
> 
> The primary goal of this series is to eliminate repeated allocation/free cycles
> currently happening in show_numa_maps() while we hold a reference to an mm.
> 
> The strategy is to perform the allocation once when /proc/pid/numa_maps is
> opened, before a reference on the target tasks mm is taken.
> 
> Unfortunately, show_numa_maps() is implemented in mm/mempolicy.c while the
> primary procfs implementation  lives in fs/proc/task_mmu.c.  This makes
> clean cooperation between show_numa_maps() and the other seq_file operations
> (start(), stop(), etc) difficult.
> 
> 
> Patches 1-5 convert show_numa_maps() to use the generic walk_page_range()
> functionality instead of the mempolicy.c specific page table walking logic.
> Also, get_vma_policy() is exported.  This makes the show_numa_maps()
> implementation independent of mempolicy.c. 
> 
> Patch 6 moves show_numa_maps() and supporting routines over to
> fs/proc/task_mmu.c.
> 
> Finally, patches 7 and 8 provide minor cleanup and eliminates the troublesome
> allocation.
> 
>  
> Please note that moving show_numa_maps() into fs/proc/task_mmu.c essentially
> reverts 1a75a6c825 and 48fce3429d.  Also, please see the discussion at [2].  My
> main justifications for moving the code back into task_mmu.c is:
> 
>   - Having the show() operation "miles away" from the corresponding
>     seq_file iteration operations is a maintenance burden. 
>     
>   - The need to export ad hoc info like struct proc_maps_private is
>     eliminated.
> 
> 
> These patches are based on v2.6.39-rc5.

The patches look reasonable.  It would be nice to get some more review
happening (poke).

> 
> Please note that this series is VERY LIGHTLY TESTED.  I have been using
> CONFIG_NUMA_EMU=y thus far as I will not have access to a real NUMA system for
> another week or two.

"lightly tested" evokes fear, but the patches don't look too scary to
me.

Did you look at using apply_to_page_range()?

I'm trying to remember why we're carrying both walk_page_range() and
apply_to_page_range() but can't immediately think of a reason.

There's also an apply_to_page_range_batch() in -mm but that code is
broken on PPC and not much is happening with it, so it will probably go
away again.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] avoid allocation in show_numa_map()
  2011-05-04 23:10 ` [PATCH 0/8] avoid allocation in show_numa_map() Andrew Morton
@ 2011-05-05  2:37   ` Stephen Wilson
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen Wilson @ 2011-05-05  2:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Stephen Wilson, Alexander Viro, KOSAKI Motohiro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel, Jeremy Fitzhardinge

On Wed, May 04, 2011 at 04:10:20PM -0700, Andrew Morton wrote:
> On Wed, 27 Apr 2011 19:35:41 -0400
> Stephen Wilson <wilsons@start.ca> wrote:
> 
> > Recently a concern was raised[1] that performing an allocation while holding a
> > reference on a tasks mm could lead to a stalemate in the oom killer.  The
> > concern was specific to the goings-on in /proc.  Hugh Dickins stated the issue
> > thusly:
> > 
> >     ...imagine what happens if the system is out of memory, and the mm
> >     we're looking at is selected for killing by the OOM killer: while we
> >     wait in __get_free_page for more memory, no memory is freed from the
> >     selected mm because it cannot reach exit_mmap while we hold that
> >     reference.
> > 
> > The primary goal of this series is to eliminate repeated allocation/free cycles
> > currently happening in show_numa_maps() while we hold a reference to an mm.
> > 
> > The strategy is to perform the allocation once when /proc/pid/numa_maps is
> > opened, before a reference on the target tasks mm is taken.
> > 
> > Unfortunately, show_numa_maps() is implemented in mm/mempolicy.c while the
> > primary procfs implementation  lives in fs/proc/task_mmu.c.  This makes
> > clean cooperation between show_numa_maps() and the other seq_file operations
> > (start(), stop(), etc) difficult.
> > 
> > 
> > Patches 1-5 convert show_numa_maps() to use the generic walk_page_range()
> > functionality instead of the mempolicy.c specific page table walking logic.
> > Also, get_vma_policy() is exported.  This makes the show_numa_maps()
> > implementation independent of mempolicy.c. 
> > 
> > Patch 6 moves show_numa_maps() and supporting routines over to
> > fs/proc/task_mmu.c.
> > 
> > Finally, patches 7 and 8 provide minor cleanup and eliminates the troublesome
> > allocation.
> > 
> >  
> > Please note that moving show_numa_maps() into fs/proc/task_mmu.c essentially
> > reverts 1a75a6c825 and 48fce3429d.  Also, please see the discussion at [2].  My
> > main justifications for moving the code back into task_mmu.c is:
> > 
> >   - Having the show() operation "miles away" from the corresponding
> >     seq_file iteration operations is a maintenance burden. 
> >     
> >   - The need to export ad hoc info like struct proc_maps_private is
> >     eliminated.
> > 
> > 
> > These patches are based on v2.6.39-rc5.
> 
> The patches look reasonable.  It would be nice to get some more review
> happening (poke).

If anyone would like me to resend the series please let me know.

> > 
> > Please note that this series is VERY LIGHTLY TESTED.  I have been using
> > CONFIG_NUMA_EMU=y thus far as I will not have access to a real NUMA system for
> > another week or two.
> 
> "lightly tested" evokes fear, but the patches don't look too scary to
> me.

Indeed.  I hope to have some real hardware to test the patches on by
the end of the week; fingers crossed.  Will certainly address any
issues that come up at that time. 


> Did you look at using apply_to_page_range()?

I did not look into it deeply, no.  The main reason for using
walk_page_range() was that it supports hugetlb vma's in the same way as
was done in mempolicy.c's check_huge_range().  The algorithm was a very
natural fit so I ran with it.


> I'm trying to remember why we're carrying both walk_page_range() and
> apply_to_page_range() but can't immediately think of a reason.
>
> There's also an apply_to_page_range_batch() in -mm but that code is
> broken on PPC and not much is happening with it, so it will probably go
> away again.


-- 
steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code
  2011-04-27 23:35 ` [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code Stephen Wilson
@ 2011-05-09  7:38   ` KOSAKI Motohiro
  2011-05-09 19:36     ` Stephen Wilson
  0 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  7:38 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

Hello,

sorry for the long delay.

> In the specific case of show_numa_map(), the custom page table walking
> logic implemented in mempolicy.c does not provide any special service
> beyond that provided by walk_page_range().
> 
> Also, converting show_numa_map() to use the generic routine decouples
> the function from mempolicy.c, allowing it to be moved out of the mm
> subsystem and into fs/proc.
> 
> Signed-off-by: Stephen Wilson <wilsons@start.ca>
> ---
>  mm/mempolicy.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 files changed, 46 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5bfb03e..dfe27e3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
>  	md->node[page_to_nid(page)]++;
>  }
>  
> +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> +		unsigned long pte_size, struct mm_walk *walk)
> +{
> +	struct page *page;
> +
> +	if (pte_none(*pte))
> +		return 0;
> +
> +	page = pte_page(*pte);
> +	if (!page)
> +		return 0;

original check_pte_range() has following logic.

        orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
        do {
                struct page *page;
                int nid;

                if (!pte_present(*pte))
                        continue;
                page = vm_normal_page(vma, addr, *pte);
                if (!page)
                        continue;
                /*
                 * vm_normal_page() filters out zero pages, but there might
                 * still be PageReserved pages to skip, perhaps in a VDSO.
                 * And we cannot move PageKsm pages sensibly or safely yet.
                 */
                if (PageReserved(page) || PageKsm(page))
                        continue;
                gather_stats(page, private, pte_dirty(*pte));

Why did you drop a lot of check? Is it safe?

Other parts looks good to me.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] mm: export get_vma_policy()
  2011-04-27 23:35 ` [PATCH 1/8] mm: export get_vma_policy() Stephen Wilson
@ 2011-05-09  7:39   ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  7:39 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 31ac26c..c2f6032 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -199,6 +199,9 @@ void mpol_free_shared_policy(struct shared_policy *p);
>  struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  					    unsigned long idx);
>  
> +struct mempolicy *get_vma_policy(struct task_struct *tsk,
> +		struct vm_area_struct *vma, unsigned long addr);
> +
>  extern void numa_default_policy(void);
>  extern void numa_policy_init(void);
>  extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 959a8b8..5bfb03e 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1489,7 +1489,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct task_struct *task,
> +struct mempolicy *get_vma_policy(struct task_struct *task,
>  		struct vm_area_struct *vma, unsigned long addr)

Looks reasonable to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] mm: remove MPOL_MF_STATS
  2011-04-27 23:35 ` [PATCH 3/8] mm: remove MPOL_MF_STATS Stephen Wilson
@ 2011-05-09  7:44   ` KOSAKI Motohiro
  2011-05-09 19:39     ` Stephen Wilson
  0 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  7:44 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

> Mapping statistics in a NUMA environment is now computed using the
> generic walk_page_range() logic.  Remove the old/equivalent
> functionality.
> 
> Signed-off-by: Stephen Wilson <wilsons@start.ca>
> ---
>  mm/mempolicy.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index dfe27e3..63c0d69 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -99,7 +99,6 @@
>  /* Internal flags */
>  #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
>  #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
> -#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
>  
>  static struct kmem_cache *policy_cache;
>  static struct kmem_cache *sn_cache;
> @@ -492,9 +491,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
>  			continue;
>  
> -		if (flags & MPOL_MF_STATS)
> -			gather_stats(page, private, pte_dirty(*pte));
> -		else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> +		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
>  			migrate_page_add(page, private, flags);
>  		else
>  			break;

This hunk looks good to me.


> @@ -2572,6 +2569,7 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
>  		unsigned long pte_size, struct mm_walk *walk)
>  {
>  	struct page *page;
> +	int nid;
>  
>  	if (pte_none(*pte))
>  		return 0;
> @@ -2580,6 +2578,10 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
>  	if (!page)
>  		return 0;
>  
> +	nid = page_to_nid(page);
> +	if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
> +		return 0;
> +
>  	gather_stats(page, walk->private, pte_dirty(*pte));
>  	return 0;

However this hunk should be moved into patch [2/8]. because 1) keeping
bisectability 2) The description says "Remove the old/equivalent
functionality." but it added new functionality.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration
  2011-04-27 23:35 ` [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration Stephen Wilson
@ 2011-05-09  7:45   ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  7:45 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

> Improve the prototype of gather_stats() to take a struct numa_maps as
> argument instead of a generic void *.  Update all callers to make the
> required type explicit.
> 
> Since gather_stats() is not needed before its definition and is
> scheduled to be moved out of mempolicy.c the declaration is removed as
> well.
> 
> Signed-off-by: Stephen Wilson <wilsons@start.ca>

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/8] mm: remove check_huge_range()
  2011-04-27 23:35 ` [PATCH 5/8] mm: remove check_huge_range() Stephen Wilson
@ 2011-05-09  7:46   ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  7:46 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

> This function has been superseded by gather_hugetbl_stats() and is no
> longer needed.
> 
> Signed-off-by: Stephen Wilson <wilsons@start.ca>
> ---
>  mm/mempolicy.c |   35 -----------------------------------
>  1 files changed, 0 insertions(+), 35 deletions(-)

Looks good.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c
  2011-04-27 23:35 ` [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c Stephen Wilson
@ 2011-05-09  7:49   ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  7:49 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

> Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
> issues.
> 
>   - Having the show() operation "miles away" from the corresponding
>     seq_file iteration operations is a maintenance burden.
> 
>   - The need to export ad hoc info like struct proc_maps_private is
>     eliminated.
> 
>   - The implementation of show_numa_map() can be improved in a simple
>     manner by cooperating with the other seq_file operations (start,
>     stop, etc) -- something that would be messy to do without this
>     change.
> 
> Signed-off-by: Stephen Wilson <wilsons@start.ca>
> ---
>  fs/proc/task_mmu.c |  170 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  mm/mempolicy.c     |  168 ---------------------------------------------------
>  2 files changed, 168 insertions(+), 170 deletions(-)

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/8] proc: make struct proc_maps_private truly private
  2011-04-27 23:35 ` [PATCH 7/8] proc: make struct proc_maps_private truly private Stephen Wilson
@ 2011-05-09  7:51   ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  7:51 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

> Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
> there is no need to export struct proc_maps_private to the world.  Move
> it to fs/proc/internal.h instead.
> 
> Signed-off-by: Stephen Wilson <wilsons@start.ca>
> ---
>  fs/proc/internal.h      |    8 ++++++++
>  include/linux/proc_fs.h |    8 --------
>  2 files changed, 8 insertions(+), 8 deletions(-)

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 8/8] proc: allocate storage for numa_maps statistics once
  2011-04-27 23:35 ` [PATCH 8/8] proc: allocate storage for numa_maps statistics once Stephen Wilson
@ 2011-05-09  8:24   ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-09  8:24 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

>  static int numa_maps_open(struct inode *inode, struct file *file)
>  {
> -	return do_maps_open(inode, file, &proc_pid_numa_maps_op);
> +	struct numa_maps_private *priv;
> +	int ret = -ENOMEM;
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (priv) {
> +		priv->proc_maps.pid = proc_pid(inode);
> +		ret = seq_open(file, &proc_pid_numa_maps_op);
> +		if (!ret) {
> +			struct seq_file *m = file->private_data;
> +			m->private = priv;
> +		} else {
> +			kfree(priv);
> +		}
> +	}
> +	return ret;
>  }

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code
  2011-05-09  7:38   ` KOSAKI Motohiro
@ 2011-05-09 19:36     ` Stephen Wilson
  2011-05-10  0:20       ` KOSAKI Motohiro
  0 siblings, 1 reply; 22+ messages in thread
From: Stephen Wilson @ 2011-05-09 19:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Stephen Wilson, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

On Mon, May 09, 2011 at 04:38:49PM +0900, KOSAKI Motohiro wrote:
> Hello,
> 
> sorry for the long delay.

Please, no apologies.  Thank you for the review!

> > In the specific case of show_numa_map(), the custom page table walking
> > logic implemented in mempolicy.c does not provide any special service
> > beyond that provided by walk_page_range().
> > 
> > Also, converting show_numa_map() to use the generic routine decouples
> > the function from mempolicy.c, allowing it to be moved out of the mm
> > subsystem and into fs/proc.
> > 
> > Signed-off-by: Stephen Wilson <wilsons@start.ca>
> > ---
> >  mm/mempolicy.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 files changed, 46 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 5bfb03e..dfe27e3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> >  	md->node[page_to_nid(page)]++;
> >  }
> >  
> > +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > +		unsigned long pte_size, struct mm_walk *walk)
> > +{
> > +	struct page *page;
> > +
> > +	if (pte_none(*pte))
> > +		return 0;
> > +
> > +	page = pte_page(*pte);
> > +	if (!page)
> > +		return 0;
> 
> original check_pte_range() has following logic.
> 
>         orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         do {
>                 struct page *page;
>                 int nid;
> 
>                 if (!pte_present(*pte))
>                         continue;
>                 page = vm_normal_page(vma, addr, *pte);
>                 if (!page)
>                         continue;
>                 /*
>                  * vm_normal_page() filters out zero pages, but there might
>                  * still be PageReserved pages to skip, perhaps in a VDSO.
>                  * And we cannot move PageKsm pages sensibly or safely yet.
>                  */
>                 if (PageReserved(page) || PageKsm(page))
>                         continue;
>                 gather_stats(page, private, pte_dirty(*pte));
> 
> Why did you drop a lot of check? Is it safe?

I must have been confused.  For one, walk_page_range() does not even
lock the pmd entry when iterating over the pte's.  I completely
overlooked that fact and so with that, the series is totally broken.

I am currently testing a slightly reworked set based on the following
variation.  When finished I will send v2 of the series which will
address all issues raised so far.

Thanks again for the review!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] mm: remove MPOL_MF_STATS
  2011-05-09  7:44   ` KOSAKI Motohiro
@ 2011-05-09 19:39     ` Stephen Wilson
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen Wilson @ 2011-05-09 19:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Stephen Wilson, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

On Mon, May 09, 2011 at 04:44:24PM +0900, KOSAKI Motohiro wrote:
> > Mapping statistics in a NUMA environment is now computed using the
> > generic walk_page_range() logic.  Remove the old/equivalent
> > functionality.
> > 
> > Signed-off-by: Stephen Wilson <wilsons@start.ca>
> > ---
> >  mm/mempolicy.c |   10 ++++++----
> >  1 files changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index dfe27e3..63c0d69 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -99,7 +99,6 @@
> >  /* Internal flags */
> >  #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
> >  #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
> > -#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
> >  
> >  static struct kmem_cache *policy_cache;
> >  static struct kmem_cache *sn_cache;
> > @@ -492,9 +491,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
> >  			continue;
> >  
> > -		if (flags & MPOL_MF_STATS)
> > -			gather_stats(page, private, pte_dirty(*pte));
> > -		else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> > +		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> >  			migrate_page_add(page, private, flags);
> >  		else
> >  			break;
> 
> This hunk looks good to me.
> 
> 
> > @@ -2572,6 +2569,7 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
> >  		unsigned long pte_size, struct mm_walk *walk)
> >  {
> >  	struct page *page;
> > +	int nid;
> >  
> >  	if (pte_none(*pte))
> >  		return 0;
> > @@ -2580,6 +2578,10 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
> >  	if (!page)
> >  		return 0;
> >  
> > +	nid = page_to_nid(page);
> > +	if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
> > +		return 0;
> > +
> >  	gather_stats(page, walk->private, pte_dirty(*pte));
> >  	return 0;
> 
> However this hunk should be moved into patch [2/8]. because 1) keeping
> bisectability 2) The description says "Remove the old/equivalent
> functionality." but it added new functionality.

Absolutely.  Will move into the proper location and resend the whole
series.

Thanks again,

-- 
steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code
  2011-05-09 19:36     ` Stephen Wilson
@ 2011-05-10  0:20       ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2011-05-10  0:20 UTC (permalink / raw)
  To: Stephen Wilson
  Cc: kosaki.motohiro, Andrew Morton, Alexander Viro, Hugh Dickins,
	David Rientjes, linux-mm, linux-kernel

> On Mon, May 09, 2011 at 04:38:49PM +0900, KOSAKI Motohiro wrote:
> > Hello,
> > 
> > sorry for the long delay.
> 
> Please, no apologies.  Thank you for the review!
> 
> > > In the specific case of show_numa_map(), the custom page table walking
> > > logic implemented in mempolicy.c does not provide any special service
> > > beyond that provided by walk_page_range().
> > > 
> > > Also, converting show_numa_map() to use the generic routine decouples
> > > the function from mempolicy.c, allowing it to be moved out of the mm
> > > subsystem and into fs/proc.
> > > 
> > > Signed-off-by: Stephen Wilson <wilsons@start.ca>
> > > ---
> > >  mm/mempolicy.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++-------
> > >  1 files changed, 46 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 5bfb03e..dfe27e3 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> > >  	md->node[page_to_nid(page)]++;
> > >  }
> > >  
> > > +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > > +		unsigned long pte_size, struct mm_walk *walk)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	if (pte_none(*pte))
> > > +		return 0;
> > > +
> > > +	page = pte_page(*pte);
> > > +	if (!page)
> > > +		return 0;
> > 
> > original check_pte_range() has following logic.
> > 
> >         orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >         do {
> >                 struct page *page;
> >                 int nid;
> > 
> >                 if (!pte_present(*pte))
> >                         continue;
> >                 page = vm_normal_page(vma, addr, *pte);
> >                 if (!page)
> >                         continue;
> >                 /*
> >                  * vm_normal_page() filters out zero pages, but there might
> >                  * still be PageReserved pages to skip, perhaps in a VDSO.
> >                  * And we cannot move PageKsm pages sensibly or safely yet.
> >                  */
> >                 if (PageReserved(page) || PageKsm(page))
> >                         continue;
> >                 gather_stats(page, private, pte_dirty(*pte));
> > 
> > Why did you drop a lot of check? Is it safe?
> 
> I must have been confused.  For one, walk_page_range() does not even
> lock the pmd entry when iterating over the pte's.  I completely
> overlooked that fact and so with that, the series is totally broken.
> 
> I am currently testing a slightly reworked set based on the following
> variation.  When finished I will send v2 of the series which will
> address all issues raised so far.
> 
> Thanks again for the review!
> 
> 
> 
> From 013a1e0fc96f8370339209f16d81df4ded40dbf2 Mon Sep 17 00:00:00 2001
> From: Stephen Wilson <wilsons@start.ca>
> Date: Mon, 9 May 2011 14:39:27 -0400
> Subject: [PATCH] mm: use walk_page_range() instead of custom page table
>  walking code
> 
> Converting show_numa_map() to use the generic routine decouples
> the function from mempolicy.c, allowing it to be moved out of the mm
> subsystem and into fs/proc.
> 
> Also, include KSM pages in /proc/pid/numa_maps statistics.  The pagewalk
> logic implemented by check_pte_range() failed to account for such pages
> as they were not applicable to the page migration case.

Seems very reasonable change.

> 
> Signed-off-by: Stephen Wilson <wilsons@start.ca>
> ---
>  mm/mempolicy.c |   75 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 files changed, 68 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5bfb03e..945e85d 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2531,6 +2531,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
>  }
>  
>  struct numa_maps {
> +	struct vm_area_struct *vma;
>  	unsigned long pages;
>  	unsigned long anon;
>  	unsigned long active;
> @@ -2568,6 +2569,41 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
>  	md->node[page_to_nid(page)]++;
>  }
>  
> +static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
> +		unsigned long end, struct mm_walk *walk)
> +{
> +	struct numa_maps *md;
> +	spinlock_t *ptl;
> +	pte_t *orig_pte;
> +	pte_t *pte;
> +
> +	md = walk->private;
> +	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	do {
> +		struct page *page;
> +		int nid;
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		page = vm_normal_page(md->vma, addr, *pte);
> +		if (!page)
> +			continue;
> +
> +		if (PageReserved(page))
> +			continue;
> +
> +		nid = page_to_nid(page);
> +		if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
> +			continue;
> +
> +		gather_stats(page, md, pte_dirty(*pte));
> +
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	pte_unmap_unlock(orig_pte, ptl);
> +	return 0;
> +}


Looks completely good.
	Reviewed-by KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Thank you for great work!



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-05-10  0:20 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
2011-04-27 23:35 ` [PATCH 1/8] mm: export get_vma_policy() Stephen Wilson
2011-05-09  7:39   ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code Stephen Wilson
2011-05-09  7:38   ` KOSAKI Motohiro
2011-05-09 19:36     ` Stephen Wilson
2011-05-10  0:20       ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 3/8] mm: remove MPOL_MF_STATS Stephen Wilson
2011-05-09  7:44   ` KOSAKI Motohiro
2011-05-09 19:39     ` Stephen Wilson
2011-04-27 23:35 ` [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration Stephen Wilson
2011-05-09  7:45   ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 5/8] mm: remove check_huge_range() Stephen Wilson
2011-05-09  7:46   ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c Stephen Wilson
2011-05-09  7:49   ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 7/8] proc: make struct proc_maps_private truly private Stephen Wilson
2011-05-09  7:51   ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 8/8] proc: allocate storage for numa_maps statistics once Stephen Wilson
2011-05-09  8:24   ` KOSAKI Motohiro
2011-05-04 23:10 ` [PATCH 0/8] avoid allocation in show_numa_map() Andrew Morton
2011-05-05  2:37   ` Stephen Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).