From: Stephen Wilson <wilsons@start.ca>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stephen Wilson <wilsons@start.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Hugh Dickins <hughd@google.com>,
David Rientjes <rientjes@google.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code
Date: Mon, 9 May 2011 15:36:50 -0400 [thread overview]
Message-ID: <20110509193650.GA2865@wicker.gateway.2wire.net> (raw)
In-Reply-To: <20110509164034.164C.A69D9226@jp.fujitsu.com>
On Mon, May 09, 2011 at 04:38:49PM +0900, KOSAKI Motohiro wrote:
> Hello,
>
> sorry for the long delay.
Please, no apologies. Thank you for the review!
> > In the specific case of show_numa_map(), the custom page table walking
> > logic implemented in mempolicy.c does not provide any special service
> > beyond that provided by walk_page_range().
> >
> > Also, converting show_numa_map() to use the generic routine decouples
> > the function from mempolicy.c, allowing it to be moved out of the mm
> > subsystem and into fs/proc.
> >
> > Signed-off-by: Stephen Wilson <wilsons@start.ca>
> > ---
> > mm/mempolicy.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
> > 1 files changed, 46 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 5bfb03e..dfe27e3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> > md->node[page_to_nid(page)]++;
> > }
> >
> > +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > + unsigned long pte_size, struct mm_walk *walk)
> > +{
> > + struct page *page;
> > +
> > + if (pte_none(*pte))
> > + return 0;
> > +
> > + page = pte_page(*pte);
> > + if (!page)
> > + return 0;
>
> original check_pte_range() has following logic.
>
> orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> do {
> struct page *page;
> int nid;
>
> if (!pte_present(*pte))
> continue;
> page = vm_normal_page(vma, addr, *pte);
> if (!page)
> continue;
> /*
> * vm_normal_page() filters out zero pages, but there might
> * still be PageReserved pages to skip, perhaps in a VDSO.
> * And we cannot move PageKsm pages sensibly or safely yet.
> */
> if (PageReserved(page) || PageKsm(page))
> continue;
> gather_stats(page, private, pte_dirty(*pte));
>
> Why did you drop a lot of check? Is it safe?
I must have been confused. For one, walk_page_range() does not even
lock the pmd entry when iterating over the pte's. I completely
overlooked that fact and so with that, the series is totally broken.
I am currently testing a slightly reworked set based on the following
variation. When finished I will send v2 of the series which will
address all issues raised so far.
Thanks again for the review!
>From 013a1e0fc96f8370339209f16d81df4ded40dbf2 Mon Sep 17 00:00:00 2001
From: Stephen Wilson <wilsons@start.ca>
Date: Mon, 9 May 2011 14:39:27 -0400
Subject: [PATCH] mm: use walk_page_range() instead of custom page table
walking code
Converting show_numa_map() to use the generic routine decouples
the function from mempolicy.c, allowing it to be moved out of the mm
subsystem and into fs/proc.
Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk
logic implemented by check_pte_range() failed to account for such pages
as they were not applicable to the page migration case.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
---
mm/mempolicy.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 68 insertions(+), 7 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5bfb03e..945e85d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2531,6 +2531,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
}
struct numa_maps {
+ struct vm_area_struct *vma;
unsigned long pages;
unsigned long anon;
unsigned long active;
@@ -2568,6 +2569,41 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
md->node[page_to_nid(page)]++;
}
+static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md;
+ spinlock_t *ptl;
+ pte_t *orig_pte;
+ pte_t *pte;
+
+ md = walk->private;
+ orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ do {
+ struct page *page;
+ int nid;
+
+ if (!pte_present(*pte))
+ continue;
+
+ page = vm_normal_page(md->vma, addr, *pte);
+ if (!page)
+ continue;
+
+ if (PageReserved(page))
+ continue;
+
+ nid = page_to_nid(page);
+ if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ continue;
+
+ gather_stats(page, md, pte_dirty(*pte));
+
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ pte_unmap_unlock(orig_pte, ptl);
+ return 0;
+}
+
#ifdef CONFIG_HUGETLB_PAGE
static void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
@@ -2597,12 +2633,35 @@ static void check_huge_range(struct vm_area_struct *vma,
gather_stats(page, md, pte_dirty(*ptep));
}
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ struct page *page;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ gather_stats(page, walk->private, pte_dirty(*pte));
+ return 0;
+}
+
#else
static inline void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct numa_maps *md)
{
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ return 0;
+}
#endif
/*
@@ -2615,6 +2674,7 @@ int show_numa_map(struct seq_file *m, void *v)
struct numa_maps *md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
+ struct mm_walk walk = {};
struct mempolicy *pol;
int n;
char buffer[50];
@@ -2626,6 +2686,13 @@ int show_numa_map(struct seq_file *m, void *v)
if (!md)
return 0;
+ md->vma = vma;
+
+ walk.hugetlb_entry = gather_hugetbl_stats;
+ walk.pmd_entry = gather_pte_stats;
+ walk.private = md;
+ walk.mm = mm;
+
pol = get_vma_policy(priv->task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);
@@ -2642,13 +2709,7 @@ int show_numa_map(struct seq_file *m, void *v)
seq_printf(m, " stack");
}
- if (is_vm_hugetlb_page(vma)) {
- check_huge_range(vma, vma->vm_start, vma->vm_end, md);
- seq_printf(m, " huge");
- } else {
- check_pgd_range(vma, vma->vm_start, vma->vm_end,
- &node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
- }
+ walk_page_range(vma->vm_start, vma->vm_end, &walk);
if (!md->pages)
goto out;
--
1.7.4.4
WARNING: multiple messages have this Message-ID (diff)
From: Stephen Wilson <wilsons@start.ca>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stephen Wilson <wilsons@start.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Hugh Dickins <hughd@google.com>,
David Rientjes <rientjes@google.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code
Date: Mon, 9 May 2011 15:36:50 -0400 [thread overview]
Message-ID: <20110509193650.GA2865@wicker.gateway.2wire.net> (raw)
In-Reply-To: <20110509164034.164C.A69D9226@jp.fujitsu.com>
On Mon, May 09, 2011 at 04:38:49PM +0900, KOSAKI Motohiro wrote:
> Hello,
>
> sorry for the long delay.
Please, no apologies. Thank you for the review!
> > In the specific case of show_numa_map(), the custom page table walking
> > logic implemented in mempolicy.c does not provide any special service
> > beyond that provided by walk_page_range().
> >
> > Also, converting show_numa_map() to use the generic routine decouples
> > the function from mempolicy.c, allowing it to be moved out of the mm
> > subsystem and into fs/proc.
> >
> > Signed-off-by: Stephen Wilson <wilsons@start.ca>
> > ---
> > mm/mempolicy.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
> > 1 files changed, 46 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 5bfb03e..dfe27e3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> > md->node[page_to_nid(page)]++;
> > }
> >
> > +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > + unsigned long pte_size, struct mm_walk *walk)
> > +{
> > + struct page *page;
> > +
> > + if (pte_none(*pte))
> > + return 0;
> > +
> > + page = pte_page(*pte);
> > + if (!page)
> > + return 0;
>
> original check_pte_range() has following logic.
>
> orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> do {
> struct page *page;
> int nid;
>
> if (!pte_present(*pte))
> continue;
> page = vm_normal_page(vma, addr, *pte);
> if (!page)
> continue;
> /*
> * vm_normal_page() filters out zero pages, but there might
> * still be PageReserved pages to skip, perhaps in a VDSO.
> * And we cannot move PageKsm pages sensibly or safely yet.
> */
> if (PageReserved(page) || PageKsm(page))
> continue;
> gather_stats(page, private, pte_dirty(*pte));
>
> Why did you drop a lot of check? Is it safe?
I must have been confused. For one, walk_page_range() does not even
lock the pmd entry when iterating over the pte's. I completely
overlooked that fact and so with that, the series is totally broken.
I am currently testing a slightly reworked set based on the following
variation. When finished I will send v2 of the series which will
address all issues raised so far.
Thanks again for the review!
next prev parent reply other threads:[~2011-05-09 19:37 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-27 23:35 [PATCH 0/8] avoid allocation in show_numa_map() Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-04-27 23:35 ` [PATCH 1/8] mm: export get_vma_policy() Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 7:39 ` KOSAKI Motohiro
2011-05-09 7:39 ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 7:38 ` KOSAKI Motohiro
2011-05-09 7:38 ` KOSAKI Motohiro
2011-05-09 19:36 ` Stephen Wilson [this message]
2011-05-09 19:36 ` Stephen Wilson
2011-05-10 0:20 ` KOSAKI Motohiro
2011-05-10 0:20 ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 3/8] mm: remove MPOL_MF_STATS Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 7:44 ` KOSAKI Motohiro
2011-05-09 7:44 ` KOSAKI Motohiro
2011-05-09 19:39 ` Stephen Wilson
2011-05-09 19:39 ` Stephen Wilson
2011-04-27 23:35 ` [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 7:45 ` KOSAKI Motohiro
2011-05-09 7:45 ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 5/8] mm: remove check_huge_range() Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 7:46 ` KOSAKI Motohiro
2011-05-09 7:46 ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 7:49 ` KOSAKI Motohiro
2011-05-09 7:49 ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 7/8] proc: make struct proc_maps_private truly private Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 7:51 ` KOSAKI Motohiro
2011-05-09 7:51 ` KOSAKI Motohiro
2011-04-27 23:35 ` [PATCH 8/8] proc: allocate storage for numa_maps statistics once Stephen Wilson
2011-04-27 23:35 ` Stephen Wilson
2011-05-09 8:24 ` KOSAKI Motohiro
2011-05-09 8:24 ` KOSAKI Motohiro
2011-05-04 23:10 ` [PATCH 0/8] avoid allocation in show_numa_map() Andrew Morton
2011-05-04 23:10 ` Andrew Morton
2011-05-05 2:37 ` Stephen Wilson
2011-05-05 2:37 ` Stephen Wilson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110509193650.GA2865@wicker.gateway.2wire.net \
--to=wilsons@start.ca \
--cc=akpm@linux-foundation.org \
--cc=hughd@google.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.