[PATCH 0/11] maps3: pagemap monitoring v3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/11] maps3: pagemap monitoring v3
@ 2007-10-15 22:25 Matt Mackall
  2007-10-15 22:25 ` [PATCH 1/11] maps3: add proportional set size accounting in smaps Matt Mackall
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:25 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

This patchset is version 3 of my /proc/pid/pagemaps code.

Rather than submit about 30 incremental patches atop an existing 20 or
so where many of the intermediate states are broken and get undone
anyway, I've respun this as a much smaller set of 11 patches.

Changes in this series:

- headers gone again (as recommended by Dave Hansen and Alan Cox)
- 64-bit entries (as per discussion with Andi Kleen)
- swap pte information exported (from Dave Hansen)
- page walker callback for holes (from Dave Hansen)
- direct put_user I/O (as suggested by Rusty Russell)
- split kpagemap into kpagecount and kpageflags

I've dropped one cleanup patch from Rusty from the current series,
mmaps2-vma-out-of-mem_size_stats.patch, which I didn't find to be an
improvement.

Andrew, please replace the current maps2* patches with this set at
your convenience. I've included the above change list in the relevant
patches.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/11] maps3: add proportional set size accounting in smaps
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
@ 2007-10-15 22:25 ` Matt Mackall
  2007-10-15 23:36   ` David Rientjes
  2007-10-15 22:25 ` [PATCH 2/11] maps3: introduce task_size_of for all arches Matt Mackall
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:25 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

From: Fengguang Wu <wfg@mail.ustc.edu.cn>

The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing
it.  So if a process has 1000 pages all to itself, and 1000 shared with one
other process, its PSS will be 1500.

               - lwn.net: "ELC: How much memory are applications really using?"

The PSS proposed by Matt Mackall is a very nice metic for measuring an
process's memory footprint.  So collect and export it via
/proc/<pid>/smaps.

Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
job.  They are comprehensive tools.  But for PSS, let's do it in the simple
way.

Cc: John Berthels <jjberthels@gmail.com>
Cc: Bernardo Innocenti <bernie@codewiz.org>
Cc: Padraig Brady <P@draigBrady.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Hugh Dickins <hugh@veritas.com>
---

 fs/proc/task_mmu.c |   29 ++++++++++++++++++++++++++++-
 1 files changed, 28 insertions(+), 1 deletion(-)

Index: l/fs/proc/task_mmu.c
===================================================================
--- l.orig/fs/proc/task_mmu.c	2007-10-14 13:35:31.000000000 -0500
+++ l/fs/proc/task_mmu.c	2007-10-14 13:36:56.000000000 -0500
@@ -122,6 +122,27 @@ struct mem_size_stats
 	unsigned long private_clean;
 	unsigned long private_dirty;
 	unsigned long referenced;
+
+	/*
+	 * Proportional Set Size(PSS): my share of RSS.
+	 *
+	 * PSS of a process is the count of pages it has in memory, where each
+	 * page is divided by the number of processes sharing it.  So if a
+	 * process has 1000 pages all to itself, and 1000 shared with one other
+	 * process, its PSS will be 1500.               - Matt Mackall, lwn.net
+	 */
+	u64 	      pss;
+	/*
+	 * To keep (accumulated) division errors low, we adopt 64bit pss and
+	 * use some low bits for division errors. So (pss >> PSS_DIV_BITS)
+	 * would be the real byte count.
+	 *
+	 * A shift of 12 before division means(assuming 4K page size):
+	 * 	- 1M 3-user-pages add up to 8KB errors;
+	 * 	- supports mapcount up to 2^24, or 16M;
+	 * 	- supports PSS up to 2^52 bytes, or 4PB.
+	 */
+#define PSS_DIV_BITS	12
 };
 
 struct pmd_walker {
@@ -195,6 +216,7 @@ static int show_map_internal(struct seq_
 		seq_printf(m,
 			   "Size:           %8lu kB\n"
 			   "Rss:            %8lu kB\n"
+			   "Pss:            %8lu kB\n"
 			   "Shared_Clean:   %8lu kB\n"
 			   "Shared_Dirty:   %8lu kB\n"
 			   "Private_Clean:  %8lu kB\n"
@@ -202,6 +224,7 @@ static int show_map_internal(struct seq_
 			   "Referenced:     %8lu kB\n",
 			   (vma->vm_end - vma->vm_start) >> 10,
 			   mss->resident >> 10,
+			   (unsigned long)(mss.pss >> (10 + PSS_DIV_BITS)),
 			   mss->shared_clean  >> 10,
 			   mss->shared_dirty  >> 10,
 			   mss->private_clean >> 10,
@@ -226,6 +249,7 @@ static void smaps_pte_range(struct vm_ar
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
+	int mapcount;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -242,16 +266,19 @@ static void smaps_pte_range(struct vm_ar
 		/* Accumulate the size in pages that have been accessed. */
 		if (pte_young(ptent) || PageReferenced(page))
 			mss->referenced += PAGE_SIZE;
-		if (page_mapcount(page) >= 2) {
+		mapcount = page_mapcount(page);
+		if (mapcount >= 2) {
 			if (pte_dirty(ptent))
 				mss->shared_dirty += PAGE_SIZE;
 			else
 				mss->shared_clean += PAGE_SIZE;
+			mss->pss += (PAGE_SIZE << PSS_DIV_BITS) / mapcount;
 		} else {
 			if (pte_dirty(ptent))
 				mss->private_dirty += PAGE_SIZE;
 			else
 				mss->private_clean += PAGE_SIZE;
+			mss->pss += (PAGE_SIZE << PSS_DIV_BITS);
 		}
 	}
 	pte_unmap_unlock(pte - 1, ptl);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/11] maps3: add proportional set size accounting in smaps
  2007-10-15 22:25 ` [PATCH 1/11] maps3: add proportional set size accounting in smaps Matt Mackall
@ 2007-10-15 23:36   ` David Rientjes
  2007-10-16  0:18     ` Matt Mackall
  0 siblings, 1 reply; 39+ messages in thread
From: David Rientjes @ 2007-10-15 23:36 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/fs/proc/task_mmu.c
> ===================================================================
> --- l.orig/fs/proc/task_mmu.c	2007-10-14 13:35:31.000000000 -0500
> +++ l/fs/proc/task_mmu.c	2007-10-14 13:36:56.000000000 -0500
> @@ -122,6 +122,27 @@ struct mem_size_stats
>  	unsigned long private_clean;
>  	unsigned long private_dirty;
>  	unsigned long referenced;
> +
> +	/*
> +	 * Proportional Set Size(PSS): my share of RSS.
> +	 *
> +	 * PSS of a process is the count of pages it has in memory, where each
> +	 * page is divided by the number of processes sharing it.  So if a
> +	 * process has 1000 pages all to itself, and 1000 shared with one other
> +	 * process, its PSS will be 1500.               - Matt Mackall, lwn.net
> +	 */
> +	u64 	      pss;
> +	/*
> +	 * To keep (accumulated) division errors low, we adopt 64bit pss and
> +	 * use some low bits for division errors. So (pss >> PSS_DIV_BITS)
> +	 * would be the real byte count.
> +	 *
> +	 * A shift of 12 before division means(assuming 4K page size):
> +	 * 	- 1M 3-user-pages add up to 8KB errors;
> +	 * 	- supports mapcount up to 2^24, or 16M;
> +	 * 	- supports PSS up to 2^52 bytes, or 4PB.
> +	 */
> +#define PSS_DIV_BITS	12
>  };
>  

I know this gets moved again in the eighth patch of the series, but the 
#define still has no place inside the struct definition.

The pss is going to need accessor functions, preferably inlined, and the 
comment adjusted stating that all accesses should be through those 
functions and not directly to the mem_size_stats struct.

	static inline u64 pss_up(unsigned long pss)
	{
		return pss << PSS_DIV_BITS;
	}

	static inline unsigned long pss_down(u64 pss)
	{
		return pss >> PSS_DIV_BITS;
	}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/11] maps3: add proportional set size accounting in smaps
  2007-10-15 23:36   ` David Rientjes
@ 2007-10-16  0:18     ` Matt Mackall
  2007-10-16  2:24       ` David Rientjes
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-16  0:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, Oct 15, 2007 at 04:36:38PM -0700, David Rientjes wrote:
> On Mon, 15 Oct 2007, Matt Mackall wrote:
> 
> > Index: l/fs/proc/task_mmu.c
> > ===================================================================
> > --- l.orig/fs/proc/task_mmu.c	2007-10-14 13:35:31.000000000 -0500
> > +++ l/fs/proc/task_mmu.c	2007-10-14 13:36:56.000000000 -0500
> > @@ -122,6 +122,27 @@ struct mem_size_stats
> >  	unsigned long private_clean;
> >  	unsigned long private_dirty;
> >  	unsigned long referenced;
> > +
> > +	/*
> > +	 * Proportional Set Size(PSS): my share of RSS.
> > +	 *
> > +	 * PSS of a process is the count of pages it has in memory, where each
> > +	 * page is divided by the number of processes sharing it.  So if a
> > +	 * process has 1000 pages all to itself, and 1000 shared with one other
> > +	 * process, its PSS will be 1500.               - Matt Mackall, lwn.net
> > +	 */
> > +	u64 	      pss;
> > +	/*
> > +	 * To keep (accumulated) division errors low, we adopt 64bit pss and
> > +	 * use some low bits for division errors. So (pss >> PSS_DIV_BITS)
> > +	 * would be the real byte count.
> > +	 *
> > +	 * A shift of 12 before division means(assuming 4K page size):
> > +	 * 	- 1M 3-user-pages add up to 8KB errors;
> > +	 * 	- supports mapcount up to 2^24, or 16M;
> > +	 * 	- supports PSS up to 2^52 bytes, or 4PB.
> > +	 */
> > +#define PSS_DIV_BITS	12
> >  };
> >  
> 
> I know this gets moved again in the eighth patch of the series, but the 
> #define still has no place inside the struct definition.

Agreed.
 
> The pss is going to need accessor functions, preferably inlined, and the 
> comment adjusted stating that all accesses should be through those 
> functions and not directly to the mem_size_stats struct.
> 
> 	static inline u64 pss_up(unsigned long pss)
> 	{
> 		return pss << PSS_DIV_BITS;
> 	}
> 
> 	static inline unsigned long pss_down(u64 pss)
> 	{
> 		return pss >> PSS_DIV_BITS;
> 	}

I think that's overkill for something that has exactly one use of each.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/11] maps3: add proportional set size accounting in smaps
  2007-10-16  0:18     ` Matt Mackall
@ 2007-10-16  2:24       ` David Rientjes
  0 siblings, 0 replies; 39+ messages in thread
From: David Rientjes @ 2007-10-16  2:24 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Matt Mackall wrote:

> > The pss is going to need accessor functions, preferably inlined, and the 
> > comment adjusted stating that all accesses should be through those 
> > functions and not directly to the mem_size_stats struct.
> > 
> > 	static inline u64 pss_up(unsigned long pss)
> > 	{
> > 		return pss << PSS_DIV_BITS;
> > 	}
> > 
> > 	static inline unsigned long pss_down(u64 pss)
> > 	{
> > 		return pss >> PSS_DIV_BITS;
> > 	}
> 
> I think that's overkill for something that has exactly one use of each.
> 

There's no overkill at all, the current uses are already accessed with 
these bitshifts so there's no overhead when using an inlined function 
instead.

To correctly access the pss, these bitshifts are required because the 
decision was made to use the lower PSS_DIV_BITS for rounding.  Thus, you 
need to include accessor functions so that they are always accessed 
correctly now and in the future.

		David

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/11] maps3: introduce task_size_of for all arches
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
  2007-10-15 22:25 ` [PATCH 1/11] maps3: add proportional set size accounting in smaps Matt Mackall
@ 2007-10-15 22:25 ` Matt Mackall
  2007-10-15 23:45   ` David Rientjes
  2007-10-15 22:26 ` [PATCH 3/11] maps3: move is_swap_pte Matt Mackall
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:25 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

For the /proc/<pid>/pagemap code[1], we need to able to query how
much virtual address space a particular task has.  The trick is
that we do it through /proc and can't use TASK_SIZE since it
references "current" on some arches.  The process opening the
/proc file might be a 32-bit process opening a 64-bit process's
pagemap file.

x86_64 already has a TASK_SIZE_OF() macro:

#define TASK_SIZE_OF(child)     ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)

I'd like to have that for other architectures.  So, add it
for all the architectures that actually use "current" in 
their TASK_SIZE.  For the others, just add a quick #define
in sched.h to use plain old TASK_SIZE.

1. http://www.linuxworld.com/news/2007/042407-kernel.html

- MIPS portion from Ralf Baechle <ralf@linux-mips.org>

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Matt Mackall <mpm@selenic.com>
---

 lxc-dave/include/asm-ia64/processor.h    |    3 ++-
 lxc-dave/include/asm-mips/processor.h    |    4 ++++
 lxc-dave/include/asm-parisc/processor.h  |    3 ++-
 lxc-dave/include/asm-powerpc/processor.h |    4 +++-
 lxc-dave/include/asm-s390/processor.h    |    2 ++
 lxc-dave/include/linux/sched.h           |    4 ++++
 6 files changed, 17 insertions(+), 3 deletions(-)

Index: l/include/asm-ia64/processor.h
===================================================================
--- l.orig/include/asm-ia64/processor.h	2007-10-09 17:37:58.000000000 -0500
+++ l/include/asm-ia64/processor.h	2007-10-10 11:46:30.000000000 -0500
@@ -31,7 +31,8 @@
  * each (assuming 8KB page size), for a total of 8TB of user virtual
  * address space.
  */
-#define TASK_SIZE		(current->thread.task_size)
+#define TASK_SIZE_OF(tsk)	((tsk)->thread.task_size)
+#define TASK_SIZE       	TASK_SIZE_OF(current)
 
 /*
  * This decides where the kernel will search for a free chunk of vm
Index: l/include/asm-mips/processor.h
===================================================================
--- l.orig/include/asm-mips/processor.h	2007-10-09 17:37:58.000000000 -0500
+++ l/include/asm-mips/processor.h	2007-10-10 11:46:30.000000000 -0500
@@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)						\
+	(test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #ifdef CONFIG_64BIT
@@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
 #define TASK_UNMAPPED_BASE						\
 	(test_thread_flag(TIF_32BIT_ADDR) ?				\
 		PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)						\
+	(test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #define NUM_FPU_REGS	32
Index: l/include/asm-parisc/processor.h
===================================================================
--- l.orig/include/asm-parisc/processor.h	2007-10-09 17:36:49.000000000 -0500
+++ l/include/asm-parisc/processor.h	2007-10-10 11:46:30.000000000 -0500
@@ -32,7 +32,8 @@
 #endif
 #define current_text_addr() ({ void *pc; current_ia(pc); pc; })
 
-#define TASK_SIZE               (current->thread.task_size)
+#define TASK_SIZE_OF(tsk)       ((tsk)->thread.task_size)
+#define TASK_SIZE	         (current->thread.task_size)
 #define TASK_UNMAPPED_BASE      (current->thread.map_base)
 
 #define DEFAULT_TASK_SIZE32	(0xFFF00000UL)
Index: l/include/asm-powerpc/processor.h
===================================================================
--- l.orig/include/asm-powerpc/processor.h	2007-10-09 17:37:58.000000000 -0500
+++ l/include/asm-powerpc/processor.h	2007-10-10 11:46:30.000000000 -0500
@@ -99,7 +99,9 @@ extern struct task_struct *last_task_use
  */
 #define TASK_SIZE_USER32 (0x0000000100000000UL - (1*PAGE_SIZE))
 
-#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
+#define TASK_SIZE	  (test_thread_flag(TIF_32BIT) ? \
+		TASK_SIZE_USER32 : TASK_SIZE_USER64)
+#define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
 		TASK_SIZE_USER32 : TASK_SIZE_USER64)
 
 /* This decides where the kernel will search for a free chunk of vm
Index: l/include/asm-s390/processor.h
===================================================================
--- l.orig/include/asm-s390/processor.h	2007-10-09 17:37:58.000000000 -0500
+++ l/include/asm-s390/processor.h	2007-10-10 11:46:30.000000000 -0500
@@ -75,6 +75,8 @@ extern struct task_struct *last_task_use
 
 # define TASK_SIZE		(test_thread_flag(TIF_31BIT) ? \
 					(0x80000000UL) : (0x40000000000UL))
+# define TASK_SIZE_OF(tsk)	(test_tsk_thread_flag(tsk, TIF_31BIT) ? \
+					(0x80000000UL) : (0x40000000000UL))
 # define TASK_UNMAPPED_BASE	(TASK_SIZE / 2)
 # define DEFAULT_TASK_SIZE	(0x40000000000UL)
 
Index: l/include/linux/sched.h
===================================================================
--- l.orig/include/linux/sched.h	2007-10-09 17:37:59.000000000 -0500
+++ l/include/linux/sched.h	2007-10-10 11:46:30.000000000 -0500
@@ -1880,6 +1880,10 @@ static inline void inc_syscw(struct task
 }
 #endif
 
+#ifndef TASK_SIZE_OF
+#define TASK_SIZE_OF(tsk)	TASK_SIZE
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/11] maps3: introduce task_size_of for all arches
  2007-10-15 22:25 ` [PATCH 2/11] maps3: introduce task_size_of for all arches Matt Mackall
@ 2007-10-15 23:45   ` David Rientjes
  2007-10-16  0:36     ` Dave Hansen
  0 siblings, 1 reply; 39+ messages in thread
From: David Rientjes @ 2007-10-15 23:45 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/include/asm-mips/processor.h
> ===================================================================
> --- l.orig/include/asm-mips/processor.h	2007-10-09 17:37:58.000000000 -0500
> +++ l/include/asm-mips/processor.h	2007-10-10 11:46:30.000000000 -0500
> @@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
>   * space during mmap's.
>   */
>  #define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)						\
> +	(test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #ifdef CONFIG_64BIT
> @@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
>  #define TASK_UNMAPPED_BASE						\
>  	(test_thread_flag(TIF_32BIT_ADDR) ?				\
>  		PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)						\
> +	(test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #define NUM_FPU_REGS	32

These need to use test_tsk_thread_flag(tsk, TIF_32BIT_ADDR).

> Index: l/include/asm-parisc/processor.h
> ===================================================================
> --- l.orig/include/asm-parisc/processor.h	2007-10-09 17:36:49.000000000 -0500
> +++ l/include/asm-parisc/processor.h	2007-10-10 11:46:30.000000000 -0500
> @@ -32,7 +32,8 @@
>  #endif
>  #define current_text_addr() ({ void *pc; current_ia(pc); pc; })
>  
> -#define TASK_SIZE               (current->thread.task_size)
> +#define TASK_SIZE_OF(tsk)       ((tsk)->thread.task_size)
> +#define TASK_SIZE	         (current->thread.task_size)
>  #define TASK_UNMAPPED_BASE      (current->thread.map_base)
>  
>  #define DEFAULT_TASK_SIZE32	(0xFFF00000UL)

TASK_SIZE_OF() should be defined in terms of TASK_SIZE, just like it is 
for ia64.

> Index: l/include/asm-powerpc/processor.h
> ===================================================================
> --- l.orig/include/asm-powerpc/processor.h	2007-10-09 17:37:58.000000000 -0500
> +++ l/include/asm-powerpc/processor.h	2007-10-10 11:46:30.000000000 -0500
> @@ -99,7 +99,9 @@ extern struct task_struct *last_task_use
>   */
>  #define TASK_SIZE_USER32 (0x0000000100000000UL - (1*PAGE_SIZE))
>  
> -#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
> +#define TASK_SIZE	  (test_thread_flag(TIF_32BIT) ? \
> +		TASK_SIZE_USER32 : TASK_SIZE_USER64)
> +#define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
>  		TASK_SIZE_USER32 : TASK_SIZE_USER64)
>  

Same.

>  /* This decides where the kernel will search for a free chunk of vm
> Index: l/include/asm-s390/processor.h
> ===================================================================
> --- l.orig/include/asm-s390/processor.h	2007-10-09 17:37:58.000000000 -0500
> +++ l/include/asm-s390/processor.h	2007-10-10 11:46:30.000000000 -0500
> @@ -75,6 +75,8 @@ extern struct task_struct *last_task_use
>  
>  # define TASK_SIZE		(test_thread_flag(TIF_31BIT) ? \
>  					(0x80000000UL) : (0x40000000000UL))
> +# define TASK_SIZE_OF(tsk)	(test_tsk_thread_flag(tsk, TIF_31BIT) ? \
> +					(0x80000000UL) : (0x40000000000UL))
>  # define TASK_UNMAPPED_BASE	(TASK_SIZE / 2)
>  # define DEFAULT_TASK_SIZE	(0x40000000000UL)
>  

Same.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/11] maps3: introduce task_size_of for all arches
  2007-10-15 23:45   ` David Rientjes
@ 2007-10-16  0:36     ` Dave Hansen
  2007-10-16  2:26       ` David Rientjes
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2007-10-16  0:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Matt Mackall, Andrew Morton, linux-kernel, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

David,

All of your comments looked pretty valid to me.  I've refreshed that
patch.

I haven't even compile-tested this so there may be some fat fingering
somewhere.  I'll run compile tests on it now.

-- Dave


For the /proc/<pid>/pagemap code[1], we need to able to query how
much virtual address space a particular task has.  The trick is
that we do it through /proc and can't use TASK_SIZE since it
references "current" on some arches.  The process opening the
/proc file might be a 32-bit process opening a 64-bit process's
pagemap file.

x86_64 already has a TASK_SIZE_OF() macro:

#define TASK_SIZE_OF(child)     ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)

I'd like to have that for other architectures.  So, add it
for all the architectures that actually use "current" in 
their TASK_SIZE.  For the others, just add a quick #define
in sched.h to use plain old TASK_SIZE.

1. http://www.linuxworld.com/news/2007/042407-kernel.html

- MIPS portion from Ralf Baechle <ralf@linux-mips.org>

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Matt Mackall <mpm@selenic.com>
---

 lxc-dave/include/asm-ia64/processor.h    |    3 ++-
 lxc-dave/include/asm-mips/processor.h    |    4 ++++
 lxc-dave/include/asm-parisc/processor.h  |    3 ++-
 lxc-dave/include/asm-powerpc/processor.h |    3 ++-
 lxc-dave/include/asm-s390/processor.h    |    3 ++-
 lxc-dave/include/linux/sched.h           |    4 ++++
 6 files changed, 16 insertions(+), 4 deletions(-)

diff -puN include/asm-ia64/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-ia64/processor.h
--- lxc/include/asm-ia64/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-ia64/processor.h	2007-10-15 17:29:22.000000000 -0700
@@ -31,7 +31,8 @@
  * each (assuming 8KB page size), for a total of 8TB of user virtual
  * address space.
  */
-#define TASK_SIZE		(current->thread.task_size)
+#define TASK_SIZE_OF(tsk)	((tsk)->thread.task_size)
+#define TASK_SIZE       	TASK_SIZE_OF(current)
 
 /*
  * This decides where the kernel will search for a free chunk of vm
diff -puN include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-mips/processor.h
--- lxc/include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-mips/processor.h	2007-10-15 17:34:12.000000000 -0700
@@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)						\
+	(test_tsk_thread_flag(tak, TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #ifdef CONFIG_64BIT
@@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
 #define TASK_UNMAPPED_BASE						\
 	(test_thread_flag(TIF_32BIT_ADDR) ?				\
 		PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)						\
+	(test_tsk_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #define NUM_FPU_REGS	32
diff -puN include/asm-parisc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-parisc/processor.h
--- lxc/include/asm-parisc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-parisc/processor.h	2007-10-15 17:31:39.000000000 -0700
@@ -32,7 +32,8 @@
 #endif
 #define current_text_addr() ({ void *pc; current_ia(pc); pc; })
 
-#define TASK_SIZE               (current->thread.task_size)
+#define TASK_SIZE_OF(tsk)       ((tsk)->thread.task_size)
+#define TASK_SIZE	        TASK_SIZE_OF(current)
 #define TASK_UNMAPPED_BASE      (current->thread.map_base)
 
 #define DEFAULT_TASK_SIZE32	(0xFFF00000UL)
diff -puN include/asm-powerpc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-powerpc/processor.h
--- lxc/include/asm-powerpc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-powerpc/processor.h	2007-10-15 17:32:00.000000000 -0700
@@ -99,8 +99,9 @@ extern struct task_struct *last_task_use
  */
 #define TASK_SIZE_USER32 (0x0000000100000000UL - (1*PAGE_SIZE))
 
-#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
+#define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
 		TASK_SIZE_USER32 : TASK_SIZE_USER64)
+#define TASK_SIZE	  TASK_SIZE_OF(current)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
diff -puN include/asm-s390/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-s390/processor.h
--- lxc/include/asm-s390/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-s390/processor.h	2007-10-15 17:32:31.000000000 -0700
@@ -73,8 +73,9 @@ extern struct task_struct *last_task_use
 
 #else /* __s390x__ */
 
-# define TASK_SIZE		(test_thread_flag(TIF_31BIT) ? \
+# define TASK_SIZE_OF(tsk)	(test_tsk_thread_flag(tsk, TIF_31BIT) ? \
 					(0x80000000UL) : (0x40000000000UL))
+# define TASK_SIZE		TASK_SIZE_OF(current)
 # define TASK_UNMAPPED_BASE	(TASK_SIZE / 2)
 # define DEFAULT_TASK_SIZE	(0x40000000000UL)
 
diff -puN include/linux/sched.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/linux/sched.h
--- lxc/include/linux/sched.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/linux/sched.h	2007-10-15 17:32:42.000000000 -0700
@@ -1953,6 +1953,10 @@ static inline void inc_syscw(struct task
 }
 #endif
 
+#ifndef TASK_SIZE_OF
+#define TASK_SIZE_OF(tsk)	TASK_SIZE
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif
_



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/11] maps3: introduce task_size_of for all arches
  2007-10-16  0:36     ` Dave Hansen
@ 2007-10-16  2:26       ` David Rientjes
  2007-10-16 17:18         ` maps3: introduce task_size_of for all arches (updated v4) Dave Hansen
  0 siblings, 1 reply; 39+ messages in thread
From: David Rientjes @ 2007-10-16  2:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matt Mackall, Andrew Morton, linux-kernel, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Dave Hansen wrote:

> diff -puN include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-mips/processor.h
> --- lxc/include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
> +++ lxc-dave/include/asm-mips/processor.h	2007-10-15 17:34:12.000000000 -0700
> @@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
>   * space during mmap's.
>   */
>  #define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)						\
> +	(test_tsk_thread_flag(tak, TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #ifdef CONFIG_64BIT

tak needs to be tsk.

> @@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
>  #define TASK_UNMAPPED_BASE						\
>  	(test_thread_flag(TIF_32BIT_ADDR) ?				\
>  		PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)						\
> +	(test_tsk_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #define NUM_FPU_REGS	32

test_tsk_thread_flag() takes two arguments.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* maps3: introduce task_size_of for all arches (updated v4)
  2007-10-16  2:26       ` David Rientjes
@ 2007-10-16 17:18         ` Dave Hansen
  2007-10-16 17:25           ` David Rientjes
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2007-10-16 17:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Matt Mackall, Andrew Morton, linux-kernel, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

The following replaces the earlier patches sent.  It should address
David Rientjes's comments, and has been compile tested on all the
architectures that it touches, save for parisc.


----

For the /proc/<pid>/pagemap code[1], we need to able to query how
much virtual address space a particular task has.  The trick is
that we do it through /proc and can't use TASK_SIZE since it
references "current" on some arches.  The process opening the
/proc file might be a 32-bit process opening a 64-bit process's
pagemap file.

x86_64 already has a TASK_SIZE_OF() macro:

#define TASK_SIZE_OF(child)     ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)

I'd like to have that for other architectures.  So, add it
for all the architectures that actually use "current" in 
their TASK_SIZE.  For the others, just add a quick #define
in sched.h to use plain old TASK_SIZE.

1. http://www.linuxworld.com/news/2007/042407-kernel.html

- MIPS portion from Ralf Baechle <ralf@linux-mips.org>

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Matt Mackall <mpm@selenic.com>
---

 lxc-dave/include/asm-ia64/processor.h    |    3 ++-
 lxc-dave/include/asm-mips/processor.h    |    4 ++++
 lxc-dave/include/asm-parisc/processor.h  |    3 ++-
 lxc-dave/include/asm-powerpc/processor.h |    3 ++-
 lxc-dave/include/asm-s390/processor.h    |    3 ++-
 lxc-dave/include/linux/sched.h           |    4 ++++
 6 files changed, 16 insertions(+), 4 deletions(-)

diff -puN include/asm-ia64/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-ia64/processor.h
--- lxc/include/asm-ia64/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-ia64/processor.h	2007-10-15 17:29:22.000000000 -0700
@@ -31,7 +31,8 @@
  * each (assuming 8KB page size), for a total of 8TB of user virtual
  * address space.
  */
-#define TASK_SIZE		(current->thread.task_size)
+#define TASK_SIZE_OF(tsk)	((tsk)->thread.task_size)
+#define TASK_SIZE       	TASK_SIZE_OF(current)
 
 /*
  * This decides where the kernel will search for a free chunk of vm
diff -puN include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-mips/processor.h
--- lxc/include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-mips/processor.h	2007-10-16 09:26:44.000000000 -0700
@@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)						\
+	(test_tsk_thread_flag(tsk, TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #ifdef CONFIG_64BIT
@@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
 #define TASK_UNMAPPED_BASE						\
 	(test_thread_flag(TIF_32BIT_ADDR) ?				\
 		PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)						\
+	(test_tsk_thread_flag(tsk, TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #define NUM_FPU_REGS	32
diff -puN include/asm-parisc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-parisc/processor.h
--- lxc/include/asm-parisc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-parisc/processor.h	2007-10-15 17:31:39.000000000 -0700
@@ -32,7 +32,8 @@
 #endif
 #define current_text_addr() ({ void *pc; current_ia(pc); pc; })
 
-#define TASK_SIZE               (current->thread.task_size)
+#define TASK_SIZE_OF(tsk)       ((tsk)->thread.task_size)
+#define TASK_SIZE	        TASK_SIZE_OF(current)
 #define TASK_UNMAPPED_BASE      (current->thread.map_base)
 
 #define DEFAULT_TASK_SIZE32	(0xFFF00000UL)
diff -puN include/asm-powerpc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-powerpc/processor.h
--- lxc/include/asm-powerpc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-powerpc/processor.h	2007-10-15 17:32:00.000000000 -0700
@@ -99,8 +99,9 @@ extern struct task_struct *last_task_use
  */
 #define TASK_SIZE_USER32 (0x0000000100000000UL - (1*PAGE_SIZE))
 
-#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
+#define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
 		TASK_SIZE_USER32 : TASK_SIZE_USER64)
+#define TASK_SIZE	  TASK_SIZE_OF(current)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
diff -puN include/asm-s390/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/asm-s390/processor.h
--- lxc/include/asm-s390/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/asm-s390/processor.h	2007-10-15 17:32:31.000000000 -0700
@@ -73,8 +73,9 @@ extern struct task_struct *last_task_use
 
 #else /* __s390x__ */
 
-# define TASK_SIZE		(test_thread_flag(TIF_31BIT) ? \
+# define TASK_SIZE_OF(tsk)	(test_tsk_thread_flag(tsk, TIF_31BIT) ? \
 					(0x80000000UL) : (0x40000000000UL))
+# define TASK_SIZE		TASK_SIZE_OF(current)
 # define TASK_UNMAPPED_BASE	(TASK_SIZE / 2)
 # define DEFAULT_TASK_SIZE	(0x40000000000UL)
 
diff -puN include/linux/sched.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches include/linux/sched.h
--- lxc/include/linux/sched.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches	2007-10-15 17:29:22.000000000 -0700
+++ lxc-dave/include/linux/sched.h	2007-10-15 17:32:42.000000000 -0700
@@ -1953,6 +1953,10 @@ static inline void inc_syscw(struct task
 }
 #endif
 
+#ifndef TASK_SIZE_OF
+#define TASK_SIZE_OF(tsk)	TASK_SIZE
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif
_


-- Dave


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: maps3: introduce task_size_of for all arches (updated v4)
  2007-10-16 17:18         ` maps3: introduce task_size_of for all arches (updated v4) Dave Hansen
@ 2007-10-16 17:25           ` David Rientjes
  0 siblings, 0 replies; 39+ messages in thread
From: David Rientjes @ 2007-10-16 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matt Mackall, Andrew Morton, linux-kernel, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Tue, 16 Oct 2007, Dave Hansen wrote:

> For the /proc/<pid>/pagemap code[1], we need to able to query how
> much virtual address space a particular task has.  The trick is
> that we do it through /proc and can't use TASK_SIZE since it
> references "current" on some arches.  The process opening the
> /proc file might be a 32-bit process opening a 64-bit process's
> pagemap file.
> 
> x86_64 already has a TASK_SIZE_OF() macro:
> 
> #define TASK_SIZE_OF(child)     ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)
> 
> I'd like to have that for other architectures.  So, add it
> for all the architectures that actually use "current" in 
> their TASK_SIZE.  For the others, just add a quick #define
> in sched.h to use plain old TASK_SIZE.
> 
> 1. http://www.linuxworld.com/news/2007/042407-kernel.html
> 
> - MIPS portion from Ralf Baechle <ralf@linux-mips.org>
> 
> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
> Signed-off-by: Matt Mackall <mpm@selenic.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/11] maps3: move is_swap_pte
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
  2007-10-15 22:25 ` [PATCH 1/11] maps3: add proportional set size accounting in smaps Matt Mackall
  2007-10-15 22:25 ` [PATCH 2/11] maps3: introduce task_size_of for all arches Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-15 22:26 ` [PATCH 4/11] maps3: introduce a generic page walker Matt Mackall
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

Move is_swap_pte helper function to swapops.h for use by pagemap code

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: l/include/linux/swapops.h
===================================================================
--- l.orig/include/linux/swapops.h	2007-10-09 17:36:25.000000000 -0500
+++ l/include/linux/swapops.h	2007-10-10 11:46:34.000000000 -0500
@@ -42,6 +42,12 @@ static inline pgoff_t swp_offset(swp_ent
 	return entry.val & SWP_OFFSET_MASK(entry);
 }
 
+/* check whether a pte points to a swap entry */
+static inline int is_swap_pte(pte_t pte)
+{
+	return !pte_none(pte) && !pte_present(pte) && !pte_file(pte);
+}
+
 /*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
Index: l/mm/migrate.c
===================================================================
--- l.orig/mm/migrate.c	2007-10-09 17:37:59.000000000 -0500
+++ l/mm/migrate.c	2007-10-10 11:46:34.000000000 -0500
@@ -114,11 +114,6 @@ int putback_lru_pages(struct list_head *
 	return count;
 }
 
-static inline int is_swap_pte(pte_t pte)
-{
-	return !pte_none(pte) && !pte_present(pte) && !pte_file(pte);
-}
-
 /*
  * Restore a potential migration pte to a working pte entry
  */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/11] maps3: introduce a generic page walker
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (2 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 3/11] maps3: move is_swap_pte Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-15 22:40   ` Jeremy Fitzhardinge
  2007-10-16  4:58   ` David Rientjes
  2007-10-15 22:26 ` [PATCH 5/11] maps3: use pagewalker in clear_refs and smaps Matt Mackall
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

Introduce a general page table walker

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: l/include/linux/mm.h
===================================================================
--- l.orig/include/linux/mm.h	2007-10-09 17:37:59.000000000 -0500
+++ l/include/linux/mm.h	2007-10-10 11:46:37.000000000 -0500
@@ -773,6 +773,17 @@ unsigned long unmap_vmas(struct mmu_gath
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
+
+struct mm_walk {
+	int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, void *);
+	int (*pud_entry)(pud_t *, unsigned long, unsigned long, void *);
+	int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *);
+	int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *);
+	int (*pte_hole) (unsigned long, unsigned long, void *);
+};
+
+int walk_page_range(struct mm_struct *, unsigned long addr, unsigned long end,
+		    struct mm_walk *walk, void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
Index: l/mm/Makefile
===================================================================
--- l.orig/mm/Makefile	2007-10-09 17:37:59.000000000 -0500
+++ l/mm/Makefile	2007-10-10 11:46:37.000000000 -0500
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o pagewalk.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
Index: l/mm/pagewalk.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ l/mm/pagewalk.c	2007-10-10 11:46:37.000000000 -0500
@@ -0,0 +1,120 @@
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/sched.h>
+
+static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			  struct mm_walk *walk, void *private)
+{
+	pte_t *pte;
+	int err = 0;
+
+	pte = pte_offset_map(pmd, addr);
+	do {
+		err = walk->pte_entry(pte, addr, addr, private);
+		if (err)
+		       break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	pte_unmap(pte);
+	return err;
+}
+
+static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+			  struct mm_walk *walk, void *private)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	int err = 0;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd)) {
+			if (walk->pte_hole)
+				err = walk->pte_hole(addr, next, private);
+			if (err)
+				break;
+			continue;
+		}
+		if (walk->pmd_entry)
+			err = walk->pmd_entry(pmd, addr, next, private);
+		if (!err && walk->pte_entry)
+			err = walk_pte_range(pmd, addr, next, walk, private);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+
+	return err;
+}
+
+static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+			  struct mm_walk *walk, void *private)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err = 0;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud)) {
+			if (walk->pte_hole)
+				err = walk->pte_hole(addr, next, private);
+			if (err)
+				break;
+			continue;
+		}
+		if (walk->pud_entry)
+			err = walk->pud_entry(pud, addr, next, private);
+		if (!err && (walk->pmd_entry || walk->pte_entry))
+			err = walk_pmd_range(pud, addr, next, walk, private);
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+
+	return err;
+}
+
+/*
+ * walk_page_range - walk a memory map's page tables with a callback
+ * @mm - memory map to walk
+ * @addr - starting address
+ * @end - ending address
+ * @walk - set of callbacks to invoke for each level of the tree
+ * @private - private data passed to the callback function
+ *
+ * Recursively walk the page table for the memory area in a VMA, calling
+ * a callback for every bottom-level (PTE) page table.
+ */
+int walk_page_range(struct mm_struct *mm,
+		    unsigned long addr, unsigned long end,
+		    struct mm_walk *walk, void *private)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	int err = 0;
+
+	if (addr >= end)
+		return err;
+
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd)) {
+			if (walk->pte_hole)
+				err = walk->pte_hole(addr, next, private);
+			if (err)
+				break;
+			continue;
+		}
+		if (walk->pgd_entry)
+			err = walk->pgd_entry(pgd, addr, next, private);
+		if (!err &&
+		    (walk->pud_entry || walk->pmd_entry || walk->pte_entry))
+			err = walk_pud_range(pgd, addr, next, walk, private);
+		if (err)
+			return err;
+	} while (pgd++, addr = next, addr != end);
+
+	return err;
+}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/11] maps3: introduce a generic page walker
  2007-10-15 22:26 ` [PATCH 4/11] maps3: introduce a generic page walker Matt Mackall
@ 2007-10-15 22:40   ` Jeremy Fitzhardinge
  2007-10-15 23:05     ` Dave Hansen
  2007-10-15 23:30     ` Matt Mackall
  2007-10-16  4:58   ` David Rientjes
  1 sibling, 2 replies; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2007-10-15 22:40 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	David Rientjes, Fengguang Wu

Matt Mackall wrote:
> Introduce a general page table walker
>   

Definitely approve in principle, but some comments:

> Signed-off-by: Matt Mackall <mpm@selenic.com>
>
> Index: l/include/linux/mm.h
> ===================================================================
> --- l.orig/include/linux/mm.h	2007-10-09 17:37:59.000000000 -0500
> +++ l/include/linux/mm.h	2007-10-10 11:46:37.000000000 -0500
> @@ -773,6 +773,17 @@ unsigned long unmap_vmas(struct mmu_gath
>  		struct vm_area_struct *start_vma, unsigned long start_addr,
>  		unsigned long end_addr, unsigned long *nr_accounted,
>  		struct zap_details *);
> +
> +struct mm_walk {
> +	int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, void *);
> +	int (*pud_entry)(pud_t *, unsigned long, unsigned long, void *);
> +	int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *);
> +	int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *);
> +	int (*pte_hole) (unsigned long, unsigned long, void *);
> +};
>   

It would be nice to have some clue about when each of these functions
are called (depth first? pre or post order?), and what their params
are.  Does it call a callback for folded pagetable levels?

Can pte_hole be used to create new mappings while we're traversing the
pagetable?  Apparently not, because it continues after calling it.

> +
> +int walk_page_range(struct mm_struct *, unsigned long addr, unsigned long end,
> +		    struct mm_walk *walk, void *private);
>  void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
>  		unsigned long end, unsigned long floor, unsigned long ceiling);
>  void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
> Index: l/mm/Makefile
> ===================================================================
> --- l.orig/mm/Makefile	2007-10-09 17:37:59.000000000 -0500
> +++ l/mm/Makefile	2007-10-10 11:46:37.000000000 -0500
> @@ -5,7 +5,7 @@
>  mmu-y			:= nommu.o
>  mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
>  			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> -			   vmalloc.o
> +			   vmalloc.o pagewalk.o
>  
>  obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
>  			   page_alloc.o page-writeback.o pdflush.o \
> Index: l/mm/pagewalk.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ l/mm/pagewalk.c	2007-10-10 11:46:37.000000000 -0500
> @@ -0,0 +1,120 @@
> +#include <linux/mm.h>
> +#include <linux/highmem.h>
> +#include <linux/sched.h>
> +
> +static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> +			  struct mm_walk *walk, void *private)
> +{
> +	pte_t *pte;
> +	int err = 0;
> +
> +	pte = pte_offset_map(pmd, addr);
> +	do {
> +		err = walk->pte_entry(pte, addr, addr, private);
>   

Should this be (pte, addr, addr+PAGE_SIZE, private)?  Is the second addr
argument for the address range being mapped by this thing?  Why pass
addr twice?

> +		if (err)
> +		       break;
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	pte_unmap(pte);
> +	return err;
> +}
> +
> +static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> +			  struct mm_walk *walk, void *private)
> +{
> +	pmd_t *pmd;
> +	unsigned long next;
> +	int err = 0;
> +
> +	pmd = pmd_offset(pud, addr);
> +	do {
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none_or_clear_bad(pmd)) {
> +			if (walk->pte_hole)
> +				err = walk->pte_hole(addr, next, private);
> +			if (err)
> +				break;
> +			continue;
> +		}
> +		if (walk->pmd_entry)
> +			err = walk->pmd_entry(pmd, addr, next, private);
> +		if (!err && walk->pte_entry)
> +			err = walk_pte_range(pmd, addr, next, walk, private);
> +		if (err)
> +			break;
> +	} while (pmd++, addr = next, addr != end);
> +
> +	return err;
> +}
> +
> +static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> +			  struct mm_walk *walk, void *private)
> +{
> +	pud_t *pud;
> +	unsigned long next;
> +	int err = 0;
> +
> +	pud = pud_offset(pgd, addr);
> +	do {
> +		next = pud_addr_end(addr, end);
> +		if (pud_none_or_clear_bad(pud)) {
> +			if (walk->pte_hole)
> +				err = walk->pte_hole(addr, next, private);
> +			if (err)
> +				break;
> +			continue;
> +		}
> +		if (walk->pud_entry)
> +			err = walk->pud_entry(pud, addr, next, private);
> +		if (!err && (walk->pmd_entry || walk->pte_entry))
> +			err = walk_pmd_range(pud, addr, next, walk, private);
> +		if (err)
> +			break;
> +	} while (pud++, addr = next, addr != end);
> +
> +	return err;
> +}
> +
> +/*
> + * walk_page_range - walk a memory map's page tables with a callback
> + * @mm - memory map to walk
> + * @addr - starting address
> + * @end - ending address
> + * @walk - set of callbacks to invoke for each level of the tree
> + * @private - private data passed to the callback function
> + *
> + * Recursively walk the page table for the memory area in a VMA, calling
> + * a callback for every bottom-level (PTE) page table.
>   

It calls a callback for every level of the pagetable.

> + */
> +int walk_page_range(struct mm_struct *mm,
> +		    unsigned long addr, unsigned long end,
> +		    struct mm_walk *walk, void *private)
> +{
> +	pgd_t *pgd;
> +	unsigned long next;
> +	int err = 0;
> +
> +	if (addr >= end)
> +		return err;
> +
> +	pgd = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none_or_clear_bad(pgd)) {
> +			if (walk->pte_hole)
> +				err = walk->pte_hole(addr, next, private);
> +			if (err)
> +				break;
> +			continue;
> +		}
> +		if (walk->pgd_entry)
> +			err = walk->pgd_entry(pgd, addr, next, private);
> +		if (!err &&
> +		    (walk->pud_entry || walk->pmd_entry || walk->pte_entry))
> +			err = walk_pud_range(pgd, addr, next, walk, private);
> +		if (err)
> +			return err;
> +	} while (pgd++, addr = next, addr != end);
> +
> +	return err;
> +}
>   


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/11] maps3: introduce a generic page walker
  2007-10-15 22:40   ` Jeremy Fitzhardinge
@ 2007-10-15 23:05     ` Dave Hansen
  2007-10-15 23:20       ` Jeremy Fitzhardinge
  2007-10-15 23:30     ` Matt Mackall
  1 sibling, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2007-10-15 23:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Matt Mackall, Andrew Morton, linux-kernel, Rusty Russell,
	David Rientjes, Fengguang Wu, ADAM G. LITKE [imap]

On Mon, 2007-10-15 at 15:40 -0700, Jeremy Fitzhardinge wrote:
> Can pte_hole be used to create new mappings while we're traversing the
> pagetable?  Apparently not, because it continues after calling it. 

For now, we should probably document that these functions assume that
the appropriate locks are held, and that there are no changes being made
to the pagetables as we walk.

However, I can see that people might want to use these in the future for
establishing ptes.  Perhaps a special code coming back from the
->pte_hole() function could indicate changes were made to the
pagetables.  I guess we could at least retry part of the loop where the
hole call was made, like:

+int walk_page_range(struct mm_struct *mm,...
+{
...
+       pgd = pgd_offset(mm, addr);
+       do {
+               next = pgd_addr_end(addr, end);
+               if (pgd_none_or_clear_bad(pgd)) {
+                       if (walk->pte_hole)
+                               err = walk->pte_hole(addr, next, private);
			if (err == -EAGAIN) { // or whatever we want
				pgd--;
				err = 0;
			}
+                       if (err)
+                               break;
+                       continue;
+               }

That wouldn't allow changes behind the walker, but it should allow them
in the range that was walked by the ->pte_hole() function.

-- Dave


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/11] maps3: introduce a generic page walker
  2007-10-15 23:05     ` Dave Hansen
@ 2007-10-15 23:20       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2007-10-15 23:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matt Mackall, Andrew Morton, linux-kernel, Rusty Russell,
	David Rientjes, Fengguang Wu, ADAM G. LITKE [imap]

Dave Hansen wrote:
> For now, we should probably document that these functions assume that
> the appropriate locks are held, and that there are no changes being made
> to the pagetables as we walk.
>
> However, I can see that people might want to use these in the future for
> establishing ptes.  Perhaps a special code coming back from the
> ->pte_hole() function could indicate changes were made to the
> pagetables.  I guess we could at least retry part of the loop where the
> hole call was made, like:
>   

Yes.  We already have apply_to_page_range(), which has the side effect
of creating the page range in order to apply a function to it.  It would
be nice to be able to replicate its functionality with this page waker
so we can have just one.

> +int walk_page_range(struct mm_struct *mm,...
> +{
> ...
> +       pgd = pgd_offset(mm, addr);
> +       do {
> +               next = pgd_addr_end(addr, end);
> +               if (pgd_none_or_clear_bad(pgd)) {
> +                       if (walk->pte_hole)
> +                               err = walk->pte_hole(addr, next, private);
> 			if (err == -EAGAIN) { // or whatever we want
> 				pgd--;
> 				err = 0;
> 			}
> +                       if (err)
> +                               break;
> +                       continue;
> +               }
>
> That wouldn't allow changes behind the walker, but it should allow them
> in the range that was walked by the ->pte_hole() function.
>   

Yep.

    J

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/11] maps3: introduce a generic page walker
  2007-10-15 22:40   ` Jeremy Fitzhardinge
  2007-10-15 23:05     ` Dave Hansen
@ 2007-10-15 23:30     ` Matt Mackall
  1 sibling, 0 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 23:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	David Rientjes, Fengguang Wu

On Mon, Oct 15, 2007 at 03:40:27PM -0700, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > Introduce a general page table walker
> >   
> 
> Definitely approve in principle, but some comments:
> 
> > Signed-off-by: Matt Mackall <mpm@selenic.com>
> >
> > Index: l/include/linux/mm.h
> > ===================================================================
> > --- l.orig/include/linux/mm.h	2007-10-09 17:37:59.000000000 -0500
> > +++ l/include/linux/mm.h	2007-10-10 11:46:37.000000000 -0500
> > @@ -773,6 +773,17 @@ unsigned long unmap_vmas(struct mmu_gath
> >  		struct vm_area_struct *start_vma, unsigned long start_addr,
> >  		unsigned long end_addr, unsigned long *nr_accounted,
> >  		struct zap_details *);
> > +
> > +struct mm_walk {
> > +	int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, void *);
> > +	int (*pud_entry)(pud_t *, unsigned long, unsigned long, void *);
> > +	int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *);
> > +	int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *);
> > +	int (*pte_hole) (unsigned long, unsigned long, void *);
> > +};
> >   
> 
> It would be nice to have some clue about when each of these functions
> are called (depth first? pre or post order?), and what their params
> are.  Does it call a callback for folded pagetable levels?
> 
> Can pte_hole be used to create new mappings while we're traversing the
> pagetable?  Apparently not, because it continues after calling it.
> 
> > +
> > +int walk_page_range(struct mm_struct *, unsigned long addr, unsigned long end,
> > +		    struct mm_walk *walk, void *private);
> >  void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
> >  		unsigned long end, unsigned long floor, unsigned long ceiling);
> >  void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
> > Index: l/mm/Makefile
> > ===================================================================
> > --- l.orig/mm/Makefile	2007-10-09 17:37:59.000000000 -0500
> > +++ l/mm/Makefile	2007-10-10 11:46:37.000000000 -0500
> > @@ -5,7 +5,7 @@
> >  mmu-y			:= nommu.o
> >  mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
> >  			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> > -			   vmalloc.o
> > +			   vmalloc.o pagewalk.o
> >  
> >  obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
> >  			   page_alloc.o page-writeback.o pdflush.o \
> > Index: l/mm/pagewalk.c
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ l/mm/pagewalk.c	2007-10-10 11:46:37.000000000 -0500
> > @@ -0,0 +1,120 @@
> > +#include <linux/mm.h>
> > +#include <linux/highmem.h>
> > +#include <linux/sched.h>
> > +
> > +static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > +			  struct mm_walk *walk, void *private)
> > +{
> > +	pte_t *pte;
> > +	int err = 0;
> > +
> > +	pte = pte_offset_map(pmd, addr);
> > +	do {
> > +		err = walk->pte_entry(pte, addr, addr, private);
> >   
> 
> Should this be (pte, addr, addr+PAGE_SIZE, private)?

Probably - the pattern is [start, end). Either that or we should have
one arg.
 
> > +/*
> > + * walk_page_range - walk a memory map's page tables with a callback
> > + * @mm - memory map to walk
> > + * @addr - starting address
> > + * @end - ending address
> > + * @walk - set of callbacks to invoke for each level of the tree
> > + * @private - private data passed to the callback function
> > + *
> > + * Recursively walk the page table for the memory area in a VMA, calling
> > + * a callback for every bottom-level (PTE) page table.
> >   
> 
> It calls a callback for every level of the pagetable.

Oops.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/11] maps3: introduce a generic page walker
  2007-10-15 22:26 ` [PATCH 4/11] maps3: introduce a generic page walker Matt Mackall
  2007-10-15 22:40   ` Jeremy Fitzhardinge
@ 2007-10-16  4:58   ` David Rientjes
  1 sibling, 0 replies; 39+ messages in thread
From: David Rientjes @ 2007-10-16  4:58 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Matt Mackall wrote:

> Introduce a general page table walker
> 
> Signed-off-by: Matt Mackall <mpm@selenic.com>
> 
> Index: l/include/linux/mm.h
> ===================================================================
> --- l.orig/include/linux/mm.h	2007-10-09 17:37:59.000000000 -0500
> +++ l/include/linux/mm.h	2007-10-10 11:46:37.000000000 -0500
> @@ -773,6 +773,17 @@ unsigned long unmap_vmas(struct mmu_gath
>  		struct vm_area_struct *start_vma, unsigned long start_addr,
>  		unsigned long end_addr, unsigned long *nr_accounted,
>  		struct zap_details *);
> +
> +struct mm_walk {
> +	int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, void *);
> +	int (*pud_entry)(pud_t *, unsigned long, unsigned long, void *);
> +	int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *);
> +	int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *);
> +	int (*pte_hole) (unsigned long, unsigned long, void *);
> +};
> +
> +int walk_page_range(struct mm_struct *, unsigned long addr, unsigned long end,
> +		    struct mm_walk *walk, void *private);

The struct mm_walk * can be qualified as const.

> Index: l/mm/pagewalk.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ l/mm/pagewalk.c	2007-10-10 11:46:37.000000000 -0500
> @@ -0,0 +1,120 @@
> +#include <linux/mm.h>
> +#include <linux/highmem.h>
> +#include <linux/sched.h>
> +
> +static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> +			  struct mm_walk *walk, void *private)
> +{
> +	pte_t *pte;
> +	int err = 0;
> +
> +	pte = pte_offset_map(pmd, addr);
> +	do {
> +		err = walk->pte_entry(pte, addr, addr, private);
> +		if (err)
> +		       break;
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	pte_unmap(pte);
> +	return err;
> +}
> +
> +static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> +			  struct mm_walk *walk, void *private)
> +{
> +	pmd_t *pmd;
> +	unsigned long next;
> +	int err = 0;
> +
> +	pmd = pmd_offset(pud, addr);
> +	do {
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none_or_clear_bad(pmd)) {
> +			if (walk->pte_hole)
> +				err = walk->pte_hole(addr, next, private);
> +			if (err)
> +				break;
> +			continue;
> +		}
> +		if (walk->pmd_entry)
> +			err = walk->pmd_entry(pmd, addr, next, private);
> +		if (!err && walk->pte_entry)
> +			err = walk_pte_range(pmd, addr, next, walk, private);
> +		if (err)
> +			break;
> +	} while (pmd++, addr = next, addr != end);
> +
> +	return err;
> +}
> +
> +static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> +			  struct mm_walk *walk, void *private)
> +{
> +	pud_t *pud;
> +	unsigned long next;
> +	int err = 0;
> +
> +	pud = pud_offset(pgd, addr);
> +	do {
> +		next = pud_addr_end(addr, end);
> +		if (pud_none_or_clear_bad(pud)) {
> +			if (walk->pte_hole)
> +				err = walk->pte_hole(addr, next, private);
> +			if (err)
> +				break;
> +			continue;
> +		}
> +		if (walk->pud_entry)
> +			err = walk->pud_entry(pud, addr, next, private);
> +		if (!err && (walk->pmd_entry || walk->pte_entry))
> +			err = walk_pmd_range(pud, addr, next, walk, private);
> +		if (err)
> +			break;
> +	} while (pud++, addr = next, addr != end);
> +
> +	return err;
> +}
> +
> +/*
> + * walk_page_range - walk a memory map's page tables with a callback
> + * @mm - memory map to walk
> + * @addr - starting address
> + * @end - ending address
> + * @walk - set of callbacks to invoke for each level of the tree
> + * @private - private data passed to the callback function
> + *
> + * Recursively walk the page table for the memory area in a VMA, calling
> + * a callback for every bottom-level (PTE) page table.
> + */
> +int walk_page_range(struct mm_struct *mm,
> +		    unsigned long addr, unsigned long end,
> +		    struct mm_walk *walk, void *private)
> +{
> +	pgd_t *pgd;
> +	unsigned long next;
> +	int err = 0;
> +
> +	if (addr >= end)
> +		return err;

unlikely?

> +
> +	pgd = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none_or_clear_bad(pgd)) {
> +			if (walk->pte_hole)
> +				err = walk->pte_hole(addr, next, private);
> +			if (err)
> +				break;
> +			continue;
> +		}
> +		if (walk->pgd_entry)
> +			err = walk->pgd_entry(pgd, addr, next, private);
> +		if (!err &&
> +		    (walk->pud_entry || walk->pmd_entry || walk->pte_entry))
> +			err = walk_pud_range(pgd, addr, next, walk, private);
> +		if (err)
> +			return err;

break instead.

> +	} while (pgd++, addr = next, addr != end);
> +
> +	return err;
> +}

Should walk_page_range be exported?

Is it trivial to convert ioremap to use this new pte walker?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 5/11] maps3: use pagewalker in clear_refs and smaps
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (3 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 4/11] maps3: introduce a generic page walker Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-16  5:03   ` David Rientjes
  2007-10-15 22:26 ` [PATCH 6/11] maps3: simplify interdependence of maps " Matt Mackall
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

Use the generic pagewalker for smaps and clear_refs

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: l/fs/proc/task_mmu.c
===================================================================
--- l.orig/fs/proc/task_mmu.c	2007-10-14 13:36:56.000000000 -0500
+++ l/fs/proc/task_mmu.c	2007-10-14 13:37:08.000000000 -0500
@@ -116,6 +116,7 @@ static void pad_len_spaces(struct seq_fi
 
 struct mem_size_stats
 {
+	struct vm_area_struct *vma;
 	unsigned long resident;
 	unsigned long shared_clean;
 	unsigned long shared_dirty;
@@ -145,13 +146,6 @@ struct mem_size_stats
 #define PSS_DIV_BITS	12
 };
 
-struct pmd_walker {
-	struct vm_area_struct *vma;
-	void *private;
-	void (*action)(struct vm_area_struct *, pmd_t *, unsigned long,
-		       unsigned long, void *);
-};
-
 static int show_map_internal(struct seq_file *m, void *v, struct mem_size_stats *mss)
 {
 	struct proc_maps_private *priv = m->private;
@@ -241,11 +235,11 @@ static int show_map(struct seq_file *m, 
 	return show_map_internal(m, v, NULL);
 }
 
-static void smaps_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-			    unsigned long addr, unsigned long end,
-			    void *private)
+static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			   void *private)
 {
 	struct mem_size_stats *mss = private;
+	struct vm_area_struct *vma = mss->vma;
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
@@ -283,12 +277,13 @@ static void smaps_pte_range(struct vm_ar
 	}
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
+	return 0;
 }
 
-static void clear_refs_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-				 unsigned long addr, unsigned long end,
-				 void *private)
+static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, void *private)
 {
+	struct vm_area_struct *vma = private;
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
@@ -309,71 +304,10 @@ static void clear_refs_pte_range(struct 
 	}
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
+	return 0;
 }
 
-static inline void walk_pmd_range(struct pmd_walker *walker, pud_t *pud,
-				  unsigned long addr, unsigned long end)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	for (pmd = pmd_offset(pud, addr); addr != end;
-	     pmd++, addr = next) {
-		next = pmd_addr_end(addr, end);
-		if (pmd_none_or_clear_bad(pmd))
-			continue;
-		walker->action(walker->vma, pmd, addr, next, walker->private);
-	}
-}
-
-static inline void walk_pud_range(struct pmd_walker *walker, pgd_t *pgd,
-				  unsigned long addr, unsigned long end)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	for (pud = pud_offset(pgd, addr); addr != end;
-	     pud++, addr = next) {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		walk_pmd_range(walker, pud, addr, next);
-	}
-}
-
-/*
- * walk_page_range - walk the page tables of a VMA with a callback
- * @vma - VMA to walk
- * @action - callback invoked for every bottom-level (PTE) page table
- * @private - private data passed to the callback function
- *
- * Recursively walk the page table for the memory area in a VMA, calling
- * a callback for every bottom-level (PTE) page table.
- */
-static inline void walk_page_range(struct vm_area_struct *vma,
-				   void (*action)(struct vm_area_struct *,
-						  pmd_t *, unsigned long,
-						  unsigned long, void *),
-				   void *private)
-{
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	struct pmd_walker walker = {
-		.vma		= vma,
-		.private	= private,
-		.action		= action,
-	};
-	pgd_t *pgd;
-	unsigned long next;
-
-	for (pgd = pgd_offset(vma->vm_mm, addr); addr != end;
-	     pgd++, addr = next) {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		walk_pud_range(&walker, pgd, addr, next);
-	}
-}
+static struct mm_walk smaps_walk = { .pmd_entry = smaps_pte_range };
 
 static int show_smap(struct seq_file *m, void *v)
 {
@@ -381,11 +315,15 @@ static int show_smap(struct seq_file *m,
 	struct mem_size_stats mss;
 
 	memset(&mss, 0, sizeof mss);
+	mss.vma = vma;
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
-		walk_page_range(vma, smaps_pte_range, &mss);
+		walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
+				&smaps_walk, &mss);
 	return show_map_internal(m, v, &mss);
 }
 
+static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range };
+
 void clear_refs_smap(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -393,7 +331,8 @@ void clear_refs_smap(struct mm_struct *m
 	down_read(&mm->mmap_sem);
 	for (vma = mm->mmap; vma; vma = vma->vm_next)
 		if (vma->vm_mm && !is_vm_hugetlb_page(vma))
-			walk_page_range(vma, clear_refs_pte_range, NULL);
+			walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
+					&clear_refs_walk, vma);
 	flush_tlb_mm(mm);
 	up_read(&mm->mmap_sem);
 }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/11] maps3: use pagewalker in clear_refs and smaps
  2007-10-15 22:26 ` [PATCH 5/11] maps3: use pagewalker in clear_refs and smaps Matt Mackall
@ 2007-10-16  5:03   ` David Rientjes
  0 siblings, 0 replies; 39+ messages in thread
From: David Rientjes @ 2007-10-16  5:03 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Matt Mackall wrote:

> Use the generic pagewalker for smaps and clear_refs
> 
> Signed-off-by: Matt Mackall <mpm@selenic.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 6/11] maps3: simplify interdependence of maps and smaps
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (4 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 5/11] maps3: use pagewalker in clear_refs and smaps Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-15 22:26 ` [PATCH 7/11] maps3: move clear_refs code to task_mmu.c Matt Mackall
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

From: Matt Mackall <mpm@selenic.com>

This pulls the shared map display code out of show_map and puts it in
show_smap where it belongs.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Index: l/fs/proc/task_mmu.c
===================================================================
--- l.orig/fs/proc/task_mmu.c	2007-10-14 13:37:08.000000000 -0500
+++ l/fs/proc/task_mmu.c	2007-10-14 13:38:43.000000000 -0500
@@ -146,7 +146,7 @@ struct mem_size_stats
 #define PSS_DIV_BITS	12
 };
 
-static int show_map_internal(struct seq_file *m, void *v, struct mem_size_stats *mss)
+static int show_map(struct seq_file *m, void *v)
 {
 	struct proc_maps_private *priv = m->private;
 	struct task_struct *task = priv->task;
@@ -206,35 +206,11 @@ static int show_map_internal(struct seq_
 	}
 	seq_putc(m, '\n');
 
-	if (mss)
-		seq_printf(m,
-			   "Size:           %8lu kB\n"
-			   "Rss:            %8lu kB\n"
-			   "Pss:            %8lu kB\n"
-			   "Shared_Clean:   %8lu kB\n"
-			   "Shared_Dirty:   %8lu kB\n"
-			   "Private_Clean:  %8lu kB\n"
-			   "Private_Dirty:  %8lu kB\n"
-			   "Referenced:     %8lu kB\n",
-			   (vma->vm_end - vma->vm_start) >> 10,
-			   mss->resident >> 10,
-			   (unsigned long)(mss.pss >> (10 + PSS_DIV_BITS)),
-			   mss->shared_clean  >> 10,
-			   mss->shared_dirty  >> 10,
-			   mss->private_clean >> 10,
-			   mss->private_dirty >> 10,
-			   mss->referenced >> 10);
-
 	if (m->count < m->size)  /* vma is copied successfully */
 		m->version = (vma != get_gate_vma(task))? vma->vm_start: 0;
 	return 0;
 }
 
-static int show_map(struct seq_file *m, void *v)
-{
-	return show_map_internal(m, v, NULL);
-}
-
 static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			   void *private)
 {
@@ -313,13 +289,37 @@ static int show_smap(struct seq_file *m,
 {
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
+	int ret;
 
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
 		walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
 				&smaps_walk, &mss);
-	return show_map_internal(m, v, &mss);
+
+	ret = show_map(m, v);
+	if (ret)
+		return ret;
+
+	seq_printf(m,
+		   "Size:           %8lu kB\n"
+		   "Rss:            %8lu kB\n"
+		   "Pss:            %8lu kB\n"
+		   "Shared_Clean:   %8lu kB\n"
+		   "Shared_Dirty:   %8lu kB\n"
+		   "Private_Clean:  %8lu kB\n"
+		   "Private_Dirty:  %8lu kB\n"
+		   "Referenced:     %8lu kB\n",
+		   (vma->vm_end - vma->vm_start) >> 10,
+		   mss.resident >> 10,
+		   (unsigned long)(mss.pss >> (10 + PSS_DIV_BITS)),
+		   mss.shared_clean  >> 10,
+		   mss.shared_dirty  >> 10,
+		   mss.private_clean >> 10,
+		   mss.private_dirty >> 10,
+		   mss.referenced >> 10);
+
+	return ret;
 }
 
 static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range };

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 7/11] maps3: move clear_refs code to task_mmu.c
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (5 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 6/11] maps3: simplify interdependence of maps " Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-16  5:11   ` David Rientjes
  2007-10-15 22:26 ` [PATCH 8/11] maps3: regroup task_mmu by interface Matt Mackall
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

From: Matt Mackall <mpm@selenic.com>

This puts all the clear_refs code where it belongs and probably lets things
compile on MMU-less systems as well.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Index: l/fs/proc/base.c
===================================================================
--- l.orig/fs/proc/base.c	2007-10-14 13:35:08.000000000 -0500
+++ l/fs/proc/base.c	2007-10-14 13:39:00.000000000 -0500
@@ -713,42 +713,6 @@ static const struct file_operations proc
 	.write		= oom_adjust_write,
 };
 
-#ifdef CONFIG_MMU
-static ssize_t clear_refs_write(struct file *file, const char __user *buf,
-				size_t count, loff_t *ppos)
-{
-	struct task_struct *task;
-	char buffer[PROC_NUMBUF], *end;
-	struct mm_struct *mm;
-
-	memset(buffer, 0, sizeof(buffer));
-	if (count > sizeof(buffer) - 1)
-		count = sizeof(buffer) - 1;
-	if (copy_from_user(buffer, buf, count))
-		return -EFAULT;
-	if (!simple_strtol(buffer, &end, 0))
-		return -EINVAL;
-	if (*end == '\n')
-		end++;
-	task = get_proc_task(file->f_path.dentry->d_inode);
-	if (!task)
-		return -ESRCH;
-	mm = get_task_mm(task);
-	if (mm) {
-		clear_refs_smap(mm);
-		mmput(mm);
-	}
-	put_task_struct(task);
-	if (end - buffer == 0)
-		return -EIO;
-	return end - buffer;
-}
-
-static struct file_operations proc_clear_refs_operations = {
-	.write		= clear_refs_write,
-};
-#endif
-
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
Index: l/fs/proc/internal.h
===================================================================
--- l.orig/fs/proc/internal.h	2007-10-14 13:35:08.000000000 -0500
+++ l/fs/proc/internal.h	2007-10-14 13:39:00.000000000 -0500
@@ -49,11 +49,7 @@ extern int proc_pid_statm(struct task_st
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;
-
-extern const struct file_operations proc_maps_operations;
-extern const struct file_operations proc_numa_maps_operations;
-extern const struct file_operations proc_smaps_operations;
-
+extern const struct file_operations proc_clear_refs_operations;
 
 void free_proc_entry(struct proc_dir_entry *de);
 
Index: l/fs/proc/task_mmu.c
===================================================================
--- l.orig/fs/proc/task_mmu.c	2007-10-14 13:38:43.000000000 -0500
+++ l/fs/proc/task_mmu.c	2007-10-14 13:39:00.000000000 -0500
@@ -324,19 +324,47 @@ static int show_smap(struct seq_file *m,
 
 static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range };
 
-void clear_refs_smap(struct mm_struct *mm)
+static ssize_t clear_refs_write(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
 {
+	struct task_struct *task;
+	char buffer[13], *end;
+	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 
-	down_read(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		if (vma->vm_mm && !is_vm_hugetlb_page(vma))
-			walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
-					&clear_refs_walk, vma);
-	flush_tlb_mm(mm);
-	up_read(&mm->mmap_sem);
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	if (!simple_strtol(buffer, &end, 0))
+		return -EINVAL;
+	if (*end == '\n')
+		end++;
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	mm = get_task_mm(task);
+	if (mm) {
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
+			if (!is_vm_hugetlb_page(vma))
+				walk_page_range(mm, vma->vm_start, vma->vm_end,
+						&clear_refs_walk, vma);
+		flush_tlb_mm(mm);
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+	put_task_struct(task);
+	if (end - buffer == 0)
+		return -EIO;
+	return end - buffer;
 }
 
+const struct file_operations proc_clear_refs_operations = {
+	.write		= clear_refs_write,
+};
+
 static void *m_start(struct seq_file *m, loff_t *pos)
 {
 	struct proc_maps_private *priv = m->private;
Index: l/include/linux/proc_fs.h
===================================================================
--- l.orig/include/linux/proc_fs.h	2007-10-14 13:35:08.000000000 -0500
+++ l/include/linux/proc_fs.h	2007-10-14 13:39:00.000000000 -0500
@@ -116,7 +116,6 @@ int proc_pid_readdir(struct file * filp,
 unsigned long task_vsize(struct mm_struct *);
 int task_statm(struct mm_struct *, int *, int *, int *, int *);
 char *task_mem(struct mm_struct *, char *);
-void clear_refs_smap(struct mm_struct *mm);
 
 struct proc_dir_entry *de_get(struct proc_dir_entry *de);
 void de_put(struct proc_dir_entry *de);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/11] maps3: move clear_refs code to task_mmu.c
  2007-10-15 22:26 ` [PATCH 7/11] maps3: move clear_refs code to task_mmu.c Matt Mackall
@ 2007-10-16  5:11   ` David Rientjes
  0 siblings, 0 replies; 39+ messages in thread
From: David Rientjes @ 2007-10-16  5:11 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/fs/proc/task_mmu.c
> ===================================================================
> --- l.orig/fs/proc/task_mmu.c	2007-10-14 13:38:43.000000000 -0500
> +++ l/fs/proc/task_mmu.c	2007-10-14 13:39:00.000000000 -0500
> @@ -324,19 +324,47 @@ static int show_smap(struct seq_file *m,
>  
>  static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range };
>  
> -void clear_refs_smap(struct mm_struct *mm)
> +static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> +				size_t count, loff_t *ppos)
>  {
> +	struct task_struct *task;
> +	char buffer[13], *end;

The #define for PROC_NUMBUF will need to be moved from fs/proc/base.c to 
include/linux/proc_fs.h and used here instead of hardcoding it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 8/11] maps3: regroup task_mmu by interface
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (6 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 7/11] maps3: move clear_refs code to task_mmu.c Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-15 22:26 ` [PATCH 9/11] maps3: add /proc/pid/pagemap interface Matt Mackall
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

From: Matt Mackall <mpm@selenic.com>

Reorder source so that all the code and data for each interface is together.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Index: l/fs/proc/task_mmu.c
===================================================================
--- l.orig/fs/proc/task_mmu.c	2007-10-14 13:42:11.000000000 -0500
+++ l/fs/proc/task_mmu.c	2007-10-14 18:07:26.000000000 -0500
@@ -114,37 +114,121 @@ static void pad_len_spaces(struct seq_fi
 	seq_printf(m, "%*c", len, ' ');
 }
 
-struct mem_size_stats
+static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
 {
-	struct vm_area_struct *vma;
-	unsigned long resident;
-	unsigned long shared_clean;
-	unsigned long shared_dirty;
-	unsigned long private_clean;
-	unsigned long private_dirty;
-	unsigned long referenced;
+	if (vma && vma != priv->tail_vma) {
+		struct mm_struct *mm = vma->vm_mm;
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+}
+
+static void *m_start(struct seq_file *m, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	unsigned long last_addr = m->version;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma, *tail_vma = NULL;
+	loff_t l = *pos;
+
+	/* Clear the per syscall fields in priv */
+	priv->task = NULL;
+	priv->tail_vma = NULL;
 
 	/*
-	 * Proportional Set Size(PSS): my share of RSS.
-	 *
-	 * PSS of a process is the count of pages it has in memory, where each
-	 * page is divided by the number of processes sharing it.  So if a
-	 * process has 1000 pages all to itself, and 1000 shared with one other
-	 * process, its PSS will be 1500.               - Matt Mackall, lwn.net
+	 * We remember last_addr rather than next_addr to hit with
+	 * mmap_cache most of the time. We have zero last_addr at
+	 * the beginning and also after lseek. We will have -1 last_addr
+	 * after the end of the vmas.
 	 */
-	u64 	      pss;
+
+	if (last_addr == -1UL)
+		return NULL;
+
+	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
+	if (!priv->task)
+		return NULL;
+
+	mm = get_task_mm(priv->task);
+	if (!mm)
+		return NULL;
+
+	priv->tail_vma = tail_vma = get_gate_vma(priv->task);
+	down_read(&mm->mmap_sem);
+
+	/* Start with last addr hint */
+	if (last_addr && (vma = find_vma(mm, last_addr))) {
+		vma = vma->vm_next;
+		goto out;
+	}
+
 	/*
-	 * To keep (accumulated) division errors low, we adopt 64bit pss and
-	 * use some low bits for division errors. So (pss >> PSS_DIV_BITS)
-	 * would be the real byte count.
-	 *
-	 * A shift of 12 before division means(assuming 4K page size):
-	 * 	- 1M 3-user-pages add up to 8KB errors;
-	 * 	- supports mapcount up to 2^24, or 16M;
-	 * 	- supports PSS up to 2^52 bytes, or 4PB.
+	 * Check the vma index is within the range and do
+	 * sequential scan until m_index.
 	 */
-#define PSS_DIV_BITS	12
-};
+	vma = NULL;
+	if ((unsigned long)l < mm->map_count) {
+		vma = mm->mmap;
+		while (l-- && vma)
+			vma = vma->vm_next;
+		goto out;
+	}
+
+	if (l != mm->map_count)
+		tail_vma = NULL; /* After gate vma */
+
+out:
+	if (vma)
+		return vma;
+
+	/* End of vmas has been reached */
+	m->version = (tail_vma != NULL)? 0: -1UL;
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return tail_vma;
+}
+
+static void *m_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+	struct vm_area_struct *tail_vma = priv->tail_vma;
+
+	(*pos)++;
+	if (vma && (vma != tail_vma) && vma->vm_next)
+		return vma->vm_next;
+	vma_stop(priv, vma);
+	return (vma != tail_vma)? tail_vma: NULL;
+}
+
+static void m_stop(struct seq_file *m, void *v)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+
+	vma_stop(priv, vma);
+	if (priv->task)
+		put_task_struct(priv->task);
+}
+
+static int do_maps_open(struct inode *inode, struct file *file,
+			struct seq_operations *ops)
+{
+	struct proc_maps_private *priv;
+	int ret = -ENOMEM;
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (priv) {
+		priv->pid = proc_pid(inode);
+		ret = seq_open(file, ops);
+		if (!ret) {
+			struct seq_file *m = file->private_data;
+			m->private = priv;
+		} else {
+			kfree(priv);
+		}
+	}
+	return ret;
+}
 
 static int show_map(struct seq_file *m, void *v)
 {
@@ -211,6 +295,57 @@ static int show_map(struct seq_file *m, 
 	return 0;
 }
 
+static struct seq_operations proc_pid_maps_op = {
+	.start	= m_start,
+	.next	= m_next,
+	.stop	= m_stop,
+	.show	= show_map
+};
+
+static int maps_open(struct inode *inode, struct file *file)
+{
+	return do_maps_open(inode, file, &proc_pid_maps_op);
+}
+
+const struct file_operations proc_maps_operations = {
+	.open		= maps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release_private,
+};
+
+struct mem_size_stats
+{
+	struct vm_area_struct *vma;
+	unsigned long resident;
+	unsigned long shared_clean;
+	unsigned long shared_dirty;
+	unsigned long private_clean;
+	unsigned long private_dirty;
+	unsigned long referenced;
+
+	/*
+	 * Proportional Set Size(PSS): my share of RSS.
+	 *
+	 * PSS of a process is the count of pages it has in memory, where each
+	 * page is divided by the number of processes sharing it.  So if a
+	 * process has 1000 pages all to itself, and 1000 shared with one other
+	 * process, its PSS will be 1500.               - Matt Mackall, lwn.net
+	 */
+	u64 	      pss;
+	/*
+	 * To keep (accumulated) division errors low, we adopt 64bit pss and
+	 * use some low bits for division errors. So (pss >> PSS_DIV_BITS)
+	 * would be the real byte count.
+	 *
+	 * A shift of 12 before division means(assuming 4K page size):
+	 * 	- 1M 3-user-pages add up to 8KB errors;
+	 * 	- supports mapcount up to 2^24, or 16M;
+	 * 	- supports PSS up to 2^52 bytes, or 4PB.
+	 */
+#define PSS_DIV_BITS	12
+};
+
 static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			   void *private)
 {
@@ -256,33 +391,6 @@ static int smaps_pte_range(pmd_t *pmd, u
 	return 0;
 }
 
-static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
-				unsigned long end, void *private)
-{
-	struct vm_area_struct *vma = private;
-	pte_t *pte, ptent;
-	spinlock_t *ptl;
-	struct page *page;
-
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; pte++, addr += PAGE_SIZE) {
-		ptent = *pte;
-		if (!pte_present(ptent))
-			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
-			continue;
-
-		/* Clear accessed and referenced bits. */
-		ptep_test_and_clear_young(vma, addr, pte);
-		ClearPageReferenced(page);
-	}
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-	return 0;
-}
-
 static struct mm_walk smaps_walk = { .pmd_entry = smaps_pte_range };
 
 static int show_smap(struct seq_file *m, void *v)
@@ -322,6 +430,52 @@ static int show_smap(struct seq_file *m,
 	return ret;
 }
 
+static struct seq_operations proc_pid_smaps_op = {
+	.start	= m_start,
+	.next	= m_next,
+	.stop	= m_stop,
+	.show	= show_smap
+};
+
+static int smaps_open(struct inode *inode, struct file *file)
+{
+	return do_maps_open(inode, file, &proc_pid_smaps_op);
+}
+
+const struct file_operations proc_smaps_operations = {
+	.open		= smaps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release_private,
+};
+
+static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, void *private)
+{
+	struct vm_area_struct *vma = private;
+	pte_t *pte, ptent;
+	spinlock_t *ptl;
+	struct page *page;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		/* Clear accessed and referenced bits. */
+		ptep_test_and_clear_young(vma, addr, pte);
+		ClearPageReferenced(page);
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	return 0;
+}
+
 static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range };
 
 static ssize_t clear_refs_write(struct file *file, const char __user *buf,
@@ -365,148 +519,6 @@ const struct file_operations proc_clear_
 	.write		= clear_refs_write,
 };
 
-static void *m_start(struct seq_file *m, loff_t *pos)
-{
-	struct proc_maps_private *priv = m->private;
-	unsigned long last_addr = m->version;
-	struct mm_struct *mm;
-	struct vm_area_struct *vma, *tail_vma = NULL;
-	loff_t l = *pos;
-
-	/* Clear the per syscall fields in priv */
-	priv->task = NULL;
-	priv->tail_vma = NULL;
-
-	/*
-	 * We remember last_addr rather than next_addr to hit with
-	 * mmap_cache most of the time. We have zero last_addr at
-	 * the beginning and also after lseek. We will have -1 last_addr
-	 * after the end of the vmas.
-	 */
-
-	if (last_addr == -1UL)
-		return NULL;
-
-	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
-	if (!priv->task)
-		return NULL;
-
-	mm = get_task_mm(priv->task);
-	if (!mm)
-		return NULL;
-
-	priv->tail_vma = tail_vma = get_gate_vma(priv->task);
-	down_read(&mm->mmap_sem);
-
-	/* Start with last addr hint */
-	if (last_addr && (vma = find_vma(mm, last_addr))) {
-		vma = vma->vm_next;
-		goto out;
-	}
-
-	/*
-	 * Check the vma index is within the range and do
-	 * sequential scan until m_index.
-	 */
-	vma = NULL;
-	if ((unsigned long)l < mm->map_count) {
-		vma = mm->mmap;
-		while (l-- && vma)
-			vma = vma->vm_next;
-		goto out;
-	}
-
-	if (l != mm->map_count)
-		tail_vma = NULL; /* After gate vma */
-
-out:
-	if (vma)
-		return vma;
-
-	/* End of vmas has been reached */
-	m->version = (tail_vma != NULL)? 0: -1UL;
-	up_read(&mm->mmap_sem);
-	mmput(mm);
-	return tail_vma;
-}
-
-static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
-{
-	if (vma && vma != priv->tail_vma) {
-		struct mm_struct *mm = vma->vm_mm;
-		up_read(&mm->mmap_sem);
-		mmput(mm);
-	}
-}
-
-static void *m_next(struct seq_file *m, void *v, loff_t *pos)
-{
-	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *vma = v;
-	struct vm_area_struct *tail_vma = priv->tail_vma;
-
-	(*pos)++;
-	if (vma && (vma != tail_vma) && vma->vm_next)
-		return vma->vm_next;
-	vma_stop(priv, vma);
-	return (vma != tail_vma)? tail_vma: NULL;
-}
-
-static void m_stop(struct seq_file *m, void *v)
-{
-	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *vma = v;
-
-	vma_stop(priv, vma);
-	if (priv->task)
-		put_task_struct(priv->task);
-}
-
-static struct seq_operations proc_pid_maps_op = {
-	.start	= m_start,
-	.next	= m_next,
-	.stop	= m_stop,
-	.show	= show_map
-};
-
-static struct seq_operations proc_pid_smaps_op = {
-	.start	= m_start,
-	.next	= m_next,
-	.stop	= m_stop,
-	.show	= show_smap
-};
-
-static int do_maps_open(struct inode *inode, struct file *file,
-			struct seq_operations *ops)
-{
-	struct proc_maps_private *priv;
-	int ret = -ENOMEM;
-	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
-	if (priv) {
-		priv->pid = proc_pid(inode);
-		ret = seq_open(file, ops);
-		if (!ret) {
-			struct seq_file *m = file->private_data;
-			m->private = priv;
-		} else {
-			kfree(priv);
-		}
-	}
-	return ret;
-}
-
-static int maps_open(struct inode *inode, struct file *file)
-{
-	return do_maps_open(inode, file, &proc_pid_maps_op);
-}
-
-const struct file_operations proc_maps_operations = {
-	.open		= maps_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release_private,
-};
-
 #ifdef CONFIG_NUMA
 extern int show_numa_map(struct seq_file *m, void *v);
 
@@ -541,14 +553,3 @@ const struct file_operations proc_numa_m
 };
 #endif
 
-static int smaps_open(struct inode *inode, struct file *file)
-{
-	return do_maps_open(inode, file, &proc_pid_smaps_op);
-}
-
-const struct file_operations proc_smaps_operations = {
-	.open		= smaps_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release_private,
-};

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 9/11] maps3: add /proc/pid/pagemap interface
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (7 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 8/11] maps3: regroup task_mmu by interface Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-15 22:26 ` [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces Matt Mackall
  2007-10-15 22:26 ` [PATCH 11/11] maps3: make page monitoring /proc file optional Matt Mackall
  10 siblings, 0 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

From: Matt Mackall <mpm@selenic.com>

This interface provides a mapping for each page in an address space to its
physical page frame number, allowing precise determination of what pages are
mapped and what pages are shared between processes.

New in this version:

- headers gone again (as recommended by Dave Hansen and Alan Cox)
- 64-bit entries (as per discussion with Andi Kleen)
- swap pte information exported (from Dave Hansen)
- page walker callback for holes (from Dave Hansen)
- direct put_user I/O (as suggested by Rusty Russell)

This patch folds in cleanups and swap PTE support from Dave Hansen
<haveblue@us.ibm.com>.

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: l/fs/proc/base.c
===================================================================
--- l.orig/fs/proc/base.c	2007-10-14 13:39:00.000000000 -0500
+++ l/fs/proc/base.c	2007-10-15 17:18:09.000000000 -0500
@@ -635,7 +635,7 @@ out_no_task:
 }
 #endif
 
-static loff_t mem_lseek(struct file * file, loff_t offset, int orig)
+loff_t mem_lseek(struct file * file, loff_t offset, int orig)
 {
 	switch (orig) {
 	case 0:
@@ -2034,6 +2034,7 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_MMU
 	REG("clear_refs", S_IWUSR, clear_refs),
 	REG("smaps",      S_IRUGO, smaps),
+	REG("pagemap",    S_IRUSR, pagemap),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, attr_dir),
@@ -2320,6 +2321,7 @@ static const struct pid_entry tid_base_s
 #ifdef CONFIG_MMU
 	REG("clear_refs", S_IWUSR, clear_refs),
 	REG("smaps",     S_IRUGO, smaps),
+	REG("pagemap",    S_IRUSR, pagemap),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",      S_IRUGO|S_IXUGO, attr_dir),
Index: l/fs/proc/internal.h
===================================================================
--- l.orig/fs/proc/internal.h	2007-10-14 13:39:00.000000000 -0500
+++ l/fs/proc/internal.h	2007-10-15 17:18:09.000000000 -0500
@@ -45,11 +45,13 @@ extern int proc_tid_stat(struct task_str
 extern int proc_tgid_stat(struct task_struct *, char *);
 extern int proc_pid_status(struct task_struct *, char *);
 extern int proc_pid_statm(struct task_struct *, char *);
+extern loff_t mem_lseek(struct file * file, loff_t offset, int orig);
 
 extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;
 extern const struct file_operations proc_clear_refs_operations;
+extern const struct file_operations proc_pagemap_operations;
 
 void free_proc_entry(struct proc_dir_entry *de);
 
Index: l/fs/proc/task_mmu.c
===================================================================
--- l.orig/fs/proc/task_mmu.c	2007-10-14 18:07:26.000000000 -0500
+++ l/fs/proc/task_mmu.c	2007-10-15 17:18:09.000000000 -0500
@@ -5,7 +5,10 @@
 #include <linux/highmem.h>
 #include <linux/ptrace.h>
 #include <linux/pagemap.h>
+#include <linux/ptrace.h>
 #include <linux/mempolicy.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #include <asm/elf.h>
 #include <asm/uaccess.h>
@@ -519,6 +522,202 @@ const struct file_operations proc_clear_
 	.write		= clear_refs_write,
 };
 
+struct pagemapread {
+	char __user *out, *end;
+};
+
+#define PM_ENTRY_BYTES sizeof(u64)
+#define PM_RESERVED_BITS    3
+#define PM_RESERVED_OFFSET  (64 - PM_RESERVED_BITS)
+#define PM_RESERVED_MASK    (((1LL<<PM_RESERVED_BITS)-1) << PM_RESERVED_OFFSET)
+#define PM_SPECIAL(nr)      (((nr) << PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
+#define PM_NOT_PRESENT      PM_SPECIAL(1LL)
+#define PM_SWAP             PM_SPECIAL(2LL)
+#define PM_END_OF_BUFFER    1
+
+static int add_to_pagemap(unsigned long addr, u64 pfn,
+			  struct pagemapread *pm)
+{
+	/*
+	 * Make sure there's room in the buffer for an
+	 * entire entry.  Otherwise, only copy part of
+	 * the pfn.
+	 */
+	if (pm->out + PM_ENTRY_BYTES >= pm->end) {
+		if(copy_to_user(pm->out, &pfn, pm->end - pm->out))
+			return -EFAULT;
+		pm->out = pm->end;
+		return PM_END_OF_BUFFER;
+	}
+
+	if(put_user(pfn, pm->out))
+		return -EFAULT;
+	pm->out += PM_ENTRY_BYTES;
+	return 0;
+}
+
+static int pagemap_pte_hole(unsigned long start, unsigned long end,
+				void *private)
+{
+	struct pagemapread *pm = private;
+	unsigned long addr;
+	int err = 0;
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
+		if (err)
+			break;
+	}
+	return err;
+}
+
+u64 swap_pte_to_pagemap_entry(pte_t pte)
+{
+	swp_entry_t e = pte_to_swp_entry(pte);
+	return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
+}
+
+static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			     void *private)
+{
+	struct pagemapread *pm = private;
+	pte_t *pte;
+	int err = 0;
+
+	pte = pte_offset_map(pmd, addr);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		u64 pfn = PM_NOT_PRESENT;
+		if (is_swap_pte(*pte))
+			pfn = swap_pte_to_pagemap_entry(*pte);
+		else if (pte_present(*pte))
+			pfn = pte_pfn(*pte);
+		err = add_to_pagemap(addr, pfn, pm);
+		if (err)
+			return err;
+	}
+	pte_unmap(pte - 1);
+
+	cond_resched();
+
+	return err;
+}
+
+static struct mm_walk pagemap_walk =
+{
+	.pmd_entry = pagemap_pte_range,
+	.pte_hole = pagemap_pte_hole
+};
+
+/*
+ * /proc/pid/pagemap - an array mapping virtual pages to pfns
+ *
+ * For each page in the address space, this file contains one 64-bit
+ * entry representing the corresponding physical page frame number
+ * (PFN) if the page is present. If there is a swap entry for the
+ * physical page, then an encoding of the swap file number and the
+ * page's offset into the swap file are returned. If no page is
+ * present at all, PM_NOT_PRESENT is returned. This allows determining
+ * precisely which pages are mapped (or in swap) and comparing mapped
+ * pages between processes.
+ *
+ * Efficient users of this interface will use /proc/pid/maps to
+ * determine which areas of memory are actually mapped and llseek to
+ * skip over unmapped regions.
+ */
+static ssize_t pagemap_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	struct page **pages, *page;
+	unsigned long uaddr, uend;
+	struct mm_struct *mm;
+	struct pagemapread pm;
+	int pagecount;
+	int ret = -ESRCH;
+
+	if (!task)
+		goto out;
+
+	ret = -EACCES;
+	if (!ptrace_may_attach(task))
+		goto out;
+
+	ret = -EINVAL;
+	/* file position must be aligned */
+	if (*ppos % PM_ENTRY_BYTES)
+		goto out;
+
+	ret = 0;
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out;
+
+	ret = -ENOMEM;
+	uaddr = (unsigned long)buf & PAGE_MASK;
+	uend = (unsigned long)(buf + count);
+	pagecount = (PAGE_ALIGN(uend) - uaddr) / PAGE_SIZE;
+	pages = kmalloc(pagecount * sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		goto out_task;
+
+	down_read(&current->mm->mmap_sem);
+	ret = get_user_pages(current, current->mm, uaddr, pagecount,
+			     1, 0, pages, NULL);
+	up_read(&current->mm->mmap_sem);
+
+	if (ret < 0)
+		goto out_free;
+
+	pm.out = buf;
+	pm.end = buf + count;
+
+	if (!ptrace_may_attach(task)) {
+		ret = -EIO;
+	} else {
+		unsigned long src = *ppos;
+		unsigned long svpfn = src / PM_ENTRY_BYTES;
+		unsigned long start_vaddr = svpfn << PAGE_SHIFT;
+		unsigned long end_vaddr = TASK_SIZE_OF(task);
+
+		/* watch out for wraparound */
+		if (svpfn > TASK_SIZE_OF(task) >> PAGE_SHIFT)
+			start_vaddr = end_vaddr;
+
+		/*
+		 * The odds are that this will stop walking way
+		 * before end_vaddr, because the length of the
+		 * user buffer is tracked in "pm", and the walk
+		 * will stop when we hit the end of the buffer.
+		 */
+		ret = walk_page_range(mm, start_vaddr, end_vaddr,
+					&pagemap_walk, &pm);
+		if (ret == PM_END_OF_BUFFER)
+			ret = 0;
+		/* don't need mmap_sem for these, but this looks cleaner */
+		*ppos += pm.out - buf;
+		if (!ret)
+			ret = pm.out - buf;
+	}
+
+	for (; pagecount; pagecount--) {
+		page = pages[pagecount-1];
+		if (!PageReserved(page))
+			SetPageDirty(page);
+		page_cache_release(page);
+	}
+	mmput(mm);
+out_free:
+	kfree(pages);
+out_task:
+	put_task_struct(task);
+out:
+	return ret;
+}
+
+const struct file_operations proc_pagemap_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= pagemap_read,
+};
+
 #ifdef CONFIG_NUMA
 extern int show_numa_map(struct seq_file *m, void *v);
 
@@ -552,4 +751,3 @@ const struct file_operations proc_numa_m
 	.release	= seq_release_private,
 };
 #endif
-

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (8 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 9/11] maps3: add /proc/pid/pagemap interface Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-15 22:48   ` Dave Hansen
  2007-10-15 22:26 ` [PATCH 11/11] maps3: make page monitoring /proc file optional Matt Mackall
  10 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

From: Matt Mackall <mpm@selenic.com>

This makes physical page map counts available to userspace. Together
with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
monitor memory usage on a per-page basis.

[bunk@stusta.de: make struct proc_kpagemap static]
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

Index: l/fs/proc/proc_misc.c
===================================================================
--- l.orig/fs/proc/proc_misc.c	2007-10-09 17:37:57.000000000 -0500
+++ l/fs/proc/proc_misc.c	2007-10-10 11:46:50.000000000 -0500
@@ -46,6 +46,7 @@
 #include <linux/vmalloc.h>
 #include <linux/crash_dump.h>
 #include <linux/pid_namespace.h>
+#include <linux/bootmem.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/io.h>
@@ -656,6 +657,106 @@ static const struct file_operations proc
 };
 #endif
 
+#define KPMSIZE sizeof(u64)
+#define KPMMASK (KPMSIZE - 1)
+/* /proc/kpagecount - an array exposing page counts
+ *
+ * Each entry is a u64 representing the corresponding
+ * physical page count.
+ */
+static ssize_t kpagecount_read(struct file *file, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *ppage;
+	unsigned long src = *ppos;
+	unsigned long pfn;
+	ssize_t ret = 0;
+	u64 pcount;
+
+	if (!access_ok(VERIFY_WRITE, buf, count))
+		return -EFAULT;
+
+	pfn = src / KPMSIZE;
+	count = min_t(size_t, count, (max_pfn * KPMSIZE) - src);
+	if (src & KPMMASK || count & KPMMASK)
+		return -EIO;
+
+	while (count > 0) {
+		ppage = pfn_to_page(pfn++);
+		if (!ppage)
+			pcount = 0;
+		else
+			pcount = atomic_read(&ppage->_count);
+
+		if (put_user(pcount, out++)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		count -= KPMSIZE;
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static struct file_operations proc_kpagecount_operations = {
+	.llseek = mem_lseek,
+	.read = kpagecount_read,
+};
+
+/* /proc/kpageflags - an array exposing page flags
+ *
+ * Each entry is a u64 representing the corresponding
+ * physical page flags.
+ */
+static ssize_t kpageflags_read(struct file *file, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *ppage;
+	unsigned long src = *ppos;
+	unsigned long pfn;
+	ssize_t ret = 0;
+	u64 pflags;
+
+	if (!access_ok(VERIFY_WRITE, buf, count))
+		return -EFAULT;
+
+	pfn = src / KPMSIZE;
+	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+	if (src & KPMMASK || count & KPMMASK)
+		return -EIO;
+
+	while (count > 0) {
+		ppage = pfn_to_page(pfn++);
+		if (!ppage)
+			pflags = 0;
+		else
+			pflags = ppage->flags;
+
+		if (put_user(pflags, out++)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		count -= KPMSIZE;
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static struct file_operations proc_kpageflags_operations = {
+	.llseek = mem_lseek,
+	.read = kpageflags_read,
+};
+
 struct proc_dir_entry *proc_root_kcore;
 
 void create_seq_entry(char *name, mode_t mode, const struct file_operations *f)
@@ -735,6 +836,8 @@ void __init proc_misc_init(void)
 				(size_t)high_memory - PAGE_OFFSET + PAGE_SIZE;
 	}
 #endif
+	create_seq_entry("kpagecount", S_IRUSR, &proc_kpagecount_operations);
+	create_seq_entry("kpageflags", S_IRUSR, &proc_kpageflags_operations);
 #ifdef CONFIG_PROC_VMCORE
 	proc_vmcore = create_proc_entry("vmcore", S_IRUSR, NULL);
 	if (proc_vmcore)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-15 22:26 ` [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces Matt Mackall
@ 2007-10-15 22:48   ` Dave Hansen
  2007-10-15 23:11     ` Matt Mackall
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2007-10-15 22:48 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
> From: Matt Mackall <mpm@selenic.com>
> 
> This makes physical page map counts available to userspace. Together
> with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
> monitor memory usage on a per-page basis.
...
> +       while (count > 0) {
> +               ppage = pfn_to_page(pfn++);
> +               if (!ppage)
> +                       pflags = 0;
> +               else
> +                       pflags = ppage->flags;
> +

This one makes me worry a little bit.  Are we sure that this won't
expose a wee bit too much to userspace?

I can see it making sense to clear the page refs, then inspect whether
the page has been referenced again.  But, I worry that people are going
to start doing things like read NUMA, SPARSEMEM, or other internal
information out of these.

I've seen quite a few patches lately that do creative things with these
*cough*clameter*cough*, and I worry that they're too fluid to get
exposed to userspace.

Could we just have /proc/kpagereferenced?  Is there a legitimate need
for other flags to be visible?

-- Dave

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-15 22:48   ` Dave Hansen
@ 2007-10-15 23:11     ` Matt Mackall
  2007-10-15 23:34       ` Dave Hansen
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 23:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, Oct 15, 2007 at 03:48:33PM -0700, Dave Hansen wrote:
> On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
> > From: Matt Mackall <mpm@selenic.com>
> > 
> > This makes physical page map counts available to userspace. Together
> > with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
> > monitor memory usage on a per-page basis.
> ...
> > +       while (count > 0) {
> > +               ppage = pfn_to_page(pfn++);
> > +               if (!ppage)
> > +                       pflags = 0;
> > +               else
> > +                       pflags = ppage->flags;
> > +
> 
> This one makes me worry a little bit.  Are we sure that this won't
> expose a wee bit too much to userspace?
> 
> I can see it making sense to clear the page refs, then inspect whether
> the page has been referenced again.  But, I worry that people are going
> to start doing things like read NUMA, SPARSEMEM, or other internal
> information out of these.

Hmm, I would have thought you'd find the NUMA bits especially interesting.
Being able to, say, colorize a process' memory map by what nodes its
pages land on could be very telling.

> I've seen quite a few patches lately that do creative things with these
> *cough*clameter*cough*, and I worry that they're too fluid to get
> exposed to userspace.

That is a concern. In general, I think getting too cute with page
flags and struct page in general is a bad idea because the rules here
are already so complex/fragile/confusing/underdocumented, but there's
definitely a lot of pressure in that direction.

> Could we just have /proc/kpagereferenced?  Is there a legitimate need
> for other flags to be visible?

Referenced, dirty, uptodate, lru, active, slab, writeback, reclaim,
and buddy all look like they might be interesting to me from the point
of view of watching what's happening in the VM graphically in
real-time.

For instance, watching the slab bit I can watch a 'find /' fill up
huge swaths of contiguous dcache memory, then get fragmented to hell
and never recover when I do a large userspace malloc. In other words,
this thing actually lets you see all the crap that happens in the VM
that we usually handwave about.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-15 23:11     ` Matt Mackall
@ 2007-10-15 23:34       ` Dave Hansen
  2007-10-16  0:35         ` Matt Mackall
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2007-10-15 23:34 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, 2007-10-15 at 18:11 -0500, Matt Mackall wrote:
> > Could we just have /proc/kpagereferenced?  Is there a legitimate need
> > for other flags to be visible?
> 
> Referenced, dirty, uptodate, lru, active, slab, writeback, reclaim,
> and buddy all look like they might be interesting to me from the point
> of view of watching what's happening in the VM graphically in
> real-time.

This is true, but it forces a lot of logic from the kernel to be run in
userspace to figure out what is going on.  Looking at mainline today:

#define PG_reclaim              17      /* To be reclaimed asap */
...
#define PG_readahead            PG_reclaim /* Reminder to do async read-ahead */

All of a sudden, to figure out which flag it actually is, we need to
have all of the logic that the kernel does.  

Does this establish a fixed user<->kernel ABI that will keep us from
doing this in the future:

-#define PG_slab                  7      /* slab debug (Suparna wants this) */
+#define PG_slab                  14      /* slab debug (Suparna wants this) */

Or, even something like this:

-#define PageSlab(page)          test_bit(PG_slab, &(page)->flags)
+#define PageSlab(page)          (!PageLRU(page) && !PageHighmem(page))

If we actually had several (or even still one file) that exposed this
state, independent of the actual content of page->flags, I think we'd be
better off.  I think that's the difference between a fun, super-useful
debugging feature and one that can stay in mainline and have
applications stay using it (without breaking) for a long time.

The flags you listed are things that I would imagine will always exist,
logically.  But, we might not always have a specific page flag for pages
under writeback or in the buddy list for that matter.  PG_buddy isn't
that old.  Perhaps that would be better abstracted to something like
page_in_main_allocator().

-- Dave

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-15 23:34       ` Dave Hansen
@ 2007-10-16  0:35         ` Matt Mackall
  2007-10-16  0:49           ` Dave Hansen
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-16  0:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, Oct 15, 2007 at 04:34:57PM -0700, Dave Hansen wrote:
> On Mon, 2007-10-15 at 18:11 -0500, Matt Mackall wrote:
> > > Could we just have /proc/kpagereferenced?  Is there a legitimate need
> > > for other flags to be visible?
> > 
> > Referenced, dirty, uptodate, lru, active, slab, writeback, reclaim,
> > and buddy all look like they might be interesting to me from the point
> > of view of watching what's happening in the VM graphically in
> > real-time.
> 
> This is true, but it forces a lot of logic from the kernel to be run in
> userspace to figure out what is going on.  Looking at mainline today:
> 
> #define PG_reclaim              17      /* To be reclaimed asap */
> ...
> #define PG_readahead            PG_reclaim /* Reminder to do async read-ahead */
> 
> All of a sudden, to figure out which flag it actually is, we need to
> have all of the logic that the kernel does.  
> 
> Does this establish a fixed user<->kernel ABI that will keep us from
> doing this in the future:
> 
> -#define PG_slab                  7      /* slab debug (Suparna wants this) */
> +#define PG_slab                  14      /* slab debug (Suparna wants this) */
> 
> Or, even something like this:
> 
> -#define PageSlab(page)          test_bit(PG_slab, &(page)->flags)
> +#define PageSlab(page)          (!PageLRU(page) && !PageHighmem(page))

Yeah, there are a bunch of flags that aren't mutually exclusive and we
could probably recover a few.

> If we actually had several (or even still one file) that exposed this
> state, independent of the actual content of page->flags, I think we'd be
> better off.  I think that's the difference between a fun, super-useful
> debugging feature and one that can stay in mainline and have
> applications stay using it (without breaking) for a long time.
> 
> The flags you listed are things that I would imagine will always exist,
> logically.  But, we might not always have a specific page flag for pages
> under writeback or in the buddy list for that matter.  PG_buddy isn't
> that old.  Perhaps that would be better abstracted to something like
> page_in_main_allocator().

Perhaps we need something like:

flags = page->flags;
userflags = 
	  FLAG_BIT(USER_REFERENCED, flags & PG_referenced) |
	  ...

etc. for the flags we want to export. This will let us change to

	 FLAG_BIT(USER_SLAB, PageSlab(page)) |

if we make a virtual slab bit.

And it shows up in grep.

Unfortunately, i386 test_bit is an asm inline and not a macro so we
can't hope for the compiler to fold up a bunch of identity bit
mappings for us.


-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-16  0:35         ` Matt Mackall
@ 2007-10-16  0:49           ` Dave Hansen
  2007-10-16  0:58             ` Matt Mackall
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2007-10-16  0:49 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, 2007-10-15 at 19:35 -0500, Matt Mackall wrote:
> Perhaps we need something like:
> 
> flags = page->flags;
> userflags = 
>           FLAG_BIT(USER_REFERENCED, flags & PG_referenced) |
>           ...
> 
> etc. for the flags we want to export. This will let us change to
> 
>          FLAG_BIT(USER_SLAB, PageSlab(page)) |
> 
> if we make a virtual slab bit.

Yeah, that looks like a pretty sane scheme.  Do we want to be any more
abstract about it?  Perhaps instead of USER_SLAB, it should be
USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going away
as we speak. :)

> And it shows up in grep.
> 
> Unfortunately, i386 test_bit is an asm inline and not a macro so we
> can't hope for the compiler to fold up a bunch of identity bit
> mappings for us. 

We could also Yeah, that looks like a pretty sane scheme.  Do we want to
be any more abstract about it?  Perhaps instead of USER_SLAB, it should
be USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going
away as we speak.

For the bits that we want to export, we could also add the unoptimized
access functions for any that don't already have them:

#define __ClearPageReserved(page)       __clear_bit(PG_reserved, &(page)->flags)

Anybody changing bit behavior will certainly go check all of the
callers, such as ClearPageReserved() *and* __ClearPageReserved().

-- Dave

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-16  0:49           ` Dave Hansen
@ 2007-10-16  0:58             ` Matt Mackall
  2007-10-16  1:07               ` Dave Hansen
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Mackall @ 2007-10-16  0:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, Oct 15, 2007 at 05:49:10PM -0700, Dave Hansen wrote:
> On Mon, 2007-10-15 at 19:35 -0500, Matt Mackall wrote:
> > Perhaps we need something like:
> > 
> > flags = page->flags;
> > userflags = 
> >           FLAG_BIT(USER_REFERENCED, flags & PG_referenced) |
> >           ...
> > 
> > etc. for the flags we want to export. This will let us change to
> > 
> >          FLAG_BIT(USER_SLAB, PageSlab(page)) |
> > 
> > if we make a virtual slab bit.
> 
> Yeah, that looks like a pretty sane scheme.  Do we want to be any more
> abstract about it?  Perhaps instead of USER_SLAB, it should be
> USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going away
> as we speak. :)

Perhaps. SLUB is still "a slab-based allocator". SLOB isn't, but I
intend to start making it use PG_slab shortly anyway.
 
> > And it shows up in grep.
> > 
> > Unfortunately, i386 test_bit is an asm inline and not a macro so we
> > can't hope for the compiler to fold up a bunch of identity bit
> > mappings for us. 
> 
> We could also Yeah, that looks like a pretty sane scheme.  Do we want to
> be any more abstract about it?  Perhaps instead of USER_SLAB, it should
> be USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going
> away as we speak.
> 
> For the bits that we want to export, we could also add the unoptimized
> access functions for any that don't already have them:
> 
> #define __ClearPageReserved(page)       __clear_bit(PG_reserved, &(page)->flags)

Confused. Why are we interested in clear?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces
  2007-10-16  0:58             ` Matt Mackall
@ 2007-10-16  1:07               ` Dave Hansen
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Hansen @ 2007-10-16  1:07 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, 2007-10-15 at 19:58 -0500, Matt Mackall wrote:
> 
> > For the bits that we want to export, we could also add the unoptimized
> > access functions for any that don't already have them:
> > 
> > #define __ClearPageReserved(page)       __clear_bit(PG_reserved, &(page)->flags)
> 
> Confused. Why are we interested in clear? 

We're not.  I just grabbed a random line to show the non-atomic
accessors.  Any actual one we'd need to add would be:

#define __PageBuddy(page)         __test_bit(PG_buddy, &(page)->flags)

It looks like we don't have any of these non-atomic ones for plain
__PageFoo().  So, we'd have to add them for each one that we wanted.
Still not much work, and still satisfies the "grep test". :)

-- Dave


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 11/11] maps3: make page monitoring /proc file optional
  2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
                   ` (9 preceding siblings ...)
  2007-10-15 22:26 ` [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces Matt Mackall
@ 2007-10-15 22:26 ` Matt Mackall
  2007-10-15 22:49   ` Dave Hansen
  2007-10-16  5:25   ` David Rientjes
  10 siblings, 2 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-15 22:26 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Dave Hansen, Rusty Russell, Jeremy Fitzhardinge, David Rientjes,
	Fengguang Wu

Make /proc/ page monitoring configurable

This puts the following files under an embedded config option:

/proc/pid/clear_refs
/proc/pid/smaps
/proc/pid/pagemap
/proc/kpagecount
/proc/kpageflags

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: l/fs/proc/base.c
===================================================================
--- l.orig/fs/proc/base.c	2007-10-15 17:18:09.000000000 -0500
+++ l/fs/proc/base.c	2007-10-15 17:18:16.000000000 -0500
@@ -2031,7 +2031,7 @@ static const struct pid_entry tgid_base_
 	LNK("exe",        exe),
 	REG("mounts",     S_IRUGO, mounts),
 	REG("mountstats", S_IRUSR, mountstats),
-#ifdef CONFIG_MMU
+#ifdef CONFIG_PROC_PAGE_MONITOR
 	REG("clear_refs", S_IWUSR, clear_refs),
 	REG("smaps",      S_IRUGO, smaps),
 	REG("pagemap",    S_IRUSR, pagemap),
@@ -2318,7 +2318,7 @@ static const struct pid_entry tid_base_s
 	LNK("root",      root),
 	LNK("exe",       exe),
 	REG("mounts",    S_IRUGO, mounts),
-#ifdef CONFIG_MMU
+#ifdef CONFIG_PROC_PAGE_MONITOR
 	REG("clear_refs", S_IWUSR, clear_refs),
 	REG("smaps",     S_IRUGO, smaps),
 	REG("pagemap",    S_IRUSR, pagemap),
Index: l/fs/proc/proc_misc.c
===================================================================
--- l.orig/fs/proc/proc_misc.c	2007-10-15 17:18:13.000000000 -0500
+++ l/fs/proc/proc_misc.c	2007-10-15 17:18:16.000000000 -0500
@@ -657,6 +657,7 @@ static const struct file_operations proc
 };
 #endif
 
+#ifdef CONFIG_PROC_PAGE_MONITOR
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
 /* /proc/kpagecount - an array exposing page counts
@@ -756,6 +757,7 @@ static struct file_operations proc_kpage
 	.llseek = mem_lseek,
 	.read = kpageflags_read,
 };
+#endif /* CONFIG_PROC_PAGE_MONITOR */
 
 struct proc_dir_entry *proc_root_kcore;
 
@@ -836,8 +838,10 @@ void __init proc_misc_init(void)
 				(size_t)high_memory - PAGE_OFFSET + PAGE_SIZE;
 	}
 #endif
+#ifdef CONFIG_PROC_PAGE_MONITOR
 	create_seq_entry("kpagecount", S_IRUSR, &proc_kpagecount_operations);
 	create_seq_entry("kpageflags", S_IRUSR, &proc_kpageflags_operations);
+#endif
 #ifdef CONFIG_PROC_VMCORE
 	proc_vmcore = create_proc_entry("vmcore", S_IRUSR, NULL);
 	if (proc_vmcore)
Index: l/fs/proc/task_mmu.c
===================================================================
--- l.orig/fs/proc/task_mmu.c	2007-10-15 17:18:09.000000000 -0500
+++ l/fs/proc/task_mmu.c	2007-10-15 17:18:16.000000000 -0500
@@ -317,6 +317,7 @@ const struct file_operations proc_maps_o
 	.release	= seq_release_private,
 };
 
+#ifdef CONFIG_PROC_PAGE_MONITOR
 struct mem_size_stats
 {
 	struct vm_area_struct *vma;
@@ -717,6 +718,7 @@ const struct file_operations proc_pagema
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 };
+#endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
 extern int show_numa_map(struct seq_file *m, void *v);
Index: l/init/Kconfig
===================================================================
--- l.orig/init/Kconfig	2007-10-14 13:35:07.000000000 -0500
+++ l/init/Kconfig	2007-10-15 17:18:16.000000000 -0500
@@ -571,6 +571,15 @@ config SLOB
 
 endchoice
 
+config PROC_PAGE_MONITOR
+ 	default y
+	bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS && MMU
+ 	help
+	  Various /proc files exist to monitor process memory utilization:
+	  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
+	  /proc/kpagecount, and /proc/kpageflags. Disabling these
+          interfaces will reduce the size of the kernel by approximately 4kb.
+
 endmenu		# General setup
 
 config RT_MUTEXES
Index: l/mm/Makefile
===================================================================
--- l.orig/mm/Makefile	2007-10-14 13:37:07.000000000 -0500
+++ l/mm/Makefile	2007-10-15 17:18:16.000000000 -0500
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o
+			   vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
@@ -13,6 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   $(mmu-y)
 
+obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/11] maps3: make page monitoring /proc file optional
  2007-10-15 22:26 ` [PATCH 11/11] maps3: make page monitoring /proc file optional Matt Mackall
@ 2007-10-15 22:49   ` Dave Hansen
  2007-10-15 22:51     ` Jeremy Fitzhardinge
  2007-10-16  5:25   ` David Rientjes
  1 sibling, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2007-10-15 22:49 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Rusty Russell, Jeremy Fitzhardinge,
	David Rientjes, Fengguang Wu

On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
> 
> +config PROC_PAGE_MONITOR
> +       default y
> +       bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS && MMU
> +       help
> +         Various /proc files exist to monitor process memory utilization:
> +         /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
> +         /proc/kpagecount, and /proc/kpageflags. Disabling these
> +          interfaces will reduce the size of the kernel by approximately 4kb. 

How about pulling the EMBEDDED off there?  I certainly want it for
non-embedded reasons. ;)

-- Dave


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/11] maps3: make page monitoring /proc file optional
  2007-10-15 22:49   ` Dave Hansen
@ 2007-10-15 22:51     ` Jeremy Fitzhardinge
  2007-10-16  0:03       ` Rusty Russell
  0 siblings, 1 reply; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2007-10-15 22:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matt Mackall, Andrew Morton, linux-kernel, Rusty Russell,
	David Rientjes, Fengguang Wu

Dave Hansen wrote:
> On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
>   
>> +config PROC_PAGE_MONITOR
>> +       default y
>> +       bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS && MMU
>> +       help
>> +         Various /proc files exist to monitor process memory utilization:
>> +         /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
>> +         /proc/kpagecount, and /proc/kpageflags. Disabling these
>> +          interfaces will reduce the size of the kernel by approximately 4kb. 
>>     
>
> How about pulling the EMBEDDED off there?  I certainly want it for
> non-embedded reasons. ;)

That means it will only bother asking you if you've set EMBEDDED;
otherwise its always on.

    J

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/11] maps3: make page monitoring /proc file optional
  2007-10-15 22:51     ` Jeremy Fitzhardinge
@ 2007-10-16  0:03       ` Rusty Russell
  2007-10-16  0:20         ` Matt Mackall
  0 siblings, 1 reply; 39+ messages in thread
From: Rusty Russell @ 2007-10-16  0:03 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dave Hansen, Matt Mackall, Andrew Morton, linux-kernel,
	David Rientjes, Fengguang Wu

On Tuesday 16 October 2007 08:51:17 Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
> > On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
> >> +config PROC_PAGE_MONITOR
> >> +       default y
> >> +       bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS &&
> >> MMU +       help
> >> +         Various /proc files exist to monitor process memory
> >> utilization: +         /proc/pid/smaps, /proc/pid/clear_refs,
> >> /proc/pid/pagemap, +         /proc/kpagecount, and /proc/kpageflags.
> >> Disabling these +          interfaces will reduce the size of the kernel
> >> by approximately 4kb.
> >
> > How about pulling the EMBEDDED off there?  I certainly want it for
> > non-embedded reasons. ;)
>
> That means it will only bother asking you if you've set EMBEDDED;
> otherwise its always on.

But it's at the least confusing.  Surely this option should depend on MMU and 
PROC_FS, and the prompt depend on EMBEDDED?

That might be implied by the Kconfig layout, but AFAICT this patch removed the 
explicit MMU dependency.

Rusty.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/11] maps3: make page monitoring /proc file optional
  2007-10-16  0:03       ` Rusty Russell
@ 2007-10-16  0:20         ` Matt Mackall
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Mackall @ 2007-10-16  0:20 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jeremy Fitzhardinge, Dave Hansen, Andrew Morton, linux-kernel,
	David Rientjes, Fengguang Wu

On Tue, Oct 16, 2007 at 10:03:39AM +1000, Rusty Russell wrote:
> On Tuesday 16 October 2007 08:51:17 Jeremy Fitzhardinge wrote:
> > Dave Hansen wrote:
> > > On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
> > >> +config PROC_PAGE_MONITOR
> > >> +       default y
> > >> +       bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS &&
> > >> MMU +       help
> > >> +         Various /proc files exist to monitor process memory
> > >> utilization: +         /proc/pid/smaps, /proc/pid/clear_refs,
> > >> /proc/pid/pagemap, +         /proc/kpagecount, and /proc/kpageflags.
> > >> Disabling these +          interfaces will reduce the size of the kernel
> > >> by approximately 4kb.
> > >
> > > How about pulling the EMBEDDED off there?  I certainly want it for
> > > non-embedded reasons. ;)
> >
> > That means it will only bother asking you if you've set EMBEDDED;
> > otherwise its always on.
> 
> But it's at the least confusing.  Surely this option should depend on MMU and 
> PROC_FS, and the prompt depend on EMBEDDED?
> 
> That might be implied by the Kconfig layout, but AFAICT this patch removed the 
> explicit MMU dependency.
> 
> Rusty.

Wasn't this your patch? You're right, it ought to say "depends PROC_FS
&& MMU". Will fix.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/11] maps3: make page monitoring /proc file optional
  2007-10-15 22:26 ` [PATCH 11/11] maps3: make page monitoring /proc file optional Matt Mackall
  2007-10-15 22:49   ` Dave Hansen
@ 2007-10-16  5:25   ` David Rientjes
  1 sibling, 0 replies; 39+ messages in thread
From: David Rientjes @ 2007-10-16  5:25 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, linux-kernel, Dave Hansen, Rusty Russell,
	Jeremy Fitzhardinge, Fengguang Wu

On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/init/Kconfig
> ===================================================================
> --- l.orig/init/Kconfig	2007-10-14 13:35:07.000000000 -0500
> +++ l/init/Kconfig	2007-10-15 17:18:16.000000000 -0500
> @@ -571,6 +571,15 @@ config SLOB
>  
>  endchoice
>  
> +config PROC_PAGE_MONITOR
> + 	default y
> +	bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS && MMU
> + 	help
> +	  Various /proc files exist to monitor process memory utilization:
> +	  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
> +	  /proc/kpagecount, and /proc/kpageflags. Disabling these
> +          interfaces will reduce the size of the kernel by approximately 4kb.
> +
>  endmenu		# General setup
>  
>  config RT_MUTEXES

It's probably better not to include the text size savings since it will 
most likely be outdated at some time in the future.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2007-10-16 17:27 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-15 22:25 [PATCH 0/11] maps3: pagemap monitoring v3 Matt Mackall
2007-10-15 22:25 ` [PATCH 1/11] maps3: add proportional set size accounting in smaps Matt Mackall
2007-10-15 23:36   ` David Rientjes
2007-10-16  0:18     ` Matt Mackall
2007-10-16  2:24       ` David Rientjes
2007-10-15 22:25 ` [PATCH 2/11] maps3: introduce task_size_of for all arches Matt Mackall
2007-10-15 23:45   ` David Rientjes
2007-10-16  0:36     ` Dave Hansen
2007-10-16  2:26       ` David Rientjes
2007-10-16 17:18         ` maps3: introduce task_size_of for all arches (updated v4) Dave Hansen
2007-10-16 17:25           ` David Rientjes
2007-10-15 22:26 ` [PATCH 3/11] maps3: move is_swap_pte Matt Mackall
2007-10-15 22:26 ` [PATCH 4/11] maps3: introduce a generic page walker Matt Mackall
2007-10-15 22:40   ` Jeremy Fitzhardinge
2007-10-15 23:05     ` Dave Hansen
2007-10-15 23:20       ` Jeremy Fitzhardinge
2007-10-15 23:30     ` Matt Mackall
2007-10-16  4:58   ` David Rientjes
2007-10-15 22:26 ` [PATCH 5/11] maps3: use pagewalker in clear_refs and smaps Matt Mackall
2007-10-16  5:03   ` David Rientjes
2007-10-15 22:26 ` [PATCH 6/11] maps3: simplify interdependence of maps " Matt Mackall
2007-10-15 22:26 ` [PATCH 7/11] maps3: move clear_refs code to task_mmu.c Matt Mackall
2007-10-16  5:11   ` David Rientjes
2007-10-15 22:26 ` [PATCH 8/11] maps3: regroup task_mmu by interface Matt Mackall
2007-10-15 22:26 ` [PATCH 9/11] maps3: add /proc/pid/pagemap interface Matt Mackall
2007-10-15 22:26 ` [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces Matt Mackall
2007-10-15 22:48   ` Dave Hansen
2007-10-15 23:11     ` Matt Mackall
2007-10-15 23:34       ` Dave Hansen
2007-10-16  0:35         ` Matt Mackall
2007-10-16  0:49           ` Dave Hansen
2007-10-16  0:58             ` Matt Mackall
2007-10-16  1:07               ` Dave Hansen
2007-10-15 22:26 ` [PATCH 11/11] maps3: make page monitoring /proc file optional Matt Mackall
2007-10-15 22:49   ` Dave Hansen
2007-10-15 22:51     ` Jeremy Fitzhardinge
2007-10-16  0:03       ` Rusty Russell
2007-10-16  0:20         ` Matt Mackall
2007-10-16  5:25   ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox