Page Fault Scalabilty patch V19 [0/4]: Overview

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Page Fault Scalabilty patch V19 [0/4]: Overview
@ 2005-03-09 20:13 Christoph Lameter
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS Christoph Lameter
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 20:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, Christoph Lameter


Changelog:

V18->V19 Fall back to obtaining the page table lock before calling
	do_wp_page. Keep mark_page_accessed in do_swap_page and
	add SetPageReferenced in do_anonymous_page
	Diff against 2.6.11.
V17->V18 Rediff against 2.6.11-rc5-bk4
V16->V17 Do not increment page_count in do_wp_page. Performance data
V15->V16 of this patch: Redesign to allow full backback
	for architectures that do not supporting atomic operations.

Note that this is a release against 2.6.11 and not against the latest
bk tree. Some changes have made it into Linus tree that will require some
rework of the patch for post 2.6.11 (look for V20).

An introduction to what this patch does and a patch archive can be found on
http://oss.sgi.com/projects/page_fault_performance. The archive also has the
result of various performance tests (LMBench, Microbenchmark and
kernel compiles).

The basic approach in this patchset is the same as used in SGI's 2.4.X
based kernels which have been in production use in ProPack 3 for a long time.

The patchset is composed of 4 patches:

1/4: ptep_cmpxchg and ptep_xchg to avoid intermittent zeroing of ptes

	The current way of synchronizing with the CPU or arch specific
	interrupts updating page table entries is to first set a pte
	to zero before writing a new value. This patch uses ptep_xchg
	and ptep_cmpxchg to avoid writing the zero for certain
	configurations.
	
	The patch introduces CONFIG_ATOMIC_TABLE_OPS that may be
	enabled as a experimental feature during kernel configuration
	if the hardware is able to support atomic operations and if
	an SMP kernel is being configured. A Kconfig update for i386,
	x86_64 and ia64 has been provided. On i386 this options is
	restricted to CPUs better than a 486 and non PAE mode (that
	way all the cmpxchg issues on old i386 CPUS and the problems
	with 64bit atomic operations on recent i386 CPUS are avoided). 

	If CONFIG_ATOMIC_TABLE_OPS is not set then ptep_xchg and
	ptep_xcmpxchg are realized by falling back to clearing a pte
	before updating it.

	The patch does not change the use of mm->page_table_lock and
	the only performance improvement is the replacement of
	xchg-with-zero-and-then-write-new-pte-value with an xchg with
	the new value for SMP on some architectures if
	CONFIG_ATOMIC_TABLE_OPS is configured. It should not do anything
	major to VM operations.

2/4: Macros for mm counter manipulation

	There are various approaches to handling mm counters if the
	page_table_lock is no longer acquired. This patch defines
	macros in include/linux/sched.h to handle these counters and
	makes sure that these macros are used throughout the kernel
	to access and manipulate rss and anon_rss. There should be
	no change to the generated code as a result of this patch.

3/4: Drop the first use of the page_table_lock in handle_mm_fault

	The patch introduces two new functions:

	page_table_atomic_start(mm), page_table_atomic_stop(mm)

	that fall back to the use of the page_table_lock if
	CONFIG_ATOMIC_TABLE_OPS is not defined.

	If CONFIG_ATOMIC_TABLE_OPS is defined those functions may
	be used to prep the CPU for atomic table ops (i386 in PAE mode
	may f.e. get the MMX register ready for 64bit atomic ops) but
	are simply empty by default.

	Two operations may then be performed on the page table without
	acquiring the page table lock:

	a) updating access bits in pte
	b) anonymous read faults installed a mapping to the zero page.

	All counters are still protected with the page_table_lock thus
	avoiding any issues there.

	Some additional statistics are added to /proc/meminfo to
	give some statistics. Also counts spurious faults with no
	effect. There is a surprisingly high number of those on ia64
	(used to populate the cpu caches with the pte??)

4/4: Drop the use of the page_table_lock in do_anonymous_page

	The second acquisition of the page_table_lock is removed 
	from do_anonymous_page and allows the anonymous
	write fault to be possible without the page_table_lock.

	The macros for manipulating rss and anon_rss in include/linux/sched.h
	are changed if CONFIG_ATOMIC_TABLE_OPS is set to use atomic
	operations for rss and anon_rss (safest solution for now, other
	solutions may easily be implemented by changing those macros).

	This patch typically yield significant increases in page fault
	performance for threaded applications on SMP systems.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS
  2005-03-09 20:13 Page Fault Scalabilty patch V19 [0/4]: Overview Christoph Lameter
@ 2005-03-09 20:13 ` Christoph Lameter
  2005-03-09 23:01   ` Andi Kleen
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [2/4]: Abstract mm_struct counter operations Christoph Lameter
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 20:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, Christoph Lameter

The current way of updating ptes in the Linux vm includes first clearing
a pte before setting it to another value. The clearing is performed while
holding the page_table_lock to insure that the entry will not be modified
by the CPU directly (clearing the pte clears the present bit),
by an arch specific interrupt handler or another page fault handler
running on another CPU. This approach is necessary for some
architectures that cannot perform atomic updates of page table entries.

If a page table entry is cleared then a second CPU may generate a page fault
for that entry. The fault handler on the second CPU will then attempt to
acquire the page_table_lock and wait until the first CPU has completed
updating the page table entry. The fault handler on the second CPU will then
discover that everything is ok and simply do nothing (apart from incrementing
the counters for a minor fault and marking the page again as accessed).

However, most architectures actually support atomic operations on page
table entries. The use of atomic operations on page table entries would
allow the update of a page table entry in a single atomic operation instead
of writing to the page table entry twice. There would also be no danger of
generating a spurious page fault on other CPUs.

The following patch introduces two new atomic operations ptep_xchg and
ptep_cmpxchg that may be provided by an architecture. The fallback in
include/asm-generic/pgtable.h is to simulate both operations through the
existing ptep_get_and_clear function. So there is essentially no change if
atomic operations on ptes have not been defined. Architectures that do
not support atomic operations on ptes may continue to use the clearing of
a pte for locking type purposes.

Atomic operations may be enabled in the kernel configuration on
i386, ia64 and x86_64 if a suitable CPU is configured in SMP mode.
Generic atomic definitions for ptep_xchg and ptep_cmpxchg
have been provided based on the existing xchg() and cmpxchg() functions
that already work atomically on many platforms. It is very
easy to implement this for any architecture by adding the appropriate
definitions to arch/xx/Kconfig.

The provided generic atomic functions may be overridden as usual by defining
the appropriate__HAVE_ARCH_xxx constant and providing an implementation.

My aim to reduce the use of the page_table_lock in the page fault handler
rely on a pte never being clear if the pte is in use even when the
page_table_lock is not held. Clearing a pte before setting it to another
values could result in a situation in which a fault generated by
another cpu could install a pte which is then immediately overwritten by
the first CPU setting the pte to a valid value again. This patch is
important for future work on reducing the use of spinlocks in the vm. 

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/rmap.c	2005-02-24 19:42:12.000000000 -0800
@@ -575,11 +575,6 @@ static int try_to_unmap_one(struct page 
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);
 
 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -594,11 +589,15 @@ static int try_to_unmap_one(struct page 
 			list_add(&mm->mmlist, &init_mm.mmlist);
 			spin_unlock(&mmlist_lock);
 		}
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
 		mm->anon_rss--;
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);
 
+	/* Move the dirty bit to the physical page now that the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
 	mm->rss--;
 	acct_update_integrals();
 	page_remove_rmap(page);
@@ -691,15 +690,15 @@ static void try_to_unmap_cluster(unsigne
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
-		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);
 
-		/* Move the dirty bit to the physical page now the pte is gone. */
+		/* Move the dirty bit to the physical page now that the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
 
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-02-24 19:42:12.000000000 -0800
@@ -502,14 +502,18 @@ static void zap_pte_range(struct mmu_gat
 				     page->index > details->last_index))
 					continue;
 			}
-			pte = ptep_get_and_clear(ptep);
-			tlb_remove_tlb_entry(tlb, ptep, address+offset);
-			if (unlikely(!page))
+			if (unlikely(!page)) {
+				pte = ptep_get_and_clear(ptep);
+				tlb_remove_tlb_entry(tlb, ptep, address+offset);
 				continue;
+			}
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
 					address+offset) != page->index)
-				set_pte(ptep, pgoff_to_pte(page->index));
+				pte = ptep_xchg(ptep, pgoff_to_pte(page->index));
+			else
+				pte = ptep_get_and_clear(ptep);
+			tlb_remove_tlb_entry(tlb, ptep, address+offset);
 			if (pte_dirty(pte))
 				set_page_dirty(page);
 			if (PageAnon(page))
Index: linux-2.6.10/mm/mprotect.c
===================================================================
--- linux-2.6.10.orig/mm/mprotect.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/mprotect.c	2005-02-24 19:42:12.000000000 -0800
@@ -48,12 +48,16 @@ change_pte_range(pmd_t *pmd, unsigned lo
 		if (pte_present(*pte)) {
 			pte_t entry;
 
-			/* Avoid an SMP race with hardware updated dirty/clean
-			 * bits by wiping the pte and then setting the new pte
-			 * into place.
-			 */
-			entry = ptep_get_and_clear(pte);
-			set_pte(pte, pte_modify(entry, newprot));
+			 /* Deal with a potential SMP race with hardware/arch
+			  * interrupt updating dirty/clean bits through the use
+			  * of ptep_cmpxchg.
+			  */
+			do {
+				entry = *pte;
+			} while (!ptep_cmpxchg(pte,
+					entry,
+					pte_modify(entry, newprot)
+				));
 		}
 		address += PAGE_SIZE;
 		pte++;
Index: linux-2.6.10/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/pgtable.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-generic/pgtable.h	2005-02-24 19:42:12.000000000 -0800
@@ -102,6 +102,92 @@ static inline pte_t ptep_get_and_clear(p
 })
 #endif
 
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+
+/*
+ * The architecture does support atomic table operations.
+ * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * cmpxchg and xchg.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__ptep, __pteval) \
+	__pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)))
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__ptep,__oldval,__newval)				\
+	(cmpxchg(&pte_val(*(__ptep)),					\
+			pte_val(__oldval),				\
+			pte_val(__newval)				\
+		) == pte_val(__oldval)					\
+	)
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+#else
+
+/*
+ * No support for atomic operations on the page table.
+ * Exchanging of pte values is done by first swapping zeros into
+ * a pte and then putting new content into the pte entry.
+ * However, these functions will generate an empty pte for a
+ * short time frame. This means that the page_table_lock must be held
+ * to avoid a page fault that would install a new entry.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__ptep, __pteval)					\
+({									\
+	pte_t __pte = ptep_get_and_clear(__ptep);			\
+	set_pte(__ptep, __pteval);					\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_clear_flush(__vma, __address, __ptep);	\
+	set_pte(__ptep, __pteval);					\
+	__pte;								\
+})
+#else
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+#endif
+
+/*
+ * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg
+ * since cmpxchg may not be available on certain architectures. Instead
+ * the clearing of a pte is used as a form of locking mechanism.
+ * This approach will only work if the page_table_lock is held to insure
+ * that the pte is not populated by a page fault generated on another
+ * CPU. 
+ */
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__ptep, __old, __new)				\
+({									\
+	pte_t prev = ptep_get_and_clear(__ptep);			\
+	int r = pte_val(prev) == pte_val(__old);			\
+	set_pte(__ptep, r ? (__new) : prev);				\
+	r;								\
+})
+#endif
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
Index: linux-2.6.10/arch/ia64/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/ia64/Kconfig	2005-02-24 19:41:28.000000000 -0800
+++ linux-2.6.10/arch/ia64/Kconfig	2005-02-24 19:42:12.000000000 -0800
@@ -272,6 +272,17 @@ config PREEMPT
           Say Y here if you are building a kernel for a desktop, embedded
           or real-time system.  Say N if you are unsure.
 
+config ATOMIC_TABLE_OPS
+	bool "Atomic Page Table Operations (EXPERIMENTAL)"
+	depends on SMP && EXPERIMENTAL
+	help
+	  Atomic Page table operations allow page faults
+	  without the use (or with reduce use of) spinlocks
+	  and allow greater concurrency for a task with multiple
+	  threads in the page fault handler. This is in particular
+	  useful for high CPU counts and processes that use
+	  large amounts of memory.
+
 config HAVE_DEC_LOCK
 	bool
 	depends on (SMP || PREEMPT)
Index: linux-2.6.10/arch/i386/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/i386/Kconfig	2005-02-24 19:41:28.000000000 -0800
+++ linux-2.6.10/arch/i386/Kconfig	2005-02-24 19:42:12.000000000 -0800
@@ -868,6 +868,17 @@ config HAVE_DEC_LOCK
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config ATOMIC_TABLE_OPS
+	bool "Atomic Page Table Operations (EXPERIMENTAL)"
+	depends on SMP && X86_CMPXCHG && EXPERIMENTAL && !X86_PAE
+	help
+	  Atomic Page table operations allow page faults
+	  without the use (or with reduce use of) spinlocks
+	  and allow greater concurrency for a task with multiple
+	  threads in the page fault handler. This is in particular
+	  useful for high CPU counts and processes that use
+	  large amounts of memory.
+
 # turning this on wastes a bunch of space.
 # Summit needs it only when NUMA is on
 config BOOT_IOREMAP
Index: linux-2.6.10/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.10.orig/arch/x86_64/Kconfig	2005-02-24 19:41:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/Kconfig	2005-02-24 19:42:12.000000000 -0800
@@ -240,6 +240,17 @@ config PREEMPT
 	  Say Y here if you are feeling brave and building a kernel for a
 	  desktop, embedded or real-time system.  Say N if you are unsure.
 
+config ATOMIC_TABLE_OPS
+	bool "Atomic Page Table Operations (EXPERIMENTAL)"
+	depends on SMP && EXPERIMENTAL
+	help
+	  Atomic Page table operations allow page faults
+	  without the use (or with reduce use of) spinlocks
+	  and allow greater concurrency for a task with multiple
+	  threads in the page fault handler. This is in particular
+	  useful for high CPU counts and processes that use
+	  large amounts of memory.
+
 config PREEMPT_BKL
 	bool "Preempt The Big Kernel Lock"
 	depends on PREEMPT

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS Christoph Lameter
@ 2005-03-09 23:01   ` Andi Kleen
  2005-03-09 23:06     ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2005-03-09 23:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:
>
> Atomic operations may be enabled in the kernel configuration on
> i386, ia64 and x86_64 if a suitable CPU is configured in SMP mode.
> Generic atomic definitions for ptep_xchg and ptep_cmpxchg
> have been provided based on the existing xchg() and cmpxchg() functions
> that already work atomically on many platforms. It is very

I'm curious - do you have any micro benchmarks on i386 or x86-64 systems
about the difference between spin_lock(ptl) access; spin_unlock(ptl);
and cmpxchg ? 

cmpxchg can be quite slow, with bad luck it could be slower than 
the spinlocks.

A P4 would be good to benchmark this because it seems to be the worst
case.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS
  2005-03-09 23:01   ` Andi Kleen
@ 2005-03-09 23:06     ` Christoph Lameter
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 23:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-ia64, linux-kernel

On Thu, 10 Mar 2005, Andi Kleen wrote:

> Christoph Lameter <clameter@sgi.com> writes:
> >
> > Atomic operations may be enabled in the kernel configuration on
> > i386, ia64 and x86_64 if a suitable CPU is configured in SMP mode.
> > Generic atomic definitions for ptep_xchg and ptep_cmpxchg
> > have been provided based on the existing xchg() and cmpxchg() functions
> > that already work atomically on many platforms. It is very
>
> I'm curious - do you have any micro benchmarks on i386 or x86-64 systems
> about the difference between spin_lock(ptl) access; spin_unlock(ptl);
> and cmpxchg ?

There is a benchmark for UP on
http://oss.sgi.com/projects/page_fault_performance.

> cmpxchg can be quite slow, with bad luck it could be slower than
> the spinlocks.

Spinlocks also require atomic operations like a lock decb on i386 in order
to acquire the locks. And the page_table_lock is acquired two
times in the page fault handler. In order for this to be faster

2*spinlock acquisition and release would have to be faster than a cmpxchg.

> A P4 would be good to benchmark this because it seems to be the worst
> case.

The numbers on the webpage are for an AMD64. But I can try
to get some testing done on a P4 too.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Page Fault Scalability patch V19 [2/4]: Abstract mm_struct counter operations
  2005-03-09 20:13 Page Fault Scalabilty patch V19 [0/4]: Overview Christoph Lameter
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS Christoph Lameter
@ 2005-03-09 20:13 ` Christoph Lameter
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [3/4]: Drop use of page_table_lock in handle_mm_fault Christoph Lameter
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page Christoph Lameter
  3 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 20:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, Christoph Lameter

This patch extracts all the operations on rss into definitions in 
include/linux/sched.h. All rss operations are performed through
the following three macros:

get_mm_counter(mm, member)		-> Obtain the value of a counter
set_mm_counter(mm, member, value)	-> Set the value of a counter
update_mm_counter(mm, member, value)	-> Add a value to a counter

The simple definitions provided in this patch result in no change to
to the generated code. 

With this patch it becomes easier to add new counters and it is possible
to redefine the method of counter handling (f.e. the page fault scalability
patches may want to use atomic operations or split rss).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h	2005-02-24 19:41:49.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h	2005-02-24 19:42:17.000000000 -0800
@@ -203,6 +203,10 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct vm_area_struct *area);
 extern void arch_unmap_area_topdown(struct vm_area_struct *area);
 
+#define set_mm_counter(mm, member, value) (mm)->member = (value)
+#define get_mm_counter(mm, member) ((mm)->member)
+#define update_mm_counter(mm, member, value) (mm)->member += (value)
+#define MM_COUNTER_T unsigned long
 
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
@@ -219,7 +223,7 @@ struct mm_struct {
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
-	spinlock_t page_table_lock;		/* Protects page tables, mm->rss, mm->anon_rss */
+	spinlock_t page_table_lock;		/* Protects page tables and some counters */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.  These are globally strung
 						 * together off init_mm.mmlist, and are protected
@@ -229,9 +233,13 @@ struct mm_struct {
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
-	unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+	unsigned long total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;
 
+	/* Special counters protected by the page_table_lock */
+	MM_COUNTER_T rss;
+	MM_COUNTER_T anon_rss;
+
 	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
 
 	unsigned dumpable:1;
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c	2005-02-24 19:42:12.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-02-24 19:42:17.000000000 -0800
@@ -313,9 +313,9 @@ copy_one_pte(struct mm_struct *dst_mm,  
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 	get_page(page);
-	dst_mm->rss++;
+	update_mm_counter(dst_mm, rss, 1);
 	if (PageAnon(page))
-		dst_mm->anon_rss++;
+		update_mm_counter(dst_mm, anon_rss, 1);
 	set_pte(dst_pte, pte);
 	page_dup_rmap(page);
 }
@@ -517,7 +517,7 @@ static void zap_pte_range(struct mmu_gat
 			if (pte_dirty(pte))
 				set_page_dirty(page);
 			if (PageAnon(page))
-				tlb->mm->anon_rss--;
+				update_mm_counter(tlb->mm, anon_rss, -1);
 			else if (pte_young(pte))
 				mark_page_accessed(page);
 			tlb->freed++;
@@ -1340,13 +1340,14 @@ static int do_wp_page(struct mm_struct *
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
-		if (PageAnon(old_page))
-			mm->anon_rss--;
+		if (PageAnon(old_page)) 
+			update_mm_counter(mm, anon_rss, -1);
 		if (PageReserved(old_page)) {
-			++mm->rss;
+			update_mm_counter(mm, rss, 1);
 			acct_update_integrals();
 			update_mem_hiwater();
 		} else
+
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
 		lru_cache_add_active(new_page);
@@ -1750,7 +1751,7 @@ static int do_swap_page(struct mm_struct
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);
 
-	mm->rss++;
+	update_mm_counter(mm, rss, 1);
 	acct_update_integrals();
 	update_mem_hiwater();
 
@@ -1817,7 +1818,7 @@ do_anonymous_page(struct mm_struct *mm, 
 			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
-		mm->rss++;
+		update_mm_counter(mm, rss, 1);
 		acct_update_integrals();
 		update_mem_hiwater();
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
@@ -1935,7 +1936,7 @@ retry:
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		if (!PageReserved(new_page))
-			++mm->rss;
+			update_mm_counter(mm, rss, 1);
 		acct_update_integrals();
 		update_mem_hiwater();
 
@@ -2262,8 +2263,10 @@ void update_mem_hiwater(void)
 	struct task_struct *tsk = current;
 
 	if (tsk->mm) {
-		if (tsk->mm->hiwater_rss < tsk->mm->rss)
-			tsk->mm->hiwater_rss = tsk->mm->rss;
+		unsigned long rss = get_mm_counter(tsk->mm, rss);
+
+		if (tsk->mm->hiwater_rss < rss)
+			tsk->mm->hiwater_rss = rss;
 		if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
 			tsk->mm->hiwater_vm = tsk->mm->total_vm;
 	}
Index: linux-2.6.10/mm/rmap.c
===================================================================
--- linux-2.6.10.orig/mm/rmap.c	2005-02-24 19:42:12.000000000 -0800
+++ linux-2.6.10/mm/rmap.c	2005-02-24 19:42:17.000000000 -0800
@@ -258,7 +258,7 @@ static int page_referenced_one(struct pa
 	pte_t *pte;
 	int referenced = 0;
 
-	if (!mm->rss)
+	if (!get_mm_counter(mm, rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -437,7 +437,7 @@ void page_add_anon_rmap(struct page *pag
 	BUG_ON(PageReserved(page));
 	BUG_ON(!anon_vma);
 
-	vma->vm_mm->anon_rss++;
+	update_mm_counter(vma->vm_mm, anon_rss, 1);
 
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	index = (address - vma->vm_start) >> PAGE_SHIFT;
@@ -510,7 +510,7 @@ static int try_to_unmap_one(struct page 
 	pte_t pteval;
 	int ret = SWAP_AGAIN;
 
-	if (!mm->rss)
+	if (!get_mm_counter(mm, rss))
 		goto out;
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -591,14 +591,14 @@ static int try_to_unmap_one(struct page 
 		}
 		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-		mm->anon_rss--;
+		update_mm_counter(mm, anon_rss, -1);
 	} else
 		pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now that the pte is gone. */
 	if (pte_dirty(pteval))
 		set_page_dirty(page);
-	mm->rss--;
+	update_mm_counter(mm, rss, -1);
 	acct_update_integrals();
 	page_remove_rmap(page);
 	page_cache_release(page);
@@ -705,7 +705,7 @@ static void try_to_unmap_cluster(unsigne
 		page_remove_rmap(page);
 		page_cache_release(page);
 		acct_update_integrals();
-		mm->rss--;
+		update_mm_counter(mm, rss, -1);
 		(*mapcount)--;
 	}
 
@@ -804,7 +804,7 @@ static int try_to_unmap_file(struct page
 			if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
-			while (vma->vm_mm->rss &&
+			while (get_mm_counter(vma->vm_mm, rss) &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				try_to_unmap_cluster(cursor, &mapcount, vma);
Index: linux-2.6.10/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.10.orig/fs/proc/task_mmu.c	2005-02-24 19:41:44.000000000 -0800
+++ linux-2.6.10/fs/proc/task_mmu.c	2005-02-24 19:42:17.000000000 -0800
@@ -24,7 +24,7 @@ char *task_mem(struct mm_struct *mm, cha
 		"VmPTE:\t%8lu kB\n",
 		(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
 		mm->locked_vm << (PAGE_SHIFT-10),
-		mm->rss << (PAGE_SHIFT-10),
+		get_mm_counter(mm, rss) << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -39,11 +39,13 @@ unsigned long task_vsize(struct mm_struc
 int task_statm(struct mm_struct *mm, int *shared, int *text,
 	       int *data, int *resident)
 {
-	*shared = mm->rss - mm->anon_rss;
+	int rss = get_mm_counter(mm, rss);
+	
+	*shared = rss - get_mm_counter(mm, anon_rss);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm;
-	*resident = mm->rss;
+	*resident = rss;
 	return mm->total_vm;
 }
 
Index: linux-2.6.10/mm/mmap.c
===================================================================
--- linux-2.6.10.orig/mm/mmap.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/mmap.c	2005-02-24 19:42:17.000000000 -0800
@@ -2000,7 +2000,7 @@ void exit_mmap(struct mm_struct *mm)
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;
-	mm->rss = 0;
+	set_mm_counter(mm, rss, 0);
 	mm->total_vm = 0;
 	mm->locked_vm = 0;
 
Index: linux-2.6.10/kernel/fork.c
===================================================================
--- linux-2.6.10.orig/kernel/fork.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/kernel/fork.c	2005-02-24 19:42:17.000000000 -0800
@@ -174,8 +174,8 @@ static inline int dup_mmap(struct mm_str
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->map_count = 0;
-	mm->rss = 0;
-	mm->anon_rss = 0;
+	set_mm_counter(mm, rss, 0);
+	set_mm_counter(mm, anon_rss, 0);
 	cpus_clear(mm->cpu_vm_mask);
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
@@ -471,7 +471,7 @@ static int copy_mm(unsigned long clone_f
 	if (retval)
 		goto free_pt;
 
-	mm->hiwater_rss = mm->rss;
+	mm->hiwater_rss = get_mm_counter(mm,rss);
 	mm->hiwater_vm = mm->total_vm;
 
 good_mm:
Index: linux-2.6.10/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-generic/tlb.h	2005-02-24 19:41:46.000000000 -0800
+++ linux-2.6.10/include/asm-generic/tlb.h	2005-02-24 19:42:17.000000000 -0800
@@ -88,11 +88,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 {
 	int freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
+	int rss = get_mm_counter(mm, rss);
 
 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	update_mm_counter(mm, rss, -freed);
 	tlb_flush_mmu(tlb, start, end);
 
 	/* keep the page table cache within bounds */
Index: linux-2.6.10/fs/binfmt_flat.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_flat.c	2004-12-24 13:33:47.000000000 -0800
+++ linux-2.6.10/fs/binfmt_flat.c	2005-02-24 19:42:17.000000000 -0800
@@ -650,7 +650,7 @@ static int load_flat_file(struct linux_b
 		current->mm->start_brk = datapos + data_len + bss_len;
 		current->mm->brk = (current->mm->start_brk + 3) & ~3;
 		current->mm->context.end_brk = memp + ksize((void *) memp) - stack_len;
-		current->mm->rss = 0;
+		set_mm_counter(current->mm, rss, 0);
 	}
 
 	if (flags & FLAT_FLAG_KTRACE)
Index: linux-2.6.10/fs/exec.c
===================================================================
--- linux-2.6.10.orig/fs/exec.c	2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/exec.c	2005-02-24 19:42:17.000000000 -0800
@@ -326,7 +326,7 @@ void install_arg_page(struct vm_area_str
 		pte_unmap(pte);
 		goto out;
 	}
-	mm->rss++;
+	update_mm_counter(mm, rss, 1);
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
Index: linux-2.6.10/fs/binfmt_som.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_som.c	2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_som.c	2005-02-24 19:42:17.000000000 -0800
@@ -259,7 +259,7 @@ load_som_binary(struct linux_binprm * bp
 	create_som_tables(bprm);
 
 	current->mm->start_stack = bprm->p;
-	current->mm->rss = 0;
+	set_mm_counter(current->mm, rss, 0);
 
 #if 0
 	printk("(start_brk) %08lx\n" , (unsigned long) current->mm->start_brk);
Index: linux-2.6.10/mm/fremap.c
===================================================================
--- linux-2.6.10.orig/mm/fremap.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/fremap.c	2005-02-24 19:42:17.000000000 -0800
@@ -39,7 +39,7 @@ static inline void zap_pte(struct mm_str
 					set_page_dirty(page);
 				page_remove_rmap(page);
 				page_cache_release(page);
-				mm->rss--;
+				update_mm_counter(mm, rss, -1);
 			}
 		}
 	} else {
@@ -92,7 +92,7 @@ int install_page(struct mm_struct *mm, s
 
 	zap_pte(mm, vma, addr, pte);
 
-	mm->rss++;
+	update_mm_counter(mm,rss, 1);
 	flush_icache_page(vma, page);
 	set_pte(pte, mk_pte(page, prot));
 	page_add_file_rmap(page);
Index: linux-2.6.10/mm/swapfile.c
===================================================================
--- linux-2.6.10.orig/mm/swapfile.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/swapfile.c	2005-02-24 19:42:17.000000000 -0800
@@ -432,7 +432,7 @@ static void
 unuse_pte(struct vm_area_struct *vma, unsigned long address, pte_t *dir,
 	swp_entry_t entry, struct page *page)
 {
-	vma->vm_mm->rss++;
+	update_mm_counter(vma->vm_mm, rss, 1);
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, address);
Index: linux-2.6.10/fs/binfmt_aout.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_aout.c	2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_aout.c	2005-02-24 19:42:17.000000000 -0800
@@ -317,7 +317,7 @@ static int load_aout_binary(struct linux
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = current->mm->mmap_base;
 
-	current->mm->rss = 0;
+	set_mm_counter(current->mm, rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/mm/hugetlbpage.c	2005-02-24 19:41:29.000000000 -0800
+++ linux-2.6.10/arch/ia64/mm/hugetlbpage.c	2005-02-24 19:42:17.000000000 -0800
@@ -73,7 +73,7 @@ set_huge_pte (struct mm_struct *mm, stru
 {
 	pte_t entry;
 
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -116,7 +116,7 @@ int copy_hugetlb_page_range(struct mm_st
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -246,7 +246,7 @@ void unmap_hugepage_range(struct vm_area
 		put_page(page);
 		pte_clear(pte);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	update_mm_counter(mm, rss, - ((end - start) >> PAGE_SHIFT));
 	flush_tlb_range(vma, start, end);
 }
 
Index: linux-2.6.10/fs/binfmt_elf.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_elf.c	2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_elf.c	2005-02-24 19:42:17.000000000 -0800
@@ -764,7 +764,7 @@ static int load_elf_binary(struct linux_
 
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
+	set_mm_counter(current->mm, rss, 0);
 	current->mm->free_area_cache = current->mm->mmap_base;
 	retval = setup_arg_pages(bprm, STACK_TOP, executable_stack);
 	if (retval < 0) {
Index: linux-2.6.10/include/asm-ia64/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/tlb.h	2005-02-24 19:41:47.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/tlb.h	2005-02-24 19:42:17.000000000 -0800
@@ -161,11 +161,11 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
 {
 	unsigned long freed = tlb->freed;
 	struct mm_struct *mm = tlb->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = get_mm_counter(mm, rss);
 
 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	update_mm_counter(mm, rss, -freed);
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
 	 * tlb->end_addr.
Index: linux-2.6.10/include/asm-arm/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/tlb.h	2005-02-24 19:41:45.000000000 -0800
+++ linux-2.6.10/include/asm-arm/tlb.h	2005-02-24 19:42:17.000000000 -0800
@@ -54,11 +54,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 {
 	struct mm_struct *mm = tlb->mm;
 	unsigned long freed = tlb->freed;
-	int rss = mm->rss;
+	int rss = get_mm_counter(mm, rss);
 
 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	update_mm_counter(mm, rss, -freed);
 
 	if (freed) {
 		flush_tlb_mm(mm);
Index: linux-2.6.10/include/asm-arm26/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/tlb.h	2005-02-24 19:41:45.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/tlb.h	2005-02-24 19:42:17.000000000 -0800
@@ -37,11 +37,11 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 {
         struct mm_struct *mm = tlb->mm;
         unsigned long freed = tlb->freed;
-        int rss = mm->rss;
+        int rss = get_mm_counter(mm, rss);
 
         if (rss < freed)
                 freed = rss;
-        mm->rss = rss - freed;
+        update_mm_counter(mm, rss, -freed);
 
         if (freed) {
                 flush_tlb_mm(mm);
Index: linux-2.6.10/include/asm-sparc64/tlb.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/tlb.h	2005-02-24 19:41:48.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/tlb.h	2005-02-24 19:42:17.000000000 -0800
@@ -80,11 +80,11 @@ static inline void tlb_finish_mmu(struct
 {
 	unsigned long freed = mp->freed;
 	struct mm_struct *mm = mp->mm;
-	unsigned long rss = mm->rss;
+	unsigned long rss = get_mm_counter(mm, rss);
 
 	if (rss < freed)
 		freed = rss;
-	mm->rss = rss - freed;
+	update_mm_counter(mm, rss, -freed);
 
 	tlb_flush_mmu(mp);
 
Index: linux-2.6.10/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/hugetlbpage.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/hugetlbpage.c	2005-02-24 19:42:17.000000000 -0800
@@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc
 	unsigned long i;
 	pte_t entry;
 
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
 
 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
 	flush_tlb_range(vma, start, end);
 }
 
Index: linux-2.6.10/arch/x86_64/ia32/ia32_aout.c
===================================================================
--- linux-2.6.10.orig/arch/x86_64/ia32/ia32_aout.c	2005-02-24 19:41:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/ia32/ia32_aout.c	2005-02-24 19:42:17.000000000 -0800
@@ -313,7 +313,7 @@ static int load_aout_binary(struct linux
 		(current->mm->start_brk = N_BSSADDR(ex));
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;
 
-	current->mm->rss = 0;
+	set_mm_counter(current->mm, rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/ppc64/mm/hugetlbpage.c	2005-02-24 19:41:32.000000000 -0800
+++ linux-2.6.10/arch/ppc64/mm/hugetlbpage.c	2005-02-24 19:42:17.000000000 -0800
@@ -153,7 +153,7 @@ static void set_huge_pte(struct mm_struc
 {
 	pte_t entry;
 
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -315,7 +315,7 @@ int copy_hugetlb_page_range(struct mm_st
 		
 		ptepage = pte_page(entry);
 		get_page(ptepage);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		set_pte(dst_pte, entry);
 
 		addr += HPAGE_SIZE;
@@ -425,7 +425,7 @@ void unmap_hugepage_range(struct vm_area
 
 		put_page(page);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
 	flush_tlb_pending();
 }
 
Index: linux-2.6.10/arch/sh64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sh64/mm/hugetlbpage.c	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/arch/sh64/mm/hugetlbpage.c	2005-02-24 19:42:17.000000000 -0800
@@ -62,7 +62,7 @@ static void set_huge_pte(struct mm_struc
 	unsigned long i;
 	pte_t entry;
 
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
 
 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -115,7 +115,7 @@ int copy_hugetlb_page_range(struct mm_st
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -206,7 +206,7 @@ void unmap_hugepage_range(struct vm_area
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
 	flush_tlb_range(vma, start, end);
 }
 
Index: linux-2.6.10/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/mm/hugetlbpage.c	2005-02-24 19:41:32.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/hugetlbpage.c	2005-02-24 19:42:17.000000000 -0800
@@ -67,7 +67,7 @@ static void set_huge_pte(struct mm_struc
 	unsigned long i;
 	pte_t entry;
 
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
 
 	if (write_access)
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
@@ -120,7 +120,7 @@ int copy_hugetlb_page_range(struct mm_st
 			pte_val(entry) += PAGE_SIZE;
 			dst_pte++;
 		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -211,7 +211,7 @@ void unmap_hugepage_range(struct vm_area
 			pte++;
 		}
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	update_mm_counter(mm, rss, -((end - start) >> PAGE_SHIFT));
 	flush_tlb_range(vma, start, end);
 }
 
Index: linux-2.6.10/arch/mips/kernel/irixelf.c
===================================================================
--- linux-2.6.10.orig/arch/mips/kernel/irixelf.c	2005-02-24 19:41:29.000000000 -0800
+++ linux-2.6.10/arch/mips/kernel/irixelf.c	2005-02-24 19:42:17.000000000 -0800
@@ -692,7 +692,7 @@ static int load_irix_binary(struct linux
 	/* Do this so that we can load the interpreter, if need be.  We will
 	 * change some of these later.
 	 */
-	current->mm->rss = 0;
+	set_mm_counter(current->mm, rss, 0);
 	setup_arg_pages(bprm, STACK_TOP, EXSTACK_DEFAULT);
 	current->mm->start_stack = bprm->p;
 
Index: linux-2.6.10/arch/m68k/atari/stram.c
===================================================================
--- linux-2.6.10.orig/arch/m68k/atari/stram.c	2005-02-24 19:41:29.000000000 -0800
+++ linux-2.6.10/arch/m68k/atari/stram.c	2005-02-24 19:42:17.000000000 -0800
@@ -635,7 +635,7 @@ static inline void unswap_pte(struct vm_
 	set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
 	get_page(page);
-	++vma->vm_mm->rss;
+	update_mm_counter(vma->vm_mm, rss, 1);
 }
 
 static inline void unswap_pmd(struct vm_area_struct * vma, pmd_t *dir,
Index: linux-2.6.10/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.10.orig/arch/i386/mm/hugetlbpage.c	2005-02-24 19:41:28.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/hugetlbpage.c	2005-02-24 19:42:17.000000000 -0800
@@ -46,7 +46,7 @@ static void set_huge_pte(struct mm_struc
 {
 	pte_t entry;
 
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	update_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
 	if (write_access) {
 		entry =
 		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
@@ -86,7 +86,7 @@ int copy_hugetlb_page_range(struct mm_st
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		update_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
 	}
 	return 0;
@@ -222,7 +222,7 @@ void unmap_hugepage_range(struct vm_area
 		page = pte_page(pte);
 		put_page(page);
 	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
+	update_mm_counter(mm ,rss, -((end - start) >> PAGE_SHIFT));
 	flush_tlb_range(vma, start, end);
 }
 
Index: linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/kernel/binfmt_aout32.c	2005-02-24 19:41:32.000000000 -0800
+++ linux-2.6.10/arch/sparc64/kernel/binfmt_aout32.c	2005-02-24 19:42:17.000000000 -0800
@@ -241,7 +241,7 @@ static int load_aout32_binary(struct lin
 	current->mm->brk = ex.a_bss +
 		(current->mm->start_brk = N_BSSADDR(ex));
 
-	current->mm->rss = 0;
+	set_mm_counter(current->mm, rss, 0);
 	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
Index: linux-2.6.10/fs/proc/array.c
===================================================================
--- linux-2.6.10.orig/fs/proc/array.c	2005-02-24 19:41:44.000000000 -0800
+++ linux-2.6.10/fs/proc/array.c	2005-02-24 19:42:17.000000000 -0800
@@ -423,7 +423,7 @@ static int do_task_stat(struct task_stru
 		jiffies_to_clock_t(task->it_real_value),
 		start_time,
 		vsize,
-		mm ? mm->rss : 0, /* you might want to shift this left 3 */
+		mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
 	        rsslim,
 		mm ? mm->start_code : 0,
 		mm ? mm->end_code : 0,
Index: linux-2.6.10/fs/binfmt_elf_fdpic.c
===================================================================
--- linux-2.6.10.orig/fs/binfmt_elf_fdpic.c	2005-02-24 19:41:43.000000000 -0800
+++ linux-2.6.10/fs/binfmt_elf_fdpic.c	2005-02-24 19:42:17.000000000 -0800
@@ -299,7 +299,7 @@ static int load_elf_fdpic_binary(struct 
 	/* do this so that we can load the interpreter, if need be
 	 * - we will change some of these later
 	 */
-	current->mm->rss = 0;
+	set_mm_counter(current->mm, rss, 0);
 
 #ifdef CONFIG_MMU
 	retval = setup_arg_pages(bprm, current->mm->start_stack, executable_stack);
Index: linux-2.6.10/mm/nommu.c
===================================================================
--- linux-2.6.10.orig/mm/nommu.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/mm/nommu.c	2005-02-24 19:42:17.000000000 -0800
@@ -962,10 +962,11 @@ void arch_unmap_area(struct vm_area_stru
 void update_mem_hiwater(void)
 {
 	struct task_struct *tsk = current;
+	unsigned long rss = get_mm_counter(tsk->mm, rss);
 
 	if (likely(tsk->mm)) {
-		if (tsk->mm->hiwater_rss < tsk->mm->rss)
-			tsk->mm->hiwater_rss = tsk->mm->rss;
+		if (tsk->mm->hiwater_rss < rss)
+			tsk->mm->hiwater_rss = rss;
 		if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
 			tsk->mm->hiwater_vm = tsk->mm->total_vm;
 	}
Index: linux-2.6.10/kernel/acct.c
===================================================================
--- linux-2.6.10.orig/kernel/acct.c	2005-02-24 19:41:50.000000000 -0800
+++ linux-2.6.10/kernel/acct.c	2005-02-24 19:42:17.000000000 -0800
@@ -544,7 +544,7 @@ void acct_update_integrals(void)
 		if (delta == 0)
 			return;
 		tsk->acct_stimexpd = tsk->stime;
-		tsk->acct_rss_mem1 += delta * tsk->mm->rss;
+		tsk->acct_rss_mem1 += delta * get_mm_counter(tsk->mm, rss);
 		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
 	}
 }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Page Fault Scalability patch V19 [3/4]: Drop use of page_table_lock in handle_mm_fault
  2005-03-09 20:13 Page Fault Scalabilty patch V19 [0/4]: Overview Christoph Lameter
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS Christoph Lameter
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [2/4]: Abstract mm_struct counter operations Christoph Lameter
@ 2005-03-09 20:13 ` Christoph Lameter
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page Christoph Lameter
  3 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 20:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, Christoph Lameter

The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

	page_table_atomic_start(struct mm_struct *)

and ends with

	page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not 
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

The atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definition of higher level atomic operations is
included.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of patches
received that led to no change in the page table. Statistics may be viewed via
/proc/meminfo

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c	2005-03-04 08:25:22.000000000 -0800
+++ linux-2.6.11/mm/memory.c	2005-03-04 12:10:18.000000000 -0800
@@ -36,6 +36,8 @@
  *		(Gerhard.Wichert@pdb.siemens.de)
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005 	Scalability improvement by reducing the use and the length of time
+ *		the page table lock is held (Christoph Lameter)
  */
 
 #include <linux/kernel_stat.h>
@@ -1687,8 +1689,7 @@ void swapin_readahead(swp_entry_t entry,
 }
 
 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and have started atomic pte operations
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1700,15 +1701,14 @@ static int do_swap_page(struct mm_struct
 	int ret = VM_FAULT_MINOR;
 
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
+			 * Back out if somebody else faulted in this pte
 			 */
 			spin_lock(&mm->page_table_lock);
 			page_table = pte_offset_map(pmd, address);
@@ -1731,8 +1731,7 @@ static int do_swap_page(struct mm_struct
 	lock_page(page);
 
 	/*
-	 * Back out if somebody else faulted in this pte while we
-	 * released the page table lock.
+	 * Back out if somebody else faulted in this pte
 	 */
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
@@ -1782,63 +1781,76 @@ out:
 }
 
 /*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs. 
+ * We are called with the MM semaphore held and atomic pte operations started.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_t *page_table, pmd_t *pmd, int write_access,
-		unsigned long addr)
+		unsigned long addr, pte_t orig_entry)
 {
 	pte_t entry;
-	struct page * page = ZERO_PAGE(addr);
+	struct page * page;
 
-	/* Read-only mapping of ZERO_PAGE. */
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+	if (unlikely(!write_access)) {
 
-	/* ..except if it's a write access */
-	if (write_access) {
-		/* Allocate our own private page. */
+		/* Read-only mapping of ZERO_PAGE. */
+		entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+
+		/*
+		 * If the cmpxchg fails then another fault may be
+		 * generated that may then be successful
+		 */
+
+		if (ptep_cmpxchg(page_table, orig_entry, entry))
+			update_mmu_cache(vma, addr, entry); 
+		else
+			inc_page_state(cmpxchg_fail_anon_read);
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);
+		page_table_atomic_stop(mm);
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto no_mem;
-		page = alloc_zeroed_user_highpage(vma, addr);
-		if (!page)
-			goto no_mem;
+		return VM_FAULT_MINOR;
+	}
 
-		spin_lock(&mm->page_table_lock);
-		page_table = pte_offset_map(pmd, addr);
+	page_table_atomic_stop(mm);
 
-		if (!pte_none(*page_table)) {
-			pte_unmap(page_table);
-			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-		update_mm_counter(mm, rss, 1);
-		acct_update_integrals();
-		update_mem_hiwater();
-		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
-							 vma->vm_page_prot)),
-				      vma);
-		lru_cache_add_active(page);
-		SetPageReferenced(page);
-		page_add_anon_rmap(page, vma, addr);
+	/* Allocate our own private page. */
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
+	page = alloc_zeroed_user_highpage(vma, addr);
+	if (!page)
+		return VM_FAULT_OOM;
+
+	entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
+						 vma->vm_page_prot)),
+			      vma);
+
+	spin_lock(&mm->page_table_lock);
+	
+	if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
+		pte_unmap(page_table);
+		page_cache_release(page);
+		spin_unlock(&mm->page_table_lock);
+		inc_page_state(cmpxchg_fail_anon_write);
+		return VM_FAULT_MINOR;
 	}
 
-	set_pte(page_table, entry);
+	/*
+	 * These two functions must come after the cmpxchg
+	 * because if the page is on the LRU then try_to_unmap may come
+	 * in and unmap the pte.
+	 */
+	page_add_anon_rmap(page, vma, addr);
+	lru_cache_add_active(page);
+	update_mm_counter(mm, rss, 1);
+	acct_update_integrals();
+	update_mem_hiwater();
+	SetPageReferenced(page);
+	update_mmu_cache(vma, addr, entry); 
 	pte_unmap(page_table);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, addr, entry);
 	spin_unlock(&mm->page_table_lock);
-out:
+
 	return VM_FAULT_MINOR;
-no_mem:
-	return VM_FAULT_OOM;
 }
 
 /*
@@ -1850,12 +1862,12 @@ no_mem:
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and atomic pte operations started.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *page_table,
+        pmd_t *pmd, pte_t orig_entry)
 {
 	struct page * new_page;
 	struct address_space *mapping = NULL;
@@ -1866,9 +1878,9 @@ do_no_page(struct mm_struct *mm, struct 
 
 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
-					pmd, write_access, address);
+					pmd, write_access, address, orig_entry);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
@@ -1976,7 +1988,7 @@ oom:
  * nonlinear vmas.
  */
 static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+	unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
 {
 	unsigned long pgoff;
 	int err;
@@ -1989,13 +2001,13 @@ static int do_file_page(struct mm_struct
 	if (!vma->vm_ops || !vma->vm_ops->populate || 
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
-		return do_no_page(mm, vma, address, write_access, pte, pmd);
+		return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 	}
 
-	pgoff = pte_to_pgoff(*pte);
+	pgoff = pte_to_pgoff(entry);
 
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 
 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -2014,49 +2026,56 @@ static int do_file_page(struct mm_struct
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them. 
+ * We need to insure to handle that case properly.
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t *pte, pmd_t *pmd)
 {
 	pte_t entry;
+	pte_t new_entry;
 
 	entry = *pte;
 	if (!pte_present(entry)) {
-		/*
-		 * If it truly wasn't present, we know that kswapd
-		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
-		 */
 		if (pte_none(entry))
-			return do_no_page(mm, vma, address, write_access, pte, pmd);
+			return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address, write_access, pte, pmd);
+			return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
 		return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
 	}
 
+	new_entry = pte_mkyoung(entry);
 	if (write_access) {
-		if (!pte_write(entry))
-			return do_wp_page(mm, vma, address, pte, pmd, entry);
-
-		entry = pte_mkdirty(entry);
+		if (!pte_write(entry)) {
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+			/* do_wp_page really needs the page table lock badly */
+			spin_lock(&mm->page_table_lock);
+			if (pte_same(entry, *pte))
+#endif
+				return do_wp_page(mm, vma, address, pte, pmd, entry);
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+			pte_unmap(pte);
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_MINOR;
+#endif
+		}
+		new_entry = pte_mkdirty(new_entry);	
 	}
-	entry = pte_mkyoung(entry);
-	ptep_set_access_flags(vma, address, pte, entry, write_access);
-	update_mmu_cache(vma, address, entry);
+
+	/*
+	 * If the cmpxchg fails then we will get another fault which
+ 	 * has another chance of successfully updating the page table entry.
+	 */
+	if (ptep_cmpxchg(pte, entry, new_entry)) {
+		flush_tlb_page(vma, address);	
+		update_mmu_cache(vma, address, entry);
+	} else
+		inc_page_state(cmpxchg_fail_flag_update);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
+	if (pte_val(new_entry) == pte_val(entry))
+		inc_page_state(spurious_page_faults);
 	return VM_FAULT_MINOR;
 }
 
@@ -2075,33 +2094,73 @@ int handle_mm_fault(struct mm_struct *mm
 
 	inc_page_state(pgfault);
 
-	if (is_vm_hugetlb_page(vma))
+	if (unlikely(is_vm_hugetlb_page(vma)))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
 
 	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
+	 * We try to rely on the mmap_sem and the SMP-safe atomic PTE updates.
+	 * to synchronize with kswapd. However, the arch may fall back 
+	 * in page_table_atomic_start to the page table lock.
+	 *
+	 * We may be able to avoid taking and releasing the page_table_lock
+	 * for the p??_alloc functions through atomic operations so we
+	 * duplicate the functionality of pmd_alloc, pud_alloc and
+	 * pte_alloc_map here.
 	 */
+	page_table_atomic_start(mm);
 	pgd = pgd_offset(mm, address);
-	spin_lock(&mm->page_table_lock);
+	if (unlikely(pgd_none(*pgd))) {
+		pud_t *new;
+
+		page_table_atomic_stop(mm);
+		new = pud_alloc_one(mm, address);
+
+		if (!new)
+			return VM_FAULT_OOM;
+
+		page_table_atomic_start(mm);
+		if (!pgd_test_and_populate(mm, pgd, new))
+			pud_free(new);
+	}
 
-	pud = pud_alloc(mm, pgd, address);
-	if (!pud)
-		goto oom;
-
-	pmd = pmd_alloc(mm, pud, address);
-	if (!pmd)
-		goto oom;
-
-	pte = pte_alloc_map(mm, pmd, address);
-	if (!pte)
-		goto oom;
+	pud = pud_offset(pgd, address);
+	if (unlikely(pud_none(*pud))) {
+		pmd_t *new;
+
+		page_table_atomic_stop(mm);
+		new = pmd_alloc_one(mm, address);
+
+		if (!new)
+			return VM_FAULT_OOM;
 	
-	return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+		page_table_atomic_start(mm);
+	
+		if (!pud_test_and_populate(mm, pud, new))
+			pmd_free(new);
+	} 
 
- oom:
-	spin_unlock(&mm->page_table_lock);
-	return VM_FAULT_OOM;
+	pmd = pmd_offset(pud, address);
+	if (unlikely(!pmd_present(*pmd))) {
+		struct page *new;
+
+		page_table_atomic_stop(mm);
+		new = pte_alloc_one(mm, address);
+	
+		if (!new)
+			return VM_FAULT_OOM;
+
+		page_table_atomic_start(mm);
+
+		if (!pmd_test_and_populate(mm, pmd, new))
+			pte_free(new);
+		else {
+			inc_page_state(nr_page_table_pages);
+			mm->nr_ptes++;
+		}
+	}
+
+	pte = pte_offset_map(pmd, address);
+	return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 }
 
 #ifndef __ARCH_HAS_4LEVEL_HACK
Index: linux-2.6.11/include/asm-generic/pgtable-nopud.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopud.h	2005-03-01 23:38:12.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable-nopud.h	2005-03-04 10:03:36.000000000 -0800
@@ -25,8 +25,14 @@ static inline int pgd_bad(pgd_t pgd)		{ 
 static inline int pgd_present(pgd_t pgd)	{ return 1; }
 static inline void pgd_clear(pgd_t *pgd)	{ }
 #define pud_ERROR(pud)				(pgd_ERROR((pud).pgd))
-
 #define pgd_populate(mm, pgd, pud)		do { } while (0)
+
+#define __HAVE_ARCH_PGD_TEST_AND_POPULATE
+static inline int pgd_test_and_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+{
+	return 1;
+}
+
 /*
  * (puds are folded into pgds so this doesn't get actually called,
  * but the define is needed for a generic inline function.)
Index: linux-2.6.11/include/asm-generic/pgtable-nopmd.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable-nopmd.h	2005-03-01 23:37:49.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable-nopmd.h	2005-03-04 10:03:36.000000000 -0800
@@ -29,6 +29,11 @@ static inline void pud_clear(pud_t *pud)
 #define pmd_ERROR(pmd)				(pud_ERROR((pmd).pud))
 
 #define pud_populate(mm, pmd, pte)		do { } while (0)
+#define __ARCH_HAVE_PUD_TEST_AND_POPULATE
+static inline int pud_test_and_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+	return 1;
+}
 
 /*
  * (pmds are folded into puds so this doesn't get actually called,
Index: linux-2.6.11/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable.h	2005-03-03 10:20:56.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable.h	2005-03-04 10:03:36.000000000 -0800
@@ -105,8 +105,14 @@ static inline pte_t ptep_get_and_clear(p
 #ifdef CONFIG_ATOMIC_TABLE_OPS
 
 /*
- * The architecture does support atomic table operations.
- * Thus we may provide generic atomic ptep_xchg and ptep_cmpxchg using
+ * The architecture does support atomic table operations and
+ * all operations on page table entries must always be atomic.
+ *
+ * This means that the kernel will never encounter a partially updated
+ * page table entry.
+ *
+ * Since the architecture does support atomic table operations, we
+ * may provide generic atomic ptep_xchg and ptep_cmpxchg using
  * cmpxchg and xchg.
  */
 #ifndef __HAVE_ARCH_PTEP_XCHG
@@ -132,6 +138,65 @@ static inline pte_t ptep_get_and_clear(p
 })
 #endif
 
+/*
+ * page_table_atomic_start and page_table_atomic_stop may be used to 
+ * define special measures that an arch needs to guarantee atomic
+ * operations outside of a spinlock. In the case that an arch does
+ * not support atomic page table operations we will fall back to the
+ * page table lock. 
+ */
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_start(mm) do { } while (0)
+#endif
+
+#ifndef __HAVE_ARCH_PAGE_TABLE_ATOMIC_START
+#define page_table_atomic_stop(mm) do { } while (0)
+#endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These simply acquire the page_table_lock for
+ * synchronization. An architecture may override these generic
+ * functions to provide atomic populate functions to make these
+ * more effective.
+ */
+
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pgd_none(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pud);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = pud_none(*(__pud));					\
+	if (__rc) pud_populate(__mm, __pud, __pmd);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+	int __rc;							\
+	spin_lock(&mm->page_table_lock);				\
+	__rc = !pmd_present(*(__pmd));					\
+	if (__rc) pmd_populate(__mm, __pmd, __page);			\
+	spin_unlock(&mm->page_table_lock);				\
+	__rc;								\
+})
+#endif
+
 #else
 
 /*
@@ -142,6 +207,11 @@ static inline pte_t ptep_get_and_clear(p
  * short time frame. This means that the page_table_lock must be held
  * to avoid a page fault that would install a new entry.
  */
+
+/* Fall back to the page table lock to synchronize page table access */
+#define page_table_atomic_start(mm)	spin_lock(&(mm)->page_table_lock)
+#define page_table_atomic_stop(mm)	spin_unlock(&(mm)->page_table_lock)
+
 #ifndef __HAVE_ARCH_PTEP_XCHG
 #define ptep_xchg(__ptep, __pteval)					\
 ({									\
@@ -186,6 +256,41 @@ static inline pte_t ptep_get_and_clear(p
 	r;								\
 })
 #endif
+
+/*
+ * Fallback functions for atomic population of higher page table
+ * structures. These rely on the page_table_lock being held.
+ */
+#ifndef __HAVE_ARCH_PGD_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pud)			\
+({									\
+	int __rc;							\
+	__rc = pgd_none(*(__pgd));					\
+	if (__rc) pgd_populate(__mm, __pgd, __pud);			\
+	__rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define pud_test_and_populate(__mm, __pud, __pmd)			\
+({									\
+       int __rc;							\
+       __rc = pud_none(*(__pud));					\
+       if (__rc) pud_populate(__mm, __pud, __pmd);			\
+       __rc;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page)			\
+({									\
+       int __rc;							\
+       __rc = !pmd_present(*(__pmd));					\
+       if (__rc) pmd_populate(__mm, __pmd, __page);			\
+       __rc;								\
+})
+#endif
+
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
Index: linux-2.6.11/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgtable.h	2005-03-01 23:37:53.000000000 -0800
+++ linux-2.6.11/include/asm-ia64/pgtable.h	2005-03-04 10:03:36.000000000 -0800
@@ -554,6 +554,8 @@ do {											\
 #define FIXADDR_USER_START	GATE_ADDR
 #define FIXADDR_USER_END	(GATE_ADDR + 2*PERCPU_PAGE_SIZE)
 
+#define __HAVE_ARCH_PUD_TEST_AND_POPULATE
+#define __HAVE_ARCH_PMD_TEST_AND_POPULATE
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -561,7 +563,7 @@ do {											\
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
-#include <asm-generic/pgtable.h>
 #include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable.h>
 
 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.11/include/linux/page-flags.h
===================================================================
--- linux-2.6.11.orig/include/linux/page-flags.h	2005-03-01 23:38:13.000000000 -0800
+++ linux-2.6.11/include/linux/page-flags.h	2005-03-04 10:03:36.000000000 -0800
@@ -131,6 +131,17 @@ struct page_state {
 	unsigned long allocstall;	/* direct reclaim calls */
 
 	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
+
+	/* Low level counters */
+	unsigned long spurious_page_faults;	/* Faults with no ops */
+	unsigned long cmpxchg_fail_flag_update;	/* cmpxchg failures for pte flag update */
+	unsigned long cmpxchg_fail_flag_reuse;	/* cmpxchg failures when cow reuse of pte */
+	unsigned long cmpxchg_fail_anon_read;	/* cmpxchg failures on anonymous read */
+	unsigned long cmpxchg_fail_anon_write;	/* cmpxchg failures on anonymous write */
+
+	/* rss deltas for the current executing thread */
+	long rss;
+	long anon_rss;
 };
 
 extern void get_page_state(struct page_state *ret);
Index: linux-2.6.11/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.11.orig/fs/proc/proc_misc.c	2005-03-01 23:37:49.000000000 -0800
+++ linux-2.6.11/fs/proc/proc_misc.c	2005-03-04 10:03:36.000000000 -0800
@@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page,
 	unsigned long allowed;
 	struct vmalloc_info vmi;
 
-	get_page_state(&ps);
+	get_full_page_state(&ps);
 	get_zone_counts(&active, &inactive, &free);
 
 /*
@@ -168,7 +168,12 @@ static int meminfo_read_proc(char *page,
 		"PageTables:   %8lu kB\n"
 		"VmallocTotal: %8lu kB\n"
 		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"VmallocChunk: %8lu kB\n"
+		"Spurious page faults    : %8lu\n"
+		"cmpxchg fail flag update: %8lu\n"
+		"cmpxchg fail COW reuse  : %8lu\n"
+		"cmpxchg fail anon read  : %8lu\n"
+		"cmpxchg fail anon write : %8lu\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -191,7 +196,12 @@ static int meminfo_read_proc(char *page,
 		K(ps.nr_page_table_pages),
 		VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
-		vmi.largest_chunk >> 10
+		vmi.largest_chunk >> 10,
+		ps.spurious_page_faults,
+		ps.cmpxchg_fail_flag_update,
+		ps.cmpxchg_fail_flag_reuse,
+		ps.cmpxchg_fail_anon_read,
+		ps.cmpxchg_fail_anon_write
 		);
 
 		len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.11/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.11.orig/include/asm-ia64/pgalloc.h	2005-03-01 23:37:31.000000000 -0800
+++ linux-2.6.11/include/asm-ia64/pgalloc.h	2005-03-04 10:03:36.000000000 -0800
@@ -34,6 +34,10 @@
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)
 
+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PUD_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
@@ -82,6 +86,13 @@ pud_populate (struct mm_struct *mm, pud_
 	pud_val(*pud_entry) = __pa(pmd);
 }
 
+/* Atomic populate */
+static inline int
+pud_test_and_populate (struct mm_struct *mm, pud_t *pud_entry, pmd_t *pmd)
+{
+	return ia64_cmpxchg8_acq(pud_entry,__pa(pmd), PUD_NONE) == PUD_NONE;
+}
+
 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
 {
@@ -127,6 +138,14 @@ pmd_populate (struct mm_struct *mm, pmd_
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }
 
+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+	return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
+
 static inline void
 pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
 {

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page
  2005-03-09 20:13 Page Fault Scalabilty patch V19 [0/4]: Overview Christoph Lameter
                   ` (2 preceding siblings ...)
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [3/4]: Drop use of page_table_lock in handle_mm_fault Christoph Lameter
@ 2005-03-09 20:13 ` Christoph Lameter
  2005-03-09 22:56   ` Andi Kleen
  3 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 20:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, Christoph Lameter

Do not use the page_table_lock in do_anonymous_page. This will significantly
increase the parallelism in the page fault handler in SMP systems. The patch
also modifies the definitions of _mm_counter functions so that rss and anon_rss
become atomic.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c	2005-03-09 10:43:28.000000000 -0800
+++ linux-2.6.11/mm/memory.c	2005-03-09 10:43:29.000000000 -0800
@@ -1825,12 +1825,12 @@ do_anonymous_page(struct mm_struct *mm, 
 						 vma->vm_page_prot)),
 			      vma);
 
-	spin_lock(&mm->page_table_lock);
+	page_table_atomic_start(mm);
 	
 	if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
 		pte_unmap(page_table);
 		page_cache_release(page);
-		spin_unlock(&mm->page_table_lock);
+		page_table_atomic_stop(mm);
 		inc_page_state(cmpxchg_fail_anon_write);
 		return VM_FAULT_MINOR;
 	}
@@ -1848,7 +1848,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	SetPageReferenced(page);
 	update_mmu_cache(vma, addr, entry); 
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
+	page_table_atomic_stop(mm);
 
 	return VM_FAULT_MINOR;
 }
Index: linux-2.6.11/include/linux/sched.h
===================================================================
--- linux-2.6.11.orig/include/linux/sched.h	2005-03-09 10:43:26.000000000 -0800
+++ linux-2.6.11/include/linux/sched.h	2005-03-09 10:43:29.000000000 -0800
@@ -203,10 +203,26 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct vm_area_struct *area);
 extern void arch_unmap_area_topdown(struct vm_area_struct *area);
 
+#ifdef CONFIG_ATOMIC_TABLE_OPS
+/*
+ * Atomic page table operations require that the counters are also
+ * incremented atomically
+*/
+#define set_mm_counter(mm, member, value) atomic_set(&(mm)->member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->member))
+#define update_mm_counter(mm, member, value) atomic_add(value, &(mm)->member)
+#define MM_COUNTER_T atomic_t
+
+#else
+/*
+ * No atomic page table operations. Counters are protected by
+ * the page table lock 
+ */
 #define set_mm_counter(mm, member, value) (mm)->member = (value)
 #define get_mm_counter(mm, member) ((mm)->member)
 #define update_mm_counter(mm, member, value) (mm)->member += (value)
 #define MM_COUNTER_T unsigned long
+#endif
 
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page
  2005-03-09 20:13 ` Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page Christoph Lameter
@ 2005-03-09 22:56   ` Andi Kleen
  2005-03-09 23:02     ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2005-03-09 22:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> Do not use the page_table_lock in do_anonymous_page. This will significantly
> increase the parallelism in the page fault handler in SMP systems. The patch
> also modifies the definitions of _mm_counter functions so that rss and anon_rss
> become atomic.

I still think it's a bad idea to add arbitary process size limits like this:

>  
> +#ifdef CONFIG_ATOMIC_TABLE_OPS
> +/*
> + * Atomic page table operations require that the counters are also
> + * incremented atomically
> +*/
> +#define set_mm_counter(mm, member, value) atomic_set(&(mm)->member, value)
> +#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->member))
> +#define update_mm_counter(mm, member, value) atomic_add(value, &(mm)->member)
> +#define MM_COUNTER_T atomic_t

Can you use atomic64_t on 64bit systems at least? 

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page
  2005-03-09 22:56   ` Andi Kleen
@ 2005-03-09 23:02     ` Christoph Lameter
  2005-03-09 23:14       ` Andi Kleen
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 23:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-ia64, linux-kernel

On Wed, 9 Mar 2005, Andi Kleen wrote:

> I still think it's a bad idea to add arbitary process size limits like this:

The limit is pretty high: 2^31*PAGE_SIZE bytes. For the standard 4k
pagesize this will be >8TB.

> >
> > +#ifdef CONFIG_ATOMIC_TABLE_OPS
> > +/*
> > + * Atomic page table operations require that the counters are also
> > + * incremented atomically
> > +*/
> > +#define set_mm_counter(mm, member, value) atomic_set(&(mm)->member, value)
> > +#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->member))
> > +#define update_mm_counter(mm, member, value) atomic_add(value, &(mm)->member)
> > +#define MM_COUNTER_T atomic_t
>
> Can you use atomic64_t on 64bit systems at least?

If atomic64_t is available on all 64 bit systems then its no problem.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page
  2005-03-09 23:02     ` Christoph Lameter
@ 2005-03-09 23:14       ` Andi Kleen
  2005-03-09 23:17         ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2005-03-09 23:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

> If atomic64_t is available on all 64 bit systems then its no problem.

Most of them have it already. parisc64/ppc64/sh64 are missing it,
but I assume they will catch up quickly.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page
  2005-03-09 23:14       ` Andi Kleen
@ 2005-03-09 23:17         ` Christoph Lameter
  2005-03-09 23:21           ` Andi Kleen
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 23:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-ia64, linux-kernel

On Wed, 10 Mar 2005, Andi Kleen wrote:

> > If atomic64_t is available on all 64 bit systems then its no problem.
>
> Most of them have it already. parisc64/ppc64/sh64 are missing it,
> but I assume they will catch up quickly.

Changing the type for the countedrs is possible by only changing the
definition of MM_COUNTER_T in include/sched.h. I would prefer to wait
until atomic64_t is available on all 64 bit platforms before making that
part of this patch.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page
  2005-03-09 23:17         ` Christoph Lameter
@ 2005-03-09 23:21           ` Andi Kleen
  2005-03-09 23:32             ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2005-03-09 23:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

On Wed, Mar 09, 2005 at 03:17:10PM -0800, Christoph Lameter wrote:
> On Wed, 10 Mar 2005, Andi Kleen wrote:
> 
> > > If atomic64_t is available on all 64 bit systems then its no problem.
> >
> > Most of them have it already. parisc64/ppc64/sh64 are missing it,
> > but I assume they will catch up quickly.
> 
> Changing the type for the countedrs is possible by only changing the
> definition of MM_COUNTER_T in include/sched.h. I would prefer to wait
> until atomic64_t is available on all 64 bit platforms before making that
> part of this patch.

Well, they will not move until someone uses it (especially parisc
and sh64 which always are quite out of sync in mainline). ppc64 
usually moves quickly.

But adding arbitary limits like this even temporarily is imho
a bad idea.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page
  2005-03-09 23:21           ` Andi Kleen
@ 2005-03-09 23:32             ` Christoph Lameter
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2005-03-09 23:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-ia64, linux-kernel

On Wed, 10 Mar 2005, Andi Kleen wrote:

> > Changing the type for the countedrs is possible by only changing the
> > definition of MM_COUNTER_T in include/sched.h. I would prefer to wait
> > until atomic64_t is available on all 64 bit platforms before making that
> > part of this patch.
>
> Well, they will not move until someone uses it (especially parisc
> and sh64 which always are quite out of sync in mainline). ppc64
> usually moves quickly.

Hmm. I could add that with

#ifdef ATOMIC64_INIT

atomic64

#else

atomic_t

#endif

> But adding arbitary limits like this even temporarily is imho
> a bad idea.

Hmmm yes this could actually develop to be an issue for us. Columbia has
20 Terabytes of memory (some rumors have it that it get up to 500TB but
maybe that was just a journalist). But Columbia has only 1TB per 512
CPU cluster addressable directly. So even the biggest box in existence
right now will be fine with V19.

But V20 will definitely support atomic64.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-03-10  5:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-09 20:13 Page Fault Scalabilty patch V19 [0/4]: Overview Christoph Lameter
2005-03-09 20:13 ` Page Fault Scalability patch V19 [1/4]: pte_cmpxchg and CONFIG_ATOMIC_TABLE_OPS Christoph Lameter
2005-03-09 23:01   ` Andi Kleen
2005-03-09 23:06     ` Christoph Lameter
2005-03-09 20:13 ` Page Fault Scalability patch V19 [2/4]: Abstract mm_struct counter operations Christoph Lameter
2005-03-09 20:13 ` Page Fault Scalability patch V19 [3/4]: Drop use of page_table_lock in handle_mm_fault Christoph Lameter
2005-03-09 20:13 ` Page Fault Scalability patch V19 [4/4]: Drop use of page_table_lock in do_anonymous_page Christoph Lameter
2005-03-09 22:56   ` Andi Kleen
2005-03-09 23:02     ` Christoph Lameter
2005-03-09 23:14       ` Andi Kleen
2005-03-09 23:17         ` Christoph Lameter
2005-03-09 23:21           ` Andi Kleen
2005-03-09 23:32             ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox