Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [Patch:000/005] wait_table and zonelist initializing for memory hotadd
From: Yasunori Goto @ 2006-04-11 11:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel ML, linux-mm, Yasunori Goto

Hi.

These are patches for initialization of wait_table and updating of zonelists
for memory hot-add.
These patches can be used when a new node/zone becomes available.
When empty zone becomes not empty by memory hot-add, 
wait_table must be initialized, and zonelists must be updated.

  ex) x86-64 is good example of new zone addition.
      - System boot up with memory under 4G address.
        All of memory will be ZONE_DMA32.
      - Then hot-add over 4G memory. It becomes ZONE_NORMAL. But, 
        wait table of zone normal is not initialized at this time.

This patch is for 2.6.17-rc1-mm2.

Please apply.

----------------------------
Change log from v1 of wait_table init and build_zonelist.
  - update for 2.6.17-rc1-mm2.
  - add comment for wait_table hash entries.
  - change name wait_table_size() -> wait_table_hash_nr_entries()

Change log from v4 of node hot-add.
  - wait_table and build_zonelists updating are picked up.
  - update for 2.6.17-rc1-mm1.
  - change allocation for wait_table from kmalloc() to vmalloc().
    vmalloc() is enough for it.

V4 of post is here.
<description>
http://marc.theaimsgroup.com/?l=linux-mm&w=2&r=1&s=memory+hotplug+node+v.4&q=b
<patches>
http://marc.theaimsgroup.com/?l=linux-mm&w=2&r=1&s=memory+hotplug+node+v.4.&q=b

-- 
Yasunori Goto 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC] [PATCH] support for oom_die
From: KAMEZAWA Hiroyuki @ 2006-04-11  5:29 UTC (permalink / raw)
  To: linux-mm

Hi,

This patch adds a feature to panic at OOM, oom_die.

I think 2.6 kernel is very robust against OOM situation but sometimes
it occurs. Yes, oom_kill works enough and exit oom situation, *when*
the system wants to survive.

First, crash-dump is merged (to -mm?). So panic at OOM can be a method to
preserve *all* information at OOM. Current OOM killer kills process by SIGKILL,
this doesn't preserve any information about OOM situation. Just message log tell
something and we have to imagine what happend.

Second, considering clustering system, it has a failover node replacement 
system. Because oom_killer tends to kill system slowly, one by one, to detect 
it and do failover(or not) at OOM is tend to be difficult. (as far as I know)
Panic at OOM is useful in such system because failover system can replace
the node immediately.

I'm sorry if this kind of discussion has been setteled in past.

-Kame
==
This patch adds oom_die sysctl under sys.vm.

When oom_die==1, system panic at out_of_memory istead of kill some
process. In some environment, I think panic is more useful than kill.

for example)
(1) When a host is a node of a clustering system and panics at OOM,
    Failover system can detect panic by out-of-memory easily and immediately.
    It can replace the node with another node in fast way.

(2) When the system equips crash dump, out-of-memory will cause crash
    dump. While oom_killer cannot preserve enough information to detect
    the reason of OOM, crash dump can preserve *all* information.
    We can chase it.

Signed-Off-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Index: linux-2.6.17-rc1-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.17-rc1-mm2.orig/kernel/sysctl.c
+++ linux-2.6.17-rc1-mm2/kernel/sysctl.c
@@ -60,6 +60,7 @@ extern int proc_nr_files(ctl_table *tabl
 extern int C_A_D;
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
+extern int sysctl_oom_die;
 extern int max_threads;
 extern int sysrq_enabled;
 extern int core_uses_pid;
@@ -718,6 +719,14 @@ static ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= VM_OOM_DIE,
+		.procname	= "oom_die",
+		.data		= &sysctl_oom_die,
+		.maxlen		= sizeof(sysctl_oom_die),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= VM_OVERCOMMIT_RATIO,
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
Index: linux-2.6.17-rc1-mm2/mm/oom_kill.c
===================================================================
--- linux-2.6.17-rc1-mm2.orig/mm/oom_kill.c
+++ linux-2.6.17-rc1-mm2/mm/oom_kill.c
@@ -23,7 +23,7 @@
 #include <linux/cpuset.h>
 
 /* #define DEBUG */
-
+int sysctl_oom_die = 0;
 /**
  * oom_badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -290,6 +290,12 @@ static struct mm_struct *oom_kill_proces
 	return oom_kill_task(p, message);
 }
 
+
+static void oom_die(void)
+{
+	panic("Panic: out of memory: oom_die is selected.");
+}
+
 /**
  * oom_kill - kill the "best" process when we run out of memory
  *
@@ -331,6 +337,8 @@ void out_of_memory(struct zonelist *zone
 
 	case CONSTRAINT_NONE:
 retry:
+		if (sysctl_oom_die)
+			oom_die();
 		/*
 		 * Rambo mode: Shoot down a process and hope it solves whatever
 		 * issues we may have.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC/PATCH] Shared Page Tables [0/2]
From: Dave McCracken @ 2006-04-10 20:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Linux Kernel Mailing List, Linux Memory Management,
	Adam Litke, wli
In-Reply-To: <Pine.LNX.4.64.0604101320100.24029@schroedinger.engr.sgi.com>

--On Monday, April 10, 2006 13:20:59 -0700 Christoph Lameter
<clameter@sgi.com> wrote:

>> The lock changes to hugetlb are only to support sharing of pmd pages when
>> they contain hugetlb pages.  They just substitute the struct page lock
>> for the page_table_lock, and are only about 30 lines of code.  Is this
>> really worth separating out?
> 
> Ia64 does not use pmd pages for huge pages. It relies instead on a 
> separate region. I wonder if this works on IA64.

Sharing of hugetlb page tables is enabled on a per-architecture basis, so
if ia64 doesn't use pmd pages we shouldn't try to enable it.  If it's not
enabled all the locking in hugetlb resolves to using page_table_lock, so
the original semantics will be preserved.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC/PATCH] Shared Page Tables [0/2]
From: Christoph Lameter @ 2006-04-10 20:20 UTC (permalink / raw)
  To: Dave McCracken
  Cc: Hugh Dickins, Linux Kernel Mailing List, Linux Memory Management,
	Adam Litke, wli
In-Reply-To: <200ED4FEFEB8AA8427120DE7@[10.1.1.4]>

On Mon, 10 Apr 2006, Dave McCracken wrote:

> The lock changes to hugetlb are only to support sharing of pmd pages when
> they contain hugetlb pages.  They just substitute the struct page lock for
> the page_table_lock, and are only about 30 lines of code.  Is this really
> worth separating out?

Ia64 does not use pmd pages for huge pages. It relies instead on a 
separate region. I wonder if this works on IA64.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Page Migration: Make do_swap_page redo the fault
From: Christoph Lameter @ 2006-04-10 20:19 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: akpm, Lee Schermerhorn, linux-mm
In-Reply-To: <Pine.LNX.4.64.0604101933400.26478@blonde.wat.veritas.com>

On Mon, 10 Apr 2006, Hugh Dickins wrote:

> I have now checked through, and I'm relieved to conclude that neither
> of those other two PageSwapCache rechecks are necessary; and the rules
> are much as before.

Note that the removal of the check in do_swap_page does only work
since the remove_from_swap() changes the pte. Without that pte change 
do_swap_page could retrieve the old page via the swap map. It would wait 
until page migration finished its migration and then find that the page is 
not in the pagecache anymore. Note that Lee Schermerhorn's lazy page 
migration may rely on disabling remove_from_swap() for his migration 
scheme. Lee? Looks like we are putting new barriers in front of you?

> In the try_to_unuse case, it's quite possible that !PageSwapCache there,
> because of a racing delete_from_swap_cache; but that case is correctly
> handled in the code that follows.

Ah. I see a later check 

if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) {

> So I believe we can safely remove these other two
> "Page migration has occured" blocks - can't we?

Hmmm... The increased count is also an argument against having to check 
for the race in do_swap_page(). So maybe Lee's lazy migration patchset 
should also be fine without these checks and there is actually no need
to rely on the ptes not being the same.


Remove two unnecessary PageSwapCache checks. The page refcount is raised
and therefore page migration cannot occur in both functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2006-04-08 11:14:20.000000000 -0700
+++ linux-2.6/mm/shmem.c	2006-04-10 13:13:43.000000000 -0700
@@ -1079,14 +1079,6 @@ repeat:
 			page_cache_release(swappage);
 			goto repeat;
 		}
-		if (!PageSwapCache(swappage)) {
-			/* Page migration has occured */
-			shmem_swp_unmap(entry);
-			spin_unlock(&info->lock);
-			unlock_page(swappage);
-			page_cache_release(swappage);
-			goto repeat;
-		}
 		if (PageWriteback(swappage)) {
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2006-04-02 21:55:26.000000000 -0700
+++ linux-2.6/mm/swapfile.c	2006-04-10 13:13:01.000000000 -0700
@@ -751,12 +751,6 @@ again:
 		wait_on_page_locked(page);
 		wait_on_page_writeback(page);
 		lock_page(page);
-		if (!PageSwapCache(page)) {
-			/* Page migration has occured */
-			unlock_page(page);
-			page_cache_release(page);
-			goto again;
-		}
 		wait_on_page_writeback(page);
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC/PATCH] Shared Page Tables [0/2]
From: Dave McCracken @ 2006-04-10 20:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Linux Kernel Mailing List, Linux Memory Management,
	Adam Litke, wli
In-Reply-To: <Pine.LNX.4.64.0604101020230.22947@schroedinger.engr.sgi.com>

--On Monday, April 10, 2006 10:22:34 -0700 Christoph Lameter
<clameter@sgi.com> wrote:

>> Here's a new cut of the shared page table patch.  I divided it into
>> two patches.  The first one just fleshes out the
>> pxd_page/pxd_page_kernel macros across the architectures.  The
>> second one is the main patch.
>> (...)
> 
> Could you break out the locking changes to huge pages?

The lock changes to hugetlb are only to support sharing of pmd pages when
they contain hugetlb pages.  They just substitute the struct page lock for
the page_table_lock, and are only about 30 lines of code.  Is this really
worth separating out?

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC/PATCH] Shared Page Tables [1/2]
From: Dave McCracken @ 2006-04-10 19:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Linux Kernel Mailing List, Linux Memory Management
In-Reply-To: <1144695296.31255.16.camel@localhost.localdomain>

--On Monday, April 10, 2006 11:54:56 -0700 Dave Hansen
<haveblue@us.ibm.com> wrote:

>> Complete the macro definitions for pxd_page/pxd_page_kernel 
> 
> Could you explain a bit why these are needed for shared page tables?

The existing definitions define pte_page and pmd_page to return the struct
page for the pfn contained in that entry, and pmd_page_kernel returns the
kernel virtual address of it.  However, pud_page and pgd_page are defined
to return the kernel virtual address.  There are no macros that return the
struct page.

No one actually uses any of the pud_page and pgd_page macros (other than
one reference in the same include file).  After some discussion on the list
the last time I posted the patches, we agreed that changing pud_page and
pgd_page to be consistent with pmd_page is the best solution.  We also
agreed that I should go ahead and propagate that change across all
architectures even though not all of them currently support shared page
tables.  This patch is the result of that work.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC/PATCH] Shared Page Tables [1/2]
From: Dave Hansen @ 2006-04-10 18:54 UTC (permalink / raw)
  To: Dave McCracken
  Cc: Hugh Dickins, Linux Kernel Mailing List, Linux Memory Management
In-Reply-To: <1144685591.570.36.camel@wildcat.int.mccr.org>

On Mon, 2006-04-10 at 11:13 -0500, Dave McCracken wrote:
> Complete the macro definitions for pxd_page/pxd_page_kernel 

Could you explain a bit why these are needed for shared page tables?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Page Migration: Make do_swap_page redo the fault
From: Hugh Dickins @ 2006-04-10 18:54 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-mm
In-Reply-To: <Pine.LNX.4.64.0604090357350.5312@blonde.wat.veritas.com>

On Sun, 9 Apr 2006, Hugh Dickins wrote:
> On Sat, 8 Apr 2006, Christoph Lameter wrote:
> > 
> > Those two checks were added for migration together with the one we 
> > are removing now. Sounds like you think they additionally fix some other 
> > race conditions?
> 
> But I do have to worry then.  I'd missed the addition of those checks:
> if they really are necessary, then the rules have changed in two
> tricky areas I now need to re-understand.  It'll take me a while.

I have now checked through, and I'm relieved to conclude that neither
of those other two PageSwapCache rechecks are necessary; and the rules
are much as before.

In the try_to_unuse case, it's quite possible that !PageSwapCache there,
because of a racing delete_from_swap_cache; but that case is correctly
handled in the code that follows.

In the shmem_getpage case, info->lock is held to ensure that a racing
shmem_getpage or shmem_unuse_inode can't change it to !PageSwapCache.

In neither case can page migration interfere, because we're holding a
reference on the page: acquired within find_get_page's tree_lock (or
in the initial page allocation before add_to_swap_cache).

migrate_page_remove_references is careful to check page_count against
nr_refs within the tree_lock, and back out if page_count is raised.
If it didn't do so, most uses of find_get_page would be unsafe.

So I believe we can safely remove these other two
"Page migration has occured" blocks - can't we?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC/PATCH] Shared Page Tables [0/2]
From: Christoph Lameter @ 2006-04-10 17:22 UTC (permalink / raw)
  To: Dave McCracken
  Cc: Hugh Dickins, Linux Kernel Mailing List, Linux Memory Management,
	Adam Litke, wli
In-Reply-To: <1144685588.570.35.camel@wildcat.int.mccr.org>

On Mon, 10 Apr 2006, Dave McCracken wrote:

> Here's a new cut of the shared page table patch.  I divided it into
> two patches.  The first one just fleshes out the
> pxd_page/pxd_page_kernel macros across the architectures.  The
> second one is the main patch.
> 
> This version of the patch should address the concerns Hugh raised.
> Hugh, I'd appreciate your feedback again.  Did I get everything?
> 
> These patches apply against 2.6.17-rc1.

Could you break out the locking changes to huge pages?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC/PATCH] Shared Page Tables [2/2]
From: Dave McCracken @ 2006-04-10 16:13 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Kernel Mailing List, Linux Memory Management

Share page tables when possible.

Signed-off-by: Dave McCracken <dmccr@us.ibm.com>

----

 arch/i386/Kconfig             |   12 +
 arch/s390/Kconfig             |   22 +
 arch/x86_64/Kconfig           |   22 +
 include/asm-generic/pgtable.h |   31 ++
 include/linux/mm.h            |   12 -
 include/linux/ptshare.h       |  175 +++++++++++++++
 include/linux/rmap.h          |    2 
 include/linux/sched.h         |    2 
 mm/Makefile                   |    1 
 mm/filemap_xip.c              |    3 
 mm/fremap.c                   |    6 
 mm/hugetlb.c                  |   54 +++-
 mm/memory.c                   |   39 ++-
 mm/mmap.c                     |    3 
 mm/mprotect.c                 |   10 
 mm/mremap.c                   |    7 
 mm/ptshare.c                  |  463 ++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                     |   21 +
 18 files changed, 846 insertions(+), 39 deletions(-)

----

--- 2.6.17-rc1-macro/./arch/i386/Kconfig	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./arch/i386/Kconfig	2006-04-10 08:46:01.000000000 -0500
@@ -512,6 +512,18 @@ config X86_PAE
 	depends on HIGHMEM64G
 	default y
 
+config PTSHARE
+	bool "Share page tables"
+	default y
+	help
+	  Turn on sharing of page tables between processes for large shared
+	  memory regions.
+
+config PTSHARE_PTE
+	bool
+	depends on PTSHARE
+	default y
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
--- 2.6.17-rc1-macro/./arch/s390/Kconfig	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./arch/s390/Kconfig	2006-04-10 08:46:01.000000000 -0500
@@ -218,6 +218,28 @@ config WARN_STACK_SIZE
 
 source "mm/Kconfig"
 
+config PTSHARE
+	bool "Share page tables"
+	default y
+	help
+	  Turn on sharing of page tables between processes for large shared
+	  memory regions.
+
+menu "Page table levels to share"
+	depends on PTSHARE
+
+config PTSHARE_PTE
+	bool "Bottom level table (PTE)"
+	depends on PTSHARE
+	default y
+
+config PTSHARE_PMD
+	bool "Middle level table (PMD)"
+	depends on PTSHARE && 64BIT
+	default y
+
+endmenu
+
 comment "I/O subsystem configuration"
 
 config MACHCHK_WARNING
--- 2.6.17-rc1-macro/./arch/x86_64/Kconfig	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./arch/x86_64/Kconfig	2006-04-10 08:46:01.000000000 -0500
@@ -302,6 +302,28 @@ config NUMA_EMU
 	  into virtual nodes when booted with "numa=fake=N", where N is the
 	  number of nodes. This is only useful for debugging.
 
+config PTSHARE
+	bool "Share page tables"
+	default y
+	help
+	  Turn on sharing of page tables between processes for large shared
+	  memory regions.
+
+config PTSHARE_PTE
+	bool
+	depends on PTSHARE
+	default y
+
+config PTSHARE_PMD
+	bool
+	depends on PTSHARE
+	default y
+
+config PTSHARE_HUGEPAGE
+	bool
+	depends on PTSHARE && PTSHARE_PMD
+	default y
+
 config ARCH_DISCONTIGMEM_ENABLE
        bool
        depends on NUMA
--- 2.6.17-rc1-macro/./include/asm-generic/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./include/asm-generic/pgtable.h	2006-04-10 08:46:01.000000000 -0500
@@ -127,6 +127,16 @@ do {									\
 })
 #endif
 
+#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH_ALL
+#define ptep_clear_flush_all(__vma, __address, __ptep)			\
+({									\
+	pte_t __pte;							\
+	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
+	flush_tlb_all();				\
+	__pte;								\
+})
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 struct mm_struct;
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
@@ -173,6 +183,27 @@ static inline void ptep_set_wrprotect(st
 #endif
 
 /*
+ * Some architectures might need flushes when higher levels of page table
+ * are unshared.
+ */
+
+#ifndef __HAVE_ARCH_PMD_CLEAR_FLUSH
+#define pmd_clear_flush(__mm, __addr, __pmd)				\
+({									\
+	pmd_clear(__pmd);						\
+	flush_tlb_all();						\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PUD_CLEAR_FLUSH
+#define pud_clear_flush(__mm, __addr, __pud)				\
+({									\
+	pud_clear(__pud);						\
+	flush_tlb_all();						\
+})
+#endif
+
+/*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
--- 2.6.17-rc1-macro/./include/linux/mm.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./include/linux/mm.h	2006-04-10 08:46:01.000000000 -0500
@@ -166,6 +166,8 @@ extern unsigned int kobjsize(const void 
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
+#define VM_TRANSITION	0x04000000	/* The vma is in transition (mprotect, mremap, etc) */
+#define VM_PTSHARE	0x08000000	/* This vma has shared one or more page tables */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -242,7 +244,7 @@ struct page {
 						 * see PAGE_MAPPING_ANON below.
 						 */
 	    };
-#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+#if (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS) || defined(CONFIG_PTSHARE)
 	    spinlock_t ptl;
 #endif
 	};
@@ -815,19 +817,19 @@ static inline pmd_t *pmd_alloc(struct mm
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
-#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+#if (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS) || defined(CONFIG_PTSHARE)
 /*
  * We tuck a spinlock to guard each pagetable page into its struct page,
  * at page->private, with BUILD_BUG_ON to make sure that this will not
  * overflow into the next struct page (as it might with DEBUG_SPINLOCK).
  * When freeing, reset page->mapping so free_pages_check won't complain.
  */
-#define __pte_lockptr(page)	&((page)->ptl)
+#define __pt_lockptr(page)	&((page)->ptl)
 #define pte_lock_init(_page)	do {					\
-	spin_lock_init(__pte_lockptr(_page));				\
+	spin_lock_init(__pt_lockptr(_page));				\
 } while (0)
 #define pte_lock_deinit(page)	((page)->mapping = NULL)
-#define pte_lockptr(mm, pmd)	({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
+#define pte_lockptr(mm, pmd)	({(void)(mm); __pt_lockptr(pmd_page(*(pmd)));})
 #else
 /*
  * We use mm->page_table_lock to guard all pagetable pages of the mm.
--- 2.6.17-rc1-macro/./include/linux/rmap.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./include/linux/rmap.h	2006-04-10 08:46:01.000000000 -0500
@@ -98,7 +98,7 @@ void remove_from_swap(struct page *page)
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **);
+				unsigned long, spinlock_t **, int *);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
--- 2.6.17-rc1-macro/./include/linux/sched.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./include/linux/sched.h	2006-04-10 08:46:01.000000000 -0500
@@ -254,7 +254,7 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct mm_struct *, unsigned long);
 extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
 
-#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+#if (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS) || defined(CONFIG_PTSHARE)
 /*
  * The mm counters are not protected by its page_table_lock,
  * so must be incremented atomically.
--- 2.6.17-rc1-macro/./include/linux/ptshare.h	1969-12-31 18:00:00.000000000 -0600
+++ 2.6.17-rc1-shpt/./include/linux/ptshare.h	2006-04-10 08:46:01.000000000 -0500
@@ -0,0 +1,175 @@
+#ifndef _LINUX_PTSHARE_H
+#define _LINUX_PTSHARE_H
+
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2005
+ *
+ * Author: Dave McCracken <dmccr@us.ibm.com>
+ */
+
+#undef	PT_DEBUG
+
+#ifdef CONFIG_PTSHARE
+static inline int pt_is_shared(struct page *page)
+{
+	return (page_mapcount(page) > 0);
+}
+
+static inline void pt_increment_share(struct page *page)
+{
+	atomic_inc(&page->_mapcount);
+}
+
+static inline void pt_decrement_share(struct page *page)
+{
+	atomic_dec(&page->_mapcount);
+}
+
+static inline int pt_vmashared(struct vm_area_struct *vma) {
+	return vma->vm_flags & VM_PTSHARE;
+}
+extern void pt_unshare_range(struct vm_area_struct *vma, unsigned long address,
+			     unsigned long end);
+#else /* CONFIG_PTSHARE */
+#define pt_is_shared(page)	(0)
+#define pt_increment_share(page)
+#define pt_decrement_share(page)
+#define pt_vmashared(vma) 0
+#define pt_unshare_range(vma, address, end)
+#endif /* CONFIG_PTSHARE */
+
+#ifdef CONFIG_PTSHARE_PTE
+static inline int pt_is_shared_pte(pmd_t pmdval)
+{
+	struct page *page;
+
+	page = pmd_page(pmdval);
+	return pt_is_shared(page);
+}
+
+static inline void pt_increment_pte(pmd_t pmdval)
+{
+	struct page *page;
+
+	page = pmd_page(pmdval);
+	pt_increment_share(page);
+}
+
+static inline void pt_decrement_pte(pmd_t pmdval)
+{
+	struct page *page;
+
+	page = pmd_page(pmdval);
+	pt_decrement_share(page);
+}
+
+extern pte_t * pt_share_pte(struct vm_area_struct *vma, unsigned long address,
+			    pmd_t *pmd);
+extern int pt_check_unshare_pte(struct mm_struct *mm, unsigned long address,
+				pmd_t *pmd);
+#else /* CONFIG_PTSHARE_PTE */
+static inline int pt_is_shared_pte(pmd_t pmdval)
+{
+	return 0;
+}
+static inline int pt_check_unshare_pte(struct mm_struct *mm, unsigned long address,
+				       pmd_t *pmd)
+{
+	return 0;
+}
+#define pt_increment_pte(pmdval)
+#define pt_decrement_pte(pmdval)
+#define pt_share_pte(vma, address, pmd) pte_alloc_map(vma->vm_mm, pmd, address)
+#endif /* CONFIG_PTSHARE_PTE */
+
+#ifdef CONFIG_PTSHARE_PMD
+static inline int pt_is_shared_pmd(pud_t pudval)
+{
+	struct page *page;
+
+	page = pud_page(pudval);
+	return pt_is_shared(page);
+}
+
+static inline void pt_increment_pmd(pud_t pudval)
+{
+	struct page *page;
+
+	page = pud_page(pudval);
+	pt_increment_share(page);
+}
+
+static inline void pt_decrement_pmd(pud_t pudval)
+{
+	struct page *page;
+
+	page = pud_page(pudval);
+	pt_decrement_share(page);
+}
+extern pmd_t * pt_share_pmd(struct vm_area_struct *vma, unsigned long address,
+			    pud_t *pud);
+extern int pt_check_unshare_pmd(struct mm_struct *mm, unsigned long address,
+				pud_t *pud);
+#else /* CONFIG_PTSHARE_PMD */
+static inline int pt_is_shared_pmd(pud_t pudval)
+{
+	return 0;
+}
+static inline int pt_check_unshare_pmd(struct mm_struct *mm, unsigned long address,
+				       pud_t *pud)
+{
+	return 0;
+}
+#define pt_increment_pmd(pudval)
+#define pt_decrement_pmd(pudval)
+#define pt_share_pmd(vma, address, pud) pmd_alloc(vma->vm_mm, pud, address)
+#endif /* CONFIG_PTSHARE_PMD */
+
+#ifdef CONFIG_PTSHARE_HUGEPAGE
+extern pte_t *pt_share_hugepage(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address);
+extern void pt_unshare_huge_range(struct vm_area_struct *vma, unsigned long address,
+				  unsigned long end);
+#else
+#define pt_share_hugepage(mm, vma, address)	huge_pte_alloc(mm, address)
+#define pt_unshare_huge_range(vma, address, end)
+#endif	/* CONFIG_PTSHARE_HUGEPAGE */
+
+/*
+ *  Locking macros...
+ *  All levels of page table at or above the level(s) we share use page_table_lock.
+ *  Each level below the share level uses the pt_lockptr in struct page in the level
+ *  above.
+ */
+
+#ifdef CONFIG_PTSHARE_PMD
+#define pmd_lock_init(pmd)	do {					\
+	spin_lock_init(__pt_lockptr(virt_to_page(pmd)));		\
+} while (0)
+#define pmd_lock_deinit(pmd)	((virt_to_page(pmd))->mapping = NULL)
+#define pmd_lockptr(mm, pmd)	({(void)(mm); __pt_lockptr(virt_to_page(pmd));})
+#else
+#define pmd_lock_init(pmd)	do {} while (0)
+#define pmd_lock_deinit(pmd)	do {} while (0)
+#define pmd_lockptr(mm, pmd)	({(void)(pmd); &(mm)->page_table_lock;})
+#endif
+#ifdef CONFIG_PTSHARE_HUGEPAGE
+#define hugepte_lockptr(mm, pte)	pmd_lockptr(mm, pte)
+#else
+#define hugepte_lockptr(mm, pte)	({(void)(pte); &(mm)->page_table_lock;})
+#endif
+#endif /* _LINUX_PTSHARE_H */
--- 2.6.17-rc1-macro/./mm/Makefile	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/Makefile	2006-04-10 08:47:12.000000000 -0500
@@ -23,4 +23,5 @@ obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_PTSHARE) += ptshare.o
 
--- 2.6.17-rc1-macro/./mm/filemap_xip.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/filemap_xip.c	2006-04-10 08:46:01.000000000 -0500
@@ -174,6 +174,7 @@ __xip_unmap (struct address_space * mapp
 	unsigned long address;
 	pte_t *pte;
 	pte_t pteval;
+	int shared;
 	spinlock_t *ptl;
 	struct page *page;
 
@@ -184,7 +185,7 @@ __xip_unmap (struct address_space * mapp
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 		page = ZERO_PAGE(address);
-		pte = page_check_address(page, mm, address, &ptl);
+		pte = page_check_address(page, mm, address, &ptl, &shared);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
--- 2.6.17-rc1-macro/./mm/fremap.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/fremap.c	2006-04-10 08:46:01.000000000 -0500
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/ptshare.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -193,6 +194,10 @@ asmlinkage long sys_remap_file_pages(uns
 				has_write_lock = 1;
 				goto retry;
 			}
+			vma->vm_flags |= VM_TRANSITION;
+			if (pt_vmashared(vma))
+				pt_unshare_range(vma, vma->vm_start, vma->vm_end);
+
 			mapping = vma->vm_file->f_mapping;
 			spin_lock(&mapping->i_mmap_lock);
 			flush_dcache_mmap_lock(mapping);
@@ -201,6 +206,7 @@ asmlinkage long sys_remap_file_pages(uns
 			vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 			flush_dcache_mmap_unlock(mapping);
 			spin_unlock(&mapping->i_mmap_lock);
+			vma->vm_flags &= ~VM_TRANSITION;
 		}
 
 		err = vma->vm_ops->populate(vma, start, size,
--- 2.6.17-rc1-macro/./mm/hugetlb.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/hugetlb.c	2006-04-10 09:05:39.000000000 -0500
@@ -11,6 +11,7 @@
 #include <linux/highmem.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
+#include <linux/ptshare.h>
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
@@ -437,6 +438,7 @@ int copy_hugetlb_page_range(struct mm_st
 	struct page *ptepage;
 	unsigned long addr;
 	int cow;
+	spinlock_t *src_ptl, *dst_ptl;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
@@ -447,8 +449,10 @@ int copy_hugetlb_page_range(struct mm_st
 		dst_pte = huge_pte_alloc(dst, addr);
 		if (!dst_pte)
 			goto nomem;
-		spin_lock(&dst->page_table_lock);
-		spin_lock(&src->page_table_lock);
+		dst_ptl = hugepte_lockptr(dst, dst_pte);
+		src_ptl = hugepte_lockptr(src, src_pte);
+		spin_lock(dst_ptl);
+		spin_lock(src_ptl);
 		if (!pte_none(*src_pte)) {
 			if (cow)
 				ptep_set_wrprotect(src, addr, src_pte);
@@ -458,8 +462,8 @@ int copy_hugetlb_page_range(struct mm_st
 			add_mm_counter(dst, file_rss, HPAGE_SIZE / PAGE_SIZE);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
-		spin_unlock(&src->page_table_lock);
-		spin_unlock(&dst->page_table_lock);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
 	}
 	return 0;
 
@@ -475,12 +479,14 @@ void unmap_hugepage_range(struct vm_area
 	pte_t *ptep;
 	pte_t pte;
 	struct page *page;
+	spinlock_t *ptl = NULL, *new_ptl;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
-	spin_lock(&mm->page_table_lock);
+	if (pt_vmashared(vma))
+		pt_unshare_huge_range(vma, start, end);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -490,6 +496,13 @@ void unmap_hugepage_range(struct vm_area
 		if (!ptep)
 			continue;
 
+		new_ptl = hugepte_lockptr(mm, ptep);
+		if (new_ptl != ptl) {
+			if (ptl)
+				spin_unlock(ptl);
+			ptl = new_ptl;
+			spin_lock(ptl);
+		}
 		pte = huge_ptep_get_and_clear(mm, address, ptep);
 		if (pte_none(pte))
 			continue;
@@ -499,12 +512,15 @@ void unmap_hugepage_range(struct vm_area
 		add_mm_counter(mm, file_rss, (int) -(HPAGE_SIZE / PAGE_SIZE));
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	if (ptl)
+		spin_unlock(ptl);
+
 	flush_tlb_range(vma, start, end);
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, pte_t pte)
+			unsigned long address, pte_t *ptep, pte_t pte,
+		       spinlock_t *ptl)
 {
 	struct page *old_page, *new_page;
 	int avoidcopy;
@@ -527,9 +543,9 @@ static int hugetlb_cow(struct mm_struct 
 		return VM_FAULT_OOM;
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	copy_huge_page(new_page, old_page, address);
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 
 	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
 	if (likely(pte_same(*ptep, pte))) {
@@ -553,6 +569,7 @@ int hugetlb_no_page(struct mm_struct *mm
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	spinlock_t *ptl;
 
 	mapping = vma->vm_file->f_mapping;
 	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
@@ -590,7 +607,8 @@ retry:
 			lock_page(page);
 	}
 
-	spin_lock(&mm->page_table_lock);
+	ptl = hugepte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
 	if (idx >= size)
 		goto backout;
@@ -606,16 +624,16 @@ retry:
 
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_cow(mm, vma, address, ptep, new_pte);
+		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, ptl);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	unlock_page(page);
 out:
 	return ret;
 
 backout:
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	hugetlb_put_quota(mapping);
 	unlock_page(page);
 	put_page(page);
@@ -628,9 +646,10 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
+	spinlock_t *ptl;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 
-	ptep = huge_pte_alloc(mm, address);
+	ptep = pt_share_hugepage(mm, vma, address & HPAGE_MASK);
 	if (!ptep)
 		return VM_FAULT_OOM;
 
@@ -649,12 +668,13 @@ int hugetlb_fault(struct mm_struct *mm, 
 
 	ret = VM_FAULT_MINOR;
 
-	spin_lock(&mm->page_table_lock);
+	ptl = hugepte_lockptr(mm, ptep);
+	spin_lock(ptl);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, *ptep)))
 		if (write_access && !pte_write(entry))
-			ret = hugetlb_cow(mm, vma, address, ptep, entry);
-	spin_unlock(&mm->page_table_lock);
+			ret = hugetlb_cow(mm, vma, address, ptep, entry, ptl);
+	spin_unlock(ptl);
 	mutex_unlock(&hugetlb_instantiation_mutex);
 
 	return ret;
--- 2.6.17-rc1-macro/./mm/memory.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/memory.c	2006-04-10 08:46:01.000000000 -0500
@@ -48,6 +48,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/ptshare.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -144,6 +145,10 @@ static inline void free_pmd_range(struct
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
+		if (pt_is_shared_pte(*pmd)) {
+			if (pt_check_unshare_pte(tlb->mm, addr, pmd))
+				continue;
+		}
 		free_pte_range(tlb, pmd);
 	} while (pmd++, addr = next, addr != end);
 
@@ -160,6 +165,7 @@ static inline void free_pmd_range(struct
 
 	pmd = pmd_offset(pud, start);
 	pud_clear(pud);
+	pmd_lock_deinit(pmd);
 	pmd_free_tlb(tlb, pmd);
 }
 
@@ -177,6 +183,10 @@ static inline void free_pud_range(struct
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
+		if (pt_is_shared_pmd(*pud)) {
+			if (pt_check_unshare_pmd(tlb->mm, addr, pud))
+				continue;
+		}
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
 	} while (pud++, addr = next, addr != end);
 
@@ -301,20 +311,21 @@ void free_pgtables(struct mmu_gather **t
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
 	struct page *new = pte_alloc_one(mm, address);
+	spinlock_t *ptl = pmd_lockptr(mm, pmd);
+
 	if (!new)
 		return -ENOMEM;
 
 	pte_lock_init(new);
-	spin_lock(&mm->page_table_lock);
+	spin_lock(ptl);
 	if (pmd_present(*pmd)) {	/* Another has populated it */
 		pte_lock_deinit(new);
 		pte_free(new);
 	} else {
-		mm->nr_ptes++;
 		inc_page_state(nr_page_table_pages);
 		pmd_populate(mm, pmd, new);
 	}
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(ptl);
 	return 0;
 }
 
@@ -581,7 +592,7 @@ int copy_page_range(struct mm_struct *ds
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
+	if (!(vma->vm_flags & (VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
 		if (!vma->anon_vma)
 			return 0;
 	}
@@ -613,6 +624,10 @@ static unsigned long zap_pte_range(struc
 	int file_rss = 0;
 	int anon_rss = 0;
 
+	if (pt_is_shared_pte(*pmd)) {
+		if (pt_check_unshare_pte(mm, addr, pmd))
+			return end;
+	}
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	do {
 		pte_t ptent = *pte;
@@ -694,6 +709,10 @@ static inline unsigned long zap_pmd_rang
 	unsigned long next;
 
 	pmd = pmd_offset(pud, addr);
+	if (pt_is_shared_pmd(*pud)) {
+		if (pt_check_unshare_pmd(tlb->mm, addr, pud))
+			return end;
+	}
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd)) {
@@ -2272,10 +2291,10 @@ int __handle_mm_fault(struct mm_struct *
 	pud = pud_alloc(mm, pgd, address);
 	if (!pud)
 		return VM_FAULT_OOM;
-	pmd = pmd_alloc(mm, pud, address);
+	pmd = pt_share_pmd(vma, address, pud);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	pte = pt_share_pte(vma, address, pmd);
 	if (!pte)
 		return VM_FAULT_OOM;
 
@@ -2326,13 +2345,17 @@ int __pmd_alloc(struct mm_struct *mm, pu
 #ifndef __ARCH_HAS_4LEVEL_HACK
 	if (pud_present(*pud))		/* Another has populated it */
 		pmd_free(new);
-	else
+	else {
+		pmd_lock_init(new);
 		pud_populate(mm, pud, new);
+	}
 #else
 	if (pgd_present(*pud))		/* Another has populated it */
 		pmd_free(new);
-	else
+	else {
+		pmd_lock_init(new);
 		pgd_populate(mm, pud, new);
+	}
 #endif /* __ARCH_HAS_4LEVEL_HACK */
 	spin_unlock(&mm->page_table_lock);
 	return 0;
--- 2.6.17-rc1-macro/./mm/mmap.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/mmap.c	2006-04-10 08:46:01.000000000 -0500
@@ -1706,6 +1706,8 @@ int split_vma(struct mm_struct * mm, str
 	/* most fields are the same, copy all, and then fixup */
 	*new = *vma;
 
+	new->vm_flags &= ~VM_TRANSITION;
+
 	if (new_below)
 		new->vm_end = addr;
 	else {
@@ -1941,7 +1943,6 @@ void exit_mmap(struct mm_struct *mm)
 	while (vma)
 		vma = remove_vma(vma);
 
-	BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
 }
 
 /* Insert vm structure into process list sorted by address
--- 2.6.17-rc1-macro/./mm/mprotect.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/mprotect.c	2006-04-10 08:46:01.000000000 -0500
@@ -19,6 +19,7 @@
 #include <linux/mempolicy.h>
 #include <linux/personality.h>
 #include <linux/syscalls.h>
+#include <linux/ptshare.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -115,6 +116,10 @@ mprotect_fixup(struct vm_area_struct *vm
 		return 0;
 	}
 
+	vma->vm_flags |= VM_TRANSITION;
+	if (pt_vmashared(vma))
+		pt_unshare_range(vma, start, end);
+
 	/*
 	 * If we make a private mapping writable we increase our commit;
 	 * but (without finer accounting) cannot reduce our commit if we
@@ -126,8 +131,10 @@ mprotect_fixup(struct vm_area_struct *vm
 	if (newflags & VM_WRITE) {
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
 			charged = nrpages;
-			if (security_vm_enough_memory(charged))
+			if (security_vm_enough_memory(charged)) {
+				vma->vm_flags &= ~VM_TRANSITION;
 				return -ENOMEM;
+			}
 			newflags |= VM_ACCOUNT;
 		}
 	}
@@ -175,6 +182,7 @@ success:
 	return 0;
 
 fail:
+	vma->vm_flags &= ~VM_TRANSITION;
 	vm_unacct_memory(charged);
 	return error;
 }
--- 2.6.17-rc1-macro/./mm/mremap.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/mremap.c	2006-04-10 08:46:01.000000000 -0500
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/ptshare.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -178,6 +179,10 @@ static unsigned long move_vma(struct vm_
 	if (!new_vma)
 		return -ENOMEM;
 
+	vma->vm_flags |= VM_TRANSITION;
+	if (pt_vmashared(vma))
+		pt_unshare_range(vma, old_addr, old_addr + old_len);
+
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len);
 	if (moved_len < old_len) {
 		/*
@@ -221,6 +226,8 @@ static unsigned long move_vma(struct vm_
 	}
 	mm->hiwater_vm = hiwater_vm;
 
+	vma->vm_flags &= ~VM_TRANSITION;
+
 	/* Restore VM_ACCOUNT if one or two pieces of vma left */
 	if (excess) {
 		vma->vm_flags |= VM_ACCOUNT;
--- 2.6.17-rc1-macro/./mm/rmap.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-shpt/./mm/rmap.c	2006-04-10 08:46:01.000000000 -0500
@@ -53,6 +53,7 @@
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
 #include <linux/module.h>
+#include <linux/ptshare.h>
 
 #include <asm/tlbflush.h>
 
@@ -286,7 +287,8 @@ unsigned long page_address_in_vma(struct
  * On success returns with pte mapped and locked.
  */
 pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp)
+			  unsigned long address, spinlock_t **ptlp,
+			  int *shared)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -306,6 +308,9 @@ pte_t *page_check_address(struct page *p
 	if (!pmd_present(*pmd))
 		return NULL;
 
+	if (pt_is_shared_pmd(*pud))
+		*shared++;
+
 	pte = pte_offset_map(pmd, address);
 	/* Make a quick check before getting the lock */
 	if (!pte_present(*pte)) {
@@ -313,6 +318,9 @@ pte_t *page_check_address(struct page *p
 		return NULL;
 	}
 
+	if (pt_is_shared_pte(*pmd))
+		*shared++;
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
@@ -333,6 +341,7 @@ static int page_referenced_one(struct pa
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
 	pte_t *pte;
+	int shared;
 	spinlock_t *ptl;
 	int referenced = 0;
 
@@ -340,7 +349,7 @@ static int page_referenced_one(struct pa
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, &shared);
 	if (!pte)
 		goto out;
 
@@ -584,6 +593,7 @@ static int try_to_unmap_one(struct page 
 	unsigned long address;
 	pte_t *pte;
 	pte_t pteval;
+	int shared = 0;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 
@@ -591,7 +601,7 @@ static int try_to_unmap_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, &shared);
 	if (!pte)
 		goto out;
 
@@ -609,7 +619,10 @@ static int try_to_unmap_one(struct page 
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	if (shared)
+		pteval = ptep_clear_flush_all(vma, address, pte);
+	else
+		pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
--- 2.6.17-rc1-macro/./mm/ptshare.c	1969-12-31 18:00:00.000000000 -0600
+++ 2.6.17-rc1-shpt/./mm/ptshare.c	2006-04-10 08:46:01.000000000 -0500
@@ -0,0 +1,463 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2005
+ *
+ * Author: Dave McCracken <dmccr@us.ibm.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/prio_tree.h>
+#include <linux/mm.h>
+#include <linux/ptshare.h>
+
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+
+#define VM_PGEND(vma)	(((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) -1)
+
+#define	VMFLAG_COMPARE	(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)
+
+static int pt_shareable_vma(struct vm_area_struct *vma)
+{
+	/* We can't share anonymous memory */
+	if (!vma->vm_file)
+		return 0;
+
+	/* No sharing of nonlinear areas or vmas in transition */
+	if (vma->vm_flags & (VM_NONLINEAR|VM_TRANSITION|VM_PFNMAP|VM_INSERTPAGE))
+		return 0;
+
+	/* Only share shared mappings or read-only mappings */
+	if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) == VM_WRITE)
+		return 0;
+
+	/* If it's smaller than the smallest shareable unit, don't bother
+	   calling it shareable */
+	if ((vma->vm_end - vma->vm_start) < PMD_SIZE)
+		return 0;
+
+	return 1;
+}
+
+#ifdef CONFIG_PTSHARE_PTE
+static inline void pt_unshare_pte(struct mm_struct *mm, pmd_t *pmd,
+				  unsigned long address)
+{
+	struct page *page;
+	spinlock_t *ptl;
+
+	if (pmd_present(*pmd)) {
+		page = pmd_page(*pmd);
+		ptl = __pt_lockptr(page);
+		spin_lock(ptl);
+		if (pt_is_shared(page)) {
+#ifdef PT_DEBUG
+			printk(KERN_DEBUG "Unsharing pte page at address 0x%lx\n",
+			       address);
+#endif
+			pt_decrement_share(page);
+			pmd_clear_flush(mm, address, pmd);
+		}
+		spin_unlock(ptl);
+	}
+}
+#endif
+
+#ifndef __PAGETABLE_PMD_FOLDED
+static void pt_unshare_pmd(struct mm_struct *mm, pud_t *pud, unsigned long address,
+			   unsigned long end, int hugepage)
+{
+	if (!pud_present(*pud))
+	    return;
+
+#ifdef CONFIG_PTSHARE_PMD
+	{
+		struct page *page;
+		spinlock_t *ptl;
+
+		page = pud_page(*pud);
+		ptl = __pt_lockptr(page);
+		spin_lock(ptl);
+		if (pt_is_shared(page)) {
+#ifdef PT_DEBUG
+			printk(KERN_DEBUG "Unsharing pmd page at address 0x%lx\n",
+			       address);
+#endif
+			pt_decrement_share(page);
+			spin_unlock(ptl);
+			pud_clear_flush(mm, address, pud);
+			return;
+		}
+		spin_unlock(ptl);
+	}
+#endif
+#ifdef CONFIG_PTSHARE_PTE
+	if (hugepage)
+		return;
+
+	{
+		pmd_t *pmd;
+
+		pmd = pmd_offset(pud, address);
+		end = pud_addr_end(address, end);
+		while (address <= end) {
+			pt_unshare_pte(mm, pmd, address);
+			pmd++;
+			address += PMD_SIZE;
+		}
+	}
+#endif
+}
+
+#ifndef __PAGETABLE_PUD_FOLDED
+static void pt_unshare_pud(struct mm_struct *mm, pgd_t *pgd, unsigned long address,
+			   unsigned long end, int hugepage)
+{
+	pud_t *pud;
+
+	if (!pgd_present(*pgd))
+	    return;
+
+	pud = pud_offset(pgd, address);
+	end = pgd_addr_end(address, end);
+	while (address <= end) {
+		pt_unshare_pmd(mm, pud, address, end, hugepage);
+		pud++;
+		address += PUD_SIZE;
+	}
+}
+#endif /* __PAGETABLE_PUD_FOLDED */
+#endif /* __PAGETABLE_PMD_FOLDED */
+
+static void pt_unshare_pgd(struct mm_struct *mm, unsigned long address,
+			   unsigned long end, int hugepage)
+{
+	pgd_t *pgd;
+
+	pgd = pgd_offset(mm, address);
+
+	spin_lock(&mm->page_table_lock);
+	while (address <= end) {
+#ifdef __PAGETABLE_PMD_FOLDED
+		if (!hugepage)
+			pt_unshare_pte(mm, (pmd_t *)pgd, address);
+#else
+#ifdef __PAGETABLE_PUD_FOLDED
+		pt_unshare_pmd(mm, (pud_t *)pgd, address, end, hugepage);
+#else
+		pt_unshare_pud(mm, pgd, address, end, hugepage);
+#endif
+#endif
+		pgd++;
+		address += PGDIR_SIZE;
+	}
+	spin_unlock(&mm->page_table_lock);
+}
+
+void pt_unshare_range(struct vm_area_struct *vma, unsigned long address,
+		      unsigned long end)
+{
+	struct address_space *mapping;
+
+	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+	spin_lock(&mapping->i_mmap_lock);
+
+	pt_unshare_pgd(vma->vm_mm, address, end, 0);
+
+	spin_unlock(&mapping->i_mmap_lock);
+
+	/* If we unshare the entire vma it's safe to clear the share flag */
+	if (vma->vm_start == address &&
+	    vma->vm_end == end)
+		vma->vm_flags &= ~VM_PTSHARE;
+}
+
+static struct vm_area_struct *next_shareable_vma(struct vm_area_struct *vma,
+						 struct vm_area_struct *svma,
+						 struct prio_tree_iter *iter)
+{
+	while ((svma = vma_prio_tree_next(svma, iter))) {
+		if ((svma != vma) &&
+		    ((vma->vm_flags&VMFLAG_COMPARE) == (svma->vm_flags&VMFLAG_COMPARE)) &&
+		    (vma->vm_start == svma->vm_start) &&
+		    (vma->vm_end == svma->vm_end) &&
+		    (vma->vm_pgoff == svma->vm_pgoff))
+			break;
+	}
+	return svma;
+}
+
+#ifdef CONFIG_PTSHARE_PTE
+static int pt_shareable_pte(struct vm_area_struct *vma, unsigned long address)
+{
+	unsigned long base = address & PMD_MASK;
+	unsigned long end = base + (PMD_SIZE-1);
+
+	if (pt_shareable_vma(vma) &&
+	   (vma->vm_start <= base) &&
+	    (vma->vm_end >= end))
+		return 1;
+
+	return 0;
+}
+
+pte_t *pt_share_pte(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd)
+{
+	struct prio_tree_iter iter;
+	struct vm_area_struct *svma = NULL;
+	pgd_t *spgd, spgde;
+	pud_t *spud, spude;
+	pmd_t *spmd, spmde;
+	pte_t *pte;
+	struct page *page;
+	struct address_space *mapping;
+	spinlock_t *ptl;
+
+	pmd_clear(&spmde);
+	page = NULL;
+	if (pmd_none(*pmd) &&
+	    pt_shareable_pte(vma, address)) {
+#ifdef PT_DEBUG
+		printk(KERN_DEBUG "Looking for shareable pte page at address 0x%lx\n",
+		       address);
+#endif
+		mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+		spin_lock(&mapping->i_mmap_lock);
+		prio_tree_iter_init(&iter, &mapping->i_mmap,
+				    vma->vm_pgoff, VM_PGEND(vma));
+
+		while ((svma = next_shareable_vma(vma, svma, &iter))) {
+			spgd = pgd_offset(svma->vm_mm, address);
+			spgde = *spgd;
+			if (pgd_none(spgde))
+				continue;
+
+			spud = pud_offset(&spgde, address);
+			spude = *spud;
+			if (pud_none(spude))
+				continue;
+
+			spmd = pmd_offset(&spude, address);
+			spmde = *spmd;
+			if (pmd_none(spmde))
+				continue;
+
+			/* Found a shareable page */
+			page = pmd_page(spmde);
+			pt_increment_share(page);
+			break;
+		}
+		spin_unlock(&mapping->i_mmap_lock);
+		if (pmd_present(spmde)) {
+			ptl = pmd_lockptr(vma->vm_mm, pmd);
+			spin_lock(ptl);
+			if (pmd_none(*pmd)) {
+#ifdef PT_DEBUG
+				printk(KERN_DEBUG "Sharing pte page at address 0x%lx\n",
+				       address);
+#endif
+				pmd_populate(vma->vm_mm, pmd, page);
+				/* Both vmas now have shared pt */
+				vma->vm_flags |= VM_PTSHARE;
+				svma->vm_flags |= VM_PTSHARE;
+			} else {
+				/* Oops, already mapped... undo it */
+				pt_decrement_share(page);
+			}
+			spin_unlock(ptl);
+		}
+
+	}
+	pte = pte_alloc_map(vma->vm_mm, pmd, address);
+
+	return pte;
+}
+int pt_check_unshare_pte(struct mm_struct *mm, unsigned long address, pmd_t *pmd)
+{
+	struct page *page;
+	spinlock_t *ptl;
+
+	page = pmd_page(*pmd);
+	ptl = __pt_lockptr(page);
+	spin_lock(ptl);
+	/* Doublecheck now that we hold the lock */
+	if (pt_is_shared(page)) {
+#ifdef PT_DEBUG
+		printk(KERN_DEBUG "Unsharing pte at address 0x%lx\n",
+		       address);
+#endif
+		pt_decrement_share(page);
+		spin_unlock(ptl);
+		pmd_clear_flush(mm, address, pmd);
+		return 1;
+	}
+	spin_unlock(ptl);
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_PTSHARE_PMD
+static int pt_shareable_pmd(struct vm_area_struct *vma,
+		 unsigned long address)
+{
+	unsigned long base = address & PUD_MASK;
+	unsigned long end = base + (PUD_SIZE-1);
+
+	if (pt_shareable_vma(vma) &&
+	    (vma->vm_start <= base) &&
+	    (vma->vm_end >= end))
+		return 1;
+
+	return 0;
+}
+
+pmd_t *pt_share_pmd(struct vm_area_struct *vma, unsigned long address, pud_t *pud)
+{
+	struct prio_tree_iter iter;
+	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct *svma = NULL;
+	pgd_t *spgd, spgde;
+	pud_t *spud, spude;
+	pmd_t *pmd;
+	struct page *page;
+	struct address_space *mapping;
+
+	pud_clear(&spude);
+	page = NULL;
+	if (pud_none(*pud) &&
+	    pt_shareable_pmd(vma, address)) {
+#ifdef PT_DEBUG
+		printk(KERN_DEBUG "Looking for shareable pmd page at address 0x%lx\n",
+		       address);
+#endif
+		mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+
+		spin_lock(&mapping->i_mmap_lock);
+		prio_tree_iter_init(&iter, &mapping->i_mmap,
+				    vma->vm_pgoff, VM_PGEND(vma));
+
+		while ((svma = next_shareable_vma(vma, svma, &iter))) {
+			spgd = pgd_offset(svma->vm_mm, address);
+			spgde = *spgd;
+			if (pgd_none(spgde))
+				continue;
+
+			spud = pud_offset(spgd, address);
+			spude = *spud;
+			if (pud_none(spude))
+				continue;
+
+			/* Found a shareable page */
+			page = pud_page(spude);
+			pt_increment_share(page);
+			break;
+		}
+		spin_unlock(&mapping->i_mmap_lock);
+		if (pud_present(spude)) {
+			spin_lock(&mm->page_table_lock);
+			if (pud_none(*pud)) {
+#ifdef PT_DEBUG
+				printk(KERN_DEBUG "Sharing pmd page at address 0x%lx\n",
+				       address);
+#endif
+				pud_populate(mm, pud,
+					     (pmd_t *)pud_page_kernel(spude));
+				/* These vmas now have shared pt */
+				vma->vm_flags |= VM_PTSHARE;
+				svma->vm_flags |= VM_PTSHARE;
+			} else {
+				/* Oops, already mapped... undo it */
+				pt_decrement_share(page);
+			}
+			spin_unlock(&mm->page_table_lock);
+		}
+	}
+	pmd = pmd_alloc(mm, pud, address);
+
+	return pmd;
+}
+int pt_check_unshare_pmd(struct mm_struct *mm, unsigned long address, pud_t *pud)
+{
+	struct page *page;
+	spinlock_t *ptl;
+
+	page = pud_page(*pud);
+	ptl = __pt_lockptr(page);
+	spin_lock(ptl);
+	/* Doublecheck now that we hold the lock */
+	if (pt_is_shared(page)) {
+#ifdef PT_DEBUG
+			printk(KERN_DEBUG "Unsharing pmd at address 0x%lx\n",
+			       address);
+#endif
+		pt_decrement_share(page);
+		spin_unlock(ptl);
+		pud_clear_flush(mm, address, pud);
+		return 1;
+	}
+	spin_unlock(ptl);
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_PTSHARE_HUGEPAGE
+
+void pt_unshare_huge_range(struct vm_area_struct *vma, unsigned long address,
+			   unsigned long end)
+{
+	struct address_space *mapping;
+
+	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+	spin_lock(&mapping->i_mmap_lock);
+
+#ifdef CONFIG_PTSHARE_HUGEPAGE_PTE
+	pt_unshare_pgd(vma->vm_mm, address, end, 0);
+#else
+	pt_unshare_pgd(vma->vm_mm, address, end, 1);
+#endif
+
+	spin_unlock(&mapping->i_mmap_lock);
+
+	/* If we unshare the entire vma it's safe to clear the share flag */
+	if (vma->vm_start >= address &&
+	    vma->vm_end <= end)
+		vma->vm_flags &= ~VM_PTSHARE;
+}
+
+pte_t *pt_share_hugepage(struct mm_struct *mm, struct vm_area_struct *vma,
+			 unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (!pud)
+		return VM_FAULT_OOM;
+	pmd = pt_share_pmd(vma, address, pud);
+	if (!pmd)
+		return VM_FAULT_OOM;
+#ifdef CONFIG_PTSHARE_HUGEPAGE_PTE
+	pte = pt_share_pte(vma, address, pmd);
+#else
+	pte = (pte_t *)pmd;
+#endif
+	return pte;
+}
+#endif


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC/PATCH] Shared Page Tables [1/2]
From: Dave McCracken @ 2006-04-10 16:13 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Kernel Mailing List, Linux Memory Management

Complete the macro definitions for pxd_page/pxd_page_kernel

Signed-off-by: Dave McCracken <dmccr@us.ibm.com>

----

 arch/sparc/mm/srmmu.c               |    2 +-
 arch/sparc/mm/sun4c.c               |    2 +-
 arch/x86_64/mm/fault.c              |    4 ++--
 include/asm-alpha/mmzone.h          |    1 +
 include/asm-alpha/pgtable.h         |    5 +++--
 include/asm-generic/4level-fixup.h  |    4 ++++
 include/asm-ia64/pgtable.h          |   10 ++++++----
 include/asm-m32r/pgtable-2level.h   |    6 +++++-
 include/asm-m68k/motorola_pgtable.h |    1 +
 include/asm-mips/pgtable-64.h       |    6 ++++--
 include/asm-parisc/pgtable.h        |    5 +++--
 include/asm-powerpc/pgtable-4k.h    |    5 +++--
 include/asm-powerpc/pgtable.h       |    5 +++--
 include/asm-ppc/pgtable.h           |    2 +-
 include/asm-s390/pgtable.h          |    2 ++
 include/asm-sh/pgtable-2level.h     |    5 ++++-
 include/asm-sh64/pgtable.h          |    4 +++-
 include/asm-sparc/pgtable.h         |    4 ++--
 include/asm-sparc64/pgtable.h       |    5 +++--
 include/asm-um/pgtable-3level.h     |    5 +++--
 include/asm-x86_64/pgtable.h        |   12 ++++++------
 21 files changed, 61 insertions(+), 34 deletions(-)

----
--- 2.6.17-rc1/./arch/sparc/mm/srmmu.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./arch/sparc/mm/srmmu.c	2006-04-10 08:40:18.000000000 -0500
@@ -2176,7 +2176,7 @@ void __init ld_mmu_srmmu(void)
 
 	BTFIXUPSET_CALL(pte_pfn, srmmu_pte_pfn, BTFIXUPCALL_NORM);
 	BTFIXUPSET_CALL(pmd_page, srmmu_pmd_page, BTFIXUPCALL_NORM);
-	BTFIXUPSET_CALL(pgd_page, srmmu_pgd_page, BTFIXUPCALL_NORM);
+	BTFIXUPSET_CALL(pgd_page_kernel, srmmu_pgd_page, BTFIXUPCALL_NORM);
 
 	BTFIXUPSET_SETHI(none_mask, 0xF0000000);
 
--- 2.6.17-rc1/./arch/sparc/mm/sun4c.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./arch/sparc/mm/sun4c.c	2006-04-10 08:40:18.000000000 -0500
@@ -2281,5 +2281,5 @@ void __init ld_mmu_sun4c(void)
 
 	/* These should _never_ get called with two level tables. */
 	BTFIXUPSET_CALL(pgd_set, sun4c_pgd_set, BTFIXUPCALL_NOP);
-	BTFIXUPSET_CALL(pgd_page, sun4c_pgd_page, BTFIXUPCALL_RETO0);
+	BTFIXUPSET_CALL(pgd_page_kernel, sun4c_pgd_page, BTFIXUPCALL_RETO0);
 }
--- 2.6.17-rc1/./arch/x86_64/mm/fault.c	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./arch/x86_64/mm/fault.c	2006-04-10 08:40:18.000000000 -0500
@@ -160,7 +160,7 @@ void dump_pagetable(unsigned long addres
 	printk("PGD %lx ", pgd_val(*pgd));
 	if (!pgd_present(*pgd)) goto ret; 
 
-	pud = __pud_offset_k((pud_t *)pgd_page(*pgd), address);
+	pud = __pud_offset_k((pud_t *)pgd_page_kernel(*pgd), address);
 	if (bad_address(pud)) goto bad;
 	printk("PUD %lx ", pud_val(*pud));
 	if (!pud_present(*pud))	goto ret;
@@ -274,7 +274,7 @@ static int vmalloc_fault(unsigned long a
 	pud_ref = pud_offset(pgd_ref, address);
 	if (pud_none(*pud_ref))
 		return -1;
-	if (pud_none(*pud) || pud_page(*pud) != pud_page(*pud_ref))
+	if (pud_none(*pud) || pud_page_kernel(*pud) != pud_page_kernel(*pud_ref))
 		BUG();
 	pmd = pmd_offset(pud, address);
 	pmd_ref = pmd_offset(pud_ref, address);
--- 2.6.17-rc1/./include/asm-alpha/mmzone.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-alpha/mmzone.h	2006-04-10 08:40:18.000000000 -0500
@@ -76,6 +76,7 @@ PLAT_NODE_DATA_LOCALNR(unsigned long p, 
 #define VALID_PAGE(page)	(((page) - mem_map) < max_mapnr)
 
 #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> 32))
+#define pgd_page(pgd)		(pfn_to_page(pgd_val(pgd) >> 32))
 #define pte_pfn(pte)		(pte_val(pte) >> 32)
 
 #define mk_pte(page, pgprot)						     \
--- 2.6.17-rc1/./include/asm-alpha/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-alpha/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -238,9 +238,10 @@ pmd_page_kernel(pmd_t pmd)
 
 #ifndef CONFIG_DISCONTIGMEM
 #define pmd_page(pmd)	(mem_map + ((pmd_val(pmd) & _PFN_MASK) >> 32))
+#define pgd_page(pgd)	(mem_map + ((pgd_val(pgd) & _PFN_MASK) >> 32))
 #endif
 
-extern inline unsigned long pgd_page(pgd_t pgd)
+extern inline unsigned long pgd_page_kernel(pgd_t pgd)
 { return PAGE_OFFSET + ((pgd_val(pgd) & _PFN_MASK) >> (32-PAGE_SHIFT)); }
 
 extern inline int pte_none(pte_t pte)		{ return !pte_val(pte); }
@@ -294,7 +295,7 @@ extern inline pte_t pte_mkyoung(pte_t pt
 /* Find an entry in the second-level page table.. */
 extern inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
 {
-	return (pmd_t *) pgd_page(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
+	return (pmd_t *) pgd_page_kernel(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1));
 }
 
 /* Find an entry in the third-level page table.. */
--- 2.6.17-rc1/./include/asm-generic/4level-fixup.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-generic/4level-fixup.h	2006-04-10 08:40:18.000000000 -0500
@@ -21,6 +21,10 @@
 #define pud_present(pud)		1
 #define pud_ERROR(pud)			do { } while (0)
 #define pud_clear(pud)			pgd_clear(pud)
+#define pud_val(pud)			pgd_val(pud)
+#define pud_populate(mm, pud, pmd)	pgd_populate(mm, pud, pmd)
+#define pud_page(pud)			pgd_page(pud)
+#define pud_page_kernel(pud)		pgd_page_kernel(pud)
 
 #undef pud_free_tlb
 #define pud_free_tlb(tlb, x)            do { } while (0)
--- 2.6.17-rc1/./include/asm-ia64/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-ia64/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -283,14 +283,16 @@ ia64_phys_addr_valid (unsigned long addr
 #define pud_bad(pud)			(!ia64_phys_addr_valid(pud_val(pud)))
 #define pud_present(pud)		(pud_val(pud) != 0UL)
 #define pud_clear(pudp)			(pud_val(*(pudp)) = 0UL)
-#define pud_page(pud)			((unsigned long) __va(pud_val(pud) & _PFN_MASK))
+#define pud_page_kernel(pud)		((unsigned long) __va(pud_val(pud) & _PFN_MASK))
+#define pud_page(pud)			virt_to_page((pud_val(pud) + PAGE_OFFSET))
 
 #ifdef CONFIG_PGTABLE_4
 #define pgd_none(pgd)			(!pgd_val(pgd))
 #define pgd_bad(pgd)			(!ia64_phys_addr_valid(pgd_val(pgd)))
 #define pgd_present(pgd)		(pgd_val(pgd) != 0UL)
 #define pgd_clear(pgdp)			(pgd_val(*(pgdp)) = 0UL)
-#define pgd_page(pgd)			((unsigned long) __va(pgd_val(pgd) & _PFN_MASK))
+#define pgd_page_kernel(pgd)		((unsigned long) __va(pgd_val(pgd) & _PFN_MASK))
+#define pgd_page(pgd)			virt_to_page((pgd_val(pgd) + PAGE_OFFSET))
 #endif
 
 /*
@@ -363,12 +365,12 @@ pgd_offset (struct mm_struct *mm, unsign
 #ifdef CONFIG_PGTABLE_4
 /* Find an entry in the second-level page table.. */
 #define pud_offset(dir,addr) \
-	((pud_t *) pgd_page(*(dir)) + (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1)))
+	((pud_t *) pgd_page_kernel(*(dir)) + (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1)))
 #endif
 
 /* Find an entry in the third-level page table.. */
 #define pmd_offset(dir,addr) \
-	((pmd_t *) pud_page(*(dir)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
+	((pmd_t *) pud_page_kernel(*(dir)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
 
 /*
  * Find an entry in the third-level page table.  This looks more complicated than it
--- 2.6.17-rc1/./include/asm-m32r/pgtable-2level.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-m32r/pgtable-2level.h	2006-04-10 08:40:18.000000000 -0500
@@ -53,9 +53,13 @@ static inline int pgd_present(pgd_t pgd)
 #define set_pmd(pmdptr, pmdval) (*(pmdptr) = pmdval)
 #define set_pgd(pgdptr, pgdval) (*(pgdptr) = pgdval)
 
-#define pgd_page(pgd) \
+#define pgd_page_kernel(pgd) \
 ((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
 
+#ifndef CONFIG_DISCONTIGMEM
+#define pgd_page(pgd)	(mem_map + ((pgd_val(pgd) >> PAGE_SHIFT) - PFN_BASE))
+#endif /* !CONFIG_DISCONTIGMEM */
+
 static inline pmd_t *pmd_offset(pgd_t * dir, unsigned long address)
 {
 	return (pmd_t *) dir;
--- 2.6.17-rc1/./include/asm-m68k/motorola_pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-m68k/motorola_pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -151,6 +151,7 @@ static inline void pgd_set(pgd_t *pgdp, 
 #define pgd_bad(pgd)		((pgd_val(pgd) & _DESCTYPE_MASK) != _PAGE_TABLE)
 #define pgd_present(pgd)	(pgd_val(pgd) & _PAGE_TABLE)
 #define pgd_clear(pgdp)		({ pgd_val(*pgdp) = 0; })
+#define pgd_page(pgd)		(mem_map + ((unsigned long)(__va(pgd_val(pgd)) - PAGE_OFFSET) >> PAGE_SHIFT))
 
 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e))
--- 2.6.17-rc1/./include/asm-mips/pgtable-64.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-mips/pgtable-64.h	2006-04-10 08:40:18.000000000 -0500
@@ -179,15 +179,17 @@ static inline void pud_clear(pud_t *pudp
 /* to find an entry in a page-table-directory */
 #define pgd_offset(mm,addr)	((mm)->pgd + pgd_index(addr))
 
-static inline unsigned long pud_page(pud_t pud)
+static inline unsigned long pud_page_kernel(pud_t pud)
 {
 	return pud_val(pud);
 }
+#define pud_phys(pud)		(pud_val(pud) - PAGE_OFFSET)
+#define pud_page(pud)		(pfn_to_page(pud_phys(pud) >> PAGE_SHIFT))
 
 /* Find an entry in the second-level page table.. */
 static inline pmd_t *pmd_offset(pud_t * pud, unsigned long address)
 {
-	return (pmd_t *) pud_page(*pud) + pmd_index(address);
+	return (pmd_t *) pud_page_kernel(*pud) + pmd_index(address);
 }
 
 /* Find an entry in the third-level page table.. */
--- 2.6.17-rc1/./include/asm-parisc/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-parisc/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -295,7 +295,8 @@ static inline void pmd_clear(pmd_t *pmd)
 
 
 #if PT_NLEVELS == 3
-#define pgd_page(pgd) ((unsigned long) __va(pgd_address(pgd)))
+#define pgd_page_kernel(pgd) ((unsigned long) __va(pgd_address(pgd)))
+#define pgd_page(pgd)	virt_to_page((void *)pgd_page_kernel(pgd))
 
 /* For 64 bit we have three level tables */
 
@@ -396,7 +397,7 @@ extern inline pte_t pte_modify(pte_t pte
 
 #if PT_NLEVELS == 3
 #define pmd_offset(dir,address) \
-((pmd_t *) pgd_page(*(dir)) + (((address)>>PMD_SHIFT) & (PTRS_PER_PMD-1)))
+((pmd_t *) pgd_page_kernel(*(dir)) + (((address)>>PMD_SHIFT) & (PTRS_PER_PMD-1)))
 #else
 #define pmd_offset(dir,addr) ((pmd_t *) dir)
 #endif
--- 2.6.17-rc1/./include/asm-powerpc/pgtable-4k.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-powerpc/pgtable-4k.h	2006-04-10 08:40:18.000000000 -0500
@@ -86,10 +86,11 @@
 #define pgd_bad(pgd)		(pgd_val(pgd) == 0)
 #define pgd_present(pgd)	(pgd_val(pgd) != 0)
 #define pgd_clear(pgdp)		(pgd_val(*(pgdp)) = 0)
-#define pgd_page(pgd)		(pgd_val(pgd) & ~PGD_MASKED_BITS)
+#define pgd_page_kernel(pgd)	(pgd_val(pgd) & ~PGD_MASKED_BITS)
+#define pgd_page(pgd)		virt_to_page(pgd_page_kernel(pgd))
 
 #define pud_offset(pgdp, addr)	\
-  (((pud_t *) pgd_page(*(pgdp))) + \
+  (((pud_t *) pgd_page_kernel(*(pgdp))) + \
     (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1)))
 
 #define pud_ERROR(e) \
--- 2.6.17-rc1/./include/asm-powerpc/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-powerpc/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -206,7 +206,8 @@ static inline pte_t pfn_pte(unsigned lon
 				 || (pud_val(pud) & PUD_BAD_BITS))
 #define pud_present(pud)	(pud_val(pud) != 0)
 #define pud_clear(pudp)		(pud_val(*(pudp)) = 0)
-#define pud_page(pud)		(pud_val(pud) & ~PUD_MASKED_BITS)
+#define pud_page_kernel(pud)	(pud_val(pud) & ~PUD_MASKED_BITS)
+#define pud_page(pud)		virt_to_page(pud_page_kernel(pud))
 
 #define pgd_set(pgdp, pudp)	({pgd_val(*(pgdp)) = (unsigned long)(pudp);})
 
@@ -220,7 +221,7 @@ static inline pte_t pfn_pte(unsigned lon
 #define pgd_offset(mm, address)	 ((mm)->pgd + pgd_index(address))
 
 #define pmd_offset(pudp,addr) \
-  (((pmd_t *) pud_page(*(pudp))) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
+  (((pmd_t *) pud_page_kernel(*(pudp))) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
 
 #define pte_offset_kernel(dir,addr) \
   (((pte_t *) pmd_page_kernel(*(dir))) + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)))
--- 2.6.17-rc1/./include/asm-ppc/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-ppc/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -527,7 +527,7 @@ static inline int pgd_bad(pgd_t pgd)		{ 
 static inline int pgd_present(pgd_t pgd)	{ return 1; }
 #define pgd_clear(xp)				do { } while (0)
 
-#define pgd_page(pgd) \
+#define pgd_page_kernel(pgd) \
 	((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
 
 /*
--- 2.6.17-rc1/./include/asm-s390/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-s390/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -685,6 +685,8 @@ static inline pte_t mk_pte_phys(unsigned
 
 #define pgd_page_kernel(pgd) (pgd_val(pgd) & PAGE_MASK)
 
+#define pgd_page(pgd) (mem_map+(pgd_val(pgd) >> PAGE_SHIFT))
+
 /* to find an entry in a page-table-directory */
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
 #define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
--- 2.6.17-rc1/./include/asm-sh/pgtable-2level.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-sh/pgtable-2level.h	2006-04-10 08:40:18.000000000 -0500
@@ -50,9 +50,12 @@ static inline void pgd_clear (pgd_t * pg
 #define set_pmd(pmdptr, pmdval) (*(pmdptr) = pmdval)
 #define set_pgd(pgdptr, pgdval) (*(pgdptr) = pgdval)
 
-#define pgd_page(pgd) \
+#define pgd_page_kernel(pgd) \
 ((unsigned long) __va(pgd_val(pgd) & PAGE_MASK))
 
+#define pgd_page(pgd) \
+	(phys_to_page(pgd_val(pgd)))
+
 static inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
 {
 	return (pmd_t *) dir;
--- 2.6.17-rc1/./include/asm-sh64/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-sh64/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -191,7 +191,9 @@ static inline int pgd_bad(pgd_t pgd)		{ 
 #endif
 
 
-#define pgd_page(pgd_entry)	((unsigned long) (pgd_val(pgd_entry) & PAGE_MASK))
+#define pgd_page_kernel(pgd_entry)	((unsigned long) (pgd_val(pgd_entry) & PAGE_MASK))
+#define pgd_page(pgd)	(virt_to_page(pgd_val(pgd)))
+
 
 /*
  * PMD defines. Middle level.
--- 2.6.17-rc1/./include/asm-sparc/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-sparc/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -144,10 +144,10 @@ extern unsigned long empty_zero_page;
 /*
  */
 BTFIXUPDEF_CALL_CONST(struct page *, pmd_page, pmd_t)
-BTFIXUPDEF_CALL_CONST(unsigned long, pgd_page, pgd_t)
+BTFIXUPDEF_CALL_CONST(unsigned long, pgd_page_kernel, pgd_t)
 
 #define pmd_page(pmd) BTFIXUP_CALL(pmd_page)(pmd)
-#define pgd_page(pgd) BTFIXUP_CALL(pgd_page)(pgd)
+#define pgd_page_kernel(pgd) BTFIXUP_CALL(pgd_page_kernel)(pgd)
 
 BTFIXUPDEF_SETHI(none_mask)
 BTFIXUPDEF_CALL_CONST(int, pte_present, pte_t)
--- 2.6.17-rc1/./include/asm-sparc64/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-sparc64/pgtable.h	2006-04-10 08:44:41.000000000 -0500
@@ -631,8 +631,9 @@ static inline unsigned long pte_present(
 #define __pmd_page(pmd)		\
 	((unsigned long) __va((((unsigned long)pmd_val(pmd))<<11UL)))
 #define pmd_page(pmd) 			virt_to_page((void *)__pmd_page(pmd))
-#define pud_page(pud)		\
+#define pud_page_kernel(pud)		\
 	((unsigned long) __va((((unsigned long)pud_val(pud))<<11UL)))
+#define pud_page(pud) 			virt_to_page((void *)pud_page_kernel(pud))
 #define pmd_none(pmd)			(!pmd_val(pmd))
 #define pmd_bad(pmd)			(0)
 #define pmd_present(pmd)		(pmd_val(pmd) != 0U)
@@ -654,7 +655,7 @@ static inline unsigned long pte_present(
 
 /* Find an entry in the second-level page table.. */
 #define pmd_offset(pudp, address)	\
-	((pmd_t *) pud_page(*(pudp)) + \
+	((pmd_t *) pud_page_kernel(*(pudp)) + \
 	 (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1)))
 
 /* Find an entry in the third-level page table.. */
--- 2.6.17-rc1/./include/asm-um/pgtable-3level.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-um/pgtable-3level.h	2006-04-10 08:40:18.000000000 -0500
@@ -74,11 +74,12 @@ extern inline void pud_clear (pud_t *pud
         set_pud(pud, __pud(0));
 }
 
-#define pud_page(pud) \
+#define pud_page(pud) phys_to_page(pud_val(pud) & PAGE_MASK)
+#define pud_page_kernel(pud) \
 	((struct page *) __va(pud_val(pud) & PAGE_MASK))
 
 /* Find an entry in the second-level page table.. */
-#define pmd_offset(pud, address) ((pmd_t *) pud_page(*(pud)) + \
+#define pmd_offset(pud, address) ((pmd_t *) pud_page_kernel(*(pud)) + \
 			pmd_index(address))
 
 static inline unsigned long pte_pfn(pte_t pte)
--- 2.6.17-rc1/./include/asm-x86_64/pgtable.h	2006-04-02 22:22:10.000000000 -0500
+++ 2.6.17-rc1-macro/./include/asm-x86_64/pgtable.h	2006-04-10 08:40:18.000000000 -0500
@@ -101,9 +101,6 @@ static inline void pgd_clear (pgd_t * pg
 	set_pgd(pgd, __pgd(0));
 }
 
-#define pud_page(pud) \
-((unsigned long) __va(pud_val(pud) & PHYSICAL_PAGE_MASK))
-
 #define ptep_get_and_clear(mm,addr,xp)	__pte(xchg(&(xp)->pte, 0))
 
 struct mm_struct;
@@ -326,7 +323,8 @@ static inline int pmd_large(pmd_t pte) {
 /*
  * Level 4 access.
  */
-#define pgd_page(pgd) ((unsigned long) __va((unsigned long)pgd_val(pgd) & PTE_MASK))
+#define pgd_page_kernel(pgd) ((unsigned long) __va((unsigned long)pgd_val(pgd) & PTE_MASK))
+#define pgd_page(pgd)		(pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT))
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
 #define pgd_offset(mm, addr) ((mm)->pgd + pgd_index(addr))
 #define pgd_offset_k(address) (init_level4_pgt + pgd_index(address))
@@ -335,8 +333,10 @@ static inline int pmd_large(pmd_t pte) {
 
 /* PUD - Level3 access */
 /* to find an entry in a page-table-directory. */
+#define pud_page_kernel(pud) ((unsigned long) __va(pud_val(pud) & PHYSICAL_PAGE_MASK))
+#define pud_page(pud)		(pfn_to_page(pud_val(pud) >> PAGE_SHIFT))
 #define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
-#define pud_offset(pgd, address) ((pud_t *) pgd_page(*(pgd)) + pud_index(address))
+#define pud_offset(pgd, address) ((pud_t *) pgd_page_kernel(*(pgd)) + pud_index(address))
 #define pud_offset_k(pgd, addr) pud_offset(pgd, addr)
 #define pud_present(pud) (pud_val(pud) & _PAGE_PRESENT)
 
@@ -350,7 +350,7 @@ static inline pud_t *__pud_offset_k(pud_
 #define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
 
 #define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
-#define pmd_offset(dir, address) ((pmd_t *) pud_page(*(dir)) + \
+#define pmd_offset(dir, address) ((pmd_t *) pud_page_kernel(*(dir)) + \
 			pmd_index(address))
 #define pmd_none(x)	(!pmd_val(x))
 #define pmd_present(x)	(pmd_val(x) & _PAGE_PRESENT)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC/PATCH] Shared Page Tables [0/2]
From: Dave McCracken @ 2006-04-10 16:13 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux Kernel Mailing List, Linux Memory Management

Here's a new cut of the shared page table patch.  I divided it into
two patches.  The first one just fleshes out the
pxd_page/pxd_page_kernel macros across the architectures.  The
second one is the main patch.

This version of the patch should address the concerns Hugh raised.
Hugh, I'd appreciate your feedback again.  Did I get everything?

These patches apply against 2.6.17-rc1.

Dave McCracken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
From: Andi Kleen @ 2006-04-09  7:01 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm
In-Reply-To: <1144441108.5198.36.camel@localhost.localdomain>

On Friday 07 April 2006 22:18, Lee Schermerhorn wrote:
> This is a reposting of the migrate-on-fault series, against
> the 2.6.17-rc1-mm1 tree.  I would love to get some feedback on 
> these patches--especially regarding criteria for getting them
> into the mm tree for wider testing.

The biggest criteria would be some numbers that it actually
helps for something and doesn't break performance in other workloads.

For me it seems rather risky.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Page Migration: Make do_swap_page redo the fault
From: Hugh Dickins @ 2006-04-09  3:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-mm
In-Reply-To: <Pine.LNX.4.64.0604081430280.17911@schroedinger.engr.sgi.com>

On Sat, 8 Apr 2006, Christoph Lameter wrote:
> On Sat, 8 Apr 2006, Hugh Dickins wrote:
> > 
> > Sure, those are long standing checks, necessary long before migration
> > came on the scene; whereas the check in do_swap_page was recently added
> > just for a page migration case, and now turns out to be redundant.
> 
> Those two checks were added for migration together with the one we 
> are removing now. Sounds like you think they additionally fix some other 
> race conditions?

Of course, you're right - sorry.  Whatever was I looking at,
to get it so confidently wrong?  Dunno: scary.

But I do have to worry then.  I'd missed the addition of those checks:
if they really are necessary, then the rules have changed in two
tricky areas I now need to re-understand.  It'll take me a while.

Thanks for setting me straight.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Page Migration: Make do_swap_page redo the fault
From: Christoph Lameter @ 2006-04-08 21:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: akpm, linux-mm
In-Reply-To: <Pine.LNX.4.64.0604082022170.12196@blonde.wat.veritas.com>

On Sat, 8 Apr 2006, Hugh Dickins wrote:

> On Sat, 8 Apr 2006, Christoph Lameter wrote:
> > 
> > Hmmm..,. There are still two other checks for !PageSwapCache after 
> > obtaining a page lock in shmem_getpage() and in try_to_unuse(). 
> > However, both are getting to the page via the swap maps. So we need to 
> > keep those.
> 
> Sure, those are long standing checks, necessary long before migration
> came on the scene; whereas the check in do_swap_page was recently added
> just for a page migration case, and now turns out to be redundant.

Those two checks were added for migration together with the one we 
are removing now. Sounds like you think they additionally fix some other 
race conditions?

The check we are discussing only becomes unnecessary if the swap ptes are 
replaced by regular ptes. The swap pte would refer to the old page from 
which the SwapCache bit was cleared. This is dependent on remove_from_swap 
always functioning properly which happened pretty late in the 2.6.16 
cycle.

Here is the description from V9 of the direct migration patchset which 
introduced the 3 checks for PageSwapCache():

Check for PageSwapCache after looking up and locking a swap page.

The page migration code may change a swap pte to point to a different page
under lock_page().

If that happens then the vm must retry the lookup operation in the swap
space to find the correct page number. There are a couple of locations
in the VM where a lock_page() is done on a swap page. In these locations
we need to check afterwards if the page was migrated. If the page was 
migrated
then the old page that was looked up before was freed and no longer has 
the
PageSwapCache bit set.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Page Migration: Make do_swap_page redo the fault
From: Hugh Dickins @ 2006-04-08 19:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-mm
In-Reply-To: <Pine.LNX.4.64.0604081058290.16914@schroedinger.engr.sgi.com>

On Sat, 8 Apr 2006, Christoph Lameter wrote:
> 
> Hmmm..,. There are still two other checks for !PageSwapCache after 
> obtaining a page lock in shmem_getpage() and in try_to_unuse(). 
> However, both are getting to the page via the swap maps. So we need to 
> keep those.

Sure, those are long standing checks, necessary long before migration
came on the scene; whereas the check in do_swap_page was recently added
just for a page migration case, and now turns out to be redundant.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Page Migration: Make do_swap_page redo the fault
From: Christoph Lameter @ 2006-04-08 18:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: akpm, linux-mm
In-Reply-To: <Pine.LNX.4.64.0604081312200.14441@blonde.wat.veritas.com>

On Sat, 8 Apr 2006, Hugh Dickins wrote:

> > do_swap_page may interpret an invalid swap entry without this patch 
> > because we do not reload the pte if we are looping back. The page 
> > migration code may already have reused the swap entry referenced by our
> > local swp_entry.
> 
> Wouldn't you better just remove that !PageSwapCache "Page migration has
> occured" block?  Isn't that case already dealt with by the old !pte_same
> check below it?

Right. Since we now replace the swap ptes with ptes pointing to pages 
before unlocking the page this is no longer necessary (if the ptes 
contents are checked later). That of course means that remove_from_swap() 
must always succeed.

Hmmm..,. There are still two other checks for !PageSwapCache after 
obtaining a page lock in shmem_getpage() and in try_to_unuse(). 
However, both are getting to the page via the swap maps. So we need to 
keep those.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-04-02 21:55:26.000000000 -0700
+++ linux-2.6/mm/memory.c	2006-04-08 11:08:33.000000000 -0700
@@ -1903,12 +1903,6 @@ again:
 
 	mark_page_accessed(page);
 	lock_page(page);
-	if (!PageSwapCache(page)) {
-		/* Page migration has occured */
-		unlock_page(page);
-		page_cache_release(page);
-		goto again;
-	}
 
 	/*
 	 * Back out if somebody else already faulted in this pte.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Page Migration: Make do_swap_page redo the fault
From: Hugh Dickins @ 2006-04-08 12:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-mm
In-Reply-To: <Pine.LNX.4.64.0604032228150.24182@schroedinger.engr.sgi.com>

On Mon, 3 Apr 2006, Christoph Lameter wrote:

> It is better to redo the complete fault if do_swap_page() finds
> that the page is not in PageSwapCache() because the page migration
> code may have replaced the swap pte already with a pte pointing
> to valid memory.
> 
> do_swap_page may interpret an invalid swap entry without this patch 
> because we do not reload the pte if we are looping back. The page 
> migration code may already have reused the swap entry referenced by our
> local swp_entry.

Wouldn't you better just remove that !PageSwapCache "Page migration has
occured" block?  Isn't that case already dealt with by the old !pte_same
check below it?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: limit lowmem_reserve
From: Nick Piggin @ 2006-04-08  1:25 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Andrew Morton, ck, linux list, linux-mm
In-Reply-To: <200604081101.06066.kernel@kolivas.org>

Con Kolivas wrote:
> On Saturday 08 April 2006 10:55, Nick Piggin wrote:
> 
>>Con Kolivas wrote:
>>
>>>On Friday 07 April 2006 22:40, Nick Piggin wrote:
>>>
>>>>How would zone_watermark_ok always fail though?
>>>
>>>Withdrew this patch a while back; ignore
>>
>>Well, whether or not that particular patch isa good idea, it
>>is definitely a bug if zone_watermark_ok could ever always
>>fail due to lowmem reserve and we should fix it.
> 
> 
> Ok. I think I presented enough information for why I thought zone_watermark_ok 
> would fail (for ZONE_DMA). With 16MB ZONE_DMA and a vmsplit of 3GB we have a 
> lowmem_reserve of 12MB. It's pretty hard to keep that much ZONE_DMA free, I 
> don't think I've ever seen that much free on my ZONE_DMA on an ordinary 
> desktop without any particular ZONE_DMA users. Changing the tunable can make 
> the lowmem_reserve larger than ZONE_DMA is on any vmsplit too as far as I 
> understand the ratio.
> 

Umm, for ZONE_DMA allocations, ZONE_DMA isn't a lower zone. So that
12MB protection should never come into it (unless it is buggy?).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: limit lowmem_reserve
From: Con Kolivas @ 2006-04-08  1:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, ck, linux list, linux-mm
In-Reply-To: <443709F1.90906@yahoo.com.au>

On Saturday 08 April 2006 10:55, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Friday 07 April 2006 22:40, Nick Piggin wrote:
> >>How would zone_watermark_ok always fail though?
> >
> > Withdrew this patch a while back; ignore
>
> Well, whether or not that particular patch isa good idea, it
> is definitely a bug if zone_watermark_ok could ever always
> fail due to lowmem reserve and we should fix it.

Ok. I think I presented enough information for why I thought zone_watermark_ok 
would fail (for ZONE_DMA). With 16MB ZONE_DMA and a vmsplit of 3GB we have a 
lowmem_reserve of 12MB. It's pretty hard to keep that much ZONE_DMA free, I 
don't think I've ever seen that much free on my ZONE_DMA on an ordinary 
desktop without any particular ZONE_DMA users. Changing the tunable can make 
the lowmem_reserve larger than ZONE_DMA is on any vmsplit too as far as I 
understand the ratio.

-- 
-ck

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: limit lowmem_reserve
From: Nick Piggin @ 2006-04-08  0:55 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Andrew Morton, ck, linux list, linux-mm
In-Reply-To: <200604081015.44771.kernel@kolivas.org>

Con Kolivas wrote:
> On Friday 07 April 2006 22:40, Nick Piggin wrote:
> 

>>How would zone_watermark_ok always fail though?
> 
> 
> Withdrew this patch a while back; ignore
> 

Well, whether or not that particular patch isa good idea, it
is definitely a bug if zone_watermark_ok could ever always
fail due to lowmem reserve and we should fix it.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: limit lowmem_reserve
From: Con Kolivas @ 2006-04-08  0:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, ck, linux list, linux-mm
In-Reply-To: <44365DC2.1010806@yahoo.com.au>

On Friday 07 April 2006 22:40, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Friday 07 April 2006 16:25, Nick Piggin wrote:
> >>Con Kolivas wrote:
> >>>It is possible with a low enough lowmem_reserve ratio to make
> >>>zone_watermark_ok always fail if the lower_zone is small enough.
> >>
> >>I don't see how this would happen?
> >
> > 3GB lowmem and a reserve ratio of 180 is enough to do it.
>
> How would zone_watermark_ok always fail though?

Withdrew this patch a while back; ignore

-- 
-ck

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* get_xip_page
From: Jared Hulbert @ 2006-04-07 23:23 UTC (permalink / raw)
  To: linux-mm

What is the "create" parameter in the get_xip_page function used for?
If create = 1, does it actually create a sector and return a pointer
to it? Under what situation is create set to 1 while calling
get_xip_page? Is there any difference for a RO file system? Is it used
for a COW?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 9/9] AutoPage Migration - V0.2 - hook automigration to migrate-on-fault
From: Lee Schermerhorn @ 2006-04-07 20:45 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 9/9 hook automigration to migrate-on-fault

Add a /sys/kernel/migration control--auto_migrate_lazy--to use 
migrate-on-fault for auto-migration.

Modify migrate_to_node() to just unmap the eligible pages
via migrate_pages_unmap_only() when MPOL_MF_LAZY flag is set.

This patch depends on the "migrate-on-fault" patch series that
defines the MPOL_MF_LAZY flag and the migrate_pages_unmap_only()
function.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/mempolicy.c	2006-03-23 16:50:30.000000000 -0500
+++ linux-2.6.16-mm1/mm/mempolicy.c	2006-03-23 16:50:36.000000000 -0500
@@ -635,7 +635,11 @@ int migrate_to_node(struct mm_struct *mm
 			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
 
 	if (!list_empty(&pagelist)) {
-		err = migrate_pages_to(&pagelist, NULL, dest);
+		if (flags & MPOL_MF_LAZY)
+			err = migrate_pages_unmap_only(&pagelist);
+		else
+			err = migrate_pages_to(&pagelist, NULL, dest);
+
 		if (!list_empty(&pagelist))
 			putback_lru_pages(&pagelist);
 	}
@@ -744,6 +748,9 @@ void auto_migrate_task_memory(void)
 	 */
 	BUG_ON(!mm);
 
+	if (auto_migrate_lazy)
+		flags |= MPOL_MF_LAZY;
+
 	/*
 	 * Pass destination node as source node plus 'INVERT flag:
 	 *    Migrate all pages NOT on destination node.
@@ -1000,7 +1007,6 @@ out:
 	return err;
 }
 
-
 /* Retrieve NUMA policy */
 asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
Index: linux-2.6.16-mm1/mm/migrate.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/migrate.c	2006-03-23 16:50:30.000000000 -0500
+++ linux-2.6.16-mm1/mm/migrate.c	2006-03-23 16:50:36.000000000 -0500
@@ -129,6 +129,37 @@ static ssize_t migrate_max_mapcount_stor
 }
 MIGRATION_ATTR_RW(migrate_max_mapcount);
 
+/*
+ * auto_migrate_lazy:  use "lazy migration"--i.e., migration-on-fault--
+ * for scheduler driven task memory migration.
+ */
+int auto_migrate_lazy = 0;
+
+static int __init set_auto_migrate_lazy(char *str)
+{
+	get_option(&str, &auto_migrate_lazy);
+	return 1;
+}
+
+__setup("auto_migrate_lazy", set_auto_migrate_lazy);
+
+static ssize_t auto_migrate_lazy_show(struct subsystem *subsys, char *page)
+{
+	return sprintf(page, "auto_migrate_lazy %s\n",
+			auto_migrate_lazy ? "on" : "off");
+}
+static ssize_t auto_migrate_lazy_store(struct subsystem *subsys,
+				      const char *page, size_t count)
+{
+        unsigned long n = simple_strtoul(page, NULL, 10);
+	if (n)
+		auto_migrate_lazy = 1;
+	else
+		auto_migrate_lazy = 0;
+        return count;
+}
+MIGRATION_ATTR_RW(auto_migrate_lazy);
+
 decl_subsys(migration, NULL, NULL);
 EXPORT_SYMBOL(migration_subsys);
 
@@ -136,6 +167,7 @@ static struct attribute *migration_attrs
 	&auto_migrate_enable_attr.attr,
 	&auto_migrate_interval_attr.attr,
 	&migrate_max_mapcount_attr.attr,
+	&auto_migrate_lazy_attr.attr,
 	NULL
 };
 
Index: linux-2.6.16-mm1/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.16-mm1.orig/include/linux/auto-migrate.h	2006-03-23 16:50:30.000000000 -0500
+++ linux-2.6.16-mm1/include/linux/auto-migrate.h	2006-03-23 16:50:36.000000000 -0500
@@ -21,6 +21,7 @@ extern unsigned long auto_migrate_interv
 #define AUTO_MIGRATE_INTERVAL_MAX (300*HZ)
 
 extern unsigned int migrate_max_mapcount;
+extern int auto_migrate_lazy;
 
 #ifdef _LINUX_SCHED_H	/* only used where this is defined */
 static inline void check_internode_migration(task_t *task, int dest_cpu)
@@ -101,6 +102,7 @@ out:
 
 #define check_migrate_pending()		/* NOTHING */
 #define migrate_max_mapcount (1)
+#define auto_migrate_lazy (0)
 
 #endif	/* CONFIG_MIGRATION */
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox