linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* RFC/POC Make Page Tables Relocatable
@ 2007-10-25 15:16 Ross Biro
  2007-10-25 16:46 ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Biro @ 2007-10-25 15:16 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 3640 bytes --]

[ The attached patch is Proof of Concept (POC) code only. It only
works on x86_64, it only supports the slab allocator, it only
relocates the lowest level of page tables, it's less efficient that it
should be, and I'm convinced the locking is deficient.  It does work
well enough to play around with though. The patch is a unified diff
against a clean 2.6.23.]

I'd like to propose 4 somewhat interdependent code changes.

1) Add a separate meta-data allocation to the slab and slub allocator
and allocate full pages through kmem_cache_alloc instead of get_page.
The primary motivation of this is that we could shrink struct page by
using kmem_cache_alloc to allocate whole pages and put the supported
data in the meta_data area instead of struct page. The downside is
that we might end up using more memory because of alignment issues.  I
believe we can keep the code as efficient as the current code  by
allocating many pages at once with known alignment and locating the
meta data in the first few pages.  Then locating the meta data for a
page by page_address & mask + (page_address >> foo) & mask *
meta_data_size + offset. Which should be just as fast as the current
calculation.  This is different than the proof of concept
implementation.  I also believe this would reduce kernel memory
fragmentation.

2) Add support for relocating memory allocated via kmem_cache_alloc.
When a cache is created, optional relocation information can be
provided.  If a relocation function is provided, caches can be
defragmented and overall memory consumption can be reduced.

3) Create a handle struct for holding references to memory that might
be moved out from under you.  This is one of those things that looks
really good on paper, but in practice isn't very useful.  While I'm
sure there are a few case in /syfs and /proc where handles could be
put to good use, in general the overhead involved does not justify
their use.  I worry that they could become a fad and that  people will
start using them when they should not be used.  The reason for
including them is that they are really good for setting up synthetic
tests for relocating memory.

and finally the real reason for doing all of the above.

4) Modify pte_alloc/free and friends to use kmem_cache_alloc and make
page tables relocatable. I believe this would go a long way towards
keeping kernel memory from fragmenting.  The biggest down side is the
number of tlb flushes involved.  The POC code uses RCU to free the old
copies of the page tables, which should reduce the flushes.  However,
it blindly flushes the tlbs on all of the cpus, when it really only
needs to flush the tlb on any cpu using the mm in question.  I believe
that by only flushing the tlbs on cpus actually using the mm in
question, we can reduce the flushes to an acceptable level.  One
alternative is to create an RCU class for tlb flushes, so that the old
table only gets freed after all the cpus have flushed their tlbs.

I believe that the above opens the doors to shrinking struct page and
greatly reducing kernel memory fragmentation with the only real
downside being an increase in code complexity and a possible increase
in memory usage if we are not careful.  I'm willing to code all of
this, but I'd like to get others opinions on what's appropriate and
what's already being done.

With the exception of tlb flushes and meta data location, I believe
the POC code demonstrates how I intend to solve most of the problems
that will be encountered.  One thing I am worried about is the
performance impact of the changes and I would like pointers to any
micro benchmarks that might be relevant.

    Ross

[-- Attachment #2: pte-relocate-poc.patch --]
[-- Type: application/octet-stream, Size: 52520 bytes --]

diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/Documentation/vm/locking lsrc/prodkernel/2.6.23/Documentation/vm/locking
--- linux-2.6.23/Documentation/vm/locking	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/Documentation/vm/locking	2007-10-24 07:08:52.000000000 -0700
@@ -83,6 +83,10 @@
 vmtruncate) does not lose sending ipi's to cloned threads that might 
 be spawned underneath it and go to user mode to drag in pte's into tlbs.
 
+With the new page table relocation code, whenever the page_table_lock
+is grabbed, the page tables must be rewalked to make sure that the
+table you are looking at has not been moved out from under you.
+
 swap_lock
 --------------
 The swap devices are chained in priority order from the "swap_list" header. 
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/arch/i386/mm/hugetlbpage.c lsrc/prodkernel/2.6.23/arch/i386/mm/hugetlbpage.c
--- linux-2.6.23/arch/i386/mm/hugetlbpage.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/arch/i386/mm/hugetlbpage.c	2007-10-24 07:08:52.000000000 -0700
@@ -87,6 +87,8 @@
 		goto out;
 
 	spin_lock(&mm->page_table_lock);
+	pud = walk_page_table_pud(mm, addr);
+	BUG_ON(!pud);
 	if (pud_none(*pud))
 		pud_populate(mm, pud, (unsigned long) spte & PAGE_MASK);
 	else
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/arch/x86_64/kernel/smp.c lsrc/prodkernel/2.6.23/arch/x86_64/kernel/smp.c
--- linux-2.6.23/arch/x86_64/kernel/smp.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/arch/x86_64/kernel/smp.c	2007-10-24 07:08:52.000000000 -0700
@@ -233,6 +233,8 @@
 	cpu_mask = mm->cpu_vm_mask;
 	cpu_clear(smp_processor_id(), cpu_mask);
 
+	mm->need_flush = 0;
+
 	if (current->active_mm == mm) {
 		if (current->mm)
 			local_flush_tlb();
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/asm-x86_64/pgalloc.h lsrc/prodkernel/2.6.23/include/asm-x86_64/pgalloc.h
--- linux-2.6.23/include/asm-x86_64/pgalloc.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/asm-x86_64/pgalloc.h	2007-10-24 07:08:52.000000000 -0700
@@ -5,6 +5,16 @@
 #include <linux/threads.h>
 #include <linux/mm.h>
 
+struct page_table_metadata {
+	struct rcu_head head;
+	void *obj;
+	struct kmem_cache *cachep;
+	struct mm_struct *mm;
+	unsigned long addr;
+	unsigned long csum;
+	spinlock_t md_lock;
+};
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pud_populate(mm, pud, pmd) \
@@ -84,6 +94,8 @@
 	free_page((unsigned long)pgd);
 }
 
+extern struct kmem_cache *pte_cache;
+
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
 	return (pte_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
@@ -91,9 +103,28 @@
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
+#if 0
 	void *p = (void *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
 	if (!p)
 		return NULL;
+#else
+	void *p;
+	struct page_table_metadata *md;
+
+	p = kmem_cache_alloc(pte_cache, GFP_KERNEL|__GFP_REPEAT);
+	if (!p)
+		return NULL;
+	clear_page(p);
+	md = (struct page_table_metadata *)kmem_cache_get_metadata(pte_cache, p);
+	md->addr = address;
+	md->mm = mm;
+	md->csum = (unsigned long)mm ^ address;
+	spin_lock_init(&md->md_lock);
+
+	atomic_inc(&mm->mm_count);
+	
+#endif
+
 	return virt_to_page(p);
 }
 
@@ -103,15 +134,40 @@
 static inline void pte_free_kernel(pte_t *pte)
 {
 	BUG_ON((unsigned long)pte & (PAGE_SIZE-1));
-	free_page((unsigned long)pte); 
+	free_page((unsigned long)pte);
 }
 
 static inline void pte_free(struct page *pte)
 {
+#if 0
 	__free_page(pte);
-} 
+#else
+	struct page_table_metadata *md;
+	struct mm_struct *mm;
+	unsigned long flags;
+
+	md = (struct page_table_metadata *)kmem_cache_get_metadata(pte_cache, page_address(pte));
+
+	spin_lock_irqsave(&md->md_lock, flags);
+
+	BUG_ON(	md->csum != ((unsigned long)(md->mm) ^ (md->addr)));
+
+	mm = md->mm;
+	md->mm = NULL;
+	md->addr = 0;
+	md->csum = 0;
+
+	spin_unlock_irqrestore(&md->md_lock, flags);
+
+	if (mm)
+	   mmdrop(mm); 
+
+	kmem_cache_free(pte_cache, page_address(pte));
+
+#endif
+}
 
-#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pte_free_tlb(tlb,pte) pte_free(pte)
 
 #define __pmd_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 #define __pud_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/asm-x86_64/pgtable.h lsrc/prodkernel/2.6.23/include/asm-x86_64/pgtable.h
--- linux-2.6.23/include/asm-x86_64/pgtable.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/asm-x86_64/pgtable.h	2007-10-24 07:08:52.000000000 -0700
@@ -122,6 +122,7 @@
 
 #define pte_pgprot(a)	(__pgprot((a).pte & ~PHYSICAL_PAGE_MASK))
 
+
 #endif /* !__ASSEMBLY__ */
 
 #define PMD_SIZE	(_AC(1,UL) << PMD_SHIFT)
@@ -421,6 +422,51 @@
 #define	kc_offset_to_vaddr(o) \
    (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))
 
+#include <linux/sched.h>
+static inline pgd_t *walk_page_table_pgd(struct mm_struct *mm,
+					  unsigned long addr) {
+	return pgd_offset(mm, addr);
+}
+
+static inline pud_t *walk_page_table_pud(struct mm_struct *mm,
+					 unsigned long addr) {
+	pgd_t *pgd;
+	pgd = walk_page_table_pgd(mm, addr);
+	BUG_ON(!pgd);
+	return pud_offset(pgd, addr);
+}
+
+static inline pmd_t *walk_page_table_pmd(struct mm_struct *mm,
+					 unsigned long addr) {
+	pud_t *pud;
+	pud = walk_page_table_pud(mm, addr);
+	//BUG_ON(!pud);
+	if (!pud) {
+		printk (KERN_DEBUG "walk_page_table_pmd: pud is NULL\n");
+		return NULL;
+	}
+
+	return  pmd_offset(pud, addr);
+}
+
+static inline pte_t *walk_page_table_pte(struct mm_struct *mm,
+					 unsigned long addr) {
+	pmd_t *pmd;
+	pmd = walk_page_table_pmd(mm, addr);
+	BUG_ON(!pmd);
+	return pte_offset_map(pmd, addr);
+}
+
+static inline pmd_t *walk_page_table_kernel_pmd(unsigned long addr) {
+	return walk_page_table_pmd(&init_mm, addr);
+}
+
+static inline pte_t *walk_page_table_huge_pte(struct mm_struct *mm,
+					      unsigned long addr) {
+	return (pte_t *)walk_page_table_pmd(mm, addr);
+}
+
+
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/asm-x86_64/tlbflush.h lsrc/prodkernel/2.6.23/include/asm-x86_64/tlbflush.h
--- linux-2.6.23/include/asm-x86_64/tlbflush.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/asm-x86_64/tlbflush.h	2007-10-24 07:10:00.000000000 -0700
@@ -46,8 +46,9 @@
 
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
-	if (mm == current->active_mm)
+	if (mm == current->active_mm) {
 		__flush_tlb();
+	}
 }
 
 static inline void flush_tlb_page(struct vm_area_struct *vma,
@@ -60,8 +61,10 @@
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 	unsigned long start, unsigned long end)
 {
-	if (vma->vm_mm == current->active_mm)
+	if (vma->vm_mm == current->active_mm) {
+		vma->vm_mm->need_flush = 0;
 		__flush_tlb();
+	}
 }
 
 #else
@@ -106,4 +109,11 @@
 	   by the normal TLB flushing algorithms. */
 }
 
+static inline void maybe_flush_tlb_mm(struct mm_struct *mm) {
+	if (mm->need_flush) {
+		mm->need_flush = 0;
+		flush_tlb_all();
+	}
+}
+
 #endif /* _X8664_TLBFLUSH_H */
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/handle.h lsrc/prodkernel/2.6.23/include/linux/handle.h
--- linux-2.6.23/include/linux/handle.h	1969-12-31 16:00:00.000000000 -0800
+++ lsrc/prodkernel/2.6.23/include/linux/handle.h	2007-10-24 08:04:46.000000000 -0700
@@ -0,0 +1,127 @@
+/* linux/handle.h
+ * Written by Ross Biro, 2007 (rossb@google.com)
+ *
+ * Copyright (C) 2007 Google Inc.
+ * See Copying File.
+ *
+ */
+
+#ifndef _LINUX_HANDLE_H
+#define _LINUX_HANDLE_H
+
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <asm/atomic.h>
+
+struct khandle_target {
+	struct khandle *handle;
+	atomic_t deref_count;
+	atomic_t handle_ref_count;
+	atomic_t generation_count;
+};
+
+struct khandle {
+	struct khandle_target *target;
+	spinlock_t relocation_lock;
+};
+
+int relocate_handle(void *source_obj, void *target_obj,
+		    struct kmem_cache *cachep,
+		    unsigned long handle_target_offset,
+		    unsigned long object_size);
+
+extern struct kmem_cache *handle_cache;
+
+static inline struct khandle *alloc_handle(struct kmem_cache *cachep,
+					   unsigned long flags) {
+	void *obj = kmem_cache_alloc(cachep, flags);
+	struct khandle *handle;
+	if (obj == NULL) {
+		return NULL;
+	}
+
+	handle = kmem_cache_alloc(handle_cache, flags);
+	if (handle == NULL) {
+		kmem_cache_free(cachep, obj);
+		return NULL;
+	}
+
+	spin_lock_init(&handle->relocation_lock);
+	handle->target = obj + kmem_cachep_relocator_private(cachep);
+
+	/* The constructor must make sure these are set up
+	 * properly.
+	 */
+	atomic_inc(&handle->target->generation_count);
+	atomic_dec(&handle->target->deref_count);
+	atomic_inc(&handle->target->handle_ref_count);
+
+	handle->target->handle = handle;
+
+	printk ("alloc_handle target->deref_count=%d\n",
+		atomic_read(&handle->target->deref_count));
+
+	return handle;
+}
+
+/* Any constructor for a cache using handles *must* have a constructor and
+ * must call this construtor.  This means that SLAB_POISON will not
+ * work with any handles.
+ */
+void generic_handle_ctor(void *, struct kmem_cache *, unsigned long);
+
+#define handle_cache_create(name, flags, type, member, size, align, ctor)\
+    kmem_cache_create_relocatable(name, size, align, flags,		\
+	ctor?:generic_handle_ctor, relocate_handle,	 		\
+	offsetof(type, member), 0)
+
+/**
+ * deref_handle get the pointer for this handle.
+ * @handle:	a ptr to the struct khandle.
+ * @type:	the type of the struct this points to.
+ * @member:	the name of the khandle_target within the struct.
+ */
+#define deref_handle(handle, type, member) \
+    (type *)_deref_handle(handle, offsetof(type, member))
+
+static inline void *_deref_handle(struct khandle *handle,
+				  unsigned long offset) {
+        unsigned long flags;
+	void *obj;
+	spin_lock_irqsave(&handle->relocation_lock, flags);
+	obj = handle->target - offset;
+	atomic_inc(&handle->target->deref_count);
+	spin_unlock_irqrestore(&handle->relocation_lock, flags);
+	return obj;
+}
+
+#define put_handle_ref(handle) do {					\
+        atomic_dec(&handle->target->deref_count);			\
+} while (0)
+
+#define get_handle(handle) do {						\
+	atomic_inc(&handle->target->handle_ref_count);			\
+} while (0)
+
+#define put_handle(h, type, member, cachep) do {			\
+	if (atomic_dec_and_test(&h->target->handle_ref_count)) {	\
+		unsigned long flags;					\
+		type *obj;						\
+		spin_lock_irqsave(&h->relocation_lock, flags);	\
+		obj = container_of(h->target, type, member);	\
+                h->target->handle = NULL;				\
+		wmb();							\
+		atomic_inc(&h->target->deref_count);		\
+		spin_unlock_irqrestore(&h->relocation_lock, flags);\
+		kmem_cache_free(cachep, obj);				\
+		kmem_cache_free(handle_cache, h);			\
+		h = NULL;						\
+	} 								\
+} while (0)
+
+
+
+
+
+
+#endif /* _LINUX_HANDLE_H */
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/mm.h lsrc/prodkernel/2.6.23/include/linux/mm.h
--- linux-2.6.23/include/linux/mm.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/mm.h	2007-10-24 07:08:52.000000000 -0700
@@ -935,6 +935,7 @@
 	pte_t *__pte = pte_offset_map(pmd, address);	\
 	*(ptlp) = __ptl;				\
 	spin_lock(__ptl);				\
+	__pte = walk_page_table_pte(mm, address);	\
 	__pte;						\
 })
 
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/mmzone.h lsrc/prodkernel/2.6.23/include/linux/mmzone.h
--- linux-2.6.23/include/linux/mmzone.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/mmzone.h	2007-10-24 07:08:52.000000000 -0700
@@ -18,7 +18,7 @@
 
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_FORCE_MAX_ZONEORDER
-#define MAX_ORDER 11
+#define MAX_ORDER 14
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/sched.h lsrc/prodkernel/2.6.23/include/linux/sched.h
--- linux-2.6.23/include/linux/sched.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/sched.h	2007-10-24 07:08:52.000000000 -0700
@@ -432,6 +432,7 @@
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+	int			need_flush;
 };
 
 struct sighand_struct {
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/slab.h lsrc/prodkernel/2.6.23/include/linux/slab.h
--- linux-2.6.23/include/linux/slab.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/slab.h	2007-10-24 08:33:59.000000000 -0700
@@ -29,6 +29,7 @@
 #define SLAB_DESTROY_BY_RCU	0x00080000UL	/* Defer freeing slabs to RCU */
 #define SLAB_MEM_SPREAD		0x00100000UL	/* Spread some memory over cpuset */
 #define SLAB_TRACE		0x00200000UL	/* Trace allocations and frees */
+#define SLAB_HUGE_PAGE		0x00400000UL    /* Always use at least huge page size pages for this slab. */
 
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
@@ -49,15 +50,42 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
-struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
-			unsigned long,
-			void (*)(void *, struct kmem_cache *, unsigned long));
+struct kmem_cache *kmem_cache_create_relocatable(const char *, size_t, size_t,
+  			unsigned long,
+  			void (*)(void *, struct kmem_cache *, unsigned long),
+			int (*)(void *, void *, struct kmem_cache *,
+				unsigned long, unsigned long),
+			unsigned long, size_t);
+
+unsigned long kmem_cachep_relocator_private(struct kmem_cache *);
+
+static inline
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+			size_t align,
+			unsigned long flags,
+				     void (*ctor)(void *, struct kmem_cache *, unsigned long)) {
+	return kmem_cache_create_relocatable(name, size, align, flags, ctor, NULL, 0, 0);
+}
+
+void test_defrag(struct kmem_cache *);
+
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+void kmem_compute_stats(struct kmem_cache *cachep,
+			unsigned long *full_slabs,
+			unsigned long *partial_slabs,
+			unsigned long *partial_objs,
+			unsigned long *free_slabs,
+			char **error);
+void *kmem_cache_get_metadata(const struct kmem_cache *, void *);
+
+#define RELOCATE_SUCCESS_RCU 1
+#define RELOCATE_SUCCESS 0
+#define RELOCATE_FAILURE -1
 
 /*
  * Please use this macro to create slab caches. Simply specify the
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/Makefile lsrc/prodkernel/2.6.23/mm/Makefile
--- linux-2.6.23/mm/Makefile	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/Makefile	2007-10-24 07:08:52.000000000 -0700
@@ -9,7 +9,7 @@
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
-			   readahead.o swap.o truncate.o vmscan.o \
+			   readahead.o swap.o truncate.o vmscan.o handle.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   $(mmu-y)
 
@@ -29,4 +29,4 @@
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-
+obj-m += handle_test.o
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/handle.c lsrc/prodkernel/2.6.23/mm/handle.c
--- linux-2.6.23/mm/handle.c	1969-12-31 16:00:00.000000000 -0800
+++ lsrc/prodkernel/2.6.23/mm/handle.c	2007-10-24 07:34:43.000000000 -0700
@@ -0,0 +1,129 @@
+/* mm/handle.c
+ * Written by Ross Biro, 2007 (rossb@google.com)
+ *
+ * Copyright (C) 2007 Google Inc.
+ * See Copying File.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/handle.h>
+#include <linux/module.h>
+
+struct kmem_cache *handle_cache;
+
+EXPORT_SYMBOL_GPL(handle_cache);
+
+/*
+ * This function handles the relocation.  Guarentee we have is that
+ * the source and target objects will not vanish underneath us.
+ * However, they might get recycled.  So we have to be careful to get
+ * the handle pointer.  The caller has appropriate locks to make sure
+ * that two different threads don't try to relocate the same object at
+ * the same time.
+ */
+int relocate_handle(void *source_obj, void *target_obj,
+		    struct kmem_cache *cachep,
+		    unsigned long handle_target_offset,
+		    unsigned long object_size) {
+	struct khandle_target *handle_target = source_obj +
+			handle_target_offset;
+
+	struct khandle *handle;
+	unsigned long flags;
+	int generation = atomic_read(&handle_target->generation_count);;
+
+
+	if (atomic_read(&handle_target->deref_count)) {
+		printk (KERN_DEBUG "relocate_handle: handle in use (%d).\n",
+			atomic_read(&handle_target->deref_count));
+		printk (KERN_DEBUG "handle_target_offset = %d\n",
+			handle_target_offset);
+		return 1;
+	}
+
+	atomic_inc(&handle_target->handle_ref_count);
+	handle = handle_target->handle;
+
+	/* we need to make sure that the atomic_inc completed,
+	   and the atomic read is not using a cached (even by the
+	   compiler) value. */
+	mb();
+
+	/* Make sure the handle didn't vanish underneath us while
+	   we were grabbing it. */
+	if (handle == NULL || atomic_read(&handle_target->deref_count)) {
+		atomic_dec(&handle_target->handle_ref_count);
+		printk (KERN_DEBUG "relocate_handle: handle in use after grabbing.\n");
+		return 1;
+	}
+
+
+	/*
+	 * At this point, we know that the handle is valid and the
+	 * object cannot be recycled while we are looking at it.
+	 * We know this because the recycling code increments the ref
+	 * count, and we have a ref count of 0.  Plus we incremented
+	 * the ref count of the handle, so it cannot drop to 0 either.
+	 */
+
+	spin_lock_irqsave(&handle->relocation_lock, flags);
+
+	/* Now check the deref count one last time.  If it's still 0,
+	   then we have exclusive access to the object.
+	*/
+
+	if (atomic_read(&handle_target->deref_count)) {
+		spin_unlock_irqrestore(&handle->relocation_lock, flags);
+		atomic_dec(&handle_target->handle_ref_count);
+		printk (KERN_DEBUG "relocate_handle: handle in use after lock.\n");
+		return 1;
+	}
+
+	/* Make sure we have the correct handle. */
+	if (generation != atomic_read(&handle_target->generation_count)) {
+		spin_unlock_irqrestore(&handle->relocation_lock, flags);
+		atomic_dec(&handle->target->handle_ref_count);
+		printk (KERN_DEBUG
+			"relocate_handle: handle generation changed.\n");
+		return 1;
+	}
+
+	/* Now we've got the object.  Do a shallow copy. */
+	memcpy (target_obj, source_obj, object_size);
+
+	/* We adjust the handle */
+	handle->target = target_obj + handle_target_offset;
+
+	/* Release the locks.  The object has been moved. */
+	spin_unlock_irqrestore(&handle->relocation_lock, flags);
+	atomic_dec(&handle->target->handle_ref_count);
+
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(relocate_handle);
+
+void generic_handle_ctor(void *obj, struct kmem_cache *cachep,
+			 unsigned long unused) {
+	struct khandle_target *target = obj +
+			kmem_cachep_relocator_private(cachep);
+	atomic_set(&target->generation_count, 0);
+	/* We have a pointer right now, so the handle has been
+	 * dereferenced even though it doesn't really exist yet.
+	 */
+	atomic_set(&target->deref_count, 1);
+	atomic_set(&target->handle_ref_count, 0);
+
+}
+
+EXPORT_SYMBOL_GPL(generic_handle_ctor);
+
+static int __init handle_init(void) {
+	handle_cache = kmem_cache_create("handle_cache",
+					 sizeof(struct khandle),
+					 0, 0, NULL);
+	return 0;
+}
+
+module_init(handle_init);
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/handle_test.c lsrc/prodkernel/2.6.23/mm/handle_test.c
--- linux-2.6.23/mm/handle_test.c	1969-12-31 16:00:00.000000000 -0800
+++ lsrc/prodkernel/2.6.23/mm/handle_test.c	2007-10-24 07:08:52.000000000 -0700
@@ -0,0 +1,140 @@
+/* mm/handle_test.c
+ * Written by Ross Biro, 2007 (rossb@google.com)
+ *
+ * Copyright (C) 2007 Google Inc.
+ * See Copying File.
+ *
+ * This file is for a module that exercises the handle systems
+ * and runs a bunch of unit tests.
+ *
+ * Loading the module should execute all the tests.  If all goes
+ * resonably well, the module should just clean up after itself and
+ * be ready to unload.  If not, anything could go wrong, after all it's
+ * a kernel module.
+ */
+
+#include <linux/kernel.h>
+#include <linux/handle.h>
+#include <linux/sched.h>
+#include <linux/module.h>
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("A test module for the handle code.");
+
+struct handle_test1 {
+	char filler0[11];
+	struct khandle_target target;
+	char filler1[11];
+};
+
+#define TEST1_SIZE PAGE_SIZE * 4
+
+struct khandle *test1_handles[TEST1_SIZE];
+struct handle_test1 *test1_ptrs[ARRAY_SIZE(test1_handles)];
+
+static int __init handle_test(void) {
+	int i;
+	struct kmem_cache *test1_cache = NULL;
+	char *error = NULL;
+	unsigned long full_slabs_before;
+	unsigned long partial_slabs_before;
+	unsigned long partial_objs_before;
+	unsigned long free_slabs_before;
+	unsigned long full_slabs_after;
+	unsigned long partial_slabs_after;
+	unsigned long partial_objs_after;
+	unsigned long free_slabs_after;
+
+	test1_cache = handle_cache_create("handle_test1", 0,
+					  struct handle_test1, target,
+					  sizeof(struct handle_test1),
+					  0, NULL);
+
+	if (test1_cache == NULL) {
+		printk (KERN_DEBUG "handle_test: Unable to allocate cache_test1");
+		goto test_failed;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(test1_handles); i++) {
+		test1_handles[i] = alloc_handle(test1_cache, GFP_KERNEL);
+	}
+
+	kmem_compute_stats(test1_cache, &full_slabs_before,
+			   &partial_slabs_before, &partial_objs_before,
+			   &free_slabs_before, &error);
+
+
+	printk (KERN_DEBUG "before: free %d partial %d full %d\n",
+		free_slabs_before, partial_slabs_before,
+		full_slabs_before);
+
+	/* Now fragment the crap out of the thing. */
+	for (i = ARRAY_SIZE(test1_handles) - 1 ; i >= 0; i--) {
+		if (i & 7) {
+			put_handle(test1_handles[i],
+				   struct handle_test1,
+				   target, test1_cache);
+			test1_ptrs[i] = NULL;
+			test1_handles[i] = NULL;
+		}
+	}
+
+	kmem_compute_stats(test1_cache, &full_slabs_before,
+			   &partial_slabs_before, &partial_objs_before,
+			   &free_slabs_before, &error);
+
+	/* Force some defrag. */
+	for (i = 0; i < partial_slabs_before; i++) {
+		test_defrag(test1_cache);
+	}
+
+	if (signal_pending(current)) {
+		printk (KERN_DEBUG "handle_test: Abbandonning test due to signal.\n");
+		goto test_failed;
+	}
+
+	kmem_compute_stats(test1_cache, &full_slabs_after,
+			   &partial_slabs_after, &partial_objs_after,
+			   &free_slabs_after, &error);
+
+	/* We should either have more free slabs, or fewer total slabs. */
+	if (free_slabs_after <= free_slabs_before &&
+	    free_slabs_after + partial_slabs_after + full_slabs_after >=
+	    free_slabs_before + partial_slabs_before + full_slabs_before) {
+		printk (KERN_DEBUG "handle_test: test 1 failed. "
+			"Memory was not freed\n");
+		printk (KERN_DEBUG "before: free %d partial %d full %d\n",
+			free_slabs_before, partial_slabs_before,
+			full_slabs_before);
+		printk (KERN_DEBUG "after: free %d partial %d full %d\n",
+			free_slabs_after, partial_slabs_after,
+			full_slabs_after);
+		goto test_failed;
+	}
+
+
+
+ test_failed:
+	for (i = 0; i < ARRAY_SIZE(test1_handles); i++) {
+		if (test1_ptrs[i])
+			put_handle_ref(test1_handles[i]);
+		if (test1_handles[i])
+			put_handle(test1_handles[i], struct handle_test1,
+				   target, test1_cache);
+	}
+
+	kmem_cache_destroy(test1_cache);
+
+	return 0;
+
+}
+
+static void __exit
+handle_test_exit(void)
+{
+	return;
+}
+
+
+module_init(handle_test);
+module_exit(handle_test_exit);
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/hugetlb.c lsrc/prodkernel/2.6.23/mm/hugetlb.c
--- linux-2.6.23/mm/hugetlb.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/hugetlb.c	2007-10-24 07:15:28.000000000 -0700
@@ -378,7 +378,12 @@
 		if (!dst_pte)
 			goto nomem;
 		spin_lock(&dst->page_table_lock);
+		dst_pte = walk_page_table_huge_pte(dst, addr);
+		BUG_ON(!dst_pte);
 		spin_lock(&src->page_table_lock);
+		src_pte = walk_page_table_huge_pte(src, addr);
+		BUG_ON(!src_pte);
+
 		if (!pte_none(*src_pte)) {
 			if (cow)
 				ptep_set_wrprotect(src, addr, src_pte);
@@ -561,6 +566,9 @@
 
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
+
+ 	ptep = walk_page_table_huge_pte(mm, address);
+
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
@@ -609,6 +617,9 @@
 	ret = 0;
 
 	spin_lock(&mm->page_table_lock);
+	ptep = walk_page_table_huge_pte(mm, address);
+	BUG_ON(!ptep);
+
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, *ptep)))
 		if (write_access && !pte_write(entry))
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/memory.c lsrc/prodkernel/2.6.23/mm/memory.c
--- linux-2.6.23/mm/memory.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/memory.c	2007-10-24 07:44:53.000000000 -0700
@@ -69,6 +69,8 @@
 EXPORT_SYMBOL(mem_map);
 #endif
 
+struct kmem_cache *pte_cache;
+
 unsigned long num_physpages;
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
@@ -306,6 +308,8 @@
 
 	pte_lock_init(new);
 	spin_lock(&mm->page_table_lock);
+	pmd = walk_page_table_pmd(mm, address);
+	BUG_ON(!pmd);
 	if (pmd_present(*pmd)) {	/* Another has populated it */
 		pte_lock_deinit(new);
 		pte_free(new);
@@ -325,6 +329,8 @@
 		return -ENOMEM;
 
 	spin_lock(&init_mm.page_table_lock);
+	pmd = walk_page_table_kernel_pmd(address);
+	BUG_ON(!pmd);
 	if (pmd_present(*pmd))		/* Another has populated it */
 		pte_free_kernel(new);
 	else
@@ -506,6 +512,11 @@
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	arch_enter_lazy_mmu_mode();
 
+	src_pte = walk_page_table_pte(src_mm, addr);
+	BUG_ON(!src_pte);
+	dst_pte = walk_page_table_pte(dst_mm, addr);
+	BUG_ON(!dst_pte);
+
 	do {
 		/*
 		 * We are holding two locks at this point - either of them
@@ -2483,7 +2494,8 @@
  * a struct_page backing it
  *
  * As this is called only for pages that do not currently exist, we
- * do not need to flush old virtual caches or the TLB.
+ * do not need to flush old virtual caches or the TLB, unless someone
+ * else has left the page table cache in an unknown state.
  *
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2603,6 +2615,8 @@
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
+	pte = walk_page_table_pte(mm, address);
+
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
 	if (write_access) {
@@ -2625,6 +2639,7 @@
 		if (write_access)
 			flush_tlb_page(vma, address);
 	}
+	maybe_flush_tlb_mm(mm);
 unlock:
 	pte_unmap_unlock(pte, ptl);
 	return 0;
@@ -2674,6 +2689,8 @@
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
+	pgd = walk_page_table_pgd(mm, address);
+	BUG_ON(!pgd);
 	if (pgd_present(*pgd))		/* Another has populated it */
 		pud_free(new);
 	else
@@ -2695,6 +2712,8 @@
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
+	pud = walk_page_table_pud(mm, address);
+	BUG_ON(!pud);
 #ifndef __ARCH_HAS_4LEVEL_HACK
 	if (pud_present(*pud))		/* Another has populated it */
 		pmd_free(new);
@@ -2867,3 +2886,218 @@
 	return buf - old_buf;
 }
 EXPORT_SYMBOL_GPL(access_process_vm);
+
+/* We need to use RCU to clean up the page tables because many read
+   accesses do not grab the lock and they are in the page fault fast
+   path, so we don't want to touch them. We flush the page tables
+   right away, but we don't flush the pages until the rcu callback.
+   We can get away with this since the old page is still valid and
+   anybody that modifies the new one will have to flush the pages
+   anyway. We can't wait to flush the page tables themselves since
+   if we fault in a page, the fault code will only modify the new
+   page tables, but if the cpu is looking at the old ones, it will
+   continue to fault on the old page table while the fault handler will
+   see the new page tables and not know what is going on.  It appears that
+   there is only one architecture where flush_tlb_pgtables is not a no-op,
+   so it doesn't hurt much to do it here.  We might lose some accessed bit
+   updates, but we can live with that.
+ */
+
+int relocate_pgd(void *source_obj, void *target_obj,
+			 struct kmem_cache *cachep,
+			 unsigned long unused,
+ 			 unsigned long object_size) {
+	struct mm_struct *mm;
+	struct page_table_metadata *md;
+	unsigned long addr;
+
+	/* get the mm so we can lock it and the entry pointing to this
+	   page table. */
+	md = (struct page_table_metadata *)kmem_cache_get_metadata(cachep,
+								   source_obj);
+	if (!md)
+		return RELOCATE_FAILURE;
+
+	mm = md->mm;
+	addr = md->addr;
+
+	/* irqs are off when this function is called. */
+	spin_lock(&mm->page_table_lock);
+	memcpy(target_obj, source_obj, object_size);
+	pgd_populate(mm, pgd_offset(mm, addr), target_obj);
+	flush_tlb_pgtables(mm, md->addr, md->addr + (1UL << PGDIR_SHIFT) - 1);
+	mm->need_flush = 1;
+ 	spin_unlock(&mm->page_table_lock);
+	return RELOCATE_SUCCESS_RCU;
+}
+
+int relocate_pud(void *source_obj, void *target_obj,
+		 struct kmem_cache *cachep,
+		 unsigned long unused,
+		 unsigned long object_size) {
+	struct mm_struct *mm;
+	struct page_table_metadata *md;
+	unsigned long addr;
+	pgd_t *pgd;
+	pud_t *pud;
+
+	/* get the mm so we can lock it and the entry pointing to this
+	   page table. */
+	md = (struct page_table_metadata *)
+			kmem_cache_get_metadata(cachep, source_obj);
+
+	if (!md)
+		return RELOCATE_FAILURE;
+
+	mm = md->mm;
+	addr = md->addr;
+
+	/* irqs are off when this function is called. */
+	spin_lock(&mm->page_table_lock);
+
+	pgd = pgd_offset(mm, addr);
+
+	if (!pgd_none(*pgd) && pgd_present(*pgd)) {
+		pud = pud_offset(pgd, addr);
+		if (!pud_none(*pud) && pud_present(*pud)) {
+			memcpy(target_obj, source_obj, object_size);
+			pud_populate(mm, pud, target_obj);
+			flush_tlb_pgtables(mm, addr,
+					   addr + (1 << PUD_SHIFT) - 1);
+			spin_unlock(&mm->page_table_lock);
+			return RELOCATE_SUCCESS_RCU;
+		}
+	}
+
+	mm->need_flush = 1;
+	spin_unlock(&mm->page_table_lock);
+	return RELOCATE_FAILURE;
+}
+
+static void rcu_free_pmd(struct rcu_head *head) {
+	struct page_table_metadata *md = 
+			(struct page_table_metadata *)head;
+	BUG_ON(!md->mm);
+	BUG_ON(md->addr);
+	BUG_ON(!md->cachep);
+	BUG_ON(!md->obj);
+
+	/* maybe_flush_tlb_mm(md->mm);
+	   mmdrop(md->mm); */
+	kmem_cache_free(md->cachep, md->obj);
+}
+
+static int relocate_pmd(void *source_obj, void *target_obj,
+			struct kmem_cache *cachep,
+			unsigned long unused,
+			unsigned long object_size) {
+	struct mm_struct *mm;
+	struct page_table_metadata *md;
+	unsigned long addr;
+ 	pmd_t *pmd;
+	unsigned long flags;
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+	struct page *target_page = virt_to_page(target_obj);
+	struct page *source_page = virt_to_page(source_obj);
+#endif
+
+	/* get the mm so we can lock it and the entry pointing to this
+ 	   page table. */
+	md = (struct page_table_metadata *)
+			kmem_cache_get_metadata(cachep, source_obj);
+
+	/*
+	printk (KERN_DEBUG "relocate_pmd source=%p mm=%p addr=0x%lx object_size=%d\n",
+		source_obj, mm, addr, object_size);
+        printk (KERN_DEBUG "md=%p mm=%p addr=0x%lx csum=0x%lx\n",
+	md, mm, addr, md->csum);
+	*/
+	BUG_ON(md->csum != ((unsigned long)(md->mm) ^ (md->addr)));
+
+	if (!md->mm || !md->addr || md->mm == &init_mm) {
+		return RELOCATE_FAILURE;
+	}
+
+	if (md->addr >= PAGE_OFFSET) {
+		printk (KERN_INFO "attempted to relocate kernel page.\n");
+		return RELOCATE_FAILURE;
+	}
+ 
+	spin_lock_irqsave(&md->md_lock, flags);
+
+	mm = md->mm;
+	addr = md->addr;
+
+	/* Make sure the mm does not go away. */
+	if (mm && addr)
+		atomic_inc(&mm->mm_count);
+
+	spin_unlock_irqrestore(&md->md_lock, flags);
+
+	if (!mm || !addr)
+		return RELOCATE_FAILURE;
+
+	/* irqs are off when this function is called. */
+	spin_lock_irqsave(&mm->page_table_lock, flags);
+
+	pmd = walk_page_table_pmd(mm, addr);
+	if (pmd && !virt_addr_valid(pmd)) {
+		printk (KERN_WARNING "walk_page_table_pmd returned %p which is not valid.\n", pmd);
+	}
+
+	if (pmd && 
+	    pmd_page_vaddr(*pmd) == (unsigned long)source_obj) {
+		memcpy (kmem_cache_get_metadata(cachep,
+						target_obj),
+			md, sizeof(*md));
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+		spin_lock_init(&target_page->ptl);
+		spin_lock(&source_page->ptl);
+#endif
+
+		memcpy(target_obj, source_obj, object_size);
+		pmd_populate(NULL, pmd, virt_to_page(target_obj));
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+		spin_unlock(&source_page->ptl);
+#endif
+	flush_tlb_pgtables(mm, addr,
+			   addr + (1 << PMD_SHIFT)
+				   - 1);
+
+		mm->need_flush = 1;
+		md->addr = 0;
+		md->csum = (unsigned long)mm;
+		spin_unlock_irqrestore(&mm->page_table_lock, flags);
+
+
+		//printk (KERN_DEBUG "relocate_pmd: succesfully relocated pte (%p)\n", source_obj);
+		/* Don't drop the MM, we have an extra copy of it so 
+	   we know what mm to flush when we drop the page. */
+		md->obj = source_obj;
+		md->cachep = cachep;
+		call_rcu(&md->head, rcu_free_pmd);
+		maybe_flush_tlb_mm(mm);
+		mmdrop(mm);
+
+		return RELOCATE_SUCCESS_RCU;
+	}
+
+	spin_unlock_irqrestore(&mm->page_table_lock, flags);
+	mmdrop(mm);
+	return RELOCATE_FAILURE;
+}
+
+static int __init page_table_cache_init(void)
+{
+	pte_cache = kmem_cache_create_relocatable("pte", PAGE_SIZE,
+						  PAGE_SIZE, SLAB_HUGE_PAGE,
+						  NULL,
+						  relocate_pmd, 0,
+						  sizeof(struct page_table_metadata));
+	BUG_ON(!pte_cache);
+	return 0;
+}
+
+module_init(page_table_cache_init);
+
+
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/rmap.c lsrc/prodkernel/2.6.23/mm/rmap.c
--- linux-2.6.23/mm/rmap.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/rmap.c	2007-10-24 07:08:52.000000000 -0700
@@ -254,6 +254,8 @@
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
+
+	pte = walk_page_table_pte(mm, address);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/slab.c lsrc/prodkernel/2.6.23/mm/slab.c
--- linux-2.6.23/mm/slab.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/slab.c	2007-10-24 07:54:14.000000000 -0700
@@ -140,7 +140,7 @@
 #define	REDZONE_ALIGN		max(BYTES_PER_WORD, __alignof__(unsigned long long))
 
 #ifndef cache_line_size
-#define cache_line_size()	L1_CACHE_BYTES
+#Define cache_line_size()	L1_CACHE_BYTES
 #endif
 
 #ifndef ARCH_KMALLOC_MINALIGN
@@ -178,12 +178,14 @@
 			 SLAB_CACHE_DMA | \
 			 SLAB_STORE_USER | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
-			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
+			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
+			 SLAB_HUGE_PAGE)
 #else
 # define CREATE_MASK	(SLAB_HWCACHE_ALIGN | \
 			 SLAB_CACHE_DMA | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
-			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
+			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
+                         SLAB_HUGE_PAGE)
 #endif
 
 /*
@@ -225,6 +227,8 @@
 	unsigned int inuse;	/* num of objs active in slab */
 	kmem_bufctl_t free;
 	unsigned short nodeid;
+	unsigned long meta_data_start;
+	unsigned long meta_data_end;
 };
 
 /*
@@ -410,6 +414,11 @@
 	/* constructor func */
 	void (*ctor) (void *, struct kmem_cache *, unsigned long);
 
+	/* reloactor function */
+	int (*relocator) (void *, void *, struct kmem_cache *,
+			  unsigned long, unsigned long);
+	unsigned long relocator_private;
+
 /* 5) cache creation/removal */
 	const char *name;
 	struct list_head next;
@@ -431,6 +440,9 @@
 	atomic_t freehit;
 	atomic_t freemiss;
 #endif
+
+	size_t meta_data_size;
+
 #if DEBUG
 	/*
 	 * If debugging is enabled, then the allocator can add additional
@@ -673,6 +685,15 @@
 	.name = "kmem_cache",
 };
 
+/* allocators might need this.  In particular, the handle allocator
+ * uses this to locate the handle_target.
+ */
+unsigned long kmem_cachep_relocator_private(struct kmem_cache *cachep) {
+	return cachep->relocator_private;
+}
+
+EXPORT_SYMBOL_GPL(kmem_cachep_relocator_private);
+
 #define BAD_ALIEN_MAGIC 0x01020304ul
 
 #ifdef CONFIG_LOCKDEP
@@ -798,9 +819,14 @@
 	return __find_general_cachep(size, gfpflags);
 }
 
-static size_t slab_mgmt_size(size_t nr_objs, size_t align)
+static size_t slab_mgmt_size(size_t nr_objs, size_t align,
+			     size_t meta_data_size)
 {
-	return ALIGN(sizeof(struct slab)+nr_objs*sizeof(kmem_bufctl_t), align);
+	size_t res1, res2;
+	res1 = sizeof(struct slab)+nr_objs*sizeof(kmem_bufctl_t)+
+			nr_objs*meta_data_size;
+	res2 = ALIGN(res1, align);
+	return res2;
 }
 
 /*
@@ -808,7 +834,7 @@
  */
 static void cache_estimate(unsigned long gfporder, size_t buffer_size,
 			   size_t align, int flags, size_t *left_over,
-			   unsigned int *num)
+			   unsigned int *num, size_t meta_data_size)
 {
 	int nr_objs;
 	size_t mgmt_size;
@@ -845,21 +871,31 @@
 		 * into account.
 		 */
 		nr_objs = (slab_size - sizeof(struct slab)) /
-			  (buffer_size + sizeof(kmem_bufctl_t));
+			  (buffer_size + sizeof(kmem_bufctl_t) +
+			   meta_data_size);
 
 		/*
 		 * This calculated number will be either the right
 		 * amount, or one greater than what we want.
 		 */
-		if (slab_mgmt_size(nr_objs, align) + nr_objs*buffer_size
+		if (slab_mgmt_size(nr_objs, align, meta_data_size) +
+		    nr_objs*buffer_size
 		       > slab_size)
 			nr_objs--;
 
 		if (nr_objs > SLAB_LIMIT)
 			nr_objs = SLAB_LIMIT;
 
-		mgmt_size = slab_mgmt_size(nr_objs, align);
+		mgmt_size = slab_mgmt_size(nr_objs, align,
+					   meta_data_size);
 	}
+
+	if (meta_data_size != 0) {
+		printk (KERN_INFO "cache_estimate: mgmt_size = %d, "
+			"nr_objs=%d, meta_data_size=%d\n", mgmt_size,
+			nr_objs, meta_data_size);
+	}
+
 	*num = nr_objs;
 	*left_over = slab_size - nr_objs*buffer_size - mgmt_size;
 }
@@ -1463,15 +1499,17 @@
 
 	for (order = 0; order < MAX_ORDER; order++) {
 		cache_estimate(order, cache_cache.buffer_size,
-			cache_line_size(), 0, &left_over, &cache_cache.num);
+			cache_line_size(), 0, &left_over, &cache_cache.num,
+			       0);
 		if (cache_cache.num)
 			break;
 	}
 	BUG_ON(!cache_cache.num);
 	cache_cache.gfporder = order;
 	cache_cache.colour = left_over / cache_cache.colour_off;
-	cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +
-				      sizeof(struct slab), cache_line_size());
+	cache_cache.slab_size = slab_mgmt_size(cache_cache.num,
+					       cache_line_size(),
+					       cache_cache.meta_data_size);
 
 	/* 2+3) create the kmalloc caches */
 	sizes = malloc_sizes;
@@ -1993,22 +2031,25 @@
 	size_t left_over = 0;
 	int gfporder;
 
-	for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
+	for (gfporder = (flags & SLAB_HUGE_PAGE)?HUGETLB_PAGE_ORDER:0;
+	     gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
 		unsigned int num;
 		size_t remainder;
 
-		cache_estimate(gfporder, size, align, flags, &remainder, &num);
+		cache_estimate(gfporder, size, align, flags, &remainder, &num,
+			       cachep->meta_data_size);
 		if (!num)
 			continue;
 
-		if (flags & CFLGS_OFF_SLAB) {
+		if (flags & CFLGS_OFF_SLAB && cachep->num) {
 			/*
 			 * Max number of objs-per-slab for caches which
 			 * use off-slab slabs. Needed to avoid a possible
 			 * looping condition in cache_grow().
 			 */
 			offslab_limit = size - sizeof(struct slab);
-			offslab_limit /= sizeof(kmem_bufctl_t);
+			offslab_limit /= (sizeof(kmem_bufctl_t) +
+					  cachep->meta_data_size);
 
  			if (num > offslab_limit)
 				break;
@@ -2125,9 +2166,13 @@
  * as davem.
  */
 struct kmem_cache *
-kmem_cache_create (const char *name, size_t size, size_t align,
-	unsigned long flags,
-	void (*ctor)(void*, struct kmem_cache *, unsigned long))
+kmem_cache_create_relocatable (const char *name, size_t size, size_t align,
+       unsigned long flags,
+       void (*ctor)(void*, struct kmem_cache *, unsigned long),
+       int (*relocator)(void*, void*, struct kmem_cache *,
+			unsigned long, unsigned long),
+       unsigned long relocator_private,
+       size_t  meta_data_size)
 {
 	size_t left_over, slab_size, ralign;
 	struct kmem_cache *cachep = NULL, *pc;
@@ -2260,6 +2305,14 @@
 	if (!cachep)
 		goto oops;
 
+	/* Need this early to compute slab size properly. */
+	cachep->meta_data_size = meta_data_size;
+
+	if (meta_data_size) {
+		printk (KERN_INFO "kmem_cache_create meta_data_size=%d\n",
+			meta_data_size);
+	}
+
 #if DEBUG
 	cachep->obj_size = size;
 
@@ -2314,9 +2367,10 @@
 		cachep = NULL;
 		goto oops;
 	}
-	slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t)
-			  + sizeof(struct slab), align);
 
+	slab_size = slab_mgmt_size(cachep->num,
+				   align,
+				   cachep->meta_data_size);
 	/*
 	 * If the slab has been placed off-slab, and we have enough space then
 	 * move it on-slab. This is at the expense of any extra colouring.
@@ -2328,8 +2382,8 @@
 
 	if (flags & CFLGS_OFF_SLAB) {
 		/* really off slab. No need for manual alignment */
-		slab_size =
-		    cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);
+		slab_size = slab_mgmt_size(cachep->num, 1,
+					   cachep->meta_data_size);
 	}
 
 	cachep->colour_off = cache_line_size();
@@ -2358,6 +2412,8 @@
 	}
 	cachep->ctor = ctor;
 	cachep->name = name;
+	cachep->relocator = relocator;
+	cachep->relocator_private = relocator_private;
 
 	if (setup_cpu_cache(cachep)) {
 		__kmem_cache_destroy(cachep);
@@ -2374,7 +2430,7 @@
 	mutex_unlock(&cache_chain_mutex);
 	return cachep;
 }
-EXPORT_SYMBOL(kmem_cache_create);
+EXPORT_SYMBOL(kmem_cache_create_relocatable);
 
 #if DEBUG
 static void check_irq_off(void)
@@ -2582,6 +2638,12 @@
  * kmem_find_general_cachep till the initialization is complete.
  * Hence we cannot have slabp_cache same as the original cache.
  */
+
+static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp)
+{
+	return (kmem_bufctl_t *) (slabp + 1);
+}
+
 static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
 				   int colour_off, gfp_t local_flags,
 				   int nodeid)
@@ -2598,18 +2660,51 @@
 		slabp = objp + colour_off;
 		colour_off += cachep->slab_size;
 	}
+
+	memset(slabp, 0, cachep->slab_size);
+
 	slabp->inuse = 0;
 	slabp->colouroff = colour_off;
 	slabp->s_mem = objp + colour_off;
 	slabp->nodeid = nodeid;
+	slabp->meta_data_start = (unsigned long)slab_bufctl(slabp) +
+			sizeof(kmem_bufctl_t)*cachep->num;
+	slabp->meta_data_end = slabp->meta_data_start + cachep->meta_data_size * cachep->num;
 	return slabp;
 }
 
-static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp)
-{
-	return (kmem_bufctl_t *) (slabp + 1);
+static inline void *slab_meta_data(const struct kmem_cache *cachep,
+				   struct slab *slabp) {
+	return (void *)slab_bufctl(slabp) +
+			cachep->num * sizeof(kmem_bufctl_t);
 }
 
+void *kmem_cache_get_metadata(const struct kmem_cache *cache,
+			      void *obj) {
+	if (cache->meta_data_size == 0) {
+		return NULL;
+	} else {
+		struct slab *slab = virt_to_slab(obj);
+		int ind = obj_to_index(cache, slab, obj);
+		void *ret;
+
+		ret = slab_meta_data(cache, slab) +
+				ind * cache->meta_data_size;
+		
+		if ((unsigned long)ret < slab->meta_data_start ||
+		    (unsigned long)ret >= slab->meta_data_end) {
+			printk (KERN_ERR "kmem_cache_get_metadata: Bad ret ind=%d ret=%p slab=%p\n", ind, ret, slab);
+		}
+
+		BUG_ON((unsigned long)ret < slab->meta_data_start);
+		BUG_ON((unsigned long)ret >= slab->meta_data_end);
+
+		return ret;
+	}
+}
+
+
+
 static void cache_init_objs(struct kmem_cache *cachep,
 			    struct slab *slabp)
 {
@@ -2681,8 +2776,10 @@
 	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
 	WARN_ON(slabp->nodeid != nodeid);
 #endif
+	slab_bufctl(slabp)[slabp->free] = BUFCTL_ACTIVE;
 	slabp->free = next;
 
+
 	return objp;
 }
 
@@ -4013,6 +4110,143 @@
 	}
 }
 
+/*
+ * Attempt to take the next to be reused slab and free it up.
+ */
+void defrag_cache_node(struct kmem_cache *cachep, int node) {
+	struct kmem_list3 *l3 = cachep->nodelists[node];
+	struct slab *slabp;
+	kmem_bufctl_t *ctlp;
+	int i;
+	void *targetp = NULL;
+
+	slabp = list_entry(l3->slabs_partial.next,
+			   struct slab, list);
+
+	/* maybe this will be clear by the next time arround. */
+	list_del(&slabp->list);
+	list_add_tail(&slabp->list, &l3->slabs_partial);
+
+
+	for (i = 0; i < cachep->num; i++) {
+		/* This risks using up the hot cpu pages on things
+		 * that are old and stale.
+		 */
+		if (targetp == NULL) {
+			/*
+			 * We risk thrashing on the spin lock, but what
+			 * else can we do?  We need to be able to allocate
+			 * new objects.
+			 */
+			spin_unlock(&l3->list_lock);
+			targetp = kmem_cache_alloc_node(cachep,
+						GFP_ATOMIC & ~GFP_THISNODE,
+						node);
+			spin_lock(&l3->list_lock);
+			if (targetp == NULL) {
+				printk (KERN_DEBUG
+					"defrage_cache_node: Couldn't allocate target.\n");
+				/* WTF? Couldn't get memory. */
+				break;
+			}
+
+		}
+
+		if (unlikely(list_empty(&l3->slabs_partial))) {
+			printk (KERN_DEBUG "defrag_cache_node: partial list empty.\n");
+			break;
+		}
+
+		/*
+		 * This may not be the same slab as we saw last time,
+		 * but that is a risk we will just have to take.
+		 * Things should still be consolidated, but we likely
+		 * won't free anything in this pass.
+		 */
+		slabp = list_entry(l3->slabs_partial.prev,
+				   struct slab, list);
+
+		ctlp = slab_bufctl(slabp);
+
+		if (ctlp[i] == BUFCTL_ACTIVE) {
+			void *objp = index_to_obj(cachep, slabp, i);
+
+			/* The relocator is responsible for making sure
+			 * the object doesn't disapeer from out from
+			 * under it.  The memory itself won't be freed,
+			 * but the object might be on the cpu hot list and
+			 * might be reused.
+			 */
+			int rel = cachep->relocator(objp, targetp, cachep,
+						    cachep->relocator_private,
+						    obj_size(cachep));
+			switch (rel) {
+				case RELOCATE_SUCCESS_RCU:
+					/* We've moved the copy, but we
+					 * can't free the old one right away
+					 * because it might still be in use.
+					 */
+					/*printk (KERN_DEBUG
+						"defrag_cache_node: "
+						"RCU success\n"); */
+					targetp = NULL;
+					break;
+
+				case RELOCATE_SUCCESS:
+					/* relocation succeeded. objp is now
+					 * free.  tragetp is used.
+					 */
+					/*printk (KERN_DEBUG "defrag_cache_node: "
+					  "relocated object %d.\n", i); */
+					targetp = NULL;
+					ctlp[i] = slabp->free;
+					slabp->free = i;
+					l3->free_objects++;
+					slabp->inuse--;
+					if (slabp->inuse == 0) {
+						list_del(&slabp->list);
+						if (l3->free_objects > l3->free_limit){
+							l3->free_objects -=
+									cachep->num;
+							slab_destroy(cachep, slabp);
+						} else {
+							list_add(&slabp->list,
+								 &l3->slabs_free);
+						}
+						goto done;
+					}
+					break;
+
+				case RELOCATE_FAILURE:
+					/*printk (KERN_DEBUG
+						"defrag_cache_node: "
+						"faled to relocated "
+						"object %d.\n", i); */
+					break;
+
+				default:
+					printk (KERN_DEBUG
+						"defrag_cache_node: "
+						"unknown result %d\n",
+						rel);
+					break;
+			}
+		} else {
+			/*printk (KERN_DEBUG "defrag_cache_node: "
+			  "object %d not active.\n", i); */
+
+		}
+
+	}
+
+ done:
+	if (targetp) {
+		spin_unlock(&l3->list_lock);
+		kmem_cache_free(cachep, targetp);
+		spin_lock(&l3->list_lock);
+	}
+}
+
 /**
  * cache_reap - Reclaim memory from caches.
  * @w: work descriptor
@@ -4037,6 +4271,7 @@
 		/* Give up. Setup the next iteration. */
 		goto out;
 
+
 	list_for_each_entry(searchp, &cache_chain, next) {
 		check_irq_on();
 
@@ -4047,6 +4282,28 @@
 		 */
 		l3 = searchp->nodelists[node];
 
+		/* See if it's worth trying to free up a slab by moving all of
+		 * it's entries to other slabs. There is a pretty good
+		 * chance that if the oldest partial slab has less than
+		 * 1/4 of the total free objects then we can reallocate them.
+		 * However, don't try if the slab is more than 50% full.
+		 */
+		if (unlikely(searchp->relocator)) {
+			spin_lock_irq(&l3->list_lock);
+			if (!list_empty(&l3->slabs_partial)) {
+				struct slab *slabp =
+				    list_entry(l3->slabs_partial.next,
+					       struct slab, list);
+				if (slabp->inuse <
+				    l3->free_objects / 4 &&
+				    slabp->inuse <
+				    searchp->num / 2) {
+					defrag_cache_node(searchp, node);
+				}
+			}
+			spin_unlock_irq(&l3->list_lock);
+		}
+
 		reap_alien(searchp, l3);
 
 		drain_array(searchp, l3, cpu_cache_get(searchp), 0, node);
@@ -4082,6 +4339,25 @@
 	schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_CPUC));
 }
 
+
+void test_defrag(struct kmem_cache *searchp) {
+	struct kmem_list3 *l3;
+	int node = numa_node_id();
+	BUG_ON(!searchp->relocator);
+
+	l3 = searchp->nodelists[node];
+
+	spin_lock_irq(&l3->list_lock);
+	if (unlikely(list_empty(&l3->slabs_partial))) {
+		printk (KERN_DEBUG "test_defrag: no partial slabs.\n");
+	} else {
+		defrag_cache_node(searchp, node);
+	}
+	spin_unlock_irq(&l3->list_lock);
+}
+
+EXPORT_SYMBOL_GPL(test_defrag);
+
 #ifdef CONFIG_PROC_FS
 
 static void print_slabinfo_header(struct seq_file *m)
@@ -4128,19 +4404,26 @@
 	mutex_unlock(&cache_chain_mutex);
 }
 
-static int s_show(struct seq_file *m, void *p)
-{
-	struct kmem_cache *cachep = list_entry(p, struct kmem_cache, next);
+void kmem_compute_stats(struct kmem_cache *cachep,
+		       unsigned long *full_slabs,
+		       unsigned long *partial_slabs,
+		       unsigned long *partial_objs,
+		       unsigned long *free_slabs,
+		       char **error) {
 	struct slab *slabp;
 	unsigned long active_objs;
 	unsigned long num_objs;
 	unsigned long active_slabs = 0;
 	unsigned long num_slabs, free_objects = 0, shared_avail = 0;
 	const char *name;
-	char *error = NULL;
 	int node;
 	struct kmem_list3 *l3;
 
+	*full_slabs = 0;
+	*partial_slabs = 0;
+	*partial_objs = 0;
+	*free_slabs = 0;
+
 	active_objs = 0;
 	num_slabs = 0;
 	for_each_online_node(node) {
@@ -4153,30 +4436,55 @@
 
 		list_for_each_entry(slabp, &l3->slabs_full, list) {
 			if (slabp->inuse != cachep->num && !error)
-				error = "slabs_full accounting error";
-			active_objs += cachep->num;
-			active_slabs++;
+				*error = "slabs_full accounting error";
+			(*full_slabs)++;
 		}
 		list_for_each_entry(slabp, &l3->slabs_partial, list) {
 			if (slabp->inuse == cachep->num && !error)
-				error = "slabs_partial inuse accounting error";
+				*error =
+				    "slabs_partial inuse accounting error";
 			if (!slabp->inuse && !error)
-				error = "slabs_partial/inuse accounting error";
-			active_objs += slabp->inuse;
-			active_slabs++;
+				*error =
+				    "slabs_partial/inuse accounting error";
+			*partial_objs += slabp->inuse;
+			(*partial_slabs)++;
 		}
 		list_for_each_entry(slabp, &l3->slabs_free, list) {
 			if (slabp->inuse && !error)
-				error = "slabs_free/inuse accounting error";
-			num_slabs++;
+				*error = "slabs_free/inuse accounting error";
+			(*free_slabs)++;
 		}
-		free_objects += l3->free_objects;
-		if (l3->shared)
-			shared_avail += l3->shared->avail;
 
 		spin_unlock_irq(&l3->list_lock);
 	}
-	num_slabs += active_slabs;
+}
+
+/* Useful for tests. */
+EXPORT_SYMBOL_GPL(kmem_compute_stats);
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct kmem_cache *cachep = p;
+	unsigned long active_objs;
+	unsigned long num_objs;
+	unsigned long active_slabs;
+	unsigned long num_slabs, free_objects, shared_avail;
+	const char *name;
+	char *error = NULL;
+	unsigned long full_slabs;
+	unsigned long partial_slabs;
+	unsigned long partial_objs;
+	unsigned long free_slabs;
+
+	kmem_compute_stats(cachep,  &full_slabs, &partial_slabs, &partial_objs,
+			   &free_slabs, &error);
+
+	active_objs = full_slabs * cachep->num + partial_objs;
+	num_slabs = full_slabs + partial_slabs + free_slabs;
+	active_slabs = full_slabs + partial_slabs;
+	shared_avail = partial_slabs * cachep->num - partial_objs;
+	free_objects = free_slabs * cachep->num + shared_avail;
+
 	num_objs = num_slabs * cachep->num;
 	if (num_objs - active_objs != free_objects && !error)
 		error = "free_objects accounting error";

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-10-26 17:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-25 15:16 RFC/POC Make Page Tables Relocatable Ross Biro
2007-10-25 16:46 ` Dave Hansen
2007-10-25 17:40   ` Ross Biro
2007-10-25 18:08     ` Dave Hansen
2007-10-25 18:44       ` Ross Biro
2007-10-25 18:47         ` Dave Hansen
2007-10-25 19:23         ` Dave Hansen
2007-10-25 19:53           ` Ross Biro
2007-10-25 19:56             ` Dave Hansen
2007-10-25 19:58             ` Ross Biro
2007-10-25 20:15               ` Dave Hansen
2007-10-25 20:00             ` Dave Hansen
2007-10-25 20:10               ` Ross Biro
2007-10-25 20:20                 ` Dave Hansen
2007-10-26 16:10       ` Mel Gorman
2007-10-26 16:51         ` Ross Biro
2007-10-26 17:11           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).