RFC/POC Make Page Tables Relocatable

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* RFC/POC Make Page Tables Relocatable
@ 2007-10-25 15:16 Ross Biro
  2007-10-25 16:46 ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Biro @ 2007-10-25 15:16 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 3640 bytes --]

[ The attached patch is Proof of Concept (POC) code only. It only
works on x86_64, it only supports the slab allocator, it only
relocates the lowest level of page tables, it's less efficient that it
should be, and I'm convinced the locking is deficient.  It does work
well enough to play around with though. The patch is a unified diff
against a clean 2.6.23.]

I'd like to propose 4 somewhat interdependent code changes.

1) Add a separate meta-data allocation to the slab and slub allocator
and allocate full pages through kmem_cache_alloc instead of get_page.
The primary motivation of this is that we could shrink struct page by
using kmem_cache_alloc to allocate whole pages and put the supported
data in the meta_data area instead of struct page. The downside is
that we might end up using more memory because of alignment issues.  I
believe we can keep the code as efficient as the current code  by
allocating many pages at once with known alignment and locating the
meta data in the first few pages.  Then locating the meta data for a
page by page_address & mask + (page_address >> foo) & mask *
meta_data_size + offset. Which should be just as fast as the current
calculation.  This is different than the proof of concept
implementation.  I also believe this would reduce kernel memory
fragmentation.

2) Add support for relocating memory allocated via kmem_cache_alloc.
When a cache is created, optional relocation information can be
provided.  If a relocation function is provided, caches can be
defragmented and overall memory consumption can be reduced.

3) Create a handle struct for holding references to memory that might
be moved out from under you.  This is one of those things that looks
really good on paper, but in practice isn't very useful.  While I'm
sure there are a few case in /syfs and /proc where handles could be
put to good use, in general the overhead involved does not justify
their use.  I worry that they could become a fad and that  people will
start using them when they should not be used.  The reason for
including them is that they are really good for setting up synthetic
tests for relocating memory.

and finally the real reason for doing all of the above.

4) Modify pte_alloc/free and friends to use kmem_cache_alloc and make
page tables relocatable. I believe this would go a long way towards
keeping kernel memory from fragmenting.  The biggest down side is the
number of tlb flushes involved.  The POC code uses RCU to free the old
copies of the page tables, which should reduce the flushes.  However,
it blindly flushes the tlbs on all of the cpus, when it really only
needs to flush the tlb on any cpu using the mm in question.  I believe
that by only flushing the tlbs on cpus actually using the mm in
question, we can reduce the flushes to an acceptable level.  One
alternative is to create an RCU class for tlb flushes, so that the old
table only gets freed after all the cpus have flushed their tlbs.

I believe that the above opens the doors to shrinking struct page and
greatly reducing kernel memory fragmentation with the only real
downside being an increase in code complexity and a possible increase
in memory usage if we are not careful.  I'm willing to code all of
this, but I'd like to get others opinions on what's appropriate and
what's already being done.

With the exception of tlb flushes and meta data location, I believe
the POC code demonstrates how I intend to solve most of the problems
that will be encountered.  One thing I am worried about is the
performance impact of the changes and I would like pointers to any
micro benchmarks that might be relevant.

    Ross

[-- Attachment #2: pte-relocate-poc.patch --]
[-- Type: application/octet-stream, Size: 52520 bytes --]

diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/Documentation/vm/locking lsrc/prodkernel/2.6.23/Documentation/vm/locking
--- linux-2.6.23/Documentation/vm/locking	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/Documentation/vm/locking	2007-10-24 07:08:52.000000000 -0700
@@ -83,6 +83,10 @@
 vmtruncate) does not lose sending ipi's to cloned threads that might 
 be spawned underneath it and go to user mode to drag in pte's into tlbs.
 
+With the new page table relocation code, whenever the page_table_lock
+is grabbed, the page tables must be rewalked to make sure that the
+table you are looking at has not been moved out from under you.
+
 swap_lock
 --------------
 The swap devices are chained in priority order from the "swap_list" header. 
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/arch/i386/mm/hugetlbpage.c lsrc/prodkernel/2.6.23/arch/i386/mm/hugetlbpage.c
--- linux-2.6.23/arch/i386/mm/hugetlbpage.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/arch/i386/mm/hugetlbpage.c	2007-10-24 07:08:52.000000000 -0700
@@ -87,6 +87,8 @@
 		goto out;
 
 	spin_lock(&mm->page_table_lock);
+	pud = walk_page_table_pud(mm, addr);
+	BUG_ON(!pud);
 	if (pud_none(*pud))
 		pud_populate(mm, pud, (unsigned long) spte & PAGE_MASK);
 	else
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/arch/x86_64/kernel/smp.c lsrc/prodkernel/2.6.23/arch/x86_64/kernel/smp.c
--- linux-2.6.23/arch/x86_64/kernel/smp.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/arch/x86_64/kernel/smp.c	2007-10-24 07:08:52.000000000 -0700
@@ -233,6 +233,8 @@
 	cpu_mask = mm->cpu_vm_mask;
 	cpu_clear(smp_processor_id(), cpu_mask);
 
+	mm->need_flush = 0;
+
 	if (current->active_mm == mm) {
 		if (current->mm)
 			local_flush_tlb();
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/asm-x86_64/pgalloc.h lsrc/prodkernel/2.6.23/include/asm-x86_64/pgalloc.h
--- linux-2.6.23/include/asm-x86_64/pgalloc.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/asm-x86_64/pgalloc.h	2007-10-24 07:08:52.000000000 -0700
@@ -5,6 +5,16 @@
 #include <linux/threads.h>
 #include <linux/mm.h>
 
+struct page_table_metadata {
+	struct rcu_head head;
+	void *obj;
+	struct kmem_cache *cachep;
+	struct mm_struct *mm;
+	unsigned long addr;
+	unsigned long csum;
+	spinlock_t md_lock;
+};
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pud_populate(mm, pud, pmd) \
@@ -84,6 +94,8 @@
 	free_page((unsigned long)pgd);
 }
 
+extern struct kmem_cache *pte_cache;
+
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
 	return (pte_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
@@ -91,9 +103,28 @@
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
+#if 0
 	void *p = (void *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
 	if (!p)
 		return NULL;
+#else
+	void *p;
+	struct page_table_metadata *md;
+
+	p = kmem_cache_alloc(pte_cache, GFP_KERNEL|__GFP_REPEAT);
+	if (!p)
+		return NULL;
+	clear_page(p);
+	md = (struct page_table_metadata *)kmem_cache_get_metadata(pte_cache, p);
+	md->addr = address;
+	md->mm = mm;
+	md->csum = (unsigned long)mm ^ address;
+	spin_lock_init(&md->md_lock);
+
+	atomic_inc(&mm->mm_count);
+	
+#endif
+
 	return virt_to_page(p);
 }
 
@@ -103,15 +134,40 @@
 static inline void pte_free_kernel(pte_t *pte)
 {
 	BUG_ON((unsigned long)pte & (PAGE_SIZE-1));
-	free_page((unsigned long)pte); 
+	free_page((unsigned long)pte);
 }
 
 static inline void pte_free(struct page *pte)
 {
+#if 0
 	__free_page(pte);
-} 
+#else
+	struct page_table_metadata *md;
+	struct mm_struct *mm;
+	unsigned long flags;
+
+	md = (struct page_table_metadata *)kmem_cache_get_metadata(pte_cache, page_address(pte));
+
+	spin_lock_irqsave(&md->md_lock, flags);
+
+	BUG_ON(	md->csum != ((unsigned long)(md->mm) ^ (md->addr)));
+
+	mm = md->mm;
+	md->mm = NULL;
+	md->addr = 0;
+	md->csum = 0;
+
+	spin_unlock_irqrestore(&md->md_lock, flags);
+
+	if (mm)
+	   mmdrop(mm); 
+
+	kmem_cache_free(pte_cache, page_address(pte));
+
+#endif
+}
 
-#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pte_free_tlb(tlb,pte) pte_free(pte)
 
 #define __pmd_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 #define __pud_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/asm-x86_64/pgtable.h lsrc/prodkernel/2.6.23/include/asm-x86_64/pgtable.h
--- linux-2.6.23/include/asm-x86_64/pgtable.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/asm-x86_64/pgtable.h	2007-10-24 07:08:52.000000000 -0700
@@ -122,6 +122,7 @@
 
 #define pte_pgprot(a)	(__pgprot((a).pte & ~PHYSICAL_PAGE_MASK))
 
+
 #endif /* !__ASSEMBLY__ */
 
 #define PMD_SIZE	(_AC(1,UL) << PMD_SHIFT)
@@ -421,6 +422,51 @@
 #define	kc_offset_to_vaddr(o) \
    (((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))
 
+#include <linux/sched.h>
+static inline pgd_t *walk_page_table_pgd(struct mm_struct *mm,
+					  unsigned long addr) {
+	return pgd_offset(mm, addr);
+}
+
+static inline pud_t *walk_page_table_pud(struct mm_struct *mm,
+					 unsigned long addr) {
+	pgd_t *pgd;
+	pgd = walk_page_table_pgd(mm, addr);
+	BUG_ON(!pgd);
+	return pud_offset(pgd, addr);
+}
+
+static inline pmd_t *walk_page_table_pmd(struct mm_struct *mm,
+					 unsigned long addr) {
+	pud_t *pud;
+	pud = walk_page_table_pud(mm, addr);
+	//BUG_ON(!pud);
+	if (!pud) {
+		printk (KERN_DEBUG "walk_page_table_pmd: pud is NULL\n");
+		return NULL;
+	}
+
+	return  pmd_offset(pud, addr);
+}
+
+static inline pte_t *walk_page_table_pte(struct mm_struct *mm,
+					 unsigned long addr) {
+	pmd_t *pmd;
+	pmd = walk_page_table_pmd(mm, addr);
+	BUG_ON(!pmd);
+	return pte_offset_map(pmd, addr);
+}
+
+static inline pmd_t *walk_page_table_kernel_pmd(unsigned long addr) {
+	return walk_page_table_pmd(&init_mm, addr);
+}
+
+static inline pte_t *walk_page_table_huge_pte(struct mm_struct *mm,
+					      unsigned long addr) {
+	return (pte_t *)walk_page_table_pmd(mm, addr);
+}
+
+
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/asm-x86_64/tlbflush.h lsrc/prodkernel/2.6.23/include/asm-x86_64/tlbflush.h
--- linux-2.6.23/include/asm-x86_64/tlbflush.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/asm-x86_64/tlbflush.h	2007-10-24 07:10:00.000000000 -0700
@@ -46,8 +46,9 @@
 
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
-	if (mm == current->active_mm)
+	if (mm == current->active_mm) {
 		__flush_tlb();
+	}
 }
 
 static inline void flush_tlb_page(struct vm_area_struct *vma,
@@ -60,8 +61,10 @@
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 	unsigned long start, unsigned long end)
 {
-	if (vma->vm_mm == current->active_mm)
+	if (vma->vm_mm == current->active_mm) {
+		vma->vm_mm->need_flush = 0;
 		__flush_tlb();
+	}
 }
 
 #else
@@ -106,4 +109,11 @@
 	   by the normal TLB flushing algorithms. */
 }
 
+static inline void maybe_flush_tlb_mm(struct mm_struct *mm) {
+	if (mm->need_flush) {
+		mm->need_flush = 0;
+		flush_tlb_all();
+	}
+}
+
 #endif /* _X8664_TLBFLUSH_H */
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/handle.h lsrc/prodkernel/2.6.23/include/linux/handle.h
--- linux-2.6.23/include/linux/handle.h	1969-12-31 16:00:00.000000000 -0800
+++ lsrc/prodkernel/2.6.23/include/linux/handle.h	2007-10-24 08:04:46.000000000 -0700
@@ -0,0 +1,127 @@
+/* linux/handle.h
+ * Written by Ross Biro, 2007 (rossb@google.com)
+ *
+ * Copyright (C) 2007 Google Inc.
+ * See Copying File.
+ *
+ */
+
+#ifndef _LINUX_HANDLE_H
+#define _LINUX_HANDLE_H
+
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <asm/atomic.h>
+
+struct khandle_target {
+	struct khandle *handle;
+	atomic_t deref_count;
+	atomic_t handle_ref_count;
+	atomic_t generation_count;
+};
+
+struct khandle {
+	struct khandle_target *target;
+	spinlock_t relocation_lock;
+};
+
+int relocate_handle(void *source_obj, void *target_obj,
+		    struct kmem_cache *cachep,
+		    unsigned long handle_target_offset,
+		    unsigned long object_size);
+
+extern struct kmem_cache *handle_cache;
+
+static inline struct khandle *alloc_handle(struct kmem_cache *cachep,
+					   unsigned long flags) {
+	void *obj = kmem_cache_alloc(cachep, flags);
+	struct khandle *handle;
+	if (obj == NULL) {
+		return NULL;
+	}
+
+	handle = kmem_cache_alloc(handle_cache, flags);
+	if (handle == NULL) {
+		kmem_cache_free(cachep, obj);
+		return NULL;
+	}
+
+	spin_lock_init(&handle->relocation_lock);
+	handle->target = obj + kmem_cachep_relocator_private(cachep);
+
+	/* The constructor must make sure these are set up
+	 * properly.
+	 */
+	atomic_inc(&handle->target->generation_count);
+	atomic_dec(&handle->target->deref_count);
+	atomic_inc(&handle->target->handle_ref_count);
+
+	handle->target->handle = handle;
+
+	printk ("alloc_handle target->deref_count=%d\n",
+		atomic_read(&handle->target->deref_count));
+
+	return handle;
+}
+
+/* Any constructor for a cache using handles *must* have a constructor and
+ * must call this construtor.  This means that SLAB_POISON will not
+ * work with any handles.
+ */
+void generic_handle_ctor(void *, struct kmem_cache *, unsigned long);
+
+#define handle_cache_create(name, flags, type, member, size, align, ctor)\
+    kmem_cache_create_relocatable(name, size, align, flags,		\
+	ctor?:generic_handle_ctor, relocate_handle,	 		\
+	offsetof(type, member), 0)
+
+/**
+ * deref_handle get the pointer for this handle.
+ * @handle:	a ptr to the struct khandle.
+ * @type:	the type of the struct this points to.
+ * @member:	the name of the khandle_target within the struct.
+ */
+#define deref_handle(handle, type, member) \
+    (type *)_deref_handle(handle, offsetof(type, member))
+
+static inline void *_deref_handle(struct khandle *handle,
+				  unsigned long offset) {
+        unsigned long flags;
+	void *obj;
+	spin_lock_irqsave(&handle->relocation_lock, flags);
+	obj = handle->target - offset;
+	atomic_inc(&handle->target->deref_count);
+	spin_unlock_irqrestore(&handle->relocation_lock, flags);
+	return obj;
+}
+
+#define put_handle_ref(handle) do {					\
+        atomic_dec(&handle->target->deref_count);			\
+} while (0)
+
+#define get_handle(handle) do {						\
+	atomic_inc(&handle->target->handle_ref_count);			\
+} while (0)
+
+#define put_handle(h, type, member, cachep) do {			\
+	if (atomic_dec_and_test(&h->target->handle_ref_count)) {	\
+		unsigned long flags;					\
+		type *obj;						\
+		spin_lock_irqsave(&h->relocation_lock, flags);	\
+		obj = container_of(h->target, type, member);	\
+                h->target->handle = NULL;				\
+		wmb();							\
+		atomic_inc(&h->target->deref_count);		\
+		spin_unlock_irqrestore(&h->relocation_lock, flags);\
+		kmem_cache_free(cachep, obj);				\
+		kmem_cache_free(handle_cache, h);			\
+		h = NULL;						\
+	} 								\
+} while (0)
+
+
+
+
+
+
+#endif /* _LINUX_HANDLE_H */
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/mm.h lsrc/prodkernel/2.6.23/include/linux/mm.h
--- linux-2.6.23/include/linux/mm.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/mm.h	2007-10-24 07:08:52.000000000 -0700
@@ -935,6 +935,7 @@
 	pte_t *__pte = pte_offset_map(pmd, address);	\
 	*(ptlp) = __ptl;				\
 	spin_lock(__ptl);				\
+	__pte = walk_page_table_pte(mm, address);	\
 	__pte;						\
 })
 
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/mmzone.h lsrc/prodkernel/2.6.23/include/linux/mmzone.h
--- linux-2.6.23/include/linux/mmzone.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/mmzone.h	2007-10-24 07:08:52.000000000 -0700
@@ -18,7 +18,7 @@
 
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_FORCE_MAX_ZONEORDER
-#define MAX_ORDER 11
+#define MAX_ORDER 14
 #else
 #define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
 #endif
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/sched.h lsrc/prodkernel/2.6.23/include/linux/sched.h
--- linux-2.6.23/include/linux/sched.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/sched.h	2007-10-24 07:08:52.000000000 -0700
@@ -432,6 +432,7 @@
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+	int			need_flush;
 };
 
 struct sighand_struct {
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/include/linux/slab.h lsrc/prodkernel/2.6.23/include/linux/slab.h
--- linux-2.6.23/include/linux/slab.h	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/include/linux/slab.h	2007-10-24 08:33:59.000000000 -0700
@@ -29,6 +29,7 @@
 #define SLAB_DESTROY_BY_RCU	0x00080000UL	/* Defer freeing slabs to RCU */
 #define SLAB_MEM_SPREAD		0x00100000UL	/* Spread some memory over cpuset */
 #define SLAB_TRACE		0x00200000UL	/* Trace allocations and frees */
+#define SLAB_HUGE_PAGE		0x00400000UL    /* Always use at least huge page size pages for this slab. */
 
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
@@ -49,15 +50,42 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
-struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
-			unsigned long,
-			void (*)(void *, struct kmem_cache *, unsigned long));
+struct kmem_cache *kmem_cache_create_relocatable(const char *, size_t, size_t,
+  			unsigned long,
+  			void (*)(void *, struct kmem_cache *, unsigned long),
+			int (*)(void *, void *, struct kmem_cache *,
+				unsigned long, unsigned long),
+			unsigned long, size_t);
+
+unsigned long kmem_cachep_relocator_private(struct kmem_cache *);
+
+static inline
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+			size_t align,
+			unsigned long flags,
+				     void (*ctor)(void *, struct kmem_cache *, unsigned long)) {
+	return kmem_cache_create_relocatable(name, size, align, flags, ctor, NULL, 0, 0);
+}
+
+void test_defrag(struct kmem_cache *);
+
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+void kmem_compute_stats(struct kmem_cache *cachep,
+			unsigned long *full_slabs,
+			unsigned long *partial_slabs,
+			unsigned long *partial_objs,
+			unsigned long *free_slabs,
+			char **error);
+void *kmem_cache_get_metadata(const struct kmem_cache *, void *);
+
+#define RELOCATE_SUCCESS_RCU 1
+#define RELOCATE_SUCCESS 0
+#define RELOCATE_FAILURE -1
 
 /*
  * Please use this macro to create slab caches. Simply specify the
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/Makefile lsrc/prodkernel/2.6.23/mm/Makefile
--- linux-2.6.23/mm/Makefile	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/Makefile	2007-10-24 07:08:52.000000000 -0700
@@ -9,7 +9,7 @@
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
-			   readahead.o swap.o truncate.o vmscan.o \
+			   readahead.o swap.o truncate.o vmscan.o handle.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   $(mmu-y)
 
@@ -29,4 +29,4 @@
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-
+obj-m += handle_test.o
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/handle.c lsrc/prodkernel/2.6.23/mm/handle.c
--- linux-2.6.23/mm/handle.c	1969-12-31 16:00:00.000000000 -0800
+++ lsrc/prodkernel/2.6.23/mm/handle.c	2007-10-24 07:34:43.000000000 -0700
@@ -0,0 +1,129 @@
+/* mm/handle.c
+ * Written by Ross Biro, 2007 (rossb@google.com)
+ *
+ * Copyright (C) 2007 Google Inc.
+ * See Copying File.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/handle.h>
+#include <linux/module.h>
+
+struct kmem_cache *handle_cache;
+
+EXPORT_SYMBOL_GPL(handle_cache);
+
+/*
+ * This function handles the relocation.  Guarentee we have is that
+ * the source and target objects will not vanish underneath us.
+ * However, they might get recycled.  So we have to be careful to get
+ * the handle pointer.  The caller has appropriate locks to make sure
+ * that two different threads don't try to relocate the same object at
+ * the same time.
+ */
+int relocate_handle(void *source_obj, void *target_obj,
+		    struct kmem_cache *cachep,
+		    unsigned long handle_target_offset,
+		    unsigned long object_size) {
+	struct khandle_target *handle_target = source_obj +
+			handle_target_offset;
+
+	struct khandle *handle;
+	unsigned long flags;
+	int generation = atomic_read(&handle_target->generation_count);;
+
+
+	if (atomic_read(&handle_target->deref_count)) {
+		printk (KERN_DEBUG "relocate_handle: handle in use (%d).\n",
+			atomic_read(&handle_target->deref_count));
+		printk (KERN_DEBUG "handle_target_offset = %d\n",
+			handle_target_offset);
+		return 1;
+	}
+
+	atomic_inc(&handle_target->handle_ref_count);
+	handle = handle_target->handle;
+
+	/* we need to make sure that the atomic_inc completed,
+	   and the atomic read is not using a cached (even by the
+	   compiler) value. */
+	mb();
+
+	/* Make sure the handle didn't vanish underneath us while
+	   we were grabbing it. */
+	if (handle == NULL || atomic_read(&handle_target->deref_count)) {
+		atomic_dec(&handle_target->handle_ref_count);
+		printk (KERN_DEBUG "relocate_handle: handle in use after grabbing.\n");
+		return 1;
+	}
+
+
+	/*
+	 * At this point, we know that the handle is valid and the
+	 * object cannot be recycled while we are looking at it.
+	 * We know this because the recycling code increments the ref
+	 * count, and we have a ref count of 0.  Plus we incremented
+	 * the ref count of the handle, so it cannot drop to 0 either.
+	 */
+
+	spin_lock_irqsave(&handle->relocation_lock, flags);
+
+	/* Now check the deref count one last time.  If it's still 0,
+	   then we have exclusive access to the object.
+	*/
+
+	if (atomic_read(&handle_target->deref_count)) {
+		spin_unlock_irqrestore(&handle->relocation_lock, flags);
+		atomic_dec(&handle_target->handle_ref_count);
+		printk (KERN_DEBUG "relocate_handle: handle in use after lock.\n");
+		return 1;
+	}
+
+	/* Make sure we have the correct handle. */
+	if (generation != atomic_read(&handle_target->generation_count)) {
+		spin_unlock_irqrestore(&handle->relocation_lock, flags);
+		atomic_dec(&handle->target->handle_ref_count);
+		printk (KERN_DEBUG
+			"relocate_handle: handle generation changed.\n");
+		return 1;
+	}
+
+	/* Now we've got the object.  Do a shallow copy. */
+	memcpy (target_obj, source_obj, object_size);
+
+	/* We adjust the handle */
+	handle->target = target_obj + handle_target_offset;
+
+	/* Release the locks.  The object has been moved. */
+	spin_unlock_irqrestore(&handle->relocation_lock, flags);
+	atomic_dec(&handle->target->handle_ref_count);
+
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(relocate_handle);
+
+void generic_handle_ctor(void *obj, struct kmem_cache *cachep,
+			 unsigned long unused) {
+	struct khandle_target *target = obj +
+			kmem_cachep_relocator_private(cachep);
+	atomic_set(&target->generation_count, 0);
+	/* We have a pointer right now, so the handle has been
+	 * dereferenced even though it doesn't really exist yet.
+	 */
+	atomic_set(&target->deref_count, 1);
+	atomic_set(&target->handle_ref_count, 0);
+
+}
+
+EXPORT_SYMBOL_GPL(generic_handle_ctor);
+
+static int __init handle_init(void) {
+	handle_cache = kmem_cache_create("handle_cache",
+					 sizeof(struct khandle),
+					 0, 0, NULL);
+	return 0;
+}
+
+module_init(handle_init);
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/handle_test.c lsrc/prodkernel/2.6.23/mm/handle_test.c
--- linux-2.6.23/mm/handle_test.c	1969-12-31 16:00:00.000000000 -0800
+++ lsrc/prodkernel/2.6.23/mm/handle_test.c	2007-10-24 07:08:52.000000000 -0700
@@ -0,0 +1,140 @@
+/* mm/handle_test.c
+ * Written by Ross Biro, 2007 (rossb@google.com)
+ *
+ * Copyright (C) 2007 Google Inc.
+ * See Copying File.
+ *
+ * This file is for a module that exercises the handle systems
+ * and runs a bunch of unit tests.
+ *
+ * Loading the module should execute all the tests.  If all goes
+ * resonably well, the module should just clean up after itself and
+ * be ready to unload.  If not, anything could go wrong, after all it's
+ * a kernel module.
+ */
+
+#include <linux/kernel.h>
+#include <linux/handle.h>
+#include <linux/sched.h>
+#include <linux/module.h>
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("A test module for the handle code.");
+
+struct handle_test1 {
+	char filler0[11];
+	struct khandle_target target;
+	char filler1[11];
+};
+
+#define TEST1_SIZE PAGE_SIZE * 4
+
+struct khandle *test1_handles[TEST1_SIZE];
+struct handle_test1 *test1_ptrs[ARRAY_SIZE(test1_handles)];
+
+static int __init handle_test(void) {
+	int i;
+	struct kmem_cache *test1_cache = NULL;
+	char *error = NULL;
+	unsigned long full_slabs_before;
+	unsigned long partial_slabs_before;
+	unsigned long partial_objs_before;
+	unsigned long free_slabs_before;
+	unsigned long full_slabs_after;
+	unsigned long partial_slabs_after;
+	unsigned long partial_objs_after;
+	unsigned long free_slabs_after;
+
+	test1_cache = handle_cache_create("handle_test1", 0,
+					  struct handle_test1, target,
+					  sizeof(struct handle_test1),
+					  0, NULL);
+
+	if (test1_cache == NULL) {
+		printk (KERN_DEBUG "handle_test: Unable to allocate cache_test1");
+		goto test_failed;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(test1_handles); i++) {
+		test1_handles[i] = alloc_handle(test1_cache, GFP_KERNEL);
+	}
+
+	kmem_compute_stats(test1_cache, &full_slabs_before,
+			   &partial_slabs_before, &partial_objs_before,
+			   &free_slabs_before, &error);
+
+
+	printk (KERN_DEBUG "before: free %d partial %d full %d\n",
+		free_slabs_before, partial_slabs_before,
+		full_slabs_before);
+
+	/* Now fragment the crap out of the thing. */
+	for (i = ARRAY_SIZE(test1_handles) - 1 ; i >= 0; i--) {
+		if (i & 7) {
+			put_handle(test1_handles[i],
+				   struct handle_test1,
+				   target, test1_cache);
+			test1_ptrs[i] = NULL;
+			test1_handles[i] = NULL;
+		}
+	}
+
+	kmem_compute_stats(test1_cache, &full_slabs_before,
+			   &partial_slabs_before, &partial_objs_before,
+			   &free_slabs_before, &error);
+
+	/* Force some defrag. */
+	for (i = 0; i < partial_slabs_before; i++) {
+		test_defrag(test1_cache);
+	}
+
+	if (signal_pending(current)) {
+		printk (KERN_DEBUG "handle_test: Abbandonning test due to signal.\n");
+		goto test_failed;
+	}
+
+	kmem_compute_stats(test1_cache, &full_slabs_after,
+			   &partial_slabs_after, &partial_objs_after,
+			   &free_slabs_after, &error);
+
+	/* We should either have more free slabs, or fewer total slabs. */
+	if (free_slabs_after <= free_slabs_before &&
+	    free_slabs_after + partial_slabs_after + full_slabs_after >=
+	    free_slabs_before + partial_slabs_before + full_slabs_before) {
+		printk (KERN_DEBUG "handle_test: test 1 failed. "
+			"Memory was not freed\n");
+		printk (KERN_DEBUG "before: free %d partial %d full %d\n",
+			free_slabs_before, partial_slabs_before,
+			full_slabs_before);
+		printk (KERN_DEBUG "after: free %d partial %d full %d\n",
+			free_slabs_after, partial_slabs_after,
+			full_slabs_after);
+		goto test_failed;
+	}
+
+
+
+ test_failed:
+	for (i = 0; i < ARRAY_SIZE(test1_handles); i++) {
+		if (test1_ptrs[i])
+			put_handle_ref(test1_handles[i]);
+		if (test1_handles[i])
+			put_handle(test1_handles[i], struct handle_test1,
+				   target, test1_cache);
+	}
+
+	kmem_cache_destroy(test1_cache);
+
+	return 0;
+
+}
+
+static void __exit
+handle_test_exit(void)
+{
+	return;
+}
+
+
+module_init(handle_test);
+module_exit(handle_test_exit);
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/hugetlb.c lsrc/prodkernel/2.6.23/mm/hugetlb.c
--- linux-2.6.23/mm/hugetlb.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/hugetlb.c	2007-10-24 07:15:28.000000000 -0700
@@ -378,7 +378,12 @@
 		if (!dst_pte)
 			goto nomem;
 		spin_lock(&dst->page_table_lock);
+		dst_pte = walk_page_table_huge_pte(dst, addr);
+		BUG_ON(!dst_pte);
 		spin_lock(&src->page_table_lock);
+		src_pte = walk_page_table_huge_pte(src, addr);
+		BUG_ON(!src_pte);
+
 		if (!pte_none(*src_pte)) {
 			if (cow)
 				ptep_set_wrprotect(src, addr, src_pte);
@@ -561,6 +566,9 @@
 
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
+
+ 	ptep = walk_page_table_huge_pte(mm, address);
+
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
@@ -609,6 +617,9 @@
 	ret = 0;
 
 	spin_lock(&mm->page_table_lock);
+	ptep = walk_page_table_huge_pte(mm, address);
+	BUG_ON(!ptep);
+
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, *ptep)))
 		if (write_access && !pte_write(entry))
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/memory.c lsrc/prodkernel/2.6.23/mm/memory.c
--- linux-2.6.23/mm/memory.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/memory.c	2007-10-24 07:44:53.000000000 -0700
@@ -69,6 +69,8 @@
 EXPORT_SYMBOL(mem_map);
 #endif
 
+struct kmem_cache *pte_cache;
+
 unsigned long num_physpages;
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
@@ -306,6 +308,8 @@
 
 	pte_lock_init(new);
 	spin_lock(&mm->page_table_lock);
+	pmd = walk_page_table_pmd(mm, address);
+	BUG_ON(!pmd);
 	if (pmd_present(*pmd)) {	/* Another has populated it */
 		pte_lock_deinit(new);
 		pte_free(new);
@@ -325,6 +329,8 @@
 		return -ENOMEM;
 
 	spin_lock(&init_mm.page_table_lock);
+	pmd = walk_page_table_kernel_pmd(address);
+	BUG_ON(!pmd);
 	if (pmd_present(*pmd))		/* Another has populated it */
 		pte_free_kernel(new);
 	else
@@ -506,6 +512,11 @@
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	arch_enter_lazy_mmu_mode();
 
+	src_pte = walk_page_table_pte(src_mm, addr);
+	BUG_ON(!src_pte);
+	dst_pte = walk_page_table_pte(dst_mm, addr);
+	BUG_ON(!dst_pte);
+
 	do {
 		/*
 		 * We are holding two locks at this point - either of them
@@ -2483,7 +2494,8 @@
  * a struct_page backing it
  *
  * As this is called only for pages that do not currently exist, we
- * do not need to flush old virtual caches or the TLB.
+ * do not need to flush old virtual caches or the TLB, unless someone
+ * else has left the page table cache in an unknown state.
  *
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2603,6 +2615,8 @@
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
+	pte = walk_page_table_pte(mm, address);
+
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
 	if (write_access) {
@@ -2625,6 +2639,7 @@
 		if (write_access)
 			flush_tlb_page(vma, address);
 	}
+	maybe_flush_tlb_mm(mm);
 unlock:
 	pte_unmap_unlock(pte, ptl);
 	return 0;
@@ -2674,6 +2689,8 @@
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
+	pgd = walk_page_table_pgd(mm, address);
+	BUG_ON(!pgd);
 	if (pgd_present(*pgd))		/* Another has populated it */
 		pud_free(new);
 	else
@@ -2695,6 +2712,8 @@
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
+	pud = walk_page_table_pud(mm, address);
+	BUG_ON(!pud);
 #ifndef __ARCH_HAS_4LEVEL_HACK
 	if (pud_present(*pud))		/* Another has populated it */
 		pmd_free(new);
@@ -2867,3 +2886,218 @@
 	return buf - old_buf;
 }
 EXPORT_SYMBOL_GPL(access_process_vm);
+
+/* We need to use RCU to clean up the page tables because many read
+   accesses do not grab the lock and they are in the page fault fast
+   path, so we don't want to touch them. We flush the page tables
+   right away, but we don't flush the pages until the rcu callback.
+   We can get away with this since the old page is still valid and
+   anybody that modifies the new one will have to flush the pages
+   anyway. We can't wait to flush the page tables themselves since
+   if we fault in a page, the fault code will only modify the new
+   page tables, but if the cpu is looking at the old ones, it will
+   continue to fault on the old page table while the fault handler will
+   see the new page tables and not know what is going on.  It appears that
+   there is only one architecture where flush_tlb_pgtables is not a no-op,
+   so it doesn't hurt much to do it here.  We might lose some accessed bit
+   updates, but we can live with that.
+ */
+
+int relocate_pgd(void *source_obj, void *target_obj,
+			 struct kmem_cache *cachep,
+			 unsigned long unused,
+ 			 unsigned long object_size) {
+	struct mm_struct *mm;
+	struct page_table_metadata *md;
+	unsigned long addr;
+
+	/* get the mm so we can lock it and the entry pointing to this
+	   page table. */
+	md = (struct page_table_metadata *)kmem_cache_get_metadata(cachep,
+								   source_obj);
+	if (!md)
+		return RELOCATE_FAILURE;
+
+	mm = md->mm;
+	addr = md->addr;
+
+	/* irqs are off when this function is called. */
+	spin_lock(&mm->page_table_lock);
+	memcpy(target_obj, source_obj, object_size);
+	pgd_populate(mm, pgd_offset(mm, addr), target_obj);
+	flush_tlb_pgtables(mm, md->addr, md->addr + (1UL << PGDIR_SHIFT) - 1);
+	mm->need_flush = 1;
+ 	spin_unlock(&mm->page_table_lock);
+	return RELOCATE_SUCCESS_RCU;
+}
+
+int relocate_pud(void *source_obj, void *target_obj,
+		 struct kmem_cache *cachep,
+		 unsigned long unused,
+		 unsigned long object_size) {
+	struct mm_struct *mm;
+	struct page_table_metadata *md;
+	unsigned long addr;
+	pgd_t *pgd;
+	pud_t *pud;
+
+	/* get the mm so we can lock it and the entry pointing to this
+	   page table. */
+	md = (struct page_table_metadata *)
+			kmem_cache_get_metadata(cachep, source_obj);
+
+	if (!md)
+		return RELOCATE_FAILURE;
+
+	mm = md->mm;
+	addr = md->addr;
+
+	/* irqs are off when this function is called. */
+	spin_lock(&mm->page_table_lock);
+
+	pgd = pgd_offset(mm, addr);
+
+	if (!pgd_none(*pgd) && pgd_present(*pgd)) {
+		pud = pud_offset(pgd, addr);
+		if (!pud_none(*pud) && pud_present(*pud)) {
+			memcpy(target_obj, source_obj, object_size);
+			pud_populate(mm, pud, target_obj);
+			flush_tlb_pgtables(mm, addr,
+					   addr + (1 << PUD_SHIFT) - 1);
+			spin_unlock(&mm->page_table_lock);
+			return RELOCATE_SUCCESS_RCU;
+		}
+	}
+
+	mm->need_flush = 1;
+	spin_unlock(&mm->page_table_lock);
+	return RELOCATE_FAILURE;
+}
+
+static void rcu_free_pmd(struct rcu_head *head) {
+	struct page_table_metadata *md = 
+			(struct page_table_metadata *)head;
+	BUG_ON(!md->mm);
+	BUG_ON(md->addr);
+	BUG_ON(!md->cachep);
+	BUG_ON(!md->obj);
+
+	/* maybe_flush_tlb_mm(md->mm);
+	   mmdrop(md->mm); */
+	kmem_cache_free(md->cachep, md->obj);
+}
+
+static int relocate_pmd(void *source_obj, void *target_obj,
+			struct kmem_cache *cachep,
+			unsigned long unused,
+			unsigned long object_size) {
+	struct mm_struct *mm;
+	struct page_table_metadata *md;
+	unsigned long addr;
+ 	pmd_t *pmd;
+	unsigned long flags;
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+	struct page *target_page = virt_to_page(target_obj);
+	struct page *source_page = virt_to_page(source_obj);
+#endif
+
+	/* get the mm so we can lock it and the entry pointing to this
+ 	   page table. */
+	md = (struct page_table_metadata *)
+			kmem_cache_get_metadata(cachep, source_obj);
+
+	/*
+	printk (KERN_DEBUG "relocate_pmd source=%p mm=%p addr=0x%lx object_size=%d\n",
+		source_obj, mm, addr, object_size);
+        printk (KERN_DEBUG "md=%p mm=%p addr=0x%lx csum=0x%lx\n",
+	md, mm, addr, md->csum);
+	*/
+	BUG_ON(md->csum != ((unsigned long)(md->mm) ^ (md->addr)));
+
+	if (!md->mm || !md->addr || md->mm == &init_mm) {
+		return RELOCATE_FAILURE;
+	}
+
+	if (md->addr >= PAGE_OFFSET) {
+		printk (KERN_INFO "attempted to relocate kernel page.\n");
+		return RELOCATE_FAILURE;
+	}
+ 
+	spin_lock_irqsave(&md->md_lock, flags);
+
+	mm = md->mm;
+	addr = md->addr;
+
+	/* Make sure the mm does not go away. */
+	if (mm && addr)
+		atomic_inc(&mm->mm_count);
+
+	spin_unlock_irqrestore(&md->md_lock, flags);
+
+	if (!mm || !addr)
+		return RELOCATE_FAILURE;
+
+	/* irqs are off when this function is called. */
+	spin_lock_irqsave(&mm->page_table_lock, flags);
+
+	pmd = walk_page_table_pmd(mm, addr);
+	if (pmd && !virt_addr_valid(pmd)) {
+		printk (KERN_WARNING "walk_page_table_pmd returned %p which is not valid.\n", pmd);
+	}
+
+	if (pmd && 
+	    pmd_page_vaddr(*pmd) == (unsigned long)source_obj) {
+		memcpy (kmem_cache_get_metadata(cachep,
+						target_obj),
+			md, sizeof(*md));
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+		spin_lock_init(&target_page->ptl);
+		spin_lock(&source_page->ptl);
+#endif
+
+		memcpy(target_obj, source_obj, object_size);
+		pmd_populate(NULL, pmd, virt_to_page(target_obj));
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+		spin_unlock(&source_page->ptl);
+#endif
+	flush_tlb_pgtables(mm, addr,
+			   addr + (1 << PMD_SHIFT)
+				   - 1);
+
+		mm->need_flush = 1;
+		md->addr = 0;
+		md->csum = (unsigned long)mm;
+		spin_unlock_irqrestore(&mm->page_table_lock, flags);
+
+
+		//printk (KERN_DEBUG "relocate_pmd: succesfully relocated pte (%p)\n", source_obj);
+		/* Don't drop the MM, we have an extra copy of it so 
+	   we know what mm to flush when we drop the page. */
+		md->obj = source_obj;
+		md->cachep = cachep;
+		call_rcu(&md->head, rcu_free_pmd);
+		maybe_flush_tlb_mm(mm);
+		mmdrop(mm);
+
+		return RELOCATE_SUCCESS_RCU;
+	}
+
+	spin_unlock_irqrestore(&mm->page_table_lock, flags);
+	mmdrop(mm);
+	return RELOCATE_FAILURE;
+}
+
+static int __init page_table_cache_init(void)
+{
+	pte_cache = kmem_cache_create_relocatable("pte", PAGE_SIZE,
+						  PAGE_SIZE, SLAB_HUGE_PAGE,
+						  NULL,
+						  relocate_pmd, 0,
+						  sizeof(struct page_table_metadata));
+	BUG_ON(!pte_cache);
+	return 0;
+}
+
+module_init(page_table_cache_init);
+
+
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/rmap.c lsrc/prodkernel/2.6.23/mm/rmap.c
--- linux-2.6.23/mm/rmap.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/rmap.c	2007-10-24 07:08:52.000000000 -0700
@@ -254,6 +254,8 @@
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
+
+	pte = walk_page_table_pte(mm, address);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
 		return pte;
diff --unidirectional-new-file -x'*~' -x'.*' -ur linux-2.6.23/mm/slab.c lsrc/prodkernel/2.6.23/mm/slab.c
--- linux-2.6.23/mm/slab.c	2007-10-09 13:31:38.000000000 -0700
+++ lsrc/prodkernel/2.6.23/mm/slab.c	2007-10-24 07:54:14.000000000 -0700
@@ -140,7 +140,7 @@
 #define	REDZONE_ALIGN		max(BYTES_PER_WORD, __alignof__(unsigned long long))
 
 #ifndef cache_line_size
-#define cache_line_size()	L1_CACHE_BYTES
+#Define cache_line_size()	L1_CACHE_BYTES
 #endif
 
 #ifndef ARCH_KMALLOC_MINALIGN
@@ -178,12 +178,14 @@
 			 SLAB_CACHE_DMA | \
 			 SLAB_STORE_USER | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
-			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
+			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
+			 SLAB_HUGE_PAGE)
 #else
 # define CREATE_MASK	(SLAB_HWCACHE_ALIGN | \
 			 SLAB_CACHE_DMA | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
-			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
+			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
+                         SLAB_HUGE_PAGE)
 #endif
 
 /*
@@ -225,6 +227,8 @@
 	unsigned int inuse;	/* num of objs active in slab */
 	kmem_bufctl_t free;
 	unsigned short nodeid;
+	unsigned long meta_data_start;
+	unsigned long meta_data_end;
 };
 
 /*
@@ -410,6 +414,11 @@
 	/* constructor func */
 	void (*ctor) (void *, struct kmem_cache *, unsigned long);
 
+	/* reloactor function */
+	int (*relocator) (void *, void *, struct kmem_cache *,
+			  unsigned long, unsigned long);
+	unsigned long relocator_private;
+
 /* 5) cache creation/removal */
 	const char *name;
 	struct list_head next;
@@ -431,6 +440,9 @@
 	atomic_t freehit;
 	atomic_t freemiss;
 #endif
+
+	size_t meta_data_size;
+
 #if DEBUG
 	/*
 	 * If debugging is enabled, then the allocator can add additional
@@ -673,6 +685,15 @@
 	.name = "kmem_cache",
 };
 
+/* allocators might need this.  In particular, the handle allocator
+ * uses this to locate the handle_target.
+ */
+unsigned long kmem_cachep_relocator_private(struct kmem_cache *cachep) {
+	return cachep->relocator_private;
+}
+
+EXPORT_SYMBOL_GPL(kmem_cachep_relocator_private);
+
 #define BAD_ALIEN_MAGIC 0x01020304ul
 
 #ifdef CONFIG_LOCKDEP
@@ -798,9 +819,14 @@
 	return __find_general_cachep(size, gfpflags);
 }
 
-static size_t slab_mgmt_size(size_t nr_objs, size_t align)
+static size_t slab_mgmt_size(size_t nr_objs, size_t align,
+			     size_t meta_data_size)
 {
-	return ALIGN(sizeof(struct slab)+nr_objs*sizeof(kmem_bufctl_t), align);
+	size_t res1, res2;
+	res1 = sizeof(struct slab)+nr_objs*sizeof(kmem_bufctl_t)+
+			nr_objs*meta_data_size;
+	res2 = ALIGN(res1, align);
+	return res2;
 }
 
 /*
@@ -808,7 +834,7 @@
  */
 static void cache_estimate(unsigned long gfporder, size_t buffer_size,
 			   size_t align, int flags, size_t *left_over,
-			   unsigned int *num)
+			   unsigned int *num, size_t meta_data_size)
 {
 	int nr_objs;
 	size_t mgmt_size;
@@ -845,21 +871,31 @@
 		 * into account.
 		 */
 		nr_objs = (slab_size - sizeof(struct slab)) /
-			  (buffer_size + sizeof(kmem_bufctl_t));
+			  (buffer_size + sizeof(kmem_bufctl_t) +
+			   meta_data_size);
 
 		/*
 		 * This calculated number will be either the right
 		 * amount, or one greater than what we want.
 		 */
-		if (slab_mgmt_size(nr_objs, align) + nr_objs*buffer_size
+		if (slab_mgmt_size(nr_objs, align, meta_data_size) +
+		    nr_objs*buffer_size
 		       > slab_size)
 			nr_objs--;
 
 		if (nr_objs > SLAB_LIMIT)
 			nr_objs = SLAB_LIMIT;
 
-		mgmt_size = slab_mgmt_size(nr_objs, align);
+		mgmt_size = slab_mgmt_size(nr_objs, align,
+					   meta_data_size);
 	}
+
+	if (meta_data_size != 0) {
+		printk (KERN_INFO "cache_estimate: mgmt_size = %d, "
+			"nr_objs=%d, meta_data_size=%d\n", mgmt_size,
+			nr_objs, meta_data_size);
+	}
+
 	*num = nr_objs;
 	*left_over = slab_size - nr_objs*buffer_size - mgmt_size;
 }
@@ -1463,15 +1499,17 @@
 
 	for (order = 0; order < MAX_ORDER; order++) {
 		cache_estimate(order, cache_cache.buffer_size,
-			cache_line_size(), 0, &left_over, &cache_cache.num);
+			cache_line_size(), 0, &left_over, &cache_cache.num,
+			       0);
 		if (cache_cache.num)
 			break;
 	}
 	BUG_ON(!cache_cache.num);
 	cache_cache.gfporder = order;
 	cache_cache.colour = left_over / cache_cache.colour_off;
-	cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +
-				      sizeof(struct slab), cache_line_size());
+	cache_cache.slab_size = slab_mgmt_size(cache_cache.num,
+					       cache_line_size(),
+					       cache_cache.meta_data_size);
 
 	/* 2+3) create the kmalloc caches */
 	sizes = malloc_sizes;
@@ -1993,22 +2031,25 @@
 	size_t left_over = 0;
 	int gfporder;
 
-	for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
+	for (gfporder = (flags & SLAB_HUGE_PAGE)?HUGETLB_PAGE_ORDER:0;
+	     gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
 		unsigned int num;
 		size_t remainder;
 
-		cache_estimate(gfporder, size, align, flags, &remainder, &num);
+		cache_estimate(gfporder, size, align, flags, &remainder, &num,
+			       cachep->meta_data_size);
 		if (!num)
 			continue;
 
-		if (flags & CFLGS_OFF_SLAB) {
+		if (flags & CFLGS_OFF_SLAB && cachep->num) {
 			/*
 			 * Max number of objs-per-slab for caches which
 			 * use off-slab slabs. Needed to avoid a possible
 			 * looping condition in cache_grow().
 			 */
 			offslab_limit = size - sizeof(struct slab);
-			offslab_limit /= sizeof(kmem_bufctl_t);
+			offslab_limit /= (sizeof(kmem_bufctl_t) +
+					  cachep->meta_data_size);
 
  			if (num > offslab_limit)
 				break;
@@ -2125,9 +2166,13 @@
  * as davem.
  */
 struct kmem_cache *
-kmem_cache_create (const char *name, size_t size, size_t align,
-	unsigned long flags,
-	void (*ctor)(void*, struct kmem_cache *, unsigned long))
+kmem_cache_create_relocatable (const char *name, size_t size, size_t align,
+       unsigned long flags,
+       void (*ctor)(void*, struct kmem_cache *, unsigned long),
+       int (*relocator)(void*, void*, struct kmem_cache *,
+			unsigned long, unsigned long),
+       unsigned long relocator_private,
+       size_t  meta_data_size)
 {
 	size_t left_over, slab_size, ralign;
 	struct kmem_cache *cachep = NULL, *pc;
@@ -2260,6 +2305,14 @@
 	if (!cachep)
 		goto oops;
 
+	/* Need this early to compute slab size properly. */
+	cachep->meta_data_size = meta_data_size;
+
+	if (meta_data_size) {
+		printk (KERN_INFO "kmem_cache_create meta_data_size=%d\n",
+			meta_data_size);
+	}
+
 #if DEBUG
 	cachep->obj_size = size;
 
@@ -2314,9 +2367,10 @@
 		cachep = NULL;
 		goto oops;
 	}
-	slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t)
-			  + sizeof(struct slab), align);
 
+	slab_size = slab_mgmt_size(cachep->num,
+				   align,
+				   cachep->meta_data_size);
 	/*
 	 * If the slab has been placed off-slab, and we have enough space then
 	 * move it on-slab. This is at the expense of any extra colouring.
@@ -2328,8 +2382,8 @@
 
 	if (flags & CFLGS_OFF_SLAB) {
 		/* really off slab. No need for manual alignment */
-		slab_size =
-		    cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);
+		slab_size = slab_mgmt_size(cachep->num, 1,
+					   cachep->meta_data_size);
 	}
 
 	cachep->colour_off = cache_line_size();
@@ -2358,6 +2412,8 @@
 	}
 	cachep->ctor = ctor;
 	cachep->name = name;
+	cachep->relocator = relocator;
+	cachep->relocator_private = relocator_private;
 
 	if (setup_cpu_cache(cachep)) {
 		__kmem_cache_destroy(cachep);
@@ -2374,7 +2430,7 @@
 	mutex_unlock(&cache_chain_mutex);
 	return cachep;
 }
-EXPORT_SYMBOL(kmem_cache_create);
+EXPORT_SYMBOL(kmem_cache_create_relocatable);
 
 #if DEBUG
 static void check_irq_off(void)
@@ -2582,6 +2638,12 @@
  * kmem_find_general_cachep till the initialization is complete.
  * Hence we cannot have slabp_cache same as the original cache.
  */
+
+static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp)
+{
+	return (kmem_bufctl_t *) (slabp + 1);
+}
+
 static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
 				   int colour_off, gfp_t local_flags,
 				   int nodeid)
@@ -2598,18 +2660,51 @@
 		slabp = objp + colour_off;
 		colour_off += cachep->slab_size;
 	}
+
+	memset(slabp, 0, cachep->slab_size);
+
 	slabp->inuse = 0;
 	slabp->colouroff = colour_off;
 	slabp->s_mem = objp + colour_off;
 	slabp->nodeid = nodeid;
+	slabp->meta_data_start = (unsigned long)slab_bufctl(slabp) +
+			sizeof(kmem_bufctl_t)*cachep->num;
+	slabp->meta_data_end = slabp->meta_data_start + cachep->meta_data_size * cachep->num;
 	return slabp;
 }
 
-static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp)
-{
-	return (kmem_bufctl_t *) (slabp + 1);
+static inline void *slab_meta_data(const struct kmem_cache *cachep,
+				   struct slab *slabp) {
+	return (void *)slab_bufctl(slabp) +
+			cachep->num * sizeof(kmem_bufctl_t);
 }
 
+void *kmem_cache_get_metadata(const struct kmem_cache *cache,
+			      void *obj) {
+	if (cache->meta_data_size == 0) {
+		return NULL;
+	} else {
+		struct slab *slab = virt_to_slab(obj);
+		int ind = obj_to_index(cache, slab, obj);
+		void *ret;
+
+		ret = slab_meta_data(cache, slab) +
+				ind * cache->meta_data_size;
+		
+		if ((unsigned long)ret < slab->meta_data_start ||
+		    (unsigned long)ret >= slab->meta_data_end) {
+			printk (KERN_ERR "kmem_cache_get_metadata: Bad ret ind=%d ret=%p slab=%p\n", ind, ret, slab);
+		}
+
+		BUG_ON((unsigned long)ret < slab->meta_data_start);
+		BUG_ON((unsigned long)ret >= slab->meta_data_end);
+
+		return ret;
+	}
+}
+
+
+
 static void cache_init_objs(struct kmem_cache *cachep,
 			    struct slab *slabp)
 {
@@ -2681,8 +2776,10 @@
 	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
 	WARN_ON(slabp->nodeid != nodeid);
 #endif
+	slab_bufctl(slabp)[slabp->free] = BUFCTL_ACTIVE;
 	slabp->free = next;
 
+
 	return objp;
 }
 
@@ -4013,6 +4110,143 @@
 	}
 }
 
+/*
+ * Attempt to take the next to be reused slab and free it up.
+ */
+void defrag_cache_node(struct kmem_cache *cachep, int node) {
+	struct kmem_list3 *l3 = cachep->nodelists[node];
+	struct slab *slabp;
+	kmem_bufctl_t *ctlp;
+	int i;
+	void *targetp = NULL;
+
+	slabp = list_entry(l3->slabs_partial.next,
+			   struct slab, list);
+
+	/* maybe this will be clear by the next time arround. */
+	list_del(&slabp->list);
+	list_add_tail(&slabp->list, &l3->slabs_partial);
+
+
+	for (i = 0; i < cachep->num; i++) {
+		/* This risks using up the hot cpu pages on things
+		 * that are old and stale.
+		 */
+		if (targetp == NULL) {
+			/*
+			 * We risk thrashing on the spin lock, but what
+			 * else can we do?  We need to be able to allocate
+			 * new objects.
+			 */
+			spin_unlock(&l3->list_lock);
+			targetp = kmem_cache_alloc_node(cachep,
+						GFP_ATOMIC & ~GFP_THISNODE,
+						node);
+			spin_lock(&l3->list_lock);
+			if (targetp == NULL) {
+				printk (KERN_DEBUG
+					"defrage_cache_node: Couldn't allocate target.\n");
+				/* WTF? Couldn't get memory. */
+				break;
+			}
+
+		}
+
+		if (unlikely(list_empty(&l3->slabs_partial))) {
+			printk (KERN_DEBUG "defrag_cache_node: partial list empty.\n");
+			break;
+		}
+
+		/*
+		 * This may not be the same slab as we saw last time,
+		 * but that is a risk we will just have to take.
+		 * Things should still be consolidated, but we likely
+		 * won't free anything in this pass.
+		 */
+		slabp = list_entry(l3->slabs_partial.prev,
+				   struct slab, list);
+
+		ctlp = slab_bufctl(slabp);
+
+		if (ctlp[i] == BUFCTL_ACTIVE) {
+			void *objp = index_to_obj(cachep, slabp, i);
+
+			/* The relocator is responsible for making sure
+			 * the object doesn't disapeer from out from
+			 * under it.  The memory itself won't be freed,
+			 * but the object might be on the cpu hot list and
+			 * might be reused.
+			 */
+			int rel = cachep->relocator(objp, targetp, cachep,
+						    cachep->relocator_private,
+						    obj_size(cachep));
+			switch (rel) {
+				case RELOCATE_SUCCESS_RCU:
+					/* We've moved the copy, but we
+					 * can't free the old one right away
+					 * because it might still be in use.
+					 */
+					/*printk (KERN_DEBUG
+						"defrag_cache_node: "
+						"RCU success\n"); */
+					targetp = NULL;
+					break;
+
+				case RELOCATE_SUCCESS:
+					/* relocation succeeded. objp is now
+					 * free.  tragetp is used.
+					 */
+					/*printk (KERN_DEBUG "defrag_cache_node: "
+					  "relocated object %d.\n", i); */
+					targetp = NULL;
+					ctlp[i] = slabp->free;
+					slabp->free = i;
+					l3->free_objects++;
+					slabp->inuse--;
+					if (slabp->inuse == 0) {
+						list_del(&slabp->list);
+						if (l3->free_objects > l3->free_limit){
+							l3->free_objects -=
+									cachep->num;
+							slab_destroy(cachep, slabp);
+						} else {
+							list_add(&slabp->list,
+								 &l3->slabs_free);
+						}
+						goto done;
+					}
+					break;
+
+				case RELOCATE_FAILURE:
+					/*printk (KERN_DEBUG
+						"defrag_cache_node: "
+						"faled to relocated "
+						"object %d.\n", i); */
+					break;
+
+				default:
+					printk (KERN_DEBUG
+						"defrag_cache_node: "
+						"unknown result %d\n",
+						rel);
+					break;
+			}
+		} else {
+			/*printk (KERN_DEBUG "defrag_cache_node: "
+			  "object %d not active.\n", i); */
+
+		}
+
+	}
+
+ done:
+	if (targetp) {
+		spin_unlock(&l3->list_lock);
+		kmem_cache_free(cachep, targetp);
+		spin_lock(&l3->list_lock);
+	}
+}
+
 /**
  * cache_reap - Reclaim memory from caches.
  * @w: work descriptor
@@ -4037,6 +4271,7 @@
 		/* Give up. Setup the next iteration. */
 		goto out;
 
+
 	list_for_each_entry(searchp, &cache_chain, next) {
 		check_irq_on();
 
@@ -4047,6 +4282,28 @@
 		 */
 		l3 = searchp->nodelists[node];
 
+		/* See if it's worth trying to free up a slab by moving all of
+		 * it's entries to other slabs. There is a pretty good
+		 * chance that if the oldest partial slab has less than
+		 * 1/4 of the total free objects then we can reallocate them.
+		 * However, don't try if the slab is more than 50% full.
+		 */
+		if (unlikely(searchp->relocator)) {
+			spin_lock_irq(&l3->list_lock);
+			if (!list_empty(&l3->slabs_partial)) {
+				struct slab *slabp =
+				    list_entry(l3->slabs_partial.next,
+					       struct slab, list);
+				if (slabp->inuse <
+				    l3->free_objects / 4 &&
+				    slabp->inuse <
+				    searchp->num / 2) {
+					defrag_cache_node(searchp, node);
+				}
+			}
+			spin_unlock_irq(&l3->list_lock);
+		}
+
 		reap_alien(searchp, l3);
 
 		drain_array(searchp, l3, cpu_cache_get(searchp), 0, node);
@@ -4082,6 +4339,25 @@
 	schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_CPUC));
 }
 
+
+void test_defrag(struct kmem_cache *searchp) {
+	struct kmem_list3 *l3;
+	int node = numa_node_id();
+	BUG_ON(!searchp->relocator);
+
+	l3 = searchp->nodelists[node];
+
+	spin_lock_irq(&l3->list_lock);
+	if (unlikely(list_empty(&l3->slabs_partial))) {
+		printk (KERN_DEBUG "test_defrag: no partial slabs.\n");
+	} else {
+		defrag_cache_node(searchp, node);
+	}
+	spin_unlock_irq(&l3->list_lock);
+}
+
+EXPORT_SYMBOL_GPL(test_defrag);
+
 #ifdef CONFIG_PROC_FS
 
 static void print_slabinfo_header(struct seq_file *m)
@@ -4128,19 +4404,26 @@
 	mutex_unlock(&cache_chain_mutex);
 }
 
-static int s_show(struct seq_file *m, void *p)
-{
-	struct kmem_cache *cachep = list_entry(p, struct kmem_cache, next);
+void kmem_compute_stats(struct kmem_cache *cachep,
+		       unsigned long *full_slabs,
+		       unsigned long *partial_slabs,
+		       unsigned long *partial_objs,
+		       unsigned long *free_slabs,
+		       char **error) {
 	struct slab *slabp;
 	unsigned long active_objs;
 	unsigned long num_objs;
 	unsigned long active_slabs = 0;
 	unsigned long num_slabs, free_objects = 0, shared_avail = 0;
 	const char *name;
-	char *error = NULL;
 	int node;
 	struct kmem_list3 *l3;
 
+	*full_slabs = 0;
+	*partial_slabs = 0;
+	*partial_objs = 0;
+	*free_slabs = 0;
+
 	active_objs = 0;
 	num_slabs = 0;
 	for_each_online_node(node) {
@@ -4153,30 +4436,55 @@
 
 		list_for_each_entry(slabp, &l3->slabs_full, list) {
 			if (slabp->inuse != cachep->num && !error)
-				error = "slabs_full accounting error";
-			active_objs += cachep->num;
-			active_slabs++;
+				*error = "slabs_full accounting error";
+			(*full_slabs)++;
 		}
 		list_for_each_entry(slabp, &l3->slabs_partial, list) {
 			if (slabp->inuse == cachep->num && !error)
-				error = "slabs_partial inuse accounting error";
+				*error =
+				    "slabs_partial inuse accounting error";
 			if (!slabp->inuse && !error)
-				error = "slabs_partial/inuse accounting error";
-			active_objs += slabp->inuse;
-			active_slabs++;
+				*error =
+				    "slabs_partial/inuse accounting error";
+			*partial_objs += slabp->inuse;
+			(*partial_slabs)++;
 		}
 		list_for_each_entry(slabp, &l3->slabs_free, list) {
 			if (slabp->inuse && !error)
-				error = "slabs_free/inuse accounting error";
-			num_slabs++;
+				*error = "slabs_free/inuse accounting error";
+			(*free_slabs)++;
 		}
-		free_objects += l3->free_objects;
-		if (l3->shared)
-			shared_avail += l3->shared->avail;
 
 		spin_unlock_irq(&l3->list_lock);
 	}
-	num_slabs += active_slabs;
+}
+
+/* Useful for tests. */
+EXPORT_SYMBOL_GPL(kmem_compute_stats);
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct kmem_cache *cachep = p;
+	unsigned long active_objs;
+	unsigned long num_objs;
+	unsigned long active_slabs;
+	unsigned long num_slabs, free_objects, shared_avail;
+	const char *name;
+	char *error = NULL;
+	unsigned long full_slabs;
+	unsigned long partial_slabs;
+	unsigned long partial_objs;
+	unsigned long free_slabs;
+
+	kmem_compute_stats(cachep,  &full_slabs, &partial_slabs, &partial_objs,
+			   &free_slabs, &error);
+
+	active_objs = full_slabs * cachep->num + partial_objs;
+	num_slabs = full_slabs + partial_slabs + free_slabs;
+	active_slabs = full_slabs + partial_slabs;
+	shared_avail = partial_slabs * cachep->num - partial_objs;
+	free_objects = free_slabs * cachep->num + shared_avail;
+
 	num_objs = num_slabs * cachep->num;
 	if (num_objs - active_objs != free_objects && !error)
 		error = "free_objects accounting error";

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 15:16 RFC/POC Make Page Tables Relocatable Ross Biro
@ 2007-10-25 16:46 ` Dave Hansen
  2007-10-25 17:40   ` Ross Biro
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 16:46 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm

On Thu, 2007-10-25 at 11:16 -0400, Ross Biro wrote:
> 1) Add a separate meta-data allocation to the slab and slub allocator
> and allocate full pages through kmem_cache_alloc instead of get_page.
> The primary motivation of this is that we could shrink struct page by
> using kmem_cache_alloc to allocate whole pages and put the supported
> data in the meta_data area instead of struct page. 

The idea seems cool, but I think I'm missing a lot of your motivation
here.

First of all, which meta-data, exactly, is causing 'struct page' to be
larger than it could be?  Which meta-data can be moved?

> 2) Add support for relocating memory allocated via kmem_cache_alloc.
> When a cache is created, optional relocation information can be
> provided.  If a relocation function is provided, caches can be
> defragmented and overall memory consumption can be reduced.

We may truly need this some day, but I'm not sure we need it for
pagetables.  If I were a stupid, naive kernel developer and I wanted to
get a pte page back, I might simply hold the page table lock, walk the
pagetables to the pmd, lock and invalidate the pmd, copy the pagetable
contents into a new page, update the pmd, and be on my merry way.  Why
doesn't this work?  I'm just fishing for a good explanation why we need
all the slab silliness.

I applaud you for posting early and posting often, but there is an
absolute ton of code in your patch.  For your subsequent postings, I'd
highly recommend trying to break it up in some logical ways.  Your 4
steps would be an excellent start.

You might also want to run checkpatch.pl on your patch.  It has some
style issues that also need to get worked out.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 16:46 ` Dave Hansen
@ 2007-10-25 17:40   ` Ross Biro
  2007-10-25 18:08     ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Biro @ 2007-10-25 17:40 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm

On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> On Thu, 2007-10-25 at 11:16 -0400, Ross Biro wrote:
> > 1) Add a separate meta-data allocation to the slab and slub allocator
> > and allocate full pages through kmem_cache_alloc instead of get_page.
> > The primary motivation of this is that we could shrink struct page by
> > using kmem_cache_alloc to allocate whole pages and put the supported
> > data in the meta_data area instead of struct page.
>
> The idea seems cool, but I think I'm missing a lot of your motivation
> here.
>
> First of all, which meta-data, exactly, is causing 'struct page' to be
> larger than it could be?  Which meta-data can be moved?

Almost all of it.  Most of struct page isn't about the kernel manging
pages in general, but about managing particular types of pages.
Although it's been cleaned up over the years, there are still
some things:

        union {
                atomic_t _mapcount;     /* Count of ptes mapped in mms,
                                         * to show when page is mapped
                                         * & limit reverse map searches.
                                         */
                struct {        /* SLUB uses */
                        short unsigned int inuse;
                        short unsigned int offset;
                };
        };

mapcount is only used when the page is mapped via a pte, while the
other part is only used when the page is part of a SLUB cache.
Neither of which is always true and not 100% needed as part of struct
page.  There is just currently no better place to put them.  The rest
of the unions don't really belong in struct page.  Similarly the lru
list only applies to pages which could go on the lru list.  So why not
make a better place to put them.

>
> > 2) Add support for relocating memory allocated via kmem_cache_alloc.
> > When a cache is created, optional relocation information can be
> > provided.  If a relocation function is provided, caches can be
> > defragmented and overall memory consumption can be reduced.
>
> We may truly need this some day, but I'm not sure we need it for
> pagetables.  If I were a stupid, naive kernel developer and I wanted to

I chose to start with page tables because I figured they would be the
hardest to properly relocate.

> get a pte page back, I might simply hold the page table lock, walk the
> pagetables to the pmd, lock and invalidate the pmd, copy the pagetable
> contents into a new page, update the pmd, and be on my merry way.  Why
> doesn't this work?  I'm just fishing for a good explanation why we need
> all the slab silliness.

This would almost work, but to do it properly, you find you'll need
some more locks and a couple of extra pointers and such.  With out all
the slab silliness you would need to add them to struct page. It would
have needlessly bloated struct page hence the previous change.  I've
also managed to convince myself that using the slab/slub allocator
will tend to clump the page tables together which should reduce
fragmentation and make more memory available for huge pages.  In fact,
I've got this idea that by using slab/slub, we can stop allocating
individual pages and only allocate huge pages on systems that have
them.

>
> I applaud you for posting early and posting often, but there is an
> absolute ton of code in your patch.  For your subsequent postings, I'd
> highly recommend trying to break it up in some logical ways.  Your 4
> steps would be an excellent start.

I don't think any of the four changes stand on their own, but only
when you see them together.  If there is enough agreement in principle
to go forward, then for real patches you are correct.   Remember, that
patch was only meant as a proof of concept.

> You might also want to run checkpatch.pl on your patch.  It has some
> style issues that also need to get worked out.

That patch isn't meant to be applied, but is there because it's easier
to point to code to try to explain what I'm mean than to explain in
words.  I didn't think a few style issues would be an issue.  And just
to reiterate, if you actually use the code I posted, you get what you
deserve.  It was only meant to illustrate what I'm trying to say.

    Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 17:40   ` Ross Biro
@ 2007-10-25 18:08     ` Dave Hansen
  2007-10-25 18:44       ` Ross Biro
  2007-10-26 16:10       ` Mel Gorman
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 18:08 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm, Mel Gorman

On Thu, 2007-10-25 at 13:40 -0400, Ross Biro wrote: 
> On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> > On Thu, 2007-10-25 at 11:16 -0400, Ross Biro wrote:
> > > 1) Add a separate meta-data allocation to the slab and slub allocator
> > > and allocate full pages through kmem_cache_alloc instead of get_page.
> > > The primary motivation of this is that we could shrink struct page by
> > > using kmem_cache_alloc to allocate whole pages and put the supported
> > > data in the meta_data area instead of struct page.
> >
> > The idea seems cool, but I think I'm missing a lot of your motivation
> > here.
> >
> > First of all, which meta-data, exactly, is causing 'struct page' to be
> > larger than it could be?  Which meta-data can be moved?
> 
> Almost all of it.  Most of struct page isn't about the kernel manging
> pages in general, but about managing particular types of pages.
> Although it's been cleaned up over the years, there are still
> some things:
> 
>         union {
>                 atomic_t _mapcount;     /* Count of ptes mapped in mms,
>                                          * to show when page is mapped
>                                          * & limit reverse map searches.
>                                          */
>                 struct {        /* SLUB uses */
>                         short unsigned int inuse;
>                         short unsigned int offset;
>                 };
>         };
> 
> mapcount is only used when the page is mapped via a pte, while the
> other part is only used when the page is part of a SLUB cache.
> Neither of which is always true and not 100% needed as part of struct
> page.  There is just currently no better place to put them.  The rest
> of the unions don't really belong in struct page.  Similarly the lru
> list only applies to pages which could go on the lru list.  So why not
> make a better place to put them.

Right, but we're talking about pagetable pages here, right?  What fields
in 'struct page' are used by pagetable pages, but will allow 'struct
page' to shrink in size if pagetables pages stop using them?

On a more general note: so it's all about saving memory in the end?
Making 'struct page' smaller?  If I were you, I'd be very conerned about
the pathological cases.  We may get the lru pointers out of 'struct
page', so we'll need some parallel lookup to get from physical page to
LRU, right?   Although the bootup footprint of mem_map[] (and friends)
smaller, what happens on a machine with virtually all its memory used by
pages on the LRU (which I would guess is actually quite common).  Will
the memory footprint even be close to the two pointers per physical page
that it cost us for the current implementation?

That doesn't even consider the runtime overhead of such a scheme.  Right
now, if you touch any part of 'struct page' on a 32-bit machine, you
generally bring the entire thing into a single cacheline.  Every other
subsequent access is essentially free.  Any ideas what the ballpark
number of cachelines are that would have to be brought in with another
lookup method for 'struct page' to lru?

I dunno.  I'm highly skeptical this can work.

I've heard rumors in the past that the Windows' 'struct page' is much
smaller than the Linux one.  But, I've also heard that this weighs
heavily in other areas such as page reclamation.  Could be _completely_
bogus, but it might be worth a search or two to see if there have been
any papers on the subject.  

> > get a pte page back, I might simply hold the page table lock, walk the
> > pagetables to the pmd, lock and invalidate the pmd, copy the pagetable
> > contents into a new page, update the pmd, and be on my merry way.  Why
> > doesn't this work?  I'm just fishing for a good explanation why we need
> > all the slab silliness.
> 
> This would almost work, but to do it properly, you find you'll need
> some more locks and a couple of extra pointers and such.

Could you be specific?

> With out all
> the slab silliness you would need to add them to struct page. It would
> have needlessly bloated struct page hence the previous change.  I've
> also managed to convince myself that using the slab/slub allocator
> will tend to clump the page tables together which should reduce
> fragmentation and make more memory available for huge pages.  In fact,
> I've got this idea that by using slab/slub, we can stop allocating
> individual pages and only allocate huge pages on systems that have
> them.

You may want to have a talk with Mel about memory fragmentation, and
whether there is any lower hanging fruit (cc'd). :)

> > You might also want to run checkpatch.pl on your patch.  It has some
> > style issues that also need to get worked out.
> 
> That patch isn't meant to be applied, but is there because it's easier
> to point to code to try to explain what I'm mean than to explain in
> words.  I didn't think a few style issues would be an issue.  And just
> to reiterate, if you actually use the code I posted, you get what you
> deserve.  It was only meant to illustrate what I'm trying to say.

In general, the reason to run such a script (and to have coding
standards in the first place) is so that others can more easily read
your code.  The posted patch is hard to understand in some areas because
of indenting bracketing.  If you'd like people to read, review, and give
suggestions on what they see, I'd suggest trying to make it as easy as
possible to understand.

Check out Documentation/CodingStyle.  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 18:08     ` Dave Hansen
@ 2007-10-25 18:44       ` Ross Biro
  2007-10-25 18:47         ` Dave Hansen
  2007-10-25 19:23         ` Dave Hansen
  2007-10-26 16:10       ` Mel Gorman
  1 sibling, 2 replies; 17+ messages in thread
From: Ross Biro @ 2007-10-25 18:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, Mel Gorman

On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> Right, but we're talking about pagetable pages here, right?  What fields
> in 'struct page' are used by pagetable pages, but will allow 'struct
> page' to shrink in size if pagetables pages stop using them?

At the moment we are only talking about page tables, but I hope in the
future to do more.  Perhaps page tables were a bad place to start, but
like I said I thought they would be the hardest, and hence a good
place to start.

>
> On a more general note: so it's all about saving memory in the end?

Sort of. As you pointed out, right now struct page is pretty well
tuned, but it's also not easy to add something.  If I need an extra
pointer or two to do something, then I more or less totally trash all
sorts of efficiencies we are getting now.  It's about having
flexibility without losing efficeincy.

>
> I dunno.  I'm highly skeptical this can work.

Skepticism is good.  I think I can pull it off, and possibly even make
it more efficient.  The big gotcha is going to be caching issues if we
end up bouncing back and forth between struct page and the new meta
data, we might end up being toast.  On the other hand, if we are
looking at multiple pages and we fit multiple smaller structures into
a cache line, we might still win.  This is why I asked for some micro
benchmarks.  I figured people would send the ones they feel are most
likely to fail.

Remember, this change doesn't stand on it's own.  In a vacuum, I don't
think this change is worth doing at all.  But it enables the other
changes and a lot more going forward.

> > > get a pte page back, I might simply hold the page table lock, walk the
> > > pagetables to the pmd, lock and invalidate the pmd, copy the pagetable
> > > contents into a new page, update the pmd, and be on my merry way.  Why
> > > doesn't this work?  I'm just fishing for a good explanation why we need
> > > all the slab silliness.
> >
> > This would almost work, but to do it properly, you find you'll need
> > some more locks and a couple of extra pointers and such.
>
> Could you be specific?

Well to go quickly from an arbitrary page that happens to be part of a
page table to the appropriate mm to get a lock, I had to store a
pointer to the mm.  Then I also needed to know where the particular
page fit into the page table tree.  Once I had those, it turned out I
needed a spinlock to protect them to deallocate the page with out
racing against the relocation.  I think I could have used the ptl lock
struct page, but I wasn't really clear on it when I started.

So I needed 2 pointers which I could have squeezed into struct page
somewhere, but then what about when I needed a third or forth pointer
to make something else work well?  I'm pretty sure I can clean up some
of the tlb flushing and make all levels of the page tables relocatable
with out a problem by adding another flag.  Of course, I could put a
flag into the page flags, but it doesn't take long to run out of flag
space.  The meta data change we are talking about above is to make the
code flexible enough to support things like this with out killing
performance.

Your argument against the meta data change above is that it will kill
performance.  I don't think so, but I could be wrong.  However, if the
only objection is that it will kill performance, then it's worth doing
and running some benchmarks.  If it turns out I'm correct and it's a
win or not a big loss from a performance point of view, then it goes
in.  If not, it doesn't.

>
> You may want to have a talk with Mel about memory fragmentation, and
> whether there is any lower hanging fruit (cc'd). :)

I usually like to go for the high hanging fruit with the idea if I do
that well, the low hanging fruit becomes a cake walk.  However, any
input on this is welcome.

>
> your code.  The posted patch is hard to understand in some areas because
> of indenting bracketing.  If you'd like people to read, review, and give
> suggestions on what they see, I'd suggest trying to make it as easy as

I'm sorry about that.  It must have happened when I hand applied the
patch to 2.6.23 (it was developed under 2.6.22).  I should have had
emacs reflow all the changes after deleting all the +'s that diff
sticks in front of the lines.

    Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 18:44       ` Ross Biro
@ 2007-10-25 18:47         ` Dave Hansen
  2007-10-25 19:23         ` Dave Hansen
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 18:47 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm, Mel Gorman

On Thu, 2007-10-25 at 14:44 -0400, Ross Biro wrote:
> This is why I asked for some micro
> benchmarks.  I figured people would send the ones they feel are most
> likely to fail.

lmbench has some fault speed tests that are pretty sensitive to things
like changes in the allocator.  You might take a look at those.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 18:44       ` Ross Biro
  2007-10-25 18:47         ` Dave Hansen
@ 2007-10-25 19:23         ` Dave Hansen
  2007-10-25 19:53           ` Ross Biro
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 19:23 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm, Mel Gorman

On Thu, 2007-10-25 at 14:44 -0400, Ross Biro wrote:
> On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> > > This would almost work, but to do it properly, you find you'll need
> > > some more locks and a couple of extra pointers and such.
> >
> > Could you be specific?
> 
> Well to go quickly from an arbitrary page that happens to be part of a
> page table to the appropriate mm to get a lock, I had to store a
> pointer to the mm.

Hold on a sec there.  You don't *have* to. :)

With the pagetable page you can go examine ptes.  From the ptes, you can
get the 'struct page' for the mapped page.  From there, you can get the
anon_vma and at least the list of mms that _could_ map the page.  You
get virtual addresses from the linear_page_index() or offset in the
mapping from page->index and vma->vm_pgoff and vm_start.  That should
make the search a bit more reasonable.

Slow, yes.  But, we're already talking about reclaim paths here.  

> Then I also needed to know where the particular
> page fit into the page table tree.  Once I had those, it turned out I
> needed a spinlock to protect them to deallocate the page with out
> racing against the relocation.  I think I could have used the ptl lock
> struct page, but I wasn't really clear on it when I started.
> 
> So I needed 2 pointers which I could have squeezed into struct page
> somewhere, but then what about when I needed a third or forth pointer
> to make something else work well?

I think you started out with the assumption that we needed out of page
metadata and then started adding more reasons that we needed it.  I
seriously doubt that you really and truly *NEED* four new fields in
'struct page'. :)

My guys says that this is way too complicated to be pursued in this
form.  But, don't listen to me.  You don't have to convince _me_.

If you want to pursue this, I'd concentrate on breaking your patch up in
to manageable pieces.  Don't forget diffstats at the top of your patch,
too.  If I were to start breaking this patch up, I'd probably start with
these things, but probably not in this order.  If you do it right,
you'll end up with even more pieces than this.

1. add support to slab for object relocation
2. add support to slab for object metadata
3. allocate pte pages from the slab (yet again)
4. add metadata for pagetable pages (this can be distinct from the other
   patches, and a simple implementation might just stick it in 'struct
   page' to make it easy to review at first)
5. add and use walk_page_table_*() functions
6. add need_flush tracking to the mm
7. add minimum base page size requirements to the slab
8. add relocation handles
9. your test module
10. rcu for freeing pagetables and tlb flushing
11. actual pagetable relocation code

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 19:23         ` Dave Hansen
@ 2007-10-25 19:53           ` Ross Biro
  2007-10-25 19:56             ` Dave Hansen
                               ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Ross Biro @ 2007-10-25 19:53 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, Mel Gorman

On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> With the pagetable page you can go examine ptes.  From the ptes, you can
> get the 'struct page' for the mapped page.  From there, you can get the

Definitely worth considering.

> I think you started out with the assumption that we needed out of page
> metadata and then started adding more reasons that we needed it.  I
> seriously doubt that you really and truly *NEED* four new fields in
> 'struct page'. :)

I didn't start off with that assumption.  Originally I intended to add
what I needed to struct page and not worry about it.  However, it
quickly became apparent that while doable, it wouldn't be clean and
that in fact we do have a flexibility issue when it comes to mucking
with something that touches struct page.  That's when I thought up the
meta data thing.

I think it's worth pursuing, but if your suggestion above works, then
it can be totally independent of these changes and I can possibly
substantially shrink struct page when I do the change.  If it all
works well, then it would be self motivating.

>
> My guys says that this is way too complicated to be pursued in this
> form.  But, don't listen to me.  You don't have to convince _me_.

At this point, I'm more interested if anyone has any objections in
principle to the overall thing.  If so, and they are legitimate, then
it's not worth pursuing.  If not, then I'll start.  However, I
disagree with your order.  I'm thinking more like:

1) Support for relocation.
2) Support for handles
3) Test module.

These three work together and give a framework for validating the
relocation code with out causing too much trouble.  The only problem
is that they are mostly useless on their own.

Then the page table related code, using your suggestion above
(provided I can get it to work.  I'm worried about the page table
being freed while I'm trying to figure out what mm it belongs to.)
I'll break this into small chunks.

Finally the metadata code.

Thanks for your input.

     Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 19:53           ` Ross Biro
@ 2007-10-25 19:56             ` Dave Hansen
  2007-10-25 19:58             ` Ross Biro
  2007-10-25 20:00             ` Dave Hansen
  2 siblings, 0 replies; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 19:56 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm, Mel Gorman

On Thu, 2007-10-25 at 15:53 -0400, Ross Biro wrote:
> 
> > My guys says that this is way too complicated to be pursued in this
> > form.  But, don't listen to me.  You don't have to convince _me_.

Wow.  My fingers no workee today.  "My gut says"...  I don't have a
bunch of guys sitting around telling me things (only in my head).

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 19:53           ` Ross Biro
  2007-10-25 19:56             ` Dave Hansen
@ 2007-10-25 19:58             ` Ross Biro
  2007-10-25 20:15               ` Dave Hansen
  2007-10-25 20:00             ` Dave Hansen
  2 siblings, 1 reply; 17+ messages in thread
From: Ross Biro @ 2007-10-25 19:58 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, Mel Gorman

On 10/25/07, Ross Biro <rossb@google.com> wrote:
> On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> > With the pagetable page you can go examine ptes.  From the ptes, you can
> > get the 'struct page' for the mapped page.  From there, you can get the
>
> Definitely worth considering.

Now I remember.  At least in the slab allocator, the relocation code
must hold an important spinlock while the relocation occurs.  Maybe I
can get around that, but maybe not.  If not, that could be a
fundamental problem, but at least it prevents doing long searches.

    Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 19:58             ` Ross Biro
@ 2007-10-25 20:15               ` Dave Hansen
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 20:15 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm

On Thu, 2007-10-25 at 15:58 -0400, Ross Biro wrote:
> On 10/25/07, Ross Biro <rossb@google.com> wrote:
> > On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> > > With the pagetable page you can go examine ptes.  From the ptes, you can
> > > get the 'struct page' for the mapped page.  From there, you can get the
> >
> > Definitely worth considering.
> 
> Now I remember.  At least in the slab allocator, the relocation code
> must hold an important spinlock while the relocation occurs.  Maybe I
> can get around that, but maybe not.  If not, that could be a
> fundamental problem, but at least it prevents doing long searches.

"important spinlock" isn't really precise enough for me to understand
what you are talking about, make any arguments for or against it, or
suggest alternatives. :(

If the slab is truly a constraint, perhaps you should consider alternate
mechanisms, or fix the slab instead.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 19:53           ` Ross Biro
  2007-10-25 19:56             ` Dave Hansen
  2007-10-25 19:58             ` Ross Biro
@ 2007-10-25 20:00             ` Dave Hansen
  2007-10-25 20:10               ` Ross Biro
  2 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 20:00 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm, Mel Gorman

On Thu, 2007-10-25 at 15:53 -0400, Ross Biro wrote:
> On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> > My guys says that this is way too complicated to be pursued in this
> > form.  But, don't listen to me.  You don't have to convince _me_.
> 
> At this point, I'm more interested if anyone has any objections in
> principle to the overall thing.  If so, and they are legitimate, then
> it's not worth pursuing.  If not, then I'll start.  However, I
> disagree with your order.

Me too!  I just ran through your patch and wrote ideas as I saw them in
your patch order.  I bet they need to be done in much different orders
in reality. 

> I'm thinking more like:
> 
> 1) Support for relocation.

Generic slab relocation, right?

> 2) Support for handles

I've heard these handles are more or less what some other UNIXes do.
That doesn't give it points in my book.  :)

> 3) Test module.
> 
> These three work together and give a framework for validating the
> relocation code with out causing too much trouble.  The only problem
> is that they are mostly useless on their own.

Useless on their own is actually OK.  Patches series are often useless
up until patch 943/943.

> Then the page table related code, using your suggestion above
> (provided I can get it to work.  I'm worried about the page table
> being freed while I'm trying to figure out what mm it belongs to.)
> I'll break this into small chunks.

How would it get freed?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 20:00             ` Dave Hansen
@ 2007-10-25 20:10               ` Ross Biro
  2007-10-25 20:20                 ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Biro @ 2007-10-25 20:10 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, Mel Gorman

On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> How would it get freed?
>

The process exists or ummaps the range of memory.  The relocation code
is likely called on a different cpu in the node and currently has no
way to pin the data in memory.  Perhaps finding a way to pin the page
would help the other locking issues, so it might solve lots of
problems.

    Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 20:10               ` Ross Biro
@ 2007-10-25 20:20                 ` Dave Hansen
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Hansen @ 2007-10-25 20:20 UTC (permalink / raw)
  To: Ross Biro; +Cc: linux-mm, Mel Gorman

On Thu, 2007-10-25 at 16:10 -0400, Ross Biro wrote:
> On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> > How would it get freed?
> 
> The process exists or ummaps the range of memory.  The relocation code
> is likely called on a different cpu in the node and currently has no
> way to pin the data in memory.  Perhaps finding a way to pin the page
> would help the other locking issues, so it might solve lots of
> problems.

Taking a simple reference count on the page will keep it from getting
freed.  It won't keep it from getting _unused_, but it will against
getting actually freed back to the allocator.

But, if you get to this point and you have a page and the only person
with a reference to it is you, it _should_ be completely empty of pte
entries.  They were all cleared at zap_pte_range() time.

You need other mechanisms in place, anyway, to keep ptes from being
instantiated or shot down while you're doing the copy itself.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-25 18:08     ` Dave Hansen
  2007-10-25 18:44       ` Ross Biro
@ 2007-10-26 16:10       ` Mel Gorman
  2007-10-26 16:51         ` Ross Biro
  1 sibling, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2007-10-26 16:10 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Ross Biro, linux-mm, Mel Gorman

On (25/10/07 11:08), Dave Hansen didst pronounce:
> On Thu, 2007-10-25 at 13:40 -0400, Ross Biro wrote: 
> > On 10/25/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> > > On Thu, 2007-10-25 at 11:16 -0400, Ross Biro wrote:
> > > > 1) Add a separate meta-data allocation to the slab and slub allocator
> > > > and allocate full pages through kmem_cache_alloc instead of get_page.
> > > > The primary motivation of this is that we could shrink struct page by
> > > > using kmem_cache_alloc to allocate whole pages and put the supported
> > > > data in the meta_data area instead of struct page.
> > >
> > > The idea seems cool, but I think I'm missing a lot of your motivation
> > > here.
> > >
> > > First of all, which meta-data, exactly, is causing 'struct page' to be
> > > larger than it could be?  Which meta-data can be moved?
> > 
> > Almost all of it.  Most of struct page isn't about the kernel manging
> > pages in general, but about managing particular types of pages.
> > Although it's been cleaned up over the years, there are still
> > some things:
> > 
> >         union {
> >                 atomic_t _mapcount;     /* Count of ptes mapped in mms,
> >                                          * to show when page is mapped
> >                                          * & limit reverse map searches.
> >                                          */
> >                 struct {        /* SLUB uses */
> >                         short unsigned int inuse;
> >                         short unsigned int offset;
> >                 };
> >         };
> > 
> > mapcount is only used when the page is mapped via a pte, while the
> > other part is only used when the page is part of a SLUB cache.
> > Neither of which is always true and not 100% needed as part of struct
> > page.  There is just currently no better place to put them.  The rest
> > of the unions don't really belong in struct page.  Similarly the lru
> > list only applies to pages which could go on the lru list.  So why not
> > make a better place to put them.
> 
> Right, but we're talking about pagetable pages here, right?  What fields
> in 'struct page' are used by pagetable pages, but will allow 'struct
> page' to shrink in size if pagetables pages stop using them?
> 
> On a more general note: so it's all about saving memory in the end?
> Making 'struct page' smaller?  If I were you, I'd be very conerned about
> the pathological cases.  We may get the lru pointers out of 'struct
> page', so we'll need some parallel lookup to get from physical page to
> LRU, right?   Although the bootup footprint of mem_map[] (and friends)
> smaller, what happens on a machine with virtually all its memory used by
> pages on the LRU (which I would guess is actually quite common).  Will
> the memory footprint even be close to the two pointers per physical page
> that it cost us for the current implementation?
> 
> That doesn't even consider the runtime overhead of such a scheme.  Right
> now, if you touch any part of 'struct page' on a 32-bit machine, you
> generally bring the entire thing into a single cacheline.  Every other
> subsequent access is essentially free.  Any ideas what the ballpark
> number of cachelines are that would have to be brought in with another
> lookup method for 'struct page' to lru?
> 
> I dunno.  I'm highly skeptical this can work.
> 
> I've heard rumors in the past that the Windows' 'struct page' is much
> smaller than the Linux one.  But, I've also heard that this weighs
> heavily in other areas such as page reclamation.  Could be _completely_
> bogus, but it might be worth a search or two to see if there have been
> any papers on the subject.  
> 
> > > get a pte page back, I might simply hold the page table lock, walk the
> > > pagetables to the pmd, lock and invalidate the pmd, copy the pagetable
> > > contents into a new page, update the pmd, and be on my merry way.  Why
> > > doesn't this work?  I'm just fishing for a good explanation why we need
> > > all the slab silliness.
> > 
> > This would almost work, but to do it properly, you find you'll need
> > some more locks and a couple of extra pointers and such.
> 
> Could you be specific?
> 
> > With out all
> > the slab silliness you would need to add them to struct page. It would
> > have needlessly bloated struct page hence the previous change.  I've
> > also managed to convince myself that using the slab/slub allocator
> > will tend to clump the page tables together which should reduce
> > fragmentation and make more memory available for huge pages.  In fact,
> > I've got this idea that by using slab/slub, we can stop allocating
> > individual pages and only allocate huge pages on systems that have
> > them.
> 
> You may want to have a talk with Mel about memory fragmentation, and
> whether there is any lower hanging fruit (cc'd). :)
> 

I suspect this might be overkill from a memory fragmentation
perspective. When grouping pages by mobility, page table pages are
currently considered MIGRATE_UNMOVABLE. From what I have seen, they are
by far the most common unmovable allocation. If they were relocatable
with the standard page migration mechanism, they could be considered
MIGRATE_MOVABLE and external fragmentation would be easier to content
with.

I haven't looked closely enough at this patch to know if moving page
table pages with page migration is the aim or not.

However, using huge pages just all slabs does not feel like a great
idea. There will be a lot memory wasted due to internal fragmentation
and systems with less memory are not going to want to commit a hugepage
for a small slab allocation.

> > > You might also want to run checkpatch.pl on your patch.  It has some
> > > style issues that also need to get worked out.
> > 
> > That patch isn't meant to be applied, but is there because it's easier
> > to point to code to try to explain what I'm mean than to explain in
> > words.  I didn't think a few style issues would be an issue.  And just
> > to reiterate, if you actually use the code I posted, you get what you
> > deserve.  It was only meant to illustrate what I'm trying to say.
> 
> In general, the reason to run such a script (and to have coding
> standards in the first place) is so that others can more easily read
> your code.  The posted patch is hard to understand in some areas because
> of indenting bracketing.  If you'd like people to read, review, and give
> suggestions on what they see, I'd suggest trying to make it as easy as
> possible to understand.
> 
> Check out Documentation/CodingStyle.  
> 
> -- Dave
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-26 16:10       ` Mel Gorman
@ 2007-10-26 16:51         ` Ross Biro
  2007-10-26 17:11           ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Biro @ 2007-10-26 16:51 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Dave Hansen, linux-mm, Mel Gorman

On 10/26/07, Mel Gorman <mel@skynet.ie> wrote:
> I suspect this might be overkill from a memory fragmentation
> perspective. When grouping pages by mobility, page table pages are
> currently considered MIGRATE_UNMOVABLE. From what I have seen, they are

I may be being dense, but the page migration code looks to me like it
just moves pages in a process from one node to another node with no
effort to touch the page tables.  It would be easy to hook the code I
wrote into the page migration code, what I don't understand is when
the page tables should be migrated?  Only when the whole process is
being migrated?  When all the pages pointed to a page table are being
migrated?  When any page pointed to by the page table is being
migrated?

    Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC/POC Make Page Tables Relocatable
  2007-10-26 16:51         ` Ross Biro
@ 2007-10-26 17:11           ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2007-10-26 17:11 UTC (permalink / raw)
  To: Ross Biro; +Cc: Dave Hansen, linux-mm, Mel Gorman

On (26/10/07 12:51), Ross Biro didst pronounce:
> On 10/26/07, Mel Gorman <mel@skynet.ie> wrote:
> > I suspect this might be overkill from a memory fragmentation
> > perspective. When grouping pages by mobility, page table pages are
> > currently considered MIGRATE_UNMOVABLE. From what I have seen, they are
> 
> I may be being dense, but the page migration code looks to me like it
> just moves pages in a process from one node to another node with no
> effort to touch the page tables. 

Exactly, if it was able to move arbitrary pagetable pages too, it would
be useful. Page migrations traditional case is to move pages between
nodes but memory hot-remove also uses it to move pages around a zone and
there has been at least one other case which I'm coming to.

> It would be easy to hook the code I
> wrote into the page migration code, what I don't understand is when
> the page tables should be migrated? 

>From an external fragmentation point of view, they would be moved when a
high-order allocation failued. Patches exist that do this sort of thing
under the title "Memory Compaction" but they are not merged because they
don't have a demonstratable use-case yet[1].

> Only when the whole process is
> being migrated?  When all the pages pointed to a page table are being
> migrated?  When any page pointed to by the page table is being
> migrated?
> 

If it was external fragmentation you were dealing with, a pagetable apge
would be moved once it was found to be preventing a high-order (e.gh.
hugepage) allocation from succeeding.

[1] Intuitively, the use case would be that a hugepage allocation
    happened faster when moving pages around than reclaiming them.
    This situation does not happen often enough to justify the 
    complexity of the code though.

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-10-26 17:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-25 15:16 RFC/POC Make Page Tables Relocatable Ross Biro
2007-10-25 16:46 ` Dave Hansen
2007-10-25 17:40   ` Ross Biro
2007-10-25 18:08     ` Dave Hansen
2007-10-25 18:44       ` Ross Biro
2007-10-25 18:47         ` Dave Hansen
2007-10-25 19:23         ` Dave Hansen
2007-10-25 19:53           ` Ross Biro
2007-10-25 19:56             ` Dave Hansen
2007-10-25 19:58             ` Ross Biro
2007-10-25 20:15               ` Dave Hansen
2007-10-25 20:00             ` Dave Hansen
2007-10-25 20:10               ` Ross Biro
2007-10-25 20:20                 ` Dave Hansen
2007-10-26 16:10       ` Mel Gorman
2007-10-26 16:51         ` Ross Biro
2007-10-26 17:11           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).