[RFC] - Kernel text replication on IA64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] - Kernel text replication on IA64
@ 2006-04-20 13:53 Jack Steiner
  2006-04-20 16:41 ` Luck, Tony
  0 siblings, 1 reply; 3+ messages in thread
From: Jack Steiner @ 2006-04-20 13:53 UTC (permalink / raw)
  To: linux-ia64; +Cc: lee.schermerhorn, clameter, linux-mm

There was a question about the effects of kernel text replication last
month.  I was curious so I resurrected an old trillian patch (Tony Luck's)
& got it working again. Here is the preliminary patch & some data about
the benefit.

This is still a work-in-progress. I have not concluded whether I think the
patch is beneficial. Please take a look. Comments are appreciated.

Note that one piece is missing from the patch. It is currently
incompatible with kprobes. That is easy to fix if we decide to go forward
with the patch.  For now, make sure that CONFIG_KPROBES is not selected.
Kdb breakpoints will not work, either.  (But, then, kdb breakpoints don't
really work anyway).

----------------

Here is a summary that shows the benefit of kernel text replication on a
few selected microbenchmarks.

The tests were run on a kernel that supports kernel text replication as a
boottime option. The first column shows the time (in usec) to run the
microbenchmark when text replication is disabled. The second column is the
same kernel but text replication was enabled at boot time.

Each test supports an option to select whether to run the test with a
"hot" or "cold" cache.

If "hot" is selected, the test is run multiple times in a tight loop.
Because the microbenchmarks have a small cache footprint, replication is
not expected to help if caches are hot. The rate of i-cache misses for
kernel code should be low.

If "cold" is selected, all caches are flushed between each iteration of
the loop. This removes any cached kernel code from the caches & will
increase the time of the next system call(s). When kernel text replication
is enabled, the caches are refilled from the local node which has a
smaller latency than when refilling from node 0 if replication is
disabled.  The cache flush time is not included in the times but does have
a small residual impact on timing (less than a usec).

All tests were run on a 12 cpu (Itanium2, 900 MHz, 1.5MB L3) , 6 node
system. All cpus are idle with the exception of the cpu running the test.

Tests run on node 0 (as expected) show no improvement when text
replication is enabled. Tests run on other nodes show a significant
improvement when replication is enabled.

Note that these are microbenchmarks. The effect on real applications has
not been determined. Applications that have small cache footprints or
applications that are mostly cpu bound in user code are not expected to
show significant improvement with kernel text replication. In addition,
the improvements when replication is enabled will increase as system size
or system activity increases.

Enabling replication reserves 1 additional DTLB entry for kernel code.
This reduces the number of DTLB entries that is available for user code.
There is the potential that this could impact some applications.
Additional measurements are still needed.



------------------------------------------------------
  Cold cache. Running on node 3 of 6 node system

                         NoRep        Rep   %improvement
null                :    0.894 :    0.812 :         9.17
forkexit            :  521.518 :  416.467 :        20.14
openclose           :  106.683 :   75.000 :        29.70
pid                 :    2.577 :    2.356 :         8.58
time                :   17.882 :   11.693 :        34.61
gettimeofday        :   17.523 :   11.695 :        33.26


------------------------------------------------------
   Hot cache. Running on node 3 of 6 node system

                         NoRep        Rep   %improvement
null                :    0.044 :    0.044 :         0.00
forkexit            :  162.019 :  151.927 :         6.23
openclose           :    8.445 :    8.128 :         3.75
pid                 :    0.067 :    0.067 :         0.00
time                :    1.110 :    1.100 :         0.90
gettimeofday        :    1.079 :    1.074 :         0.46







 arch/ia64/Kconfig              |    7 ++
 arch/ia64/kernel/head.S        |   89 +++++++++++++++++++++++++++++
 arch/ia64/kernel/mca_asm.S     |   13 +++-
 arch/ia64/kernel/setup.c       |    6 +
 arch/ia64/kernel/smpboot.c     |    2 
 arch/ia64/kernel/vmlinux.lds.S |   71 +++++++++++++----------
 arch/ia64/mm/init.c            |  125 ++++++++++++++++++++++++++++++++++++++---
 include/asm-ia64/kregs.h       |    7 +-
 include/asm-ia64/numa.h        |   10 +++
 include/asm-ia64/pgtable.h     |    1 
 include/asm-ia64/system.h      |    1 
 11 files changed, 294 insertions(+), 38 deletions(-)



Index: linux/arch/ia64/kernel/head.S
===================================================================
--- linux.orig/arch/ia64/kernel/head.S	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/head.S	2006-04-18 14:36:14.895436090 -0500
@@ -247,6 +247,20 @@ start_ap:
 	;;
 	itr.d dtr[r16]=r18
 	;;
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+	mov r16=IA64_TR_KERNEL_DATA
+	movl r17=KERNEL_DATA_START
+	movl r18=PAGE_KERNEL
+	;;
+	or r18=r2,r18
+	mov cr.ifa=r17
+	;;
+	srlz.i
+	;;
+	itr.d dtr[r16]=r18
+	;;
+#endif
+
 	srlz.i
 
 	/*
@@ -1218,4 +1232,79 @@ tlb_purge_done:
 END(ia64_jump_to_sal)
 #endif /* CONFIG_HOTPLUG_CPU */
 
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+
+#define PSR_BITS_TO_CLEAR						\
+	(IA64_PSR_I | IA64_PSR_IT | IA64_PSR_DT | IA64_PSR_RT |		\
+	 IA64_PSR_DD | IA64_PSR_SS | IA64_PSR_RI | IA64_PSR_ED |	\
+	 IA64_PSR_DFL | IA64_PSR_DFH)
+
+#define PSR_BITS_TO_SET							\
+	(IA64_PSR_BN)
+
+/*
+ * ccNUMA systems bring up all cpus running from the copy of the
+ * kernel that elilo loaded into memory.  Processors that find that
+ * they are not using the kernel text/rodata that is on their local
+ * node can use this routine to reset their TLB mappings to point
+ * at the correct copy.
+ *
+ * This is like the magic trick where you pull a table cloth out
+ * from under a table covered with plates, glasses and silverware,
+ * except in this version we slide an identical tablecloth in to
+ * replace the one we pulled out.
+ *
+ * Inputs:
+ *	in0 = virtual address of local node copy to be mapped
+ */
+
+GLOBAL_ENTRY(remap_kernel_text)
+	.prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+	alloc loc1=ar.pfs,8,5,7,0
+	mov loc0=rp
+	.body
+	;;
+	mov loc4=ar.rsc			// save RSE configuration
+	mov ar.rsc=0			// put RSE in enforced lazy, LE mode
+	tpa in0=in0
+	;;
+	movl r16=PSR_BITS_TO_CLEAR
+	mov loc3=psr			// save processor status word
+	movl r17=PSR_BITS_TO_SET
+	;;
+	or loc3=loc3,r17
+	;;
+	andcm r16=loc3,r16		// get psr with IT, DT, and RT bits cleared
+	br.call.sptk.few rp=ia64_switch_mode_phys
+.ret3:
+	rsm psr.i | psr.ic
+	movl r24=KERNEL_START
+	movl r25=KERNEL_TR_PAGE_SHIFT<<2
+	movl r21=PAGE_KERNELRX
+	mov r22=IA64_TR_KERNEL
+	;;
+	ptr.i r24,r25			// purge old code mapping
+	;;
+	srlz.i
+	;;
+	mov cr.ifa=r24
+	mov cr.itir=r25
+	or in0=r21,in0
+	;;
+	srlz.i
+	;;
+	itr.i itr[r22]=in0
+	;;
+	srlz.i
+	;;
+	mov r16=loc3
+	br.call.sptk.few rp=ia64_switch_mode_virt // return to virtual mode
+.ret4:	mov ar.rsc=loc4			// restore RSE configuration
+	mov ar.pfs=loc1
+	mov rp=loc0
+	br.ret.sptk.few rp
+END(remap_kernel_text)
+#endif /* CONFIG_KERNEL_TEXT_REPLICATION */
+
 #endif /* CONFIG_SMP */
Index: linux/arch/ia64/kernel/smpboot.c
===================================================================
--- linux.orig/arch/ia64/kernel/smpboot.c	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/smpboot.c	2006-04-18 14:36:14.895436090 -0500
@@ -402,6 +402,8 @@ smp_callin (void)
 
 	smp_setup_percpu_timer();
 
+	check_remap_kernel_text(cpuid);
+
 	ia64_mca_cmc_vector_setup();	/* Setup vector on AP */
 
 #ifdef CONFIG_PERFMON
Index: linux/arch/ia64/mm/init.c
===================================================================
--- linux.orig/arch/ia64/mm/init.c	2006-04-18 14:35:34.407443859 -0500
+++ linux/arch/ia64/mm/init.c	2006-04-18 15:22:15.766146875 -0500
@@ -60,6 +60,88 @@ EXPORT_SYMBOL(zero_page_memmap_ptr);
 #define MAX_PGT_FREES_PER_PASS		16L
 #define PGT_FRACTION_OF_NODE_MEM	16
 
+static unsigned long free_mem_range (void *, void *);
+
+#ifdef	CONFIG_KERNEL_TEXT_REPLICATION
+/*
+ * Set ktreplicate to 0 to disable kernel text replication.
+ */
+static int ktreplicate=1;
+
+static int __init replicate_setup(char *str)
+{
+	get_option(&str, &ktreplicate);
+	return 1;
+}
+
+__setup("ktreplicate=", replicate_setup);
+
+
+/*
+ * Addresses of per-node copies of kernel text/readonly-data
+ */
+static void *kcopybase[MAX_NUMNODES];
+
+/*
+ * Remap the kernel text for this cpu if a closer copy
+ * is available.
+ */
+void
+check_remap_kernel_text(int cpuid)
+{
+        if (kcopybase[node_cpuid[cpuid].nid])
+		remap_kernel_text(kcopybase[node_cpuid[cpuid].nid]);
+}
+
+/*
+ * Make properly aligned copies of kernel text and read-only
+ * data on other nodes.
+ */
+void replicate_kernel(void)
+{
+	extern void *_start_replicate, *_end_replicate;
+	void *kstart = &_start_replicate;
+	void *kend = &_end_replicate;
+	struct page *page;
+	int nid, length, copies = 1;
+	void *addr;
+	int kloadnode;
+	int cpuid = smp_processor_id();
+
+	kloadnode = paddr_to_nid(ia64_tpa(&kcopybase[0]));
+	kcopybase[kloadnode] = ia64_imva(&_start_replicate);
+	kcopybase[node_cpuid[cpuid].nid] = ia64_imva(&_start_replicate);
+
+	if (ktreplicate) {
+		length = kend - kstart;
+		for_each_online_node(nid) {
+			if (nid == kloadnode || nr_cpus_node(nid) == 0)
+				continue;
+
+			page = alloc_pages_node(nid, GFP_KERNEL, get_order(KERNEL_TR_PAGE_SIZE));
+			if (!page) {
+				printk("Could not replicate kernel to node %d\n", nid);
+				continue;
+			}
+			addr = page_address(page);
+			free_mem_range(addr + length, addr + KERNEL_TR_PAGE_SIZE);
+			kcopybase[nid] = addr;
+			memcpy(addr, &_start_replicate, length);
+			copies++;
+		}
+		printk("Replicated kernel to %d nodes\n", copies);
+	} else {
+		printk("Kernel text replication is disabled\n");
+	}
+
+	/*
+	 * Make kernel text read-only. We do this even if replication * is disabled.
+	 */
+	check_remap_kernel_text(cpuid);
+
+}
+#endif	/* CONFIG_KERNEL_TEXT_REPLICATION */
+
 static inline long
 max_pgt_pages(void)
 {
@@ -194,22 +276,51 @@ ia64_init_addr_space (void)
 	}
 }
 
-void
-free_initmem (void)
+static unsigned long
+free_mem_range (void *addr, void *eaddr)
 {
-	unsigned long addr, eaddr;
+	unsigned long pages_freed = 0;
 
-	addr = (unsigned long) ia64_imva(__init_begin);
-	eaddr = (unsigned long) ia64_imva(__init_end);
 	while (addr < eaddr) {
 		ClearPageReserved(virt_to_page(addr));
 		init_page_count(virt_to_page(addr));
-		free_page(addr);
+		free_page((u64)addr);
 		++totalram_pages;
 		addr += PAGE_SIZE;
+		++pages_freed;
+	}
+	return pages_freed;
+}
+
+void
+free_initmem (void)
+{
+	void *addr, *eaddr;
+	unsigned long pages_freed = 0;
+	extern char __init_data_begin[], __init_data_end[];
+	int nid;
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+	for_each_online_node(nid) {
+		if (!kcopybase[nid])
+			continue;
+		addr = kcopybase[nid] + ((u64) ia64_imva(&__init_begin) & (KERNEL_TR_PAGE_SIZE-1));
+		eaddr = kcopybase[nid] + ((u64) ia64_imva(&__init_end) & (KERNEL_TR_PAGE_SIZE-1));
+		pages_freed += free_mem_range(addr, eaddr);
 	}
+#else
+	addr = ia64_imva(__init_begin);
+	eaddr = ia64_imva(__init_end);
+	pages_freed += free_mem_range(addr, eaddr);
+#endif
+
+	addr = ia64_imva(__init_data_begin);
+	eaddr = ia64_imva(__init_data_end);
+	pages_freed += free_mem_range(addr, eaddr);
+
 	printk(KERN_INFO "Freeing unused kernel memory: %ldkB freed\n",
-	       (__init_end - __init_begin) >> 10);
+	       (pages_freed << PAGE_SHIFT) >> 10);
+
 }
 
 void __init
Index: linux/arch/ia64/kernel/mca_asm.S
===================================================================
--- linux.orig/arch/ia64/kernel/mca_asm.S	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/mca_asm.S	2006-04-18 14:36:14.899435695 -0500
@@ -96,8 +96,15 @@ ia64_do_tlb_purge:
 	mov r18=KERNEL_TR_PAGE_SHIFT<<2
 	;;
 	ptr.i r16, r18
-	ptr.d r16, r18
 	;;
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+	movl r17=KERNEL_DATA_START
+	;;
+	ptr.d r17, r18
+	;;
+#endif
+
 	srlz.i
 	;;
 	srlz.d
@@ -192,6 +199,10 @@ ia64_reload_tr:
 	;;
         itr.i itr[r16]=r18
 	;;
+	movl r17=KERNEL_DATA_START
+	;;
+	mov cr.ifa=r17
+	;;
         itr.d dtr[r16]=r18
         ;;
 	srlz.i
Index: linux/arch/ia64/kernel/setup.c
===================================================================
--- linux.orig/arch/ia64/kernel/setup.c	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/setup.c	2006-04-18 14:36:14.899435695 -0500
@@ -887,6 +887,12 @@ check_bugs (void)
 {
 	ia64_patch_mckinley_e9((unsigned long) __start___mckinley_e9_bundles,
 			       (unsigned long) __end___mckinley_e9_bundles);
+
+	/*
+	 * This really doesn't belong here but this is the last arch-specific
+	 * callout before starting cpus. Need a better place for this.
+	 */
+	replicate_kernel();
 }
 
 static int __init run_dmi_scan(void)
Index: linux/include/asm-ia64/system.h
===================================================================
--- linux.orig/include/asm-ia64/system.h	2006-04-18 14:35:34.407443859 -0500
+++ linux/include/asm-ia64/system.h	2006-04-18 14:36:14.903435299 -0500
@@ -26,6 +26,7 @@
  * - 0xa000000000000000+3*PERCPU_PAGE_SIZE remain unmapped (guard page)
  */
 #define KERNEL_START		 (GATE_ADDR+0x100000000)
+#define KERNEL_DATA_START	 (GATE_ADDR+0x180000000)
 #define PERCPU_ADDR		(-PERCPU_PAGE_SIZE)
 
 #ifndef __ASSEMBLY__
Index: linux/arch/ia64/kernel/vmlinux.lds.S
===================================================================
--- linux.orig/arch/ia64/kernel/vmlinux.lds.S	2006-04-18 14:35:34.403444255 -0500
+++ linux/arch/ia64/kernel/vmlinux.lds.S	2006-04-18 15:38:28.645844817 -0500
@@ -39,6 +39,7 @@ SECTIONS
   code : { } :code
   . = KERNEL_START;
 
+  _start_replicate = .;
   _text = .;
   _stext = .;
 
@@ -79,8 +80,30 @@ SECTIONS
 	  __stop___mca_table = .;
 	}
 
-  /* Global data */
-  _data = .;
+  .data.patch.vtop : AT(ADDR(.data.patch.vtop) - LOAD_OFFSET)
+	{
+	  __start___vtop_patchlist = .;
+	  *(.data.patch.vtop)
+	  __end___vtop_patchlist = .;
+	}
+
+  .data.patch.mckinley_e9 : AT(ADDR(.data.patch.mckinley_e9) - LOAD_OFFSET)
+	{
+	  __start___mckinley_e9_bundles = .;
+	  *(.data.patch.mckinley_e9)
+	  __end___mckinley_e9_bundles = .;
+	}
+
+#if defined(CONFIG_IA64_GENERIC)
+  /* Machine Vector */
+  . = ALIGN(16);
+  .machvec : AT(ADDR(.machvec) - LOAD_OFFSET)
+	{
+	  machvec_start = .;
+	  *(.machvec)
+	  machvec_end = .;
+	}
+#endif
 
   /* Unwind info & table: */
   . = ALIGN(8);
@@ -98,7 +121,7 @@ SECTIONS
   .opd : AT(ADDR(.opd) - LOAD_OFFSET)
 	{ *(.opd) }
 
-  /* Initialization code and data: */
+  /* Initialization code: */
 
   . = ALIGN(PAGE_SIZE);
   __init_begin = .;
@@ -109,6 +132,21 @@ SECTIONS
 	  _einittext = .;
 	}
 
+  . = ALIGN(PAGE_SIZE);
+  __init_end = .;
+  _end_replicate = .;
+
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+#undef LOAD_OFFSET
+#define LOAD_OFFSET	(KERNEL_DATA_START - KERNEL_TR_PAGE_SIZE)
+. = KERNEL_DATA_START + (. - KERNEL_START);
+#endif
+
+  /* Global read/write data */
+  _data = .;
+
+  /* Initialization data: */
+  __init_data_begin = .;
   .init.data : AT(ADDR(.init.data) - LOAD_OFFSET)
 	{ *(.init.data) }
 
@@ -139,31 +177,6 @@ SECTIONS
 	  __initcall_end = .;
 	}
 
-  .data.patch.vtop : AT(ADDR(.data.patch.vtop) - LOAD_OFFSET)
-	{
-	  __start___vtop_patchlist = .;
-	  *(.data.patch.vtop)
-	  __end___vtop_patchlist = .;
-	}
-
-  .data.patch.mckinley_e9 : AT(ADDR(.data.patch.mckinley_e9) - LOAD_OFFSET)
-	{
-	  __start___mckinley_e9_bundles = .;
-	  *(.data.patch.mckinley_e9)
-	  __end___mckinley_e9_bundles = .;
-	}
-
-#if defined(CONFIG_IA64_GENERIC)
-  /* Machine Vector */
-  . = ALIGN(16);
-  .machvec : AT(ADDR(.machvec) - LOAD_OFFSET)
-	{
-	  machvec_start = .;
-	  *(.machvec)
-	  machvec_end = .;
-	}
-#endif
-
    __con_initcall_start = .;
   .con_initcall.init : AT(ADDR(.con_initcall.init) - LOAD_OFFSET)
 	{ *(.con_initcall.init) }
@@ -173,7 +186,7 @@ SECTIONS
 	{ *(.security_initcall.init) }
   __security_initcall_end = .;
   . = ALIGN(PAGE_SIZE);
-  __init_end = .;
+  __init_data_end = .;
 
   /* The initial task and kernel stack */
   .data.init_task : AT(ADDR(.data.init_task) - LOAD_OFFSET)
Index: linux/include/asm-ia64/kregs.h
===================================================================
--- linux.orig/include/asm-ia64/kregs.h	2006-04-18 14:35:34.407443859 -0500
+++ linux/include/asm-ia64/kregs.h	2006-04-18 14:36:14.903435299 -0500
@@ -27,11 +27,16 @@
 /*
  * Translation registers:
  */
-#define IA64_TR_KERNEL		0	/* itr0, dtr0: maps kernel image (code & data) */
+#define IA64_TR_KERNEL		0	/* itr0, dtr0: maps kernel image (code & readonly data) */
+					/*    also maps RW data if text replication is not enabled */
 #define IA64_TR_PALCODE		1	/* itr1: maps PALcode as required by EFI */
 #define IA64_TR_PERCPU_DATA	1	/* dtr1: percpu data */
 #define IA64_TR_CURRENT_STACK	2	/* dtr2: maps kernel's memory- & register-stacks */
 
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+#define IA64_TR_KERNEL_DATA	3	/* dtr3: maps kernel's global data */
+#endif
+
 /* Processor status register bits: */
 #define IA64_PSR_BE_BIT		1
 #define IA64_PSR_UP_BIT		2
Index: linux/include/asm-ia64/numa.h
===================================================================
--- linux.orig/include/asm-ia64/numa.h	2006-04-18 14:35:34.407443859 -0500
+++ linux/include/asm-ia64/numa.h	2006-04-18 14:36:14.903435299 -0500
@@ -71,4 +71,14 @@ extern int paddr_to_nid(unsigned long pa
 
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_KERNEL_TEXT_REPLICATION
+extern void remap_kernel_text(void *);
+extern void check_remap_kernel_text(int);
+extern void replicate_kernel(void);
+#else
+#define remap_kernel_text(p)
+#define check_remap_kernel_text(c)
+#define replicate_kernel()
+#endif
+
 #endif /* _ASM_IA64_NUMA_H */
Index: linux/include/asm-ia64/pgtable.h
===================================================================
--- linux.orig/include/asm-ia64/pgtable.h	2006-04-18 14:35:34.411443463 -0500
+++ linux/include/asm-ia64/pgtable.h	2006-04-18 14:36:14.903435299 -0500
@@ -146,6 +146,7 @@
 #define PAGE_COPY_EXEC	__pgprot(__ACCESS_BITS | _PAGE_PL_3 | _PAGE_AR_RX)
 #define PAGE_GATE	__pgprot(__ACCESS_BITS | _PAGE_PL_0 | _PAGE_AR_X_RX)
 #define PAGE_KERNEL	__pgprot(__DIRTY_BITS  | _PAGE_PL_0 | _PAGE_AR_RWX)
+#define PAGE_KERNELR	__pgprot(__ACCESS_BITS | _PAGE_PL_0 | _PAGE_AR_R)
 #define PAGE_KERNELRX	__pgprot(__ACCESS_BITS | _PAGE_PL_0 | _PAGE_AR_RX)
 
 # ifndef __ASSEMBLY__
Index: linux/arch/ia64/Kconfig
===================================================================
--- linux.orig/arch/ia64/Kconfig	2006-04-18 14:35:34.407443859 -0500
+++ linux/arch/ia64/Kconfig	2006-04-18 14:36:14.907434903 -0500
@@ -260,6 +260,13 @@ config NR_CPUS
 	  than 64 will cause the use of a CPU mask array, causing a small
 	  performance hit.
 
+config KERNEL_TEXT_REPLICATION
+	bool "Kernel text replication"
+	depends on NUMA
+	default off
+	help
+	  Say Y if you want to eeplicate kernel text on each node of a NUMA system.
+
 config HOTPLUG_CPU
 	bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
 	depends on SMP && EXPERIMENTAL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] - Kernel text replication on IA64
  2006-04-20 13:53 [RFC] - Kernel text replication on IA64 Jack Steiner
@ 2006-04-20 16:41 ` Luck, Tony
  2006-04-20 17:48   ` Chen, Kenneth W
  0 siblings, 1 reply; 3+ messages in thread
From: Luck, Tony @ 2006-04-20 16:41 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-ia64, lee.schermerhorn, clameter, linux-mm

On Thu, Apr 20, 2006 at 08:53:16AM -0500, Jack Steiner wrote:

> There was a question about the effects of kernel text replication last
> month.  I was curious so I resurrected an old trillian patch (Tony Luck's)
> & got it working again. Here is the preliminary patch & some data about
> the benefit.

It's not *that* old ... just from Atlas days, not from Trillian.  Google
carbon dating shows old versions of this patch from around the August
2002 time frame (against 2.4.19).

> All tests were run on a 12 cpu (Itanium2, 900 MHz, 1.5MB L3) , 6 node
> system. All cpus are idle with the exception of the cpu running the test.

Presumably results would be even better on a bigger system where the
distance from the test node to node 0 is even bigger.

On truly huge systems does node0 (or the interconnects leading to it) ever
suffer measureable slowdown from the instruction fetch traffic coming from
the other 255 (or more) nodes?

> Enabling replication reserves 1 additional DTLB entry for kernel code.
> This reduces the number of DTLB entries that is available for user code.
> There is the potential that this could impact some applications.
> Additional measurements are still needed.

Ken's recent patch to free up the DTLB that is currently used for per-cpu
data would mitigate this (though I'm sure he'll be unamused if I blow the
1.6% gain he saw on his transaction processing benchmark on this :-)

> ------------------------------------------------------
>   Cold cache. Running on node 3 of 6 node system
> 
>                          NoRep        Rep   %improvement
> null                :    0.894 :    0.812 :         9.17
> forkexit            :  521.518 :  416.467 :        20.14
> openclose           :  106.683 :   75.000 :        29.70
> pid                 :    2.577 :    2.356 :         8.58
> time                :   17.882 :   11.693 :        34.61
> gettimeofday        :   17.523 :   11.695 :        33.26

Those are some pretty nice numbers.

> ------------------------------------------------------
>    Hot cache. Running on node 3 of 6 node system
> 
>                          NoRep        Rep   %improvement
> forkexit            :  162.019 :  151.927 :         6.23
> openclose           :    8.445 :    8.128 :         3.75

These ones are good too.  But a bit surprising ... it implies that we
are still seeing significant kernel-code i-cache misses even in a
micro-benchmark tight loop.  Montecito (with the big L2 icache) should
be better here (and so see less improvement with replicated text).

> +	  Say Y if you want to eeplicate kernel text on each node of a NUMA system.

s/eeplicate/replicate/

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [RFC] - Kernel text replication on IA64
  2006-04-20 16:41 ` Luck, Tony
@ 2006-04-20 17:48   ` Chen, Kenneth W
  0 siblings, 0 replies; 3+ messages in thread
From: Chen, Kenneth W @ 2006-04-20 17:48 UTC (permalink / raw)
  To: Luck, Tony, Jack Steiner; +Cc: linux-ia64, lee.schermerhorn, clameter, linux-mm

Luck, Tony wrote on Thursday, April 20, 2006 9:41 AM
> On Thu, Apr 20, 2006 at 08:53:16AM -0500, Jack Steiner wrote:
> > Enabling replication reserves 1 additional DTLB entry for kernel code.
> > This reduces the number of DTLB entries that is available for user code.
> > There is the potential that this could impact some applications.
> > Additional measurements are still needed.
> 
> Ken's recent patch to free up the DTLB that is currently used for per-cpu
> data would mitigate this (though I'm sure he'll be unamused if I blow the
> 1.6% gain he saw on his transaction processing benchmark on this :-)

How much benefit is there to have readonly section replicated?  Do you really
have to use two DTRs - one to map the readonly and one to map rw?

What about just replicate text so we don't need to burn an extra DTR?

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-04-20 17:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-20 13:53 [RFC] - Kernel text replication on IA64 Jack Steiner
2006-04-20 16:41 ` Luck, Tony
2006-04-20 17:48   ` Chen, Kenneth W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).