[patch 0/2] Immediate Values - jump patching update

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/2] Immediate Values - jump patching update
@ 2008-04-28  3:34 Mathieu Desnoyers
  2008-04-28  3:34 ` [patch 1/2] Immediate Values - jump liveliness Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2008-04-28  3:34 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, H. Peter Anvin

Hi Ingo,

Here is the update to the jump patching optimization taking care of Peter's
comments about register liveliness and instruction re-use by gcc optimizations.
A good thing : it actually simplifies the code. Unfortunately, it adds 3 bytes
to the instructions in i-cache because I now have to use a 5-bytes mov
instruction so I can replace it with a 5-bytes jump. Therefore, 9 bytes are
added to rather small functions (5-bytes mov + 2-bytes test + 2 bytes
conditional branch) and 13 bytes are added to larger functions which needs a 6
bytes conditional branch at the branch site.

Instead of having to execute a sequence of nop, nop and jump, we now only have
to execute the near jump, which jumps either at the address following the
conditional branch or at the target address of the conditional branch, depending
on the immediate value variable state.

Thanks to Peter for the review.

Those two patches apply on top of sched-devel.git.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 1/2] Immediate Values - jump liveliness
  2008-04-28  3:34 [patch 0/2] Immediate Values - jump patching update Mathieu Desnoyers
@ 2008-04-28  3:34 ` Mathieu Desnoyers
  2008-04-28  3:34 ` [patch 2/2] Markers - use imv_cond " Mathieu Desnoyers
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2008-04-28  3:34 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, H. Peter Anvin; +Cc: Mathieu Desnoyers

[-- Attachment #1: immediate-values-jump-liveliness.patch --]
[-- Type: text/plain, Size: 22656 bytes --]

Check for limited liveliness of ZF and %eax in the search pattern.
Declare a 5-bytes mov in the inline assembly stub and change it for a 5-bytes
jump. We would not want to modify an instruction which gcc reuses for other
purposes (it could jump in the middle of the pattern), but changing our own mov
is ok, because we know that liveliness of %eax is limited to our pattern (mov,
test, branch). It is therefore OK to change:

mov, test, cond. branch
for
jump, test, cond. branch

Where jump has the same behavior as the conditional branch.

It applies on top of the Immediate Values - jump patch. (should be folded with
that patch).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/immediate.c       |  228 +++++++++++++++++---------------------
 include/asm-generic/vmlinux.lds.h |    4 
 include/asm-powerpc/immediate.h   |    1 
 include/asm-x86/immediate.h       |   58 ++++++---
 include/linux/immediate.h         |    4 
 include/linux/module.h            |    8 +
 kernel/immediate.c                |   38 ++++++
 kernel/module.c                   |   23 +++
 8 files changed, 226 insertions(+), 138 deletions(-)

Index: linux-2.6-sched-devel/include/asm-powerpc/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-powerpc/immediate.h	2008-04-26 10:53:53.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-powerpc/immediate.h	2008-04-26 10:54:44.000000000 -0400
@@ -69,6 +69,7 @@ struct __imv {
 	})
 
 #define imv_cond(name)	imv_read(name)
+#define imv_cond_end()
 
 extern int arch_imv_update(const struct __imv *imv, int early);
 
Index: linux-2.6-sched-devel/include/asm-x86/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-x86/immediate.h	2008-04-26 10:53:53.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-x86/immediate.h	2008-04-26 15:38:37.000000000 -0400
@@ -20,8 +20,8 @@ struct __imv {
 				 * Pointer to the memory location of the
 				 * immediate value within the instruction.
 				 */
-	int  jmp_off;		/* offset for jump target */
-	unsigned char size;	/* Type size. */
+	unsigned char var_size;	/* Type size of variable. */
+	unsigned char size;	/* Type size of immediate value. */
 	unsigned char insn_size;/* Instruction size. */
 } __attribute__ ((packed));
 
@@ -58,8 +58,7 @@ struct __imv {
 				".previous\n\t"				\
 				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".int 0\n\t"				\
-				".byte %c2, (2b-1b)\n\t"		\
+				".byte %c2, %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				"mov $0,%0\n\t"				\
 				"3:\n\t"				\
@@ -76,8 +75,7 @@ struct __imv {
 				".previous\n\t"				\
 				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".int 0\n\t"				\
-				".byte %c2, (2b-1b)\n\t"		\
+				".byte %c2, %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
 				"mov $0,%0\n\t"				\
@@ -98,8 +96,7 @@ struct __imv {
 				".previous\n\t"				\
 				".section __imv,\"aw\",@progbits\n\t"	\
 				_ASM_PTR "%c1, (3f)-%c2\n\t"		\
-				".int 0\n\t"				\
-				".byte %c2, (2b-1b)\n\t"		\
+				".byte %c2, %c2, (2b-1b)\n\t"		\
 				".previous\n\t"				\
 				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
 				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
@@ -113,33 +110,60 @@ struct __imv {
 	})
 
 /*
- * Uses %al.
- * size is 0.
- * Use in if (unlikely(imv_cond(var)))
+ * Uses %eax.
+ * immediate value size is declared as 0.
+ * Use in
+ * if (unlikely(imv_cond(var))) {
+ *   imv_cond_end();
+ *   ...
+ * } else {
+ *   imv_cond_end();
+ *   ...
+ * }
  * Given a char as argument.
+ * If the expected code pattern insuring correct liveliness of ZF and %eax isn't
+ * met, fallback on standard immediate value.
+ * patches the 5 bytes mov for a e9 XX XX XX XX (near jump)
+ * Note : Patching the the 4 bytes immediate value with 1 byte variable
+ * on fallback.
  */
 #define imv_cond(name)							\
 	({								\
-		__typeof__(name##__imv) value;				\
-		BUILD_BUG_ON(sizeof(value) > 1);			\
+		uint32_t value;						\
+		BUILD_BUG_ON(sizeof(__typeof__(name##__imv)) > 1);	\
 		asm (".section __discard,\"\",@progbits\n\t"		\
 			"1:\n\t"					\
 			"mov $0,%0\n\t"					\
 			"2:\n\t"					\
 			".previous\n\t"					\
 			".section __imv,\"aw\",@progbits\n\t"		\
-			_ASM_PTR "%c1, (3f)-1\n\t"			\
-			".int 0\n\t"					\
-			".byte %c2, (2b-1b)\n\t"			\
+			_ASM_PTR "%c1, (3f)-%c2\n\t"			\
+			".byte %c3, 0, (2b-1b)\n\t"			\
 			".previous\n\t"					\
 			"mov $0,%0\n\t"					\
 			"3:\n\t"					\
 			: "=a" (value)					\
 			: "i" (&name##__imv),				\
-			  "i" (0));					\
+			  "i" (sizeof(value)),				\
+			  "i" (sizeof(__typeof__(name##__imv))));	\
 		value;							\
 	})
 
+/*
+ * Make sure the %eax register and ZF are not live anymore at the current
+ * address, which is declared in the __imv_cond_end section.
+ * All asm statements clobbers the flags, but add "cc" clobber just to be sure.
+ * Clobbers %eax.
+ */
+#define imv_cond_end()							\
+	do {								\
+		asm (".section __imv_cond_end,\"a\",@progbits\n\t"	\
+				_ASM_PTR "1f\n\t"			\
+				".previous\n\t"				\
+				"1:\n\t"				\
+				: : : "eax", "cc");			\
+	} while (0)
+
 extern int arch_imv_update(struct __imv *imv, int early);
 
 #endif /* _ASM_X86_IMMEDIATE_H */
Index: linux-2.6-sched-devel/include/linux/immediate.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/immediate.h	2008-04-26 10:53:56.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/immediate.h	2008-04-26 10:54:44.000000000 -0400
@@ -37,6 +37,9 @@ extern void imv_update_range(struct __im
 extern void imv_unref_core_init(void);
 extern void imv_unref(struct __imv *begin, struct __imv *end, void *start,
 		unsigned long size);
+extern int _is_imv_cond_end(unsigned long *begin, unsigned long *end,
+		unsigned long addr1, unsigned long addr2);
+extern int is_imv_cond_end(unsigned long addr1, unsigned long addr2);
 
 #else
 
@@ -59,6 +62,7 @@ extern void imv_unref(struct __imv *begi
  * Reads the value of @name.
  */
 #define imv_cond(name)			_imv_read(name)
+#define imv_cond_end()
 
 /**
  * imv_set - set immediate variable (with locking)
Index: linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-sched-devel.orig/include/asm-generic/vmlinux.lds.h	2008-04-26 10:53:53.000000000 -0400
+++ linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h	2008-04-26 10:54:44.000000000 -0400
@@ -63,6 +63,10 @@
 		VMLINUX_SYMBOL(__start_rodata) = .;			\
 		*(.rodata) *(.rodata.*)					\
 		*(__vermagic)		/* Kernel version magic */	\
+		. = ALIGN(8);						\
+		VMLINUX_SYMBOL(__start___imv_cond_end) = .;		\
+		*(__imv_cond_end) /* Immediate condition end pointers */\
+		VMLINUX_SYMBOL(__stop___imv_cond_end) = .;		\
 		*(__markers_strings)	/* Markers: strings */		\
 	}								\
 									\
Index: linux-2.6-sched-devel/include/linux/module.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/module.h	2008-04-26 10:53:56.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/module.h	2008-04-26 10:54:44.000000000 -0400
@@ -359,6 +359,8 @@ struct module
 #ifdef CONFIG_IMMEDIATE
 	struct __imv *immediate;
 	unsigned int num_immediate;
+	unsigned long *immediate_cond_end;
+	unsigned int num_immediate_cond_end;
 #endif
 #ifdef CONFIG_MARKERS
 	struct marker *markers;
@@ -581,6 +583,7 @@ static inline void module_update_markers
 #if defined(CONFIG_MODULES) && defined(CONFIG_IMMEDIATE)
 extern void _module_imv_update(void);
 extern void module_imv_update(void);
+extern int is_imv_cond_end_module(unsigned long addr1, unsigned long addr2);
 #else
 static inline void _module_imv_update(void)
 {
@@ -588,6 +591,11 @@ static inline void _module_imv_update(vo
 static inline void module_imv_update(void)
 {
 }
+static inline int is_imv_cond_end_module(unsigned long addr1,
+		unsigned long addr2)
+{
+	return 0;
+}
 #endif
 
 struct device_driver;
Index: linux-2.6-sched-devel/kernel/immediate.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/immediate.c	2008-04-26 10:53:56.000000000 -0400
+++ linux-2.6-sched-devel/kernel/immediate.c	2008-04-26 10:54:44.000000000 -0400
@@ -29,6 +29,8 @@ static int imv_early_boot_complete;
 
 extern struct __imv __start___imv[];
 extern struct __imv __stop___imv[];
+extern unsigned long __start___imv_cond_end[];
+extern unsigned long __stop___imv_cond_end[];
 
 /*
  * imv_mutex nests inside module_mutex. imv_mutex protects builtin
@@ -103,6 +105,42 @@ void imv_unref_core_init(void)
 		(unsigned long)__init_end - (unsigned long)__init_begin);
 }
 
+int _is_imv_cond_end(unsigned long *begin, unsigned long *end,
+		unsigned long addr1, unsigned long addr2)
+{
+	unsigned long *iter;
+	int found = 0;
+
+	for (iter = begin; iter < end; iter++) {
+		if (*iter == addr1)	/* deals with addr1 == addr2 */
+			found++;
+		if (*iter == addr2)
+			found++;
+		if (found == 2)
+			return 1;
+	}
+	return 0;
+}
+
+/**
+ * is_imv_cond_end
+ *
+ * Check if the two given addresses are located in the immediate value condition
+ * end table. Addresses should be in the same object.
+ * The module mutex should be held when calling this function for non-core
+ * addresses.
+ */
+int is_imv_cond_end(unsigned long addr1, unsigned long addr2)
+{
+	if (core_kernel_text(addr1)) {
+		return _is_imv_cond_end(__start___imv_cond_end,
+			__stop___imv_cond_end, addr1, addr2);
+	} else {
+		return is_imv_cond_end_module(addr1, addr2);
+	}
+	return 0;
+}
+
 void __init imv_init_complete(void)
 {
 	imv_early_boot_complete = 1;
Index: linux-2.6-sched-devel/kernel/module.c
===================================================================
--- linux-2.6-sched-devel.orig/kernel/module.c	2008-04-26 10:53:56.000000000 -0400
+++ linux-2.6-sched-devel/kernel/module.c	2008-04-26 10:54:44.000000000 -0400
@@ -1720,6 +1720,7 @@ static struct module *load_module(void _
 	unsigned int unusedgplindex;
 	unsigned int unusedgplcrcindex;
 	unsigned int immediateindex;
+	unsigned int immediatecondendindex;
 	unsigned int markersindex;
 	unsigned int markersstringsindex;
 	struct module *mod;
@@ -1819,6 +1820,8 @@ static struct module *load_module(void _
 	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
 #endif
 	immediateindex = find_sec(hdr, sechdrs, secstrings, "__imv");
+	immediatecondendindex = find_sec(hdr, sechdrs, secstrings,
+		"__imv_cond_end");
 
 	/* Don't keep modinfo section */
 	sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -1981,6 +1984,11 @@ static struct module *load_module(void _
 	mod->immediate = (void *)sechdrs[immediateindex].sh_addr;
 	mod->num_immediate =
 		sechdrs[immediateindex].sh_size / sizeof(*mod->immediate);
+	mod->immediate_cond_end =
+		(void *)sechdrs[immediatecondendindex].sh_addr;
+	mod->num_immediate_cond_end =
+		sechdrs[immediatecondendindex].sh_size
+			/ sizeof(*mod->immediate_cond_end);
 #endif
 
 	mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr;
@@ -2645,4 +2653,19 @@ void module_imv_update(void)
 	mutex_unlock(&module_mutex);
 }
 EXPORT_SYMBOL_GPL(module_imv_update);
+
+/**
+ * is_imv_cond_end_module
+ *
+ * Check if the two given addresses are located in the immediate value condition
+ * end table. Addresses should be in the same object.
+ * The module mutex should be held.
+ */
+int is_imv_cond_end_module(unsigned long addr1, unsigned long addr2)
+{
+	struct module *mod = __module_text_address(addr1);
+	return _is_imv_cond_end(mod->immediate_cond_end,
+		mod->immediate_cond_end + mod->num_immediate_cond_end,
+		addr1, addr2);
+}
 #endif
Index: linux-2.6-sched-devel/arch/x86/kernel/immediate.c
===================================================================
--- linux-2.6-sched-devel.orig/arch/x86/kernel/immediate.c	2008-04-26 10:53:41.000000000 -0400
+++ linux-2.6-sched-devel/arch/x86/kernel/immediate.c	2008-04-26 15:43:57.000000000 -0400
@@ -80,10 +80,7 @@
 #include <asm/cacheflush.h>
 
 #define BREAKPOINT_INSTRUCTION  0xcc
-#define JMP_REL8		0xeb
 #define JMP_REL32		0xe9
-#define INSN_NOP1		0x90
-#define INSN_NOP2		0x89, 0xf6
 #define BREAKPOINT_INS_LEN	1
 #define NR_NOPS			10
 
@@ -156,22 +153,14 @@ static int imv_notifier(struct notifier_
 		if (args->regs->ip == target_after_int3) {
 			/* deal with non-relocatable jmp instructions */
 			switch (*(uint8_t *)bypass_eip) {
-			case JMP_REL8: /* eb cb       jmp rel8 */
-				args->regs->ip +=
-					*(signed char *)(bypass_eip + 1) + 1;
-				return NOTIFY_STOP;
 			case JMP_REL32: /* e9 cw    jmp rel16 (valid on ia32) */
 					/* e9 cd    jmp rel32 */
+					/* Skip the 4 bytes (jump offset)
+					 * following the breakpoint. */
 				args->regs->ip +=
 					*(int *)(bypass_eip + 1) + 4;
+				/* The bypass won't be executed. */
 				return NOTIFY_STOP;
-			case INSN_NOP1:
-				/* deal with insertion of nop + jmp_rel32 */
-				if (*((uint8_t *)bypass_eip + 1) == JMP_REL32) {
-					args->regs->ip +=
-						*(int *)(bypass_eip + 2) + 5;
-					return NOTIFY_STOP;
-				}
 			}
 			preempt_disable();
 			args->regs->ip = bypass_eip;
@@ -197,38 +186,62 @@ static struct notifier_block imv_notify 
 static inline int detect_mov_test_jne(uint8_t *addr, uint8_t **opcode,
 		uint8_t **jmp_offset, int *offset_len)
 {
+	uint8_t *addr1, *addr2;
+
 	printk_dbg(KERN_DEBUG "Trying at %p %hx %hx %hx %hx %hx %hx\n",
 		addr, addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]);
-	/* b0 cb    movb cb,%al */
-	if (addr[0] != 0xb0)
+	/*
+	 * b8 imm32        mov  imm32,%eax or
+	 * c9 XX XX XX XX  jmp  rel32 (patched instruction)
+	 */
+	if (addr[0] != 0xb8 && addr[0] != JMP_REL32)
 		return -1;
-	/* 84 c0    test %al,%al */
-	if (addr[2] != 0x84 || addr[3] != 0xc0)
+	/* 85 c0    test %eax,%eax */
+	if (addr[5] != 0x85 || addr[6] != 0xc0)
 		return -1;
-	printk_dbg(KERN_DEBUG "Found test %%al,%%al at %p\n", addr + 2);
-	switch (addr[4]) {
+	printk_dbg(KERN_DEBUG "Found test %%eax,%%eax at %p\n", addr + 5);
+	switch (addr[7]) {
 	case 0x75: /* 75 cb       jne rel8 */
-		printk_dbg(KERN_DEBUG "Found jne rel8 at %p\n", addr + 4);
-		*opcode = addr + 4;
-		*jmp_offset = addr + 5;
+		printk_dbg(KERN_DEBUG "Found jne rel8 at %p\n", addr + 7);
+		*opcode = addr + 7;
+		*jmp_offset = addr + 8;
 		*offset_len = 1;
-		return 0;
+		/* addr1 is the address following the branch instruction */
+		addr1 = addr + 9;
+		/* addr2 is the branch target */
+		addr2 = addr1 + addr[8];
+		break;
 	case 0x0f:
-		switch (addr[5]) {
+		switch (addr[8]) {
 		case 0x85:	 /* 0F 85 cw    jne rel16 (valid on ia32) */
 				 /* 0F 85 cd    jne rel32 */
 			printk_dbg(KERN_DEBUG "Found jne rel16/32 at %p\n",
-				addr + 5);
-			*opcode = addr + 4;
-			*jmp_offset = addr + 6;
+				addr + 8);
+			*opcode = addr + 7;
+			*jmp_offset = addr + 9;
 			*offset_len = 4;
-			return 0;
+			/*
+			 * addr1 is the address following the branch
+			 * instruction
+			 */
+			addr1 = addr + 13;
+			/* addr2 is the branch target */
+			addr2 = addr1 + *(uint32_t *)(addr + 9);
+			break;
 		default:
 			return -1;
 		}
 		break;
 	default: return -1;
 	}
+	/*
+	 * Now check that the pattern was squeezed between the mov instruction
+	 * end the two epilogues (branch taken and not taken), which ensure
+	 * %eax and ZF liveliness is limited to our instructions.
+	 */
+	if (!is_imv_cond_end((unsigned long)addr1, (unsigned long)addr2))
+		return -1;
+	return 0;
 }
 
 /*
@@ -238,36 +251,60 @@ static inline int detect_mov_test_jne(ui
 static inline int detect_mov_test_je(uint8_t *addr, uint8_t **opcode,
 		uint8_t **jmp_offset, int *offset_len)
 {
-	/* b0 cb    movb cb,%al */
-	if (addr[0] != 0xb0)
+	uint8_t *addr1, *addr2;
+
+	/*
+	 * b8 imm32        mov  imm32,%eax or
+	 * c9 XX XX XX XX  jmp  rel32 (patched instruction)
+	 */
+	if (addr[0] != 0xb8 && addr[0] != JMP_REL32)
 		return -1;
-	/* 84 c0    test %al,%al */
-	if (addr[2] != 0x84 || addr[3] != 0xc0)
+	/* 85 c0    test %eax,%eax */
+	if (addr[5] != 0x85 || addr[6] != 0xc0)
 		return -1;
-	printk_dbg(KERN_DEBUG "Found test %%al,%%al at %p\n", addr + 2);
-	switch (addr[4]) {
+	printk_dbg(KERN_DEBUG "Found test %%eax,%%eax at %p\n", addr + 5);
+	switch (addr[7]) {
 	case 0x74: /* 74 cb       je rel8 */
-		printk_dbg(KERN_DEBUG "Found je rel8 at %p\n", addr + 4);
-		*opcode = addr + 4;
-		*jmp_offset = addr + 5;
+		printk_dbg(KERN_DEBUG "Found je rel8 at %p\n", addr + 7);
+		*opcode = addr + 7;
+		*jmp_offset = addr + 8;
 		*offset_len = 1;
-		return 0;
+		/* addr1 is the address following the branch instruction */
+		addr1 = addr + 9;
+		/* addr2 is the branch target */
+		addr2 = addr1 + addr[8];
+		break;
 	case 0x0f:
-		switch (addr[5]) {
+		switch (addr[8]) {
 		case 0x84:	 /* 0F 84 cw    je rel16 (valid on ia32) */
 				 /* 0F 84 cd    je rel32 */
 			printk_dbg(KERN_DEBUG "Found je rel16/32 at %p\n",
-				addr + 5);
-			*opcode = addr + 4;
-			*jmp_offset = addr + 6;
+				addr + 8);
+			*opcode = addr + 7;
+			*jmp_offset = addr + 9;
 			*offset_len = 4;
-			return 0;
+			/*
+			 * addr1 is the address following the branch
+			 * instruction
+			 */
+			addr1 = addr + 13;
+			/* addr2 is the branch target */
+			addr2 = addr1 + *(uint32_t *)(addr + 9);
+			break;
 		default:
 			return -1;
 		}
 		break;
 	default: return -1;
 	}
+	/*
+	 * Now check that the pattern was squeezed between the mov instruction
+	 * end the two epilogues (branch taken and not taken), which ensure
+	 * %eax and ZF liveliness is limited to our instructions.
+	 */
+	if (!is_imv_cond_end((unsigned long)addr1, (unsigned long)addr2))
+		return -1;
+	return 0;
 }
 
 static int static_early;
@@ -366,95 +403,42 @@ static int patch_jump_target(struct __im
 	uint8_t *opcode, *jmp_offset;
 	int offset_len;
 	int mov_test_j_found = 0;
+	unsigned long logicvar;
 
 	if (!detect_mov_test_jne((uint8_t *)imv->imv - 1,
 			&opcode, &jmp_offset, &offset_len)) {
-		imv->insn_size = 1;	/* positive logic */
+		logicvar = imv->var;	/* positive logic */
 		mov_test_j_found = 1;
 	} else if (!detect_mov_test_je((uint8_t *)imv->imv - 1,
 			&opcode, &jmp_offset, &offset_len)) {
-		imv->insn_size = 0;	/* negative logic */
+		logicvar = !imv->var;	/* negative logic */
 		mov_test_j_found = 1;
 	}
 
 	if (mov_test_j_found) {
-		int logicvar = imv->insn_size ? imv->var : !imv->var;
 		int newoff;
 
 		if (offset_len == 1) {
-			imv->jmp_off = *(signed char *)jmp_offset;
-			/* replace with JMP_REL8 opcode. */
-			replace_instruction_safe(opcode,
-				((unsigned char[]){ JMP_REL8,
-				(logicvar ? (signed char)imv->jmp_off : 0) }),
-				2);
+			if (logicvar)
+				newoff = *(signed char *)jmp_offset;
+			else
+				newoff = 0;
+			newoff += 4; /* jump over test, branch */
 		} else {
-			/* replace with nop and JMP_REL16/32 opcode.
-			 * It's ok to shrink an instruction, never ok to
-			 * grow it afterward. */
-			imv->jmp_off = *(int *)jmp_offset;
-			newoff = logicvar ? (int)imv->jmp_off : 0;
-			replace_instruction_safe(opcode,
-				((unsigned char[]){ INSN_NOP1, JMP_REL32,
-				((unsigned char *)&newoff)[0],
-				((unsigned char *)&newoff)[1],
-				((unsigned char *)&newoff)[2],
-				((unsigned char *)&newoff)[3] }),
-				6);
+			if (logicvar)
+				newoff = *(int *)jmp_offset;
+			else
+				newoff = 0;
+			newoff += 8; /*
+				      * jump over test (2 bytes)
+				      * and branch (6 bytes).
+				      */
 		}
-		/* now we can get rid of the movb */
-		replace_instruction_safe((uint8_t *)imv->imv - 1,
-			((unsigned char[]){ INSN_NOP2 }),
-			2);
-		/* now we can get rid of the testb */
-		replace_instruction_safe((uint8_t *)imv->imv + 1,
-			((unsigned char[]){ INSN_NOP2 }),
-			2);
-		/* remember opcode + 1 to enable the JMP_REL patching */
-		if (offset_len == 1)
-			imv->imv = (unsigned long)opcode + 1;
-		else
-			imv->imv = (unsigned long)opcode + 2;	/* skip nop */
-		return 0;
-
-	}
-
-	if (*((uint8_t *)imv->imv - 1) == JMP_REL8) {
-		int logicvar = imv->insn_size ? imv->var : !imv->var;
-
-		printk_dbg(KERN_DEBUG "Found JMP_REL8 at %p\n",
-			((uint8_t *)imv->imv - 1));
 		/* Speed up by skipping if not changed */
-		if (logicvar) {
-			if (*(int8_t *)imv->imv == (int8_t)imv->jmp_off)
-				return 0;
-		} else {
-			if (*(int8_t *)imv->imv == 0)
-				return 0;
-		}
-
-		replace_instruction_safe((uint8_t *)imv->imv - 1,
-			((unsigned char[]){ JMP_REL8,
-			(logicvar ? (signed char)imv->jmp_off : 0) }),
-			2);
-		return 0;
-	}
-
-	if (*((uint8_t *)imv->imv - 1) == JMP_REL32) {
-		int logicvar = imv->insn_size ? imv->var : !imv->var;
-		int newoff = logicvar ? (int)imv->jmp_off : 0;
-
-		printk_dbg(KERN_DEBUG "Found JMP_REL32 at %p, update with %x\n",
-			((uint8_t *)imv->imv - 1), newoff);
-		/* Speed up by skipping if not changed */
-		if (logicvar) {
-			if (*(int *)imv->imv == (int)imv->jmp_off)
-				return 0;
-		} else {
-			if (*(int *)imv->imv == 0)
-				return 0;
-		}
-
+		if (*(uint8_t *)(imv->imv - 1) == JMP_REL32 &&
+				*(int *)imv->imv == newoff)
+			return 0;
+		/* replace with a 5 bytes jump */
 		replace_instruction_safe((uint8_t *)imv->imv - 1,
 			((unsigned char[]){ JMP_REL32,
 			((unsigned char *)&newoff)[0],
@@ -464,7 +448,6 @@ static int patch_jump_target(struct __im
 			5);
 		return 0;
 	}
-
 	/* Nothing known found. */
 	return -1;
 }
@@ -498,7 +481,7 @@ __kprobes int arch_imv_update(struct __i
 				"Jump target fallback at %lX, nr fail %d\n",
 				imv->imv, ++nr_fail);
 #endif
-			imv->size = 1;
+			imv->size = 4;
 		} else {
 #ifdef DEBUG_IMMEDIATE
 			static int nr_success;
@@ -533,7 +516,8 @@ __kprobes int arch_imv_update(struct __i
 	 * If the variable and the instruction have the same value, there is
 	 * nothing to do.
 	 */
-	switch (imv->size) {
+	BUG_ON(imv->var_size > imv->size);
+	switch (imv->var_size) {
 	case 1:	if (*(uint8_t *)imv->imv == *(uint8_t *)imv->var)
 			return 0;
 		break;
@@ -552,7 +536,9 @@ __kprobes int arch_imv_update(struct __i
 	}
 
 	memcpy(buf, (uint8_t *)insn, opcode_size);
-	memcpy(&buf[opcode_size], (void *)imv->var, imv->size);
+	memcpy(&buf[opcode_size], (void *)imv->var, imv->var_size);
+	/* pad MSBs with 0 */
+	memset(&buf[opcode_size + imv->var_size], 0, imv->size - imv->var_size);
 	replace_instruction_safe((uint8_t *)insn, buf, imv->insn_size);
 
 	return 0;

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch 2/2] Markers - use imv_cond jump liveliness
  2008-04-28  3:34 [patch 0/2] Immediate Values - jump patching update Mathieu Desnoyers
  2008-04-28  3:34 ` [patch 1/2] Immediate Values - jump liveliness Mathieu Desnoyers
@ 2008-04-28  3:34 ` Mathieu Desnoyers
  2008-04-28 12:48 ` [patch 0/2] Immediate Values - jump patching update Ingo Molnar
  2008-04-28 17:21 ` H. Peter Anvin
  3 siblings, 0 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2008-04-28  3:34 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, H. Peter Anvin; +Cc: Mathieu Desnoyers

[-- Attachment #1: markers-use-imv-jump-liveliness.patch --]
[-- Type: text/plain, Size: 1294 bytes --]

Provide the imv_cond_ed() macros to mark the end of ZF and eax liveliness for
imv_cond.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/linux/marker.h |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6-sched-devel/include/linux/marker.h
===================================================================
--- linux-2.6-sched-devel.orig/include/linux/marker.h	2008-04-25 13:46:14.000000000 -0400
+++ linux-2.6-sched-devel/include/linux/marker.h	2008-04-25 13:47:11.000000000 -0400
@@ -76,10 +76,13 @@ struct marker {
 		{ __mark_empty_function, NULL}, NULL };			\
 		__mark_check_format(format, ## args);			\
 		if (!generic) {						\
-			if (unlikely(imv_cond(__mark_##name.state)))	\
+			if (unlikely(imv_cond(__mark_##name.state))) {	\
+				imv_cond_end();				\
 				(*__mark_##name.call)			\
 					(&__mark_##name, call_private,	\
 					## args);			\
+			} else						\
+				imv_cond_end();				\
 		} else {						\
 			if (unlikely(_imv_read(__mark_##name.state)))	\
 				(*__mark_##name.call)			\

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28  3:34 [patch 0/2] Immediate Values - jump patching update Mathieu Desnoyers
  2008-04-28  3:34 ` [patch 1/2] Immediate Values - jump liveliness Mathieu Desnoyers
  2008-04-28  3:34 ` [patch 2/2] Markers - use imv_cond " Mathieu Desnoyers
@ 2008-04-28 12:48 ` Ingo Molnar
  2008-04-28 14:35   ` Mathieu Desnoyers
  2008-04-28 17:21 ` H. Peter Anvin
  3 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-04-28 12:48 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel, H. Peter Anvin


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Hi Ingo,
> 
> Here is the update to the jump patching optimization taking care of 
> Peter's comments about register liveliness and instruction re-use by 
> gcc optimizations. A good thing : it actually simplifies the code. 
> Unfortunately, it adds 3 bytes to the instructions in i-cache because 
> I now have to use a 5-bytes mov instruction so I can replace it with a 
> 5-bytes jump. Therefore, 9 bytes are added to rather small functions 
> (5-bytes mov + 2-bytes test + 2 bytes conditional branch) and 13 bytes 
> are added to larger functions which needs a 6 bytes conditional branch 
> at the branch site.
> 
> Instead of having to execute a sequence of nop, nop and jump, we now 
> only have to execute the near jump, which jumps either at the address 
> following the conditional branch or at the target address of the 
> conditional branch, depending on the immediate value variable state.
> 
> Thanks to Peter for the review.

thanks Mathieu, i've queued them up for more testing. Your previous 
queue already looked good here so i pushed it out into sched-devel.git 
as you probably noticed.

Sidenote, without trying to bikeshed paint this issue too much: are we 
absolutely sure that (on modern CPU architectures) such a short jump is 
better than just 2-3 NOPs in a sequence? It's a minor (sub-cycle) detail 
in any case.

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 12:48 ` [patch 0/2] Immediate Values - jump patching update Ingo Molnar
@ 2008-04-28 14:35   ` Mathieu Desnoyers
  0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2008-04-28 14:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel, H. Peter Anvin

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Hi Ingo,
> > 
> > Here is the update to the jump patching optimization taking care of 
> > Peter's comments about register liveliness and instruction re-use by 
> > gcc optimizations. A good thing : it actually simplifies the code. 
> > Unfortunately, it adds 3 bytes to the instructions in i-cache because 
> > I now have to use a 5-bytes mov instruction so I can replace it with a 
> > 5-bytes jump. Therefore, 9 bytes are added to rather small functions 
> > (5-bytes mov + 2-bytes test + 2 bytes conditional branch) and 13 bytes 
> > are added to larger functions which needs a 6 bytes conditional branch 
> > at the branch site.
> > 
> > Instead of having to execute a sequence of nop, nop and jump, we now 
> > only have to execute the near jump, which jumps either at the address 
> > following the conditional branch or at the target address of the 
> > conditional branch, depending on the immediate value variable state.
> > 
> > Thanks to Peter for the review.
> 
> thanks Mathieu, i've queued them up for more testing. Your previous 
> queue already looked good here so i pushed it out into sched-devel.git 
> as you probably noticed.
> 

Yes, thanks,

> Sidenote, without trying to bikeshed paint this issue too much: are we 
> absolutely sure that (on modern CPU architectures) such a short jump is 
> better than just 2-3 NOPs in a sequence? It's a minor (sub-cycle) detail 
> in any case.
> 
> 	Ingo

Let's see :

Paraphrasing the
"Intel® 64 and IA-32 Architectures Optimization Reference Manual" :
http://www.intel.com/design/processor/manuals/248966.pdf

3.5.1.8 Using NOPs

The 1-byte nop, xchg %eax,%eax, is treated specially by the cpu. It
takes only one µop and does not have register dependencies. However,
what we would need here is likely a 5-bytes nop.

The generic 5-bytes nop found in asm-x86/nops.h is :

#define GENERIC_NOP1 ".byte 0x90\n"
#define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n"
#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4

In Intel's guide, they propose a single instruction :
NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)

(0F 1F 44 00 00H)

Which would be required for correct code patching, since we cannot
safely turn a 1+4 bytes sequence of instruction into a single 5-bytes
call without suffering from possible preempted threads returning in the
middle of the instruction. Since the 5-bytes nop does not seem to be
available on architectures below P6, I think this would be a backward
compatibility problem.

K8 NOP 5 also uses a combination of K8_NOP3 K8_NOP2, so this is not only
backward compatibility, but also a concern for current AMD
compatibility. (same issue for K7 5-bytes nop).

However, if we forget about this issue and take for granted that we can
use a single nop instruction that would only take a single µop and have
no external effect (that would be the ideal P6 and + world), the
closest comparison with the jmp instruction I found is the jcc
(conditional branch) instruction in the same manual, Appendix C, where
they list jcc as having 0 latency when not taken and to have a bandwidth
of 0.5, compared to a latency of 1 and bandwidth of 0.33 to 0.5 for nop.

So, just there, a single jmp instruction would seem to have a lower
latency than the nop. (0 cycle vs 1 cycle)

Another interested reading in this document is about conditional
branches :

3.4.1 Optimizing the Front End

Some interesting guide lines :
- Eliminate branches whenever possible
- Separate branches so that they occur no more frequently than every 3 µops
  where possible

3.4.1.1 Eliminating branches improve performance because :
- It eliminates the number of BTB (branch target buffer) entries.
  However, cond. branches which are never taken do not consume BTB
  resources (that's interesting).

3.4.1.3 Static Prediction

Basically, if there is no BTB entry for a branch, if the target address
is forward, static prediction goes forward, and if it goes backward
(loop), static pred. is to go backward. (no surprise here)

Given that the unlikely code is normally at the bottom of functions, the
static prediction would correctly predict the not taken case. So,
disabled markers would not consume any BTB entry, but each enabled
marker would consume a BTB entry.

So, in the end, it looks like the non-taken jcc vs jmp instructions have
the same impact on the system, except that the non-taken jcc must be
preceded by movb (latency 1, bw 0.5 to 1) and test instructions (latency
1, bw 0.33 to 0.5), for a total of 2 latency cycles for the
movb+test+jcc vs 0 latency cycles for the jmp.

Therefore, jmp wins over jcc in the disabled case (latency 0 vs 2) and
in the enabled case even more, since it won't consume any BTB entry.

Oh, and for NOPs vs jmp/jcc, I think not having a single instruction
5-byte nop on every architectures (below P6 and AMD) just disqualifies
it. Even if it would qualify, jmp wins the latency game 0 to 1.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28  3:34 [patch 0/2] Immediate Values - jump patching update Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2008-04-28 12:48 ` [patch 0/2] Immediate Values - jump patching update Ingo Molnar
@ 2008-04-28 17:21 ` H. Peter Anvin
  2008-04-28 20:25   ` Ingo Molnar
  3 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2008-04-28 17:21 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel

Mathieu Desnoyers wrote:
> 
> Thanks to Peter for the review.
> 

Just in case someone gets the wrong idea...

I still think this is the completely wrong approach.

I'd use stronger terms, but Al Viro would sue me for copyright infringement.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 17:21 ` H. Peter Anvin
@ 2008-04-28 20:25   ` Ingo Molnar
  2008-04-28 21:03     ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-04-28 20:25 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Mathieu Desnoyers, akpm, linux-kernel

* H. Peter Anvin <hpa@zytor.com> wrote:

> Mathieu Desnoyers wrote:
>>
>> Thanks to Peter for the review.
>
> Just in case someone gets the wrong idea...
>
> I still think this is the completely wrong approach.

hm, can it result in a broken kernel? If yes, how? Or are your 
objections more higher level?

i actually like that we end up with something rather NOP-ish, with some 
mild 'ambient' impact due to the extra constraints that state value 
visibility brings with it. The in-source impact of the markers is 
minimal, especially with Peter Zijstra's wrappers. The scheduler markers 
at least are also expected to stay pretty stable as well.

the syscall markers should be done less intrusively - and i think they 
can be done less intrusively.

but we need to get past this current impasse, the optimizations that 
Mathieu has done to markers are pretty impressive so far. The overhead 
is not zero, but it gets quite close to it and the SystemTap and kprobes 
people are happy with it as well.

> I'd use stronger terms, but Al Viro would sue me for copyright 
> infringement.

it would clearly fall under the Fair Use Doctrine and perhaps also under 
the Doctrine of Severe Necessities, so dont be shy!

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 20:25   ` Ingo Molnar
@ 2008-04-28 21:03     ` H. Peter Anvin
  2008-04-28 22:11       ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2008-04-28 21:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mathieu Desnoyers, akpm, linux-kernel

Ingo Molnar wrote:
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
>> Mathieu Desnoyers wrote:
>>> Thanks to Peter for the review.
>> Just in case someone gets the wrong idea...
>>
>> I still think this is the completely wrong approach.
> 
> hm, can it result in a broken kernel? If yes, how? Or are your 
> objections more higher level?
> 

My objections are higher level, I believe the current code is (a) 
painfully complex, and I'd rather not see it in the kernel, and (b) the 
wrong thing anyway.

Put a 5-byte nop in as the marker, and patch it with a call instruction, 
out of line, to a collector function.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 21:03     ` H. Peter Anvin
@ 2008-04-28 22:11       ` Ingo Molnar
  2008-04-28 22:25         ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-04-28 22:11 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Mathieu Desnoyers, akpm, linux-kernel

* H. Peter Anvin <hpa@zytor.com> wrote:

>>> I still think this is the completely wrong approach.
>>
>> hm, can it result in a broken kernel? If yes, how? Or are your 
>> objections more higher level?
>
> My objections are higher level, I believe the current code is (a) 
> painfully complex, and I'd rather not see it in the kernel, and (b) 
> the wrong thing anyway.
>
> Put a 5-byte nop in as the marker, and patch it with a call 
> instruction, out of line, to a collector function.

the counter argument was that by specific sched.o analysis, this results 
in slower code. The reason is that the "function call parameter 
preparation" halo around that 5-byte patch site is larger than that 
single conditional branch operation to an offline place of the current 
function is.

i.e. the current optimized marker approach does roughly this:

  [ .... fastpath head ....       ]
  [ immediate value instruction   ]  --->
  [ branch instruction            ]  ---> these two get NOP-ed out
  [ .... fastpath tail ....       ]
  [ ............................. ]
  [ ... offline area ............ ]
  [ ... parameter preparation ... ]
  [ ... marker call ............. ]

your proposed 5-byte call NOP approach (which btw. was what i proposed 
multiple times in the past 2 years) would do this:

  [ .... fastpath head ......     ]
  [ ... parameter preparation ... ]
  [ ....   5-byte CALL .......... ]  ---> NOP-ed out
  [ .... fastpath tail .......... ]
  [ ............................. ]

in the first case we have little "marker parameter/value preparation" 
cost: it all happens in the 'offline area' _by GCC_. I.e. the fastpath 
is relatively undisturbed.

in the latter case, all the 'parameter preparation' phase has to happen 
at around the 5-byte CALL site, in the fastpath. This, in the specific, 
assembly level analysis of sched.o, was shown by Matthieu to be a 
pessimisation. We are better off by inserting that conditional and 
letting gcc generate the call, than by forcing it in the middle of the 
fastpath - even if we end up NOP-ing out the call.

wrt. complexity i agree with you - if the current optimization cannot be 
made correctly we have to fall back to a simpler variant, even if it's 
slower.

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 22:11       ` Ingo Molnar
@ 2008-04-28 22:25         ` H. Peter Anvin
  2008-04-28 22:44           ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2008-04-28 22:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mathieu Desnoyers, akpm, linux-kernel

Ingo Molnar wrote:
> 
>   [ .... fastpath head ....       ]
>   [ immediate value instruction   ]  --->
>   [ branch instruction            ]  ---> these two get NOP-ed out
>   [ .... fastpath tail ....       ]
>   [ ............................. ]
>   [ ... offline area ............ ]
>   [ ... parameter preparation ... ]
>   [ ... marker call ............. ]
> 
> your proposed 5-byte call NOP approach (which btw. was what i proposed 
> multiple times in the past 2 years) would do this:
> 
>   [ .... fastpath head ......     ]
>   [ ... parameter preparation ... ]
>   [ ....   5-byte CALL .......... ]  ---> NOP-ed out
>   [ .... fastpath tail .......... ]
>   [ ............................. ]
> 
> in the first case we have little "marker parameter/value preparation" 
> cost: it all happens in the 'offline area' _by GCC_. I.e. the fastpath 
> is relatively undisturbed.

No, that's not what I'm proposing at all (and would indeed be bogus.)

What I'm proposing is:

 >   [ .... fastpath head ......     ]
 >   [ ....   5-byte CALL .......... ]  ---> NOP-ed out
 >   [ .... fastpath tail .......... ]
 >   [ ............................. ]

The call site is created with an asm() statement as opposed to a gcc 
function call; it is up to the logging function to take the state and 
mangle it into whatever format it wants to; the debugging information 
(e.g. DWARF) should tell it all it needs to know about how the 
register/memory state maps onto the C state.  This mapping can either be 
done online, with a small piece of dynamic code, or offline (although 
offline makes it tricky to know what memory tems to gather.)

For pieces of data which would be dead at the call site, one can add "g" 
annotations to the asm() statement to force it to be live without 
forcing it to be present in any particular location.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 22:25         ` H. Peter Anvin
@ 2008-04-28 22:44           ` Ingo Molnar
  2008-04-28 23:06             ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-04-28 22:44 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Mathieu Desnoyers, akpm, linux-kernel, Frank Ch. Eigler


* H. Peter Anvin <hpa@zytor.com> wrote:

> What I'm proposing is:
>
> >   [ .... fastpath head ......     ]
> >   [ ....   5-byte CALL .......... ]  ---> NOP-ed out
> >   [ .... fastpath tail .......... ]
> >   [ ............................. ]
>
> The call site is created with an asm() statement as opposed to a gcc 
> function call; it is up to the logging function to take the state and 
> mangle it into whatever format it wants to; the debugging information 
> (e.g. DWARF) should tell it all it needs to know about how the 
> register/memory state maps onto the C state.  This mapping can either 
> be done online, with a small piece of dynamic code, or offline 
> (although offline makes it tricky to know what memory tems to gather.)

that would be rather impractical as we'd force DEBUG_INFO builds on 
anyone (it's HUGE) just to do some trivial tracing. Look at the ftrace 
plugin usage model - it wants to be widely available and easy to use.

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 22:44           ` Ingo Molnar
@ 2008-04-28 23:06             ` H. Peter Anvin
  2008-04-29  0:47               ` Frank Ch. Eigler
  2008-04-29  1:46               ` Mathieu Desnoyers
  0 siblings, 2 replies; 22+ messages in thread
From: H. Peter Anvin @ 2008-04-28 23:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mathieu Desnoyers, akpm, linux-kernel, Frank Ch. Eigler

Ingo Molnar wrote:
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
>> What I'm proposing is:
>>
>>>   [ .... fastpath head ......     ]
>>>   [ ....   5-byte CALL .......... ]  ---> NOP-ed out
>>>   [ .... fastpath tail .......... ]
>>>   [ ............................. ]
>> The call site is created with an asm() statement as opposed to a gcc 
>> function call; it is up to the logging function to take the state and 
>> mangle it into whatever format it wants to; the debugging information 
>> (e.g. DWARF) should tell it all it needs to know about how the 
>> register/memory state maps onto the C state.  This mapping can either 
>> be done online, with a small piece of dynamic code, or offline 
>> (although offline makes it tricky to know what memory tems to gather.)
> 
> that would be rather impractical as we'd force DEBUG_INFO builds on 
> anyone (it's HUGE) just to do some trivial tracing. Look at the ftrace 
> plugin usage model - it wants to be widely available and easy to use.
> 

Otherwise you're forcing everyone to take the cost of additional cache 
footprint, plus optimizer interference, just because they might want to 
possibly do some trivial tracing.  DEBUG_INFO is The Right Thing for 
this, as it carries all the information you may want in a well-defined 
format.  You don't necessarily have to keep all this information around, 
of course; you can distill out the information for the trace sites at 
compile time and keep a tracer information file, after which you can 
strip the information.

There is actually yet another alternative, though, which is to build the 
tracer at compile time.  The tricky part of this is that it almost 
requires inserting a preprocessor before the assembler, and use really 
ugly asm() macros to extract the very information that the debug format 
is designed explicitly to extract!

Personally, I think requiring DEBUG_INFO is a helluva lot less ugly than 
these branch hacks.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 23:06             ` H. Peter Anvin
@ 2008-04-29  0:47               ` Frank Ch. Eigler
  2008-04-29  1:08                 ` H. Peter Anvin
  2008-04-29  1:46               ` Mathieu Desnoyers
  1 sibling, 1 reply; 22+ messages in thread
From: Frank Ch. Eigler @ 2008-04-29  0:47 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ingo Molnar, Mathieu Desnoyers, akpm, linux-kernel

Hi -

On Mon, Apr 28, 2008 at 04:06:14PM -0700, H. Peter Anvin wrote:
> [...]
> >>The call site is created with an asm() statement as opposed to a gcc 
> >>function call; it is up to the logging function to take the state and 
> >>mangle it into whatever format it wants to; the debugging information 
> >>(e.g. DWARF) should tell it all it needs to know [...]
> >
> >that would be rather impractical as we'd force DEBUG_INFO builds on 
> >anyone (it's HUGE) just to do some trivial tracing. Look at the ftrace 
> >plugin usage model - it wants to be widely available and easy to use.
> 
> Otherwise you're forcing everyone to take the cost of additional cache 
> footprint, plus optimizer interference, just because they might want to 
> possibly do some trivial tracing.  

The intent is for the tracing not to be trivial but useful.

> DEBUG_INFO is The Right Thing for this, as it carries all the
> information you may want in a well-defined format.  [...]

This would require either that DWARF processing be done far after
kernel build, and thus the kernel cannot be self-sufficient /
introspective without user-space assistance (like firmware); or that
the DWARF data extraction (and systemtap-like $context-variable code
generation) be part of the kernel build itself.  It *might* be
workable.

At least one complication though is that in the case of markers,
tracing parameter evaluation is itself conditional (and placed out of
the hot path due to -freorder-blocks).  With your suggested kind of
assembly ("g" constraints for all the expressions), those expressions
would be evaluated unconditionally, just to make them live somewhere.
That unconditional evaluation can easily entail memory reads and
dependent arithmetic, which could swamp the savings of eliminating the
marker-style conditional branch.

- FChE

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-29  0:47               ` Frank Ch. Eigler
@ 2008-04-29  1:08                 ` H. Peter Anvin
  2008-04-29 12:08                   ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2008-04-29  1:08 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Ingo Molnar, Mathieu Desnoyers, akpm, linux-kernel

Frank Ch. Eigler wrote:
> This would require either that DWARF processing be done far after
> kernel build, and thus the kernel cannot be self-sufficient /
> introspective without user-space assistance (like firmware); or that
> the DWARF data extraction (and systemtap-like $context-variable code
> generation) be part of the kernel build itself.  It *might* be
> workable.

DWARF processing done after the kernel build isn't really a problem, 
although it requires a large vmlinux(.gz) file to be kept around.

As part of the kernel build is probably better, at least in the case of 
static markers.

> At least one complication though is that in the case of markers,
> tracing parameter evaluation is itself conditional (and placed out of
> the hot path due to -freorder-blocks).  With your suggested kind of
> assembly ("g" constraints for all the expressions), those expressions
> would be evaluated unconditionally, just to make them live somewhere.
> That unconditional evaluation can easily entail memory reads and
> dependent arithmetic, which could swamp the savings of eliminating the
> marker-style conditional branch.

Well, it depends a bit on what kind of expressions you put in there.
You don't really want to put *expressions* in there as much as you want 
to put *data* references in there, although, of course, if your have 
something like "foo->bar[baz]->quux" then it's easy to trip upon.

One advantage of using the debugging information in this manner is that 
you can write the actual markers long after the kernel has compiled, as 
long as the data happens to be available at the call site.

Once again, the *proper* way to do this is with compiler support.  That 
way we can do patched branches and out-of-line everything with impunity.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-28 23:06             ` H. Peter Anvin
  2008-04-29  0:47               ` Frank Ch. Eigler
@ 2008-04-29  1:46               ` Mathieu Desnoyers
  2008-04-29  2:07                 ` H. Peter Anvin
  1 sibling, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2008-04-29  1:46 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ingo Molnar, akpm, linux-kernel, Frank Ch. Eigler

* H. Peter Anvin (hpa@zytor.com) wrote:
> Ingo Molnar wrote:
>> * H. Peter Anvin <hpa@zytor.com> wrote:
>>> What I'm proposing is:
>>>
>>>>   [ .... fastpath head ......     ]
>>>>   [ ....   5-byte CALL .......... ]  ---> NOP-ed out
>>>>   [ .... fastpath tail .......... ]
>>>>   [ ............................. ]
>>> The call site is created with an asm() statement as opposed to a gcc 
>>> function call; it is up to the logging function to take the state and 
>>> mangle it into whatever format it wants to; the debugging information 
>>> (e.g. DWARF) should tell it all it needs to know about how the 
>>> register/memory state maps onto the C state.  This mapping can either be 
>>> done online, with a small piece of dynamic code, or offline (although 
>>> offline makes it tricky to know what memory tems to gather.)
>> that would be rather impractical as we'd force DEBUG_INFO builds on anyone 
>> (it's HUGE) just to do some trivial tracing. Look at the ftrace plugin 
>> usage model - it wants to be widely available and easy to use.
>
> Otherwise you're forcing everyone to take the cost of additional cache 
> footprint, plus optimizer interference, just because they might want to 
> possibly do some trivial tracing.  DEBUG_INFO is The Right Thing for this, 
> as it carries all the information you may want in a well-defined format.  
> You don't necessarily have to keep all this information around, of course; 
> you can distill out the information for the trace sites at compile time and 
> keep a tracer information file, after which you can strip the information.
>
> There is actually yet another alternative, though, which is to build the 
> tracer at compile time.  The tricky part of this is that it almost requires 
> inserting a preprocessor before the assembler, and use really ugly asm() 
> macros to extract the very information that the debug format is designed 
> explicitly to extract!
>
> Personally, I think requiring DEBUG_INFO is a helluva lot less ugly than 
> these branch hacks.
>
> 	-hpa

Peter, do you have something like the following code in mind ?

I think the main differences between the code snippet down here and the
markers is that markers rely on the compiler to generate the stack
setup, and have this code a little bit closer to the function than what
I propose here, where I put the stack setup code in a "farfaraway"
section.  Moreover, markers are much simpler than what I show here.
And actually, markers can be deployed portably, with
architecture-specific optimizations refined later. This has to be
implemented all up front for any traced architecture. In addition,
dealing with weird types like unsigned long long can become a pain.
Also, due to fact that we are asking the compiler to put keep some
variables live in registers, I would be tempted to embed this in a block
controlled by an if() statement (conditional branch, like I use for the
markers) so we don't have to pay the penality of populating the
registers when not required if there are not live at the marker site.

Here is the toy program :

#include <stdio.h>

#define SAVE_REGS   \
	"pushl %%eax\n\t" \
	"pushl %%ebp\n\t" \
	"pushl %%edi\n\t" \
	"pushl %%esi\n\t" \
	"pushl %%edx\n\t" \
	"pushl %%ecx\n\t" \
	"pushl %%ebx\n\t"

#define RESTORE_REGS \
	"popl %%ebx\n\t"	\
	"popl %%ecx\n\t"	\
	"popl %%edx\n\t"	\
	"popl %%esi\n\t"	\
	"popl %%edi\n\t"	\
	"popl %%ebp\n\t"	\
	"popl %%eax\n\t"

#define asmlinkage __attribute__((regparm(0)))
#define _ASM_PTR ".long "

struct marker_info {
	const char *name;
	const char *fields;
	unsigned char nr_args;
	unsigned char type_sizes[];
};

#define trace_mark3(name, fields, arg1, arg2, arg3)			\
	do {								\
		static struct marker_info info = {			\
			#name,						\
			fields,						\
			3, { sizeof(arg1), sizeof(arg2), sizeof(arg3) } }; \
		asm (".section __marker_info, \"a\",@progbits\n\t"	\
			_ASM_PTR "%c0, 1f, 2f\n\t"			\
			".previous\n\t"					\
			".section __farfarawaycode,\"ax\",@progbits\n\t"\
			"1:\n\t"					\
			SAVE_REGS	/* Save caller-saved registers */\
			"pushl %3\n\t"					\
			"pushl %2\n\t"					\
			"pushl %1\n\t"					\
			"pushl %0\n\t"					\
			"call do_kernel_trace\n\t"			\
			"addl $(4*(1+3)), %%esp\n\t"			\
			RESTORE_REGS					\
			"ret\n\t"					\
			".previous\n\t"					\
			"2:\n\t"					\
			"call 1b\n\t"	/* TEST */			\
			".byte 0xe9\n\t"/* jmp near 0x0 (5-bytes 1 insn nop) */\
			".int 0x0\n\t"	/* patched to "call ...\n\t" */	\
			: : "i" (&info),				\
			"g" (arg1), "g" (arg2), "g" (arg3));		\
	} while (0)

/*
 * Patching address "2f" with a call 1f. All the information required is in 
 * the __marker_info section/table.
 */

asmlinkage void do_kernel_trace(const struct marker_info *info,
	 unsigned long arg1, unsigned long arg2, unsigned long arg3)
{
	unsigned long *args = &arg1;
	int i;
	printf("%s \n", info->name);
	for (i = 0; i < info->nr_args; i++) {
		switch (info->type_sizes[i]) {
		case 1:
			printf("%hu ", (unsigned char)args[i]);
			break;
		case 2:
			printf("%hu ", (unsigned short)args[i]);
			break;
		case 4:
			printf("%u ", (unsigned int)args[i]);
			break;
		case 8:
			printf("%llu ", (unsigned long long)args[i]);
			break;
		}
	}
	printf("\n");
}

unsigned long long gg = -1ULL;  /* this guy does not work. No warning
                                   generated. */

void f(void)
{
	char y = 4;
	trace_mark3(test_event, "name1 name2 name3", y, gg, 6);
}

void main()
{
	f();
}


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-29  1:46               ` Mathieu Desnoyers
@ 2008-04-29  2:07                 ` H. Peter Anvin
  2008-04-29 12:18                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2008-04-29  2:07 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Ingo Molnar, akpm, linux-kernel, Frank Ch. Eigler

Mathieu Desnoyers wrote:
> 
> Peter, do you have something like the following code in mind ?
> 

Basically, although I was suggesting using a per-site dynamic piece of 
code.  Data items may not necessarily be in registers.

> I think the main differences between the code snippet down here and the
> markers is that markers rely on the compiler to generate the stack
> setup, and have this code a little bit closer to the function than what
> I propose here, where I put the stack setup code in a "farfaraway"
> section.  Moreover, markers are much simpler than what I show here.
> And actually, markers can be deployed portably, with
> architecture-specific optimizations refined later. This has to be
> implemented all up front for any traced architecture. In addition,
> dealing with weird types like unsigned long long can become a pain.
> Also, due to fact that we are asking the compiler to put keep some
> variables live in registers, I would be tempted to embed this in a block
> controlled by an if() statement (conditional branch, like I use for the
> markers) so we don't have to pay the penality of populating the
> registers when not required if there are not live at the marker site.

We're requesting to keep them *alive*, but not necessarily in registers 
(that would be an "r" constraint.)

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-29  1:08                 ` H. Peter Anvin
@ 2008-04-29 12:08                   ` Ingo Molnar
  2008-05-14 14:53                     ` Pavel Machek
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2008-04-29 12:08 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Frank Ch. Eigler, Mathieu Desnoyers, akpm, linux-kernel


* H. Peter Anvin <hpa@zytor.com> wrote:

>> At least one complication though is that in the case of markers, 
>> tracing parameter evaluation is itself conditional (and placed out of 
>> the hot path due to -freorder-blocks).  With your suggested kind of 
>> assembly ("g" constraints for all the expressions), those expressions 
>> would be evaluated unconditionally, just to make them live somewhere. 
>> That unconditional evaluation can easily entail memory reads and 
>> dependent arithmetic, which could swamp the savings of eliminating 
>> the marker-style conditional branch.
>
> Well, it depends a bit on what kind of expressions you put in there. 
> You don't really want to put *expressions* in there as much as you 
> want to put *data* references in there, although, of course, if your 
> have something like "foo->bar[baz]->quux" then it's easy to trip upon.

and that's exactly what was tripped upon in sched.o and analyzed.

Furthermore, the suggestion of doing this exclusively within the DWARF2 
space - besides the not particularly minor complication of it not being 
implemented yet - is:

 - quite substantially complex on its own

 - would make Linux instrumentation dependent on all sorts of DWARF2
   details which we had our 'fun' with before. (I proffer that that's
   more fragile than any code patching can ever be.)

 - if done self-sufficiently (i.e. if a kernel image can be used to
   trace things, which i believe any usable kernel tracer must offer),
   it would, with the current debug info format, enlargen the kernel RAM
   image with quite a substantial amount of unswappable kernel memory. 

But i would not mind such a scheme at all (it is in essence SystemTap 
integrated into the core kernel) - in fact if your scheme gets 
implemented then the current marker facilities could be ported to that 
scheme transparently.

So i dont see how it can be a loss to stick with the current markers for 
the time being. If the super-optimized runtime patching in this thread 
is deemed too fragile we simply wont do that and live with the (small) 
runtime overhead that current markers have.

This is a sane plan IMO, basically all the instrumentation folks agree 
with it (SystemTap, LTTNG, kprobes, ftrace), the scheduler folks agree 
with it as well and we'd like to move on.

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-29  2:07                 ` H. Peter Anvin
@ 2008-04-29 12:18                   ` Mathieu Desnoyers
  2008-04-29 15:35                     ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2008-04-29 12:18 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ingo Molnar, akpm, linux-kernel, Frank Ch. Eigler

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> Peter, do you have something like the following code in mind ?
>
> Basically, although I was suggesting using a per-site dynamic piece of 
> code.  Data items may not necessarily be in registers.
>
>> I think the main differences between the code snippet down here and the
>> markers is that markers rely on the compiler to generate the stack
>> setup, and have this code a little bit closer to the function than what
>> I propose here, where I put the stack setup code in a "farfaraway"
>> section.  Moreover, markers are much simpler than what I show here.
>> And actually, markers can be deployed portably, with
>> architecture-specific optimizations refined later. This has to be
>> implemented all up front for any traced architecture. In addition,
>> dealing with weird types like unsigned long long can become a pain.
>> Also, due to fact that we are asking the compiler to put keep some
>> variables live in registers, I would be tempted to embed this in a block
>> controlled by an if() statement (conditional branch, like I use for the
>> markers) so we don't have to pay the penality of populating the
>> registers when not required if there are not live at the marker site.
>
> We're requesting to keep them *alive*, but not necessarily in registers 
> (that would be an "r" constraint.)
>

Interesting. Actually, I use the "g" constraint in the code I showed
you, so it might be more acceptable to put in the fast path without
requiring registers to be populated. The only cases that would generate
additional code would probably be arguments like :

/* multiple pointer dereference */
trace_mark(evname, "argname", ptr->stuff[index]->...);

/*
 * having to do some work to prepare the variable (calling a macro or
 * inline function which does more than just a pointer deref.
 */
trace_mark(evname, "argname", get_real_valueof(variable));

/* constants */
trace_mark(evname, "argname", 10000);

Those cases won't add code to the critial path with my current markers,
but it would with the inline assembly "g" constraint approach. Looking
at the "mm" instrumentation, where page_to_pfn, swp_offset and
get_swap_info_struct are used makes me think it would not be such a rare
case.

I would also like to point out that maintaining a _separated_ piece of
code for each instrumentation site which would heavily depend on the
inner kernel data structures seems like a maintenance nightmare. This is
why I am trying to get the instrumented site to export the meaningful
data, self-described, in a standardized way. We can then simply hook on
all the instrumented sites to either perform some in-kernel analysis on
the data (ftrace) or to export that to a userspace trace analyzer
(LTTV analyzing LTTng traces).

I would be happy with a solution that doesn't depend on this gigantic
DWARF information and can be included in the kernel build process. See,
I think tracing is, primarily, a facility that the kernel should provide
to users so they can tune and find problems in their own applications.
>From this POV, it would make sense to consider tracing as part of the
kernel code itself, not as a separated, kernel debugging oriented piece
of code. If you require per-site dynamic pieces of code, you are only
adding to the complexity of such a tracer. Actually, an active tracer
would trash the i-cache quite heavily due to these per-site pieces of
code. Given that users want a tracer that disturbs as little as
possible the normal system behavior, I don't think this "per-site"
pieces of code approach is that good.

So, in terms of complexity added to the kernel, i-cache impact of an
active tracer and maintainability, I think the register constraining
assembly isn't such a good approach. And why would we do that ? The real
contention point here seems to be to remove a few bytes from an unlikely
block. I think I should paste my reply to Ingo about d-cache, i-cache
and TLB impact of such code :

<quote>
> > I have not seen any counter argument for the in-depth analysis of the 
> > instruction cache impact of the optimized markers I've done. Arguing 
> > that the markers are "bloated" based only on "size kernel/sched.o" 
> > output is a bit misleading.
>
> uhm, i'm not sure what you mean - how else would you quantify bloat than
> to look at the size of the affected subsystem?
>
>       Ingo

Data cache bloat inspection :
If you use the "size" output, it will take into account all the data
placed in special sections. At link time, these sections are put
together far from the actual cache hot kernel data.

Instruction cache bloat inspection :
If a code region is placed with cache cold instructions (unlikely
branches), it should not increase the cache impact, since although we
might use one more cache line, it won't be often loaded in cache because
all the code that shares this cache line is unlikely.

TLB entries bloat :
If code is added in unlikely branches, the instruction size increase
could increase the number of TLB entries required to keep cache hot
code. However, in our case, adding 10 (hot) + 50 (cold) bytes to the
scheduler code per optimized marker would require 68 markers to occupy a
whole 4kB TLB entry. Statistically, we could suppose that adding less
than 34 markers to the scheduler should not use any supplementary TLB
entry.  Adding 3 markers is therefore very unlikely to increase the TLB
impact. Given we have about 1024 TLB entries, adding 1/25th of a TLB
entry to the cache hot kernel instructions should not matter much,
especially since it might be absorbed by alignment.

And since the kernel core code is placed in "Huge TLB pages" on many
architectures nowadays, I really don't think the impact of a few bytes
out of 4MB is significant.

I therefore think that looking only at code size is misleading when
considering the cache impact of markers, since they have been designed
to put the bytes as far away as possible from cache-hot memory.
</quote>

Mathieu

> 	-hpa

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-29 12:18                   ` Mathieu Desnoyers
@ 2008-04-29 15:35                     ` H. Peter Anvin
  2008-05-04 14:54                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2008-04-29 15:35 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Ingo Molnar, akpm, linux-kernel, Frank Ch. Eigler

Mathieu Desnoyers wrote:
> I would also like to point out that maintaining a _separated_ piece of
> code for each instrumentation site which would heavily depend on the
> inner kernel data structures seems like a maintenance nightmare.

Obviously doing this by hand is insane.  That was not my thought.

> I would be happy with a solution that doesn't depend on this gigantic
> DWARF information and can be included in the kernel build process. See,
> I think tracing is, primarily, a facility that the kernel should provide
> to users so they can tune and find problems in their own applications.
> From this POV, it would make sense to consider tracing as part of the
> kernel code itself, not as a separated, kernel debugging oriented piece
> of code. If you require per-site dynamic pieces of code, you are only
> adding to the complexity of such a tracer. Actually, an active tracer
> would trash the i-cache quite heavily due to these per-site pieces of
> code. Given that users want a tracer that disturbs as little as
> possible the normal system behavior, I don't think this "per-site"
> pieces of code approach is that good.

That's funny, given that's exactly what you have now.

DWARF information is the way you get this stuff out of the compiler. 
That is what it's *there for*.  If you don't want to keep it around you 
can distill out the information you need and then remove it.  However, 
as I have said about six times now:

THE RIGHT WAY TO DO THIS IS WITH COMPILER SUPPORT.

All these problems is because you're trying to do something behind the 
back of the compiler, but not *completely* so.

> Instruction cache bloat inspection :
> If a code region is placed with cache cold instructions (unlikely
> branches), it should not increase the cache impact, since although we
> might use one more cache line, it won't be often loaded in cache because
> all the code that shares this cache line is unlikely.

This is somewhat nice in theory; I've found that gcc tends to interlace 
them pretty heavily and so the cache interference even of gcc "out of 
line" code is sizable.  Furthermore, modern CPUs often speculatively 
fetch *both* branches of a conditional.

This is actually the biggest motivation for patching static branches.

> I therefore think that looking only at code size is misleading when
> considering the cache impact of markers, since they have been designed
> to put the bytes as far away as possible from cache-hot memory.

Nice theory.  Doesn't work in practice as long as you rely on gcc unlikey().

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-29 15:35                     ` H. Peter Anvin
@ 2008-05-04 14:54                       ` Mathieu Desnoyers
  2008-05-04 21:05                         ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2008-05-04 14:54 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ingo Molnar, akpm, linux-kernel, Frank Ch. Eigler

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> I would also like to point out that maintaining a _separated_ piece of
>> code for each instrumentation site which would heavily depend on the
>> inner kernel data structures seems like a maintenance nightmare.
>
> Obviously doing this by hand is insane.  That was not my thought.
>

Great :)

>> I would be happy with a solution that doesn't depend on this gigantic
>> DWARF information and can be included in the kernel build process. See,
>> I think tracing is, primarily, a facility that the kernel should provide
>> to users so they can tune and find problems in their own applications.
>> From this POV, it would make sense to consider tracing as part of the
>> kernel code itself, not as a separated, kernel debugging oriented piece
>> of code. If you require per-site dynamic pieces of code, you are only
>> adding to the complexity of such a tracer. Actually, an active tracer
>> would trash the i-cache quite heavily due to these per-site pieces of
>> code. Given that users want a tracer that disturbs as little as
>> possible the normal system behavior, I don't think this "per-site"
>> pieces of code approach is that good.
>
> That's funny, given that's exactly what you have now.
>

The per-site pieces of code are only there to do the stack setup. I
really wonder if we could do this more efficiently from DWARF info.

> DWARF information is the way you get this stuff out of the compiler. That 
> is what it's *there for*.  If you don't want to keep it around you can 
> distill out the information you need and then remove it.  However, as I 
> have said about six times now:

About DWARF : I agree with Ingo that we might not want to depend on this
kind of information normally expected to be correct for debug uses in a
part of infrastructure that is not limited to debugging situation.
Continous performance monitoring is one of the use cases I have in mind.

Moreover, depending on DWARF info requires us to do
architecture-specific code from the beginning. The markers are designed
in such a way that any given new architecture can use the "architecture
agnostic" version of the markers, and then later implement the
optimizations. With about 27 architectures supported by the Linux
kernel, I think this approach makes sense. Looking at the number of
years it took to port something as "simple" as kprobes to 8 out of 27
architectures speaks for itself.

>
> THE RIGHT WAY TO DO THIS IS WITH COMPILER SUPPORT.
>

We totally agree on this about the jump-patching optimization. If the
jump-patching approach I proposed is too far-fetched, and if reading a
variable from memory at each tracing site is too expensive, I would
propose to use the standard "immediate values" flavor until gcc gives us
that kind support for patchable jump instructions.

> All these problems is because you're trying to do something behind the back 
> of the compiler, but not *completely* so.
>

Using the compiler for the markers (I am not talking about immediate
values, which is an optimization) is what gives us the ability to do an
architecture-agnostic version. The 19 architectures which still lacks
kprobes support tell me that it isn't such a bad way to go.

>> Instruction cache bloat inspection :
>> If a code region is placed with cache cold instructions (unlikely
>> branches), it should not increase the cache impact, since although we
>> might use one more cache line, it won't be often loaded in cache because
>> all the code that shares this cache line is unlikely.
>
> This is somewhat nice in theory; I've found that gcc tends to interlace 
> them pretty heavily and so the cache interference even of gcc "out of line" 
> code is sizable.

Following your own suggestion, why don't we fix gcc and make it
interleave unlikely blocks less heavily with hot blocks ?

> Furthermore, modern CPUs often speculatively fetch *both* 
> branches of a conditional.
>
> This is actually the biggest motivation for patching static branches.
>

Agreed. I'd like to find some info about which microarchitectures you
have in mind. Intel Core 2 ?

>> I therefore think that looking only at code size is misleading when
>> considering the cache impact of markers, since they have been designed
>> to put the bytes as far away as possible from cache-hot memory.
>
> Nice theory.  Doesn't work in practice as long as you rely on gcc 
> unlikey().
>
> 	-hpa

Let's fix gcc ! ;)

Cheers,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-05-04 14:54                       ` Mathieu Desnoyers
@ 2008-05-04 21:05                         ` H. Peter Anvin
  0 siblings, 0 replies; 22+ messages in thread
From: H. Peter Anvin @ 2008-05-04 21:05 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Ingo Molnar, akpm, linux-kernel, Frank Ch. Eigler

Mathieu Desnoyers wrote:
> 
> Following your own suggestion, why don't we fix gcc and make it
> interleave unlikely blocks less heavily with hot blocks ?
> 

Doing this with compiler support is definitely The Right Thing, so I 
think this is the best way.

>> Furthermore, modern CPUs often speculatively fetch *both* 
>> branches of a conditional.
>>
>> This is actually the biggest motivation for patching static branches.
> 
> Agreed. I'd like to find some info about which microarchitectures you
> have in mind. Intel Core 2 ?

Not sure about Core 2, although Core 2 definitely can track down the 
wrong branch on a mispredict.

> Let's fix gcc ! ;)

Sounds great :)

	-hpa


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 0/2] Immediate Values - jump patching update
  2008-04-29 12:08                   ` Ingo Molnar
@ 2008-05-14 14:53                     ` Pavel Machek
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2008-05-14 14:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Frank Ch. Eigler, Mathieu Desnoyers, akpm,
	linux-kernel

Hi!

> >> At least one complication though is that in the case of markers, 
> >> tracing parameter evaluation is itself conditional (and placed out of 
> >> the hot path due to -freorder-blocks).  With your suggested kind of 
> >> assembly ("g" constraints for all the expressions), those expressions 
> >> would be evaluated unconditionally, just to make them live somewhere. 
> >> That unconditional evaluation can easily entail memory reads and 
> >> dependent arithmetic, which could swamp the savings of eliminating 
> >> the marker-style conditional branch.
> >
> > Well, it depends a bit on what kind of expressions you put in there. 
> > You don't really want to put *expressions* in there as much as you 
> > want to put *data* references in there, although, of course, if your 
> > have something like "foo->bar[baz]->quux" then it's easy to trip upon.
> 
> and that's exactly what was tripped upon in sched.o and analyzed.
> 
> Furthermore, the suggestion of doing this exclusively within the DWARF2 
> space - besides the not particularly minor complication of it not being 
> implemented yet - is:
> 
>  - quite substantially complex on its own
> 
>  - would make Linux instrumentation dependent on all sorts of DWARF2
>    details which we had our 'fun' with before. (I proffer that that's
>    more fragile than any code patching can ever be.)

>  - if done self-sufficiently (i.e. if a kernel image can be used to
>    trace things, which i believe any usable kernel tracer must offer),
>    it would, with the current debug info format, enlargen the kernel RAM
>    image with quite a substantial amount of unswappable kernel memory. 

I am not sure self-sufficiency is  good goal here.

If tracing becomes part of kernel-user ABI, we are in big trouble...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-05-14 14:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-28  3:34 [patch 0/2] Immediate Values - jump patching update Mathieu Desnoyers
2008-04-28  3:34 ` [patch 1/2] Immediate Values - jump liveliness Mathieu Desnoyers
2008-04-28  3:34 ` [patch 2/2] Markers - use imv_cond " Mathieu Desnoyers
2008-04-28 12:48 ` [patch 0/2] Immediate Values - jump patching update Ingo Molnar
2008-04-28 14:35   ` Mathieu Desnoyers
2008-04-28 17:21 ` H. Peter Anvin
2008-04-28 20:25   ` Ingo Molnar
2008-04-28 21:03     ` H. Peter Anvin
2008-04-28 22:11       ` Ingo Molnar
2008-04-28 22:25         ` H. Peter Anvin
2008-04-28 22:44           ` Ingo Molnar
2008-04-28 23:06             ` H. Peter Anvin
2008-04-29  0:47               ` Frank Ch. Eigler
2008-04-29  1:08                 ` H. Peter Anvin
2008-04-29 12:08                   ` Ingo Molnar
2008-05-14 14:53                     ` Pavel Machek
2008-04-29  1:46               ` Mathieu Desnoyers
2008-04-29  2:07                 ` H. Peter Anvin
2008-04-29 12:18                   ` Mathieu Desnoyers
2008-04-29 15:35                     ` H. Peter Anvin
2008-05-04 14:54                       ` Mathieu Desnoyers
2008-05-04 21:05                         ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).