public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/7] Immediate Values
@ 2007-09-17 18:42 Mathieu Desnoyers
  2007-09-17 18:42 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel

Hi Andrew,

Here are the updated immediate values, following recent discussions.

It depends on the text edit lock patches.

It applies to 2.6.23-rc4-mm1, in the following order:

immediate-values-architecture-independent-code.patch
immediate-values-kconfig-embedded.patch
immediate-values-move-kprobes-i386-restore-interrupt-to-kdebug-h.patch
immediate-values-i386-optimization.patch
immediate-values-powerpc-optimization.patch
immediate-values-documentation.patch
profiling-use-immediate-values.patch

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
@ 2007-09-17 18:42 ` Mathieu Desnoyers
  2007-09-18 17:47   ` Denys Vlasenko
  2007-09-17 18:42 ` [patch 2/7] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: immediate-values-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 17446 bytes --]

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _immediate_read() version, which uses standard global
variables, and optimized per architecture immediate_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__immediate" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __immediate is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove immediate_*_t types, add DECLARE_IMMEDIATE() and DEFINE_IMMEDIATE().
  - immediate_read(&var) becomes immediate_read(var) because of this.
- Adding a new EXPORT_IMMEDIATE_SYMBOL(_GPL).
- remove immediate_if(). Should use if (unlikely(immediate_read(var))) instead.
  - Wait until we have gcc support before we add the immediate_if macro, since
    its form may have to change.
- Document immediate_set_early(). This design is chosen over immediate_init()
  because:
    - We can mark the function __init so it is freed after boot.
    - Module code patching can also be done by the "normal" function. This is
      preferred, even if it can be slightly slower, because update performance
      does not matter.
    - immediate_init() could confuse users because we can actually "initialize"
      an immediate value to a certain value, which will be used as initial
      value. e.g.:
        static DEFINE_IMMEDIATE(int, var) = 10;

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/asm-generic/vmlinux.lds.h |    7 ++
 include/linux/immediate.h         |  133 ++++++++++++++++++++++++++++++++++++++
 include/linux/module.h            |   17 ++++
 init/main.c                       |    2 
 kernel/Makefile                   |    1 
 kernel/immediate.c                |   92 ++++++++++++++++++++++++++
 kernel/module.c                   |   49 ++++++++++++++
 7 files changed, 301 insertions(+)

Index: linux-2.6-lttng/include/linux/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/immediate.h	2007-09-17 13:38:32.000000000 -0400
@@ -0,0 +1,133 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+#include <asm/immediate.h>
+
+/**
+ * immediate_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define immediate_set(name, i)						\
+	do {								\
+		name##__immediate = (i);				\
+		core_immediate_update();				\
+		module_immediate_update();				\
+	} while (0)
+
+
+/**
+ * _immediate_set - set immediate variable (without locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Must be called with module_mutex held.
+ */
+#define _immediate_set(name, i)						\
+	do {								\
+		name##__immediate = (i);				\
+		core_immediate_update();				\
+		_module_immediate_update();				\
+	} while (0)
+
+/**
+ * immediate_set_early - set immediate variable at early boot
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Should be used for updates at early boot, when only
+ * one CPU is active and interrupts are disabled.
+ */
+#define immediate_set_early(name, i)					\
+	do {								\
+		name##__immediate = (i);				\
+		immediate_update_early();				\
+	} while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_immediate_update(void);
+extern void immediate_update_range(const struct __immediate *begin,
+	const struct __immediate *end);
+extern void immediate_update_early(void);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define immediate_read(name)		_immediate_read(name)
+
+/**
+ * immediate_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define immediate_set(name, i)		(name##__immediate = (i))
+
+/**
+ * _immediate_set - set immediate variable (without locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Must be called with module_mutex held.
+ */
+#define _immediate_set(name, i)		immediate_set(name, i)
+
+/**
+ * immediate_set_early - set immediate variable at early boot
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Should be used for early boot updates.
+ */
+#define immediate_set_early(name, i)	immediate_set(name, i)
+
+/*
+ * Internal update functions.
+ */
+static inline void immediate_update_early(void)
+{ }
+#endif
+
+#define DECLARE_IMMEDIATE(type, name) extern __typeof__(type) name##__immediate
+#define DEFINE_IMMEDIATE(type, name)  __typeof__(type) name##__immediate
+
+#define EXPORT_IMMEDIATE_SYMBOL(name) EXPORT_SYMBOL(name##__immediate)
+#define EXPORT_IMMEDIATE_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__immediate)
+
+/**
+ * _immediate_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _immediate_read(name)		(name##__immediate)
+
+#endif
Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
@@ -122,6 +122,13 @@
 		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
 	}								\
 									\
+	/* Immediate values: pointers */				\
+	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(__start___immediate) = .;		\
+		*(__immediate)						\
+		VMLINUX_SYMBOL(__stop___immediate) = .;			\
+	}								\
+									\
 	/* Kernel symbol table: strings */				\
         __ksymtab_strings : AT(ADDR(__ksymtab_strings) - LOAD_OFFSET) {	\
 		*(__ksymtab_strings)					\
Index: linux-2.6-lttng/include/linux/module.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/module.h	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/include/linux/module.h	2007-09-17 13:35:50.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/stringify.h>
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
+#include <linux/immediate.h>
 #include <asm/local.h>
 
 #include <asm/module.h>
@@ -370,6 +371,11 @@ struct module
 	/* The command line arguments (may be mangled).  People like
 	   keeping pointers to this stuff */
 	char *args;
+
+#ifdef CONFIG_IMMEDIATE
+	const struct __immediate *immediate;
+	unsigned int num_immediate;
+#endif
 };
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
@@ -473,6 +479,9 @@ int unregister_module_notifier(struct no
 
 extern void print_modules(void);
 
+extern void _module_immediate_update(void);
+extern void module_immediate_update(void);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -572,6 +581,14 @@ static inline void print_modules(void)
 {
 }
 
+static inline void _module_immediate_update(void)
+{
+}
+
+static inline void module_immediate_update(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c	2007-09-17 13:35:51.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
 #include <linux/errno.h>
+#include <linux/immediate.h>
 #include <linux/err.h>
 #include <linux/vermagic.h>
 #include <linux/notifier.h>
@@ -1717,6 +1718,7 @@ static struct module *load_module(void _
 	unsigned int unusedcrcindex;
 	unsigned int unusedgplindex;
 	unsigned int unusedgplcrcindex;
+	unsigned int immediateindex;
 	struct module *mod;
 	long err = 0;
 	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
@@ -1813,6 +1815,7 @@ static struct module *load_module(void _
 #ifdef ARCH_UNWIND_SECTION_NAME
 	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
 #endif
+	immediateindex = find_sec(hdr, sechdrs, secstrings, "__immediate");
 
 	/* Don't keep modinfo section */
 	sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -1963,6 +1966,11 @@ static struct module *load_module(void _
 	mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr;
 	if (gplfuturecrcindex)
 		mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr;
+#ifdef CONFIG_IMMEDIATE
+	mod->immediate = (void *)sechdrs[immediateindex].sh_addr;
+	mod->num_immediate =
+		sechdrs[immediateindex].sh_size / sizeof(*mod->immediate);
+#endif
 
 	mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr;
 	if (unusedcrcindex)
@@ -2028,6 +2036,12 @@ static struct module *load_module(void _
 		 goto nomodsectinfo;
 #endif
 
+#ifdef CONFIG_IMMEDIATE
+	if (!mod->taints)
+		immediate_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+#endif
+
 	err = module_finalize(hdr, sechdrs, mod);
 	if (err < 0)
 		goto cleanup;
@@ -2629,3 +2643,38 @@ EXPORT_SYMBOL(module_remove_driver);
 void struct_module(struct module *mod) { return; }
 EXPORT_SYMBOL(struct_module);
 #endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_immediate_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_immediate_update(void)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		immediate_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+	}
+}
+EXPORT_SYMBOL_GPL(_module_immediate_update);
+
+/**
+ * module_immediate_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_immediate_update(void)
+{
+	mutex_lock(&module_mutex);
+	_module_immediate_update();
+	mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_immediate_update);
+#endif
Index: linux-2.6-lttng/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/immediate.c	2007-09-17 13:25:22.000000000 -0400
@@ -0,0 +1,92 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+
+extern const struct __immediate __start___immediate[];
+extern const struct __immediate __stop___immediate[];
+
+/*
+ * immediate_mutex nests inside module_mutex. immediate_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(immediate_mutex);
+
+/**
+ * immediate_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Sets a range of immediates to a enabled state : set the enable bit.
+ */
+void immediate_update_range(const struct __immediate *begin,
+		const struct __immediate *end)
+{
+	const struct __immediate *iter;
+	int ret;
+
+	for (iter = begin; iter < end; iter++) {
+		mutex_lock(&immediate_mutex);
+		kernel_text_lock();
+		ret = arch_immediate_update(iter);
+		kernel_text_unlock();
+		if (ret)
+			printk(KERN_WARNING "Invalid immediate value. "
+					    "Variable at %p, "
+					    "instruction at %p, size %lu\n",
+					    (void*)iter->immediate,
+					    (void*)iter->var, iter->size);
+		mutex_unlock(&immediate_mutex);
+	}
+}
+EXPORT_SYMBOL_GPL(immediate_update_range);
+
+/**
+ * immediate_update - update all immediate values in the kernel
+ * @lock: should a module_mutex be taken ?
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_immediate_update(void)
+{
+	/* Core kernel immediates */
+	immediate_update_range(__start___immediate, __stop___immediate);
+}
+EXPORT_SYMBOL_GPL(core_immediate_update);
+
+static void __init immediate_update_early_range(const struct __immediate *begin,
+		const struct __immediate *end)
+{
+	const struct __immediate *iter;
+
+	for (iter = begin; iter < end; iter++)
+		arch_immediate_update_early(iter);
+}
+
+/**
+ * immediate_update_early - Update immediate values at boot time
+ *
+ * Update the immediate values to the state of the variables they refer to. It
+ * is done before SMP is active, at the very beginning of start_kernel().
+ */
+void __init immediate_update_early(void)
+{
+	immediate_update_early_range(__start___immediate, __stop___immediate);
+}
Index: linux-2.6-lttng/init/main.c
===================================================================
--- linux-2.6-lttng.orig/init/main.c	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/init/main.c	2007-09-17 13:25:22.000000000 -0400
@@ -56,6 +56,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/device.h>
 #include <linux/kthread.h>
+#include <linux/immediate.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -525,6 +526,7 @@ asmlinkage void __init start_kernel(void
 	unwind_init();
 	lockdep_init();
 	container_init_early();
+	immediate_update_early();
 
 	local_irq_disable();
 	early_boot_irqs_off();
Index: linux-2.6-lttng/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/kernel/Makefile	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/kernel/Makefile	2007-09-17 13:35:50.000000000 -0400
@@ -61,6 +61,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 2/7] Immediate Values - Kconfig menu in EMBEDDED
  2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
  2007-09-17 18:42 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2007-09-17 18:42 ` Mathieu Desnoyers
  2007-09-17 20:55   ` Randy Dunlap
  2007-09-17 18:42 ` [patch 3/7] Immediate Values - Move Kprobes i386 restore_interrupt to kdebug.h Mathieu Desnoyers
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: Mathieu Desnoyers, Adrian Bunk, Andi Kleen, Alexey Dobriyan,
	Christoph Hellwig

[-- Attachment #1: immediate-values-kconfig-embedded.patch --]
[-- Type: text/plain, Size: 2647 bytes --]

Immediate values provide a way to use dynamic code patching to update variables
sitting within the instruction stream. It saves caches lines normally used by
static read mostly variables. Enable it by default, but let users disable it
through the EMBEDDED menu with the "Disable immediate values" submenu entry.

Note: Since I think that I really should let embedded systems developers using
RO memory the option to disable the immediate values, I choose to leave this
menu option there, in the EMBEDDED menu. Also, the "CONFIG_IMMEDIATE" makes
sense because we want to compile out all the immediate code when we decide not
to use optimized immediate values at all (it removes otherwise unused code).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Christoph Hellwig <hch@infradead.org>
---
 init/Kconfig |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Index: linux-2.6-lttng/init/Kconfig
===================================================================
--- linux-2.6-lttng.orig/init/Kconfig	2007-09-04 11:53:59.000000000 -0400
+++ linux-2.6-lttng/init/Kconfig	2007-09-04 12:12:03.000000000 -0400
@@ -407,6 +407,17 @@ config CC_OPTIMIZE_FOR_SIZE
 config SYSCTL
 	bool
 
+config IMMEDIATE
+	default y if !DISABLE_IMMEDIATE
+	depends on X86_32 || PPC || PPC64
+	bool
+	help
+	  Immediate values are used as read mostly variables that are rarely
+	  updated. They use code patching to modify the values inscribed in the
+	  instruction stream. It provides a way to save precious cache lines
+	  that would otherwise have to be used by these variables. Can be
+	  disabled through the EMBEDDED menu.
+
 menuconfig EMBEDDED
 	bool "Configure standard kernel features (for small systems)"
 	help
@@ -645,6 +656,16 @@ config PROC_KPAGEMAP
           information on page-level memory usage. Disabling this interface
           will reduce the size of the kernel by around 600 bytes.
 
+config DISABLE_IMMEDIATE
+	default y if EMBEDDED
+	bool "Disable immediate values" if EMBEDDED
+	depends on X86_32 || PPC || PPC64
+	help
+	  Disable code patching based immediate values for embedded systems. It
+	  consumes slightly more memory and requires to modify the instruction
+	  stream each time a variable is updated. Should really be disabled for
+	  embedded systems with read-only text.
+
 endmenu		# General setup
 
 config RT_MUTEXES

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 3/7] Immediate Values - Move Kprobes i386 restore_interrupt to kdebug.h
  2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
  2007-09-17 18:42 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
  2007-09-17 18:42 ` [patch 2/7] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
@ 2007-09-17 18:42 ` Mathieu Desnoyers
  2007-09-17 18:42 ` [patch 4/7] Immediate Values - i386 Optimization Mathieu Desnoyers
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: Mathieu Desnoyers, Ananth N Mavinakayanahalli, Christoph Hellwig,
	prasanna, anil.s.keshavamurthy, davem

[-- Attachment #1: immediate-values-move-kprobes-i386-restore-interrupt-to-kdebug-h.patch --]
[-- Type: text/plain, Size: 2329 bytes --]

Since the breakpoint handler is useful both to kprobes and immediate values, it
makes sense to make the required restore_interrupt() available through
asm-i386/kdebug.h.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: prasanna@in.ibm.com
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
---
 include/asm-i386/kdebug.h  |   12 ++++++++++++
 include/asm-i386/kprobes.h |    9 ---------
 2 files changed, 12 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng/include/asm-i386/kdebug.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-i386/kdebug.h	2007-09-04 11:52:37.000000000 -0400
+++ linux-2.6-lttng/include/asm-i386/kdebug.h	2007-09-04 12:12:06.000000000 -0400
@@ -6,6 +6,9 @@
  * from x86_64 architecture.
  */
 
+#include <asm/ptrace.h>
+#include <asm/system.h>
+
 struct pt_regs;
 
 /* Grossly misnamed. */
@@ -25,4 +28,13 @@ enum die_val {
 	DIE_PAGE_FAULT_NO_CONTEXT,
 };
 
+/* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
+ * if necessary, before executing the original int3/1 (trap) handler.
+ */
+static inline void restore_interrupts(struct pt_regs *regs)
+{
+	if (regs->eflags & IF_MASK)
+		local_irq_enable();
+}
+
 #endif
Index: linux-2.6-lttng/include/asm-i386/kprobes.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-i386/kprobes.h	2007-09-04 11:53:26.000000000 -0400
+++ linux-2.6-lttng/include/asm-i386/kprobes.h	2007-09-04 12:12:06.000000000 -0400
@@ -79,15 +79,6 @@ struct kprobe_ctlblk {
 	struct prev_kprobe prev_kprobe;
 };
 
-/* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
- * if necessary, before executing the original int3/1 (trap) handler.
- */
-static inline void restore_interrupts(struct pt_regs *regs)
-{
-	if (regs->eflags & IF_MASK)
-		local_irq_enable();
-}
-
 extern int kprobe_exceptions_notify(struct notifier_block *self,
 				    unsigned long val, void *data);
 extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 4/7] Immediate Values - i386 Optimization
  2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2007-09-17 18:42 ` [patch 3/7] Immediate Values - Move Kprobes i386 restore_interrupt to kdebug.h Mathieu Desnoyers
@ 2007-09-17 18:42 ` Mathieu Desnoyers
  2007-09-18  6:04   ` Borislav Petkov
  2007-09-17 18:42 ` [patch 5/7] Immediate Values - Powerpc Optimization Mathieu Desnoyers
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: Mathieu Desnoyers, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig

[-- Attachment #1: immediate-values-i386-optimization.patch --]
[-- Type: text/plain, Size: 16430 bytes --]

i386 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
---
 arch/i386/kernel/Makefile    |    1 
 arch/i386/kernel/immediate.c |  307 +++++++++++++++++++++++++++++++++++++++++++
 arch/i386/kernel/traps.c     |    8 -
 include/asm-i386/immediate.h |   82 +++++++++++
 4 files changed, 394 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/include/asm-i386/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-i386/immediate.h	2007-09-17 13:38:52.000000000 -0400
@@ -0,0 +1,82 @@
+#ifndef _ASM_I386_IMMEDIATE_H
+#define _ASM_I386_IMMEDIATE_H
+
+/*
+ * Immediate values. i386 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+struct __immediate {
+	long var;		/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	long immediate;		/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	long size;		/* Type size. */
+};
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @var.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * Makes sure the 2 and 4 bytes update will be atomic by aligning the immediate
+ * value. 2 bytes (short) uses a 66H prefix. If size is bigger than 4 bytes,
+ * fall back on a memory read.
+ */
+#define immediate_read(name)						\
+	({								\
+		__typeof__(name##__immediate) value;			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm (	".section __immediate, \"a\", @progbits;\n\t" \
+					".long %1, (0f)+1, 1;\n\t"	\
+					".previous;\n\t"		\
+					"0:\n\t"			\
+					"mov %2,%0;\n\t"		\
+				: "=r" (value)				\
+				: "m" (name##__immediate),		\
+				  "i" (0));				\
+			break;						\
+		case 2:							\
+			asm (	".section __immediate, \"a\", @progbits;\n\t" \
+					".long %1, (0f)+2, 2;\n\t"	\
+					".previous;\n\t"		\
+					"1:\n\t"			\
+					".align 2;\n\t"			\
+					"0:\n\t"			\
+					"mov %2,%0;\n\t"		\
+				: "=r" (value)				\
+				: "m" (name##__immediate),		\
+				  "i" (0));				\
+			break;						\
+		case 4:							\
+			asm (	".section __immediate, \"a\", @progbits;\n\t" \
+					".long %1, (0f)+1, 4;\n\t"	\
+					".previous;\n\t"		\
+					"1:\n\t"			\
+					".org (1b)+(3-((1b)%%4)), 0x90;\n\t" \
+					"0:\n\t"			\
+					"mov %2,%0;\n\t"		\
+				: "=r" (value)				\
+				: "m" (name##__immediate),		\
+				  "i" (0));				\
+			break;						\
+		default:value = name##__immediate;			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_I386_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/i386/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/i386/kernel/Makefile	2007-09-17 13:37:58.000000000 -0400
+++ linux-2.6-lttng/arch/i386/kernel/Makefile	2007-09-17 13:38:44.000000000 -0400
@@ -37,6 +37,7 @@ obj-$(CONFIG_MODULES)		+= module.o
 obj-y				+= sysenter.o vsyscall.o
 obj-$(CONFIG_ACPI_SRAT) 	+= srat.o
 obj-$(CONFIG_EFI) 		+= efi.o efi_stub.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault.o
 obj-$(CONFIG_KGDB)		+= kgdb.o kgdb-jmp.o
 obj-$(CONFIG_VM86)		+= vm86.o
Index: linux-2.6-lttng/arch/i386/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/i386/kernel/immediate.c	2007-09-17 13:38:44.000000000 -0400
@@ -0,0 +1,307 @@
+/*
+ * Immediate Value - i386 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ *   Centrino Duo Processor Technology Specification Update, AH33.
+ *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ *   Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micorops resulting from instruction interpretation -
+ * cannot guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/notifier.h>
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION  0xcc
+#define BREAKPOINT_INS_LEN	1
+#define NR_NOPS			8
+
+static long target_after_int3;		/* EIP of the target after the int3 */
+static long bypass_eip;			/* EIP of the bypass. */
+static long bypass_after_int3;		/* EIP after the end-of-bypass int3 */
+static long after_immediate;		/*
+					 * EIP where to resume after the
+					 * single-stepping.
+					 */
+
+/*
+ * Size of the movl instruction (without the immediate value) in bytes.
+ * The 2 bytes load immediate has a 66H prefix, which makes the opcode 2 bytes
+ * wide.
+ */
+static inline size_t _immediate_get_insn_size(long size)
+{
+	switch (size) {
+		case 1: return 1;
+		case 2: return 2;
+		case 4: return 1;
+		default: BUG();
+	};
+}
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * nop nop nop nop nop nop nop nop int3
+ * The nops are the target replaced by the instruction to single-step.
+ */
+static inline void _immediate_bypass(long *bypassaddr, long *breaknextaddr)
+{
+		asm volatile (	"jmp 2f;\n\t"
+				"0:\n\t"
+				".space 8, 0x90;\n\t"
+				"1:\n\t"
+				"int3;\n\t"
+				"2:\n\t"
+				"movl $(0b),%0;\n\t"
+				"movl $((1b)+1),%1;\n\t"
+				: "=r" (*bypassaddr),
+				  "=r" (*breaknextaddr) : );
+}
+
+static void immediate_synchronize_core(void *info)
+{
+	sync_core();	/* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int immediate_notifier(struct notifier_block *nb,
+	unsigned long val, void *data)
+{
+	enum die_val die_val = (enum die_val) val;
+	struct die_args *args = data;
+
+	if (!args->regs || user_mode_vm(args->regs))
+		return NOTIFY_DONE;
+
+	if (die_val == DIE_INT3) {
+		if (args->regs->eip == target_after_int3) {
+			preempt_disable();
+			args->regs->eip = bypass_eip;
+			return NOTIFY_STOP;
+		} else if (args->regs->eip == bypass_after_int3) {
+			args->regs->eip = after_immediate;
+			preempt_enable();
+			return NOTIFY_STOP;
+		}
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block immediate_notify = {
+	.notifier_call = immediate_notifier,
+	.priority = 0x7fffffff,	/* we need to be notified first */
+};
+
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ */
+__kprobes int arch_immediate_update(const struct __immediate *immediate)
+{
+	int ret;
+	size_t insn_size = _immediate_get_insn_size(immediate->size);
+	long insn = immediate->immediate - insn_size;
+	long len;
+	unsigned long cr0;
+
+#ifdef CONFIG_KPROBES
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	if (unlikely(*(unsigned char*)insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %lu\n",
+				    (void*)immediate->immediate,
+				    (void*)immediate->var, immediate->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+		case 1:	if (*(uint8_t*)immediate->immediate
+					== *(uint8_t*)immediate->var)
+				return 0;
+			break;
+		case 2:	if (*(uint16_t*)immediate->immediate
+					== *(uint16_t*)immediate->var)
+				return 0;
+			break;
+		case 4:	if (*(uint32_t*)immediate->immediate
+					== *(uint32_t*)immediate->var)
+				return 0;
+			break;
+		default:return -EINVAL;
+	}
+
+	_immediate_bypass(&bypass_eip, &bypass_after_int3);
+
+	after_immediate = immediate->immediate + immediate->size;
+
+	/*
+	 * Using the _early variants because nobody is executing the
+	 * bypass code while we patch it. It is protected by the
+	 * immediate_mutex. Since we modify the instructions non atomically (for
+	 * nops), we have to use the _early variant.
+	 * We must however deal with the WP flag in cr0 by ourself.
+	 */
+	kernel_wp_save(cr0);
+	text_poke_early((void*)bypass_eip, (void*)insn,
+			insn_size + immediate->size);
+	/*
+	 * Fill the rest with nops.
+	 */
+	len = NR_NOPS - immediate->size - insn_size;
+	add_nops((void*)(bypass_eip + insn_size + immediate->size), len);
+	kernel_wp_restore(cr0);
+
+	target_after_int3 = insn + BREAKPOINT_INS_LEN;
+	/* register_die_notifier has memory barriers */
+	register_die_notifier(&immediate_notify);
+	/* The breakpoint will single-step the bypass */
+	text_poke((void*)insn,
+		INIT_ARRAY(unsigned char, BREAKPOINT_INSTRUCTION, 1), 1);
+	wmb();
+	/*
+	 * Execute serializing instruction on each CPU.
+	 * Acts as a memory barrier.
+	 */
+	ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
+	BUG_ON(ret != 0);
+
+	text_poke((void*)(insn + insn_size), (void*)immediate->var,
+			immediate->size);
+	wmb();
+	text_poke((void*)insn, (unsigned char*)bypass_eip, 1);
+		/*
+		 * Wait for all int3 handlers to end
+		 * (interrupts are disabled in int3).
+		 * This CPU is clearly not in a int3 handler,
+		 * because int3 handler is not preemptible and
+		 * there cannot be any more int3 handler called
+		 * for this site, because we placed the original
+		 * instruction back.
+		 * synchronize_sched has memory barriers.
+		 */
+	synchronize_sched();
+	unregister_die_notifier(&immediate_notify);
+	/* unregister_die_notifier has memory barriers */
+	return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ */
+void __init arch_immediate_update_early(const struct __immediate *immediate)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+		case 1:	if (*(uint8_t*)immediate->immediate
+					== *(uint8_t*)immediate->var)
+				return;
+			break;
+		case 2:	if (*(uint16_t*)immediate->immediate
+					== *(uint16_t*)immediate->var)
+				return;
+			break;
+		case 4:	if (*(uint32_t*)immediate->immediate
+					== *(uint32_t*)immediate->var)
+				return;
+			break;
+		default:return;
+	}
+	memcpy((void*)immediate->immediate, (void*)immediate->var,
+			immediate->size);
+}
Index: linux-2.6-lttng/arch/i386/kernel/traps.c
===================================================================
--- linux-2.6-lttng.orig/arch/i386/kernel/traps.c	2007-09-17 13:37:59.000000000 -0400
+++ linux-2.6-lttng/arch/i386/kernel/traps.c	2007-09-17 13:38:44.000000000 -0400
@@ -601,7 +601,7 @@ fastcall void do_##name(struct pt_regs *
 }
 
 DO_VM86_ERROR_INFO( 0, SIGFPE,  "divide error", divide_error, FPE_INTDIV, regs->eip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
 DO_VM86_ERROR( 3, SIGTRAP, "int3", int3)
 #endif
 DO_VM86_ERROR( 4, SIGSEGV, "overflow", overflow)
@@ -843,14 +843,14 @@ void restart_nmi(void)
 	acpi_nmi_enable();
 }
 
-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
 fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
 		return;
-	/* This is an interrupt gate, because kprobes wants interrupts
-	disabled.  Normal trap handlers don't. */
+	/* This is an interrupt gate, because kprobes and immediate values wants
+	 * interrupts disabled. Normal trap handlers don't. */
 	restore_interrupts(regs);
 	do_trap(3, SIGTRAP, "int3", 1, regs, error_code, NULL);
 }

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 5/7] Immediate Values - Powerpc Optimization
  2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2007-09-17 18:42 ` [patch 4/7] Immediate Values - i386 Optimization Mathieu Desnoyers
@ 2007-09-17 18:42 ` Mathieu Desnoyers
  2007-09-17 18:42 ` [patch 6/7] Immediate Values - Documentation Mathieu Desnoyers
  2007-09-17 18:42 ` [patch 7/7] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
  6 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers, Christoph Hellwig

[-- Attachment #1: immediate-values-powerpc-optimization.patch --]
[-- Type: text/plain, Size: 6946 bytes --]

PowerPC optimization of the immediate values which uses a li instruction,
patched with an immediate value.

Changelog:
- Put immediate_set and _immediate_set in the architecture independent header.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Christoph Hellwig <hch@infradead.org>
---
 arch/powerpc/kernel/Makefile    |    1 
 arch/powerpc/kernel/immediate.c |  103 ++++++++++++++++++++++++++++++++++++++++
 include/asm-powerpc/immediate.h |   73 ++++++++++++++++++++++++++++
 3 files changed, 177 insertions(+)

Index: linux-2.6-lttng/include/asm-powerpc/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-powerpc/immediate.h	2007-09-17 13:39:02.000000000 -0400
@@ -0,0 +1,73 @@
+#ifndef _ASM_POWERPC_IMMEDIATE_H
+#define _ASM_POWERPC_IMMEDIATE_H
+
+/*
+ * Immediate values. PowerPC architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm-compat.h>
+
+struct __immediate {
+	long var;		/* Identifier variable of the immediate value */
+	long immediate;		/*
+				 * Pointer to the memory location that holds
+				 * the immediate value within the load immediate
+				 * instruction.
+				 */
+	long size;		/* Type size. */
+};
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * Makes sure the 2 bytes update will be atomic by aligning the immediate
+ * value. Use a normal memory read for the 4 bytes immediate because there is no
+ * way to atomically update it without using a seqlock read side, which would
+ * cost more in term of total i-cache and d-cache space than a simple memory
+ * read.
+ */
+#define immediate_read(name)						\
+	({								\
+		__typeof__(name##__immediate) value;			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm (	".section __immediate, \"a\", @progbits;\n\t" \
+					PPC_LONG "%1, ((0f)+3), 1;\n\t"	\
+					".previous;\n\t"		\
+					"0:\n\t"			\
+					"li %0,%2;\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate),		\
+				  "i" (0));				\
+			break;						\
+		case 2:							\
+			asm (	".section __immediate, \"a\", @progbits;\n\t" \
+					PPC_LONG "%1, ((0f)+2), 2;\n\t"	\
+					".previous;\n\t"		\
+					".align 2\n\t"			\
+					"0:\n\t"			\
+					"li %0,%2;\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate),		\
+				  "i" (0));				\
+			break;						\
+		default:						\
+			value = name##__immediate;			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_POWERPC_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/powerpc/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/kernel/Makefile	2007-09-17 13:37:03.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/kernel/Makefile	2007-09-17 13:38:54.000000000 -0400
@@ -97,3 +97,4 @@ obj-$(CONFIG_PPC64)		+= $(obj64-y)
 
 extra-$(CONFIG_PPC_FPU)		+= fpu.o
 extra-$(CONFIG_PPC64)		+= entry_64.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
Index: linux-2.6-lttng/arch/powerpc/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/powerpc/kernel/immediate.c	2007-09-17 13:38:54.000000000 -0400
@@ -0,0 +1,103 @@
+/*
+ * Powerpc optimized immediate values enabling/disabling.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/string.h>
+#include <linux/kprobes.h>
+#include <asm/cacheflush.h>
+#include <asm/page.h>
+
+#define LI_OPCODE_LEN	2
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ */
+int arch_immediate_update(const struct __immediate *immediate)
+{
+#ifdef CONFIG_KPROBES
+	kprobe_opcode_t *insn;
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	switch (immediate->size) {
+		case 1:	/* The uint8_t points to the 3rd byte of the
+			 * instruction */
+			insn = (void*)(immediate->immediate - 1 - LI_OPCODE_LEN);
+			break;
+		case 2:	insn = (void*)(immediate->immediate - LI_OPCODE_LEN);
+			break;
+		default:
+		return -EINVAL;
+	}
+
+	if (unlikely(*insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %lu\n",
+				    (void*)immediate->immediate,
+				    (void*)immediate->var, immediate->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+		case 1:	if (*(uint8_t*)immediate->immediate
+					== *(uint8_t*)immediate->var)
+				return 0;
+			break;
+		case 2:	if (*(uint16_t*)immediate->immediate
+					== *(uint16_t*)immediate->var)
+				return 0;
+			break;
+		default:return -EINVAL;
+	}
+	memcpy((void*)immediate->immediate, (void*)immediate->var,
+			immediate->size);
+	flush_icache_range((unsigned long)immediate->immediate,
+		(unsigned long)immediate->immediate + immediate->size);
+	return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ * We can use flush_icache_range, since the cpu identification has been done in
+ * the early_init stage.
+ */
+void __init arch_immediate_update_early(const struct __immediate *immediate)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+		case 1:	if (*(uint8_t*)immediate->immediate
+					== *(uint8_t*)immediate->var)
+				return;
+			break;
+		case 2:	if (*(uint16_t*)immediate->immediate
+					== *(uint16_t*)immediate->var)
+				return;
+			break;
+		default:return;
+	}
+	memcpy((void*)immediate->immediate, (void*)immediate->var,
+			immediate->size);
+	flush_icache_range((unsigned long)immediate->immediate,
+		(unsigned long)immediate->immediate + immediate->size);
+}

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 6/7] Immediate Values - Documentation
  2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2007-09-17 18:42 ` [patch 5/7] Immediate Values - Powerpc Optimization Mathieu Desnoyers
@ 2007-09-17 18:42 ` Mathieu Desnoyers
  2007-09-17 20:55   ` Randy Dunlap
  2007-09-17 18:42 ` [patch 7/7] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
  6 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: immediate-values-documentation.patch --]
[-- Type: text/plain, Size: 9369 bytes --]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 Documentation/immediate.txt |  228 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 228 insertions(+)

Index: linux-2.6-lttng/Documentation/immediate.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/Documentation/immediate.txt	2007-09-17 14:03:09.000000000 -0400
@@ -0,0 +1,228 @@
+		        Using the Immediate Values
+
+			    Mathieu Desnoyers
+
+
+This document introduces Immediate Values and their use.
+
+* Purpose of immediate values
+
+An immediate value is used to compile into the kernel variables that sit within
+the instruction stream. They are meant to be rarely updated but read often.
+Using immediate values for these variables will save cache lines.
+
+This infrastructure is specialized in supporting dynamic patching of the values
+in the instruction stream when multiple CPUs are running without disturbing the
+normal system behavior.
+
+Compiling code meant to be rarely enabled at runtime can be done using
+immediate_if() as condition surrounding the code.
+
+* Usage
+
+In order to use the "immediate" macros, you should include linux/immediate.h.
+
+#include <linux/immediate.h>
+
+DEFINE_IMMEDIATE(char, this_immediate);
+EXPORT_SYMBOL(this_immediate);
+
+
+And use, in the body of a function:
+
+Use immediate_set(&this_immediate) to set the immediate value.
+
+Use immediate_read(&this_immediate) to read the immediate value.
+
+The immediate mechanism supports inserting multiple instances of the same
+immediate. Immediate values can be put in inline functions, inlined static
+functions, and unrolled loops.
+
+If you have to read the immediate values from a function declared as __init or
+__exit, you should explicitly use _immediate_read(), which will fall back on a
+global variable read. Failing to do so will leave a reference to the __init
+section after it is freed (it would generate a modpost warning).
+
+The prefered idiom to dynamically enable compiled-in code is to use
+immediate_if (&this_immediate), which may eventually use gcc improvements to
+provide a jump instruction patching based condition instead of a immediate value
+feeding a conditional jump. You should use _immediate_if () instead of
+immediate_if () in functions marked __init or __exit.
+
+immediate_set_early() should be used only at early kernel boot time, before SMP
+is activated. It provides no locking, does not take care of clearing the
+write-protection bits when touching executable pages and does not take care or
+SMP errata on the i386 and x86_64 architectures. By not doing so, it can update
+the immediate values very early at kernel boot time and require the minimalistic
+infrastructure available at that point.
+
+You can choose to set an initial static value to the immediate by using, for
+instance:
+
+DEFINE_IMMEDIATE(long, myptr) = 10;
+
+
+* Optimization for a given architecture
+
+One can implement optimized immediate values for a given architecture by
+replacing asm-$ARCH/immediate.h.
+
+* Performance improvement
+
+* Memory hit for a data-based branch
+
+Here are the results on a 3GHz Pentium 4:
+
+number of tests: 100
+number of branches per test: 100000
+memory hit cycles per iteration (mean): 636.611
+L1 cache hit cycles per iteration (mean): 89.6413
+instruction stream based test, cycles per iteration (mean): 85.3438
+Just getting the pointer from a modulo on a pseudo-random value, doing
+  nothing with it, cycles per iteration (mean): 77.5044
+
+So:
+Base case:                      77.50 cycles
+instruction stream based test:  +7.8394 cycles
+L1 cache hit based test:        +12.1369 cycles
+Memory load based test:         +559.1066 cycles
+
+So let's say we have a ping flood coming at
+(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
+7674 packets per second. If we put 2 markers for irq entry/exit, it
+brings us to 15348 markers sites executed per second.
+
+(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
+We therefore have a 0.29% slowdown just on this case.
+
+Compared to this, the instruction stream based test will cause a
+slowdown of:
+
+(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
+For a 0.004% slowdown.
+
+If we plan to use this for memory allocation, spinlock, and all sort of
+very high event rate tracing, we can assume it will execute 10 to 100
+times more sites per second, which brings us to 0.4% slowdown with the
+instruction stream based test compared to 29% slowdown with the memory
+load based test on a system with high memory pressure.
+
+
+
+* Markers impact under heavy memory load
+
+Running a kernel with my LTTng instrumentation set, in a test that
+generates memory pressure (from userspace) by trashing L1 and L2 caches
+between calls to getppid() (note: syscall_trace is active and calls
+a marker upon syscall entry and syscall exit; markers are disarmed).
+This test is done in user-space, so there are some delays due to IRQs
+coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
+nice level)
+
+My first set of results: Linear cache trashing, turned out not to be
+very interesting, because it seems like the linearity of the memset on a
+full array is somehow detected and it does not "really" trash the
+caches.
+
+Now the most interesting result: Random walk L1 and L2 trashing
+surrounding a getppid() call.
+
+- Markers compiled out (but syscall_trace execution forced)
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.033 cycles
+getppid: 1681.4 cycles
+With memory pressure
+Reading timestamps takes 102.938 cycles
+getppid: 15691.6 cycles
+
+
+- With the immediate values based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.006 cycles
+getppid: 1681.84 cycles
+With memory pressure
+Reading timestamps takes 100.291 cycles
+getppid: 11793 cycles
+
+
+- With global variables based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 107.999 cycles
+getppid: 1669.06 cycles
+With memory pressure
+Reading timestamps takes 102.839 cycles
+getppid: 12535 cycles
+
+The result is quite interesting in that the kernel is slower without
+markers than with markers. I explain it by the fact that the data
+accessed is not laid out in the same manner in the cache lines when the
+markers are compiled in or out. It seems that it aligns the function's
+data better to compile-in the markers in this case.
+
+But since the interesting comparison is between the immediate values and
+global variables based markers, and because they share the same memory
+layout, except for the movl being replaced by a movz, we see that the
+global variable based markers (2 markers) adds 742 cycles to each system
+call (syscall entry and exit are traced and memory locations for both
+global variables lie on the same cache line).
+
+
+- Test redone with less iterations, but with error estimates
+
+10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
+syscall trace inactive, comparing the case with memory pressure and withou
+memory pressure. (sorry, my system is not setup to execute syscall_trace this
+time, but it will make the point anyway).
+
+No memory pressure
+Reading timestamps:     150.92 cycles,     std dev.    1.01 cycles
+getppid:               1462.09 cycles,     std dev.   18.87 cycles
+
+With memory pressure
+Reading timestamps:     578.22 cycles,     std dev.  269.51 cycles
+getppid:              17113.33 cycles,     std dev. 1655.92 cycles
+
+
+Now for memory read timing: (10 runs, branches per test: 100000)
+Memory read based branch:
+                       644.09 cycles,      std dev.   11.39 cycles
+L1 cache hit based branch:
+                        88.16 cycles,      std dev.    1.35 cycles
+
+
+So, now that we have the raw results, let's calculate:
+
+Memory read:
+644.09±11.39 - 88.16±1.35 = 555.93±11.46 cycles
+
+Getppid without memory pressure:
+1462.09±18.87 - 150.92±1.01 = 1311.17±18.90 cycles
+
+Getppid with memory pressure:
+17113.33±1655.92 - 578.22±269.51 = 16535.11±1677.71 cycles
+
+Therefore, if we add 2 markers not based on immediate values to the getppid
+code, which would add 2 memory reads, we would add
+2 * 555.93±12.74 = 1111.86±25.48 cycles
+
+Therefore,
+
+1111.86±25.48 / 16535.11±1677.71 = 0.0672
+ relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2))
+                     = 0.1040
+ absolute error: 0.1040 * 0.0672 = 0.0070
+
+Therefore: 0.0672±0.0070 * 100% = 6.72±0.70 %
+
+We can therefore affirm that adding 2 markers to getppid, on a system with high
+memory pressure, would have a performance hit of at least 6.0% on the system
+call time, all within the uncertainty limits of these tests. The same applies to
+other kernel code paths. The smaller those code paths are, the highest the
+impact ratio will be.
+
+Therefore, not only is it interesting to use the immediate values to dynamically
+activate dormant code such as the markers, but I think it should also be
+considered as a replacement for many of the "read mostly" static variables.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 7/7] Scheduler Profiling - Use Immediate Values
  2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2007-09-17 18:42 ` [patch 6/7] Immediate Values - Documentation Mathieu Desnoyers
@ 2007-09-17 18:42 ` Mathieu Desnoyers
  6 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-17 18:42 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: profiling-use-immediate-values.patch --]
[-- Type: text/plain, Size: 5612 bytes --]

Use immediate values with lower d-cache hit in optimized version as a
condition for scheduler profiling call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 drivers/kvm/svm.c       |    4 +---
 drivers/kvm/vmx.c       |    3 +--
 include/linux/profile.h |    5 +++--
 kernel/profile.c        |   22 +++++++++++-----------
 4 files changed, 16 insertions(+), 18 deletions(-)

Index: linux-2.6-lttng/kernel/profile.c
===================================================================
--- linux-2.6-lttng.orig/kernel/profile.c	2007-09-14 10:00:17.000000000 -0400
+++ linux-2.6-lttng/kernel/profile.c	2007-09-14 10:11:12.000000000 -0400
@@ -42,8 +42,8 @@ static int (*timer_hook)(struct pt_regs 
 static atomic_t *prof_buffer;
 static unsigned long prof_len, prof_shift;
 
-int prof_on __read_mostly;
-EXPORT_SYMBOL_GPL(prof_on);
+DEFINE_IMMEDIATE(char, prof_on) __read_mostly;
+EXPORT_IMMEDIATE_SYMBOL_GPL(prof_on);
 
 static cpumask_t prof_cpu_mask = CPU_MASK_ALL;
 #ifdef CONFIG_SMP
@@ -60,7 +60,7 @@ static int __init profile_setup(char * s
 	int par;
 
 	if (!strncmp(str, sleepstr, strlen(sleepstr))) {
-		prof_on = SLEEP_PROFILING;
+		immediate_set_early(prof_on, SLEEP_PROFILING);
 		if (str[strlen(sleepstr)] == ',')
 			str += strlen(sleepstr) + 1;
 		if (get_option(&str, &par))
@@ -69,7 +69,7 @@ static int __init profile_setup(char * s
 			"kernel sleep profiling enabled (shift: %ld)\n",
 			prof_shift);
 	} else if (!strncmp(str, schedstr, strlen(schedstr))) {
-		prof_on = SCHED_PROFILING;
+		immediate_set_early(prof_on, SCHED_PROFILING);
 		if (str[strlen(schedstr)] == ',')
 			str += strlen(schedstr) + 1;
 		if (get_option(&str, &par))
@@ -78,7 +78,7 @@ static int __init profile_setup(char * s
 			"kernel schedule profiling enabled (shift: %ld)\n",
 			prof_shift);
 	} else if (!strncmp(str, kvmstr, strlen(kvmstr))) {
-		prof_on = KVM_PROFILING;
+		immediate_set_early(prof_on, KVM_PROFILING);
 		if (str[strlen(kvmstr)] == ',')
 			str += strlen(kvmstr) + 1;
 		if (get_option(&str, &par))
@@ -88,7 +88,7 @@ static int __init profile_setup(char * s
 			prof_shift);
 	} else if (get_option(&str, &par)) {
 		prof_shift = par;
-		prof_on = CPU_PROFILING;
+		immediate_set_early(prof_on, CPU_PROFILING);
 		printk(KERN_INFO "kernel profiling enabled (shift: %ld)\n",
 			prof_shift);
 	}
@@ -99,7 +99,7 @@ __setup("profile=", profile_setup);
 
 void __init profile_init(void)
 {
-	if (!prof_on) 
+	if (!_immediate_read(prof_on))
 		return;
  
 	/* only text is profiled */
@@ -288,7 +288,7 @@ void profile_hits(int type, void *__pc, 
 	int i, j, cpu;
 	struct profile_hit *hits;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1);
 	i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT;
@@ -398,7 +398,7 @@ void profile_hits(int type, void *__pc, 
 {
 	unsigned long pc;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = ((unsigned long)__pc - (unsigned long)_stext) >> prof_shift;
 	atomic_add(nr_hits, &prof_buffer[min(pc, prof_len - 1)]);
@@ -555,7 +555,7 @@ static int __init create_hash_tables(voi
 	}
 	return 0;
 out_cleanup:
-	prof_on = 0;
+	immediate_set_early(prof_on, 0);
 	smp_mb();
 	on_each_cpu(profile_nop, NULL, 0, 1);
 	for_each_online_cpu(cpu) {
@@ -582,7 +582,7 @@ static int __init create_proc_profile(vo
 {
 	struct proc_dir_entry *entry;
 
-	if (!prof_on)
+	if (!_immediate_read(prof_on))
 		return 0;
 	if (create_hash_tables())
 		return -1;
Index: linux-2.6-lttng/include/linux/profile.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/profile.h	2007-09-14 10:00:17.000000000 -0400
+++ linux-2.6-lttng/include/linux/profile.h	2007-09-14 10:00:30.000000000 -0400
@@ -7,10 +7,11 @@
 #include <linux/init.h>
 #include <linux/cpumask.h>
 #include <linux/cache.h>
+#include <linux/immediate.h>
 
 #include <asm/errno.h>
 
-extern int prof_on __read_mostly;
+DECLARE_IMMEDIATE(char, prof_on) __read_mostly;
 
 #define CPU_PROFILING	1
 #define SCHED_PROFILING	2
@@ -38,7 +39,7 @@ static inline void profile_hit(int type,
 	/*
 	 * Speedup for the common (no profiling enabled) case:
 	 */
-	if (unlikely(prof_on == type))
+	if (unlikely(immediate_read(prof_on) == type))
 		profile_hits(type, ip, 1);
 }
 
Index: linux-2.6-lttng/drivers/kvm/svm.c
===================================================================
--- linux-2.6-lttng.orig/drivers/kvm/svm.c	2007-09-14 10:00:17.000000000 -0400
+++ linux-2.6-lttng/drivers/kvm/svm.c	2007-09-14 10:00:30.000000000 -0400
@@ -1570,9 +1570,7 @@ again:
 	/*
 	 * Profile KVM exit RIPs:
 	 */
-	if (unlikely(prof_on == KVM_PROFILING))
-		profile_hit(KVM_PROFILING,
-			(void *)(unsigned long)svm->vmcb->save.rip);
+	profile_hit(KVM_PROFILING, (void *)(unsigned long)svm->vmcb->save.rip);
 
 	stgi();
 
Index: linux-2.6-lttng/drivers/kvm/vmx.c
===================================================================
--- linux-2.6-lttng.orig/drivers/kvm/vmx.c	2007-09-14 10:00:17.000000000 -0400
+++ linux-2.6-lttng/drivers/kvm/vmx.c	2007-09-14 10:00:30.000000000 -0400
@@ -2234,8 +2234,7 @@ again:
 	/*
 	 * Profile KVM exit RIPs:
 	 */
-	if (unlikely(prof_on == KVM_PROFILING))
-		profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP));
+	profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP));
 
 	r = kvm_handle_exit(kvm_run, vcpu);
 	if (r > 0) {

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/7] Immediate Values - Kconfig menu in EMBEDDED
  2007-09-17 18:42 ` [patch 2/7] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
@ 2007-09-17 20:55   ` Randy Dunlap
  0 siblings, 0 replies; 29+ messages in thread
From: Randy Dunlap @ 2007-09-17 20:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Adrian Bunk, Andi Kleen, Alexey Dobriyan,
	Christoph Hellwig

On Mon, 17 Sep 2007 14:42:26 -0400 Mathieu Desnoyers wrote:

>  init/Kconfig |   21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> Index: linux-2.6-lttng/init/Kconfig
> ===================================================================
> --- linux-2.6-lttng.orig/init/Kconfig	2007-09-04 11:53:59.000000000 -0400
> +++ linux-2.6-lttng/init/Kconfig	2007-09-04 12:12:03.000000000 -0400
> @@ -407,6 +407,17 @@ config CC_OPTIMIZE_FOR_SIZE
>  config SYSCTL
>  	bool
>  
> +config IMMEDIATE
> +	default y if !DISABLE_IMMEDIATE
> +	depends on X86_32 || PPC || PPC64
> +	bool
> +	help
> +	  Immediate values are used as read mostly variables that are rarely

                                       read-mostly

> +	  updated. They use code patching to modify the values inscribed in the
> +	  instruction stream. It provides a way to save precious cache lines

They .... (then) It ...
so choose one of the other.  I think I prefer "They".

> +	  that would otherwise have to be used by these variables. Can be
> +	  disabled through the EMBEDDED menu.
> +
>  menuconfig EMBEDDED
>  	bool "Configure standard kernel features (for small systems)"
>  	help

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 6/7] Immediate Values - Documentation
  2007-09-17 18:42 ` [patch 6/7] Immediate Values - Documentation Mathieu Desnoyers
@ 2007-09-17 20:55   ` Randy Dunlap
  2007-09-18 13:13     ` Mathieu Desnoyers
  0 siblings, 1 reply; 29+ messages in thread
From: Randy Dunlap @ 2007-09-17 20:55 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel

On Mon, 17 Sep 2007 14:42:30 -0400 Mathieu Desnoyers wrote:

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> ---
>  Documentation/immediate.txt |  228 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 228 insertions(+)
> 
> Index: linux-2.6-lttng/Documentation/immediate.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/Documentation/immediate.txt	2007-09-17 14:03:09.000000000 -0400
> @@ -0,0 +1,228 @@
> +		        Using the Immediate Values
> +
> +			    Mathieu Desnoyers
> +
> +
> +This document introduces Immediate Values and their use.
> +
> +* Purpose of immediate values
> +
[snip]
> +
> +If you have to read the immediate values from a function declared as __init or
> +__exit, you should explicitly use _immediate_read(), which will fall back on a
> +global variable read. Failing to do so will leave a reference to the __init
> +section after it is freed (it would generate a modpost warning).
> +
> +The prefered idiom to dynamically enable compiled-in code is to use

       preferred

> +immediate_if (&this_immediate), which may eventually use gcc improvements to
> +provide a jump instruction patching based condition instead of a immediate value

                                                                  an

> +feeding a conditional jump. You should use _immediate_if () instead of
> +immediate_if () in functions marked __init or __exit.
> +
> +immediate_set_early() should be used only at early kernel boot time, before SMP
> +is activated. It provides no locking, does not take care of clearing the
> +write-protection bits when touching executable pages and does not take care or

                                                           does not take care of ??

> +SMP errata on the i386 and x86_64 architectures. By not doing so, it can update
> +the immediate values very early at kernel boot time and require the minimalistic

                                                          require only the

> +infrastructure available at that point.
> +
> +You can choose to set an initial static value to the immediate by using, for
> +instance:
> +
> +DEFINE_IMMEDIATE(long, myptr) = 10;
> +
> +
[snip]
> +
> +(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
> +For a 0.004% slowdown.
> +
> +If we plan to use this for memory allocation, spinlock, and all sort of

                                                              all sorts of

> +very high event rate tracing, we can assume it will execute 10 to 100
> +times more sites per second, which brings us to 0.4% slowdown with the
> +instruction stream based test compared to 29% slowdown with the memory
> +load based test on a system with high memory pressure.
> +
> +
[snip]
> +
> +
> +- Test redone with less iterations, but with error estimates
> +
> +10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
> +syscall trace inactive, comparing the case with memory pressure and withou

                                                  last word:  without

> +memory pressure. (sorry, my system is not setup to execute syscall_trace this
> +time, but it will make the point anyway).
> +
[snip]
> +
> +
> +Therefore, not only is it interesting to use the immediate values to dynamically
> +activate dormant code such as the markers, but I think it should also be
> +considered as a replacement for many of the "read mostly" static variables.

                                               "read-mostly"

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 4/7] Immediate Values - i386 Optimization
  2007-09-17 18:42 ` [patch 4/7] Immediate Values - i386 Optimization Mathieu Desnoyers
@ 2007-09-18  6:04   ` Borislav Petkov
  0 siblings, 0 replies; 29+ messages in thread
From: Borislav Petkov @ 2007-09-18  6:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig

On Mon, Sep 17, 2007 at 02:42:28PM -0400, Mathieu Desnoyers wrote:
> i386 optimization of the immediate values which uses a movl with code patching
> to set/unset the value used to populate the register used as variable source.
> 
> Changelog:
> - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
>   non atomic writes to a code region only touched by us (nobody can execute it
>   since we are protected by the immediate_mutex).
> - Put immediate_set and _immediate_set in the architecture independent header.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: Andi Kleen <ak@muc.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Chuck Ebbert <cebbert@redhat.com>
> CC: Christoph Hellwig <hch@infradead.org>
> ---
>  arch/i386/kernel/Makefile    |    1 
>  arch/i386/kernel/immediate.c |  307 +++++++++++++++++++++++++++++++++++++++++++
>  arch/i386/kernel/traps.c     |    8 -
>  include/asm-i386/immediate.h |   82 +++++++++++
>  4 files changed, 394 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6-lttng/include/asm-i386/immediate.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/include/asm-i386/immediate.h	2007-09-17 13:38:52.000000000 -0400
> @@ -0,0 +1,82 @@
> +#ifndef _ASM_I386_IMMEDIATE_H
> +#define _ASM_I386_IMMEDIATE_H
> +
> +/*
> + * Immediate values. i386 architecture optimizations.
> + *
> + * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> + *
> + * This file is released under the GPLv2.
> + * See the file COPYING for more details.
> + */
> +
> +struct __immediate {
> +	long var;		/* Pointer to the identifier variable of the
> +				 * immediate value
> +				 */
> +	long immediate;		/*
> +				 * Pointer to the memory location of the
> +				 * immediate value within the instruction.
> +				 */
> +	long size;		/* Type size. */
> +};
> +
> +/**
> + * immediate_read - read immediate variable
> + * @name: immediate value name
> + *
> + * Reads the value of @var.
s/var/name/ ?
> + * Optimized version of the immediate.
> + * Do not use in __init and __exit functions. Use _immediate_read() instead.
> + * Makes sure the 2 and 4 bytes update will be atomic by aligning the immediate
> + * value. 2 bytes (short) uses a 66H prefix. If size is bigger than 4 bytes,
> + * fall back on a memory read.
> + */
> +#define immediate_read(name)						\
> +	({								\
> +		__typeof__(name##__immediate) value;			\
> +		switch (sizeof(value)) {				\
> +		case 1:							\
> +			asm (	".section __immediate, \"a\", @progbits;\n\t" \
> +					".long %1, (0f)+1, 1;\n\t"	\
> +					".previous;\n\t"		\
> +					"0:\n\t"			\
> +					"mov %2,%0;\n\t"		\
> +				: "=r" (value)				\
> +				: "m" (name##__immediate),		\
> +				  "i" (0));				\
> +			break;						\
> +		case 2:							\
> +			asm (	".section __immediate, \"a\", @progbits;\n\t" \
> +					".long %1, (0f)+2, 2;\n\t"	\
> +					".previous;\n\t"		\
> +					"1:\n\t"			\
> +					".align 2;\n\t"			\
> +					"0:\n\t"			\
> +					"mov %2,%0;\n\t"		\
> +				: "=r" (value)				\
> +				: "m" (name##__immediate),		\
> +				  "i" (0));				\
> +			break;						\
> +		case 4:							\
> +			asm (	".section __immediate, \"a\", @progbits;\n\t" \
> +					".long %1, (0f)+1, 4;\n\t"	\
> +					".previous;\n\t"		\
> +					"1:\n\t"			\
> +					".org (1b)+(3-((1b)%%4)), 0x90;\n\t" \
> +					"0:\n\t"			\
> +					"mov %2,%0;\n\t"		\
> +				: "=r" (value)				\
> +				: "m" (name##__immediate),		\
> +				  "i" (0));				\
> +			break;						\
> +		default:value = name##__immediate;			\
> +			break;						\
> +		};							\
> +		value;							\
> +	})
> +
> +extern int arch_immediate_update(const struct __immediate *immediate);
> +extern void arch_immediate_update_early(const struct __immediate *immediate);
> +
> +#endif /* _ASM_I386_IMMEDIATE_H */
> Index: linux-2.6-lttng/arch/i386/kernel/Makefile
> ===================================================================
> --- linux-2.6-lttng.orig/arch/i386/kernel/Makefile	2007-09-17 13:37:58.000000000 -0400
> +++ linux-2.6-lttng/arch/i386/kernel/Makefile	2007-09-17 13:38:44.000000000 -0400
> @@ -37,6 +37,7 @@ obj-$(CONFIG_MODULES)		+= module.o
>  obj-y				+= sysenter.o vsyscall.o
>  obj-$(CONFIG_ACPI_SRAT) 	+= srat.o
>  obj-$(CONFIG_EFI) 		+= efi.o efi_stub.o
> +obj-$(CONFIG_IMMEDIATE)		+= immediate.o
>  obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault.o
>  obj-$(CONFIG_KGDB)		+= kgdb.o kgdb-jmp.o
>  obj-$(CONFIG_VM86)		+= vm86.o
> Index: linux-2.6-lttng/arch/i386/kernel/immediate.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/arch/i386/kernel/immediate.c	2007-09-17 13:38:44.000000000 -0400
> @@ -0,0 +1,307 @@
> +/*
> + * Immediate Value - i386 architecture specific code.
> + *
> + * Rationale
> + *
> + * Required because of :
> + * - Erratum 49 fix for Intel PIII.
> + * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
> + *   Centrino Duo Processor Technology Specification Update, AH33.
> + *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
> + *   Instruction Execution Results.
> + *
> + * Permits immediate value modification by XMC with correct serialization.
> + *
> + * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
> + * location that has preemption enabled because it involves no temporary or
> + * reused data structure.
> + *
> + * Quoting Richard J Moore, source of the information motivating this
> + * implementation which differs from the one proposed by Intel which is not
> + * suitable for kernel context (does not support NMI and would require disabling
> + * interrupts on every CPU for a long period) :
> + *
> + * "There is another issue to consider when looking into using probes other
> + * then int3:
> + *
> + * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
> + * practice of modifying code on one processor where another has prefetched
> + * the unmodified version of the code. Intel states that unpredictable general
> + * protection faults may result if a synchronizing instruction (iret, int,
> + * int3, cpuid, etc ) is not executed on the second processor before it
> + * executes the pre-fetched out-of-date copy of the instruction.
> + *
> + * When we became aware of this I had a long discussion with Intel's
> + * microarchitecture guys. It turns out that the reason for this erratum
> + * (which incidentally Intel does not intend to fix) is because the trace
> + * cache - the stream of micorops resulting from instruction interpretation -
> + * cannot guaranteed to be valid. Reading between the lines I assume this
> + * issue arises because of optimization done in the trace cache, where it is
> + * no longer possible to identify the original instruction boundaries. If the
> + * CPU discoverers that the trace cache has been invalidated because of
> + * unsynchronized cross-modification then instruction execution will be
> + * aborted with a GPF. Further discussion with Intel revealed that replacing
> + * the first opcode byte with an int3 would not be subject to this erratum.
> + *
> + * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
> + *
> + * Overall design
> + *
> + * The algorithm proposed by Intel applies not so well in kernel context: it
> + * would imply disabling interrupts and looping on every CPUs while modifying
> + * the code and would not support instrumentation of code called from interrupt
> + * sources that cannot be disabled.
> + *
> + * Therefore, we use a different algorithm to respect Intel's erratum (see the
> + * quoted discussion above). We make sure that no CPU sees an out-of-date copy
> + * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
> + * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
> + * execute a sync_core(), to make sure that even when the breakpoint is removed,
> + * no cpu could possibly still have the out-of-date copy of the instruction,
> + * modify the now unused 2nd byte of the instruction, and then put back the
> + * original 1st byte of the instruction.
> + *
> + * It has exactly the same intent as the algorithm proposed by Intel, but
> + * it has less side-effects, scales better and supports NMI, SMI and MCE.
> + *
> + * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> + */
> +
> +#include <linux/notifier.h>
> +#include <linux/preempt.h>
> +#include <linux/smp.h>
> +#include <linux/notifier.h>
> +#include <linux/module.h>
> +#include <linux/immediate.h>
> +#include <linux/kdebug.h>
> +#include <linux/rcupdate.h>
> +#include <linux/kprobes.h>
> +
> +#include <asm/cacheflush.h>
> +
> +#define BREAKPOINT_INSTRUCTION  0xcc
> +#define BREAKPOINT_INS_LEN	1
> +#define NR_NOPS			8
> +
> +static long target_after_int3;		/* EIP of the target after the int3 */
> +static long bypass_eip;			/* EIP of the bypass. */
> +static long bypass_after_int3;		/* EIP after the end-of-bypass int3 */
> +static long after_immediate;		/*
> +					 * EIP where to resume after the
> +					 * single-stepping.
> +					 */
> +
> +/*
> + * Size of the movl instruction (without the immediate value) in bytes.
> + * The 2 bytes load immediate has a 66H prefix, which makes the opcode 2 bytes
> + * wide.
> + */
> +static inline size_t _immediate_get_insn_size(long size)
> +{
> +	switch (size) {
> +		case 1: return 1;
> +		case 2: return 2;
> +		case 4: return 1;
> +		default: BUG();
> +	};
> +}
> +
> +/*
> + * Internal bypass used during value update. The bypass is skipped by the
> + * function in which it is inserted.
> + * No need to be aligned because we exclude readers from the site during
> + * update.
> + * Layout is:
> + * nop nop nop nop nop nop nop nop int3
> + * The nops are the target replaced by the instruction to single-step.
> + */
> +static inline void _immediate_bypass(long *bypassaddr, long *breaknextaddr)
> +{
> +		asm volatile (	"jmp 2f;\n\t"
> +				"0:\n\t"
> +				".space 8, 0x90;\n\t"
> +				"1:\n\t"
> +				"int3;\n\t"
> +				"2:\n\t"
> +				"movl $(0b),%0;\n\t"
> +				"movl $((1b)+1),%1;\n\t"
> +				: "=r" (*bypassaddr),
> +				  "=r" (*breaknextaddr) : );
> +}
> +
> +static void immediate_synchronize_core(void *info)
> +{
> +	sync_core();	/* use cpuid to stop speculative execution */
> +}
> +
> +/*
> + * The eip value points right after the breakpoint instruction, in the second
> + * byte of the movl.
> + * Disable preemption in the bypass to make sure no thread will be preempted in
> + * it. We can then use synchronize_sched() to make sure every bypass users have
> + * ended.
> + */
> +static int immediate_notifier(struct notifier_block *nb,
> +	unsigned long val, void *data)
> +{
> +	enum die_val die_val = (enum die_val) val;
> +	struct die_args *args = data;
> +
> +	if (!args->regs || user_mode_vm(args->regs))
> +		return NOTIFY_DONE;
> +
> +	if (die_val == DIE_INT3) {
> +		if (args->regs->eip == target_after_int3) {
> +			preempt_disable();
> +			args->regs->eip = bypass_eip;
> +			return NOTIFY_STOP;
> +		} else if (args->regs->eip == bypass_after_int3) {
> +			args->regs->eip = after_immediate;
> +			preempt_enable();
> +			return NOTIFY_STOP;
> +		}
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block immediate_notify = {
> +	.notifier_call = immediate_notifier,
> +	.priority = 0x7fffffff,	/* we need to be notified first */
> +};
> +
> +
> +/**
> + * arch_immediate_update - update one immediate value
> + * @immediate: pointer of type const struct __immediate to update
> + *
> + * Update one immediate value. Must be called with immediate_mutex held.
> + */
> +__kprobes int arch_immediate_update(const struct __immediate *immediate)
> +{
> +	int ret;
> +	size_t insn_size = _immediate_get_insn_size(immediate->size);
> +	long insn = immediate->immediate - insn_size;
> +	long len;
> +	unsigned long cr0;
> +
> +#ifdef CONFIG_KPROBES
> +	/*
> +	 * Fail if a kprobe has been set on this instruction.
> +	 * (TODO: we could eventually do better and modify all the (possibly
> +	 * nested) kprobes for this site if kprobes had an API for this.
> +	 */
> +	if (unlikely(*(unsigned char*)insn == BREAKPOINT_INSTRUCTION)) {
> +		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
> +				    "Variable at %p, "
> +				    "instruction at %p, size %lu\n",
> +				    (void*)immediate->immediate,
> +				    (void*)immediate->var, immediate->size);
> +		return -EBUSY;
> +	}
> +#endif
> +
> +	/*
> +	 * If the variable and the instruction have the same value, there is
> +	 * nothing to do.
> +	 */
> +	switch (immediate->size) {
> +		case 1:	if (*(uint8_t*)immediate->immediate
> +					== *(uint8_t*)immediate->var)
> +				return 0;
> +			break;
> +		case 2:	if (*(uint16_t*)immediate->immediate
> +					== *(uint16_t*)immediate->var)
> +				return 0;
> +			break;
> +		case 4:	if (*(uint32_t*)immediate->immediate
> +					== *(uint32_t*)immediate->var)
> +				return 0;
> +			break;
> +		default:return -EINVAL;
> +	}
> +
> +	_immediate_bypass(&bypass_eip, &bypass_after_int3);
> +
> +	after_immediate = immediate->immediate + immediate->size;
> +
> +	/*
> +	 * Using the _early variants because nobody is executing the
> +	 * bypass code while we patch it. It is protected by the
> +	 * immediate_mutex. Since we modify the instructions non atomically (for
> +	 * nops), we have to use the _early variant.
> +	 * We must however deal with the WP flag in cr0 by ourself.
> +	 */
> +	kernel_wp_save(cr0);
> +	text_poke_early((void*)bypass_eip, (void*)insn,
> +			insn_size + immediate->size);
> +	/*
> +	 * Fill the rest with nops.
> +	 */
> +	len = NR_NOPS - immediate->size - insn_size;
> +	add_nops((void*)(bypass_eip + insn_size + immediate->size), len);
> +	kernel_wp_restore(cr0);
> +
> +	target_after_int3 = insn + BREAKPOINT_INS_LEN;
> +	/* register_die_notifier has memory barriers */
> +	register_die_notifier(&immediate_notify);
> +	/* The breakpoint will single-step the bypass */
> +	text_poke((void*)insn,
> +		INIT_ARRAY(unsigned char, BREAKPOINT_INSTRUCTION, 1), 1);
> +	wmb();
> +	/*
> +	 * Execute serializing instruction on each CPU.
> +	 * Acts as a memory barrier.
> +	 */
> +	ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
> +	BUG_ON(ret != 0);
> +
> +	text_poke((void*)(insn + insn_size), (void*)immediate->var,
> +			immediate->size);
> +	wmb();
> +	text_poke((void*)insn, (unsigned char*)bypass_eip, 1);
> +		/*
> +		 * Wait for all int3 handlers to end
> +		 * (interrupts are disabled in int3).
> +		 * This CPU is clearly not in a int3 handler,
> +		 * because int3 handler is not preemptible and
> +		 * there cannot be any more int3 handler called
> +		 * for this site, because we placed the original
> +		 * instruction back.
> +		 * synchronize_sched has memory barriers.
> +		 */
> +	synchronize_sched();
> +	unregister_die_notifier(&immediate_notify);
> +	/* unregister_die_notifier has memory barriers */
> +	return 0;
> +}
> +
> +/**
> + * arch_immediate_update_early - update one immediate value at boot time
> + * @immediate: pointer of type const struct __immediate to update
> + *
> + * Update one immediate value at boot time.
> + */
> +void __init arch_immediate_update_early(const struct __immediate *immediate)
> +{
> +	/*
> +	 * If the variable and the instruction have the same value, there is
> +	 * nothing to do.
> +	 */
> +	switch (immediate->size) {
> +		case 1:	if (*(uint8_t*)immediate->immediate
> +					== *(uint8_t*)immediate->var)
> +				return;
> +			break;
> +		case 2:	if (*(uint16_t*)immediate->immediate
> +					== *(uint16_t*)immediate->var)
> +				return;
> +			break;
> +		case 4:	if (*(uint32_t*)immediate->immediate
> +					== *(uint32_t*)immediate->var)
> +				return;
> +			break;
> +		default:return;
> +	}
> +	memcpy((void*)immediate->immediate, (void*)immediate->var,
> +			immediate->size);
> +}
> Index: linux-2.6-lttng/arch/i386/kernel/traps.c
> ===================================================================
> --- linux-2.6-lttng.orig/arch/i386/kernel/traps.c	2007-09-17 13:37:59.000000000 -0400
> +++ linux-2.6-lttng/arch/i386/kernel/traps.c	2007-09-17 13:38:44.000000000 -0400
> @@ -601,7 +601,7 @@ fastcall void do_##name(struct pt_regs *
>  }
>  
>  DO_VM86_ERROR_INFO( 0, SIGFPE,  "divide error", divide_error, FPE_INTDIV, regs->eip)
> -#ifndef CONFIG_KPROBES
> +#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
>  DO_VM86_ERROR( 3, SIGTRAP, "int3", int3)
>  #endif
>  DO_VM86_ERROR( 4, SIGSEGV, "overflow", overflow)
> @@ -843,14 +843,14 @@ void restart_nmi(void)
>  	acpi_nmi_enable();
>  }
>  
> -#ifdef CONFIG_KPROBES
> +#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
>  fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code)
>  {
>  	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
>  			== NOTIFY_STOP)
>  		return;
> -	/* This is an interrupt gate, because kprobes wants interrupts
> -	disabled.  Normal trap handlers don't. */
> +	/* This is an interrupt gate, because kprobes and immediate values wants
> +	 * interrupts disabled. Normal trap handlers don't. */
>  	restore_interrupts(regs);
>  	do_trap(3, SIGTRAP, "int3", 1, regs, error_code, NULL);
>  }
> 
> -- 
> Mathieu Desnoyers
> Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruß,
    Boris.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 6/7] Immediate Values - Documentation
  2007-09-17 20:55   ` Randy Dunlap
@ 2007-09-18 13:13     ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-18 13:13 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: akpm, linux-kernel

* Randy Dunlap (randy.dunlap@oracle.com) wrote:
> On Mon, 17 Sep 2007 14:42:30 -0400 Mathieu Desnoyers wrote:
> 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > ---
> >  Documentation/immediate.txt |  228 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 228 insertions(+)
> > 
> > Index: linux-2.6-lttng/Documentation/immediate.txt
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/Documentation/immediate.txt	2007-09-17 14:03:09.000000000 -0400
> > @@ -0,0 +1,228 @@
> > +		        Using the Immediate Values
> > +
> > +			    Mathieu Desnoyers
> > +
> > +
> > +This document introduces Immediate Values and their use.
> > +
> > +* Purpose of immediate values
> > +

Updated, thanks. I also noted that I did not refresh this patch when
sending it.. a repost is needed (removed the immediate_if, which is now
gone, and changed all immediate_read(&this_immediate) for
immediate_read(this_immediate).

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-17 18:42 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2007-09-18 17:47   ` Denys Vlasenko
  2007-09-18 17:59     ` Mathieu Desnoyers
  0 siblings, 1 reply; 29+ messages in thread
From: Denys Vlasenko @ 2007-09-18 17:47 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel

On Monday 17 September 2007 19:42, Mathieu Desnoyers wrote:
> Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> ===================================================================
> --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
> +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
> @@ -122,6 +122,13 @@
>  		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
>  	}								\
>  									\
> +	/* Immediate values: pointers */				\
> +	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
> +		VMLINUX_SYMBOL(__start___immediate) = .;		\
> +		*(__immediate)						\
> +		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> +	}								\
> +									\

Why do you need an output section for that? IOW: will this work too?

.data : ... { 
...

		VMLINUX_SYMBOL(__start___immediate) = .;		\
		*(__immediate)						\
		VMLINUX_SYMBOL(__stop___immediate) = .;			\
...
}


> Index: linux-2.6-lttng/kernel/module.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/module.c	2007-09-17 13:25:06.000000000 -0400
> +++ linux-2.6-lttng/kernel/module.c	2007-09-17 13:35:51.000000000 -0400
> @@ -33,6 +33,7 @@
>  #include <linux/cpu.h>
>  #include <linux/moduleparam.h>
>  #include <linux/errno.h>
> +#include <linux/immediate.h>
>  #include <linux/err.h>
>  #include <linux/vermagic.h>
>  #include <linux/notifier.h>
> @@ -1717,6 +1718,7 @@ static struct module *load_module(void _
>  	unsigned int unusedcrcindex;
>  	unsigned int unusedgplindex;
>  	unsigned int unusedgplcrcindex;
> +	unsigned int immediateindex;
>  	struct module *mod;
>  	long err = 0;
>  	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
> @@ -1813,6 +1815,7 @@ static struct module *load_module(void _
>  #ifdef ARCH_UNWIND_SECTION_NAME
>  	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
>  #endif
> +	immediateindex = find_sec(hdr, sechdrs, secstrings, "__immediate");


Do you need to frame immediateindex by #ifdef CONFIG_IMMEDIATE / #endif?
--
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-18 17:47   ` Denys Vlasenko
@ 2007-09-18 17:59     ` Mathieu Desnoyers
  2007-09-18 20:01       ` Denys Vlasenko
  0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-18 17:59 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: akpm, linux-kernel

* Denys Vlasenko (vda.linux@googlemail.com) wrote:
> On Monday 17 September 2007 19:42, Mathieu Desnoyers wrote:
> > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
> > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
> > @@ -122,6 +122,13 @@
> >  		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
> >  	}								\
> >  									\
> > +	/* Immediate values: pointers */				\
> > +	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
> > +		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > +		*(__immediate)						\
> > +		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > +	}								\
> > +									\
> 
> Why do you need an output section for that? IOW: will this work too?
> 
> .data : ... { 
> ...
> 
> 		VMLINUX_SYMBOL(__start___immediate) = .;		\
> 		*(__immediate)						\
> 		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> ...
> }
> 

This last one could cause alignment problems. We either have to use the
proper ALIGN() before the section, or let AT(ADDR(__immediate) -
LOAD_OFFSET) take care of it. I prefer the latter.

> 
> > Index: linux-2.6-lttng/kernel/module.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/module.c	2007-09-17 13:25:06.000000000 -0400
> > +++ linux-2.6-lttng/kernel/module.c	2007-09-17 13:35:51.000000000 -0400
> > @@ -33,6 +33,7 @@
> >  #include <linux/cpu.h>
> >  #include <linux/moduleparam.h>
> >  #include <linux/errno.h>
> > +#include <linux/immediate.h>
> >  #include <linux/err.h>
> >  #include <linux/vermagic.h>
> >  #include <linux/notifier.h>
> > @@ -1717,6 +1718,7 @@ static struct module *load_module(void _
> >  	unsigned int unusedcrcindex;
> >  	unsigned int unusedgplindex;
> >  	unsigned int unusedgplcrcindex;
> > +	unsigned int immediateindex;
> >  	struct module *mod;
> >  	long err = 0;
> >  	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
> > @@ -1813,6 +1815,7 @@ static struct module *load_module(void _
> >  #ifdef ARCH_UNWIND_SECTION_NAME
> >  	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
> >  #endif
> > +	immediateindex = find_sec(hdr, sechdrs, secstrings, "__immediate");
> 
> 
> Do you need to frame immediateindex by #ifdef CONFIG_IMMEDIATE / #endif?

I could, but I have been told to leave the immediateindex there even though
immediate values are configured out. I guess visual hideousness is the
main argument there.

Mathieu

> --
> vda

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-18 17:59     ` Mathieu Desnoyers
@ 2007-09-18 20:01       ` Denys Vlasenko
  2007-09-18 20:47         ` Mathieu Desnoyers
  0 siblings, 1 reply; 29+ messages in thread
From: Denys Vlasenko @ 2007-09-18 20:01 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel

On Tuesday 18 September 2007 18:59, Mathieu Desnoyers wrote:
> * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > On Monday 17 September 2007 19:42, Mathieu Desnoyers wrote:
> > > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > > ===================================================================
> > > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
> > > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
> > > @@ -122,6 +122,13 @@
> > >  		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
> > >  	}								\
> > >  									\
> > > +	/* Immediate values: pointers */				\
> > > +	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
> > > +		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > +		*(__immediate)						\
> > > +		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > +	}								\
> > > +									\
> > 
> > Why do you need an output section for that? IOW: will this work too?
> > 
> > .data : ... { 
> > ...
> > 
> > 		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > 		*(__immediate)						\
> > 		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > ...
> > }
> > 
> 
> This last one could cause alignment problems. We either have to use the
> proper ALIGN() before the section, or let AT(ADDR(__immediate) -
> LOAD_OFFSET) take care of it. I prefer the latter.

This adds yet another output section in vmlinux, and there is
no tools which need that. We already have 30+ sections there while we need ~20.

I am trying to fix the mess. Please don't add to it.

Re alignment: (1) do you really realy REALLY need it? Last I checked,
i386 was handling unaligned accesses just fine; and
(2) this works:

		. = ALIGN(4)
 		VMLINUX_SYMBOL(__start___immediate) = .;		\
 		*(__immediate)						\
 		VMLINUX_SYMBOL(__stop___immediate) = .;			\


--
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-18 20:01       ` Denys Vlasenko
@ 2007-09-18 20:47         ` Mathieu Desnoyers
  2007-09-19  8:45           ` Denys Vlasenko
  2007-09-19  8:57           ` Denys Vlasenko
  0 siblings, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-18 20:47 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: akpm, linux-kernel

* Denys Vlasenko (vda.linux@googlemail.com) wrote:
> On Tuesday 18 September 2007 18:59, Mathieu Desnoyers wrote:
> > * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > > On Monday 17 September 2007 19:42, Mathieu Desnoyers wrote:
> > > > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > > > ===================================================================
> > > > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
> > > > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
> > > > @@ -122,6 +122,13 @@
> > > >  		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
> > > >  	}								\
> > > >  									\
> > > > +	/* Immediate values: pointers */				\
> > > > +	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
> > > > +		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > > +		*(__immediate)						\
> > > > +		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > > +	}								\
> > > > +									\
> > > 
> > > Why do you need an output section for that? IOW: will this work too?
> > > 
> > > .data : ... { 
> > > ...
> > > 
> > > 		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > 		*(__immediate)						\
> > > 		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > ...
> > > }
> > > 
> > 
> > This last one could cause alignment problems. We either have to use the
> > proper ALIGN() before the section, or let AT(ADDR(__immediate) -
> > LOAD_OFFSET) take care of it. I prefer the latter.
> 
> This adds yet another output section in vmlinux, and there is
> no tools which need that. We already have 30+ sections there while we need ~20.
> 
> I am trying to fix the mess. Please don't add to it.
> 
> Re alignment: (1) do you really realy REALLY need it? Last I checked,
> i386 was handling unaligned accesses just fine; and
> (2) this works:
> 
> 		. = ALIGN(4)
>  		VMLINUX_SYMBOL(__start___immediate) = .;		\
>  		*(__immediate)						\
>  		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> 
> 

Alignment: I need the __start___immediate and __stop___immediate values
to be at the same alignment as the *(__immediate) content, or else we
end up thinking that padding is data.

. = ALIGN(4) works fine as long as the structure within the section is
not bigger or equal to 32 bytes: gcc has the habit to align 32 bytes
structure on 32 bytes multiples. The safest way I found to do it is to
declare the section as I do: it will cause no breakage if anybody append
data to the structure.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-18 21:07 [patch 0/7] Immediate Values for 2.6.23-rc6-mm1 Mathieu Desnoyers
@ 2007-09-18 21:07 ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-18 21:07 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: immediate-values-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 17446 bytes --]

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _immediate_read() version, which uses standard global
variables, and optimized per architecture immediate_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__immediate" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __immediate is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove immediate_*_t types, add DECLARE_IMMEDIATE() and DEFINE_IMMEDIATE().
  - immediate_read(&var) becomes immediate_read(var) because of this.
- Adding a new EXPORT_IMMEDIATE_SYMBOL(_GPL).
- remove immediate_if(). Should use if (unlikely(immediate_read(var))) instead.
  - Wait until we have gcc support before we add the immediate_if macro, since
    its form may have to change.
- Document immediate_set_early(). This design is chosen over immediate_init()
  because:
    - We can mark the function __init so it is freed after boot.
    - Module code patching can also be done by the "normal" function. This is
      preferred, even if it can be slightly slower, because update performance
      does not matter.
    - immediate_init() could confuse users because we can actually "initialize"
      an immediate value to a certain value, which will be used as initial
      value. e.g.:
        static DEFINE_IMMEDIATE(int, var) = 10;

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/asm-generic/vmlinux.lds.h |    7 ++
 include/linux/immediate.h         |  133 ++++++++++++++++++++++++++++++++++++++
 include/linux/module.h            |   17 ++++
 init/main.c                       |    2 
 kernel/Makefile                   |    1 
 kernel/immediate.c                |   92 ++++++++++++++++++++++++++
 kernel/module.c                   |   49 ++++++++++++++
 7 files changed, 301 insertions(+)

Index: linux-2.6-lttng/include/linux/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/immediate.h	2007-09-17 13:38:32.000000000 -0400
@@ -0,0 +1,133 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+#include <asm/immediate.h>
+
+/**
+ * immediate_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define immediate_set(name, i)						\
+	do {								\
+		name##__immediate = (i);				\
+		core_immediate_update();				\
+		module_immediate_update();				\
+	} while (0)
+
+
+/**
+ * _immediate_set - set immediate variable (without locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Must be called with module_mutex held.
+ */
+#define _immediate_set(name, i)						\
+	do {								\
+		name##__immediate = (i);				\
+		core_immediate_update();				\
+		_module_immediate_update();				\
+	} while (0)
+
+/**
+ * immediate_set_early - set immediate variable at early boot
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Should be used for updates at early boot, when only
+ * one CPU is active and interrupts are disabled.
+ */
+#define immediate_set_early(name, i)					\
+	do {								\
+		name##__immediate = (i);				\
+		immediate_update_early();				\
+	} while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_immediate_update(void);
+extern void immediate_update_range(const struct __immediate *begin,
+	const struct __immediate *end);
+extern void immediate_update_early(void);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define immediate_read(name)		_immediate_read(name)
+
+/**
+ * immediate_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define immediate_set(name, i)		(name##__immediate = (i))
+
+/**
+ * _immediate_set - set immediate variable (without locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Must be called with module_mutex held.
+ */
+#define _immediate_set(name, i)		immediate_set(name, i)
+
+/**
+ * immediate_set_early - set immediate variable at early boot
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name. Should be used for early boot updates.
+ */
+#define immediate_set_early(name, i)	immediate_set(name, i)
+
+/*
+ * Internal update functions.
+ */
+static inline void immediate_update_early(void)
+{ }
+#endif
+
+#define DECLARE_IMMEDIATE(type, name) extern __typeof__(type) name##__immediate
+#define DEFINE_IMMEDIATE(type, name)  __typeof__(type) name##__immediate
+
+#define EXPORT_IMMEDIATE_SYMBOL(name) EXPORT_SYMBOL(name##__immediate)
+#define EXPORT_IMMEDIATE_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__immediate)
+
+/**
+ * _immediate_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _immediate_read(name)		(name##__immediate)
+
+#endif
Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
@@ -122,6 +122,13 @@
 		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
 	}								\
 									\
+	/* Immediate values: pointers */				\
+	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(__start___immediate) = .;		\
+		*(__immediate)						\
+		VMLINUX_SYMBOL(__stop___immediate) = .;			\
+	}								\
+									\
 	/* Kernel symbol table: strings */				\
         __ksymtab_strings : AT(ADDR(__ksymtab_strings) - LOAD_OFFSET) {	\
 		*(__ksymtab_strings)					\
Index: linux-2.6-lttng/include/linux/module.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/module.h	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/include/linux/module.h	2007-09-17 13:35:50.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/stringify.h>
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
+#include <linux/immediate.h>
 #include <asm/local.h>
 
 #include <asm/module.h>
@@ -370,6 +371,11 @@ struct module
 	/* The command line arguments (may be mangled).  People like
 	   keeping pointers to this stuff */
 	char *args;
+
+#ifdef CONFIG_IMMEDIATE
+	const struct __immediate *immediate;
+	unsigned int num_immediate;
+#endif
 };
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
@@ -473,6 +479,9 @@ int unregister_module_notifier(struct no
 
 extern void print_modules(void);
 
+extern void _module_immediate_update(void);
+extern void module_immediate_update(void);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -572,6 +581,14 @@ static inline void print_modules(void)
 {
 }
 
+static inline void _module_immediate_update(void)
+{
+}
+
+static inline void module_immediate_update(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c	2007-09-17 13:35:51.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
 #include <linux/errno.h>
+#include <linux/immediate.h>
 #include <linux/err.h>
 #include <linux/vermagic.h>
 #include <linux/notifier.h>
@@ -1717,6 +1718,7 @@ static struct module *load_module(void _
 	unsigned int unusedcrcindex;
 	unsigned int unusedgplindex;
 	unsigned int unusedgplcrcindex;
+	unsigned int immediateindex;
 	struct module *mod;
 	long err = 0;
 	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
@@ -1813,6 +1815,7 @@ static struct module *load_module(void _
 #ifdef ARCH_UNWIND_SECTION_NAME
 	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
 #endif
+	immediateindex = find_sec(hdr, sechdrs, secstrings, "__immediate");
 
 	/* Don't keep modinfo section */
 	sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -1963,6 +1966,11 @@ static struct module *load_module(void _
 	mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr;
 	if (gplfuturecrcindex)
 		mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr;
+#ifdef CONFIG_IMMEDIATE
+	mod->immediate = (void *)sechdrs[immediateindex].sh_addr;
+	mod->num_immediate =
+		sechdrs[immediateindex].sh_size / sizeof(*mod->immediate);
+#endif
 
 	mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr;
 	if (unusedcrcindex)
@@ -2028,6 +2036,12 @@ static struct module *load_module(void _
 		 goto nomodsectinfo;
 #endif
 
+#ifdef CONFIG_IMMEDIATE
+	if (!mod->taints)
+		immediate_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+#endif
+
 	err = module_finalize(hdr, sechdrs, mod);
 	if (err < 0)
 		goto cleanup;
@@ -2629,3 +2643,38 @@ EXPORT_SYMBOL(module_remove_driver);
 void struct_module(struct module *mod) { return; }
 EXPORT_SYMBOL(struct_module);
 #endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_immediate_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_immediate_update(void)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		immediate_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+	}
+}
+EXPORT_SYMBOL_GPL(_module_immediate_update);
+
+/**
+ * module_immediate_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_immediate_update(void)
+{
+	mutex_lock(&module_mutex);
+	_module_immediate_update();
+	mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_immediate_update);
+#endif
Index: linux-2.6-lttng/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/immediate.c	2007-09-17 13:25:22.000000000 -0400
@@ -0,0 +1,92 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+
+extern const struct __immediate __start___immediate[];
+extern const struct __immediate __stop___immediate[];
+
+/*
+ * immediate_mutex nests inside module_mutex. immediate_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(immediate_mutex);
+
+/**
+ * immediate_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Sets a range of immediates to a enabled state : set the enable bit.
+ */
+void immediate_update_range(const struct __immediate *begin,
+		const struct __immediate *end)
+{
+	const struct __immediate *iter;
+	int ret;
+
+	for (iter = begin; iter < end; iter++) {
+		mutex_lock(&immediate_mutex);
+		kernel_text_lock();
+		ret = arch_immediate_update(iter);
+		kernel_text_unlock();
+		if (ret)
+			printk(KERN_WARNING "Invalid immediate value. "
+					    "Variable at %p, "
+					    "instruction at %p, size %lu\n",
+					    (void*)iter->immediate,
+					    (void*)iter->var, iter->size);
+		mutex_unlock(&immediate_mutex);
+	}
+}
+EXPORT_SYMBOL_GPL(immediate_update_range);
+
+/**
+ * immediate_update - update all immediate values in the kernel
+ * @lock: should a module_mutex be taken ?
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_immediate_update(void)
+{
+	/* Core kernel immediates */
+	immediate_update_range(__start___immediate, __stop___immediate);
+}
+EXPORT_SYMBOL_GPL(core_immediate_update);
+
+static void __init immediate_update_early_range(const struct __immediate *begin,
+		const struct __immediate *end)
+{
+	const struct __immediate *iter;
+
+	for (iter = begin; iter < end; iter++)
+		arch_immediate_update_early(iter);
+}
+
+/**
+ * immediate_update_early - Update immediate values at boot time
+ *
+ * Update the immediate values to the state of the variables they refer to. It
+ * is done before SMP is active, at the very beginning of start_kernel().
+ */
+void __init immediate_update_early(void)
+{
+	immediate_update_early_range(__start___immediate, __stop___immediate);
+}
Index: linux-2.6-lttng/init/main.c
===================================================================
--- linux-2.6-lttng.orig/init/main.c	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/init/main.c	2007-09-17 13:25:22.000000000 -0400
@@ -56,6 +56,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/device.h>
 #include <linux/kthread.h>
+#include <linux/immediate.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -525,6 +526,7 @@ asmlinkage void __init start_kernel(void
 	unwind_init();
 	lockdep_init();
 	container_init_early();
+	immediate_update_early();
 
 	local_irq_disable();
 	early_boot_irqs_off();
Index: linux-2.6-lttng/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/kernel/Makefile	2007-09-17 13:25:06.000000000 -0400
+++ linux-2.6-lttng/kernel/Makefile	2007-09-17 13:35:50.000000000 -0400
@@ -61,6 +61,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-18 20:47         ` Mathieu Desnoyers
@ 2007-09-19  8:45           ` Denys Vlasenko
  2007-09-19  8:57           ` Denys Vlasenko
  1 sibling, 0 replies; 29+ messages in thread
From: Denys Vlasenko @ 2007-09-19  8:45 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel

On Tuesday 18 September 2007 21:47, Mathieu Desnoyers wrote:
> * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > On Tuesday 18 September 2007 18:59, Mathieu Desnoyers wrote:
> > > * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > > > On Monday 17 September 2007 19:42, Mathieu Desnoyers wrote:
> > > > > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > > > > ===================================================================
> > > > > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
> > > > > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
> > > > > @@ -122,6 +122,13 @@
> > > > >  		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
> > > > >  	}								\
> > > > >  									\
> > > > > +	/* Immediate values: pointers */				\
> > > > > +	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
> > > > > +		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > > > +		*(__immediate)						\
> > > > > +		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > > > +	}								\
> > > > > +									\
> > > > 
> > > > Why do you need an output section for that? IOW: will this work too?
> > > > 
> > > > .data : ... { 
> > > > ...
> > > > 
> > > > 		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > > 		*(__immediate)						\
> > > > 		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > > ...
> > > > }
> > > > 
> > > 
> > > This last one could cause alignment problems. We either have to use the
> > > proper ALIGN() before the section, or let AT(ADDR(__immediate) -
> > > LOAD_OFFSET) take care of it. I prefer the latter.
> > 
> > This adds yet another output section in vmlinux, and there is
> > no tools which need that. We already have 30+ sections there while we need ~20.
> > 
> > I am trying to fix the mess. Please don't add to it.
> > 
> > Re alignment: (1) do you really realy REALLY need it? Last I checked,
> > i386 was handling unaligned accesses just fine; and
> > (2) this works:
> > 
> > 		. = ALIGN(4)
> >  		VMLINUX_SYMBOL(__start___immediate) = .;		\
> >  		*(__immediate)						\
> >  		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > 
> > 
> 
> Alignment: I need the __start___immediate and __stop___immediate values
> to be at the same alignment as the *(__immediate) content, or else we
> end up thinking that padding is data.
> 
> . = ALIGN(4) works fine as long as the structure within the section is
> not bigger or equal to 32 bytes: gcc has the habit to align 32 bytes
> structure on 32 bytes multiples. The safest way I found to do it is to

Yes, I'm painfully aware of that. gcc is too damn happy to align stuff.

> declare the section as I do: it will cause no breakage if anybody append
> data to the structure.

You can actively fight gcc's sadistic alignment tendencies instead:

struct __immediate {
       long var;               /* Identifier variable of the immediate value */
       long immediate;         /*
                                * Pointer to the memory location that holds
                                * the immediate value within the load immediate
                                * instruction.
                                */
       long size;              /* Type size. */
} __attribute__ ((aligned(sizeof(long))));  <================= HERE

Kernel is already using this technique a lot. Try

grep -r '^\} *__attribute__ *.*aligned' .
--
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-18 20:47         ` Mathieu Desnoyers
  2007-09-19  8:45           ` Denys Vlasenko
@ 2007-09-19  8:57           ` Denys Vlasenko
  2007-09-19 11:27             ` Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: Denys Vlasenko @ 2007-09-19  8:57 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel

On Tuesday 18 September 2007 21:47, Mathieu Desnoyers wrote:
> * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > On Tuesday 18 September 2007 18:59, Mathieu Desnoyers wrote:
> > > * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > > > On Monday 17 September 2007 19:42, Mathieu Desnoyers wrote:
> > > > > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > > > > ===================================================================
> > > > > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
> > > > > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
> > > > > @@ -122,6 +122,13 @@
> > > > >  		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
> > > > >  	}								\
> > > > >  									\
> > > > > +	/* Immediate values: pointers */				\
> > > > > +	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
> > > > > +		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > > > +		*(__immediate)						\
> > > > > +		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > > > +	}								\
> > > > > +									\
> > > > 
> > > > Why do you need an output section for that? IOW: will this work too?
> > > > 
> > > > .data : ... { 
> > > > ...
> > > > 
> > > > 		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > > 		*(__immediate)						\
> > > > 		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > > ...
> > > > }
> > > > 
> > > 
> > > This last one could cause alignment problems. We either have to use the
> > > proper ALIGN() before the section, or let AT(ADDR(__immediate) -
> > > LOAD_OFFSET) take care of it. I prefer the latter.
> > 
> > This adds yet another output section in vmlinux, and there is
> > no tools which need that. We already have 30+ sections there while we need ~20.
> > 
> > I am trying to fix the mess. Please don't add to it.
> > 
> > Re alignment: (1) do you really realy REALLY need it? Last I checked,
> > i386 was handling unaligned accesses just fine; and
> > (2) this works:
> > 
> > 		. = ALIGN(4)
> >  		VMLINUX_SYMBOL(__start___immediate) = .;		\
> >  		*(__immediate)						\
> >  		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > 
> > 
> 
> Alignment: I need the __start___immediate and __stop___immediate values
> to be at the same alignment as the *(__immediate) content, or else we
> end up thinking that padding is data.
> 
> . = ALIGN(4) works fine as long as the structure within the section is
> not bigger or equal to 32 bytes: gcc has the habit to align 32 bytes
> structure on 32 bytes multiples. The safest way I found to do it is to
> declare the section as I do: it will cause no breakage if anybody append
> data to the structure.

If your structure will be padded by gcc, then this:

+#define immediate_read(name)                                           \
+       ({                                                              \
+               __typeof__(name##__immediate) value;                    \
+               switch (sizeof(value)) {                                \
+               case 1:                                                 \
+                       asm (   ".section __immediate, \"a\", @progbits;\n\t" \
+                                       ".long %1, (0f)+1, 1;\n\t"      \
+                                       ".previous;\n\t"                \
+                                       "0:\n\t"                        \
+                                       "mov %2,%0;\n\t"                \
+                               : "=r" (value)                          \
+                               : "m" (name##__immediate),              \
+                                 "i" (0));                             \
+                       break;                                          \

will produce wrongly-sized "struct __immediate" (truncated one),
since gcc has no idea that you are building struct __immediate there,
and here:

+void immediate_update_range(const struct __immediate *begin,
+               const struct __immediate *end)
+{
+       const struct __immediate *iter;
+       int ret;
+
+       for (iter = begin; iter < end; iter++) {
+               mutex_lock(&immediate_mutex);
+               kernel_text_lock();
+               ret = arch_immediate_update(iter);
+               kernel_text_unlock();
+               if (ret)
+                       printk(KERN_WARNING "Invalid immediate value. "
+                                           "Variable at %p, "
+                                           "instruction at %p, size %lu\n",
+                                           (void*)iter->immediate,
+                                           (void*)iter->var, iter->size);
+               mutex_unlock(&immediate_mutex);
+       }
+}

iter++ will go off rails.
--
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2007-09-19  8:57           ` Denys Vlasenko
@ 2007-09-19 11:27             ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-09-19 11:27 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: akpm, linux-kernel

* Denys Vlasenko (vda.linux@googlemail.com) wrote:
> On Tuesday 18 September 2007 21:47, Mathieu Desnoyers wrote:
> > * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > > On Tuesday 18 September 2007 18:59, Mathieu Desnoyers wrote:
> > > > * Denys Vlasenko (vda.linux@googlemail.com) wrote:
> > > > > On Monday 17 September 2007 19:42, Mathieu Desnoyers wrote:
> > > > > > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > > > > > ===================================================================
> > > > > > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-09-17 13:25:06.000000000 -0400
> > > > > > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-09-17 13:35:50.000000000 -0400
> > > > > > @@ -122,6 +122,13 @@
> > > > > >  		VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .;	\
> > > > > >  	}								\
> > > > > >  									\
> > > > > > +	/* Immediate values: pointers */				\
> > > > > > +	__immediate : AT(ADDR(__immediate) - LOAD_OFFSET) {		\
> > > > > > +		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > > > > +		*(__immediate)						\
> > > > > > +		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > > > > +	}								\
> > > > > > +									\
> > > > > 
> > > > > Why do you need an output section for that? IOW: will this work too?
> > > > > 
> > > > > .data : ... { 
> > > > > ...
> > > > > 
> > > > > 		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > > > > 		*(__immediate)						\
> > > > > 		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > > > ...
> > > > > }
> > > > > 
> > > > 
> > > > This last one could cause alignment problems. We either have to use the
> > > > proper ALIGN() before the section, or let AT(ADDR(__immediate) -
> > > > LOAD_OFFSET) take care of it. I prefer the latter.
> > > 
> > > This adds yet another output section in vmlinux, and there is
> > > no tools which need that. We already have 30+ sections there while we need ~20.
> > > 
> > > I am trying to fix the mess. Please don't add to it.
> > > 
> > > Re alignment: (1) do you really realy REALLY need it? Last I checked,
> > > i386 was handling unaligned accesses just fine; and
> > > (2) this works:
> > > 
> > > 		. = ALIGN(4)
> > >  		VMLINUX_SYMBOL(__start___immediate) = .;		\
> > >  		*(__immediate)						\
> > >  		VMLINUX_SYMBOL(__stop___immediate) = .;			\
> > > 
> > > 
> > 
> > Alignment: I need the __start___immediate and __stop___immediate values
> > to be at the same alignment as the *(__immediate) content, or else we
> > end up thinking that padding is data.
> > 
> > . = ALIGN(4) works fine as long as the structure within the section is
> > not bigger or equal to 32 bytes: gcc has the habit to align 32 bytes
> > structure on 32 bytes multiples. The safest way I found to do it is to
> > declare the section as I do: it will cause no breakage if anybody append
> > data to the structure.
> 
> If your structure will be padded by gcc, then this:
> 
> +#define immediate_read(name)                                           \
> +       ({                                                              \
> +               __typeof__(name##__immediate) value;                    \
> +               switch (sizeof(value)) {                                \
> +               case 1:                                                 \
> +                       asm (   ".section __immediate, \"a\", @progbits;\n\t" \
> +                                       ".long %1, (0f)+1, 1;\n\t"      \
> +                                       ".previous;\n\t"                \
> +                                       "0:\n\t"                        \
> +                                       "mov %2,%0;\n\t"                \
> +                               : "=r" (value)                          \
> +                               : "m" (name##__immediate),              \
> +                                 "i" (0));                             \
> +                       break;                                          \
> 
> will produce wrongly-sized "struct __immediate" (truncated one),
> since gcc has no idea that you are building struct __immediate there,
> and here:
> 
> +void immediate_update_range(const struct __immediate *begin,
> +               const struct __immediate *end)
> +{
> +       const struct __immediate *iter;
> +       int ret;
> +
> +       for (iter = begin; iter < end; iter++) {
> +               mutex_lock(&immediate_mutex);
> +               kernel_text_lock();
> +               ret = arch_immediate_update(iter);
> +               kernel_text_unlock();
> +               if (ret)
> +                       printk(KERN_WARNING "Invalid immediate value. "
> +                                           "Variable at %p, "
> +                                           "instruction at %p, size %lu\n",
> +                                           (void*)iter->immediate,
> +                                           (void*)iter->var, iter->size);
> +               mutex_unlock(&immediate_mutex);
> +       }
> +}
> 
> iter++ will go off rails.

You are right. It's ok here since we are actually smaller than 32 bytes,
but I should force the structure alignment so that if the structure
grows, the assembly declaration follows. I'll go for the gcc attribute
and then we can remove the section declaration.

Mathieu

> --
> vda

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/7] Immediate Values - Architecture Independent Code
  2007-12-06  2:07 [patch 0/7] Immediate Values (redux) for 2.6.24-rc4-git3 Mathieu Desnoyers
@ 2007-12-06  2:07 ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2007-12-06  2:07 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, Rusty Russell

[-- Attachment #1: immediate-values-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 18558 bytes --]

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _imv_read() version, which uses standard global
variables, and optimized per architecture imv_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__imv" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __imv is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove imv_*_t types, add DECLARE_IMV() and DEFINE_IMV().
  - imv_read(&var) becomes imv_read(var) because of this.
- Adding a new EXPORT_IMV_SYMBOL(_GPL).
- remove imv_if(). Should use if (unlikely(imv_read(var))) instead.
  - Wait until we have gcc support before we add the imv_if macro, since
    its form may have to change.
- Dont't declare the __imv section in vmlinux.lds.h, just put the content
  in the rodata section.
- Simplify interface : remove imv_set_early, keep track of kernel boot
  status internally.
- Remove the ALIGN(8) before the __imv section. It is packed now.
- Uses an IPI busy-loop on each CPU with interrupts disabled as a simple,
  architecture agnostic, update mechanism.
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
---
 include/asm-generic/vmlinux.lds.h |    3 
 include/linux/immediate.h         |   94 +++++++++++++++++++
 include/linux/module.h            |   16 +++
 init/main.c                       |    8 +
 kernel/Makefile                   |    1 
 kernel/immediate.c                |  187 ++++++++++++++++++++++++++++++++++++++
 kernel/module.c                   |   50 +++++++++-
 7 files changed, 358 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/linux/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/immediate.h	2007-11-28 09:32:04.000000000 -0500
@@ -0,0 +1,94 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+struct __imv {
+	unsigned long var;	/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	unsigned long imv;	/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	unsigned char size;	/* Type size. */
+} __attribute__ ((packed));
+
+#include <asm/immediate.h>
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)						\
+	do {								\
+		name##__imv = (i);					\
+		core_imv_update();					\
+		module_imv_update();					\
+	} while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_imv_update(void);
+extern void imv_update_range(const struct __imv *begin,
+	const struct __imv *end);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define imv_read(name)			_imv_read(name)
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)		(name##__imv = (i))
+
+static inline void core_imv_update(void) { }
+static inline void module_imv_update(void) { }
+
+#endif
+
+#define DECLARE_IMV(type, name) extern __typeof__(type) name##__imv
+#define DEFINE_IMV(type, name)  __typeof__(type) name##__imv
+
+#define EXPORT_IMV_SYMBOL(name) EXPORT_SYMBOL(name##__imv)
+#define EXPORT_IMV_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__imv)
+
+/**
+ * _imv_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _imv_read(name)		(name##__imv)
+
+#endif
Index: linux-2.6-lttng/include/linux/module.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/module.h	2007-11-28 09:31:51.000000000 -0500
+++ linux-2.6-lttng/include/linux/module.h	2007-11-28 09:32:04.000000000 -0500
@@ -15,6 +15,7 @@
 #include <linux/stringify.h>
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
+#include <linux/immediate.h>
 #include <linux/marker.h>
 #include <asm/local.h>
 
@@ -355,6 +356,10 @@ struct module
 	/* The command line arguments (may be mangled).  People like
 	   keeping pointers to this stuff */
 	char *args;
+#ifdef CONFIG_IMMEDIATE
+	const struct __imv *immediate;
+	unsigned int num_immediate;
+#endif
 #ifdef CONFIG_MARKERS
 	struct marker *markers;
 	unsigned int num_markers;
@@ -464,6 +469,9 @@ extern void print_modules(void);
 
 extern void module_update_markers(void);
 
+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -568,6 +576,14 @@ static inline void module_update_markers
 {
 }
 
+static inline void _module_imv_update(void)
+{
+}
+
+static inline void module_imv_update(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2007-11-28 09:31:51.000000000 -0500
+++ linux-2.6-lttng/kernel/module.c	2007-11-28 09:32:04.000000000 -0500
@@ -33,6 +33,7 @@
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
 #include <linux/errno.h>
+#include <linux/immediate.h>
 #include <linux/err.h>
 #include <linux/vermagic.h>
 #include <linux/notifier.h>
@@ -1675,6 +1676,7 @@ static struct module *load_module(void _
 	unsigned int unusedcrcindex;
 	unsigned int unusedgplindex;
 	unsigned int unusedgplcrcindex;
+	unsigned int immediateindex;
 	unsigned int markersindex;
 	unsigned int markersstringsindex;
 	struct module *mod;
@@ -1773,6 +1775,7 @@ static struct module *load_module(void _
 #ifdef ARCH_UNWIND_SECTION_NAME
 	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
 #endif
+	immediateindex = find_sec(hdr, sechdrs, secstrings, "__imv");
 
 	/* Don't keep modinfo section */
 	sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -1924,6 +1927,11 @@ static struct module *load_module(void _
 	mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr;
 	if (gplfuturecrcindex)
 		mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr;
+#ifdef CONFIG_IMMEDIATE
+	mod->immediate = (void *)sechdrs[immediateindex].sh_addr;
+	mod->num_immediate =
+		sechdrs[immediateindex].sh_size / sizeof(*mod->immediate);
+#endif
 
 	mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr;
 	if (unusedcrcindex)
@@ -1991,11 +1999,16 @@ static struct module *load_module(void _
 
 	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
 
+	if (!mod->taints) {
 #ifdef CONFIG_MARKERS
-	if (!mod->taints)
 		marker_update_probe_range(mod->markers,
 			mod->markers + mod->num_markers);
 #endif
+#ifdef CONFIG_IMMEDIATE
+		imv_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+#endif
+	}
 	err = module_finalize(hdr, sechdrs, mod);
 	if (err < 0)
 		goto cleanup;
@@ -2601,3 +2614,38 @@ void module_update_markers(void)
 	mutex_unlock(&module_mutex);
 }
 #endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_imv_update(void)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		imv_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+	}
+}
+EXPORT_SYMBOL_GPL(_module_imv_update);
+
+/**
+ * module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_imv_update(void)
+{
+	mutex_lock(&module_mutex);
+	_module_imv_update();
+	mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_imv_update);
+#endif
Index: linux-2.6-lttng/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/immediate.c	2007-11-28 09:32:04.000000000 -0500
@@ -0,0 +1,187 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+#include <linux/cpu.h>
+
+#include <asm/cacheflush.h>
+
+/*
+ * Kernel ready to execute the SMP update that may depend on trap and ipi.
+ */
+static int imv_early_boot_complete;
+
+extern const struct __imv __start___imv[];
+extern const struct __imv __stop___imv[];
+
+/*
+ * imv_mutex nests inside module_mutex. imv_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(imv_mutex);
+
+static atomic_t wait_sync;
+
+struct ipi_loop_data {
+	long value;
+	const struct __imv *imv;
+} loop_data;
+
+static void ipi_busy_loop(void *arg)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	atomic_dec(&wait_sync);
+	do {
+		/* Make sure the wait_sync gets re-read */
+		smp_mb();
+	} while (atomic_read(&wait_sync) > loop_data.value);
+	atomic_dec(&wait_sync);
+	do {
+		/* Make sure the wait_sync gets re-read */
+		smp_mb();
+	} while (atomic_read(&wait_sync) > 0);
+	/*
+	 * Issuing a synchronizing instruction must be done on each CPU before
+	 * reenabling interrupts after modifying an instruction. Required by
+	 * Intel's errata.
+	 */
+	sync_core();
+	flush_icache_range(loop_data.imv->imv,
+		loop_data.imv->imv + loop_data.imv->size);
+	local_irq_restore(flags);
+}
+
+/**
+ * apply_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ * It makes sure all CPUs are not executing the modified code by having them
+ * busy looping with interrupts disabled.
+ * It does _not_ protect against NMI and MCE (could be a problem with Intel's
+ * errata if we use immediate values in their code path).
+ */
+static int apply_imv_update(const struct __imv *imv)
+{
+	unsigned long flags;
+	long online_cpus;
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:	if (*(uint8_t *)imv->imv
+				== *(uint8_t *)imv->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)imv->imv
+				== *(uint16_t *)imv->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)imv->imv
+				== *(uint32_t *)imv->var)
+			return 0;
+		break;
+	case 8:	if (*(uint64_t *)imv->imv
+				== *(uint64_t *)imv->var)
+			return 0;
+		break;
+	default:return -EINVAL;
+	}
+
+	if (imv_early_boot_complete) {
+		kernel_text_lock();
+		lock_cpu_hotplug();
+		online_cpus = num_online_cpus();
+		atomic_set(&wait_sync, 2 * online_cpus);
+		loop_data.value = online_cpus;
+		loop_data.imv = imv;
+		smp_call_function(ipi_busy_loop, NULL, 1, 0);
+		local_irq_save(flags);
+		atomic_dec(&wait_sync);
+		do {
+			/* Make sure the wait_sync gets re-read */
+			smp_mb();
+		} while (atomic_read(&wait_sync) > online_cpus);
+		text_poke((void *)imv->imv, (void *)imv->var,
+				imv->size);
+		/*
+		 * Make sure the modified instruction is seen by all CPUs before
+		 * we continue (visible to other CPUs and local interrupts).
+		 */
+		wmb();
+		atomic_dec(&wait_sync);
+		flush_icache_range(imv->imv,
+				imv->imv + imv->size);
+		local_irq_restore(flags);
+		unlock_cpu_hotplug();
+		kernel_text_unlock();
+	} else
+		text_poke_early((void *)imv->imv, (void *)imv->var,
+				imv->size);
+	return 0;
+}
+
+/**
+ * imv_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Updates a range of immediates.
+ */
+void imv_update_range(const struct __imv *begin,
+		const struct __imv *end)
+{
+	const struct __imv *iter;
+	int ret;
+	for (iter = begin; iter < end; iter++) {
+		mutex_lock(&imv_mutex);
+		ret = apply_imv_update(iter);
+		if (imv_early_boot_complete && ret)
+			printk(KERN_WARNING
+				"Invalid immediate value. "
+				"Variable at %p, "
+				"instruction at %p, size %hu\n",
+				(void *)iter->imv,
+				(void *)iter->var, iter->size);
+		mutex_unlock(&imv_mutex);
+	}
+}
+EXPORT_SYMBOL_GPL(imv_update_range);
+
+/**
+ * imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_imv_update(void)
+{
+	/* Core kernel imvs */
+	imv_update_range(__start___imv, __stop___imv);
+}
+EXPORT_SYMBOL_GPL(core_imv_update);
+
+void __init imv_init_complete(void)
+{
+	imv_early_boot_complete = 1;
+}
Index: linux-2.6-lttng/init/main.c
===================================================================
--- linux-2.6-lttng.orig/init/main.c	2007-11-28 09:27:34.000000000 -0500
+++ linux-2.6-lttng/init/main.c	2007-11-28 09:32:04.000000000 -0500
@@ -57,6 +57,7 @@
 #include <linux/device.h>
 #include <linux/kthread.h>
 #include <linux/sched.h>
+#include <linux/immediate.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -101,6 +102,11 @@ static inline void mark_rodata_ro(void) 
 #ifdef CONFIG_TC
 extern void tc_init(void);
 #endif
+#ifdef CONFIG_IMMEDIATE
+extern void imv_init_complete(void);
+#else
+static inline void imv_init_complete(void) { }
+#endif
 
 enum system_states system_state;
 EXPORT_SYMBOL(system_state);
@@ -518,6 +524,7 @@ asmlinkage void __init start_kernel(void
 	unwind_init();
 	lockdep_init();
 	cgroup_init_early();
+	core_imv_update();
 
 	local_irq_disable();
 	early_boot_irqs_off();
@@ -639,6 +646,7 @@ asmlinkage void __init start_kernel(void
 	cpuset_init();
 	taskstats_init_early();
 	delayacct_init();
+	imv_init_complete();
 
 	check_bugs();
 
Index: linux-2.6-lttng/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/kernel/Makefile	2007-11-28 09:27:34.000000000 -0500
+++ linux-2.6-lttng/kernel/Makefile	2007-11-28 09:32:04.000000000 -0500
@@ -56,6 +56,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
 obj-$(CONFIG_MARKERS) += marker.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-11-28 09:27:34.000000000 -0500
+++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-11-28 09:32:04.000000000 -0500
@@ -25,6 +25,9 @@
 		*(.rodata) *(.rodata.*)					\
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
+		VMLINUX_SYMBOL(__start___imv) = .;			\
+		*(__imv)		/* Immediate values: pointers */ \
+		VMLINUX_SYMBOL(__stop___imv) = .;			\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-02 21:08 [patch 0/7] Immediate Values Mathieu Desnoyers
@ 2008-02-02 21:08 ` Mathieu Desnoyers
  2008-02-26 22:52   ` Jason Baron
  0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-02-02 21:08 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel; +Cc: Mathieu Desnoyers, Rusty Russell

[-- Attachment #1: immediate-values-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 18592 bytes --]

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _imv_read() version, which uses standard global
variables, and optimized per architecture imv_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__imv" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __imv is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove imv_*_t types, add DECLARE_IMV() and DEFINE_IMV().
  - imv_read(&var) becomes imv_read(var) because of this.
- Adding a new EXPORT_IMV_SYMBOL(_GPL).
- remove imv_if(). Should use if (unlikely(imv_read(var))) instead.
  - Wait until we have gcc support before we add the imv_if macro, since
    its form may have to change.
- Dont't declare the __imv section in vmlinux.lds.h, just put the content
  in the rodata section.
- Simplify interface : remove imv_set_early, keep track of kernel boot
  status internally.
- Remove the ALIGN(8) before the __imv section. It is packed now.
- Uses an IPI busy-loop on each CPU with interrupts disabled as a simple,
  architecture agnostic, update mechanism.
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
---
 include/asm-generic/vmlinux.lds.h |    3 
 include/linux/immediate.h         |   94 +++++++++++++++++++
 include/linux/module.h            |   16 +++
 init/main.c                       |    8 +
 kernel/Makefile                   |    1 
 kernel/immediate.c                |  187 ++++++++++++++++++++++++++++++++++++++
 kernel/module.c                   |   50 +++++++++-
 7 files changed, 358 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/linux/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/immediate.h	2008-02-02 15:54:04.000000000 -0500
@@ -0,0 +1,94 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+struct __imv {
+	unsigned long var;	/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	unsigned long imv;	/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	unsigned char size;	/* Type size. */
+} __attribute__ ((packed));
+
+#include <asm/immediate.h>
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)						\
+	do {								\
+		name##__imv = (i);					\
+		core_imv_update();					\
+		module_imv_update();					\
+	} while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_imv_update(void);
+extern void imv_update_range(const struct __imv *begin,
+	const struct __imv *end);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define imv_read(name)			_imv_read(name)
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i)		(name##__imv = (i))
+
+static inline void core_imv_update(void) { }
+static inline void module_imv_update(void) { }
+
+#endif
+
+#define DECLARE_IMV(type, name) extern __typeof__(type) name##__imv
+#define DEFINE_IMV(type, name)  __typeof__(type) name##__imv
+
+#define EXPORT_IMV_SYMBOL(name) EXPORT_SYMBOL(name##__imv)
+#define EXPORT_IMV_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__imv)
+
+/**
+ * _imv_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _imv_read(name)		(name##__imv)
+
+#endif
Index: linux-2.6-lttng/include/linux/module.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/module.h	2008-02-02 15:54:04.000000000 -0500
+++ linux-2.6-lttng/include/linux/module.h	2008-02-02 15:54:04.000000000 -0500
@@ -15,6 +15,7 @@
 #include <linux/stringify.h>
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
+#include <linux/immediate.h>
 #include <linux/marker.h>
 #include <asm/local.h>
 
@@ -355,6 +356,10 @@ struct module
 	/* The command line arguments (may be mangled).  People like
 	   keeping pointers to this stuff */
 	char *args;
+#ifdef CONFIG_IMMEDIATE
+	const struct __imv *immediate;
+	unsigned int num_immediate;
+#endif
 #ifdef CONFIG_MARKERS
 	struct marker *markers;
 	unsigned int num_markers;
@@ -467,6 +472,9 @@ extern void print_modules(void);
 
 extern void module_update_markers(void);
 
+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -572,6 +580,14 @@ static inline void module_update_markers
 {
 }
 
+static inline void _module_imv_update(void)
+{
+}
+
+static inline void module_imv_update(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2008-02-02 15:54:04.000000000 -0500
+++ linux-2.6-lttng/kernel/module.c	2008-02-02 15:54:04.000000000 -0500
@@ -33,6 +33,7 @@
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
 #include <linux/errno.h>
+#include <linux/immediate.h>
 #include <linux/err.h>
 #include <linux/vermagic.h>
 #include <linux/notifier.h>
@@ -1716,6 +1717,7 @@ static struct module *load_module(void _
 	unsigned int unusedcrcindex;
 	unsigned int unusedgplindex;
 	unsigned int unusedgplcrcindex;
+	unsigned int immediateindex;
 	unsigned int markersindex;
 	unsigned int markersstringsindex;
 	struct module *mod;
@@ -1814,6 +1816,7 @@ static struct module *load_module(void _
 #ifdef ARCH_UNWIND_SECTION_NAME
 	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
 #endif
+	immediateindex = find_sec(hdr, sechdrs, secstrings, "__imv");
 
 	/* Don't keep modinfo section */
 	sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -1965,6 +1968,11 @@ static struct module *load_module(void _
 	mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr;
 	if (gplfuturecrcindex)
 		mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr;
+#ifdef CONFIG_IMMEDIATE
+	mod->immediate = (void *)sechdrs[immediateindex].sh_addr;
+	mod->num_immediate =
+		sechdrs[immediateindex].sh_size / sizeof(*mod->immediate);
+#endif
 
 	mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr;
 	if (unusedcrcindex)
@@ -2032,11 +2040,16 @@ static struct module *load_module(void _
 
 	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
 
+	if (!(mod->taints & TAINT_FORCED_MODULE)) {
 #ifdef CONFIG_MARKERS
-	if (!(mod->taints & TAINT_FORCED_MODULE))
 		marker_update_probe_range(mod->markers,
 			mod->markers + mod->num_markers);
 #endif
+#ifdef CONFIG_IMMEDIATE
+		imv_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+#endif
+	}
 	err = module_finalize(hdr, sechdrs, mod);
 	if (err < 0)
 		goto cleanup;
@@ -2573,3 +2586,38 @@ void module_update_markers(void)
 	mutex_unlock(&module_mutex);
 }
 #endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_imv_update(void)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		imv_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+	}
+}
+EXPORT_SYMBOL_GPL(_module_imv_update);
+
+/**
+ * module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_imv_update(void)
+{
+	mutex_lock(&module_mutex);
+	_module_imv_update();
+	mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_imv_update);
+#endif
Index: linux-2.6-lttng/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/immediate.c	2008-02-02 15:57:23.000000000 -0500
@@ -0,0 +1,187 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+#include <linux/cpu.h>
+
+#include <asm/cacheflush.h>
+
+/*
+ * Kernel ready to execute the SMP update that may depend on trap and ipi.
+ */
+static int imv_early_boot_complete;
+
+extern const struct __imv __start___imv[];
+extern const struct __imv __stop___imv[];
+
+/*
+ * imv_mutex nests inside module_mutex. imv_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(imv_mutex);
+
+static atomic_t wait_sync;
+
+struct ipi_loop_data {
+	long value;
+	const struct __imv *imv;
+} loop_data;
+
+static void ipi_busy_loop(void *arg)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	atomic_dec(&wait_sync);
+	do {
+		/* Make sure the wait_sync gets re-read */
+		smp_mb();
+	} while (atomic_read(&wait_sync) > loop_data.value);
+	atomic_dec(&wait_sync);
+	do {
+		/* Make sure the wait_sync gets re-read */
+		smp_mb();
+	} while (atomic_read(&wait_sync) > 0);
+	/*
+	 * Issuing a synchronizing instruction must be done on each CPU before
+	 * reenabling interrupts after modifying an instruction. Required by
+	 * Intel's errata.
+	 */
+	sync_core();
+	flush_icache_range(loop_data.imv->imv,
+		loop_data.imv->imv + loop_data.imv->size);
+	local_irq_restore(flags);
+}
+
+/**
+ * apply_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ * It makes sure all CPUs are not executing the modified code by having them
+ * busy looping with interrupts disabled.
+ * It does _not_ protect against NMI and MCE (could be a problem with Intel's
+ * errata if we use immediate values in their code path).
+ */
+static int apply_imv_update(const struct __imv *imv)
+{
+	unsigned long flags;
+	long online_cpus;
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (imv->size) {
+	case 1:	if (*(uint8_t *)imv->imv
+				== *(uint8_t *)imv->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)imv->imv
+				== *(uint16_t *)imv->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)imv->imv
+				== *(uint32_t *)imv->var)
+			return 0;
+		break;
+	case 8:	if (*(uint64_t *)imv->imv
+				== *(uint64_t *)imv->var)
+			return 0;
+		break;
+	default:return -EINVAL;
+	}
+
+	if (imv_early_boot_complete) {
+		kernel_text_lock();
+		get_online_cpus();
+		online_cpus = num_online_cpus();
+		atomic_set(&wait_sync, 2 * online_cpus);
+		loop_data.value = online_cpus;
+		loop_data.imv = imv;
+		smp_call_function(ipi_busy_loop, NULL, 1, 0);
+		local_irq_save(flags);
+		atomic_dec(&wait_sync);
+		do {
+			/* Make sure the wait_sync gets re-read */
+			smp_mb();
+		} while (atomic_read(&wait_sync) > online_cpus);
+		text_poke((void *)imv->imv, (void *)imv->var,
+				imv->size);
+		/*
+		 * Make sure the modified instruction is seen by all CPUs before
+		 * we continue (visible to other CPUs and local interrupts).
+		 */
+		wmb();
+		atomic_dec(&wait_sync);
+		flush_icache_range(imv->imv,
+				imv->imv + imv->size);
+		local_irq_restore(flags);
+		put_online_cpus();
+		kernel_text_unlock();
+	} else
+		text_poke_early((void *)imv->imv, (void *)imv->var,
+				imv->size);
+	return 0;
+}
+
+/**
+ * imv_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Updates a range of immediates.
+ */
+void imv_update_range(const struct __imv *begin,
+		const struct __imv *end)
+{
+	const struct __imv *iter;
+	int ret;
+	for (iter = begin; iter < end; iter++) {
+		mutex_lock(&imv_mutex);
+		ret = apply_imv_update(iter);
+		if (imv_early_boot_complete && ret)
+			printk(KERN_WARNING
+				"Invalid immediate value. "
+				"Variable at %p, "
+				"instruction at %p, size %hu\n",
+				(void *)iter->imv,
+				(void *)iter->var, iter->size);
+		mutex_unlock(&imv_mutex);
+	}
+}
+EXPORT_SYMBOL_GPL(imv_update_range);
+
+/**
+ * imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_imv_update(void)
+{
+	/* Core kernel imvs */
+	imv_update_range(__start___imv, __stop___imv);
+}
+EXPORT_SYMBOL_GPL(core_imv_update);
+
+void __init imv_init_complete(void)
+{
+	imv_early_boot_complete = 1;
+}
Index: linux-2.6-lttng/init/main.c
===================================================================
--- linux-2.6-lttng.orig/init/main.c	2008-02-02 15:33:24.000000000 -0500
+++ linux-2.6-lttng/init/main.c	2008-02-02 15:54:04.000000000 -0500
@@ -57,6 +57,7 @@
 #include <linux/device.h>
 #include <linux/kthread.h>
 #include <linux/sched.h>
+#include <linux/immediate.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -101,6 +102,11 @@ static inline void mark_rodata_ro(void) 
 #ifdef CONFIG_TC
 extern void tc_init(void);
 #endif
+#ifdef CONFIG_IMMEDIATE
+extern void imv_init_complete(void);
+#else
+static inline void imv_init_complete(void) { }
+#endif
 
 enum system_states system_state;
 EXPORT_SYMBOL(system_state);
@@ -522,6 +528,7 @@ asmlinkage void __init start_kernel(void
 	unwind_init();
 	lockdep_init();
 	cgroup_init_early();
+	core_imv_update();
 
 	local_irq_disable();
 	early_boot_irqs_off();
@@ -645,6 +652,7 @@ asmlinkage void __init start_kernel(void
 	cpuset_init();
 	taskstats_init_early();
 	delayacct_init();
+	imv_init_complete();
 
 	check_bugs();
 
Index: linux-2.6-lttng/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/kernel/Makefile	2008-02-02 15:33:24.000000000 -0500
+++ linux-2.6-lttng/kernel/Makefile	2008-02-02 15:54:04.000000000 -0500
@@ -63,6 +63,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
 obj-$(CONFIG_MARKERS) += marker.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 
Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2008-02-02 15:33:24.000000000 -0500
+++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2008-02-02 15:54:04.000000000 -0500
@@ -61,6 +61,9 @@
 		*(.rodata) *(.rodata.*)					\
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
+		VMLINUX_SYMBOL(__start___imv) = .;			\
+		*(__imv)		/* Immediate values: pointers */ \
+		VMLINUX_SYMBOL(__stop___imv) = .;			\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-02 21:08 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2008-02-26 22:52   ` Jason Baron
  2008-02-26 23:12     ` Mathieu Desnoyers
  2008-02-27 19:05     ` Mathieu Desnoyers
  0 siblings, 2 replies; 29+ messages in thread
From: Jason Baron @ 2008-02-26 22:52 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel, Rusty Russell

On Sat, Feb 02, 2008 at 04:08:29PM -0500, Mathieu Desnoyers wrote:
> Changelog:
> 
> - section __imv is already SHF_ALLOC
> - Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
>   the if (immediateindex) is unnecessary here.
> - Remove module_mutex usage: depend on functions implemented in module.c for
>   that.

hi,

In testing this patch, i've run across a deadlock...apply_imv_update() can get
called again before, ipi_busy_loop() has had a chance to finsh, and set 
wait_sync back to its initial value. This causes ipi_busy_loop() to get stuck
indefinitely and the subsequent apply_imv_update() hangs. I've shown this 
deadlock below using nmi_watchdog=1 in item 1). 

After hitting this deadlock i modified apply_imv_update() to wait for 
ipi_busy_loop(), to finish, however this resulted in a 3 way deadlock. Process
A held a read lock on tasklist_lock, then process B called apply_imv_update().
Process A received the IPI and begins executing ipi_busy_loop(). Then process
C takes a write lock irq on the task list lock, before receiving the IPI. Thus,
process A holds up process C, and C can't get an IPI b/c interrupts are
disabled. i believe this is an inherent problem in using smp_call_function, in
that one can't synchronize the processes on each other...you can reproduce
these issues using the test module below item 2)

In order to address these issues, i've modified stop_machine_run() to take an
new third parameter, RUN_ALL, to allow stop_machine_run() to run a function
on all cpus, item 3). I then modified kernel/immediate.c to use this new 
infrastructure item 4). the code in immediate.c simplifies quite a bit. This
new code has been robust through all testing thus far.

thanks,

-Jason

1)

 NMI Watchdog detected LOCKUP on CPU 3
CPU 3
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 0, comm: swapper Not tainted 2.6.24-git12markers #1
RIP: 0010:[<ffffffff8106e7cc>]  [<ffffffff8106e7cc>] ipi_busy_loop+0x19/0x41
RSP: 0018:ffff81007fbdff70  EFLAGS: 00000002
RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff81007fbd2930
RDX: ffffffff81491b80 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffff81007fbdff78 R08: ffff81007fbdff78 R09: ffff81007dd91e80
R10: ffff810001030b80 R11: 0000000000000246 R12: ffffffff8106e7b3
R13: 0000000000000000 R14: ffffffff81491a80 R15: ffff810001030a80
FS:  0000000000000000(0000) GS:ffff81007f801c80(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000025e8d68 CR3: 000000007549c000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff81007fbd6000, task ffff81007fbd2930)
Stack:  0000000000000000 ffff81007fbdffa8 ffffffff8101ca4e ffff81007fbdffa8
 ffffffff8100b0e2 0000000000000003 0000000000000040 ffff81007fbd7e60
 ffffffff8100cac6 ffff81007fbd7e60 <EOI>  ffff81007fbd7ee8 0000000000000246
Call Trace:
 <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
 [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
 [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
 <EOI>  [<ffffffff8100b127>] ? mwait_idle+0x45/0x4a
 [<ffffffff8100a723>] ? enter_idle+0x22/0x24
 [<ffffffff8100b06f>] ? cpu_idle+0x97/0xc1
 [<ffffffff81270a35>] ? start_secondary+0x3ba/0x3c6


---[ end trace 738433437d960ebc ]---
NMI Watchdog detected LOCKUP on CPU 2
CPU 2
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 9704, comm: toggle-writer Not tainted 2.6.24-git12markers #1
RIP: 0010:[<ffffffff8101c717>]  [<ffffffff8101c717>] __smp_call_function_mask+0x9f/0xc1
RSP: 0018:ffff81003ac35da8  EFLAGS: 00000297
RAX: 00000000000008fc RBX: 000000000000000b RCX: 00000000000008fc
RDX: 00000000000008fc RSI: 00000000000000fc RDI: 000000000000000b
RBP: ffff81003ac35e08 R08: ffffffff8840504f R09: 00007f01f07696f0
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff8106e7b3
FS:  00007f01f07696f0(0000) GS:ffff81007f801980(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f01f0796000 CR3: 000000006bcdf000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process toggle-writer (pid: 9704, threadinfo ffff81003ac34000, task ffff81003ad14930)
Stack:  ffffffff8106e7b3 0000000000000000 0000000000000002 00003fff00000000
 000000000000000b ffffffff81075c03 00000010001280d2 0000000000000000
 0000000000000000 ffffffff8106e7b3 000000000000000f 0000000000000005
Call Trace:
 [<ffffffff8106e7b3>] ? ipi_busy_loop+0x0/0x41
 [<ffffffff81075c03>] ? __alloc_pages+0x8e/0x337
 [<ffffffff8106e7b3>] ? ipi_busy_loop+0x0/0x41
 [<ffffffff8101c783>] smp_call_function_mask+0x4a/0x59
 [<ffffffff8101c7ab>] smp_call_function+0x19/0x1b
 [<ffffffff8106e703>] imv_update_range+0xd7/0x16e
 [<ffffffff8105494e>] _module_imv_update+0x35/0x54
 [<ffffffff81054982>] module_imv_update+0x15/0x23
 [<ffffffff884050c4>] :toggle_tester:proc_toggle_write+0x4c/0x64
 [<ffffffff810d64a4>] proc_reg_write+0x7b/0x96
 [<ffffffff81099f72>] vfs_write+0xae/0x157
 [<ffffffff8109a53f>] sys_write+0x47/0x70
 [<ffffffff8100c0e9>] tracesys+0xdc/0xe1

Code: f8 48 3b 5d c0 48 8b 05 28 1a 41 00 75 0a bf fc 00 00 00 ff 50 38 eb 0f be fc 00 00 00 48
---[ end trace 738433437d960ebc ]---
NMI Watchdog detected LOCKUP on CPU 1
CPU 1
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 0, comm: swapper Not tainted 2.6.24-git12markers #1
RIP: 0010:[<ffffffff8106e7cc>]  [<ffffffff8106e7cc>] ipi_busy_loop+0x19/0x41
RSP: 0018:ffff81007fb6ff70  EFLAGS: 00000002
RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff81007fb5f260
RDX: ffffffff81491b80 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffff81007fb6ff78 R08: ffff81007fb6ff78 R09: ffff810001016700
R10: ffff810001018b80 R11: 0000000000000001 R12: ffffffff8106e7b3
R13: 0000000000000000 R14: ffffffff81491a80 R15: ffff810001018a80
FS:  0000000000000000(0000) GS:ffff81007f801680(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000031fac9ac20 CR3: 00000000778e6000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff81007fb6a000, task ffff81007fb5f260)
Stack:  0000000000000000 ffff81007fb6ffa8 ffffffff8101ca4e ffff81007fb6ffa8
 ffffffff8100b0e2 0000000000000001 0000000000000040 ffff81007fb6be60
 ffffffff8100cac6 ffff81007fb6be60 <EOI>  ffff81007fb6bee8 0000000000000001
Call Trace:
 <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
 [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
 [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
 <EOI>  [<ffffffff8100b127>] ? mwait_idle+0x45/0x4a
 [<ffffffff8100a723>] ? enter_idle+0x22/0x24
 [<ffffffff8100b06f>] ? cpu_idle+0x97/0xc1
 [<ffffffff81270a35>] ? start_secondary+0x3ba/0x3c6

Code: 81 48 c7 c7 02 fd 36 81 48 89 e5 e8 7b fe ff ff c9 c3 55 48 89 e5 53 9c 5e fa f0 ff 0d 82
---[ end trace 738433437d960ebc ]---
NMI Watchdog detected LOCKUP on CPU 0
Kernel panic - not syncing: Aiee, killing interrupt handler!
CPU 0
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 0, comm: swapper Not tainted 2.6.24-git12markers #1
RIP: 0010:[<ffffffff8106e7e6>]  [<ffffffff8106e7e6>] ipi_busy_loop+0x33/0x41
RSP: 0018:ffffffff81498f70  EFLAGS: 00000006
RAX: 0000000000000004 RBX: 0000000000000000 RCX: ffffffff81394620
RDX: ffffffff81491b80 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffffffff81498f78 R08: ffffffff81498f78 R09: ffff810062cc1e98
R10: ffff81000100cb80 R11: ffff810062cc1d58 R12: ffffffff8106e7b3
R13: 0000000000000000 R14: ffffffff81466640 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff813d2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000031fac6f060 CR3: 0000000000201000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81434000, task ffffffff81394620)
Stack:  0000000000000000 ffffffff81498fa8 ffffffff8101ca4e ffffffff8100b0e2
 ffffffff8100b0e2 0000000000000000 ffffffffffffffff ffffffff81435ec0
 ffffffff8100cac6 ffffffff81435ec0 <EOI>  ffffffff81435f48 ffff810062cc1d58
Call Trace:
 <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
 [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
 [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
 [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
 <EOI>  [<ffffffff8100b127>] ? mwait_idle+0x45/0x4a
 [<ffffffff8100a723>] ? enter_idle+0x22/0x24
 [<ffffffff8100b06f>] ? cpu_idle+0x97/0xc1
 [<ffffffff81266ee6>] ? rest_init+0x5a/0x5c

Code: f0 ff 0d 82 6b 49 00 0f ae f0 48 63 05 78 6b 49 00 48 3b 05 5d 6b 49 00 7f ed f0 ff 0d 68
---[ end trace 738433437d960ebc ]---


2)


--- /dev/null
+++ b/samples/markers/toggle-tester.c
@@ -0,0 +1,90 @@
+/* probe-example.c
+ *
+ * toggle module to test immedidate infrastructure.
+ *
+ * (C) Copyright 2007 Jason Baron <jbaron@redhat.com>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/marker.h>
+#include <asm/atomic.h>
+#include <linux/seq_file.h>
+#include <linux/fs.h>
+#include <linux/proc_fs.h>
+
+static DECLARE_MUTEX(toggle_mutex);
+DEFINE_IMV(char, toggle_value) = 0;
+
+static ssize_t proc_toggle_write(struct file *file, const char __user *buf,
+                                size_t length, loff_t *ppos)
+{
+	int i;
+
+	down(&toggle_mutex);
+	i = imv_read(toggle_value);
+	imv_set(toggle_value, !i);
+	up(&toggle_mutex);
+
+	return 0;
+}
+
+static int proc_toggle_show(struct seq_file *s, void *p)
+{
+	int i;
+
+	down(&toggle_mutex);        
+        i = imv_read(toggle_value);
+	up(&toggle_mutex);
+
+	seq_printf(s, "%u\n", i);
+
+        return 0;
+}
+
+
+static int proc_toggle_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, proc_toggle_show, NULL);
+}
+
+
+static const struct file_operations toggle_operations = {
+	.open           = proc_toggle_open,
+	.read           = seq_read,
+	.write          = proc_toggle_write,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+	
+
+struct proc_dir_entry *pde;
+
+static int __init toggle_init(void)
+{
+	
+	pde = create_proc_entry("toggle", 0, NULL);
+	if (!pde)
+		return -ENOMEM;
+	pde->proc_fops = &toggle_operations;
+
+
+	return 0;
+}
+
+static void __exit toggle_fini(void)
+{
+	remove_proc_entry("toggle", &proc_root);
+	
+}
+
+module_init(toggle_init);
+module_exit(toggle_fini);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jason Baron");
+MODULE_DESCRIPTION("testing immediate implementation");


3)


diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 5bfc553..1ab1c5b 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -8,6 +8,9 @@
 #include <asm/system.h>
 
 #if defined(CONFIG_STOP_MACHINE) && defined(CONFIG_SMP)
+
+#define RUN_ALL ~0U
+
 /**
  * stop_machine_run: freeze the machine on all CPUs and run this function
  * @fn: the function to run
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 51b5ee5..e6ee46f 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -23,9 +23,17 @@ enum stopmachine_state {
 	STOPMACHINE_WAIT,
 	STOPMACHINE_PREPARE,
 	STOPMACHINE_DISABLE_IRQ,
+	STOPMACHINE_RUN,
 	STOPMACHINE_EXIT,
 };
 
+struct stop_machine_data {
+	int (*fn)(void *);
+	void *data;
+	struct completion done;
+	int run_all;
+} smdata;
+
 static enum stopmachine_state stopmachine_state;
 static unsigned int stopmachine_num_threads;
 static atomic_t stopmachine_thread_ack;
@@ -35,6 +43,7 @@ static int stopmachine(void *cpu)
 {
 	int irqs_disabled = 0;
 	int prepared = 0;
+	int ran = 0;
 
 	set_cpus_allowed(current, cpumask_of_cpu((int)(long)cpu));
 
@@ -59,6 +68,11 @@ static int stopmachine(void *cpu)
 			prepared = 1;
 			smp_mb(); /* Must read state first. */
 			atomic_inc(&stopmachine_thread_ack);
+		} else if (stopmachine_state == STOPMACHINE_RUN && !ran) {
+			smdata.fn(smdata.data);
+			ran = 1;
+			smp_mb(); /* Must read state first. */
+			atomic_inc(&stopmachine_thread_ack);
 		}
 		/* Yield in first stage: migration threads need to
 		 * help our sisters onto their CPUs. */
@@ -136,12 +150,10 @@ static void restart_machine(void)
 	preempt_enable_no_resched();
 }
 
-struct stop_machine_data
+static void run_other_cpus(void)
 {
-	int (*fn)(void *);
-	void *data;
-	struct completion done;
-};
+	stopmachine_set_state(STOPMACHINE_RUN);
+}
 
 static int do_stop(void *_smdata)
 {
@@ -151,6 +163,8 @@ static int do_stop(void *_smdata)
 	ret = stop_machine();
 	if (ret == 0) {
 		ret = smdata->fn(smdata->data);
+		if (smdata->run_all)
+			run_other_cpus();
 		restart_machine();
 	}
 
@@ -170,17 +184,16 @@ static int do_stop(void *_smdata)
 struct task_struct *__stop_machine_run(int (*fn)(void *), void *data,
 				       unsigned int cpu)
 {
-	struct stop_machine_data smdata;
 	struct task_struct *p;
 
+	down(&stopmachine_mutex);
 	smdata.fn = fn;
 	smdata.data = data;
+	smdata.run_all = (cpu == RUN_ALL) ? 1 : 0;
 	init_completion(&smdata.done);
-
-	down(&stopmachine_mutex);
-
+	smp_wmb();
 	/* If they don't care which CPU fn runs on, bind to any online one. */
-	if (cpu == NR_CPUS)
+	if (cpu == NR_CPUS || cpu == RUN_ALL)
 		cpu = raw_smp_processor_id();
 
 	p = kthread_create(do_stop, &smdata, "kstopmachine");


4)


diff --git a/kernel/immediate.c b/kernel/immediate.c
index 4c36a89..39ec13e 100644
--- a/kernel/immediate.c
+++ b/kernel/immediate.c
@@ -20,6 +20,7 @@
 #include <linux/immediate.h>
 #include <linux/memory.h>
 #include <linux/cpu.h>
+#include <linux/stop_machine.h>
 
 #include <asm/cacheflush.h>
 
@@ -30,6 +31,23 @@ static int imv_early_boot_complete;
 
 extern const struct __imv __start___imv[];
 extern const struct __imv __stop___imv[];
+int started;
+
+int stop_machine_imv_update(void *imv_ptr)
+{
+	struct __imv *imv = imv_ptr;
+
+	if (!started) {
+		text_poke((void *)imv->imv, (void *)imv->var, imv->size);
+		started = 1;
+		smp_wmb();
+	} else
+		sync_core();
+
+	flush_icache_range(imv->imv, imv->imv + imv->size);
+
+	return 0;
+}
 
 /*
  * imv_mutex nests inside module_mutex. imv_mutex protects builtin
@@ -37,38 +55,6 @@ extern const struct __imv __stop___imv[];
  */
 static DEFINE_MUTEX(imv_mutex);
 
-static atomic_t wait_sync;
-
-struct ipi_loop_data {
-	long value;
-	const struct __imv *imv;
-} loop_data;
-
-static void ipi_busy_loop(void *arg)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	atomic_dec(&wait_sync);
-	do {
-		/* Make sure the wait_sync gets re-read */
-		smp_mb();
-	} while (atomic_read(&wait_sync) > loop_data.value);
-	atomic_dec(&wait_sync);
-	do {
-		/* Make sure the wait_sync gets re-read */
-		smp_mb();
-	} while (atomic_read(&wait_sync) > 0);
-	/*
-	 * Issuing a synchronizing instruction must be done on each CPU before
-	 * reenabling interrupts after modifying an instruction. Required by
-	 * Intel's errata.
-	 */
-	sync_core();
-	flush_icache_range(loop_data.imv->imv,
-		loop_data.imv->imv + loop_data.imv->size);
-	local_irq_restore(flags);
-}
 
 /**
  * apply_imv_update - update one immediate value
@@ -82,9 +68,6 @@ static void ipi_busy_loop(void *arg)
  */
 static int apply_imv_update(const struct __imv *imv)
 {
-	unsigned long flags;
-	long online_cpus;
-
 	/*
 	 * If the variable and the instruction have the same value, there is
 	 * nothing to do.
@@ -111,30 +94,8 @@ static int apply_imv_update(const struct __imv *imv)
 
 	if (imv_early_boot_complete) {
 		kernel_text_lock();
-		get_online_cpus();
-		online_cpus = num_online_cpus();
-		atomic_set(&wait_sync, 2 * online_cpus);
-		loop_data.value = online_cpus;
-		loop_data.imv = imv;
-		smp_call_function(ipi_busy_loop, NULL, 1, 0);
-		local_irq_save(flags);
-		atomic_dec(&wait_sync);
-		do {
-			/* Make sure the wait_sync gets re-read */
-			smp_mb();
-		} while (atomic_read(&wait_sync) > online_cpus);
-		text_poke((void *)imv->imv, (void *)imv->var,
-				imv->size);
-		/*
-		 * Make sure the modified instruction is seen by all CPUs before
-		 * we continue (visible to other CPUs and local interrupts).
-		 */
-		wmb();
-		atomic_dec(&wait_sync);
-		flush_icache_range(imv->imv,
-				imv->imv + imv->size);
-		local_irq_restore(flags);
-		put_online_cpus();
+		started = 0;
+		stop_machine_run(stop_machine_imv_update, (void *)imv, RUN_ALL);
 		kernel_text_unlock();
 	} else
 		text_poke_early((void *)imv->imv, (void *)imv->var,

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-26 22:52   ` Jason Baron
@ 2008-02-26 23:12     ` Mathieu Desnoyers
  2008-02-26 23:34       ` Mathieu Desnoyers
  2008-02-27 17:01       ` Jason Baron
  2008-02-27 19:05     ` Mathieu Desnoyers
  1 sibling, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-02-26 23:12 UTC (permalink / raw)
  To: Jason Baron; +Cc: akpm, Ingo Molnar, linux-kernel, Rusty Russell, Jan Kiszka

* Jason Baron (jbaron@redhat.com) wrote:
> On Sat, Feb 02, 2008 at 04:08:29PM -0500, Mathieu Desnoyers wrote:
> > Changelog:
> > 
> > - section __imv is already SHF_ALLOC
> > - Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
> >   the if (immediateindex) is unnecessary here.
> > - Remove module_mutex usage: depend on functions implemented in module.c for
> >   that.
> 
> hi,
> 
> In testing this patch, i've run across a deadlock...apply_imv_update() can get
> called again before, ipi_busy_loop() has had a chance to finsh, and set 
> wait_sync back to its initial value. This causes ipi_busy_loop() to get stuck
> indefinitely and the subsequent apply_imv_update() hangs. I've shown this 
> deadlock below using nmi_watchdog=1 in item 1). 
> 

Hrm, yes, Jan pointed out the exact same problem in my ltt-test-tsc TSC
test module in LTTng a few days ago. His fix implied to add another
barrier upon which the smp_call_function() caller must wait for the ipis
to finish. Since this immediate value code does the same I did in my
ltt-test-tsc code, the same fix will likely apply.

I'll cook something.

Hrm, does your stop machine implementation disable interrupts while the
CPUs busy loop ?

> After hitting this deadlock i modified apply_imv_update() to wait for 
> ipi_busy_loop(), to finish, however this resulted in a 3 way deadlock. Process
> A held a read lock on tasklist_lock, then process B called apply_imv_update().
> Process A received the IPI and begins executing ipi_busy_loop(). Then process
> C takes a write lock irq on the task list lock, before receiving the IPI. Thus,
> process A holds up process C, and C can't get an IPI b/c interrupts are
> disabled. i believe this is an inherent problem in using smp_call_function, in
> that one can't synchronize the processes on each other...you can reproduce
> these issues using the test module below item 2)
> 
> In order to address these issues, i've modified stop_machine_run() to take an
> new third parameter, RUN_ALL, to allow stop_machine_run() to run a function
> on all cpus, item 3). I then modified kernel/immediate.c to use this new 
> infrastructure item 4). the code in immediate.c simplifies quite a bit. This
> new code has been robust through all testing thus far.
> 
> thanks,
> 
> -Jason
> 
> 1)
> 
>  NMI Watchdog detected LOCKUP on CPU 3
> CPU 3
> Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 0, comm: swapper Not tainted 2.6.24-git12markers #1
> RIP: 0010:[<ffffffff8106e7cc>]  [<ffffffff8106e7cc>] ipi_busy_loop+0x19/0x41
> RSP: 0018:ffff81007fbdff70  EFLAGS: 00000002
> RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff81007fbd2930
> RDX: ffffffff81491b80 RSI: 0000000000000046 RDI: 0000000000000000
> RBP: ffff81007fbdff78 R08: ffff81007fbdff78 R09: ffff81007dd91e80
> R10: ffff810001030b80 R11: 0000000000000246 R12: ffffffff8106e7b3
> R13: 0000000000000000 R14: ffffffff81491a80 R15: ffff810001030a80
> FS:  0000000000000000(0000) GS:ffff81007f801c80(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000000025e8d68 CR3: 000000007549c000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffff81007fbd6000, task ffff81007fbd2930)
> Stack:  0000000000000000 ffff81007fbdffa8 ffffffff8101ca4e ffff81007fbdffa8
>  ffffffff8100b0e2 0000000000000003 0000000000000040 ffff81007fbd7e60
>  ffffffff8100cac6 ffff81007fbd7e60 <EOI>  ffff81007fbd7ee8 0000000000000246
> Call Trace:
>  <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
>  [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
>  [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
>  <EOI>  [<ffffffff8100b127>] ? mwait_idle+0x45/0x4a
>  [<ffffffff8100a723>] ? enter_idle+0x22/0x24
>  [<ffffffff8100b06f>] ? cpu_idle+0x97/0xc1
>  [<ffffffff81270a35>] ? start_secondary+0x3ba/0x3c6
> 
> 
> ---[ end trace 738433437d960ebc ]---
> NMI Watchdog detected LOCKUP on CPU 2
> CPU 2
> Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 9704, comm: toggle-writer Not tainted 2.6.24-git12markers #1
> RIP: 0010:[<ffffffff8101c717>]  [<ffffffff8101c717>] __smp_call_function_mask+0x9f/0xc1
> RSP: 0018:ffff81003ac35da8  EFLAGS: 00000297
> RAX: 00000000000008fc RBX: 000000000000000b RCX: 00000000000008fc
> RDX: 00000000000008fc RSI: 00000000000000fc RDI: 000000000000000b
> RBP: ffff81003ac35e08 R08: ffffffff8840504f R09: 00007f01f07696f0
> R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff8106e7b3
> FS:  00007f01f07696f0(0000) GS:ffff81007f801980(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007f01f0796000 CR3: 000000006bcdf000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process toggle-writer (pid: 9704, threadinfo ffff81003ac34000, task ffff81003ad14930)
> Stack:  ffffffff8106e7b3 0000000000000000 0000000000000002 00003fff00000000
>  000000000000000b ffffffff81075c03 00000010001280d2 0000000000000000
>  0000000000000000 ffffffff8106e7b3 000000000000000f 0000000000000005
> Call Trace:
>  [<ffffffff8106e7b3>] ? ipi_busy_loop+0x0/0x41
>  [<ffffffff81075c03>] ? __alloc_pages+0x8e/0x337
>  [<ffffffff8106e7b3>] ? ipi_busy_loop+0x0/0x41
>  [<ffffffff8101c783>] smp_call_function_mask+0x4a/0x59
>  [<ffffffff8101c7ab>] smp_call_function+0x19/0x1b
>  [<ffffffff8106e703>] imv_update_range+0xd7/0x16e
>  [<ffffffff8105494e>] _module_imv_update+0x35/0x54
>  [<ffffffff81054982>] module_imv_update+0x15/0x23
>  [<ffffffff884050c4>] :toggle_tester:proc_toggle_write+0x4c/0x64
>  [<ffffffff810d64a4>] proc_reg_write+0x7b/0x96
>  [<ffffffff81099f72>] vfs_write+0xae/0x157
>  [<ffffffff8109a53f>] sys_write+0x47/0x70
>  [<ffffffff8100c0e9>] tracesys+0xdc/0xe1
> 
> Code: f8 48 3b 5d c0 48 8b 05 28 1a 41 00 75 0a bf fc 00 00 00 ff 50 38 eb 0f be fc 00 00 00 48
> ---[ end trace 738433437d960ebc ]---
> NMI Watchdog detected LOCKUP on CPU 1
> CPU 1
> Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 0, comm: swapper Not tainted 2.6.24-git12markers #1
> RIP: 0010:[<ffffffff8106e7cc>]  [<ffffffff8106e7cc>] ipi_busy_loop+0x19/0x41
> RSP: 0018:ffff81007fb6ff70  EFLAGS: 00000002
> RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff81007fb5f260
> RDX: ffffffff81491b80 RSI: 0000000000000046 RDI: 0000000000000000
> RBP: ffff81007fb6ff78 R08: ffff81007fb6ff78 R09: ffff810001016700
> R10: ffff810001018b80 R11: 0000000000000001 R12: ffffffff8106e7b3
> R13: 0000000000000000 R14: ffffffff81491a80 R15: ffff810001018a80
> FS:  0000000000000000(0000) GS:ffff81007f801680(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000031fac9ac20 CR3: 00000000778e6000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffff81007fb6a000, task ffff81007fb5f260)
> Stack:  0000000000000000 ffff81007fb6ffa8 ffffffff8101ca4e ffff81007fb6ffa8
>  ffffffff8100b0e2 0000000000000001 0000000000000040 ffff81007fb6be60
>  ffffffff8100cac6 ffff81007fb6be60 <EOI>  ffff81007fb6bee8 0000000000000001
> Call Trace:
>  <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
>  [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
>  [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
>  <EOI>  [<ffffffff8100b127>] ? mwait_idle+0x45/0x4a
>  [<ffffffff8100a723>] ? enter_idle+0x22/0x24
>  [<ffffffff8100b06f>] ? cpu_idle+0x97/0xc1
>  [<ffffffff81270a35>] ? start_secondary+0x3ba/0x3c6
> 
> Code: 81 48 c7 c7 02 fd 36 81 48 89 e5 e8 7b fe ff ff c9 c3 55 48 89 e5 53 9c 5e fa f0 ff 0d 82
> ---[ end trace 738433437d960ebc ]---
> NMI Watchdog detected LOCKUP on CPU 0
> Kernel panic - not syncing: Aiee, killing interrupt handler!
> CPU 0
> Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 0, comm: swapper Not tainted 2.6.24-git12markers #1
> RIP: 0010:[<ffffffff8106e7e6>]  [<ffffffff8106e7e6>] ipi_busy_loop+0x33/0x41
> RSP: 0018:ffffffff81498f70  EFLAGS: 00000006
> RAX: 0000000000000004 RBX: 0000000000000000 RCX: ffffffff81394620
> RDX: ffffffff81491b80 RSI: 0000000000000046 RDI: 0000000000000000
> RBP: ffffffff81498f78 R08: ffffffff81498f78 R09: ffff810062cc1e98
> R10: ffff81000100cb80 R11: ffff810062cc1d58 R12: ffffffff8106e7b3
> R13: 0000000000000000 R14: ffffffff81466640 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffffffff813d2000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000031fac6f060 CR3: 0000000000201000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff81434000, task ffffffff81394620)
> Stack:  0000000000000000 ffffffff81498fa8 ffffffff8101ca4e ffffffff8100b0e2
>  ffffffff8100b0e2 0000000000000000 ffffffffffffffff ffffffff81435ec0
>  ffffffff8100cac6 ffffffff81435ec0 <EOI>  ffffffff81435f48 ffff810062cc1d58
> Call Trace:
>  <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
>  [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
>  [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
>  [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
>  <EOI>  [<ffffffff8100b127>] ? mwait_idle+0x45/0x4a
>  [<ffffffff8100a723>] ? enter_idle+0x22/0x24
>  [<ffffffff8100b06f>] ? cpu_idle+0x97/0xc1
>  [<ffffffff81266ee6>] ? rest_init+0x5a/0x5c
> 
> Code: f0 ff 0d 82 6b 49 00 0f ae f0 48 63 05 78 6b 49 00 48 3b 05 5d 6b 49 00 7f ed f0 ff 0d 68
> ---[ end trace 738433437d960ebc ]---
> 
> 
> 2)
> 
> 
> --- /dev/null
> +++ b/samples/markers/toggle-tester.c
> @@ -0,0 +1,90 @@
> +/* probe-example.c
> + *
> + * toggle module to test immedidate infrastructure.
> + *
> + * (C) Copyright 2007 Jason Baron <jbaron@redhat.com>
> + *
> + * This file is released under the GPLv2.
> + * See the file COPYING for more details.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/marker.h>
> +#include <asm/atomic.h>
> +#include <linux/seq_file.h>
> +#include <linux/fs.h>
> +#include <linux/proc_fs.h>
> +
> +static DECLARE_MUTEX(toggle_mutex);
> +DEFINE_IMV(char, toggle_value) = 0;
> +
> +static ssize_t proc_toggle_write(struct file *file, const char __user *buf,
> +                                size_t length, loff_t *ppos)
> +{
> +	int i;
> +
> +	down(&toggle_mutex);
> +	i = imv_read(toggle_value);
> +	imv_set(toggle_value, !i);
> +	up(&toggle_mutex);
> +
> +	return 0;
> +}
> +
> +static int proc_toggle_show(struct seq_file *s, void *p)
> +{
> +	int i;
> +
> +	down(&toggle_mutex);        
> +        i = imv_read(toggle_value);
> +	up(&toggle_mutex);
> +
> +	seq_printf(s, "%u\n", i);
> +
> +        return 0;
> +}
> +
> +
> +static int proc_toggle_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, proc_toggle_show, NULL);
> +}
> +
> +
> +static const struct file_operations toggle_operations = {
> +	.open           = proc_toggle_open,
> +	.read           = seq_read,
> +	.write          = proc_toggle_write,
> +	.llseek         = seq_lseek,
> +	.release        = single_release,
> +};
> +	
> +
> +struct proc_dir_entry *pde;
> +
> +static int __init toggle_init(void)
> +{
> +	
> +	pde = create_proc_entry("toggle", 0, NULL);
> +	if (!pde)
> +		return -ENOMEM;
> +	pde->proc_fops = &toggle_operations;
> +
> +
> +	return 0;
> +}
> +
> +static void __exit toggle_fini(void)
> +{
> +	remove_proc_entry("toggle", &proc_root);
> +	
> +}
> +
> +module_init(toggle_init);
> +module_exit(toggle_fini);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Jason Baron");
> +MODULE_DESCRIPTION("testing immediate implementation");
> 
> 
> 3)
> 
> 
> diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
> index 5bfc553..1ab1c5b 100644
> --- a/include/linux/stop_machine.h
> +++ b/include/linux/stop_machine.h
> @@ -8,6 +8,9 @@
>  #include <asm/system.h>
>  
>  #if defined(CONFIG_STOP_MACHINE) && defined(CONFIG_SMP)
> +
> +#define RUN_ALL ~0U
> +
>  /**
>   * stop_machine_run: freeze the machine on all CPUs and run this function
>   * @fn: the function to run
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index 51b5ee5..e6ee46f 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -23,9 +23,17 @@ enum stopmachine_state {
>  	STOPMACHINE_WAIT,
>  	STOPMACHINE_PREPARE,
>  	STOPMACHINE_DISABLE_IRQ,
> +	STOPMACHINE_RUN,
>  	STOPMACHINE_EXIT,
>  };
>  
> +struct stop_machine_data {
> +	int (*fn)(void *);
> +	void *data;
> +	struct completion done;
> +	int run_all;
> +} smdata;
> +
>  static enum stopmachine_state stopmachine_state;
>  static unsigned int stopmachine_num_threads;
>  static atomic_t stopmachine_thread_ack;
> @@ -35,6 +43,7 @@ static int stopmachine(void *cpu)
>  {
>  	int irqs_disabled = 0;
>  	int prepared = 0;
> +	int ran = 0;
>  
>  	set_cpus_allowed(current, cpumask_of_cpu((int)(long)cpu));
>  
> @@ -59,6 +68,11 @@ static int stopmachine(void *cpu)
>  			prepared = 1;
>  			smp_mb(); /* Must read state first. */
>  			atomic_inc(&stopmachine_thread_ack);
> +		} else if (stopmachine_state == STOPMACHINE_RUN && !ran) {
> +			smdata.fn(smdata.data);
> +			ran = 1;
> +			smp_mb(); /* Must read state first. */
> +			atomic_inc(&stopmachine_thread_ack);
>  		}
>  		/* Yield in first stage: migration threads need to
>  		 * help our sisters onto their CPUs. */
> @@ -136,12 +150,10 @@ static void restart_machine(void)
>  	preempt_enable_no_resched();
>  }
>  
> -struct stop_machine_data
> +static void run_other_cpus(void)
>  {
> -	int (*fn)(void *);
> -	void *data;
> -	struct completion done;
> -};
> +	stopmachine_set_state(STOPMACHINE_RUN);
> +}
>  
>  static int do_stop(void *_smdata)
>  {
> @@ -151,6 +163,8 @@ static int do_stop(void *_smdata)
>  	ret = stop_machine();
>  	if (ret == 0) {
>  		ret = smdata->fn(smdata->data);
> +		if (smdata->run_all)
> +			run_other_cpus();
>  		restart_machine();
>  	}
>  
> @@ -170,17 +184,16 @@ static int do_stop(void *_smdata)
>  struct task_struct *__stop_machine_run(int (*fn)(void *), void *data,
>  				       unsigned int cpu)
>  {
> -	struct stop_machine_data smdata;
>  	struct task_struct *p;
>  
> +	down(&stopmachine_mutex);
>  	smdata.fn = fn;
>  	smdata.data = data;
> +	smdata.run_all = (cpu == RUN_ALL) ? 1 : 0;
>  	init_completion(&smdata.done);
> -
> -	down(&stopmachine_mutex);
> -
> +	smp_wmb();
>  	/* If they don't care which CPU fn runs on, bind to any online one. */
> -	if (cpu == NR_CPUS)
> +	if (cpu == NR_CPUS || cpu == RUN_ALL)
>  		cpu = raw_smp_processor_id();
>  
>  	p = kthread_create(do_stop, &smdata, "kstopmachine");
> 
> 
> 4)
> 
> 
> diff --git a/kernel/immediate.c b/kernel/immediate.c
> index 4c36a89..39ec13e 100644
> --- a/kernel/immediate.c
> +++ b/kernel/immediate.c
> @@ -20,6 +20,7 @@
>  #include <linux/immediate.h>
>  #include <linux/memory.h>
>  #include <linux/cpu.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/cacheflush.h>
>  
> @@ -30,6 +31,23 @@ static int imv_early_boot_complete;
>  
>  extern const struct __imv __start___imv[];
>  extern const struct __imv __stop___imv[];
> +int started;
> +
> +int stop_machine_imv_update(void *imv_ptr)
> +{
> +	struct __imv *imv = imv_ptr;
> +
> +	if (!started) {
> +		text_poke((void *)imv->imv, (void *)imv->var, imv->size);
> +		started = 1;
> +		smp_wmb();
> +	} else
> +		sync_core();
> +
> +	flush_icache_range(imv->imv, imv->imv + imv->size);
> +
> +	return 0;
> +}
>  
>  /*
>   * imv_mutex nests inside module_mutex. imv_mutex protects builtin
> @@ -37,38 +55,6 @@ extern const struct __imv __stop___imv[];
>   */
>  static DEFINE_MUTEX(imv_mutex);
>  
> -static atomic_t wait_sync;
> -
> -struct ipi_loop_data {
> -	long value;
> -	const struct __imv *imv;
> -} loop_data;
> -
> -static void ipi_busy_loop(void *arg)
> -{
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
> -	atomic_dec(&wait_sync);
> -	do {
> -		/* Make sure the wait_sync gets re-read */
> -		smp_mb();
> -	} while (atomic_read(&wait_sync) > loop_data.value);
> -	atomic_dec(&wait_sync);
> -	do {
> -		/* Make sure the wait_sync gets re-read */
> -		smp_mb();
> -	} while (atomic_read(&wait_sync) > 0);
> -	/*
> -	 * Issuing a synchronizing instruction must be done on each CPU before
> -	 * reenabling interrupts after modifying an instruction. Required by
> -	 * Intel's errata.
> -	 */
> -	sync_core();
> -	flush_icache_range(loop_data.imv->imv,
> -		loop_data.imv->imv + loop_data.imv->size);
> -	local_irq_restore(flags);
> -}
>  
>  /**
>   * apply_imv_update - update one immediate value
> @@ -82,9 +68,6 @@ static void ipi_busy_loop(void *arg)
>   */
>  static int apply_imv_update(const struct __imv *imv)
>  {
> -	unsigned long flags;
> -	long online_cpus;
> -
>  	/*
>  	 * If the variable and the instruction have the same value, there is
>  	 * nothing to do.
> @@ -111,30 +94,8 @@ static int apply_imv_update(const struct __imv *imv)
>  
>  	if (imv_early_boot_complete) {
>  		kernel_text_lock();
> -		get_online_cpus();
> -		online_cpus = num_online_cpus();
> -		atomic_set(&wait_sync, 2 * online_cpus);
> -		loop_data.value = online_cpus;
> -		loop_data.imv = imv;
> -		smp_call_function(ipi_busy_loop, NULL, 1, 0);
> -		local_irq_save(flags);
> -		atomic_dec(&wait_sync);
> -		do {
> -			/* Make sure the wait_sync gets re-read */
> -			smp_mb();
> -		} while (atomic_read(&wait_sync) > online_cpus);
> -		text_poke((void *)imv->imv, (void *)imv->var,
> -				imv->size);
> -		/*
> -		 * Make sure the modified instruction is seen by all CPUs before
> -		 * we continue (visible to other CPUs and local interrupts).
> -		 */
> -		wmb();
> -		atomic_dec(&wait_sync);
> -		flush_icache_range(imv->imv,
> -				imv->imv + imv->size);
> -		local_irq_restore(flags);
> -		put_online_cpus();
> +		started = 0;
> +		stop_machine_run(stop_machine_imv_update, (void *)imv, RUN_ALL);
>  		kernel_text_unlock();
>  	} else
>  		text_poke_early((void *)imv->imv, (void *)imv->var,

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-26 23:12     ` Mathieu Desnoyers
@ 2008-02-26 23:34       ` Mathieu Desnoyers
  2008-02-27 16:44         ` Jason Baron
  2008-02-27 17:01       ` Jason Baron
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-02-26 23:34 UTC (permalink / raw)
  To: Jason Baron; +Cc: akpm, Ingo Molnar, linux-kernel, Rusty Russell, Jan Kiszka

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
> > On Sat, Feb 02, 2008 at 04:08:29PM -0500, Mathieu Desnoyers wrote:
> > > Changelog:
> > > 
> > > - section __imv is already SHF_ALLOC
> > > - Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
> > >   the if (immediateindex) is unnecessary here.
> > > - Remove module_mutex usage: depend on functions implemented in module.c for
> > >   that.
> > 
> > hi,
> > 
> > In testing this patch, i've run across a deadlock...apply_imv_update() can get
> > called again before, ipi_busy_loop() has had a chance to finsh, and set 
> > wait_sync back to its initial value. This causes ipi_busy_loop() to get stuck
> > indefinitely and the subsequent apply_imv_update() hangs. I've shown this 
> > deadlock below using nmi_watchdog=1 in item 1). 
> > 
> 
> Hrm, yes, Jan pointed out the exact same problem in my ltt-test-tsc TSC
> test module in LTTng a few days ago. His fix implied to add another
> barrier upon which the smp_call_function() caller must wait for the ipis
> to finish. Since this immediate value code does the same I did in my
> ltt-test-tsc code, the same fix will likely apply.
> 
> I'll cook something.
> 

This should work. Untested for now. Can you give it a try ?

Fix Immediate Deadlock

Jason Baron <jbaron@redhat.com> :

In testing this patch, i've run across a deadlock...apply_imv_update() can get
called again before, ipi_busy_loop() has had a chance to finsh, and set
wait_sync back to its initial value. This causes ipi_busy_loop() to get stuck
indefinitely and the subsequent apply_imv_update() hangs. I've shown this
deadlock below using nmi_watchdog=1 in item 1).

Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> :

Hrm, yes, Jan pointed out the exact same problem in my ltt-test-tsc TSC
test module in LTTng a few days ago. His fix implied to add another
barrier upon which the smp_call_function() caller must wait for the ipis
to finish. Since this immediate value code does the same I did in my
ltt-test-tsc code, the same fix will likely apply.

Thanks to Jason Baron for finding this bug and proposing an initial
implementation.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Jason Baron <jbaron@redhat.com>
---
 kernel/immediate.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/kernel/immediate.c
===================================================================
--- linux-2.6-lttng.orig/kernel/immediate.c	2008-02-26 18:16:10.000000000 -0500
+++ linux-2.6-lttng/kernel/immediate.c	2008-02-26 18:29:32.000000000 -0500
@@ -53,12 +53,12 @@ static void ipi_busy_loop(void *arg)
 	do {
 		/* Make sure the wait_sync gets re-read */
 		smp_mb();
-	} while (atomic_read(&wait_sync) > loop_data.value);
+	} while (atomic_read(&wait_sync) > 2 * loop_data.value);
 	atomic_dec(&wait_sync);
 	do {
 		/* Make sure the wait_sync gets re-read */
 		smp_mb();
-	} while (atomic_read(&wait_sync) > 0);
+	} while (atomic_read(&wait_sync) > loop_data.value);
 	/*
 	 * Issuing a synchronizing instruction must be done on each CPU before
 	 * reenabling interrupts after modifying an instruction. Required by
@@ -67,6 +67,7 @@ static void ipi_busy_loop(void *arg)
 	sync_core();
 	flush_icache_range(loop_data.imv->imv,
 		loop_data.imv->imv + loop_data.imv->size);
+	atomic_dec(&wait_sync);
 	local_irq_restore(flags);
 }
 
@@ -113,7 +114,7 @@ static int apply_imv_update(const struct
 		kernel_text_lock();
 		get_online_cpus();
 		online_cpus = num_online_cpus();
-		atomic_set(&wait_sync, 2 * online_cpus);
+		atomic_set(&wait_sync, 3 * online_cpus);
 		loop_data.value = online_cpus;
 		loop_data.imv = imv;
 		smp_call_function(ipi_busy_loop, NULL, 1, 0);
@@ -122,7 +123,7 @@ static int apply_imv_update(const struct
 		do {
 			/* Make sure the wait_sync gets re-read */
 			smp_mb();
-		} while (atomic_read(&wait_sync) > online_cpus);
+		} while (atomic_read(&wait_sync) > 2 * online_cpus);
 		text_poke((void *)imv->imv, (void *)imv->var,
 				imv->size);
 		/*
@@ -133,6 +134,12 @@ static int apply_imv_update(const struct
 		atomic_dec(&wait_sync);
 		flush_icache_range(imv->imv,
 				imv->imv + imv->size);
+		/*
+		 * Wait until all other CPUs are done so that we don't overwrite
+		 * loop_data or wait_sync prematurely.
+		 */
+		while (unlikely(atomic_read(&wait_sync) > 1))
+			cpu_relax();
 		local_irq_restore(flags);
 		put_online_cpus();
 		kernel_text_unlock();

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-26 23:34       ` Mathieu Desnoyers
@ 2008-02-27 16:44         ` Jason Baron
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Baron @ 2008-02-27 16:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Rusty Russell, Jan Kiszka

On Tue, Feb 26, 2008 at 06:34:45PM -0500, Mathieu Desnoyers wrote:
> > > In testing this patch, i've run across a deadlock...apply_imv_update() can get
> > > called again before, ipi_busy_loop() has had a chance to finsh, and set 
> > > wait_sync back to its initial value. This causes ipi_busy_loop() to get stuck
> > > indefinitely and the subsequent apply_imv_update() hangs. I've shown this 
> > > deadlock below using nmi_watchdog=1 in item 1). 
> > > 
> > 
> > Hrm, yes, Jan pointed out the exact same problem in my ltt-test-tsc TSC
> > test module in LTTng a few days ago. His fix implied to add another
> > barrier upon which the smp_call_function() caller must wait for the ipis
> > to finish. Since this immediate value code does the same I did in my
> > ltt-test-tsc code, the same fix will likely apply.
> > 
> > I'll cook something.
> > 
> 
> This should work. Untested for now. Can you give it a try ?
> 

this patch results in the subsequent 3 way deadlock that I described in the
previous mail. smp_call_function() can not be used with a function that 
attempts to rendezvous cpus in the manner being done here. The patch I posted 
in the previous mail addresses these limitations on smp_call_functions(). trace
of the lockup using this patch is shown below. 

thanks,

-Jason


NMI Watchdog detected LOCKUP on CPU 2
CPU 2
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 11233, comm: make Not tainted 2.6.24-git12markers #2
RIP: 0010:[<ffffffff811248d9>]  [<ffffffff811248d9>] __write_lock_failed+0x9/0x20
RSP: 0018:ffff81006e025e30  EFLAGS: 00000087
RAX: ffff81006ece8b58 RBX: 0000000000000000 RCX: ffff81006e1f0ec8
RDX: 0000000000000011 RSI: fe60000000000000 RDI: ffffffff813d4000
RBP: ffff81006e025e38 R08: ffff81006e1f0e90 R09: 00000000ffffffff
R10: ffff810072c45e98 R11: 0000000000004111 R12: 00000000fffffff4
R13: ffff81006ece8930 R14: ffff81006818b260 R15: 0000000000000000
FS:  00002b14168b66f0(0000) GS:ffff81007f801980(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000106f000 CR3: 00000000680d2000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process make (pid: 11233, threadinfo ffff81006e024000, task ffff81006818b260)
Stack:  ffffffff8127536a ffff81006e025ed8 ffffffff81034260 ffff81006e1f0e80
 0000000000000000 0000000000000000 ffff81006e025f58 00007fff94222fe0
 0000000000004111 ffff81006ece8930 0000000000000000 ffff81006ec6f040
Call Trace:
 [<ffffffff8127536a>] ? _write_lock_irq+0x13/0x15
 [<ffffffff81034260>] copy_process+0xf7c/0x1477
 [<ffffffff81034870>] do_fork+0x75/0x20a
 [<ffffffff8100c0e9>] ? tracesys+0xdc/0xe1
 [<ffffffff8100a3c6>] sys_vfork+0x20/0x22
 [<ffffffff8100c287>] ptregscall_common+0x67/0xb0

Code: e9 07 48 89 11 31 c0 c3 48 83 e9 07 eb 00 48 c7 c0 f2 ff ff ff c3 90 90 90 90 90 90 90 90
---[ end trace 6a2bbda3f47e95fd ]---
NMI Watchdog detected LOCKUP on CPU 3
CPU 3
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 3258, comm: toggle-writer Not tainted 2.6.24-git12markers #2
RIP: 0010:[<ffffffff8101c717>]  [<ffffffff8101c717>] __smp_call_function_mask+0x9f/0xc1
RSP: 0018:ffff81006ec51da8  EFLAGS: 00000297
RAX: 00000000000008fc RBX: 0000000000000007 RCX: 000000007688fa00
RDX: 00000000000008fc RSI: 00000000000000fc RDI: 0000000000000007
RBP: ffff81006ec51e08 R08: ffffffff8840504f R09: 00007fa6423566f0
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff8106e7c3
FS:  00007fa6423566f0(0000) GS:ffff81007f801c80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000031fb603080 CR3: 000000006ec49000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process toggle-writer (pid: 3258, threadinfo ffff81006ec50000, task ffff81007899a000)
Stack:  ffffffff8106e7c3 0000000000000000 0000000300000002 ffffffff00000000
 0000000000000007 0000000000000292 ffff81007899a000 0000000000000000
 0000000000000000 ffffffff8106e7c3 000000000000000f 0000000000000005
Call Trace:
 [<ffffffff8106e7c3>] ? ipi_busy_loop+0x0/0x55
 [<ffffffff8106e7c3>] ? ipi_busy_loop+0x0/0x55
 [<ffffffff8101c783>] smp_call_function_mask+0x4a/0x59
 [<ffffffff8101c7ab>] smp_call_function+0x19/0x1b
 [<ffffffff8106e702>] imv_update_range+0xd6/0x17e
 [<ffffffff8105494e>] _module_imv_update+0x35/0x54
 [<ffffffff81054982>] module_imv_update+0x15/0x23
 [<ffffffff884050c4>] :toggle_tester:proc_toggle_write+0x4c/0x64
 [<ffffffff810d64c8>] proc_reg_write+0x7b/0x96
 [<ffffffff81099f96>] vfs_write+0xae/0x157
 [<ffffffff8109a563>] sys_write+0x47/0x70
 [<ffffffff8100c0e9>] tracesys+0xdc/0xe1

Code: f8 48 3b 5d c0 48 8b 05 28 1a 41 00 75 0a bf fc 00 00 00 ff 50 38 eb 0f be fc 00 00 00 48
---[ end trace 6a2bbda3f47e95fd ]---
NMI Watchdog detected LOCKUP on CPU 1
CPU 1
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 0, comm: swapper Not tainted 2.6.24-git12markers #2
RIP: 0010:[<ffffffff8106e7dc>]  [<ffffffff8106e7dc>] ipi_busy_loop+0x19/0x55
RSP: 0018:ffff81007fb6ff70  EFLAGS: 00000002
RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff81007fb5f260
RDX: 000000000000000a RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffff81007fb6ff78 R08: ffff81007fb6ff78 R09: 0000000000000002
R10: ffff810001018b80 R11: ffff810072dddc18 R12: ffffffff8106e7c3
R13: 0000000000000000 R14: ffffffff81491a80 R15: ffff810001018a80
FS:  0000000000000000(0000) GS:ffff81007f801680(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000031fac9afa0 CR3: 0000000072cfb000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff81007fb6a000, task ffff81007fb5f260)
Stack:  0000000000000000 ffff81007fb6ffa8 ffffffff8101ca4e ffff81007fb6ffa8
 ffffffff8100b0e2 0000000000000001 0000000000000040 ffff81007fb6be60
 ffffffff8100cac6 ffff81007fb6be60 <EOI>  ffff81007fb6bee8 ffff810072dddc18
Call Trace:
 <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
 [<ffffffff8100b0e2>] ? mwait_idle+0x0/0x4a
 [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
 <EOI>  [<ffffffff8100b127>] ? mwait_idle+0x45/0x4a
 [<ffffffff8100a723>] ? enter_idle+0x22/0x24
 [<ffffffff8100b06f>] ? cpu_idle+0x97/0xc1
 [<ffffffff81270a55>] ? start_secondary+0x3ba/0x3c6

Code: 81 48 c7 c7 02 fd 36 81 48 89 e5 e8 6b fe ff ff c9 c3 55 48 89 e5 53 9c 5e fa f0 ff 0d 72
---[ end trace 6a2bbda3f47e95fd ]---
NMI Watchdog detected LOCKUP on CPU 0
Kernel panic - not syncing: Aiee, killing interrupt handler!
CPU 0
Modules linked in: toggle_tester ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf]Pid: 5, comm: watchdog/0 Not tainted 2.6.24-git12markers #2
RIP: 0010:[<ffffffff8106e7e9>]  [<ffffffff8106e7e9>] ipi_busy_loop+0x26/0x55
RSP: 0018:ffffffff81498f70  EFLAGS: 00000002
RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffffffff81394620
RDX: 000000000000000a RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffffffff81498f78 R08: ffffffff81498f78 R09: ffff81000100cbe0
R10: ffff81007fb59e20 R11: 0000000000000001 R12: ffffffff8106e7c3
R13: 0000000000000000 R14: 00000000000000f0 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff813d2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006bb374 CR3: 000000006ec88000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process watchdog/0 (pid: 5, threadinfo ffff81007fb58000, task ffff81007fb54930)
Stack:  0000000000000000 ffffffff81498fa8 ffffffff8101ca4e ffffffff884059c8
 ffff81007e4d3260 ffff81007e4d3260 00000000000003b2 ffff81007fb59e60
 ffffffff8100cac6 ffff81007fb59e60 <EOI>  ffff81007fb59f20 0000000000000001
Call Trace:
 <IRQ>  [<ffffffff8101ca4e>] smp_call_function_interrupt+0x48/0x71
 [<ffffffff8100cac6>] call_function_interrupt+0x66/0x70
 <EOI>  [<ffffffff81069685>] ? watchdog+0xd5/0x1c8
 [<ffffffff810695b0>] ? watchdog+0x0/0x1c8
 [<ffffffff81047fd3>] ? kthread+0x49/0x76
 [<ffffffff8100cd88>] ? child_rip+0xa/0x12
 [<ffffffff81047f8a>] ? kthread+0x0/0x76
 [<ffffffff8100cd7e>] ? child_rip+0x0/0x12


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-26 23:12     ` Mathieu Desnoyers
  2008-02-26 23:34       ` Mathieu Desnoyers
@ 2008-02-27 17:01       ` Jason Baron
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Baron @ 2008-02-27 17:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, Rusty Russell, Jan Kiszka

On Tue, Feb 26, 2008 at 06:12:48PM -0500, Mathieu Desnoyers wrote:
> Hrm, does your stop machine implementation disable interrupts while the
> CPUs busy loop ?
> 

yes. this is how stop_machine is implemented...and I believe is consistent with
the way in which your algorithm disables irqs. The logic to me, is we don't
want any cpus to see the kernel text in an inconsistent state via an irq.

-Jason


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-26 22:52   ` Jason Baron
  2008-02-26 23:12     ` Mathieu Desnoyers
@ 2008-02-27 19:05     ` Mathieu Desnoyers
  2008-02-28 16:50       ` Jason Baron
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2008-02-27 19:05 UTC (permalink / raw)
  To: Jason Baron; +Cc: akpm, Ingo Molnar, linux-kernel, Rusty Russell

* Jason Baron (jbaron@redhat.com) wrote:
> On Sat, Feb 02, 2008 at 04:08:29PM -0500, Mathieu Desnoyers wrote:
> > Changelog:
> > 
> > - section __imv is already SHF_ALLOC
> > - Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
> >   the if (immediateindex) is unnecessary here.
> > - Remove module_mutex usage: depend on functions implemented in module.c for
> >   that.
> 
> hi,
> 
> In testing this patch, i've run across a deadlock...apply_imv_update() can get
> called again before, ipi_busy_loop() has had a chance to finsh, and set 
> wait_sync back to its initial value. This causes ipi_busy_loop() to get stuck
> indefinitely and the subsequent apply_imv_update() hangs. I've shown this 
> deadlock below using nmi_watchdog=1 in item 1). 
> 
> After hitting this deadlock i modified apply_imv_update() to wait for 
> ipi_busy_loop(), to finish, however this resulted in a 3 way deadlock. Process
> A held a read lock on tasklist_lock, then process B called apply_imv_update().
> Process A received the IPI and begins executing ipi_busy_loop(). Then process
> C takes a write lock irq on the task list lock, before receiving the IPI. Thus,
> process A holds up process C, and C can't get an IPI b/c interrupts are
> disabled. i believe this is an inherent problem in using smp_call_function, in
> that one can't synchronize the processes on each other...you can reproduce
> these issues using the test module below item 2)
> 
> In order to address these issues, i've modified stop_machine_run() to take an
> new third parameter, RUN_ALL, to allow stop_machine_run() to run a function
> on all cpus, item 3). I then modified kernel/immediate.c to use this new 
> infrastructure item 4). the code in immediate.c simplifies quite a bit. This
> new code has been robust through all testing thus far.
> 

Ok, I see why you did it that way. Comments follow inline.

> thanks,
> 
> -Jason
> 
> 1)
[...]
 
> 2)
> 
[...]
> 
> 3)
> 
> 
> diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
> index 5bfc553..1ab1c5b 100644
> --- a/include/linux/stop_machine.h
> +++ b/include/linux/stop_machine.h
> @@ -8,6 +8,9 @@
>  #include <asm/system.h>
>  
>  #if defined(CONFIG_STOP_MACHINE) && defined(CONFIG_SMP)
> +
> +#define RUN_ALL ~0U
> +
>  /**
>   * stop_machine_run: freeze the machine on all CPUs and run this function
>   * @fn: the function to run
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index 51b5ee5..e6ee46f 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -23,9 +23,17 @@ enum stopmachine_state {
>  	STOPMACHINE_WAIT,
>  	STOPMACHINE_PREPARE,
>  	STOPMACHINE_DISABLE_IRQ,
> +	STOPMACHINE_RUN,
>  	STOPMACHINE_EXIT,
>  };
>  
> +struct stop_machine_data {
> +	int (*fn)(void *);
> +	void *data;
> +	struct completion done;
> +	int run_all;
> +} smdata;
> +

Why do we now have to declare this static ? Can we pass it as a pointer
to stopmachine instead ?

>  static enum stopmachine_state stopmachine_state;
>  static unsigned int stopmachine_num_threads;
>  static atomic_t stopmachine_thread_ack;
> @@ -35,6 +43,7 @@ static int stopmachine(void *cpu)
>  {
>  	int irqs_disabled = 0;
>  	int prepared = 0;
> +	int ran = 0;
>  
>  	set_cpus_allowed(current, cpumask_of_cpu((int)(long)cpu));
>  
> @@ -59,6 +68,11 @@ static int stopmachine(void *cpu)
>  			prepared = 1;
>  			smp_mb(); /* Must read state first. */
>  			atomic_inc(&stopmachine_thread_ack);
> +		} else if (stopmachine_state == STOPMACHINE_RUN && !ran) {
> +			smdata.fn(smdata.data);
> +			ran = 1;
> +			smp_mb(); /* Must read state first. */
> +			atomic_inc(&stopmachine_thread_ack);
>  		}
>  		/* Yield in first stage: migration threads need to
>  		 * help our sisters onto their CPUs. */
> @@ -136,12 +150,10 @@ static void restart_machine(void)
>  	preempt_enable_no_resched();
>  }
>  
> -struct stop_machine_data
> +static void run_other_cpus(void)
>  {
> -	int (*fn)(void *);
> -	void *data;
> -	struct completion done;
> -};
> +	stopmachine_set_state(STOPMACHINE_RUN);

Hrm, the semantic of STOPMACHINE_RUN is a bit weird :
- The CPU where the do_stop thread runs will first execute (alone) the
  callback.
- Then, all the other CPUs will execute the callback concurrently.

Given that you use a "started" boolean in the callback, which is ok as
long as there is no concurrent modification (correct given the current
semantic, but fragile), I would tend to document that the first time the
callback is called, it is ran alone, without concurrency, and then all
the other callbacks are ran concurrently.


> +}
>  
>  static int do_stop(void *_smdata)
>  {
> @@ -151,6 +163,8 @@ static int do_stop(void *_smdata)
>  	ret = stop_machine();
>  	if (ret == 0) {
>  		ret = smdata->fn(smdata->data);
> +		if (smdata->run_all)
> +			run_other_cpus();
>  		restart_machine();
>  	}
>  
> @@ -170,17 +184,16 @@ static int do_stop(void *_smdata)
>  struct task_struct *__stop_machine_run(int (*fn)(void *), void *data,
>  				       unsigned int cpu)
>  {
> -	struct stop_machine_data smdata;
>  	struct task_struct *p;
>  
> +	down(&stopmachine_mutex);
>  	smdata.fn = fn;
>  	smdata.data = data;
> +	smdata.run_all = (cpu == RUN_ALL) ? 1 : 0;
>  	init_completion(&smdata.done);
> -
> -	down(&stopmachine_mutex);
> -
> +	smp_wmb();
>  	/* If they don't care which CPU fn runs on, bind to any online one. */
> -	if (cpu == NR_CPUS)
> +	if (cpu == NR_CPUS || cpu == RUN_ALL)
>  		cpu = raw_smp_processor_id();
>  
>  	p = kthread_create(do_stop, &smdata, "kstopmachine");
> 
> 
> 4)
> 
> 
> diff --git a/kernel/immediate.c b/kernel/immediate.c
> index 4c36a89..39ec13e 100644
> --- a/kernel/immediate.c
> +++ b/kernel/immediate.c
> @@ -20,6 +20,7 @@
>  #include <linux/immediate.h>
>  #include <linux/memory.h>
>  #include <linux/cpu.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/cacheflush.h>
>  
> @@ -30,6 +31,23 @@ static int imv_early_boot_complete;
>  
>  extern const struct __imv __start___imv[];
>  extern const struct __imv __stop___imv[];
> +int started;
> +
> +int stop_machine_imv_update(void *imv_ptr)
> +{
> +	struct __imv *imv = imv_ptr;
> +
> +	if (!started) {
> +		text_poke((void *)imv->imv, (void *)imv->var, imv->size);
> +		started = 1;
> +		smp_wmb();

missing mb() comment here.

> +	} else
> +		sync_core();
> +

Note : we really want the sync_core()s to be executed after the
text_poke. This is ok given the implicit RUN_ALL semantic, but I guess
it should be documented in stop_machine that the first callback is
executed alone before all the others.

Thanks!

Mathieu

> +	flush_icache_range(imv->imv, imv->imv + imv->size);
> +
> +	return 0;
> +}
>  
>  /*
>   * imv_mutex nests inside module_mutex. imv_mutex protects builtin
> @@ -37,38 +55,6 @@ extern const struct __imv __stop___imv[];
>   */
>  static DEFINE_MUTEX(imv_mutex);
>  
> -static atomic_t wait_sync;
> -
> -struct ipi_loop_data {
> -	long value;
> -	const struct __imv *imv;
> -} loop_data;
> -
> -static void ipi_busy_loop(void *arg)
> -{
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
> -	atomic_dec(&wait_sync);
> -	do {
> -		/* Make sure the wait_sync gets re-read */
> -		smp_mb();
> -	} while (atomic_read(&wait_sync) > loop_data.value);
> -	atomic_dec(&wait_sync);
> -	do {
> -		/* Make sure the wait_sync gets re-read */
> -		smp_mb();
> -	} while (atomic_read(&wait_sync) > 0);
> -	/*
> -	 * Issuing a synchronizing instruction must be done on each CPU before
> -	 * reenabling interrupts after modifying an instruction. Required by
> -	 * Intel's errata.
> -	 */
> -	sync_core();
> -	flush_icache_range(loop_data.imv->imv,
> -		loop_data.imv->imv + loop_data.imv->size);
> -	local_irq_restore(flags);
> -}
>  
>  /**
>   * apply_imv_update - update one immediate value
> @@ -82,9 +68,6 @@ static void ipi_busy_loop(void *arg)
>   */
>  static int apply_imv_update(const struct __imv *imv)
>  {
> -	unsigned long flags;
> -	long online_cpus;
> -
>  	/*
>  	 * If the variable and the instruction have the same value, there is
>  	 * nothing to do.
> @@ -111,30 +94,8 @@ static int apply_imv_update(const struct __imv *imv)
>  
>  	if (imv_early_boot_complete) {
>  		kernel_text_lock();
> -		get_online_cpus();
> -		online_cpus = num_online_cpus();
> -		atomic_set(&wait_sync, 2 * online_cpus);
> -		loop_data.value = online_cpus;
> -		loop_data.imv = imv;
> -		smp_call_function(ipi_busy_loop, NULL, 1, 0);
> -		local_irq_save(flags);
> -		atomic_dec(&wait_sync);
> -		do {
> -			/* Make sure the wait_sync gets re-read */
> -			smp_mb();
> -		} while (atomic_read(&wait_sync) > online_cpus);
> -		text_poke((void *)imv->imv, (void *)imv->var,
> -				imv->size);
> -		/*
> -		 * Make sure the modified instruction is seen by all CPUs before
> -		 * we continue (visible to other CPUs and local interrupts).
> -		 */
> -		wmb();
> -		atomic_dec(&wait_sync);
> -		flush_icache_range(imv->imv,
> -				imv->imv + imv->size);
> -		local_irq_restore(flags);
> -		put_online_cpus();
> +		started = 0;
> +		stop_machine_run(stop_machine_imv_update, (void *)imv, RUN_ALL);
>  		kernel_text_unlock();
>  	} else
>  		text_poke_early((void *)imv->imv, (void *)imv->var,

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/7] Immediate Values - Architecture Independent Code
  2008-02-27 19:05     ` Mathieu Desnoyers
@ 2008-02-28 16:50       ` Jason Baron
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Baron @ 2008-02-28 16:50 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, Ingo Molnar, linux-kernel, Rusty Russell

On Wed, Feb 27, 2008 at 02:05:19PM -0500, Mathieu Desnoyers wrote:
> > +struct stop_machine_data {
> > +	int (*fn)(void *);
> > +	void *data;
> > +	struct completion done;
> > +	int run_all;
> > +} smdata;
> > +
> 
> Why do we now have to declare this static ? Can we pass it as a pointer
> to stopmachine instead ?
> 

Since the other cpus need to access 'fn', i made smdata a global.
stop_machine_run() is system-wide, so we only need one of these...

thanks,

-Jason


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-02-28 16:51 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers
2007-09-17 18:42 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2007-09-18 17:47   ` Denys Vlasenko
2007-09-18 17:59     ` Mathieu Desnoyers
2007-09-18 20:01       ` Denys Vlasenko
2007-09-18 20:47         ` Mathieu Desnoyers
2007-09-19  8:45           ` Denys Vlasenko
2007-09-19  8:57           ` Denys Vlasenko
2007-09-19 11:27             ` Mathieu Desnoyers
2007-09-17 18:42 ` [patch 2/7] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2007-09-17 20:55   ` Randy Dunlap
2007-09-17 18:42 ` [patch 3/7] Immediate Values - Move Kprobes i386 restore_interrupt to kdebug.h Mathieu Desnoyers
2007-09-17 18:42 ` [patch 4/7] Immediate Values - i386 Optimization Mathieu Desnoyers
2007-09-18  6:04   ` Borislav Petkov
2007-09-17 18:42 ` [patch 5/7] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2007-09-17 18:42 ` [patch 6/7] Immediate Values - Documentation Mathieu Desnoyers
2007-09-17 20:55   ` Randy Dunlap
2007-09-18 13:13     ` Mathieu Desnoyers
2007-09-17 18:42 ` [patch 7/7] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
  -- strict thread matches above, loose matches on Subject: below --
2007-09-18 21:07 [patch 0/7] Immediate Values for 2.6.23-rc6-mm1 Mathieu Desnoyers
2007-09-18 21:07 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2007-12-06  2:07 [patch 0/7] Immediate Values (redux) for 2.6.24-rc4-git3 Mathieu Desnoyers
2007-12-06  2:07 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2008-02-02 21:08 [patch 0/7] Immediate Values Mathieu Desnoyers
2008-02-02 21:08 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2008-02-26 22:52   ` Jason Baron
2008-02-26 23:12     ` Mathieu Desnoyers
2008-02-26 23:34       ` Mathieu Desnoyers
2008-02-27 16:44         ` Jason Baron
2008-02-27 17:01       ` Jason Baron
2008-02-27 19:05     ` Mathieu Desnoyers
2008-02-28 16:50       ` Jason Baron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox