[PATCH] [0/48] x86 candidate patches for review III: various stuff

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] [0/48] x86 candidate patches for review III: various stuff
@ 2007-04-29 10:52 Andi Kleen
  2007-04-29 10:52 ` [PATCH] [1/48] x86_64: fix x86_64-mm-sched-clock-share Andi Kleen
                   ` (46 more replies)
  0 siblings, 47 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: patches, linux-kernel


- Rewritten all dancing sched_clock()
- Extended numa emulation support from David Rientjes; now the sizes
of the emulated nodes can be configured on the command line.
- Faster vgettimeofday from Eric Dumazet
- GDT cleanups from Rusty
- cpa() fixes and better kernel protection from Jan Beulich
- PGD handling cleanup from Christoph Lameter
- Various other changes
- Lots of minor cleanups from various people

Please review.

-Andi


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [1/48] x86_64: fix x86_64-mm-sched-clock-share
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [2/48] i386: Rewrite sched_clock Andi Kleen
                   ` (45 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Dave Jones, patches, linux-kernel


From: Andrew Morton <akpm@linux-foundation.org>

Fix for the following patch. Provide dummy cpufreq functions when
CPUFREQ is not compiled in. 

Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 include/linux/cpufreq.h |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux/include/linux/cpufreq.h
===================================================================
--- linux.orig/include/linux/cpufreq.h
+++ linux/include/linux/cpufreq.h
@@ -32,7 +32,15 @@
  *                     CPUFREQ NOTIFIER INTERFACE                    *
  *********************************************************************/
 
+#ifdef CONFIG_CPU_FREQ
 int cpufreq_register_notifier(struct notifier_block *nb, unsigned int list);
+#else
+static inline int cpufreq_register_notifier(struct notifier_block *nb,
+						unsigned int list)
+{
+	return 0;
+}
+#endif
 int cpufreq_unregister_notifier(struct notifier_block *nb, unsigned int list);
 
 #define CPUFREQ_TRANSITION_NOTIFIER	(0)
@@ -261,17 +269,22 @@ int cpufreq_set_policy(struct cpufreq_po
 int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
 int cpufreq_update_policy(unsigned int cpu);
 
-/* query the current CPU frequency (in kHz). If zero, cpufreq couldn't detect it */
-unsigned int cpufreq_get(unsigned int cpu);
 
-/* query the last known CPU freq (in kHz). If zero, cpufreq couldn't detect it */
+/*
+ * query the last known CPU freq (in kHz). If zero, cpufreq couldn't detect it
+ */
 #ifdef CONFIG_CPU_FREQ
 unsigned int cpufreq_quick_get(unsigned int cpu);
+unsigned int cpufreq_get(unsigned int cpu);
 #else
 static inline unsigned int cpufreq_quick_get(unsigned int cpu)
 {
 	return 0;
 }
+static inline unsigned int cpufreq_get(unsigned int cpu)
+{
+	return 0;
+}
 #endif
 
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [2/48] i386: Rewrite sched_clock
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
  2007-04-29 10:52 ` [PATCH] [1/48] x86_64: fix x86_64-mm-sched-clock-share Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [3/48] x86_64: Use new shared sched_clock in x86-64 too Andi Kleen
                   ` (44 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: patches, linux-kernel


Move it into an own file for easy sharing.
Do everything per CPU. This avoids problems with TSCs that
tick at different frequencies per CPU.
Resync properly on cpufreq changes. CPU frequency is instable
around cpu frequency changing, so fall back during a backing
clock during this period.
Hopefully TSC will work now on all systems except when there isn't a
physical TSC. 

And

+From: Jeremy Fitzhardinge <jeremy@goop.org>
Three cleanups there:
 - change "instable" -> "unstable"
 - it's better to use get_cpu_var for getting this cpu's variables
 - change cycles_2_ns to do the full computation rather than just the
   tsc->ns scaling.  It's a simpler interface, and it makes the function

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/kernel/Makefile      |    3 
 arch/i386/kernel/sched-clock.c |  213 +++++++++++++++++++++++++++++++++++++++++
 arch/i386/kernel/tsc.c         |   62 -----------
 3 files changed, 215 insertions(+), 63 deletions(-)

Index: linux/arch/i386/kernel/sched-clock.c
===================================================================
--- /dev/null
+++ linux/arch/i386/kernel/sched-clock.c
@@ -0,0 +1,213 @@
+/* A fast clock for the scheduler. */
+#include <linux/init.h>
+#include <linux/cpu.h>
+#include <linux/cpufreq.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/ktime.h>
+#include <linux/hrtimer.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/init.h>
+#include <asm/tsc.h>
+#include <asm/cpufeature.h>
+#include <asm/timer.h>
+
+/*
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math (suggested by george@mvista.com) to get:
+ *		ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *		ns = cycles * cyc2ns_scale / SC
+ *
+ *	And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+
+#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+
+struct sc_data {
+	unsigned cyc2ns_scale;
+	unsigned unstable;
+	unsigned long long sync_base;		/* TSC or jiffies at syncpoint*/
+	unsigned long long ns_base;		/* nanoseconds at sync point */
+	unsigned long long last_val;		/* Last returned value */
+};
+
+static DEFINE_PER_CPU(struct sc_data, sc_data) =
+	{ .unstable = 1, .sync_base = INITIAL_JIFFIES };
+
+static inline u64 cycles_2_ns(struct sc_data *sc, u64 cyc)
+{
+	u64 ns;
+
+	cyc -= sc->sync_base;
+	ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	ns += sc->ns_base;
+
+	return ns;
+}
+
+/*
+ * Scheduler clock - returns current time in nanosec units.
+ * All data is local to the CPU.
+ * The values are approximately[1] monotonic local to a CPU, but not
+ * between CPUs.   There might be also an occasionally random error,
+ * but not too bad. Between CPUs the values can be non monotonic.
+ *
+ * [1] no attempt to stop CPU instruction reordering, which can hit
+ * in a 100 instruction window or so.
+ *
+ * The clock can be in two states: stable and unstable.
+ * When it is stable we use the TSC per CPU.
+ * When it is unstable we use jiffies as fallback.
+ * stable->unstable->stable transitions can happen regularly
+ * during CPU frequency changes.
+ * There is special code to avoid having the clock jump backwards
+ * when we switch from TSC to jiffies, which needs to keep some state
+ * per CPU. This state is protected against parallel state changes
+ * with interrupts off.
+ */
+unsigned long long sched_clock(void)
+{
+	unsigned long long r;
+	struct sc_data *sc = &get_cpu_var(sc_data);
+
+	if (sc->unstable) {
+		unsigned long flags;
+		r = (jiffies_64 - sc->sync_base) * (1000000000 / HZ);
+		r += sc->ns_base;
+		local_irq_save(flags);
+		/* last_val is used to avoid non monotonity on a
+		   stable->unstable transition. Make sure the time
+		   never goes to before the last value returned by
+		   the TSC clock */
+		if (r <= sc->last_val)
+			r = sc->last_val + 1;
+		sc->last_val = r;
+		local_irq_restore(flags);
+	} else {
+		get_scheduled_cycles(r);
+		r = cycles_2_ns(sc, r);
+		sc->last_val = r;
+	}
+
+	put_cpu_var(sc_data);
+
+	return r;
+}
+
+/* Resync with new CPU frequency */
+static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
+{
+	sc->sync_base = jiffies;
+	if (!cpu_has_tsc) {
+		sc->unstable = 1;
+		return;
+	}
+	/* Handle nesting, but when we're zero multiple calls in a row
+	   are ok too and not a bug */
+	if (sc->unstable > 0)
+		sc->unstable--;
+	/* RED-PEN protect with seqlock? I hope that's not needed
+	   because sched_clock callers should be able to tolerate small
+	   errors. */
+	sc->ns_base = ktime_to_ns(ktime_get());
+	get_scheduled_cycles(sc->sync_base);
+	sc->cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR) / newfreq;
+}
+
+static void call_r_s_f(void *arg)
+{
+	struct cpufreq_freqs *freq = arg;
+	unsigned f = freq->new;
+	if (!f)
+		f = cpufreq_get(freq->cpu);
+	if (!f)
+		f = tsc_khz;
+	resync_sc_freq(&per_cpu(sc_data, freq->cpu), f);
+}
+
+static void call_r_s_f_here(void *arg)
+{
+	struct cpufreq_freqs f = { .cpu = get_cpu(), .new = 0 };
+	call_r_s_f(&f);
+	put_cpu();
+}
+
+static int sc_freq_event(struct notifier_block *nb, unsigned long event,
+			 void *data)
+{
+	struct cpufreq_freqs *freq = data;
+	int cpu = get_cpu();
+	struct sc_data *sc = &per_cpu(sc_data, cpu);
+
+	if (cpu_has(&cpu_data[cpu], X86_FEATURE_CONSTANT_TSC))
+		goto out;
+	if (freq->old == freq->new)
+		goto out;
+
+	switch (event) {
+	case CPUFREQ_SUSPENDCHANGE:
+		/* Mark TSC unstable during suspend/resume */
+	case CPUFREQ_PRECHANGE:
+		/* Mark TSC as unstable until cpu frequency change is done
+		   because we don't know when exactly it will change.
+		   unstable in used as a counter to guard against races
+		   between the cpu frequency notifiers and normal resyncs */
+		sc->unstable++;
+		break;
+	case CPUFREQ_RESUMECHANGE:
+	case CPUFREQ_POSTCHANGE:
+		/* Frequency change or resume is done -- update everything and
+		   mark TSC as stable again. */
+		if (cpu == freq->cpu)
+			resync_sc_freq(sc, freq->new);
+		else
+			smp_call_function_single(freq->cpu, call_r_s_f,
+						 freq, 0, 1);
+		break;
+	}
+out:
+	put_cpu();
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block sc_freq_notifier = {
+	.notifier_call = sc_freq_event
+};
+
+static int __cpuinit
+sc_cpu_event(struct notifier_block *self, unsigned long event, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	if (event == CPU_ONLINE) {
+		struct cpufreq_freqs f = { .cpu = cpu, .new = 0 };
+		smp_call_function_single(cpu, call_r_s_f, &f, 0, 1);
+	}
+	return NOTIFY_DONE;
+}
+
+static __init int init_sched_clock(void)
+{
+	/* On a race between the various events the initialization might be
+	   done multiple times, but that is handled */
+	cpufreq_register_notifier(&sc_freq_notifier,
+				CPUFREQ_TRANSITION_NOTIFIER);
+	hotcpu_notifier(sc_cpu_event, 0);
+	on_each_cpu(call_r_s_f_here, NULL, 0, 0);
+	return 0;
+}
+core_initcall(init_sched_clock);
+
Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -62,62 +62,6 @@ static inline int check_tsc_unstable(voi
 	return tsc_unstable;
 }
 
-/* Accellerators for sched_clock()
- * convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- *		ns = cycles / (freq / ns_per_sec)
- *		ns = cycles * (ns_per_sec / freq)
- *		ns = cycles * (10^9 / (cpu_khz * 10^3))
- *		ns = cycles * (10^6 / cpu_khz)
- *
- *	Then we use scaling math (suggested by george@mvista.com) to get:
- *		ns = cycles * (10^6 * SC / cpu_khz) / SC
- *		ns = cycles * cyc2ns_scale / SC
- *
- *	And since SC is a constant power of two, we can convert the div
- *  into a shift.
- *
- *  We can use khz divisor instead of mhz to keep a better percision, since
- *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
- *  (mathieu.desnoyers@polymtl.ca)
- *
- *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
- */
-static unsigned long cyc2ns_scale __read_mostly;
-
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
-
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
-{
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
-}
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
-}
-
-/*
- * Scheduler clock - returns current time in nanosec units.
- */
-unsigned long long sched_clock(void)
-{
-	unsigned long long this_offset;
-
-	/*
-	 * Fall back to jiffies if there's no TSC available:
-	 */
-	if (unlikely(!tsc_enabled))
-		/* No locking but a rare wrong value is not a big deal: */
-		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
-
-	/* read the Time Stamp Counter: */
-	get_scheduled_cycles(this_offset);
-
-	/* return the value in ns */
-	return cycles_2_ns(this_offset);
-}
-
 unsigned long native_calculate_cpu_khz(void)
 {
 	unsigned long long start, end;
@@ -228,11 +172,6 @@ time_cpufreq_notifier(struct notifier_bl
 						ref_freq, freq->new);
 			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
 				tsc_khz = cpu_khz;
-				set_cyc2ns_scale(cpu_khz);
-				/*
-				 * TSC based sched_clock turns
-				 * to junk w/ cpufreq
-				 */
 				mark_tsc_unstable();
 			}
 		}
@@ -371,7 +310,6 @@ void __init tsc_init(void)
 				(unsigned long)cpu_khz / 1000,
 				(unsigned long)cpu_khz % 1000);
 
-	set_cyc2ns_scale(cpu_khz);
 	use_tsc_delay();
 
 	/* Check and install the TSC clocksource */
Index: linux/arch/i386/kernel/Makefile
===================================================================
--- linux.orig/arch/i386/kernel/Makefile
+++ linux/arch/i386/kernel/Makefile
@@ -7,7 +7,8 @@ extra-y := head.o init_task.o vmlinux.ld
 obj-y	:= process.o signal.o entry.o traps.o irq.o \
 		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \
 		pci-dma.o i386_ksyms.o i387.o bootflag.o e820.o\
-		quirks.o i8237.o topology.o alternative.o i8253.o tsc.o
+		quirks.o i8237.o topology.o alternative.o i8253.o tsc.o \
+		sched-clock.o
 
 obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
 obj-y				+= cpu/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [3/48] x86_64: Use new shared sched_clock in x86-64 too
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
  2007-04-29 10:52 ` [PATCH] [1/48] x86_64: fix x86_64-mm-sched-clock-share Andi Kleen
  2007-04-29 10:52 ` [PATCH] [2/48] i386: Rewrite sched_clock Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [4/48] x86_64: Don't disable basic block reordering Andi Kleen
                   ` (43 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: patches, linux-kernel


Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/Makefile |    3 ++-
 arch/x86_64/kernel/time.c   |    1 -
 arch/x86_64/kernel/tsc.c    |   28 ----------------------------
 include/asm-x86_64/timer.h  |    1 +
 include/asm-x86_64/timex.h  |    1 -
 5 files changed, 3 insertions(+), 31 deletions(-)

Index: linux/arch/x86_64/kernel/Makefile
===================================================================
--- linux.orig/arch/x86_64/kernel/Makefile
+++ linux/arch/x86_64/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y	:= process.o signal.o entry.o trap
 		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \
 		x8664_ksyms.o i387.o syscall.o vsyscall.o \
 		setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
-		pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o
+		pci-dma.o pci-nommu.o alternative.o hpet.o tsc.o sched-clock.o
 
 obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
 obj-$(CONFIG_X86_MCE)		+= mce.o therm_throt.o
@@ -57,3 +57,4 @@ i8237-y				+= ../../i386/kernel/i8237.o
 msr-$(subst m,y,$(CONFIG_X86_MSR))  += ../../i386/kernel/msr.o
 alternative-y			+= ../../i386/kernel/alternative.o
 pcspeaker-y			+= ../../i386/kernel/pcspeaker.o
+sched-clock-y			+= ../../i386/kernel/sched-clock.o
Index: linux/arch/x86_64/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86_64/kernel/tsc.c
+++ linux/arch/x86_64/kernel/tsc.c
@@ -16,32 +16,6 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
-
-void set_cyc2ns_scale(unsigned long khz)
-{
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
-
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
-}
-
-unsigned long long sched_clock(void)
-{
-	unsigned long a = 0;
-
-	/* Could do CPU core sync here. Opteron can execute rdtsc speculatively,
-	 * which means it is not completely exact and may not be monotonous
-	 * between CPUs. But the errors should be too small to matter for
-	 * scheduling purposes.
-	 */
-
-	rdtscll(a);
-	return cycles_2_ns(a);
-}
-
 static int tsc_unstable;
 
 static inline int check_tsc_unstable(void)
@@ -114,8 +88,6 @@ static int time_cpufreq_notifier(struct 
 			mark_tsc_unstable();
 	}
 
-	set_cyc2ns_scale(tsc_khz_ref);
-
 	return 0;
 }
 
Index: linux/include/asm-x86_64/timer.h
===================================================================
--- /dev/null
+++ linux/include/asm-x86_64/timer.h
@@ -0,0 +1 @@
+#define get_scheduled_cycles(x) rdtscll(x)
Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -404,7 +404,6 @@ void __init time_init(void)
 	else
 		vgetcpu_mode = VGETCPU_LSL;
 
-	set_cyc2ns_scale(tsc_khz);
 	printk(KERN_INFO "time.c: Detected %d.%03d MHz processor.\n",
 		cpu_khz / 1000, cpu_khz % 1000);
 	init_tsc_clocksource();
Index: linux/include/asm-x86_64/timex.h
===================================================================
--- linux.orig/include/asm-x86_64/timex.h
+++ linux/include/asm-x86_64/timex.h
@@ -28,5 +28,4 @@ extern int read_current_timer(unsigned l
 #define US_SCALE        32 /* 2^32, arbitralrily chosen */
 
 extern void mark_tsc_unstable(void);
-extern void set_cyc2ns_scale(unsigned long khz);
 #endif

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [4/48] x86_64: Don't disable basic block reordering
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (2 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [3/48] x86_64: Use new shared sched_clock in x86-64 too Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [5/48] x86_64: Allow sys_uselib unconditionally Andi Kleen
                   ` (42 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: patches, linux-kernel


When compiling with -Os (which is default) the compiler defaults to it
anyways. And with -O2 it probably generates somewhat better (although
also larger) code. 

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/Makefile |    3 ---
 1 file changed, 3 deletions(-)

Index: linux/arch/x86_64/Makefile
===================================================================
--- linux.orig/arch/x86_64/Makefile
+++ linux/arch/x86_64/Makefile
@@ -41,9 +41,6 @@ cflags-y += -mno-red-zone
 cflags-y += -mcmodel=kernel
 cflags-y += -pipe
 cflags-kernel-$(CONFIG_REORDER) += -ffunction-sections
-# this makes reading assembly source easier, but produces worse code
-# actually it makes the kernel smaller too.
-cflags-y += -fno-reorder-blocks
 cflags-y += -Wno-sign-compare
 cflags-y += -fno-asynchronous-unwind-tables
 ifneq ($(CONFIG_DEBUG_INFO),y)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [5/48] x86_64: Allow sys_uselib unconditionally
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (3 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [4/48] x86_64: Don't disable basic block reordering Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [6/48] x86_64: Minor white space cleanup in traps.c Andi Kleen
                   ` (41 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: patches, linux-kernel


Previously it wasn't enabled in the binfmt_aout is a module case. 

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/ia32/ia32entry.S |    4 ----
 1 file changed, 4 deletions(-)

Index: linux/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux.orig/arch/x86_64/ia32/ia32entry.S
+++ linux/arch/x86_64/ia32/ia32entry.S
@@ -481,11 +481,7 @@ ia32_sys_call_table:
 	.quad sys_symlink
 	.quad sys_lstat
 	.quad sys_readlink		/* 85 */
-#ifdef CONFIG_IA32_AOUT
 	.quad sys_uselib
-#else
-	.quad quiet_ni_syscall
-#endif
 	.quad sys_swapon
 	.quad sys_reboot
 	.quad compat_sys_old_readdir

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [6/48] x86_64: Minor white space cleanup in traps.c
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (4 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [5/48] x86_64: Allow sys_uselib unconditionally Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [7/48] x86_64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA Andi Kleen
                   ` (40 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: patches, linux-kernel


Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/traps.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -426,8 +426,7 @@ void show_registers(struct pt_regs *regs
 	const int cpu = smp_processor_id();
 	struct task_struct *cur = cpu_pda(cpu)->pcurrent;
 
-		rsp = regs->rsp;
-
+	rsp = regs->rsp;
 	printk("CPU %d ", cpu);
 	__show_regs(regs);
 	printk("Process %s (pid: %d, threadinfo %p, task %p)\n",
@@ -438,7 +437,6 @@ void show_registers(struct pt_regs *regs
 	 * time of the fault..
 	 */
 	if (in_kernel) {
-
 		printk("Stack: ");
 		_show_stack(NULL, regs, (unsigned long*)rsp);
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [7/48] x86_64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (5 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [6/48] x86_64: Minor white space cleanup in traps.c Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [8/48] i386: modpost apic related warning fixes Andi Kleen
                   ` (39 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: Ravikiran G Thirumalai, Andi Kleen, patches, linux-kernel


From: Ravikiran G Thirumalai <kiran@scalex86.org>

Enable system hashtable memory to be distributed among nodes on x86_64 NUMA

Forcing the kernel to use node interleaved vmalloc instead of bootmem for
the system hashtable memory (alloc_large_system_hash) reduces the memory
imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box:

Before the following patch, on bootup of a 8 node box:

Node 0 MemTotal:      3407488 kB
Node 0 MemFree:       3206296 kB
Node 0 MemUsed:        201192 kB
Node 0 Active:           7012 kB
Node 0 Inactive:          512 kB
Node 0 Dirty:               0 kB
Node 0 Writeback:           0 kB
Node 0 FilePages:        1912 kB
Node 0 Mapped:            420 kB
Node 0 AnonPages:        5612 kB
Node 0 PageTables:        468 kB
Node 0 NFS_Unstable:        0 kB
Node 0 Bounce:              0 kB
Node 0 Slab:             5408 kB
Node 0 SReclaimable:      644 kB
Node 0 SUnreclaim:       4764 kB

After the patch (or using hashdist=1 on the kernel command line):

Node 0 MemTotal:      3407488 kB
Node 0 MemFree:       3247608 kB
Node 0 MemUsed:        159880 kB
Node 0 Active:           3012 kB
Node 0 Inactive:          616 kB
Node 0 Dirty:               0 kB
Node 0 Writeback:           0 kB
Node 0 FilePages:        2424 kB
Node 0 Mapped:            380 kB
Node 0 AnonPages:        1200 kB
Node 0 PageTables:        396 kB
Node 0 NFS_Unstable:        0 kB
Node 0 Bounce:              0 kB
Node 0 Slab:             6304 kB
Node 0 SReclaimable:     1596 kB
Node 0 SUnreclaim:       4708 kB

I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA
since x86_64 has no dearth of vmalloc space?  Or maybe enable hash
distribution for all 64bit NUMA arches?  The following patch does it only
for x86_64.


I ran a HPC MPI benchmark -- 'Ansys wingsolid', which takes up quite a bit of
memory and uses up tlb entries.  This was on a 4 way, 2 socket
Tyan AMD box (non vsmp), with 8G total memory (4G pernode).

The results with and without hash distribution are:

1. Vanilla - runtime of 1188.000s
2. With hashdist=1 runtime of 1154.000s

Oprofile output for the duration of run is:

1. Vanilla:
PU: AMD64 processors, speed 2411.16 MHz (estimated)
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
mask of 0x00 (No unit mask) count 500
samples  %        app name                 symbol name
163054    6.5513  libansys1.so             MultiFront::decompose(int, int,
Elemset *, int *, int, int, int)
162061    6.5114  libansys3.so             blockSaxpy6L_fd
162042    6.5107  libansys3.so             blockInnerProduct6L_fd
156286    6.2794  libansys3.so             maxb33_
87879     3.5309  libansys1.so             elmatrixmultpcg_
84857     3.4095  libansys4.so             saxpy_pcg
58637     2.3560  libansys4.so             .st4560
46612     1.8728  libansys4.so             .st4282
43043     1.7294  vmlinux-t                copy_user_generic_string
41326     1.6604  libansys3.so             blockSaxpyBackSolve6L_fd
41288     1.6589  libansys3.so             blockInnerProductBackSolve6L_fd

2. With hashdist=1
CPU: AMD64 processors, speed 2411.13 MHz (estimated)
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
mask of 0x00 (No unit mask) count 500
samples  %        app name                 symbol name
162993    6.9814  libansys1.so             MultiFront::decompose(int, int,
Elemset *, int *, int, int, int)
160799    6.8874  libansys3.so             blockInnerProduct6L_fd
160459    6.8729  libansys3.so             blockSaxpy6L_fd
156018    6.6826  libansys3.so             maxb33_
84700     3.6279  libansys4.so             saxpy_pcg
83434     3.5737  libansys1.so             elmatrixmultpcg_
58074     2.4875  libansys4.so             .st4560
46000     1.9703  libansys4.so             .st4282
41166     1.7632  libansys3.so             blockSaxpyBackSolve6L_fd
41033     1.7575  libansys3.so             blockInnerProductBackSolve6L_fd
35762     1.5318  libansys1.so             inner_product_sub
35591     1.5245  libansys1.so             inner_product_sub2
28259     1.2104  libansys4.so             addVectors



Signed-off-by: Pravin B. Shelar <pravin.shelar@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/bootmem.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/include/linux/bootmem.h
===================================================================
--- linux.orig/include/linux/bootmem.h
+++ linux/include/linux/bootmem.h
@@ -122,9 +122,9 @@ extern void *alloc_large_system_hash(con
 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 
 /* Only NUMA needs hash distribution.
- * IA64 is known to have sufficient vmalloc space.
+ * IA64 and x86_64 have sufficient vmalloc space.
  */
-#if defined(CONFIG_NUMA) && defined(CONFIG_IA64)
+#if defined(CONFIG_NUMA) && (defined(CONFIG_IA64) || defined(CONFIG_X86_64))
 #define HASHDIST_DEFAULT 1
 #else
 #define HASHDIST_DEFAULT 0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [8/48] i386: modpost apic related warning fixes
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (6 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [7/48] x86_64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [9/48] i386: make struct vmi_ops static Andi Kleen
                   ` (38 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: Vivek Goyal, Andi Kleen, Len Brown, patches, linux-kernel


From: Vivek Goyal <vgoyal@in.ibm.com>

o Modpost generates warnings for i386 if compiled with CONFIG_RELOCATABLE=y

WARNING: vmlinux - Section mismatch: reference to .init.text:find_unisys_acpi_oem_table from .text between 'acpi_madt_oem_check' (at offset 0xc0101eda) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:acpi_get_table_header_early from .text between 'acpi_madt_oem_check' (at offset 0xc0101ef0) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'acpi_madt_oem_check' (at offset 0xc0101f2e) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:setup_unisys from .text between 'acpi_madt_oem_check' (at offset 0xc0101f37) and 'enable_apic_mode'WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'mps_oem_check' (at offset 0xc0101ec7) and 'acpi_madt_oem_check'
WARNING: vmlinux - Section mismatch: reference to .init.text:es7000_sw_apic from .text between 'enable_apic_mode' (at offset 0xc0101f48) and 'check_apicid_present'

o Some functions which are inline (acpi_madt_oem_check) are not inlined by
  compiler as these functions are accessed using function pointer. These
  functions are put in .text section and they in-turn access __init type
  functions hence modpost generates warnings.

o Do not iniline acpi_madt_oem_check, instead make it __init.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/mach-generic/es7000.c             |   41 ++++++++++++++++++++++++++++
 include/asm-i386/mach-es7000/mach_apic.h    |    7 ----
 include/asm-i386/mach-es7000/mach_mpparse.h |   32 ---------------------
 scripts/mod/modpost.c                       |    1 
 4 files changed, 42 insertions(+), 39 deletions(-)

Index: linux/arch/i386/mach-generic/es7000.c
===================================================================
--- linux.orig/arch/i386/mach-generic/es7000.c
+++ linux/arch/i386/mach-generic/es7000.c
@@ -25,4 +25,45 @@ static int probe_es7000(void)
 	return 0;
 }
 
+extern void es7000_sw_apic(void);
+static void __init enable_apic_mode(void)
+{
+	es7000_sw_apic();
+	return;
+}
+
+static __init int mps_oem_check(struct mp_config_table *mpc, char *oem,
+		char *productid)
+{
+	if (mpc->mpc_oemptr) {
+		struct mp_config_oemtable *oem_table =
+			(struct mp_config_oemtable *)mpc->mpc_oemptr;
+		if (!strncmp(oem, "UNISYS", 6))
+			return parse_unisys_oem((char *)oem_table);
+	}
+	return 0;
+}
+
+#ifdef CONFIG_ACPI
+/* Hook from generic ACPI tables.c */
+static int __init acpi_madt_oem_check(char *oem_id, char *oem_table_id)
+{
+	unsigned long oem_addr;
+	if (!find_unisys_acpi_oem_table(&oem_addr)) {
+		if (es7000_check_dsdt())
+			return parse_unisys_oem((char *)oem_addr);
+		else {
+			setup_unisys();
+			return 1;
+		}
+	}
+	return 0;
+}
+#else
+static int __init acpi_madt_oem_check(char *oem_id, char *oem_table_id)
+{
+	return 0;
+}
+#endif
+
 struct genapic apic_es7000 = APIC_INIT("es7000", probe_es7000);
Index: linux/include/asm-i386/mach-es7000/mach_apic.h
===================================================================
--- linux.orig/include/asm-i386/mach-es7000/mach_apic.h
+++ linux/include/asm-i386/mach-es7000/mach_apic.h
@@ -73,13 +73,6 @@ static inline void init_apic_ldr(void)
 	apic_write_around(APIC_LDR, val);
 }
 
-extern void es7000_sw_apic(void);
-static inline void enable_apic_mode(void)
-{
-	es7000_sw_apic();
-	return;
-}
-
 extern int apic_version [MAX_APICS];
 static inline void setup_apic_routing(void)
 {
Index: linux/include/asm-i386/mach-es7000/mach_mpparse.h
===================================================================
--- linux.orig/include/asm-i386/mach-es7000/mach_mpparse.h
+++ linux/include/asm-i386/mach-es7000/mach_mpparse.h
@@ -18,18 +18,6 @@ extern int parse_unisys_oem (char *oempt
 extern int find_unisys_acpi_oem_table(unsigned long *oem_addr);
 extern void setup_unisys(void);
 
-static inline int mps_oem_check(struct mp_config_table *mpc, char *oem,
-		char *productid)
-{
-	if (mpc->mpc_oemptr) {
-		struct mp_config_oemtable *oem_table =
-			(struct mp_config_oemtable *)mpc->mpc_oemptr;
-		if (!strncmp(oem, "UNISYS", 6))
-			return parse_unisys_oem((char *)oem_table);
-	}
-	return 0;
-}
-
 #ifdef CONFIG_ACPI
 
 static inline int es7000_check_dsdt(void)
@@ -41,26 +29,6 @@ static inline int es7000_check_dsdt(void
 		return 1;
 	return 0;
 }
-
-/* Hook from generic ACPI tables.c */
-static inline int acpi_madt_oem_check(char *oem_id, char *oem_table_id)
-{
-	unsigned long oem_addr;
-	if (!find_unisys_acpi_oem_table(&oem_addr)) {
-		if (es7000_check_dsdt())
-			return parse_unisys_oem((char *)oem_addr);
-		else {
-			setup_unisys();
-			return 1;
-		}
-	}
-	return 0;
-}
-#else
-static inline int acpi_madt_oem_check(char *oem_id, char *oem_table_id)
-{
-	return 0;
-}
 #endif
 
 #endif /* __ASM_MACH_MPPARSE_H */
Index: linux/scripts/mod/modpost.c
===================================================================
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -606,6 +606,7 @@ static int secref_whitelist(const char *
 		"_probe",
 		"_probe_one",
 		"_console",
+		"apic_es7000",
 		NULL
 	};
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [9/48] i386: make struct vmi_ops static
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (7 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [8/48] i386: modpost apic related warning fixes Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [10/48] i386: type cast clean up for find_next_zero_bit Andi Kleen
                   ` (37 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: Adrian Bunk, Andi Kleen, Zachary Amsden, patches, linux-kernel


From: Adrian Bunk <bunk@stusta.de>

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/vmi.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/i386/kernel/vmi.c
===================================================================
--- linux.orig/arch/i386/kernel/vmi.c
+++ linux/arch/i386/kernel/vmi.c
@@ -56,7 +56,7 @@ static int disable_noidle;
 static int disable_vmi_timer;
 
 /* Cached VMI operations */
-struct {
+static struct {
 	void (*cpuid)(void /* non-c */);
 	void (*_set_ldt)(u32 selector);
 	void (*set_tr)(u32 selector);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [10/48] i386: type cast clean up for find_next_zero_bit
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (8 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [9/48] i386: make struct vmi_ops static Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:52 ` [PATCH] [11/48] i386: workaround for a -Wmissing-prototypes warning Andi Kleen
                   ` (36 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: Ken Chen, patches, linux-kernel


From: "Ken Chen" <kenchen@google.com>

clean up unneeded type cast by properly declare data type.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/lib/bitops.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/i386/lib/bitops.c
===================================================================
--- linux.orig/arch/i386/lib/bitops.c
+++ linux/arch/i386/lib/bitops.c
@@ -43,7 +43,7 @@ EXPORT_SYMBOL(find_next_bit);
  */
 int find_next_zero_bit(const unsigned long *addr, int size, int offset)
 {
-	unsigned long * p = ((unsigned long *) addr) + (offset >> 5);
+	const unsigned long *p = addr + (offset >> 5);
 	int set = 0, bit = offset & 31, res;
 
 	if (bit) {
@@ -64,7 +64,7 @@ int find_next_zero_bit(const unsigned lo
 	/*
 	 * No zero yet, search remaining full bytes for a zero
 	 */
-	res = find_first_zero_bit (p, size - 32 * (p - (unsigned long *) addr));
+	res = find_first_zero_bit(p, size - 32 * (p - addr));
 	return (offset + set + res);
 }
 EXPORT_SYMBOL(find_next_zero_bit);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [11/48] i386: workaround for a -Wmissing-prototypes warning
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (9 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [10/48] i386: type cast clean up for find_next_zero_bit Andi Kleen
@ 2007-04-29 10:52 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [12/48] x86: Log reason why TSC was marked unstable Andi Kleen
                   ` (35 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:52 UTC (permalink / raw)
  To: Adrian Bunk, Andi Kleen, patches, linux-kernel


From: Adrian Bunk <bunk@stusta.de>

Work around a warning with -Wmissing-prototypes in
arch/i386/kernel/asm-offsets.c

The warning isn't gcc's fault - asm-offsets.c is simply a special file.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/asm-offsets.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/arch/i386/kernel/asm-offsets.c
===================================================================
--- linux.orig/arch/i386/kernel/asm-offsets.c
+++ linux/arch/i386/kernel/asm-offsets.c
@@ -25,6 +25,9 @@
 #define OFFSET(sym, str, mem) \
 	DEFINE(sym, offsetof(struct str, mem));
 
+/* workaround for a warning with -Wmissing-prototypes */
+void foo(void);
+
 void foo(void)
 {
 	OFFSET(SIGCONTEXT_eax, sigcontext, eax);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [12/48] x86: Log reason why TSC was marked unstable
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (10 preceding siblings ...)
  2007-04-29 10:52 ` [PATCH] [11/48] i386: workaround for a -Wmissing-prototypes warning Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [13/48] x86_64: fix ia32_binfmt.c build error Andi Kleen
                   ` (34 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: john stultz, Ingo Molnar, Thomas Gleixner, Andi Kleen, patches,
	linux-kernel


From: john stultz <johnstul@us.ibm.com>

Change mark_tsc_unstable() so it takes a string argument, which holds the
reason the TSC was marked unstable.

This is then displayed the first time mark_tsc_unstable is called.

This should help us better debug why the TSC was marked unstable on certain
systems and allow us to make sure we're not being overly paranoid when
throwing out this troublesome clocksource.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/cpu/cyrix.c                |    2 +-
 arch/i386/kernel/tsc.c                      |    5 +++--
 arch/x86_64/kernel/time.c                   |    2 +-
 arch/x86_64/kernel/tsc.c                    |    5 +++--
 arch/x86_64/kernel/tsc_sync.c               |    2 +-
 drivers/acpi/processor_idle.c               |    4 ++--
 include/asm-i386/mach-summit/mach_mpparse.h |    4 ++--
 include/asm-i386/tsc.h                      |    2 +-
 include/asm-x86_64/timex.h                  |    2 +-
 9 files changed, 15 insertions(+), 13 deletions(-)

Index: linux/arch/i386/kernel/cpu/cyrix.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/cyrix.c
+++ linux/arch/i386/kernel/cpu/cyrix.c
@@ -279,7 +279,7 @@ static void __cpuinit init_cyrix(struct 
 		 */  
 		if (vendor == PCI_VENDOR_ID_CYRIX &&
 	 (device == PCI_DEVICE_ID_CYRIX_5510 || device == PCI_DEVICE_ID_CYRIX_5520))
-			mark_tsc_unstable();
+			mark_tsc_unstable("cyrix 5510/5520 detected");
 	}
 #endif
 		c->x86_cache_size=16;	/* Yep 16K integrated cache thats it */
Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -172,7 +172,7 @@ time_cpufreq_notifier(struct notifier_bl
 						ref_freq, freq->new);
 			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
 				tsc_khz = cpu_khz;
-				mark_tsc_unstable();
+				mark_tsc_unstable("cpufreq changes");
 			}
 		}
 	}
@@ -220,11 +220,12 @@ static struct clocksource clocksource_ts
 				  CLOCK_SOURCE_MUST_VERIFY,
 };
 
-void mark_tsc_unstable(void)
+void mark_tsc_unstable(char *reason)
 {
 	if (!tsc_unstable) {
 		tsc_unstable = 1;
 		tsc_enabled = 0;
+		printk("Marking TSC unstable due to: %s.\n", reason);
 		/* Can be called before registration */
 		if (clocksource_tsc.mult)
 			clocksource_change_rating(&clocksource_tsc, 0);
Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -397,7 +397,7 @@ void __init time_init(void)
 		cpu_khz = tsc_calibrate_cpu_khz();
 
 	if (unsynchronized_tsc())
-		mark_tsc_unstable();
+		mark_tsc_unstable("TSCs unsynchronized");
 
 	if (cpu_has(&boot_cpu_data, X86_FEATURE_RDTSCP))
 		vgetcpu_mode = VGETCPU_RDTSCP;
Index: linux/arch/x86_64/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86_64/kernel/tsc.c
+++ linux/arch/x86_64/kernel/tsc.c
@@ -85,7 +85,7 @@ static int time_cpufreq_notifier(struct 
 
 		tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq, freq->new);
 		if (!(freq->flags & CPUFREQ_CONST_LOOPS))
-			mark_tsc_unstable();
+			mark_tsc_unstable("cpufreq changes");
 	}
 
 	return 0;
@@ -171,10 +171,11 @@ static struct clocksource clocksource_ts
 	.vread			= vread_tsc,
 };
 
-void mark_tsc_unstable(void)
+void mark_tsc_unstable(char *reason)
 {
 	if (!tsc_unstable) {
 		tsc_unstable = 1;
+		printk("Marking TSC unstable due to %s\n", reason);
 		/* Change only the rating, when not registered */
 		if (clocksource_tsc.mult)
 			clocksource_change_rating(&clocksource_tsc, 0);
Index: linux/arch/x86_64/kernel/tsc_sync.c
===================================================================
--- linux.orig/arch/x86_64/kernel/tsc_sync.c
+++ linux/arch/x86_64/kernel/tsc_sync.c
@@ -138,7 +138,7 @@ void __cpuinit check_tsc_sync_source(int
 		printk("\n");
 		printk(KERN_WARNING "Measured %Ld cycles TSC warp between CPUs,"
 				    " turning off TSC clock.\n", max_warp);
-		mark_tsc_unstable();
+		mark_tsc_unstable("check_tsc_sync_source failed");
 		nr_warps = 0;
 		max_warp = 0;
 		last_tsc = 0;
Index: linux/drivers/acpi/processor_idle.c
===================================================================
--- linux.orig/drivers/acpi/processor_idle.c
+++ linux/drivers/acpi/processor_idle.c
@@ -483,7 +483,7 @@ static void acpi_processor_idle(void)
 
 #ifdef CONFIG_GENERIC_TIME
 		/* TSC halts in C2, so notify users */
-		mark_tsc_unstable();
+		mark_tsc_unstable("possible TSC halt in C2");
 #endif
 		/* Re-enable interrupts */
 		local_irq_enable();
@@ -525,7 +525,7 @@ static void acpi_processor_idle(void)
 
 #ifdef CONFIG_GENERIC_TIME
 		/* TSC halts in C3, so notify users */
-		mark_tsc_unstable();
+		mark_tsc_unstable("TSC halts in C3");
 #endif
 		/* Re-enable interrupts */
 		local_irq_enable();
Index: linux/include/asm-i386/mach-summit/mach_mpparse.h
===================================================================
--- linux.orig/include/asm-i386/mach-summit/mach_mpparse.h
+++ linux/include/asm-i386/mach-summit/mach_mpparse.h
@@ -30,7 +30,7 @@ static inline int mps_oem_check(struct m
 			(!strncmp(productid, "VIGIL SMP", 9) 
 			 || !strncmp(productid, "EXA", 3)
 			 || !strncmp(productid, "RUTHLESS SMP", 12))){
-		mark_tsc_unstable();
+		mark_tsc_unstable("Summit based system");
 		use_cyclone = 1; /*enable cyclone-timer*/
 		setup_summit();
 		return 1;
@@ -44,7 +44,7 @@ static inline int acpi_madt_oem_check(ch
 	if (!strncmp(oem_id, "IBM", 3) &&
 	    (!strncmp(oem_table_id, "SERVIGIL", 8)
 	     || !strncmp(oem_table_id, "EXA", 3))){
-		mark_tsc_unstable();
+		mark_tsc_unstable("Summit based system");
 		use_cyclone = 1; /*enable cyclone-timer*/
 		setup_summit();
 		return 1;
Index: linux/include/asm-i386/tsc.h
===================================================================
--- linux.orig/include/asm-i386/tsc.h
+++ linux/include/asm-i386/tsc.h
@@ -53,7 +53,7 @@ static __always_inline cycles_t get_cycl
 }
 
 extern void tsc_init(void);
-extern void mark_tsc_unstable(void);
+extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
 extern void init_tsc_clocksource(void);
 
Index: linux/include/asm-x86_64/timex.h
===================================================================
--- linux.orig/include/asm-x86_64/timex.h
+++ linux/include/asm-x86_64/timex.h
@@ -27,5 +27,5 @@ extern int read_current_timer(unsigned l
 #define NS_SCALE        10 /* 2^10, carefully chosen */
 #define US_SCALE        32 /* 2^32, arbitralrily chosen */
 
-extern void mark_tsc_unstable(void);
+extern void mark_tsc_unstable(char *);
 #endif

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [13/48] x86_64: fix ia32_binfmt.c build error
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (11 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [12/48] x86: Log reason why TSC was marked unstable Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [14/48] x86_64: remove extra smp_processor_id calling Andi Kleen
                   ` (33 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Ralf Baechle, Andi Kleen, patches, linux-kernel


From: Ralf Baechle <ralf@linux-mips.org>

Reorder code to avoid multiple inclusion of elf.h.

#undef several symbols to avoid build errors over redefinitions.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86_64/ia32/ia32_binfmt.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: linux/arch/x86_64/ia32/ia32_binfmt.c
===================================================================
--- linux.orig/arch/x86_64/ia32/ia32_binfmt.c
+++ linux/arch/x86_64/ia32/ia32_binfmt.c
@@ -5,6 +5,11 @@
  * This tricks binfmt_elf.c into loading 32bit binaries using lots 
  * of ugly preprocessor tricks. Talk about very very poor man's inheritance.
  */ 
+#define __ASM_X86_64_ELF_H 1
+
+#undef ELF_CLASS
+#define ELF_CLASS ELFCLASS32
+
 #include <linux/types.h>
 #include <linux/stddef.h>
 #include <linux/rwsem.h>
@@ -50,9 +55,6 @@ struct elf_phdr; 
 #undef ELF_ARCH
 #define ELF_ARCH EM_386
 
-#undef ELF_CLASS
-#define ELF_CLASS ELFCLASS32
-
 #define ELF_DATA	ELFDATA2LSB
 
 #define USE_ELF_CORE_DUMP 1
@@ -136,7 +138,7 @@ struct elf_prpsinfo
 
 #define user user32
 
-#define __ASM_X86_64_ELF_H 1
+#undef elf_read_implies_exec
 #define elf_read_implies_exec(ex, executable_stack)     (executable_stack != EXSTACK_DISABLE_X)
 //#include <asm/ia32.h>
 #include <linux/elf.h>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [14/48] x86_64: remove extra smp_processor_id calling
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (12 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [13/48] x86_64: fix ia32_binfmt.c build error Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [15/48] x86_64: make simnow_init() static Andi Kleen
                   ` (32 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Yinghai Lu, Eric W. Biederman, Andi Kleen, patches, linux-kernel


From: "Yinghai Lu" <yinghai.lu@amd.com>

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/x86_64/kernel/io_apic.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/io_apic.c
===================================================================
--- linux.orig/arch/x86_64/kernel/io_apic.c
+++ linux/arch/x86_64/kernel/io_apic.c
@@ -1407,8 +1407,7 @@ static void irq_complete_move(unsigned i
 
 	vector = ~get_irq_regs()->orig_rax;
 	me = smp_processor_id();
-	if ((vector == cfg->vector) &&
-	    cpu_isset(smp_processor_id(), cfg->domain)) {
+	if ((vector == cfg->vector) && cpu_isset(me, cfg->domain)) {
 		cpumask_t cleanup_mask;
 
 		cpus_and(cleanup_mask, cfg->old_domain, cpu_online_map);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [15/48] x86_64: make simnow_init() static
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (13 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [14/48] x86_64: remove extra smp_processor_id calling Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [16/48] i386: vmi_pmd_clear() static Andi Kleen
                   ` (31 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Adrian Bunk, patches, linux-kernel


From: Adrian Bunk <bunk@stusta.de>

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/x86_64/kernel/early_printk.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86_64/kernel/early_printk.c
===================================================================
--- linux.orig/arch/x86_64/kernel/early_printk.c
+++ linux/arch/x86_64/kernel/early_printk.c
@@ -175,7 +175,7 @@ static noinline long simnow(long cmd, lo
 	return ret;
 }
 
-void __init simnow_init(char *str)
+static void __init simnow_init(char *str)
 {
 	char *fn = "klog";
 	if (*str == '=')

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [16/48] i386: vmi_pmd_clear() static
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (14 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [15/48] x86_64: make simnow_init() static Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [18/48] x86_64: configurable fake numa node sizes Andi Kleen
                   ` (30 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Adrian Bunk, patches, linux-kernel


From: Adrian Bunk <bunk@stusta.de>

This patch makes the needlessly global vmi_pmd_clear() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/vmi.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/i386/kernel/vmi.c
===================================================================
--- linux.orig/arch/i386/kernel/vmi.c
+++ linux/arch/i386/kernel/vmi.c
@@ -516,7 +516,7 @@ static void vmi_pte_clear(struct mm_stru
 	vmi_ops.set_pte(pte, ptep, vmi_flags_addr(mm, addr, VMI_PAGE_PT, 0));
 }
 
-void vmi_pmd_clear(pmd_t *pmd)
+static void vmi_pmd_clear(pmd_t *pmd)
 {
 	const pte_t pte = { 0 };
 	vmi_check_page_type(__pa(pmd) >> PAGE_SHIFT, VMI_PAGE_PMD);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [18/48] x86_64: configurable fake numa node sizes
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (15 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [16/48] i386: vmi_pmd_clear() static Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [19/48] x86_64: split remaining fake nodes equally Andi Kleen
                   ` (29 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: David Rientjes, Andi Kleen, Paul Jackson, Christoph Lameter,
	patches, linux-kernel


From: David Rientjes <rientjes@google.com>

Extends the numa=fake x86_64 command-line option to allow for configurable
node sizes.  These nodes can be used in conjunction with cpusets for coarse
memory resource management.

The old command-line option is still supported:
  numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
		actual machine.

But now you may configure your system for the node sizes of your choice:
  numa=fake=2*512,1024,2*256
		gives two 512M nodes, one 1024M node, two 256M nodes, and
		the rest of system memory to a sixth node.

The existing hash function is maintained to support the various node sizes
that are possible with this implementation.

Each node of the same size receives roughly the same amount of available
pages, regardless of any reserved memory with its address range.  The total
available pages on the system is calculated and divided by the number of equal
nodes to allocate.  These nodes are then dynamically allocated and their
borders extended until such time as their number of available pages reaches
the required size.

Configurable node sizes are recommended when used in conjunction with cpusets
for memory control because it eliminates the overhead associated with scanning
the zonelists of many smaller full nodes on page_alloc().

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/x86_64/boot-options.txt |    8 -
 arch/x86_64/mm/numa.c                 |  259 +++++++++++++++++++---------------
 include/asm-x86_64/mmzone.h           |    2 
 3 files changed, 160 insertions(+), 109 deletions(-)

Index: linux/Documentation/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86_64/boot-options.txt
+++ linux/Documentation/x86_64/boot-options.txt
@@ -149,7 +149,13 @@ NUMA
 
   numa=noacpi   Don't parse the SRAT table for NUMA setup
 
-  numa=fake=X   Fake X nodes and ignore NUMA setup of the actual machine.
+  numa=fake=CMDLINE
+		If a number, fakes CMDLINE nodes and ignores NUMA setup of the
+		actual machine.  Otherwise, system memory is configured
+		depending on the sizes and coefficients listed.  For example:
+			numa=fake=2*512,1024,4*256
+		gives two 512M nodes, a 1024M node, and four 256M nodes.  The
+		remaining system RAM is allocated to an additional node.
 
   numa=hotadd=percent
 		Only allow hotadd memory to preallocate page structures upto
Index: linux/arch/x86_64/mm/numa.c
===================================================================
--- linux.orig/arch/x86_64/mm/numa.c
+++ linux/arch/x86_64/mm/numa.c
@@ -273,125 +273,172 @@ void __init numa_init_array(void)
 
 #ifdef CONFIG_NUMA_EMU
 /* Numa emulation */
-int numa_fake __initdata = 0;
+#define E820_ADDR_HOLE_SIZE(start, end)					\
+	(e820_hole_size((start) >> PAGE_SHIFT, (end) >> PAGE_SHIFT) <<	\
+	PAGE_SHIFT)
+char *cmdline __initdata;
 
 /*
- * This function is used to find out if the start and end correspond to
- * different zones.
+ * Setups up nid to range from addr to addr + size.  If the end boundary is
+ * greater than max_addr, then max_addr is used instead.  The return value is 0
+ * if there is additional memory left for allocation past addr and -1 otherwise.
+ * addr is adjusted to be at the end of the node.
  */
-int zone_cross_over(unsigned long start, unsigned long end)
+static int __init setup_node_range(int nid, struct bootnode *nodes, u64 *addr,
+				   u64 size, u64 max_addr)
 {
-	if ((start < (MAX_DMA32_PFN << PAGE_SHIFT)) &&
-			(end >= (MAX_DMA32_PFN << PAGE_SHIFT)))
-		return 1;
-	return 0;
+	int ret = 0;
+	nodes[nid].start = *addr;
+	*addr += size;
+	if (*addr >= max_addr) {
+		*addr = max_addr;
+		ret = -1;
+	}
+	nodes[nid].end = *addr;
+	node_set_online(nid);
+	printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n", nid,
+	       nodes[nid].start, nodes[nid].end,
+	       (nodes[nid].end - nodes[nid].start) >> 20);
+	return ret;
 }
 
-static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
+/*
+ * Splits num_nodes nodes up equally starting at node_start.  The return value
+ * is the number of nodes split up and addr is adjusted to be at the end of the
+ * last node allocated.
+ */
+static int __init split_nodes_equally(struct bootnode *nodes, u64 *addr,
+				      u64 max_addr, int node_start,
+				      int num_nodes)
 {
- 	int i, big;
- 	struct bootnode nodes[MAX_NUMNODES];
- 	unsigned long sz, old_sz;
-	unsigned long hole_size;
-	unsigned long start, end;
-	unsigned long max_addr = (end_pfn << PAGE_SHIFT);
-
-	start = (start_pfn << PAGE_SHIFT);
-	hole_size = e820_hole_size(start, max_addr);
-	sz = (max_addr - start - hole_size) / numa_fake;
-
- 	/* Kludge needed for the hash function */
+	unsigned int big;
+	u64 size;
+	int i;
 
-	old_sz = sz;
+	if (num_nodes <= 0)
+		return -1;
+	if (num_nodes > MAX_NUMNODES)
+		num_nodes = MAX_NUMNODES;
+	size = (max_addr - *addr - E820_ADDR_HOLE_SIZE(*addr, max_addr)) /
+	       num_nodes;
 	/*
-	 * Round down to the nearest FAKE_NODE_MIN_SIZE.
+	 * Calculate the number of big nodes that can be allocated as a result
+	 * of consolidating the leftovers.
 	 */
-	sz &= FAKE_NODE_MIN_HASH_MASK;
+	big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * num_nodes) /
+	      FAKE_NODE_MIN_SIZE;
 
-	/*
-	 * We ensure that each node is at least 64MB big.  Smaller than this
-	 * size can cause VM hiccups.
-	 */
-	if (sz == 0) {
-		printk(KERN_INFO "Not enough memory for %d nodes.  Reducing "
-				"the number of nodes\n", numa_fake);
-		numa_fake = (max_addr - start - hole_size) / FAKE_NODE_MIN_SIZE;
-		printk(KERN_INFO "Number of fake nodes will be = %d\n",
-				numa_fake);
-		sz = FAKE_NODE_MIN_SIZE;
+	/* Round down to nearest FAKE_NODE_MIN_SIZE. */
+	size &= FAKE_NODE_MIN_HASH_MASK;
+	if (!size) {
+		printk(KERN_ERR "Not enough memory for each node.  "
+		       "NUMA emulation disabled.\n");
+		return -1;
 	}
-	/*
-	 * Find out how many nodes can get an extra NODE_MIN_SIZE granule.
-	 * This logic ensures the extra memory gets distributed among as many
-	 * nodes as possible (as compared to one single node getting all that
-	 * extra memory.
-	 */
-	big = ((old_sz - sz) * numa_fake) / FAKE_NODE_MIN_SIZE;
-	printk(KERN_INFO "Fake node Size: %luMB hole_size: %luMB big nodes: "
-			"%d\n",
-			(sz >> 20), (hole_size >> 20), big);
- 	memset(&nodes,0,sizeof(nodes));
-	end = start;
- 	for (i = 0; i < numa_fake; i++) {
-		/*
-		 * In case we are not able to allocate enough memory for all
-		 * the nodes, we reduce the number of fake nodes.
-		 */
-		if (end >= max_addr) {
-			numa_fake = i - 1;
-			break;
-		}
- 		start = nodes[i].start = end;
-		/*
-		 * Final node can have all the remaining memory.
-		 */
- 		if (i == numa_fake-1)
- 			sz = max_addr - start;
- 		end = nodes[i].start + sz;
-		/*
-		 * Fir "big" number of nodes get extra granule.
-		 */
+
+	for (i = node_start; i < num_nodes + node_start; i++) {
+		u64 end = *addr + size;
 		if (i < big)
 			end += FAKE_NODE_MIN_SIZE;
 		/*
-		 * Iterate over the range to ensure that this node gets at
-		 * least sz amount of RAM (excluding holes)
+		 * The final node can have the remaining system RAM.  Other
+		 * nodes receive roughly the same amount of available pages.
 		 */
-		while ((end - start - e820_hole_size(start, end)) < sz) {
-			end += FAKE_NODE_MIN_SIZE;
-			if (end >= max_addr)
-				break;
+		if (i == num_nodes + node_start - 1)
+			end = max_addr;
+		else
+			while (end - *addr - E820_ADDR_HOLE_SIZE(*addr, end) <
+			       size) {
+				end += FAKE_NODE_MIN_SIZE;
+				if (end > max_addr) {
+					end = max_addr;
+					break;
+				}
+			}
+		if (setup_node_range(i, nodes, addr, end - *addr, max_addr) < 0)
+			break;
+	}
+	return i - node_start + 1;
+}
+
+/*
+ * Sets up the system RAM area from start_pfn to end_pfn according to the
+ * numa=fake command-line option.
+ */
+static int __init numa_emulation(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct bootnode nodes[MAX_NUMNODES];
+	u64 addr = start_pfn << PAGE_SHIFT;
+	u64 max_addr = end_pfn << PAGE_SHIFT;
+	unsigned int coeff;
+	unsigned int num = 0;
+	int num_nodes = 0;
+	u64 size;
+	int i;
+
+	memset(&nodes, 0, sizeof(nodes));
+	/*
+	 * If the numa=fake command-line is just a single number N, split the
+	 * system RAM into N fake nodes.
+	 */
+	if (!strchr(cmdline, '*') && !strchr(cmdline, ',')) {
+		num_nodes = split_nodes_equally(nodes, &addr, max_addr, 0,
+						simple_strtol(cmdline, NULL, 0));
+		if (num_nodes < 0)
+			return num_nodes;
+		goto out;
+	}
+
+	/* Parse the command line. */
+	for (coeff = 1; ; cmdline++) {
+		if (*cmdline && isdigit(*cmdline)) {
+			num = num * 10 + *cmdline - '0';
+			continue;
 		}
-		/*
-		 * Look at the next node to make sure there is some real memory
-		 * to map.  Bad things happen when the only memory present
-		 * in a zone on a fake node is IO hole.
-		 */
-		while (e820_hole_size(end, end + FAKE_NODE_MIN_SIZE) > 0) {
-			if (zone_cross_over(start, end + sz)) {
-				end = (MAX_DMA32_PFN << PAGE_SHIFT);
-				break;
+		if (*cmdline == '*')
+			coeff = num;
+		if (!*cmdline || *cmdline == ',') {
+			/*
+			 * Round down to the nearest FAKE_NODE_MIN_SIZE.
+			 * Command-line coefficients are in megabytes.
+			 */
+			size = ((u64)num << 20) & FAKE_NODE_MIN_HASH_MASK;
+			if (size) {
+				for (i = 0; i < coeff; i++, num_nodes++)
+					if (setup_node_range(num_nodes, nodes,
+						&addr, size, max_addr) < 0)
+						goto done;
+				coeff = 1;
 			}
-			if (end >= max_addr)
-				break;
-			end += FAKE_NODE_MIN_SIZE;
 		}
-		if (end > max_addr)
-			end = max_addr;
-		nodes[i].end = end;
- 		printk(KERN_INFO "Faking node %d at %016Lx-%016Lx (%LuMB)\n",
- 		       i,
- 		       nodes[i].start, nodes[i].end,
- 		       (nodes[i].end - nodes[i].start) >> 20);
-		node_set_online(i);
- 	}
- 	memnode_shift = compute_hash_shift(nodes, numa_fake);
- 	if (memnode_shift < 0) {
- 		memnode_shift = 0;
- 		printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
- 		return -1;
- 	}
- 	for_each_online_node(i) {
+		if (!*cmdline)
+			break;
+		num = 0;
+	}
+done:
+	if (!num_nodes)
+		return -1;
+	/* Fill remainder of system RAM with a final node, if appropriate. */
+	if (addr < max_addr) {
+		setup_node_range(num_nodes, nodes, &addr, max_addr - addr,
+				 max_addr);
+		num_nodes++;
+	}
+out:
+	memnode_shift = compute_hash_shift(nodes, num_nodes);
+	if (memnode_shift < 0) {
+		memnode_shift = 0;
+		printk(KERN_ERR "No NUMA hash function found.  NUMA emulation "
+		       "disabled.\n");
+		return -1;
+	}
+
+	/*
+	 * We need to vacate all active ranges that may have been registered by
+	 * SRAT.
+	 */
+	remove_all_active_ranges();
+	for_each_online_node(i) {
 		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
 						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
@@ -399,14 +446,15 @@ static int __init numa_emulation(unsigne
  	numa_init_array();
  	return 0;
 }
-#endif
+#undef E820_ADDR_HOLE_SIZE
+#endif /* CONFIG_NUMA_EMU */
 
 void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 { 
 	int i;
 
 #ifdef CONFIG_NUMA_EMU
-	if (numa_fake && !numa_emulation(start_pfn, end_pfn))
+	if (cmdline && !numa_emulation(start_pfn, end_pfn))
  		return;
 #endif
 
@@ -486,11 +534,8 @@ static __init int numa_setup(char *opt)
 	if (!strncmp(opt,"off",3))
 		numa_off = 1;
 #ifdef CONFIG_NUMA_EMU
-	if(!strncmp(opt, "fake=", 5)) {
-		numa_fake = simple_strtoul(opt+5,NULL,0); ;
-		if (numa_fake >= MAX_NUMNODES)
-			numa_fake = MAX_NUMNODES;
-	}
+	if (!strncmp(opt, "fake=", 5))
+		cmdline = opt + 5;
 #endif
 #ifdef CONFIG_ACPI_NUMA
  	if (!strncmp(opt,"noacpi",6))
Index: linux/include/asm-x86_64/mmzone.h
===================================================================
--- linux.orig/include/asm-x86_64/mmzone.h
+++ linux/include/asm-x86_64/mmzone.h
@@ -49,7 +49,7 @@ extern int pfn_valid(unsigned long pfn);
 
 #ifdef CONFIG_NUMA_EMU
 #define FAKE_NODE_MIN_SIZE	(64*1024*1024)
-#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1ul))
+#define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1uL))
 #endif
 
 #endif

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [19/48] x86_64: split remaining fake nodes equally
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (16 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [18/48] x86_64: configurable fake numa node sizes Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [20/48] x86_64: fixed size remaining fake nodes Andi Kleen
                   ` (28 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: David Rientjes, Andi Kleen, Paul Jackson, Christoph Lameter,
	patches, linux-kernel


From: David Rientjes <rientjes@google.com>

Extends the numa=fake x86_64 command-line option to split the remaining
system memory into equal-sized nodes.

For example:
numa=fake=2*512,4*	gives two 512M nodes and the remaining system
			memory is split into four approximately equal
			chunks.

This is beneficial for systems where the exact size of RAM is unknown or not
necessarily relevant, but the granularity with which nodes shall be allocated
is known.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/x86_64/boot-options.txt |    4 +++-
 arch/x86_64/mm/numa.c                 |   22 ++++++++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

Index: linux/Documentation/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86_64/boot-options.txt
+++ linux/Documentation/x86_64/boot-options.txt
@@ -155,7 +155,9 @@ NUMA
 		depending on the sizes and coefficients listed.  For example:
 			numa=fake=2*512,1024,4*256
 		gives two 512M nodes, a 1024M node, and four 256M nodes.  The
-		remaining system RAM is allocated to an additional node.
+		remaining system RAM is allocated to an additional node.  If
+		the last character of CMDLINE is a *, the remaining system RAM
+		is instead divided up equally among its coefficient.
 
   numa=hotadd=percent
 		Only allow hotadd memory to preallocate page structures upto
Index: linux/arch/x86_64/mm/numa.c
===================================================================
--- linux.orig/arch/x86_64/mm/numa.c
+++ linux/arch/x86_64/mm/numa.c
@@ -418,11 +418,25 @@ static int __init numa_emulation(unsigne
 done:
 	if (!num_nodes)
 		return -1;
-	/* Fill remainder of system RAM with a final node, if appropriate. */
+	/* Fill remainder of system RAM, if appropriate. */
 	if (addr < max_addr) {
-		setup_node_range(num_nodes, nodes, &addr, max_addr - addr,
-				 max_addr);
-		num_nodes++;
+		switch (*(cmdline - 1)) {
+		case '*':
+			/* Split remaining nodes into coeff chunks */
+			if (coeff <= 0)
+				break;
+			num_nodes += split_nodes_equally(nodes, &addr, max_addr,
+							 num_nodes, coeff);
+			break;
+		case ',':
+			/* Do not allocate remaining system RAM */
+			break;
+		default:
+			/* Give one final node */
+			setup_node_range(num_nodes, nodes, &addr,
+					 max_addr - addr, max_addr);
+			num_nodes++;
+		}
 	}
 out:
 	memnode_shift = compute_hash_shift(nodes, num_nodes);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [20/48] x86_64: fixed size remaining fake nodes
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (17 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [19/48] x86_64: split remaining fake nodes equally Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [21/48] x86: remove constant_tsc reporting from /proc/cpuinfo' power flags Andi Kleen
                   ` (27 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: David Rientjes, Andi Kleen, Paul Jackson, Christoph Lameter,
	patches, linux-kernel


From: David Rientjes <rientjes@google.com>

Extends the numa=fake x86_64 command-line option to split the remaining system
memory into nodes of fixed size.  Any leftover memory is allocated to a final
node unless the command-line ends with a comma.

For example:
  numa=fake=2*512,*128	gives two 512M nodes and the remaining system
			memory is split into nodes of 128M each.

This is beneficial for systems where the exact size of RAM is unknown or not
necessarily relevant, but the size of the remaining nodes to be allocated is
known based on their capacity for resource management.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/x86_64/boot-options.txt |   14 ++++++----
 arch/x86_64/mm/numa.c                 |   47 ++++++++++++++++++++++++++--------
 2 files changed, 46 insertions(+), 15 deletions(-)

Index: linux/Documentation/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86_64/boot-options.txt
+++ linux/Documentation/x86_64/boot-options.txt
@@ -153,11 +153,15 @@ NUMA
 		If a number, fakes CMDLINE nodes and ignores NUMA setup of the
 		actual machine.  Otherwise, system memory is configured
 		depending on the sizes and coefficients listed.  For example:
-			numa=fake=2*512,1024,4*256
-		gives two 512M nodes, a 1024M node, and four 256M nodes.  The
-		remaining system RAM is allocated to an additional node.  If
-		the last character of CMDLINE is a *, the remaining system RAM
-		is instead divided up equally among its coefficient.
+			numa=fake=2*512,1024,4*256,*128
+		gives two 512M nodes, a 1024M node, four 256M nodes, and the
+		rest split into 128M chunks.  If the last character of CMDLINE
+		is a *, the remaining memory is divided up equally among its
+		coefficient:
+			numa=fake=2*512,2*
+		gives two 512M nodes and the rest split into two nodes.
+		Otherwise, the remaining system RAM is allocated to an
+		additional node.
 
   numa=hotadd=percent
 		Only allow hotadd memory to preallocate page structures upto
Index: linux/arch/x86_64/mm/numa.c
===================================================================
--- linux.orig/arch/x86_64/mm/numa.c
+++ linux/arch/x86_64/mm/numa.c
@@ -362,6 +362,21 @@ static int __init split_nodes_equally(st
 }
 
 /*
+ * Splits the remaining system RAM into chunks of size.  The remaining memory is
+ * always assigned to a final node and can be asymmetric.  Returns the number of
+ * nodes split.
+ */
+static int __init split_nodes_by_size(struct bootnode *nodes, u64 *addr,
+				      u64 max_addr, int node_start, u64 size)
+{
+	int i = node_start;
+	size = (size << 20) & FAKE_NODE_MIN_HASH_MASK;
+	while (!setup_node_range(i++, nodes, addr, size, max_addr))
+		;
+	return i - node_start;
+}
+
+/*
  * Sets up the system RAM area from start_pfn to end_pfn according to the
  * numa=fake command-line option.
  */
@@ -370,9 +385,10 @@ static int __init numa_emulation(unsigne
 	struct bootnode nodes[MAX_NUMNODES];
 	u64 addr = start_pfn << PAGE_SHIFT;
 	u64 max_addr = end_pfn << PAGE_SHIFT;
-	unsigned int coeff;
-	unsigned int num = 0;
 	int num_nodes = 0;
+	int coeff_flag;
+	int coeff = -1;
+	int num = 0;
 	u64 size;
 	int i;
 
@@ -390,29 +406,34 @@ static int __init numa_emulation(unsigne
 	}
 
 	/* Parse the command line. */
-	for (coeff = 1; ; cmdline++) {
+	for (coeff_flag = 0; ; cmdline++) {
 		if (*cmdline && isdigit(*cmdline)) {
 			num = num * 10 + *cmdline - '0';
 			continue;
 		}
-		if (*cmdline == '*')
-			coeff = num;
+		if (*cmdline == '*') {
+			if (num > 0)
+				coeff = num;
+			coeff_flag = 1;
+		}
 		if (!*cmdline || *cmdline == ',') {
+			if (!coeff_flag)
+				coeff = 1;
 			/*
 			 * Round down to the nearest FAKE_NODE_MIN_SIZE.
 			 * Command-line coefficients are in megabytes.
 			 */
 			size = ((u64)num << 20) & FAKE_NODE_MIN_HASH_MASK;
-			if (size) {
+			if (size)
 				for (i = 0; i < coeff; i++, num_nodes++)
 					if (setup_node_range(num_nodes, nodes,
 						&addr, size, max_addr) < 0)
 						goto done;
-				coeff = 1;
-			}
+			if (!*cmdline)
+				break;
+			coeff_flag = 0;
+			coeff = -1;
 		}
-		if (!*cmdline)
-			break;
 		num = 0;
 	}
 done:
@@ -420,6 +441,12 @@ done:
 		return -1;
 	/* Fill remainder of system RAM, if appropriate. */
 	if (addr < max_addr) {
+		if (coeff_flag && coeff < 0) {
+			/* Split remaining nodes into num-sized chunks */
+			num_nodes += split_nodes_by_size(nodes, &addr, max_addr,
+							 num_nodes, num);
+			goto out;
+		}
 		switch (*(cmdline - 1)) {
 		case '*':
 			/* Split remaining nodes into coeff chunks */

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [21/48] x86: remove constant_tsc reporting from /proc/cpuinfo' power flags
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (18 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [20/48] x86_64: fixed size remaining fake nodes Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [22/48] x86_64: fake numa for cpusets document Andi Kleen
                   ` (26 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Joerg Roedel, patches, linux-kernel


From: Joerg Roedel <joerg.roedel@amd.com>

remove the reporting of the constant_tsc flag from the "power management"
field in /proc/cpuinfo.  The NULL value there was replaced by "" because
the former would result in a printout of [8] if the flag is set.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/cpu/proc.c |    3 +--
 arch/x86_64/kernel/setup.c  |    5 ++---
 2 files changed, 3 insertions(+), 5 deletions(-)

Index: linux/arch/i386/kernel/cpu/proc.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/proc.c
+++ linux/arch/i386/kernel/cpu/proc.c
@@ -72,8 +72,7 @@ static int show_cpuinfo(struct seq_file 
 		"stc",
 		"100mhzsteps",
 		"hwpstate",
-		NULL,
-		NULL,	/* constant_tsc - moved to flags */
+		"",	/* constant_tsc - moved to flags */
 		/* nothing */
 	};
 	struct cpuinfo_x86 *c = v;
Index: linux/arch/x86_64/kernel/setup.c
===================================================================
--- linux.orig/arch/x86_64/kernel/setup.c
+++ linux/arch/x86_64/kernel/setup.c
@@ -979,9 +979,8 @@ static int show_cpuinfo(struct seq_file 
 		"stc",
 		"100mhzsteps",
 		"hwpstate",
-		NULL,	/* tsc invariant mapped to constant_tsc */
-		NULL,
-		/* nothing */	/* constant_tsc - moved to flags */
+		"",	/* tsc invariant mapped to constant_tsc */
+		/* nothing */
 	};
 
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [22/48] x86_64: fake numa for cpusets document
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (19 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [21/48] x86: remove constant_tsc reporting from /proc/cpuinfo' power flags Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [23/48] i386: VDSO_PRELINK warning fix Andi Kleen
                   ` (25 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: David Rientjes, Andi Kleen, Paul Jackson, Christoph Lameter,
	patches, linux-kernel


From: David Rientjes <rientjes@google.com>

Create a document to explain how to use numa=fake in conjunction with cpusets
for coarse memory resource management.

An attempt to get more awareness and testing for this feature.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/x86_64/fake-numa-for-cpusets |   66 +++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

Index: linux/Documentation/x86_64/fake-numa-for-cpusets
===================================================================
--- /dev/null
+++ linux/Documentation/x86_64/fake-numa-for-cpusets
@@ -0,0 +1,66 @@
+Using numa=fake and CPUSets for Resource Management
+Written by David Rientjes <rientjes@cs.washington.edu>
+
+This document describes how the numa=fake x86_64 command-line option can be used
+in conjunction with cpusets for coarse memory management.  Using this feature,
+you can create fake NUMA nodes that represent contiguous chunks of memory and
+assign them to cpusets and their attached tasks.  This is a way of limiting the
+amount of system memory that are available to a certain class of tasks.
+
+For more information on the features of cpusets, see Documentation/cpusets.txt.
+There are a number of different configurations you can use for your needs.  For
+more information on the numa=fake command line option and its various ways of
+configuring fake nodes, see Documentation/x86_64/boot-options.txt.
+
+For the purposes of this introduction, we'll assume a very primitive NUMA
+emulation setup of "numa=fake=4*512,".  This will split our system memory into
+four equal chunks of 512M each that we can now use to assign to cpusets.  As
+you become more familiar with using this combination for resource control,
+you'll determine a better setup to minimize the number of nodes you have to deal
+with.
+
+A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
+
+	Faking node 0 at 0000000000000000-0000000020000000 (512MB)
+	Faking node 1 at 0000000020000000-0000000040000000 (512MB)
+	Faking node 2 at 0000000040000000-0000000060000000 (512MB)
+	Faking node 3 at 0000000060000000-0000000080000000 (512MB)
+	...
+	On node 0 totalpages: 130975
+	On node 1 totalpages: 131072
+	On node 2 totalpages: 131072
+	On node 3 totalpages: 131072
+
+Now following the instructions for mounting the cpusets filesystem from
+Documentation/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
+address spaces) to individual cpusets:
+
+	[root@xroads /]# mkdir exampleset
+	[root@xroads /]# mount -t cpuset none exampleset
+	[root@xroads /]# mkdir exampleset/ddset
+	[root@xroads /]# cd exampleset/ddset
+	[root@xroads /exampleset/ddset]# echo 0-1 > cpus
+	[root@xroads /exampleset/ddset]# echo 0-1 > mems
+
+Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
+memory allocations (1G).
+
+You can now assign tasks to these cpusets to limit the memory resources
+available to them according to the fake nodes assigned as mems:
+
+	[root@xroads /exampleset/ddset]# echo $$ > tasks
+	[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
+	[1] 13425
+
+Notice the difference between the system memory usage as reported by
+/proc/meminfo between the restricted cpuset case above and the unrestricted
+case (i.e. running the same 'dd' command without assigning it to a fake NUMA
+cpuset):
+				Unrestricted	Restricted
+	MemTotal:		3091900 kB	3091900 kB
+	MemFree:		  42113 kB	1513236 kB
+
+This allows for coarse memory management for the tasks you assign to particular
+cpusets.  Since cpusets can form a hierarchy, you can create some pretty
+interesting combinations of use-cases for various classes of tasks for your
+memory management needs.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [23/48] i386: VDSO_PRELINK warning fix
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (20 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [22/48] x86_64: fake numa for cpusets document Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [24/48] i386: Initialize esp0 properly all the time Andi Kleen
                   ` (24 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Andrew Morton, Rusty Russell, Andi Kleen, patches, linux-kernel


From: Andrew Morton <akpm@linux-foundation.org>

The lguest patches somehow managed to trigger this:

In file included from arch/i386/lguest/lguest.c:38:
include/asm/asm-offsets.h:67:1: warning: "VDSO_PRELINK" redefined
In file included from include/linux/elf.h:7,
                 from include/linux/module.h:15,
                 from include/linux/device.h:21,
                 from include/linux/interrupt.h:15,
                 from arch/i386/lguest/lguest.c:27:
include/asm/elf.h:140:1: warning: this is the location of the previous definition

I assume that using the same identifier twice was a bad idea..

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/asm-offsets.c  |    2 +-
 arch/i386/kernel/vsyscall.lds.S |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux/arch/i386/kernel/asm-offsets.c
===================================================================
--- linux.orig/arch/i386/kernel/asm-offsets.c
+++ linux/arch/i386/kernel/asm-offsets.c
@@ -97,7 +97,7 @@ void foo(void)
 		 sizeof(struct tss_struct));
 
 	DEFINE(PAGE_SIZE_asm, PAGE_SIZE);
-	DEFINE(VDSO_PRELINK, VDSO_PRELINK);
+	DEFINE(VDSO_PRELINK_asm, VDSO_PRELINK);
 
 	OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx);
 
Index: linux/arch/i386/kernel/vsyscall.lds.S
===================================================================
--- linux.orig/arch/i386/kernel/vsyscall.lds.S
+++ linux/arch/i386/kernel/vsyscall.lds.S
@@ -7,7 +7,7 @@
 
 SECTIONS
 {
-  . = VDSO_PRELINK + SIZEOF_HEADERS;
+  . = VDSO_PRELINK_asm + SIZEOF_HEADERS;
 
   .hash           : { *(.hash) }		:text
   .gnu.hash       : { *(.gnu.hash) }
@@ -21,7 +21,7 @@ SECTIONS
      For the layouts to match, we need to skip more than enough
      space for the dynamic symbol table et al.  If this amount
      is insufficient, ld -shared will barf.  Just increase it here.  */
-  . = VDSO_PRELINK + 0x400;
+  . = VDSO_PRELINK_asm + 0x400;
 
   .text           : { *(.text) }		:text =0x90909090
   .note		  : { *(.note.*) }		:text :note

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [24/48] i386: Initialize esp0 properly all the time
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (21 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [23/48] i386: VDSO_PRELINK warning fix Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [25/48] x86_64: Introduce load_TLS to the "for" loop Andi Kleen
                   ` (23 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

Whenever we schedule, __switch_to calls load_esp0 which does:

	tss->esp0 = thread->esp0;

This is never initialized for the initial thread (ie "swapper"), so when we're
scheduling that, we end up setting esp0 to 0.  This is fine: the swapper never
leaves ring 0, so this field is never used.

lguest, however, gets upset that we're trying to used an unmapped page as our
kernel stack.  Rather than work around it there, let's initialize it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-i386/processor.h |    1 +
 1 file changed, 1 insertion(+)

Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -421,6 +421,7 @@ struct thread_struct {
 };
 
 #define INIT_THREAD  {							\
+	.esp0 = sizeof(init_stack) + (long)&init_stack,			\
 	.vm86_info = NULL,						\
 	.sysenter_cs = __KERNEL_CS,					\
 	.io_bitmap_ptr = NULL,						\

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [25/48] x86_64: Introduce load_TLS to the "for" loop.
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (22 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [24/48] i386: Initialize esp0 properly all the time Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [26/48] x86_64: Clarify CONFIG_REORDER explanation Andi Kleen
                   ` (22 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

GCC (4.1 at least) unrolls it anyway, but I can't believe this code
was ever justifiable.  (I've also submitted a patch which cleans up
i386, which is even uglier).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-x86_64/desc.h |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

Index: linux/include/asm-x86_64/desc.h
===================================================================
--- linux.orig/include/asm-x86_64/desc.h
+++ linux/include/asm-x86_64/desc.h
@@ -135,16 +135,13 @@ static inline void set_ldt_desc(unsigned
 	(info)->useable		== 0	&& \
 	(info)->lm		== 0)
 
-#if TLS_SIZE != 24
-# error update this code.
-#endif
-
 static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
 {
+	unsigned int i;
 	u64 *gdt = (u64 *)(cpu_gdt(cpu) + GDT_ENTRY_TLS_MIN);
-	gdt[0] = t->tls_array[0];
-	gdt[1] = t->tls_array[1];
-	gdt[2] = t->tls_array[2];
+
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
+		gdt[i] = t->tls_array[i];
 } 
 
 /*

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [26/48] x86_64: Clarify CONFIG_REORDER explanation
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (23 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [25/48] x86_64: Introduce load_TLS to the "for" loop Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [27/48] i386: Allow i386 crash kernels to handle x86_64 dumps Andi Kleen
                   ` (21 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

if (1 && X) => if (X).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86_64/Kconfig |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/Kconfig
===================================================================
--- linux.orig/arch/x86_64/Kconfig
+++ linux/arch/x86_64/Kconfig
@@ -665,8 +665,8 @@ config REORDER
 	default n
 	help
          This option enables the toolchain to reorder functions for a more 
-         optimal TLB usage. If you have pretty much any version of binutils, 
-	 this can increase your kernel build time by roughly one minute.
+         optimal TLB usage.  This will slow your kernel build by
+	 roughly one minute.
 
 config K8_NB
 	def_bool y

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [27/48] i386: Allow i386 crash kernels to handle x86_64 dumps
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (24 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [26/48] x86_64: Clarify CONFIG_REORDER explanation Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [28/48] i386: prevent ACPI quirk warning mass spamming in logs Andi Kleen
                   ` (20 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Ian Campbell, Horms, Magnus Damm, Eric W. Biederman, Andi Kleen,
	patches, linux-kernel


From: Ian Campbell <ian.campbell@xensource.com>

The specific case I am encountering is kdump under Xen with a 64 bit
hypervisor and 32 bit kernel/userspace.  The dump created is 64 bit due to
the hypervisor but the dump kernel is 32 bit for maximum compatibility.

It's possibly less likely to be useful in a purely native scenario but I
see no reason to disallow it.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Ian Campbell <ian.campbell@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Horms <horms@verge.net.au>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/vmcore.c           |    2 +-
 include/asm-i386/kexec.h   |    3 +++
 include/linux/crash_dump.h |    8 ++++++++
 3 files changed, 12 insertions(+), 1 deletion(-)

Index: linux/fs/proc/vmcore.c
===================================================================
--- linux.orig/fs/proc/vmcore.c
+++ linux/fs/proc/vmcore.c
@@ -514,7 +514,7 @@ static int __init parse_crash_elf64_head
 	/* Do some basic Verification. */
 	if (memcmp(ehdr.e_ident, ELFMAG, SELFMAG) != 0 ||
 		(ehdr.e_type != ET_CORE) ||
-		!elf_check_arch(&ehdr) ||
+		!vmcore_elf_check_arch(&ehdr) ||
 		ehdr.e_ident[EI_CLASS] != ELFCLASS64 ||
 		ehdr.e_ident[EI_VERSION] != EV_CURRENT ||
 		ehdr.e_version != EV_CURRENT ||
Index: linux/include/asm-i386/kexec.h
===================================================================
--- linux.orig/include/asm-i386/kexec.h
+++ linux/include/asm-i386/kexec.h
@@ -42,6 +42,9 @@
 /* The native architecture */
 #define KEXEC_ARCH KEXEC_ARCH_386
 
+/* We can also handle crash dumps from 64 bit kernel. */
+#define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
+
 #define MAX_NOTE_BYTES 1024
 
 /* CPU does not save ss and esp on stack if execution is already
Index: linux/include/linux/crash_dump.h
===================================================================
--- linux.orig/include/linux/crash_dump.h
+++ linux/include/linux/crash_dump.h
@@ -14,5 +14,13 @@ extern ssize_t copy_oldmem_page(unsigned
 extern const struct file_operations proc_vmcore_operations;
 extern struct proc_dir_entry *proc_vmcore;
 
+/* Architecture code defines this if there are other possible ELF
+ * machine types, e.g. on bi-arch capable hardware. */
+#ifndef vmcore_elf_check_arch_cross
+#define vmcore_elf_check_arch_cross(x) 0
+#endif
+
+#define vmcore_elf_check_arch(x) (elf_check_arch(x) || vmcore_elf_check_arch_cross(x))
+
 #endif /* CONFIG_CRASH_DUMP */
 #endif /* LINUX_CRASHDUMP_H */

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [28/48] i386: prevent ACPI quirk warning mass spamming in logs
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (25 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [27/48] i386: Allow i386 crash kernels to handle x86_64 dumps Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [29/48] x86: add command line length to boot protocol Andi Kleen
                   ` (19 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Thierry Vignaud, Andi Kleen, Len Brown, patches, linux-kernel


From: Thierry Vignaud <tvignaud@mandriva.com>

The following patch prevent this warning to be displayed again & again (eg:
nine times on my NForce2 motherboard) and thus improve signal to noise
ratio in logs.

The ATI quirk below probably needs a similar "fix" but I don't have
the hardware to test.

Btw arch/x86_64/kernel/early-quirks.c::nvidia_bugs() would probably need to
be synced (but I don't have an x86_64 NVidia motherboard to boot test it). 
Still it shows the usefullity of the recent x86 merge thread.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Thierry Vignaud <tvignaud@mandriva.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/acpi/earlyquirk.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux/arch/i386/kernel/acpi/earlyquirk.c
===================================================================
--- linux.orig/arch/i386/kernel/acpi/earlyquirk.c
+++ linux/arch/i386/kernel/acpi/earlyquirk.c
@@ -21,11 +21,14 @@ static int __init nvidia_hpet_check(stru
 
 static int __init check_bridge(int vendor, int device)
 {
+	static int warned;
 #ifdef CONFIG_ACPI
 	/* According to Nvidia all timer overrides are bogus unless HPET
 	   is enabled. */
 	if (!acpi_use_timer_override && vendor == PCI_VENDOR_ID_NVIDIA) {
-		if (acpi_table_parse(ACPI_SIG_HPET, nvidia_hpet_check)) {
+		if (!warned && acpi_table_parse(ACPI_SIG_HPET,
+						nvidia_hpet_check)) {
+			warned = 1;
 			acpi_skip_timer_override = 1;
 			  printk(KERN_INFO "Nvidia board "
                        "detected. Ignoring ACPI "

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [29/48] x86: add command line length to boot protocol
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (26 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [28/48] i386: prevent ACPI quirk warning mass spamming in logs Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [30/48] i386: Use per-cpu variables for GDT, PDA Andi Kleen
                   ` (18 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Bernhard Walle, Alon Bar-Lev, Andi Kleen, patches, linux-kernel


From: Bernhard Walle <bwalle@suse.de>

Because the command line is increased to 2048 characters after 2.6.21, it's
not possible for boot loaders and userspace tools to determine the length
of the command line the kernel can understand.  The benefit of knowing the
length is that users can be warned if the command line size is too long
which prevents surprise if things don't work after bootup.

This patch updates the boot protocol to contain a field called
"cmdline_size" that contain the length of the command line (excluding the
terminating zero).

The patch also adds missing fields (of protocol version 2.05) to the x86_64
setup code.

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Alon Bar-Lev <alon.barlev@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/i386/boot.txt |   23 +++++++++++++++++------
 arch/i386/boot/setup.S      |    7 ++++++-
 arch/x86_64/boot/setup.S    |    7 ++++++-
 3 files changed, 29 insertions(+), 8 deletions(-)

Index: linux/Documentation/i386/boot.txt
===================================================================
--- linux.orig/Documentation/i386/boot.txt
+++ linux/Documentation/i386/boot.txt
@@ -2,7 +2,7 @@
 		     ----------------------------
 
 		    H. Peter Anvin <hpa@zytor.com>
-			Last update 2007-01-26
+			Last update 2007-03-06
 
 On the i386 platform, the Linux kernel uses a rather complicated boot
 convention.  This has evolved partially due to historical aspects, as
@@ -35,9 +35,13 @@ Protocol 2.03:	(Kernel 2.4.18-pre1) Expl
 		initrd address available to the bootloader.
 
 Protocol 2.04:	(Kernel 2.6.14) Extend the syssize field to four bytes.
+
 Protocol 2.05:	(Kernel 2.6.20) Make protected mode kernel relocatable.
 		Introduce relocatable_kernel and kernel_alignment fields.
 
+Protocol 2.06:	(Kernel 2.6.22) Added a field that contains the size of
+		the boot command line
+
 
 **** MEMORY LAYOUT
 
@@ -133,6 +137,8 @@ Offset	Proto	Name		Meaning
 022C/4	2.03+	initrd_addr_max	Highest legal initrd address
 0230/4	2.05+	kernel_alignment Physical addr alignment required for kernel
 0234/1	2.05+	relocatable_kernel Whether kernel is relocatable or not
+0235/3	N/A	pad2		Unused
+0238/4	2.06+	cmdline_size	Maximum size of the kernel command line
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
     real value is 4.
@@ -233,6 +239,12 @@ filled out, however:
 	if your ramdisk is exactly 131072 bytes long and this field is
 	0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
 
+  cmdline_size:
+	The maximum size of the command line without the terminating
+	zero. This means that the command line can contain at most
+	cmdline_size characters. With protocol version 2.05 and
+	earlier, the maximum size was 255.
+
 
 **** THE KERNEL COMMAND LINE
 
@@ -241,11 +253,10 @@ loader to communicate with the kernel.  
 relevant to the boot loader itself, see "special command line options"
 below.
 
-The kernel command line is a null-terminated string currently up to
-255 characters long, plus the final null.  A string that is too long
-will be automatically truncated by the kernel, a boot loader may allow
-a longer command line to be passed to permit future kernels to extend
-this limit.
+The kernel command line is a null-terminated string. The maximum
+length can be retrieved from the field cmdline_size.  Before protocol
+version 2.06, the maximum was 255 characters.  A string that is too
+long will be automatically truncated by the kernel.
 
 If the boot protocol version is 2.02 or later, the address of the
 kernel command line is given by the header field cmd_line_ptr (see
Index: linux/arch/i386/boot/setup.S
===================================================================
--- linux.orig/arch/i386/boot/setup.S
+++ linux/arch/i386/boot/setup.S
@@ -52,6 +52,7 @@
 #include <asm/boot.h>
 #include <asm/e820.h>
 #include <asm/page.h>
+#include <asm/setup.h>
 	
 /* Signature words to ensure LILO loaded us right */
 #define SIG1	0xAA55
@@ -81,7 +82,7 @@ start:
 # This is the setup header, and it must start at %cs:2 (old 0x9020:2)
 
 		.ascii	"HdrS"		# header signature
-		.word	0x0205		# header version number (>= 0x0105)
+		.word	0x0206		# header version number (>= 0x0105)
 					# or else old loadlin-1.5 will fail)
 realmode_swtch:	.word	0, 0		# default_switch, SETUPSEG
 start_sys_seg:	.word	SYSSEG
@@ -171,6 +172,10 @@ relocatable_kernel:    .byte 0
 pad2:			.byte 0
 pad3:			.word 0
 
+cmdline_size:   .long   COMMAND_LINE_SIZE-1     #length of the command line,
+                                                #added with boot protocol
+                                                #version 2.06
+
 trampoline:	call	start_of_setup
 		.align 16
 					# The offset at this point is 0x240
Index: linux/arch/x86_64/boot/setup.S
===================================================================
--- linux.orig/arch/x86_64/boot/setup.S
+++ linux/arch/x86_64/boot/setup.S
@@ -51,6 +51,7 @@
 #include <asm/boot.h>
 #include <asm/e820.h>
 #include <asm/page.h>
+#include <asm/setup.h>
 
 /* Signature words to ensure LILO loaded us right */
 #define SIG1	0xAA55
@@ -80,7 +81,7 @@ start:
 # This is the setup header, and it must start at %cs:2 (old 0x9020:2)
 
 		.ascii	"HdrS"		# header signature
-		.word	0x0205		# header version number (>= 0x0105)
+		.word	0x0206		# header version number (>= 0x0105)
 					# or else old loadlin-1.5 will fail)
 realmode_swtch:	.word	0, 0		# default_switch, SETUPSEG
 start_sys_seg:	.word	SYSSEG
@@ -165,6 +166,10 @@ relocatable_kernel:    .byte 0
 pad2:                  .byte 0
 pad3:                  .word 0
 
+cmdline_size:   .long   COMMAND_LINE_SIZE-1     #length of the command line,
+                                                #added with boot protocol
+                                                #version 2.06
+
 trampoline:	call	start_of_setup
 		.align 16
 					# The offset at this point is 0x240

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [30/48] i386: Use per-cpu variables for GDT, PDA
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (27 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [29/48] x86: add command line length to boot protocol Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [31/48] i386: Use per-cpu GDT immediately upon boot Andi Kleen
                   ` (17 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

Allocating PDA and GDT at boot is a pain.  Using simple per-cpu variables adds
happiness (although we need the GDT page-aligned for Xen, which we do in a
followup patch).

[akpm@linux-foundation.org: build fix]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/cpu/common.c        |   94 ++++-------------------------------
 arch/i386/kernel/smpboot.c           |   21 -------
 arch/i386/mach-voyager/voyager_smp.c |   10 ---
 include/asm-generic/percpu.h         |    1 
 include/asm-i386/desc.h              |    1 
 include/asm-i386/pda.h               |    7 +-
 include/asm-i386/processor.h         |    2 
 7 files changed, 21 insertions(+), 115 deletions(-)

Index: linux/arch/i386/kernel/cpu/common.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/common.c
+++ linux/arch/i386/kernel/cpu/common.c
@@ -25,8 +25,10 @@
 DEFINE_PER_CPU(struct Xgt_desc_struct, cpu_gdt_descr);
 EXPORT_PER_CPU_SYMBOL(cpu_gdt_descr);
 
-struct i386_pda *_cpu_pda[NR_CPUS] __read_mostly;
-EXPORT_SYMBOL(_cpu_pda);
+DEFINE_PER_CPU(struct desc_struct, cpu_gdt[GDT_ENTRIES]);
+
+DEFINE_PER_CPU(struct i386_pda, _cpu_pda);
+EXPORT_PER_CPU_SYMBOL(_cpu_pda);
 
 static int cachesize_override __cpuinitdata = -1;
 static int disable_x86_fxsr __cpuinitdata;
@@ -609,52 +611,6 @@ struct pt_regs * __devinit idle_regs(str
 	return regs;
 }
 
-static __cpuinit int alloc_gdt(int cpu)
-{
-	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
-	struct desc_struct *gdt;
-	struct i386_pda *pda;
-
-	gdt = (struct desc_struct *)cpu_gdt_descr->address;
-	pda = cpu_pda(cpu);
-
-	/*
-	 * This is a horrible hack to allocate the GDT.  The problem
-	 * is that cpu_init() is called really early for the boot CPU
-	 * (and hence needs bootmem) but much later for the secondary
-	 * CPUs, when bootmem will have gone away
-	 */
-	if (NODE_DATA(0)->bdata->node_bootmem_map) {
-		BUG_ON(gdt != NULL || pda != NULL);
-
-		gdt = alloc_bootmem_pages(PAGE_SIZE);
-		pda = alloc_bootmem(sizeof(*pda));
-		/* alloc_bootmem(_pages) panics on failure, so no check */
-
-		memset(gdt, 0, PAGE_SIZE);
-		memset(pda, 0, sizeof(*pda));
-	} else {
-		/* GDT and PDA might already have been allocated if
-		   this is a CPU hotplug re-insertion. */
-		if (gdt == NULL)
-			gdt = (struct desc_struct *)get_zeroed_page(GFP_KERNEL);
-
-		if (pda == NULL)
-			pda = kmalloc_node(sizeof(*pda), GFP_KERNEL, cpu_to_node(cpu));
-
-		if (unlikely(!gdt || !pda)) {
-			free_pages((unsigned long)gdt, 0);
-			kfree(pda);
-			return 0;
-		}
-	}
-
- 	cpu_gdt_descr->address = (unsigned long)gdt;
-	cpu_pda(cpu) = pda;
-
-	return 1;
-}
-
 /* Initial PDA used by boot CPU */
 struct i386_pda boot_pda = {
 	._pda = &boot_pda,
@@ -670,31 +626,17 @@ static inline void set_kernel_fs(void)
 	asm volatile ("mov %0, %%fs" : : "r" (__KERNEL_PDA) : "memory");
 }
 
-/* Initialize the CPU's GDT and PDA.  The boot CPU does this for
-   itself, but secondaries find this done for them. */
-__cpuinit int init_gdt(int cpu, struct task_struct *idle)
+/* Initialize the CPU's GDT and PDA.  This is either the boot CPU doing itself
+   (still using cpu_gdt_table), or a CPU doing it for a secondary which
+   will soon come up. */
+__cpuinit void init_gdt(int cpu, struct task_struct *idle)
 {
 	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
-	struct desc_struct *gdt;
-	struct i386_pda *pda;
-
-	/* For non-boot CPUs, the GDT and PDA should already have been
-	   allocated. */
-	if (!alloc_gdt(cpu)) {
-		printk(KERN_CRIT "CPU%d failed to allocate GDT or PDA\n", cpu);
-		return 0;
-	}
-
-	gdt = (struct desc_struct *)cpu_gdt_descr->address;
-	pda = cpu_pda(cpu);
-
-	BUG_ON(gdt == NULL || pda == NULL);
+	struct desc_struct *gdt = per_cpu(cpu_gdt, cpu);
+	struct i386_pda *pda = &per_cpu(_cpu_pda, cpu);
 
-	/*
-	 * Initialize the per-CPU GDT with the boot GDT,
-	 * and set up the GDT descriptor:
-	 */
  	memcpy(gdt, cpu_gdt_table, GDT_SIZE);
+ 	cpu_gdt_descr->address = (unsigned long)gdt;
 	cpu_gdt_descr->size = GDT_SIZE - 1;
 
 	pack_descriptor((u32 *)&gdt[GDT_ENTRY_PDA].a,
@@ -706,17 +648,12 @@ __cpuinit int init_gdt(int cpu, struct t
 	pda->_pda = pda;
 	pda->cpu_number = cpu;
 	pda->pcurrent = idle;
-
-	return 1;
 }
 
 void __cpuinit cpu_set_gdt(int cpu)
 {
 	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
 
-	/* Reinit these anyway, even if they've already been done (on
-	   the boot CPU, this will transition from the boot gdt+pda to
-	   the real ones). */
 	load_gdt(cpu_gdt_descr);
 	set_kernel_fs();
 }
@@ -804,13 +741,8 @@ void __cpuinit cpu_init(void)
 	struct task_struct *curr = current;
 
 	/* Set up the real GDT and PDA, so we can transition from the
-	   boot versions. */
-	if (!init_gdt(cpu, curr)) {
-		/* failed to allocate something; not much we can do... */
-		for (;;)
-			local_irq_enable();
-	}
-
+	   boot_gdt_table & boot_pda. */
+	init_gdt(cpu, curr);
 	cpu_set_gdt(cpu);
 	_cpu_init(cpu, curr);
 }
Index: linux/arch/i386/kernel/smpboot.c
===================================================================
--- linux.orig/arch/i386/kernel/smpboot.c
+++ linux/arch/i386/kernel/smpboot.c
@@ -808,13 +808,7 @@ static int __cpuinit do_boot_cpu(int api
 	if (IS_ERR(idle))
 		panic("failed fork for CPU %d", cpu);
 
-	/* Pre-allocate and initialize the CPU's GDT and PDA so it
-	   doesn't have to do any memory allocation during the
-	   delicate CPU-bringup phase. */
-	if (!init_gdt(cpu, idle)) {
-		printk(KERN_INFO "Couldn't allocate GDT/PDA for CPU %d\n", cpu);
-		return -1;	/* ? */
-	}
+	init_gdt(cpu, idle);
 
 	idle->thread.eip = (unsigned long) start_secondary;
 	/* start_eip had better be page-aligned! */
@@ -940,7 +934,6 @@ static int __cpuinit __smp_prepare_cpu(i
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct warm_boot_cpu_info info;
 	int	apicid, ret;
-	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
 
 	apicid = x86_cpu_to_apicid[cpu];
 	if (apicid == BAD_APICID) {
@@ -948,18 +941,6 @@ static int __cpuinit __smp_prepare_cpu(i
 		goto exit;
 	}
 
-	/*
-	 * the CPU isn't initialized at boot time, allocate gdt table here.
-	 * cpu_init will initialize it
-	 */
-	if (!cpu_gdt_descr->address) {
-		cpu_gdt_descr->address = get_zeroed_page(GFP_KERNEL);
-		if (!cpu_gdt_descr->address)
-			printk(KERN_CRIT "CPU%d failed to allocate GDT\n", cpu);
-			ret = -ENOMEM;
-			goto exit;
-	}
-
 	info.complete = &done;
 	info.apicid = apicid;
 	info.cpu = cpu;
Index: linux/arch/i386/mach-voyager/voyager_smp.c
===================================================================
--- linux.orig/arch/i386/mach-voyager/voyager_smp.c
+++ linux/arch/i386/mach-voyager/voyager_smp.c
@@ -580,15 +580,7 @@ do_boot_cpu(__u8 cpu)
 	/* init_tasks (in sched.c) is indexed logically */
 	stack_start.esp = (void *) idle->thread.esp;
 
-	/* Pre-allocate and initialize the CPU's GDT and PDA so it
-	   doesn't have to do any memory allocation during the
-	   delicate CPU-bringup phase. */
-	if (!init_gdt(cpu, idle)) {
-		printk(KERN_INFO "Couldn't allocate GDT/PDA for CPU %d\n", cpu);
-		cpucount--;
-		return;
-	}
-
+	init_gdt(cpu, idle);
 	irq_ctx_init(cpu);
 
 	/* Note: Don't modify initial ss override */
Index: linux/include/asm-generic/percpu.h
===================================================================
--- linux.orig/include/asm-generic/percpu.h
+++ linux/include/asm-generic/percpu.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_GENERIC_PERCPU_H_
 #define _ASM_GENERIC_PERCPU_H_
 #include <linux/compiler.h>
+#include <linux/threads.h>
 
 #define __GENERIC_PER_CPU
 #ifdef CONFIG_SMP
Index: linux/include/asm-i386/desc.h
===================================================================
--- linux.orig/include/asm-i386/desc.h
+++ linux/include/asm-i386/desc.h
@@ -22,6 +22,7 @@ struct Xgt_desc_struct {
 
 extern struct Xgt_desc_struct idt_descr;
 DECLARE_PER_CPU(struct Xgt_desc_struct, cpu_gdt_descr);
+DECLARE_PER_CPU(struct desc_struct, cpu_gdt[GDT_ENTRIES]);
 extern struct Xgt_desc_struct early_gdt_descr;
 
 static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
Index: linux/include/asm-i386/pda.h
===================================================================
--- linux.orig/include/asm-i386/pda.h
+++ linux/include/asm-i386/pda.h
@@ -8,6 +8,7 @@
 
 #include <linux/stddef.h>
 #include <linux/types.h>
+#include <asm/percpu.h>
 
 struct i386_pda
 {
@@ -18,10 +19,8 @@ struct i386_pda
 	struct pt_regs *irq_regs;
 };
 
-extern struct i386_pda *_cpu_pda[];
-
-#define cpu_pda(i)	(_cpu_pda[i])
-
+DECLARE_PER_CPU(struct i386_pda, _cpu_pda);
+#define cpu_pda(i)	(&per_cpu(_cpu_pda, (i)))
 #define pda_offset(field) offsetof(struct i386_pda, field)
 
 extern void __bad_pda_field(void);
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -743,7 +743,7 @@ extern unsigned long boot_option_idle_ov
 extern void enable_sep_cpu(void);
 extern int sysenter_setup(void);
 
-extern int init_gdt(int cpu, struct task_struct *idle);
+extern void init_gdt(int cpu, struct task_struct *idle);
 extern void cpu_set_gdt(int);
 extern void secondary_cpu_init(void);
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [31/48] i386: Use per-cpu GDT immediately upon boot
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (28 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [30/48] i386: Use per-cpu variables for GDT, PDA Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [32/48] i386: clean up cpu_init() Andi Kleen
                   ` (16 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

Now we are no longer dynamically allocating the GDT, we don't need the
"cpu_gdt_table" at all: we can switch straight from "boot_gdt_table" to the
per-cpu GDT.  This means initializing the cpu_gdt array in C.

The boot CPU uses the per-cpu var directly, then in smp_prepare_cpus() it
switches to the per-cpu copy just allocated.  For secondary CPUs, the
early_gdt_descr is set to point directly to their per-cpu copy.

For UP the code is very simple: it keeps using the "per-cpu" GDT as per SMP,
but we never have to move.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/cpu/common.c        |   72 +++++++++++++----------------------
 arch/i386/kernel/head.S              |   55 --------------------------
 arch/i386/kernel/smpboot.c           |   59 ++++++++++++++++++++++------
 arch/i386/mach-voyager/voyager_smp.c |    6 --
 include/asm-i386/desc.h              |    2 
 include/asm-i386/processor.h         |    1 
 6 files changed, 75 insertions(+), 120 deletions(-)

Index: linux/arch/i386/kernel/cpu/common.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/common.c
+++ linux/arch/i386/kernel/cpu/common.c
@@ -25,7 +25,33 @@
 DEFINE_PER_CPU(struct Xgt_desc_struct, cpu_gdt_descr);
 EXPORT_PER_CPU_SYMBOL(cpu_gdt_descr);
 
-DEFINE_PER_CPU(struct desc_struct, cpu_gdt[GDT_ENTRIES]);
+DEFINE_PER_CPU(struct desc_struct, cpu_gdt[GDT_ENTRIES]) = {
+	[GDT_ENTRY_KERNEL_CS] = { 0x0000ffff, 0x00cf9a00 },
+	[GDT_ENTRY_KERNEL_DS] = { 0x0000ffff, 0x00cf9200 },
+	[GDT_ENTRY_DEFAULT_USER_CS] = { 0x0000ffff, 0x00cffa00 },
+	[GDT_ENTRY_DEFAULT_USER_DS] = { 0x0000ffff, 0x00cff200 },
+	/*
+	 * Segments used for calling PnP BIOS have byte granularity.
+	 * They code segments and data segments have fixed 64k limits,
+	 * the transfer segment sizes are set at run time.
+	 */
+	[GDT_ENTRY_PNPBIOS_CS32] = { 0x0000ffff, 0x00409a00 },/* 32-bit code */
+	[GDT_ENTRY_PNPBIOS_CS16] = { 0x0000ffff, 0x00009a00 },/* 16-bit code */
+	[GDT_ENTRY_PNPBIOS_DS] = { 0x0000ffff, 0x00009200 }, /* 16-bit data */
+	[GDT_ENTRY_PNPBIOS_TS1] = { 0x00000000, 0x00009200 },/* 16-bit data */
+	[GDT_ENTRY_PNPBIOS_TS2] = { 0x00000000, 0x00009200 },/* 16-bit data */
+	/*
+	 * The APM segments have byte granularity and their bases
+	 * are set at run time.  All have 64k limits.
+	 */
+	[GDT_ENTRY_APMBIOS_BASE] = { 0x0000ffff, 0x00409a00 },/* 32-bit code */
+	/* 16-bit code */
+	[GDT_ENTRY_APMBIOS_BASE+1] = { 0x0000ffff, 0x00009a00 },
+	[GDT_ENTRY_APMBIOS_BASE+2] = { 0x0000ffff, 0x00409200 }, /* data */
+
+	[GDT_ENTRY_ESPFIX_SS] = { 0x00000000, 0x00c09200 },
+	[GDT_ENTRY_PDA] = { 0x00000000, 0x00c09200 }, /* set in setup_pda */
+};
 
 DEFINE_PER_CPU(struct i386_pda, _cpu_pda);
 EXPORT_PER_CPU_SYMBOL(_cpu_pda);
@@ -618,46 +644,6 @@ struct i386_pda boot_pda = {
 	.pcurrent = &init_task,
 };
 
-static inline void set_kernel_fs(void)
-{
-	/* Set %fs for this CPU's PDA.  Memory clobber is to create a
-	   barrier with respect to any PDA operations, so the compiler
-	   doesn't move any before here. */
-	asm volatile ("mov %0, %%fs" : : "r" (__KERNEL_PDA) : "memory");
-}
-
-/* Initialize the CPU's GDT and PDA.  This is either the boot CPU doing itself
-   (still using cpu_gdt_table), or a CPU doing it for a secondary which
-   will soon come up. */
-__cpuinit void init_gdt(int cpu, struct task_struct *idle)
-{
-	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
-	struct desc_struct *gdt = per_cpu(cpu_gdt, cpu);
-	struct i386_pda *pda = &per_cpu(_cpu_pda, cpu);
-
- 	memcpy(gdt, cpu_gdt_table, GDT_SIZE);
- 	cpu_gdt_descr->address = (unsigned long)gdt;
-	cpu_gdt_descr->size = GDT_SIZE - 1;
-
-	pack_descriptor((u32 *)&gdt[GDT_ENTRY_PDA].a,
-			(u32 *)&gdt[GDT_ENTRY_PDA].b,
-			(unsigned long)pda, sizeof(*pda) - 1,
-			0x80 | DESCTYPE_S | 0x2, 0); /* present read-write data segment */
-
-	memset(pda, 0, sizeof(*pda));
-	pda->_pda = pda;
-	pda->cpu_number = cpu;
-	pda->pcurrent = idle;
-}
-
-void __cpuinit cpu_set_gdt(int cpu)
-{
-	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
-
-	load_gdt(cpu_gdt_descr);
-	set_kernel_fs();
-}
-
 /* Common CPU init for both boot and secondary CPUs */
 static void __cpuinit _cpu_init(int cpu, struct task_struct *curr)
 {
@@ -740,10 +726,6 @@ void __cpuinit cpu_init(void)
 	int cpu = smp_processor_id();
 	struct task_struct *curr = current;
 
-	/* Set up the real GDT and PDA, so we can transition from the
-	   boot_gdt_table & boot_pda. */
-	init_gdt(cpu, curr);
-	cpu_set_gdt(cpu);
 	_cpu_init(cpu, curr);
 }
 
Index: linux/arch/i386/kernel/head.S
===================================================================
--- linux.orig/arch/i386/kernel/head.S
+++ linux/arch/i386/kernel/head.S
@@ -599,7 +599,7 @@ idt_descr:
 	.word 0				# 32 bit align gdt_desc.address
 ENTRY(early_gdt_descr)
 	.word GDT_ENTRIES*8-1
-	.long cpu_gdt_table
+	.long per_cpu__cpu_gdt		/* Overwritten for secondary CPUs */
 
 /*
  * The boot_gdt_table must mirror the equivalent in setup.S and is
@@ -610,56 +610,3 @@ ENTRY(boot_gdt_table)
 	.fill GDT_ENTRY_BOOT_CS,8,0
 	.quad 0x00cf9a000000ffff	/* kernel 4GB code at 0x00000000 */
 	.quad 0x00cf92000000ffff	/* kernel 4GB data at 0x00000000 */
-
-/*
- * The Global Descriptor Table contains 32 quadwords, per-CPU.
- */
-	.align L1_CACHE_BYTES
-ENTRY(cpu_gdt_table)
-	.quad 0x0000000000000000	/* NULL descriptor */
-	.quad 0x0000000000000000	/* 0x0b reserved */
-	.quad 0x0000000000000000	/* 0x13 reserved */
-	.quad 0x0000000000000000	/* 0x1b reserved */
-	.quad 0x0000000000000000	/* 0x20 unused */
-	.quad 0x0000000000000000	/* 0x28 unused */
-	.quad 0x0000000000000000	/* 0x33 TLS entry 1 */
-	.quad 0x0000000000000000	/* 0x3b TLS entry 2 */
-	.quad 0x0000000000000000	/* 0x43 TLS entry 3 */
-	.quad 0x0000000000000000	/* 0x4b reserved */
-	.quad 0x0000000000000000	/* 0x53 reserved */
-	.quad 0x0000000000000000	/* 0x5b reserved */
-
-	.quad 0x00cf9a000000ffff	/* 0x60 kernel 4GB code at 0x00000000 */
-	.quad 0x00cf92000000ffff	/* 0x68 kernel 4GB data at 0x00000000 */
-	.quad 0x00cffa000000ffff	/* 0x73 user 4GB code at 0x00000000 */
-	.quad 0x00cff2000000ffff	/* 0x7b user 4GB data at 0x00000000 */
-
-	.quad 0x0000000000000000	/* 0x80 TSS descriptor */
-	.quad 0x0000000000000000	/* 0x88 LDT descriptor */
-
-	/*
-	 * Segments used for calling PnP BIOS have byte granularity.
-	 * The code segments and data segments have fixed 64k limits,
-	 * the transfer segment sizes are set at run time.
-	 */
-	.quad 0x00409a000000ffff	/* 0x90 32-bit code */
-	.quad 0x00009a000000ffff	/* 0x98 16-bit code */
-	.quad 0x000092000000ffff	/* 0xa0 16-bit data */
-	.quad 0x0000920000000000	/* 0xa8 16-bit data */
-	.quad 0x0000920000000000	/* 0xb0 16-bit data */
-
-	/*
-	 * The APM segments have byte granularity and their bases
-	 * are set at run time.  All have 64k limits.
-	 */
-	.quad 0x00409a000000ffff	/* 0xb8 APM CS    code */
-	.quad 0x00009a000000ffff	/* 0xc0 APM CS 16 code (16 bit) */
-	.quad 0x004092000000ffff	/* 0xc8 APM DS    data */
-
-	.quad 0x00c0920000000000	/* 0xd0 - ESPFIX SS */
-	.quad 0x00cf92000000ffff	/* 0xd8 - PDA */
-	.quad 0x0000000000000000	/* 0xe0 - unused */
-	.quad 0x0000000000000000	/* 0xe8 - unused */
-	.quad 0x0000000000000000	/* 0xf0 - unused */
-	.quad 0x0000000000000000	/* 0xf8 - GDT entry 31: double-fault TSS */
-
Index: linux/arch/i386/kernel/smpboot.c
===================================================================
--- linux.orig/arch/i386/kernel/smpboot.c
+++ linux/arch/i386/kernel/smpboot.c
@@ -440,12 +440,6 @@ static void __cpuinit start_secondary(vo
 void __devinit initialize_secondary(void)
 {
 	/*
-	 * switch to the per CPU GDT we already set up
-	 * in do_boot_cpu()
-	 */
-	cpu_set_gdt(current_thread_info()->cpu);
-
-	/*
 	 * We don't actually need to load the full TSS,
 	 * basically just the stack pointer and the eip.
 	 */
@@ -787,6 +781,32 @@ static inline struct task_struct * alloc
 #define alloc_idle_task(cpu) fork_idle(cpu)
 #endif
 
+/* Initialize the CPU's GDT.  This is either the boot CPU doing itself
+   (still using the master per-cpu area), or a CPU doing it for a
+   secondary which will soon come up. */
+static __cpuinit void init_gdt(int cpu, struct task_struct *idle)
+{
+	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
+	struct desc_struct *gdt = per_cpu(cpu_gdt, cpu);
+	struct i386_pda *pda = &per_cpu(_cpu_pda, cpu);
+
+ 	cpu_gdt_descr->address = (unsigned long)gdt;
+	cpu_gdt_descr->size = GDT_SIZE - 1;
+
+	pack_descriptor((u32 *)&gdt[GDT_ENTRY_PDA].a,
+			(u32 *)&gdt[GDT_ENTRY_PDA].b,
+			(unsigned long)pda, sizeof(*pda) - 1,
+			0x80 | DESCTYPE_S | 0x2, 0); /* present read-write data segment */
+
+	memset(pda, 0, sizeof(*pda));
+	pda->_pda = pda;
+	pda->cpu_number = cpu;
+	pda->pcurrent = idle;
+}
+
+/* Defined in head.S */
+extern struct Xgt_desc_struct early_gdt_descr;
+
 static int __cpuinit do_boot_cpu(int apicid, int cpu)
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
@@ -809,6 +829,8 @@ static int __cpuinit do_boot_cpu(int api
 		panic("failed fork for CPU %d", cpu);
 
 	init_gdt(cpu, idle);
+	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
+	start_pda = cpu_pda(cpu);
 
 	idle->thread.eip = (unsigned long) start_secondary;
 	/* start_eip had better be page-aligned! */
@@ -1161,13 +1183,26 @@ void __init smp_prepare_cpus(unsigned in
 	smp_boot_cpus(max_cpus);
 }
 
-void __devinit smp_prepare_boot_cpu(void)
+/* Current gdt points %fs at the "master" per-cpu area: after this,
+ * it's on the real one. */
+static inline void switch_to_new_gdt(void)
 {
-	cpu_set(smp_processor_id(), cpu_online_map);
-	cpu_set(smp_processor_id(), cpu_callout_map);
-	cpu_set(smp_processor_id(), cpu_present_map);
-	cpu_set(smp_processor_id(), cpu_possible_map);
-	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
+	load_gdt(&per_cpu(cpu_gdt_descr, smp_processor_id()));
+	asm volatile ("mov %0, %%fs" : : "r" (__KERNEL_PDA) : "memory");
+}
+
+void __init smp_prepare_boot_cpu(void)
+{
+	unsigned int cpu = smp_processor_id();
+
+	init_gdt(cpu, current);
+	switch_to_new_gdt();
+
+	cpu_set(cpu, cpu_online_map);
+	cpu_set(cpu, cpu_callout_map);
+	cpu_set(cpu, cpu_present_map);
+	cpu_set(cpu, cpu_possible_map);
+	__get_cpu_var(cpu_state) = CPU_ONLINE;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
Index: linux/arch/i386/mach-voyager/voyager_smp.c
===================================================================
--- linux.orig/arch/i386/mach-voyager/voyager_smp.c
+++ linux/arch/i386/mach-voyager/voyager_smp.c
@@ -765,12 +765,6 @@ initialize_secondary(void)
 #endif
 
 	/*
-	 * switch to the per CPU GDT we already set up
-	 * in do_boot_cpu()
-	 */
-	cpu_set_gdt(current_thread_info()->cpu);
-
-	/*
 	 * We don't actually need to load the full TSS,
 	 * basically just the stack pointer and the eip.
 	 */
Index: linux/include/asm-i386/desc.h
===================================================================
--- linux.orig/include/asm-i386/desc.h
+++ linux/include/asm-i386/desc.h
@@ -12,8 +12,6 @@
 
 #include <asm/mmu.h>
 
-extern struct desc_struct cpu_gdt_table[GDT_ENTRIES];
-
 struct Xgt_desc_struct {
 	unsigned short size;
 	unsigned long address __attribute__((packed));
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -743,7 +743,6 @@ extern unsigned long boot_option_idle_ov
 extern void enable_sep_cpu(void);
 extern int sysenter_setup(void);
 
-extern void init_gdt(int cpu, struct task_struct *idle);
 extern void cpu_set_gdt(int);
 extern void secondary_cpu_init(void);
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [32/48] i386: clean up cpu_init()
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (29 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [31/48] i386: Use per-cpu GDT immediately upon boot Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [33/48] i386: Rename boot_gdt_table to boot_gdt Andi Kleen
                   ` (15 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

We now have cpu_init() and secondary_cpu_init() doing nothing but calling
_cpu_init() with the same arguments.  Rename _cpu_init() to cpu_init() and use
it as a replcement for secondary_cpu_init().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/cpu/common.c |   34 +++++++++-------------------------
 arch/i386/kernel/smpboot.c    |    8 ++++----
 include/asm-i386/processor.h  |    2 +-
 3 files changed, 14 insertions(+), 30 deletions(-)

Index: linux/arch/i386/kernel/cpu/common.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/common.c
+++ linux/arch/i386/kernel/cpu/common.c
@@ -644,9 +644,16 @@ struct i386_pda boot_pda = {
 	.pcurrent = &init_task,
 };
 
-/* Common CPU init for both boot and secondary CPUs */
-static void __cpuinit _cpu_init(int cpu, struct task_struct *curr)
+/*
+ * cpu_init() initializes state that is per-CPU. Some data is already
+ * initialized (naturally) in the bootstrap process, such as the GDT
+ * and IDT. We reload them nevertheless, this function acts as a
+ * 'CPU state barrier', nothing should get across.
+ */
+void __cpuinit cpu_init(void)
 {
+	int cpu = smp_processor_id();
+	struct task_struct *curr = current;
 	struct tss_struct * t = &per_cpu(init_tss, cpu);
 	struct thread_struct *thread = &curr->thread;
 
@@ -706,29 +713,6 @@ static void __cpuinit _cpu_init(int cpu,
 	mxcsr_feature_mask_init();
 }
 
-/* Entrypoint to initialize secondary CPU */
-void __cpuinit secondary_cpu_init(void)
-{
-	int cpu = smp_processor_id();
-	struct task_struct *curr = current;
-
-	_cpu_init(cpu, curr);
-}
-
-/*
- * cpu_init() initializes state that is per-CPU. Some data is already
- * initialized (naturally) in the bootstrap process, such as the GDT
- * and IDT. We reload them nevertheless, this function acts as a
- * 'CPU state barrier', nothing should get across.
- */
-void __cpuinit cpu_init(void)
-{
-	int cpu = smp_processor_id();
-	struct task_struct *curr = current;
-
-	_cpu_init(cpu, curr);
-}
-
 #ifdef CONFIG_HOTPLUG_CPU
 void __cpuinit cpu_uninit(void)
 {
Index: linux/arch/i386/kernel/smpboot.c
===================================================================
--- linux.orig/arch/i386/kernel/smpboot.c
+++ linux/arch/i386/kernel/smpboot.c
@@ -378,14 +378,14 @@ set_cpu_sibling_map(int cpu)
 static void __cpuinit start_secondary(void *unused)
 {
 	/*
-	 * Don't put *anything* before secondary_cpu_init(), SMP
-	 * booting is too fragile that we want to limit the
-	 * things done here to the most necessary things.
+	 * Don't put *anything* before cpu_init(), SMP booting is too
+	 * fragile that we want to limit the things done here to the
+	 * most necessary things.
 	 */
 #ifdef CONFIG_VMI
 	vmi_bringup();
 #endif
-	secondary_cpu_init();
+	cpu_init();
 	preempt_disable();
 	smp_callin();
 	while (!cpu_isset(smp_processor_id(), smp_commenced_mask))
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -744,6 +744,6 @@ extern void enable_sep_cpu(void);
 extern int sysenter_setup(void);
 
 extern void cpu_set_gdt(int);
-extern void secondary_cpu_init(void);
+extern void cpu_init(void);
 
 #endif /* __ASM_I386_PROCESSOR_H */

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [33/48] i386: Rename boot_gdt_table to boot_gdt
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (30 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [32/48] i386: clean up cpu_init() Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [34/48] i386: rationalize paravirt wrappers Andi Kleen
                   ` (14 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Sebastien Dugue, Andi Kleen, patches, linux-kernel


From: Sebastien Dugue <sebastien.dugue@bull.net>

Rename boot_gdt_table to boot_gdt to avoid the duplicate T(able).

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/head.S       |    9 ++++-----
 arch/i386/kernel/trampoline.S |   12 ++++++------
 2 files changed, 10 insertions(+), 11 deletions(-)

Index: linux/arch/i386/kernel/head.S
===================================================================
--- linux.orig/arch/i386/kernel/head.S
+++ linux/arch/i386/kernel/head.S
@@ -147,8 +147,7 @@ page_pde_offset = (__PAGE_OFFSET >> 20);
 /*
  * Non-boot CPU entry point; entered from trampoline.S
  * We can't lgdt here, because lgdt itself uses a data segment, but
- * we know the trampoline has already loaded the boot_gdt_table GDT
- * for us.
+ * we know the trampoline has already loaded the boot_gdt for us.
  *
  * If cpu hotplug is not supported then this code can go in init section
  * which will be freed later
@@ -588,7 +587,7 @@ fault_msg:
 	.word 0				# 32 bit align gdt_desc.address
 boot_gdt_descr:
 	.word __BOOT_DS+7
-	.long boot_gdt_table - __PAGE_OFFSET
+	.long boot_gdt - __PAGE_OFFSET
 
 	.word 0				# 32-bit align idt_desc.address
 idt_descr:
@@ -602,11 +601,11 @@ ENTRY(early_gdt_descr)
 	.long per_cpu__cpu_gdt		/* Overwritten for secondary CPUs */
 
 /*
- * The boot_gdt_table must mirror the equivalent in setup.S and is
+ * The boot_gdt must mirror the equivalent in setup.S and is
  * used only for booting.
  */
 	.align L1_CACHE_BYTES
-ENTRY(boot_gdt_table)
+ENTRY(boot_gdt)
 	.fill GDT_ENTRY_BOOT_CS,8,0
 	.quad 0x00cf9a000000ffff	/* kernel 4GB code at 0x00000000 */
 	.quad 0x00cf92000000ffff	/* kernel 4GB data at 0x00000000 */
Index: linux/arch/i386/kernel/trampoline.S
===================================================================
--- linux.orig/arch/i386/kernel/trampoline.S
+++ linux/arch/i386/kernel/trampoline.S
@@ -29,7 +29,7 @@
  *
  *	TYPE              VALUE
  *	R_386_32          startup_32_smp
- *	R_386_32          boot_gdt_table
+ *	R_386_32          boot_gdt
  */
 
 #include <linux/linkage.h>
@@ -62,8 +62,8 @@ r_base = .
 	 * to 32 bit.
 	 */
 
-	lidtl	boot_idt - r_base	# load idt with 0, 0
-	lgdtl	boot_gdt - r_base	# load gdt with whatever is appropriate
+	lidtl	boot_idt_descr - r_base	# load idt with 0, 0
+	lgdtl	boot_gdt_descr - r_base	# load gdt with whatever is appropriate
 
 	xor	%ax, %ax
 	inc	%ax		# protected mode (PE) bit
@@ -73,11 +73,11 @@ r_base = .
 
 	# These need to be in the same 64K segment as the above;
 	# hence we don't use the boot_gdt_descr defined in head.S
-boot_gdt:
+boot_gdt_descr:
 	.word	__BOOT_DS + 7			# gdt limit
-	.long	boot_gdt_table-__PAGE_OFFSET	# gdt base
+	.long	boot_gdt - __PAGE_OFFSET	# gdt base
 
-boot_idt:
+boot_idt_descr:
 	.word	0				# idt limit = 0
 	.long	0				# idt base = 0L
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [34/48] i386: rationalize paravirt wrappers
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (31 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [33/48] i386: Rename boot_gdt_table to boot_gdt Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [35/48] x86: tighten kernel image page access rights Andi Kleen
                   ` (13 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, Avi Kivity, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

paravirt.c used to implement native versions of all low-level
functions.  Far cleaner is to have the native versions exposed in the
headers and as inline native_XXX, and if !CONFIG_PARAVIRT, then simply
#define XXX native_XXX.

There are several nice side effects:

1) write_dt_entry() now takes the correct "struct Xgt_desc_struct *"
   not "void *".

2) load_TLS is reintroduced to the for loop, not manually unrolled
   with a #error in case the bounds ever change.

3) Macros become inlines, with type checking.

4) Access to the native versions is trivial for KVM, lguest, Xen and
   others who might want it.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/paravirt.c  |  293 -------------------------------------------
 include/asm-i386/desc.h      |   82 ++++++++----
 include/asm-i386/io.h        |   15 +-
 include/asm-i386/irqflags.h  |   61 ++++++--
 include/asm-i386/msr.h       |  165 ++++++++++++++++--------
 include/asm-i386/paravirt.h  |   17 +-
 include/asm-i386/processor.h |   94 ++++++++++---
 include/asm-i386/system.h    |  137 ++++++++++++--------
 8 files changed, 389 insertions(+), 475 deletions(-)

Index: linux/arch/i386/kernel/paravirt.c
===================================================================
--- linux.orig/arch/i386/kernel/paravirt.c
+++ linux/arch/i386/kernel/paravirt.c
@@ -93,294 +93,11 @@ static unsigned native_patch(u8 type, u1
 	return insn_len;
 }
 
-static unsigned long native_get_debugreg(int regno)
-{
-	unsigned long val = 0; 	/* Damn you, gcc! */
-
-	switch (regno) {
-	case 0:
-		asm("movl %%db0, %0" :"=r" (val)); break;
-	case 1:
-		asm("movl %%db1, %0" :"=r" (val)); break;
-	case 2:
-		asm("movl %%db2, %0" :"=r" (val)); break;
-	case 3:
-		asm("movl %%db3, %0" :"=r" (val)); break;
-	case 6:
-		asm("movl %%db6, %0" :"=r" (val)); break;
-	case 7:
-		asm("movl %%db7, %0" :"=r" (val)); break;
-	default:
-		BUG();
-	}
-	return val;
-}
-
-static void native_set_debugreg(int regno, unsigned long value)
-{
-	switch (regno) {
-	case 0:
-		asm("movl %0,%%db0"	: /* no output */ :"r" (value));
-		break;
-	case 1:
-		asm("movl %0,%%db1"	: /* no output */ :"r" (value));
-		break;
-	case 2:
-		asm("movl %0,%%db2"	: /* no output */ :"r" (value));
-		break;
-	case 3:
-		asm("movl %0,%%db3"	: /* no output */ :"r" (value));
-		break;
-	case 6:
-		asm("movl %0,%%db6"	: /* no output */ :"r" (value));
-		break;
-	case 7:
-		asm("movl %0,%%db7"	: /* no output */ :"r" (value));
-		break;
-	default:
-		BUG();
-	}
-}
-
 void init_IRQ(void)
 {
 	paravirt_ops.init_IRQ();
 }
 
-static void native_clts(void)
-{
-	asm volatile ("clts");
-}
-
-static unsigned long native_read_cr0(void)
-{
-	unsigned long val;
-	asm volatile("movl %%cr0,%0\n\t" :"=r" (val));
-	return val;
-}
-
-static void native_write_cr0(unsigned long val)
-{
-	asm volatile("movl %0,%%cr0": :"r" (val));
-}
-
-static unsigned long native_read_cr2(void)
-{
-	unsigned long val;
-	asm volatile("movl %%cr2,%0\n\t" :"=r" (val));
-	return val;
-}
-
-static void native_write_cr2(unsigned long val)
-{
-	asm volatile("movl %0,%%cr2": :"r" (val));
-}
-
-static unsigned long native_read_cr3(void)
-{
-	unsigned long val;
-	asm volatile("movl %%cr3,%0\n\t" :"=r" (val));
-	return val;
-}
-
-static void native_write_cr3(unsigned long val)
-{
-	asm volatile("movl %0,%%cr3": :"r" (val));
-}
-
-static unsigned long native_read_cr4(void)
-{
-	unsigned long val;
-	asm volatile("movl %%cr4,%0\n\t" :"=r" (val));
-	return val;
-}
-
-static unsigned long native_read_cr4_safe(void)
-{
-	unsigned long val;
-	/* This could fault if %cr4 does not exist */
-	asm("1: movl %%cr4, %0		\n"
-		"2:				\n"
-		".section __ex_table,\"a\"	\n"
-		".long 1b,2b			\n"
-		".previous			\n"
-		: "=r" (val): "0" (0));
-	return val;
-}
-
-static void native_write_cr4(unsigned long val)
-{
-	asm volatile("movl %0,%%cr4": :"r" (val));
-}
-
-static unsigned long native_save_fl(void)
-{
-	unsigned long f;
-	asm volatile("pushfl ; popl %0":"=g" (f): /* no input */);
-	return f;
-}
-
-static void native_restore_fl(unsigned long f)
-{
-	asm volatile("pushl %0 ; popfl": /* no output */
-			     :"g" (f)
-			     :"memory", "cc");
-}
-
-static void native_irq_disable(void)
-{
-	asm volatile("cli": : :"memory");
-}
-
-static void native_irq_enable(void)
-{
-	asm volatile("sti": : :"memory");
-}
-
-static void native_safe_halt(void)
-{
-	asm volatile("sti; hlt": : :"memory");
-}
-
-static void native_halt(void)
-{
-	asm volatile("hlt": : :"memory");
-}
-
-static void native_wbinvd(void)
-{
-	asm volatile("wbinvd": : :"memory");
-}
-
-static unsigned long long native_read_msr(unsigned int msr, int *err)
-{
-	unsigned long long val;
-
-	asm volatile("2: rdmsr ; xorl %0,%0\n"
-		     "1:\n\t"
-		     ".section .fixup,\"ax\"\n\t"
-		     "3:  movl %3,%0 ; jmp 1b\n\t"
-		     ".previous\n\t"
- 		     ".section __ex_table,\"a\"\n"
-		     "   .align 4\n\t"
-		     "   .long 	2b,3b\n\t"
-		     ".previous"
-		     : "=r" (*err), "=A" (val)
-		     : "c" (msr), "i" (-EFAULT));
-
-	return val;
-}
-
-static int native_write_msr(unsigned int msr, unsigned long long val)
-{
-	int err;
-	asm volatile("2: wrmsr ; xorl %0,%0\n"
-		     "1:\n\t"
-		     ".section .fixup,\"ax\"\n\t"
-		     "3:  movl %4,%0 ; jmp 1b\n\t"
-		     ".previous\n\t"
- 		     ".section __ex_table,\"a\"\n"
-		     "   .align 4\n\t"
-		     "   .long 	2b,3b\n\t"
-		     ".previous"
-		     : "=a" (err)
-		     : "c" (msr), "0" ((u32)val), "d" ((u32)(val>>32)),
-		       "i" (-EFAULT));
-	return err;
-}
-
-static unsigned long long native_read_tsc(void)
-{
-	unsigned long long val;
-	asm volatile("rdtsc" : "=A" (val));
-	return val;
-}
-
-static unsigned long long native_read_pmc(void)
-{
-	unsigned long long val;
-	asm volatile("rdpmc" : "=A" (val));
-	return val;
-}
-
-static void native_load_tr_desc(void)
-{
-	asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));
-}
-
-static void native_load_gdt(const struct Xgt_desc_struct *dtr)
-{
-	asm volatile("lgdt %0"::"m" (*dtr));
-}
-
-static void native_load_idt(const struct Xgt_desc_struct *dtr)
-{
-	asm volatile("lidt %0"::"m" (*dtr));
-}
-
-static void native_store_gdt(struct Xgt_desc_struct *dtr)
-{
-	asm ("sgdt %0":"=m" (*dtr));
-}
-
-static void native_store_idt(struct Xgt_desc_struct *dtr)
-{
-	asm ("sidt %0":"=m" (*dtr));
-}
-
-static unsigned long native_store_tr(void)
-{
-	unsigned long tr;
-	asm ("str %0":"=r" (tr));
-	return tr;
-}
-
-static void native_load_tls(struct thread_struct *t, unsigned int cpu)
-{
-#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
-	C(0); C(1); C(2);
-#undef C
-}
-
-static inline void native_write_dt_entry(void *dt, int entry, u32 entry_low, u32 entry_high)
-{
-	u32 *lp = (u32 *)((char *)dt + entry*8);
-	lp[0] = entry_low;
-	lp[1] = entry_high;
-}
-
-static void native_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high)
-{
-	native_write_dt_entry(dt, entrynum, low, high);
-}
-
-static void native_write_gdt_entry(void *dt, int entrynum, u32 low, u32 high)
-{
-	native_write_dt_entry(dt, entrynum, low, high);
-}
-
-static void native_write_idt_entry(void *dt, int entrynum, u32 low, u32 high)
-{
-	native_write_dt_entry(dt, entrynum, low, high);
-}
-
-static void native_load_esp0(struct tss_struct *tss,
-				      struct thread_struct *thread)
-{
-	tss->esp0 = thread->esp0;
-
-	/* This can only happen when SEP is enabled, no need to test "SEP"arately */
-	if (unlikely(tss->ss1 != thread->sysenter_cs)) {
-		tss->ss1 = thread->sysenter_cs;
-		wrmsr(MSR_IA32_SYSENTER_CS, thread->sysenter_cs, 0);
-	}
-}
-
-static void native_io_delay(void)
-{
-	asm volatile("outb %al,$0x80");
-}
-
 static void native_flush_tlb(void)
 {
 	__native_flush_tlb();
@@ -517,8 +234,8 @@ struct paravirt_ops paravirt_ops = {
 	.safe_halt = native_safe_halt,
 	.halt = native_halt,
 	.wbinvd = native_wbinvd,
-	.read_msr = native_read_msr,
-	.write_msr = native_write_msr,
+	.read_msr = native_read_msr_safe,
+	.write_msr = native_write_msr_safe,
 	.read_tsc = native_read_tsc,
 	.read_pmc = native_read_pmc,
 	.get_scheduled_cycles = native_read_tsc,
@@ -531,9 +248,9 @@ struct paravirt_ops paravirt_ops = {
 	.store_idt = native_store_idt,
 	.store_tr = native_store_tr,
 	.load_tls = native_load_tls,
-	.write_ldt_entry = native_write_ldt_entry,
-	.write_gdt_entry = native_write_gdt_entry,
-	.write_idt_entry = native_write_idt_entry,
+	.write_ldt_entry = write_dt_entry,
+	.write_gdt_entry = write_dt_entry,
+	.write_idt_entry = write_dt_entry,
 	.load_esp0 = native_load_esp0,
 
 	.set_iopl_mask = native_set_iopl_mask,
Index: linux/include/asm-i386/desc.h
===================================================================
--- linux.orig/include/asm-i386/desc.h
+++ linux/include/asm-i386/desc.h
@@ -57,45 +57,33 @@ static inline void pack_gate(__u32 *a, _
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
-#define load_TR_desc() __asm__ __volatile__("ltr %w0"::"q" (GDT_ENTRY_TSS*8))
-
-#define load_gdt(dtr) __asm__ __volatile("lgdt %0"::"m" (*dtr))
-#define load_idt(dtr) __asm__ __volatile("lidt %0"::"m" (*dtr))
+#define load_TR_desc() native_load_tr_desc()
+#define load_gdt(dtr) native_load_gdt(dtr)
+#define load_idt(dtr) native_load_idt(dtr)
 #define load_tr(tr) __asm__ __volatile("ltr %0"::"m" (tr))
 #define load_ldt(ldt) __asm__ __volatile("lldt %0"::"m" (ldt))
 
-#define store_gdt(dtr) __asm__ ("sgdt %0":"=m" (*dtr))
-#define store_idt(dtr) __asm__ ("sidt %0":"=m" (*dtr))
-#define store_tr(tr) __asm__ ("str %0":"=m" (tr))
+#define store_gdt(dtr) native_store_gdt(dtr)
+#define store_idt(dtr) native_store_idt(dtr)
+#define store_tr(tr) (tr = native_store_tr())
 #define store_ldt(ldt) __asm__ ("sldt %0":"=m" (ldt))
 
-#if TLS_SIZE != 24
-# error update this code.
-#endif
-
-static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
-{
-#define C(i) get_cpu_gdt_table(cpu)[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i]
-	C(0); C(1); C(2);
-#undef C
-}
+#define load_TLS(t, cpu) native_load_tls(t, cpu)
+#define set_ldt native_set_ldt
 
 #define write_ldt_entry(dt, entry, a, b) write_dt_entry(dt, entry, a, b)
 #define write_gdt_entry(dt, entry, a, b) write_dt_entry(dt, entry, a, b)
 #define write_idt_entry(dt, entry, a, b) write_dt_entry(dt, entry, a, b)
+#endif
 
-static inline void write_dt_entry(void *dt, int entry, __u32 entry_a, __u32 entry_b)
+static inline void write_dt_entry(struct desc_struct *dt,
+				  int entry, u32 entry_low, u32 entry_high)
 {
-	__u32 *lp = (__u32 *)((char *)dt + entry*8);
-	*lp = entry_a;
-	*(lp+1) = entry_b;
+	dt[entry].a = entry_low;
+	dt[entry].b = entry_high;
 }
 
-#define set_ldt native_set_ldt
-#endif /* CONFIG_PARAVIRT */
-
-static inline fastcall void native_set_ldt(const void *addr,
-					   unsigned int entries)
+static inline void native_set_ldt(const void *addr, unsigned int entries)
 {
 	if (likely(entries == 0))
 		__asm__ __volatile__("lldt %w0"::"q" (0));
@@ -111,6 +99,48 @@ static inline fastcall void native_set_l
 	}
 }
 
+
+static inline void native_load_tr_desc(void)
+{
+	asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));
+}
+
+static inline void native_load_gdt(const struct Xgt_desc_struct *dtr)
+{
+	asm volatile("lgdt %0"::"m" (*dtr));
+}
+
+static inline void native_load_idt(const struct Xgt_desc_struct *dtr)
+{
+	asm volatile("lidt %0"::"m" (*dtr));
+}
+
+static inline void native_store_gdt(struct Xgt_desc_struct *dtr)
+{
+	asm ("sgdt %0":"=m" (*dtr));
+}
+
+static inline void native_store_idt(struct Xgt_desc_struct *dtr)
+{
+	asm ("sidt %0":"=m" (*dtr));
+}
+
+static inline unsigned long native_store_tr(void)
+{
+	unsigned long tr;
+	asm ("str %0":"=r" (tr));
+	return tr;
+}
+
+static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
+{
+	unsigned int i;
+	struct desc_struct *gdt = get_cpu_gdt_table(cpu);
+
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
+		gdt[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i];
+}
+
 static inline void _set_gate(int gate, unsigned int type, void *addr, unsigned short seg)
 {
 	__u32 a, b;
Index: linux/include/asm-i386/io.h
===================================================================
--- linux.orig/include/asm-i386/io.h
+++ linux/include/asm-i386/io.h
@@ -250,19 +250,22 @@ static inline void flush_write_buffers(v
 
 #endif /* __KERNEL__ */
 
+static inline void native_io_delay(void)
+{
+	asm volatile("outb %%al,$0x80" : : : "memory");
+}
+
 #if defined(CONFIG_PARAVIRT)
 #include <asm/paravirt.h>
 #else
 
-#define __SLOW_DOWN_IO "outb %%al,$0x80;"
-
 static inline void slow_down_io(void) {
-	__asm__ __volatile__(
-		__SLOW_DOWN_IO
+	native_io_delay();
 #ifdef REALLY_SLOW_IO
-		__SLOW_DOWN_IO __SLOW_DOWN_IO __SLOW_DOWN_IO
+	native_io_delay();
+	native_io_delay();
+	native_io_delay();
 #endif
-		: : );
 }
 
 #endif
Index: linux/include/asm-i386/irqflags.h
===================================================================
--- linux.orig/include/asm-i386/irqflags.h
+++ linux/include/asm-i386/irqflags.h
@@ -10,6 +10,42 @@
 #ifndef _ASM_IRQFLAGS_H
 #define _ASM_IRQFLAGS_H
 
+#ifndef __ASSEMBLY__
+static inline unsigned long native_save_fl(void)
+{
+	unsigned long f;
+	asm volatile("pushfl ; popl %0":"=g" (f): /* no input */);
+	return f;
+}
+
+static inline void native_restore_fl(unsigned long f)
+{
+	asm volatile("pushl %0 ; popfl": /* no output */
+			     :"g" (f)
+			     :"memory", "cc");
+}
+
+static inline void native_irq_disable(void)
+{
+	asm volatile("cli": : :"memory");
+}
+
+static inline void native_irq_enable(void)
+{
+	asm volatile("sti": : :"memory");
+}
+
+static inline void native_safe_halt(void)
+{
+	asm volatile("sti; hlt": : :"memory");
+}
+
+static inline void native_halt(void)
+{
+	asm volatile("hlt": : :"memory");
+}
+#endif	/* __ASSEMBLY__ */
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -17,35 +53,22 @@
 
 static inline unsigned long __raw_local_save_flags(void)
 {
-	unsigned long flags;
-
-	__asm__ __volatile__(
-		"pushfl ; popl %0"
-		: "=g" (flags)
-		: /* no input */
-	);
-
-	return flags;
+	return native_save_fl();
 }
 
 static inline void raw_local_irq_restore(unsigned long flags)
 {
-	__asm__ __volatile__(
-		"pushl %0 ; popfl"
-		: /* no output */
-		:"g" (flags)
-		:"memory", "cc"
-	);
+	native_restore_fl(flags);
 }
 
 static inline void raw_local_irq_disable(void)
 {
-	__asm__ __volatile__("cli" : : : "memory");
+	native_irq_disable();
 }
 
 static inline void raw_local_irq_enable(void)
 {
-	__asm__ __volatile__("sti" : : : "memory");
+	native_irq_enable();
 }
 
 /*
@@ -54,7 +77,7 @@ static inline void raw_local_irq_enable(
  */
 static inline void raw_safe_halt(void)
 {
-	__asm__ __volatile__("sti; hlt" : : : "memory");
+	native_safe_halt();
 }
 
 /*
@@ -63,7 +86,7 @@ static inline void raw_safe_halt(void)
  */
 static inline void halt(void)
 {
-	__asm__ __volatile__("hlt": : :"memory");
+	native_halt();
 }
 
 /*
Index: linux/include/asm-i386/msr.h
===================================================================
--- linux.orig/include/asm-i386/msr.h
+++ linux/include/asm-i386/msr.h
@@ -1,6 +1,74 @@
 #ifndef __ASM_MSR_H
 #define __ASM_MSR_H
 
+#include <asm/errno.h>
+
+static inline unsigned long long native_read_msr(unsigned int msr)
+{
+	unsigned long long val;
+
+	asm volatile("rdmsr" : "=A" (val) : "c" (msr));
+	return val;
+}
+
+static inline unsigned long long native_read_msr_safe(unsigned int msr,
+						      int *err)
+{
+	unsigned long long val;
+
+	asm volatile("2: rdmsr ; xorl %0,%0\n"
+		     "1:\n\t"
+		     ".section .fixup,\"ax\"\n\t"
+		     "3:  movl %3,%0 ; jmp 1b\n\t"
+		     ".previous\n\t"
+ 		     ".section __ex_table,\"a\"\n"
+		     "   .align 4\n\t"
+		     "   .long 	2b,3b\n\t"
+		     ".previous"
+		     : "=r" (*err), "=A" (val)
+		     : "c" (msr), "i" (-EFAULT));
+
+	return val;
+}
+
+static inline void native_write_msr(unsigned int msr, unsigned long long val)
+{
+	asm volatile("wrmsr" : : "c" (msr), "A"(val));
+}
+
+static inline int native_write_msr_safe(unsigned int msr,
+					unsigned long long val)
+{
+	int err;
+	asm volatile("2: wrmsr ; xorl %0,%0\n"
+		     "1:\n\t"
+		     ".section .fixup,\"ax\"\n\t"
+		     "3:  movl %4,%0 ; jmp 1b\n\t"
+		     ".previous\n\t"
+ 		     ".section __ex_table,\"a\"\n"
+		     "   .align 4\n\t"
+		     "   .long 	2b,3b\n\t"
+		     ".previous"
+		     : "=a" (err)
+		     : "c" (msr), "0" ((u32)val), "d" ((u32)(val>>32)),
+		       "i" (-EFAULT));
+	return err;
+}
+
+static inline unsigned long long native_read_tsc(void)
+{
+	unsigned long long val;
+	asm volatile("rdtsc" : "=A" (val));
+	return val;
+}
+
+static inline unsigned long long native_read_pmc(void)
+{
+	unsigned long long val;
+	asm volatile("rdpmc" : "=A" (val));
+	return val;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -11,22 +79,20 @@
  * pointer indirection), this allows gcc to optimize better
  */
 
-#define rdmsr(msr,val1,val2) \
-	__asm__ __volatile__("rdmsr" \
-			  : "=a" (val1), "=d" (val2) \
-			  : "c" (msr))
-
-#define wrmsr(msr,val1,val2) \
-	__asm__ __volatile__("wrmsr" \
-			  : /* no outputs */ \
-			  : "c" (msr), "a" (val1), "d" (val2))
-
-#define rdmsrl(msr,val) do { \
-	unsigned long l__,h__; \
-	rdmsr (msr, l__, h__);  \
-	val = l__;  \
-	val |= ((u64)h__<<32);  \
-} while(0)
+#define rdmsr(msr,val1,val2)						\
+	do {								\
+		unsigned long long __val = native_read_msr(msr);	\
+		val1 = __val;						\
+		val2 = __val >> 32;					\
+	} while(0)
+
+#define wrmsr(msr,val1,val2)						\
+	native_write_msr(msr, ((unsigned long long)val2 << 32) | val1)
+
+#define rdmsrl(msr,val)					\
+	do {						\
+		(val) = native_read_msr(msr);		\
+	} while(0)
 
 static inline void wrmsrl (unsigned long msr, unsigned long long val)
 {
@@ -37,50 +103,41 @@ static inline void wrmsrl (unsigned long
 }
 
 /* wrmsr with exception handling */
-#define wrmsr_safe(msr,a,b) ({ int ret__;						\
-	asm volatile("2: wrmsr ; xorl %0,%0\n"						\
-		     "1:\n\t"								\
-		     ".section .fixup,\"ax\"\n\t"					\
-		     "3:  movl %4,%0 ; jmp 1b\n\t"					\
-		     ".previous\n\t"							\
- 		     ".section __ex_table,\"a\"\n"					\
-		     "   .align 4\n\t"							\
-		     "   .long 	2b,3b\n\t"						\
-		     ".previous"							\
-		     : "=a" (ret__)							\
-		     : "c" (msr), "0" (a), "d" (b), "i" (-EFAULT));\
-	ret__; })
+#define wrmsr_safe(msr,val1,val2)						\
+	(native_write_msr_safe(msr, ((unsigned long long)val2 << 32) | val1))
 
 /* rdmsr with exception handling */
-#define rdmsr_safe(msr,a,b) ({ int ret__;						\
-	asm volatile("2: rdmsr ; xorl %0,%0\n"						\
-		     "1:\n\t"								\
-		     ".section .fixup,\"ax\"\n\t"					\
-		     "3:  movl %4,%0 ; jmp 1b\n\t"					\
-		     ".previous\n\t"							\
- 		     ".section __ex_table,\"a\"\n"					\
-		     "   .align 4\n\t"							\
-		     "   .long 	2b,3b\n\t"						\
-		     ".previous"							\
-		     : "=r" (ret__), "=a" (*(a)), "=d" (*(b))				\
-		     : "c" (msr), "i" (-EFAULT));\
-	ret__; })
-
-#define rdtsc(low,high) \
-     __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))
-
-#define rdtscl(low) \
-     __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")
+#define rdmsr_safe(msr,p1,p2)						\
+	({								\
+		int __err;						\
+		unsigned long long __val = native_read_msr_safe(msr, &__err);\
+		(*p1) = __val;						\
+		(*p2) = __val >> 32;					\
+		__err;							\
+	})
+
+#define rdtsc(low,high)						\
+	do {							\
+		u64 _l = native_read_tsc();			\
+		(low) = (u32)_l;				\
+		(high) = _l >> 32;				\
+	} while(0)
+
+#define rdtscl(low)						\
+	do {							\
+		(low) = native_read_tsc();			\
+	} while(0)
 
-#define rdtscll(val) \
-     __asm__ __volatile__("rdtsc" : "=A" (val))
+#define rdtscll(val) ((val) = native_read_tsc())
 
 #define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
 
-#define rdpmc(counter,low,high) \
-     __asm__ __volatile__("rdpmc" \
-			  : "=a" (low), "=d" (high) \
-			  : "c" (counter))
+#define rdpmc(counter,low,high)					\
+	do {							\
+		u64 _l = native_read_pmc();			\
+		low = (u32)_l;					\
+		high = _l >> 32;				\
+	} while(0)
 #endif	/* !CONFIG_PARAVIRT */
 
 #ifdef CONFIG_SMP
Index: linux/include/asm-i386/paravirt.h
===================================================================
--- linux.orig/include/asm-i386/paravirt.h
+++ linux/include/asm-i386/paravirt.h
@@ -29,6 +29,7 @@ struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
 struct mm_struct;
+struct desc_struct;
 struct paravirt_ops
 {
 	unsigned int kernel_rpl;
@@ -105,14 +106,13 @@ struct paravirt_ops
 	void (*set_ldt)(const void *desc, unsigned entries);
 	unsigned long (*store_tr)(void);
 	void (*load_tls)(struct thread_struct *t, unsigned int cpu);
-	void (*write_ldt_entry)(void *dt, int entrynum,
-					 u32 low, u32 high);
-	void (*write_gdt_entry)(void *dt, int entrynum,
-					 u32 low, u32 high);
-	void (*write_idt_entry)(void *dt, int entrynum,
-					 u32 low, u32 high);
-	void (*load_esp0)(struct tss_struct *tss,
-				   struct thread_struct *thread);
+	void (*write_ldt_entry)(struct desc_struct *,
+				int entrynum, u32 low, u32 high);
+	void (*write_gdt_entry)(struct desc_struct *,
+				int entrynum, u32 low, u32 high);
+	void (*write_idt_entry)(struct desc_struct *,
+				int entrynum, u32 low, u32 high);
+	void (*load_esp0)(struct tss_struct *tss, struct thread_struct *t);
 
 	void (*set_iopl_mask)(unsigned mask);
 
@@ -232,6 +232,7 @@ static inline void halt(void)
 
 #define get_kernel_rpl()  (paravirt_ops.kernel_rpl)
 
+/* These should all do BUG_ON(_err), but our headers are too tangled. */
 #define rdmsr(msr,val1,val2) do {				\
 	int _err;						\
 	u64 _l = paravirt_ops.read_msr(msr,&_err);		\
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -147,7 +147,7 @@ static inline void detect_ht(struct cpui
 #define X86_EFLAGS_VIP	0x00100000 /* Virtual Interrupt Pending */
 #define X86_EFLAGS_ID	0x00200000 /* CPUID detection flag */
 
-static inline fastcall void native_cpuid(unsigned int *eax, unsigned int *ebx,
+static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
 					 unsigned int *ecx, unsigned int *edx)
 {
 	/* ecx is often an input as well as an output. */
@@ -545,13 +545,7 @@ static inline void rep_nop(void)
 
 #define cpu_relax()	rep_nop()
 
-#ifdef CONFIG_PARAVIRT
-#include <asm/paravirt.h>
-#else
-#define paravirt_enabled() 0
-#define __cpuid native_cpuid
-
-static inline void load_esp0(struct tss_struct *tss, struct thread_struct *thread)
+static inline void native_load_esp0(struct tss_struct *tss, struct thread_struct *thread)
 {
 	tss->esp0 = thread->esp0;
 	/* This can only happen when SEP is enabled, no need to test "SEP"arately */
@@ -561,24 +555,60 @@ static inline void load_esp0(struct tss_
 	}
 }
 
-/*
- * These special macros can be used to get or set a debugging register
- */
-#define get_debugreg(var, register)				\
-		__asm__("movl %%db" #register ", %0"		\
-			:"=r" (var))
-#define set_debugreg(value, register)			\
-		__asm__("movl %0,%%db" #register		\
-			: /* no output */			\
-			:"r" (value))
 
-#define set_iopl_mask native_set_iopl_mask
-#endif /* CONFIG_PARAVIRT */
+static inline unsigned long native_get_debugreg(int regno)
+{
+	unsigned long val = 0; 	/* Damn you, gcc! */
+
+	switch (regno) {
+	case 0:
+		asm("movl %%db0, %0" :"=r" (val)); break;
+	case 1:
+		asm("movl %%db1, %0" :"=r" (val)); break;
+	case 2:
+		asm("movl %%db2, %0" :"=r" (val)); break;
+	case 3:
+		asm("movl %%db3, %0" :"=r" (val)); break;
+	case 6:
+		asm("movl %%db6, %0" :"=r" (val)); break;
+	case 7:
+		asm("movl %%db7, %0" :"=r" (val)); break;
+	default:
+		BUG();
+	}
+	return val;
+}
+
+static inline void native_set_debugreg(int regno, unsigned long value)
+{
+	switch (regno) {
+	case 0:
+		asm("movl %0,%%db0"	: /* no output */ :"r" (value));
+		break;
+	case 1:
+		asm("movl %0,%%db1"	: /* no output */ :"r" (value));
+		break;
+	case 2:
+		asm("movl %0,%%db2"	: /* no output */ :"r" (value));
+		break;
+	case 3:
+		asm("movl %0,%%db3"	: /* no output */ :"r" (value));
+		break;
+	case 6:
+		asm("movl %0,%%db6"	: /* no output */ :"r" (value));
+		break;
+	case 7:
+		asm("movl %0,%%db7"	: /* no output */ :"r" (value));
+		break;
+	default:
+		BUG();
+	}
+}
 
 /*
  * Set IOPL bits in EFLAGS from given mask
  */
-static fastcall inline void native_set_iopl_mask(unsigned mask)
+static inline void native_set_iopl_mask(unsigned mask)
 {
 	unsigned int reg;
 	__asm__ __volatile__ ("pushfl;"
@@ -591,6 +621,28 @@ static fastcall inline void native_set_i
 				: "i" (~X86_EFLAGS_IOPL), "r" (mask));
 }
 
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define paravirt_enabled() 0
+#define __cpuid native_cpuid
+
+static inline void load_esp0(struct tss_struct *tss, struct thread_struct *thread)
+{
+	native_load_esp0(tss, thread);
+}
+
+/*
+ * These special macros can be used to get or set a debugging register
+ */
+#define get_debugreg(var, register)				\
+	(var) = native_get_debugreg(register)
+#define set_debugreg(value, register)				\
+	native_set_debugreg(register, value)
+
+#define set_iopl_mask native_set_iopl_mask
+#endif /* CONFIG_PARAVIRT */
+
 /*
  * Generic CPUID function
  * clear %ecx since some cpus (Cyrix MII) do not set or clear %ecx
Index: linux/include/asm-i386/system.h
===================================================================
--- linux.orig/include/asm-i386/system.h
+++ linux/include/asm-i386/system.h
@@ -88,65 +88,96 @@ __asm__ __volatile__ ("movw %%dx,%1\n\t"
 #define savesegment(seg, value) \
 	asm volatile("mov %%" #seg ",%0":"=rm" (value))
 
-#ifdef CONFIG_PARAVIRT
-#include <asm/paravirt.h>
-#else
-#define read_cr0() ({ \
-	unsigned int __dummy; \
-	__asm__ __volatile__( \
-		"movl %%cr0,%0\n\t" \
-		:"=r" (__dummy)); \
-	__dummy; \
-})
-#define write_cr0(x) \
-	__asm__ __volatile__("movl %0,%%cr0": :"r" (x))
 
-#define read_cr2() ({ \
-	unsigned int __dummy; \
-	__asm__ __volatile__( \
-		"movl %%cr2,%0\n\t" \
-		:"=r" (__dummy)); \
-	__dummy; \
-})
-#define write_cr2(x) \
-	__asm__ __volatile__("movl %0,%%cr2": :"r" (x))
+static inline void native_clts(void)
+{
+	asm volatile ("clts");
+}
 
-#define read_cr3() ({ \
-	unsigned int __dummy; \
-	__asm__ ( \
-		"movl %%cr3,%0\n\t" \
-		:"=r" (__dummy)); \
-	__dummy; \
-})
-#define write_cr3(x) \
-	__asm__ __volatile__("movl %0,%%cr3": :"r" (x))
+static inline unsigned long native_read_cr0(void)
+{
+	unsigned long val;
+	asm volatile("movl %%cr0,%0\n\t" :"=r" (val));
+	return val;
+}
 
-#define read_cr4() ({ \
-	unsigned int __dummy; \
-	__asm__( \
-		"movl %%cr4,%0\n\t" \
-		:"=r" (__dummy)); \
-	__dummy; \
-})
-#define read_cr4_safe() ({			      \
-	unsigned int __dummy;			      \
-	/* This could fault if %cr4 does not exist */ \
-	__asm__("1: movl %%cr4, %0		\n"   \
-		"2:				\n"   \
-		".section __ex_table,\"a\"	\n"   \
-		".long 1b,2b			\n"   \
-		".previous			\n"   \
-		: "=r" (__dummy): "0" (0));	      \
-	__dummy;				      \
-})
-#define write_cr4(x) \
-	__asm__ __volatile__("movl %0,%%cr4": :"r" (x))
+static inline void native_write_cr0(unsigned long val)
+{
+	asm volatile("movl %0,%%cr0": :"r" (val));
+}
+
+static inline unsigned long native_read_cr2(void)
+{
+	unsigned long val;
+	asm volatile("movl %%cr2,%0\n\t" :"=r" (val));
+	return val;
+}
 
-#define wbinvd() \
-	__asm__ __volatile__ ("wbinvd": : :"memory")
+static inline void native_write_cr2(unsigned long val)
+{
+	asm volatile("movl %0,%%cr2": :"r" (val));
+}
+
+static inline unsigned long native_read_cr3(void)
+{
+	unsigned long val;
+	asm volatile("movl %%cr3,%0\n\t" :"=r" (val));
+	return val;
+}
+
+static inline void native_write_cr3(unsigned long val)
+{
+	asm volatile("movl %0,%%cr3": :"r" (val));
+}
+
+static inline unsigned long native_read_cr4(void)
+{
+	unsigned long val;
+	asm volatile("movl %%cr4,%0\n\t" :"=r" (val));
+	return val;
+}
+
+static inline unsigned long native_read_cr4_safe(void)
+{
+	unsigned long val;
+	/* This could fault if %cr4 does not exist */
+	asm("1: movl %%cr4, %0		\n"
+		"2:				\n"
+		".section __ex_table,\"a\"	\n"
+		".long 1b,2b			\n"
+		".previous			\n"
+		: "=r" (val): "0" (0));
+	return val;
+}
+
+static inline void native_write_cr4(unsigned long val)
+{
+	asm volatile("movl %0,%%cr4": :"r" (val));
+}
+
+static inline void native_wbinvd(void)
+{
+	asm volatile("wbinvd": : :"memory");
+}
+
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define read_cr0()	(native_read_cr0())
+#define write_cr0(x)	(native_write_cr0(x))
+#define read_cr2()	(native_read_cr2())
+#define write_cr2(x)	(native_write_cr2(x))
+#define read_cr3()	(native_read_cr3())
+#define write_cr3(x)	(native_write_cr3(x))
+#define read_cr4()	(native_read_cr4())
+#define read_cr4_safe()	(native_read_cr4_safe())
+#define write_cr4(x)	(native_write_cr4(x))
+#define wbinvd()	(native_wbinvd())
 
 /* Clear the 'TS' bit */
-#define clts() __asm__ __volatile__ ("clts")
+#define clts()		(native_clts())
+
 #endif/* CONFIG_PARAVIRT */
 
 /* Set the 'TS' bit */

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [35/48] x86: tighten kernel image page access rights
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (32 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [34/48] i386: rationalize paravirt wrappers Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [36/48] i386: get rid of unused variables Andi Kleen
                   ` (12 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: patches, linux-kernel


On x86-64, kernel memory freed after init can be entirely unmapped instead
of just getting 'poisoned' by overwriting with a debug pattern.

On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table
can also be write-protected.

Compared to the first version, this one prevents re-creating deleted
mappings in the kernel image range on x86-64, if those got removed
previously. This, together with the original changes, prevents temporarily
having inconsistent mappings when cacheability attributes are being
changed on such pages (e.g. from AGP code). While on i386 such duplicate
mappings don't exist, the same change is done there, too, both for
consistency and because checking pte_present() before using various other
pte_XXX functions is a requirement anyway. At once, i386 code gets
adjusted to use pte_huge() instead of open coding this.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/kernel/vmlinux.lds.S   |    4 ++--
 arch/i386/mm/init.c              |   27 ++++++++++++++++++++-------
 arch/i386/mm/pageattr.c          |    6 ++++--
 arch/x86_64/kernel/head.S        |    1 -
 arch/x86_64/kernel/vmlinux.lds.S |    5 +++--
 arch/x86_64/mm/init.c            |   25 ++++++++++++++++---------
 arch/x86_64/mm/pageattr.c        |   18 +++++++++++++-----
 include/asm-i386/pgtable.h       |    2 ++
 include/linux/poison.h           |    3 ---
 9 files changed, 60 insertions(+), 31 deletions(-)

Index: linux/arch/i386/kernel/vmlinux.lds.S
===================================================================
--- linux.orig/arch/i386/kernel/vmlinux.lds.S
+++ linux/arch/i386/kernel/vmlinux.lds.S
@@ -61,8 +61,6 @@ SECTIONS
   	__stop___ex_table = .;
   }
 
-  RODATA
-
   BUG_TABLE
 
   . = ALIGN(4);
@@ -72,6 +70,8 @@ SECTIONS
   	__tracedata_end = .;
   }
 
+  RODATA
+
   /* writeable */
   . = ALIGN(4096);
   .data : AT(ADDR(.data) - LOAD_OFFSET) {	/* Data */
Index: linux/arch/i386/mm/init.c
===================================================================
--- linux.orig/arch/i386/mm/init.c
+++ linux/arch/i386/mm/init.c
@@ -22,6 +22,7 @@
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/pfn.h>
 #include <linux/poison.h>
 #include <linux/bootmem.h>
 #include <linux/slab.h>
@@ -751,13 +752,25 @@ static int noinline do_test_wp_bit(void)
 
 void mark_rodata_ro(void)
 {
-	unsigned long addr = (unsigned long)__start_rodata;
+	unsigned long start = PFN_ALIGN(_text);
+	unsigned long size = PFN_ALIGN(_etext) - start;
 
-	for (; addr < (unsigned long)__end_rodata; addr += PAGE_SIZE)
-		change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RO);
-
-	printk("Write protecting the kernel read-only data: %uk\n",
-			(__end_rodata - __start_rodata) >> 10);
+#ifdef CONFIG_HOTPLUG_CPU
+	/* It must still be possible to apply SMP alternatives. */
+	if (num_possible_cpus() <= 1)
+#endif
+	{
+		change_page_attr(virt_to_page(start),
+		                 size >> PAGE_SHIFT, PAGE_KERNEL_RX);
+		printk("Write protecting the kernel text: %luk\n", size >> 10);
+	}
+
+	start += size;
+	size = (unsigned long)__end_rodata - start;
+	change_page_attr(virt_to_page(start),
+	                 size >> PAGE_SHIFT, PAGE_KERNEL_RO);
+	printk("Write protecting the kernel read-only data: %luk\n",
+	       size >> 10);
 
 	/*
 	 * change_page_attr() requires a global_flush_tlb() call after it.
@@ -781,7 +794,7 @@ void free_init_pages(char *what, unsigne
 		__free_page(page);
 		totalram_pages++;
 	}
-	printk(KERN_INFO "Freeing %s: %ldk freed\n", what, (end - begin) >> 10);
+	printk(KERN_INFO "Freeing %s: %luk freed\n", what, (end - begin) >> 10);
 }
 
 void free_initmem(void)
Index: linux/arch/i386/mm/pageattr.c
===================================================================
--- linux.orig/arch/i386/mm/pageattr.c
+++ linux/arch/i386/mm/pageattr.c
@@ -140,9 +140,11 @@ __change_page_attr(struct page *page, pg
 	kpte = lookup_address(address);
 	if (!kpte)
 		return -EINVAL;
+	if (!pte_present(*kpte))
+		return 0;
 	kpte_page = virt_to_page(kpte);
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
-		if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
+		if (!pte_huge(*kpte)) {
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
 			pgprot_t ref_prot;
@@ -158,7 +160,7 @@ __change_page_attr(struct page *page, pg
 			kpte_page = split;
 		}
 		page_private(kpte_page)++;
-	} else if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
+	} else if (!pte_huge(*kpte)) {
 		set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
 		BUG_ON(page_private(kpte_page) == 0);
 		page_private(kpte_page)--;
Index: linux/arch/x86_64/kernel/head.S
===================================================================
--- linux.orig/arch/x86_64/kernel/head.S
+++ linux/arch/x86_64/kernel/head.S
@@ -280,7 +280,6 @@ early_idt_ripmsg:
 
 .balign PAGE_SIZE
 ENTRY(stext)
-ENTRY(_stext)
 
 #define NEXT_PAGE(name) \
 	.balign	PAGE_SIZE; \
Index: linux/arch/x86_64/kernel/vmlinux.lds.S
===================================================================
--- linux.orig/arch/x86_64/kernel/vmlinux.lds.S
+++ linux/arch/x86_64/kernel/vmlinux.lds.S
@@ -29,6 +29,7 @@ SECTIONS
   .text :  AT(ADDR(.text) - LOAD_OFFSET) {
 	/* First the code that has to be first for bootstrapping */
 	*(.bootstrap.text)
+	_stext = .;
 	/* Then all the functions that are "hot" in profiles, to group them
            onto the same hugetlb entry */
 	#include "functionlist"
@@ -50,10 +51,10 @@ SECTIONS
   __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { *(__ex_table) }
   __stop___ex_table = .;
 
-  RODATA
-
   BUG_TABLE
 
+  RODATA
+
   . = ALIGN(PAGE_SIZE);        /* Align data segment to page size boundary */
 				/* Data */
   .data : AT(ADDR(.data) - LOAD_OFFSET) {
Index: linux/arch/x86_64/mm/init.c
===================================================================
--- linux.orig/arch/x86_64/mm/init.c
+++ linux/arch/x86_64/mm/init.c
@@ -22,6 +22,7 @@
 #include <linux/bootmem.h>
 #include <linux/proc_fs.h>
 #include <linux/pci.h>
+#include <linux/pfn.h>
 #include <linux/poison.h>
 #include <linux/dma-mapping.h>
 #include <linux/module.h>
@@ -563,21 +564,23 @@ void free_init_pages(char *what, unsigne
 	if (begin >= end)
 		return;
 
-	printk(KERN_INFO "Freeing %s: %ldk freed\n", what, (end - begin) >> 10);
+	printk(KERN_INFO "Freeing %s: %luk freed\n", what, (end - begin) >> 10);
 	for (addr = begin; addr < end; addr += PAGE_SIZE) {
 		struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
 		ClearPageReserved(page);
 		init_page_count(page);
 		memset(page_address(page), POISON_FREE_INITMEM, PAGE_SIZE);
+		if (addr >= __START_KERNEL_map)
+			change_page_attr_addr(addr, 1, __pgprot(0));
 		__free_page(page);
 		totalram_pages++;
 	}
+	if (addr > __START_KERNEL_map)
+		global_flush_tlb();
 }
 
 void free_initmem(void)
 {
-	memset(__initdata_begin, POISON_FREE_INITDATA,
-		__initdata_end - __initdata_begin);
 	free_init_pages("unused kernel memory",
 			__pa_symbol(&__init_begin),
 			__pa_symbol(&__init_end));
@@ -587,14 +590,18 @@ void free_initmem(void)
 
 void mark_rodata_ro(void)
 {
-	unsigned long addr = (unsigned long)__va(__pa_symbol(&__start_rodata));
-	unsigned long end  = (unsigned long)__va(__pa_symbol(&__end_rodata));
+	unsigned long start = PFN_ALIGN(__va(__pa_symbol(&_stext))), size;
 
-	for (; addr < end; addr += PAGE_SIZE)
-		change_page_attr_addr(addr, 1, PAGE_KERNEL_RO);
+#ifdef CONFIG_HOTPLUG_CPU
+	/* It must still be possible to apply SMP alternatives. */
+	if (num_possible_cpus() > 1)
+		start = PFN_ALIGN(__va(__pa_symbol(&_etext)));
+#endif
+	size = (unsigned long)__va(__pa_symbol(&__end_rodata)) - start;
+	change_page_attr_addr(start, size >> PAGE_SHIFT, PAGE_KERNEL_RO);
 
-	printk ("Write protecting the kernel read-only data: %luk\n",
-			(__end_rodata - __start_rodata) >> 10);
+	printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
+	       size >> 10);
 
 	/*
 	 * change_page_attr_addr() requires a global_flush_tlb() call after it.
Index: linux/arch/x86_64/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86_64/mm/pageattr.c
+++ linux/arch/x86_64/mm/pageattr.c
@@ -126,7 +126,7 @@ __change_page_attr(unsigned long address
 	struct page *kpte_page;
 	pgprot_t ref_prot2;
 	kpte = lookup_address(address);
-	if (!kpte) return 0;
+	if (!kpte || !pte_present(*kpte)) return 0;
 	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
 	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
 		if (!pte_huge(*kpte)) {
@@ -179,16 +179,24 @@ __change_page_attr(unsigned long address
 int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
 {
 	unsigned long phys_base_pfn = __pa_symbol(__START_KERNEL_map) >> PAGE_SHIFT;
-	int err = 0; 
+	int err = 0, kernel_map = 0;
 	int i; 
 
+	if (address >= __START_KERNEL_map
+	    && address < __START_KERNEL_map + KERNEL_TEXT_SIZE) {
+		address = (unsigned long)__va(__pa(address));
+		kernel_map = 1;
+	}
+
 	down_write(&init_mm.mmap_sem);
 	for (i = 0; i < numpages; i++, address += PAGE_SIZE) {
 		unsigned long pfn = __pa(address) >> PAGE_SHIFT;
 
-		err = __change_page_attr(address, pfn, prot, PAGE_KERNEL);
-		if (err) 
-			break; 
+		if (!kernel_map || pte_present(pfn_pte(0, prot))) {
+			err = __change_page_attr(address, pfn, prot, PAGE_KERNEL);
+			if (err)
+				break;
+		}
 		/* Handle kernel mapping too which aliases part of the
 		 * lowmem */
 		if ((pfn >= phys_base_pfn) &&
Index: linux/include/asm-i386/pgtable.h
===================================================================
--- linux.orig/include/asm-i386/pgtable.h
+++ linux/include/asm-i386/pgtable.h
@@ -159,6 +159,7 @@ void paging_init(void);
 
 extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
 #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
+#define __PAGE_KERNEL_RX		(__PAGE_KERNEL_EXEC & ~_PAGE_RW)
 #define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD)
 #define __PAGE_KERNEL_LARGE		(__PAGE_KERNEL | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_EXEC	(__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -166,6 +167,7 @@ extern unsigned long long __PAGE_KERNEL,
 #define PAGE_KERNEL		__pgprot(__PAGE_KERNEL)
 #define PAGE_KERNEL_RO		__pgprot(__PAGE_KERNEL_RO)
 #define PAGE_KERNEL_EXEC	__pgprot(__PAGE_KERNEL_EXEC)
+#define PAGE_KERNEL_RX		__pgprot(__PAGE_KERNEL_RX)
 #define PAGE_KERNEL_NOCACHE	__pgprot(__PAGE_KERNEL_NOCACHE)
 #define PAGE_KERNEL_LARGE	__pgprot(__PAGE_KERNEL_LARGE)
 #define PAGE_KERNEL_LARGE_EXEC	__pgprot(__PAGE_KERNEL_LARGE_EXEC)
Index: linux/include/linux/poison.h
===================================================================
--- linux.orig/include/linux/poison.h
+++ linux/include/linux/poison.h
@@ -26,9 +26,6 @@
 /********** arch/$ARCH/mm/init.c **********/
 #define POISON_FREE_INITMEM	0xcc
 
-/********** arch/x86_64/mm/init.c **********/
-#define	POISON_FREE_INITDATA	0xba
-
 /********** arch/ia64/hp/common/sba_iommu.c **********/
 /*
  * arch/ia64/hp/common/sba_iommu.c uses a 16-byte poison string with a

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [36/48] i386: get rid of unused variables
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (33 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [35/48] x86: tighten kernel image page access rights Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [37/48] i386: ignore vgacon if hardware not present Andi Kleen
                   ` (11 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Parag Warudkar, patches, linux-kernel


From: Parag Warudkar <parag.warudkar@gmail.com>

Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/apm.c |    7 -------
 1 file changed, 7 deletions(-)

Index: linux/arch/i386/kernel/apm.c
===================================================================
--- linux.orig/arch/i386/kernel/apm.c
+++ linux/arch/i386/kernel/apm.c
@@ -384,13 +384,6 @@ static int			ignore_sys_suspend;
 static int			ignore_normal_resume;
 static int			bounce_interval __read_mostly = DEFAULT_BOUNCE_INTERVAL;
 
-#ifdef CONFIG_APM_RTC_IS_GMT
-#	define	clock_cmos_diff	0
-#	define	got_clock_diff	1
-#else
-static long			clock_cmos_diff;
-static int			got_clock_diff;
-#endif
 static int			debug __read_mostly;
 static int			smp __read_mostly;
 static int			apm_disabled = -1;

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [37/48] i386: ignore vgacon if hardware not present
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (34 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [36/48] i386: get rid of unused variables Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 12:17   ` Antonino A. Daplas
  2007-04-29 10:53 ` [PATCH] [38/48] x86_64: Remove unused stext symbol Andi Kleen
                   ` (10 subsequent siblings)
  46 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>
On Thu, 2007-03-29 at 12:36 +0200, Andi Kleen wrote:
> On Thu, Mar 29, 2007 at 05:46:48PM +1000, Rusty Russell wrote:
> > (Did this fall through the cracks?  I don't see it in -mm.  It's
> > standalone, and saves some silly code in lguest and presumably others).
> 
> Normally it should go to some some console maintainer? 

Hmm, but who?

> Ok I can add it.

Thanks.  While you're in a patch-applying mood, how about this?

Cheers,
Rusty.
==
Use X86_EFLAGS_IF in irqflags.h.

Move X86_EFLAGS_IF et al out to a new header: processor-flags.h, so we
can include it from irqflags.h and use it in raw_irqs_disabled_flags().

As a side-effect, we could now use these flags in .S files.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-i386/irqflags.h        |    3 ++-
 include/asm-i386/processor-flags.h |   26 ++++++++++++++++++++++++++
 include/asm-i386/processor.h       |   22 +---------------------
 3 files changed, 29 insertions(+), 22 deletions(-)

===================================================================
Index: linux/include/asm-i386/processor-flags.h
===================================================================
--- /dev/null
+++ linux/include/asm-i386/processor-flags.h
@@ -0,0 +1,26 @@
+#ifndef __ASM_I386_PROCESSOR_FLAGS_H
+#define __ASM_I386_PROCESSOR_FLAGS_H
+/* Various flags defined: can be included from assembler. */
+
+/*
+ * EFLAGS bits
+ */
+#define X86_EFLAGS_CF	0x00000001 /* Carry Flag */
+#define X86_EFLAGS_PF	0x00000004 /* Parity Flag */
+#define X86_EFLAGS_AF	0x00000010 /* Auxillary carry Flag */
+#define X86_EFLAGS_ZF	0x00000040 /* Zero Flag */
+#define X86_EFLAGS_SF	0x00000080 /* Sign Flag */
+#define X86_EFLAGS_TF	0x00000100 /* Trap Flag */
+#define X86_EFLAGS_IF	0x00000200 /* Interrupt Flag */
+#define X86_EFLAGS_DF	0x00000400 /* Direction Flag */
+#define X86_EFLAGS_OF	0x00000800 /* Overflow Flag */
+#define X86_EFLAGS_IOPL	0x00003000 /* IOPL mask */
+#define X86_EFLAGS_NT	0x00004000 /* Nested Task */
+#define X86_EFLAGS_RF	0x00010000 /* Resume Flag */
+#define X86_EFLAGS_VM	0x00020000 /* Virtual Mode */
+#define X86_EFLAGS_AC	0x00040000 /* Alignment Check */
+#define X86_EFLAGS_VIF	0x00080000 /* Virtual Interrupt Flag */
+#define X86_EFLAGS_VIP	0x00100000 /* Virtual Interrupt Pending */
+#define X86_EFLAGS_ID	0x00200000 /* CPUID detection flag */
+
+#endif	/* __ASM_I386_PROCESSOR_FLAGS_H */
Index: linux/include/asm-i386/irqflags.h
===================================================================
--- linux.orig/include/asm-i386/irqflags.h
+++ linux/include/asm-i386/irqflags.h
@@ -9,6 +9,7 @@
  */
 #ifndef _ASM_IRQFLAGS_H
 #define _ASM_IRQFLAGS_H
+#include <asm/processor-flags.h>
 
 #ifndef __ASSEMBLY__
 static inline unsigned long native_save_fl(void)
@@ -119,7 +120,7 @@ static inline unsigned long __raw_local_
 
 static inline int raw_irqs_disabled_flags(unsigned long flags)
 {
-	return !(flags & (1 << 9));
+	return !(flags & X86_EFLAGS_IF);
 }
 
 static inline int raw_irqs_disabled(void)
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -21,6 +21,7 @@
 #include <asm/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/init.h>
+#include <asm/processor-flags.h>
 
 /* flag for disabling the tsc */
 extern int tsc_disable;
@@ -126,27 +127,6 @@ extern void detect_ht(struct cpuinfo_x86
 static inline void detect_ht(struct cpuinfo_x86 *c) {}
 #endif
 
-/*
- * EFLAGS bits
- */
-#define X86_EFLAGS_CF	0x00000001 /* Carry Flag */
-#define X86_EFLAGS_PF	0x00000004 /* Parity Flag */
-#define X86_EFLAGS_AF	0x00000010 /* Auxillary carry Flag */
-#define X86_EFLAGS_ZF	0x00000040 /* Zero Flag */
-#define X86_EFLAGS_SF	0x00000080 /* Sign Flag */
-#define X86_EFLAGS_TF	0x00000100 /* Trap Flag */
-#define X86_EFLAGS_IF	0x00000200 /* Interrupt Flag */
-#define X86_EFLAGS_DF	0x00000400 /* Direction Flag */
-#define X86_EFLAGS_OF	0x00000800 /* Overflow Flag */
-#define X86_EFLAGS_IOPL	0x00003000 /* IOPL mask */
-#define X86_EFLAGS_NT	0x00004000 /* Nested Task */
-#define X86_EFLAGS_RF	0x00010000 /* Resume Flag */
-#define X86_EFLAGS_VM	0x00020000 /* Virtual Mode */
-#define X86_EFLAGS_AC	0x00040000 /* Alignment Check */
-#define X86_EFLAGS_VIF	0x00080000 /* Virtual Interrupt Flag */
-#define X86_EFLAGS_VIP	0x00100000 /* Virtual Interrupt Pending */
-#define X86_EFLAGS_ID	0x00200000 /* CPUID detection flag */
-
 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
 					 unsigned int *ecx, unsigned int *edx)
 {

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] [37/48] i386: ignore vgacon if hardware not present
  2007-04-29 10:53 ` [PATCH] [37/48] i386: ignore vgacon if hardware not present Andi Kleen
@ 2007-04-29 12:17   ` Antonino A. Daplas
  2007-04-29 13:24     ` Andi Kleen
  0 siblings, 1 reply; 55+ messages in thread
From: Antonino A. Daplas @ 2007-04-29 12:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rusty Russell, patches, linux-kernel

On Sun, 2007-04-29 at 12:53 +0200, Andi Kleen wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> On Thu, 2007-03-29 at 12:36 +0200, Andi Kleen wrote:
> > On Thu, Mar 29, 2007 at 05:46:48PM +1000, Rusty Russell wrote:
> > > (Did this fall through the cracks?  I don't see it in -mm.  It's
> > > standalone, and saves some silly code in lguest and presumably others).
> > 
> > Normally it should go to some some console maintainer? 
> 
> Hmm, but who?
> 
> > Ok I can add it.
> 
> Thanks.  While you're in a patch-applying mood, how about this?
> 
> Cheers,
> Rusty.
> ==
> Use X86_EFLAGS_IF in irqflags.h.
> 
> Move X86_EFLAGS_IF et al out to a new header: processor-flags.h, so we
> can include it from irqflags.h and use it in raw_irqs_disabled_flags().
> 
> As a side-effect, we could now use these flags in .S files.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Signed-off-by: Andi Kleen <ak@suse.de>
> 
> ---
>  include/asm-i386/irqflags.h        |    3 ++-
>  include/asm-i386/processor-flags.h |   26 ++++++++++++++++++++++++++
>  include/asm-i386/processor.h       |   22 +---------------------
>  3 files changed, 29 insertions(+), 22 deletions(-)

The subject does not reflect the content of the patch :-). And I believe
ignore-vgacon-if-hardware-not-present is already in mainline.

Tony



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] [37/48] i386: ignore vgacon if hardware not present
  2007-04-29 12:17   ` Antonino A. Daplas
@ 2007-04-29 13:24     ` Andi Kleen
  2007-04-29 14:10       ` Antonino A. Daplas
  0 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 13:24 UTC (permalink / raw)
  To: Antonino A. Daplas; +Cc: Rusty Russell, patches, linux-kernel


> The subject does not reflect the content of the patch :-). And I believe
> ignore-vgacon-if-hardware-not-present is already in mainline.

Patch dropped thanks
-Andi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] [37/48] i386: ignore vgacon if hardware not present
  2007-04-29 13:24     ` Andi Kleen
@ 2007-04-29 14:10       ` Antonino A. Daplas
  2007-04-29 14:16         ` Andi Kleen
  0 siblings, 1 reply; 55+ messages in thread
From: Antonino A. Daplas @ 2007-04-29 14:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rusty Russell, patches, linux-kernel

On Sun, 2007-04-29 at 15:24 +0200, Andi Kleen wrote:
> > The subject does not reflect the content of the patch :-). And I believe
> > ignore-vgacon-if-hardware-not-present is already in mainline.
> 
> Patch dropped thanks

Note that this is the patch which have the above title. So I don't know
if you have to drop the patch or just change the title to an appropriate
one.

Tony

The patch titled
     ignore vgacon if hardware not present
has been removed from the -mm tree.  Its filename was
     ignore-vgacon-if-hardware-not-present.patch

This patch was dropped because it was merged into mainline or a
subsystem tree

------------------------------------------------------
Subject: ignore vgacon if hardware not present
From: Gerd Hoffmann <kraxel@suse.de>

Avoid trying to set up vgacon if there's no vga hardware present.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alan <alan@lxorguk.ukuu.org.uk>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/video/console/vgacon.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletion(-)

diff -puN
drivers/video/console/vgacon.c~ignore-vgacon-if-hardware-not-present
drivers/video/console/vgacon.c
---
a/drivers/video/console/vgacon.c~ignore-vgacon-if-hardware-not-present
+++ a/drivers/video/console/vgacon.c
@@ -371,7 +371,8 @@ static const char *vgacon_startup(void)
        }
 
        /* VGA16 modes are not handled by VGACON */
-       if ((ORIG_VIDEO_MODE == 0x0D) ||        /* 320x200/4 */
+       if ((ORIG_VIDEO_MODE == 0x00) ||        /* SCREEN_INFO not
initialized */
+           (ORIG_VIDEO_MODE == 0x0D) ||        /* 320x200/4 */
            (ORIG_VIDEO_MODE == 0x0E) ||        /* 640x200/4 */
            (ORIG_VIDEO_MODE == 0x10) ||        /* 640x350/4 */
            (ORIG_VIDEO_MODE == 0x12) ||        /* 640x480/4 */
> -Andi


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] [37/48] i386: ignore vgacon if hardware not present
  2007-04-29 14:10       ` Antonino A. Daplas
@ 2007-04-29 14:16         ` Andi Kleen
  2007-04-29 17:16           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 14:16 UTC (permalink / raw)
  To: Antonino A. Daplas; +Cc: Rusty Russell, patches, linux-kernel

On Sunday 29 April 2007 16:10:38 Antonino A. Daplas wrote:
> On Sun, 2007-04-29 at 15:24 +0200, Andi Kleen wrote:
> > > The subject does not reflect the content of the patch :-). And I believe
> > > ignore-vgacon-if-hardware-not-present is already in mainline.
> > 
> > Patch dropped thanks
> 
> Note that this is the patch which have the above title. So I don't know
> if you have to drop the patch or just change the title to an appropriate
> one.

I dropped the patch completely because you said it was already in mainline
or queued elsewhere. Was that wrong?

-Andi


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] [37/48] i386: ignore vgacon if hardware not present
  2007-04-29 14:16         ` Andi Kleen
@ 2007-04-29 17:16           ` Jeremy Fitzhardinge
  2007-04-29 17:39             ` Andi Kleen
  0 siblings, 1 reply; 55+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-29 17:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Antonino A. Daplas, Rusty Russell, patches, linux-kernel,
	Andrew Morton

Andi Kleen wrote:
> I dropped the patch completely because you said it was already in mainline
> or queued elsewhere. Was that wrong?
>   
I think both patches are actually needed.  The -mm message Antonio
quoted probably refers to the fact that the patch is in your tree, so
Andrew isn't carrying it separately.

The patch called "ignore-vgacon-if-hardware-not-present" on
firstfloor.org is what the name suggests.  It appears to be still
needed, because it still applies.

The patch you've posted here with the subject "i386: ignore vgacon if
hardware not present" is actually i386-eflags-header, but it appears to
have a bad first line, which is why it is posted with the wrong subject.

    J

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] [37/48] i386: ignore vgacon if hardware not present
  2007-04-29 17:16           ` Jeremy Fitzhardinge
@ 2007-04-29 17:39             ` Andi Kleen
  0 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 17:39 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Antonino A. Daplas, Rusty Russell, patches, linux-kernel,
	Andrew Morton


> The patch you've posted here with the subject "i386: ignore vgacon if
> hardware not present" is actually i386-eflags-header, but it appears to
> have a bad first line, which is why it is posted with the wrong subject.

Ok should be fixed now and i readded the patch.

-Andi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [38/48] x86_64: Remove unused stext symbol
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (35 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [37/48] i386: ignore vgacon if hardware not present Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [39/48] i386: remove the APM_RTC_IS_GMT config option Andi Kleen
                   ` (9 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: patches, linux-kernel


suggested by Jan Beulich

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/head.S |    1 -
 1 file changed, 1 deletion(-)

Index: linux/arch/x86_64/kernel/head.S
===================================================================
--- linux.orig/arch/x86_64/kernel/head.S
+++ linux/arch/x86_64/kernel/head.S
@@ -279,7 +279,6 @@ early_idt_ripmsg:
 	.asciz "RIP %s\n"
 
 .balign PAGE_SIZE
-ENTRY(stext)
 
 #define NEXT_PAGE(name) \
 	.balign	PAGE_SIZE; \

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [39/48] i386: remove the APM_RTC_IS_GMT config option.
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (36 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [38/48] x86_64: Remove unused stext symbol Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [40/48] x86_64: use lru instead of page->index and page->private for pgd lists management Andi Kleen
                   ` (8 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Parag Warudkar, patches, linux-kernel


From: "Parag Warudkar" <parag.warudkar@gmail.com>

Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/Kconfig |   13 -------------
 1 file changed, 13 deletions(-)

Index: linux/arch/i386/Kconfig
===================================================================
--- linux.orig/arch/i386/Kconfig
+++ linux/arch/i386/Kconfig
@@ -1029,19 +1029,6 @@ config APM_DISPLAY_BLANK
 	  backlight at all, or it might print a lot of errors to the console,
 	  especially if you are using gpm.
 
-config APM_RTC_IS_GMT
-	bool "RTC stores time in GMT"
-	depends on APM
-	help
-	  Say Y here if your RTC (Real Time Clock a.k.a. hardware clock)
-	  stores the time in GMT (Greenwich Mean Time). Say N if your RTC
-	  stores localtime.
-
-	  It is in fact recommended to store GMT in your RTC, because then you
-	  don't have to worry about daylight savings time changes. The only
-	  reason not to use GMT in your RTC is if you also run a broken OS
-	  that doesn't understand GMT.
-
 config APM_ALLOW_INTS
 	bool "Allow interrupts during APM BIOS calls"
 	depends on APM

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [40/48] x86_64: use lru instead of page->index and page->private for pgd lists management.
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (37 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [39/48] i386: remove the APM_RTC_IS_GMT config option Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [41/48] x86: sys_ioperm() prototype cleanup Andi Kleen
                   ` (7 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Christoph Lameter, Andi Kleen, patches, linux-kernel


From: Christoph Lameter <clameter@sgi.com>

x86_64 currently simulates a list using the index and private fields of the
page struct.  Seems that the code was inherited from i386.  But x86_64 does
not use the slab to allocate pgds and pmds etc.  So the lru field is not
used by the slab and therefore available.

This patch uses standard list operations on page->lru to realize pgd
tracking.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86_64/mm/fault.c       |    5 ++---
 include/asm-x86_64/pgalloc.h |   14 +++-----------
 include/asm-x86_64/pgtable.h |    2 +-
 3 files changed, 6 insertions(+), 15 deletions(-)

Index: linux/arch/x86_64/mm/fault.c
===================================================================
--- linux.orig/arch/x86_64/mm/fault.c
+++ linux/arch/x86_64/mm/fault.c
@@ -585,7 +585,7 @@ do_sigbus:
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
+LIST_HEAD(pgd_list);
 
 void vmalloc_sync_all(void)
 {
@@ -605,8 +605,7 @@ void vmalloc_sync_all(void)
 			if (pgd_none(*pgd_ref))
 				continue;
 			spin_lock(&pgd_lock);
-			for (page = pgd_list; page;
-			     page = (struct page *)page->index) {
+			list_for_each_entry(page, &pgd_list, lru) {
 				pgd_t *pgd;
 				pgd = (pgd_t *)page_address(page) + pgd_index(address);
 				if (pgd_none(*pgd))
Index: linux/include/asm-x86_64/pgalloc.h
===================================================================
--- linux.orig/include/asm-x86_64/pgalloc.h
+++ linux/include/asm-x86_64/pgalloc.h
@@ -44,24 +44,16 @@ static inline void pgd_list_add(pgd_t *p
 	struct page *page = virt_to_page(pgd);
 
 	spin_lock(&pgd_lock);
-	page->index = (pgoff_t)pgd_list;
-	if (pgd_list)
-		pgd_list->private = (unsigned long)&page->index;
-	pgd_list = page;
-	page->private = (unsigned long)&pgd_list;
+	list_add(&page->lru, &pgd_list);
 	spin_unlock(&pgd_lock);
 }
 
 static inline void pgd_list_del(pgd_t *pgd)
 {
-	struct page *next, **pprev, *page = virt_to_page(pgd);
+	struct page *page = virt_to_page(pgd);
 
 	spin_lock(&pgd_lock);
-	next = (struct page *)page->index;
-	pprev = (struct page **)page->private;
-	*pprev = next;
-	if (next)
-		next->private = (unsigned long)pprev;
+	list_del(&page->lru);
 	spin_unlock(&pgd_lock);
 }
 
Index: linux/include/asm-x86_64/pgtable.h
===================================================================
--- linux.orig/include/asm-x86_64/pgtable.h
+++ linux/include/asm-x86_64/pgtable.h
@@ -410,7 +410,7 @@ static inline pte_t pte_modify(pte_t pte
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
 extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
+extern struct list_head pgd_list;
 void vmalloc_sync_all(void);
 
 extern int kern_addr_valid(unsigned long addr); 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [41/48] x86: sys_ioperm() prototype cleanup
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (38 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [40/48] x86_64: use lru instead of page->index and page->private for pgd lists management Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [42/48] x86: remove UNEXPECTED_IO_APIC() Andi Kleen
                   ` (6 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Adrian Bunk, patches, linux-kernel


From: Adrian Bunk <bunk@stusta.de>
- there's no reason for duplicating the prototype from 
  include/linux/syscalls.h in include/asm-x86_64/unistd.h
- every file should #include the headers containing the prototypes for
  it's global functions

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/ioport.c   |    1 +
 arch/x86_64/kernel/ioport.c |    1 +
 include/asm-x86_64/unistd.h |    1 -
 3 files changed, 2 insertions(+), 1 deletion(-)

Index: linux/include/asm-x86_64/unistd.h
===================================================================
--- linux.orig/include/asm-x86_64/unistd.h
+++ linux/include/asm-x86_64/unistd.h
@@ -655,7 +655,6 @@ __SYSCALL(__NR_move_pages, sys_move_page
 #include <asm/ptrace.h>
 
 asmlinkage long sys_iopl(unsigned int level, struct pt_regs *regs);
-asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int turn_on);
 struct sigaction;
 asmlinkage long sys_rt_sigaction(int sig,
 				const struct sigaction __user *act,
Index: linux/arch/i386/kernel/ioport.c
===================================================================
--- linux.orig/arch/i386/kernel/ioport.c
+++ linux/arch/i386/kernel/ioport.c
@@ -16,6 +16,7 @@
 #include <linux/stddef.h>
 #include <linux/slab.h>
 #include <linux/thread_info.h>
+#include <linux/syscalls.h>
 
 /* Set EXTENT bits starting at BASE in BITMAP to value TURN_ON. */
 static void set_bitmap(unsigned long *bitmap, unsigned int base, unsigned int extent, int new_value)
Index: linux/arch/x86_64/kernel/ioport.c
===================================================================
--- linux.orig/arch/x86_64/kernel/ioport.c
+++ linux/arch/x86_64/kernel/ioport.c
@@ -16,6 +16,7 @@
 #include <linux/stddef.h>
 #include <linux/slab.h>
 #include <linux/thread_info.h>
+#include <linux/syscalls.h>
 
 /* Set EXTENT bits starting at BASE in BITMAP to value TURN_ON. */
 static void set_bitmap(unsigned long *bitmap, unsigned int base, unsigned int extent, int new_value)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [42/48] x86: remove UNEXPECTED_IO_APIC()
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (39 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [41/48] x86: sys_ioperm() prototype cleanup Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [43/48] x86_64: fix vtime() vsyscall Andi Kleen
                   ` (5 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Adrian Bunk, patches, linux-kernel


From: Adrian Bunk <bunk@stusta.de>
Many years ago, UNEXPECTED_IO_APIC() contained printk()'s (but nothing more).

Now that it's completely empty for years, we can as well remove it.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/io_apic.c   |   30 ------------------------------
 arch/x86_64/kernel/io_apic.c |   28 ----------------------------
 2 files changed, 58 deletions(-)

Index: linux/arch/i386/kernel/io_apic.c
===================================================================
--- linux.orig/arch/i386/kernel/io_apic.c
+++ linux/arch/i386/kernel/io_apic.c
@@ -1403,10 +1403,6 @@ static void __init setup_ExtINT_IRQ0_pin
 	enable_8259A_irq(0);
 }
 
-static inline void UNEXPECTED_IO_APIC(void)
-{
-}
-
 void __init print_IO_APIC(void)
 {
 	int apic, i;
@@ -1446,34 +1442,12 @@ void __init print_IO_APIC(void)
 	printk(KERN_DEBUG ".......    : physical APIC id: %02X\n", reg_00.bits.ID);
 	printk(KERN_DEBUG ".......    : Delivery Type: %X\n", reg_00.bits.delivery_type);
 	printk(KERN_DEBUG ".......    : LTS          : %X\n", reg_00.bits.LTS);
-	if (reg_00.bits.ID >= get_physical_broadcast())
-		UNEXPECTED_IO_APIC();
-	if (reg_00.bits.__reserved_1 || reg_00.bits.__reserved_2)
-		UNEXPECTED_IO_APIC();
 
 	printk(KERN_DEBUG ".... register #01: %08X\n", reg_01.raw);
 	printk(KERN_DEBUG ".......     : max redirection entries: %04X\n", reg_01.bits.entries);
-	if (	(reg_01.bits.entries != 0x0f) && /* older (Neptune) boards */
-		(reg_01.bits.entries != 0x17) && /* typical ISA+PCI boards */
-		(reg_01.bits.entries != 0x1b) && /* Compaq Proliant boards */
-		(reg_01.bits.entries != 0x1f) && /* dual Xeon boards */
-		(reg_01.bits.entries != 0x22) && /* bigger Xeon boards */
-		(reg_01.bits.entries != 0x2E) &&
-		(reg_01.bits.entries != 0x3F)
-	)
-		UNEXPECTED_IO_APIC();
 
 	printk(KERN_DEBUG ".......     : PRQ implemented: %X\n", reg_01.bits.PRQ);
 	printk(KERN_DEBUG ".......     : IO APIC version: %04X\n", reg_01.bits.version);
-	if (	(reg_01.bits.version != 0x01) && /* 82489DX IO-APICs */
-		(reg_01.bits.version != 0x10) && /* oldest IO-APICs */
-		(reg_01.bits.version != 0x11) && /* Pentium/Pro IO-APICs */
-		(reg_01.bits.version != 0x13) && /* Xeon IO-APICs */
-		(reg_01.bits.version != 0x20)    /* Intel P64H (82806 AA) */
-	)
-		UNEXPECTED_IO_APIC();
-	if (reg_01.bits.__reserved_1 || reg_01.bits.__reserved_2)
-		UNEXPECTED_IO_APIC();
 
 	/*
 	 * Some Intel chipsets with IO APIC VERSION of 0x1? don't have reg_02,
@@ -1483,8 +1457,6 @@ void __init print_IO_APIC(void)
 	if (reg_01.bits.version >= 0x10 && reg_02.raw != reg_01.raw) {
 		printk(KERN_DEBUG ".... register #02: %08X\n", reg_02.raw);
 		printk(KERN_DEBUG ".......     : arbitration: %02X\n", reg_02.bits.arbitration);
-		if (reg_02.bits.__reserved_1 || reg_02.bits.__reserved_2)
-			UNEXPECTED_IO_APIC();
 	}
 
 	/*
@@ -1496,8 +1468,6 @@ void __init print_IO_APIC(void)
 	    reg_03.raw != reg_01.raw) {
 		printk(KERN_DEBUG ".... register #03: %08X\n", reg_03.raw);
 		printk(KERN_DEBUG ".......     : Boot DT    : %X\n", reg_03.bits.boot_DT);
-		if (reg_03.bits.__reserved_1)
-			UNEXPECTED_IO_APIC();
 	}
 
 	printk(KERN_DEBUG ".... IRQ redirection table:\n");
Index: linux/arch/x86_64/kernel/io_apic.c
===================================================================
--- linux.orig/arch/x86_64/kernel/io_apic.c
+++ linux/arch/x86_64/kernel/io_apic.c
@@ -907,10 +907,6 @@ static void __init setup_ExtINT_IRQ0_pin
 	enable_8259A_irq(0);
 }
 
-void __init UNEXPECTED_IO_APIC(void)
-{
-}
-
 void __apicdebuginit print_IO_APIC(void)
 {
 	int apic, i;
@@ -946,40 +942,16 @@ void __apicdebuginit print_IO_APIC(void)
 	printk(KERN_DEBUG "IO APIC #%d......\n", mp_ioapics[apic].mpc_apicid);
 	printk(KERN_DEBUG ".... register #00: %08X\n", reg_00.raw);
 	printk(KERN_DEBUG ".......    : physical APIC id: %02X\n", reg_00.bits.ID);
-	if (reg_00.bits.__reserved_1 || reg_00.bits.__reserved_2)
-		UNEXPECTED_IO_APIC();
 
 	printk(KERN_DEBUG ".... register #01: %08X\n", *(int *)&reg_01);
 	printk(KERN_DEBUG ".......     : max redirection entries: %04X\n", reg_01.bits.entries);
-	if (	(reg_01.bits.entries != 0x0f) && /* older (Neptune) boards */
-		(reg_01.bits.entries != 0x17) && /* typical ISA+PCI boards */
-		(reg_01.bits.entries != 0x1b) && /* Compaq Proliant boards */
-		(reg_01.bits.entries != 0x1f) && /* dual Xeon boards */
-		(reg_01.bits.entries != 0x22) && /* bigger Xeon boards */
-		(reg_01.bits.entries != 0x2E) &&
-		(reg_01.bits.entries != 0x3F) &&
-		(reg_01.bits.entries != 0x03) 
-	)
-		UNEXPECTED_IO_APIC();
 
 	printk(KERN_DEBUG ".......     : PRQ implemented: %X\n", reg_01.bits.PRQ);
 	printk(KERN_DEBUG ".......     : IO APIC version: %04X\n", reg_01.bits.version);
-	if (	(reg_01.bits.version != 0x01) && /* 82489DX IO-APICs */
-		(reg_01.bits.version != 0x02) && /* 82801BA IO-APICs (ICH2) */
-		(reg_01.bits.version != 0x10) && /* oldest IO-APICs */
-		(reg_01.bits.version != 0x11) && /* Pentium/Pro IO-APICs */
-		(reg_01.bits.version != 0x13) && /* Xeon IO-APICs */
-		(reg_01.bits.version != 0x20)    /* Intel P64H (82806 AA) */
-	)
-		UNEXPECTED_IO_APIC();
-	if (reg_01.bits.__reserved_1 || reg_01.bits.__reserved_2)
-		UNEXPECTED_IO_APIC();
 
 	if (reg_01.bits.version >= 0x10) {
 		printk(KERN_DEBUG ".... register #02: %08X\n", reg_02.raw);
 		printk(KERN_DEBUG ".......     : arbitration: %02X\n", reg_02.bits.arbitration);
-		if (reg_02.bits.__reserved_1 || reg_02.bits.__reserved_2)
-			UNEXPECTED_IO_APIC();
 	}
 
 	printk(KERN_DEBUG ".... IRQ redirection table:\n");

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [43/48] x86_64: fix vtime() vsyscall
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (40 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [42/48] x86: remove UNEXPECTED_IO_APIC() Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [44/48] x86_64: vsyscall_gtod_data diet and vgettimeofday() fix Andi Kleen
                   ` (4 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Eric Dumazet, patches, linux-kernel


From: Eric Dumazet <dada1@cosmosbay.com>

There is a tiny probability that the return value from vtime(time_t *t) is 
Signed-off-by: Andi Kleen <ak@suse.de>

different than the value stored in *t

Using a temporary variable solves the problem and gives a faster code.

   17:   48 85 ff                test   %rdi,%rdi
   1a:   48 8b 05 00 00 00 00    mov    0(%rip),%rax        # 
__vsyscall_gtod_data.wall_time_tv.tv_sec
   21:   74 03                   je     26
   23:   48 89 07                mov    %rax,(%rdi)
   26:   c9                      leaveq
   27:   c3                      retq

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


---
 arch/x86_64/kernel/vsyscall.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux/arch/x86_64/kernel/vsyscall.c
===================================================================
--- linux.orig/arch/x86_64/kernel/vsyscall.c
+++ linux/arch/x86_64/kernel/vsyscall.c
@@ -156,11 +156,13 @@ int __vsyscall(0) vgettimeofday(struct t
  * unlikely */
 time_t __vsyscall(1) vtime(time_t *t)
 {
+	time_t result;
 	if (unlikely(!__vsyscall_gtod_data.sysctl_enabled))
 		return time_syscall(t);
-	else if (t)
-		*t = __vsyscall_gtod_data.wall_time_tv.tv_sec;
-	return __vsyscall_gtod_data.wall_time_tv.tv_sec;
+	result = __vsyscall_gtod_data.wall_time_tv.tv_sec;
+	if (t)
+		*t = result;
+	return result;
 }
 
 /* Fast way to get current CPU and node.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [44/48] x86_64: vsyscall_gtod_data diet and vgettimeofday() fix
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (41 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [43/48] x86_64: fix vtime() vsyscall Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [45/48] x86_64: Inhibit machine from asserting an NMI when doing Alt-SysRq-M operation Andi Kleen
                   ` (3 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Eric Dumazet, patches, linux-kernel


From: Eric Dumazet <dada1@cosmosbay.com>

Current vsyscall_gtod_data is large (3 or 4 cache lines dirtied at timer 
interrupt). We can shrink it to exactly 64 bytes (1 cache line on AMD64)

Instead of copying a whole struct clocksource, we copy only needed fields.

I deleted an unused field : offset_base

This patch fixes one oddity in vgettimeofday(): It can returns a timeval with 
tv_usec = 1000000. Maybe not a bug, but why not doing the right thing ?

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andi Kleen <ak@suse.de>



---
 arch/x86_64/kernel/vsyscall.c |   53 ++++++++++++++++++++++++++++--------------
 1 file changed, 36 insertions(+), 17 deletions(-)

Index: linux/arch/x86_64/kernel/vsyscall.c
===================================================================
--- linux.orig/arch/x86_64/kernel/vsyscall.c
+++ linux/arch/x86_64/kernel/vsyscall.c
@@ -51,13 +51,28 @@
 	  asm("" : "=r" (v) : "0" (x)); \
 	  ((v - VSYSCALL_FIRST_PAGE) + __pa_symbol(&__vsyscall_0)); })
 
+/*
+ * vsyscall_gtod_data contains data that is :
+ * - readonly from vsyscalls
+ * - writen by timer interrupt or systcl (/proc/sys/kernel/vsyscall64)
+ * Try to keep this structure as small as possible to avoid cache line ping pongs
+ */
 struct vsyscall_gtod_data_t {
-	seqlock_t lock;
-	int sysctl_enabled;
-	struct timeval wall_time_tv;
+	seqlock_t	lock;
+
+	/* open coded 'struct timespec' */
+	time_t		wall_time_sec;
+	u32		wall_time_nsec;
+
+	int		sysctl_enabled;
 	struct timezone sys_tz;
-	cycle_t offset_base;
-	struct clocksource clock;
+	struct { /* extract of a clocksource struct */
+		cycle_t (*vread)(void);
+		cycle_t	cycle_last;
+		cycle_t	mask;
+		u32	mult;
+		u32	shift;
+	} clock;
 };
 int __vgetcpu_mode __section_vgetcpu_mode;
 
@@ -73,9 +88,13 @@ void update_vsyscall(struct timespec *wa
 
 	write_seqlock_irqsave(&vsyscall_gtod_data.lock, flags);
 	/* copy vsyscall data */
-	vsyscall_gtod_data.clock = *clock;
-	vsyscall_gtod_data.wall_time_tv.tv_sec = wall_time->tv_sec;
-	vsyscall_gtod_data.wall_time_tv.tv_usec = wall_time->tv_nsec/1000;
+	vsyscall_gtod_data.clock.vread = clock->vread;
+	vsyscall_gtod_data.clock.cycle_last = clock->cycle_last;
+	vsyscall_gtod_data.clock.mask = clock->mask;
+	vsyscall_gtod_data.clock.mult = clock->mult;
+	vsyscall_gtod_data.clock.shift = clock->shift;
+	vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec;
+	vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec;
 	vsyscall_gtod_data.sys_tz = sys_tz;
 	write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags);
 }
@@ -110,7 +129,8 @@ static __always_inline long time_syscall
 static __always_inline void do_vgettimeofday(struct timeval * tv)
 {
 	cycle_t now, base, mask, cycle_delta;
-	unsigned long seq, mult, shift, nsec_delta;
+	unsigned seq;
+	unsigned long mult, shift, nsec;
 	cycle_t (*vread)(void);
 	do {
 		seq = read_seqbegin(&__vsyscall_gtod_data.lock);
@@ -126,21 +146,20 @@ static __always_inline void do_vgettimeo
 		mult = __vsyscall_gtod_data.clock.mult;
 		shift = __vsyscall_gtod_data.clock.shift;
 
-		*tv = __vsyscall_gtod_data.wall_time_tv;
-
+		tv->tv_sec = __vsyscall_gtod_data.wall_time_sec;
+		nsec = __vsyscall_gtod_data.wall_time_nsec;
 	} while (read_seqretry(&__vsyscall_gtod_data.lock, seq));
 
 	/* calculate interval: */
 	cycle_delta = (now - base) & mask;
 	/* convert to nsecs: */
-	nsec_delta = (cycle_delta * mult) >> shift;
+	nsec += (cycle_delta * mult) >> shift;
 
-	/* convert to usecs and add to timespec: */
-	tv->tv_usec += nsec_delta / NSEC_PER_USEC;
-	while (tv->tv_usec > USEC_PER_SEC) {
+	while (nsec >= NSEC_PER_SEC) {
 		tv->tv_sec += 1;
-		tv->tv_usec -= USEC_PER_SEC;
+		nsec -= NSEC_PER_SEC;
 	}
+	tv->tv_usec = nsec / NSEC_PER_USEC;
 }
 
 int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz)
@@ -159,7 +178,7 @@ time_t __vsyscall(1) vtime(time_t *t)
 	time_t result;
 	if (unlikely(!__vsyscall_gtod_data.sysctl_enabled))
 		return time_syscall(t);
-	result = __vsyscall_gtod_data.wall_time_tv.tv_sec;
+	result = __vsyscall_gtod_data.wall_time_sec;
 	if (t)
 		*t = result;
 	return result;

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [45/48] x86_64: Inhibit machine from asserting an NMI when doing Alt-SysRq-M operation.
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (42 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [44/48] x86_64: vsyscall_gtod_data diet and vgettimeofday() fix Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [46/48] x86_64: adjust EDID retrieval Andi Kleen
                   ` (2 subsequent siblings)
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Konrad Rzeszutek, patches, linux-kernel


From: Konrad Rzeszutek <konrad@darnok.org>
This patch touches the NMI watchdog every MAX_ORDER_NR_PAGES
to inhibit the machine from triggering an NMI while the CPUs
are locked. This situation is happening on boxes with more 
than 64CPUs and 128GB of RAM when Alt-SysRq-m is performed.

It has been succesfully tested for regression on uni, 2, 4, 8 
32, and 64 CPU boxes with various memory configuration.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/mm/init.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux/arch/x86_64/mm/init.c
===================================================================
--- linux.orig/arch/x86_64/mm/init.c
+++ linux/arch/x86_64/mm/init.c
@@ -27,6 +27,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/module.h>
 #include <linux/memory_hotplug.h>
+#include <linux/nmi.h>
 
 #include <asm/processor.h>
 #include <asm/system.h>
@@ -73,6 +74,11 @@ void show_mem(void)
 
 	for_each_online_pgdat(pgdat) {
                for (i = 0; i < pgdat->node_spanned_pages; ++i) {
+			/* this loop can take a while with 256 GB and 4k pages
+			   so update the NMI watchdog */
+			if (unlikely(i % MAX_ORDER_NR_PAGES == 0)) {
+				touch_nmi_watchdog();
+			}
 			page = pfn_to_page(pgdat->node_start_pfn + i);
 			total++;
 			if (PageReserved(page))

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [46/48] x86_64: adjust EDID retrieval
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (43 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [45/48] x86_64: Inhibit machine from asserting an NMI when doing Alt-SysRq-M operation Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 12:18   ` Antonino A. Daplas
  2007-04-29 10:53 ` [PATCH] [47/48] x86_64: Fix "Section mismatch" compile warning Andi Kleen
  2007-04-29 10:53 ` [PATCH] [48/48] i386: cleanup GDT Access Andi Kleen
  46 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Jan Beulich, patches, linux-kernel


From: "Jan Beulich" <jbeulich@novell.com>
commit 5e518d7672dea4cd7c60871e40d0490c52f01d13 did the same change to
i386's variant.

With this change, i386's and x86-64's versions are identical, raising
the question whether the x86-64 one should go (just like there's only
one instance of edd.S).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/boot/video.S |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86_64/boot/video.S
===================================================================
--- linux.orig/arch/x86_64/boot/video.S
+++ linux/arch/x86_64/boot/video.S
@@ -1977,7 +1977,7 @@ store_edid:
 	movw	$0x4f15, %ax                    # do VBE/DDC
 	movw	$0x01, %bx
 	movw	$0x00, %cx
-	movw    $0x01, %dx
+	movw    $0x00, %dx
 	movw	$0x140, %di
 	int	$0x10
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] [46/48] x86_64: adjust EDID retrieval
  2007-04-29 10:53 ` [PATCH] [46/48] x86_64: adjust EDID retrieval Andi Kleen
@ 2007-04-29 12:18   ` Antonino A. Daplas
  0 siblings, 0 replies; 55+ messages in thread
From: Antonino A. Daplas @ 2007-04-29 12:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jan Beulich, patches, linux-kernel

On Sun, 2007-04-29 at 12:53 +0200, Andi Kleen wrote:
> From: "Jan Beulich" <jbeulich@novell.com>
> commit 5e518d7672dea4cd7c60871e40d0490c52f01d13 did the same change to
> i386's variant.
> 
> With this change, i386's and x86-64's versions are identical, raising
> the question whether the x86-64 one should go (just like there's only
> one instance of edd.S).
> 
> Signed-off-by: Jan Beulich <jbeulich@novell.com>
> Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Antonino Daplas <adaplas@gmail.com>

Tony


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [47/48] x86_64: Fix "Section mismatch" compile warning
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (44 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [46/48] x86_64: adjust EDID retrieval Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  2007-04-29 10:53 ` [PATCH] [48/48] i386: cleanup GDT Access Andi Kleen
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Bernhard Walle, patches, linux-kernel


From: Bernhard Walle <bwalle@suse.de>
Fix "Section mismatch" warnings in arch/x86_64/kernel/time.c

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/time.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -328,7 +328,7 @@ static unsigned int __init pit_calibrate
 #define PIT_MODE 0x43
 #define PIT_CH0  0x40
 
-static void __init __pit_init(int val, u8 mode)
+static void __pit_init(int val, u8 mode)
 {
 	unsigned long flags;
 
@@ -344,12 +344,12 @@ void __init pit_init(void)
 	__pit_init(LATCH, 0x34); /* binary, mode 2, LSB/MSB, ch 0 */
 }
 
-void __init pit_stop_interrupt(void)
+void pit_stop_interrupt(void)
 {
 	__pit_init(0, 0x30); /* mode 0 */
 }
 
-void __init stop_timer_interrupt(void)
+void stop_timer_interrupt(void)
 {
 	char *name;
 	if (hpet_address) {

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] [48/48] i386: cleanup GDT Access
  2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
                   ` (45 preceding siblings ...)
  2007-04-29 10:53 ` [PATCH] [47/48] x86_64: Fix "Section mismatch" compile warning Andi Kleen
@ 2007-04-29 10:53 ` Andi Kleen
  46 siblings, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2007-04-29 10:53 UTC (permalink / raw)
  To: Rusty Russell, Andi Kleen, patches, linux-kernel


From: Rusty Russell <rusty@rustcorp.com.au>

Now we have an explicit per-cpu GDT variable, we don't need to keep the
descriptors around to use them to find the GDT: expose cpu_gdt directly.

We could go further and make load_gdt() pack the descriptor for us, or even
assume it means "load the current cpu's GDT" which is what it always does.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/cpu/common.c |    4 +---
 arch/i386/kernel/efi.c        |   16 ++++++++--------
 arch/i386/kernel/entry.S      |    3 +--
 arch/i386/kernel/smpboot.c    |   12 ++++++------
 arch/i386/kernel/traps.c      |    4 +---
 include/asm-i386/desc.h       |    7 ++-----
 6 files changed, 19 insertions(+), 27 deletions(-)

Index: linux/arch/i386/kernel/cpu/common.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/common.c
+++ linux/arch/i386/kernel/cpu/common.c
@@ -22,9 +22,6 @@
 
 #include "cpu.h"
 
-DEFINE_PER_CPU(struct Xgt_desc_struct, cpu_gdt_descr);
-EXPORT_PER_CPU_SYMBOL(cpu_gdt_descr);
-
 DEFINE_PER_CPU(struct desc_struct, cpu_gdt[GDT_ENTRIES]) = {
 	[GDT_ENTRY_KERNEL_CS] = { 0x0000ffff, 0x00cf9a00 },
 	[GDT_ENTRY_KERNEL_DS] = { 0x0000ffff, 0x00cf9200 },
@@ -52,6 +49,7 @@ DEFINE_PER_CPU(struct desc_struct, cpu_g
 	[GDT_ENTRY_ESPFIX_SS] = { 0x00000000, 0x00c09200 },
 	[GDT_ENTRY_PDA] = { 0x00000000, 0x00c09200 }, /* set in setup_pda */
 };
+EXPORT_PER_CPU_SYMBOL_GPL(cpu_gdt);
 
 DEFINE_PER_CPU(struct i386_pda, _cpu_pda);
 EXPORT_PER_CPU_SYMBOL(_cpu_pda);
Index: linux/arch/i386/kernel/efi.c
===================================================================
--- linux.orig/arch/i386/kernel/efi.c
+++ linux/arch/i386/kernel/efi.c
@@ -69,13 +69,11 @@ static void efi_call_phys_prelog(void) _
 {
 	unsigned long cr4;
 	unsigned long temp;
-	struct Xgt_desc_struct *cpu_gdt_descr;
+	struct Xgt_desc_struct gdt_descr;
 
 	spin_lock(&efi_rt_lock);
 	local_irq_save(efi_rt_eflags);
 
-	cpu_gdt_descr = &per_cpu(cpu_gdt_descr, 0);
-
 	/*
 	 * If I don't have PSE, I should just duplicate two entries in page
 	 * directory. If I have PSE, I just need to duplicate one entry in
@@ -105,17 +103,19 @@ static void efi_call_phys_prelog(void) _
 	 */
 	local_flush_tlb();
 
-	cpu_gdt_descr->address = __pa(cpu_gdt_descr->address);
-	load_gdt(cpu_gdt_descr);
+	gdt_descr.address = __pa(get_cpu_gdt_table(0));
+	gdt_descr.size = GDT_SIZE - 1;
+	load_gdt(&gdt_descr);
 }
 
 static void efi_call_phys_epilog(void) __releases(efi_rt_lock)
 {
 	unsigned long cr4;
-	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, 0);
+	struct Xgt_desc_struct gdt_descr;
 
-	cpu_gdt_descr->address = (unsigned long)__va(cpu_gdt_descr->address);
-	load_gdt(cpu_gdt_descr);
+	gdt_descr.address = (unsigned long)get_cpu_gdt_table(0);
+	gdt_descr.size = GDT_SIZE - 1;
+	load_gdt(&gdt_descr);
 
 	cr4 = read_cr4();
 
Index: linux/arch/i386/kernel/entry.S
===================================================================
--- linux.orig/arch/i386/kernel/entry.S
+++ linux/arch/i386/kernel/entry.S
@@ -561,8 +561,7 @@ END(syscall_badsys)
 #define FIXUP_ESPFIX_STACK \
 	/* since we are on a wrong stack, we cant make it a C code :( */ \
 	movl %fs:PDA_cpu, %ebx; \
-	PER_CPU(cpu_gdt_descr, %ebx); \
-	movl GDS_address(%ebx), %ebx; \
+	PER_CPU(cpu_gdt, %ebx); \
 	GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah); \
 	addl %esp, %eax; \
 	pushl $__KERNEL_DS; \
Index: linux/arch/i386/kernel/smpboot.c
===================================================================
--- linux.orig/arch/i386/kernel/smpboot.c
+++ linux/arch/i386/kernel/smpboot.c
@@ -786,13 +786,9 @@ static inline struct task_struct * alloc
    secondary which will soon come up. */
 static __cpuinit void init_gdt(int cpu, struct task_struct *idle)
 {
-	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
-	struct desc_struct *gdt = per_cpu(cpu_gdt, cpu);
+	struct desc_struct *gdt = get_cpu_gdt_table(cpu);
 	struct i386_pda *pda = &per_cpu(_cpu_pda, cpu);
 
- 	cpu_gdt_descr->address = (unsigned long)gdt;
-	cpu_gdt_descr->size = GDT_SIZE - 1;
-
 	pack_descriptor((u32 *)&gdt[GDT_ENTRY_PDA].a,
 			(u32 *)&gdt[GDT_ENTRY_PDA].b,
 			(unsigned long)pda, sizeof(*pda) - 1,
@@ -1187,7 +1183,11 @@ void __init smp_prepare_cpus(unsigned in
  * it's on the real one. */
 static inline void switch_to_new_gdt(void)
 {
-	load_gdt(&per_cpu(cpu_gdt_descr, smp_processor_id()));
+	struct Xgt_desc_struct gdt_descr;
+
+	gdt_descr.address = (long)get_cpu_gdt_table(smp_processor_id());
+	gdt_descr.size = GDT_SIZE - 1;
+	load_gdt(&gdt_descr);
 	asm volatile ("mov %0, %%fs" : : "r" (__KERNEL_PDA) : "memory");
 }
 
Index: linux/arch/i386/kernel/traps.c
===================================================================
--- linux.orig/arch/i386/kernel/traps.c
+++ linux/arch/i386/kernel/traps.c
@@ -1030,9 +1030,7 @@ fastcall void do_spurious_interrupt_bug(
 fastcall unsigned long patch_espfix_desc(unsigned long uesp,
 					  unsigned long kesp)
 {
-	int cpu = smp_processor_id();
-	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
-	struct desc_struct *gdt = (struct desc_struct *)cpu_gdt_descr->address;
+	struct desc_struct *gdt = __get_cpu_var(cpu_gdt);
 	unsigned long base = (kesp - uesp) & -THREAD_SIZE;
 	unsigned long new_kesp = kesp - base;
 	unsigned long lim_pages = (new_kesp | (THREAD_SIZE - 1)) >> PAGE_SHIFT;
Index: linux/include/asm-i386/desc.h
===================================================================
--- linux.orig/include/asm-i386/desc.h
+++ linux/include/asm-i386/desc.h
@@ -18,16 +18,13 @@ struct Xgt_desc_struct {
 	unsigned short pad;
 } __attribute__ ((packed));
 
-extern struct Xgt_desc_struct idt_descr;
-DECLARE_PER_CPU(struct Xgt_desc_struct, cpu_gdt_descr);
 DECLARE_PER_CPU(struct desc_struct, cpu_gdt[GDT_ENTRIES]);
-extern struct Xgt_desc_struct early_gdt_descr;
-
 static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
 {
-	return (struct desc_struct *)per_cpu(cpu_gdt_descr, cpu).address;
+	return per_cpu(cpu_gdt, cpu);
 }
 
+extern struct Xgt_desc_struct idt_descr;
 extern struct desc_struct idt_table[];
 extern void set_intr_gate(unsigned int irq, void * addr);
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2007-04-29 17:46 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-29 10:52 [PATCH] [0/48] x86 candidate patches for review III: various stuff Andi Kleen
2007-04-29 10:52 ` [PATCH] [1/48] x86_64: fix x86_64-mm-sched-clock-share Andi Kleen
2007-04-29 10:52 ` [PATCH] [2/48] i386: Rewrite sched_clock Andi Kleen
2007-04-29 10:52 ` [PATCH] [3/48] x86_64: Use new shared sched_clock in x86-64 too Andi Kleen
2007-04-29 10:52 ` [PATCH] [4/48] x86_64: Don't disable basic block reordering Andi Kleen
2007-04-29 10:52 ` [PATCH] [5/48] x86_64: Allow sys_uselib unconditionally Andi Kleen
2007-04-29 10:52 ` [PATCH] [6/48] x86_64: Minor white space cleanup in traps.c Andi Kleen
2007-04-29 10:52 ` [PATCH] [7/48] x86_64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA Andi Kleen
2007-04-29 10:52 ` [PATCH] [8/48] i386: modpost apic related warning fixes Andi Kleen
2007-04-29 10:52 ` [PATCH] [9/48] i386: make struct vmi_ops static Andi Kleen
2007-04-29 10:52 ` [PATCH] [10/48] i386: type cast clean up for find_next_zero_bit Andi Kleen
2007-04-29 10:52 ` [PATCH] [11/48] i386: workaround for a -Wmissing-prototypes warning Andi Kleen
2007-04-29 10:53 ` [PATCH] [12/48] x86: Log reason why TSC was marked unstable Andi Kleen
2007-04-29 10:53 ` [PATCH] [13/48] x86_64: fix ia32_binfmt.c build error Andi Kleen
2007-04-29 10:53 ` [PATCH] [14/48] x86_64: remove extra smp_processor_id calling Andi Kleen
2007-04-29 10:53 ` [PATCH] [15/48] x86_64: make simnow_init() static Andi Kleen
2007-04-29 10:53 ` [PATCH] [16/48] i386: vmi_pmd_clear() static Andi Kleen
2007-04-29 10:53 ` [PATCH] [18/48] x86_64: configurable fake numa node sizes Andi Kleen
2007-04-29 10:53 ` [PATCH] [19/48] x86_64: split remaining fake nodes equally Andi Kleen
2007-04-29 10:53 ` [PATCH] [20/48] x86_64: fixed size remaining fake nodes Andi Kleen
2007-04-29 10:53 ` [PATCH] [21/48] x86: remove constant_tsc reporting from /proc/cpuinfo' power flags Andi Kleen
2007-04-29 10:53 ` [PATCH] [22/48] x86_64: fake numa for cpusets document Andi Kleen
2007-04-29 10:53 ` [PATCH] [23/48] i386: VDSO_PRELINK warning fix Andi Kleen
2007-04-29 10:53 ` [PATCH] [24/48] i386: Initialize esp0 properly all the time Andi Kleen
2007-04-29 10:53 ` [PATCH] [25/48] x86_64: Introduce load_TLS to the "for" loop Andi Kleen
2007-04-29 10:53 ` [PATCH] [26/48] x86_64: Clarify CONFIG_REORDER explanation Andi Kleen
2007-04-29 10:53 ` [PATCH] [27/48] i386: Allow i386 crash kernels to handle x86_64 dumps Andi Kleen
2007-04-29 10:53 ` [PATCH] [28/48] i386: prevent ACPI quirk warning mass spamming in logs Andi Kleen
2007-04-29 10:53 ` [PATCH] [29/48] x86: add command line length to boot protocol Andi Kleen
2007-04-29 10:53 ` [PATCH] [30/48] i386: Use per-cpu variables for GDT, PDA Andi Kleen
2007-04-29 10:53 ` [PATCH] [31/48] i386: Use per-cpu GDT immediately upon boot Andi Kleen
2007-04-29 10:53 ` [PATCH] [32/48] i386: clean up cpu_init() Andi Kleen
2007-04-29 10:53 ` [PATCH] [33/48] i386: Rename boot_gdt_table to boot_gdt Andi Kleen
2007-04-29 10:53 ` [PATCH] [34/48] i386: rationalize paravirt wrappers Andi Kleen
2007-04-29 10:53 ` [PATCH] [35/48] x86: tighten kernel image page access rights Andi Kleen
2007-04-29 10:53 ` [PATCH] [36/48] i386: get rid of unused variables Andi Kleen
2007-04-29 10:53 ` [PATCH] [37/48] i386: ignore vgacon if hardware not present Andi Kleen
2007-04-29 12:17   ` Antonino A. Daplas
2007-04-29 13:24     ` Andi Kleen
2007-04-29 14:10       ` Antonino A. Daplas
2007-04-29 14:16         ` Andi Kleen
2007-04-29 17:16           ` Jeremy Fitzhardinge
2007-04-29 17:39             ` Andi Kleen
2007-04-29 10:53 ` [PATCH] [38/48] x86_64: Remove unused stext symbol Andi Kleen
2007-04-29 10:53 ` [PATCH] [39/48] i386: remove the APM_RTC_IS_GMT config option Andi Kleen
2007-04-29 10:53 ` [PATCH] [40/48] x86_64: use lru instead of page->index and page->private for pgd lists management Andi Kleen
2007-04-29 10:53 ` [PATCH] [41/48] x86: sys_ioperm() prototype cleanup Andi Kleen
2007-04-29 10:53 ` [PATCH] [42/48] x86: remove UNEXPECTED_IO_APIC() Andi Kleen
2007-04-29 10:53 ` [PATCH] [43/48] x86_64: fix vtime() vsyscall Andi Kleen
2007-04-29 10:53 ` [PATCH] [44/48] x86_64: vsyscall_gtod_data diet and vgettimeofday() fix Andi Kleen
2007-04-29 10:53 ` [PATCH] [45/48] x86_64: Inhibit machine from asserting an NMI when doing Alt-SysRq-M operation Andi Kleen
2007-04-29 10:53 ` [PATCH] [46/48] x86_64: adjust EDID retrieval Andi Kleen
2007-04-29 12:18   ` Antonino A. Daplas
2007-04-29 10:53 ` [PATCH] [47/48] x86_64: Fix "Section mismatch" compile warning Andi Kleen
2007-04-29 10:53 ` [PATCH] [48/48] i386: cleanup GDT Access Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox