[RFC 00/15] x86_64: Optimize percpu accesses

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC 00/15] x86_64: Optimize percpu accesses
@ 2008-07-09 16:51 Mike Travis
  2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis
                   ` (18 more replies)
  0 siblings, 19 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel


This patchset provides the following:

  * Cleanup: Fix early references to cpumask_of_cpu(0)

    Provides an early cpumask_of_cpu(0) usable before the cpumask_of_cpu_map
    is allocated and initialized.

  * Generic: Percpu infrastructure to rebase the per cpu area to zero

    This provides for the capability of accessing the percpu variables
    using a local register instead of having to go through a table
    on node 0 to find the cpu-specific offsets.  It also would allow
    atomic operations on percpu variables to reduce required locking.
    Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the
    generic code that the arch has this new basing.

    (Note: split into two patches, one to rebase percpu variables at 0,
    and the second to actually use %gs as the base for percpu variables.)

  * x86_64: Fold pda into per cpu area

    Declare the pda as a per cpu variable. This will move the pda
    area to an address accessible by the x86_64 per cpu macros.
    Subtraction of __per_cpu_start will make the offset based from
    the beginning of the per cpu area.  Since %gs is pointing to the
    pda, it will then also point to the per cpu variables and can be
    accessed thusly:

	%gs:[&per_cpu_xxxx - __per_cpu_start]

  * x86_64: Rebase per cpu variables to zero

    Take advantage of the zero-based per cpu area provided above.
    Then we can directly use the x86_32 percpu operations. x86_32
    offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
    to the pda and the per cpu area thereby allowing access to the
    pda with the x86_64 pda operations and access to the per cpu
    variables using x86_32 percpu operations.


Based on linux-2.6.tip/master with following patches applied:

	[PATCH 1/1] x86: Add check for node passed to node_to_cpumask V3
	[PATCH 1/1] x86: Change _node_to_cpumask_ptr to return const ptr
	[PATCH 1/1] sched: Reduce stack size in isolated_cpu_setup()
	[PATCH 1/1] kthread: Reduce stack pressure in create_kthread and kthreadd


Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 01/15] x86_64: Cleanup early setup_percpu references
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: cleanup_percpu --]
[-- Type: text/plain, Size: 5644 bytes --]

  * Initialize the cpumask_of_cpu_map to contain a cpumask for cpu 0
    in the initdata section.  This allows references before the real
    cpumask_of_cpu_map is setup avoiding possible null pointer deref
    panics.

  * Ruggedize some other calls to prevent mishaps from early calls,
    pariticularly in non-critical functions:

  * Cleanup DEBUG_PER_CPU_MAPS usages and some comments.

Based on linux-2.6.tip/master

Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/kernel/setup_percpu.c |   73 ++++++++++++++++++++++++++++-------------
 1 file changed, 51 insertions(+), 22 deletions(-)

--- linux-2.6.tip.orig/arch/x86/kernel/setup_percpu.c
+++ linux-2.6.tip/arch/x86/kernel/setup_percpu.c
@@ -15,6 +15,12 @@
 #include <asm/apicdef.h>
 #include <asm/highmem.h>
 
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+# define DBG(x...) printk(KERN_DEBUG x)
+#else
+# define DBG(x...)
+#endif
+
 #ifdef CONFIG_X86_LOCAL_APIC
 unsigned int num_processors;
 unsigned disabled_cpus __cpuinitdata;
@@ -27,31 +33,39 @@ EXPORT_SYMBOL(boot_cpu_physical_apicid);
 physid_mask_t phys_cpu_present_map;
 #endif
 
-/* map cpu index to physical APIC ID */
+/*
+ * Map cpu index to physical APIC ID
+ */
 DEFINE_EARLY_PER_CPU(u16, x86_cpu_to_apicid, BAD_APICID);
 DEFINE_EARLY_PER_CPU(u16, x86_bios_cpu_apicid, BAD_APICID);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_X86_64)
-#define	X86_64_NUMA	1
+#define	X86_64_NUMA	1	/* (used later) */
 
-/* map cpu index to node index */
+/*
+ * Map cpu index to node index
+ */
 DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map);
 
-/* which logical CPUs are on which nodes */
+/*
+ * Which logical CPUs are on which nodes
+ */
 cpumask_t *node_to_cpumask_map;
 EXPORT_SYMBOL(node_to_cpumask_map);
 
-/* setup node_to_cpumask_map */
+/*
+ * Setup node_to_cpumask_map
+ */
 static void __init setup_node_to_cpumask_map(void);
 
 #else
 static inline void setup_node_to_cpumask_map(void) { }
 #endif
 
-#if defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) && defined(CONFIG_SMP)
+#ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 /*
  * Copy data used in early init routines from the initial arrays to the
  * per cpu data areas.  These arrays then become expendable and the
@@ -81,16 +95,25 @@ static void __init setup_per_cpu_maps(vo
 }
 
 #ifdef CONFIG_HAVE_CPUMASK_OF_CPU_MAP
-cpumask_t *cpumask_of_cpu_map __read_mostly;
+/*
+ * Configure an initial cpumask_of_cpu(0) for early users
+ */
+static cpumask_t initial_cpumask_of_cpu_map __initdata = (cpumask_t) { {
+	[BITS_TO_LONGS(NR_CPUS)-1] = 1
+} };
+cpumask_t *cpumask_of_cpu_map __read_mostly =
+	(cpumask_t *)&initial_cpumask_of_cpu_map;
 EXPORT_SYMBOL(cpumask_of_cpu_map);
 
-/* requires nr_cpu_ids to be initialized */
+/* Requires nr_cpu_ids to be initialized. */
 static void __init setup_cpumask_of_cpu(void)
 {
 	int i;
 
 	/* alloc_bootmem zeroes memory */
 	cpumask_of_cpu_map = alloc_bootmem_low(sizeof(cpumask_t) * nr_cpu_ids);
+	DBG("cpumask_of_cpu_map %p\n", cpumask_of_cpu_map);
+
 	for (i = 0; i < nr_cpu_ids; i++)
 		cpu_set(i, cpumask_of_cpu_map[i]);
 }
@@ -197,9 +220,10 @@ void __init setup_per_cpu_areas(void)
 		per_cpu_offset(cpu) = ptr - __per_cpu_start;
 		memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
 
+		DBG("PERCPU: cpu %4d %p\n", cpu, ptr);
 	}
 
-	printk(KERN_DEBUG "NR_CPUS: %d, nr_cpu_ids: %d, nr_node_ids %d\n",
+	printk(KERN_INFO "NR_CPUS: %d, nr_cpu_ids: %d, nr_node_ids %d\n",
 		NR_CPUS, nr_cpu_ids, nr_node_ids);
 
 	/* Setup percpu data maps */
@@ -221,6 +245,7 @@ void __init setup_per_cpu_areas(void)
  * Requires node_possible_map to be valid.
  *
  * Note: node_to_cpumask() is not valid until after this is done.
+ * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
  */
 static void __init setup_node_to_cpumask_map(void)
 {
@@ -236,9 +261,7 @@ static void __init setup_node_to_cpumask
 
 	/* allocate the map */
 	map = alloc_bootmem_low(nr_node_ids * sizeof(cpumask_t));
-
-	Dprintk(KERN_DEBUG "Node to cpumask map at %p for %d nodes\n",
-		map, nr_node_ids);
+	DBG("node_to_cpumask_map at %p for %d nodes\n", map, nr_node_ids);
 
 	/* node_to_cpumask() will now work */
 	node_to_cpumask_map = map;
@@ -248,17 +271,23 @@ void __cpuinit numa_set_node(int cpu, in
 {
 	int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
 
-	if (cpu_pda(cpu) && node != NUMA_NO_NODE)
-		cpu_pda(cpu)->nodenumber = node;
-
-	if (cpu_to_node_map)
+	/* early setting, no percpu area yet */
+	if (cpu_to_node_map) {
 		cpu_to_node_map[cpu] = node;
+		return;
+	}
 
-	else if (per_cpu_offset(cpu))
-		per_cpu(x86_cpu_to_node_map, cpu) = node;
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+	if (cpu >= nr_cpu_ids || !per_cpu_offset(cpu)) {
+		printk(KERN_ERR "numa_set_node: invalid cpu# (%d)\n", cpu);
+		dump_stack();
+		return;
+	}
+#endif
+	per_cpu(x86_cpu_to_node_map, cpu) = node;
 
-	else
-		Dprintk(KERN_INFO "Setting node for non-present cpu %d\n", cpu);
+	if (node != NUMA_NO_NODE)
+		cpu_pda(cpu)->nodenumber = node;
 }
 
 void __cpuinit numa_clear_node(int cpu)
@@ -275,7 +304,7 @@ void __cpuinit numa_add_cpu(int cpu)
 
 void __cpuinit numa_remove_cpu(int cpu)
 {
-	cpu_clear(cpu, node_to_cpumask_map[cpu_to_node(cpu)]);
+	cpu_clear(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]);
 }
 
 #else /* CONFIG_DEBUG_PER_CPU_MAPS */
@@ -285,7 +314,7 @@ void __cpuinit numa_remove_cpu(int cpu)
  */
 static void __cpuinit numa_set_cpumask(int cpu, int enable)
 {
-	int node = cpu_to_node(cpu);
+	int node = early_cpu_to_node(cpu);
 	cpumask_t *mask;
 	char buf[64];
 

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 02/15] x86_64: Fold pda into per cpu area
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
  2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 22:02   ` Eric W. Biederman
  2008-07-09 16:51 ` [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs Mike Travis
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: zero_based_fold --]
[-- Type: text/plain, Size: 15082 bytes --]

WARNING: there is still a FIXME in this patch (see arch/x86/kernel/acpi/sleep.c)

  * Declare the pda as a per cpu variable.

  * Make the x86_64 per cpu area start at zero.

  * Relocate the initial pda and per_cpu(gdt_page) in head_64.S for the
    boot cpu (0).  For secondary cpus, do_boot_cpu() sets up the correct
    initial pda and gdt_page pointer.

  * Initialize per_cpu_offset to point to static pda in the per_cpu area
    (@ __per_cpu_load).

  * After allocation of the per cpu area for the boot cpu (0), reload the
    gdt page pointer.

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/Kconfig                 |    3 +
 arch/x86/kernel/acpi/sleep.c     |    9 +++
 arch/x86/kernel/cpu/common_64.c  |    4 -
 arch/x86/kernel/head64.c         |   24 +--------
 arch/x86/kernel/head_64.S        |   45 ++++++++++++++++--
 arch/x86/kernel/setup_percpu.c   |   94 +++++++++++++++++----------------------
 arch/x86/kernel/smpboot.c        |   52 ---------------------
 arch/x86/kernel/vmlinux_64.lds.S |    1 
 include/asm-x86/desc.h           |    5 ++
 include/asm-x86/pda.h            |    3 -
 include/asm-x86/percpu.h         |   13 -----
 include/asm-x86/trampoline.h     |    1 
 12 files changed, 112 insertions(+), 142 deletions(-)

--- linux-2.6.tip.orig/arch/x86/Kconfig
+++ linux-2.6.tip/arch/x86/Kconfig
@@ -129,6 +129,9 @@ config HAVE_SETUP_PER_CPU_AREA
 config HAVE_CPUMASK_OF_CPU_MAP
 	def_bool X86_64_SMP
 
+config HAVE_ZERO_BASED_PER_CPU
+	def_bool X86_64_SMP
+
 config ARCH_HIBERNATION_POSSIBLE
 	def_bool y
 	depends on !SMP || !X86_VOYAGER
--- linux-2.6.tip.orig/arch/x86/kernel/acpi/sleep.c
+++ linux-2.6.tip/arch/x86/kernel/acpi/sleep.c
@@ -89,6 +89,15 @@ int acpi_save_state_mem(void)
 #ifdef CONFIG_SMP
 	stack_start.sp = temp_stack + 4096;
 #endif
+	/*
+	 * FIXME: with zero-based percpu variables, the pda and gdt_page
+	 * addresses must be offset by the base of this cpu's percpu area.
+	 * Where/how should we do this?
+	 *
+	 * for secondary cpu startup in smpboot.c:do_boot_cpu() this is done:
+	 *	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
+	 *	initial_pda = (unsigned long)get_cpu_pda(cpu);
+	 */
 	initial_code = (unsigned long)wakeup_long64;
 	saved_magic = 0x123456789abcdef0;
 #endif /* CONFIG_64BIT */
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/common_64.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/common_64.c
@@ -423,8 +423,8 @@ __setup("clearcpuid=", setup_disablecpui
 
 cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE;
 
-struct x8664_pda **_cpu_pda __read_mostly;
-EXPORT_SYMBOL(_cpu_pda);
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);
 
 struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
 
--- linux-2.6.tip.orig/arch/x86/kernel/head64.c
+++ linux-2.6.tip/arch/x86/kernel/head64.c
@@ -25,20 +25,6 @@
 #include <asm/e820.h>
 #include <asm/bios_ebda.h>
 
-/* boot cpu pda */
-static struct x8664_pda _boot_cpu_pda __read_mostly;
-
-#ifdef CONFIG_SMP
-/*
- * We install an empty cpu_pda pointer table to indicate to early users
- * (numa_set_node) that the cpu_pda pointer table for cpus other than
- * the boot cpu is not yet setup.
- */
-static struct x8664_pda *__cpu_pda[NR_CPUS] __initdata;
-#else
-static struct x8664_pda *__cpu_pda[NR_CPUS] __read_mostly;
-#endif
-
 static void __init zap_identity_mappings(void)
 {
 	pgd_t *pgd = pgd_offset_k(0UL);
@@ -91,6 +77,10 @@ void __init x86_64_start_kernel(char * r
 	/* Cleanup the over mapped high alias */
 	cleanup_highmap();
 
+	/* Initialize boot cpu_pda data */
+	/* (See head_64.S for earlier pda/gdt initialization) */
+	pda_init(0);
+
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
 #ifdef CONFIG_EARLY_PRINTK
 		set_intr_gate(i, &early_idt_handlers[i]);
@@ -102,12 +92,6 @@ void __init x86_64_start_kernel(char * r
 
 	early_printk("Kernel alive\n");
 
-	_cpu_pda = __cpu_pda;
-	cpu_pda(0) = &_boot_cpu_pda;
-	pda_init(0);
-
-	early_printk("Kernel really alive\n");
-
 	copy_bootdata(__va(real_mode_data));
 
 	reserve_early(__pa_symbol(&_text), __pa_symbol(&_end), "TEXT DATA BSS");
--- linux-2.6.tip.orig/arch/x86/kernel/head_64.S
+++ linux-2.6.tip/arch/x86/kernel/head_64.S
@@ -12,6 +12,7 @@
 #include <linux/linkage.h>
 #include <linux/threads.h>
 #include <linux/init.h>
+#include <asm/asm-offsets.h>
 #include <asm/desc.h>
 #include <asm/segment.h>
 #include <asm/pgtable.h>
@@ -203,7 +204,27 @@ ENTRY(secondary_startup_64)
 	 * addresses where we're currently running on. We have to do that here
 	 * because in 32bit we couldn't load a 64bit linear address.
 	 */
-	lgdt	early_gdt_descr(%rip)
+
+#ifdef CONFIG_SMP
+	 /*
+	 * For zero-based percpu variables, the base (__per_cpu_load) must
+	 * be added to the offset of per_cpu__gdt_page.  This is only needed
+	 * for the boot cpu but we can't do this prior to secondary_startup_64.
+	 * So we use a NULL gdt adrs to indicate that we are starting up the
+	 * boot cpu and not the secondary cpus.  do_boot_cpu() will fixup
+	 * the gdt adrs for those cpus.
+	 */
+#define PER_CPU_GDT_PAGE	0
+	movq	early_gdt_descr_base(%rip), %rax
+	testq	%rax, %rax
+	jnz	1f
+	movq	$__per_cpu_load, %rax
+	addq	$per_cpu__gdt_page, %rax
+	movq	%rax, early_gdt_descr_base(%rip)
+#else
+#define PER_CPU_GDT_PAGE	per_cpu__gdt_page
+#endif
+1:	lgdt	early_gdt_descr(%rip)
 
 	/* set up data segments. actually 0 would do too */
 	movl $__KERNEL_DS,%eax
@@ -220,14 +241,21 @@ ENTRY(secondary_startup_64)
 	movl %eax,%gs
 
 	/* 
-	 * Setup up a dummy PDA. this is just for some early bootup code
-	 * that does in_interrupt() 
+	 * Setup up the real PDA.
+	 *
+	 * For SMP, the boot cpu (0) uses the static pda which is the first
+	 * element in the percpu area (@__per_cpu_load).  This pda is moved
+	 * to the real percpu area once that is allocated.  Secondary cpus
+	 * will use the initial_pda value setup in do_boot_cpu().
 	 */ 
 	movl	$MSR_GS_BASE,%ecx
-	movq	$empty_zero_page,%rax
+	movq	initial_pda(%rip), %rax
 	movq    %rax,%rdx
 	shrq	$32,%rdx
 	wrmsr	
+#ifdef CONFIG_SMP
+	movq	%rax, %gs:pda_data_offset
+#endif
 
 	/* esi is pointer to real mode structure with interesting info.
 	   pass it to C */
@@ -250,6 +278,12 @@ ENTRY(secondary_startup_64)
 	.align	8
 	ENTRY(initial_code)
 	.quad	x86_64_start_kernel
+	ENTRY(initial_pda)
+#ifdef CONFIG_SMP
+	.quad	__per_cpu_load		# Overwritten for secondary CPUs
+#else
+	.quad	per_cpu__pda
+#endif
 	__FINITDATA
 
 	ENTRY(stack_start)
@@ -394,7 +428,8 @@ NEXT_PAGE(level2_spare_pgt)
 	.globl early_gdt_descr
 early_gdt_descr:
 	.word	GDT_ENTRIES*8-1
-	.quad   per_cpu__gdt_page
+early_gdt_descr_base:
+	.quad	PER_CPU_GDT_PAGE	# Overwritten for secondary CPUs
 
 ENTRY(phys_base)
 	/* This must match the first entry in level2_kernel_pgt */
--- linux-2.6.tip.orig/arch/x86/kernel/setup_percpu.c
+++ linux-2.6.tip/arch/x86/kernel/setup_percpu.c
@@ -14,6 +14,7 @@
 #include <asm/mpspec.h>
 #include <asm/apicdef.h>
 #include <asm/highmem.h>
+#include <asm/desc.h>
 
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
 # define DBG(x...) printk(KERN_DEBUG x)
@@ -119,63 +120,26 @@ static void __init setup_cpumask_of_cpu(
 static inline void setup_cpumask_of_cpu(void) { }
 #endif
 
-#ifdef CONFIG_X86_32
 /*
- * Great future not-so-futuristic plan: make i386 and x86_64 do it
- * the same way
+ * Pointers to per cpu areas for each cpu
  */
+#ifndef CONFIG_HAVE_ZERO_BASED_PER_CPU
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
-static inline void setup_cpu_pda_map(void) { }
-
-#elif !defined(CONFIG_SMP)
-static inline void setup_cpu_pda_map(void) { }
-
-#else /* CONFIG_SMP && CONFIG_X86_64 */
+#else
 
 /*
- * Allocate cpu_pda pointer table and array via alloc_bootmem.
+ * Initialize percpu offset for boot cpu (0) to static percpu area
+ * for referencing very early in kernel startup.
  */
-static void __init setup_cpu_pda_map(void)
-{
-	char *pda;
-	struct x8664_pda **new_cpu_pda;
-	unsigned long size;
-	int cpu;
-
-	size = roundup(sizeof(struct x8664_pda), cache_line_size());
-
-	/* allocate cpu_pda array and pointer table */
-	{
-		unsigned long tsize = nr_cpu_ids * sizeof(void *);
-		unsigned long asize = size * (nr_cpu_ids - 1);
-
-		tsize = roundup(tsize, cache_line_size());
-		new_cpu_pda = alloc_bootmem(tsize + asize);
-		pda = (char *)new_cpu_pda + tsize;
-	}
-
-	/* initialize pointer table to static pda's */
-	for_each_possible_cpu(cpu) {
-		if (cpu == 0) {
-			/* leave boot cpu pda in place */
-			new_cpu_pda[0] = cpu_pda(0);
-			continue;
-		}
-		new_cpu_pda[cpu] = (struct x8664_pda *)pda;
-		new_cpu_pda[cpu]->in_bootmem = 1;
-		pda += size;
-	}
-
-	/* point to new pointer table */
-	_cpu_pda = new_cpu_pda;
-}
+unsigned long __per_cpu_offset[NR_CPUS] __read_mostly = {
+	[0] = (unsigned long)__per_cpu_load
+};
+EXPORT_SYMBOL(__per_cpu_offset);
 #endif
 
 /*
- * Great future plan:
- * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data.
- * Always point %gs to its beginning
+ * Allocate and initialize the per cpu areas which include the PDAs.
  */
 void __init setup_per_cpu_areas(void)
 {
@@ -193,9 +157,6 @@ void __init setup_per_cpu_areas(void)
 	nr_cpu_ids = num_processors;
 #endif
 
-	/* Setup cpu_pda map */
-	setup_cpu_pda_map();
-
 	/* Copy section for each CPU (we discard the original) */
 	size = PERCPU_ENOUGH_ROOM;
 	printk(KERN_INFO "PERCPU: Allocating %zd bytes of per cpu data\n",
@@ -215,10 +176,41 @@ void __init setup_per_cpu_areas(void)
 		else
 			ptr = alloc_bootmem_pages_node(NODE_DATA(node), size);
 #endif
+		/* Initialize each cpu's per_cpu area and save pointer */
+		memcpy(ptr, __per_cpu_load, __per_cpu_size);
 		per_cpu_offset(cpu) = ptr - __per_cpu_start;
-		memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
 
 		DBG("PERCPU: cpu %4d %p\n", cpu, ptr);
+
+#ifdef CONFIG_X86_64
+		/*
+		 * Note the boot cpu (0) has been using the static per_cpu load
+		 * area for it's pda.  We need to zero out the pdas for the
+		 * other cpus that are coming online.
+		 *
+		 * Additionally, for the boot cpu the gdt page must be reloaded
+		 * as we moved it from the static per cpu area to the newly
+		 * allocated area.
+		 */
+		{
+			/* We rely on the fact that pda is the first element */
+			struct x8664_pda *pda = (struct x8664_pda *)ptr;
+
+			if (cpu) {
+				memset(pda, 0, sizeof(*pda));
+				pda->data_offset = (unsigned long)ptr;
+			} else {
+				struct desc_ptr	gdt_descr = early_gdt_descr;
+
+				pda->data_offset = (unsigned long)ptr;
+				gdt_descr.address =
+					(unsigned long)get_cpu_gdt_table(0);
+				native_load_gdt(&gdt_descr);
+				pda_init(0);
+			}
+
+		}
+#endif
 	}
 
 	printk(KERN_INFO "NR_CPUS: %d, nr_cpu_ids: %d, nr_node_ids %d\n",
--- linux-2.6.tip.orig/arch/x86/kernel/smpboot.c
+++ linux-2.6.tip/arch/x86/kernel/smpboot.c
@@ -762,45 +762,6 @@ static void __cpuinit do_fork_idle(struc
 	complete(&c_idle->done);
 }
 
-#ifdef CONFIG_X86_64
-/*
- * Allocate node local memory for the AP pda.
- *
- * Must be called after the _cpu_pda pointer table is initialized.
- */
-static int __cpuinit get_local_pda(int cpu)
-{
-	struct x8664_pda *oldpda, *newpda;
-	unsigned long size = sizeof(struct x8664_pda);
-	int node = cpu_to_node(cpu);
-
-	if (cpu_pda(cpu) && !cpu_pda(cpu)->in_bootmem)
-		return 0;
-
-	oldpda = cpu_pda(cpu);
-	newpda = kmalloc_node(size, GFP_ATOMIC, node);
-	if (!newpda) {
-		printk(KERN_ERR "Could not allocate node local PDA "
-			"for CPU %d on node %d\n", cpu, node);
-
-		if (oldpda)
-			return 0;	/* have a usable pda */
-		else
-			return -1;
-	}
-
-	if (oldpda) {
-		memcpy(newpda, oldpda, size);
-		if (!after_bootmem)
-			free_bootmem((unsigned long)oldpda, size);
-	}
-
-	newpda->in_bootmem = 0;
-	cpu_pda(cpu) = newpda;
-	return 0;
-}
-#endif /* CONFIG_X86_64 */
-
 static int __cpuinit do_boot_cpu(int apicid, int cpu)
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
@@ -818,16 +779,6 @@ static int __cpuinit do_boot_cpu(int api
 	};
 	INIT_WORK(&c_idle.work, do_fork_idle);
 
-#ifdef CONFIG_X86_64
-	/* Allocate node local memory for AP pdas */
-	if (cpu > 0) {
-		boot_error = get_local_pda(cpu);
-		if (boot_error)
-			goto restore_state;
-			/* if can't get pda memory, can't start cpu */
-	}
-#endif
-
 	alternatives_smp_switch(1);
 
 	c_idle.idle = get_idle_for_cpu(cpu);
@@ -865,6 +816,7 @@ do_rest:
 #else
 	cpu_pda(cpu)->pcurrent = c_idle.idle;
 	clear_tsk_thread_flag(c_idle.idle, TIF_FORK);
+	initial_pda = (unsigned long)get_cpu_pda(cpu);
 #endif
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
 	initial_code = (unsigned long)start_secondary;
@@ -940,8 +892,6 @@ do_rest:
 		}
 	}
 
-restore_state:
-
 	if (boot_error) {
 		/* Try to put things back the way they were before ... */
 		numa_remove_cpu(cpu); /* was set by numa_add_cpu */
--- linux-2.6.tip.orig/arch/x86/kernel/vmlinux_64.lds.S
+++ linux-2.6.tip/arch/x86/kernel/vmlinux_64.lds.S
@@ -16,6 +16,7 @@ jiffies_64 = jiffies;
 _proxy_pda = 1;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
+	percpu PT_LOAD FLAGS(7);	/* RWE */
 	data PT_LOAD FLAGS(7);	/* RWE */
 	user PT_LOAD FLAGS(7);	/* RWE */
 	data.init PT_LOAD FLAGS(7);	/* RWE */
--- linux-2.6.tip.orig/include/asm-x86/desc.h
+++ linux-2.6.tip/include/asm-x86/desc.h
@@ -41,6 +41,11 @@ static inline struct desc_struct *get_cp
 
 #ifdef CONFIG_X86_64
 
+static inline struct x8664_pda *get_cpu_pda(unsigned int cpu)
+{
+	return &per_cpu(pda, cpu);
+}
+
 static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
 			     unsigned dpl, unsigned ist, unsigned seg)
 {
--- linux-2.6.tip.orig/include/asm-x86/pda.h
+++ linux-2.6.tip/include/asm-x86/pda.h
@@ -37,10 +37,9 @@ struct x8664_pda {
 	unsigned irq_spurious_count;
 } ____cacheline_aligned_in_smp;
 
-extern struct x8664_pda **_cpu_pda;
 extern void pda_init(int);
 
-#define cpu_pda(i) (_cpu_pda[i])
+#define cpu_pda(cpu) (&per_cpu(pda, cpu))
 
 /*
  * There is no fast way to get the base address of the PDA, all the accesses
--- linux-2.6.tip.orig/include/asm-x86/percpu.h
+++ linux-2.6.tip/include/asm-x86/percpu.h
@@ -3,20 +3,11 @@
 
 #ifdef CONFIG_X86_64
 #include <linux/compiler.h>
-
-/* Same as asm-generic/percpu.h, except that we store the per cpu offset
-   in the PDA. Longer term the PDA and every per cpu variable
-   should be just put into a single section and referenced directly
-   from %gs */
-
-#ifdef CONFIG_SMP
 #include <asm/pda.h>
 
-#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
+/* Same as asm-generic/percpu.h */
+#ifdef CONFIG_SMP
 #define __my_cpu_offset read_pda(data_offset)
-
-#define per_cpu_offset(x) (__per_cpu_offset(x))
-
 #endif
 #include <asm-generic/percpu.h>
 
--- linux-2.6.tip.orig/include/asm-x86/trampoline.h
+++ linux-2.6.tip/include/asm-x86/trampoline.h
@@ -12,6 +12,7 @@ extern unsigned char *trampoline_base;
 
 extern unsigned long init_rsp;
 extern unsigned long initial_code;
+extern unsigned long initial_pda;
 
 #define TRAMPOLINE_BASE 0x6000
 extern unsigned long setup_trampoline(void);

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
  2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis
  2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops Mike Travis
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: zero_based_use_gs --]
[-- Type: text/plain, Size: 2261 bytes --]

  * Since %gs is pointing to the pda, it will then also point to the per cpu
    variables and can be accessed thusly:

        %gs:[&per_cpu_xxxx - __per_cpu_start]

    ... and since __per_cpu_start == 0 then:

        %gs:&per_cpu_var(xxx)
	
    becomes the optimized effective address.


Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/percpu.h |   36 +++++++++++++-----------------------
 1 file changed, 13 insertions(+), 23 deletions(-)

--- linux-2.6.tip.orig/include/asm-x86/percpu.h
+++ linux-2.6.tip/include/asm-x86/percpu.h
@@ -5,15 +5,19 @@
 #include <linux/compiler.h>
 #include <asm/pda.h>
 
-/* Same as asm-generic/percpu.h */
+/* Same as asm-generic/percpu.h, except we use %gs as a segment offset. */
 #ifdef CONFIG_SMP
 #define __my_cpu_offset read_pda(data_offset)
+#define __percpu_seg "%%gs:"
+#else
+#define __percpu_seg ""
 #endif
+
 #include <asm-generic/percpu.h>
 
 DECLARE_PER_CPU(struct x8664_pda, pda);
 
-#else /* CONFIG_X86_64 */
+#else /* !CONFIG_X86_64 */
 
 #ifdef __ASSEMBLY__
 
@@ -42,36 +46,23 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
 
 #else /* ...!ASSEMBLY */
 
-/*
- * PER_CPU finds an address of a per-cpu variable.
- *
- * Args:
- *    var - variable name
- *    cpu - 32bit register containing the current CPU number
- *
- * The resulting address is stored in the "cpu" argument.
- *
- * Example:
- *    PER_CPU(cpu_gdt_descr, %ebx)
- */
 #ifdef CONFIG_SMP
-
 #define __my_cpu_offset x86_read_percpu(this_cpu_off)
-
-/* fs segment starts at (positive) offset == __per_cpu_offset[cpu] */
 #define __percpu_seg "%%fs:"
-
-#else  /* !SMP */
-
+#else
 #define __percpu_seg ""
-
-#endif	/* SMP */
+#endif
 
 #include <asm-generic/percpu.h>
 
 /* We can use this directly for local CPU (faster). */
 DECLARE_PER_CPU(unsigned long, this_cpu_off);
 
+#endif /* __ASSEMBLY__ */
+#endif /* !CONFIG_X86_64 */
+
+#ifndef __ASSEMBLY__
+
 /* For arch-specific code, we can use direct single-insn ops (they
  * don't give an lvalue though). */
 extern void __bad_percpu_size(void);
@@ -206,7 +197,6 @@ do {							\
 				percpu_cmpxchg_op(per_cpu_var(var), old, new)
 
 #endif /* !__ASSEMBLY__ */
-#endif /* !CONFIG_X86_64 */
 
 #ifdef CONFIG_SMP
 

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (2 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu() Mike Travis
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: zero_based_replace_cpu_pda --]
[-- Type: text/plain, Size: 6292 bytes --]

  * Replace cpu_pda(i) references with percpu(pda, i).

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/kernel/cpu/common_64.c |    2 +-
 arch/x86/kernel/irq_64.c        |   36 ++++++++++++++++++++----------------
 arch/x86/kernel/nmi.c           |    2 +-
 arch/x86/kernel/setup_percpu.c  |    2 +-
 arch/x86/kernel/smpboot.c       |    2 +-
 arch/x86/kernel/traps_64.c      |   11 +++++++----
 6 files changed, 31 insertions(+), 24 deletions(-)

--- linux-2.6.tip.orig/arch/x86/kernel/cpu/common_64.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/common_64.c
@@ -477,7 +477,7 @@ __setup("noexec32=", nonx32_setup);
 
 void pda_init(int cpu)
 {
-	struct x8664_pda *pda = cpu_pda(cpu);
+	struct x8664_pda *pda = &per_cpu(pda, cpu);
 
 	/* Setup up data that may be needed in __get_free_pages early */
 	asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0));
--- linux-2.6.tip.orig/arch/x86/kernel/irq_64.c
+++ linux-2.6.tip/arch/x86/kernel/irq_64.c
@@ -115,39 +115,43 @@ skip:
 	} else if (i == NR_IRQS) {
 		seq_printf(p, "NMI: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->__nmi_count);
+			seq_printf(p, "%10u ", per_cpu(pda.__nmi_count, j));
 		seq_printf(p, "  Non-maskable interrupts\n");
 		seq_printf(p, "LOC: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->apic_timer_irqs);
+			seq_printf(p, "%10u ", per_cpu(pda.apic_timer_irqs, j));
 		seq_printf(p, "  Local timer interrupts\n");
 #ifdef CONFIG_SMP
 		seq_printf(p, "RES: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_resched_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_resched_count, j));
 		seq_printf(p, "  Rescheduling interrupts\n");
 		seq_printf(p, "CAL: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_call_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_call_count, j));
 		seq_printf(p, "  function call interrupts\n");
 		seq_printf(p, "TLB: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_tlb_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_tlb_count, j));
 		seq_printf(p, "  TLB shootdowns\n");
 #endif
 #ifdef CONFIG_X86_MCE
 		seq_printf(p, "TRM: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_thermal_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_thermal_count, j));
 		seq_printf(p, "  Thermal event interrupts\n");
 		seq_printf(p, "THR: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_threshold_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_threshold_count, j));
 		seq_printf(p, "  Threshold APIC interrupts\n");
 #endif
 		seq_printf(p, "SPU: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_spurious_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_spurious_count, j));
 		seq_printf(p, "  Spurious interrupts\n");
 		seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count));
 	}
@@ -159,19 +163,19 @@ skip:
  */
 u64 arch_irq_stat_cpu(unsigned int cpu)
 {
-	u64 sum = cpu_pda(cpu)->__nmi_count;
+	u64 sum = per_cpu(pda.__nmi_count, cpu);
 
-	sum += cpu_pda(cpu)->apic_timer_irqs;
+	sum += per_cpu(pda.apic_timer_irqs, cpu);
 #ifdef CONFIG_SMP
-	sum += cpu_pda(cpu)->irq_resched_count;
-	sum += cpu_pda(cpu)->irq_call_count;
-	sum += cpu_pda(cpu)->irq_tlb_count;
+	sum += per_cpu(pda.irq_resched_count, cpu);
+	sum += per_cpu(pda.irq_call_count, cpu);
+	sum += per_cpu(pda.irq_tlb_count, cpu);
 #endif
 #ifdef CONFIG_X86_MCE
-	sum += cpu_pda(cpu)->irq_thermal_count;
-	sum += cpu_pda(cpu)->irq_threshold_count;
+	sum += per_cpu(pda.irq_thermal_count, cpu);
+	sum += per_cpu(pda.irq_threshold_count, cpu);
 #endif
-	sum += cpu_pda(cpu)->irq_spurious_count;
+	sum += per_cpu(pda.irq_spurious_count, cpu);
 	return sum;
 }
 
--- linux-2.6.tip.orig/arch/x86/kernel/nmi.c
+++ linux-2.6.tip/arch/x86/kernel/nmi.c
@@ -61,7 +61,7 @@ static int endflag __initdata;
 static inline unsigned int get_nmi_count(int cpu)
 {
 #ifdef CONFIG_X86_64
-	return cpu_pda(cpu)->__nmi_count;
+	return per_cpu(pda.__nmi_count, cpu);
 #else
 	return nmi_count(cpu);
 #endif
--- linux-2.6.tip.orig/arch/x86/kernel/setup_percpu.c
+++ linux-2.6.tip/arch/x86/kernel/setup_percpu.c
@@ -279,7 +279,7 @@ void __cpuinit numa_set_node(int cpu, in
 	per_cpu(x86_cpu_to_node_map, cpu) = node;
 
 	if (node != NUMA_NO_NODE)
-		cpu_pda(cpu)->nodenumber = node;
+		per_cpu(pda.nodenumber, cpu) = node;
 }
 
 void __cpuinit numa_clear_node(int cpu)
--- linux-2.6.tip.orig/arch/x86/kernel/smpboot.c
+++ linux-2.6.tip/arch/x86/kernel/smpboot.c
@@ -814,7 +814,7 @@ do_rest:
 	/* Stack for startup_32 can be just as for start_secondary onwards */
 	irq_ctx_init(cpu);
 #else
-	cpu_pda(cpu)->pcurrent = c_idle.idle;
+	per_cpu(pda.pcurrent, cpu) = c_idle.idle;
 	clear_tsk_thread_flag(c_idle.idle, TIF_FORK);
 	initial_pda = (unsigned long)get_cpu_pda(cpu);
 #endif
--- linux-2.6.tip.orig/arch/x86/kernel/traps_64.c
+++ linux-2.6.tip/arch/x86/kernel/traps_64.c
@@ -265,7 +265,8 @@ void dump_trace(struct task_struct *tsk,
 		const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
-	unsigned long *irqstack_end = (unsigned long*)cpu_pda(cpu)->irqstackptr;
+	unsigned long *irqstack_end =
+		(unsigned long *)per_cpu(pda.irqstackptr, cpu);
 	unsigned used = 0;
 	struct thread_info *tinfo;
 
@@ -399,8 +400,10 @@ _show_stack(struct task_struct *tsk, str
 	unsigned long *stack;
 	int i;
 	const int cpu = smp_processor_id();
-	unsigned long *irqstack_end = (unsigned long *) (cpu_pda(cpu)->irqstackptr);
-	unsigned long *irqstack = (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE);
+	unsigned long *irqstack_end =
+				(unsigned long *)per_cpu(pda.irqstackptr, cpu);
+	unsigned long *irqstack =
+				(unsigned long *)(irqstack_end - IRQSTACKSIZE);
 
 	// debugging aid: "show_stack(NULL, NULL);" prints the
 	// back trace for this cpu.
@@ -464,7 +467,7 @@ void show_registers(struct pt_regs *regs
 	int i;
 	unsigned long sp;
 	const int cpu = smp_processor_id();
-	struct task_struct *cur = cpu_pda(cpu)->pcurrent;
+	struct task_struct *cur = __get_cpu_var(pda.pcurrent);
 	u8 *ip;
 	unsigned int code_prologue = code_bytes * 43 / 64;
 	unsigned int code_len = code_bytes;

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu().
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (3 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h Mike Travis
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: zero_based_replace_pda_ops --]
[-- Type: text/plain, Size: 7201 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

  * Remove unused field (in_bootmem) from the pda.

  * One pda op (test_and_clear_bit_pda) cannot be easily removed
    but since the pda is the first element in the per cpu area,
    then it can be left in place as is.

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/kernel/apic_64.c                 |    4 ++--
 arch/x86/kernel/cpu/mcheck/mce_amd_64.c   |    2 +-
 arch/x86/kernel/cpu/mcheck/mce_intel_64.c |    2 +-
 arch/x86/kernel/nmi.c                     |    3 ++-
 arch/x86/kernel/process_64.c              |   12 ++++++------
 arch/x86/kernel/smp.c                     |    4 ++--
 arch/x86/kernel/time_64.c                 |    2 +-
 arch/x86/kernel/tlb_64.c                  |   12 ++++++------
 arch/x86/kernel/traps_64.c                |    2 +-
 arch/x86/kernel/x8664_ksyms_64.c          |    2 --
 arch/x86/xen/smp.c                        |    2 +-
 11 files changed, 23 insertions(+), 24 deletions(-)

--- linux-2.6.tip.orig/arch/x86/kernel/apic_64.c
+++ linux-2.6.tip/arch/x86/kernel/apic_64.c
@@ -457,7 +457,7 @@ static void local_apic_timer_interrupt(v
 	/*
 	 * the NMI deadlock-detector uses this.
 	 */
-	add_pda(apic_timer_irqs, 1);
+	x86_inc_percpu(pda.apic_timer_irqs);
 
 	evt->event_handler(evt);
 }
@@ -965,7 +965,7 @@ asmlinkage void smp_spurious_interrupt(v
 	if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f)))
 		ack_APIC_irq();
 
-	add_pda(irq_spurious_count, 1);
+	x86_inc_percpu(pda.irq_spurious_count);
 	irq_exit();
 }
 
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
@@ -237,7 +237,7 @@ asmlinkage void mce_threshold_interrupt(
 		}
 	}
 out:
-	add_pda(irq_threshold_count, 1);
+	x86_inc_percpu(pda.irq_threshold_count);
 	irq_exit();
 }
 
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
@@ -26,7 +26,7 @@ asmlinkage void smp_thermal_interrupt(vo
 	if (therm_throt_process(msr_val & 1))
 		mce_log_therm_throt_event(smp_processor_id(), msr_val);
 
-	add_pda(irq_thermal_count, 1);
+	x86_inc_percpu(pda.irq_thermal_count);
 	irq_exit();
 }
 
--- linux-2.6.tip.orig/arch/x86/kernel/nmi.c
+++ linux-2.6.tip/arch/x86/kernel/nmi.c
@@ -82,7 +82,8 @@ static inline int mce_in_progress(void)
 static inline unsigned int get_timer_irqs(int cpu)
 {
 #ifdef CONFIG_X86_64
-	return read_pda(apic_timer_irqs) + read_pda(irq0_irqs);
+	return x86_read_percpu(pda.apic_timer_irqs) +
+		x86_read_percpu(pda.irq0_irqs);
 #else
 	return per_cpu(irq_stat, cpu).apic_timer_irqs +
 		per_cpu(irq_stat, cpu).irq0_irqs;
--- linux-2.6.tip.orig/arch/x86/kernel/process_64.c
+++ linux-2.6.tip/arch/x86/kernel/process_64.c
@@ -66,7 +66,7 @@ void idle_notifier_register(struct notif
 
 void enter_idle(void)
 {
-	write_pda(isidle, 1);
+	x86_write_percpu(pda.isidle, 1);
 	atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
 }
 
@@ -410,7 +410,7 @@ start_thread(struct pt_regs *regs, unsig
 	load_gs_index(0);
 	regs->ip		= new_ip;
 	regs->sp		= new_sp;
-	write_pda(oldrsp, new_sp);
+	x86_write_percpu(pda.oldrsp, new_sp);
 	regs->cs		= __USER_CS;
 	regs->ss		= __USER_DS;
 	regs->flags		= 0x200;
@@ -646,11 +646,11 @@ __switch_to(struct task_struct *prev_p, 
 	/* 
 	 * Switch the PDA and FPU contexts.
 	 */
-	prev->usersp = read_pda(oldrsp);
-	write_pda(oldrsp, next->usersp);
-	write_pda(pcurrent, next_p); 
+	prev->usersp = x86_read_percpu(pda.oldrsp);
+	x86_write_percpu(pda.oldrsp, next->usersp);
+	x86_write_percpu(pda.pcurrent, next_p);
 
-	write_pda(kernelstack,
+	x86_write_percpu(pda.kernelstack,
 	(unsigned long)task_stack_page(next_p) + THREAD_SIZE - PDA_STACKOFFSET);
 #ifdef CONFIG_CC_STACKPROTECTOR
 	/*
--- linux-2.6.tip.orig/arch/x86/kernel/smp.c
+++ linux-2.6.tip/arch/x86/kernel/smp.c
@@ -295,7 +295,7 @@ void smp_reschedule_interrupt(struct pt_
 #ifdef CONFIG_X86_32
 	__get_cpu_var(irq_stat).irq_resched_count++;
 #else
-	add_pda(irq_resched_count, 1);
+	x86_inc_percpu(pda.irq_resched_count);
 #endif
 }
 
@@ -320,7 +320,7 @@ void smp_call_function_interrupt(struct 
 #ifdef CONFIG_X86_32
 	__get_cpu_var(irq_stat).irq_call_count++;
 #else
-	add_pda(irq_call_count, 1);
+	x86_inc_percpu(pda.irq_call_count);
 #endif
 	irq_exit();
 
--- linux-2.6.tip.orig/arch/x86/kernel/time_64.c
+++ linux-2.6.tip/arch/x86/kernel/time_64.c
@@ -46,7 +46,7 @@ EXPORT_SYMBOL(profile_pc);
 
 static irqreturn_t timer_event_interrupt(int irq, void *dev_id)
 {
-	add_pda(irq0_irqs, 1);
+	x86_inc_percpu(pda.irq0_irqs);
 
 	global_clock_event->event_handler(global_clock_event);
 
--- linux-2.6.tip.orig/arch/x86/kernel/tlb_64.c
+++ linux-2.6.tip/arch/x86/kernel/tlb_64.c
@@ -62,9 +62,9 @@ static DEFINE_PER_CPU(union smp_flush_st
  */
 void leave_mm(int cpu)
 {
-	if (read_pda(mmu_state) == TLBSTATE_OK)
+	if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK)
 		BUG();
-	cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask);
+	cpu_clear(cpu, x86_read_percpu(pda.active_mm)->cpu_vm_mask);
 	load_cr3(swapper_pg_dir);
 }
 EXPORT_SYMBOL_GPL(leave_mm);
@@ -142,8 +142,8 @@ asmlinkage void smp_invalidate_interrupt
 		 * BUG();
 		 */
 
-	if (f->flush_mm == read_pda(active_mm)) {
-		if (read_pda(mmu_state) == TLBSTATE_OK) {
+	if (f->flush_mm == x86_read_percpu(pda.active_mm)) {
+		if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK) {
 			if (f->flush_va == TLB_FLUSH_ALL)
 				local_flush_tlb();
 			else
@@ -154,7 +154,7 @@ asmlinkage void smp_invalidate_interrupt
 out:
 	ack_APIC_irq();
 	cpu_clear(cpu, f->flush_cpumask);
-	add_pda(irq_tlb_count, 1);
+	x86_inc_percpu(pda.irq_tlb_count);
 }
 
 void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
@@ -269,7 +269,7 @@ static void do_flush_tlb_all(void *info)
 	unsigned long cpu = smp_processor_id();
 
 	__flush_tlb_all();
-	if (read_pda(mmu_state) == TLBSTATE_LAZY)
+	if (x86_read_percpu(pda.mmu_state) == TLBSTATE_LAZY)
 		leave_mm(cpu);
 }
 
--- linux-2.6.tip.orig/arch/x86/kernel/traps_64.c
+++ linux-2.6.tip/arch/x86/kernel/traps_64.c
@@ -878,7 +878,7 @@ asmlinkage notrace __kprobes void
 do_nmi(struct pt_regs *regs, long error_code)
 {
 	nmi_enter();
-	add_pda(__nmi_count, 1);
+	x86_inc_percpu(pda.__nmi_count);
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 	nmi_exit();
--- linux-2.6.tip.orig/arch/x86/kernel/x8664_ksyms_64.c
+++ linux-2.6.tip/arch/x86/kernel/x8664_ksyms_64.c
@@ -58,5 +58,3 @@ EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(empty_zero_page);
 EXPORT_SYMBOL(init_level4_pgt);
 EXPORT_SYMBOL(load_gs_index);
-
-EXPORT_SYMBOL(_proxy_pda);
--- linux-2.6.tip.orig/arch/x86/xen/smp.c
+++ linux-2.6.tip/arch/x86/xen/smp.c
@@ -68,7 +68,7 @@ static irqreturn_t xen_reschedule_interr
 #ifdef CONFIG_X86_32
 	__get_cpu_var(irq_stat).irq_resched_count++;
 #else
-	add_pda(irq_resched_count, 1);
+	x86_inc_percpu(pda.irq_resched_count);
 #endif
 
 	return IRQ_HANDLED;

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (4 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu() Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h Mike Travis
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_current_h --]
[-- Type: text/plain, Size: 993 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/current.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-2.6.tip.orig/include/asm-x86/current.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/current.h	2008-07-01 10:49:13.764285143 -0700
@@ -17,12 +17,13 @@ static __always_inline struct task_struc
 
 #ifndef __ASSEMBLY__
 #include <asm/pda.h>
+#include <asm/percpu.h>
 
 struct task_struct;
 
 static __always_inline struct task_struct *get_current(void)
 {
-	return read_pda(pcurrent);
+	return x86_read_percpu(pda.pcurrent);
 }
 
 #else /* __ASSEMBLY__ */

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (5 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h Mike Travis
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_hardirq_64_h --]
[-- Type: text/plain, Size: 1238 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/hardirq_64.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux-2.6.tip.orig/include/asm-x86/hardirq_64.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/hardirq_64.h	2008-07-01 10:49:14.000299503 -0700
@@ -11,12 +11,12 @@
 
 #define __ARCH_IRQ_STAT 1
 
-#define local_softirq_pending() read_pda(__softirq_pending)
+#define local_softirq_pending() x86_read_percpu(pda.__softirq_pending)
 
 #define __ARCH_SET_SOFTIRQ_PENDING 1
 
-#define set_softirq_pending(x) write_pda(__softirq_pending, (x))
-#define or_softirq_pending(x)  or_pda(__softirq_pending, (x))
+#define set_softirq_pending(x) x86_write_percpu(pda.__softirq_pending, (x))
+#define or_softirq_pending(x)  x86_or_percpu(pda.__softirq_pending, (x))
 
 extern void ack_bad_irq(unsigned int irq);
 

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (6 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h Mike Travis
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_mmu_context_64_h --]
[-- Type: text/plain, Size: 1831 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/mmu_context_64.h |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- linux-2.6.tip.orig/include/asm-x86/mmu_context_64.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/mmu_context_64.h	2008-07-01 10:49:14.220312889 -0700
@@ -20,8 +20,8 @@ void destroy_context(struct mm_struct *m
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
 #ifdef CONFIG_SMP
-	if (read_pda(mmu_state) == TLBSTATE_OK)
-		write_pda(mmu_state, TLBSTATE_LAZY);
+	if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK)
+		x86_write_percpu(pda.mmu_state, TLBSTATE_LAZY);
 #endif
 }
 
@@ -33,8 +33,8 @@ static inline void switch_mm(struct mm_s
 		/* stop flush ipis for the previous mm */
 		cpu_clear(cpu, prev->cpu_vm_mask);
 #ifdef CONFIG_SMP
-		write_pda(mmu_state, TLBSTATE_OK);
-		write_pda(active_mm, next);
+		x86_write_percpu(pda.mmu_state, TLBSTATE_OK);
+		x86_write_percpu(pda.active_mm, next);
 #endif
 		cpu_set(cpu, next->cpu_vm_mask);
 		load_cr3(next->pgd);
@@ -44,8 +44,8 @@ static inline void switch_mm(struct mm_s
 	}
 #ifdef CONFIG_SMP
 	else {
-		write_pda(mmu_state, TLBSTATE_OK);
-		if (read_pda(active_mm) != next)
+		x86_write_percpu(pda.mmu_state, TLBSTATE_OK);
+		if (x86_read_percpu(pda.active_mm) != next)
 			BUG();
 		if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) {
 			/* We were in lazy tlb mode and leave_mm disabled

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (7 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h Mike Travis
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_percpu_h --]
[-- Type: text/plain, Size: 948 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/percpu.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.tip.orig/include/asm-x86/percpu.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/percpu.h	2008-07-01 10:49:14.432325788 -0700
@@ -7,7 +7,7 @@
 
 /* Same as asm-generic/percpu.h, except we use %gs as a segment offset. */
 #ifdef CONFIG_SMP
-#define __my_cpu_offset read_pda(data_offset)
+#define __my_cpu_offset (x86_read_percpu(pda.data_offset))
 #define __percpu_seg "%%gs:"
 #else
 #define __percpu_seg ""

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (8 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h Mike Travis
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_smp_h --]
[-- Type: text/plain, Size: 958 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/smp.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.tip.orig/include/asm-x86/smp.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/smp.h	2008-07-01 10:49:14.676340634 -0700
@@ -134,7 +134,7 @@ DECLARE_PER_CPU(int, cpu_number);
 extern int safe_smp_processor_id(void);
 
 #elif defined(CONFIG_X86_64_SMP)
-#define raw_smp_processor_id()	read_pda(cpunumber)
+#define raw_smp_processor_id()	x86_read_percpu(pda.cpunumber)
 
 #define stack_smp_processor_id()					\
 ({								\

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (9 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h Mike Travis
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_stackprotector_h --]
[-- Type: text/plain, Size: 912 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/stackprotector.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.tip.orig/include/asm-x86/stackprotector.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/stackprotector.h	2008-07-01 10:49:14.920355480 -0700
@@ -32,7 +32,7 @@ static __always_inline void boot_init_st
 	canary += tsc + (tsc << 32UL);
 
 	current->stack_canary = canary;
-	write_pda(stack_canary, canary);
+	x86_write_percpu(pda.stack_canary, canary);
 }
 
 #endif

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (10 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h Mike Travis
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_thread_info_h --]
[-- Type: text/plain, Size: 1014 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/thread_info.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-2.6.tip.orig/include/asm-x86/thread_info.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/thread_info.h	2008-07-01 10:49:15.172370813 -0700
@@ -200,7 +200,8 @@ static inline struct thread_info *curren
 static inline struct thread_info *current_thread_info(void)
 {
 	struct thread_info *ti;
-	ti = (void *)(read_pda(kernelstack) + PDA_STACKOFFSET - THREAD_SIZE);
+	ti = (void *)(x86_read_percpu(pda.kernelstack) +
+						PDA_STACKOFFSET - THREAD_SIZE);
 	return ti;
 }
 

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (11 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 14/15] x86_64: Remove xxx_pda() operations Mike Travis
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: incl_pda_ops_include_asm-x86_topology_h --]
[-- Type: text/plain, Size: 1001 bytes --]

  * It is now possible to use percpu operations for pda access
    since the pda is in the percpu area. Drop the pda operations.

    Thus:

	read_pda     --> x86_read_percpu
	write_pda    --> x86_write_percpu
	add_pda (+1) --> x86_inc_percpu
	or_pda       --> x86_or_percpu

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/topology.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.tip.orig/include/asm-x86/topology.h	2008-07-01 10:41:33.000000000 -0700
+++ linux-2.6.tip/include/asm-x86/topology.h	2008-07-01 10:49:15.420385902 -0700
@@ -77,7 +77,7 @@ extern cpumask_t *node_to_cpumask_map;
 DECLARE_EARLY_PER_CPU(int, x86_cpu_to_node_map);
 
 /* Returns the number of the current Node. */
-#define numa_node_id()		read_pda(nodenumber)
+#define numa_node_id()		x86_read_percpu(pda.nodenumber)
 
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
 extern int cpu_to_node(int cpu);

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 14/15] x86_64: Remove xxx_pda() operations
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (12 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 16:51 ` [RFC 15/15] x86_64: Remove cpu_pda() macro Mike Travis
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: zero_based_remove_pda_ops --]
[-- Type: text/plain, Size: 3591 bytes --]

  * As there are no more references to xxx_pda() ops then remove them.

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/pda.h |   77 ++++----------------------------------------------
 1 file changed, 7 insertions(+), 70 deletions(-)

--- linux-2.6.tip.orig/include/asm-x86/pda.h
+++ linux-2.6.tip/include/asm-x86/pda.h
@@ -21,7 +21,7 @@ struct x8664_pda {
 					   offset 40!!! */
 	char *irqstackptr;
 	short nodenumber;		/* number of current node (32k max) */
-	short in_bootmem;		/* pda lives in bootmem */
+	short unused1;			/* unused */
 	unsigned int __softirq_pending;
 	unsigned int __nmi_count;	/* number of NMI on this CPUs */
 	short mmu_state;
@@ -42,12 +42,6 @@ extern void pda_init(int);
 #define cpu_pda(cpu) (&per_cpu(pda, cpu))
 
 /*
- * There is no fast way to get the base address of the PDA, all the accesses
- * have to mention %fs/%gs.  So it needs to be done this Torvaldian way.
- */
-extern void __bad_pda_field(void) __attribute__((noreturn));
-
-/*
  * proxy_pda doesn't actually exist, but tell gcc it is accessed for
  * all PDA accesses so it gets read/write dependencies right.
  */
@@ -55,69 +49,11 @@ extern struct x8664_pda _proxy_pda;
 
 #define pda_offset(field) offsetof(struct x8664_pda, field)
 
-#define pda_to_op(op, field, val)					\
-do {									\
-	typedef typeof(_proxy_pda.field) T__;				\
-	if (0) { T__ tmp__; tmp__ = (val); }	/* type checking */	\
-	switch (sizeof(_proxy_pda.field)) {				\
-	case 2:								\
-		asm(op "w %1,%%gs:%c2" :				\
-		    "+m" (_proxy_pda.field) :				\
-		    "ri" ((T__)val),					\
-		    "i"(pda_offset(field)));				\
-		break;							\
-	case 4:								\
-		asm(op "l %1,%%gs:%c2" :				\
-		    "+m" (_proxy_pda.field) :				\
-		    "ri" ((T__)val),					\
-		    "i" (pda_offset(field)));				\
-		break;							\
-	case 8:								\
-		asm(op "q %1,%%gs:%c2":					\
-		    "+m" (_proxy_pda.field) :				\
-		    "ri" ((T__)val),					\
-		    "i"(pda_offset(field)));				\
-		break;							\
-	default:							\
-		__bad_pda_field();					\
-	}								\
-} while (0)
-
-#define pda_from_op(op, field)			\
-({						\
-	typeof(_proxy_pda.field) ret__;		\
-	switch (sizeof(_proxy_pda.field)) {	\
-	case 2:					\
-		asm(op "w %%gs:%c1,%0" :	\
-		    "=r" (ret__) :		\
-		    "i" (pda_offset(field)),	\
-		    "m" (_proxy_pda.field));	\
-		break;				\
-	case 4:					\
-		asm(op "l %%gs:%c1,%0":		\
-		    "=r" (ret__):		\
-		    "i" (pda_offset(field)),	\
-		    "m" (_proxy_pda.field));	\
-		break;				\
-	case 8:					\
-		asm(op "q %%gs:%c1,%0":		\
-		    "=r" (ret__) :		\
-		    "i" (pda_offset(field)),	\
-		    "m" (_proxy_pda.field));	\
-		break;				\
-	default:				\
-		__bad_pda_field();		\
-	}					\
-	ret__;					\
-})
-
-#define read_pda(field)		pda_from_op("mov", field)
-#define write_pda(field, val)	pda_to_op("mov", field, val)
-#define add_pda(field, val)	pda_to_op("add", field, val)
-#define sub_pda(field, val)	pda_to_op("sub", field, val)
-#define or_pda(field, val)	pda_to_op("or", field, val)
-
-/* This is not atomic against other CPUs -- CPU preemption needs to be off */
+/*
+ * This is not atomic against other CPUs -- CPU preemption needs to be off
+ * NOTE: This relies on the fact that the cpu_pda is the *first* field in
+ *       the per cpu area.  Move it and you'll need to change this.
+ */
 #define test_and_clear_bit_pda(bit, field)				\
 ({									\
 	int old__;							\
@@ -127,6 +63,7 @@ do {									\
 	old__;								\
 })
 
+
 #endif
 
 #define PDA_STACKOFFSET (5*8)

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC 15/15] x86_64: Remove cpu_pda() macro
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (13 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 14/15] x86_64: Remove xxx_pda() operations Mike Travis
@ 2008-07-09 16:51 ` Mike Travis
  2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

[-- Attachment #1: zero_based_remove_cpu_pda --]
[-- Type: text/plain, Size: 611 bytes --]

  * As there are no more references to cpu_pda() then remove it.

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-x86/pda.h |    2 --
 1 file changed, 2 deletions(-)

--- linux-2.6.tip.orig/include/asm-x86/pda.h
+++ linux-2.6.tip/include/asm-x86/pda.h
@@ -39,8 +39,6 @@ struct x8664_pda {
 
 extern void pda_init(int);
 
-#define cpu_pda(cpu) (&per_cpu(pda, cpu))
-
 /*
  * proxy_pda doesn't actually exist, but tell gcc it is accessed for
  * all PDA accesses so it gets read/write dependencies right.

-- 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (14 preceding siblings ...)
  2008-07-09 16:51 ` [RFC 15/15] x86_64: Remove cpu_pda() macro Mike Travis
@ 2008-07-09 17:19 ` H. Peter Anvin
  2008-07-09 17:40   ` Mike Travis
  2008-07-09 17:44   ` Jeremy Fitzhardinge
  2008-07-09 17:27 ` Jeremy Fitzhardinge
                   ` (2 subsequent siblings)
  18 siblings, 2 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 17:19 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

Hi Mike,

Did the suspected linker bug issue ever get resolved?

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (15 preceding siblings ...)
  2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin
@ 2008-07-09 17:27 ` Jeremy Fitzhardinge
  2008-07-09 17:39   ` Christoph Lameter
  2008-07-09 18:00   ` Mike Travis
  2008-07-09 19:28 ` Ingo Molnar
  2008-07-09 20:00 ` Eric W. Biederman
  18 siblings, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 17:27 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

Mike Travis wrote:
> This patchset provides the following:
>
>   * Cleanup: Fix early references to cpumask_of_cpu(0)
>
>     Provides an early cpumask_of_cpu(0) usable before the cpumask_of_cpu_map
>     is allocated and initialized.
>
>   * Generic: Percpu infrastructure to rebase the per cpu area to zero
>
>     This provides for the capability of accessing the percpu variables
>     using a local register instead of having to go through a table
>     on node 0 to find the cpu-specific offsets.  It also would allow
>     atomic operations on percpu variables to reduce required locking.
>     Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the
>     generic code that the arch has this new basing.
>
>     (Note: split into two patches, one to rebase percpu variables at 0,
>     and the second to actually use %gs as the base for percpu variables.)
>
>   * x86_64: Fold pda into per cpu area
>
>     Declare the pda as a per cpu variable. This will move the pda
>     area to an address accessible by the x86_64 per cpu macros.
>     Subtraction of __per_cpu_start will make the offset based from
>     the beginning of the per cpu area.  Since %gs is pointing to the
>     pda, it will then also point to the per cpu variables and can be
>     accessed thusly:
>
> 	%gs:[&per_cpu_xxxx - __per_cpu_start]
>
>   * x86_64: Rebase per cpu variables to zero
>
>     Take advantage of the zero-based per cpu area provided above.
>     Then we can directly use the x86_32 percpu operations. x86_32
>     offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
>     to the pda and the per cpu area thereby allowing access to the
>     pda with the x86_64 pda operations and access to the per cpu
>     variables using x86_32 percpu operations.

The bulk of this series is pda_X to x86_X_percpu conversion.  This seems 
like pointless churn to me; there's nothing inherently wrong with the 
pda_X interfaces, and doing this transformation doesn't get us any 
closer to unifying 32 and 64 bit.

I think we should start devolving things out of the pda in the other 
direction: make a series where each patch takes a member of struct 
x8664_pda, converts it to a per-cpu variable (where possible, the same 
one that 32-bit uses), and updates all the references accordingly.  When 
the pda is as empty as it can be, we can look at removing the 
pda-specific interfaces.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:27 ` Jeremy Fitzhardinge
@ 2008-07-09 17:39   ` Christoph Lameter
  2008-07-09 17:51     ` Jeremy Fitzhardinge
  2008-07-09 18:02     ` Mike Travis
  2008-07-09 18:00   ` Mike Travis
  1 sibling, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 17:39 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:

> The bulk of this series is pda_X to x86_X_percpu conversion.  This seems
> like pointless churn to me; there's nothing inherently wrong with the
> pda_X interfaces, and doing this transformation doesn't get us any
> closer to unifying 32 and 64 bit.

What is the point of the pda_X interface? It does not exist on 32 bit.
The pda wastes the GS segment register on a small memory area. This patchset
makes the GS segment usable to reach all of the per cpu area by placing
the pda into the per cpu area. Thus the pda_X interface becomes obsolete
and the 32 bit per cpu stuff becomes usable under 64 bit unifying both
architectures.

> I think we should start devolving things out of the pda in the other
> direction: make a series where each patch takes a member of struct
> x8664_pda, converts it to a per-cpu variable (where possible, the same
> one that 32-bit uses), and updates all the references accordingly.  When
> the pda is as empty as it can be, we can look at removing the
> pda-specific interfaces.

This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin
@ 2008-07-09 17:40   ` Mike Travis
  2008-07-09 17:42     ` H. Peter Anvin
  2008-07-09 17:44   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 17:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

H. Peter Anvin wrote:
> Hi Mike,
> 
> Did the suspected linker bug issue ever get resolved?
> 
>     -hpa

Hi Peter,

I was not able to figure out how the two versions of the same
kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed.   Currently,
I'm sticking with gcc-4.2.4 as it boots much farther.

There still is a problem where if I bump THREAD_ORDER, the
problem goes away and everything so far that I've tested boots
up fine.

We tried to install a later gcc (4.3.1) that might have the
"GCC_HAS_SP" flag but our sys admin reported:

	The 4.3.1 version gives me errors on the make.  I had to
	pre-install gmp and mpfr, but, I still get errors on the make.

I think that was the latest he found on the GNU/GCC site.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:40   ` Mike Travis
@ 2008-07-09 17:42     ` H. Peter Anvin
  2008-07-09 18:05       ` Mike Travis
  0 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 17:42 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

Mike Travis wrote:
> H. Peter Anvin wrote:
>> Hi Mike,
>>
>> Did the suspected linker bug issue ever get resolved?
>>
>>     -hpa
> 
> Hi Peter,
> 
> I was not able to figure out how the two versions of the same
> kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed.   Currently,
> I'm sticking with gcc-4.2.4 as it boots much farther.
> 
> There still is a problem where if I bump THREAD_ORDER, the
> problem goes away and everything so far that I've tested boots
> up fine.
> 
> We tried to install a later gcc (4.3.1) that might have the
> "GCC_HAS_SP" flag but our sys admin reported:
> 
> 	The 4.3.1 version gives me errors on the make.  I had to
> 	pre-install gmp and mpfr, but, I still get errors on the make.
> 
> I think that was the latest he found on the GNU/GCC site.
> 

We have seen miscompilations with gcc 4.3.0 at least; not sure about 4.3.1.

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin
  2008-07-09 17:40   ` Mike Travis
@ 2008-07-09 17:44   ` Jeremy Fitzhardinge
  2008-07-09 18:09     ` Mike Travis
  2008-07-25 15:49     ` Mike Travis
  1 sibling, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 17:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel

H. Peter Anvin wrote:
> Did the suspected linker bug issue ever get resolved?

I don't believe so.  I think Mike is getting very early crashes 
depending on some combination of gcc, linker and kernel config.  Or 
something.

This fragility makes me very nervous.  It seems hard enough to get this 
stuff working with current tools; making it work over the whole range of 
supported tools looks like its going to be hard.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:39   ` Christoph Lameter
@ 2008-07-09 17:51     ` Jeremy Fitzhardinge
  2008-07-09 18:14       ` Mike Travis
  2008-07-09 18:02     ` Mike Travis
  1 sibling, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 17:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> What is the point of the pda_X interface? It does not exist on 32 bit.
> The pda wastes the GS segment register on a small memory area. This patchset
> makes the GS segment usable to reach all of the per cpu area by placing
> the pda into the per cpu area. Thus the pda_X interface becomes obsolete
> and the 32 bit per cpu stuff becomes usable under 64 bit unifying both
> architectures.
>   

I think we agree on the desired outcome.  I just disagree with the path 
to getting there.

>> I think we should start devolving things out of the pda in the other
>> direction: make a series where each patch takes a member of struct
>> x8664_pda, converts it to a per-cpu variable (where possible, the same
>> one that 32-bit uses), and updates all the references accordingly.  When
>> the pda is as empty as it can be, we can look at removing the
>> pda-specific interfaces.
>>     
>
> This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately.
>   

No, it's not churn doing it object at a time.  If you convert 
pda.pcurrent into a percpu current_task variable, then at one stroke 
you've 1) shrunk the pda, 2) unified with i386.  If you go through the 
process of converting all the read_pda(pcurrent) references into 
x86_read_percpu(pda.pcurrent) then that's a pure churn patch.  It 
doesn't get rid of the pda variable, it doesn't unify with i386.  All it 
does is remove a reference to a macro which was fairly inoffensive in 
the first place.

Once the pda has shrunk as much as it can (which remove everything 
except stack_canary, I think), then remove all the X_pda macros, since 
there won't be any users anyway.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:27 ` Jeremy Fitzhardinge
  2008-07-09 17:39   ` Christoph Lameter
@ 2008-07-09 18:00   ` Mike Travis
  2008-07-09 19:05     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 18:00 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> This patchset provides the following:
>>
>>   * Cleanup: Fix early references to cpumask_of_cpu(0)
>>
>>     Provides an early cpumask_of_cpu(0) usable before the
>> cpumask_of_cpu_map
>>     is allocated and initialized.
>>
>>   * Generic: Percpu infrastructure to rebase the per cpu area to zero
>>
>>     This provides for the capability of accessing the percpu variables
>>     using a local register instead of having to go through a table
>>     on node 0 to find the cpu-specific offsets.  It also would allow
>>     atomic operations on percpu variables to reduce required locking.
>>     Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the
>>     generic code that the arch has this new basing.
>>
>>     (Note: split into two patches, one to rebase percpu variables at 0,
>>     and the second to actually use %gs as the base for percpu variables.)
>>
>>   * x86_64: Fold pda into per cpu area
>>
>>     Declare the pda as a per cpu variable. This will move the pda
>>     area to an address accessible by the x86_64 per cpu macros.
>>     Subtraction of __per_cpu_start will make the offset based from
>>     the beginning of the per cpu area.  Since %gs is pointing to the
>>     pda, it will then also point to the per cpu variables and can be
>>     accessed thusly:
>>
>>     %gs:[&per_cpu_xxxx - __per_cpu_start]
>>
>>   * x86_64: Rebase per cpu variables to zero
>>
>>     Take advantage of the zero-based per cpu area provided above.
>>     Then we can directly use the x86_32 percpu operations. x86_32
>>     offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
>>     to the pda and the per cpu area thereby allowing access to the
>>     pda with the x86_64 pda operations and access to the per cpu
>>     variables using x86_32 percpu operations.
> 
> The bulk of this series is pda_X to x86_X_percpu conversion.  This seems
> like pointless churn to me; there's nothing inherently wrong with the
> pda_X interfaces, and doing this transformation doesn't get us any
> closer to unifying 32 and 64 bit.
> 
> I think we should start devolving things out of the pda in the other
> direction: make a series where each patch takes a member of struct
> x8664_pda, converts it to a per-cpu variable (where possible, the same
> one that 32-bit uses), and updates all the references accordingly.  When
> the pda is as empty as it can be, we can look at removing the
> pda-specific interfaces.
> 
>    J

I did compartmentalize the changes so they were in separate patches,
and in particular, by separating the changes to the include files, I
was able to zero in on some problems much easier.

But I have no objections to leaving the cpu_pda ops in place and then,
as you're suggesting, extract and modify the fields as appropriate.

Another approach would be to leave the changes from XXX_pda() to
x86_percpu_XXX in place, and do the patches with simply changing
pda.VAR to VAR .)

In any case I would like to get this version working first, before
attempting that rewrite, as that won't change the generated code.

Btw, while I've got your attention... ;-), there's some code in
arch/x86/xen/smp.c:xen_smp_prepare_boot_cpu() that should be looked
at closer for zero-based per_cpu__gdt_page:

	make_lowmem_page_readwrite(&per_cpu__gdt_page);

(I wasn't sure how to deal with this but I suspect the __percpu_offset[]
or __per_cpu_load should be added to it.)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:39   ` Christoph Lameter
  2008-07-09 17:51     ` Jeremy Fitzhardinge
@ 2008-07-09 18:02     ` Mike Travis
  2008-07-09 18:13       ` Christoph Lameter
  1 sibling, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 18:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
> 
>> The bulk of this series is pda_X to x86_X_percpu conversion.  This seems
>> like pointless churn to me; there's nothing inherently wrong with the
>> pda_X interfaces, and doing this transformation doesn't get us any
>> closer to unifying 32 and 64 bit.
> 
> What is the point of the pda_X interface? It does not exist on 32 bit.
> The pda wastes the GS segment register on a small memory area. This patchset
> makes the GS segment usable to reach all of the per cpu area by placing
> the pda into the per cpu area. Thus the pda_X interface becomes obsolete
> and the 32 bit per cpu stuff becomes usable under 64 bit unifying both
> architectures.
> 
> 
>> I think we should start devolving things out of the pda in the other
>> direction: make a series where each patch takes a member of struct
>> x8664_pda, converts it to a per-cpu variable (where possible, the same
>> one that 32-bit uses), and updates all the references accordingly.  When
>> the pda is as empty as it can be, we can look at removing the
>> pda-specific interfaces.
> 
> This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately.

I think Jeremy's point is that by removing the pda struct entirely, the
references to the fields can be the same for both x86_32 and x86_64.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:42     ` H. Peter Anvin
@ 2008-07-09 18:05       ` Mike Travis
  0 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 18:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

H. Peter Anvin wrote:
> Mike Travis wrote:
>> H. Peter Anvin wrote:
>>> Hi Mike,
>>>
>>> Did the suspected linker bug issue ever get resolved?
>>>
>>>     -hpa
>>
>> Hi Peter,
>>
>> I was not able to figure out how the two versions of the same
>> kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed.   Currently,
>> I'm sticking with gcc-4.2.4 as it boots much farther.
>>
>> There still is a problem where if I bump THREAD_ORDER, the
>> problem goes away and everything so far that I've tested boots
>> up fine.
>>
>> We tried to install a later gcc (4.3.1) that might have the
>> "GCC_HAS_SP" flag but our sys admin reported:
>>
>>     The 4.3.1 version gives me errors on the make.  I had to
>>     pre-install gmp and mpfr, but, I still get errors on the make.
>>
>> I think that was the latest he found on the GNU/GCC site.
>>
> 
> We have seen miscompilations with gcc 4.3.0 at least; not sure about 4.3.1.
> 
>     -hpa

Hmm, I wonder how the CONFIG_CC_STACKPROTECTOR was tested?  Could it
be a config option for building GCC itself?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:44   ` Jeremy Fitzhardinge
@ 2008-07-09 18:09     ` Mike Travis
  2008-07-09 18:30       ` H. Peter Anvin
  2008-07-09 19:34       ` Ingo Molnar
  2008-07-25 15:49     ` Mike Travis
  1 sibling, 2 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 18:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Did the suspected linker bug issue ever get resolved?
> 
> I don't believe so.  I think Mike is getting very early crashes
> depending on some combination of gcc, linker and kernel config.  Or
> something.

Yes and unfortunately since SGI does not do it's own compilers any
more (they were MIPS compilers), there's no one here that really
understands the internals of the compile tools.

> 
> This fragility makes me very nervous.  It seems hard enough to get this
> stuff working with current tools; making it work over the whole range of
> supported tools looks like its going to be hard.

(me too ;-)

Once I get a solid version working with (at least) gcc-4.2.4, then
regression testing with older tools will be easier, or at least a
table of results can be produced.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:02     ` Mike Travis
@ 2008-07-09 18:13       ` Christoph Lameter
  2008-07-09 18:26         ` Jeremy Fitzhardinge
                           ` (2 more replies)
  0 siblings, 3 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 18:13 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel

Mike Travis wrote:

> I think Jeremy's point is that by removing the pda struct entirely, the
> references to the fields can be the same for both x86_32 and x86_64.

That is going to be difficult. The GS register is tied up for the pda area
as long as you have it. And you cannot get rid of the pda because of the library
compatibility issues. We would break binary compatibility if we would get rid of the pda.

If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.

It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:51     ` Jeremy Fitzhardinge
@ 2008-07-09 18:14       ` Mike Travis
  2008-07-09 18:22         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 18:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:
...
> 
> Once the pda has shrunk as much as it can (which remove everything
> except stack_canary, I think), then remove all the X_pda macros, since
> there won't be any users anyway.
> 
>    J

You bring up a good point here.  Since the stack_canary has to be 20
(or is that 0x20?) bytes from %gs, then it sounds like we'll still
need a pda struct of some sort.  And zero padding before that seems
counter-productive, so perhaps taking a poll of the most used pda
(or percpu) variables and keeping them in the same cache line would
be more useful?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:14       ` Mike Travis
@ 2008-07-09 18:22         ` Jeremy Fitzhardinge
  2008-07-09 18:31           ` Mike Travis
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 18:22 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Mike Travis wrote:
> Jeremy Fitzhardinge wrote:
> ...
>   
>> Once the pda has shrunk as much as it can (which remove everything
>> except stack_canary, I think), then remove all the X_pda macros, since
>> there won't be any users anyway.
>>
>>    J
>>     
>
> You bring up a good point here.  Since the stack_canary has to be 20
> (or is that 0x20?) bytes from %gs, then it sounds like we'll still
> need a pda struct of some sort.  And zero padding before that seems
> counter-productive, so perhaps taking a poll of the most used pda
> (or percpu) variables and keeping them in the same cache line would
> be more useful?

The offset is 40 (decimal).  I don't see any particular problem with 
putting zero padding in there; we can get rid of it if 
CONFIG_STACK_PROTECTOR is off.

The trouble with reusing the space is that it's going to be things like 
"current" which are the best candidates for going there - but if you do 
that you lose i386 unification (unless you play some tricks where you 
arrange to make the percpu variables land there while still appearing to 
be normal percpu vars).

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:13       ` Christoph Lameter
@ 2008-07-09 18:26         ` Jeremy Fitzhardinge
  2008-07-09 18:34           ` Christoph Lameter
  2008-07-09 18:27         ` Mike Travis
  2008-07-09 18:31         ` H. Peter Anvin
  2 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 18:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> That is going to be difficult. The GS register is tied up for the pda area
> as long as you have it. And you cannot get rid of the pda because of the library
> compatibility issues. We would break binary compatibility if we would get rid of the pda.
>
> If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.
>   

Eh?  Yes they will.  That's the whole point of making the pda a percpu 
variable itself.  You can use %gs:<small> to get to the pda, and 
%gs:<larger> to get to percpu variables.  Converting pda->percpu will 
just have the effect of increasing the %gs offset into the percpu space.

This project isn't interesting to me unless per-cpu variables are 
directly accessible off %gs.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:13       ` Christoph Lameter
  2008-07-09 18:26         ` Jeremy Fitzhardinge
@ 2008-07-09 18:27         ` Mike Travis
  2008-07-09 18:46           ` Jeremy Fitzhardinge
  2008-07-09 18:31         ` H. Peter Anvin
  2 siblings, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 18:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> Mike Travis wrote:
> 
>> I think Jeremy's point is that by removing the pda struct entirely, the
>> references to the fields can be the same for both x86_32 and x86_64.
> 
> That is going to be difficult. The GS register is tied up for the pda area
> as long as you have it. And you cannot get rid of the pda because of the library
> compatibility issues. We would break binary compatibility if we would get rid of the pda.
> 
> If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.
> 
> It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same.

Is there a comprehensive list of these library accesses to variables
offset from %gs, or is it only the "stack_canary"?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:09     ` Mike Travis
@ 2008-07-09 18:30       ` H. Peter Anvin
  2008-07-09 19:34       ` Ingo Molnar
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 18:30 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

Mike Travis wrote:
> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> Did the suspected linker bug issue ever get resolved?
>> I don't believe so.  I think Mike is getting very early crashes
>> depending on some combination of gcc, linker and kernel config.  Or
>> something.
> 
> Yes and unfortunately since SGI does not do it's own compilers any
> more (they were MIPS compilers), there's no one here that really
> understands the internals of the compile tools.
> 

A bummer, too, since that compiler lives on as the Pathscale compiler...

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:13       ` Christoph Lameter
  2008-07-09 18:26         ` Jeremy Fitzhardinge
  2008-07-09 18:27         ` Mike Travis
@ 2008-07-09 18:31         ` H. Peter Anvin
  2 siblings, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 18:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> Mike Travis wrote:
> 
>> I think Jeremy's point is that by removing the pda struct entirely, the
>> references to the fields can be the same for both x86_32 and x86_64.
> 
> That is going to be difficult. The GS register is tied up for the pda area
> as long as you have it. And you cannot get rid of the pda because of the library
> compatibility issues. We would break binary compatibility if we would get rid of the pda.
> 

We're talking about the kernel here... who gives a hoot about binary 
compatibility?

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:22         ` Jeremy Fitzhardinge
@ 2008-07-09 18:31           ` Mike Travis
  2008-07-09 19:08             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 18:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:
...
> The trouble with reusing the space is that it's going to be things like
> "current" which are the best candidates for going there - but if you do
> that you lose i386 unification (unless you play some tricks where you
> arrange to make the percpu variables land there while still appearing to
> be normal percpu vars).
> 
>    J


One more approach... ;-)  Once the pda and percpu vars are in the same
area, then could the pda be used for both 32 and 64...?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:26         ` Jeremy Fitzhardinge
@ 2008-07-09 18:34           ` Christoph Lameter
  2008-07-09 18:37             ` H. Peter Anvin
  2008-07-09 18:48             ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 18:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:

> Eh?  Yes they will.  That's the whole point of making the pda a percpu
> variable itself.  You can use %gs:<small> to get to the pda, and
> %gs:<larger> to get to percpu variables.  Converting pda->percpu will
> just have the effect of increasing the %gs offset into the percpu space.

Right that is what this patchset does.

> This project isn't interesting to me unless per-cpu variables are
> directly accessible off %gs.

Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS).


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:34           ` Christoph Lameter
@ 2008-07-09 18:37             ` H. Peter Anvin
  2008-07-09 18:48             ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 18:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Mike Travis, Ingo Molnar, Andrew Morton,
	Eric W. Biederman, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
> 
>> Eh?  Yes they will.  That's the whole point of making the pda a percpu
>> variable itself.  You can use %gs:<small> to get to the pda, and
>> %gs:<larger> to get to percpu variables.  Converting pda->percpu will
>> just have the effect of increasing the %gs offset into the percpu space.
> 
> Right that is what this patchset does.
> 
>> This project isn't interesting to me unless per-cpu variables are
>> directly accessible off %gs.
> 
> Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS).
> 

I don't understand the "individual members" requirement here.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:27         ` Mike Travis
@ 2008-07-09 18:46           ` Jeremy Fitzhardinge
  2008-07-09 20:22             ` Eric W. Biederman
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 18:46 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Mike Travis wrote:
> Christoph Lameter wrote:
>   
>> Mike Travis wrote:
>>
>>     
>>> I think Jeremy's point is that by removing the pda struct entirely, the
>>> references to the fields can be the same for both x86_32 and x86_64.
>>>       
>> That is going to be difficult. The GS register is tied up for the pda area
>> as long as you have it. And you cannot get rid of the pda because of the library
>> compatibility issues. We would break binary compatibility if we would get rid of the pda.
>>
>> If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.
>>
>> It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same.
>>     
>
> Is there a comprehensive list of these library accesses to variables
> offset from %gs, or is it only the "stack_canary"?

It's just the stack canary.  It isn't library accesses; it's the code 
gcc generates:

foo:	subq	$152, %rsp
        movq    %gs:40, %rax
        movq    %rax, 136(%rsp)
...
        movq    136(%rsp), %rdx
        xorq    %gs:40, %rdx
        je      .L3
        call    __stack_chk_fail
.L3:
        addq    $152, %rsp
        .p2align 4,,4
        ret

There are two irritating things here:

One is that the kernel supports -fstack-protector for x86-64, which 
forces us into all these contortions in the first place.  We don't 
support stack-protector for 32-bit (gcc does), and things are much easier.

The other somewhat orthogonal irritation is the fixed "40".  If they'd 
generated %gs:__gcc_stack_canary, then we could alias that to a per-cpu 
variable like anything else and the whole problem would go away - and we 
could support stack-protector on 32-bit with no problems (and normal 
usermode could define __gcc_stack_canary to be a weak symbol with value 
"40" (20 on 32-bit) for backwards compatibility).

I'm close to proposing that we run a post-processor over the generated 
assembly to perform the %gs:40 -> %gs:__gcc_stack_canary transformation 
and deal with it that way.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:34           ` Christoph Lameter
  2008-07-09 18:37             ` H. Peter Anvin
@ 2008-07-09 18:48             ` Jeremy Fitzhardinge
  2008-07-09 18:53               ` Christoph Lameter
  1 sibling, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 18:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS).
>   

I have no objections to parts 1-3 of the patchset.  It's just 4-15, 
which does the mechanical conversion.

    J


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:48             ` Jeremy Fitzhardinge
@ 2008-07-09 18:53               ` Christoph Lameter
  2008-07-09 19:07                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 18:53 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:
> Christoph Lameter wrote:
>> Maybe I misunderstood but it seems that you proposed to convert
>> individual members of the pda structure (which uses GS) to per cpu
>> variables (which without this patchset cannot use GS).
>>   
> 
> I have no objections to parts 1-3 of the patchset.  It's just 4-15,
> which does the mechanical conversion.

Well yes I agree these could be better if the fields would be moved out of the pda
structure itself but then it wont be mechanical anymore and require more
review. But these are an important step because they allow us to get rid of the
pda operations that do not exist for 32 bit.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:00   ` Mike Travis
@ 2008-07-09 19:05     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 19:05 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

Mike Travis wrote:
> I did compartmentalize the changes so they were in separate patches,
> and in particular, by separating the changes to the include files, I
> was able to zero in on some problems much easier.
>
> But I have no objections to leaving the cpu_pda ops in place and then,
> as you're suggesting, extract and modify the fields as appropriate.
>
> Another approach would be to leave the changes from XXX_pda() to
> x86_percpu_XXX in place, and do the patches with simply changing
> pda.VAR to VAR .)
>   

Yes, but that's still two patches where one would do.  If I'm going to 
go through the effort of reconciling your percpu patches with my code, 
I'd like to be able to remove some #ifdef CONFIG_X86_64s in the process.

> In any case I would like to get this version working first, before
> attempting that rewrite, as that won't change the generated code.
>   

Well, as far as I can tell the real meat of the series is in 1-3 and the 
rest is fluff.  If just applying 1-3 works, then everything else should too.

> Btw, while I've got your attention... ;-), there's some code in
> arch/x86/xen/smp.c:xen_smp_prepare_boot_cpu() that should be looked
> at closer for zero-based per_cpu__gdt_page:
>
> 	make_lowmem_page_readwrite(&per_cpu__gdt_page);
>
> (I wasn't sure how to deal with this but I suspect the __percpu_offset[]
> or __per_cpu_load should be added to it.)

Already fixed in the mass of patches I posted yesterday.  I turned it 
into &per_cpu_var(gdt_page)).

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:53               ` Christoph Lameter
@ 2008-07-09 19:07                 ` Jeremy Fitzhardinge
  2008-07-09 19:12                   ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 19:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> Well yes I agree these could be better if the fields would be moved out of the pda
> structure itself but then it wont be mechanical anymore and require more
> review.

Yes, but they'll have more value.  And if you do it as one variable per 
patch, then it should be easy to bisect should any problems arise.

>  But these are an important step because they allow us to get rid of the
> pda operations that do not exist for 32 bit.
>   

No, they don't help at all, because they convert X_pda(Y) (which doesn't 
exist) into x86_X_percpu(pda.Y) (which also doesn't exist).  They don't 
remove any #ifdef CONFIG_X86_64's.  If you're going to tromp all over 
the source, you may as well do the most useful thing in the first step.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:31           ` Mike Travis
@ 2008-07-09 19:08             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 19:08 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Mike Travis wrote:
> One more approach... ;-)  Once the pda and percpu vars are in the same
> area, then could the pda be used for both 32 and 64...?
>   

The i386 code works quite reliably thanks ;)

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:07                 ` Jeremy Fitzhardinge
@ 2008-07-09 19:12                   ` Christoph Lameter
  2008-07-09 19:32                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 19:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:

> No, they don't help at all, because they convert X_pda(Y) (which doesn't
> exist) into x86_X_percpu(pda.Y) (which also doesn't exist).  They don't
> remove any #ifdef CONFIG_X86_64's.  If you're going to tromp all over
> the source, you may as well do the most useful thing in the first step.

Well they help in the sense that the patches get rid of the special X_pda(Y) operations. 
x86_X_percpu will then exist under 32 bit and 64 bit.

What is remaining is the task to rename 

	pda.Y -> Z

in order to make variable references the same under both arches. Presumably the Z is the corresponding 32 bit variable. There are likely a number of cases where the transformation
is trivial if we just identify the corresponding 32 bit equivalent.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (16 preceding siblings ...)
  2008-07-09 17:27 ` Jeremy Fitzhardinge
@ 2008-07-09 19:28 ` Ingo Molnar
  2008-07-09 20:55   ` Mike Travis
  2008-07-09 20:00 ` Eric W. Biederman
  18 siblings, 1 reply; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 19:28 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel


* Mike Travis <travis@sgi.com> wrote:

>   * x86_64: Rebase per cpu variables to zero
> 
>     Take advantage of the zero-based per cpu area provided above. Then 
>     we can directly use the x86_32 percpu operations. x86_32 offsets 
>     %fs by __per_cpu_start. x86_64 has %gs pointing directly to the 
>     pda and the per cpu area thereby allowing access to the pda with 
>     the x86_64 pda operations and access to the per cpu variables 
>     using x86_32 percpu operations.

hm, have the binutils (or gcc) problems with this been resolved?

If common binutils versions miscompile the kernel with this feature then 
i guess we cannot just unconditionally enable it. (My hope is that it's 
not necessarily a binutils bug but some broken assumption of the kernel 
somewhere.)

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:12                   ` Christoph Lameter
@ 2008-07-09 19:32                     ` Jeremy Fitzhardinge
  2008-07-09 19:41                       ` Ingo Molnar
  2008-07-09 19:44                       ` Christoph Lameter
  0 siblings, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 19:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> Well they help in the sense that the patches get rid of the special X_pda(Y) operations. 
> x86_X_percpu will then exist under 32 bit and 64 bit.
>
> What is remaining is the task to rename 
>
> 	pda.Y -> Z
>
> in order to make variable references the same under both arches. Presumably the Z is the corresponding 32 bit variable. There are likely a number of cases where the transformation
> is trivial if we just identify the corresponding 32 bit equivalent.
>   

Yes, I understand that, but it's still pointless churn.  The 
intermediate step is no improvement over what was there before, and 
isn't any closer to the desired final result.

Once you've made the pda a percpu variable, and redefined all the X_pda 
macros in terms of x86_X_percpu, then there's no need to touch all the 
usage sites until you're *actually* going to unify something.  Touching 
them all just because you find "X_pda" unsightly doesn't help anyone.  
Ideally every site you touch will remove a #ifdef CONFIG_X86_64, or make 
two as-yet unified pieces of code closer to unification.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:09     ` Mike Travis
  2008-07-09 18:30       ` H. Peter Anvin
@ 2008-07-09 19:34       ` Ingo Molnar
  2008-07-09 19:44         ` H. Peter Anvin
                           ` (2 more replies)
  1 sibling, 3 replies; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 19:34 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, H. Peter Anvin, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

* Mike Travis <travis@sgi.com> wrote:

> > This fragility makes me very nervous.  It seems hard enough to get 
> > this stuff working with current tools; making it work over the whole 
> > range of supported tools looks like its going to be hard.
> 
> (me too ;-)
> 
> Once I get a solid version working with (at least) gcc-4.2.4, then 
> regression testing with older tools will be easier, or at least a 
> table of results can be produced.

the problem is, we cannot just put it even into tip/master if there's no 
short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for 
me otherwise, for series of thousands of randomly built kernels.

can we just leave out the zero-based percpu stuff safely and could i 
test the rest of your series - or are there dependencies? I think 
zero-based percpu, while nice in theory, is probably just a very small 
positive effect so it's not a life or death issue. (or is there any 
deeper, semantic reason why we'd want it?)

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:32                     ` Jeremy Fitzhardinge
@ 2008-07-09 19:41                       ` Ingo Molnar
  2008-07-09 19:45                         ` H. Peter Anvin
                                           ` (2 more replies)
  2008-07-09 19:44                       ` Christoph Lameter
  1 sibling, 3 replies; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 19:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Mike Travis, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>> What is remaining is the task to rename
>>
>> 	pda.Y -> Z
>>
>> in order to make variable references the same under both arches. 
>> Presumably the Z is the corresponding 32 bit variable. There are 
>> likely a number of cases where the transformation is trivial if we 
>> just identify the corresponding 32 bit equivalent.
>
> Yes, I understand that, but it's still pointless churn.  The 
> intermediate step is no improvement over what was there before, and 
> isn't any closer to the desired final result.
>
> Once you've made the pda a percpu variable, and redefined all the 
> X_pda macros in terms of x86_X_percpu, then there's no need to touch 
> all the usage sites until you're *actually* going to unify something.  
> Touching them all just because you find "X_pda" unsightly doesn't help 
> anyone.  Ideally every site you touch will remove a #ifdef 
> CONFIG_X86_64, or make two as-yet unified pieces of code closer to 
> unification.

that makes sense. Does everyone agree on #1-#2-#3 and then gradual 
elimination of most pda members (without going through an intermediate 
renaming of pda members) being the way to go?

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:32                     ` Jeremy Fitzhardinge
  2008-07-09 19:41                       ` Ingo Molnar
@ 2008-07-09 19:44                       ` Christoph Lameter
  2008-07-09 19:48                         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 19:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:

> Yes, I understand that, but it's still pointless churn.  The
> intermediate step is no improvement over what was there before, and
> isn't any closer to the desired final result.

No its not pointless churn. We actually eliminate all the pda operations and use the per_cpu operations both on 32 and 64 bit. That is unification.

We would be glad if you could contribute the patches to get rid of the pda.xxx references. I do not think that either Mike or I have the 32 bit expertise needed to do that step. We went as far as we could. The patches are touching all the points of interest so locating the lines to fix should be easy.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:34       ` Ingo Molnar
@ 2008-07-09 19:44         ` H. Peter Anvin
  2008-07-09 20:26           ` Adrian Bunk
  2008-07-09 21:03         ` Mike Travis
  2008-07-09 21:23         ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 19:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Travis, Jeremy Fitzhardinge, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> 
> the problem is, we cannot just put it even into tip/master if there's no 
> short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for 
> me otherwise, for series of thousands of randomly built kernels.
> 

4.2.3 is fine; he was using 4.2.0 before, and as far as I know, 4.2.0 
and 4.2.1 are known broken for the kernel.

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:41                       ` Ingo Molnar
@ 2008-07-09 19:45                         ` H. Peter Anvin
  2008-07-09 19:52                         ` Christoph Lameter
  2008-07-09 21:05                         ` Mike Travis
  2 siblings, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 19:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Mike Travis,
	Andrew Morton, Eric W. Biederman, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> 
> that makes sense. Does everyone agree on #1-#2-#3 and then gradual 
> elimination of most pda members (without going through an intermediate 
> renaming of pda members) being the way to go?
> 

Works for me.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:44                       ` Christoph Lameter
@ 2008-07-09 19:48                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 19:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Christoph Lameter wrote:
> We would be glad if you could contribute the patches to get rid of the pda.xxx references. I do not think that either Mike or I have the 32 bit expertise needed to do that step. We went as far as we could. The patches are touching all the points of interest so locating the lines to fix should be easy.
>   

Yes, I'd be happy to contribute.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:41                       ` Ingo Molnar
  2008-07-09 19:45                         ` H. Peter Anvin
@ 2008-07-09 19:52                         ` Christoph Lameter
  2008-07-09 20:00                           ` Ingo Molnar
  2008-07-09 21:05                         ` Mike Travis
  2 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 19:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Mike Travis, Andrew Morton,
	Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel

Ingo Molnar wrote:

> that makes sense. Does everyone agree on #1-#2-#3 and then gradual 
> elimination of most pda members (without going through an intermediate 
> renaming of pda members) being the way to go?

I think we all agree on 1-2-3.

The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X replaced by the proper percpu names for 32 bit. 

With Jeremy's approach we would be doing two steps at once (getting rid of pda ops plus unifying the variable names between 32 and 64 bit). Maybe more difficult to verify correctness. The removal of the pda ops is a pretty straighforward conversion.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
                   ` (17 preceding siblings ...)
  2008-07-09 19:28 ` Ingo Molnar
@ 2008-07-09 20:00 ` Eric W. Biederman
  2008-07-09 20:05   ` Jeremy Fitzhardinge
                     ` (3 more replies)
  18 siblings, 4 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 20:00 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

I just took a quick look at how stack_protector works on x86_64.  Unless there is
some deep kernel magic that changes the segment register to %gs from the ABI specified
%fs CC_STACKPROTECTOR is totally broken on x86_64.  We access our pda through %gs.

Further -fstack-protector-all only seems to detect against buffer overflows and
thus corruption of the stack.  Not stack overflows.  So it doesn't appear especially
useful.

So we don't we kill the broken CONFIG_CC_STACKPROTECTOR.  Stop trying to figure out
how to use a zero based percpu area.

That should allow us to make the current pda a per cpu variable, and use %gs with
a large offset to access the per cpu area.  And since it is only the per cpu accesses
and the pda accesses that will change we should not need to fight toolchain issues
and other weirdness.  The linked binary can remain the same.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:52                         ` Christoph Lameter
@ 2008-07-09 20:00                           ` Ingo Molnar
  2008-07-09 20:09                             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 20:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Mike Travis, Andrew Morton,
	Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel

* Christoph Lameter <cl@linux-foundation.org> wrote:

> Ingo Molnar wrote:
> 
> > that makes sense. Does everyone agree on #1-#2-#3 and then gradual 
> > elimination of most pda members (without going through an 
> > intermediate renaming of pda members) being the way to go?
> 
> I think we all agree on 1-2-3.
> 
> The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X 
> replaced by the proper percpu names for 32 bit.
> 
> With Jeremy's approach we would be doing two steps at once (getting 
> rid of pda ops plus unifying the variable names between 32 and 64 
> bit). Maybe more difficult to verify correctness. The removal of the 
> pda ops is a pretty straighforward conversion.

Yes, but there's nothing magic about pda variables versus percpu 
variables. We should be able to do the pda -> unified step just as much 
as we can do a percpu -> unified step. We can think of pda as a funky, 
pre-percpu-era relic.

The only thing that percpu really offers over pda is its familarity. 
read_pda() has the per-cpu-ness embedded in it, which is nasty with 
regard to tracking preemption properties, etc.

So converting to percpu would bring us CONFIG_PREEMPT_DEBUG=y checking 
to those ex-pda variables. Today if a read_pda() (or anything but 
pcurrent) is done in a non-preempt region that's likely a bug - but 
nothing warns about it.

So in that light 4-15 might make some sense in standardizing all these 
accesses and making sure it all fits into an existing, familar API 
world, with no register level assumptions and assembly (and ABI) ties, 
which is instrumented as well, with explicit smp_processor_id() 
dependencies, etc.

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:00 ` Eric W. Biederman
@ 2008-07-09 20:05   ` Jeremy Fitzhardinge
  2008-07-09 20:15     ` Ingo Molnar
  2008-07-09 20:07   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

Eric W. Biederman wrote:
> I just took a quick look at how stack_protector works on x86_64.  Unless there is
> some deep kernel magic that changes the segment register to %gs from the ABI specified
> %fs CC_STACKPROTECTOR is totally broken on x86_64.  We access our pda through %gs.
>   

-mcmodel=kernel switches it to using %gs.

> Further -fstack-protector-all only seems to detect against buffer overflows and
> thus corruption of the stack.  Not stack overflows.  So it doesn't appear especially
> useful.
>   

It's a bit useful.  But at the cost of preventing a pile of more useful 
unification work, not to mention making all access to per-cpu variables 
more expensive.

> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR.  Stop trying to figure out
> how to use a zero based percpu area.
>   

Yes, please.

> That should allow us to make the current pda a per cpu variable, and use %gs with
> a large offset to access the per cpu area.  And since it is only the per cpu accesses
> and the pda accesses that will change we should not need to fight toolchain issues
> and other weirdness.  The linked binary can remain the same.
>   

Yes, and it would be functionally identical to the 32-bit code.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:00 ` Eric W. Biederman
  2008-07-09 20:05   ` Jeremy Fitzhardinge
@ 2008-07-09 20:07   ` Ingo Molnar
  2008-07-09 20:11     ` Jeremy Fitzhardinge
  2008-07-09 20:14   ` Arjan van de Ven
  2008-07-09 21:39   ` Mike Travis
  3 siblings, 1 reply; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 20:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel, Arjan van de Ven

* Eric W. Biederman <ebiederm@xmission.com> wrote:

> I just took a quick look at how stack_protector works on x86_64.  
> Unless there is some deep kernel magic that changes the segment 
> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is 
> totally broken on x86_64.  We access our pda through %gs.
> 
> Further -fstack-protector-all only seems to detect against buffer 
> overflows and thus corruption of the stack.  Not stack overflows.  So 
> it doesn't appear especially useful.

CC_STACKPROTECTOR, as fixed in -tip, can catch the splice exploit, and 
there's no known breakage in it.

Deep stack recursion itself is not really interesting. (as that cannot 
arbitrarily be triggered by attackers in most cases) Random overflows of 
buffers on the stackframe is a lot more common, and that's what 
stackprotector protects against.

( Note that deep stack recursion can be caught via another debug
  mechanism, ftrace's mcount approach. )

> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR.  Stop trying 
> to figure out how to use a zero based percpu area.

Note that the zero-based percpu problems are completely unrelated to 
stackprotector. I was able to hit them with a stackprotector-disabled 
gcc-4.2.3 environment.

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:00                           ` Ingo Molnar
@ 2008-07-09 20:09                             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Mike Travis, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> * Christoph Lameter <cl@linux-foundation.org> wrote:
>
>   
>> Ingo Molnar wrote:
>>
>>     
>>> that makes sense. Does everyone agree on #1-#2-#3 and then gradual 
>>> elimination of most pda members (without going through an 
>>> intermediate renaming of pda members) being the way to go?
>>>       
>> I think we all agree on 1-2-3.
>>
>> The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X 
>> replaced by the proper percpu names for 32 bit.
>>
>> With Jeremy's approach we would be doing two steps at once (getting 
>> rid of pda ops plus unifying the variable names between 32 and 64 
>> bit). Maybe more difficult to verify correctness. The removal of the 
>> pda ops is a pretty straighforward conversion.
>>     
>
> Yes, but there's nothing magic about pda variables versus percpu 
> variables. We should be able to do the pda -> unified step just as much 
> as we can do a percpu -> unified step. We can think of pda as a funky, 
> pre-percpu-era relic.
>
> The only thing that percpu really offers over pda is its familarity. 
> read_pda() has the per-cpu-ness embedded in it, which is nasty with 
> regard to tracking preemption properties, etc.
>
> So converting to percpu would bring us CONFIG_PREEMPT_DEBUG=y checking 
> to those ex-pda variables. Today if a read_pda() (or anything but 
> pcurrent) is done in a non-preempt region that's likely a bug - but 
> nothing warns about it.
>
> So in that light 4-15 might make some sense in standardizing all these 
> accesses and making sure it all fits into an existing, familar API 
> world, with no register level assumptions and assembly (and ABI) ties, 
> which is instrumented as well, with explicit smp_processor_id() 
> dependencies, etc.
>   

Yeah, but doing

#define read_pda(x)      x86_read_percpu(x)

gives you all that anyway.   Though because x86_X_percpu and X_pda are 
guaranteed to be atomic with respect to preemption, it's actually 
reasonable to use them with preemption enabled.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:07   ` Ingo Molnar
@ 2008-07-09 20:11     ` Jeremy Fitzhardinge
  2008-07-09 20:18       ` Christoph Lameter
  2008-07-09 20:39       ` Arjan van de Ven
  0 siblings, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel, Arjan van de Ven

Ingo Molnar wrote:
> Note that the zero-based percpu problems are completely unrelated to 
> stackprotector. I was able to hit them with a stackprotector-disabled 
> gcc-4.2.3 environment.

The only reason we need to keep a zero-based pda is to support 
stack-protector.  If we drop drop it, we can drop the pda - and its 
special zero-based properties - entirely.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:00 ` Eric W. Biederman
  2008-07-09 20:05   ` Jeremy Fitzhardinge
  2008-07-09 20:07   ` Ingo Molnar
@ 2008-07-09 20:14   ` Arjan van de Ven
  2008-07-09 20:33     ` Eric W. Biederman
  2008-07-09 21:39   ` Mike Travis
  3 siblings, 1 reply; 190+ messages in thread
From: Arjan van de Ven @ 2008-07-09 20:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

On Wed, 09 Jul 2008 13:00:19 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 
> I just took a quick look at how stack_protector works on x86_64.
> Unless there is some deep kernel magic that changes the segment
> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is
> totally broken on x86_64.  We access our pda through %gs.

and so does gcc in kernel mode.

> 
> Further -fstack-protector-all only seems to detect against buffer
> overflows and thus corruption of the stack.  Not stack overflows.  So
> it doesn't appear especially useful.

stopping buffer overflows and other return address corruption is not
useful? Excuse me?

> 
> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR.  Stop trying
> to figure out how to use a zero based percpu area.

So why don't we NOT do that and fix instead what you're trying to do?

> 
> That should allow us to make the current pda a per cpu variable, and
> use %gs with a large offset to access the per cpu area. 

and what does that gain us?


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:05   ` Jeremy Fitzhardinge
@ 2008-07-09 20:15     ` Ingo Molnar
  0 siblings, 0 replies; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 20:15 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>> Further -fstack-protector-all only seems to detect against buffer 
>> overflows and thus corruption of the stack.  Not stack overflows.  So 
>> it doesn't appear especially useful.
>
> It's a bit useful.  But at the cost of preventing a pile of more 
> useful unification work, not to mention making all access to per-cpu 
> variables more expensive.

well, stackprotector is near zero maintenance trouble. It mostly binds 
in places that are fundamentally non-unifiable anyway. (nobody is going 
to unify the assembly code in switch_to())

i had zero-based percpu problems (early crashes) with a 4.2.3 gcc that 
had --fstack-protect compiled out, so there's no connection there.

In its fixed form in tip/core/stackprotector it can catch the splice 
exploit which makes it quite a bit useful. It would be rather silly to 
not offer that feature.

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:11     ` Jeremy Fitzhardinge
@ 2008-07-09 20:18       ` Christoph Lameter
  2008-07-09 20:33         ` Jeremy Fitzhardinge
  2008-07-09 20:35         ` H. Peter Anvin
  2008-07-09 20:39       ` Arjan van de Ven
  1 sibling, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 20:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
>> Note that the zero-based percpu problems are completely unrelated to
>> stackprotector. I was able to hit them with a stackprotector-disabled
>> gcc-4.2.3 environment.
> 
> The only reason we need to keep a zero-based pda is to support
> stack-protector.  If we drop drop it, we can drop the pda - and its
> special zero-based properties - entirely.


Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction. It also is easier to handle since __per_cpu_start does not figure
in the calculation of the offsets.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 18:46           ` Jeremy Fitzhardinge
@ 2008-07-09 20:22             ` Eric W. Biederman
  2008-07-09 20:35               ` Jeremy Fitzhardinge
  2008-07-09 21:10               ` Arjan van de Ven
  0 siblings, 2 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 20:22 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge <jeremy@goop.org> writes:

> It's just the stack canary.  It isn't library accesses; it's the code gcc
> generates:
>
> foo:	subq	$152, %rsp
>        movq    %gs:40, %rax
>        movq    %rax, 136(%rsp)
> ...
>        movq    136(%rsp), %rdx
>        xorq    %gs:40, %rdx
>        je      .L3
>        call    __stack_chk_fail
> .L3:
>        addq    $152, %rsp
>        .p2align 4,,4
>        ret
>
>
> There are two irritating things here:
>
> One is that the kernel supports -fstack-protector for x86-64, which forces us
> into all these contortions in the first place.  We don't support stack-protector
> for 32-bit (gcc does), and things are much easier.

How does gcc know to use %gs instead of the usual %fs for accessing
the stack protector variable?  My older gcc-4.1.x on ubuntu always uses %fs.

> The other somewhat orthogonal irritation is the fixed "40".  If they'd generated
> %gs:__gcc_stack_canary, then we could alias that to a per-cpu variable like
> anything else and the whole problem would go away - and we could support
> stack-protector on 32-bit with no problems (and normal usermode could define
> __gcc_stack_canary to be a weak symbol with value "40" (20 on 32-bit) for
> backwards compatibility).
>
> I'm close to proposing that we run a post-processor over the generated assembly
> to perform the %gs:40 -> %gs:__gcc_stack_canary transformation and deal with it
> that way.

Or we could do something completely evil.  And use the other segment
register for the stack canary.

I think the unification is valid and useful, and that trying to keep
that stupid stack canary working is currently more trouble then it is
worth.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:44         ` H. Peter Anvin
@ 2008-07-09 20:26           ` Adrian Bunk
  0 siblings, 0 replies; 190+ messages in thread
From: Adrian Bunk @ 2008-07-09 20:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Mike Travis, Jeremy Fitzhardinge, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

On Wed, Jul 09, 2008 at 03:44:51PM -0400, H. Peter Anvin wrote:
> Ingo Molnar wrote:
>>
>> the problem is, we cannot just put it even into tip/master if there's 
>> no short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid 
>> for me otherwise, for series of thousands of randomly built kernels.
>>
>
> 4.2.3 is fine; he was using 4.2.0 before, and as far as I know, 4.2.0  
> and 4.2.1 are known broken for the kernel.

Not sure where your knowledge comes from, but the ones I try to get 
blacklisted due to known gcc bugs are 4.1.0 and 4.1.1.

On a larger picture, we officially support gcc >= 3.2, and if any kernel 
change triggers a bug with e.g. gcc 3.2.3 that's technically a 
regression in the kernel...

> 	-hpa

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:14   ` Arjan van de Ven
@ 2008-07-09 20:33     ` Eric W. Biederman
  2008-07-09 21:01       ` Ingo Molnar
  0 siblings, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 20:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Mike Travis, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

Arjan van de Ven <arjan@infradead.org> writes:

> On Wed, 09 Jul 2008 13:00:19 -0700
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> 
>> I just took a quick look at how stack_protector works on x86_64.
>> Unless there is some deep kernel magic that changes the segment
>> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is
>> totally broken on x86_64.  We access our pda through %gs.
>
> and so does gcc in kernel mode.

Some gcc's in kernel mode.  The one I tested with doesn't.

>> Further -fstack-protector-all only seems to detect against buffer
>> overflows and thus corruption of the stack.  Not stack overflows.  So
>> it doesn't appear especially useful.
>
> stopping buffer overflows and other return address corruption is not
> useful? Excuse me?

Stopping buffer overflows and return address corruption is useful.  Simply
catching and panic'ing the machine when the occur is less useful.  We aren't
perfect but we have a pretty good track record of handling this with
old fashioned methods.

>> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR.  Stop trying
>> to figure out how to use a zero based percpu area.
>
> So why don't we NOT do that and fix instead what you're trying to do?

So our choices are.
fix -fstack-protector to not use a hard coded offset.
fix gcc/ld to not miscompile the kernel at random times that prevents us from
booting when we add a segement with an address at 0.

-fstack-protector does not use the TLS ABI and instead uses nasty hard coded magic
and that is why it is a problem.  Otherwise we could easily support it.

>> That should allow us to make the current pda a per cpu variable, and
>> use %gs with a large offset to access the per cpu area. 
>
> and what does that gain us?

A faster more maintainable kernel.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:18       ` Christoph Lameter
@ 2008-07-09 20:33         ` Jeremy Fitzhardinge
  2008-07-09 20:42           ` H. Peter Anvin
  2008-07-09 21:25           ` Christoph Lameter
  2008-07-09 20:35         ` H. Peter Anvin
  1 sibling, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>   
>> Ingo Molnar wrote:
>>     
>>> Note that the zero-based percpu problems are completely unrelated to
>>> stackprotector. I was able to hit them with a stackprotector-disabled
>>> gcc-4.2.3 environment.
>>>       
>> The only reason we need to keep a zero-based pda is to support
>> stack-protector.  If we drop drop it, we can drop the pda - and its
>> special zero-based properties - entirely.
>>     
>
>
> Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction.

No, it makes no difference.  %gs:X always has a 32-bit offset in the 
instruction, regardless of how big X is:

	mov %eax, %gs:0
	mov %eax, %gs:0x1234567
->
   0:	65 89 04 25 00 00 00 00	mov    %eax,%gs:0x0
   8:	65 89 04 25 67 45 23 01	mov    %eax,%gs:0x1234567


>  It also is easier to handle since __per_cpu_start does not figure
> in the calculation of the offsets.
>   

No, you do it the same as i386.  You set the segment base to be 
percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo 
directly.  You can use rip-relative addressing to make it a smaller 
addressing mode too:

   0:	65 89 05 00 00 00 00 	mov    %eax,%gs:0(%rip)        # 0x7


    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:18       ` Christoph Lameter
  2008-07-09 20:33         ` Jeremy Fitzhardinge
@ 2008-07-09 20:35         ` H. Peter Anvin
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 20:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> 
> Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction. It also is easier to handle since __per_cpu_start does not figure
> in the calculation of the offsets.
> 

No, that makes no difference.  There is no short-offset form that 
doesn't involve a register (ignoring the 16-bit 67h form on 32 bits.)

For 64 bits, you want to keep the offsets within %rip±2 GB, or you will 
have relocation overflows for %rip-based forms; for absolute forms you 
have to be in the range 0-4 GB.  The %rip-based forms are shorter, and 
I'm pretty sure they're the ones we currently generate.  Since we base 
the kernel at 0xffffffff80000000 (-2 GB) this means a zero-based offset 
is actively wrong, and only work by accident (since the first 
CONFIG_PHYSICAL_START of that space is unused.)

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:22             ` Eric W. Biederman
@ 2008-07-09 20:35               ` Jeremy Fitzhardinge
  2008-07-09 20:53                 ` Eric W. Biederman
  2008-07-09 21:10               ` Arjan van de Ven
  1 sibling, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel

Eric W. Biederman wrote:
> How does gcc know to use %gs instead of the usual %fs for accessing
> the stack protector variable?  My older gcc-4.1.x on ubuntu always uses %fs.
>   

-mcmodel=kernel

> Or we could do something completely evil.  And use the other segment
> register for the stack canary.
>   

That would still require gcc changes, so it doesn't help much.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:11     ` Jeremy Fitzhardinge
  2008-07-09 20:18       ` Christoph Lameter
@ 2008-07-09 20:39       ` Arjan van de Ven
  2008-07-09 20:44         ` H. Peter Anvin
  2008-07-09 20:46         ` Jeremy Fitzhardinge
  1 sibling, 2 replies; 190+ messages in thread
From: Arjan van de Ven @ 2008-07-09 20:39 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

On Wed, 09 Jul 2008 13:11:03 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
> > Note that the zero-based percpu problems are completely unrelated
> > to stackprotector. I was able to hit them with a
> > stackprotector-disabled gcc-4.2.3 environment.
> 
> The only reason we need to keep a zero-based pda is to support 
> stack-protector.  If we drop drop it, we can drop the pda - and its 
> special zero-based properties - entirely.

what's wrong with zero based btw?

do they stop us from using gcc's __thread keyword for per cpu variables
or something? (*that* would be a nice feature)

or does it stop us from putting the per cpu variables starting from
offset 4096 onwards?

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:33         ` Jeremy Fitzhardinge
@ 2008-07-09 20:42           ` H. Peter Anvin
  2008-07-09 20:48             ` Jeremy Fitzhardinge
  2008-07-09 21:25           ` Christoph Lameter
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 20:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:
> 
> No, you do it the same as i386.  You set the segment base to be 
> percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo 
> directly.  You can use rip-relative addressing to make it a smaller 
> addressing mode too:
> 
>   0:    65 89 05 00 00 00 00     mov    %eax,%gs:0(%rip)        # 0x7
> 

Thinking about this some more, I don't know if it would make sense to 
put the x86-64 stack canary at the *end* of the percpu area, and 
otherwise use negative offsets.  That would make sure they were readily 
reachable from %rip-based references from within the kernel text area.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:39       ` Arjan van de Ven
@ 2008-07-09 20:44         ` H. Peter Anvin
  2008-07-09 20:50           ` Jeremy Fitzhardinge
  2008-07-09 20:46         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 20:44 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

Arjan van de Ven wrote:
> On Wed, 09 Jul 2008 13:11:03 -0700
> Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
>> Ingo Molnar wrote:
>>> Note that the zero-based percpu problems are completely unrelated
>>> to stackprotector. I was able to hit them with a
>>> stackprotector-disabled gcc-4.2.3 environment.
>> The only reason we need to keep a zero-based pda is to support 
>> stack-protector.  If we drop drop it, we can drop the pda - and its 
>> special zero-based properties - entirely.
> 
> what's wrong with zero based btw?
> 

Two problems:

1. it means pda references are invalid if their offsets are ever more 
than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)

2. some vague hints of a linker bug.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:39       ` Arjan van de Ven
  2008-07-09 20:44         ` H. Peter Anvin
@ 2008-07-09 20:46         ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:46 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

Arjan van de Ven wrote:
> what's wrong with zero based btw?
>   

Nothing in princple.  In practice it's triggering an amazing variety of 
toolchain bugs.

> do they stop us from using gcc's __thread keyword for per cpu variables
> or something? (*that* would be a nice feature)
>   

The powerpc guys tried it, and it doesn't work.  per-cpu is not 
semantically equivalent to per-thread.  If you have a function in which 
you refer to a percpu variable and then have a preemptable section in 
the middle followed by another reference to the same percpu variable, 
it's hard to stop gcc from caching a reference to the old tls variable, 
even though we may have switched cpus in the meantime.

Also, we explicitly use the other segment register in kernel mode, to 
avoid segment register switches where possible.  Even with 
-mcmodel=kernel, gcc generates %fs references to tls variables.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:42           ` H. Peter Anvin
@ 2008-07-09 20:48             ` Jeremy Fitzhardinge
  2008-07-09 21:06               ` Eric W. Biederman
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Christoph Lameter, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

H. Peter Anvin wrote:
> Thinking about this some more, I don't know if it would make sense to 
> put the x86-64 stack canary at the *end* of the percpu area, and 
> otherwise use negative offsets.  That would make sure they were 
> readily reachable from %rip-based references from within the kernel 
> text area. 

If we can move the canary then a whole pile of options open up.  But the 
problem is that we can't.

    J


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:44         ` H. Peter Anvin
@ 2008-07-09 20:50           ` Jeremy Fitzhardinge
  2008-07-09 21:12             ` H. Peter Anvin
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 20:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

H. Peter Anvin wrote:
> 1. it means pda references are invalid if their offsets are ever more 
> than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)

Why?

As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE 
0 - putting the percpu variables as the first thing in the kernel - and 
relocating on load?  That would avoid having to make a special PT_LOAD 
segment at 0.  Hm, would that result in the pda and the boot params 
getting mushed together?

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:35               ` Jeremy Fitzhardinge
@ 2008-07-09 20:53                 ` Eric W. Biederman
  2008-07-09 21:03                   ` Ingo Molnar
  2008-07-09 21:16                   ` H. Peter Anvin
  0 siblings, 2 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 20:53 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric W. Biederman, Mike Travis, Christoph Lameter, Ingo Molnar,
	Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel

Jeremy Fitzhardinge <jeremy@goop.org> writes:

>> Or we could do something completely evil.  And use the other segment
>> register for the stack canary.
>>
>
> That would still require gcc changes, so it doesn't help much.

We could use %fs for the per cpu variables.  Then we could set %gs to whatever
we wanted to sync up with gcc

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:28 ` Ingo Molnar
@ 2008-07-09 20:55   ` Mike Travis
  2008-07-09 21:12     ` Ingo Molnar
  0 siblings, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-09 20:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> * Mike Travis <travis@sgi.com> wrote:
> 
>>   * x86_64: Rebase per cpu variables to zero
>>
>>     Take advantage of the zero-based per cpu area provided above. Then 
>>     we can directly use the x86_32 percpu operations. x86_32 offsets 
>>     %fs by __per_cpu_start. x86_64 has %gs pointing directly to the 
>>     pda and the per cpu area thereby allowing access to the pda with 
>>     the x86_64 pda operations and access to the per cpu variables 
>>     using x86_32 percpu operations.
> 
> hm, have the binutils (or gcc) problems with this been resolved?
> 
> If common binutils versions miscompile the kernel with this feature then 
> i guess we cannot just unconditionally enable it. (My hope is that it's 
> not necessarily a binutils bug but some broken assumption of the kernel 
> somewhere.)
> 
> 	Ingo

Currently I'm using gcc-4.2.4.  Which are you using?

I labeled it "RFC" as it does not quite work without THREAD_ORDER bumped to 2.
And I believe the stack overflow is happening because of some interrupt
routine as it does not happen on our simulator.

After that is taken care of, I'll start regression testing earlier compilers.
I think someone mentioned that gcc-2.something was the minimum required...?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:33     ` Eric W. Biederman
@ 2008-07-09 21:01       ` Ingo Molnar
  0 siblings, 0 replies; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 21:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Arjan van de Ven, Mike Travis, Jeremy Fitzhardinge, Andrew Morton,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

* Eric W. Biederman <ebiederm@xmission.com> wrote:

> Arjan van de Ven <arjan@infradead.org> writes:
> 
> > On Wed, 09 Jul 2008 13:00:19 -0700
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> >
> >> 
> >> I just took a quick look at how stack_protector works on x86_64.
> >> Unless there is some deep kernel magic that changes the segment
> >> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is
> >> totally broken on x86_64.  We access our pda through %gs.
> >
> > and so does gcc in kernel mode.
> 
> Some gcc's in kernel mode.  The one I tested with doesn't.

yes - stackprotector enabled distros build with kernel mode enabled gcc.

> >> Further -fstack-protector-all only seems to detect against buffer 
> >> overflows and thus corruption of the stack.  Not stack overflows.  
> >> So it doesn't appear especially useful.
> >
> > stopping buffer overflows and other return address corruption is not 
> > useful? Excuse me?
> 
> Stopping buffer overflows and return address corruption is useful.  
> Simply catching and panic'ing the machine when the occur is less 
> useful. [...]

why? I personally prefer an informative panic in an overflow-suspect 
piece of code instead of a guest root on my machine.

I think you miss one of the fundamental security aspects here. The panic 
is not there just to inform the administrator - although it certainly 
has such a role.

It is mainly there to _deter_ attackers from experimenting with certain 
exploits.

For the more sophisticated attackers (not the script kiddies - the ones 
who can do serious economic harm) their exploits and their attack 
vectors are their main assets. They want their exploits to work on the 
next target as well, and they want to be as stealth as possible.

For a script kiddie crashing a box is not a big issue - they work with 
public exploits.

This means that the serious attempts will only use an attack if its 
effects are 100% deterministic - they wont risk something like a 50%/50% 
chance of a crash (or even a 10% chance of a crash). Some of the most 
sophisticated kernel exploits i've seen had like 80% of their code 
complexity in making sure that they dont crash the target box. They were 
more resilient code than a lot of kernel code we have.

> [...] We aren't perfect but we have a pretty good track record of 
> handling this with old fashioned methods.

That's your opinion. A valid counter point is that more layers of 
defense, in a fundamentally fragile area (buffers on the stack, return 
addresses), cannot hurt. If you've got a firewall that is only 10% busy 
even under peak load it's a valid option to spend some CPU cycles on 
preventive measures.

A firewall _itself_ is huge overhead already - so there's absolutely no 
valid technical reason to forbid a firewall from having something like 
stackprotector built into its kernel. (and into most of its userspace) 

We could have caught the vsplice exploit as well with stackprotector - 
but our security QA was not strong enough to keep it from slowly 
regressing. (without anyone noticing) That's fixed now in 
tip/core/stackprotector.

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:34       ` Ingo Molnar
  2008-07-09 19:44         ` H. Peter Anvin
@ 2008-07-09 21:03         ` Mike Travis
  2008-07-09 21:23         ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 21:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, H. Peter Anvin, Andrew Morton,
	Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> * Mike Travis <travis@sgi.com> wrote:
> 
>>> This fragility makes me very nervous.  It seems hard enough to get 
>>> this stuff working with current tools; making it work over the whole 
>>> range of supported tools looks like its going to be hard.
>> (me too ;-)
>>
>> Once I get a solid version working with (at least) gcc-4.2.4, then 
>> regression testing with older tools will be easier, or at least a 
>> table of results can be produced.
> 
> the problem is, we cannot just put it even into tip/master if there's no 
> short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for 
> me otherwise, for series of thousands of randomly built kernels.

Great, I'll request we load gcc-4.2.3 on our devel server.

> 
> can we just leave out the zero-based percpu stuff safely and could i 
> test the rest of your series - or are there dependencies? I think 
> zero-based percpu, while nice in theory, is probably just a very small 
> positive effect so it's not a life or death issue. (or is there any 
> deeper, semantic reason why we'd want it?)

I sort of assumed that zero-based would not make it into 2.6.26-rcX,
and no, reaching 4096 cpus does not require it.  The other patches
I've been submitting are more general and will fix possible panics
(like stack overflows, etc.)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:53                 ` Eric W. Biederman
@ 2008-07-09 21:03                   ` Ingo Molnar
  2008-07-09 21:16                   ` H. Peter Anvin
  1 sibling, 0 replies; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 21:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter,
	Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> Jeremy Fitzhardinge <jeremy@goop.org> writes:
> 
> >> Or we could do something completely evil.  And use the other segment
> >> register for the stack canary.
> >>
> >
> > That would still require gcc changes, so it doesn't help much.
> 
> We could use %fs for the per cpu variables.  Then we could set %gs to 
> whatever we wanted to sync up with gcc

one problem is, there's no SWAPFS instruction.

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:41                       ` Ingo Molnar
  2008-07-09 19:45                         ` H. Peter Anvin
  2008-07-09 19:52                         ` Christoph Lameter
@ 2008-07-09 21:05                         ` Mike Travis
  2 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 21:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Andrew Morton,
	Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
>>> What is remaining is the task to rename
>>>
>>> 	pda.Y -> Z
>>>
>>> in order to make variable references the same under both arches. 
>>> Presumably the Z is the corresponding 32 bit variable. There are 
>>> likely a number of cases where the transformation is trivial if we 
>>> just identify the corresponding 32 bit equivalent.
>> Yes, I understand that, but it's still pointless churn.  The 
>> intermediate step is no improvement over what was there before, and 
>> isn't any closer to the desired final result.
>>
>> Once you've made the pda a percpu variable, and redefined all the 
>> X_pda macros in terms of x86_X_percpu, then there's no need to touch 
>> all the usage sites until you're *actually* going to unify something.  
>> Touching them all just because you find "X_pda" unsightly doesn't help 
>> anyone.  Ideally every site you touch will remove a #ifdef 
>> CONFIG_X86_64, or make two as-yet unified pieces of code closer to 
>> unification.
> 
> that makes sense. Does everyone agree on #1-#2-#3 and then gradual 
> elimination of most pda members (without going through an intermediate 
> renaming of pda members) being the way to go?
> 
> 	Ingo


This is fine with me... not much more work required to go "all the way"... ;-)

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:48             ` Jeremy Fitzhardinge
@ 2008-07-09 21:06               ` Eric W. Biederman
  2008-07-09 21:16                 ` H. Peter Anvin
  2008-07-09 21:20                 ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 21:06 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Christoph Lameter, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge <jeremy@goop.org> writes:

> H. Peter Anvin wrote:
>> Thinking about this some more, I don't know if it would make sense to put the
>> x86-64 stack canary at the *end* of the percpu area, and otherwise use
>> negative offsets.  That would make sure they were readily reachable from
>> %rip-based references from within the kernel text area.
>
> If we can move the canary then a whole pile of options open up.  But the problem
> is that we can't.

But we can pick an arbitrary point where %gs points at.

Hmm.  This whole thing is even sillier then I thought.
Why can't we access per cpu vars as:
%gs:(per_cpu__var - __per_cpu_start) ?

If we can subtract constants and allow the linker to perform that resolution
at link.  A zero based per cpu segment becomes a moot issue.

We may need to change the definition of PERCPU in vmlinux.lds.h to
#define PERCPU(align)							\
	. = ALIGN(align);						\
-	__per_cpu_start = .;					\
	.data.percpu  : AT(ADDR(.data.percpu) - LOAD_OFFSET) {		\
+		__per_cpu_start = .;					\
		*(.data.percpu)						\
		*(.data.percpu.shared_aligned)				\
+		__per_cpu_end = .;					\
+	}
-	}								\
-	__per_cpu_end = .;


So that the linker knows  __per_cpu_start and __per_cpu_end are in the same section
but otherwise it sounds entirely reasonable.  Just slightly trickier math at link
time.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:22             ` Eric W. Biederman
  2008-07-09 20:35               ` Jeremy Fitzhardinge
@ 2008-07-09 21:10               ` Arjan van de Ven
  2008-07-09 23:20                 ` Eric W. Biederman
  1 sibling, 1 reply; 190+ messages in thread
From: Arjan van de Ven @ 2008-07-09 21:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter, Ingo Molnar,
	Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel

On Wed, 09 Jul 2008 13:22:06 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Jeremy Fitzhardinge <jeremy@goop.org> writes:
> 
> > It's just the stack canary.  It isn't library accesses; it's the
> > code gcc generates:
> >
> > foo:	subq	$152, %rsp
> >        movq    %gs:40, %rax
> >        movq    %rax, 136(%rsp)
> > ...
> >        movq    136(%rsp), %rdx
> >        xorq    %gs:40, %rdx
> >        je      .L3
> >        call    __stack_chk_fail
> > .L3:
> >        addq    $152, %rsp
> >        .p2align 4,,4
> >        ret
> >
> >
> > There are two irritating things here:
> >
> > One is that the kernel supports -fstack-protector for x86-64, which
> > forces us into all these contortions in the first place.  We don't
> > support stack-protector for 32-bit (gcc does), and things are much
> > easier.
> 
> How does gcc know to use %gs instead of the usual %fs for accessing
> the stack protector variable?  My older gcc-4.1.x on ubuntu always
> uses %fs.

ubuntu broke gcc (they don't want to have compiler flags per package so
patches stuff in gcc instead).


> I think the unification is valid and useful, and that trying to keep
> that stupid stack canary working is currently more trouble then it is
> worth.

I think that "unification over everything" is stupid, especially if it
removes useful features.



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:50           ` Jeremy Fitzhardinge
@ 2008-07-09 21:12             ` H. Peter Anvin
  2008-07-09 21:26               ` Jeremy Fitzhardinge
  2008-07-09 22:10               ` Eric W. Biederman
  0 siblings, 2 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 21:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> 1. it means pda references are invalid if their offsets are ever more 
>> than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)
> 
> Why?
> 
> As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE 
> 0 - putting the percpu variables as the first thing in the kernel - and 
> relocating on load?  That would avoid having to make a special PT_LOAD 
> segment at 0.  Hm, would that result in the pda and the boot params 
> getting mushed together?
> 

CONFIG_PHYSICAL_START rather.  And no, it can't be zero!  Realistically 
we should make it 16 MB by default (currently 2 MB), to keep the DMA 
zone clear.

Either way, I really suspect that the right thing to do is to use 
negative offsets, with the possible exception of a handful of things (40 
bytes or less, perhaps like current) which can get small positive 
offsets and end up in the "super hot" cacheline.

The sucky part is that I don't believe GNU ld has native support for a 
"hanging down" section (one which has a fixed endpoint rather than a 
starting point), so it requires extra magic around the link (or finding 
some way to do it with linker script functions.)  Let me see if I can 
cook up something in linker script that would actually work.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:55   ` Mike Travis
@ 2008-07-09 21:12     ` Ingo Molnar
  0 siblings, 0 replies; 190+ messages in thread
From: Ingo Molnar @ 2008-07-09 21:12 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

* Mike Travis <travis@sgi.com> wrote:

> After that is taken care of, I'll start regression testing earlier 
> compilers. I think someone mentioned that gcc-2.something was the 
> minimum required...?

i think the current official minimum is around gcc-3.2 [2.x is out of 
question because we have a few feature dependencies on gcc-3.x] - but i 
stopped using it because it miscompiles the kernel so often. 4.0 was 
really bad due to large stack footprint. The 4.3.x series miscompiles 
the kernel too in certain situations - there was a high-rising 
kerneloops.org crash recently in ext3.

So in general, 'too new' is bad because it has new regressions, 'too 
old' is bad because it has unfixed old regressions. Somewhere in the 
middle, 4.2.x-ish, seems to be pretty robust in practice.

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:06               ` Eric W. Biederman
@ 2008-07-09 21:16                 ` H. Peter Anvin
  2008-07-09 21:20                 ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 21:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Eric W. Biederman wrote:
> 
> But we can pick an arbitrary point where %gs points at.
> 
> Hmm.  This whole thing is even sillier then I thought.
> Why can't we access per cpu vars as:
> %gs:(per_cpu__var - __per_cpu_start) ?
> 
> If we can subtract constants and allow the linker to perform that resolution
> at link.  A zero based per cpu segment becomes a moot issue.
> 

And then we're back here again!

Supposedly the linker buggers up, although we don't have conclusive 
evidence...

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:53                 ` Eric W. Biederman
  2008-07-09 21:03                   ` Ingo Molnar
@ 2008-07-09 21:16                   ` H. Peter Anvin
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 21:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter, Ingo Molnar,
	Andrew Morton, Jack Steiner, linux-kernel

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
> 
>>> Or we could do something completely evil.  And use the other segment
>>> register for the stack canary.
>>>
>> That would still require gcc changes, so it doesn't help much.
> 
> We could use %fs for the per cpu variables.  Then we could set %gs to whatever
> we wanted to sync up with gcc

No swapfs instruction, and extra performance penalty because %fs is used 
  in userspace.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:06               ` Eric W. Biederman
  2008-07-09 21:16                 ` H. Peter Anvin
@ 2008-07-09 21:20                 ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 21:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Christoph Lameter, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Eric W. Biederman wrote:
> But we can pick an arbitrary point where %gs points at.
>
> Hmm.  This whole thing is even sillier then I thought.
> Why can't we access per cpu vars as:
> %gs:(per_cpu__var - __per_cpu_start) ?
>   

Because there's no linker reloc for doing subtraction (or addition) of 
two symbols.

> If we can subtract constants and allow the linker to perform that resolution
> at link.  A zero based per cpu segment becomes a moot issue.
>   

They're not constants; they're symbols.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 19:34       ` Ingo Molnar
  2008-07-09 19:44         ` H. Peter Anvin
  2008-07-09 21:03         ` Mike Travis
@ 2008-07-09 21:23         ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 21:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Travis, H. Peter Anvin, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> * Mike Travis <travis@sgi.com> wrote:
>
>   
>>> This fragility makes me very nervous.  It seems hard enough to get 
>>> this stuff working with current tools; making it work over the whole 
>>> range of supported tools looks like its going to be hard.
>>>       
>> (me too ;-)
>>
>> Once I get a solid version working with (at least) gcc-4.2.4, then 
>> regression testing with older tools will be easier, or at least a 
>> table of results can be produced.
>>     
>
> the problem is, we cannot just put it even into tip/master if there's no 
> short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for 
> me otherwise, for series of thousands of randomly built kernels.
>
> can we just leave out the zero-based percpu stuff safely and could i 
> test the rest of your series - or are there dependencies? I think 
> zero-based percpu, while nice in theory, is probably just a very small 
> positive effect so it's not a life or death issue. (or is there any 
> deeper, semantic reason why we'd want it?)
>   

I'm looking forward to using it, because I can make the Xen vcpu 
structure a percpu variable shared with the hypervisor.  This means 
something like a disable interrupt becomes a simple "movb  
$1,%gs:per_cpu__xen_vcpu_event_mask".  If access to percpu variables is 
indirect (ie, two instructions) I need to disable preemption which makes 
the whole thing much more complex, and too big to inline.  There are 
other cases where preemption-safe access to percpu variables is useful 
as well.

My view, which is admittedly very one-sided, is that all this brokenness 
is forced on us by gcc's stack-protector brokenness.  My preferred 
approach would be to fix -fstack-protector by eliminating the 
requirement for small offsets from %gs.  With that in place we could 
support it without needing a pda.  In the meantime, we could either 
support stack-protector or direct access to percpu variables.  Either 
way, we don't need to worry about making zero-based percpu work.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:33         ` Jeremy Fitzhardinge
  2008-07-09 20:42           ` H. Peter Anvin
@ 2008-07-09 21:25           ` Christoph Lameter
  2008-07-09 21:36             ` H. Peter Anvin
                               ` (2 more replies)
  1 sibling, 3 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-09 21:25 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:

> No, it makes no difference.  %gs:X always has a 32-bit offset in the
> instruction, regardless of how big X is:
> 
>     mov %eax, %gs:0
>     mov %eax, %gs:0x1234567
> ->
>   0:    65 89 04 25 00 00 00 00    mov    %eax,%gs:0x0
>   8:    65 89 04 25 67 45 23 01    mov    %eax,%gs:0x1234567

The processor itself supports smaller offsets.

Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation.


>>  It also is easier to handle since __per_cpu_start does not figure
>> in the calculation of the offsets.
>>   
> 
> No, you do it the same as i386.  You set the segment base to be
> percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo
> directly.  You can use rip-relative addressing to make it a smaller
> addressing mode too:
> 
>   0:    65 89 05 00 00 00 00     mov    %eax,%gs:0(%rip)        # 0x7

RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:12             ` H. Peter Anvin
@ 2008-07-09 21:26               ` Jeremy Fitzhardinge
  2008-07-09 21:37                 ` H. Peter Anvin
  2008-07-09 22:10               ` Eric W. Biederman
  1 sibling, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 21:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

H. Peter Anvin wrote:
> Either way, I really suspect that the right thing to do is to use 
> negative offsets, with the possible exception of a handful of things 
> (40 bytes or less, perhaps like current) which can get small positive 
> offsets and end up in the "super hot" cacheline.
>
> The sucky part is that I don't believe GNU ld has native support for a 
> "hanging down" section (one which has a fixed endpoint rather than a 
> starting point), so it requires extra magic around the link (or 
> finding some way to do it with linker script functions.) 

If you're going to do another linker pass, you could have a script to 
extract all the percpu symbols and generate a set of derived zero-based 
ones and then link against that.

Or generate a vmlinux with relocations and "relocate" all the percpu 
symbols down to 0.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:25           ` Christoph Lameter
@ 2008-07-09 21:36             ` H. Peter Anvin
  2008-07-09 21:41             ` Jeremy Fitzhardinge
  2008-07-09 22:22             ` Eric W. Biederman
  2 siblings, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 21:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
> 
>> No, it makes no difference.  %gs:X always has a 32-bit offset in the
>> instruction, regardless of how big X is:
>>
>>     mov %eax, %gs:0
>>     mov %eax, %gs:0x1234567
>> ->
>>   0:    65 89 04 25 00 00 00 00    mov    %eax,%gs:0x0
>>   8:    65 89 04 25 67 45 23 01    mov    %eax,%gs:0x1234567
> 
> The processor itself supports smaller offsets.

No, it doesn't, unless you have a base register.  There is no naked 
disp8 form, and disp16 is only available in 16- or 32-bit mode (and in 
32-bit form it requires a 67h prefix.)

> Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation.

The offset is either ±2 GB from the segment register, or ±2 GB from the 
segment register plus %rip.  The latter is more efficient.

The processor *does* permit a 64-bit absolute form, which can be used 
with a segment register, but that one is hideously restricted (only move 
to/from %rax) and bloated (10 bytes!)

>>   0:    65 89 05 00 00 00 00     mov    %eax,%gs:0(%rip)        # 0x7
> 
> RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area.

Not from the per cpu area, but from the linked address of the per cpu 
area (the segment register base can point anywhere.)

In our case means between -2 GB and a smallish positive value (I believe 
it is guaranteed to be 2 MB or more.)

Being able to use %rip-relative forms would save a byte per reference, 
which is valuable.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:26               ` Jeremy Fitzhardinge
@ 2008-07-09 21:37                 ` H. Peter Anvin
  0 siblings, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

Jeremy Fitzhardinge wrote:
> 
> If you're going to do another linker pass, you could have a script to 
> extract all the percpu symbols and generate a set of derived zero-based 
> ones and then link against that.
> 
> Or generate a vmlinux with relocations and "relocate" all the percpu 
> symbols down to 0.
> 

Yeah, I'd hate to have to go to either of those lengths though.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 20:00 ` Eric W. Biederman
                     ` (2 preceding siblings ...)
  2008-07-09 20:14   ` Arjan van de Ven
@ 2008-07-09 21:39   ` Mike Travis
  2008-07-09 21:47     ` Jeremy Fitzhardinge
  2008-07-09 21:55     ` Eric W. Biederman
  3 siblings, 2 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 21:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

Eric W. Biederman wrote:
> I just took a quick look at how stack_protector works on x86_64.  Unless there is
> some deep kernel magic that changes the segment register to %gs from the ABI specified
> %fs CC_STACKPROTECTOR is totally broken on x86_64.  We access our pda through %gs.
> 
> Further -fstack-protector-all only seems to detect against buffer overflows and
> thus corruption of the stack.  Not stack overflows.  So it doesn't appear especially
> useful.
> 
> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR.  Stop trying to figure out
> how to use a zero based percpu area.
> 
> That should allow us to make the current pda a per cpu variable, and use %gs with
> a large offset to access the per cpu area.  And since it is only the per cpu accesses
> and the pda accesses that will change we should not need to fight toolchain issues
> and other weirdness.  The linked binary can remain the same.
> 
> Eric

Hi Eric,

There is one pda op that I was not able to remove.  Most likely it can be recoded
but it was a bit over my expertise.  Most likely the "pda_offset(field)" can be
replaced with "per_cpu_var(field)" [per_cpu__##field], but for "_proxy_pda.field"
I wasn't sure about.

include/asm-x86/pda.h:

/*
 * This is not atomic against other CPUs -- CPU preemption needs to be off
 * NOTE: This relies on the fact that the cpu_pda is the *first* field in
 *       the per cpu area.  Move it and you'll need to change this.
 */
#define test_and_clear_bit_pda(bit, field)                              \
({                                                                      \
        int old__;                                                      \
        asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0"                    \
                     : "=r" (old__), "+m" (_proxy_pda.field)            \
                     : "dIr" (bit), "i" (pda_offset(field)) : "memory");\
        old__;                                                          \
})

And there is only one reference to it.  

arch/x86/kernel/process_64.c:

static void __exit_idle(void)
{
        if (test_and_clear_bit_pda(0, isidle) == 0)
                return;
        atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
}

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:25           ` Christoph Lameter
  2008-07-09 21:36             ` H. Peter Anvin
@ 2008-07-09 21:41             ` Jeremy Fitzhardinge
  2008-07-09 22:22             ` Eric W. Biederman
  2 siblings, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 21:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>   
>> No, it makes no difference.  %gs:X always has a 32-bit offset in the
>> instruction, regardless of how big X is:
>>
>>     mov %eax, %gs:0
>>     mov %eax, %gs:0x1234567
>> ->
>>   0:    65 89 04 25 00 00 00 00    mov    %eax,%gs:0x0
>>   8:    65 89 04 25 67 45 23 01    mov    %eax,%gs:0x1234567
>>     
>
> The processor itself supports smaller offsets.
>   

Not in 64-bit mode.  In 32-bit mode you can use the addr16 prefix, but 
that would only save a byte per use (and I doubt it's a fast-path in the 
processor).

> Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation.
>   

No.  The %gs base is a full 64-bit value you can put anywhere in the 
address space.  So long as your percpu data is within 2G of that point 
you can get to it directly.


>>   0:    65 89 05 00 00 00 00     mov    %eax,%gs:0(%rip)        # 0x7
>>     
>
> RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area.
>   

It means the percpu symbols must be within 2G of your code.  We can't 
compile the kernel any other way (there's no -mcmodel=large-kernel).

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:39   ` Mike Travis
@ 2008-07-09 21:47     ` Jeremy Fitzhardinge
  2008-07-09 21:55     ` Eric W. Biederman
  1 sibling, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 21:47 UTC (permalink / raw)
  To: Mike Travis
  Cc: Eric W. Biederman, Ingo Molnar, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

Mike Travis wrote:
> Eric W. Biederman wrote:
>   
>> I just took a quick look at how stack_protector works on x86_64.  Unless there is
>> some deep kernel magic that changes the segment register to %gs from the ABI specified
>> %fs CC_STACKPROTECTOR is totally broken on x86_64.  We access our pda through %gs.
>>
>> Further -fstack-protector-all only seems to detect against buffer overflows and
>> thus corruption of the stack.  Not stack overflows.  So it doesn't appear especially
>> useful.
>>
>> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR.  Stop trying to figure out
>> how to use a zero based percpu area.
>>
>> That should allow us to make the current pda a per cpu variable, and use %gs with
>> a large offset to access the per cpu area.  And since it is only the per cpu accesses
>> and the pda accesses that will change we should not need to fight toolchain issues
>> and other weirdness.  The linked binary can remain the same.
>>
>> Eric
>>     
>
> Hi Eric,
>
> There is one pda op that I was not able to remove.  Most likely it can be recoded
> but it was a bit over my expertise.  Most likely the "pda_offset(field)" can be
> replaced with "per_cpu_var(field)" [per_cpu__##field], but for "_proxy_pda.field"
> I wasn't sure about.
>
> include/asm-x86/pda.h:
>
> /*
>  * This is not atomic against other CPUs -- CPU preemption needs to be off
>  * NOTE: This relies on the fact that the cpu_pda is the *first* field in
>  *       the per cpu area.  Move it and you'll need to change this.
>  */
> #define test_and_clear_bit_pda(bit, field)                              \
> ({                                                                      \
>         int old__;                                                      \
>         asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0"                    \
>                      : "=r" (old__), "+m" (_proxy_pda.field)            \
>                      : "dIr" (bit), "i" (pda_offset(field)) : "memory");\
>   
        asm volatile("btr %2,%%gs:%1\n\tsbbl %0,%0"                    \
                     : "=r" (old__), "+m" (per_cpu_var(var))            \
                     : "dIr" (bit) : "memory");\


but it barely seems worthwhile if we really can't use test_and_clear_bit.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:39   ` Mike Travis
  2008-07-09 21:47     ` Jeremy Fitzhardinge
@ 2008-07-09 21:55     ` Eric W. Biederman
  1 sibling, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 21:55 UTC (permalink / raw)
  To: Mike Travis
  Cc: Eric W. Biederman, Jeremy Fitzhardinge, Ingo Molnar,
	Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner,
	linux-kernel

Mike Travis <travis@sgi.com> writes:

> Hi Eric,
>
> There is one pda op that I was not able to remove.  Most likely it can be
> recoded
> but it was a bit over my expertise.  Most likely the "pda_offset(field)" can be
> replaced with "per_cpu_var(field)" [per_cpu__##field], but for
> "_proxy_pda.field"
> I wasn't sure about.

If you notice we never use %1.  My reading would be we just have the +m
there to tell the compiler we may be changing the field.  So just
a reference to the per_cpu_var directly should be sufficient.  Although
"memory" may actually be enough.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 02/15] x86_64: Fold pda into per cpu area
  2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis
@ 2008-07-09 22:02   ` Eric W. Biederman
  2008-07-13 17:54     ` Ingo Molnar
  0 siblings, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 22:02 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel

Mike Travis <travis@sgi.com> writes:

> WARNING: there is still a FIXME in this patch (see arch/x86/kernel/acpi/sleep.c)
>
>   * Declare the pda as a per cpu variable.
>
>   * Make the x86_64 per cpu area start at zero.
>
>   * Relocate the initial pda and per_cpu(gdt_page) in head_64.S for the
>     boot cpu (0).  For secondary cpus, do_boot_cpu() sets up the correct
>     initial pda and gdt_page pointer.
>
>   * Initialize per_cpu_offset to point to static pda in the per_cpu area
>     (@ __per_cpu_load).
>
>   * After allocation of the per cpu area for the boot cpu (0), reload the
>     gdt page pointer.
>
> Based on linux-2.6.tip/master

Given that we have not yet understood the weird failure case.  This patch needs
to be split in two.  
- make the current per cpu variable section zero based.
- Move the pda into the per cpu variable section.

There are too many variables at present the reported failure cases to
guess what is really going on.

We can not optimize the per cpu variable accesses until the pda moves
but we can easily test for linker and tool chain bugs with zero
based pda segment itself.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:12             ` H. Peter Anvin
  2008-07-09 21:26               ` Jeremy Fitzhardinge
@ 2008-07-09 22:10               ` Eric W. Biederman
  2008-07-09 22:23                 ` H. Peter Anvin
  1 sibling, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 22:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:

> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> 1. it means pda references are invalid if their offsets are ever more than
>>> CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)
>>
>> Why?
>>
>> As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE 0 -
>> putting the percpu variables as the first thing in the kernel - and relocating
>> on load?  That would avoid having to make a special PT_LOAD segment at 0.  Hm,
>> would that result in the pda and the boot params getting mushed together?
>>
>
> CONFIG_PHYSICAL_START rather.  And no, it can't be zero!  Realistically we
> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear.

Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment
is liked at a fixed address -2G and the option only determines the virtual
to physical address mapping.

That said the idea may not be too far off.

Potentially we could put the percpu area at our fixed -2G address and then
we have a constant (instead of an address) we could subtract from this address.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:25           ` Christoph Lameter
  2008-07-09 21:36             ` H. Peter Anvin
  2008-07-09 21:41             ` Jeremy Fitzhardinge
@ 2008-07-09 22:22             ` Eric W. Biederman
  2008-07-09 22:32               ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 22:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Ingo Molnar, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter <cl@linux-foundation.org> writes:

> Note also that the 32 bit offset size limits the offset that can be added to the
> segment register. You either need to place the per cpu area either in the last
> 2G of the address space or in the first 2G. The zero based approach removes that
> limitation.

Good point.  Which means that fundamentally we need to come up with a special
linker segment or some other way to guarantee that the offsets we use for per
cpu variables is within 2G of the segment register.

Which means that my idea of using the technique we use on x86_32 will not work.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 22:10               ` Eric W. Biederman
@ 2008-07-09 22:23                 ` H. Peter Anvin
  2008-07-09 23:54                   ` Eric W. Biederman
  0 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-09 22:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

Eric W. Biederman wrote:
>>>
>> CONFIG_PHYSICAL_START rather.  And no, it can't be zero!  Realistically we
>> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear.
> 
> Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment
> is liked at a fixed address -2G and the option only determines the virtual
> to physical address mapping.
> 

No, it's not irrelevant; we currently base the kernel at virtual address 
-2 GB (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the 
proper alignment for large pages.

Now, it probably wouldn't hurt moving KERNEL_IMAGE_START up a bit to 
have low positive values safer to use.

> That said the idea may not be too far off.
> 
> Potentially we could put the percpu area at our fixed -2G address and then
> we have a constant (instead of an address) we could subtract from this address.

We can't put it at -2 GB since the offset +40 for the stack sentinel is 
hard-coded into gcc.  This leaves growing upward from +48 (or another 
small positive number), or growing down from zero (or +40) as realistic 
options.

Unfortunately, GNU ld handles grow-down not at all.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 22:22             ` Eric W. Biederman
@ 2008-07-09 22:32               ` Jeremy Fitzhardinge
  2008-07-09 23:36                 ` Eric W. Biederman
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-09 22:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Eric W. Biederman wrote:
> Christoph Lameter <cl@linux-foundation.org> writes:
>
>   
>> Note also that the 32 bit offset size limits the offset that can be added to the
>> segment register. You either need to place the per cpu area either in the last
>> 2G of the address space or in the first 2G. The zero based approach removes that
>> limitation.
>>     
>
> Good point.  Which means that fundamentally we need to come up with a special
> linker segment or some other way to guarantee that the offsets we use for per
> cpu variables is within 2G of the segment register.
>
> Which means that my idea of using the technique we use on x86_32 will not work.

No, the compiler memory model we use guarantees that everything will be 
within 2G of each other.  The linker will spew loudly if that's not the 
case.

    J


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 21:10               ` Arjan van de Ven
@ 2008-07-09 23:20                 ` Eric W. Biederman
  0 siblings, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 23:20 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter, Ingo Molnar,
	Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel

Arjan van de Ven <arjan@infradead.org> writes:

>> I think the unification is valid and useful, and that trying to keep
>> that stupid stack canary working is currently more trouble then it is
>> worth.
>
> I think that "unification over everything" is stupid, especially if it
> removes useful features.

After looking at this some more any solution that actually works will
enable us to make the stack canary work, as we have a 32bit offset to
deal with.  So there is no point in killing the feature.

That said I have no sympathy for a thread local variable that is
compiled as an absolute symbol instead of using the proper thread
local markup.  The implementation of -fstack-protector however useful
still appears to be a nasty hack, ignoring decades of best practice in
how to implement things.

Do you have a clue who we need to bug on the gcc team to get the
compiler to implement a proper TLS version of -fstack-protector?

- Unification over everything is stupid.  
- Interesting features that disregard decades implementation experience
  are also stupid.

Since we know that the code stack_canary is always a part of the
executable.  Being a fundamental part of glibc and libpthreads etc.
We can use the local exec model for tls storage.  The local exec model
means the compiler should be able to output code such as 
"movq %fs:stack_canary@tpoff, %rax" to read the stack canary in user space.
Instead it emits the much more stupid "movq "%fs:40, %rax".   Not even
letting the linker have a say in the placement of the variable.

So we either need to update the gcc code to do something proper or
someone needs to update the sysv tls abi spec so %fs:40 joins %fs:0 in
the ranks of magic address in thread local storage, so that other
compilers can reliably use offset 40, and no one will have an excuse
for changing it in the future.  Frankly I think updating the ABI is
the wrong solution but it at least it would document this stupidity.

Does -fstack-protector compiled code even fail to run with gcc that
does not implement a thread local variable at %fs:40?  Or does it
just silently break.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 22:32               ` Jeremy Fitzhardinge
@ 2008-07-09 23:36                 ` Eric W. Biederman
  2008-07-10  0:19                   ` H. Peter Anvin
  2008-07-10  0:23                   ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 23:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge <jeremy@goop.org> writes:

>> Which means that my idea of using the technique we use on x86_32 will not
> work.
>
> No, the compiler memory model we use guarantees that everything will be within
> 2G of each other.  The linker will spew loudly if that's not the case.

The per cpu area is at least theoretically dynamically allocated.  And we
really want to put it in cpu local memory.    Which means on any reasonable
NUMA machine the per cpu areas should be all over the box.

So there is no guarantee that with an arbitrary 64bit address in %gs of anything.

Grr.  Except you are correct.  We have to guarantee that the offsets we have
chosen at compile time still work.  And we know all of the compile time offsets
will be in the -2G range.  So they are all 32bit numbers.  Negative 32bit
numbers to be sure.  That trivially leaves us with everything working except
the nasty hard coded decimal 40.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 22:23                 ` H. Peter Anvin
@ 2008-07-09 23:54                   ` Eric W. Biederman
  2008-07-10 16:22                     ` Mike Travis
                                       ` (2 more replies)
  0 siblings, 3 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-09 23:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:

> Eric W. Biederman wrote:
>>>>
>>> CONFIG_PHYSICAL_START rather.  And no, it can't be zero!  Realistically we
>>> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear.
>>
>> Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment
>> is liked at a fixed address -2G and the option only determines the virtual
>> to physical address mapping.
>>
>
> No, it's not irrelevant; we currently base the kernel at virtual address -2 GB
> (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the proper
> alignment for large pages.

Ugh.  That is silly.  We need to restrict CONFIG_PHYSICAL_START to the aligned
choices obviously.  But -2G is better aligned then anything else we can do virtually.

For the 32bit code we need to play some of those games because it doesn't have
it's own magic chunk of the address space to live in.

>> That said the idea may not be too far off.
>>
>> Potentially we could put the percpu area at our fixed -2G address and then
>> we have a constant (instead of an address) we could subtract from this
> address.
>
> We can't put it at -2 GB since the offset +40 for the stack sentinel is
> hard-coded into gcc.  This leaves growing upward from +48 (or another small
> positive number), or growing down from zero (or +40) as realistic options.

I was thinking everything except that access would be done as:
%gs:var - -2G aka
%gs:var - START_KERNEL.
So that everything was a small 32bit number.  That the linker and the compiler can
resolve.  The trick is to put the stack canary at 40 decimal.

I was just trying to find a compile time know location for the start of the percpu
area so we could subtract it off.

Unless the linker just winds up overflowing in the subtraction and doing hideous
things to us.  Although that should be pretty easy to spot and to test for at
build time.

-2G has the interesting distinction that we might get away with just dropping the
high bits.

> Unfortunately, GNU ld handles grow-down not at all.

Another alternative that almost fares better then a segment with
a base of zero is a base of -32K or so.  Only trouble that would get us
manually managing the per cpu area size again.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 23:36                 ` Eric W. Biederman
@ 2008-07-10  0:19                   ` H. Peter Anvin
  2008-07-10  0:24                     ` Jeremy Fitzhardinge
  2008-07-10  0:23                   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10  0:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
> 
>>> Which means that my idea of using the technique we use on x86_32 will not
>> work.
>>
>> No, the compiler memory model we use guarantees that everything will be within
>> 2G of each other.  The linker will spew loudly if that's not the case.
> 
> The per cpu area is at least theoretically dynamically allocated.  And we
> really want to put it in cpu local memory.    Which means on any reasonable
> NUMA machine the per cpu areas should be all over the box.
> 
> So there is no guarantee that with an arbitrary 64bit address in %gs of anything.
> 

That doesn't matter in the slightest.

> Grr.  Except you are correct.  We have to guarantee that the offsets we have
> chosen at compile time still work.  And we know all of the compile time offsets
> will be in the -2G range.  So they are all 32bit numbers.  Negative 32bit
> numbers to be sure.  That trivially leaves us with everything working except
> the nasty hard coded decimal 40.

The *offsets* have to be in the proper range, but the %gs_base is an 
arbitrary 64-bit number.

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 23:36                 ` Eric W. Biederman
  2008-07-10  0:19                   ` H. Peter Anvin
@ 2008-07-10  0:23                   ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10  0:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton,
	H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
>   
>>> Which means that my idea of using the technique we use on x86_32 will not
>>>       
>> work.
>>
>> No, the compiler memory model we use guarantees that everything will be within
>> 2G of each other.  The linker will spew loudly if that's not the case.
>>     
>
> The per cpu area is at least theoretically dynamically allocated.  And we
> really want to put it in cpu local memory.    Which means on any reasonable
> NUMA machine the per cpu areas should be all over the box.
>   

Yes, but that doesn't matter in the slightest.  The effective address 
will be within 2G of the base; the base can be anywhere.

> So there is no guarantee that with an arbitrary 64bit address in %gs of anything.
>
> Grr.  Except you are correct.  We have to guarantee that the offsets we have
> chosen at compile time still work.  And we know all of the compile time offsets
> will be in the -2G range.  So they are all 32bit numbers.  Negative 32bit
> numbers to be sure.  That trivially leaves us with everything working except
> the nasty hard coded decimal 40.
>   

Right.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10  0:19                   ` H. Peter Anvin
@ 2008-07-10  0:24                     ` Jeremy Fitzhardinge
  2008-07-10 14:14                       ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10  0:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Christoph Lameter, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

H. Peter Anvin wrote:
> Eric W. Biederman wrote:
>> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>>
>>>> Which means that my idea of using the technique we use on x86_32 
>>>> will not
>>> work.
>>>
>>> No, the compiler memory model we use guarantees that everything will 
>>> be within
>>> 2G of each other.  The linker will spew loudly if that's not the case.
>>
>> The per cpu area is at least theoretically dynamically allocated.  
>> And we
>> really want to put it in cpu local memory.    Which means on any 
>> reasonable
>> NUMA machine the per cpu areas should be all over the box.
>>
>> So there is no guarantee that with an arbitrary 64bit address in %gs 
>> of anything.
>>
>
> That doesn't matter in the slightest.

Creepy, get out of my brain.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10  0:24                     ` Jeremy Fitzhardinge
@ 2008-07-10 14:14                       ` Christoph Lameter
  2008-07-10 14:26                         ` H. Peter Anvin
  0 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 14:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

With the zero based approach you do not have a relative address anymore. We are basically creating a new absolute address space where we place variables starting at zero.

This means that we are fully independent from the placement of the percpu segment.

The loader may place the per cpu segment with the initialized variables anywhere. We just need to set GS correctly for the boot cpu. We always need to refer to the per cpu variables
via GS or by adding the per cpu offset to the __per_cpu_offset[] (which is now badly named because it points directly to the start of the percpu segment for each processor).

So there is no 2G limitation on the distance between the code and the percpu segment anymore. The 2G limitation still exists for the *size* of the per cpu segment. If we go beyond 2G in defined per cpu variables then the per cpu addresses will wrap.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 14:14                       ` Christoph Lameter
@ 2008-07-10 14:26                         ` H. Peter Anvin
  2008-07-10 15:26                           ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> With the zero based approach you do not have a relative address anymore. We are basically creating a new absolute address space where we place variables starting at zero.
> 
> This means that we are fully independent from the placement of the percpu segment.
> 
> The loader may place the per cpu segment with the initialized variables anywhere. We just need to set GS correctly for the boot cpu. We always need to refer to the per cpu variables
> via GS or by adding the per cpu offset to the __per_cpu_offset[] (which is now badly named because it points directly to the start of the percpu segment for each processor).
> 
> So there is no 2G limitation on the distance between the code and the percpu segment anymore. The 2G limitation still exists for the *size* of the per cpu segment. If we go beyond 2G in defined per cpu variables then the per cpu addresses will wrap.

Okay, this is getting somewhat annoying.  Several people now have missed 
the point.

Noone has talked about the actual placement of the percpu segment data.

Using RIP-based references, however, are *cheaper* than using absolute 
references.  For RIP-based references to be valid, then the *offsets* 
need to be in the range [-2 GB + CONFIG_PHYSICAL_START ... 
CONFIG_PHYSICAL_START).  This is similar to the constraint on absolute 
refereces, where the *offsets* have to be in the range [-2 GB, 2 GB).

None of this affects the absolute positioning of the data.  The final 
address are determined by:

	fs_base + rip + offset
or
	fs_base + offset

... respectively.  fs_base is an arbitrary 64-bit number; rip (in the 
kernel) is in the range [-2 GB + CONFIG_PHYSICAL_START, 0), and offset 
is in the range [-2 GB, 2 GB).

(The high end of the rip range above is slightly too wide.)

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 14:26                         ` H. Peter Anvin
@ 2008-07-10 15:26                           ` Christoph Lameter
  2008-07-10 15:42                             ` H. Peter Anvin
  0 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 15:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

H. Peter Anvin wrote:

> Noone has talked about the actual placement of the percpu segment data.

But the placement of the percpu segment data is a problem because of the way we
currently have the linker calculate offsets. I have had kernel configurations where I changed the placement of the percpu segment leading to linker failures because the percpu segment was not in 2G range of the code segment!

This is a particular problem if we have a large number of processors (like 4096) that each require a sizable segment of virtual address space up there for the per cpu allocator.

> None of this affects the absolute positioning of the data.  The final
> address are determined by:
> 
>     fs_base + rip + offset
> or
>     fs_base + offset
> 
> ... respectively.  fs_base is an arbitrary 64-bit number; rip (in the
> kernel) is in the range [-2 GB + CONFIG_PHYSICAL_START, 0), and offset
> is in the range [-2 GB, 2 GB).

Well the zero based results in this becoming always

	gs_base + absolute address in per cpu segment

Why are RIP based references cheaper? The offset to the per cpu segment is certainly more than what can be fit into 16 bits.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 15:26                           ` Christoph Lameter
@ 2008-07-10 15:42                             ` H. Peter Anvin
  2008-07-10 16:24                               ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 15:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> 
> Well the zero based results in this becoming always
> 
> 	gs_base + absolute address in per cpu segment

You can do either way.  For RIP-based, you have to worry about the 
possible range for the RIP register when referencing.  Currently, even 
for "make allyesconfig" the per cpu segment is a lot smaller than the 
minimum value for CONFIG_PHYSICAL_START (2 MB), so there is no issue, 
but there is a distinct lack of wiggle room, which can be resolved 
either by using negative offsets, or by moving the kernel text area up a 
bit from -2 GB.

> Why are RIP based references cheaper? The offset to the per cpu segment is certainly more than what can be fit into 16 bits.

Where are you getting 16 bits from?!?!  *There are no 16-bit offsets in 
64-bit mode, period, full stop.*

RIP-based references are cheaper because the x86-64 architects chose to 
optimize RIP-based references over absolute references.  Therefore 
RIP-based references are encodable with only a MODR/M byte, whereas 
absolute references require a SIB byte as well -- longer instruction, 
possibly a less optimized path through the CPU, and *definitely* 
something that gets exercised less in the linker.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 23:54                   ` Eric W. Biederman
@ 2008-07-10 16:22                     ` Mike Travis
  2008-07-10 16:25                       ` H. Peter Anvin
                                         ` (2 more replies)
  2008-07-10 17:57                     ` H. Peter Anvin
  2008-07-10 18:08                     ` H. Peter Anvin
  2 siblings, 3 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-10 16:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner,
	linux-kernel, Rusty Russell

Eric W. Biederman wrote:
...
> Another alternative that almost fares better then a segment with
> a base of zero is a base of -32K or so.  Only trouble that would get us
> manually managing the per cpu area size again.

One thing to remember is the eventual goal is implementing the cpu_alloc
functions which I think we've agreed has to be "growable".  This means that
the addresses will need to be virtual to allow the same offsets for all cpus.
The patchset I have uses 2Mb pages.  This "little" twist might figure into the
implementation issues that are being discussed.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 15:42                             ` H. Peter Anvin
@ 2008-07-10 16:24                               ` Christoph Lameter
  2008-07-10 16:33                                 ` H. Peter Anvin
  2008-07-10 17:26                                 ` Eric W. Biederman
  0 siblings, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 16:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

H. Peter Anvin wrote:

> but there is a distinct lack of wiggle room, which can be resolved
> either by using negative offsets, or by moving the kernel text area up a
> bit from -2 GB.

Lets say we reserve 256MB of cpu alloc space per processor.

On a system with 4k processors this will result in the need for 1TB virtual address space for per cpu areas (note that there may be more processors in the future). Preferably we would calculate the address of the per cpu area by

	PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id()

instead of looking it up in a table because that will save a memory access on per_cpu().

The first percpu area would ideally be the per cpu segment generated by the linker.

How would that fit into the address map? In particular the 2G distance between code and the first per cpu area must not be violated unless we go to a zero based approach.

Maybe there is another way of arranging things that would allow for this?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:22                     ` Mike Travis
@ 2008-07-10 16:25                       ` H. Peter Anvin
  2008-07-10 16:35                         ` Christoph Lameter
  2008-07-10 17:20                         ` Mike Travis
  2008-07-10 17:07                       ` Jeremy Fitzhardinge
  2008-07-10 18:48                       ` Eric W. Biederman
  2 siblings, 2 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 16:25 UTC (permalink / raw)
  To: Mike Travis
  Cc: Eric W. Biederman, Jeremy Fitzhardinge, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner,
	linux-kernel, Rusty Russell

Mike Travis wrote:
> Eric W. Biederman wrote:
> ...
>> Another alternative that almost fares better then a segment with
>> a base of zero is a base of -32K or so.  Only trouble that would get us
>> manually managing the per cpu area size again.
> 
> One thing to remember is the eventual goal is implementing the cpu_alloc
> functions which I think we've agreed has to be "growable".  This means that
> the addresses will need to be virtual to allow the same offsets for all cpus.
> The patchset I have uses 2Mb pages.  This "little" twist might figure into the
> implementation issues that are being discussed.
> 

No, since the *addresses* can be arbitrary.  The current issue is about 
*offsets.*

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:24                               ` Christoph Lameter
@ 2008-07-10 16:33                                 ` H. Peter Anvin
  2008-07-10 16:45                                   ` Christoph Lameter
  2008-07-10 17:26                                 ` Eric W. Biederman
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 16:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> H. Peter Anvin wrote:
> 
>> but there is a distinct lack of wiggle room, which can be resolved
>> either by using negative offsets, or by moving the kernel text area up a
>> bit from -2 GB.
> 
> Lets say we reserve 256MB of cpu alloc space per processor.
> 
> On a system with 4k processors this will result in the need for 1TB virtual address space for per cpu areas (note that there may be more processors in the future). Preferably we would calculate the address of the per cpu area by
> 
> 	PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id()
> 
> instead of looking it up in a table because that will save a memory access on per_cpu().

It will, but it might still be a net loss due to higher load on the TLB 
(you're effectively using the TLB to do the table lookup for you.)  On 
the other hand, Mike points out that once we move away from fixed-sized 
segments we pretty much have to use virtual addresses anyway(*).

> The first percpu area would ideally be the per cpu segment generated by the linker.
> 
> How would that fit into the address map? In particular the 2G distance between code and the first per cpu area must not be violated unless we go to a zero based approach.

If with "zero-based" you mean "nonzero gs_base for the boot CPU" then 
yes, you're right.

Note again that that is completely orthogonal to RIP-based versus absolute.

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:25                       ` H. Peter Anvin
@ 2008-07-10 16:35                         ` Christoph Lameter
  2008-07-10 16:39                           ` H. Peter Anvin
  2008-07-10 17:20                         ` Mike Travis
  1 sibling, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 16:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mike Travis, Eric W. Biederman, Jeremy Fitzhardinge,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

H. Peter Anvin wrote:

> No, since the *addresses* can be arbitrary.  The current issue is about
> *offsets.*

Well those are intimately connected.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:35                         ` Christoph Lameter
@ 2008-07-10 16:39                           ` H. Peter Anvin
  2008-07-10 16:47                             ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 16:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Eric W. Biederman, Jeremy Fitzhardinge,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

Christoph Lameter wrote:
> H. Peter Anvin wrote:
> 
>> No, since the *addresses* can be arbitrary.  The current issue is about
>> *offsets.*
> 
> Well those are intimately connected.

Not really, since gs_base is an arbitrary 64-bit pointer.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:33                                 ` H. Peter Anvin
@ 2008-07-10 16:45                                   ` Christoph Lameter
  2008-07-10 17:33                                     ` Jeremy Fitzhardinge
  2008-07-10 17:53                                     ` H. Peter Anvin
  0 siblings, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 16:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

H. Peter Anvin wrote:

> It will, but it might still be a net loss due to higher load on the TLB
> (you're effectively using the TLB to do the table lookup for you.)  On
> the other hand, Mike points out that once we move away from fixed-sized
> segments we pretty much have to use virtual addresses anyway(*).

There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution.

>> The first percpu area would ideally be the per cpu segment generated
>> by the linker.
>>
>> How would that fit into the address map? In particular the 2G distance
>> between code and the first per cpu area must not be violated unless we
>> go to a zero based approach.
> 
> If with "zero-based" you mean "nonzero gs_base for the boot CPU" then
> yes, you're right.
> 
> Note again that that is completely orthogonal to RIP-based versus absolute.

?? The distance to the per cpu area for cpu 0 is larger than 2G. Kernel wont link with RIP based addresses. You would have to place the per cpu areas 1TB before the kernel text.





^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:39                           ` H. Peter Anvin
@ 2008-07-10 16:47                             ` Christoph Lameter
  2008-07-10 17:21                               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 16:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mike Travis, Eric W. Biederman, Jeremy Fitzhardinge,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

H. Peter Anvin wrote:
> Christoph Lameter wrote:
>> H. Peter Anvin wrote:
>>
>>> No, since the *addresses* can be arbitrary.  The current issue is about
>>> *offsets.*
>>
>> Well those are intimately connected.
> 
> Not really, since gs_base is an arbitrary 64-bit pointer.

The current scheme ties the offsets to kernel code addresses.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:22                     ` Mike Travis
  2008-07-10 16:25                       ` H. Peter Anvin
@ 2008-07-10 17:07                       ` Jeremy Fitzhardinge
  2008-07-10 17:12                         ` Christoph Lameter
  2008-07-10 17:41                         ` Mike Travis
  2008-07-10 18:48                       ` Eric W. Biederman
  2 siblings, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 17:07 UTC (permalink / raw)
  To: Mike Travis
  Cc: Eric W. Biederman, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Mike Travis wrote:
> One thing to remember is the eventual goal is implementing the cpu_alloc
> functions which I think we've agreed has to be "growable".  This means that
> the addresses will need to be virtual to allow the same offsets for all cpus.
> The patchset I have uses 2Mb pages.  This "little" twist might figure into the
> implementation issues that are being discussed.

You want to virtually map the percpu area?  How and when would it get 
extended?

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:07                       ` Jeremy Fitzhardinge
@ 2008-07-10 17:12                         ` Christoph Lameter
  2008-07-10 17:25                           ` Jeremy Fitzhardinge
  2008-07-10 17:41                         ` Mike Travis
  1 sibling, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 17:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Eric W. Biederman, H. Peter Anvin, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Rusty Russell

Jeremy Fitzhardinge wrote:

> You want to virtually map the percpu area?  How and when would it get
> extended?

It would get extended when cpu_alloc() is called and the allocator finds that there is no per cpu memory available.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:25                       ` H. Peter Anvin
  2008-07-10 16:35                         ` Christoph Lameter
@ 2008-07-10 17:20                         ` Mike Travis
  1 sibling, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-10 17:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Jeremy Fitzhardinge, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner,
	linux-kernel, Rusty Russell

H. Peter Anvin wrote:
> Mike Travis wrote:
>> Eric W. Biederman wrote:
>> ...
>>> Another alternative that almost fares better then a segment with
>>> a base of zero is a base of -32K or so.  Only trouble that would get us
>>> manually managing the per cpu area size again.
>>
>> One thing to remember is the eventual goal is implementing the cpu_alloc
>> functions which I think we've agreed has to be "growable".  This means
>> that
>> the addresses will need to be virtual to allow the same offsets for
>> all cpus.
>> The patchset I have uses 2Mb pages.  This "little" twist might figure
>> into the
>> implementation issues that are being discussed.
>>
> 
> No, since the *addresses* can be arbitrary.  The current issue is about
> *offsets.*
> 
>     -hpa

Ok, thanks for clearing that up.  I just didn't want us to drop the ball
trying to make that double play... ;-)

Cheers,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:47                             ` Christoph Lameter
@ 2008-07-10 17:21                               ` Jeremy Fitzhardinge
  2008-07-10 17:31                                 ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 17:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Mike Travis, Eric W. Biederman, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Rusty Russell

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>   
>> Christoph Lameter wrote:
>>     
>>> H. Peter Anvin wrote:
>>>
>>>       
>>>> No, since the *addresses* can be arbitrary.  The current issue is about
>>>> *offsets.*
>>>>         
>>> Well those are intimately connected.
>>>       
>> Not really, since gs_base is an arbitrary 64-bit pointer.
>>     
>
> The current scheme ties the offsets to kernel code addresses.
>   

This is getting very frustrating.  We've been going around and around on 
this point, what, 5 or 6 times at least.

The base address of the percpu area and the offsets from that base are 
completely independent values.

The offset is limited to 2G.  The 2G limit applies regardless of how you 
compute your effective address.  It doesn't matter if its absolute.  It 
doesn't matter if it's rip-relative.  It doesn't matter if it's 
zero-based.  Small absolute addresses generate exactly the same form as 
large absolute addresses.  There is no 8-bit or 16-bit address mode.

The base is arbitrary.  It can be any canonical address at all.  It has 
no effect on how you compute your offset.

The addressing modes:

    * ABS
    * off(%rip)

Are exactly equivalent in what offsets they can generate, so long as *at 
link time* the percpu *symbols* are within 2G of the code addressing 
them.  *After* the addressing mode has generated an effective address 
(by whatever means it likes), the %gs: override applies the segment 
base, which can therefore offset the effective address to anywhere at all.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:12                         ` Christoph Lameter
@ 2008-07-10 17:25                           ` Jeremy Fitzhardinge
  2008-07-10 17:34                             ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 17:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Eric W. Biederman, H. Peter Anvin, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Rusty Russell

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>   
>> You want to virtually map the percpu area?  How and when would it get
>> extended?
>>     
>
> It would get extended when cpu_alloc() is called and the allocator finds that there is no per cpu memory available.
>   

Which, I take it, allocates percpu memory.  It would have the same 
caveats as vmalloc memory, with respect to accessing it during fault 
handlers and nmi handlers, I take it.

How would cpu_alloc() actually get used?  It doesn't make much sense for 
general code, since we don't have the notion of a percpu pointer to 
memory (vs a pointer to percpu memory).  Is the intended use for 
allocating percpu memory in modules?  What other uses?

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:24                               ` Christoph Lameter
  2008-07-10 16:33                                 ` H. Peter Anvin
@ 2008-07-10 17:26                                 ` Eric W. Biederman
  2008-07-10 17:38                                   ` Christoph Lameter
                                                     ` (2 more replies)
  1 sibling, 3 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 17:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter <cl@linux-foundation.org> writes:

> H. Peter Anvin wrote:
>
>> but there is a distinct lack of wiggle room, which can be resolved
>> either by using negative offsets, or by moving the kernel text area up a
>> bit from -2 GB.
>
> Lets say we reserve 256MB of cpu alloc space per processor.

First off right now reserving more than about 64KB is ridiculous.  We rightly
don't have that many per cpu variables.

> On a system with 4k processors this will result in the need for 1TB virtual
> address space for per cpu areas (note that there may be more processors in the
> future). Preferably we would calculate the address of the per cpu area by
>
> 	PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id()
>
> instead of looking it up in a table because that will save a memory access on
> per_cpu().

???? Optimizing per_cpu seems to be the wrong path.  If you want to go fast you
access the data on the cpu you start out on.

> The first percpu area would ideally be the per cpu segment generated by the
> linker.
>
> How would that fit into the address map? In particular the 2G distance between
> code and the first per cpu area must not be violated unless we go to a zero
> based approach.
>
> Maybe there is another way of arranging things that would allow for this?

Yes.  Start with a patch that doesn't have freaky failures that can't be understood
or bisected because the patch is too big.  The only reason we are having a conversation
about alternative implementations is because the current implementation has weird
random incomprehensible failures.  The most likely culprit is playing with
the linker.  It could be something else.

So please REFACTOR the patch that changes things to DO ONE THING PER PATCH.

Eric


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:21                               ` Jeremy Fitzhardinge
@ 2008-07-10 17:31                                 ` Christoph Lameter
  2008-07-10 17:48                                   ` Jeremy Fitzhardinge
  2008-07-10 18:00                                   ` H. Peter Anvin
  0 siblings, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 17:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Mike Travis, Eric W. Biederman, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Rusty Russell

Jeremy Fitzhardinge wrote:
>
> The base address of the percpu area and the offsets from that base are
> completely independent values.

Definitely.


> The addressing modes:
> 
>    * ABS
>    * off(%rip)
> 
> Are exactly equivalent in what offsets they can generate, so long as *at
> link time* the percpu *symbols* are within 2G of the code addressing
> them.  *After* the addressing mode has generated an effective address
> (by whatever means it likes), the %gs: override applies the segment
> base, which can therefore offset the effective address to anywhere at all.

Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas.

However, that does not work if one calculates the virtual address instead of looking up a physical address.






^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:45                                   ` Christoph Lameter
@ 2008-07-10 17:33                                     ` Jeremy Fitzhardinge
  2008-07-10 17:42                                       ` Christoph Lameter
  2008-07-10 17:53                                     ` H. Peter Anvin
  1 sibling, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 17:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>
>   
>> It will, but it might still be a net loss due to higher load on the TLB
>> (you're effectively using the TLB to do the table lookup for you.)  On
>> the other hand, Mike points out that once we move away from fixed-sized
>> segments we pretty much have to use virtual addresses anyway(*).
>>     
>
> There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution.
>
>   
>>> The first percpu area would ideally be the per cpu segment generated
>>> by the linker.
>>>
>>> How would that fit into the address map? In particular the 2G distance
>>> between code and the first per cpu area must not be violated unless we
>>> go to a zero based approach.
>>>       
>> If with "zero-based" you mean "nonzero gs_base for the boot CPU" then
>> yes, you're right.
>>
>> Note again that that is completely orthogonal to RIP-based versus absolute.
>>     
>
> ?? The distance to the per cpu area for cpu 0 is larger than 2G. Kernel wont link with RIP based addresses. You would have to place the per cpu areas 1TB before the kernel text.

If %gs:0 points to start of your percpu area, then all the offsets off 
%gs are going to be no larger than the amount of percpu memory you 
have.  The gs base itself can be any 64-bit address, so it doesn't 
matter where it is within overall kernel memory.  Using zero-based 
percpu area means that you must set a non-zero %gs base before you can 
access the percpu area.

If the layout of the percpu area is done by the linker by packing all 
the percpu variables into one section, then any address computation 
using a percpu variable symbol will generate an offset which is 
appropriate to apply to a %gs: addressing mode.

The nice thing about the non-zero-based scheme i386 uses is that setting 
gs-base to zero means that percpu variables accesses get directly to the 
prototype percpu data area, which simplifies boot time setup (which is 
doubly awkward on 32-bit because you need to generate a GDT entry rather 
than just load an MSR as you do in 64-bit).

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:25                           ` Jeremy Fitzhardinge
@ 2008-07-10 17:34                             ` Christoph Lameter
  0 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 17:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Eric W. Biederman, H. Peter Anvin, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Rusty Russell

Jeremy Fitzhardinge wrote:
> Christoph Lameter wrote:
>> Jeremy Fitzhardinge wrote:
>>
>>  
>>> You want to virtually map the percpu area?  How and when would it get
>>> extended?
>>>     
>>
>> It would get extended when cpu_alloc() is called and the allocator
>> finds that there is no per cpu memory available.
>>   
> 
> Which, I take it, allocates percpu memory.  It would have the same
> caveats as vmalloc memory, with respect to accessing it during fault
> handlers and nmi handlers, I take it.

Right. One would not want to allocate per cpu memory in those contexts. The current allocpercpu() functions already have those restrictions.

> How would cpu_alloc() actually get used?  It doesn't make much sense for
> general code, since we don't have the notion of a percpu pointer to
> memory (vs a pointer to percpu memory).  Is the intended use for
> allocating percpu memory in modules?  What other uses?

Argh. Do I have to reexplain all of this all over again? Please look at the latest cpu_alloc patchset and the related discussions.

The cpu_alloc patchset introduces the concept of a pointer to percpu memory. Or you could call it an offset into the percpu segment that can be treated (in a restricted way) like a pointer....




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:26                                 ` Eric W. Biederman
@ 2008-07-10 17:38                                   ` Christoph Lameter
  2008-07-10 19:11                                     ` Mike Travis
  2008-07-10 19:12                                     ` Eric W. Biederman
  2008-07-10 17:46                                   ` Mike Travis
  2008-07-10 17:51                                   ` H. Peter Anvin
  2 siblings, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 17:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Eric W. Biederman wrote:

> First off right now reserving more than about 64KB is ridiculous.  We rightly
> don't have that many per cpu variables.

We do. The case has been made numerous times that we need at least several megabytes of per cpu memory in case someone creates gazillions of ip tunnels etc etc.


>> instead of looking it up in a table because that will save a memory access on
>> per_cpu().
> 
> ???? Optimizing per_cpu seems to be the wrong path.  If you want to go fast you
> access the data on the cpu you start out on.

Yes most arches provide specialized registers for local per cpu variable access. There are cases though in which you have to access another processors cpu space.

>> Maybe there is another way of arranging things that would allow for this?
> 
> Yes.  Start with a patch that doesn't have freaky failures that can't be understood
> or bisected because the patch is too big.  The only reason we are having a conversation

The patches are reasonably small. The problem that Mike seems to have is early boot debugging.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:07                       ` Jeremy Fitzhardinge
  2008-07-10 17:12                         ` Christoph Lameter
@ 2008-07-10 17:41                         ` Mike Travis
  2008-07-10 18:01                           ` H. Peter Anvin
  1 sibling, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-10 17:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric W. Biederman, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> One thing to remember is the eventual goal is implementing the cpu_alloc
>> functions which I think we've agreed has to be "growable".  This means
>> that
>> the addresses will need to be virtual to allow the same offsets for
>> all cpus.
>> The patchset I have uses 2Mb pages.  This "little" twist might figure
>> into the
>> implementation issues that are being discussed.
> 
> You want to virtually map the percpu area?  How and when would it get
> extended?
> 
>    J


CPU_ALLOC(), or some such means.  This is to replace the percpu allocator
in modules.c:

    Subject: cpu alloc: The allocator

    The per cpu allocator allows dynamic allocation of memory on all
    processors simultaneously. A bitmap is used to track used areas.
    The allocator implements tight packing to reduce the cache footprint
    and increase speed since cacheline contention is typically not a concern
    for memory mainly used by a single cpu. Small objects will fill up gaps
    left by larger allocations that required alignments.

    The size of the cpu_alloc area can be changed via make menuconfig.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>

and:

    Subject: cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator

    Remove the builtin per cpu allocator from modules.c and use cpu_alloc instead.

    The patch also removes PERCPU_ENOUGH_ROOM. The size of the cpu_alloc area is
    determined by CONFIG_CPU_AREA_SIZE. PERCPU_ENOUGH_ROOMs default was 8k.
    CONFIG_CPU_AREA_SIZE defaults to 32k. Thus we have more space to load modules.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>

The discussion that followed was very emphatic that the size of the space should
not be fixed, but instead be dynamically growable.  Since the offset needs to be
fixed for each cpu, then virtual (I think) is the only way to go.  The use of a
2MB page just conserves map entries.  (Of course, if we just reserved 2MB in the
first place it might not need to be virtual...?  But the concern was for systems
with hundreds of (say) network interfaces using even more than 2MB.)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:33                                     ` Jeremy Fitzhardinge
@ 2008-07-10 17:42                                       ` Christoph Lameter
  2008-07-10 17:53                                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 17:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:

> If %gs:0 points to start of your percpu area, then all the offsets off
> %gs are going to be no larger than the amount of percpu memory you
> have.  The gs base itself can be any 64-bit address, so it doesn't
> matter where it is within overall kernel memory.  Using zero-based
> percpu area means that you must set a non-zero %gs base before you can
> access the percpu area.

Correct.

> If the layout of the percpu area is done by the linker by packing all
> the percpu variables into one section, then any address computation
> using a percpu variable symbol will generate an offset which is
> appropriate to apply to a %gs: addressing mode.

Of course.

> The nice thing about the non-zero-based scheme i386 uses is that setting
> gs-base to zero means that percpu variables accesses get directly to the
> prototype percpu data area, which simplifies boot time setup (which is
> doubly awkward on 32-bit because you need to generate a GDT entry rather
> than just load an MSR as you do in 64-bit).

Great but it causes trouble in other ways as discussed. Its best to consistently access
per cpu variables using the segment registers.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:26                                 ` Eric W. Biederman
  2008-07-10 17:38                                   ` Christoph Lameter
@ 2008-07-10 17:46                                   ` Mike Travis
  2008-07-10 17:51                                   ` H. Peter Anvin
  2 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-10 17:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, H. Peter Anvin, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Eric W. Biederman wrote:
...
> 
> So please REFACTOR the patch that changes things to DO ONE THING PER PATCH.
> 
> Eric

Working feverishly on this exact thing... ;-)


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:31                                 ` Christoph Lameter
@ 2008-07-10 17:48                                   ` Jeremy Fitzhardinge
  2008-07-10 18:00                                   ` H. Peter Anvin
  1 sibling, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 17:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Mike Travis, Eric W. Biederman, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Rusty Russell

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>   
>> The base address of the percpu area and the offsets from that base are
>> completely independent values.
>>     
>
> Definitely.
>
>
>   
>> The addressing modes:
>>
>>    * ABS
>>    * off(%rip)
>>
>> Are exactly equivalent in what offsets they can generate, so long as *at
>> link time* the percpu *symbols* are within 2G of the code addressing
>> them.  *After* the addressing mode has generated an effective address
>> (by whatever means it likes), the %gs: override applies the segment
>> base, which can therefore offset the effective address to anywhere at all.
>>     
>
> Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas.
>   

Yes, but the offset is the same either way.  When you want a cpu to 
refer to its own percpu memory, regardless of where it is in memory, you 
just reload the gs base.  The offsets are the same everywhere, and are 
computed by the linker with out knowledge or reference to where the 
final address will end up.

In other words, at source level:

	a = x86_read_percpu(foo)

will generate

	mov %gs:percpu__foo, %rax

where the linker decides the value of percpu__foo, which can be up to 
4G.  Or if we use rip-relative:

	mov %gs:percpu__foo(%rip), %rax

we end up with the same result, except that the generated instruction is 
a bit more compact.

In the final generated assembly, it ends up being a hardcoded constant 
address.  Say, 0x7838.

Now if we allocate cpu 43 percpu data at 0xfffffffff7198000, we load %gs 
base with that value, and then the instruction is still

	mov %gs:0x7838, %rax

and the computed address will be 0xfffffffff7198000 + 0x7838 = 
0xfffffffff719f838.

And cpu 62 has its percpu data at 0xffffffffe3819000, and the 
instruction is still

	mov %gs:0x7838, %rax

and the computed address for it's version of percpu__foo is 
0xffffffffe3819000 + 0x7838 = 0xffffffffe3820838.

Note that it doesn't matter how you decide to place the percpu data, so 
long as you can load the address into the %gs base.

> However, that does not work if one calculates the virtual address instead of looking up a physical address.
>   

Calculate a virtual address for what?  Physical address for what?  If 
you have a large virtual region allocating 256M of percpu space, er, per 
cpu, then you just load %gs base with percpu_region_base + cpuid * 
256M.  It has no effect on the instructions accessing that percpu space.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:26                                 ` Eric W. Biederman
  2008-07-10 17:38                                   ` Christoph Lameter
  2008-07-10 17:46                                   ` Mike Travis
@ 2008-07-10 17:51                                   ` H. Peter Anvin
  2008-07-10 19:09                                     ` Eric W. Biederman
  2 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 17:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Eric W. Biederman wrote:
> Christoph Lameter <cl@linux-foundation.org> writes:
> 
>> H. Peter Anvin wrote:
>>
>>> but there is a distinct lack of wiggle room, which can be resolved
>>> either by using negative offsets, or by moving the kernel text area up a
>>> bit from -2 GB.
>> Lets say we reserve 256MB of cpu alloc space per processor.
> 
> First off right now reserving more than about 64KB is ridiculous.  We rightly
> don't have that many per cpu variables.

Almost half a megabyte in current allyesconfig, and that is not 
including dynamic allocations at all.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:42                                       ` Christoph Lameter
@ 2008-07-10 17:53                                         ` Jeremy Fitzhardinge
  2008-07-10 17:55                                           ` H. Peter Anvin
  2008-07-10 20:52                                           ` Christoph Lameter
  0 siblings, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 17:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
>> The nice thing about the non-zero-based scheme i386 uses is that setting
>> gs-base to zero means that percpu variables accesses get directly to the
>> prototype percpu data area, which simplifies boot time setup (which is
>> doubly awkward on 32-bit because you need to generate a GDT entry rather
>> than just load an MSR as you do in 64-bit).
>>     
>
> Great but it causes trouble in other ways as discussed.

What other trouble?  It works fine.

>  Its best to consistently access
> per cpu variables using the segment registers.
>   


It is, but initially the segment base is 0, so just using a percpu 
variable does something sensible from the start with no special setup.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:45                                   ` Christoph Lameter
  2008-07-10 17:33                                     ` Jeremy Fitzhardinge
@ 2008-07-10 17:53                                     ` H. Peter Anvin
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 17:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> 
> There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution.
> 

THAT sounds strange.  If you're using dedicated virtual maps (which is 
what you're responding to here) then you will *always* have additional 
TLB pressure.  Furthermore, if you use 2 MB pages, you:

a) can only allocate full 2 MB pages, which is expensive for the static 
users and difficult for the dynamic users;

b) increase pressure in the relatively small 2 MB TLB.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:53                                         ` Jeremy Fitzhardinge
@ 2008-07-10 17:55                                           ` H. Peter Anvin
  2008-07-10 20:52                                           ` Christoph Lameter
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 17:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:
> 
>>  Its best to consistently access
>> per cpu variables using the segment registers.
> 
> It is, but initially the segment base is 0, so just using a percpu 
> variable does something sensible from the start with no special setup.
> 

That's easy enough to fix -- synthesizing a GDT entry is trivial enough 
-- but since it currently works, and since the offsets on 32 bits can 
reach the full address space anyway, there is no reason to change.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 23:54                   ` Eric W. Biederman
  2008-07-10 16:22                     ` Mike Travis
@ 2008-07-10 17:57                     ` H. Peter Anvin
  2008-07-10 18:08                     ` H. Peter Anvin
  2 siblings, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 17:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

Eric W. Biederman wrote:
>>>
>> No, it's not irrelevant; we currently base the kernel at virtual address -2 GB
>> (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the proper
>> alignment for large pages.
> 
> Ugh.  That is silly.  We need to restrict CONFIG_PHYSICAL_START to the aligned
> choices obviously.  But -2G is better aligned then anything else we can do virtually.
> 

You may think it's silly, but it's actually an advantage in this case.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:31                                 ` Christoph Lameter
  2008-07-10 17:48                                   ` Jeremy Fitzhardinge
@ 2008-07-10 18:00                                   ` H. Peter Anvin
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Mike Travis, Eric W. Biederman,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

Christoph Lameter wrote:
> 
> Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas.
> 
> However, that does not work if one calculates the virtual address instead of looking up a physical address.
> 

As far as the linker is concerned, there are two address spaces: VMA, 
which is the offset, and LMA, which is the physical address on where to 
load.  The linker doesn't give a flying hoot about the virtual address, 
since it's completely irrelevant as far as it is concerned; it's nothing 
but a kernel-internal abstraction.

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:41                         ` Mike Travis
@ 2008-07-10 18:01                           ` H. Peter Anvin
  2008-07-10 20:51                             ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 18:01 UTC (permalink / raw)
  To: Mike Travis
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner,
	linux-kernel, Rusty Russell

Mike Travis wrote:
> 
> The discussion that followed was very emphatic that the size of the space should
> not be fixed, but instead be dynamically growable.  Since the offset needs to be
> fixed for each cpu, then virtual (I think) is the only way to go.  The use of a
> 2MB page just conserves map entries.  (Of course, if we just reserved 2MB in the
> first place it might not need to be virtual...?  But the concern was for systems
> with hundreds of (say) network interfaces using even more than 2MB.)
> 

I'm much more concerned about wasting an average of 1 MB of memory per CPU.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 23:54                   ` Eric W. Biederman
  2008-07-10 16:22                     ` Mike Travis
  2008-07-10 17:57                     ` H. Peter Anvin
@ 2008-07-10 18:08                     ` H. Peter Anvin
  2 siblings, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 18:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel

Eric W. Biederman wrote:
> 
> Another alternative that almost fares better then a segment with
> a base of zero is a base of -32K or so.  Only trouble that would get us
> manually managing the per cpu area size again.
> 

Yes, an extra link pass would be better than that.  I have tried, and 
none of the clever things I tried actually works, since GNU ld has 
pretty much no way to get it to reveal its information ahead of time.

However, I want to explore the details of the supposed toolchain issue; 
it might be just a simple tweak to the way things are done now to fix it.

Clearly, given the stack protector ABI, we want %gs:40 to be usable and 
free.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 16:22                     ` Mike Travis
  2008-07-10 16:25                       ` H. Peter Anvin
  2008-07-10 17:07                       ` Jeremy Fitzhardinge
@ 2008-07-10 18:48                       ` Eric W. Biederman
  2008-07-10 18:54                         ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 18:48 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Arjan van de Ven,
	Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner,
	linux-kernel, Rusty Russell

Mike Travis <travis@sgi.com> writes:

> Eric W. Biederman wrote:
> ...
>> Another alternative that almost fares better then a segment with
>> a base of zero is a base of -32K or so.  Only trouble that would get us
>> manually managing the per cpu area size again.
>
> One thing to remember is the eventual goal is implementing the cpu_alloc
> functions which I think we've agreed has to be "growable".  This means that
> the addresses will need to be virtual to allow the same offsets for all cpus.
> The patchset I have uses 2Mb pages.  This "little" twist might figure into the
> implementation issues that are being discussed.

I had not heard that.

However if you are going to use 2MB pages you might was well just use a
physical address at the start of a node.    2MB is so much larger then
the size of the per cpu memory we need today it isn't even funny.

To get 32K I had to round up on my current system, and honestly it is
important that per cpu data stay relatively small as otherwise the system
won't have memory to use for anything interesting.

I just took a quick look at our alloc_percpu calls. At a first glance
they all appear to be for relatively small data structures.  So we can
just about get away with doing what we do today for modules for everything.
The question is what to do when we fill up our preallocated size for percpu
data.

I think we can get away with just simply realloc'ing the percpu area
on each cpu.  No fancy table manipulations required.  Just update
the base pointer in %gs and in someplace global.

If you do use virtual addresses really requires using 4K pages, so we
can benefit from non-contiguous allocations.  I just can't imagine
the per cpu area getting up to 2MB in size, where you would need
multiple 2MB pages.  That is a huge jump from the 32KB I see today.

For the rest mostly I have been making a list of things that we can do
that could work.  A zero based percpu area is great if you can
eliminate it from suspicion of your weird random failures.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 18:48                       ` Eric W. Biederman
@ 2008-07-10 18:54                         ` Jeremy Fitzhardinge
  2008-07-10 19:18                           ` Eric W. Biederman
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 18:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Eric W. Biederman wrote:
> I think we can get away with just simply realloc'ing the percpu area
> on each cpu.  No fancy table manipulations required.  Just update
> the base pointer in %gs and in someplace global.
>   

It's perfectly legitimate to take the address of a percpu variable and 
store it somewhere.  We can't move them around.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:51                                   ` H. Peter Anvin
@ 2008-07-10 19:09                                     ` Eric W. Biederman
  2008-07-10 19:18                                       ` Mike Travis
  0 siblings, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 19:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

"H. Peter Anvin" <hpa@zytor.com> writes:

> Almost half a megabyte in current allyesconfig, and that is not including
> dynamic allocations at all.

Ouch!  This start to make me sad I removed the arbitrary cap on static
percpu memory.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:38                                   ` Christoph Lameter
@ 2008-07-10 19:11                                     ` Mike Travis
  2008-07-10 19:12                                     ` Eric W. Biederman
  1 sibling, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-10 19:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric W. Biederman, H. Peter Anvin, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Christoph Lameter wrote:
...
> The patches are reasonably small. The problem that Mike seems to have is early boot debugging.

Note that the early boot debugging problems go away with gcc-4.2.4.  My
problem now (what I [maybe incorrectly] believe) is stack overflow with
NR_CPUS=4096 and a specific random config file.

Btw, I've completed the first half of splitting the zero_based_fold into
zero_based_only and fold_pda_into_percpu and am testing that now.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:38                                   ` Christoph Lameter
  2008-07-10 19:11                                     ` Mike Travis
@ 2008-07-10 19:12                                     ` Eric W. Biederman
  1 sibling, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 19:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter <cl@linux-foundation.org> writes:

> The patches are reasonably small. The problem that Mike seems to have is early
> boot debugging.

I didn't say small.  I said do one thing at a time.

Mike is addressing this.

Fundamentally the problem is that we are seeing weird failures and there is not
enough granularity in the patches to test which part of the patch the failure
comes from.

The fact that the failures happen early in boot just makes it worse to debug.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 18:54                         ` Jeremy Fitzhardinge
@ 2008-07-10 19:18                           ` Eric W. Biederman
  2008-07-10 19:56                             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 19:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Jeremy Fitzhardinge <jeremy@goop.org> writes:

> Eric W. Biederman wrote:
>> I think we can get away with just simply realloc'ing the percpu area
>> on each cpu.  No fancy table manipulations required.  Just update
>> the base pointer in %gs and in someplace global.
>>
>
> It's perfectly legitimate to take the address of a percpu variable and store it
> somewhere.  We can't move them around.

Really?  I guess there are cases where that makes sense.  It is a pretty
rare case though.  Especially when you are not talking about doing it temporarily
with preemption disabled.  There are few enough users of the API I think we can
certainly explore the cost of forbidding in the general case of storing the
address of a percpu variable.

Eric


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 19:09                                     ` Eric W. Biederman
@ 2008-07-10 19:18                                       ` Mike Travis
  2008-07-10 19:32                                         ` H. Peter Anvin
  2008-07-10 20:17                                         ` Eric W. Biederman
  0 siblings, 2 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-10 19:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Eric W. Biederman wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
> 
>> Almost half a megabyte in current allyesconfig, and that is not including
>> dynamic allocations at all.
> 
> Ouch!  This start to make me sad I removed the arbitrary cap on static
> percpu memory.
> 
> Eric

The biggest growth came from moving all the xxx[NR_CPUS] arrays into
the per cpu area.  So you free up a huge amount of unused memory when
the NR_CPUS count starts getting into the ozone layer.  4k now, 16k
real soon now, ??? future?

Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 19:18                                       ` Mike Travis
@ 2008-07-10 19:32                                         ` H. Peter Anvin
  2008-07-10 23:37                                           ` Mike Travis
  2008-07-10 20:17                                         ` Eric W. Biederman
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 19:32 UTC (permalink / raw)
  To: Mike Travis
  Cc: Eric W. Biederman, Christoph Lameter, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis wrote:
> 
> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
> the per cpu area.  So you free up a huge amount of unused memory when
> the NR_CPUS count starts getting into the ozone layer.  4k now, 16k
> real soon now, ??? future?
> 

Even (or perhaps especially) so, allocating the percpu area in 2 MB 
increments is a total nonstarter.  It hurts the small, common 
configurations way too much.  For SGI, it's probably fine.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 19:18                           ` Eric W. Biederman
@ 2008-07-10 19:56                             ` Jeremy Fitzhardinge
  2008-07-10 20:22                               ` Eric W. Biederman
  2008-07-10 20:25                               ` Eric W. Biederman
  0 siblings, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 19:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> I think we can get away with just simply realloc'ing the percpu area
>>> on each cpu.  No fancy table manipulations required.  Just update
>>> the base pointer in %gs and in someplace global.
>>>
>>>       
>> It's perfectly legitimate to take the address of a percpu variable and store it
>> somewhere.  We can't move them around.
>>     
>
> Really?  I guess there are cases where that makes sense.  It is a pretty
> rare case though.  Especially when you are not talking about doing it temporarily
> with preemption disabled.  There are few enough users of the API I think we can
> certainly explore the cost of forbidding in the general case of storing the
> address of a percpu variable.
>   

No, that sounds like a bad idea.  For one, how would you enforce it?  
How would you check for it?  It's one of those things that would mostly 
work and then fail very rarely.

Secondly, I depend on it.  I register a percpu structure with Xen to 
share per-vcpu specific information (interrupt mask, time info, runstate 
stats, etc).

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 19:18                                       ` Mike Travis
  2008-07-10 19:32                                         ` H. Peter Anvin
@ 2008-07-10 20:17                                         ` Eric W. Biederman
  2008-07-10 20:24                                           ` Ingo Molnar
  2008-07-11  1:39                                           ` Mike Travis
  1 sibling, 2 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 20:17 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis <travis@sgi.com> writes:

> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
> the per cpu area.  So you free up a huge amount of unused memory when
> the NR_CPUS count starts getting into the ozone layer.  4k now, 16k
> real soon now, ??? future?

Hmm.  Do you know how big a role kernel_stat plays.

It is a per cpu structure that is sized via NR_IRQS.  NR_IRQS is by NR_CPUS.
So ultimately the amount of memory take up is NR_CPUS*NR_CPUS*32 or so.

I have a patch I wrote long ago, that addresses that specific nasty configuration
by moving the per cpu irq counters into pointer available from struct irq_desc.

The next step which I did not get to (but is interesting from a scaling perspective)
was to start dynamically allocating the irq structures.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 19:56                             ` Jeremy Fitzhardinge
@ 2008-07-10 20:22                               ` Eric W. Biederman
  2008-07-10 20:54                                 ` Jeremy Fitzhardinge
  2008-07-11  6:59                                 ` Rusty Russell
  2008-07-10 20:25                               ` Eric W. Biederman
  1 sibling, 2 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 20:22 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Jeremy Fitzhardinge <jeremy@goop.org> writes:

> No, that sounds like a bad idea.  For one, how would you enforce it?  How would
> you check for it?  It's one of those things that would mostly work and then fail
> very rarely.

Well the easiest way would be to avoid the letting people take the address of
per cpu memory, and just provide macros to read/write it.  We are 90% of the
way there already so it isn't a big jump.

> Secondly, I depend on it.  I register a percpu structure with Xen to share
> per-vcpu specific information (interrupt mask, time info, runstate stats, etc).

Well even virtual allocation is likely to break the Xen sharing case as you
would at least need to compute the physical address and pass it to Xen.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:17                                         ` Eric W. Biederman
@ 2008-07-10 20:24                                           ` Ingo Molnar
  2008-07-10 21:33                                             ` Eric W. Biederman
  2008-07-11  1:39                                           ` Mike Travis
  1 sibling, 1 reply; 190+ messages in thread
From: Ingo Molnar @ 2008-07-10 20:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, H. Peter Anvin, Christoph Lameter,
	Jeremy Fitzhardinge, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> Mike Travis <travis@sgi.com> writes:
> 
> 
> > The biggest growth came from moving all the xxx[NR_CPUS] arrays into 
> > the per cpu area.  So you free up a huge amount of unused memory 
> > when the NR_CPUS count starts getting into the ozone layer.  4k now, 
> > 16k real soon now, ??? future?
> 
> Hmm.  Do you know how big a role kernel_stat plays.
> 
> It is a per cpu structure that is sized via NR_IRQS.  NR_IRQS is by 
> NR_CPUS. So ultimately the amount of memory take up is 
> NR_CPUS*NR_CPUS*32 or so.
> 
> I have a patch I wrote long ago, that addresses that specific nasty 
> configuration by moving the per cpu irq counters into pointer 
> available from struct irq_desc.
> 
> The next step which I did not get to (but is interesting from a 
> scaling perspective) was to start dynamically allocating the irq 
> structures.

/me willing to test & babysit any test-patch in that area ...

this is a big problem and it's getting worse quadratically ;-)

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 19:56                             ` Jeremy Fitzhardinge
  2008-07-10 20:22                               ` Eric W. Biederman
@ 2008-07-10 20:25                               ` Eric W. Biederman
  1 sibling, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 20:25 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Jeremy Fitzhardinge <jeremy@goop.org> writes:

> Secondly, I depend on it.  I register a percpu structure with Xen to share
> per-vcpu specific information (interrupt mask, time info, runstate stats, etc).

Note.  That I expect at least something like this is interesting in the context
per cpu device queues.  However except possibly for Xen that implies allocating
DMA addressable memory and going through that API, which will keep device drivers
from using per cpu memory that way even if the allocating something for each cpu?

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 18:01                           ` H. Peter Anvin
@ 2008-07-10 20:51                             ` Christoph Lameter
  2008-07-10 20:58                               ` H. Peter Anvin
  0 siblings, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 20:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

H. Peter Anvin wrote:

> I'm much more concerned about wasting an average of 1 MB of memory per CPU.

Well all the memory that is now allocated via allocpercpu()s will be allocated from  that 2MB segment. And cpu_alloc packs variables in a dense way. The current slab allocations try to avoid sharing cachelines which wastes lots of memory for every allocation for each and every processor.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 17:53                                         ` Jeremy Fitzhardinge
  2008-07-10 17:55                                           ` H. Peter Anvin
@ 2008-07-10 20:52                                           ` Christoph Lameter
  2008-07-10 20:58                                             ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 20:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:

> What other trouble?  It works fine.

Somehow you performed a mind wipe to get rid of all memory of the earlier messages?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:22                               ` Eric W. Biederman
@ 2008-07-10 20:54                                 ` Jeremy Fitzhardinge
  2008-07-11  6:59                                 ` Rusty Russell
  1 sibling, 0 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 20:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar,
	Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel,
	Rusty Russell

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
>   
>> No, that sounds like a bad idea.  For one, how would you enforce it?  How would
>> you check for it?  It's one of those things that would mostly work and then fail
>> very rarely.
>>     
>
> Well the easiest way would be to avoid the letting people take the address of
> per cpu memory, and just provide macros to read/write it.  We are 90% of the
> way there already so it isn't a big jump.
>   

Well, the x86_X_percpu api is there.  But per_cpu() and get_cpu_var() 
both explicitly return lvalues which can have their addresses taken.

>> Secondly, I depend on it.  I register a percpu structure with Xen to share
>> per-vcpu specific information (interrupt mask, time info, runstate stats, etc).
>>     
>
> Well even virtual allocation is likely to break the Xen sharing case as you
> would at least need to compute the physical address and pass it to Xen.
>   

Right.  At the moment it assumes that the percpu variable is in the 
linear mapping, but it could easily do a pagetable walk if necessary.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:52                                           ` Christoph Lameter
@ 2008-07-10 20:58                                             ` Jeremy Fitzhardinge
  2008-07-10 21:03                                               ` H. Peter Anvin
  2008-07-10 21:05                                               ` Christoph Lameter
  0 siblings, 2 replies; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-10 20:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>   
>> What other trouble?  It works fine.
>>     
>
> Somehow you performed a mind wipe to get rid of all memory of the earlier messages?
>   

Percpu on i386 hasn't been a point of discussion.  It works fine, and 
has been working fine for a long time.  The same mechanism would work 
fine on x86-64.  Its only "issue" is that it doesn't support the broken 
gcc abi for stack-protector.

The problem is all zero-based percpu on x86-64.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:51                             ` Christoph Lameter
@ 2008-07-10 20:58                               ` H. Peter Anvin
  2008-07-10 21:07                                 ` Christoph Lameter
  0 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 20:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

Christoph Lameter wrote:
> H. Peter Anvin wrote:
> 
>> I'm much more concerned about wasting an average of 1 MB of memory per CPU.
> 
> Well all the memory that is now allocated via allocpercpu()s will be allocated from  that 2MB segment. And cpu_alloc packs variables in a dense way. The current slab allocations try to avoid sharing cachelines which wastes lots of memory for every allocation for each and every processor.
> 

And how much is that, especially on *small* systems?

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:58                                             ` Jeremy Fitzhardinge
@ 2008-07-10 21:03                                               ` H. Peter Anvin
  2008-07-11  0:55                                                 ` Mike Travis
  2008-07-10 21:05                                               ` Christoph Lameter
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 21:03 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Christoph Lameter, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:
> 
> Percpu on i386 hasn't been a point of discussion.  It works fine, and 
> has been working fine for a long time.  The same mechanism would work 
> fine on x86-64.  Its only "issue" is that it doesn't support the broken 
> gcc abi for stack-protector.
> 
> The problem is all zero-based percpu on x86-64.
> 

Well, x86-64 has *two* issues: limited range of offsets (regardless of 
if we do RIP-relative or not), and the stack-protector ABI.

I'm still trying to reproduce Mike's setup, but I suspect it can be 
switched to RIP-relative for the fixed-offset (static) stuff; for the 
dynamic stuff it's all via pointers anyway so the offsets don't matter.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:58                                             ` Jeremy Fitzhardinge
  2008-07-10 21:03                                               ` H. Peter Anvin
@ 2008-07-10 21:05                                               ` Christoph Lameter
  2008-07-10 21:22                                                 ` Eric W. Biederman
  2008-07-10 21:29                                                 ` H. Peter Anvin
  1 sibling, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 21:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Jeremy Fitzhardinge wrote:

> Percpu on i386 hasn't been a point of discussion.  It works fine, and
> has been working fine for a long time.  The same mechanism would work
> fine on x86-64.  Its only "issue" is that it doesn't support the broken
> gcc abi for stack-protector.

Well that is one thing and then the scaling issues, and the support of the new cpu allocator, new arch independent cpu operations etc.

> The problem is all zero-based percpu on x86-64.

The zero based stuff will enable a lot of things. Please have a look at the cpu_alloc patchsets.




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:58                               ` H. Peter Anvin
@ 2008-07-10 21:07                                 ` Christoph Lameter
  2008-07-10 21:11                                   ` H. Peter Anvin
  2008-07-10 21:26                                   ` Eric W. Biederman
  0 siblings, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-10 21:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

H. Peter Anvin wrote:

> And how much is that, especially on *small* systems?

i386?

i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the cpu_alloc patchsets for i386.




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 21:07                                 ` Christoph Lameter
@ 2008-07-10 21:11                                   ` H. Peter Anvin
  2008-07-11 15:32                                     ` Christoph Lameter
  2008-07-10 21:26                                   ` Eric W. Biederman
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 21:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

Christoph Lameter wrote:
> H. Peter Anvin wrote:
> 
>> And how much is that, especially on *small* systems?
> 
> i386?
> 
> i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the cpu_alloc patchsets for i386.
> 

No, not i386.  x86-64.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 21:05                                               ` Christoph Lameter
@ 2008-07-10 21:22                                                 ` Eric W. Biederman
  2008-07-10 21:29                                                 ` H. Peter Anvin
  1 sibling, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 21:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, H. Peter Anvin, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter <cl@linux-foundation.org> writes:

> Jeremy Fitzhardinge wrote:
>
>> Percpu on i386 hasn't been a point of discussion.  It works fine, and
>> has been working fine for a long time.  The same mechanism would work
>> fine on x86-64.  Its only "issue" is that it doesn't support the broken
>> gcc abi for stack-protector.
>
> Well that is one thing and then the scaling issues, and the support of the new
> cpu allocator, new arch independent cpu operations etc.
>
>> The problem is all zero-based percpu on x86-64.
>
> The zero based stuff will enable a lot of things. Please have a look at the
> cpu_alloc patchsets.

Christoph again.  The reason we are balking at the zero based percpu
area is NOT because it is zero based.  It is because systems with it
patched in don't work reliably.

The bottom line is if the tools don't support a clever idea we can't use it.

Hopefully the problem can be root caused and we can use a zero based percpu area.
There are several ways we can achieve that.

Further any design that depends on a zero based percpu can work with a contiguous percpu
with an offset so we should not be breaking whatever design you have for the percpu
allocator.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 21:07                                 ` Christoph Lameter
  2008-07-10 21:11                                   ` H. Peter Anvin
@ 2008-07-10 21:26                                   ` Eric W. Biederman
  1 sibling, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 21:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Mike Travis, Jeremy Fitzhardinge,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

Christoph Lameter <cl@linux-foundation.org> writes:

> H. Peter Anvin wrote:
>
>> And how much is that, especially on *small* systems?
>
> i386?
>
> i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of
> ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the
> cpu_alloc patchsets for i386.

i386 is fundamentally resource constrained.  However x86_32 should support a
strict superset of the machines the x86_64 kernel supports.

Because it is resource constrained in the lowmem zone you should not
be able to bring up all of the cpus on a huge cpu box.   But you should still
be able to boot and run the kernel.  So for percpu data we have effectively
same size constraints.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 21:05                                               ` Christoph Lameter
  2008-07-10 21:22                                                 ` Eric W. Biederman
@ 2008-07-10 21:29                                                 ` H. Peter Anvin
  2008-07-11  0:12                                                   ` Mike Travis
  1 sibling, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-10 21:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis,
	Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
> 
>> Percpu on i386 hasn't been a point of discussion.  It works fine, and
>> has been working fine for a long time.  The same mechanism would work
>> fine on x86-64.  Its only "issue" is that it doesn't support the broken
>> gcc abi for stack-protector.
> 
> Well that is one thing and then the scaling issues, and the support of the new cpu allocator, new arch independent cpu operations etc.
> 
>> The problem is all zero-based percpu on x86-64.
> 
> The zero based stuff will enable a lot of things. Please have a look at the cpu_alloc patchsets.
> 

No argument this work is worthwhile.  The main issues on the table is 
the particular choice of offsets, and the handling of the virtual space 
-- I believe 2 MB mappings are too large, except perhaps as an option.

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:24                                           ` Ingo Molnar
@ 2008-07-10 21:33                                             ` Eric W. Biederman
  0 siblings, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-10 21:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Travis, H. Peter Anvin, Christoph Lameter,
	Jeremy Fitzhardinge, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Ingo Molnar <mingo@elte.hu> writes:

> /me willing to test & babysit any test-patch in that area ...
>
> this is a big problem and it's getting worse quadratically ;-)
>

Well here is a copy of my old patch to get things started.
It isn't where I'm working right now so I don't have time to rebase
the patch, but the same logic should still apply.

----

>From e02f708c0eca6708c8f79824717705379e982fe3 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <ebiederm@xmission.com>
Date: Tue, 13 Feb 2007 02:42:50 -0700
Subject: [PATCH] genirq: Kill the percpu NR_IRQS sized array in kstat.

In struct kernel_stat which has one instance per cpu we keep a
count of how many times each irq has occured on that cpu.  Given
that we don't usually use all of our irqs this is very wasteful
of space and in particular percpu space.

This patch replaces that array on all architectures that use
GENERIC_HARD_IRQS with a point to a array of cpus in struct irq_desc.
This allocates the array at boot time after we have generated the
cpu_possible_map and is only large enough to hold the largest possible
cpu index.

Assuming the common case of dense cpu numbers this consumes roughly the
same amount of space as the current mechanism and removes the NR_IRQS
sized array.

The only immediate win is to get these counts out of the limited size
percpu areas.

Shortly I will make the need for NR_IRQS sized arrays obsolete, allowing
a single kernel to support huge numbers of irqs and still be efficient
on small machines.

With the removal of the NR_IRQS sized arrays this patch will be a clear
size win in space consumption for small machines.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/alpha/kernel/irq.c           |    2 +-
 arch/alpha/kernel/irq_alpha.c     |    2 +-
 arch/arm/kernel/irq.c             |    2 +-
 arch/avr32/kernel/irq.c           |    2 +-
 arch/cris/kernel/irq.c            |    2 +-
 arch/frv/kernel/irq.c             |    2 +-
 arch/i386/kernel/io_apic.c        |    2 +-
 arch/i386/kernel/irq.c            |    2 +-
 arch/i386/mach-visws/visws_apic.c |    2 +-
 arch/ia64/kernel/irq.c            |    2 +-
 arch/ia64/kernel/irq_ia64.c       |    4 ++--
 arch/m32r/kernel/irq.c            |    2 +-
 arch/mips/au1000/common/time.c    |    4 ++--
 arch/mips/kernel/irq.c            |    2 +-
 arch/mips/kernel/time.c           |    4 ++--
 arch/mips/sgi-ip22/ip22-int.c     |    2 +-
 arch/mips/sgi-ip22/ip22-time.c    |    4 ++--
 arch/mips/sgi-ip27/ip27-timer.c   |    2 +-
 arch/mips/sibyte/bcm1480/smp.c    |    2 +-
 arch/mips/sibyte/sb1250/irq.c     |    2 +-
 arch/mips/sibyte/sb1250/smp.c     |    2 +-
 arch/parisc/kernel/irq.c          |    2 +-
 arch/powerpc/kernel/irq.c         |    2 +-
 arch/ppc/amiga/amiints.c          |    4 ++--
 arch/ppc/amiga/cia.c              |    2 +-
 arch/ppc/amiga/ints.c             |    4 ++--
 arch/sh/kernel/irq.c              |    2 +-
 arch/sparc64/kernel/irq.c         |    4 ++--
 arch/sparc64/kernel/smp.c         |    2 +-
 arch/um/kernel/irq.c              |    2 +-
 arch/x86_64/kernel/irq.c          |    6 +-----
 arch/xtensa/kernel/irq.c          |    2 +-
 fs/proc/proc_misc.c               |    2 +-
 include/linux/irq.h               |    4 ++++
 include/linux/kernel_stat.h       |   20 +++++++++++++++++---
 init/main.c                       |    1 +
 kernel/irq/chip.c                 |   15 +++++----------
 kernel/irq/handle.c               |   29 +++++++++++++++++++++++++++--
 38 files changed, 94 insertions(+), 59 deletions(-)

diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c
index 3659af8..8e0af05 100644
--- a/arch/alpha/kernel/irq.c
+++ b/arch/alpha/kernel/irq.c
@@ -88,7 +88,7 @@ show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(irq));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[irq]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %14s", irq_desc[irq].chip->typename);
 		seq_printf(p, "  %c%s",
diff --git a/arch/alpha/kernel/irq_alpha.c b/arch/alpha/kernel/irq_alpha.c
index e16aeb6..2c0852c 100644
--- a/arch/alpha/kernel/irq_alpha.c
+++ b/arch/alpha/kernel/irq_alpha.c
@@ -64,7 +64,7 @@ do_entInt(unsigned long type, unsigned long vector,
 		smp_percpu_timer_interrupt(regs);
 		cpu = smp_processor_id();
 		if (cpu != boot_cpuid) {
-		        kstat_cpu(cpu).irqs[RTC_IRQ]++;
+			irq_desc[RTC_IRQ].kstat_irqs[cpu]++;
 		} else {
 			handle_irq(RTC_IRQ);
 		}
diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c
index e101846..db79c4c 100644
--- a/arch/arm/kernel/irq.c
+++ b/arch/arm/kernel/irq.c
@@ -76,7 +76,7 @@ int show_interrupts(struct seq_file *p, void *v)
 
 		seq_printf(p, "%3d: ", i);
 		for_each_present_cpu(cpu)
-			seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu));
 		seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-");
 		seq_printf(p, "  %s", action->name);
 		for (action = action->next; action; action = action->next)
diff --git a/arch/avr32/kernel/irq.c b/arch/avr32/kernel/irq.c
index fd31124..7cddf0a 100644
--- a/arch/avr32/kernel/irq.c
+++ b/arch/avr32/kernel/irq.c
@@ -56,7 +56,7 @@ int show_interrupts(struct seq_file *p, void *v)
 
 		seq_printf(p, "%3d: ", i);
 		for_each_online_cpu(cpu)
-			seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu));
 		seq_printf(p, " %8s", irq_desc[i].chip->name ? : "-");
 		seq_printf(p, "  %s", action->name);
 		for (action = action->next; action; action = action->next)
diff --git a/arch/cris/kernel/irq.c b/arch/cris/kernel/irq.c
index 903ea62..9d7c1d7 100644
--- a/arch/cris/kernel/irq.c
+++ b/arch/cris/kernel/irq.c
@@ -66,7 +66,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %14s", irq_desc[i].chip->typename);
 		seq_printf(p, "  %s", action->name);
diff --git a/arch/frv/kernel/irq.c b/arch/frv/kernel/irq.c
index 87f360a..ff6579f 100644
--- a/arch/frv/kernel/irq.c
+++ b/arch/frv/kernel/irq.c
@@ -75,7 +75,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		if (action) {
 			seq_printf(p, "%3d: ", i);
 			for_each_present_cpu(cpu)
-				seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]);
+				seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu));
 			seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-");
 			seq_printf(p, "  %s", action->name);
 			for (action = action->next;
diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c
index edcc849..c660c8b 100644
--- a/arch/i386/kernel/io_apic.c
+++ b/arch/i386/kernel/io_apic.c
@@ -488,7 +488,7 @@ static void do_irq_balance(void)
 			if ( package_index == i )
 				IRQ_DELTA(package_index,j) = 0;
 			/* Determine the total count per processor per IRQ */
-			value_now = (unsigned long) kstat_cpu(i).irqs[j];
+			value_now = (unsigned long) kstat_irqs_cpu(j, i);
 
 			/* Determine the activity per processor per IRQ */
 			delta = value_now - LAST_CPU_IRQ(i,j);
diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c
index eeb29af..0a30abc 100644
--- a/arch/i386/kernel/irq.c
+++ b/arch/i386/kernel/irq.c
@@ -279,7 +279,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %8s", irq_desc[i].chip->name);
 		seq_printf(p, "-%-8s", irq_desc[i].name);
diff --git a/arch/i386/mach-visws/visws_apic.c b/arch/i386/mach-visws/visws_apic.c
index 38c2b13..0d153eb 100644
--- a/arch/i386/mach-visws/visws_apic.c
+++ b/arch/i386/mach-visws/visws_apic.c
@@ -240,7 +240,7 @@ static irqreturn_t piix4_master_intr(int irq, void *dev_id)
 	/*
 	 * handle this 'virtual interrupt' as a Cobalt one now.
 	 */
-	kstat_cpu(smp_processor_id()).irqs[realirq]++;
+	desc->kstat_irqs[smp_processor_id()]++;
 
 	if (likely(desc->action != NULL))
 		handle_IRQ_event(realirq, desc->action);
diff --git a/arch/ia64/kernel/irq.c b/arch/ia64/kernel/irq.c
index ce49c85..06edbb8 100644
--- a/arch/ia64/kernel/irq.c
+++ b/arch/ia64/kernel/irq.c
@@ -73,7 +73,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j) {
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 		}
 #endif
 		seq_printf(p, " %14s", irq_desc[i].chip->name);
diff --git a/arch/ia64/kernel/irq_ia64.c b/arch/ia64/kernel/irq_ia64.c
index 456f57b..be1dd6e 100644
--- a/arch/ia64/kernel/irq_ia64.c
+++ b/arch/ia64/kernel/irq_ia64.c
@@ -181,7 +181,7 @@ ia64_handle_irq (ia64_vector vector, struct pt_regs *regs)
 	ia64_srlz_d();
 	while (vector != IA64_SPURIOUS_INT_VECTOR) {
 		if (unlikely(IS_RESCHEDULE(vector)))
-			 kstat_this_cpu.irqs[vector]++;
+			kstat_irqs_this_cpu(&irq_desc[vector])++;
 		else {
 			ia64_setreg(_IA64_REG_CR_TPR, vector);
 			ia64_srlz_d();
@@ -228,7 +228,7 @@ void ia64_process_pending_intr(void)
 	  */
 	while (vector != IA64_SPURIOUS_INT_VECTOR) {
 		if (unlikely(IS_RESCHEDULE(vector)))
-			 kstat_this_cpu.irqs[vector]++;
+			kstat_irqs_this_cpu(&irq_desc[vector])++;
 		else {
 			struct pt_regs *old_regs = set_irq_regs(NULL);
 
diff --git a/arch/m32r/kernel/irq.c b/arch/m32r/kernel/irq.c
index f8d8650..4fb85b2 100644
--- a/arch/m32r/kernel/irq.c
+++ b/arch/m32r/kernel/irq.c
@@ -52,7 +52,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %14s", irq_desc[i].chip->typename);
 		seq_printf(p, "  %s", action->name);
diff --git a/arch/mips/au1000/common/time.c b/arch/mips/au1000/common/time.c
index fa1c62f..c2a084e 100644
--- a/arch/mips/au1000/common/time.c
+++ b/arch/mips/au1000/common/time.c
@@ -81,13 +81,13 @@ void mips_timer_interrupt(void)
 	int irq = 63;
 
 	irq_enter();
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(&irq_desc[irq])++;
 
 	if (r4k_offset == 0)
 		goto null;
 
 	do {
-		kstat_this_cpu.irqs[irq]++;
+		kstat_irqs_this_cpu(&irq_desc[irq])++;
 		do_timer(1);
 #ifndef CONFIG_SMP
 		update_process_times(user_mode(get_irq_regs()));
diff --git a/arch/mips/kernel/irq.c b/arch/mips/kernel/irq.c
index 2fe4c86..c2cae91 100644
--- a/arch/mips/kernel/irq.c
+++ b/arch/mips/kernel/irq.c
@@ -115,7 +115,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %14s", irq_desc[i].chip->name);
 		seq_printf(p, "  %s", action->name);
diff --git a/arch/mips/kernel/time.c b/arch/mips/kernel/time.c
index e5e56bd..0a829e2 100644
--- a/arch/mips/kernel/time.c
+++ b/arch/mips/kernel/time.c
@@ -204,7 +204,7 @@ asmlinkage void ll_timer_interrupt(int irq)
 	int r2 = cpu_has_mips_r2;
 
 	irq_enter();
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(&irq_desc[irq])++;
 
 	/*
 	 * Suckage alert:
@@ -228,7 +228,7 @@ asmlinkage void ll_local_timer_interrupt(int irq)
 {
 	irq_enter();
 	if (smp_processor_id() != 0)
-		kstat_this_cpu.irqs[irq]++;
+		kstat_irqs_this_cpu(&irq_desc[irq])++;
 
 	/* we keep interrupt disabled all the time */
 	local_timer_interrupt(irq, NULL);
diff --git a/arch/mips/sgi-ip22/ip22-int.c b/arch/mips/sgi-ip22/ip22-int.c
index b454924..382a8a5 100644
--- a/arch/mips/sgi-ip22/ip22-int.c
+++ b/arch/mips/sgi-ip22/ip22-int.c
@@ -164,7 +164,7 @@ static void indy_buserror_irq(void)
 	int irq = SGI_BUSERR_IRQ;
 
 	irq_enter();
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(&irq_desc[irq])++;
 	ip22_be_interrupt(irq);
 	irq_exit();
 }
diff --git a/arch/mips/sgi-ip22/ip22-time.c b/arch/mips/sgi-ip22/ip22-time.c
index 2055547..0cd6887 100644
--- a/arch/mips/sgi-ip22/ip22-time.c
+++ b/arch/mips/sgi-ip22/ip22-time.c
@@ -182,7 +182,7 @@ void indy_8254timer_irq(void)
 	char c;
 
 	irq_enter();
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(&irq_desc[irq])++;
 	printk(KERN_ALERT "Oops, got 8254 interrupt.\n");
 	ArcRead(0, &c, 1, &cnt);
 	ArcEnterInteractiveMode();
@@ -194,7 +194,7 @@ void indy_r4k_timer_interrupt(void)
 	int irq = SGI_TIMER_IRQ;
 
 	irq_enter();
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(&irq_desc[irq])++;
 	timer_interrupt(irq, NULL);
 	irq_exit();
 }
diff --git a/arch/mips/sgi-ip27/ip27-timer.c b/arch/mips/sgi-ip27/ip27-timer.c
index 8c3c78c..592449c 100644
--- a/arch/mips/sgi-ip27/ip27-timer.c
+++ b/arch/mips/sgi-ip27/ip27-timer.c
@@ -106,7 +106,7 @@ again:
 	if (LOCAL_HUB_L(PI_RT_COUNT) >= ct_cur[cpu])
 		goto again;
 
-	kstat_this_cpu.irqs[irq]++;		/* kstat only for bootcpu? */
+	irq_desc[irq].kstat_irqs[cpu]++;	/* kstat only for bootcpu? */
 
 	if (cpu == 0)
 		do_timer(1);
diff --git a/arch/mips/sibyte/bcm1480/smp.c b/arch/mips/sibyte/bcm1480/smp.c
index bf32827..a070238 100644
--- a/arch/mips/sibyte/bcm1480/smp.c
+++ b/arch/mips/sibyte/bcm1480/smp.c
@@ -93,7 +93,7 @@ void bcm1480_mailbox_interrupt(void)
 	int cpu = smp_processor_id();
 	unsigned int action;
 
-	kstat_this_cpu.irqs[K_BCM1480_INT_MBOX_0_0]++;
+	irq_desc[K_BCM1480_INT_MBOX_0_0].kstat_irqs[cpu]++;
 	/* Load the mailbox register to figure out what we're supposed to do */
 	action = (__raw_readq(mailbox_0_regs[cpu]) >> 48) & 0xffff;
 
diff --git a/arch/mips/sibyte/sb1250/irq.c b/arch/mips/sibyte/sb1250/irq.c
index 1482394..fb7d77f 100644
--- a/arch/mips/sibyte/sb1250/irq.c
+++ b/arch/mips/sibyte/sb1250/irq.c
@@ -390,7 +390,7 @@ static void sb1250_kgdb_interrupt(void)
 	 * host to stop the break, since we would see another
 	 * interrupt on the end-of-break too)
 	 */
-	kstat_this_cpu.irqs[kgdb_irq]++;
+	kstat_irqs_this_cpu(&irq_desc[kgdb_irq])++;
 	mdelay(500);
 	duart_out(R_DUART_CMD, V_DUART_MISC_CMD_RESET_BREAK_INT |
 				M_DUART_RX_EN | M_DUART_TX_EN);
diff --git a/arch/mips/sibyte/sb1250/smp.c b/arch/mips/sibyte/sb1250/smp.c
index c38e1f3..54c6164 100644
--- a/arch/mips/sibyte/sb1250/smp.c
+++ b/arch/mips/sibyte/sb1250/smp.c
@@ -81,7 +81,7 @@ void sb1250_mailbox_interrupt(void)
 	int cpu = smp_processor_id();
 	unsigned int action;
 
-	kstat_this_cpu.irqs[K_INT_MBOX_0]++;
+	irq_desc[K_INT_MBOX_0].kstat_irqs[cpu]++;
 	/* Load the mailbox register to figure out what we're supposed to do */
 	action = (____raw_readq(mailbox_regs[cpu]) >> 48) & 0xffff;
 
diff --git a/arch/parisc/kernel/irq.c b/arch/parisc/kernel/irq.c
index b39c5b9..c222bbd 100644
--- a/arch/parisc/kernel/irq.c
+++ b/arch/parisc/kernel/irq.c
@@ -192,7 +192,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%3d: ", i);
 #ifdef CONFIG_SMP
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #else
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #endif
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 919fbf5..0be818a 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -189,7 +189,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%3d: ", i);
 #ifdef CONFIG_SMP
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #else
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #endif /* CONFIG_SMP */
diff --git a/arch/ppc/amiga/amiints.c b/arch/ppc/amiga/amiints.c
index 265fcd3..3dc8651 100644
--- a/arch/ppc/amiga/amiints.c
+++ b/arch/ppc/amiga/amiints.c
@@ -184,7 +184,7 @@ inline void amiga_do_irq(int irq, struct pt_regs *fp)
 	irq_desc_t *desc = irq_desc + irq;
 	struct irqaction *action = desc->action;
 
-	kstat_cpu(0).irqs[irq]++;
+	desc->kstat_irqs[0]++;
 	action->handler(irq, action->dev_id, fp);
 }
 
@@ -193,7 +193,7 @@ void amiga_do_irq_list(int irq, struct pt_regs *fp)
 	irq_desc_t *desc = irq_desc + irq;
 	struct irqaction *action;
 
-	kstat_cpu(0).irqs[irq]++;
+	desc->kstat_irqs[0]++;
 
 	amiga_custom.intreq = ami_intena_vals[irq];
 
diff --git a/arch/ppc/amiga/cia.c b/arch/ppc/amiga/cia.c
index 9558f2f..33faf2d 100644
--- a/arch/ppc/amiga/cia.c
+++ b/arch/ppc/amiga/cia.c
@@ -146,7 +146,7 @@ static void cia_handler(int irq, void *dev_id, struct pt_regs *fp)
 	amiga_custom.intreq = base->int_mask;
 	for (i = 0; i < CIA_IRQS; i++, irq++) {
 		if (ints & 1) {
-			kstat_cpu(0).irqs[irq]++;
+			desc->kstat_irqs[0]++;
 			action = desc->action;
 			action->handler(irq, action->dev_id, fp);
 		}
diff --git a/arch/ppc/amiga/ints.c b/arch/ppc/amiga/ints.c
index 083a174..84ec6cb 100644
--- a/arch/ppc/amiga/ints.c
+++ b/arch/ppc/amiga/ints.c
@@ -128,7 +128,7 @@ asmlinkage void process_int(unsigned long vec, struct pt_regs *fp)
 {
 	if (vec >= VEC_INT1 && vec <= VEC_INT7 && !MACH_IS_BVME6000) {
 		vec -= VEC_SPUR;
-		kstat_cpu(0).irqs[vec]++;
+		irq_desc[vec].kstat_irqs[0]++;
 		irq_list[vec].handler(vec, irq_list[vec].dev_id, fp);
 	} else {
 		if (mach_process_int)
@@ -147,7 +147,7 @@ int m68k_get_irq_list(struct seq_file *p, void *v)
 	if (mach_default_handler) {
 		for (i = 0; i < SYS_IRQS; i++) {
 			seq_printf(p, "auto %2d: %10u ", i,
-			               i ? kstat_cpu(0).irqs[i] : num_spurious);
+			               i ? kstat_irqs_cpu(i, 0) : num_spurious);
 			seq_puts(p, "  ");
 			seq_printf(p, "%s\n", irq_list[i].devname);
 		}
diff --git a/arch/sh/kernel/irq.c b/arch/sh/kernel/irq.c
index 67be2b6..e9c739a 100644
--- a/arch/sh/kernel/irq.c
+++ b/arch/sh/kernel/irq.c
@@ -52,7 +52,7 @@ int show_interrupts(struct seq_file *p, void *v)
 			goto unlock;
 		seq_printf(p, "%3d: ",i);
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_cpu(i, j));
 		seq_printf(p, " %14s", irq_desc[i].chip->name);
 		seq_printf(p, "-%-8s", irq_desc[i].name);
 		seq_printf(p, "  %s", action->name);
diff --git a/arch/sparc64/kernel/irq.c b/arch/sparc64/kernel/irq.c
index b5ff3ee..4a436a7 100644
--- a/arch/sparc64/kernel/irq.c
+++ b/arch/sparc64/kernel/irq.c
@@ -154,7 +154,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %9s", irq_desc[i].chip->typename);
 		seq_printf(p, "  %s", action->name);
@@ -605,7 +605,7 @@ void timer_irq(int irq, struct pt_regs *regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	kstat_this_cpu.irqs[0]++;
+	irq_desc[0].kstat_irqs[0]++;
 	timer_interrupt(irq, NULL);
 
 	irq_exit();
diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
index fc99f7b..155703b 100644
--- a/arch/sparc64/kernel/smp.c
+++ b/arch/sparc64/kernel/smp.c
@@ -1212,7 +1212,7 @@ void smp_percpu_timer_interrupt(struct pt_regs *regs)
 			irq_enter();
 
 			if (cpu == boot_cpu_id) {
-				kstat_this_cpu.irqs[0]++;
+				irq_desc[0].kstat_irqs[cpu]++;
 				timer_tick_interrupt(regs);
 			}
 
diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c
index 50a288b..fa16410 100644
--- a/arch/um/kernel/irq.c
+++ b/arch/um/kernel/irq.c
@@ -61,7 +61,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %14s", irq_desc[i].chip->typename);
 		seq_printf(p, "  %s", action->name);
diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 9fe2e28..beefb89 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -69,12 +69,8 @@ int show_interrupts(struct seq_file *p, void *v)
 		if (!action) 
 			goto skip;
 		seq_printf(p, "%3d: ",i);
-#ifndef CONFIG_SMP
-		seq_printf(p, "%10u ", kstat_irqs(i));
-#else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
-#endif
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 		seq_printf(p, " %8s", irq_desc[i].chip->name);
 		seq_printf(p, "-%-8s", irq_desc[i].name);
 
diff --git a/arch/xtensa/kernel/irq.c b/arch/xtensa/kernel/irq.c
index c9ea73b..c35e271 100644
--- a/arch/xtensa/kernel/irq.c
+++ b/arch/xtensa/kernel/irq.c
@@ -99,7 +99,7 @@ int show_interrupts(struct seq_file *p, void *v)
 		seq_printf(p, "%10u ", kstat_irqs(i));
 #else
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+			seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
 #endif
 		seq_printf(p, " %14s", irq_desc[i].chip->typename);
 		seq_printf(p, "  %s", action->name);
diff --git a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c
index e2c4c0a..21be453 100644
--- a/fs/proc/proc_misc.c
+++ b/fs/proc/proc_misc.c
@@ -472,7 +472,7 @@ static int show_stat(struct seq_file *p, void *v)
 		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
 		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
 		for (j = 0 ; j < NR_IRQS ; j++)
-			sum += kstat_cpu(i).irqs[j];
+			sum += kstat_irqs_cpu(j, i);
 	}
 
 	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu\n",
diff --git a/include/linux/irq.h b/include/linux/irq.h
index bb78ab9..9c61fd7 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -156,6 +156,7 @@ struct irq_desc {
 	void			*handler_data;
 	void			*chip_data;
 	struct irqaction	*action;	/* IRQ action list */
+	unsigned int		*kstat_irqs;
 	unsigned int		status;		/* IRQ status */
 
 	unsigned int		depth;		/* nested irq disables */
@@ -178,6 +179,9 @@ struct irq_desc {
 
 extern struct irq_desc irq_desc[NR_IRQS];
 
+#define kstat_irqs_this_cpu(DESC) \
+	((DESC)->kstat_irqs[smp_processor_id()])
+
 /*
  * Migration helpers for obsolete names, they will go away:
  */
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 43e895f..0c8f650 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,7 +27,9 @@ struct cpu_usage_stat {
 
 struct kernel_stat {
 	struct cpu_usage_stat	cpustat;
+#ifndef CONFIG_GENERIC_HARDIRQS
 	unsigned int irqs[NR_IRQS];
+#endif
 };
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
@@ -38,15 +40,27 @@ DECLARE_PER_CPU(struct kernel_stat, kstat);
 
 extern unsigned long long nr_context_switches(void);
 
+#ifndef CONFIG_GENERIC_HARDIRQS
+static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
+{
+	return kstat_cpu(cpu).irqs[irq];
+}
+static inline void init_kstat_irqs(void) {}
+#else
+extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
+extern void init_kstat_irqs(void);
+#endif /* CONFIG_GENERIC_HARDIRQS */
+
 /*
  * Number of interrupts per specific IRQ source, since bootup
  */
-static inline int kstat_irqs(int irq)
+static inline unsigned int kstat_irqs(unsigned int irq)
 {
-	int cpu, sum = 0;
+	unsigned int sum = 0;
+	int cpu;
 
 	for_each_possible_cpu(cpu)
-		sum += kstat_cpu(cpu).irqs[irq];
+		sum += kstat_irqs_cpu(irq, cpu);
 
 	return sum;
 }
diff --git a/init/main.c b/init/main.c
index a92989e..23f1c64 100644
--- a/init/main.c
+++ b/init/main.c
@@ -559,6 +559,7 @@ asmlinkage void __init start_kernel(void)
 	sort_main_extable();
 	trap_init();
 	rcu_init();
+	init_kstat_irqs();
 	init_IRQ();
 	pidhash_init();
 	init_timers();
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index f83d691..7896286 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -288,13 +288,12 @@ handle_simple_irq(unsigned int irq, struct irq_desc *desc)
 {
 	struct irqaction *action;
 	irqreturn_t action_ret;
-	const unsigned int cpu = smp_processor_id();
 
 	spin_lock(&desc->lock);
 
 	if (unlikely(desc->status & IRQ_INPROGRESS))
 		goto out_unlock;
-	kstat_cpu(cpu).irqs[irq]++;
+	kstat_irqs_this_cpu(desc)++;
 
 	action = desc->action;
 	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
@@ -332,7 +331,6 @@ out_unlock:
 void fastcall
 handle_level_irq(unsigned int irq, struct irq_desc *desc)
 {
-	unsigned int cpu = smp_processor_id();
 	struct irqaction *action;
 	irqreturn_t action_ret;
 
@@ -342,7 +340,7 @@ handle_level_irq(unsigned int irq, struct irq_desc *desc)
 	if (unlikely(desc->status & IRQ_INPROGRESS))
 		goto out_unlock;
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
-	kstat_cpu(cpu).irqs[irq]++;
+	kstat_irqs_this_cpu(desc)++;
 
 	/*
 	 * If its disabled or no action available
@@ -383,7 +381,6 @@ out_unlock:
 void fastcall
 handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc)
 {
-	unsigned int cpu = smp_processor_id();
 	struct irqaction *action;
 	irqreturn_t action_ret;
 
@@ -393,7 +390,7 @@ handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc)
 		goto out;
 
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
-	kstat_cpu(cpu).irqs[irq]++;
+	kstat_irqs_this_cpu(desc)++;
 
 	/*
 	 * If its disabled or no action available
@@ -442,8 +439,6 @@ out:
 void fastcall
 handle_edge_irq(unsigned int irq, struct irq_desc *desc)
 {
-	const unsigned int cpu = smp_processor_id();
-
 	spin_lock(&desc->lock);
 
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -460,7 +455,7 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc)
 		goto out_unlock;
 	}
 
-	kstat_cpu(cpu).irqs[irq]++;
+	kstat_irqs_this_cpu(desc)++;
 
 	/* Start handling the irq */
 	desc->chip->ack(irq);
@@ -516,7 +511,7 @@ handle_percpu_irq(unsigned int irq, struct irq_desc *desc)
 {
 	irqreturn_t action_ret;
 
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(desc)++;
 
 	if (desc->chip->ack)
 		desc->chip->ack(irq);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index aff1f0f..27cf665 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/bootmem.h>
 
 #include "internals.h"
 
@@ -30,7 +31,7 @@ void fastcall
 handle_bad_irq(unsigned int irq, struct irq_desc *desc)
 {
 	print_irq_desc(irq, desc);
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(desc)++;
 	ack_bad_irq(irq);
 }
 
@@ -170,7 +171,7 @@ fastcall unsigned int __do_IRQ(unsigned int irq)
 	struct irqaction *action;
 	unsigned int status;
 
-	kstat_this_cpu.irqs[irq]++;
+	kstat_irqs_this_cpu(desc)++;
 	if (CHECK_IRQ_PER_CPU(desc->status)) {
 		irqreturn_t action_ret;
 
@@ -269,3 +270,27 @@ void early_init_irq_lock_class(void)
 }
 
 #endif
+
+__init void init_kstat_irqs(void)
+{
+	unsigned entries = 0, cpu;
+	unsigned int irq;
+	unsigned bytes;
+
+	/* Compute the worst case size of a per cpu array */
+	for_each_possible_cpu(cpu)
+		if (cpu >= entries)
+			entries = cpu + 1;
+
+	/* Compute how many bytes we need per irq and allocate them */
+	bytes = entries*sizeof(unsigned int);
+	for (irq = 0; irq < NR_IRQS; irq++)
+		irq_desc[irq].kstat_irqs = alloc_bootmem(bytes);
+}
+
+unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
+{
+	struct irq_desc *desc = irq_desc + irq;
+	return desc->kstat_irqs[cpu];
+}
+EXPORT_SYMBOL(kstat_irqs_cpu);
-- 
1.5.0.g53756

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 19:32                                         ` H. Peter Anvin
@ 2008-07-10 23:37                                           ` Mike Travis
  0 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-10 23:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Christoph Lameter, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

H. Peter Anvin wrote:
> Mike Travis wrote:
>>
>> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
>> the per cpu area.  So you free up a huge amount of unused memory when
>> the NR_CPUS count starts getting into the ozone layer.  4k now, 16k
>> real soon now, ??? future?
>>
> 
> Even (or perhaps especially) so, allocating the percpu area in 2 MB
> increments is a total nonstarter.  It hurts the small, common
> configurations way too much.  For SGI, it's probably fine.
> 
>     -hpa

Yes, "right-sizing" the kernel for systems from 512M laptops to 4k cpu
systems with memory totally maxed out, using only a binary distribution
has proven tricky (at best... ;-)

One alternative was to only allocate a chunk similar in size to
PERCPU_ENOUGH_ROOM and allow for startup options to create a bigger
space if needed.  Though the realloc idea has some merit if we can
validate the non-use of pointers to specific cpu's percpu vars.
(CPU_ALLOC contains a function to dereference a percpu offset.)

The discussion started at:

	http://marc.info/?l=linux-kernel&m=121212026716085&w=4

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 21:29                                                 ` H. Peter Anvin
@ 2008-07-11  0:12                                                   ` Mike Travis
  2008-07-11  0:14                                                     ` H. Peter Anvin
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-11  0:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

H. Peter Anvin wrote:
...
> -- I believe 2 MB mappings are too large, except perhaps as an option.
> 
>     -hpa

Hmm, that might be the way to go.... At boot up time determine the
size of the system in terms of cpu count and memory available and
attempt to do the right thing, with startup options to override the
internal choices... ?

(Surely a system that has a "gazillion ip tunnels" could modify it's
kernel start options... ;-)

Unfortunately, we can't use a MODULE to support different options unless
we change how the kernel starts up (would need to mount the root fs
before starting secondary cpus.)

Btw, the "zero_based_only" patch (w/o the pda folded into the percpu
area) gets to the point shown below.  Dropping NR_CPUS from 4096 to 256
clears up the error.  So except for the "stack overflow" message I got
yesterday, the result is the same.  As soon as I get a chance, I'll try
it out with gcc-4.2.0 to see if it changed the boot up problem.

Thanks,
Mike

[    0.096000] ACPI: Core revision 20080321
[    0.108889] Parsing all Control Methods:
[    0.116198] Table [DSDT](id 0001) - 364 Objects with 40 Devices 109 Methods 20 Regions
[    0.124000] Parsing all Control Methods:
[    0.128000] Table [SSDT](id 0002) - 43 Objects with 0 Devices 16 Methods 0 Regions
[    0.132000]  tbxface-0598 [02] tb_load_namespace     : ACPI Tables successfully acquired
[    0.148000] evxfevnt-0091 [02] enable                : Transition to ACPI mode successful
[    0.200000] CPU0: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz stepping 07
[    0.211685] Using local APIC timer interrupts.
[    0.220000] APIC timer calibration result 20781901
[    0.224000] Detected 20.781 MHz APIC timer.
[    0.228000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[    0.228000] IP: [<0000000000000000>]
[    0.228000] PGD 0
[    0.228000] Oops: 0010 [1] SMP
[    0.228000] CPU 0
[    0.228000] Pid: 1, comm: swapper Not tainted 2.6.26-rc8-tip-ingo-test-0701-00208-g79a4d68-dirty #7
[    0.228000] RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
[    0.228000] RSP: 0000:ffff81022ed1fe18  EFLAGS: 00010286
[    0.228000] RAX: 0000000000000000 RBX: ffff81022ed1fe84 RCX: ffffffff80d0de80
[    0.228000] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffffff80d0de80
[    0.228000] RBP: ffff81022ed1fe50 R08: ffff81022ed1fe84 R09: ffffffff80e28ae0
[    0.228000] R10: ffff81022ed1fe80 R11: ffff81022ed39188 R12: 00000000ffffffff
[    0.228000] R13: ffffffff80d0de40 R14: 0000000000000001 R15: 0000000000000003
[    0.228000] FS:  0000000000000000(0000) GS:ffffffff80de69c0(0000) knlGS:0000000000000000
[    0.228000] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.228000] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
[    0.228000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.228000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.228000] Process swapper (pid: 1, threadinfo ffff81022ed10000, task ffff81022ec93100)
[    0.228000] Stack:  ffffffff8024f34c 0000000000000000 0000000000000001 0000000000000001
[    0.228000]  00000000fffffff0 ffffffff80e8e4e0 0000000000092fd0 ffff81022ed1fe60
[    0.228000]  ffffffff8024f3cc ffff81022ed1fea0 ffffffff808ef5b8 0000000000000008
[    0.228000] Call Trace:
[    0.228000]  [<ffffffff8024f34c>] ? notifier_call_chain+0x38/0x60
[    0.228000]  [<ffffffff8024f3cc>] __raw_notifier_call_chain+0xe/0x10
[    0.228000]  [<ffffffff808ef5b8>] cpu_up+0xa8/0x138
[    0.228000]  [<ffffffff80e4d9b9>] kernel_init+0xdf/0x327
[    0.228000]  [<ffffffff8020d4b8>] child_rip+0xa/0x12
[    0.228000]  [<ffffffff8020c955>] ? restore_args+0x0/0x30
[    0.228000]  [<ffffffff80e4d8da>] ? kernel_init+0x0/0x327
[    0.228000]  [<ffffffff8020d4ae>] ? child_rip+0x0/0x12
[    0.228000]
[    0.228000]
[    0.228000] Code:  Bad RIP value.
[    0.228000] RIP  [<0000000000000000>]
[    0.228000]  RSP <ffff81022ed1fe18>
[    0.228000] CR2: 0000000000000000
[    0.232000] ---[ end trace a7919e7f17c0a725 ]---
[    0.236000] Kernel panic - not syncing: Attempted to kill init!



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11  0:12                                                   ` Mike Travis
@ 2008-07-11  0:14                                                     ` H. Peter Anvin
  2008-07-11  0:58                                                       ` Mike Travis
  2008-07-11  0:42                                                     ` Eric W. Biederman
  2008-07-11 15:36                                                     ` Christoph Lameter
  2 siblings, 1 reply; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-11  0:14 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis wrote:
> 
> Hmm, that might be the way to go.... At boot up time determine the
> size of the system in terms of cpu count and memory available and
> attempt to do the right thing, with startup options to override the
> internal choices... ?
> 
> (Surely a system that has a "gazillion ip tunnels" could modify it's
> kernel start options... ;-)
> 
> Unfortunately, we can't use a MODULE to support different options unless
> we change how the kernel starts up (would need to mount the root fs
> before starting secondary cpus.)
> 

Using a module doesn't make any sense anyway.  This is more what the 
kernel command line is for.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11  0:12                                                   ` Mike Travis
  2008-07-11  0:14                                                     ` H. Peter Anvin
@ 2008-07-11  0:42                                                     ` Eric W. Biederman
  2008-07-11 15:36                                                     ` Christoph Lameter
  2 siblings, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-11  0:42 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis <travis@sgi.com> writes:

> Btw, the "zero_based_only" patch (w/o the pda folded into the percpu
> area) gets to the point shown below.  Dropping NR_CPUS from 4096 to 256
> clears up the error.  So except for the "stack overflow" message I got
> yesterday, the result is the same.  As soon as I get a chance, I'll try
> it out with gcc-4.2.0 to see if it changed the boot up problem.

Thanks that seems to confirm the suspicion that it the zero based percpu
segment that is causing problems.

In this case it appears that the notifier block for one of the cpu notifiers
got stomped, or never got initialized.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 21:03                                               ` H. Peter Anvin
@ 2008-07-11  0:55                                                 ` Mike Travis
  0 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-11  0:55 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Eric W. Biederman,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>
>> Percpu on i386 hasn't been a point of discussion.  It works fine, and
>> has been working fine for a long time.  The same mechanism would work
>> fine on x86-64.  Its only "issue" is that it doesn't support the
>> broken gcc abi for stack-protector.
>>
>> The problem is all zero-based percpu on x86-64.
>>
> 
> Well, x86-64 has *two* issues: limited range of offsets (regardless of
> if we do RIP-relative or not), and the stack-protector ABI.
> 
> I'm still trying to reproduce Mike's setup, but I suspect it can be
> switched to RIP-relative for the fixed-offset (static) stuff; for the
> dynamic stuff it's all via pointers anyway so the offsets don't matter.
> 
>     -hpa

I'm rebuilding my tip tree now, that should bring it up to date.  I'll
repost patches #1 to (currently) #4  shortly.

I'm looking at some code that does not have patches I sent in like 4 to 5
months ago (acpi/NR_CPUS related changes).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11  0:14                                                     ` H. Peter Anvin
@ 2008-07-11  0:58                                                       ` Mike Travis
  2008-07-11  1:41                                                         ` H. Peter Anvin
  2008-07-11 15:37                                                         ` Christoph Lameter
  0 siblings, 2 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-11  0:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

H. Peter Anvin wrote:
> Mike Travis wrote:
>>
>> Hmm, that might be the way to go.... At boot up time determine the
>> size of the system in terms of cpu count and memory available and
>> attempt to do the right thing, with startup options to override the
>> internal choices... ?
>>
>> (Surely a system that has a "gazillion ip tunnels" could modify it's
>> kernel start options... ;-)
>>
>> Unfortunately, we can't use a MODULE to support different options unless
>> we change how the kernel starts up (would need to mount the root fs
>> before starting secondary cpus.)
>>
> 
> Using a module doesn't make any sense anyway.  This is more what the
> kernel command line is for.
> 
>     -hpa


I was thinking that supporting virtual percpu addresses would take a fair
amount of code that, if living in a MODULE, wouldn't impact small systems.
But it seems to be not worth the effort... ;-)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:17                                         ` Eric W. Biederman
  2008-07-10 20:24                                           ` Ingo Molnar
@ 2008-07-11  1:39                                           ` Mike Travis
  2008-07-11  2:57                                             ` Eric W. Biederman
  1 sibling, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-11  1:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Eric W. Biederman wrote:
> Mike Travis <travis@sgi.com> writes:
> 
> 
>> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
>> the per cpu area.  So you free up a huge amount of unused memory when
>> the NR_CPUS count starts getting into the ozone layer.  4k now, 16k
>> real soon now, ??? future?
> 
> Hmm.  Do you know how big a role kernel_stat plays.
> 
> It is a per cpu structure that is sized via NR_IRQS.  NR_IRQS is by NR_CPUS.
> So ultimately the amount of memory take up is NR_CPUS*NR_CPUS*32 or so.
> 
> I have a patch I wrote long ago, that addresses that specific nasty configuration
> by moving the per cpu irq counters into pointer available from struct irq_desc.
> 
> The next step which I did not get to (but is interesting from a scaling perspective)
> was to start dynamically allocating the irq structures.
> 
> Eric

If you could dig that up, that would be great.  Another engr here at SGI
took that task off my hands and he's been able to do a few things to reduce
the "# irqs" but irq_desc is still one of the bigger static arrays (>256k).

(There was some discussion a while back on this very subject.)

The top data users are:

====== Data (-l 500)
    1 - ingo-test-0701-256
    2 - 4k-defconfig
    3 - ingo-test-0701

      .1.      .2.      .3.    ..final..
  1048576  -917504  +917504 1048576      .  __log_buf(.bss)
   262144  -262144  +262144  262144      .  gl_hash_table(.bss)
   122360  -122360  +122360  122360      .  g_bitstream(.data)
   119756  -119756  +119756  119756      .  init_data(.rodata)
    89760   -89760   +89760   89760      .  o2net_nodes(.bss)
    76800   -76800  +614400  614400  +700%  early_node_map(.data)
    44548   -44548   +44548   44548      .  typhoon_firmware_image(.rodata)
    43008  +215040        .  258048  +500%  irq_desc(.data.cacheline_aligned)
    42768   -42768   +42768   42768      .  s_firmLoad(.data)
    41184   -41184   +41184   41184      .  saa7134_boards(.data)
    38912   -38912   +38912   38912      .  dabusb(.bss)
    34804   -34804   +34804   34804      .  g_Firmware(.data)
    32768   -32768   +32768   32768      .  read_buffers(.bss)
    19968   -19968  +159744  159744  +700%  initkmem_list3(.init.data)
    18041   -18041   +18041   18041      .  OperationalCodeImage_GEN1(.data)
    16507   -16507   +16507   16507      .  OperationalCodeImage_GEN2(.data)
    16464   -16464   +16464   16464      .  ipw_geos(.rodata)
    16388  +114688  -114688   16388      .  map_pid_to_cmdline(.bss)
    16384   -16384   +16384   16384      .  gl_hash_locks(.bss)
    16384  +245760        .  262144 +1500%  boot_pageset(.bss)
    16128  +215040        .  231168 +1333%  irq_cfg(.data.read_mostly)


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11  0:58                                                       ` Mike Travis
@ 2008-07-11  1:41                                                         ` H. Peter Anvin
  2008-07-11 15:37                                                         ` Christoph Lameter
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-11  1:41 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis wrote:
> 
> I was thinking that supporting virtual percpu addresses would take a fair
> amount of code that, if living in a MODULE, wouldn't impact small systems.
> But it seems to be not worth the effort... ;-)
> 

No, and it seems pretty toxic.  If we're doing virtual, they should 
almost certainly always be virtual (except perhaps on UP.)  Page size is 
a separate issue.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11  1:39                                           ` Mike Travis
@ 2008-07-11  2:57                                             ` Eric W. Biederman
  0 siblings, 0 replies; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-11  2:57 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis <travis@sgi.com> writes:

> If you could dig that up, that would be great.  Another engr here at SGI
> took that task off my hands and he's been able to do a few things to reduce
> the "# irqs" but irq_desc is still one of the bigger static arrays (>256k).

So the part I had completed which turns takes the NR_IRQS array out of kernel_stat
I posted.  Here are my mental notes on how to handle the rest.

Also if you will notice on x86_64 everything that is per irq is in irq_cfg.
Which explains why irq_cfg grows.  We have those crazy bitmaps almost useless
bitmaps of which cpu we want to direct irqs to.  In the irq configuration so that
doesn't help.

The array sized by NR_IRQS irqs are in:

drivers/char/random.c:static struct timer_rand_state *irq_timer_state[NR_IRQS];
  looks like it should go in irq_desc (it's a generic feature).

drivers/pcmcia/pcmcia_resource.c:static u8 pcmcia_used_irq[NR_IRQS];
  That number should be 16 possibly 32 for sanity not NR_IRQS.

drivers/net/hamradio/scc.c:static struct irqflags { unsigned char used : 1; } Ivec[NR_IRQS];
drivers/serial/68328serial.c:struct m68k_serial *IRQ_ports[NR_IRQS];
drivers/serial/8250.c:static struct irq_info irq_lists[NR_IRQS];
drivers/serial/m32r_sio.c:static struct irq_info irq_lists[NR_IRQS];
  The are all drivers and should allocate a proper per irq structure like every other driver.

drivers/xen/events.c:static struct packed_irq irq_info[NR_IRQS];
drivers/xen/events.c:static int irq_bindcount[NR_IRQS];
  For all intents and purposes this is another architecture, that should be fixed up
  at some point.

The interfaces from include/linux/interrupt.h that take irq interfaces
that take an irq number are slow path.

So it is just a matter of writing an irq_descp(irq) that returns
an irq_desc and returns an irq.  The definition would go something
like:

#ifndef CONFIG_DYNAMIC_NR_IRQ
#define irq_descp(irq) (irq >= 0 && irq < NR_IRQS)? (irq_desc + irq) : NULL
#else
struct irq_desc *irq_descp(int irq)
{
        struct irq_desc *desc;
        rcu_read_lock();
        list_for_each_entry_rcu(desc, &irq_list, list) {
        	if (desc->irq == irq)
                	return desc;
        }
        rcu_read_unlock();
        return NULL;
}
#endif

Then the generic irq code just needs to use irq_descp through out.
And the arch code needs to allocate/free irq_descs and add them to the
list.  With say:
int add_irq_desc(int irq, struct irq_desc *desc)
{
        struct irq_desc *old;
        int error = -EINVAL;
	spin_lock(&irq_list_lock);
        old = irq_descp(irq);
        if (old)
        	goto out;
	list_add_rcu(&desc.list, &irc_list);
        error = 0;
        return error;    
}
With architecture picking the irq number so it can be stable and have meaning
to users.

Starting from that direction it isn't too hard and it should yield timely results.

Eric

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 20:22                               ` Eric W. Biederman
  2008-07-10 20:54                                 ` Jeremy Fitzhardinge
@ 2008-07-11  6:59                                 ` Rusty Russell
  1 sibling, 0 replies; 190+ messages in thread
From: Rusty Russell @ 2008-07-11  6:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeremy Fitzhardinge, Mike Travis, H. Peter Anvin,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter,
	Jack Steiner, linux-kernel

On Friday 11 July 2008 06:22:52 Eric W. Biederman wrote:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
> > No, that sounds like a bad idea.  For one, how would you enforce it?  How
> > would you check for it?  It's one of those things that would mostly work
> > and then fail very rarely.
>
> Well the easiest way would be to avoid the letting people take the address
> of per cpu memory, and just provide macros to read/write it.  We are 90% of
> the way there already so it isn't a big jump.

Hi Eric,

    I decided against that originally, but we can revisit that decision.  But 
it would *not* be easy.  Try it on kernel/sched.c which uses per-cpu "struct 
rq".

Perhaps we could limit dynamically allocated per-cpu mem this way though...

Rusty.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-10 21:11                                   ` H. Peter Anvin
@ 2008-07-11 15:32                                     ` Christoph Lameter
  2008-07-11 16:07                                       ` H. Peter Anvin
  2008-07-11 16:57                                       ` Eric W. Biederman
  0 siblings, 2 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-11 15:32 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

H. Peter Anvin wrote:
> Christoph Lameter wrote:
>> H. Peter Anvin wrote:
>>
>>> And how much is that, especially on *small* systems?
>>
>> i386?
>>
>> i386 uses 4K mappings. There are just a few cpus supported, there is
>> scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get
>> that big. See the cpu_alloc patchsets for i386.
>>
> 
> No, not i386.  x86-64.

x86_64 are small systems? For 64 bit use one would expect 4GB of memory.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11  0:12                                                   ` Mike Travis
  2008-07-11  0:14                                                     ` H. Peter Anvin
  2008-07-11  0:42                                                     ` Eric W. Biederman
@ 2008-07-11 15:36                                                     ` Christoph Lameter
  2 siblings, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-11 15:36 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Eric W. Biederman,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis wrote:
> H. Peter Anvin wrote:
> ...
>> -- I believe 2 MB mappings are too large, except perhaps as an option.
>>
>>     -hpa
> 
> Hmm, that might be the way to go.... At boot up time determine the
> size of the system in terms of cpu count and memory available and
> attempt to do the right thing, with startup options to override the
> internal choices... ?

Ok. That is an extension of the static per cpu area scenario already supported by cpu_alloc. Should not be too difficult to implement.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11  0:58                                                       ` Mike Travis
  2008-07-11  1:41                                                         ` H. Peter Anvin
@ 2008-07-11 15:37                                                         ` Christoph Lameter
  1 sibling, 0 replies; 190+ messages in thread
From: Christoph Lameter @ 2008-07-11 15:37 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Eric W. Biederman,
	Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel,
	Arjan van de Ven

Mike Travis wrote:

> I was thinking that supporting virtual percpu addresses would take a fair
> amount of code that, if living in a MODULE, wouldn't impact small systems.
> But it seems to be not worth the effort... ;-)

For base page mappings that logic is already provided by the vmalloc subsystem. For 2M mappings the vmemmap logic can be used.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11 15:32                                     ` Christoph Lameter
@ 2008-07-11 16:07                                       ` H. Peter Anvin
  2008-07-11 16:57                                       ` Eric W. Biederman
  1 sibling, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-11 16:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

Christoph Lameter wrote:
>>>
>> No, not i386.  x86-64.
> 
> x86_64 are small systems? For 64 bit use one would expect 4GB of memory.

Hardly so.  i386 has too many other limitations; for one thing, it 
starts having performance problems (needing HIHGMEM) at less than 1 GB. 
  The additional registers, etc.

	-hpa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11 15:32                                     ` Christoph Lameter
  2008-07-11 16:07                                       ` H. Peter Anvin
@ 2008-07-11 16:57                                       ` Eric W. Biederman
  2008-07-11 17:10                                         ` H. Peter Anvin
  1 sibling, 1 reply; 190+ messages in thread
From: Eric W. Biederman @ 2008-07-11 16:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: H. Peter Anvin, Mike Travis, Jeremy Fitzhardinge,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell


> x86_64 are small systems? For 64 bit use one would expect 4GB of memory.

Anything over 1G where 32bit runs out of lowmem starts to be a win.
Plus sometimes it just doesn't make sense to use a 32bit kernel.

Expecting 4GB of real memory seems silly.

Eric


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-11 16:57                                       ` Eric W. Biederman
@ 2008-07-11 17:10                                         ` H. Peter Anvin
  0 siblings, 0 replies; 190+ messages in thread
From: H. Peter Anvin @ 2008-07-11 17:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Mike Travis, Jeremy Fitzhardinge,
	Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner,
	linux-kernel, Rusty Russell

Eric W. Biederman wrote:
>> x86_64 are small systems? For 64 bit use one would expect 4GB of memory.
> 
> Anything over 1G where 32bit runs out of lowmem starts to be a win.
> Plus sometimes it just doesn't make sense to use a 32bit kernel.
> 
> Expecting 4GB of real memory seems silly.

Especially since people set up VM instances small and then grow them. 
They can have *weird* CPU-to-RAM ratios; I have heard of 8 VCPUs and
24 MB of RAM in production.

	-hpa


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 02/15] x86_64: Fold pda into per cpu area
  2008-07-09 22:02   ` Eric W. Biederman
@ 2008-07-13 17:54     ` Ingo Molnar
  2008-07-14 14:24       ` Mike Travis
  0 siblings, 1 reply; 190+ messages in thread
From: Ingo Molnar @ 2008-07-13 17:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Travis, Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin,
	Christoph Lameter, Jack Steiner, linux-kernel


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> Mike Travis <travis@sgi.com> writes:
> 
> > WARNING: there is still a FIXME in this patch (see arch/x86/kernel/acpi/sleep.c)
> >
> >   * Declare the pda as a per cpu variable.
> >
> >   * Make the x86_64 per cpu area start at zero.
> >
> >   * Relocate the initial pda and per_cpu(gdt_page) in head_64.S for the
> >     boot cpu (0).  For secondary cpus, do_boot_cpu() sets up the correct
> >     initial pda and gdt_page pointer.
> >
> >   * Initialize per_cpu_offset to point to static pda in the per_cpu area
> >     (@ __per_cpu_load).
> >
> >   * After allocation of the per cpu area for the boot cpu (0), reload the
> >     gdt page pointer.
> >
> > Based on linux-2.6.tip/master
> 
> Given that we have not yet understood the weird failure case.  This patch needs
> to be split in two.  
> - make the current per cpu variable section zero based.
> - Move the pda into the per cpu variable section.
> 
> There are too many variables at present the reported failure cases to
> guess what is really going on.
> 
> We can not optimize the per cpu variable accesses until the pda moves
> but we can easily test for linker and tool chain bugs with zero
> based pda segment itself.

agreed, a patch of this gravity and with a diffstat:

 12 files changed, 112 insertions(+), 142 deletions(-)

is indeed too large. Test failures that get bisected to this patch will 
still cause people to guess about which aspect of the large patch caused 
the problem.

	Ingo

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 02/15] x86_64: Fold pda into per cpu area
  2008-07-13 17:54     ` Ingo Molnar
@ 2008-07-14 14:24       ` Mike Travis
  0 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-14 14:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, Jeremy Fitzhardinge, Andrew Morton,
	H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel

Ingo Molnar wrote:
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
...
>> Given that we have not yet understood the weird failure case.  This patch needs
>> to be split in two.  
>> - make the current per cpu variable section zero based.
>> - Move the pda into the per cpu variable section.
>>
>> There are too many variables at present the reported failure cases to
>> guess what is really going on.
>>
>> We can not optimize the per cpu variable accesses until the pda moves
>> but we can easily test for linker and tool chain bugs with zero
>> based pda segment itself.
> 
> agreed, a patch of this gravity and with a diffstat:
> 
>  12 files changed, 112 insertions(+), 142 deletions(-)
> 
> is indeed too large. Test failures that get bisected to this patch will 
> still cause people to guess about which aspect of the large patch caused 
> the problem.
> 
> 	Ingo

That split has been done and I've sent it to Jeremy and Peter for further
review.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-09 17:44   ` Jeremy Fitzhardinge
  2008-07-09 18:09     ` Mike Travis
@ 2008-07-25 15:49     ` Mike Travis
  2008-07-25 16:08       ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-25 15:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Did the suspected linker bug issue ever get resolved?
> 
> I don't believe so.  I think Mike is getting very early crashes
> depending on some combination of gcc, linker and kernel config.  Or
> something.
> 
> This fragility makes me very nervous.  It seems hard enough to get this
> stuff working with current tools; making it work over the whole range of
> supported tools looks like its going to be hard.
> 
>    J

FYI, I think it was a combination of errors that was causing my problems.
In any case, I've successfully compiled and booted Ingo's pesky config
config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and 4.2.4
on both Intel and AMD boxes.  (As well as a variety of other configs.)

I think Hugh's change adding "text" to the BUILD_IRQ macro might also have
helped, since interrupts seemed to have always been involved in the panics.
That might explain the variable-ness of the gcc version's.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-25 15:49     ` Mike Travis
@ 2008-07-25 16:08       ` Jeremy Fitzhardinge
  2008-07-25 16:46         ` Mike Travis
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-25 16:08 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins

Mike Travis wrote:
> FYI, I think it was a combination of errors that was causing my problems.
> In any case, I've successfully compiled and booted Ingo's pesky config
> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and 4.2.4
> on both Intel and AMD boxes.  (As well as a variety of other configs.)
>   

Good.  What compilers have you tested?  Will it work over the complete 
supported range?

> I think Hugh's change adding "text" to the BUILD_IRQ macro might also have
> helped, since interrupts seemed to have always been involved in the panics.
> That might explain the variable-ness of the gcc version's.

Yes, indeed.  That was a nasty one, and will be very sensitive to the 
exact order gcc decided to emit things.

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-25 16:08       ` Jeremy Fitzhardinge
@ 2008-07-25 16:46         ` Mike Travis
  2008-07-25 16:58           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 190+ messages in thread
From: Mike Travis @ 2008-07-25 16:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> FYI, I think it was a combination of errors that was causing my problems.
>> In any case, I've successfully compiled and booted Ingo's pesky config
>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and
>> 4.2.4
>> on both Intel and AMD boxes.  (As well as a variety of other configs.)
>>   
> 
> Good.  What compilers have you tested?  Will it work over the complete
> supported range?

The oldest gcc I have available is 3.4.3 (and I just tried that one and
it worked.)  So that one and the ones listed above I've verified using
Ingo's test config.  (All other testing I'm using 4.2.3.)

> 
>> I think Hugh's change adding "text" to the BUILD_IRQ macro might also
>> have
>> helped, since interrupts seemed to have always been involved in the
>> panics.
>> That might explain the variable-ness of the gcc version's.
> 
> Yes, indeed.  That was a nasty one, and will be very sensitive to the
> exact order gcc decided to emit things.
> 
>    J


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-25 16:46         ` Mike Travis
@ 2008-07-25 16:58           ` Jeremy Fitzhardinge
  2008-07-25 18:12             ` Mike Travis
  0 siblings, 1 reply; 190+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-25 16:58 UTC (permalink / raw)
  To: Mike Travis
  Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins

Mike Travis wrote:
> Jeremy Fitzhardinge wrote:
>   
>> Mike Travis wrote:
>>     
>>> FYI, I think it was a combination of errors that was causing my problems.
>>> In any case, I've successfully compiled and booted Ingo's pesky config
>>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and
>>> 4.2.4
>>> on both Intel and AMD boxes.  (As well as a variety of other configs.)
>>>   
>>>       
>> Good.  What compilers have you tested?  Will it work over the complete
>> supported range?
>>     
>
> The oldest gcc I have available is 3.4.3 (and I just tried that one and
> it worked.)  So that one and the ones listed above I've verified using
> Ingo's test config.  (All other testing I'm using 4.2.3.)
>   

OK.  I've got a range of toolchains on various test machines around 
here, so I'll give it a spin.  Are your most recent changes in tip.git yet?

    J

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC 00/15] x86_64: Optimize percpu accesses
  2008-07-25 16:58           ` Jeremy Fitzhardinge
@ 2008-07-25 18:12             ` Mike Travis
  0 siblings, 0 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-25 18:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman,
	Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> Jeremy Fitzhardinge wrote:
>>  
>>> Mike Travis wrote:
>>>    
>>>> FYI, I think it was a combination of errors that was causing my
>>>> problems.
>>>> In any case, I've successfully compiled and booted Ingo's pesky config
>>>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and
>>>> 4.2.4
>>>> on both Intel and AMD boxes.  (As well as a variety of other configs.)
>>>>         
>>> Good.  What compilers have you tested?  Will it work over the complete
>>> supported range?
>>>     
>>
>> The oldest gcc I have available is 3.4.3 (and I just tried that one and
>> it worked.)  So that one and the ones listed above I've verified using
>> Ingo's test config.  (All other testing I'm using 4.2.3.)
>>   
> 
> OK.  I've got a range of toolchains on various test machines around
> here, so I'll give it a spin.  Are your most recent changes in tip.git yet?
> 
>    J


Almost there...

^ permalink raw reply	[flat|nested] 190+ messages in thread

end of thread, other threads:[~2008-07-25 18:13 UTC | newest]

Thread overview: 190+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis
2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis
2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis
2008-07-09 22:02   ` Eric W. Biederman
2008-07-13 17:54     ` Ingo Molnar
2008-07-14 14:24       ` Mike Travis
2008-07-09 16:51 ` [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs Mike Travis
2008-07-09 16:51 ` [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops Mike Travis
2008-07-09 16:51 ` [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu() Mike Travis
2008-07-09 16:51 ` [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h Mike Travis
2008-07-09 16:51 ` [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h Mike Travis
2008-07-09 16:51 ` [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h Mike Travis
2008-07-09 16:51 ` [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h Mike Travis
2008-07-09 16:51 ` [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h Mike Travis
2008-07-09 16:51 ` [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h Mike Travis
2008-07-09 16:51 ` [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h Mike Travis
2008-07-09 16:51 ` [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h Mike Travis
2008-07-09 16:51 ` [RFC 14/15] x86_64: Remove xxx_pda() operations Mike Travis
2008-07-09 16:51 ` [RFC 15/15] x86_64: Remove cpu_pda() macro Mike Travis
2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin
2008-07-09 17:40   ` Mike Travis
2008-07-09 17:42     ` H. Peter Anvin
2008-07-09 18:05       ` Mike Travis
2008-07-09 17:44   ` Jeremy Fitzhardinge
2008-07-09 18:09     ` Mike Travis
2008-07-09 18:30       ` H. Peter Anvin
2008-07-09 19:34       ` Ingo Molnar
2008-07-09 19:44         ` H. Peter Anvin
2008-07-09 20:26           ` Adrian Bunk
2008-07-09 21:03         ` Mike Travis
2008-07-09 21:23         ` Jeremy Fitzhardinge
2008-07-25 15:49     ` Mike Travis
2008-07-25 16:08       ` Jeremy Fitzhardinge
2008-07-25 16:46         ` Mike Travis
2008-07-25 16:58           ` Jeremy Fitzhardinge
2008-07-25 18:12             ` Mike Travis
2008-07-09 17:27 ` Jeremy Fitzhardinge
2008-07-09 17:39   ` Christoph Lameter
2008-07-09 17:51     ` Jeremy Fitzhardinge
2008-07-09 18:14       ` Mike Travis
2008-07-09 18:22         ` Jeremy Fitzhardinge
2008-07-09 18:31           ` Mike Travis
2008-07-09 19:08             ` Jeremy Fitzhardinge
2008-07-09 18:02     ` Mike Travis
2008-07-09 18:13       ` Christoph Lameter
2008-07-09 18:26         ` Jeremy Fitzhardinge
2008-07-09 18:34           ` Christoph Lameter
2008-07-09 18:37             ` H. Peter Anvin
2008-07-09 18:48             ` Jeremy Fitzhardinge
2008-07-09 18:53               ` Christoph Lameter
2008-07-09 19:07                 ` Jeremy Fitzhardinge
2008-07-09 19:12                   ` Christoph Lameter
2008-07-09 19:32                     ` Jeremy Fitzhardinge
2008-07-09 19:41                       ` Ingo Molnar
2008-07-09 19:45                         ` H. Peter Anvin
2008-07-09 19:52                         ` Christoph Lameter
2008-07-09 20:00                           ` Ingo Molnar
2008-07-09 20:09                             ` Jeremy Fitzhardinge
2008-07-09 21:05                         ` Mike Travis
2008-07-09 19:44                       ` Christoph Lameter
2008-07-09 19:48                         ` Jeremy Fitzhardinge
2008-07-09 18:27         ` Mike Travis
2008-07-09 18:46           ` Jeremy Fitzhardinge
2008-07-09 20:22             ` Eric W. Biederman
2008-07-09 20:35               ` Jeremy Fitzhardinge
2008-07-09 20:53                 ` Eric W. Biederman
2008-07-09 21:03                   ` Ingo Molnar
2008-07-09 21:16                   ` H. Peter Anvin
2008-07-09 21:10               ` Arjan van de Ven
2008-07-09 23:20                 ` Eric W. Biederman
2008-07-09 18:31         ` H. Peter Anvin
2008-07-09 18:00   ` Mike Travis
2008-07-09 19:05     ` Jeremy Fitzhardinge
2008-07-09 19:28 ` Ingo Molnar
2008-07-09 20:55   ` Mike Travis
2008-07-09 21:12     ` Ingo Molnar
2008-07-09 20:00 ` Eric W. Biederman
2008-07-09 20:05   ` Jeremy Fitzhardinge
2008-07-09 20:15     ` Ingo Molnar
2008-07-09 20:07   ` Ingo Molnar
2008-07-09 20:11     ` Jeremy Fitzhardinge
2008-07-09 20:18       ` Christoph Lameter
2008-07-09 20:33         ` Jeremy Fitzhardinge
2008-07-09 20:42           ` H. Peter Anvin
2008-07-09 20:48             ` Jeremy Fitzhardinge
2008-07-09 21:06               ` Eric W. Biederman
2008-07-09 21:16                 ` H. Peter Anvin
2008-07-09 21:20                 ` Jeremy Fitzhardinge
2008-07-09 21:25           ` Christoph Lameter
2008-07-09 21:36             ` H. Peter Anvin
2008-07-09 21:41             ` Jeremy Fitzhardinge
2008-07-09 22:22             ` Eric W. Biederman
2008-07-09 22:32               ` Jeremy Fitzhardinge
2008-07-09 23:36                 ` Eric W. Biederman
2008-07-10  0:19                   ` H. Peter Anvin
2008-07-10  0:24                     ` Jeremy Fitzhardinge
2008-07-10 14:14                       ` Christoph Lameter
2008-07-10 14:26                         ` H. Peter Anvin
2008-07-10 15:26                           ` Christoph Lameter
2008-07-10 15:42                             ` H. Peter Anvin
2008-07-10 16:24                               ` Christoph Lameter
2008-07-10 16:33                                 ` H. Peter Anvin
2008-07-10 16:45                                   ` Christoph Lameter
2008-07-10 17:33                                     ` Jeremy Fitzhardinge
2008-07-10 17:42                                       ` Christoph Lameter
2008-07-10 17:53                                         ` Jeremy Fitzhardinge
2008-07-10 17:55                                           ` H. Peter Anvin
2008-07-10 20:52                                           ` Christoph Lameter
2008-07-10 20:58                                             ` Jeremy Fitzhardinge
2008-07-10 21:03                                               ` H. Peter Anvin
2008-07-11  0:55                                                 ` Mike Travis
2008-07-10 21:05                                               ` Christoph Lameter
2008-07-10 21:22                                                 ` Eric W. Biederman
2008-07-10 21:29                                                 ` H. Peter Anvin
2008-07-11  0:12                                                   ` Mike Travis
2008-07-11  0:14                                                     ` H. Peter Anvin
2008-07-11  0:58                                                       ` Mike Travis
2008-07-11  1:41                                                         ` H. Peter Anvin
2008-07-11 15:37                                                         ` Christoph Lameter
2008-07-11  0:42                                                     ` Eric W. Biederman
2008-07-11 15:36                                                     ` Christoph Lameter
2008-07-10 17:53                                     ` H. Peter Anvin
2008-07-10 17:26                                 ` Eric W. Biederman
2008-07-10 17:38                                   ` Christoph Lameter
2008-07-10 19:11                                     ` Mike Travis
2008-07-10 19:12                                     ` Eric W. Biederman
2008-07-10 17:46                                   ` Mike Travis
2008-07-10 17:51                                   ` H. Peter Anvin
2008-07-10 19:09                                     ` Eric W. Biederman
2008-07-10 19:18                                       ` Mike Travis
2008-07-10 19:32                                         ` H. Peter Anvin
2008-07-10 23:37                                           ` Mike Travis
2008-07-10 20:17                                         ` Eric W. Biederman
2008-07-10 20:24                                           ` Ingo Molnar
2008-07-10 21:33                                             ` Eric W. Biederman
2008-07-11  1:39                                           ` Mike Travis
2008-07-11  2:57                                             ` Eric W. Biederman
2008-07-10  0:23                   ` Jeremy Fitzhardinge
2008-07-09 20:35         ` H. Peter Anvin
2008-07-09 20:39       ` Arjan van de Ven
2008-07-09 20:44         ` H. Peter Anvin
2008-07-09 20:50           ` Jeremy Fitzhardinge
2008-07-09 21:12             ` H. Peter Anvin
2008-07-09 21:26               ` Jeremy Fitzhardinge
2008-07-09 21:37                 ` H. Peter Anvin
2008-07-09 22:10               ` Eric W. Biederman
2008-07-09 22:23                 ` H. Peter Anvin
2008-07-09 23:54                   ` Eric W. Biederman
2008-07-10 16:22                     ` Mike Travis
2008-07-10 16:25                       ` H. Peter Anvin
2008-07-10 16:35                         ` Christoph Lameter
2008-07-10 16:39                           ` H. Peter Anvin
2008-07-10 16:47                             ` Christoph Lameter
2008-07-10 17:21                               ` Jeremy Fitzhardinge
2008-07-10 17:31                                 ` Christoph Lameter
2008-07-10 17:48                                   ` Jeremy Fitzhardinge
2008-07-10 18:00                                   ` H. Peter Anvin
2008-07-10 17:20                         ` Mike Travis
2008-07-10 17:07                       ` Jeremy Fitzhardinge
2008-07-10 17:12                         ` Christoph Lameter
2008-07-10 17:25                           ` Jeremy Fitzhardinge
2008-07-10 17:34                             ` Christoph Lameter
2008-07-10 17:41                         ` Mike Travis
2008-07-10 18:01                           ` H. Peter Anvin
2008-07-10 20:51                             ` Christoph Lameter
2008-07-10 20:58                               ` H. Peter Anvin
2008-07-10 21:07                                 ` Christoph Lameter
2008-07-10 21:11                                   ` H. Peter Anvin
2008-07-11 15:32                                     ` Christoph Lameter
2008-07-11 16:07                                       ` H. Peter Anvin
2008-07-11 16:57                                       ` Eric W. Biederman
2008-07-11 17:10                                         ` H. Peter Anvin
2008-07-10 21:26                                   ` Eric W. Biederman
2008-07-10 18:48                       ` Eric W. Biederman
2008-07-10 18:54                         ` Jeremy Fitzhardinge
2008-07-10 19:18                           ` Eric W. Biederman
2008-07-10 19:56                             ` Jeremy Fitzhardinge
2008-07-10 20:22                               ` Eric W. Biederman
2008-07-10 20:54                                 ` Jeremy Fitzhardinge
2008-07-11  6:59                                 ` Rusty Russell
2008-07-10 20:25                               ` Eric W. Biederman
2008-07-10 17:57                     ` H. Peter Anvin
2008-07-10 18:08                     ` H. Peter Anvin
2008-07-09 20:46         ` Jeremy Fitzhardinge
2008-07-09 20:14   ` Arjan van de Ven
2008-07-09 20:33     ` Eric W. Biederman
2008-07-09 21:01       ` Ingo Molnar
2008-07-09 21:39   ` Mike Travis
2008-07-09 21:47     ` Jeremy Fitzhardinge
2008-07-09 21:55     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox