* [RFC 00/15] x86_64: Optimize percpu accesses
@ 2008-07-09 16:51 Mike Travis
2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis
` (18 more replies)
0 siblings, 19 replies; 190+ messages in thread
From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin,
Christoph Lameter, Jack Steiner, linux-kernel
This patchset provides the following:
* Cleanup: Fix early references to cpumask_of_cpu(0)
Provides an early cpumask_of_cpu(0) usable before the cpumask_of_cpu_map
is allocated and initialized.
* Generic: Percpu infrastructure to rebase the per cpu area to zero
This provides for the capability of accessing the percpu variables
using a local register instead of having to go through a table
on node 0 to find the cpu-specific offsets. It also would allow
atomic operations on percpu variables to reduce required locking.
Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the
generic code that the arch has this new basing.
(Note: split into two patches, one to rebase percpu variables at 0,
and the second to actually use %gs as the base for percpu variables.)
* x86_64: Fold pda into per cpu area
Declare the pda as a per cpu variable. This will move the pda
area to an address accessible by the x86_64 per cpu macros.
Subtraction of __per_cpu_start will make the offset based from
the beginning of the per cpu area. Since %gs is pointing to the
pda, it will then also point to the per cpu variables and can be
accessed thusly:
%gs:[&per_cpu_xxxx - __per_cpu_start]
* x86_64: Rebase per cpu variables to zero
Take advantage of the zero-based per cpu area provided above.
Then we can directly use the x86_32 percpu operations. x86_32
offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
to the pda and the per cpu area thereby allowing access to the
pda with the x86_64 pda operations and access to the per cpu
variables using x86_32 percpu operations.
Based on linux-2.6.tip/master with following patches applied:
[PATCH 1/1] x86: Add check for node passed to node_to_cpumask V3
[PATCH 1/1] x86: Change _node_to_cpumask_ptr to return const ptr
[PATCH 1/1] sched: Reduce stack size in isolated_cpu_setup()
[PATCH 1/1] kthread: Reduce stack pressure in create_kthread and kthreadd
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
--
^ permalink raw reply [flat|nested] 190+ messages in thread* [RFC 01/15] x86_64: Cleanup early setup_percpu references 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis ` (17 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: cleanup_percpu --] [-- Type: text/plain, Size: 5644 bytes --] * Initialize the cpumask_of_cpu_map to contain a cpumask for cpu 0 in the initdata section. This allows references before the real cpumask_of_cpu_map is setup avoiding possible null pointer deref panics. * Ruggedize some other calls to prevent mishaps from early calls, pariticularly in non-critical functions: * Cleanup DEBUG_PER_CPU_MAPS usages and some comments. Based on linux-2.6.tip/master Signed-off-by: Mike Travis <travis@sgi.com> --- arch/x86/kernel/setup_percpu.c | 73 ++++++++++++++++++++++++++++------------- 1 file changed, 51 insertions(+), 22 deletions(-) --- linux-2.6.tip.orig/arch/x86/kernel/setup_percpu.c +++ linux-2.6.tip/arch/x86/kernel/setup_percpu.c @@ -15,6 +15,12 @@ #include <asm/apicdef.h> #include <asm/highmem.h> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS +# define DBG(x...) printk(KERN_DEBUG x) +#else +# define DBG(x...) +#endif + #ifdef CONFIG_X86_LOCAL_APIC unsigned int num_processors; unsigned disabled_cpus __cpuinitdata; @@ -27,31 +33,39 @@ EXPORT_SYMBOL(boot_cpu_physical_apicid); physid_mask_t phys_cpu_present_map; #endif -/* map cpu index to physical APIC ID */ +/* + * Map cpu index to physical APIC ID + */ DEFINE_EARLY_PER_CPU(u16, x86_cpu_to_apicid, BAD_APICID); DEFINE_EARLY_PER_CPU(u16, x86_bios_cpu_apicid, BAD_APICID); EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid); EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); #if defined(CONFIG_NUMA) && defined(CONFIG_X86_64) -#define X86_64_NUMA 1 +#define X86_64_NUMA 1 /* (used later) */ -/* map cpu index to node index */ +/* + * Map cpu index to node index + */ DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE); EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map); -/* which logical CPUs are on which nodes */ +/* + * Which logical CPUs are on which nodes + */ cpumask_t *node_to_cpumask_map; EXPORT_SYMBOL(node_to_cpumask_map); -/* setup node_to_cpumask_map */ +/* + * Setup node_to_cpumask_map + */ static void __init setup_node_to_cpumask_map(void); #else static inline void setup_node_to_cpumask_map(void) { } #endif -#if defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) && defined(CONFIG_SMP) +#ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA /* * Copy data used in early init routines from the initial arrays to the * per cpu data areas. These arrays then become expendable and the @@ -81,16 +95,25 @@ static void __init setup_per_cpu_maps(vo } #ifdef CONFIG_HAVE_CPUMASK_OF_CPU_MAP -cpumask_t *cpumask_of_cpu_map __read_mostly; +/* + * Configure an initial cpumask_of_cpu(0) for early users + */ +static cpumask_t initial_cpumask_of_cpu_map __initdata = (cpumask_t) { { + [BITS_TO_LONGS(NR_CPUS)-1] = 1 +} }; +cpumask_t *cpumask_of_cpu_map __read_mostly = + (cpumask_t *)&initial_cpumask_of_cpu_map; EXPORT_SYMBOL(cpumask_of_cpu_map); -/* requires nr_cpu_ids to be initialized */ +/* Requires nr_cpu_ids to be initialized. */ static void __init setup_cpumask_of_cpu(void) { int i; /* alloc_bootmem zeroes memory */ cpumask_of_cpu_map = alloc_bootmem_low(sizeof(cpumask_t) * nr_cpu_ids); + DBG("cpumask_of_cpu_map %p\n", cpumask_of_cpu_map); + for (i = 0; i < nr_cpu_ids; i++) cpu_set(i, cpumask_of_cpu_map[i]); } @@ -197,9 +220,10 @@ void __init setup_per_cpu_areas(void) per_cpu_offset(cpu) = ptr - __per_cpu_start; memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); + DBG("PERCPU: cpu %4d %p\n", cpu, ptr); } - printk(KERN_DEBUG "NR_CPUS: %d, nr_cpu_ids: %d, nr_node_ids %d\n", + printk(KERN_INFO "NR_CPUS: %d, nr_cpu_ids: %d, nr_node_ids %d\n", NR_CPUS, nr_cpu_ids, nr_node_ids); /* Setup percpu data maps */ @@ -221,6 +245,7 @@ void __init setup_per_cpu_areas(void) * Requires node_possible_map to be valid. * * Note: node_to_cpumask() is not valid until after this is done. + * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.) */ static void __init setup_node_to_cpumask_map(void) { @@ -236,9 +261,7 @@ static void __init setup_node_to_cpumask /* allocate the map */ map = alloc_bootmem_low(nr_node_ids * sizeof(cpumask_t)); - - Dprintk(KERN_DEBUG "Node to cpumask map at %p for %d nodes\n", - map, nr_node_ids); + DBG("node_to_cpumask_map at %p for %d nodes\n", map, nr_node_ids); /* node_to_cpumask() will now work */ node_to_cpumask_map = map; @@ -248,17 +271,23 @@ void __cpuinit numa_set_node(int cpu, in { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); - if (cpu_pda(cpu) && node != NUMA_NO_NODE) - cpu_pda(cpu)->nodenumber = node; - - if (cpu_to_node_map) + /* early setting, no percpu area yet */ + if (cpu_to_node_map) { cpu_to_node_map[cpu] = node; + return; + } - else if (per_cpu_offset(cpu)) - per_cpu(x86_cpu_to_node_map, cpu) = node; +#ifdef CONFIG_DEBUG_PER_CPU_MAPS + if (cpu >= nr_cpu_ids || !per_cpu_offset(cpu)) { + printk(KERN_ERR "numa_set_node: invalid cpu# (%d)\n", cpu); + dump_stack(); + return; + } +#endif + per_cpu(x86_cpu_to_node_map, cpu) = node; - else - Dprintk(KERN_INFO "Setting node for non-present cpu %d\n", cpu); + if (node != NUMA_NO_NODE) + cpu_pda(cpu)->nodenumber = node; } void __cpuinit numa_clear_node(int cpu) @@ -275,7 +304,7 @@ void __cpuinit numa_add_cpu(int cpu) void __cpuinit numa_remove_cpu(int cpu) { - cpu_clear(cpu, node_to_cpumask_map[cpu_to_node(cpu)]); + cpu_clear(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]); } #else /* CONFIG_DEBUG_PER_CPU_MAPS */ @@ -285,7 +314,7 @@ void __cpuinit numa_remove_cpu(int cpu) */ static void __cpuinit numa_set_cpumask(int cpu, int enable) { - int node = cpu_to_node(cpu); + int node = early_cpu_to_node(cpu); cpumask_t *mask; char buf[64]; -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 02/15] x86_64: Fold pda into per cpu area 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis 2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 22:02 ` Eric W. Biederman 2008-07-09 16:51 ` [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs Mike Travis ` (16 subsequent siblings) 18 siblings, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: zero_based_fold --] [-- Type: text/plain, Size: 15082 bytes --] WARNING: there is still a FIXME in this patch (see arch/x86/kernel/acpi/sleep.c) * Declare the pda as a per cpu variable. * Make the x86_64 per cpu area start at zero. * Relocate the initial pda and per_cpu(gdt_page) in head_64.S for the boot cpu (0). For secondary cpus, do_boot_cpu() sets up the correct initial pda and gdt_page pointer. * Initialize per_cpu_offset to point to static pda in the per_cpu area (@ __per_cpu_load). * After allocation of the per cpu area for the boot cpu (0), reload the gdt page pointer. Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- arch/x86/Kconfig | 3 + arch/x86/kernel/acpi/sleep.c | 9 +++ arch/x86/kernel/cpu/common_64.c | 4 - arch/x86/kernel/head64.c | 24 +-------- arch/x86/kernel/head_64.S | 45 ++++++++++++++++-- arch/x86/kernel/setup_percpu.c | 94 +++++++++++++++++---------------------- arch/x86/kernel/smpboot.c | 52 --------------------- arch/x86/kernel/vmlinux_64.lds.S | 1 include/asm-x86/desc.h | 5 ++ include/asm-x86/pda.h | 3 - include/asm-x86/percpu.h | 13 ----- include/asm-x86/trampoline.h | 1 12 files changed, 112 insertions(+), 142 deletions(-) --- linux-2.6.tip.orig/arch/x86/Kconfig +++ linux-2.6.tip/arch/x86/Kconfig @@ -129,6 +129,9 @@ config HAVE_SETUP_PER_CPU_AREA config HAVE_CPUMASK_OF_CPU_MAP def_bool X86_64_SMP +config HAVE_ZERO_BASED_PER_CPU + def_bool X86_64_SMP + config ARCH_HIBERNATION_POSSIBLE def_bool y depends on !SMP || !X86_VOYAGER --- linux-2.6.tip.orig/arch/x86/kernel/acpi/sleep.c +++ linux-2.6.tip/arch/x86/kernel/acpi/sleep.c @@ -89,6 +89,15 @@ int acpi_save_state_mem(void) #ifdef CONFIG_SMP stack_start.sp = temp_stack + 4096; #endif + /* + * FIXME: with zero-based percpu variables, the pda and gdt_page + * addresses must be offset by the base of this cpu's percpu area. + * Where/how should we do this? + * + * for secondary cpu startup in smpboot.c:do_boot_cpu() this is done: + * early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu); + * initial_pda = (unsigned long)get_cpu_pda(cpu); + */ initial_code = (unsigned long)wakeup_long64; saved_magic = 0x123456789abcdef0; #endif /* CONFIG_64BIT */ --- linux-2.6.tip.orig/arch/x86/kernel/cpu/common_64.c +++ linux-2.6.tip/arch/x86/kernel/cpu/common_64.c @@ -423,8 +423,8 @@ __setup("clearcpuid=", setup_disablecpui cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE; -struct x8664_pda **_cpu_pda __read_mostly; -EXPORT_SYMBOL(_cpu_pda); +DEFINE_PER_CPU_FIRST(struct x8664_pda, pda); +EXPORT_PER_CPU_SYMBOL(pda); struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table }; --- linux-2.6.tip.orig/arch/x86/kernel/head64.c +++ linux-2.6.tip/arch/x86/kernel/head64.c @@ -25,20 +25,6 @@ #include <asm/e820.h> #include <asm/bios_ebda.h> -/* boot cpu pda */ -static struct x8664_pda _boot_cpu_pda __read_mostly; - -#ifdef CONFIG_SMP -/* - * We install an empty cpu_pda pointer table to indicate to early users - * (numa_set_node) that the cpu_pda pointer table for cpus other than - * the boot cpu is not yet setup. - */ -static struct x8664_pda *__cpu_pda[NR_CPUS] __initdata; -#else -static struct x8664_pda *__cpu_pda[NR_CPUS] __read_mostly; -#endif - static void __init zap_identity_mappings(void) { pgd_t *pgd = pgd_offset_k(0UL); @@ -91,6 +77,10 @@ void __init x86_64_start_kernel(char * r /* Cleanup the over mapped high alias */ cleanup_highmap(); + /* Initialize boot cpu_pda data */ + /* (See head_64.S for earlier pda/gdt initialization) */ + pda_init(0); + for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) { #ifdef CONFIG_EARLY_PRINTK set_intr_gate(i, &early_idt_handlers[i]); @@ -102,12 +92,6 @@ void __init x86_64_start_kernel(char * r early_printk("Kernel alive\n"); - _cpu_pda = __cpu_pda; - cpu_pda(0) = &_boot_cpu_pda; - pda_init(0); - - early_printk("Kernel really alive\n"); - copy_bootdata(__va(real_mode_data)); reserve_early(__pa_symbol(&_text), __pa_symbol(&_end), "TEXT DATA BSS"); --- linux-2.6.tip.orig/arch/x86/kernel/head_64.S +++ linux-2.6.tip/arch/x86/kernel/head_64.S @@ -12,6 +12,7 @@ #include <linux/linkage.h> #include <linux/threads.h> #include <linux/init.h> +#include <asm/asm-offsets.h> #include <asm/desc.h> #include <asm/segment.h> #include <asm/pgtable.h> @@ -203,7 +204,27 @@ ENTRY(secondary_startup_64) * addresses where we're currently running on. We have to do that here * because in 32bit we couldn't load a 64bit linear address. */ - lgdt early_gdt_descr(%rip) + +#ifdef CONFIG_SMP + /* + * For zero-based percpu variables, the base (__per_cpu_load) must + * be added to the offset of per_cpu__gdt_page. This is only needed + * for the boot cpu but we can't do this prior to secondary_startup_64. + * So we use a NULL gdt adrs to indicate that we are starting up the + * boot cpu and not the secondary cpus. do_boot_cpu() will fixup + * the gdt adrs for those cpus. + */ +#define PER_CPU_GDT_PAGE 0 + movq early_gdt_descr_base(%rip), %rax + testq %rax, %rax + jnz 1f + movq $__per_cpu_load, %rax + addq $per_cpu__gdt_page, %rax + movq %rax, early_gdt_descr_base(%rip) +#else +#define PER_CPU_GDT_PAGE per_cpu__gdt_page +#endif +1: lgdt early_gdt_descr(%rip) /* set up data segments. actually 0 would do too */ movl $__KERNEL_DS,%eax @@ -220,14 +241,21 @@ ENTRY(secondary_startup_64) movl %eax,%gs /* - * Setup up a dummy PDA. this is just for some early bootup code - * that does in_interrupt() + * Setup up the real PDA. + * + * For SMP, the boot cpu (0) uses the static pda which is the first + * element in the percpu area (@__per_cpu_load). This pda is moved + * to the real percpu area once that is allocated. Secondary cpus + * will use the initial_pda value setup in do_boot_cpu(). */ movl $MSR_GS_BASE,%ecx - movq $empty_zero_page,%rax + movq initial_pda(%rip), %rax movq %rax,%rdx shrq $32,%rdx wrmsr +#ifdef CONFIG_SMP + movq %rax, %gs:pda_data_offset +#endif /* esi is pointer to real mode structure with interesting info. pass it to C */ @@ -250,6 +278,12 @@ ENTRY(secondary_startup_64) .align 8 ENTRY(initial_code) .quad x86_64_start_kernel + ENTRY(initial_pda) +#ifdef CONFIG_SMP + .quad __per_cpu_load # Overwritten for secondary CPUs +#else + .quad per_cpu__pda +#endif __FINITDATA ENTRY(stack_start) @@ -394,7 +428,8 @@ NEXT_PAGE(level2_spare_pgt) .globl early_gdt_descr early_gdt_descr: .word GDT_ENTRIES*8-1 - .quad per_cpu__gdt_page +early_gdt_descr_base: + .quad PER_CPU_GDT_PAGE # Overwritten for secondary CPUs ENTRY(phys_base) /* This must match the first entry in level2_kernel_pgt */ --- linux-2.6.tip.orig/arch/x86/kernel/setup_percpu.c +++ linux-2.6.tip/arch/x86/kernel/setup_percpu.c @@ -14,6 +14,7 @@ #include <asm/mpspec.h> #include <asm/apicdef.h> #include <asm/highmem.h> +#include <asm/desc.h> #ifdef CONFIG_DEBUG_PER_CPU_MAPS # define DBG(x...) printk(KERN_DEBUG x) @@ -119,63 +120,26 @@ static void __init setup_cpumask_of_cpu( static inline void setup_cpumask_of_cpu(void) { } #endif -#ifdef CONFIG_X86_32 /* - * Great future not-so-futuristic plan: make i386 and x86_64 do it - * the same way + * Pointers to per cpu areas for each cpu */ +#ifndef CONFIG_HAVE_ZERO_BASED_PER_CPU unsigned long __per_cpu_offset[NR_CPUS] __read_mostly; EXPORT_SYMBOL(__per_cpu_offset); -static inline void setup_cpu_pda_map(void) { } - -#elif !defined(CONFIG_SMP) -static inline void setup_cpu_pda_map(void) { } - -#else /* CONFIG_SMP && CONFIG_X86_64 */ +#else /* - * Allocate cpu_pda pointer table and array via alloc_bootmem. + * Initialize percpu offset for boot cpu (0) to static percpu area + * for referencing very early in kernel startup. */ -static void __init setup_cpu_pda_map(void) -{ - char *pda; - struct x8664_pda **new_cpu_pda; - unsigned long size; - int cpu; - - size = roundup(sizeof(struct x8664_pda), cache_line_size()); - - /* allocate cpu_pda array and pointer table */ - { - unsigned long tsize = nr_cpu_ids * sizeof(void *); - unsigned long asize = size * (nr_cpu_ids - 1); - - tsize = roundup(tsize, cache_line_size()); - new_cpu_pda = alloc_bootmem(tsize + asize); - pda = (char *)new_cpu_pda + tsize; - } - - /* initialize pointer table to static pda's */ - for_each_possible_cpu(cpu) { - if (cpu == 0) { - /* leave boot cpu pda in place */ - new_cpu_pda[0] = cpu_pda(0); - continue; - } - new_cpu_pda[cpu] = (struct x8664_pda *)pda; - new_cpu_pda[cpu]->in_bootmem = 1; - pda += size; - } - - /* point to new pointer table */ - _cpu_pda = new_cpu_pda; -} +unsigned long __per_cpu_offset[NR_CPUS] __read_mostly = { + [0] = (unsigned long)__per_cpu_load +}; +EXPORT_SYMBOL(__per_cpu_offset); #endif /* - * Great future plan: - * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data. - * Always point %gs to its beginning + * Allocate and initialize the per cpu areas which include the PDAs. */ void __init setup_per_cpu_areas(void) { @@ -193,9 +157,6 @@ void __init setup_per_cpu_areas(void) nr_cpu_ids = num_processors; #endif - /* Setup cpu_pda map */ - setup_cpu_pda_map(); - /* Copy section for each CPU (we discard the original) */ size = PERCPU_ENOUGH_ROOM; printk(KERN_INFO "PERCPU: Allocating %zd bytes of per cpu data\n", @@ -215,10 +176,41 @@ void __init setup_per_cpu_areas(void) else ptr = alloc_bootmem_pages_node(NODE_DATA(node), size); #endif + /* Initialize each cpu's per_cpu area and save pointer */ + memcpy(ptr, __per_cpu_load, __per_cpu_size); per_cpu_offset(cpu) = ptr - __per_cpu_start; - memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); DBG("PERCPU: cpu %4d %p\n", cpu, ptr); + +#ifdef CONFIG_X86_64 + /* + * Note the boot cpu (0) has been using the static per_cpu load + * area for it's pda. We need to zero out the pdas for the + * other cpus that are coming online. + * + * Additionally, for the boot cpu the gdt page must be reloaded + * as we moved it from the static per cpu area to the newly + * allocated area. + */ + { + /* We rely on the fact that pda is the first element */ + struct x8664_pda *pda = (struct x8664_pda *)ptr; + + if (cpu) { + memset(pda, 0, sizeof(*pda)); + pda->data_offset = (unsigned long)ptr; + } else { + struct desc_ptr gdt_descr = early_gdt_descr; + + pda->data_offset = (unsigned long)ptr; + gdt_descr.address = + (unsigned long)get_cpu_gdt_table(0); + native_load_gdt(&gdt_descr); + pda_init(0); + } + + } +#endif } printk(KERN_INFO "NR_CPUS: %d, nr_cpu_ids: %d, nr_node_ids %d\n", --- linux-2.6.tip.orig/arch/x86/kernel/smpboot.c +++ linux-2.6.tip/arch/x86/kernel/smpboot.c @@ -762,45 +762,6 @@ static void __cpuinit do_fork_idle(struc complete(&c_idle->done); } -#ifdef CONFIG_X86_64 -/* - * Allocate node local memory for the AP pda. - * - * Must be called after the _cpu_pda pointer table is initialized. - */ -static int __cpuinit get_local_pda(int cpu) -{ - struct x8664_pda *oldpda, *newpda; - unsigned long size = sizeof(struct x8664_pda); - int node = cpu_to_node(cpu); - - if (cpu_pda(cpu) && !cpu_pda(cpu)->in_bootmem) - return 0; - - oldpda = cpu_pda(cpu); - newpda = kmalloc_node(size, GFP_ATOMIC, node); - if (!newpda) { - printk(KERN_ERR "Could not allocate node local PDA " - "for CPU %d on node %d\n", cpu, node); - - if (oldpda) - return 0; /* have a usable pda */ - else - return -1; - } - - if (oldpda) { - memcpy(newpda, oldpda, size); - if (!after_bootmem) - free_bootmem((unsigned long)oldpda, size); - } - - newpda->in_bootmem = 0; - cpu_pda(cpu) = newpda; - return 0; -} -#endif /* CONFIG_X86_64 */ - static int __cpuinit do_boot_cpu(int apicid, int cpu) /* * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad @@ -818,16 +779,6 @@ static int __cpuinit do_boot_cpu(int api }; INIT_WORK(&c_idle.work, do_fork_idle); -#ifdef CONFIG_X86_64 - /* Allocate node local memory for AP pdas */ - if (cpu > 0) { - boot_error = get_local_pda(cpu); - if (boot_error) - goto restore_state; - /* if can't get pda memory, can't start cpu */ - } -#endif - alternatives_smp_switch(1); c_idle.idle = get_idle_for_cpu(cpu); @@ -865,6 +816,7 @@ do_rest: #else cpu_pda(cpu)->pcurrent = c_idle.idle; clear_tsk_thread_flag(c_idle.idle, TIF_FORK); + initial_pda = (unsigned long)get_cpu_pda(cpu); #endif early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu); initial_code = (unsigned long)start_secondary; @@ -940,8 +892,6 @@ do_rest: } } -restore_state: - if (boot_error) { /* Try to put things back the way they were before ... */ numa_remove_cpu(cpu); /* was set by numa_add_cpu */ --- linux-2.6.tip.orig/arch/x86/kernel/vmlinux_64.lds.S +++ linux-2.6.tip/arch/x86/kernel/vmlinux_64.lds.S @@ -16,6 +16,7 @@ jiffies_64 = jiffies; _proxy_pda = 1; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ + percpu PT_LOAD FLAGS(7); /* RWE */ data PT_LOAD FLAGS(7); /* RWE */ user PT_LOAD FLAGS(7); /* RWE */ data.init PT_LOAD FLAGS(7); /* RWE */ --- linux-2.6.tip.orig/include/asm-x86/desc.h +++ linux-2.6.tip/include/asm-x86/desc.h @@ -41,6 +41,11 @@ static inline struct desc_struct *get_cp #ifdef CONFIG_X86_64 +static inline struct x8664_pda *get_cpu_pda(unsigned int cpu) +{ + return &per_cpu(pda, cpu); +} + static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func, unsigned dpl, unsigned ist, unsigned seg) { --- linux-2.6.tip.orig/include/asm-x86/pda.h +++ linux-2.6.tip/include/asm-x86/pda.h @@ -37,10 +37,9 @@ struct x8664_pda { unsigned irq_spurious_count; } ____cacheline_aligned_in_smp; -extern struct x8664_pda **_cpu_pda; extern void pda_init(int); -#define cpu_pda(i) (_cpu_pda[i]) +#define cpu_pda(cpu) (&per_cpu(pda, cpu)) /* * There is no fast way to get the base address of the PDA, all the accesses --- linux-2.6.tip.orig/include/asm-x86/percpu.h +++ linux-2.6.tip/include/asm-x86/percpu.h @@ -3,20 +3,11 @@ #ifdef CONFIG_X86_64 #include <linux/compiler.h> - -/* Same as asm-generic/percpu.h, except that we store the per cpu offset - in the PDA. Longer term the PDA and every per cpu variable - should be just put into a single section and referenced directly - from %gs */ - -#ifdef CONFIG_SMP #include <asm/pda.h> -#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset) +/* Same as asm-generic/percpu.h */ +#ifdef CONFIG_SMP #define __my_cpu_offset read_pda(data_offset) - -#define per_cpu_offset(x) (__per_cpu_offset(x)) - #endif #include <asm-generic/percpu.h> --- linux-2.6.tip.orig/include/asm-x86/trampoline.h +++ linux-2.6.tip/include/asm-x86/trampoline.h @@ -12,6 +12,7 @@ extern unsigned char *trampoline_base; extern unsigned long init_rsp; extern unsigned long initial_code; +extern unsigned long initial_pda; #define TRAMPOLINE_BASE 0x6000 extern unsigned long setup_trampoline(void); -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 02/15] x86_64: Fold pda into per cpu area 2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis @ 2008-07-09 22:02 ` Eric W. Biederman 2008-07-13 17:54 ` Ingo Molnar 0 siblings, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 22:02 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Mike Travis <travis@sgi.com> writes: > WARNING: there is still a FIXME in this patch (see arch/x86/kernel/acpi/sleep.c) > > * Declare the pda as a per cpu variable. > > * Make the x86_64 per cpu area start at zero. > > * Relocate the initial pda and per_cpu(gdt_page) in head_64.S for the > boot cpu (0). For secondary cpus, do_boot_cpu() sets up the correct > initial pda and gdt_page pointer. > > * Initialize per_cpu_offset to point to static pda in the per_cpu area > (@ __per_cpu_load). > > * After allocation of the per cpu area for the boot cpu (0), reload the > gdt page pointer. > > Based on linux-2.6.tip/master Given that we have not yet understood the weird failure case. This patch needs to be split in two. - make the current per cpu variable section zero based. - Move the pda into the per cpu variable section. There are too many variables at present the reported failure cases to guess what is really going on. We can not optimize the per cpu variable accesses until the pda moves but we can easily test for linker and tool chain bugs with zero based pda segment itself. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 02/15] x86_64: Fold pda into per cpu area 2008-07-09 22:02 ` Eric W. Biederman @ 2008-07-13 17:54 ` Ingo Molnar 2008-07-14 14:24 ` Mike Travis 0 siblings, 1 reply; 190+ messages in thread From: Ingo Molnar @ 2008-07-13 17:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel * Eric W. Biederman <ebiederm@xmission.com> wrote: > Mike Travis <travis@sgi.com> writes: > > > WARNING: there is still a FIXME in this patch (see arch/x86/kernel/acpi/sleep.c) > > > > * Declare the pda as a per cpu variable. > > > > * Make the x86_64 per cpu area start at zero. > > > > * Relocate the initial pda and per_cpu(gdt_page) in head_64.S for the > > boot cpu (0). For secondary cpus, do_boot_cpu() sets up the correct > > initial pda and gdt_page pointer. > > > > * Initialize per_cpu_offset to point to static pda in the per_cpu area > > (@ __per_cpu_load). > > > > * After allocation of the per cpu area for the boot cpu (0), reload the > > gdt page pointer. > > > > Based on linux-2.6.tip/master > > Given that we have not yet understood the weird failure case. This patch needs > to be split in two. > - make the current per cpu variable section zero based. > - Move the pda into the per cpu variable section. > > There are too many variables at present the reported failure cases to > guess what is really going on. > > We can not optimize the per cpu variable accesses until the pda moves > but we can easily test for linker and tool chain bugs with zero > based pda segment itself. agreed, a patch of this gravity and with a diffstat: 12 files changed, 112 insertions(+), 142 deletions(-) is indeed too large. Test failures that get bisected to this patch will still cause people to guess about which aspect of the large patch caused the problem. Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 02/15] x86_64: Fold pda into per cpu area 2008-07-13 17:54 ` Ingo Molnar @ 2008-07-14 14:24 ` Mike Travis 0 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-14 14:24 UTC (permalink / raw) To: Ingo Molnar Cc: Eric W. Biederman, Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Ingo Molnar wrote: > * Eric W. Biederman <ebiederm@xmission.com> wrote: ... >> Given that we have not yet understood the weird failure case. This patch needs >> to be split in two. >> - make the current per cpu variable section zero based. >> - Move the pda into the per cpu variable section. >> >> There are too many variables at present the reported failure cases to >> guess what is really going on. >> >> We can not optimize the per cpu variable accesses until the pda moves >> but we can easily test for linker and tool chain bugs with zero >> based pda segment itself. > > agreed, a patch of this gravity and with a diffstat: > > 12 files changed, 112 insertions(+), 142 deletions(-) > > is indeed too large. Test failures that get bisected to this patch will > still cause people to guess about which aspect of the large patch caused > the problem. > > Ingo That split has been done and I've sent it to Jeremy and Peter for further review. Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis 2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis 2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops Mike Travis ` (15 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: zero_based_use_gs --] [-- Type: text/plain, Size: 2261 bytes --] * Since %gs is pointing to the pda, it will then also point to the per cpu variables and can be accessed thusly: %gs:[&per_cpu_xxxx - __per_cpu_start] ... and since __per_cpu_start == 0 then: %gs:&per_cpu_var(xxx) becomes the optimized effective address. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/percpu.h | 36 +++++++++++++----------------------- 1 file changed, 13 insertions(+), 23 deletions(-) --- linux-2.6.tip.orig/include/asm-x86/percpu.h +++ linux-2.6.tip/include/asm-x86/percpu.h @@ -5,15 +5,19 @@ #include <linux/compiler.h> #include <asm/pda.h> -/* Same as asm-generic/percpu.h */ +/* Same as asm-generic/percpu.h, except we use %gs as a segment offset. */ #ifdef CONFIG_SMP #define __my_cpu_offset read_pda(data_offset) +#define __percpu_seg "%%gs:" +#else +#define __percpu_seg "" #endif + #include <asm-generic/percpu.h> DECLARE_PER_CPU(struct x8664_pda, pda); -#else /* CONFIG_X86_64 */ +#else /* !CONFIG_X86_64 */ #ifdef __ASSEMBLY__ @@ -42,36 +46,23 @@ DECLARE_PER_CPU(struct x8664_pda, pda); #else /* ...!ASSEMBLY */ -/* - * PER_CPU finds an address of a per-cpu variable. - * - * Args: - * var - variable name - * cpu - 32bit register containing the current CPU number - * - * The resulting address is stored in the "cpu" argument. - * - * Example: - * PER_CPU(cpu_gdt_descr, %ebx) - */ #ifdef CONFIG_SMP - #define __my_cpu_offset x86_read_percpu(this_cpu_off) - -/* fs segment starts at (positive) offset == __per_cpu_offset[cpu] */ #define __percpu_seg "%%fs:" - -#else /* !SMP */ - +#else #define __percpu_seg "" - -#endif /* SMP */ +#endif #include <asm-generic/percpu.h> /* We can use this directly for local CPU (faster). */ DECLARE_PER_CPU(unsigned long, this_cpu_off); +#endif /* __ASSEMBLY__ */ +#endif /* !CONFIG_X86_64 */ + +#ifndef __ASSEMBLY__ + /* For arch-specific code, we can use direct single-insn ops (they * don't give an lvalue though). */ extern void __bad_percpu_size(void); @@ -206,7 +197,6 @@ do { \ percpu_cmpxchg_op(per_cpu_var(var), old, new) #endif /* !__ASSEMBLY__ */ -#endif /* !CONFIG_X86_64 */ #ifdef CONFIG_SMP -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (2 preceding siblings ...) 2008-07-09 16:51 ` [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu() Mike Travis ` (14 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: zero_based_replace_cpu_pda --] [-- Type: text/plain, Size: 6292 bytes --] * Replace cpu_pda(i) references with percpu(pda, i). Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- arch/x86/kernel/cpu/common_64.c | 2 +- arch/x86/kernel/irq_64.c | 36 ++++++++++++++++++++---------------- arch/x86/kernel/nmi.c | 2 +- arch/x86/kernel/setup_percpu.c | 2 +- arch/x86/kernel/smpboot.c | 2 +- arch/x86/kernel/traps_64.c | 11 +++++++---- 6 files changed, 31 insertions(+), 24 deletions(-) --- linux-2.6.tip.orig/arch/x86/kernel/cpu/common_64.c +++ linux-2.6.tip/arch/x86/kernel/cpu/common_64.c @@ -477,7 +477,7 @@ __setup("noexec32=", nonx32_setup); void pda_init(int cpu) { - struct x8664_pda *pda = cpu_pda(cpu); + struct x8664_pda *pda = &per_cpu(pda, cpu); /* Setup up data that may be needed in __get_free_pages early */ asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0)); --- linux-2.6.tip.orig/arch/x86/kernel/irq_64.c +++ linux-2.6.tip/arch/x86/kernel/irq_64.c @@ -115,39 +115,43 @@ skip: } else if (i == NR_IRQS) { seq_printf(p, "NMI: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->__nmi_count); + seq_printf(p, "%10u ", per_cpu(pda.__nmi_count, j)); seq_printf(p, " Non-maskable interrupts\n"); seq_printf(p, "LOC: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->apic_timer_irqs); + seq_printf(p, "%10u ", per_cpu(pda.apic_timer_irqs, j)); seq_printf(p, " Local timer interrupts\n"); #ifdef CONFIG_SMP seq_printf(p, "RES: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->irq_resched_count); + seq_printf(p, "%10u ", + per_cpu(pda.irq_resched_count, j)); seq_printf(p, " Rescheduling interrupts\n"); seq_printf(p, "CAL: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->irq_call_count); + seq_printf(p, "%10u ", per_cpu(pda.irq_call_count, j)); seq_printf(p, " function call interrupts\n"); seq_printf(p, "TLB: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->irq_tlb_count); + seq_printf(p, "%10u ", per_cpu(pda.irq_tlb_count, j)); seq_printf(p, " TLB shootdowns\n"); #endif #ifdef CONFIG_X86_MCE seq_printf(p, "TRM: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->irq_thermal_count); + seq_printf(p, "%10u ", + per_cpu(pda.irq_thermal_count, j)); seq_printf(p, " Thermal event interrupts\n"); seq_printf(p, "THR: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->irq_threshold_count); + seq_printf(p, "%10u ", + per_cpu(pda.irq_threshold_count, j)); seq_printf(p, " Threshold APIC interrupts\n"); #endif seq_printf(p, "SPU: "); for_each_online_cpu(j) - seq_printf(p, "%10u ", cpu_pda(j)->irq_spurious_count); + seq_printf(p, "%10u ", + per_cpu(pda.irq_spurious_count, j)); seq_printf(p, " Spurious interrupts\n"); seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); } @@ -159,19 +163,19 @@ skip: */ u64 arch_irq_stat_cpu(unsigned int cpu) { - u64 sum = cpu_pda(cpu)->__nmi_count; + u64 sum = per_cpu(pda.__nmi_count, cpu); - sum += cpu_pda(cpu)->apic_timer_irqs; + sum += per_cpu(pda.apic_timer_irqs, cpu); #ifdef CONFIG_SMP - sum += cpu_pda(cpu)->irq_resched_count; - sum += cpu_pda(cpu)->irq_call_count; - sum += cpu_pda(cpu)->irq_tlb_count; + sum += per_cpu(pda.irq_resched_count, cpu); + sum += per_cpu(pda.irq_call_count, cpu); + sum += per_cpu(pda.irq_tlb_count, cpu); #endif #ifdef CONFIG_X86_MCE - sum += cpu_pda(cpu)->irq_thermal_count; - sum += cpu_pda(cpu)->irq_threshold_count; + sum += per_cpu(pda.irq_thermal_count, cpu); + sum += per_cpu(pda.irq_threshold_count, cpu); #endif - sum += cpu_pda(cpu)->irq_spurious_count; + sum += per_cpu(pda.irq_spurious_count, cpu); return sum; } --- linux-2.6.tip.orig/arch/x86/kernel/nmi.c +++ linux-2.6.tip/arch/x86/kernel/nmi.c @@ -61,7 +61,7 @@ static int endflag __initdata; static inline unsigned int get_nmi_count(int cpu) { #ifdef CONFIG_X86_64 - return cpu_pda(cpu)->__nmi_count; + return per_cpu(pda.__nmi_count, cpu); #else return nmi_count(cpu); #endif --- linux-2.6.tip.orig/arch/x86/kernel/setup_percpu.c +++ linux-2.6.tip/arch/x86/kernel/setup_percpu.c @@ -279,7 +279,7 @@ void __cpuinit numa_set_node(int cpu, in per_cpu(x86_cpu_to_node_map, cpu) = node; if (node != NUMA_NO_NODE) - cpu_pda(cpu)->nodenumber = node; + per_cpu(pda.nodenumber, cpu) = node; } void __cpuinit numa_clear_node(int cpu) --- linux-2.6.tip.orig/arch/x86/kernel/smpboot.c +++ linux-2.6.tip/arch/x86/kernel/smpboot.c @@ -814,7 +814,7 @@ do_rest: /* Stack for startup_32 can be just as for start_secondary onwards */ irq_ctx_init(cpu); #else - cpu_pda(cpu)->pcurrent = c_idle.idle; + per_cpu(pda.pcurrent, cpu) = c_idle.idle; clear_tsk_thread_flag(c_idle.idle, TIF_FORK); initial_pda = (unsigned long)get_cpu_pda(cpu); #endif --- linux-2.6.tip.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.tip/arch/x86/kernel/traps_64.c @@ -265,7 +265,8 @@ void dump_trace(struct task_struct *tsk, const struct stacktrace_ops *ops, void *data) { const unsigned cpu = get_cpu(); - unsigned long *irqstack_end = (unsigned long*)cpu_pda(cpu)->irqstackptr; + unsigned long *irqstack_end = + (unsigned long *)per_cpu(pda.irqstackptr, cpu); unsigned used = 0; struct thread_info *tinfo; @@ -399,8 +400,10 @@ _show_stack(struct task_struct *tsk, str unsigned long *stack; int i; const int cpu = smp_processor_id(); - unsigned long *irqstack_end = (unsigned long *) (cpu_pda(cpu)->irqstackptr); - unsigned long *irqstack = (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE); + unsigned long *irqstack_end = + (unsigned long *)per_cpu(pda.irqstackptr, cpu); + unsigned long *irqstack = + (unsigned long *)(irqstack_end - IRQSTACKSIZE); // debugging aid: "show_stack(NULL, NULL);" prints the // back trace for this cpu. @@ -464,7 +467,7 @@ void show_registers(struct pt_regs *regs int i; unsigned long sp; const int cpu = smp_processor_id(); - struct task_struct *cur = cpu_pda(cpu)->pcurrent; + struct task_struct *cur = __get_cpu_var(pda.pcurrent); u8 *ip; unsigned int code_prologue = code_bytes * 43 / 64; unsigned int code_len = code_bytes; -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu(). 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (3 preceding siblings ...) 2008-07-09 16:51 ` [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h Mike Travis ` (13 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: zero_based_replace_pda_ops --] [-- Type: text/plain, Size: 7201 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu * Remove unused field (in_bootmem) from the pda. * One pda op (test_and_clear_bit_pda) cannot be easily removed but since the pda is the first element in the per cpu area, then it can be left in place as is. Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- arch/x86/kernel/apic_64.c | 4 ++-- arch/x86/kernel/cpu/mcheck/mce_amd_64.c | 2 +- arch/x86/kernel/cpu/mcheck/mce_intel_64.c | 2 +- arch/x86/kernel/nmi.c | 3 ++- arch/x86/kernel/process_64.c | 12 ++++++------ arch/x86/kernel/smp.c | 4 ++-- arch/x86/kernel/time_64.c | 2 +- arch/x86/kernel/tlb_64.c | 12 ++++++------ arch/x86/kernel/traps_64.c | 2 +- arch/x86/kernel/x8664_ksyms_64.c | 2 -- arch/x86/xen/smp.c | 2 +- 11 files changed, 23 insertions(+), 24 deletions(-) --- linux-2.6.tip.orig/arch/x86/kernel/apic_64.c +++ linux-2.6.tip/arch/x86/kernel/apic_64.c @@ -457,7 +457,7 @@ static void local_apic_timer_interrupt(v /* * the NMI deadlock-detector uses this. */ - add_pda(apic_timer_irqs, 1); + x86_inc_percpu(pda.apic_timer_irqs); evt->event_handler(evt); } @@ -965,7 +965,7 @@ asmlinkage void smp_spurious_interrupt(v if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f))) ack_APIC_irq(); - add_pda(irq_spurious_count, 1); + x86_inc_percpu(pda.irq_spurious_count); irq_exit(); } --- linux-2.6.tip.orig/arch/x86/kernel/cpu/mcheck/mce_amd_64.c +++ linux-2.6.tip/arch/x86/kernel/cpu/mcheck/mce_amd_64.c @@ -237,7 +237,7 @@ asmlinkage void mce_threshold_interrupt( } } out: - add_pda(irq_threshold_count, 1); + x86_inc_percpu(pda.irq_threshold_count); irq_exit(); } --- linux-2.6.tip.orig/arch/x86/kernel/cpu/mcheck/mce_intel_64.c +++ linux-2.6.tip/arch/x86/kernel/cpu/mcheck/mce_intel_64.c @@ -26,7 +26,7 @@ asmlinkage void smp_thermal_interrupt(vo if (therm_throt_process(msr_val & 1)) mce_log_therm_throt_event(smp_processor_id(), msr_val); - add_pda(irq_thermal_count, 1); + x86_inc_percpu(pda.irq_thermal_count); irq_exit(); } --- linux-2.6.tip.orig/arch/x86/kernel/nmi.c +++ linux-2.6.tip/arch/x86/kernel/nmi.c @@ -82,7 +82,8 @@ static inline int mce_in_progress(void) static inline unsigned int get_timer_irqs(int cpu) { #ifdef CONFIG_X86_64 - return read_pda(apic_timer_irqs) + read_pda(irq0_irqs); + return x86_read_percpu(pda.apic_timer_irqs) + + x86_read_percpu(pda.irq0_irqs); #else return per_cpu(irq_stat, cpu).apic_timer_irqs + per_cpu(irq_stat, cpu).irq0_irqs; --- linux-2.6.tip.orig/arch/x86/kernel/process_64.c +++ linux-2.6.tip/arch/x86/kernel/process_64.c @@ -66,7 +66,7 @@ void idle_notifier_register(struct notif void enter_idle(void) { - write_pda(isidle, 1); + x86_write_percpu(pda.isidle, 1); atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL); } @@ -410,7 +410,7 @@ start_thread(struct pt_regs *regs, unsig load_gs_index(0); regs->ip = new_ip; regs->sp = new_sp; - write_pda(oldrsp, new_sp); + x86_write_percpu(pda.oldrsp, new_sp); regs->cs = __USER_CS; regs->ss = __USER_DS; regs->flags = 0x200; @@ -646,11 +646,11 @@ __switch_to(struct task_struct *prev_p, /* * Switch the PDA and FPU contexts. */ - prev->usersp = read_pda(oldrsp); - write_pda(oldrsp, next->usersp); - write_pda(pcurrent, next_p); + prev->usersp = x86_read_percpu(pda.oldrsp); + x86_write_percpu(pda.oldrsp, next->usersp); + x86_write_percpu(pda.pcurrent, next_p); - write_pda(kernelstack, + x86_write_percpu(pda.kernelstack, (unsigned long)task_stack_page(next_p) + THREAD_SIZE - PDA_STACKOFFSET); #ifdef CONFIG_CC_STACKPROTECTOR /* --- linux-2.6.tip.orig/arch/x86/kernel/smp.c +++ linux-2.6.tip/arch/x86/kernel/smp.c @@ -295,7 +295,7 @@ void smp_reschedule_interrupt(struct pt_ #ifdef CONFIG_X86_32 __get_cpu_var(irq_stat).irq_resched_count++; #else - add_pda(irq_resched_count, 1); + x86_inc_percpu(pda.irq_resched_count); #endif } @@ -320,7 +320,7 @@ void smp_call_function_interrupt(struct #ifdef CONFIG_X86_32 __get_cpu_var(irq_stat).irq_call_count++; #else - add_pda(irq_call_count, 1); + x86_inc_percpu(pda.irq_call_count); #endif irq_exit(); --- linux-2.6.tip.orig/arch/x86/kernel/time_64.c +++ linux-2.6.tip/arch/x86/kernel/time_64.c @@ -46,7 +46,7 @@ EXPORT_SYMBOL(profile_pc); static irqreturn_t timer_event_interrupt(int irq, void *dev_id) { - add_pda(irq0_irqs, 1); + x86_inc_percpu(pda.irq0_irqs); global_clock_event->event_handler(global_clock_event); --- linux-2.6.tip.orig/arch/x86/kernel/tlb_64.c +++ linux-2.6.tip/arch/x86/kernel/tlb_64.c @@ -62,9 +62,9 @@ static DEFINE_PER_CPU(union smp_flush_st */ void leave_mm(int cpu) { - if (read_pda(mmu_state) == TLBSTATE_OK) + if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK) BUG(); - cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask); + cpu_clear(cpu, x86_read_percpu(pda.active_mm)->cpu_vm_mask); load_cr3(swapper_pg_dir); } EXPORT_SYMBOL_GPL(leave_mm); @@ -142,8 +142,8 @@ asmlinkage void smp_invalidate_interrupt * BUG(); */ - if (f->flush_mm == read_pda(active_mm)) { - if (read_pda(mmu_state) == TLBSTATE_OK) { + if (f->flush_mm == x86_read_percpu(pda.active_mm)) { + if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK) { if (f->flush_va == TLB_FLUSH_ALL) local_flush_tlb(); else @@ -154,7 +154,7 @@ asmlinkage void smp_invalidate_interrupt out: ack_APIC_irq(); cpu_clear(cpu, f->flush_cpumask); - add_pda(irq_tlb_count, 1); + x86_inc_percpu(pda.irq_tlb_count); } void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm, @@ -269,7 +269,7 @@ static void do_flush_tlb_all(void *info) unsigned long cpu = smp_processor_id(); __flush_tlb_all(); - if (read_pda(mmu_state) == TLBSTATE_LAZY) + if (x86_read_percpu(pda.mmu_state) == TLBSTATE_LAZY) leave_mm(cpu); } --- linux-2.6.tip.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.tip/arch/x86/kernel/traps_64.c @@ -878,7 +878,7 @@ asmlinkage notrace __kprobes void do_nmi(struct pt_regs *regs, long error_code) { nmi_enter(); - add_pda(__nmi_count, 1); + x86_inc_percpu(pda.__nmi_count); if (!ignore_nmis) default_do_nmi(regs); nmi_exit(); --- linux-2.6.tip.orig/arch/x86/kernel/x8664_ksyms_64.c +++ linux-2.6.tip/arch/x86/kernel/x8664_ksyms_64.c @@ -58,5 +58,3 @@ EXPORT_SYMBOL(__memcpy); EXPORT_SYMBOL(empty_zero_page); EXPORT_SYMBOL(init_level4_pgt); EXPORT_SYMBOL(load_gs_index); - -EXPORT_SYMBOL(_proxy_pda); --- linux-2.6.tip.orig/arch/x86/xen/smp.c +++ linux-2.6.tip/arch/x86/xen/smp.c @@ -68,7 +68,7 @@ static irqreturn_t xen_reschedule_interr #ifdef CONFIG_X86_32 __get_cpu_var(irq_stat).irq_resched_count++; #else - add_pda(irq_resched_count, 1); + x86_inc_percpu(pda.irq_resched_count); #endif return IRQ_HANDLED; -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (4 preceding siblings ...) 2008-07-09 16:51 ` [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu() Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h Mike Travis ` (12 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_current_h --] [-- Type: text/plain, Size: 993 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/current.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- linux-2.6.tip.orig/include/asm-x86/current.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/current.h 2008-07-01 10:49:13.764285143 -0700 @@ -17,12 +17,13 @@ static __always_inline struct task_struc #ifndef __ASSEMBLY__ #include <asm/pda.h> +#include <asm/percpu.h> struct task_struct; static __always_inline struct task_struct *get_current(void) { - return read_pda(pcurrent); + return x86_read_percpu(pda.pcurrent); } #else /* __ASSEMBLY__ */ -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (5 preceding siblings ...) 2008-07-09 16:51 ` [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h Mike Travis ` (11 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_hardirq_64_h --] [-- Type: text/plain, Size: 1238 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/hardirq_64.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- linux-2.6.tip.orig/include/asm-x86/hardirq_64.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/hardirq_64.h 2008-07-01 10:49:14.000299503 -0700 @@ -11,12 +11,12 @@ #define __ARCH_IRQ_STAT 1 -#define local_softirq_pending() read_pda(__softirq_pending) +#define local_softirq_pending() x86_read_percpu(pda.__softirq_pending) #define __ARCH_SET_SOFTIRQ_PENDING 1 -#define set_softirq_pending(x) write_pda(__softirq_pending, (x)) -#define or_softirq_pending(x) or_pda(__softirq_pending, (x)) +#define set_softirq_pending(x) x86_write_percpu(pda.__softirq_pending, (x)) +#define or_softirq_pending(x) x86_or_percpu(pda.__softirq_pending, (x)) extern void ack_bad_irq(unsigned int irq); -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (6 preceding siblings ...) 2008-07-09 16:51 ` [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h Mike Travis ` (10 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_mmu_context_64_h --] [-- Type: text/plain, Size: 1831 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/mmu_context_64.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) --- linux-2.6.tip.orig/include/asm-x86/mmu_context_64.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/mmu_context_64.h 2008-07-01 10:49:14.220312889 -0700 @@ -20,8 +20,8 @@ void destroy_context(struct mm_struct *m static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { #ifdef CONFIG_SMP - if (read_pda(mmu_state) == TLBSTATE_OK) - write_pda(mmu_state, TLBSTATE_LAZY); + if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK) + x86_write_percpu(pda.mmu_state, TLBSTATE_LAZY); #endif } @@ -33,8 +33,8 @@ static inline void switch_mm(struct mm_s /* stop flush ipis for the previous mm */ cpu_clear(cpu, prev->cpu_vm_mask); #ifdef CONFIG_SMP - write_pda(mmu_state, TLBSTATE_OK); - write_pda(active_mm, next); + x86_write_percpu(pda.mmu_state, TLBSTATE_OK); + x86_write_percpu(pda.active_mm, next); #endif cpu_set(cpu, next->cpu_vm_mask); load_cr3(next->pgd); @@ -44,8 +44,8 @@ static inline void switch_mm(struct mm_s } #ifdef CONFIG_SMP else { - write_pda(mmu_state, TLBSTATE_OK); - if (read_pda(active_mm) != next) + x86_write_percpu(pda.mmu_state, TLBSTATE_OK); + if (x86_read_percpu(pda.active_mm) != next) BUG(); if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) { /* We were in lazy tlb mode and leave_mm disabled -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (7 preceding siblings ...) 2008-07-09 16:51 ` [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h Mike Travis ` (9 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_percpu_h --] [-- Type: text/plain, Size: 948 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/percpu.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.tip.orig/include/asm-x86/percpu.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/percpu.h 2008-07-01 10:49:14.432325788 -0700 @@ -7,7 +7,7 @@ /* Same as asm-generic/percpu.h, except we use %gs as a segment offset. */ #ifdef CONFIG_SMP -#define __my_cpu_offset read_pda(data_offset) +#define __my_cpu_offset (x86_read_percpu(pda.data_offset)) #define __percpu_seg "%%gs:" #else #define __percpu_seg "" -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (8 preceding siblings ...) 2008-07-09 16:51 ` [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h Mike Travis ` (8 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_smp_h --] [-- Type: text/plain, Size: 958 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/smp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.tip.orig/include/asm-x86/smp.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/smp.h 2008-07-01 10:49:14.676340634 -0700 @@ -134,7 +134,7 @@ DECLARE_PER_CPU(int, cpu_number); extern int safe_smp_processor_id(void); #elif defined(CONFIG_X86_64_SMP) -#define raw_smp_processor_id() read_pda(cpunumber) +#define raw_smp_processor_id() x86_read_percpu(pda.cpunumber) #define stack_smp_processor_id() \ ({ \ -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (9 preceding siblings ...) 2008-07-09 16:51 ` [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h Mike Travis ` (7 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_stackprotector_h --] [-- Type: text/plain, Size: 912 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/stackprotector.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.tip.orig/include/asm-x86/stackprotector.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/stackprotector.h 2008-07-01 10:49:14.920355480 -0700 @@ -32,7 +32,7 @@ static __always_inline void boot_init_st canary += tsc + (tsc << 32UL); current->stack_canary = canary; - write_pda(stack_canary, canary); + x86_write_percpu(pda.stack_canary, canary); } #endif -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (10 preceding siblings ...) 2008-07-09 16:51 ` [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h Mike Travis ` (6 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_thread_info_h --] [-- Type: text/plain, Size: 1014 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/thread_info.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- linux-2.6.tip.orig/include/asm-x86/thread_info.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/thread_info.h 2008-07-01 10:49:15.172370813 -0700 @@ -200,7 +200,8 @@ static inline struct thread_info *curren static inline struct thread_info *current_thread_info(void) { struct thread_info *ti; - ti = (void *)(read_pda(kernelstack) + PDA_STACKOFFSET - THREAD_SIZE); + ti = (void *)(x86_read_percpu(pda.kernelstack) + + PDA_STACKOFFSET - THREAD_SIZE); return ti; } -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (11 preceding siblings ...) 2008-07-09 16:51 ` [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 14/15] x86_64: Remove xxx_pda() operations Mike Travis ` (5 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: incl_pda_ops_include_asm-x86_topology_h --] [-- Type: text/plain, Size: 1001 bytes --] * It is now possible to use percpu operations for pda access since the pda is in the percpu area. Drop the pda operations. Thus: read_pda --> x86_read_percpu write_pda --> x86_write_percpu add_pda (+1) --> x86_inc_percpu or_pda --> x86_or_percpu Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.tip.orig/include/asm-x86/topology.h 2008-07-01 10:41:33.000000000 -0700 +++ linux-2.6.tip/include/asm-x86/topology.h 2008-07-01 10:49:15.420385902 -0700 @@ -77,7 +77,7 @@ extern cpumask_t *node_to_cpumask_map; DECLARE_EARLY_PER_CPU(int, x86_cpu_to_node_map); /* Returns the number of the current Node. */ -#define numa_node_id() read_pda(nodenumber) +#define numa_node_id() x86_read_percpu(pda.nodenumber) #ifdef CONFIG_DEBUG_PER_CPU_MAPS extern int cpu_to_node(int cpu); -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 14/15] x86_64: Remove xxx_pda() operations 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (12 preceding siblings ...) 2008-07-09 16:51 ` [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 16:51 ` [RFC 15/15] x86_64: Remove cpu_pda() macro Mike Travis ` (4 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: zero_based_remove_pda_ops --] [-- Type: text/plain, Size: 3591 bytes --] * As there are no more references to xxx_pda() ops then remove them. Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/pda.h | 77 ++++---------------------------------------------- 1 file changed, 7 insertions(+), 70 deletions(-) --- linux-2.6.tip.orig/include/asm-x86/pda.h +++ linux-2.6.tip/include/asm-x86/pda.h @@ -21,7 +21,7 @@ struct x8664_pda { offset 40!!! */ char *irqstackptr; short nodenumber; /* number of current node (32k max) */ - short in_bootmem; /* pda lives in bootmem */ + short unused1; /* unused */ unsigned int __softirq_pending; unsigned int __nmi_count; /* number of NMI on this CPUs */ short mmu_state; @@ -42,12 +42,6 @@ extern void pda_init(int); #define cpu_pda(cpu) (&per_cpu(pda, cpu)) /* - * There is no fast way to get the base address of the PDA, all the accesses - * have to mention %fs/%gs. So it needs to be done this Torvaldian way. - */ -extern void __bad_pda_field(void) __attribute__((noreturn)); - -/* * proxy_pda doesn't actually exist, but tell gcc it is accessed for * all PDA accesses so it gets read/write dependencies right. */ @@ -55,69 +49,11 @@ extern struct x8664_pda _proxy_pda; #define pda_offset(field) offsetof(struct x8664_pda, field) -#define pda_to_op(op, field, val) \ -do { \ - typedef typeof(_proxy_pda.field) T__; \ - if (0) { T__ tmp__; tmp__ = (val); } /* type checking */ \ - switch (sizeof(_proxy_pda.field)) { \ - case 2: \ - asm(op "w %1,%%gs:%c2" : \ - "+m" (_proxy_pda.field) : \ - "ri" ((T__)val), \ - "i"(pda_offset(field))); \ - break; \ - case 4: \ - asm(op "l %1,%%gs:%c2" : \ - "+m" (_proxy_pda.field) : \ - "ri" ((T__)val), \ - "i" (pda_offset(field))); \ - break; \ - case 8: \ - asm(op "q %1,%%gs:%c2": \ - "+m" (_proxy_pda.field) : \ - "ri" ((T__)val), \ - "i"(pda_offset(field))); \ - break; \ - default: \ - __bad_pda_field(); \ - } \ -} while (0) - -#define pda_from_op(op, field) \ -({ \ - typeof(_proxy_pda.field) ret__; \ - switch (sizeof(_proxy_pda.field)) { \ - case 2: \ - asm(op "w %%gs:%c1,%0" : \ - "=r" (ret__) : \ - "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ - break; \ - case 4: \ - asm(op "l %%gs:%c1,%0": \ - "=r" (ret__): \ - "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ - break; \ - case 8: \ - asm(op "q %%gs:%c1,%0": \ - "=r" (ret__) : \ - "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ - break; \ - default: \ - __bad_pda_field(); \ - } \ - ret__; \ -}) - -#define read_pda(field) pda_from_op("mov", field) -#define write_pda(field, val) pda_to_op("mov", field, val) -#define add_pda(field, val) pda_to_op("add", field, val) -#define sub_pda(field, val) pda_to_op("sub", field, val) -#define or_pda(field, val) pda_to_op("or", field, val) - -/* This is not atomic against other CPUs -- CPU preemption needs to be off */ +/* + * This is not atomic against other CPUs -- CPU preemption needs to be off + * NOTE: This relies on the fact that the cpu_pda is the *first* field in + * the per cpu area. Move it and you'll need to change this. + */ #define test_and_clear_bit_pda(bit, field) \ ({ \ int old__; \ @@ -127,6 +63,7 @@ do { \ old__; \ }) + #endif #define PDA_STACKOFFSET (5*8) -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* [RFC 15/15] x86_64: Remove cpu_pda() macro 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (13 preceding siblings ...) 2008-07-09 16:51 ` [RFC 14/15] x86_64: Remove xxx_pda() operations Mike Travis @ 2008-07-09 16:51 ` Mike Travis 2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin ` (3 subsequent siblings) 18 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 16:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel [-- Attachment #1: zero_based_remove_cpu_pda --] [-- Type: text/plain, Size: 611 bytes --] * As there are no more references to cpu_pda() then remove it. Based on linux-2.6.tip/master Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Mike Travis <travis@sgi.com> --- include/asm-x86/pda.h | 2 -- 1 file changed, 2 deletions(-) --- linux-2.6.tip.orig/include/asm-x86/pda.h +++ linux-2.6.tip/include/asm-x86/pda.h @@ -39,8 +39,6 @@ struct x8664_pda { extern void pda_init(int); -#define cpu_pda(cpu) (&per_cpu(pda, cpu)) - /* * proxy_pda doesn't actually exist, but tell gcc it is accessed for * all PDA accesses so it gets read/write dependencies right. -- ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (14 preceding siblings ...) 2008-07-09 16:51 ` [RFC 15/15] x86_64: Remove cpu_pda() macro Mike Travis @ 2008-07-09 17:19 ` H. Peter Anvin 2008-07-09 17:40 ` Mike Travis 2008-07-09 17:44 ` Jeremy Fitzhardinge 2008-07-09 17:27 ` Jeremy Fitzhardinge ` (2 subsequent siblings) 18 siblings, 2 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 17:19 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel Hi Mike, Did the suspected linker bug issue ever get resolved? -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin @ 2008-07-09 17:40 ` Mike Travis 2008-07-09 17:42 ` H. Peter Anvin 2008-07-09 17:44 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 17:40 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel H. Peter Anvin wrote: > Hi Mike, > > Did the suspected linker bug issue ever get resolved? > > -hpa Hi Peter, I was not able to figure out how the two versions of the same kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed. Currently, I'm sticking with gcc-4.2.4 as it boots much farther. There still is a problem where if I bump THREAD_ORDER, the problem goes away and everything so far that I've tested boots up fine. We tried to install a later gcc (4.3.1) that might have the "GCC_HAS_SP" flag but our sys admin reported: The 4.3.1 version gives me errors on the make. I had to pre-install gmp and mpfr, but, I still get errors on the make. I think that was the latest he found on the GNU/GCC site. Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:40 ` Mike Travis @ 2008-07-09 17:42 ` H. Peter Anvin 2008-07-09 18:05 ` Mike Travis 0 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 17:42 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel Mike Travis wrote: > H. Peter Anvin wrote: >> Hi Mike, >> >> Did the suspected linker bug issue ever get resolved? >> >> -hpa > > Hi Peter, > > I was not able to figure out how the two versions of the same > kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed. Currently, > I'm sticking with gcc-4.2.4 as it boots much farther. > > There still is a problem where if I bump THREAD_ORDER, the > problem goes away and everything so far that I've tested boots > up fine. > > We tried to install a later gcc (4.3.1) that might have the > "GCC_HAS_SP" flag but our sys admin reported: > > The 4.3.1 version gives me errors on the make. I had to > pre-install gmp and mpfr, but, I still get errors on the make. > > I think that was the latest he found on the GNU/GCC site. > We have seen miscompilations with gcc 4.3.0 at least; not sure about 4.3.1. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:42 ` H. Peter Anvin @ 2008-07-09 18:05 ` Mike Travis 0 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 18:05 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel H. Peter Anvin wrote: > Mike Travis wrote: >> H. Peter Anvin wrote: >>> Hi Mike, >>> >>> Did the suspected linker bug issue ever get resolved? >>> >>> -hpa >> >> Hi Peter, >> >> I was not able to figure out how the two versions of the same >> kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed. Currently, >> I'm sticking with gcc-4.2.4 as it boots much farther. >> >> There still is a problem where if I bump THREAD_ORDER, the >> problem goes away and everything so far that I've tested boots >> up fine. >> >> We tried to install a later gcc (4.3.1) that might have the >> "GCC_HAS_SP" flag but our sys admin reported: >> >> The 4.3.1 version gives me errors on the make. I had to >> pre-install gmp and mpfr, but, I still get errors on the make. >> >> I think that was the latest he found on the GNU/GCC site. >> > > We have seen miscompilations with gcc 4.3.0 at least; not sure about 4.3.1. > > -hpa Hmm, I wonder how the CONFIG_CC_STACKPROTECTOR was tested? Could it be a config option for building GCC itself? Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin 2008-07-09 17:40 ` Mike Travis @ 2008-07-09 17:44 ` Jeremy Fitzhardinge 2008-07-09 18:09 ` Mike Travis 2008-07-25 15:49 ` Mike Travis 1 sibling, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 17:44 UTC (permalink / raw) To: H. Peter Anvin Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel H. Peter Anvin wrote: > Did the suspected linker bug issue ever get resolved? I don't believe so. I think Mike is getting very early crashes depending on some combination of gcc, linker and kernel config. Or something. This fragility makes me very nervous. It seems hard enough to get this stuff working with current tools; making it work over the whole range of supported tools looks like its going to be hard. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:44 ` Jeremy Fitzhardinge @ 2008-07-09 18:09 ` Mike Travis 2008-07-09 18:30 ` H. Peter Anvin 2008-07-09 19:34 ` Ingo Molnar 2008-07-25 15:49 ` Mike Travis 1 sibling, 2 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 18:09 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > H. Peter Anvin wrote: >> Did the suspected linker bug issue ever get resolved? > > I don't believe so. I think Mike is getting very early crashes > depending on some combination of gcc, linker and kernel config. Or > something. Yes and unfortunately since SGI does not do it's own compilers any more (they were MIPS compilers), there's no one here that really understands the internals of the compile tools. > > This fragility makes me very nervous. It seems hard enough to get this > stuff working with current tools; making it work over the whole range of > supported tools looks like its going to be hard. (me too ;-) Once I get a solid version working with (at least) gcc-4.2.4, then regression testing with older tools will be easier, or at least a table of results can be produced. Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:09 ` Mike Travis @ 2008-07-09 18:30 ` H. Peter Anvin 2008-07-09 19:34 ` Ingo Molnar 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 18:30 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel Mike Travis wrote: > Jeremy Fitzhardinge wrote: >> H. Peter Anvin wrote: >>> Did the suspected linker bug issue ever get resolved? >> I don't believe so. I think Mike is getting very early crashes >> depending on some combination of gcc, linker and kernel config. Or >> something. > > Yes and unfortunately since SGI does not do it's own compilers any > more (they were MIPS compilers), there's no one here that really > understands the internals of the compile tools. > A bummer, too, since that compiler lives on as the Pathscale compiler... -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:09 ` Mike Travis 2008-07-09 18:30 ` H. Peter Anvin @ 2008-07-09 19:34 ` Ingo Molnar 2008-07-09 19:44 ` H. Peter Anvin ` (2 more replies) 1 sibling, 3 replies; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 19:34 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, H. Peter Anvin, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel * Mike Travis <travis@sgi.com> wrote: > > This fragility makes me very nervous. It seems hard enough to get > > this stuff working with current tools; making it work over the whole > > range of supported tools looks like its going to be hard. > > (me too ;-) > > Once I get a solid version working with (at least) gcc-4.2.4, then > regression testing with older tools will be easier, or at least a > table of results can be produced. the problem is, we cannot just put it even into tip/master if there's no short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for me otherwise, for series of thousands of randomly built kernels. can we just leave out the zero-based percpu stuff safely and could i test the rest of your series - or are there dependencies? I think zero-based percpu, while nice in theory, is probably just a very small positive effect so it's not a life or death issue. (or is there any deeper, semantic reason why we'd want it?) Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:34 ` Ingo Molnar @ 2008-07-09 19:44 ` H. Peter Anvin 2008-07-09 20:26 ` Adrian Bunk 2008-07-09 21:03 ` Mike Travis 2008-07-09 21:23 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 19:44 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Travis, Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel Ingo Molnar wrote: > > the problem is, we cannot just put it even into tip/master if there's no > short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for > me otherwise, for series of thousands of randomly built kernels. > 4.2.3 is fine; he was using 4.2.0 before, and as far as I know, 4.2.0 and 4.2.1 are known broken for the kernel. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:44 ` H. Peter Anvin @ 2008-07-09 20:26 ` Adrian Bunk 0 siblings, 0 replies; 190+ messages in thread From: Adrian Bunk @ 2008-07-09 20:26 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Mike Travis, Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel On Wed, Jul 09, 2008 at 03:44:51PM -0400, H. Peter Anvin wrote: > Ingo Molnar wrote: >> >> the problem is, we cannot just put it even into tip/master if there's >> no short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid >> for me otherwise, for series of thousands of randomly built kernels. >> > > 4.2.3 is fine; he was using 4.2.0 before, and as far as I know, 4.2.0 > and 4.2.1 are known broken for the kernel. Not sure where your knowledge comes from, but the ones I try to get blacklisted due to known gcc bugs are 4.1.0 and 4.1.1. On a larger picture, we officially support gcc >= 3.2, and if any kernel change triggers a bug with e.g. gcc 3.2.3 that's technically a regression in the kernel... > -hpa cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:34 ` Ingo Molnar 2008-07-09 19:44 ` H. Peter Anvin @ 2008-07-09 21:03 ` Mike Travis 2008-07-09 21:23 ` Jeremy Fitzhardinge 2 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 21:03 UTC (permalink / raw) To: Ingo Molnar Cc: Jeremy Fitzhardinge, H. Peter Anvin, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel Ingo Molnar wrote: > * Mike Travis <travis@sgi.com> wrote: > >>> This fragility makes me very nervous. It seems hard enough to get >>> this stuff working with current tools; making it work over the whole >>> range of supported tools looks like its going to be hard. >> (me too ;-) >> >> Once I get a solid version working with (at least) gcc-4.2.4, then >> regression testing with older tools will be easier, or at least a >> table of results can be produced. > > the problem is, we cannot just put it even into tip/master if there's no > short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for > me otherwise, for series of thousands of randomly built kernels. Great, I'll request we load gcc-4.2.3 on our devel server. > > can we just leave out the zero-based percpu stuff safely and could i > test the rest of your series - or are there dependencies? I think > zero-based percpu, while nice in theory, is probably just a very small > positive effect so it's not a life or death issue. (or is there any > deeper, semantic reason why we'd want it?) I sort of assumed that zero-based would not make it into 2.6.26-rcX, and no, reaching 4096 cpus does not require it. The other patches I've been submitting are more general and will fix possible panics (like stack overflows, etc.) Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:34 ` Ingo Molnar 2008-07-09 19:44 ` H. Peter Anvin 2008-07-09 21:03 ` Mike Travis @ 2008-07-09 21:23 ` Jeremy Fitzhardinge 2 siblings, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 21:23 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Travis, H. Peter Anvin, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel Ingo Molnar wrote: > * Mike Travis <travis@sgi.com> wrote: > > >>> This fragility makes me very nervous. It seems hard enough to get >>> this stuff working with current tools; making it work over the whole >>> range of supported tools looks like its going to be hard. >>> >> (me too ;-) >> >> Once I get a solid version working with (at least) gcc-4.2.4, then >> regression testing with older tools will be easier, or at least a >> table of results can be produced. >> > > the problem is, we cannot just put it even into tip/master if there's no > short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for > me otherwise, for series of thousands of randomly built kernels. > > can we just leave out the zero-based percpu stuff safely and could i > test the rest of your series - or are there dependencies? I think > zero-based percpu, while nice in theory, is probably just a very small > positive effect so it's not a life or death issue. (or is there any > deeper, semantic reason why we'd want it?) > I'm looking forward to using it, because I can make the Xen vcpu structure a percpu variable shared with the hypervisor. This means something like a disable interrupt becomes a simple "movb $1,%gs:per_cpu__xen_vcpu_event_mask". If access to percpu variables is indirect (ie, two instructions) I need to disable preemption which makes the whole thing much more complex, and too big to inline. There are other cases where preemption-safe access to percpu variables is useful as well. My view, which is admittedly very one-sided, is that all this brokenness is forced on us by gcc's stack-protector brokenness. My preferred approach would be to fix -fstack-protector by eliminating the requirement for small offsets from %gs. With that in place we could support it without needing a pda. In the meantime, we could either support stack-protector or direct access to percpu variables. Either way, we don't need to worry about making zero-based percpu work. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:44 ` Jeremy Fitzhardinge 2008-07-09 18:09 ` Mike Travis @ 2008-07-25 15:49 ` Mike Travis 2008-07-25 16:08 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-25 15:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins Jeremy Fitzhardinge wrote: > H. Peter Anvin wrote: >> Did the suspected linker bug issue ever get resolved? > > I don't believe so. I think Mike is getting very early crashes > depending on some combination of gcc, linker and kernel config. Or > something. > > This fragility makes me very nervous. It seems hard enough to get this > stuff working with current tools; making it work over the whole range of > supported tools looks like its going to be hard. > > J FYI, I think it was a combination of errors that was causing my problems. In any case, I've successfully compiled and booted Ingo's pesky config config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and 4.2.4 on both Intel and AMD boxes. (As well as a variety of other configs.) I think Hugh's change adding "text" to the BUILD_IRQ macro might also have helped, since interrupts seemed to have always been involved in the panics. That might explain the variable-ness of the gcc version's. Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-25 15:49 ` Mike Travis @ 2008-07-25 16:08 ` Jeremy Fitzhardinge 2008-07-25 16:46 ` Mike Travis 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-25 16:08 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins Mike Travis wrote: > FYI, I think it was a combination of errors that was causing my problems. > In any case, I've successfully compiled and booted Ingo's pesky config > config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and 4.2.4 > on both Intel and AMD boxes. (As well as a variety of other configs.) > Good. What compilers have you tested? Will it work over the complete supported range? > I think Hugh's change adding "text" to the BUILD_IRQ macro might also have > helped, since interrupts seemed to have always been involved in the panics. > That might explain the variable-ness of the gcc version's. Yes, indeed. That was a nasty one, and will be very sensitive to the exact order gcc decided to emit things. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-25 16:08 ` Jeremy Fitzhardinge @ 2008-07-25 16:46 ` Mike Travis 2008-07-25 16:58 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-25 16:46 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins Jeremy Fitzhardinge wrote: > Mike Travis wrote: >> FYI, I think it was a combination of errors that was causing my problems. >> In any case, I've successfully compiled and booted Ingo's pesky config >> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and >> 4.2.4 >> on both Intel and AMD boxes. (As well as a variety of other configs.) >> > > Good. What compilers have you tested? Will it work over the complete > supported range? The oldest gcc I have available is 3.4.3 (and I just tried that one and it worked.) So that one and the ones listed above I've verified using Ingo's test config. (All other testing I'm using 4.2.3.) > >> I think Hugh's change adding "text" to the BUILD_IRQ macro might also >> have >> helped, since interrupts seemed to have always been involved in the >> panics. >> That might explain the variable-ness of the gcc version's. > > Yes, indeed. That was a nasty one, and will be very sensitive to the > exact order gcc decided to emit things. > > J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-25 16:46 ` Mike Travis @ 2008-07-25 16:58 ` Jeremy Fitzhardinge 2008-07-25 18:12 ` Mike Travis 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-25 16:58 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins Mike Travis wrote: > Jeremy Fitzhardinge wrote: > >> Mike Travis wrote: >> >>> FYI, I think it was a combination of errors that was causing my problems. >>> In any case, I've successfully compiled and booted Ingo's pesky config >>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and >>> 4.2.4 >>> on both Intel and AMD boxes. (As well as a variety of other configs.) >>> >>> >> Good. What compilers have you tested? Will it work over the complete >> supported range? >> > > The oldest gcc I have available is 3.4.3 (and I just tried that one and > it worked.) So that one and the ones listed above I've verified using > Ingo's test config. (All other testing I'm using 4.2.3.) > OK. I've got a range of toolchains on various test machines around here, so I'll give it a spin. Are your most recent changes in tip.git yet? J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-25 16:58 ` Jeremy Fitzhardinge @ 2008-07-25 18:12 ` Mike Travis 0 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-25 18:12 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Ingo Molnar, Andrew Morton, Eric W. Biederman, Christoph Lameter, Jack Steiner, linux-kernel, Hugh Dickins Jeremy Fitzhardinge wrote: > Mike Travis wrote: >> Jeremy Fitzhardinge wrote: >> >>> Mike Travis wrote: >>> >>>> FYI, I think it was a combination of errors that was causing my >>>> problems. >>>> In any case, I've successfully compiled and booted Ingo's pesky config >>>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and >>>> 4.2.4 >>>> on both Intel and AMD boxes. (As well as a variety of other configs.) >>>> >>> Good. What compilers have you tested? Will it work over the complete >>> supported range? >>> >> >> The oldest gcc I have available is 3.4.3 (and I just tried that one and >> it worked.) So that one and the ones listed above I've verified using >> Ingo's test config. (All other testing I'm using 4.2.3.) >> > > OK. I've got a range of toolchains on various test machines around > here, so I'll give it a spin. Are your most recent changes in tip.git yet? > > J Almost there... ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (15 preceding siblings ...) 2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin @ 2008-07-09 17:27 ` Jeremy Fitzhardinge 2008-07-09 17:39 ` Christoph Lameter 2008-07-09 18:00 ` Mike Travis 2008-07-09 19:28 ` Ingo Molnar 2008-07-09 20:00 ` Eric W. Biederman 18 siblings, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 17:27 UTC (permalink / raw) To: Mike Travis Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Mike Travis wrote: > This patchset provides the following: > > * Cleanup: Fix early references to cpumask_of_cpu(0) > > Provides an early cpumask_of_cpu(0) usable before the cpumask_of_cpu_map > is allocated and initialized. > > * Generic: Percpu infrastructure to rebase the per cpu area to zero > > This provides for the capability of accessing the percpu variables > using a local register instead of having to go through a table > on node 0 to find the cpu-specific offsets. It also would allow > atomic operations on percpu variables to reduce required locking. > Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the > generic code that the arch has this new basing. > > (Note: split into two patches, one to rebase percpu variables at 0, > and the second to actually use %gs as the base for percpu variables.) > > * x86_64: Fold pda into per cpu area > > Declare the pda as a per cpu variable. This will move the pda > area to an address accessible by the x86_64 per cpu macros. > Subtraction of __per_cpu_start will make the offset based from > the beginning of the per cpu area. Since %gs is pointing to the > pda, it will then also point to the per cpu variables and can be > accessed thusly: > > %gs:[&per_cpu_xxxx - __per_cpu_start] > > * x86_64: Rebase per cpu variables to zero > > Take advantage of the zero-based per cpu area provided above. > Then we can directly use the x86_32 percpu operations. x86_32 > offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly > to the pda and the per cpu area thereby allowing access to the > pda with the x86_64 pda operations and access to the per cpu > variables using x86_32 percpu operations. The bulk of this series is pda_X to x86_X_percpu conversion. This seems like pointless churn to me; there's nothing inherently wrong with the pda_X interfaces, and doing this transformation doesn't get us any closer to unifying 32 and 64 bit. I think we should start devolving things out of the pda in the other direction: make a series where each patch takes a member of struct x8664_pda, converts it to a per-cpu variable (where possible, the same one that 32-bit uses), and updates all the references accordingly. When the pda is as empty as it can be, we can look at removing the pda-specific interfaces. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:27 ` Jeremy Fitzhardinge @ 2008-07-09 17:39 ` Christoph Lameter 2008-07-09 17:51 ` Jeremy Fitzhardinge 2008-07-09 18:02 ` Mike Travis 2008-07-09 18:00 ` Mike Travis 1 sibling, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 17:39 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > The bulk of this series is pda_X to x86_X_percpu conversion. This seems > like pointless churn to me; there's nothing inherently wrong with the > pda_X interfaces, and doing this transformation doesn't get us any > closer to unifying 32 and 64 bit. What is the point of the pda_X interface? It does not exist on 32 bit. The pda wastes the GS segment register on a small memory area. This patchset makes the GS segment usable to reach all of the per cpu area by placing the pda into the per cpu area. Thus the pda_X interface becomes obsolete and the 32 bit per cpu stuff becomes usable under 64 bit unifying both architectures. > I think we should start devolving things out of the pda in the other > direction: make a series where each patch takes a member of struct > x8664_pda, converts it to a per-cpu variable (where possible, the same > one that 32-bit uses), and updates all the references accordingly. When > the pda is as empty as it can be, we can look at removing the > pda-specific interfaces. This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:39 ` Christoph Lameter @ 2008-07-09 17:51 ` Jeremy Fitzhardinge 2008-07-09 18:14 ` Mike Travis 2008-07-09 18:02 ` Mike Travis 1 sibling, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 17:51 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > What is the point of the pda_X interface? It does not exist on 32 bit. > The pda wastes the GS segment register on a small memory area. This patchset > makes the GS segment usable to reach all of the per cpu area by placing > the pda into the per cpu area. Thus the pda_X interface becomes obsolete > and the 32 bit per cpu stuff becomes usable under 64 bit unifying both > architectures. > I think we agree on the desired outcome. I just disagree with the path to getting there. >> I think we should start devolving things out of the pda in the other >> direction: make a series where each patch takes a member of struct >> x8664_pda, converts it to a per-cpu variable (where possible, the same >> one that 32-bit uses), and updates all the references accordingly. When >> the pda is as empty as it can be, we can look at removing the >> pda-specific interfaces. >> > > This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately. > No, it's not churn doing it object at a time. If you convert pda.pcurrent into a percpu current_task variable, then at one stroke you've 1) shrunk the pda, 2) unified with i386. If you go through the process of converting all the read_pda(pcurrent) references into x86_read_percpu(pda.pcurrent) then that's a pure churn patch. It doesn't get rid of the pda variable, it doesn't unify with i386. All it does is remove a reference to a macro which was fairly inoffensive in the first place. Once the pda has shrunk as much as it can (which remove everything except stack_canary, I think), then remove all the X_pda macros, since there won't be any users anyway. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:51 ` Jeremy Fitzhardinge @ 2008-07-09 18:14 ` Mike Travis 2008-07-09 18:22 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 18:14 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: ... > > Once the pda has shrunk as much as it can (which remove everything > except stack_canary, I think), then remove all the X_pda macros, since > there won't be any users anyway. > > J You bring up a good point here. Since the stack_canary has to be 20 (or is that 0x20?) bytes from %gs, then it sounds like we'll still need a pda struct of some sort. And zero padding before that seems counter-productive, so perhaps taking a poll of the most used pda (or percpu) variables and keeping them in the same cache line would be more useful? Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:14 ` Mike Travis @ 2008-07-09 18:22 ` Jeremy Fitzhardinge 2008-07-09 18:31 ` Mike Travis 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 18:22 UTC (permalink / raw) To: Mike Travis Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Mike Travis wrote: > Jeremy Fitzhardinge wrote: > ... > >> Once the pda has shrunk as much as it can (which remove everything >> except stack_canary, I think), then remove all the X_pda macros, since >> there won't be any users anyway. >> >> J >> > > You bring up a good point here. Since the stack_canary has to be 20 > (or is that 0x20?) bytes from %gs, then it sounds like we'll still > need a pda struct of some sort. And zero padding before that seems > counter-productive, so perhaps taking a poll of the most used pda > (or percpu) variables and keeping them in the same cache line would > be more useful? The offset is 40 (decimal). I don't see any particular problem with putting zero padding in there; we can get rid of it if CONFIG_STACK_PROTECTOR is off. The trouble with reusing the space is that it's going to be things like "current" which are the best candidates for going there - but if you do that you lose i386 unification (unless you play some tricks where you arrange to make the percpu variables land there while still appearing to be normal percpu vars). J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:22 ` Jeremy Fitzhardinge @ 2008-07-09 18:31 ` Mike Travis 2008-07-09 19:08 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 18:31 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: ... > The trouble with reusing the space is that it's going to be things like > "current" which are the best candidates for going there - but if you do > that you lose i386 unification (unless you play some tricks where you > arrange to make the percpu variables land there while still appearing to > be normal percpu vars). > > J One more approach... ;-) Once the pda and percpu vars are in the same area, then could the pda be used for both 32 and 64...? ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:31 ` Mike Travis @ 2008-07-09 19:08 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 19:08 UTC (permalink / raw) To: Mike Travis Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Mike Travis wrote: > One more approach... ;-) Once the pda and percpu vars are in the same > area, then could the pda be used for both 32 and 64...? > The i386 code works quite reliably thanks ;) J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:39 ` Christoph Lameter 2008-07-09 17:51 ` Jeremy Fitzhardinge @ 2008-07-09 18:02 ` Mike Travis 2008-07-09 18:13 ` Christoph Lameter 1 sibling, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 18:02 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > >> The bulk of this series is pda_X to x86_X_percpu conversion. This seems >> like pointless churn to me; there's nothing inherently wrong with the >> pda_X interfaces, and doing this transformation doesn't get us any >> closer to unifying 32 and 64 bit. > > What is the point of the pda_X interface? It does not exist on 32 bit. > The pda wastes the GS segment register on a small memory area. This patchset > makes the GS segment usable to reach all of the per cpu area by placing > the pda into the per cpu area. Thus the pda_X interface becomes obsolete > and the 32 bit per cpu stuff becomes usable under 64 bit unifying both > architectures. > > >> I think we should start devolving things out of the pda in the other >> direction: make a series where each patch takes a member of struct >> x8664_pda, converts it to a per-cpu variable (where possible, the same >> one that 32-bit uses), and updates all the references accordingly. When >> the pda is as empty as it can be, we can look at removing the >> pda-specific interfaces. > > This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately. I think Jeremy's point is that by removing the pda struct entirely, the references to the fields can be the same for both x86_32 and x86_64. Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:02 ` Mike Travis @ 2008-07-09 18:13 ` Christoph Lameter 2008-07-09 18:26 ` Jeremy Fitzhardinge ` (2 more replies) 0 siblings, 3 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 18:13 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Mike Travis wrote: > I think Jeremy's point is that by removing the pda struct entirely, the > references to the fields can be the same for both x86_32 and x86_64. That is going to be difficult. The GS register is tied up for the pda area as long as you have it. And you cannot get rid of the pda because of the library compatibility issues. We would break binary compatibility if we would get rid of the pda. If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications. It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:13 ` Christoph Lameter @ 2008-07-09 18:26 ` Jeremy Fitzhardinge 2008-07-09 18:34 ` Christoph Lameter 2008-07-09 18:27 ` Mike Travis 2008-07-09 18:31 ` H. Peter Anvin 2 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 18:26 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > That is going to be difficult. The GS register is tied up for the pda area > as long as you have it. And you cannot get rid of the pda because of the library > compatibility issues. We would break binary compatibility if we would get rid of the pda. > > If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications. > Eh? Yes they will. That's the whole point of making the pda a percpu variable itself. You can use %gs:<small> to get to the pda, and %gs:<larger> to get to percpu variables. Converting pda->percpu will just have the effect of increasing the %gs offset into the percpu space. This project isn't interesting to me unless per-cpu variables are directly accessible off %gs. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:26 ` Jeremy Fitzhardinge @ 2008-07-09 18:34 ` Christoph Lameter 2008-07-09 18:37 ` H. Peter Anvin 2008-07-09 18:48 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 18:34 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > Eh? Yes they will. That's the whole point of making the pda a percpu > variable itself. You can use %gs:<small> to get to the pda, and > %gs:<larger> to get to percpu variables. Converting pda->percpu will > just have the effect of increasing the %gs offset into the percpu space. Right that is what this patchset does. > This project isn't interesting to me unless per-cpu variables are > directly accessible off %gs. Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS). ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:34 ` Christoph Lameter @ 2008-07-09 18:37 ` H. Peter Anvin 2008-07-09 18:48 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 18:37 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, Jack Steiner, linux-kernel Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > >> Eh? Yes they will. That's the whole point of making the pda a percpu >> variable itself. You can use %gs:<small> to get to the pda, and >> %gs:<larger> to get to percpu variables. Converting pda->percpu will >> just have the effect of increasing the %gs offset into the percpu space. > > Right that is what this patchset does. > >> This project isn't interesting to me unless per-cpu variables are >> directly accessible off %gs. > > Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS). > I don't understand the "individual members" requirement here. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:34 ` Christoph Lameter 2008-07-09 18:37 ` H. Peter Anvin @ 2008-07-09 18:48 ` Jeremy Fitzhardinge 2008-07-09 18:53 ` Christoph Lameter 1 sibling, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 18:48 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS). > I have no objections to parts 1-3 of the patchset. It's just 4-15, which does the mechanical conversion. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:48 ` Jeremy Fitzhardinge @ 2008-07-09 18:53 ` Christoph Lameter 2008-07-09 19:07 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 18:53 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > Christoph Lameter wrote: >> Maybe I misunderstood but it seems that you proposed to convert >> individual members of the pda structure (which uses GS) to per cpu >> variables (which without this patchset cannot use GS). >> > > I have no objections to parts 1-3 of the patchset. It's just 4-15, > which does the mechanical conversion. Well yes I agree these could be better if the fields would be moved out of the pda structure itself but then it wont be mechanical anymore and require more review. But these are an important step because they allow us to get rid of the pda operations that do not exist for 32 bit. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:53 ` Christoph Lameter @ 2008-07-09 19:07 ` Jeremy Fitzhardinge 2008-07-09 19:12 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 19:07 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > Well yes I agree these could be better if the fields would be moved out of the pda > structure itself but then it wont be mechanical anymore and require more > review. Yes, but they'll have more value. And if you do it as one variable per patch, then it should be easy to bisect should any problems arise. > But these are an important step because they allow us to get rid of the > pda operations that do not exist for 32 bit. > No, they don't help at all, because they convert X_pda(Y) (which doesn't exist) into x86_X_percpu(pda.Y) (which also doesn't exist). They don't remove any #ifdef CONFIG_X86_64's. If you're going to tromp all over the source, you may as well do the most useful thing in the first step. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:07 ` Jeremy Fitzhardinge @ 2008-07-09 19:12 ` Christoph Lameter 2008-07-09 19:32 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 19:12 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > No, they don't help at all, because they convert X_pda(Y) (which doesn't > exist) into x86_X_percpu(pda.Y) (which also doesn't exist). They don't > remove any #ifdef CONFIG_X86_64's. If you're going to tromp all over > the source, you may as well do the most useful thing in the first step. Well they help in the sense that the patches get rid of the special X_pda(Y) operations. x86_X_percpu will then exist under 32 bit and 64 bit. What is remaining is the task to rename pda.Y -> Z in order to make variable references the same under both arches. Presumably the Z is the corresponding 32 bit variable. There are likely a number of cases where the transformation is trivial if we just identify the corresponding 32 bit equivalent. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:12 ` Christoph Lameter @ 2008-07-09 19:32 ` Jeremy Fitzhardinge 2008-07-09 19:41 ` Ingo Molnar 2008-07-09 19:44 ` Christoph Lameter 0 siblings, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 19:32 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > Well they help in the sense that the patches get rid of the special X_pda(Y) operations. > x86_X_percpu will then exist under 32 bit and 64 bit. > > What is remaining is the task to rename > > pda.Y -> Z > > in order to make variable references the same under both arches. Presumably the Z is the corresponding 32 bit variable. There are likely a number of cases where the transformation > is trivial if we just identify the corresponding 32 bit equivalent. > Yes, I understand that, but it's still pointless churn. The intermediate step is no improvement over what was there before, and isn't any closer to the desired final result. Once you've made the pda a percpu variable, and redefined all the X_pda macros in terms of x86_X_percpu, then there's no need to touch all the usage sites until you're *actually* going to unify something. Touching them all just because you find "X_pda" unsightly doesn't help anyone. Ideally every site you touch will remove a #ifdef CONFIG_X86_64, or make two as-yet unified pieces of code closer to unification. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:32 ` Jeremy Fitzhardinge @ 2008-07-09 19:41 ` Ingo Molnar 2008-07-09 19:45 ` H. Peter Anvin ` (2 more replies) 2008-07-09 19:44 ` Christoph Lameter 1 sibling, 3 replies; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 19:41 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Christoph Lameter, Mike Travis, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel * Jeremy Fitzhardinge <jeremy@goop.org> wrote: >> What is remaining is the task to rename >> >> pda.Y -> Z >> >> in order to make variable references the same under both arches. >> Presumably the Z is the corresponding 32 bit variable. There are >> likely a number of cases where the transformation is trivial if we >> just identify the corresponding 32 bit equivalent. > > Yes, I understand that, but it's still pointless churn. The > intermediate step is no improvement over what was there before, and > isn't any closer to the desired final result. > > Once you've made the pda a percpu variable, and redefined all the > X_pda macros in terms of x86_X_percpu, then there's no need to touch > all the usage sites until you're *actually* going to unify something. > Touching them all just because you find "X_pda" unsightly doesn't help > anyone. Ideally every site you touch will remove a #ifdef > CONFIG_X86_64, or make two as-yet unified pieces of code closer to > unification. that makes sense. Does everyone agree on #1-#2-#3 and then gradual elimination of most pda members (without going through an intermediate renaming of pda members) being the way to go? Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:41 ` Ingo Molnar @ 2008-07-09 19:45 ` H. Peter Anvin 2008-07-09 19:52 ` Christoph Lameter 2008-07-09 21:05 ` Mike Travis 2 siblings, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 19:45 UTC (permalink / raw) To: Ingo Molnar Cc: Jeremy Fitzhardinge, Christoph Lameter, Mike Travis, Andrew Morton, Eric W. Biederman, Jack Steiner, linux-kernel Ingo Molnar wrote: > > that makes sense. Does everyone agree on #1-#2-#3 and then gradual > elimination of most pda members (without going through an intermediate > renaming of pda members) being the way to go? > Works for me. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:41 ` Ingo Molnar 2008-07-09 19:45 ` H. Peter Anvin @ 2008-07-09 19:52 ` Christoph Lameter 2008-07-09 20:00 ` Ingo Molnar 2008-07-09 21:05 ` Mike Travis 2 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 19:52 UTC (permalink / raw) To: Ingo Molnar Cc: Jeremy Fitzhardinge, Mike Travis, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Ingo Molnar wrote: > that makes sense. Does everyone agree on #1-#2-#3 and then gradual > elimination of most pda members (without going through an intermediate > renaming of pda members) being the way to go? I think we all agree on 1-2-3. The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X replaced by the proper percpu names for 32 bit. With Jeremy's approach we would be doing two steps at once (getting rid of pda ops plus unifying the variable names between 32 and 64 bit). Maybe more difficult to verify correctness. The removal of the pda ops is a pretty straighforward conversion. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:52 ` Christoph Lameter @ 2008-07-09 20:00 ` Ingo Molnar 2008-07-09 20:09 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 20:00 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Mike Travis, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel * Christoph Lameter <cl@linux-foundation.org> wrote: > Ingo Molnar wrote: > > > that makes sense. Does everyone agree on #1-#2-#3 and then gradual > > elimination of most pda members (without going through an > > intermediate renaming of pda members) being the way to go? > > I think we all agree on 1-2-3. > > The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X > replaced by the proper percpu names for 32 bit. > > With Jeremy's approach we would be doing two steps at once (getting > rid of pda ops plus unifying the variable names between 32 and 64 > bit). Maybe more difficult to verify correctness. The removal of the > pda ops is a pretty straighforward conversion. Yes, but there's nothing magic about pda variables versus percpu variables. We should be able to do the pda -> unified step just as much as we can do a percpu -> unified step. We can think of pda as a funky, pre-percpu-era relic. The only thing that percpu really offers over pda is its familarity. read_pda() has the per-cpu-ness embedded in it, which is nasty with regard to tracking preemption properties, etc. So converting to percpu would bring us CONFIG_PREEMPT_DEBUG=y checking to those ex-pda variables. Today if a read_pda() (or anything but pcurrent) is done in a non-preempt region that's likely a bug - but nothing warns about it. So in that light 4-15 might make some sense in standardizing all these accesses and making sure it all fits into an existing, familar API world, with no register level assumptions and assembly (and ABI) ties, which is instrumented as well, with explicit smp_processor_id() dependencies, etc. Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:00 ` Ingo Molnar @ 2008-07-09 20:09 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:09 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Lameter, Mike Travis, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Ingo Molnar wrote: > * Christoph Lameter <cl@linux-foundation.org> wrote: > > >> Ingo Molnar wrote: >> >> >>> that makes sense. Does everyone agree on #1-#2-#3 and then gradual >>> elimination of most pda members (without going through an >>> intermediate renaming of pda members) being the way to go? >>> >> I think we all agree on 1-2-3. >> >> The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X >> replaced by the proper percpu names for 32 bit. >> >> With Jeremy's approach we would be doing two steps at once (getting >> rid of pda ops plus unifying the variable names between 32 and 64 >> bit). Maybe more difficult to verify correctness. The removal of the >> pda ops is a pretty straighforward conversion. >> > > Yes, but there's nothing magic about pda variables versus percpu > variables. We should be able to do the pda -> unified step just as much > as we can do a percpu -> unified step. We can think of pda as a funky, > pre-percpu-era relic. > > The only thing that percpu really offers over pda is its familarity. > read_pda() has the per-cpu-ness embedded in it, which is nasty with > regard to tracking preemption properties, etc. > > So converting to percpu would bring us CONFIG_PREEMPT_DEBUG=y checking > to those ex-pda variables. Today if a read_pda() (or anything but > pcurrent) is done in a non-preempt region that's likely a bug - but > nothing warns about it. > > So in that light 4-15 might make some sense in standardizing all these > accesses and making sure it all fits into an existing, familar API > world, with no register level assumptions and assembly (and ABI) ties, > which is instrumented as well, with explicit smp_processor_id() > dependencies, etc. > Yeah, but doing #define read_pda(x) x86_read_percpu(x) gives you all that anyway. Though because x86_X_percpu and X_pda are guaranteed to be atomic with respect to preemption, it's actually reasonable to use them with preemption enabled. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:41 ` Ingo Molnar 2008-07-09 19:45 ` H. Peter Anvin 2008-07-09 19:52 ` Christoph Lameter @ 2008-07-09 21:05 ` Mike Travis 2 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 21:05 UTC (permalink / raw) To: Ingo Molnar Cc: Jeremy Fitzhardinge, Christoph Lameter, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Ingo Molnar wrote: > * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > >>> What is remaining is the task to rename >>> >>> pda.Y -> Z >>> >>> in order to make variable references the same under both arches. >>> Presumably the Z is the corresponding 32 bit variable. There are >>> likely a number of cases where the transformation is trivial if we >>> just identify the corresponding 32 bit equivalent. >> Yes, I understand that, but it's still pointless churn. The >> intermediate step is no improvement over what was there before, and >> isn't any closer to the desired final result. >> >> Once you've made the pda a percpu variable, and redefined all the >> X_pda macros in terms of x86_X_percpu, then there's no need to touch >> all the usage sites until you're *actually* going to unify something. >> Touching them all just because you find "X_pda" unsightly doesn't help >> anyone. Ideally every site you touch will remove a #ifdef >> CONFIG_X86_64, or make two as-yet unified pieces of code closer to >> unification. > > that makes sense. Does everyone agree on #1-#2-#3 and then gradual > elimination of most pda members (without going through an intermediate > renaming of pda members) being the way to go? > > Ingo This is fine with me... not much more work required to go "all the way"... ;-) ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:32 ` Jeremy Fitzhardinge 2008-07-09 19:41 ` Ingo Molnar @ 2008-07-09 19:44 ` Christoph Lameter 2008-07-09 19:48 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 19:44 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > Yes, I understand that, but it's still pointless churn. The > intermediate step is no improvement over what was there before, and > isn't any closer to the desired final result. No its not pointless churn. We actually eliminate all the pda operations and use the per_cpu operations both on 32 and 64 bit. That is unification. We would be glad if you could contribute the patches to get rid of the pda.xxx references. I do not think that either Mike or I have the 32 bit expertise needed to do that step. We went as far as we could. The patches are touching all the points of interest so locating the lines to fix should be easy. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:44 ` Christoph Lameter @ 2008-07-09 19:48 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 19:48 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > We would be glad if you could contribute the patches to get rid of the pda.xxx references. I do not think that either Mike or I have the 32 bit expertise needed to do that step. We went as far as we could. The patches are touching all the points of interest so locating the lines to fix should be easy. > Yes, I'd be happy to contribute. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:13 ` Christoph Lameter 2008-07-09 18:26 ` Jeremy Fitzhardinge @ 2008-07-09 18:27 ` Mike Travis 2008-07-09 18:46 ` Jeremy Fitzhardinge 2008-07-09 18:31 ` H. Peter Anvin 2 siblings, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 18:27 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Christoph Lameter wrote: > Mike Travis wrote: > >> I think Jeremy's point is that by removing the pda struct entirely, the >> references to the fields can be the same for both x86_32 and x86_64. > > That is going to be difficult. The GS register is tied up for the pda area > as long as you have it. And you cannot get rid of the pda because of the library > compatibility issues. We would break binary compatibility if we would get rid of the pda. > > If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications. > > It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same. Is there a comprehensive list of these library accesses to variables offset from %gs, or is it only the "stack_canary"? Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:27 ` Mike Travis @ 2008-07-09 18:46 ` Jeremy Fitzhardinge 2008-07-09 20:22 ` Eric W. Biederman 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 18:46 UTC (permalink / raw) To: Mike Travis Cc: Christoph Lameter, Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Jack Steiner, linux-kernel Mike Travis wrote: > Christoph Lameter wrote: > >> Mike Travis wrote: >> >> >>> I think Jeremy's point is that by removing the pda struct entirely, the >>> references to the fields can be the same for both x86_32 and x86_64. >>> >> That is going to be difficult. The GS register is tied up for the pda area >> as long as you have it. And you cannot get rid of the pda because of the library >> compatibility issues. We would break binary compatibility if we would get rid of the pda. >> >> If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications. >> >> It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same. >> > > Is there a comprehensive list of these library accesses to variables > offset from %gs, or is it only the "stack_canary"? It's just the stack canary. It isn't library accesses; it's the code gcc generates: foo: subq $152, %rsp movq %gs:40, %rax movq %rax, 136(%rsp) ... movq 136(%rsp), %rdx xorq %gs:40, %rdx je .L3 call __stack_chk_fail .L3: addq $152, %rsp .p2align 4,,4 ret There are two irritating things here: One is that the kernel supports -fstack-protector for x86-64, which forces us into all these contortions in the first place. We don't support stack-protector for 32-bit (gcc does), and things are much easier. The other somewhat orthogonal irritation is the fixed "40". If they'd generated %gs:__gcc_stack_canary, then we could alias that to a per-cpu variable like anything else and the whole problem would go away - and we could support stack-protector on 32-bit with no problems (and normal usermode could define __gcc_stack_canary to be a weak symbol with value "40" (20 on 32-bit) for backwards compatibility). I'm close to proposing that we run a post-processor over the generated assembly to perform the %gs:40 -> %gs:__gcc_stack_canary transformation and deal with it that way. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:46 ` Jeremy Fitzhardinge @ 2008-07-09 20:22 ` Eric W. Biederman 2008-07-09 20:35 ` Jeremy Fitzhardinge 2008-07-09 21:10 ` Arjan van de Ven 0 siblings, 2 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 20:22 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge <jeremy@goop.org> writes: > It's just the stack canary. It isn't library accesses; it's the code gcc > generates: > > foo: subq $152, %rsp > movq %gs:40, %rax > movq %rax, 136(%rsp) > ... > movq 136(%rsp), %rdx > xorq %gs:40, %rdx > je .L3 > call __stack_chk_fail > .L3: > addq $152, %rsp > .p2align 4,,4 > ret > > > There are two irritating things here: > > One is that the kernel supports -fstack-protector for x86-64, which forces us > into all these contortions in the first place. We don't support stack-protector > for 32-bit (gcc does), and things are much easier. How does gcc know to use %gs instead of the usual %fs for accessing the stack protector variable? My older gcc-4.1.x on ubuntu always uses %fs. > The other somewhat orthogonal irritation is the fixed "40". If they'd generated > %gs:__gcc_stack_canary, then we could alias that to a per-cpu variable like > anything else and the whole problem would go away - and we could support > stack-protector on 32-bit with no problems (and normal usermode could define > __gcc_stack_canary to be a weak symbol with value "40" (20 on 32-bit) for > backwards compatibility). > > I'm close to proposing that we run a post-processor over the generated assembly > to perform the %gs:40 -> %gs:__gcc_stack_canary transformation and deal with it > that way. Or we could do something completely evil. And use the other segment register for the stack canary. I think the unification is valid and useful, and that trying to keep that stupid stack canary working is currently more trouble then it is worth. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:22 ` Eric W. Biederman @ 2008-07-09 20:35 ` Jeremy Fitzhardinge 2008-07-09 20:53 ` Eric W. Biederman 2008-07-09 21:10 ` Arjan van de Ven 1 sibling, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:35 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel Eric W. Biederman wrote: > How does gcc know to use %gs instead of the usual %fs for accessing > the stack protector variable? My older gcc-4.1.x on ubuntu always uses %fs. > -mcmodel=kernel > Or we could do something completely evil. And use the other segment > register for the stack canary. > That would still require gcc changes, so it doesn't help much. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:35 ` Jeremy Fitzhardinge @ 2008-07-09 20:53 ` Eric W. Biederman 2008-07-09 21:03 ` Ingo Molnar 2008-07-09 21:16 ` H. Peter Anvin 0 siblings, 2 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 20:53 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Eric W. Biederman, Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel Jeremy Fitzhardinge <jeremy@goop.org> writes: >> Or we could do something completely evil. And use the other segment >> register for the stack canary. >> > > That would still require gcc changes, so it doesn't help much. We could use %fs for the per cpu variables. Then we could set %gs to whatever we wanted to sync up with gcc Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:53 ` Eric W. Biederman @ 2008-07-09 21:03 ` Ingo Molnar 2008-07-09 21:16 ` H. Peter Anvin 1 sibling, 0 replies; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 21:03 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel * Eric W. Biederman <ebiederm@xmission.com> wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > >> Or we could do something completely evil. And use the other segment > >> register for the stack canary. > >> > > > > That would still require gcc changes, so it doesn't help much. > > We could use %fs for the per cpu variables. Then we could set %gs to > whatever we wanted to sync up with gcc one problem is, there's no SWAPFS instruction. Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:53 ` Eric W. Biederman 2008-07-09 21:03 ` Ingo Molnar @ 2008-07-09 21:16 ` H. Peter Anvin 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 21:16 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel Eric W. Biederman wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > >>> Or we could do something completely evil. And use the other segment >>> register for the stack canary. >>> >> That would still require gcc changes, so it doesn't help much. > > We could use %fs for the per cpu variables. Then we could set %gs to whatever > we wanted to sync up with gcc No swapfs instruction, and extra performance penalty because %fs is used in userspace. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:22 ` Eric W. Biederman 2008-07-09 20:35 ` Jeremy Fitzhardinge @ 2008-07-09 21:10 ` Arjan van de Ven 2008-07-09 23:20 ` Eric W. Biederman 1 sibling, 1 reply; 190+ messages in thread From: Arjan van de Ven @ 2008-07-09 21:10 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel On Wed, 09 Jul 2008 13:22:06 -0700 ebiederm@xmission.com (Eric W. Biederman) wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > > It's just the stack canary. It isn't library accesses; it's the > > code gcc generates: > > > > foo: subq $152, %rsp > > movq %gs:40, %rax > > movq %rax, 136(%rsp) > > ... > > movq 136(%rsp), %rdx > > xorq %gs:40, %rdx > > je .L3 > > call __stack_chk_fail > > .L3: > > addq $152, %rsp > > .p2align 4,,4 > > ret > > > > > > There are two irritating things here: > > > > One is that the kernel supports -fstack-protector for x86-64, which > > forces us into all these contortions in the first place. We don't > > support stack-protector for 32-bit (gcc does), and things are much > > easier. > > How does gcc know to use %gs instead of the usual %fs for accessing > the stack protector variable? My older gcc-4.1.x on ubuntu always > uses %fs. ubuntu broke gcc (they don't want to have compiler flags per package so patches stuff in gcc instead). > I think the unification is valid and useful, and that trying to keep > that stupid stack canary working is currently more trouble then it is > worth. I think that "unification over everything" is stupid, especially if it removes useful features. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:10 ` Arjan van de Ven @ 2008-07-09 23:20 ` Eric W. Biederman 0 siblings, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 23:20 UTC (permalink / raw) To: Arjan van de Ven Cc: Jeremy Fitzhardinge, Mike Travis, Christoph Lameter, Ingo Molnar, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel Arjan van de Ven <arjan@infradead.org> writes: >> I think the unification is valid and useful, and that trying to keep >> that stupid stack canary working is currently more trouble then it is >> worth. > > I think that "unification over everything" is stupid, especially if it > removes useful features. After looking at this some more any solution that actually works will enable us to make the stack canary work, as we have a 32bit offset to deal with. So there is no point in killing the feature. That said I have no sympathy for a thread local variable that is compiled as an absolute symbol instead of using the proper thread local markup. The implementation of -fstack-protector however useful still appears to be a nasty hack, ignoring decades of best practice in how to implement things. Do you have a clue who we need to bug on the gcc team to get the compiler to implement a proper TLS version of -fstack-protector? - Unification over everything is stupid. - Interesting features that disregard decades implementation experience are also stupid. Since we know that the code stack_canary is always a part of the executable. Being a fundamental part of glibc and libpthreads etc. We can use the local exec model for tls storage. The local exec model means the compiler should be able to output code such as "movq %fs:stack_canary@tpoff, %rax" to read the stack canary in user space. Instead it emits the much more stupid "movq "%fs:40, %rax". Not even letting the linker have a say in the placement of the variable. So we either need to update the gcc code to do something proper or someone needs to update the sysv tls abi spec so %fs:40 joins %fs:0 in the ranks of magic address in thread local storage, so that other compilers can reliably use offset 40, and no one will have an excuse for changing it in the future. Frankly I think updating the ABI is the wrong solution but it at least it would document this stupidity. Does -fstack-protector compiled code even fail to run with gcc that does not implement a thread local variable at %fs:40? Or does it just silently break. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:13 ` Christoph Lameter 2008-07-09 18:26 ` Jeremy Fitzhardinge 2008-07-09 18:27 ` Mike Travis @ 2008-07-09 18:31 ` H. Peter Anvin 2 siblings, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 18:31 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Eric W. Biederman, Jack Steiner, linux-kernel Christoph Lameter wrote: > Mike Travis wrote: > >> I think Jeremy's point is that by removing the pda struct entirely, the >> references to the fields can be the same for both x86_32 and x86_64. > > That is going to be difficult. The GS register is tied up for the pda area > as long as you have it. And you cannot get rid of the pda because of the library > compatibility issues. We would break binary compatibility if we would get rid of the pda. > We're talking about the kernel here... who gives a hoot about binary compatibility? -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 17:27 ` Jeremy Fitzhardinge 2008-07-09 17:39 ` Christoph Lameter @ 2008-07-09 18:00 ` Mike Travis 2008-07-09 19:05 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 18:00 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > Mike Travis wrote: >> This patchset provides the following: >> >> * Cleanup: Fix early references to cpumask_of_cpu(0) >> >> Provides an early cpumask_of_cpu(0) usable before the >> cpumask_of_cpu_map >> is allocated and initialized. >> >> * Generic: Percpu infrastructure to rebase the per cpu area to zero >> >> This provides for the capability of accessing the percpu variables >> using a local register instead of having to go through a table >> on node 0 to find the cpu-specific offsets. It also would allow >> atomic operations on percpu variables to reduce required locking. >> Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the >> generic code that the arch has this new basing. >> >> (Note: split into two patches, one to rebase percpu variables at 0, >> and the second to actually use %gs as the base for percpu variables.) >> >> * x86_64: Fold pda into per cpu area >> >> Declare the pda as a per cpu variable. This will move the pda >> area to an address accessible by the x86_64 per cpu macros. >> Subtraction of __per_cpu_start will make the offset based from >> the beginning of the per cpu area. Since %gs is pointing to the >> pda, it will then also point to the per cpu variables and can be >> accessed thusly: >> >> %gs:[&per_cpu_xxxx - __per_cpu_start] >> >> * x86_64: Rebase per cpu variables to zero >> >> Take advantage of the zero-based per cpu area provided above. >> Then we can directly use the x86_32 percpu operations. x86_32 >> offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly >> to the pda and the per cpu area thereby allowing access to the >> pda with the x86_64 pda operations and access to the per cpu >> variables using x86_32 percpu operations. > > The bulk of this series is pda_X to x86_X_percpu conversion. This seems > like pointless churn to me; there's nothing inherently wrong with the > pda_X interfaces, and doing this transformation doesn't get us any > closer to unifying 32 and 64 bit. > > I think we should start devolving things out of the pda in the other > direction: make a series where each patch takes a member of struct > x8664_pda, converts it to a per-cpu variable (where possible, the same > one that 32-bit uses), and updates all the references accordingly. When > the pda is as empty as it can be, we can look at removing the > pda-specific interfaces. > > J I did compartmentalize the changes so they were in separate patches, and in particular, by separating the changes to the include files, I was able to zero in on some problems much easier. But I have no objections to leaving the cpu_pda ops in place and then, as you're suggesting, extract and modify the fields as appropriate. Another approach would be to leave the changes from XXX_pda() to x86_percpu_XXX in place, and do the patches with simply changing pda.VAR to VAR .) In any case I would like to get this version working first, before attempting that rewrite, as that won't change the generated code. Btw, while I've got your attention... ;-), there's some code in arch/x86/xen/smp.c:xen_smp_prepare_boot_cpu() that should be looked at closer for zero-based per_cpu__gdt_page: make_lowmem_page_readwrite(&per_cpu__gdt_page); (I wasn't sure how to deal with this but I suspect the __percpu_offset[] or __per_cpu_load should be added to it.) Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 18:00 ` Mike Travis @ 2008-07-09 19:05 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 19:05 UTC (permalink / raw) To: Mike Travis Cc: Ingo Molnar, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Mike Travis wrote: > I did compartmentalize the changes so they were in separate patches, > and in particular, by separating the changes to the include files, I > was able to zero in on some problems much easier. > > But I have no objections to leaving the cpu_pda ops in place and then, > as you're suggesting, extract and modify the fields as appropriate. > > Another approach would be to leave the changes from XXX_pda() to > x86_percpu_XXX in place, and do the patches with simply changing > pda.VAR to VAR .) > Yes, but that's still two patches where one would do. If I'm going to go through the effort of reconciling your percpu patches with my code, I'd like to be able to remove some #ifdef CONFIG_X86_64s in the process. > In any case I would like to get this version working first, before > attempting that rewrite, as that won't change the generated code. > Well, as far as I can tell the real meat of the series is in 1-3 and the rest is fluff. If just applying 1-3 works, then everything else should too. > Btw, while I've got your attention... ;-), there's some code in > arch/x86/xen/smp.c:xen_smp_prepare_boot_cpu() that should be looked > at closer for zero-based per_cpu__gdt_page: > > make_lowmem_page_readwrite(&per_cpu__gdt_page); > > (I wasn't sure how to deal with this but I suspect the __percpu_offset[] > or __per_cpu_load should be added to it.) Already fixed in the mass of patches I posted yesterday. I turned it into &per_cpu_var(gdt_page)). J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (16 preceding siblings ...) 2008-07-09 17:27 ` Jeremy Fitzhardinge @ 2008-07-09 19:28 ` Ingo Molnar 2008-07-09 20:55 ` Mike Travis 2008-07-09 20:00 ` Eric W. Biederman 18 siblings, 1 reply; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 19:28 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel * Mike Travis <travis@sgi.com> wrote: > * x86_64: Rebase per cpu variables to zero > > Take advantage of the zero-based per cpu area provided above. Then > we can directly use the x86_32 percpu operations. x86_32 offsets > %fs by __per_cpu_start. x86_64 has %gs pointing directly to the > pda and the per cpu area thereby allowing access to the pda with > the x86_64 pda operations and access to the per cpu variables > using x86_32 percpu operations. hm, have the binutils (or gcc) problems with this been resolved? If common binutils versions miscompile the kernel with this feature then i guess we cannot just unconditionally enable it. (My hope is that it's not necessarily a binutils bug but some broken assumption of the kernel somewhere.) Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 19:28 ` Ingo Molnar @ 2008-07-09 20:55 ` Mike Travis 2008-07-09 21:12 ` Ingo Molnar 0 siblings, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-09 20:55 UTC (permalink / raw) To: Ingo Molnar Cc: Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Ingo Molnar wrote: > * Mike Travis <travis@sgi.com> wrote: > >> * x86_64: Rebase per cpu variables to zero >> >> Take advantage of the zero-based per cpu area provided above. Then >> we can directly use the x86_32 percpu operations. x86_32 offsets >> %fs by __per_cpu_start. x86_64 has %gs pointing directly to the >> pda and the per cpu area thereby allowing access to the pda with >> the x86_64 pda operations and access to the per cpu variables >> using x86_32 percpu operations. > > hm, have the binutils (or gcc) problems with this been resolved? > > If common binutils versions miscompile the kernel with this feature then > i guess we cannot just unconditionally enable it. (My hope is that it's > not necessarily a binutils bug but some broken assumption of the kernel > somewhere.) > > Ingo Currently I'm using gcc-4.2.4. Which are you using? I labeled it "RFC" as it does not quite work without THREAD_ORDER bumped to 2. And I believe the stack overflow is happening because of some interrupt routine as it does not happen on our simulator. After that is taken care of, I'll start regression testing earlier compilers. I think someone mentioned that gcc-2.something was the minimum required...? Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:55 ` Mike Travis @ 2008-07-09 21:12 ` Ingo Molnar 0 siblings, 0 replies; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 21:12 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel * Mike Travis <travis@sgi.com> wrote: > After that is taken care of, I'll start regression testing earlier > compilers. I think someone mentioned that gcc-2.something was the > minimum required...? i think the current official minimum is around gcc-3.2 [2.x is out of question because we have a few feature dependencies on gcc-3.x] - but i stopped using it because it miscompiles the kernel so often. 4.0 was really bad due to large stack footprint. The 4.3.x series miscompiles the kernel too in certain situations - there was a high-rising kerneloops.org crash recently in ext3. So in general, 'too new' is bad because it has new regressions, 'too old' is bad because it has unfixed old regressions. Somewhere in the middle, 4.2.x-ish, seems to be pretty robust in practice. Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis ` (17 preceding siblings ...) 2008-07-09 19:28 ` Ingo Molnar @ 2008-07-09 20:00 ` Eric W. Biederman 2008-07-09 20:05 ` Jeremy Fitzhardinge ` (3 more replies) 18 siblings, 4 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 20:00 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel I just took a quick look at how stack_protector works on x86_64. Unless there is some deep kernel magic that changes the segment register to %gs from the ABI specified %fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs. Further -fstack-protector-all only seems to detect against buffer overflows and thus corruption of the stack. Not stack overflows. So it doesn't appear especially useful. So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out how to use a zero based percpu area. That should allow us to make the current pda a per cpu variable, and use %gs with a large offset to access the per cpu area. And since it is only the per cpu accesses and the pda accesses that will change we should not need to fight toolchain issues and other weirdness. The linked binary can remain the same. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:00 ` Eric W. Biederman @ 2008-07-09 20:05 ` Jeremy Fitzhardinge 2008-07-09 20:15 ` Ingo Molnar 2008-07-09 20:07 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:05 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Eric W. Biederman wrote: > I just took a quick look at how stack_protector works on x86_64. Unless there is > some deep kernel magic that changes the segment register to %gs from the ABI specified > %fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs. > -mcmodel=kernel switches it to using %gs. > Further -fstack-protector-all only seems to detect against buffer overflows and > thus corruption of the stack. Not stack overflows. So it doesn't appear especially > useful. > It's a bit useful. But at the cost of preventing a pile of more useful unification work, not to mention making all access to per-cpu variables more expensive. > So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out > how to use a zero based percpu area. > Yes, please. > That should allow us to make the current pda a per cpu variable, and use %gs with > a large offset to access the per cpu area. And since it is only the per cpu accesses > and the pda accesses that will change we should not need to fight toolchain issues > and other weirdness. The linked binary can remain the same. > Yes, and it would be functionally identical to the 32-bit code. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:05 ` Jeremy Fitzhardinge @ 2008-07-09 20:15 ` Ingo Molnar 0 siblings, 0 replies; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 20:15 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel * Jeremy Fitzhardinge <jeremy@goop.org> wrote: >> Further -fstack-protector-all only seems to detect against buffer >> overflows and thus corruption of the stack. Not stack overflows. So >> it doesn't appear especially useful. > > It's a bit useful. But at the cost of preventing a pile of more > useful unification work, not to mention making all access to per-cpu > variables more expensive. well, stackprotector is near zero maintenance trouble. It mostly binds in places that are fundamentally non-unifiable anyway. (nobody is going to unify the assembly code in switch_to()) i had zero-based percpu problems (early crashes) with a 4.2.3 gcc that had --fstack-protect compiled out, so there's no connection there. In its fixed form in tip/core/stackprotector it can catch the splice exploit which makes it quite a bit useful. It would be rather silly to not offer that feature. Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:00 ` Eric W. Biederman 2008-07-09 20:05 ` Jeremy Fitzhardinge @ 2008-07-09 20:07 ` Ingo Molnar 2008-07-09 20:11 ` Jeremy Fitzhardinge 2008-07-09 20:14 ` Arjan van de Ven 2008-07-09 21:39 ` Mike Travis 3 siblings, 1 reply; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 20:07 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel, Arjan van de Ven * Eric W. Biederman <ebiederm@xmission.com> wrote: > I just took a quick look at how stack_protector works on x86_64. > Unless there is some deep kernel magic that changes the segment > register to %gs from the ABI specified %fs CC_STACKPROTECTOR is > totally broken on x86_64. We access our pda through %gs. > > Further -fstack-protector-all only seems to detect against buffer > overflows and thus corruption of the stack. Not stack overflows. So > it doesn't appear especially useful. CC_STACKPROTECTOR, as fixed in -tip, can catch the splice exploit, and there's no known breakage in it. Deep stack recursion itself is not really interesting. (as that cannot arbitrarily be triggered by attackers in most cases) Random overflows of buffers on the stackframe is a lot more common, and that's what stackprotector protects against. ( Note that deep stack recursion can be caught via another debug mechanism, ftrace's mcount approach. ) > So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying > to figure out how to use a zero based percpu area. Note that the zero-based percpu problems are completely unrelated to stackprotector. I was able to hit them with a stackprotector-disabled gcc-4.2.3 environment. Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:07 ` Ingo Molnar @ 2008-07-09 20:11 ` Jeremy Fitzhardinge 2008-07-09 20:18 ` Christoph Lameter 2008-07-09 20:39 ` Arjan van de Ven 0 siblings, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:11 UTC (permalink / raw) To: Ingo Molnar Cc: Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel, Arjan van de Ven Ingo Molnar wrote: > Note that the zero-based percpu problems are completely unrelated to > stackprotector. I was able to hit them with a stackprotector-disabled > gcc-4.2.3 environment. The only reason we need to keep a zero-based pda is to support stack-protector. If we drop drop it, we can drop the pda - and its special zero-based properties - entirely. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:11 ` Jeremy Fitzhardinge @ 2008-07-09 20:18 ` Christoph Lameter 2008-07-09 20:33 ` Jeremy Fitzhardinge 2008-07-09 20:35 ` H. Peter Anvin 2008-07-09 20:39 ` Arjan van de Ven 1 sibling, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 20:18 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > Ingo Molnar wrote: >> Note that the zero-based percpu problems are completely unrelated to >> stackprotector. I was able to hit them with a stackprotector-disabled >> gcc-4.2.3 environment. > > The only reason we need to keep a zero-based pda is to support > stack-protector. If we drop drop it, we can drop the pda - and its > special zero-based properties - entirely. Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction. It also is easier to handle since __per_cpu_start does not figure in the calculation of the offsets. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:18 ` Christoph Lameter @ 2008-07-09 20:33 ` Jeremy Fitzhardinge 2008-07-09 20:42 ` H. Peter Anvin 2008-07-09 21:25 ` Christoph Lameter 2008-07-09 20:35 ` H. Peter Anvin 1 sibling, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:33 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > >> Ingo Molnar wrote: >> >>> Note that the zero-based percpu problems are completely unrelated to >>> stackprotector. I was able to hit them with a stackprotector-disabled >>> gcc-4.2.3 environment. >>> >> The only reason we need to keep a zero-based pda is to support >> stack-protector. If we drop drop it, we can drop the pda - and its >> special zero-based properties - entirely. >> > > > Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction. No, it makes no difference. %gs:X always has a 32-bit offset in the instruction, regardless of how big X is: mov %eax, %gs:0 mov %eax, %gs:0x1234567 -> 0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0 8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567 > It also is easier to handle since __per_cpu_start does not figure > in the calculation of the offsets. > No, you do it the same as i386. You set the segment base to be percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo directly. You can use rip-relative addressing to make it a smaller addressing mode too: 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7 J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:33 ` Jeremy Fitzhardinge @ 2008-07-09 20:42 ` H. Peter Anvin 2008-07-09 20:48 ` Jeremy Fitzhardinge 2008-07-09 21:25 ` Christoph Lameter 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 20:42 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Christoph Lameter, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > > No, you do it the same as i386. You set the segment base to be > percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo > directly. You can use rip-relative addressing to make it a smaller > addressing mode too: > > 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7 > Thinking about this some more, I don't know if it would make sense to put the x86-64 stack canary at the *end* of the percpu area, and otherwise use negative offsets. That would make sure they were readily reachable from %rip-based references from within the kernel text area. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:42 ` H. Peter Anvin @ 2008-07-09 20:48 ` Jeremy Fitzhardinge 2008-07-09 21:06 ` Eric W. Biederman 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:48 UTC (permalink / raw) To: H. Peter Anvin Cc: Christoph Lameter, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > Thinking about this some more, I don't know if it would make sense to > put the x86-64 stack canary at the *end* of the percpu area, and > otherwise use negative offsets. That would make sure they were > readily reachable from %rip-based references from within the kernel > text area. If we can move the canary then a whole pile of options open up. But the problem is that we can't. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:48 ` Jeremy Fitzhardinge @ 2008-07-09 21:06 ` Eric W. Biederman 2008-07-09 21:16 ` H. Peter Anvin 2008-07-09 21:20 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 21:06 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge <jeremy@goop.org> writes: > H. Peter Anvin wrote: >> Thinking about this some more, I don't know if it would make sense to put the >> x86-64 stack canary at the *end* of the percpu area, and otherwise use >> negative offsets. That would make sure they were readily reachable from >> %rip-based references from within the kernel text area. > > If we can move the canary then a whole pile of options open up. But the problem > is that we can't. But we can pick an arbitrary point where %gs points at. Hmm. This whole thing is even sillier then I thought. Why can't we access per cpu vars as: %gs:(per_cpu__var - __per_cpu_start) ? If we can subtract constants and allow the linker to perform that resolution at link. A zero based per cpu segment becomes a moot issue. We may need to change the definition of PERCPU in vmlinux.lds.h to #define PERCPU(align) \ . = ALIGN(align); \ - __per_cpu_start = .; \ .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \ + __per_cpu_start = .; \ *(.data.percpu) \ *(.data.percpu.shared_aligned) \ + __per_cpu_end = .; \ + } - } \ - __per_cpu_end = .; So that the linker knows __per_cpu_start and __per_cpu_end are in the same section but otherwise it sounds entirely reasonable. Just slightly trickier math at link time. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:06 ` Eric W. Biederman @ 2008-07-09 21:16 ` H. Peter Anvin 2008-07-09 21:20 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 21:16 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > > But we can pick an arbitrary point where %gs points at. > > Hmm. This whole thing is even sillier then I thought. > Why can't we access per cpu vars as: > %gs:(per_cpu__var - __per_cpu_start) ? > > If we can subtract constants and allow the linker to perform that resolution > at link. A zero based per cpu segment becomes a moot issue. > And then we're back here again! Supposedly the linker buggers up, although we don't have conclusive evidence... -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:06 ` Eric W. Biederman 2008-07-09 21:16 ` H. Peter Anvin @ 2008-07-09 21:20 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 21:20 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > But we can pick an arbitrary point where %gs points at. > > Hmm. This whole thing is even sillier then I thought. > Why can't we access per cpu vars as: > %gs:(per_cpu__var - __per_cpu_start) ? > Because there's no linker reloc for doing subtraction (or addition) of two symbols. > If we can subtract constants and allow the linker to perform that resolution > at link. A zero based per cpu segment becomes a moot issue. > They're not constants; they're symbols. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:33 ` Jeremy Fitzhardinge 2008-07-09 20:42 ` H. Peter Anvin @ 2008-07-09 21:25 ` Christoph Lameter 2008-07-09 21:36 ` H. Peter Anvin ` (2 more replies) 1 sibling, 3 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-09 21:25 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > No, it makes no difference. %gs:X always has a 32-bit offset in the > instruction, regardless of how big X is: > > mov %eax, %gs:0 > mov %eax, %gs:0x1234567 > -> > 0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0 > 8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567 The processor itself supports smaller offsets. Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation. >> It also is easier to handle since __per_cpu_start does not figure >> in the calculation of the offsets. >> > > No, you do it the same as i386. You set the segment base to be > percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo > directly. You can use rip-relative addressing to make it a smaller > addressing mode too: > > 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7 RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:25 ` Christoph Lameter @ 2008-07-09 21:36 ` H. Peter Anvin 2008-07-09 21:41 ` Jeremy Fitzhardinge 2008-07-09 22:22 ` Eric W. Biederman 2 siblings, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 21:36 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > >> No, it makes no difference. %gs:X always has a 32-bit offset in the >> instruction, regardless of how big X is: >> >> mov %eax, %gs:0 >> mov %eax, %gs:0x1234567 >> -> >> 0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0 >> 8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567 > > The processor itself supports smaller offsets. No, it doesn't, unless you have a base register. There is no naked disp8 form, and disp16 is only available in 16- or 32-bit mode (and in 32-bit form it requires a 67h prefix.) > Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation. The offset is either ±2 GB from the segment register, or ±2 GB from the segment register plus %rip. The latter is more efficient. The processor *does* permit a 64-bit absolute form, which can be used with a segment register, but that one is hideously restricted (only move to/from %rax) and bloated (10 bytes!) >> 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7 > > RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area. Not from the per cpu area, but from the linked address of the per cpu area (the segment register base can point anywhere.) In our case means between -2 GB and a smallish positive value (I believe it is guaranteed to be 2 MB or more.) Being able to use %rip-relative forms would save a byte per reference, which is valuable. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:25 ` Christoph Lameter 2008-07-09 21:36 ` H. Peter Anvin @ 2008-07-09 21:41 ` Jeremy Fitzhardinge 2008-07-09 22:22 ` Eric W. Biederman 2 siblings, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 21:41 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > > >> No, it makes no difference. %gs:X always has a 32-bit offset in the >> instruction, regardless of how big X is: >> >> mov %eax, %gs:0 >> mov %eax, %gs:0x1234567 >> -> >> 0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0 >> 8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567 >> > > The processor itself supports smaller offsets. > Not in 64-bit mode. In 32-bit mode you can use the addr16 prefix, but that would only save a byte per use (and I doubt it's a fast-path in the processor). > Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation. > No. The %gs base is a full 64-bit value you can put anywhere in the address space. So long as your percpu data is within 2G of that point you can get to it directly. >> 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7 >> > > RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area. > It means the percpu symbols must be within 2G of your code. We can't compile the kernel any other way (there's no -mcmodel=large-kernel). J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:25 ` Christoph Lameter 2008-07-09 21:36 ` H. Peter Anvin 2008-07-09 21:41 ` Jeremy Fitzhardinge @ 2008-07-09 22:22 ` Eric W. Biederman 2008-07-09 22:32 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 22:22 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Ingo Molnar, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter <cl@linux-foundation.org> writes: > Note also that the 32 bit offset size limits the offset that can be added to the > segment register. You either need to place the per cpu area either in the last > 2G of the address space or in the first 2G. The zero based approach removes that > limitation. Good point. Which means that fundamentally we need to come up with a special linker segment or some other way to guarantee that the offsets we use for per cpu variables is within 2G of the segment register. Which means that my idea of using the technique we use on x86_32 will not work. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 22:22 ` Eric W. Biederman @ 2008-07-09 22:32 ` Jeremy Fitzhardinge 2008-07-09 23:36 ` Eric W. Biederman 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 22:32 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > Christoph Lameter <cl@linux-foundation.org> writes: > > >> Note also that the 32 bit offset size limits the offset that can be added to the >> segment register. You either need to place the per cpu area either in the last >> 2G of the address space or in the first 2G. The zero based approach removes that >> limitation. >> > > Good point. Which means that fundamentally we need to come up with a special > linker segment or some other way to guarantee that the offsets we use for per > cpu variables is within 2G of the segment register. > > Which means that my idea of using the technique we use on x86_32 will not work. No, the compiler memory model we use guarantees that everything will be within 2G of each other. The linker will spew loudly if that's not the case. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 22:32 ` Jeremy Fitzhardinge @ 2008-07-09 23:36 ` Eric W. Biederman 2008-07-10 0:19 ` H. Peter Anvin 2008-07-10 0:23 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 23:36 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge <jeremy@goop.org> writes: >> Which means that my idea of using the technique we use on x86_32 will not > work. > > No, the compiler memory model we use guarantees that everything will be within > 2G of each other. The linker will spew loudly if that's not the case. The per cpu area is at least theoretically dynamically allocated. And we really want to put it in cpu local memory. Which means on any reasonable NUMA machine the per cpu areas should be all over the box. So there is no guarantee that with an arbitrary 64bit address in %gs of anything. Grr. Except you are correct. We have to guarantee that the offsets we have chosen at compile time still work. And we know all of the compile time offsets will be in the -2G range. So they are all 32bit numbers. Negative 32bit numbers to be sure. That trivially leaves us with everything working except the nasty hard coded decimal 40. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 23:36 ` Eric W. Biederman @ 2008-07-10 0:19 ` H. Peter Anvin 2008-07-10 0:24 ` Jeremy Fitzhardinge 2008-07-10 0:23 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 0:19 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > >>> Which means that my idea of using the technique we use on x86_32 will not >> work. >> >> No, the compiler memory model we use guarantees that everything will be within >> 2G of each other. The linker will spew loudly if that's not the case. > > The per cpu area is at least theoretically dynamically allocated. And we > really want to put it in cpu local memory. Which means on any reasonable > NUMA machine the per cpu areas should be all over the box. > > So there is no guarantee that with an arbitrary 64bit address in %gs of anything. > That doesn't matter in the slightest. > Grr. Except you are correct. We have to guarantee that the offsets we have > chosen at compile time still work. And we know all of the compile time offsets > will be in the -2G range. So they are all 32bit numbers. Negative 32bit > numbers to be sure. That trivially leaves us with everything working except > the nasty hard coded decimal 40. The *offsets* have to be in the proper range, but the %gs_base is an arbitrary 64-bit number. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 0:19 ` H. Peter Anvin @ 2008-07-10 0:24 ` Jeremy Fitzhardinge 2008-07-10 14:14 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 0:24 UTC (permalink / raw) To: H. Peter Anvin Cc: Eric W. Biederman, Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > Eric W. Biederman wrote: >> Jeremy Fitzhardinge <jeremy@goop.org> writes: >> >>>> Which means that my idea of using the technique we use on x86_32 >>>> will not >>> work. >>> >>> No, the compiler memory model we use guarantees that everything will >>> be within >>> 2G of each other. The linker will spew loudly if that's not the case. >> >> The per cpu area is at least theoretically dynamically allocated. >> And we >> really want to put it in cpu local memory. Which means on any >> reasonable >> NUMA machine the per cpu areas should be all over the box. >> >> So there is no guarantee that with an arbitrary 64bit address in %gs >> of anything. >> > > That doesn't matter in the slightest. Creepy, get out of my brain. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 0:24 ` Jeremy Fitzhardinge @ 2008-07-10 14:14 ` Christoph Lameter 2008-07-10 14:26 ` H. Peter Anvin 0 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 14:14 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven With the zero based approach you do not have a relative address anymore. We are basically creating a new absolute address space where we place variables starting at zero. This means that we are fully independent from the placement of the percpu segment. The loader may place the per cpu segment with the initialized variables anywhere. We just need to set GS correctly for the boot cpu. We always need to refer to the per cpu variables via GS or by adding the per cpu offset to the __per_cpu_offset[] (which is now badly named because it points directly to the start of the percpu segment for each processor). So there is no 2G limitation on the distance between the code and the percpu segment anymore. The 2G limitation still exists for the *size* of the per cpu segment. If we go beyond 2G in defined per cpu variables then the per cpu addresses will wrap. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 14:14 ` Christoph Lameter @ 2008-07-10 14:26 ` H. Peter Anvin 2008-07-10 15:26 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 14:26 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > With the zero based approach you do not have a relative address anymore. We are basically creating a new absolute address space where we place variables starting at zero. > > This means that we are fully independent from the placement of the percpu segment. > > The loader may place the per cpu segment with the initialized variables anywhere. We just need to set GS correctly for the boot cpu. We always need to refer to the per cpu variables > via GS or by adding the per cpu offset to the __per_cpu_offset[] (which is now badly named because it points directly to the start of the percpu segment for each processor). > > So there is no 2G limitation on the distance between the code and the percpu segment anymore. The 2G limitation still exists for the *size* of the per cpu segment. If we go beyond 2G in defined per cpu variables then the per cpu addresses will wrap. Okay, this is getting somewhat annoying. Several people now have missed the point. Noone has talked about the actual placement of the percpu segment data. Using RIP-based references, however, are *cheaper* than using absolute references. For RIP-based references to be valid, then the *offsets* need to be in the range [-2 GB + CONFIG_PHYSICAL_START ... CONFIG_PHYSICAL_START). This is similar to the constraint on absolute refereces, where the *offsets* have to be in the range [-2 GB, 2 GB). None of this affects the absolute positioning of the data. The final address are determined by: fs_base + rip + offset or fs_base + offset ... respectively. fs_base is an arbitrary 64-bit number; rip (in the kernel) is in the range [-2 GB + CONFIG_PHYSICAL_START, 0), and offset is in the range [-2 GB, 2 GB). (The high end of the rip range above is slightly too wide.) -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 14:26 ` H. Peter Anvin @ 2008-07-10 15:26 ` Christoph Lameter 2008-07-10 15:42 ` H. Peter Anvin 0 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 15:26 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > Noone has talked about the actual placement of the percpu segment data. But the placement of the percpu segment data is a problem because of the way we currently have the linker calculate offsets. I have had kernel configurations where I changed the placement of the percpu segment leading to linker failures because the percpu segment was not in 2G range of the code segment! This is a particular problem if we have a large number of processors (like 4096) that each require a sizable segment of virtual address space up there for the per cpu allocator. > None of this affects the absolute positioning of the data. The final > address are determined by: > > fs_base + rip + offset > or > fs_base + offset > > ... respectively. fs_base is an arbitrary 64-bit number; rip (in the > kernel) is in the range [-2 GB + CONFIG_PHYSICAL_START, 0), and offset > is in the range [-2 GB, 2 GB). Well the zero based results in this becoming always gs_base + absolute address in per cpu segment Why are RIP based references cheaper? The offset to the per cpu segment is certainly more than what can be fit into 16 bits. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 15:26 ` Christoph Lameter @ 2008-07-10 15:42 ` H. Peter Anvin 2008-07-10 16:24 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 15:42 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > > Well the zero based results in this becoming always > > gs_base + absolute address in per cpu segment You can do either way. For RIP-based, you have to worry about the possible range for the RIP register when referencing. Currently, even for "make allyesconfig" the per cpu segment is a lot smaller than the minimum value for CONFIG_PHYSICAL_START (2 MB), so there is no issue, but there is a distinct lack of wiggle room, which can be resolved either by using negative offsets, or by moving the kernel text area up a bit from -2 GB. > Why are RIP based references cheaper? The offset to the per cpu segment is certainly more than what can be fit into 16 bits. Where are you getting 16 bits from?!?! *There are no 16-bit offsets in 64-bit mode, period, full stop.* RIP-based references are cheaper because the x86-64 architects chose to optimize RIP-based references over absolute references. Therefore RIP-based references are encodable with only a MODR/M byte, whereas absolute references require a SIB byte as well -- longer instruction, possibly a less optimized path through the CPU, and *definitely* something that gets exercised less in the linker. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 15:42 ` H. Peter Anvin @ 2008-07-10 16:24 ` Christoph Lameter 2008-07-10 16:33 ` H. Peter Anvin 2008-07-10 17:26 ` Eric W. Biederman 0 siblings, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 16:24 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > but there is a distinct lack of wiggle room, which can be resolved > either by using negative offsets, or by moving the kernel text area up a > bit from -2 GB. Lets say we reserve 256MB of cpu alloc space per processor. On a system with 4k processors this will result in the need for 1TB virtual address space for per cpu areas (note that there may be more processors in the future). Preferably we would calculate the address of the per cpu area by PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id() instead of looking it up in a table because that will save a memory access on per_cpu(). The first percpu area would ideally be the per cpu segment generated by the linker. How would that fit into the address map? In particular the 2G distance between code and the first per cpu area must not be violated unless we go to a zero based approach. Maybe there is another way of arranging things that would allow for this? ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:24 ` Christoph Lameter @ 2008-07-10 16:33 ` H. Peter Anvin 2008-07-10 16:45 ` Christoph Lameter 2008-07-10 17:26 ` Eric W. Biederman 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 16:33 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > H. Peter Anvin wrote: > >> but there is a distinct lack of wiggle room, which can be resolved >> either by using negative offsets, or by moving the kernel text area up a >> bit from -2 GB. > > Lets say we reserve 256MB of cpu alloc space per processor. > > On a system with 4k processors this will result in the need for 1TB virtual address space for per cpu areas (note that there may be more processors in the future). Preferably we would calculate the address of the per cpu area by > > PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id() > > instead of looking it up in a table because that will save a memory access on per_cpu(). It will, but it might still be a net loss due to higher load on the TLB (you're effectively using the TLB to do the table lookup for you.) On the other hand, Mike points out that once we move away from fixed-sized segments we pretty much have to use virtual addresses anyway(*). > The first percpu area would ideally be the per cpu segment generated by the linker. > > How would that fit into the address map? In particular the 2G distance between code and the first per cpu area must not be violated unless we go to a zero based approach. If with "zero-based" you mean "nonzero gs_base for the boot CPU" then yes, you're right. Note again that that is completely orthogonal to RIP-based versus absolute. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:33 ` H. Peter Anvin @ 2008-07-10 16:45 ` Christoph Lameter 2008-07-10 17:33 ` Jeremy Fitzhardinge 2008-07-10 17:53 ` H. Peter Anvin 0 siblings, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 16:45 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > It will, but it might still be a net loss due to higher load on the TLB > (you're effectively using the TLB to do the table lookup for you.) On > the other hand, Mike points out that once we move away from fixed-sized > segments we pretty much have to use virtual addresses anyway(*). There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution. >> The first percpu area would ideally be the per cpu segment generated >> by the linker. >> >> How would that fit into the address map? In particular the 2G distance >> between code and the first per cpu area must not be violated unless we >> go to a zero based approach. > > If with "zero-based" you mean "nonzero gs_base for the boot CPU" then > yes, you're right. > > Note again that that is completely orthogonal to RIP-based versus absolute. ?? The distance to the per cpu area for cpu 0 is larger than 2G. Kernel wont link with RIP based addresses. You would have to place the per cpu areas 1TB before the kernel text. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:45 ` Christoph Lameter @ 2008-07-10 17:33 ` Jeremy Fitzhardinge 2008-07-10 17:42 ` Christoph Lameter 2008-07-10 17:53 ` H. Peter Anvin 1 sibling, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 17:33 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > H. Peter Anvin wrote: > > >> It will, but it might still be a net loss due to higher load on the TLB >> (you're effectively using the TLB to do the table lookup for you.) On >> the other hand, Mike points out that once we move away from fixed-sized >> segments we pretty much have to use virtual addresses anyway(*). >> > > There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution. > > >>> The first percpu area would ideally be the per cpu segment generated >>> by the linker. >>> >>> How would that fit into the address map? In particular the 2G distance >>> between code and the first per cpu area must not be violated unless we >>> go to a zero based approach. >>> >> If with "zero-based" you mean "nonzero gs_base for the boot CPU" then >> yes, you're right. >> >> Note again that that is completely orthogonal to RIP-based versus absolute. >> > > ?? The distance to the per cpu area for cpu 0 is larger than 2G. Kernel wont link with RIP based addresses. You would have to place the per cpu areas 1TB before the kernel text. If %gs:0 points to start of your percpu area, then all the offsets off %gs are going to be no larger than the amount of percpu memory you have. The gs base itself can be any 64-bit address, so it doesn't matter where it is within overall kernel memory. Using zero-based percpu area means that you must set a non-zero %gs base before you can access the percpu area. If the layout of the percpu area is done by the linker by packing all the percpu variables into one section, then any address computation using a percpu variable symbol will generate an offset which is appropriate to apply to a %gs: addressing mode. The nice thing about the non-zero-based scheme i386 uses is that setting gs-base to zero means that percpu variables accesses get directly to the prototype percpu data area, which simplifies boot time setup (which is doubly awkward on 32-bit because you need to generate a GDT entry rather than just load an MSR as you do in 64-bit). J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:33 ` Jeremy Fitzhardinge @ 2008-07-10 17:42 ` Christoph Lameter 2008-07-10 17:53 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 17:42 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > If %gs:0 points to start of your percpu area, then all the offsets off > %gs are going to be no larger than the amount of percpu memory you > have. The gs base itself can be any 64-bit address, so it doesn't > matter where it is within overall kernel memory. Using zero-based > percpu area means that you must set a non-zero %gs base before you can > access the percpu area. Correct. > If the layout of the percpu area is done by the linker by packing all > the percpu variables into one section, then any address computation > using a percpu variable symbol will generate an offset which is > appropriate to apply to a %gs: addressing mode. Of course. > The nice thing about the non-zero-based scheme i386 uses is that setting > gs-base to zero means that percpu variables accesses get directly to the > prototype percpu data area, which simplifies boot time setup (which is > doubly awkward on 32-bit because you need to generate a GDT entry rather > than just load an MSR as you do in 64-bit). Great but it causes trouble in other ways as discussed. Its best to consistently access per cpu variables using the segment registers. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:42 ` Christoph Lameter @ 2008-07-10 17:53 ` Jeremy Fitzhardinge 2008-07-10 17:55 ` H. Peter Anvin 2008-07-10 20:52 ` Christoph Lameter 0 siblings, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 17:53 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: >> The nice thing about the non-zero-based scheme i386 uses is that setting >> gs-base to zero means that percpu variables accesses get directly to the >> prototype percpu data area, which simplifies boot time setup (which is >> doubly awkward on 32-bit because you need to generate a GDT entry rather >> than just load an MSR as you do in 64-bit). >> > > Great but it causes trouble in other ways as discussed. What other trouble? It works fine. > Its best to consistently access > per cpu variables using the segment registers. > It is, but initially the segment base is 0, so just using a percpu variable does something sensible from the start with no special setup. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:53 ` Jeremy Fitzhardinge @ 2008-07-10 17:55 ` H. Peter Anvin 2008-07-10 20:52 ` Christoph Lameter 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 17:55 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Christoph Lameter, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > >> Its best to consistently access >> per cpu variables using the segment registers. > > It is, but initially the segment base is 0, so just using a percpu > variable does something sensible from the start with no special setup. > That's easy enough to fix -- synthesizing a GDT entry is trivial enough -- but since it currently works, and since the offsets on 32 bits can reach the full address space anyway, there is no reason to change. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:53 ` Jeremy Fitzhardinge 2008-07-10 17:55 ` H. Peter Anvin @ 2008-07-10 20:52 ` Christoph Lameter 2008-07-10 20:58 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 20:52 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > What other trouble? It works fine. Somehow you performed a mind wipe to get rid of all memory of the earlier messages? ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:52 ` Christoph Lameter @ 2008-07-10 20:58 ` Jeremy Fitzhardinge 2008-07-10 21:03 ` H. Peter Anvin 2008-07-10 21:05 ` Christoph Lameter 0 siblings, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 20:58 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > > >> What other trouble? It works fine. >> > > Somehow you performed a mind wipe to get rid of all memory of the earlier messages? > Percpu on i386 hasn't been a point of discussion. It works fine, and has been working fine for a long time. The same mechanism would work fine on x86-64. Its only "issue" is that it doesn't support the broken gcc abi for stack-protector. The problem is all zero-based percpu on x86-64. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:58 ` Jeremy Fitzhardinge @ 2008-07-10 21:03 ` H. Peter Anvin 2008-07-11 0:55 ` Mike Travis 2008-07-10 21:05 ` Christoph Lameter 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 21:03 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Christoph Lameter, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > > Percpu on i386 hasn't been a point of discussion. It works fine, and > has been working fine for a long time. The same mechanism would work > fine on x86-64. Its only "issue" is that it doesn't support the broken > gcc abi for stack-protector. > > The problem is all zero-based percpu on x86-64. > Well, x86-64 has *two* issues: limited range of offsets (regardless of if we do RIP-relative or not), and the stack-protector ABI. I'm still trying to reproduce Mike's setup, but I suspect it can be switched to RIP-relative for the fixed-offset (static) stuff; for the dynamic stuff it's all via pointers anyway so the offsets don't matter. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 21:03 ` H. Peter Anvin @ 2008-07-11 0:55 ` Mike Travis 0 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-11 0:55 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Christoph Lameter, Eric W. Biederman, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > Jeremy Fitzhardinge wrote: >> >> Percpu on i386 hasn't been a point of discussion. It works fine, and >> has been working fine for a long time. The same mechanism would work >> fine on x86-64. Its only "issue" is that it doesn't support the >> broken gcc abi for stack-protector. >> >> The problem is all zero-based percpu on x86-64. >> > > Well, x86-64 has *two* issues: limited range of offsets (regardless of > if we do RIP-relative or not), and the stack-protector ABI. > > I'm still trying to reproduce Mike's setup, but I suspect it can be > switched to RIP-relative for the fixed-offset (static) stuff; for the > dynamic stuff it's all via pointers anyway so the offsets don't matter. > > -hpa I'm rebuilding my tip tree now, that should bring it up to date. I'll repost patches #1 to (currently) #4 shortly. I'm looking at some code that does not have patches I sent in like 4 to 5 months ago (acpi/NR_CPUS related changes). Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:58 ` Jeremy Fitzhardinge 2008-07-10 21:03 ` H. Peter Anvin @ 2008-07-10 21:05 ` Christoph Lameter 2008-07-10 21:22 ` Eric W. Biederman 2008-07-10 21:29 ` H. Peter Anvin 1 sibling, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 21:05 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Jeremy Fitzhardinge wrote: > Percpu on i386 hasn't been a point of discussion. It works fine, and > has been working fine for a long time. The same mechanism would work > fine on x86-64. Its only "issue" is that it doesn't support the broken > gcc abi for stack-protector. Well that is one thing and then the scaling issues, and the support of the new cpu allocator, new arch independent cpu operations etc. > The problem is all zero-based percpu on x86-64. The zero based stuff will enable a lot of things. Please have a look at the cpu_alloc patchsets. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 21:05 ` Christoph Lameter @ 2008-07-10 21:22 ` Eric W. Biederman 2008-07-10 21:29 ` H. Peter Anvin 1 sibling, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 21:22 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, H. Peter Anvin, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter <cl@linux-foundation.org> writes: > Jeremy Fitzhardinge wrote: > >> Percpu on i386 hasn't been a point of discussion. It works fine, and >> has been working fine for a long time. The same mechanism would work >> fine on x86-64. Its only "issue" is that it doesn't support the broken >> gcc abi for stack-protector. > > Well that is one thing and then the scaling issues, and the support of the new > cpu allocator, new arch independent cpu operations etc. > >> The problem is all zero-based percpu on x86-64. > > The zero based stuff will enable a lot of things. Please have a look at the > cpu_alloc patchsets. Christoph again. The reason we are balking at the zero based percpu area is NOT because it is zero based. It is because systems with it patched in don't work reliably. The bottom line is if the tools don't support a clever idea we can't use it. Hopefully the problem can be root caused and we can use a zero based percpu area. There are several ways we can achieve that. Further any design that depends on a zero based percpu can work with a contiguous percpu with an offset so we should not be breaking whatever design you have for the percpu allocator. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 21:05 ` Christoph Lameter 2008-07-10 21:22 ` Eric W. Biederman @ 2008-07-10 21:29 ` H. Peter Anvin 2008-07-11 0:12 ` Mike Travis 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 21:29 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > >> Percpu on i386 hasn't been a point of discussion. It works fine, and >> has been working fine for a long time. The same mechanism would work >> fine on x86-64. Its only "issue" is that it doesn't support the broken >> gcc abi for stack-protector. > > Well that is one thing and then the scaling issues, and the support of the new cpu allocator, new arch independent cpu operations etc. > >> The problem is all zero-based percpu on x86-64. > > The zero based stuff will enable a lot of things. Please have a look at the cpu_alloc patchsets. > No argument this work is worthwhile. The main issues on the table is the particular choice of offsets, and the handling of the virtual space -- I believe 2 MB mappings are too large, except perhaps as an option. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 21:29 ` H. Peter Anvin @ 2008-07-11 0:12 ` Mike Travis 2008-07-11 0:14 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 190+ messages in thread From: Mike Travis @ 2008-07-11 0:12 UTC (permalink / raw) To: H. Peter Anvin Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: ... > -- I believe 2 MB mappings are too large, except perhaps as an option. > > -hpa Hmm, that might be the way to go.... At boot up time determine the size of the system in terms of cpu count and memory available and attempt to do the right thing, with startup options to override the internal choices... ? (Surely a system that has a "gazillion ip tunnels" could modify it's kernel start options... ;-) Unfortunately, we can't use a MODULE to support different options unless we change how the kernel starts up (would need to mount the root fs before starting secondary cpus.) Btw, the "zero_based_only" patch (w/o the pda folded into the percpu area) gets to the point shown below. Dropping NR_CPUS from 4096 to 256 clears up the error. So except for the "stack overflow" message I got yesterday, the result is the same. As soon as I get a chance, I'll try it out with gcc-4.2.0 to see if it changed the boot up problem. Thanks, Mike [ 0.096000] ACPI: Core revision 20080321 [ 0.108889] Parsing all Control Methods: [ 0.116198] Table [DSDT](id 0001) - 364 Objects with 40 Devices 109 Methods 20 Regions [ 0.124000] Parsing all Control Methods: [ 0.128000] Table [SSDT](id 0002) - 43 Objects with 0 Devices 16 Methods 0 Regions [ 0.132000] tbxface-0598 [02] tb_load_namespace : ACPI Tables successfully acquired [ 0.148000] evxfevnt-0091 [02] enable : Transition to ACPI mode successful [ 0.200000] CPU0: Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping 07 [ 0.211685] Using local APIC timer interrupts. [ 0.220000] APIC timer calibration result 20781901 [ 0.224000] Detected 20.781 MHz APIC timer. [ 0.228000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 0.228000] IP: [<0000000000000000>] [ 0.228000] PGD 0 [ 0.228000] Oops: 0010 [1] SMP [ 0.228000] CPU 0 [ 0.228000] Pid: 1, comm: swapper Not tainted 2.6.26-rc8-tip-ingo-test-0701-00208-g79a4d68-dirty #7 [ 0.228000] RIP: 0010:[<0000000000000000>] [<0000000000000000>] [ 0.228000] RSP: 0000:ffff81022ed1fe18 EFLAGS: 00010286 [ 0.228000] RAX: 0000000000000000 RBX: ffff81022ed1fe84 RCX: ffffffff80d0de80 [ 0.228000] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffffff80d0de80 [ 0.228000] RBP: ffff81022ed1fe50 R08: ffff81022ed1fe84 R09: ffffffff80e28ae0 [ 0.228000] R10: ffff81022ed1fe80 R11: ffff81022ed39188 R12: 00000000ffffffff [ 0.228000] R13: ffffffff80d0de40 R14: 0000000000000001 R15: 0000000000000003 [ 0.228000] FS: 0000000000000000(0000) GS:ffffffff80de69c0(0000) knlGS:0000000000000000 [ 0.228000] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 0.228000] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 [ 0.228000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.228000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 0.228000] Process swapper (pid: 1, threadinfo ffff81022ed10000, task ffff81022ec93100) [ 0.228000] Stack: ffffffff8024f34c 0000000000000000 0000000000000001 0000000000000001 [ 0.228000] 00000000fffffff0 ffffffff80e8e4e0 0000000000092fd0 ffff81022ed1fe60 [ 0.228000] ffffffff8024f3cc ffff81022ed1fea0 ffffffff808ef5b8 0000000000000008 [ 0.228000] Call Trace: [ 0.228000] [<ffffffff8024f34c>] ? notifier_call_chain+0x38/0x60 [ 0.228000] [<ffffffff8024f3cc>] __raw_notifier_call_chain+0xe/0x10 [ 0.228000] [<ffffffff808ef5b8>] cpu_up+0xa8/0x138 [ 0.228000] [<ffffffff80e4d9b9>] kernel_init+0xdf/0x327 [ 0.228000] [<ffffffff8020d4b8>] child_rip+0xa/0x12 [ 0.228000] [<ffffffff8020c955>] ? restore_args+0x0/0x30 [ 0.228000] [<ffffffff80e4d8da>] ? kernel_init+0x0/0x327 [ 0.228000] [<ffffffff8020d4ae>] ? child_rip+0x0/0x12 [ 0.228000] [ 0.228000] [ 0.228000] Code: Bad RIP value. [ 0.228000] RIP [<0000000000000000>] [ 0.228000] RSP <ffff81022ed1fe18> [ 0.228000] CR2: 0000000000000000 [ 0.232000] ---[ end trace a7919e7f17c0a725 ]--- [ 0.236000] Kernel panic - not syncing: Attempted to kill init! ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 0:12 ` Mike Travis @ 2008-07-11 0:14 ` H. Peter Anvin 2008-07-11 0:58 ` Mike Travis 2008-07-11 0:42 ` Eric W. Biederman 2008-07-11 15:36 ` Christoph Lameter 2 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-11 0:14 UTC (permalink / raw) To: Mike Travis Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis wrote: > > Hmm, that might be the way to go.... At boot up time determine the > size of the system in terms of cpu count and memory available and > attempt to do the right thing, with startup options to override the > internal choices... ? > > (Surely a system that has a "gazillion ip tunnels" could modify it's > kernel start options... ;-) > > Unfortunately, we can't use a MODULE to support different options unless > we change how the kernel starts up (would need to mount the root fs > before starting secondary cpus.) > Using a module doesn't make any sense anyway. This is more what the kernel command line is for. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 0:14 ` H. Peter Anvin @ 2008-07-11 0:58 ` Mike Travis 2008-07-11 1:41 ` H. Peter Anvin 2008-07-11 15:37 ` Christoph Lameter 0 siblings, 2 replies; 190+ messages in thread From: Mike Travis @ 2008-07-11 0:58 UTC (permalink / raw) To: H. Peter Anvin Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > Mike Travis wrote: >> >> Hmm, that might be the way to go.... At boot up time determine the >> size of the system in terms of cpu count and memory available and >> attempt to do the right thing, with startup options to override the >> internal choices... ? >> >> (Surely a system that has a "gazillion ip tunnels" could modify it's >> kernel start options... ;-) >> >> Unfortunately, we can't use a MODULE to support different options unless >> we change how the kernel starts up (would need to mount the root fs >> before starting secondary cpus.) >> > > Using a module doesn't make any sense anyway. This is more what the > kernel command line is for. > > -hpa I was thinking that supporting virtual percpu addresses would take a fair amount of code that, if living in a MODULE, wouldn't impact small systems. But it seems to be not worth the effort... ;-) Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 0:58 ` Mike Travis @ 2008-07-11 1:41 ` H. Peter Anvin 2008-07-11 15:37 ` Christoph Lameter 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-11 1:41 UTC (permalink / raw) To: Mike Travis Cc: Christoph Lameter, Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis wrote: > > I was thinking that supporting virtual percpu addresses would take a fair > amount of code that, if living in a MODULE, wouldn't impact small systems. > But it seems to be not worth the effort... ;-) > No, and it seems pretty toxic. If we're doing virtual, they should almost certainly always be virtual (except perhaps on UP.) Page size is a separate issue. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 0:58 ` Mike Travis 2008-07-11 1:41 ` H. Peter Anvin @ 2008-07-11 15:37 ` Christoph Lameter 1 sibling, 0 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-11 15:37 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis wrote: > I was thinking that supporting virtual percpu addresses would take a fair > amount of code that, if living in a MODULE, wouldn't impact small systems. > But it seems to be not worth the effort... ;-) For base page mappings that logic is already provided by the vmalloc subsystem. For 2M mappings the vmemmap logic can be used. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 0:12 ` Mike Travis 2008-07-11 0:14 ` H. Peter Anvin @ 2008-07-11 0:42 ` Eric W. Biederman 2008-07-11 15:36 ` Christoph Lameter 2 siblings, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-11 0:42 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis <travis@sgi.com> writes: > Btw, the "zero_based_only" patch (w/o the pda folded into the percpu > area) gets to the point shown below. Dropping NR_CPUS from 4096 to 256 > clears up the error. So except for the "stack overflow" message I got > yesterday, the result is the same. As soon as I get a chance, I'll try > it out with gcc-4.2.0 to see if it changed the boot up problem. Thanks that seems to confirm the suspicion that it the zero based percpu segment that is causing problems. In this case it appears that the notifier block for one of the cpu notifiers got stomped, or never got initialized. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 0:12 ` Mike Travis 2008-07-11 0:14 ` H. Peter Anvin 2008-07-11 0:42 ` Eric W. Biederman @ 2008-07-11 15:36 ` Christoph Lameter 2 siblings, 0 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-11 15:36 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis wrote: > H. Peter Anvin wrote: > ... >> -- I believe 2 MB mappings are too large, except perhaps as an option. >> >> -hpa > > Hmm, that might be the way to go.... At boot up time determine the > size of the system in terms of cpu count and memory available and > attempt to do the right thing, with startup options to override the > internal choices... ? Ok. That is an extension of the static per cpu area scenario already supported by cpu_alloc. Should not be too difficult to implement. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:45 ` Christoph Lameter 2008-07-10 17:33 ` Jeremy Fitzhardinge @ 2008-07-10 17:53 ` H. Peter Anvin 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 17:53 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Eric W. Biederman, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > > There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution. > THAT sounds strange. If you're using dedicated virtual maps (which is what you're responding to here) then you will *always* have additional TLB pressure. Furthermore, if you use 2 MB pages, you: a) can only allocate full 2 MB pages, which is expensive for the static users and difficult for the dynamic users; b) increase pressure in the relatively small 2 MB TLB. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:24 ` Christoph Lameter 2008-07-10 16:33 ` H. Peter Anvin @ 2008-07-10 17:26 ` Eric W. Biederman 2008-07-10 17:38 ` Christoph Lameter ` (2 more replies) 1 sibling, 3 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 17:26 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter <cl@linux-foundation.org> writes: > H. Peter Anvin wrote: > >> but there is a distinct lack of wiggle room, which can be resolved >> either by using negative offsets, or by moving the kernel text area up a >> bit from -2 GB. > > Lets say we reserve 256MB of cpu alloc space per processor. First off right now reserving more than about 64KB is ridiculous. We rightly don't have that many per cpu variables. > On a system with 4k processors this will result in the need for 1TB virtual > address space for per cpu areas (note that there may be more processors in the > future). Preferably we would calculate the address of the per cpu area by > > PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id() > > instead of looking it up in a table because that will save a memory access on > per_cpu(). ???? Optimizing per_cpu seems to be the wrong path. If you want to go fast you access the data on the cpu you start out on. > The first percpu area would ideally be the per cpu segment generated by the > linker. > > How would that fit into the address map? In particular the 2G distance between > code and the first per cpu area must not be violated unless we go to a zero > based approach. > > Maybe there is another way of arranging things that would allow for this? Yes. Start with a patch that doesn't have freaky failures that can't be understood or bisected because the patch is too big. The only reason we are having a conversation about alternative implementations is because the current implementation has weird random incomprehensible failures. The most likely culprit is playing with the linker. It could be something else. So please REFACTOR the patch that changes things to DO ONE THING PER PATCH. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:26 ` Eric W. Biederman @ 2008-07-10 17:38 ` Christoph Lameter 2008-07-10 19:11 ` Mike Travis 2008-07-10 19:12 ` Eric W. Biederman 2008-07-10 17:46 ` Mike Travis 2008-07-10 17:51 ` H. Peter Anvin 2 siblings, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 17:38 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > First off right now reserving more than about 64KB is ridiculous. We rightly > don't have that many per cpu variables. We do. The case has been made numerous times that we need at least several megabytes of per cpu memory in case someone creates gazillions of ip tunnels etc etc. >> instead of looking it up in a table because that will save a memory access on >> per_cpu(). > > ???? Optimizing per_cpu seems to be the wrong path. If you want to go fast you > access the data on the cpu you start out on. Yes most arches provide specialized registers for local per cpu variable access. There are cases though in which you have to access another processors cpu space. >> Maybe there is another way of arranging things that would allow for this? > > Yes. Start with a patch that doesn't have freaky failures that can't be understood > or bisected because the patch is too big. The only reason we are having a conversation The patches are reasonably small. The problem that Mike seems to have is early boot debugging. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:38 ` Christoph Lameter @ 2008-07-10 19:11 ` Mike Travis 2008-07-10 19:12 ` Eric W. Biederman 1 sibling, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-10 19:11 UTC (permalink / raw) To: Christoph Lameter Cc: Eric W. Biederman, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: ... > The patches are reasonably small. The problem that Mike seems to have is early boot debugging. Note that the early boot debugging problems go away with gcc-4.2.4. My problem now (what I [maybe incorrectly] believe) is stack overflow with NR_CPUS=4096 and a specific random config file. Btw, I've completed the first half of splitting the zero_based_fold into zero_based_only and fold_pda_into_percpu and am testing that now. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:38 ` Christoph Lameter 2008-07-10 19:11 ` Mike Travis @ 2008-07-10 19:12 ` Eric W. Biederman 1 sibling, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 19:12 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter <cl@linux-foundation.org> writes: > The patches are reasonably small. The problem that Mike seems to have is early > boot debugging. I didn't say small. I said do one thing at a time. Mike is addressing this. Fundamentally the problem is that we are seeing weird failures and there is not enough granularity in the patches to test which part of the patch the failure comes from. The fact that the failures happen early in boot just makes it worse to debug. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:26 ` Eric W. Biederman 2008-07-10 17:38 ` Christoph Lameter @ 2008-07-10 17:46 ` Mike Travis 2008-07-10 17:51 ` H. Peter Anvin 2 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-10 17:46 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, H. Peter Anvin, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: ... > > So please REFACTOR the patch that changes things to DO ONE THING PER PATCH. > > Eric Working feverishly on this exact thing... ;-) ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:26 ` Eric W. Biederman 2008-07-10 17:38 ` Christoph Lameter 2008-07-10 17:46 ` Mike Travis @ 2008-07-10 17:51 ` H. Peter Anvin 2008-07-10 19:09 ` Eric W. Biederman 2 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 17:51 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > Christoph Lameter <cl@linux-foundation.org> writes: > >> H. Peter Anvin wrote: >> >>> but there is a distinct lack of wiggle room, which can be resolved >>> either by using negative offsets, or by moving the kernel text area up a >>> bit from -2 GB. >> Lets say we reserve 256MB of cpu alloc space per processor. > > First off right now reserving more than about 64KB is ridiculous. We rightly > don't have that many per cpu variables. Almost half a megabyte in current allyesconfig, and that is not including dynamic allocations at all. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:51 ` H. Peter Anvin @ 2008-07-10 19:09 ` Eric W. Biederman 2008-07-10 19:18 ` Mike Travis 0 siblings, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 19:09 UTC (permalink / raw) To: H. Peter Anvin Cc: Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven "H. Peter Anvin" <hpa@zytor.com> writes: > Almost half a megabyte in current allyesconfig, and that is not including > dynamic allocations at all. Ouch! This start to make me sad I removed the arbitrary cap on static percpu memory. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 19:09 ` Eric W. Biederman @ 2008-07-10 19:18 ` Mike Travis 2008-07-10 19:32 ` H. Peter Anvin 2008-07-10 20:17 ` Eric W. Biederman 0 siblings, 2 replies; 190+ messages in thread From: Mike Travis @ 2008-07-10 19:18 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > "H. Peter Anvin" <hpa@zytor.com> writes: > >> Almost half a megabyte in current allyesconfig, and that is not including >> dynamic allocations at all. > > Ouch! This start to make me sad I removed the arbitrary cap on static > percpu memory. > > Eric The biggest growth came from moving all the xxx[NR_CPUS] arrays into the per cpu area. So you free up a huge amount of unused memory when the NR_CPUS count starts getting into the ozone layer. 4k now, 16k real soon now, ??? future? Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 19:18 ` Mike Travis @ 2008-07-10 19:32 ` H. Peter Anvin 2008-07-10 23:37 ` Mike Travis 2008-07-10 20:17 ` Eric W. Biederman 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 19:32 UTC (permalink / raw) To: Mike Travis Cc: Eric W. Biederman, Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis wrote: > > The biggest growth came from moving all the xxx[NR_CPUS] arrays into > the per cpu area. So you free up a huge amount of unused memory when > the NR_CPUS count starts getting into the ozone layer. 4k now, 16k > real soon now, ??? future? > Even (or perhaps especially) so, allocating the percpu area in 2 MB increments is a total nonstarter. It hurts the small, common configurations way too much. For SGI, it's probably fine. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 19:32 ` H. Peter Anvin @ 2008-07-10 23:37 ` Mike Travis 0 siblings, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-10 23:37 UTC (permalink / raw) To: H. Peter Anvin Cc: Eric W. Biederman, Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven H. Peter Anvin wrote: > Mike Travis wrote: >> >> The biggest growth came from moving all the xxx[NR_CPUS] arrays into >> the per cpu area. So you free up a huge amount of unused memory when >> the NR_CPUS count starts getting into the ozone layer. 4k now, 16k >> real soon now, ??? future? >> > > Even (or perhaps especially) so, allocating the percpu area in 2 MB > increments is a total nonstarter. It hurts the small, common > configurations way too much. For SGI, it's probably fine. > > -hpa Yes, "right-sizing" the kernel for systems from 512M laptops to 4k cpu systems with memory totally maxed out, using only a binary distribution has proven tricky (at best... ;-) One alternative was to only allocate a chunk similar in size to PERCPU_ENOUGH_ROOM and allow for startup options to create a bigger space if needed. Though the realloc idea has some merit if we can validate the non-use of pointers to specific cpu's percpu vars. (CPU_ALLOC contains a function to dereference a percpu offset.) The discussion started at: http://marc.info/?l=linux-kernel&m=121212026716085&w=4 Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 19:18 ` Mike Travis 2008-07-10 19:32 ` H. Peter Anvin @ 2008-07-10 20:17 ` Eric W. Biederman 2008-07-10 20:24 ` Ingo Molnar 2008-07-11 1:39 ` Mike Travis 1 sibling, 2 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 20:17 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis <travis@sgi.com> writes: > The biggest growth came from moving all the xxx[NR_CPUS] arrays into > the per cpu area. So you free up a huge amount of unused memory when > the NR_CPUS count starts getting into the ozone layer. 4k now, 16k > real soon now, ??? future? Hmm. Do you know how big a role kernel_stat plays. It is a per cpu structure that is sized via NR_IRQS. NR_IRQS is by NR_CPUS. So ultimately the amount of memory take up is NR_CPUS*NR_CPUS*32 or so. I have a patch I wrote long ago, that addresses that specific nasty configuration by moving the per cpu irq counters into pointer available from struct irq_desc. The next step which I did not get to (but is interesting from a scaling perspective) was to start dynamically allocating the irq structures. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:17 ` Eric W. Biederman @ 2008-07-10 20:24 ` Ingo Molnar 2008-07-10 21:33 ` Eric W. Biederman 2008-07-11 1:39 ` Mike Travis 1 sibling, 1 reply; 190+ messages in thread From: Ingo Molnar @ 2008-07-10 20:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven * Eric W. Biederman <ebiederm@xmission.com> wrote: > Mike Travis <travis@sgi.com> writes: > > > > The biggest growth came from moving all the xxx[NR_CPUS] arrays into > > the per cpu area. So you free up a huge amount of unused memory > > when the NR_CPUS count starts getting into the ozone layer. 4k now, > > 16k real soon now, ??? future? > > Hmm. Do you know how big a role kernel_stat plays. > > It is a per cpu structure that is sized via NR_IRQS. NR_IRQS is by > NR_CPUS. So ultimately the amount of memory take up is > NR_CPUS*NR_CPUS*32 or so. > > I have a patch I wrote long ago, that addresses that specific nasty > configuration by moving the per cpu irq counters into pointer > available from struct irq_desc. > > The next step which I did not get to (but is interesting from a > scaling perspective) was to start dynamically allocating the irq > structures. /me willing to test & babysit any test-patch in that area ... this is a big problem and it's getting worse quadratically ;-) Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:24 ` Ingo Molnar @ 2008-07-10 21:33 ` Eric W. Biederman 0 siblings, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 21:33 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Travis, H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Ingo Molnar <mingo@elte.hu> writes: > /me willing to test & babysit any test-patch in that area ... > > this is a big problem and it's getting worse quadratically ;-) > Well here is a copy of my old patch to get things started. It isn't where I'm working right now so I don't have time to rebase the patch, but the same logic should still apply. ---- >From e02f708c0eca6708c8f79824717705379e982fe3 Mon Sep 17 00:00:00 2001 From: Eric W. Biederman <ebiederm@xmission.com> Date: Tue, 13 Feb 2007 02:42:50 -0700 Subject: [PATCH] genirq: Kill the percpu NR_IRQS sized array in kstat. In struct kernel_stat which has one instance per cpu we keep a count of how many times each irq has occured on that cpu. Given that we don't usually use all of our irqs this is very wasteful of space and in particular percpu space. This patch replaces that array on all architectures that use GENERIC_HARD_IRQS with a point to a array of cpus in struct irq_desc. This allocates the array at boot time after we have generated the cpu_possible_map and is only large enough to hold the largest possible cpu index. Assuming the common case of dense cpu numbers this consumes roughly the same amount of space as the current mechanism and removes the NR_IRQS sized array. The only immediate win is to get these counts out of the limited size percpu areas. Shortly I will make the need for NR_IRQS sized arrays obsolete, allowing a single kernel to support huge numbers of irqs and still be efficient on small machines. With the removal of the NR_IRQS sized arrays this patch will be a clear size win in space consumption for small machines. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- arch/alpha/kernel/irq.c | 2 +- arch/alpha/kernel/irq_alpha.c | 2 +- arch/arm/kernel/irq.c | 2 +- arch/avr32/kernel/irq.c | 2 +- arch/cris/kernel/irq.c | 2 +- arch/frv/kernel/irq.c | 2 +- arch/i386/kernel/io_apic.c | 2 +- arch/i386/kernel/irq.c | 2 +- arch/i386/mach-visws/visws_apic.c | 2 +- arch/ia64/kernel/irq.c | 2 +- arch/ia64/kernel/irq_ia64.c | 4 ++-- arch/m32r/kernel/irq.c | 2 +- arch/mips/au1000/common/time.c | 4 ++-- arch/mips/kernel/irq.c | 2 +- arch/mips/kernel/time.c | 4 ++-- arch/mips/sgi-ip22/ip22-int.c | 2 +- arch/mips/sgi-ip22/ip22-time.c | 4 ++-- arch/mips/sgi-ip27/ip27-timer.c | 2 +- arch/mips/sibyte/bcm1480/smp.c | 2 +- arch/mips/sibyte/sb1250/irq.c | 2 +- arch/mips/sibyte/sb1250/smp.c | 2 +- arch/parisc/kernel/irq.c | 2 +- arch/powerpc/kernel/irq.c | 2 +- arch/ppc/amiga/amiints.c | 4 ++-- arch/ppc/amiga/cia.c | 2 +- arch/ppc/amiga/ints.c | 4 ++-- arch/sh/kernel/irq.c | 2 +- arch/sparc64/kernel/irq.c | 4 ++-- arch/sparc64/kernel/smp.c | 2 +- arch/um/kernel/irq.c | 2 +- arch/x86_64/kernel/irq.c | 6 +----- arch/xtensa/kernel/irq.c | 2 +- fs/proc/proc_misc.c | 2 +- include/linux/irq.h | 4 ++++ include/linux/kernel_stat.h | 20 +++++++++++++++++--- init/main.c | 1 + kernel/irq/chip.c | 15 +++++---------- kernel/irq/handle.c | 29 +++++++++++++++++++++++++++-- 38 files changed, 94 insertions(+), 59 deletions(-) diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c index 3659af8..8e0af05 100644 --- a/arch/alpha/kernel/irq.c +++ b/arch/alpha/kernel/irq.c @@ -88,7 +88,7 @@ show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(irq)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[irq]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[irq].chip->typename); seq_printf(p, " %c%s", diff --git a/arch/alpha/kernel/irq_alpha.c b/arch/alpha/kernel/irq_alpha.c index e16aeb6..2c0852c 100644 --- a/arch/alpha/kernel/irq_alpha.c +++ b/arch/alpha/kernel/irq_alpha.c @@ -64,7 +64,7 @@ do_entInt(unsigned long type, unsigned long vector, smp_percpu_timer_interrupt(regs); cpu = smp_processor_id(); if (cpu != boot_cpuid) { - kstat_cpu(cpu).irqs[RTC_IRQ]++; + irq_desc[RTC_IRQ].kstat_irqs[cpu]++; } else { handle_irq(RTC_IRQ); } diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c index e101846..db79c4c 100644 --- a/arch/arm/kernel/irq.c +++ b/arch/arm/kernel/irq.c @@ -76,7 +76,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%3d: ", i); for_each_present_cpu(cpu) - seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu)); seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-"); seq_printf(p, " %s", action->name); for (action = action->next; action; action = action->next) diff --git a/arch/avr32/kernel/irq.c b/arch/avr32/kernel/irq.c index fd31124..7cddf0a 100644 --- a/arch/avr32/kernel/irq.c +++ b/arch/avr32/kernel/irq.c @@ -56,7 +56,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%3d: ", i); for_each_online_cpu(cpu) - seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu)); seq_printf(p, " %8s", irq_desc[i].chip->name ? : "-"); seq_printf(p, " %s", action->name); for (action = action->next; action; action = action->next) diff --git a/arch/cris/kernel/irq.c b/arch/cris/kernel/irq.c index 903ea62..9d7c1d7 100644 --- a/arch/cris/kernel/irq.c +++ b/arch/cris/kernel/irq.c @@ -66,7 +66,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); diff --git a/arch/frv/kernel/irq.c b/arch/frv/kernel/irq.c index 87f360a..ff6579f 100644 --- a/arch/frv/kernel/irq.c +++ b/arch/frv/kernel/irq.c @@ -75,7 +75,7 @@ int show_interrupts(struct seq_file *p, void *v) if (action) { seq_printf(p, "%3d: ", i); for_each_present_cpu(cpu) - seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu)); seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-"); seq_printf(p, " %s", action->name); for (action = action->next; diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c index edcc849..c660c8b 100644 --- a/arch/i386/kernel/io_apic.c +++ b/arch/i386/kernel/io_apic.c @@ -488,7 +488,7 @@ static void do_irq_balance(void) if ( package_index == i ) IRQ_DELTA(package_index,j) = 0; /* Determine the total count per processor per IRQ */ - value_now = (unsigned long) kstat_cpu(i).irqs[j]; + value_now = (unsigned long) kstat_irqs_cpu(j, i); /* Determine the activity per processor per IRQ */ delta = value_now - LAST_CPU_IRQ(i,j); diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c index eeb29af..0a30abc 100644 --- a/arch/i386/kernel/irq.c +++ b/arch/i386/kernel/irq.c @@ -279,7 +279,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %8s", irq_desc[i].chip->name); seq_printf(p, "-%-8s", irq_desc[i].name); diff --git a/arch/i386/mach-visws/visws_apic.c b/arch/i386/mach-visws/visws_apic.c index 38c2b13..0d153eb 100644 --- a/arch/i386/mach-visws/visws_apic.c +++ b/arch/i386/mach-visws/visws_apic.c @@ -240,7 +240,7 @@ static irqreturn_t piix4_master_intr(int irq, void *dev_id) /* * handle this 'virtual interrupt' as a Cobalt one now. */ - kstat_cpu(smp_processor_id()).irqs[realirq]++; + desc->kstat_irqs[smp_processor_id()]++; if (likely(desc->action != NULL)) handle_IRQ_event(realirq, desc->action); diff --git a/arch/ia64/kernel/irq.c b/arch/ia64/kernel/irq.c index ce49c85..06edbb8 100644 --- a/arch/ia64/kernel/irq.c +++ b/arch/ia64/kernel/irq.c @@ -73,7 +73,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) { - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); } #endif seq_printf(p, " %14s", irq_desc[i].chip->name); diff --git a/arch/ia64/kernel/irq_ia64.c b/arch/ia64/kernel/irq_ia64.c index 456f57b..be1dd6e 100644 --- a/arch/ia64/kernel/irq_ia64.c +++ b/arch/ia64/kernel/irq_ia64.c @@ -181,7 +181,7 @@ ia64_handle_irq (ia64_vector vector, struct pt_regs *regs) ia64_srlz_d(); while (vector != IA64_SPURIOUS_INT_VECTOR) { if (unlikely(IS_RESCHEDULE(vector))) - kstat_this_cpu.irqs[vector]++; + kstat_irqs_this_cpu(&irq_desc[vector])++; else { ia64_setreg(_IA64_REG_CR_TPR, vector); ia64_srlz_d(); @@ -228,7 +228,7 @@ void ia64_process_pending_intr(void) */ while (vector != IA64_SPURIOUS_INT_VECTOR) { if (unlikely(IS_RESCHEDULE(vector))) - kstat_this_cpu.irqs[vector]++; + kstat_irqs_this_cpu(&irq_desc[vector])++; else { struct pt_regs *old_regs = set_irq_regs(NULL); diff --git a/arch/m32r/kernel/irq.c b/arch/m32r/kernel/irq.c index f8d8650..4fb85b2 100644 --- a/arch/m32r/kernel/irq.c +++ b/arch/m32r/kernel/irq.c @@ -52,7 +52,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); diff --git a/arch/mips/au1000/common/time.c b/arch/mips/au1000/common/time.c index fa1c62f..c2a084e 100644 --- a/arch/mips/au1000/common/time.c +++ b/arch/mips/au1000/common/time.c @@ -81,13 +81,13 @@ void mips_timer_interrupt(void) int irq = 63; irq_enter(); - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(&irq_desc[irq])++; if (r4k_offset == 0) goto null; do { - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(&irq_desc[irq])++; do_timer(1); #ifndef CONFIG_SMP update_process_times(user_mode(get_irq_regs())); diff --git a/arch/mips/kernel/irq.c b/arch/mips/kernel/irq.c index 2fe4c86..c2cae91 100644 --- a/arch/mips/kernel/irq.c +++ b/arch/mips/kernel/irq.c @@ -115,7 +115,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->name); seq_printf(p, " %s", action->name); diff --git a/arch/mips/kernel/time.c b/arch/mips/kernel/time.c index e5e56bd..0a829e2 100644 --- a/arch/mips/kernel/time.c +++ b/arch/mips/kernel/time.c @@ -204,7 +204,7 @@ asmlinkage void ll_timer_interrupt(int irq) int r2 = cpu_has_mips_r2; irq_enter(); - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(&irq_desc[irq])++; /* * Suckage alert: @@ -228,7 +228,7 @@ asmlinkage void ll_local_timer_interrupt(int irq) { irq_enter(); if (smp_processor_id() != 0) - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(&irq_desc[irq])++; /* we keep interrupt disabled all the time */ local_timer_interrupt(irq, NULL); diff --git a/arch/mips/sgi-ip22/ip22-int.c b/arch/mips/sgi-ip22/ip22-int.c index b454924..382a8a5 100644 --- a/arch/mips/sgi-ip22/ip22-int.c +++ b/arch/mips/sgi-ip22/ip22-int.c @@ -164,7 +164,7 @@ static void indy_buserror_irq(void) int irq = SGI_BUSERR_IRQ; irq_enter(); - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(&irq_desc[irq])++; ip22_be_interrupt(irq); irq_exit(); } diff --git a/arch/mips/sgi-ip22/ip22-time.c b/arch/mips/sgi-ip22/ip22-time.c index 2055547..0cd6887 100644 --- a/arch/mips/sgi-ip22/ip22-time.c +++ b/arch/mips/sgi-ip22/ip22-time.c @@ -182,7 +182,7 @@ void indy_8254timer_irq(void) char c; irq_enter(); - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(&irq_desc[irq])++; printk(KERN_ALERT "Oops, got 8254 interrupt.\n"); ArcRead(0, &c, 1, &cnt); ArcEnterInteractiveMode(); @@ -194,7 +194,7 @@ void indy_r4k_timer_interrupt(void) int irq = SGI_TIMER_IRQ; irq_enter(); - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(&irq_desc[irq])++; timer_interrupt(irq, NULL); irq_exit(); } diff --git a/arch/mips/sgi-ip27/ip27-timer.c b/arch/mips/sgi-ip27/ip27-timer.c index 8c3c78c..592449c 100644 --- a/arch/mips/sgi-ip27/ip27-timer.c +++ b/arch/mips/sgi-ip27/ip27-timer.c @@ -106,7 +106,7 @@ again: if (LOCAL_HUB_L(PI_RT_COUNT) >= ct_cur[cpu]) goto again; - kstat_this_cpu.irqs[irq]++; /* kstat only for bootcpu? */ + irq_desc[irq].kstat_irqs[cpu]++; /* kstat only for bootcpu? */ if (cpu == 0) do_timer(1); diff --git a/arch/mips/sibyte/bcm1480/smp.c b/arch/mips/sibyte/bcm1480/smp.c index bf32827..a070238 100644 --- a/arch/mips/sibyte/bcm1480/smp.c +++ b/arch/mips/sibyte/bcm1480/smp.c @@ -93,7 +93,7 @@ void bcm1480_mailbox_interrupt(void) int cpu = smp_processor_id(); unsigned int action; - kstat_this_cpu.irqs[K_BCM1480_INT_MBOX_0_0]++; + irq_desc[K_BCM1480_INT_MBOX_0_0].kstat_irqs[cpu]++; /* Load the mailbox register to figure out what we're supposed to do */ action = (__raw_readq(mailbox_0_regs[cpu]) >> 48) & 0xffff; diff --git a/arch/mips/sibyte/sb1250/irq.c b/arch/mips/sibyte/sb1250/irq.c index 1482394..fb7d77f 100644 --- a/arch/mips/sibyte/sb1250/irq.c +++ b/arch/mips/sibyte/sb1250/irq.c @@ -390,7 +390,7 @@ static void sb1250_kgdb_interrupt(void) * host to stop the break, since we would see another * interrupt on the end-of-break too) */ - kstat_this_cpu.irqs[kgdb_irq]++; + kstat_irqs_this_cpu(&irq_desc[kgdb_irq])++; mdelay(500); duart_out(R_DUART_CMD, V_DUART_MISC_CMD_RESET_BREAK_INT | M_DUART_RX_EN | M_DUART_TX_EN); diff --git a/arch/mips/sibyte/sb1250/smp.c b/arch/mips/sibyte/sb1250/smp.c index c38e1f3..54c6164 100644 --- a/arch/mips/sibyte/sb1250/smp.c +++ b/arch/mips/sibyte/sb1250/smp.c @@ -81,7 +81,7 @@ void sb1250_mailbox_interrupt(void) int cpu = smp_processor_id(); unsigned int action; - kstat_this_cpu.irqs[K_INT_MBOX_0]++; + irq_desc[K_INT_MBOX_0].kstat_irqs[cpu]++; /* Load the mailbox register to figure out what we're supposed to do */ action = (____raw_readq(mailbox_regs[cpu]) >> 48) & 0xffff; diff --git a/arch/parisc/kernel/irq.c b/arch/parisc/kernel/irq.c index b39c5b9..c222bbd 100644 --- a/arch/parisc/kernel/irq.c +++ b/arch/parisc/kernel/irq.c @@ -192,7 +192,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%3d: ", i); #ifdef CONFIG_SMP for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #else seq_printf(p, "%10u ", kstat_irqs(i)); #endif diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 919fbf5..0be818a 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -189,7 +189,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%3d: ", i); #ifdef CONFIG_SMP for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #else seq_printf(p, "%10u ", kstat_irqs(i)); #endif /* CONFIG_SMP */ diff --git a/arch/ppc/amiga/amiints.c b/arch/ppc/amiga/amiints.c index 265fcd3..3dc8651 100644 --- a/arch/ppc/amiga/amiints.c +++ b/arch/ppc/amiga/amiints.c @@ -184,7 +184,7 @@ inline void amiga_do_irq(int irq, struct pt_regs *fp) irq_desc_t *desc = irq_desc + irq; struct irqaction *action = desc->action; - kstat_cpu(0).irqs[irq]++; + desc->kstat_irqs[0]++; action->handler(irq, action->dev_id, fp); } @@ -193,7 +193,7 @@ void amiga_do_irq_list(int irq, struct pt_regs *fp) irq_desc_t *desc = irq_desc + irq; struct irqaction *action; - kstat_cpu(0).irqs[irq]++; + desc->kstat_irqs[0]++; amiga_custom.intreq = ami_intena_vals[irq]; diff --git a/arch/ppc/amiga/cia.c b/arch/ppc/amiga/cia.c index 9558f2f..33faf2d 100644 --- a/arch/ppc/amiga/cia.c +++ b/arch/ppc/amiga/cia.c @@ -146,7 +146,7 @@ static void cia_handler(int irq, void *dev_id, struct pt_regs *fp) amiga_custom.intreq = base->int_mask; for (i = 0; i < CIA_IRQS; i++, irq++) { if (ints & 1) { - kstat_cpu(0).irqs[irq]++; + desc->kstat_irqs[0]++; action = desc->action; action->handler(irq, action->dev_id, fp); } diff --git a/arch/ppc/amiga/ints.c b/arch/ppc/amiga/ints.c index 083a174..84ec6cb 100644 --- a/arch/ppc/amiga/ints.c +++ b/arch/ppc/amiga/ints.c @@ -128,7 +128,7 @@ asmlinkage void process_int(unsigned long vec, struct pt_regs *fp) { if (vec >= VEC_INT1 && vec <= VEC_INT7 && !MACH_IS_BVME6000) { vec -= VEC_SPUR; - kstat_cpu(0).irqs[vec]++; + irq_desc[vec].kstat_irqs[0]++; irq_list[vec].handler(vec, irq_list[vec].dev_id, fp); } else { if (mach_process_int) @@ -147,7 +147,7 @@ int m68k_get_irq_list(struct seq_file *p, void *v) if (mach_default_handler) { for (i = 0; i < SYS_IRQS; i++) { seq_printf(p, "auto %2d: %10u ", i, - i ? kstat_cpu(0).irqs[i] : num_spurious); + i ? kstat_irqs_cpu(i, 0) : num_spurious); seq_puts(p, " "); seq_printf(p, "%s\n", irq_list[i].devname); } diff --git a/arch/sh/kernel/irq.c b/arch/sh/kernel/irq.c index 67be2b6..e9c739a 100644 --- a/arch/sh/kernel/irq.c +++ b/arch/sh/kernel/irq.c @@ -52,7 +52,7 @@ int show_interrupts(struct seq_file *p, void *v) goto unlock; seq_printf(p, "%3d: ",i); for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_cpu(i, j)); seq_printf(p, " %14s", irq_desc[i].chip->name); seq_printf(p, "-%-8s", irq_desc[i].name); seq_printf(p, " %s", action->name); diff --git a/arch/sparc64/kernel/irq.c b/arch/sparc64/kernel/irq.c index b5ff3ee..4a436a7 100644 --- a/arch/sparc64/kernel/irq.c +++ b/arch/sparc64/kernel/irq.c @@ -154,7 +154,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %9s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); @@ -605,7 +605,7 @@ void timer_irq(int irq, struct pt_regs *regs) old_regs = set_irq_regs(regs); irq_enter(); - kstat_this_cpu.irqs[0]++; + irq_desc[0].kstat_irqs[0]++; timer_interrupt(irq, NULL); irq_exit(); diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c index fc99f7b..155703b 100644 --- a/arch/sparc64/kernel/smp.c +++ b/arch/sparc64/kernel/smp.c @@ -1212,7 +1212,7 @@ void smp_percpu_timer_interrupt(struct pt_regs *regs) irq_enter(); if (cpu == boot_cpu_id) { - kstat_this_cpu.irqs[0]++; + irq_desc[0].kstat_irqs[cpu]++; timer_tick_interrupt(regs); } diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c index 50a288b..fa16410 100644 --- a/arch/um/kernel/irq.c +++ b/arch/um/kernel/irq.c @@ -61,7 +61,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c index 9fe2e28..beefb89 100644 --- a/arch/x86_64/kernel/irq.c +++ b/arch/x86_64/kernel/irq.c @@ -69,12 +69,8 @@ int show_interrupts(struct seq_file *p, void *v) if (!action) goto skip; seq_printf(p, "%3d: ",i); -#ifndef CONFIG_SMP - seq_printf(p, "%10u ", kstat_irqs(i)); -#else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); -#endif + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); seq_printf(p, " %8s", irq_desc[i].chip->name); seq_printf(p, "-%-8s", irq_desc[i].name); diff --git a/arch/xtensa/kernel/irq.c b/arch/xtensa/kernel/irq.c index c9ea73b..c35e271 100644 --- a/arch/xtensa/kernel/irq.c +++ b/arch/xtensa/kernel/irq.c @@ -99,7 +99,7 @@ int show_interrupts(struct seq_file *p, void *v) seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); diff --git a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c index e2c4c0a..21be453 100644 --- a/fs/proc/proc_misc.c +++ b/fs/proc/proc_misc.c @@ -472,7 +472,7 @@ static int show_stat(struct seq_file *p, void *v) softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq); steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal); for (j = 0 ; j < NR_IRQS ; j++) - sum += kstat_cpu(i).irqs[j]; + sum += kstat_irqs_cpu(j, i); } seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu\n", diff --git a/include/linux/irq.h b/include/linux/irq.h index bb78ab9..9c61fd7 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -156,6 +156,7 @@ struct irq_desc { void *handler_data; void *chip_data; struct irqaction *action; /* IRQ action list */ + unsigned int *kstat_irqs; unsigned int status; /* IRQ status */ unsigned int depth; /* nested irq disables */ @@ -178,6 +179,9 @@ struct irq_desc { extern struct irq_desc irq_desc[NR_IRQS]; +#define kstat_irqs_this_cpu(DESC) \ + ((DESC)->kstat_irqs[smp_processor_id()]) + /* * Migration helpers for obsolete names, they will go away: */ diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 43e895f..0c8f650 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -27,7 +27,9 @@ struct cpu_usage_stat { struct kernel_stat { struct cpu_usage_stat cpustat; +#ifndef CONFIG_GENERIC_HARDIRQS unsigned int irqs[NR_IRQS]; +#endif }; DECLARE_PER_CPU(struct kernel_stat, kstat); @@ -38,15 +40,27 @@ DECLARE_PER_CPU(struct kernel_stat, kstat); extern unsigned long long nr_context_switches(void); +#ifndef CONFIG_GENERIC_HARDIRQS +static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu) +{ + return kstat_cpu(cpu).irqs[irq]; +} +static inline void init_kstat_irqs(void) {} +#else +extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu); +extern void init_kstat_irqs(void); +#endif /* CONFIG_GENERIC_HARDIRQS */ + /* * Number of interrupts per specific IRQ source, since bootup */ -static inline int kstat_irqs(int irq) +static inline unsigned int kstat_irqs(unsigned int irq) { - int cpu, sum = 0; + unsigned int sum = 0; + int cpu; for_each_possible_cpu(cpu) - sum += kstat_cpu(cpu).irqs[irq]; + sum += kstat_irqs_cpu(irq, cpu); return sum; } diff --git a/init/main.c b/init/main.c index a92989e..23f1c64 100644 --- a/init/main.c +++ b/init/main.c @@ -559,6 +559,7 @@ asmlinkage void __init start_kernel(void) sort_main_extable(); trap_init(); rcu_init(); + init_kstat_irqs(); init_IRQ(); pidhash_init(); init_timers(); diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index f83d691..7896286 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -288,13 +288,12 @@ handle_simple_irq(unsigned int irq, struct irq_desc *desc) { struct irqaction *action; irqreturn_t action_ret; - const unsigned int cpu = smp_processor_id(); spin_lock(&desc->lock); if (unlikely(desc->status & IRQ_INPROGRESS)) goto out_unlock; - kstat_cpu(cpu).irqs[irq]++; + kstat_irqs_this_cpu(desc)++; action = desc->action; if (unlikely(!action || (desc->status & IRQ_DISABLED))) { @@ -332,7 +331,6 @@ out_unlock: void fastcall handle_level_irq(unsigned int irq, struct irq_desc *desc) { - unsigned int cpu = smp_processor_id(); struct irqaction *action; irqreturn_t action_ret; @@ -342,7 +340,7 @@ handle_level_irq(unsigned int irq, struct irq_desc *desc) if (unlikely(desc->status & IRQ_INPROGRESS)) goto out_unlock; desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); - kstat_cpu(cpu).irqs[irq]++; + kstat_irqs_this_cpu(desc)++; /* * If its disabled or no action available @@ -383,7 +381,6 @@ out_unlock: void fastcall handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc) { - unsigned int cpu = smp_processor_id(); struct irqaction *action; irqreturn_t action_ret; @@ -393,7 +390,7 @@ handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc) goto out; desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); - kstat_cpu(cpu).irqs[irq]++; + kstat_irqs_this_cpu(desc)++; /* * If its disabled or no action available @@ -442,8 +439,6 @@ out: void fastcall handle_edge_irq(unsigned int irq, struct irq_desc *desc) { - const unsigned int cpu = smp_processor_id(); - spin_lock(&desc->lock); desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); @@ -460,7 +455,7 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc) goto out_unlock; } - kstat_cpu(cpu).irqs[irq]++; + kstat_irqs_this_cpu(desc)++; /* Start handling the irq */ desc->chip->ack(irq); @@ -516,7 +511,7 @@ handle_percpu_irq(unsigned int irq, struct irq_desc *desc) { irqreturn_t action_ret; - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(desc)++; if (desc->chip->ack) desc->chip->ack(irq); diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index aff1f0f..27cf665 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -15,6 +15,7 @@ #include <linux/random.h> #include <linux/interrupt.h> #include <linux/kernel_stat.h> +#include <linux/bootmem.h> #include "internals.h" @@ -30,7 +31,7 @@ void fastcall handle_bad_irq(unsigned int irq, struct irq_desc *desc) { print_irq_desc(irq, desc); - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(desc)++; ack_bad_irq(irq); } @@ -170,7 +171,7 @@ fastcall unsigned int __do_IRQ(unsigned int irq) struct irqaction *action; unsigned int status; - kstat_this_cpu.irqs[irq]++; + kstat_irqs_this_cpu(desc)++; if (CHECK_IRQ_PER_CPU(desc->status)) { irqreturn_t action_ret; @@ -269,3 +270,27 @@ void early_init_irq_lock_class(void) } #endif + +__init void init_kstat_irqs(void) +{ + unsigned entries = 0, cpu; + unsigned int irq; + unsigned bytes; + + /* Compute the worst case size of a per cpu array */ + for_each_possible_cpu(cpu) + if (cpu >= entries) + entries = cpu + 1; + + /* Compute how many bytes we need per irq and allocate them */ + bytes = entries*sizeof(unsigned int); + for (irq = 0; irq < NR_IRQS; irq++) + irq_desc[irq].kstat_irqs = alloc_bootmem(bytes); +} + +unsigned int kstat_irqs_cpu(unsigned int irq, int cpu) +{ + struct irq_desc *desc = irq_desc + irq; + return desc->kstat_irqs[cpu]; +} +EXPORT_SYMBOL(kstat_irqs_cpu); -- 1.5.0.g53756 ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:17 ` Eric W. Biederman 2008-07-10 20:24 ` Ingo Molnar @ 2008-07-11 1:39 ` Mike Travis 2008-07-11 2:57 ` Eric W. Biederman 1 sibling, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-11 1:39 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > Mike Travis <travis@sgi.com> writes: > > >> The biggest growth came from moving all the xxx[NR_CPUS] arrays into >> the per cpu area. So you free up a huge amount of unused memory when >> the NR_CPUS count starts getting into the ozone layer. 4k now, 16k >> real soon now, ??? future? > > Hmm. Do you know how big a role kernel_stat plays. > > It is a per cpu structure that is sized via NR_IRQS. NR_IRQS is by NR_CPUS. > So ultimately the amount of memory take up is NR_CPUS*NR_CPUS*32 or so. > > I have a patch I wrote long ago, that addresses that specific nasty configuration > by moving the per cpu irq counters into pointer available from struct irq_desc. > > The next step which I did not get to (but is interesting from a scaling perspective) > was to start dynamically allocating the irq structures. > > Eric If you could dig that up, that would be great. Another engr here at SGI took that task off my hands and he's been able to do a few things to reduce the "# irqs" but irq_desc is still one of the bigger static arrays (>256k). (There was some discussion a while back on this very subject.) The top data users are: ====== Data (-l 500) 1 - ingo-test-0701-256 2 - 4k-defconfig 3 - ingo-test-0701 .1. .2. .3. ..final.. 1048576 -917504 +917504 1048576 . __log_buf(.bss) 262144 -262144 +262144 262144 . gl_hash_table(.bss) 122360 -122360 +122360 122360 . g_bitstream(.data) 119756 -119756 +119756 119756 . init_data(.rodata) 89760 -89760 +89760 89760 . o2net_nodes(.bss) 76800 -76800 +614400 614400 +700% early_node_map(.data) 44548 -44548 +44548 44548 . typhoon_firmware_image(.rodata) 43008 +215040 . 258048 +500% irq_desc(.data.cacheline_aligned) 42768 -42768 +42768 42768 . s_firmLoad(.data) 41184 -41184 +41184 41184 . saa7134_boards(.data) 38912 -38912 +38912 38912 . dabusb(.bss) 34804 -34804 +34804 34804 . g_Firmware(.data) 32768 -32768 +32768 32768 . read_buffers(.bss) 19968 -19968 +159744 159744 +700% initkmem_list3(.init.data) 18041 -18041 +18041 18041 . OperationalCodeImage_GEN1(.data) 16507 -16507 +16507 16507 . OperationalCodeImage_GEN2(.data) 16464 -16464 +16464 16464 . ipw_geos(.rodata) 16388 +114688 -114688 16388 . map_pid_to_cmdline(.bss) 16384 -16384 +16384 16384 . gl_hash_locks(.bss) 16384 +245760 . 262144 +1500% boot_pageset(.bss) 16128 +215040 . 231168 +1333% irq_cfg(.data.read_mostly) ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 1:39 ` Mike Travis @ 2008-07-11 2:57 ` Eric W. Biederman 0 siblings, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-11 2:57 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Christoph Lameter, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Mike Travis <travis@sgi.com> writes: > If you could dig that up, that would be great. Another engr here at SGI > took that task off my hands and he's been able to do a few things to reduce > the "# irqs" but irq_desc is still one of the bigger static arrays (>256k). So the part I had completed which turns takes the NR_IRQS array out of kernel_stat I posted. Here are my mental notes on how to handle the rest. Also if you will notice on x86_64 everything that is per irq is in irq_cfg. Which explains why irq_cfg grows. We have those crazy bitmaps almost useless bitmaps of which cpu we want to direct irqs to. In the irq configuration so that doesn't help. The array sized by NR_IRQS irqs are in: drivers/char/random.c:static struct timer_rand_state *irq_timer_state[NR_IRQS]; looks like it should go in irq_desc (it's a generic feature). drivers/pcmcia/pcmcia_resource.c:static u8 pcmcia_used_irq[NR_IRQS]; That number should be 16 possibly 32 for sanity not NR_IRQS. drivers/net/hamradio/scc.c:static struct irqflags { unsigned char used : 1; } Ivec[NR_IRQS]; drivers/serial/68328serial.c:struct m68k_serial *IRQ_ports[NR_IRQS]; drivers/serial/8250.c:static struct irq_info irq_lists[NR_IRQS]; drivers/serial/m32r_sio.c:static struct irq_info irq_lists[NR_IRQS]; The are all drivers and should allocate a proper per irq structure like every other driver. drivers/xen/events.c:static struct packed_irq irq_info[NR_IRQS]; drivers/xen/events.c:static int irq_bindcount[NR_IRQS]; For all intents and purposes this is another architecture, that should be fixed up at some point. The interfaces from include/linux/interrupt.h that take irq interfaces that take an irq number are slow path. So it is just a matter of writing an irq_descp(irq) that returns an irq_desc and returns an irq. The definition would go something like: #ifndef CONFIG_DYNAMIC_NR_IRQ #define irq_descp(irq) (irq >= 0 && irq < NR_IRQS)? (irq_desc + irq) : NULL #else struct irq_desc *irq_descp(int irq) { struct irq_desc *desc; rcu_read_lock(); list_for_each_entry_rcu(desc, &irq_list, list) { if (desc->irq == irq) return desc; } rcu_read_unlock(); return NULL; } #endif Then the generic irq code just needs to use irq_descp through out. And the arch code needs to allocate/free irq_descs and add them to the list. With say: int add_irq_desc(int irq, struct irq_desc *desc) { struct irq_desc *old; int error = -EINVAL; spin_lock(&irq_list_lock); old = irq_descp(irq); if (old) goto out; list_add_rcu(&desc.list, &irc_list); error = 0; return error; } With architecture picking the irq number so it can be stable and have meaning to users. Starting from that direction it isn't too hard and it should yield timely results. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 23:36 ` Eric W. Biederman 2008-07-10 0:19 ` H. Peter Anvin @ 2008-07-10 0:23 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 0:23 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Ingo Molnar, Mike Travis, Andrew Morton, H. Peter Anvin, Jack Steiner, linux-kernel, Arjan van de Ven Eric W. Biederman wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > >>> Which means that my idea of using the technique we use on x86_32 will not >>> >> work. >> >> No, the compiler memory model we use guarantees that everything will be within >> 2G of each other. The linker will spew loudly if that's not the case. >> > > The per cpu area is at least theoretically dynamically allocated. And we > really want to put it in cpu local memory. Which means on any reasonable > NUMA machine the per cpu areas should be all over the box. > Yes, but that doesn't matter in the slightest. The effective address will be within 2G of the base; the base can be anywhere. > So there is no guarantee that with an arbitrary 64bit address in %gs of anything. > > Grr. Except you are correct. We have to guarantee that the offsets we have > chosen at compile time still work. And we know all of the compile time offsets > will be in the -2G range. So they are all 32bit numbers. Negative 32bit > numbers to be sure. That trivially leaves us with everything working except > the nasty hard coded decimal 40. > Right. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:18 ` Christoph Lameter 2008-07-09 20:33 ` Jeremy Fitzhardinge @ 2008-07-09 20:35 ` H. Peter Anvin 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 20:35 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Jack Steiner, linux-kernel, Arjan van de Ven Christoph Lameter wrote: > > Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction. It also is easier to handle since __per_cpu_start does not figure > in the calculation of the offsets. > No, that makes no difference. There is no short-offset form that doesn't involve a register (ignoring the 16-bit 67h form on 32 bits.) For 64 bits, you want to keep the offsets within %rip±2 GB, or you will have relocation overflows for %rip-based forms; for absolute forms you have to be in the range 0-4 GB. The %rip-based forms are shorter, and I'm pretty sure they're the ones we currently generate. Since we base the kernel at 0xffffffff80000000 (-2 GB) this means a zero-based offset is actively wrong, and only work by accident (since the first CONFIG_PHYSICAL_START of that space is unused.) -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:11 ` Jeremy Fitzhardinge 2008-07-09 20:18 ` Christoph Lameter @ 2008-07-09 20:39 ` Arjan van de Ven 2008-07-09 20:44 ` H. Peter Anvin 2008-07-09 20:46 ` Jeremy Fitzhardinge 1 sibling, 2 replies; 190+ messages in thread From: Arjan van de Ven @ 2008-07-09 20:39 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel On Wed, 09 Jul 2008 13:11:03 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ingo Molnar wrote: > > Note that the zero-based percpu problems are completely unrelated > > to stackprotector. I was able to hit them with a > > stackprotector-disabled gcc-4.2.3 environment. > > The only reason we need to keep a zero-based pda is to support > stack-protector. If we drop drop it, we can drop the pda - and its > special zero-based properties - entirely. what's wrong with zero based btw? do they stop us from using gcc's __thread keyword for per cpu variables or something? (*that* would be a nice feature) or does it stop us from putting the per cpu variables starting from offset 4096 onwards? -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:39 ` Arjan van de Ven @ 2008-07-09 20:44 ` H. Peter Anvin 2008-07-09 20:50 ` Jeremy Fitzhardinge 2008-07-09 20:46 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 20:44 UTC (permalink / raw) To: Arjan van de Ven Cc: Jeremy Fitzhardinge, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel Arjan van de Ven wrote: > On Wed, 09 Jul 2008 13:11:03 -0700 > Jeremy Fitzhardinge <jeremy@goop.org> wrote: > >> Ingo Molnar wrote: >>> Note that the zero-based percpu problems are completely unrelated >>> to stackprotector. I was able to hit them with a >>> stackprotector-disabled gcc-4.2.3 environment. >> The only reason we need to keep a zero-based pda is to support >> stack-protector. If we drop drop it, we can drop the pda - and its >> special zero-based properties - entirely. > > what's wrong with zero based btw? > Two problems: 1. it means pda references are invalid if their offsets are ever more than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...) 2. some vague hints of a linker bug. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:44 ` H. Peter Anvin @ 2008-07-09 20:50 ` Jeremy Fitzhardinge 2008-07-09 21:12 ` H. Peter Anvin 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:50 UTC (permalink / raw) To: H. Peter Anvin Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel H. Peter Anvin wrote: > 1. it means pda references are invalid if their offsets are ever more > than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...) Why? As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE 0 - putting the percpu variables as the first thing in the kernel - and relocating on load? That would avoid having to make a special PT_LOAD segment at 0. Hm, would that result in the pda and the boot params getting mushed together? J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:50 ` Jeremy Fitzhardinge @ 2008-07-09 21:12 ` H. Peter Anvin 2008-07-09 21:26 ` Jeremy Fitzhardinge 2008-07-09 22:10 ` Eric W. Biederman 0 siblings, 2 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 21:12 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > H. Peter Anvin wrote: >> 1. it means pda references are invalid if their offsets are ever more >> than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...) > > Why? > > As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE > 0 - putting the percpu variables as the first thing in the kernel - and > relocating on load? That would avoid having to make a special PT_LOAD > segment at 0. Hm, would that result in the pda and the boot params > getting mushed together? > CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically we should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear. Either way, I really suspect that the right thing to do is to use negative offsets, with the possible exception of a handful of things (40 bytes or less, perhaps like current) which can get small positive offsets and end up in the "super hot" cacheline. The sucky part is that I don't believe GNU ld has native support for a "hanging down" section (one which has a fixed endpoint rather than a starting point), so it requires extra magic around the link (or finding some way to do it with linker script functions.) Let me see if I can cook up something in linker script that would actually work. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:12 ` H. Peter Anvin @ 2008-07-09 21:26 ` Jeremy Fitzhardinge 2008-07-09 21:37 ` H. Peter Anvin 2008-07-09 22:10 ` Eric W. Biederman 1 sibling, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 21:26 UTC (permalink / raw) To: H. Peter Anvin Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel H. Peter Anvin wrote: > Either way, I really suspect that the right thing to do is to use > negative offsets, with the possible exception of a handful of things > (40 bytes or less, perhaps like current) which can get small positive > offsets and end up in the "super hot" cacheline. > > The sucky part is that I don't believe GNU ld has native support for a > "hanging down" section (one which has a fixed endpoint rather than a > starting point), so it requires extra magic around the link (or > finding some way to do it with linker script functions.) If you're going to do another linker pass, you could have a script to extract all the percpu symbols and generate a set of derived zero-based ones and then link against that. Or generate a vmlinux with relocations and "relocate" all the percpu symbols down to 0. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:26 ` Jeremy Fitzhardinge @ 2008-07-09 21:37 ` H. Peter Anvin 0 siblings, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 21:37 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel Jeremy Fitzhardinge wrote: > > If you're going to do another linker pass, you could have a script to > extract all the percpu symbols and generate a set of derived zero-based > ones and then link against that. > > Or generate a vmlinux with relocations and "relocate" all the percpu > symbols down to 0. > Yeah, I'd hate to have to go to either of those lengths though. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:12 ` H. Peter Anvin 2008-07-09 21:26 ` Jeremy Fitzhardinge @ 2008-07-09 22:10 ` Eric W. Biederman 2008-07-09 22:23 ` H. Peter Anvin 1 sibling, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 22:10 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel "H. Peter Anvin" <hpa@zytor.com> writes: > Jeremy Fitzhardinge wrote: >> H. Peter Anvin wrote: >>> 1. it means pda references are invalid if their offsets are ever more than >>> CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...) >> >> Why? >> >> As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE 0 - >> putting the percpu variables as the first thing in the kernel - and relocating >> on load? That would avoid having to make a special PT_LOAD segment at 0. Hm, >> would that result in the pda and the boot params getting mushed together? >> > > CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically we > should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear. Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment is liked at a fixed address -2G and the option only determines the virtual to physical address mapping. That said the idea may not be too far off. Potentially we could put the percpu area at our fixed -2G address and then we have a constant (instead of an address) we could subtract from this address. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 22:10 ` Eric W. Biederman @ 2008-07-09 22:23 ` H. Peter Anvin 2008-07-09 23:54 ` Eric W. Biederman 0 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-09 22:23 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel Eric W. Biederman wrote: >>> >> CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically we >> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear. > > Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment > is liked at a fixed address -2G and the option only determines the virtual > to physical address mapping. > No, it's not irrelevant; we currently base the kernel at virtual address -2 GB (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the proper alignment for large pages. Now, it probably wouldn't hurt moving KERNEL_IMAGE_START up a bit to have low positive values safer to use. > That said the idea may not be too far off. > > Potentially we could put the percpu area at our fixed -2G address and then > we have a constant (instead of an address) we could subtract from this address. We can't put it at -2 GB since the offset +40 for the stack sentinel is hard-coded into gcc. This leaves growing upward from +48 (or another small positive number), or growing down from zero (or +40) as realistic options. Unfortunately, GNU ld handles grow-down not at all. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 22:23 ` H. Peter Anvin @ 2008-07-09 23:54 ` Eric W. Biederman 2008-07-10 16:22 ` Mike Travis ` (2 more replies) 0 siblings, 3 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 23:54 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel "H. Peter Anvin" <hpa@zytor.com> writes: > Eric W. Biederman wrote: >>>> >>> CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically we >>> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear. >> >> Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment >> is liked at a fixed address -2G and the option only determines the virtual >> to physical address mapping. >> > > No, it's not irrelevant; we currently base the kernel at virtual address -2 GB > (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the proper > alignment for large pages. Ugh. That is silly. We need to restrict CONFIG_PHYSICAL_START to the aligned choices obviously. But -2G is better aligned then anything else we can do virtually. For the 32bit code we need to play some of those games because it doesn't have it's own magic chunk of the address space to live in. >> That said the idea may not be too far off. >> >> Potentially we could put the percpu area at our fixed -2G address and then >> we have a constant (instead of an address) we could subtract from this > address. > > We can't put it at -2 GB since the offset +40 for the stack sentinel is > hard-coded into gcc. This leaves growing upward from +48 (or another small > positive number), or growing down from zero (or +40) as realistic options. I was thinking everything except that access would be done as: %gs:var - -2G aka %gs:var - START_KERNEL. So that everything was a small 32bit number. That the linker and the compiler can resolve. The trick is to put the stack canary at 40 decimal. I was just trying to find a compile time know location for the start of the percpu area so we could subtract it off. Unless the linker just winds up overflowing in the subtraction and doing hideous things to us. Although that should be pretty easy to spot and to test for at build time. -2G has the interesting distinction that we might get away with just dropping the high bits. > Unfortunately, GNU ld handles grow-down not at all. Another alternative that almost fares better then a segment with a base of zero is a base of -32K or so. Only trouble that would get us manually managing the per cpu area size again. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 23:54 ` Eric W. Biederman @ 2008-07-10 16:22 ` Mike Travis 2008-07-10 16:25 ` H. Peter Anvin ` (2 more replies) 2008-07-10 17:57 ` H. Peter Anvin 2008-07-10 18:08 ` H. Peter Anvin 2 siblings, 3 replies; 190+ messages in thread From: Mike Travis @ 2008-07-10 16:22 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Eric W. Biederman wrote: ... > Another alternative that almost fares better then a segment with > a base of zero is a base of -32K or so. Only trouble that would get us > manually managing the per cpu area size again. One thing to remember is the eventual goal is implementing the cpu_alloc functions which I think we've agreed has to be "growable". This means that the addresses will need to be virtual to allow the same offsets for all cpus. The patchset I have uses 2Mb pages. This "little" twist might figure into the implementation issues that are being discussed. Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:22 ` Mike Travis @ 2008-07-10 16:25 ` H. Peter Anvin 2008-07-10 16:35 ` Christoph Lameter 2008-07-10 17:20 ` Mike Travis 2008-07-10 17:07 ` Jeremy Fitzhardinge 2008-07-10 18:48 ` Eric W. Biederman 2 siblings, 2 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 16:25 UTC (permalink / raw) To: Mike Travis Cc: Eric W. Biederman, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Mike Travis wrote: > Eric W. Biederman wrote: > ... >> Another alternative that almost fares better then a segment with >> a base of zero is a base of -32K or so. Only trouble that would get us >> manually managing the per cpu area size again. > > One thing to remember is the eventual goal is implementing the cpu_alloc > functions which I think we've agreed has to be "growable". This means that > the addresses will need to be virtual to allow the same offsets for all cpus. > The patchset I have uses 2Mb pages. This "little" twist might figure into the > implementation issues that are being discussed. > No, since the *addresses* can be arbitrary. The current issue is about *offsets.* -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:25 ` H. Peter Anvin @ 2008-07-10 16:35 ` Christoph Lameter 2008-07-10 16:39 ` H. Peter Anvin 2008-07-10 17:20 ` Mike Travis 1 sibling, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 16:35 UTC (permalink / raw) To: H. Peter Anvin Cc: Mike Travis, Eric W. Biederman, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell H. Peter Anvin wrote: > No, since the *addresses* can be arbitrary. The current issue is about > *offsets.* Well those are intimately connected. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:35 ` Christoph Lameter @ 2008-07-10 16:39 ` H. Peter Anvin 2008-07-10 16:47 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 16:39 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Eric W. Biederman, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: > H. Peter Anvin wrote: > >> No, since the *addresses* can be arbitrary. The current issue is about >> *offsets.* > > Well those are intimately connected. Not really, since gs_base is an arbitrary 64-bit pointer. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:39 ` H. Peter Anvin @ 2008-07-10 16:47 ` Christoph Lameter 2008-07-10 17:21 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 16:47 UTC (permalink / raw) To: H. Peter Anvin Cc: Mike Travis, Eric W. Biederman, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell H. Peter Anvin wrote: > Christoph Lameter wrote: >> H. Peter Anvin wrote: >> >>> No, since the *addresses* can be arbitrary. The current issue is about >>> *offsets.* >> >> Well those are intimately connected. > > Not really, since gs_base is an arbitrary 64-bit pointer. The current scheme ties the offsets to kernel code addresses. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:47 ` Christoph Lameter @ 2008-07-10 17:21 ` Jeremy Fitzhardinge 2008-07-10 17:31 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 17:21 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Mike Travis, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: > H. Peter Anvin wrote: > >> Christoph Lameter wrote: >> >>> H. Peter Anvin wrote: >>> >>> >>>> No, since the *addresses* can be arbitrary. The current issue is about >>>> *offsets.* >>>> >>> Well those are intimately connected. >>> >> Not really, since gs_base is an arbitrary 64-bit pointer. >> > > The current scheme ties the offsets to kernel code addresses. > This is getting very frustrating. We've been going around and around on this point, what, 5 or 6 times at least. The base address of the percpu area and the offsets from that base are completely independent values. The offset is limited to 2G. The 2G limit applies regardless of how you compute your effective address. It doesn't matter if its absolute. It doesn't matter if it's rip-relative. It doesn't matter if it's zero-based. Small absolute addresses generate exactly the same form as large absolute addresses. There is no 8-bit or 16-bit address mode. The base is arbitrary. It can be any canonical address at all. It has no effect on how you compute your offset. The addressing modes: * ABS * off(%rip) Are exactly equivalent in what offsets they can generate, so long as *at link time* the percpu *symbols* are within 2G of the code addressing them. *After* the addressing mode has generated an effective address (by whatever means it likes), the %gs: override applies the segment base, which can therefore offset the effective address to anywhere at all. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:21 ` Jeremy Fitzhardinge @ 2008-07-10 17:31 ` Christoph Lameter 2008-07-10 17:48 ` Jeremy Fitzhardinge 2008-07-10 18:00 ` H. Peter Anvin 0 siblings, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 17:31 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Mike Travis, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Jeremy Fitzhardinge wrote: > > The base address of the percpu area and the offsets from that base are > completely independent values. Definitely. > The addressing modes: > > * ABS > * off(%rip) > > Are exactly equivalent in what offsets they can generate, so long as *at > link time* the percpu *symbols* are within 2G of the code addressing > them. *After* the addressing mode has generated an effective address > (by whatever means it likes), the %gs: override applies the segment > base, which can therefore offset the effective address to anywhere at all. Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas. However, that does not work if one calculates the virtual address instead of looking up a physical address. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:31 ` Christoph Lameter @ 2008-07-10 17:48 ` Jeremy Fitzhardinge 2008-07-10 18:00 ` H. Peter Anvin 1 sibling, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 17:48 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Mike Travis, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > >> The base address of the percpu area and the offsets from that base are >> completely independent values. >> > > Definitely. > > > >> The addressing modes: >> >> * ABS >> * off(%rip) >> >> Are exactly equivalent in what offsets they can generate, so long as *at >> link time* the percpu *symbols* are within 2G of the code addressing >> them. *After* the addressing mode has generated an effective address >> (by whatever means it likes), the %gs: override applies the segment >> base, which can therefore offset the effective address to anywhere at all. >> > > Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas. > Yes, but the offset is the same either way. When you want a cpu to refer to its own percpu memory, regardless of where it is in memory, you just reload the gs base. The offsets are the same everywhere, and are computed by the linker with out knowledge or reference to where the final address will end up. In other words, at source level: a = x86_read_percpu(foo) will generate mov %gs:percpu__foo, %rax where the linker decides the value of percpu__foo, which can be up to 4G. Or if we use rip-relative: mov %gs:percpu__foo(%rip), %rax we end up with the same result, except that the generated instruction is a bit more compact. In the final generated assembly, it ends up being a hardcoded constant address. Say, 0x7838. Now if we allocate cpu 43 percpu data at 0xfffffffff7198000, we load %gs base with that value, and then the instruction is still mov %gs:0x7838, %rax and the computed address will be 0xfffffffff7198000 + 0x7838 = 0xfffffffff719f838. And cpu 62 has its percpu data at 0xffffffffe3819000, and the instruction is still mov %gs:0x7838, %rax and the computed address for it's version of percpu__foo is 0xffffffffe3819000 + 0x7838 = 0xffffffffe3820838. Note that it doesn't matter how you decide to place the percpu data, so long as you can load the address into the %gs base. > However, that does not work if one calculates the virtual address instead of looking up a physical address. > Calculate a virtual address for what? Physical address for what? If you have a large virtual region allocating 256M of percpu space, er, per cpu, then you just load %gs base with percpu_region_base + cpuid * 256M. It has no effect on the instructions accessing that percpu space. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:31 ` Christoph Lameter 2008-07-10 17:48 ` Jeremy Fitzhardinge @ 2008-07-10 18:00 ` H. Peter Anvin 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 18:00 UTC (permalink / raw) To: Christoph Lameter Cc: Jeremy Fitzhardinge, Mike Travis, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: > > Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas. > > However, that does not work if one calculates the virtual address instead of looking up a physical address. > As far as the linker is concerned, there are two address spaces: VMA, which is the offset, and LMA, which is the physical address on where to load. The linker doesn't give a flying hoot about the virtual address, since it's completely irrelevant as far as it is concerned; it's nothing but a kernel-internal abstraction. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:25 ` H. Peter Anvin 2008-07-10 16:35 ` Christoph Lameter @ 2008-07-10 17:20 ` Mike Travis 1 sibling, 0 replies; 190+ messages in thread From: Mike Travis @ 2008-07-10 17:20 UTC (permalink / raw) To: H. Peter Anvin Cc: Eric W. Biederman, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell H. Peter Anvin wrote: > Mike Travis wrote: >> Eric W. Biederman wrote: >> ... >>> Another alternative that almost fares better then a segment with >>> a base of zero is a base of -32K or so. Only trouble that would get us >>> manually managing the per cpu area size again. >> >> One thing to remember is the eventual goal is implementing the cpu_alloc >> functions which I think we've agreed has to be "growable". This means >> that >> the addresses will need to be virtual to allow the same offsets for >> all cpus. >> The patchset I have uses 2Mb pages. This "little" twist might figure >> into the >> implementation issues that are being discussed. >> > > No, since the *addresses* can be arbitrary. The current issue is about > *offsets.* > > -hpa Ok, thanks for clearing that up. I just didn't want us to drop the ball trying to make that double play... ;-) Cheers, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:22 ` Mike Travis 2008-07-10 16:25 ` H. Peter Anvin @ 2008-07-10 17:07 ` Jeremy Fitzhardinge 2008-07-10 17:12 ` Christoph Lameter 2008-07-10 17:41 ` Mike Travis 2008-07-10 18:48 ` Eric W. Biederman 2 siblings, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 17:07 UTC (permalink / raw) To: Mike Travis Cc: Eric W. Biederman, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Mike Travis wrote: > One thing to remember is the eventual goal is implementing the cpu_alloc > functions which I think we've agreed has to be "growable". This means that > the addresses will need to be virtual to allow the same offsets for all cpus. > The patchset I have uses 2Mb pages. This "little" twist might figure into the > implementation issues that are being discussed. You want to virtually map the percpu area? How and when would it get extended? J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:07 ` Jeremy Fitzhardinge @ 2008-07-10 17:12 ` Christoph Lameter 2008-07-10 17:25 ` Jeremy Fitzhardinge 2008-07-10 17:41 ` Mike Travis 1 sibling, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 17:12 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Eric W. Biederman, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Jeremy Fitzhardinge wrote: > You want to virtually map the percpu area? How and when would it get > extended? It would get extended when cpu_alloc() is called and the allocator finds that there is no per cpu memory available. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:12 ` Christoph Lameter @ 2008-07-10 17:25 ` Jeremy Fitzhardinge 2008-07-10 17:34 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 17:25 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Eric W. Biederman, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: > Jeremy Fitzhardinge wrote: > > >> You want to virtually map the percpu area? How and when would it get >> extended? >> > > It would get extended when cpu_alloc() is called and the allocator finds that there is no per cpu memory available. > Which, I take it, allocates percpu memory. It would have the same caveats as vmalloc memory, with respect to accessing it during fault handlers and nmi handlers, I take it. How would cpu_alloc() actually get used? It doesn't make much sense for general code, since we don't have the notion of a percpu pointer to memory (vs a pointer to percpu memory). Is the intended use for allocating percpu memory in modules? What other uses? J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:25 ` Jeremy Fitzhardinge @ 2008-07-10 17:34 ` Christoph Lameter 0 siblings, 0 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 17:34 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, Eric W. Biederman, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Jeremy Fitzhardinge wrote: > Christoph Lameter wrote: >> Jeremy Fitzhardinge wrote: >> >> >>> You want to virtually map the percpu area? How and when would it get >>> extended? >>> >> >> It would get extended when cpu_alloc() is called and the allocator >> finds that there is no per cpu memory available. >> > > Which, I take it, allocates percpu memory. It would have the same > caveats as vmalloc memory, with respect to accessing it during fault > handlers and nmi handlers, I take it. Right. One would not want to allocate per cpu memory in those contexts. The current allocpercpu() functions already have those restrictions. > How would cpu_alloc() actually get used? It doesn't make much sense for > general code, since we don't have the notion of a percpu pointer to > memory (vs a pointer to percpu memory). Is the intended use for > allocating percpu memory in modules? What other uses? Argh. Do I have to reexplain all of this all over again? Please look at the latest cpu_alloc patchset and the related discussions. The cpu_alloc patchset introduces the concept of a pointer to percpu memory. Or you could call it an offset into the percpu segment that can be treated (in a restricted way) like a pointer.... ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:07 ` Jeremy Fitzhardinge 2008-07-10 17:12 ` Christoph Lameter @ 2008-07-10 17:41 ` Mike Travis 2008-07-10 18:01 ` H. Peter Anvin 1 sibling, 1 reply; 190+ messages in thread From: Mike Travis @ 2008-07-10 17:41 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Eric W. Biederman, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Jeremy Fitzhardinge wrote: > Mike Travis wrote: >> One thing to remember is the eventual goal is implementing the cpu_alloc >> functions which I think we've agreed has to be "growable". This means >> that >> the addresses will need to be virtual to allow the same offsets for >> all cpus. >> The patchset I have uses 2Mb pages. This "little" twist might figure >> into the >> implementation issues that are being discussed. > > You want to virtually map the percpu area? How and when would it get > extended? > > J CPU_ALLOC(), or some such means. This is to replace the percpu allocator in modules.c: Subject: cpu alloc: The allocator The per cpu allocator allows dynamic allocation of memory on all processors simultaneously. A bitmap is used to track used areas. The allocator implements tight packing to reduce the cache footprint and increase speed since cacheline contention is typically not a concern for memory mainly used by a single cpu. Small objects will fill up gaps left by larger allocations that required alignments. The size of the cpu_alloc area can be changed via make menuconfig. Signed-off-by: Christoph Lameter <clameter@sgi.com> and: Subject: cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator Remove the builtin per cpu allocator from modules.c and use cpu_alloc instead. The patch also removes PERCPU_ENOUGH_ROOM. The size of the cpu_alloc area is determined by CONFIG_CPU_AREA_SIZE. PERCPU_ENOUGH_ROOMs default was 8k. CONFIG_CPU_AREA_SIZE defaults to 32k. Thus we have more space to load modules. Signed-off-by: Christoph Lameter <clameter@sgi.com> The discussion that followed was very emphatic that the size of the space should not be fixed, but instead be dynamically growable. Since the offset needs to be fixed for each cpu, then virtual (I think) is the only way to go. The use of a 2MB page just conserves map entries. (Of course, if we just reserved 2MB in the first place it might not need to be virtual...? But the concern was for systems with hundreds of (say) network interfaces using even more than 2MB.) Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 17:41 ` Mike Travis @ 2008-07-10 18:01 ` H. Peter Anvin 2008-07-10 20:51 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 18:01 UTC (permalink / raw) To: Mike Travis Cc: Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Mike Travis wrote: > > The discussion that followed was very emphatic that the size of the space should > not be fixed, but instead be dynamically growable. Since the offset needs to be > fixed for each cpu, then virtual (I think) is the only way to go. The use of a > 2MB page just conserves map entries. (Of course, if we just reserved 2MB in the > first place it might not need to be virtual...? But the concern was for systems > with hundreds of (say) network interfaces using even more than 2MB.) > I'm much more concerned about wasting an average of 1 MB of memory per CPU. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 18:01 ` H. Peter Anvin @ 2008-07-10 20:51 ` Christoph Lameter 2008-07-10 20:58 ` H. Peter Anvin 0 siblings, 1 reply; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 20:51 UTC (permalink / raw) To: H. Peter Anvin Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell H. Peter Anvin wrote: > I'm much more concerned about wasting an average of 1 MB of memory per CPU. Well all the memory that is now allocated via allocpercpu()s will be allocated from that 2MB segment. And cpu_alloc packs variables in a dense way. The current slab allocations try to avoid sharing cachelines which wastes lots of memory for every allocation for each and every processor. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:51 ` Christoph Lameter @ 2008-07-10 20:58 ` H. Peter Anvin 2008-07-10 21:07 ` Christoph Lameter 0 siblings, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 20:58 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: > H. Peter Anvin wrote: > >> I'm much more concerned about wasting an average of 1 MB of memory per CPU. > > Well all the memory that is now allocated via allocpercpu()s will be allocated from that 2MB segment. And cpu_alloc packs variables in a dense way. The current slab allocations try to avoid sharing cachelines which wastes lots of memory for every allocation for each and every processor. > And how much is that, especially on *small* systems? -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:58 ` H. Peter Anvin @ 2008-07-10 21:07 ` Christoph Lameter 2008-07-10 21:11 ` H. Peter Anvin 2008-07-10 21:26 ` Eric W. Biederman 0 siblings, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-10 21:07 UTC (permalink / raw) To: H. Peter Anvin Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell H. Peter Anvin wrote: > And how much is that, especially on *small* systems? i386? i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the cpu_alloc patchsets for i386. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 21:07 ` Christoph Lameter @ 2008-07-10 21:11 ` H. Peter Anvin 2008-07-11 15:32 ` Christoph Lameter 2008-07-10 21:26 ` Eric W. Biederman 1 sibling, 1 reply; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 21:11 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: > H. Peter Anvin wrote: > >> And how much is that, especially on *small* systems? > > i386? > > i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the cpu_alloc patchsets for i386. > No, not i386. x86-64. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 21:11 ` H. Peter Anvin @ 2008-07-11 15:32 ` Christoph Lameter 2008-07-11 16:07 ` H. Peter Anvin 2008-07-11 16:57 ` Eric W. Biederman 0 siblings, 2 replies; 190+ messages in thread From: Christoph Lameter @ 2008-07-11 15:32 UTC (permalink / raw) To: H. Peter Anvin Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell H. Peter Anvin wrote: > Christoph Lameter wrote: >> H. Peter Anvin wrote: >> >>> And how much is that, especially on *small* systems? >> >> i386? >> >> i386 uses 4K mappings. There are just a few cpus supported, there is >> scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get >> that big. See the cpu_alloc patchsets for i386. >> > > No, not i386. x86-64. x86_64 are small systems? For 64 bit use one would expect 4GB of memory. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 15:32 ` Christoph Lameter @ 2008-07-11 16:07 ` H. Peter Anvin 2008-07-11 16:57 ` Eric W. Biederman 1 sibling, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-11 16:07 UTC (permalink / raw) To: Christoph Lameter Cc: Mike Travis, Jeremy Fitzhardinge, Eric W. Biederman, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter wrote: >>> >> No, not i386. x86-64. > > x86_64 are small systems? For 64 bit use one would expect 4GB of memory. Hardly so. i386 has too many other limitations; for one thing, it starts having performance problems (needing HIHGMEM) at less than 1 GB. The additional registers, etc. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 15:32 ` Christoph Lameter 2008-07-11 16:07 ` H. Peter Anvin @ 2008-07-11 16:57 ` Eric W. Biederman 2008-07-11 17:10 ` H. Peter Anvin 1 sibling, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-11 16:57 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Mike Travis, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell > x86_64 are small systems? For 64 bit use one would expect 4GB of memory. Anything over 1G where 32bit runs out of lowmem starts to be a win. Plus sometimes it just doesn't make sense to use a 32bit kernel. Expecting 4GB of real memory seems silly. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-11 16:57 ` Eric W. Biederman @ 2008-07-11 17:10 ` H. Peter Anvin 0 siblings, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-11 17:10 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Mike Travis, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Eric W. Biederman wrote: >> x86_64 are small systems? For 64 bit use one would expect 4GB of memory. > > Anything over 1G where 32bit runs out of lowmem starts to be a win. > Plus sometimes it just doesn't make sense to use a 32bit kernel. > > Expecting 4GB of real memory seems silly. Especially since people set up VM instances small and then grow them. They can have *weird* CPU-to-RAM ratios; I have heard of 8 VCPUs and 24 MB of RAM in production. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 21:07 ` Christoph Lameter 2008-07-10 21:11 ` H. Peter Anvin @ 2008-07-10 21:26 ` Eric W. Biederman 1 sibling, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 21:26 UTC (permalink / raw) To: Christoph Lameter Cc: H. Peter Anvin, Mike Travis, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Jack Steiner, linux-kernel, Rusty Russell Christoph Lameter <cl@linux-foundation.org> writes: > H. Peter Anvin wrote: > >> And how much is that, especially on *small* systems? > > i386? > > i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of > ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the > cpu_alloc patchsets for i386. i386 is fundamentally resource constrained. However x86_32 should support a strict superset of the machines the x86_64 kernel supports. Because it is resource constrained in the lowmem zone you should not be able to bring up all of the cpus on a huge cpu box. But you should still be able to boot and run the kernel. So for percpu data we have effectively same size constraints. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 16:22 ` Mike Travis 2008-07-10 16:25 ` H. Peter Anvin 2008-07-10 17:07 ` Jeremy Fitzhardinge @ 2008-07-10 18:48 ` Eric W. Biederman 2008-07-10 18:54 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 18:48 UTC (permalink / raw) To: Mike Travis Cc: H. Peter Anvin, Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Mike Travis <travis@sgi.com> writes: > Eric W. Biederman wrote: > ... >> Another alternative that almost fares better then a segment with >> a base of zero is a base of -32K or so. Only trouble that would get us >> manually managing the per cpu area size again. > > One thing to remember is the eventual goal is implementing the cpu_alloc > functions which I think we've agreed has to be "growable". This means that > the addresses will need to be virtual to allow the same offsets for all cpus. > The patchset I have uses 2Mb pages. This "little" twist might figure into the > implementation issues that are being discussed. I had not heard that. However if you are going to use 2MB pages you might was well just use a physical address at the start of a node. 2MB is so much larger then the size of the per cpu memory we need today it isn't even funny. To get 32K I had to round up on my current system, and honestly it is important that per cpu data stay relatively small as otherwise the system won't have memory to use for anything interesting. I just took a quick look at our alloc_percpu calls. At a first glance they all appear to be for relatively small data structures. So we can just about get away with doing what we do today for modules for everything. The question is what to do when we fill up our preallocated size for percpu data. I think we can get away with just simply realloc'ing the percpu area on each cpu. No fancy table manipulations required. Just update the base pointer in %gs and in someplace global. If you do use virtual addresses really requires using 4K pages, so we can benefit from non-contiguous allocations. I just can't imagine the per cpu area getting up to 2MB in size, where you would need multiple 2MB pages. That is a huge jump from the 32KB I see today. For the rest mostly I have been making a list of things that we can do that could work. A zero based percpu area is great if you can eliminate it from suspicion of your weird random failures. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 18:48 ` Eric W. Biederman @ 2008-07-10 18:54 ` Jeremy Fitzhardinge 2008-07-10 19:18 ` Eric W. Biederman 0 siblings, 1 reply; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 18:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Eric W. Biederman wrote: > I think we can get away with just simply realloc'ing the percpu area > on each cpu. No fancy table manipulations required. Just update > the base pointer in %gs and in someplace global. > It's perfectly legitimate to take the address of a percpu variable and store it somewhere. We can't move them around. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 18:54 ` Jeremy Fitzhardinge @ 2008-07-10 19:18 ` Eric W. Biederman 2008-07-10 19:56 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 19:18 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Jeremy Fitzhardinge <jeremy@goop.org> writes: > Eric W. Biederman wrote: >> I think we can get away with just simply realloc'ing the percpu area >> on each cpu. No fancy table manipulations required. Just update >> the base pointer in %gs and in someplace global. >> > > It's perfectly legitimate to take the address of a percpu variable and store it > somewhere. We can't move them around. Really? I guess there are cases where that makes sense. It is a pretty rare case though. Especially when you are not talking about doing it temporarily with preemption disabled. There are few enough users of the API I think we can certainly explore the cost of forbidding in the general case of storing the address of a percpu variable. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 19:18 ` Eric W. Biederman @ 2008-07-10 19:56 ` Jeremy Fitzhardinge 2008-07-10 20:22 ` Eric W. Biederman 2008-07-10 20:25 ` Eric W. Biederman 0 siblings, 2 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 19:56 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Eric W. Biederman wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > >> Eric W. Biederman wrote: >> >>> I think we can get away with just simply realloc'ing the percpu area >>> on each cpu. No fancy table manipulations required. Just update >>> the base pointer in %gs and in someplace global. >>> >>> >> It's perfectly legitimate to take the address of a percpu variable and store it >> somewhere. We can't move them around. >> > > Really? I guess there are cases where that makes sense. It is a pretty > rare case though. Especially when you are not talking about doing it temporarily > with preemption disabled. There are few enough users of the API I think we can > certainly explore the cost of forbidding in the general case of storing the > address of a percpu variable. > No, that sounds like a bad idea. For one, how would you enforce it? How would you check for it? It's one of those things that would mostly work and then fail very rarely. Secondly, I depend on it. I register a percpu structure with Xen to share per-vcpu specific information (interrupt mask, time info, runstate stats, etc). J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 19:56 ` Jeremy Fitzhardinge @ 2008-07-10 20:22 ` Eric W. Biederman 2008-07-10 20:54 ` Jeremy Fitzhardinge 2008-07-11 6:59 ` Rusty Russell 2008-07-10 20:25 ` Eric W. Biederman 1 sibling, 2 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 20:22 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Jeremy Fitzhardinge <jeremy@goop.org> writes: > No, that sounds like a bad idea. For one, how would you enforce it? How would > you check for it? It's one of those things that would mostly work and then fail > very rarely. Well the easiest way would be to avoid the letting people take the address of per cpu memory, and just provide macros to read/write it. We are 90% of the way there already so it isn't a big jump. > Secondly, I depend on it. I register a percpu structure with Xen to share > per-vcpu specific information (interrupt mask, time info, runstate stats, etc). Well even virtual allocation is likely to break the Xen sharing case as you would at least need to compute the physical address and pass it to Xen. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:22 ` Eric W. Biederman @ 2008-07-10 20:54 ` Jeremy Fitzhardinge 2008-07-11 6:59 ` Rusty Russell 1 sibling, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-10 20:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Eric W. Biederman wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > >> No, that sounds like a bad idea. For one, how would you enforce it? How would >> you check for it? It's one of those things that would mostly work and then fail >> very rarely. >> > > Well the easiest way would be to avoid the letting people take the address of > per cpu memory, and just provide macros to read/write it. We are 90% of the > way there already so it isn't a big jump. > Well, the x86_X_percpu api is there. But per_cpu() and get_cpu_var() both explicitly return lvalues which can have their addresses taken. >> Secondly, I depend on it. I register a percpu structure with Xen to share >> per-vcpu specific information (interrupt mask, time info, runstate stats, etc). >> > > Well even virtual allocation is likely to break the Xen sharing case as you > would at least need to compute the physical address and pass it to Xen. > Right. At the moment it assumes that the percpu variable is in the linear mapping, but it could easily do a pagetable walk if necessary. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 20:22 ` Eric W. Biederman 2008-07-10 20:54 ` Jeremy Fitzhardinge @ 2008-07-11 6:59 ` Rusty Russell 1 sibling, 0 replies; 190+ messages in thread From: Rusty Russell @ 2008-07-11 6:59 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel On Friday 11 July 2008 06:22:52 Eric W. Biederman wrote: > Jeremy Fitzhardinge <jeremy@goop.org> writes: > > No, that sounds like a bad idea. For one, how would you enforce it? How > > would you check for it? It's one of those things that would mostly work > > and then fail very rarely. > > Well the easiest way would be to avoid the letting people take the address > of per cpu memory, and just provide macros to read/write it. We are 90% of > the way there already so it isn't a big jump. Hi Eric, I decided against that originally, but we can revisit that decision. But it would *not* be easy. Try it on kernel/sched.c which uses per-cpu "struct rq". Perhaps we could limit dynamically allocated per-cpu mem this way though... Rusty. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-10 19:56 ` Jeremy Fitzhardinge 2008-07-10 20:22 ` Eric W. Biederman @ 2008-07-10 20:25 ` Eric W. Biederman 1 sibling, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-10 20:25 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Mike Travis, H. Peter Anvin, Arjan van de Ven, Ingo Molnar, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel, Rusty Russell Jeremy Fitzhardinge <jeremy@goop.org> writes: > Secondly, I depend on it. I register a percpu structure with Xen to share > per-vcpu specific information (interrupt mask, time info, runstate stats, etc). Note. That I expect at least something like this is interesting in the context per cpu device queues. However except possibly for Xen that implies allocating DMA addressable memory and going through that API, which will keep device drivers from using per cpu memory that way even if the allocating something for each cpu? Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 23:54 ` Eric W. Biederman 2008-07-10 16:22 ` Mike Travis @ 2008-07-10 17:57 ` H. Peter Anvin 2008-07-10 18:08 ` H. Peter Anvin 2 siblings, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 17:57 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel Eric W. Biederman wrote: >>> >> No, it's not irrelevant; we currently base the kernel at virtual address -2 GB >> (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the proper >> alignment for large pages. > > Ugh. That is silly. We need to restrict CONFIG_PHYSICAL_START to the aligned > choices obviously. But -2G is better aligned then anything else we can do virtually. > You may think it's silly, but it's actually an advantage in this case. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 23:54 ` Eric W. Biederman 2008-07-10 16:22 ` Mike Travis 2008-07-10 17:57 ` H. Peter Anvin @ 2008-07-10 18:08 ` H. Peter Anvin 2 siblings, 0 replies; 190+ messages in thread From: H. Peter Anvin @ 2008-07-10 18:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Arjan van de Ven, Ingo Molnar, Mike Travis, Andrew Morton, Christoph Lameter, Jack Steiner, linux-kernel Eric W. Biederman wrote: > > Another alternative that almost fares better then a segment with > a base of zero is a base of -32K or so. Only trouble that would get us > manually managing the per cpu area size again. > Yes, an extra link pass would be better than that. I have tried, and none of the clever things I tried actually works, since GNU ld has pretty much no way to get it to reveal its information ahead of time. However, I want to explore the details of the supposed toolchain issue; it might be just a simple tweak to the way things are done now to fix it. Clearly, given the stack protector ABI, we want %gs:40 to be usable and free. -hpa ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:39 ` Arjan van de Ven 2008-07-09 20:44 ` H. Peter Anvin @ 2008-07-09 20:46 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 20:46 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, Eric W. Biederman, Mike Travis, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Arjan van de Ven wrote: > what's wrong with zero based btw? > Nothing in princple. In practice it's triggering an amazing variety of toolchain bugs. > do they stop us from using gcc's __thread keyword for per cpu variables > or something? (*that* would be a nice feature) > The powerpc guys tried it, and it doesn't work. per-cpu is not semantically equivalent to per-thread. If you have a function in which you refer to a percpu variable and then have a preemptable section in the middle followed by another reference to the same percpu variable, it's hard to stop gcc from caching a reference to the old tls variable, even though we may have switched cpus in the meantime. Also, we explicitly use the other segment register in kernel mode, to avoid segment register switches where possible. Even with -mcmodel=kernel, gcc generates %fs references to tls variables. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:00 ` Eric W. Biederman 2008-07-09 20:05 ` Jeremy Fitzhardinge 2008-07-09 20:07 ` Ingo Molnar @ 2008-07-09 20:14 ` Arjan van de Ven 2008-07-09 20:33 ` Eric W. Biederman 2008-07-09 21:39 ` Mike Travis 3 siblings, 1 reply; 190+ messages in thread From: Arjan van de Ven @ 2008-07-09 20:14 UTC (permalink / raw) To: Eric W. Biederman Cc: Mike Travis, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel On Wed, 09 Jul 2008 13:00:19 -0700 ebiederm@xmission.com (Eric W. Biederman) wrote: > > I just took a quick look at how stack_protector works on x86_64. > Unless there is some deep kernel magic that changes the segment > register to %gs from the ABI specified %fs CC_STACKPROTECTOR is > totally broken on x86_64. We access our pda through %gs. and so does gcc in kernel mode. > > Further -fstack-protector-all only seems to detect against buffer > overflows and thus corruption of the stack. Not stack overflows. So > it doesn't appear especially useful. stopping buffer overflows and other return address corruption is not useful? Excuse me? > > So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying > to figure out how to use a zero based percpu area. So why don't we NOT do that and fix instead what you're trying to do? > > That should allow us to make the current pda a per cpu variable, and > use %gs with a large offset to access the per cpu area. and what does that gain us? -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:14 ` Arjan van de Ven @ 2008-07-09 20:33 ` Eric W. Biederman 2008-07-09 21:01 ` Ingo Molnar 0 siblings, 1 reply; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 20:33 UTC (permalink / raw) To: Arjan van de Ven Cc: Mike Travis, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Arjan van de Ven <arjan@infradead.org> writes: > On Wed, 09 Jul 2008 13:00:19 -0700 > ebiederm@xmission.com (Eric W. Biederman) wrote: > >> >> I just took a quick look at how stack_protector works on x86_64. >> Unless there is some deep kernel magic that changes the segment >> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is >> totally broken on x86_64. We access our pda through %gs. > > and so does gcc in kernel mode. Some gcc's in kernel mode. The one I tested with doesn't. >> Further -fstack-protector-all only seems to detect against buffer >> overflows and thus corruption of the stack. Not stack overflows. So >> it doesn't appear especially useful. > > stopping buffer overflows and other return address corruption is not > useful? Excuse me? Stopping buffer overflows and return address corruption is useful. Simply catching and panic'ing the machine when the occur is less useful. We aren't perfect but we have a pretty good track record of handling this with old fashioned methods. >> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying >> to figure out how to use a zero based percpu area. > > So why don't we NOT do that and fix instead what you're trying to do? So our choices are. fix -fstack-protector to not use a hard coded offset. fix gcc/ld to not miscompile the kernel at random times that prevents us from booting when we add a segement with an address at 0. -fstack-protector does not use the TLS ABI and instead uses nasty hard coded magic and that is why it is a problem. Otherwise we could easily support it. >> That should allow us to make the current pda a per cpu variable, and >> use %gs with a large offset to access the per cpu area. > > and what does that gain us? A faster more maintainable kernel. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:33 ` Eric W. Biederman @ 2008-07-09 21:01 ` Ingo Molnar 0 siblings, 0 replies; 190+ messages in thread From: Ingo Molnar @ 2008-07-09 21:01 UTC (permalink / raw) To: Eric W. Biederman Cc: Arjan van de Ven, Mike Travis, Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel * Eric W. Biederman <ebiederm@xmission.com> wrote: > Arjan van de Ven <arjan@infradead.org> writes: > > > On Wed, 09 Jul 2008 13:00:19 -0700 > > ebiederm@xmission.com (Eric W. Biederman) wrote: > > > >> > >> I just took a quick look at how stack_protector works on x86_64. > >> Unless there is some deep kernel magic that changes the segment > >> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is > >> totally broken on x86_64. We access our pda through %gs. > > > > and so does gcc in kernel mode. > > Some gcc's in kernel mode. The one I tested with doesn't. yes - stackprotector enabled distros build with kernel mode enabled gcc. > >> Further -fstack-protector-all only seems to detect against buffer > >> overflows and thus corruption of the stack. Not stack overflows. > >> So it doesn't appear especially useful. > > > > stopping buffer overflows and other return address corruption is not > > useful? Excuse me? > > Stopping buffer overflows and return address corruption is useful. > Simply catching and panic'ing the machine when the occur is less > useful. [...] why? I personally prefer an informative panic in an overflow-suspect piece of code instead of a guest root on my machine. I think you miss one of the fundamental security aspects here. The panic is not there just to inform the administrator - although it certainly has such a role. It is mainly there to _deter_ attackers from experimenting with certain exploits. For the more sophisticated attackers (not the script kiddies - the ones who can do serious economic harm) their exploits and their attack vectors are their main assets. They want their exploits to work on the next target as well, and they want to be as stealth as possible. For a script kiddie crashing a box is not a big issue - they work with public exploits. This means that the serious attempts will only use an attack if its effects are 100% deterministic - they wont risk something like a 50%/50% chance of a crash (or even a 10% chance of a crash). Some of the most sophisticated kernel exploits i've seen had like 80% of their code complexity in making sure that they dont crash the target box. They were more resilient code than a lot of kernel code we have. > [...] We aren't perfect but we have a pretty good track record of > handling this with old fashioned methods. That's your opinion. A valid counter point is that more layers of defense, in a fundamentally fragile area (buffers on the stack, return addresses), cannot hurt. If you've got a firewall that is only 10% busy even under peak load it's a valid option to spend some CPU cycles on preventive measures. A firewall _itself_ is huge overhead already - so there's absolutely no valid technical reason to forbid a firewall from having something like stackprotector built into its kernel. (and into most of its userspace) We could have caught the vsplice exploit as well with stackprotector - but our security QA was not strong enough to keep it from slowly regressing. (without anyone noticing) That's fixed now in tip/core/stackprotector. Ingo ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 20:00 ` Eric W. Biederman ` (2 preceding siblings ...) 2008-07-09 20:14 ` Arjan van de Ven @ 2008-07-09 21:39 ` Mike Travis 2008-07-09 21:47 ` Jeremy Fitzhardinge 2008-07-09 21:55 ` Eric W. Biederman 3 siblings, 2 replies; 190+ messages in thread From: Mike Travis @ 2008-07-09 21:39 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Eric W. Biederman wrote: > I just took a quick look at how stack_protector works on x86_64. Unless there is > some deep kernel magic that changes the segment register to %gs from the ABI specified > %fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs. > > Further -fstack-protector-all only seems to detect against buffer overflows and > thus corruption of the stack. Not stack overflows. So it doesn't appear especially > useful. > > So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out > how to use a zero based percpu area. > > That should allow us to make the current pda a per cpu variable, and use %gs with > a large offset to access the per cpu area. And since it is only the per cpu accesses > and the pda accesses that will change we should not need to fight toolchain issues > and other weirdness. The linked binary can remain the same. > > Eric Hi Eric, There is one pda op that I was not able to remove. Most likely it can be recoded but it was a bit over my expertise. Most likely the "pda_offset(field)" can be replaced with "per_cpu_var(field)" [per_cpu__##field], but for "_proxy_pda.field" I wasn't sure about. include/asm-x86/pda.h: /* * This is not atomic against other CPUs -- CPU preemption needs to be off * NOTE: This relies on the fact that the cpu_pda is the *first* field in * the per cpu area. Move it and you'll need to change this. */ #define test_and_clear_bit_pda(bit, field) \ ({ \ int old__; \ asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0" \ : "=r" (old__), "+m" (_proxy_pda.field) \ : "dIr" (bit), "i" (pda_offset(field)) : "memory");\ old__; \ }) And there is only one reference to it. arch/x86/kernel/process_64.c: static void __exit_idle(void) { if (test_and_clear_bit_pda(0, isidle) == 0) return; atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL); } Thanks, Mike ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:39 ` Mike Travis @ 2008-07-09 21:47 ` Jeremy Fitzhardinge 2008-07-09 21:55 ` Eric W. Biederman 1 sibling, 0 replies; 190+ messages in thread From: Jeremy Fitzhardinge @ 2008-07-09 21:47 UTC (permalink / raw) To: Mike Travis Cc: Eric W. Biederman, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Mike Travis wrote: > Eric W. Biederman wrote: > >> I just took a quick look at how stack_protector works on x86_64. Unless there is >> some deep kernel magic that changes the segment register to %gs from the ABI specified >> %fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs. >> >> Further -fstack-protector-all only seems to detect against buffer overflows and >> thus corruption of the stack. Not stack overflows. So it doesn't appear especially >> useful. >> >> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out >> how to use a zero based percpu area. >> >> That should allow us to make the current pda a per cpu variable, and use %gs with >> a large offset to access the per cpu area. And since it is only the per cpu accesses >> and the pda accesses that will change we should not need to fight toolchain issues >> and other weirdness. The linked binary can remain the same. >> >> Eric >> > > Hi Eric, > > There is one pda op that I was not able to remove. Most likely it can be recoded > but it was a bit over my expertise. Most likely the "pda_offset(field)" can be > replaced with "per_cpu_var(field)" [per_cpu__##field], but for "_proxy_pda.field" > I wasn't sure about. > > include/asm-x86/pda.h: > > /* > * This is not atomic against other CPUs -- CPU preemption needs to be off > * NOTE: This relies on the fact that the cpu_pda is the *first* field in > * the per cpu area. Move it and you'll need to change this. > */ > #define test_and_clear_bit_pda(bit, field) \ > ({ \ > int old__; \ > asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0" \ > : "=r" (old__), "+m" (_proxy_pda.field) \ > : "dIr" (bit), "i" (pda_offset(field)) : "memory");\ > asm volatile("btr %2,%%gs:%1\n\tsbbl %0,%0" \ : "=r" (old__), "+m" (per_cpu_var(var)) \ : "dIr" (bit) : "memory");\ but it barely seems worthwhile if we really can't use test_and_clear_bit. J ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [RFC 00/15] x86_64: Optimize percpu accesses 2008-07-09 21:39 ` Mike Travis 2008-07-09 21:47 ` Jeremy Fitzhardinge @ 2008-07-09 21:55 ` Eric W. Biederman 1 sibling, 0 replies; 190+ messages in thread From: Eric W. Biederman @ 2008-07-09 21:55 UTC (permalink / raw) To: Mike Travis Cc: Eric W. Biederman, Jeremy Fitzhardinge, Ingo Molnar, Andrew Morton, H. Peter Anvin, Christoph Lameter, Jack Steiner, linux-kernel Mike Travis <travis@sgi.com> writes: > Hi Eric, > > There is one pda op that I was not able to remove. Most likely it can be > recoded > but it was a bit over my expertise. Most likely the "pda_offset(field)" can be > replaced with "per_cpu_var(field)" [per_cpu__##field], but for > "_proxy_pda.field" > I wasn't sure about. If you notice we never use %1. My reading would be we just have the +m there to tell the compiler we may be changing the field. So just a reference to the per_cpu_var directly should be sufficient. Although "memory" may actually be enough. Eric ^ permalink raw reply [flat|nested] 190+ messages in thread
end of thread, other threads:[~2008-07-25 18:13 UTC | newest] Thread overview: 190+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-09 16:51 [RFC 00/15] x86_64: Optimize percpu accesses Mike Travis 2008-07-09 16:51 ` [RFC 01/15] x86_64: Cleanup early setup_percpu references Mike Travis 2008-07-09 16:51 ` [RFC 02/15] x86_64: Fold pda into per cpu area Mike Travis 2008-07-09 22:02 ` Eric W. Biederman 2008-07-13 17:54 ` Ingo Molnar 2008-07-14 14:24 ` Mike Travis 2008-07-09 16:51 ` [RFC 03/15] x86_64: Reference zero-based percpu variables offset from gs Mike Travis 2008-07-09 16:51 ` [RFC 04/15] x86_64: Replace cpu_pda ops with percpu ops Mike Travis 2008-07-09 16:51 ` [RFC 05/15] x86_64: Replace xxx_pda() operations with x86_xxx_percpu() Mike Travis 2008-07-09 16:51 ` [RFC 06/15] x86_64: Replace xxx_pda() operations in include_asm-x86_current_h Mike Travis 2008-07-09 16:51 ` [RFC 07/15] x86_64: Replace xxx_pda() operations in include_asm-x86_hardirq_64_h Mike Travis 2008-07-09 16:51 ` [RFC 08/15] x86_64: Replace xxx_pda() operations in include_asm-x86_mmu_context_64_h Mike Travis 2008-07-09 16:51 ` [RFC 09/15] x86_64: Replace xxx_pda() operations in include_asm-x86_percpu_h Mike Travis 2008-07-09 16:51 ` [RFC 10/15] x86_64: Replace xxx_pda() operations in include_asm-x86_smp_h Mike Travis 2008-07-09 16:51 ` [RFC 11/15] x86_64: Replace xxx_pda() operations in include_asm-x86_stackprotector_h Mike Travis 2008-07-09 16:51 ` [RFC 12/15] x86_64: Replace xxx_pda() operations in include_asm-x86_thread_info_h Mike Travis 2008-07-09 16:51 ` [RFC 13/15] x86_64: Replace xxx_pda() operations in include_asm-x86_topology_h Mike Travis 2008-07-09 16:51 ` [RFC 14/15] x86_64: Remove xxx_pda() operations Mike Travis 2008-07-09 16:51 ` [RFC 15/15] x86_64: Remove cpu_pda() macro Mike Travis 2008-07-09 17:19 ` [RFC 00/15] x86_64: Optimize percpu accesses H. Peter Anvin 2008-07-09 17:40 ` Mike Travis 2008-07-09 17:42 ` H. Peter Anvin 2008-07-09 18:05 ` Mike Travis 2008-07-09 17:44 ` Jeremy Fitzhardinge 2008-07-09 18:09 ` Mike Travis 2008-07-09 18:30 ` H. Peter Anvin 2008-07-09 19:34 ` Ingo Molnar 2008-07-09 19:44 ` H. Peter Anvin 2008-07-09 20:26 ` Adrian Bunk 2008-07-09 21:03 ` Mike Travis 2008-07-09 21:23 ` Jeremy Fitzhardinge 2008-07-25 15:49 ` Mike Travis 2008-07-25 16:08 ` Jeremy Fitzhardinge 2008-07-25 16:46 ` Mike Travis 2008-07-25 16:58 ` Jeremy Fitzhardinge 2008-07-25 18:12 ` Mike Travis 2008-07-09 17:27 ` Jeremy Fitzhardinge 2008-07-09 17:39 ` Christoph Lameter 2008-07-09 17:51 ` Jeremy Fitzhardinge 2008-07-09 18:14 ` Mike Travis 2008-07-09 18:22 ` Jeremy Fitzhardinge 2008-07-09 18:31 ` Mike Travis 2008-07-09 19:08 ` Jeremy Fitzhardinge 2008-07-09 18:02 ` Mike Travis 2008-07-09 18:13 ` Christoph Lameter 2008-07-09 18:26 ` Jeremy Fitzhardinge 2008-07-09 18:34 ` Christoph Lameter 2008-07-09 18:37 ` H. Peter Anvin 2008-07-09 18:48 ` Jeremy Fitzhardinge 2008-07-09 18:53 ` Christoph Lameter 2008-07-09 19:07 ` Jeremy Fitzhardinge 2008-07-09 19:12 ` Christoph Lameter 2008-07-09 19:32 ` Jeremy Fitzhardinge 2008-07-09 19:41 ` Ingo Molnar 2008-07-09 19:45 ` H. Peter Anvin 2008-07-09 19:52 ` Christoph Lameter 2008-07-09 20:00 ` Ingo Molnar 2008-07-09 20:09 ` Jeremy Fitzhardinge 2008-07-09 21:05 ` Mike Travis 2008-07-09 19:44 ` Christoph Lameter 2008-07-09 19:48 ` Jeremy Fitzhardinge 2008-07-09 18:27 ` Mike Travis 2008-07-09 18:46 ` Jeremy Fitzhardinge 2008-07-09 20:22 ` Eric W. Biederman 2008-07-09 20:35 ` Jeremy Fitzhardinge 2008-07-09 20:53 ` Eric W. Biederman 2008-07-09 21:03 ` Ingo Molnar 2008-07-09 21:16 ` H. Peter Anvin 2008-07-09 21:10 ` Arjan van de Ven 2008-07-09 23:20 ` Eric W. Biederman 2008-07-09 18:31 ` H. Peter Anvin 2008-07-09 18:00 ` Mike Travis 2008-07-09 19:05 ` Jeremy Fitzhardinge 2008-07-09 19:28 ` Ingo Molnar 2008-07-09 20:55 ` Mike Travis 2008-07-09 21:12 ` Ingo Molnar 2008-07-09 20:00 ` Eric W. Biederman 2008-07-09 20:05 ` Jeremy Fitzhardinge 2008-07-09 20:15 ` Ingo Molnar 2008-07-09 20:07 ` Ingo Molnar 2008-07-09 20:11 ` Jeremy Fitzhardinge 2008-07-09 20:18 ` Christoph Lameter 2008-07-09 20:33 ` Jeremy Fitzhardinge 2008-07-09 20:42 ` H. Peter Anvin 2008-07-09 20:48 ` Jeremy Fitzhardinge 2008-07-09 21:06 ` Eric W. Biederman 2008-07-09 21:16 ` H. Peter Anvin 2008-07-09 21:20 ` Jeremy Fitzhardinge 2008-07-09 21:25 ` Christoph Lameter 2008-07-09 21:36 ` H. Peter Anvin 2008-07-09 21:41 ` Jeremy Fitzhardinge 2008-07-09 22:22 ` Eric W. Biederman 2008-07-09 22:32 ` Jeremy Fitzhardinge 2008-07-09 23:36 ` Eric W. Biederman 2008-07-10 0:19 ` H. Peter Anvin 2008-07-10 0:24 ` Jeremy Fitzhardinge 2008-07-10 14:14 ` Christoph Lameter 2008-07-10 14:26 ` H. Peter Anvin 2008-07-10 15:26 ` Christoph Lameter 2008-07-10 15:42 ` H. Peter Anvin 2008-07-10 16:24 ` Christoph Lameter 2008-07-10 16:33 ` H. Peter Anvin 2008-07-10 16:45 ` Christoph Lameter 2008-07-10 17:33 ` Jeremy Fitzhardinge 2008-07-10 17:42 ` Christoph Lameter 2008-07-10 17:53 ` Jeremy Fitzhardinge 2008-07-10 17:55 ` H. Peter Anvin 2008-07-10 20:52 ` Christoph Lameter 2008-07-10 20:58 ` Jeremy Fitzhardinge 2008-07-10 21:03 ` H. Peter Anvin 2008-07-11 0:55 ` Mike Travis 2008-07-10 21:05 ` Christoph Lameter 2008-07-10 21:22 ` Eric W. Biederman 2008-07-10 21:29 ` H. Peter Anvin 2008-07-11 0:12 ` Mike Travis 2008-07-11 0:14 ` H. Peter Anvin 2008-07-11 0:58 ` Mike Travis 2008-07-11 1:41 ` H. Peter Anvin 2008-07-11 15:37 ` Christoph Lameter 2008-07-11 0:42 ` Eric W. Biederman 2008-07-11 15:36 ` Christoph Lameter 2008-07-10 17:53 ` H. Peter Anvin 2008-07-10 17:26 ` Eric W. Biederman 2008-07-10 17:38 ` Christoph Lameter 2008-07-10 19:11 ` Mike Travis 2008-07-10 19:12 ` Eric W. Biederman 2008-07-10 17:46 ` Mike Travis 2008-07-10 17:51 ` H. Peter Anvin 2008-07-10 19:09 ` Eric W. Biederman 2008-07-10 19:18 ` Mike Travis 2008-07-10 19:32 ` H. Peter Anvin 2008-07-10 23:37 ` Mike Travis 2008-07-10 20:17 ` Eric W. Biederman 2008-07-10 20:24 ` Ingo Molnar 2008-07-10 21:33 ` Eric W. Biederman 2008-07-11 1:39 ` Mike Travis 2008-07-11 2:57 ` Eric W. Biederman 2008-07-10 0:23 ` Jeremy Fitzhardinge 2008-07-09 20:35 ` H. Peter Anvin 2008-07-09 20:39 ` Arjan van de Ven 2008-07-09 20:44 ` H. Peter Anvin 2008-07-09 20:50 ` Jeremy Fitzhardinge 2008-07-09 21:12 ` H. Peter Anvin 2008-07-09 21:26 ` Jeremy Fitzhardinge 2008-07-09 21:37 ` H. Peter Anvin 2008-07-09 22:10 ` Eric W. Biederman 2008-07-09 22:23 ` H. Peter Anvin 2008-07-09 23:54 ` Eric W. Biederman 2008-07-10 16:22 ` Mike Travis 2008-07-10 16:25 ` H. Peter Anvin 2008-07-10 16:35 ` Christoph Lameter 2008-07-10 16:39 ` H. Peter Anvin 2008-07-10 16:47 ` Christoph Lameter 2008-07-10 17:21 ` Jeremy Fitzhardinge 2008-07-10 17:31 ` Christoph Lameter 2008-07-10 17:48 ` Jeremy Fitzhardinge 2008-07-10 18:00 ` H. Peter Anvin 2008-07-10 17:20 ` Mike Travis 2008-07-10 17:07 ` Jeremy Fitzhardinge 2008-07-10 17:12 ` Christoph Lameter 2008-07-10 17:25 ` Jeremy Fitzhardinge 2008-07-10 17:34 ` Christoph Lameter 2008-07-10 17:41 ` Mike Travis 2008-07-10 18:01 ` H. Peter Anvin 2008-07-10 20:51 ` Christoph Lameter 2008-07-10 20:58 ` H. Peter Anvin 2008-07-10 21:07 ` Christoph Lameter 2008-07-10 21:11 ` H. Peter Anvin 2008-07-11 15:32 ` Christoph Lameter 2008-07-11 16:07 ` H. Peter Anvin 2008-07-11 16:57 ` Eric W. Biederman 2008-07-11 17:10 ` H. Peter Anvin 2008-07-10 21:26 ` Eric W. Biederman 2008-07-10 18:48 ` Eric W. Biederman 2008-07-10 18:54 ` Jeremy Fitzhardinge 2008-07-10 19:18 ` Eric W. Biederman 2008-07-10 19:56 ` Jeremy Fitzhardinge 2008-07-10 20:22 ` Eric W. Biederman 2008-07-10 20:54 ` Jeremy Fitzhardinge 2008-07-11 6:59 ` Rusty Russell 2008-07-10 20:25 ` Eric W. Biederman 2008-07-10 17:57 ` H. Peter Anvin 2008-07-10 18:08 ` H. Peter Anvin 2008-07-09 20:46 ` Jeremy Fitzhardinge 2008-07-09 20:14 ` Arjan van de Ven 2008-07-09 20:33 ` Eric W. Biederman 2008-07-09 21:01 ` Ingo Molnar 2008-07-09 21:39 ` Mike Travis 2008-07-09 21:47 ` Jeremy Fitzhardinge 2008-07-09 21:55 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox