* [RFC][PATCH] vsyscall-gtod_B3 (0/3) @ 2004-03-04 0:11 john stultz 2004-03-04 0:12 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part1 (1/3) john stultz 2004-03-04 0:15 ` [RFC] vsyscall-gtod_test_B3.tar.gz john stultz 0 siblings, 2 replies; 19+ messages in thread From: john stultz @ 2004-03-04 0:11 UTC (permalink / raw) To: lkml Cc: Andrea Arcangeli, Andi Kleen, Ulrich Drepper, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott All, This is my port of the x86-64 vsyscall gettimeofday code to i386. This patch moves gettimeofday into userspace, so it can be calledwithout the syscall overhead, greatly improving performance. This is important for any application, like a database, which heavily uses gettimeofday for time-stamping. It supports both the TSC and IBM x44X cyclone time source. Example performance gain (using cyclone timesource): int80 gettimeofday: gettimeofday ( 1665576us / 1000000runs ) = 1.665574us systenter gettimeofday: gettimeofday ( 1239215us / 1000000runs ) = 1.239214us vsyscall gettimeofday: gettimeofday ( 875876us / 1000000runs ) = 0.875875us I've broken the patch into three logical chuncks for clarity and to make it easier to cherry pick the desired bits. o Part 1: Renames variables in timer_cyclone.c and timer_tsc.c to avoid conflicts in the global namespace. o Part 2: Core vsyscall-gtod implementation. o Part 3: vDSO hooks to avoid LD_PRELOADing or needing changes to glibc Please let me know if you have any comments or suggestions. thanks -john Existing issues: ---------------- o Bad pointers cause segfaults, rather then -EFAULT. Release History: ---------------- B2 -> B3: o Broke the patch up into 3 chunks o Added vsyscall-int80.S hooks (4G disables SEP) B1 -> B2: o No LD_PRELOADing or changes to userspace required! o removed hard-coded address in linker script B0 -> B1: o Cleaned up 4/4 split code, so no additional patch is needed. o Fixed permissions on fixmapped cyclone pageo Improved alternate_instruction workaround o Use NTP variables to avoid related time inconsistencieso minor code cleanups ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part1 (1/3) 2004-03-04 0:11 [RFC][PATCH] vsyscall-gtod_B3 (0/3) john stultz @ 2004-03-04 0:12 ` john stultz 2004-03-04 0:13 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part2 (2/3) john stultz 2004-03-04 0:15 ` [RFC] vsyscall-gtod_test_B3.tar.gz john stultz 1 sibling, 1 reply; 19+ messages in thread From: john stultz @ 2004-03-04 0:12 UTC (permalink / raw) To: lkml Cc: Andrea Arcangeli, Andi Kleen, Ulrich Drepper, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott All, This patch renames variables in timer_cyclone.c and timer_tsc.c to avoid conflicts in the global namespace. This allows the renamed variables to be used in the vsyscall-gtod implementation, and is a pre-requisite for the vsyscall-gtod_B3-part2 patch. thanks -john diff -Nru a/arch/i386/kernel/timers/timer_cyclone.c b/arch/i386/kernel/timers/timer_cyclone.c --- a/arch/i386/kernel/timers/timer_cyclone.c Mon Mar 1 13:15:32 2004 +++ b/arch/i386/kernel/timers/timer_cyclone.c Mon Mar 1 13:15:32 2004 @@ -21,18 +21,17 @@ extern spinlock_t i8253_lock; /* Number of usecs that the last interrupt was delayed */ -static int delay_at_last_interrupt; +int cyclone_delay_at_last_interrupt; #define CYCLONE_CBAR_ADDR 0xFEB00CD0 #define CYCLONE_PMCC_OFFSET 0x51A0 #define CYCLONE_MPMC_OFFSET 0x51D0 #define CYCLONE_MPCS_OFFSET 0x51A8 -#define CYCLONE_TIMER_FREQ 100000000 #define CYCLONE_TIMER_MASK (((u64)1<<40)-1) /* 40 bit mask */ int use_cyclone = 0; -static u32* volatile cyclone_timer; /* Cyclone MPMC0 register */ -static u32 last_cyclone_low; +u32* volatile cyclone_timer; /* Cyclone MPMC0 register */ +u32 last_cyclone_low; static u32 last_cyclone_high; static unsigned long long monotonic_base; static seqlock_t monotonic_lock = SEQLOCK_UNLOCKED; @@ -57,7 +56,7 @@ spin_lock(&i8253_lock); read_cyclone_counter(last_cyclone_low,last_cyclone_high); - /* read values for delay_at_last_interrupt */ + /* read values for cyclone_delay_at_last_interrupt */ outb_p(0x00, 0x43); /* latch the count ASAP */ count = inb_p(0x40); /* read the latched count */ @@ -67,7 +66,7 @@ /* lost tick compensation */ delta = last_cyclone_low - delta; delta /= (CYCLONE_TIMER_FREQ/1000000); - delta += delay_at_last_interrupt; + delta += cyclone_delay_at_last_interrupt; lost = delta/(1000000/HZ); delay = delta%(1000000/HZ); if (lost >= 2) @@ -78,16 +77,16 @@ monotonic_base += (this_offset - last_offset) & CYCLONE_TIMER_MASK; write_sequnlock(&monotonic_lock); - /* calculate delay_at_last_interrupt */ + /* calculate cyclone_delay_at_last_interrupt */ count = ((LATCH-1) - count) * TICK_SIZE; - delay_at_last_interrupt = (count + LATCH/2) / LATCH; + cyclone_delay_at_last_interrupt = (count + LATCH/2) / LATCH; /* catch corner case where tick rollover occured * between cyclone and pit reads (as noted when * usec delta is > 90% # of usecs/tick) */ - if (lost && abs(delay - delay_at_last_interrupt) > (900000/HZ)) + if (lost && abs(delay - cyclone_delay_at_last_interrupt) > (900000/HZ)) jiffies_64++; } @@ -96,7 +95,7 @@ u32 offset; if(!cyclone_timer) - return delay_at_last_interrupt; + return cyclone_delay_at_last_interrupt; /* Read the cyclone timer */ offset = cyclone_timer[0]; @@ -109,7 +108,7 @@ offset = offset/(CYCLONE_TIMER_FREQ/1000000); /* our adjusted time offset in microseconds */ - return delay_at_last_interrupt + offset; + return cyclone_delay_at_last_interrupt + offset; } static unsigned long long monotonic_clock_cyclone(void) diff -Nru a/arch/i386/kernel/timers/timer_tsc.c b/arch/i386/kernel/timers/timer_tsc.c --- a/arch/i386/kernel/timers/timer_tsc.c Mon Mar 1 13:15:32 2004 +++ b/arch/i386/kernel/timers/timer_tsc.c Mon Mar 1 13:15:32 2004 @@ -33,7 +33,7 @@ static int use_tsc; /* Number of usecs that the last interrupt was delayed */ -static int delay_at_last_interrupt; +int tsc_delay_at_last_interrupt; static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */ static unsigned long last_tsc_high; /* msb 32 bits of Time Stamp Counter */ @@ -104,7 +104,7 @@ "0" (eax)); /* our adjusted time offset in microseconds */ - return delay_at_last_interrupt + edx; + return tsc_delay_at_last_interrupt + edx; } static unsigned long long monotonic_clock_tsc(void) @@ -223,7 +223,7 @@ "0" (eax)); delta = edx; } - delta += delay_at_last_interrupt; + delta += tsc_delay_at_last_interrupt; lost = delta/(1000000/HZ); delay = delta%(1000000/HZ); if (lost >= 2) { @@ -248,15 +248,15 @@ monotonic_base += cycles_2_ns(this_offset - last_offset); write_sequnlock(&monotonic_lock); - /* calculate delay_at_last_interrupt */ + /* calculate tsc_delay_at_last_interrupt */ count = ((LATCH-1) - count) * TICK_SIZE; - delay_at_last_interrupt = (count + LATCH/2) / LATCH; + tsc_delay_at_last_interrupt = (count + LATCH/2) / LATCH; /* catch corner case where tick rollover occured * between tsc and pit reads (as noted when * usec delta is > 90% # of usecs/tick) */ - if (lost && abs(delay - delay_at_last_interrupt) > (900000/HZ)) + if (lost && abs(delay - tsc_delay_at_last_interrupt) > (900000/HZ)) jiffies_64++; } @@ -308,7 +308,7 @@ monotonic_base += cycles_2_ns(this_offset - last_offset); write_sequnlock(&monotonic_lock); - /* calculate delay_at_last_interrupt */ + /* calculate tsc_delay_at_last_interrupt */ /* * Time offset = (hpet delta) * ( usecs per HPET clock ) * = (hpet delta) * ( usecs per tick / HPET clocks per tick) @@ -316,9 +316,9 @@ * Where, * hpet_usec_quotient = (2^32 * usecs per tick)/HPET clocks per tick */ - delay_at_last_interrupt = hpet_current - offset; - ASM_MUL64_REG(temp, delay_at_last_interrupt, - hpet_usec_quotient, delay_at_last_interrupt); + tsc_delay_at_last_interrupt = hpet_current - offset; + ASM_MUL64_REG(temp, tsc_delay_at_last_interrupt, + hpet_usec_quotient, tsc_delay_at_last_interrupt); } #endif diff -Nru a/include/asm-i386/timer.h b/include/asm-i386/timer.h --- a/include/asm-i386/timer.h Mon Mar 1 13:15:32 2004 +++ b/include/asm-i386/timer.h Mon Mar 1 13:15:32 2004 @@ -20,6 +20,7 @@ }; #define TICK_SIZE (tick_nsec / 1000) +#define CYCLONE_TIMER_FREQ 100000000 extern struct timer_opts* select_timer(void); extern void clock_fallback(void); ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part2 (2/3) 2004-03-04 0:12 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part1 (1/3) john stultz @ 2004-03-04 0:13 ` john stultz 2004-03-04 0:14 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) john stultz 0 siblings, 1 reply; 19+ messages in thread From: john stultz @ 2004-03-04 0:13 UTC (permalink / raw) To: lkml Cc: Andrea Arcangeli, Andi Kleen, Ulrich Drepper, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott All, This is the core vsyscall-gtod implementation. It depends on the vsyscall-gtod_B3-part1 patch. thanks -john diff -Nru a/arch/i386/Kconfig b/arch/i386/Kconfig --- a/arch/i386/Kconfig Wed Mar 3 15:37:34 2004 +++ b/arch/i386/Kconfig Wed Mar 3 15:37:34 2004 @@ -435,6 +435,10 @@ config HPET_EMULATE_RTC def_bool HPET_TIMER && RTC=y +config VSYSCALL_GTOD + depends on EXPERIMENTAL + bool "VSYSCALL gettimeofday() interface" + config SMP bool "Symmetric multi-processing support" ---help--- diff -Nru a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile --- a/arch/i386/kernel/Makefile Wed Mar 3 15:37:34 2004 +++ b/arch/i386/kernel/Makefile Wed Mar 3 15:37:34 2004 @@ -32,6 +32,7 @@ obj-$(CONFIG_HPET_TIMER) += time_hpet.o obj-$(CONFIG_EFI) += efi.o efi_stub.o obj-$(CONFIG_EARLY_PRINTK) += early_printk.o +obj-$(CONFIG_VSYSCALL_GTOD) += vsyscall-gtod.o EXTRA_AFLAGS := -traditional diff -Nru a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c --- a/arch/i386/kernel/setup.c Wed Mar 3 15:37:34 2004 +++ b/arch/i386/kernel/setup.c Wed Mar 3 15:37:34 2004 @@ -47,6 +47,7 @@ #include <asm/sections.h> #include <asm/io_apic.h> #include <asm/ist.h> +#include <asm/vsyscall-gtod.h> #include "setup_arch_pre.h" #include "mach_resources.h" @@ -1159,6 +1160,7 @@ conswitchp = &dummy_con; #endif #endif + vsyscall_init(); } #include "setup_arch_post.h" diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c --- a/arch/i386/kernel/time.c Wed Mar 3 15:37:34 2004 +++ b/arch/i386/kernel/time.c Wed Mar 3 15:37:34 2004 @@ -393,5 +393,8 @@ cur_timer = select_timer(); printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name); + /* set vsyscall to use selected time source */ + vsyscall_set_timesource(cur_timer->name); + time_init_hook(); } diff -Nru a/arch/i386/kernel/timers/timer.c b/arch/i386/kernel/timers/timer.c --- a/arch/i386/kernel/timers/timer.c Wed Mar 3 15:37:34 2004 +++ b/arch/i386/kernel/timers/timer.c Wed Mar 3 15:37:34 2004 @@ -2,6 +2,7 @@ #include <linux/kernel.h> #include <linux/string.h> #include <asm/timer.h> +#include <asm/vsyscall-gtod.h> #ifdef CONFIG_HPET_TIMER /* @@ -44,6 +45,9 @@ void clock_fallback(void) { cur_timer = &timer_pit; + + /* set vsyscall to use selected time source */ + vsyscall_set_timesource(cur_timer->name); } /* iterates through the list of timers, returning the first diff -Nru a/arch/i386/kernel/timers/timer_cyclone.c b/arch/i386/kernel/timers/timer_cyclone.c --- a/arch/i386/kernel/timers/timer_cyclone.c Wed Mar 3 15:37:34 2004 +++ b/arch/i386/kernel/timers/timer_cyclone.c Wed Mar 3 15:37:34 2004 @@ -23,6 +23,13 @@ /* Number of usecs that the last interrupt was delayed */ int cyclone_delay_at_last_interrupt; +/* FIXMAP flag */ +#ifdef CONFIG_VSYSCALL_GTOD +#define PAGE_CYCLONE PAGE_KERNEL_VSYSCALL_NOCACHE +#else +#define PAGE_CYCLONE PAGE_KERNEL_NOCACHE +#endif + #define CYCLONE_CBAR_ADDR 0xFEB00CD0 #define CYCLONE_PMCC_OFFSET 0x51A0 #define CYCLONE_MPMC_OFFSET 0x51D0 @@ -192,7 +199,7 @@ /* map in cyclone_timer */ pageaddr = (base + CYCLONE_MPMC_OFFSET)&PAGE_MASK; offset = (base + CYCLONE_MPMC_OFFSET)&(~PAGE_MASK); - set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr); + __set_fixmap(FIX_CYCLONE_TIMER, pageaddr, PAGE_CYCLONE); cyclone_timer = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset); if(!cyclone_timer){ printk(KERN_ERR "Summit chipset: Could not find valid MPMC register.\n"); diff -Nru a/arch/i386/kernel/vmlinux.lds.S b/arch/i386/kernel/vmlinux.lds.S --- a/arch/i386/kernel/vmlinux.lds.S Wed Mar 3 15:37:34 2004 +++ b/arch/i386/kernel/vmlinux.lds.S Wed Mar 3 15:37:34 2004 @@ -3,11 +3,12 @@ */ #include <asm-generic/vmlinux.lds.h> - +#include <linux/config.h> +#include <asm/vsyscall-gtod.h> + OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(startup_32) -jiffies = jiffies_64; SECTIONS { . = 0xC0000000 + 0x100000; @@ -47,6 +48,79 @@ .data.cacheline_aligned : { *(.data.cacheline_aligned) } _edata = .; /* End of data section */ + +/* VSYSCALL_GTOD data */ +#ifdef CONFIG_VSYSCALL_GTOD + + /* vsyscall entry */ + . = ALIGN(64); + .data.cacheline_aligned : { *(.data.cacheline_aligned) } + + .vsyscall_0 VSYSCALL_GTOD_START: AT ((LOADADDR(.data.cacheline_aligned) + SIZEOF(.data.cacheline_aligned) + 4095) & ~(4095)) { *(.vsyscall_0) } + __vsyscall_0 = LOADADDR(.vsyscall_0); + + + /* generic gtod variables */ + . = ALIGN(64); + .vsyscall_timesource : AT ((LOADADDR(.vsyscall_0) + SIZEOF(.vsyscall_0) + 63) & ~(63)) { *(.vsyscall_timesource) } + vsyscall_timesource = LOADADDR(.vsyscall_timesource); + + . = ALIGN(16); + .xtime_lock : AT ((LOADADDR(.vsyscall_timesource) + SIZEOF(.vsyscall_timesource) + 15) & ~(15)) { *(.xtime_lock) } + xtime_lock = LOADADDR(.xtime_lock); + + . = ALIGN(16); + .xtime : AT ((LOADADDR(.xtime_lock) + SIZEOF(.xtime_lock) + 15) & ~(15)) { *(.xtime) } + xtime = LOADADDR(.xtime); + + . = ALIGN(16); + .jiffies : AT ((LOADADDR(.xtime) + SIZEOF(.xtime) + 15) & ~(15)) { *(.jiffies) } + jiffies = LOADADDR(.jiffies); + + . = ALIGN(16); + .wall_jiffies : AT ((LOADADDR(.jiffies) + SIZEOF(.jiffies) + 15) & ~(15)) { *(.wall_jiffies) } + wall_jiffies = LOADADDR(.wall_jiffies); + + .sys_tz : AT (LOADADDR(.wall_jiffies) + SIZEOF(.wall_jiffies)) { *(.sys_tz) } + sys_tz = LOADADDR(.sys_tz); + + /* NTP variables */ + .tickadj : AT (LOADADDR(.sys_tz) + SIZEOF(.sys_tz)) { *(.tickadj) } + tickadj = LOADADDR(.tickadj); + + .time_adjust : AT (LOADADDR(.tickadj) + SIZEOF(.tickadj)) { *(.time_adjust) } + time_adjust = LOADADDR(.time_adjust); + + /* TSC variables*/ + .last_tsc_low : AT (LOADADDR(.time_adjust) + SIZEOF(.time_adjust)) { *(.last_tsc_low) } + last_tsc_low = LOADADDR(.last_tsc_low); + + .tsc_delay_at_last_interrupt : AT (LOADADDR(.last_tsc_low) + SIZEOF(.last_tsc_low)) { *(.tsc_delay_at_last_interrupt) } + tsc_delay_at_last_interrupt = LOADADDR(.tsc_delay_at_last_interrupt); + + .fast_gettimeoffset_quotient : AT (LOADADDR(.tsc_delay_at_last_interrupt) + SIZEOF(.tsc_delay_at_last_interrupt)) { *(.fast_gettimeoffset_quotient) } + fast_gettimeoffset_quotient = LOADADDR(.fast_gettimeoffset_quotient); + + + /*cyclone values*/ + .cyclone_timer : AT (LOADADDR(.fast_gettimeoffset_quotient) + SIZEOF(.fast_gettimeoffset_quotient)) { *(.cyclone_timer) } + cyclone_timer = LOADADDR(.cyclone_timer); + + .last_cyclone_low : AT (LOADADDR(.cyclone_timer) + SIZEOF(.cyclone_timer)) { *(.last_cyclone_low) } + last_cyclone_low = LOADADDR(.last_cyclone_low); + + .cyclone_delay_at_last_interrupt : AT (LOADADDR(.last_cyclone_low) + SIZEOF(.last_cyclone_low)) { *(.cyclone_delay_at_last_interrupt) } + cyclone_delay_at_last_interrupt = LOADADDR(.cyclone_delay_at_last_interrupt); + + + .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT (LOADADDR(.vsyscall_0) + 1024) { *(.vsyscall_1) } + . = LOADADDR(.vsyscall_0) + 4096; + + jiffies_64 = jiffies; +#else + jiffies = jiffies_64; +#endif +/* END of VSYSCALL_GTOD data*/ . = ALIGN(8192); /* init_task */ .data.init_task : { *(.data.init_task) } diff -Nru a/arch/i386/kernel/vsyscall-gtod.c b/arch/i386/kernel/vsyscall-gtod.c --- /dev/null Wed Dec 31 16:00:00 1969 +++ b/arch/i386/kernel/vsyscall-gtod.c Wed Mar 3 15:37:34 2004 @@ -0,0 +1,275 @@ +/* + * linux/arch/i386/kernel/vsyscall-gtod.c + * + * Copyright (C) 2001 Andrea Arcangeli <andrea@suse.de> SuSE + * Copyright (C) 2003,2004 John Stultz <johnstul@us.ibm.com> IBM + * + * Thanks to hpa@transmeta.com for some useful hint. + * Special thanks to Ingo Molnar for his early experience with + * a different vsyscall implementation for Linux/IA32 and for the name. + * + * vsyscall 0 is located at VSYSCALL_START, vsyscall 1 is located + * at virtual address VSYSCALL_START+1024bytes etc... + * + * Originally written for x86-64 by Andrea Arcangeli <andrea@suse.de> + * Ported to i386 by John Stultz <johnstul@us.ibm.com> + */ + + +#include <linux/time.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/timer.h> +#include <linux/sched.h> + +#include <asm/vsyscall-gtod.h> +#include <asm/pgtable.h> +#include <asm/page.h> +#include <asm/fixmap.h> +#include <asm/msr.h> +#include <asm/timer.h> +#include <asm/system.h> +#include <asm/unistd.h> +#include <asm/errno.h> + +int errno; +static inline _syscall2(int,gettimeofday,struct timeval *,tv,struct timezone *,tz); +static int vsyscall_mapped = 0; /* flag variable for remap_vsyscall() */ + +enum vsyscall_timesource_e vsyscall_timesource; +enum vsyscall_timesource_e __vsyscall_timesource __section_vsyscall_timesource; + +/* readonly clones of generic time values */ +seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED; +struct timespec __xtime __section_xtime; +volatile unsigned long __jiffies __section_jiffies; +unsigned long __wall_jiffies __section_wall_jiffies; +struct timezone __sys_tz __section_sys_tz; +/* readonly clones of ntp time variables */ +int __tickadj __section_tickadj; +long __time_adjust __section_time_adjust; + +/* readonly clones of TSC timesource values*/ +unsigned long __last_tsc_low __section_last_tsc_low; +int __tsc_delay_at_last_interrupt __section_tsc_delay_at_last_interrupt; +unsigned long __fast_gettimeoffset_quotient __section_fast_gettimeoffset_quotient; + +/* readonly clones of cyclone timesource values*/ +u32* __cyclone_timer __section_cyclone_timer; /* Cyclone MPMC0 register */ +u32 __last_cyclone_low __section_last_cyclone_low; +int __cyclone_delay_at_last_interrupt __section_cyclone_delay_at_last_interrupt; + + +static inline unsigned long vgettimeoffset_tsc(void) +{ + unsigned long eax, edx; + + /* Read the Time Stamp Counter */ + rdtsc(eax,edx); + + /* .. relative to previous jiffy (32 bits is enough) */ + eax -= __last_tsc_low; /* tsc_low delta */ + + /* + * Time offset = (tsc_low delta) * fast_gettimeoffset_quotient + * = (tsc_low delta) * (usecs_per_clock) + * = (tsc_low delta) * (usecs_per_jiffy / clocks_per_jiffy) + * + * Using a mull instead of a divl saves up to 31 clock cycles + * in the critical path. + */ + + + __asm__("mull %2" + :"=a" (eax), "=d" (edx) + :"rm" (__fast_gettimeoffset_quotient), + "0" (eax)); + + /* our adjusted time offset in microseconds */ + return __tsc_delay_at_last_interrupt + edx; + +} + +static inline unsigned long vgettimeoffset_cyclone(void) +{ + u32 offset; + + if (!__cyclone_timer) + return 0; + + /* Read the cyclone timer */ + offset = __cyclone_timer[0]; + + /* .. relative to previous jiffy */ + offset = offset - __last_cyclone_low; + + /* convert cyclone ticks to microseconds */ + offset = offset/(CYCLONE_TIMER_FREQ/1000000); + + /* our adjusted time offset in microseconds */ + return __cyclone_delay_at_last_interrupt + offset; +} + +static inline void do_vgettimeofday(struct timeval * tv) +{ + long sequence; + unsigned long usec, sec; + unsigned long lost; + unsigned long max_ntp_tick; + + /* If we don't have a valid vsyscall time source, + * just call gettimeofday() + */ + if (__vsyscall_timesource == VSYSCALL_GTOD_NONE) { + gettimeofday(tv, NULL); + return; + } + + + do { + sequence = read_seqbegin(&__xtime_lock); + + /* Get the high-res offset */ + if (__vsyscall_timesource == VSYSCALL_GTOD_CYCLONE) + usec = vgettimeoffset_cyclone(); + else + usec = vgettimeoffset_tsc(); + + lost = __jiffies - __wall_jiffies; + + /* + * If time_adjust is negative then NTP is slowing the clock + * so make sure not to go into next possible interval. + * Better to lose some accuracy than have time go backwards.. + */ + if (unlikely(__time_adjust < 0)) { + max_ntp_tick = (USEC_PER_SEC / HZ) - __tickadj; + usec = min(usec, max_ntp_tick); + + if (lost) + usec += lost * max_ntp_tick; + } + else if (unlikely(lost)) + usec += lost * (USEC_PER_SEC / HZ); + + sec = __xtime.tv_sec; + usec += (__xtime.tv_nsec / 1000); + + } while (read_seqretry(&__xtime_lock, sequence)); + + tv->tv_sec = sec + usec / 1000000; + tv->tv_usec = usec % 1000000; +} + +static inline void do_get_tz(struct timezone * tz) +{ + long sequence; + + do { + sequence = read_seqbegin(&__xtime_lock); + + *tz = __sys_tz; + + } while (read_seqretry(&__xtime_lock, sequence)); +} + +static int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz) +{ + if (tv) + do_vgettimeofday(tv); + if (tz) + do_get_tz(tz); + return 0; +} + +static time_t __vsyscall(1) vtime(time_t * t) +{ + struct timeval tv; + vgettimeofday(&tv,NULL); + if (t) + *t = tv.tv_sec; + return tv.tv_sec; +} + +static long __vsyscall(2) venosys_0(void) +{ + return -ENOSYS; +} + +static long __vsyscall(3) venosys_1(void) +{ + return -ENOSYS; +} + + +void vsyscall_set_timesource(char* name) +{ + if (!strncmp(name, "tsc", 3)) + vsyscall_timesource = VSYSCALL_GTOD_TSC; + else if (!strncmp(name, "cyclone", 7)) + vsyscall_timesource = VSYSCALL_GTOD_CYCLONE; + else + vsyscall_timesource = VSYSCALL_GTOD_NONE; +} + + +static void __init map_vsyscall(void) +{ + unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - PAGE_OFFSET; + + /* Initially we map the VSYSCALL page w/ PAGE_KERNEL permissions to + * keep the alternate_instruction code from bombing out when it + * changes the seq_lock memory barriers in vgettimeofday() + */ + __set_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE, physaddr_page0, PAGE_KERNEL); +} + +static int __init remap_vsyscall(void) +{ + unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - PAGE_OFFSET; + + if (!vsyscall_mapped) + return 0; + + /* Remap the VSYSCALL page w/ PAGE_KERNEL_VSYSCALL permissions + * after the alternate_instruction code has run + */ + clear_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE); + __set_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL); + + return 0; +} + +int __init vsyscall_init(void) +{ + printk("VSYSCALL: consistency checks..."); + if ((unsigned long) &vgettimeofday != VSYSCALL_ADDR(__NR_vgettimeofday)) { + printk("vgettimeofday link addr broken\n"); + printk("VSYSCALL: vsyscall_init failed!\n"); + return -EFAULT; + } + if ((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime)) { + printk("vtime link addr broken\n"); + printk("VSYSCALL: vsyscall_init failed!\n"); + return -EFAULT; + } + if (VSYSCALL_ADDR(0) != __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE)) { + printk("fixmap first vsyscall 0x%lx should be 0x%x\n", + __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE), + VSYSCALL_ADDR(0)); + printk("VSYSCALL: vsyscall_init failed!\n"); + return -EFAULT; + } + + + printk("passed...mapping..."); + map_vsyscall(); + printk("done.\n"); + vsyscall_mapped = 1; + printk("VSYSCALL: fixmap virt addr: 0x%lx\n", + __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE)); + + return 0; +} + +__initcall(remap_vsyscall); diff -Nru a/include/asm-i386/fixmap.h b/include/asm-i386/fixmap.h --- a/include/asm-i386/fixmap.h Wed Mar 3 15:37:34 2004 +++ b/include/asm-i386/fixmap.h Wed Mar 3 15:37:34 2004 @@ -18,6 +18,7 @@ #include <asm/acpi.h> #include <asm/apicdef.h> #include <asm/page.h> +#include <asm/vsyscall-gtod.h> #ifdef CONFIG_HIGHMEM #include <linux/threads.h> #include <asm/kmap_types.h> @@ -44,6 +45,17 @@ enum fixed_addresses { FIX_HOLE, FIX_VSYSCALL, +#ifdef CONFIG_VSYSCALL_GTOD +#ifndef CONFIG_X86_4G + FIX_VSYSCALL_GTOD_PAD, +#endif /* !CONFIG_X86_4G */ + FIX_VSYSCALL_GTOD_LAST_PAGE, + FIX_VSYSCALL_GTOD_FIRST_PAGE = FIX_VSYSCALL_GTOD_LAST_PAGE + + VSYSCALL_GTOD_NUMPAGES - 1, +#ifdef CONFIG_X86_4G + FIX_VSYSCALL_GTOD_4GALIGN, +#endif /* CONFIG_X86_4G */ +#endif /* CONFIG_VSYSCALL_GTOD */ #ifdef CONFIG_X86_LOCAL_APIC FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */ #endif diff -Nru a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h --- a/include/asm-i386/pgtable.h Wed Mar 3 15:37:34 2004 +++ b/include/asm-i386/pgtable.h Wed Mar 3 15:37:34 2004 @@ -137,11 +137,15 @@ #define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW) #define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD) #define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE) +#define __PAGE_KERNEL_VSYSCALL \ + (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED) #define PAGE_KERNEL __pgprot(__PAGE_KERNEL) #define PAGE_KERNEL_RO __pgprot(__PAGE_KERNEL_RO) #define PAGE_KERNEL_NOCACHE __pgprot(__PAGE_KERNEL_NOCACHE) #define PAGE_KERNEL_LARGE __pgprot(__PAGE_KERNEL_LARGE) +#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL) +#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL|(__PAGE_KERNEL_RO | _PAGE_PCD)) /* * The i386 can't do page protection for execute, and considers that diff -Nru a/include/asm-i386/vsyscall-gtod.h b/include/asm-i386/vsyscall-gtod.h --- /dev/null Wed Dec 31 16:00:00 1969 +++ b/include/asm-i386/vsyscall-gtod.h Wed Mar 3 15:37:34 2004 @@ -0,0 +1,68 @@ +#ifndef _ASM_i386_VSYSCALL_GTOD_H_ +#define _ASM_i386_VSYSCALL_GTOD_H_ + +#ifdef CONFIG_VSYSCALL_GTOD + +/* VSYSCALL_GTOD_START must be the same as + * __fix_to_virt(FIX_VSYSCALL_GTOD FIRST_PAGE) + * and must also be same as addr in vmlinux.lds.S */ +#define VSYSCALL_GTOD_START 0xffffc000 +#define VSYSCALL_GTOD_SIZE 1024 +#define VSYSCALL_GTOD_END (VSYSCALL_GTOD_START + PAGE_SIZE) +#define VSYSCALL_GTOD_NUMPAGES \ + ((VSYSCALL_GTOD_END-VSYSCALL_GTOD_START) >> PAGE_SHIFT) +#define VSYSCALL_ADDR(vsyscall_nr) \ + (VSYSCALL_GTOD_START+VSYSCALL_GTOD_SIZE*(vsyscall_nr)) + +#ifdef __KERNEL__ +#ifndef __ASSEMBLY__ +#include <linux/seqlock.h> + +#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) + +/* ReadOnly generic time value attributes*/ +#define __section_vsyscall_timesource __attribute__ ((unused, __section__ (".vsyscall_timesource"))) +#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"))) +#define __section_xtime __attribute__ ((unused, __section__ (".xtime"))) +#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"))) +#define __section_wall_jiffies __attribute__ ((unused, __section__ (".wall_jiffies"))) +#define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"))) + +/* ReadOnly NTP variables */ +#define __section_tickadj __attribute__ ((unused, __section__ (".tickadj"))) +#define __section_time_adjust __attribute__ ((unused, __section__ (".time_adjust"))) + + +/* ReadOnly TSC time value attributes*/ +#define __section_last_tsc_low __attribute__ ((unused, __section__ (".last_tsc_low"))) +#define __section_tsc_delay_at_last_interrupt __attribute__ ((unused, __section__ (".tsc_delay_at_last_interrupt"))) +#define __section_fast_gettimeoffset_quotient __attribute__ ((unused, __section__ (".fast_gettimeoffset_quotient"))) + +/* ReadOnly Cyclone time value attributes*/ +#define __section_cyclone_timer __attribute__ ((unused, __section__ (".cyclone_timer"))) +#define __section_last_cyclone_low __attribute__ ((unused, __section__ (".last_cyclone_low"))) +#define __section_cyclone_delay_at_last_interrupt __attribute__ ((unused, __section__ (".cyclone_delay_at_last_interrupt"))) + +enum vsyscall_num { + __NR_vgettimeofday, + __NR_vtime, +}; + +enum vsyscall_timesource_e { + VSYSCALL_GTOD_NONE, + VSYSCALL_GTOD_TSC, + VSYSCALL_GTOD_CYCLONE, +}; + +int vsyscall_init(void); +void vsyscall_set_timesource(char* name); + +extern char __vsyscall_0; +#endif /* __ASSEMBLY__ */ +#endif /* __KERNEL__ */ +#else /* CONFIG_VSYSCALL_GTOD */ +#define vsyscall_init() +#define vsyscall_set_timesource(x) +#endif /* CONFIG_VSYSCALL_GTOD */ +#endif /* _ASM_i386_VSYSCALL_GTOD_H_ */ + ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 0:13 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part2 (2/3) john stultz @ 2004-03-04 0:14 ` john stultz 2004-03-04 0:55 ` Andrea Arcangeli 0 siblings, 1 reply; 19+ messages in thread From: john stultz @ 2004-03-04 0:14 UTC (permalink / raw) To: lkml Cc: Andrea Arcangeli, Andi Kleen, Ulrich Drepper, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott All, This patch implements the somewhat controversial vDSO hooks for vsyscall-gtod. This makes LD_PRELOADing or changes to glibc unnecessary, but requires that the system have a sysenter enabled glibc to see any performance benifit. However the LD_PRELOAD method will still work as well with this patch. This patch depends on both vsyscall-gtod_B3-part1 and vsyscall-gtod_B3-part2. thanks -john diff -Nru a/arch/i386/kernel/vsyscall-int80.S b/arch/i386/kernel/vsyscall-int80.S --- a/arch/i386/kernel/vsyscall-int80.S Wed Mar 3 15:40:05 2004 +++ b/arch/i386/kernel/vsyscall-int80.S Wed Mar 3 15:40:05 2004 @@ -1,3 +1,6 @@ +#include <linux/config.h> +#include <asm/unistd.h> +#include <asm/vsyscall-gtod.h> /* * Code for the vsyscall page. This version uses the old int $0x80 method. */ @@ -7,8 +10,26 @@ .type __kernel_vsyscall,@function __kernel_vsyscall: .LSTART_vsyscall: +#ifdef CONFIG_VSYSCALL_GTOD + cmp $__NR_gettimeofday, %eax + je .Lvgettimeofday +#endif /* CONFIG_VSYSCALL_GTOD */ int $0x80 ret + +#ifdef CONFIG_VSYSCALL_GTOD +/* vsyscall-gettimeofday code */ +.Lvgettimeofday: + pushl %edx + pushl %ecx + pushl %ebx + call VSYSCALL_GTOD_START + popl %ebx + popl %ecx + popl %edx + ret +#endif /* CONFIG_VSYSCALL_GTOD */ + .LEND_vsyscall: .size __kernel_vsyscall,.-.LSTART_vsyscall .previous diff -Nru a/arch/i386/kernel/vsyscall-sysenter.S b/arch/i386/kernel/vsyscall-sysenter.S --- a/arch/i386/kernel/vsyscall-sysenter.S Wed Mar 3 15:40:05 2004 +++ b/arch/i386/kernel/vsyscall-sysenter.S Wed Mar 3 15:40:05 2004 @@ -1,3 +1,6 @@ +#include <linux/config.h> +#include <asm/unistd.h> +#include <asm/vsyscall-gtod.h> /* * Code for the vsyscall page. This version uses the sysenter instruction. */ @@ -7,6 +10,10 @@ .type __kernel_vsyscall,@function __kernel_vsyscall: .LSTART_vsyscall: +#ifdef CONFIG_VSYSCALL_GTOD + cmp $__NR_gettimeofday, %eax + je .Lvgettimeofday +#endif /* CONFIG_VSYSCALL_GTOD */ push %ecx .Lpush_ecx: push %edx @@ -31,6 +38,20 @@ pop %ecx .Lpop_ecx: ret + +#ifdef CONFIG_VSYSCALL_GTOD +/* vsyscall-gettimeofday code */ +.Lvgettimeofday: + pushl %edx + pushl %ecx + pushl %ebx + call VSYSCALL_GTOD_START + popl %ebx + popl %ecx + popl %edx + ret +#endif /* CONFIG_VSYSCALL_GTOD */ + .LEND_vsyscall: .size __kernel_vsyscall,.-.LSTART_vsyscall .previous ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 0:14 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) john stultz @ 2004-03-04 0:55 ` Andrea Arcangeli 2004-03-04 2:16 ` Ulrich Drepper 2004-03-04 8:00 ` Jamie Lokier 0 siblings, 2 replies; 19+ messages in thread From: Andrea Arcangeli @ 2004-03-04 0:55 UTC (permalink / raw) To: john stultz Cc: lkml, Andi Kleen, Ulrich Drepper, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Wed, Mar 03, 2004 at 04:14:08PM -0800, john stultz wrote: > All, > This patch implements the somewhat controversial vDSO hooks for > vsyscall-gtod. This makes LD_PRELOADing or changes to glibc unnecessary, the reason it's controversial is just because it's microslowing down all syscalls to speedup gettimeofday, when you can avoid this kernel change completely and implementing it zerocost like in x86-64. glibc should simply call into the vsyscall directly. Why don't you simply provide a patch against glibc, instead of proposing a patch against the kernel? Of course this patch will depend on your vsyscall patch on the kernel side, and that's fine. Another elf bitflag can be used to tell glibc to use vgettimeofday or whatever, just like it happens with the sysenter vsyscall. This is just like the kernel patches people proposes when they get vmalloc LDT allocation failure, because they run with the i686 glibc instead of the only possibly supported i586 configuration. It makes no sense to hide a glibc inefficiency in the kernel when you can fix it in glibc and avoid the LDT 4k allocation completely since nobody will ever call into pthread_create. It's not that wasting 4k of zone-normal per task is a good thing, and wasting 64k of vmalloc per task is a bad thing. they're both bad things, you just only can see the latter one unless you're a kernel hacker, so people actually think the kmalloc LDT thing is a bugfix, while it's just a bad band-aid (I mean, it's a good thing at large, but not as the fix of the vmalloc LDT faliures). I bet if the LDT allocation was visible in /proc as easily as the manger thread was visible with `ps` in linuxthreads, the LDT allocation would been deferred to pthread_create too in the first place. As a matter of fact I spent a few hours trying to fixup glibc some month ago, but the flood of #ifdefs and the fact linuxthreads is dead made me desist and I will try again with NTPL since it seems they didn't fix it (at least last time I checked the code the LDT waste as still there). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 0:55 ` Andrea Arcangeli @ 2004-03-04 2:16 ` Ulrich Drepper 2004-03-04 2:43 ` john stultz 2004-03-04 2:47 ` Andrea Arcangeli 2004-03-04 8:00 ` Jamie Lokier 1 sibling, 2 replies; 19+ messages in thread From: Ulrich Drepper @ 2004-03-04 2:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: john stultz, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andrea Arcangeli wrote: > This is just like the kernel patches people proposes when they get > vmalloc LDT allocation failure, because they run with the i686 glibc > instead of the only possibly supported i586 configuration. It makes no > sense to hide a glibc inefficiency You apparently still haven't gotten any clue since your whining the last time around. Absolute addresses are a fatal mistake. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQFARpGU2ijCOnn/RHQRAhs3AJ0XEZ5VGb40VAIPuO4negyo7cx/WwCbBrN6 EFZ7UnY7W/it0sUiq6Dodeg= =KSMr -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 2:16 ` Ulrich Drepper @ 2004-03-04 2:43 ` john stultz 2004-03-04 3:14 ` Andrea Arcangeli 2004-03-04 8:09 ` Ulrich Drepper 2004-03-04 2:47 ` Andrea Arcangeli 1 sibling, 2 replies; 19+ messages in thread From: john stultz @ 2004-03-04 2:43 UTC (permalink / raw) To: Ulrich Drepper Cc: Andrea Arcangeli, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Wed, 2004-03-03 at 18:16, Ulrich Drepper wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Andrea Arcangeli wrote: > > > This is just like the kernel patches people proposes when they get > > vmalloc LDT allocation failure, because they run with the i686 glibc > > instead of the only possibly supported i586 configuration. It makes no > > sense to hide a glibc inefficiency > > You apparently still haven't gotten any clue since your whining the last > time around. Absolute addresses are a fatal mistake. Before we start up this larger debate again, might there be some short term solution for my patch that would satisfy both of you? If I understand the earlier arguments, if we're going to have the dynamically relocated segments at some point, I agree that absolute addresses are going to have problems. However, if I'm not mistaken, this problem already exists w/ the vsyscall-sysenter code, correct? What is the plan for avoiding the absolute address issue there? If I implemented the vsyscall-gettimeofday code in a similar manner (as Andrea suggested), could the planned solution for vsyscall-sysenter be applied here as well? thanks -john ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 2:43 ` john stultz @ 2004-03-04 3:14 ` Andrea Arcangeli 2004-03-04 8:09 ` Ulrich Drepper 1 sibling, 0 replies; 19+ messages in thread From: Andrea Arcangeli @ 2004-03-04 3:14 UTC (permalink / raw) To: john stultz Cc: Ulrich Drepper, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Wed, Mar 03, 2004 at 06:43:18PM -0800, john stultz wrote: > On Wed, 2004-03-03 at 18:16, Ulrich Drepper wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Andrea Arcangeli wrote: > > > > > This is just like the kernel patches people proposes when they get > > > vmalloc LDT allocation failure, because they run with the i686 glibc > > > instead of the only possibly supported i586 configuration. It makes no > > > sense to hide a glibc inefficiency > > > > You apparently still haven't gotten any clue since your whining the last > > time around. Absolute addresses are a fatal mistake. > > Before we start up this larger debate again, might there be some short > term solution for my patch that would satisfy both of you? For a production release short term solutions like this would be ok, but the mainline source that will fork off in 2.7 should have the best design IMHO, and the same for glibc. > If I understand the earlier arguments, if we're going to have the > dynamically relocated segments at some point, I agree that absolute > addresses are going to have problems. However, if I'm not mistaken, this > problem already exists w/ the vsyscall-sysenter code, correct? this is exactly my point, the fixed address trouble mentioned by Ulirch make little sense to me because of this (especially in reply to the ldt part). and in practice the sysenter instruction is already at a fixed address in any 2.6 kernel out there (yeah, we could change that number without breaking glibc, but the attacker won't care less). > What is the plan for avoiding the absolute address issue there? If I > implemented the vsyscall-gettimeofday code in a similar manner (as > Andrea suggested), could the planned solution for vsyscall-sysenter be > applied here as well? I think yes but thinking twice my preferred way is not to pass another variable address to userspace (that was the first solution that come to mind, and I wrote that just to show there's no real "fixed address" trouble). Fixed _offsets_ (not virtual addresses) are perfectly fine w.r.t. security. So we can just assume the vgettimeofday is at a fixed offset after the vsysenter code. This should result in the most efficient code possible while providing flexiblity to the address like vsysenter does (vgettimeofday will move together with vsysenter). However it could be the second value in elf is a cleaner way to pass the vgettimeofday address, I don't mind either ways. thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 2:43 ` john stultz 2004-03-04 3:14 ` Andrea Arcangeli @ 2004-03-04 8:09 ` Ulrich Drepper 2004-03-04 19:02 ` john stultz 1 sibling, 1 reply; 19+ messages in thread From: Ulrich Drepper @ 2004-03-04 8:09 UTC (permalink / raw) To: john stultz; +Cc: lkml -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 john stultz wrote: > Before we start up this larger debate again, might there be some short > term solution for my patch that would satisfy both of you? I suggest the following: ~ define a symbol __kernel_gettimeofday_offset in the vdso's symbol table. This should be an absolute symbol containing the offset of the gettimeofday implementation from the beginning of the vdso (the address passed up in the auxiliary vector) ~ glibc can then use the equivalent of dlsym("__kernel_gettimeofday_offset"). If the symbol is not defined, it's not used (doh). If it is defined, the final function address is computed by adding the offset to the vdso address. This ensures a direct jump and it still keeps the vdso relocatable without modifying the symbol table. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD4DBQFARuRa2ijCOnn/RHQRAnITAKCeS30ShpbeadFA5n/TlaTOXYNzygCVG3tg 2HCPVqo5DRtQfUoKwLY6vQ== =ST37 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 8:09 ` Ulrich Drepper @ 2004-03-04 19:02 ` john stultz 0 siblings, 0 replies; 19+ messages in thread From: john stultz @ 2004-03-04 19:02 UTC (permalink / raw) To: Ulrich Drepper; +Cc: lkml On Thu, 2004-03-04 at 00:09, Ulrich Drepper wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > john stultz wrote: > > > Before we start up this larger debate again, might there be some short > > term solution for my patch that would satisfy both of you? > > I suggest the following: > > ~ define a symbol __kernel_gettimeofday_offset in the vdso's symbol > table. This should be an absolute symbol containing the offset of the > gettimeofday implementation from the beginning of the vdso (the address > passed up in the auxiliary vector) > > ~ glibc can then use the equivalent of > dlsym("__kernel_gettimeofday_offset"). If the symbol is not defined, > it's not used (doh). If it is defined, the final function address is > computed by adding the offset to the vdso address. > > > This ensures a direct jump and it still keeps the vdso relocatable > without modifying the symbol table. Excellent, this sounds similar to what Andrea was suggesting. I'll start working to implement this. thanks for the momentary truce ;) -john ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 2:16 ` Ulrich Drepper 2004-03-04 2:43 ` john stultz @ 2004-03-04 2:47 ` Andrea Arcangeli 2004-03-04 2:54 ` john stultz 1 sibling, 1 reply; 19+ messages in thread From: Andrea Arcangeli @ 2004-03-04 2:47 UTC (permalink / raw) To: Ulrich Drepper Cc: john stultz, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Wed, Mar 03, 2004 at 06:16:52PM -0800, Ulrich Drepper wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Andrea Arcangeli wrote: > > > This is just like the kernel patches people proposes when they get > > vmalloc LDT allocation failure, because they run with the i686 glibc > > instead of the only possibly supported i586 configuration. It makes no > > sense to hide a glibc inefficiency > > You apparently still haven't gotten any clue since your whining the last > time around. Absolute addresses are a fatal mistake. the above ldt issue has nothing to do with any address at all, it's all about deferring the ldt allocation after pthread_create, like linuxthreads also defer the genreation of the manager thread post the first pthread_create. about the vsyscall part (the only thing with a relation with "fixed addresses"), you can pass the address of vgettimeofday via elf or in any other way, I'm not forcing you to setup a fixed address, I never spoke about fixed addresses (infact I specified the elf bit) you can do the same as the vsysenter, if you're fine with the way vsysenter works you must be fine using the same way for vgettimeofday too. My only point (and the only reason I'm against this patch) is that glibc should call into the vgettimeofday without passing through vsysenter, and in turn glibc should have _knowledge_ of the existence of vgettimeofday, otherwise every other regular syscall invoked through vsysenter would need to pay for it. Now probably we'll never have more than two vsyscall in x86, so wasting a few nanoseconds for a conditional jump at every vsysenter may not be measurable but it doesn't look the right design. And sysenter is at a fixed address in 2.6 x86 too (it doesn't even change between different kernel compiles). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 2:47 ` Andrea Arcangeli @ 2004-03-04 2:54 ` john stultz 2004-03-04 3:15 ` Andrea Arcangeli 2004-03-04 8:57 ` Jakub Jelinek 0 siblings, 2 replies; 19+ messages in thread From: john stultz @ 2004-03-04 2:54 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ulrich Drepper, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Wed, 2004-03-03 at 18:47, Andrea Arcangeli wrote: > And sysenter is at a fixed address in 2.6 x86 too (it doesn't even > change between different kernel compiles). Actually, the 4G patch pushes vsysenter down a page, and glibc seems to handle this properly. -john ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 2:54 ` john stultz @ 2004-03-04 3:15 ` Andrea Arcangeli 2004-03-04 8:57 ` Jakub Jelinek 1 sibling, 0 replies; 19+ messages in thread From: Andrea Arcangeli @ 2004-03-04 3:15 UTC (permalink / raw) To: john stultz Cc: Ulrich Drepper, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Wed, Mar 03, 2004 at 06:54:49PM -0800, john stultz wrote: > On Wed, 2004-03-03 at 18:47, Andrea Arcangeli wrote: > > And sysenter is at a fixed address in 2.6 x86 too (it doesn't even > > change between different kernel compiles). > > Actually, the 4G patch pushes vsysenter down a page, and glibc seems to > handle this properly. this is nice for x86 indeed. This has never been a concern in x86-64 since there's no need to move the address space there. so in short this means 64G machines using 4:4 will need two tries to get to the root shell, every bother box will succeed at the first try ;). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 2:54 ` john stultz 2004-03-04 3:15 ` Andrea Arcangeli @ 2004-03-04 8:57 ` Jakub Jelinek 2004-03-04 16:45 ` Andrea Arcangeli 1 sibling, 1 reply; 19+ messages in thread From: Jakub Jelinek @ 2004-03-04 8:57 UTC (permalink / raw) To: john stultz Cc: Andrea Arcangeli, Ulrich Drepper, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Wed, Mar 03, 2004 at 06:54:49PM -0800, john stultz wrote: > On Wed, 2004-03-03 at 18:47, Andrea Arcangeli wrote: > > And sysenter is at a fixed address in 2.6 x86 too (it doesn't even > > change between different kernel compiles). > > Actually, the 4G patch pushes vsysenter down a page, and glibc seems to > handle this properly. But the 4G/4G patch relinks the vDSO to the address it uses, this is no big problem for glibc which of course doesn't use hardcoded address but reads AT_SYSINFO{,_EHDR} values kernel passes to it. But the fixed vDSO location is a problem, exploits certainly appreciate a fixed address at which they with high probability can enter the kernel. Ingo Molnar recently wrote a patch to randomize the vDSO address on IA-32. Unfortunately it revealed some bugs in glibc where ld.so did not handle properly vDSOs linked to one address, but mmaped to a different one (which is a must if kernel wants to share one vDSO page for each process). So now the problem is if kernel randomizes vDSO, it will not even boot with glibcs >= 2003-04-22 and <= 2004-02-27. There are 2 possible solutions for this IMHO: 1) tell users of the glibc's which don't handle this they must upgrade glibc first before booting a newer kernel and add kernel cmdline option to turn vDSO off, so that a user can turn it off, upgrade glibc and then on next boot use vDSO again 2) start using a different AT_SYSINFO_* value (just one is enough, ATM AT_SYSINFO is ((ElfNN_Ehdr *)AT_SYSINFO_EHDR)->e_entry), stop using the old 2 values. This would mean old glibcs will stop using vDSO, but hey, it is just an optimization. Upgrading glibc would use vDSO again. Jakub ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 8:57 ` Jakub Jelinek @ 2004-03-04 16:45 ` Andrea Arcangeli 0 siblings, 0 replies; 19+ messages in thread From: Andrea Arcangeli @ 2004-03-04 16:45 UTC (permalink / raw) To: Jakub Jelinek Cc: john stultz, Ulrich Drepper, lkml, Andi Kleen, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Thu, Mar 04, 2004 at 03:57:36AM -0500, Jakub Jelinek wrote: > On Wed, Mar 03, 2004 at 06:54:49PM -0800, john stultz wrote: > > On Wed, 2004-03-03 at 18:47, Andrea Arcangeli wrote: > > > And sysenter is at a fixed address in 2.6 x86 too (it doesn't even > > > change between different kernel compiles). > > > > Actually, the 4G patch pushes vsysenter down a page, and glibc seems to > > handle this properly. > > But the 4G/4G patch relinks the vDSO to the address it uses, this is no > big problem for glibc which of course doesn't use hardcoded address but > reads AT_SYSINFO{,_EHDR} values kernel passes to it. > > But the fixed vDSO location is a problem, exploits certainly appreciate > a fixed address at which they with high probability can enter the kernel. > > Ingo Molnar recently wrote a patch to randomize the vDSO address on > IA-32. Unfortunately it revealed some bugs in glibc where ld.so did not do you have a link to the patch? (I don't see it in his homepage) just curious to see how much precious address space you're throwing at this randomization and in turn how many tries are needed to brute force. if you really care so much about randomization vs performance, it would been a lot better if you implemented vsysenter in a completely different way: by exporting some position indipendent bytecode to userspace via a syscall, and have glibc loadup this code somewhere in the address space (truly random thing making it trivial to do the intra-page offsets with byte granularity) and have kernel exporting only data, not exeuctables in the address space. The executable bytecode would be returned by a syscall. That is something slower at startup but fairly secure, and it doesn't waste kernel or user address space. The max-performance way of x86 vsysenter and x86-64 vgettimeofday simply don't fit for your security object best IMHO, it wasn't designed for that and it's an hack to try to randomize it with page-offsets. Keeping a local copy of the vsyscall bytecode at different intra-page offsets is something sanely doable in userspace, doing it in kernel is hairy and non-natural thing to do for kernel (involve copies, replacation of the vsyscall page etc., so one can as well do the copy in a load_usyscall syscall when the dynamic linker is asked to run gettimeofday, so this page can also be swapped out). What kernel can do more or less naturally is to share the same _physical_ vsyscall "page" (w/o intra page offset differences making it trivial to brut), and export it at a different addresses for each task, that is what Ingo implemented I guess, but it's 4096 times * 3.5G faster to crack. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 0:55 ` Andrea Arcangeli 2004-03-04 2:16 ` Ulrich Drepper @ 2004-03-04 8:00 ` Jamie Lokier 2004-03-04 8:37 ` Jakub Jelinek 1 sibling, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2004-03-04 8:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: john stultz, lkml, Andi Kleen, Ulrich Drepper, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott Andrea Arcangeli wrote: > I will try again with NTPL since it seems they didn't fix it (at > least last time I checked the code the LDT waste as still there). Does NPTL use the LDT at all? sys_set_thread_area was created specifically so that the LDT isn't needed. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 8:00 ` Jamie Lokier @ 2004-03-04 8:37 ` Jakub Jelinek 2004-03-04 17:48 ` Andrea Arcangeli 0 siblings, 1 reply; 19+ messages in thread From: Jakub Jelinek @ 2004-03-04 8:37 UTC (permalink / raw) To: Jamie Lokier Cc: Andrea Arcangeli, john stultz, lkml, Andi Kleen, Ulrich Drepper, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Thu, Mar 04, 2004 at 08:00:56AM +0000, Jamie Lokier wrote: > Andrea Arcangeli wrote: > > I will try again with NTPL since it seems they didn't fix it (at > > least last time I checked the code the LDT waste as still there). > > Does NPTL use the LDT at all? sys_set_thread_area was created > specifically so that the LDT isn't needed. It doesn't and neither does LinuxThreads when run on a recent kernel (which has set_thread_area). Jakub ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) 2004-03-04 8:37 ` Jakub Jelinek @ 2004-03-04 17:48 ` Andrea Arcangeli 0 siblings, 0 replies; 19+ messages in thread From: Andrea Arcangeli @ 2004-03-04 17:48 UTC (permalink / raw) To: Jakub Jelinek Cc: Jamie Lokier, john stultz, lkml, Andi Kleen, Ulrich Drepper, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott On Thu, Mar 04, 2004 at 03:37:55AM -0500, Jakub Jelinek wrote: > On Thu, Mar 04, 2004 at 08:00:56AM +0000, Jamie Lokier wrote: > > Andrea Arcangeli wrote: > > > I will try again with NTPL since it seems they didn't fix it (at > > > least last time I checked the code the LDT waste as still there). > > > > Does NPTL use the LDT at all? sys_set_thread_area was created > > specifically so that the LDT isn't needed. > > It doesn't and neither does LinuxThreads when run on a recent kernel > (which has set_thread_area). I really thought set_thread_area would depend on a LDT being allocated first, I was wrong sorry, the parameter passed is in the same format of modify_ldt (that's what fooled me) but it seems used only to write directly into the gdt, it's not an entry in the ldt but it's an entry into the gdt, and the gdt is overwritten at every thread switch (not at every mm switch). So as far as glibc doesn't allocate the LDT and only uses this logic the problem I mentioned (using 2.4 kernels) is just solved and I really appreciate this. The thing that scared me most (and that hurted me most in deferring the allocation to pthread_create with 2.4 kernels) is that at least in linuxthreads the ldt allocation spreads out of the linuxthread package (it spreads into the dynamic linker IIRC, but I looked at last time a few months ago), I hope with nptl this is fixed too. clearly the ldt had to be nuked somehow to get past some thousand threads anyways, but avoiding the ldt allocation is an even more important feature in real life with real apps not using threads but just linking with pthreads by mistake like /bin/ls. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC] vsyscall-gtod_test_B3.tar.gz 2004-03-04 0:11 [RFC][PATCH] vsyscall-gtod_B3 (0/3) john stultz 2004-03-04 0:12 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part1 (1/3) john stultz @ 2004-03-04 0:15 ` john stultz 1 sibling, 0 replies; 19+ messages in thread From: john stultz @ 2004-03-04 0:15 UTC (permalink / raw) To: lkml Cc: Andrea Arcangeli, Andi Kleen, Ulrich Drepper, Jamie Lokier, Martin J. Bligh, Wim Coekaerts, Joel Becker, Chris McDermott [-- Attachment #1: Type: text/plain, Size: 377 bytes --] All, This tarball shows an example on how to use the LD_PRELOAD method of calling vsyscall-gtod. This method is used in the absense of vsyscall-gtod_B3-part3 patch, or on systems without a sysenter enabled glibc. The example provides a micro-benchmark that compares the performance of gettimeofday() when calling it normally, or through the LD_PRELOAD method. thanks -john [-- Attachment #2: vsyscall-gtod_test_B3.tar.gz --] [-- Type: application/x-compressed-tar, Size: 941 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2004-03-04 19:02 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-03-04 0:11 [RFC][PATCH] vsyscall-gtod_B3 (0/3) john stultz 2004-03-04 0:12 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part1 (1/3) john stultz 2004-03-04 0:13 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part2 (2/3) john stultz 2004-03-04 0:14 ` [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3) john stultz 2004-03-04 0:55 ` Andrea Arcangeli 2004-03-04 2:16 ` Ulrich Drepper 2004-03-04 2:43 ` john stultz 2004-03-04 3:14 ` Andrea Arcangeli 2004-03-04 8:09 ` Ulrich Drepper 2004-03-04 19:02 ` john stultz 2004-03-04 2:47 ` Andrea Arcangeli 2004-03-04 2:54 ` john stultz 2004-03-04 3:15 ` Andrea Arcangeli 2004-03-04 8:57 ` Jakub Jelinek 2004-03-04 16:45 ` Andrea Arcangeli 2004-03-04 8:00 ` Jamie Lokier 2004-03-04 8:37 ` Jakub Jelinek 2004-03-04 17:48 ` Andrea Arcangeli 2004-03-04 0:15 ` [RFC] vsyscall-gtod_test_B3.tar.gz john stultz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox