* i386 PDA patches use of %gs @ 2006-09-12 7:35 Arjan van de Ven 2006-09-12 7:48 ` Jeremy Fitzhardinge 2006-09-13 1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge 0 siblings, 2 replies; 45+ messages in thread From: Arjan van de Ven @ 2006-09-12 7:35 UTC (permalink / raw) To: akpm, ak, mingo, Jeremy Fitzhardinge; +Cc: linux-kernel Hi, Userspace uses %gs for it's per thread data (and in modern linux versions that means "all the time", errno is there for example). On x86-64 this is the reason that the kernel uses the OTHER segment register; so for the PDA patches this would mean using %fs and not %gs. The advantage of this is very simple: %fs will be 0 for userspace most of the time. Putting 0 in a segment register is cheap for the cpu, putting anything else in is quite expensive (a LOT of security checks need to happen). As such I would MUCH rather see that the i386 PDA patches use %fs and not %gs... Jeremy, is there a reason you're specifically using %gs and not %fs? If not, would you mind a switch to using %fs instead? Greetings, Arjan van de Ven -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-09-12 7:35 i386 PDA patches use of %gs Arjan van de Ven @ 2006-09-12 7:48 ` Jeremy Fitzhardinge 2006-09-12 7:56 ` Arjan van de Ven 2006-09-13 1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge 1 sibling, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-09-12 7:48 UTC (permalink / raw) To: Arjan van de Ven; +Cc: akpm, ak, mingo, linux-kernel Arjan van de Ven wrote: > Jeremy, is there a reason you're specifically using %gs and not %fs? If > not, would you mind a switch to using %fs instead? > The main reason for using %gs was to take advantage of gcc's TLS support. I intend to measure the cost of gs vs fs, and if there's a significant difference I'll switch. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-09-12 7:48 ` Jeremy Fitzhardinge @ 2006-09-12 7:56 ` Arjan van de Ven 2006-09-12 8:31 ` Jeremy Fitzhardinge 2006-11-15 11:27 ` [PATCH] i386-pda UP optimization Eric Dumazet 0 siblings, 2 replies; 45+ messages in thread From: Arjan van de Ven @ 2006-09-12 7:56 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: akpm, ak, mingo, linux-kernel On Tue, 2006-09-12 at 00:48 -0700, Jeremy Fitzhardinge wrote: > Arjan van de Ven wrote: > > Jeremy, is there a reason you're specifically using %gs and not %fs? If > > not, would you mind a switch to using %fs instead? > > > > The main reason for using %gs was to take advantage of gcc's TLS > support. I intend to measure the cost of gs vs fs, and if there's a > significant difference I'll switch. gcc can be fixed if needed. I don't see the kernel switching to use that any time soon though... ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-09-12 7:56 ` Arjan van de Ven @ 2006-09-12 8:31 ` Jeremy Fitzhardinge 2006-11-15 11:27 ` [PATCH] i386-pda UP optimization Eric Dumazet 1 sibling, 0 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-09-12 8:31 UTC (permalink / raw) To: Arjan van de Ven; +Cc: akpm, ak, mingo, linux-kernel Arjan van de Ven wrote: > gcc can be fixed if needed. I don't see the kernel switching to use that > any time soon though... I have a preliminary patch to implement per_cpu() in terms of __thread. Hm, my initial tests comparing reloading a NULL selector vs a real selector shows absolutely no measurable difference, on either a modern Core Duo, or an old P4... Admittedly this is with an artificial usermode test program, but I'd expect to see *some* difference if there's a difference. J -- /* gcc -o time-segops time-segops.c -O2 -Wall -lrt -fomit-frame-pointer -funroll-loops */ #include <stdio.h> #include <time.h> #define COUNT 10000000 static inline void sync(void) { int a,b,c,d; asm volatile("cpuid" : "=a" (a), "=b" (b), "=c" (c), "=d" (d) : "0" (0), "2" (0) : "memory"); } static void test_none(void) { int i; for(i = 0; i < COUNT; i++) { sync(); } } static void test_fs(void) { int i, ds; asm volatile("mov %%ds,%0" : "=r" (ds)); for(i = 0; i < COUNT; i++) { asm volatile("push %%fs; mov %0, %%fs; popl %%fs" : : "r" (ds)); sync(); } } static void test_gs(void) { int i, ds; asm volatile("mov %%ds,%0" : "=r" (ds)); for(i = 0; i < COUNT; i++) { asm volatile("push %%gs; mov %0, %%gs; popl %%gs" : : "r" (ds)); sync(); } } typedef void (*test_t)(void); static test_t tests[] = { test_none, test_fs, test_gs, NULL, }; int main() { int i; int ds, fs, gs; asm volatile("mov %%ds, %0; " "mov %%fs, %1; " "mov %%gs, %2" : "=r" (ds), "=r" (fs), "=r" (gs) : : "memory"); printf("fs=%x gs=%x\n", fs, gs); for(i = 0; tests[i]; i++) { struct timespec start, end; unsigned long long delta; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); (*tests[i])(); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); delta = (end.tv_sec * 1000000000ull + end.tv_nsec) - (start.tv_sec * 1000000000ull + start.tv_nsec); delta /= COUNT; printf("%lluns/iteration\n", delta); } return 0; } ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH] i386-pda UP optimization 2006-09-12 7:56 ` Arjan van de Ven 2006-09-12 8:31 ` Jeremy Fitzhardinge @ 2006-11-15 11:27 ` Eric Dumazet 2006-11-15 11:32 ` Andi Kleen ` (2 more replies) 1 sibling, 3 replies; 45+ messages in thread From: Eric Dumazet @ 2006-11-15 11:27 UTC (permalink / raw) To: akpm; +Cc: Arjan van de Ven, Jeremy Fitzhardinge, ak, mingo, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1397 bytes --] Seeing %gs prefixes used now by i386 port, I recalled seeing strange oprofile results on Opteron machines. I really think %gs prefixes can be expensive in some (most ?) cases, even if the Intel/AMD docs say they are free. I wrote this trivial User program to benchmark vfs_read()/vfs_write() that happens to use 'current' many times. #include <unistd.h> #include <errno.h> int main() { int i, fd[2]; char c = 0; pipe(fd); for (i = 0; i < 10000000; i++) { errno = 0; // glibc also use %gs write(fd[1], &c, 1); read(fd[0], &c, 1); } return 0; } The best elap time I got for this program on 10 runs was : 12.811 s (Intel(R) Pentium(R) M processor 1.60GHz) With the attached patch, I got 12.212 s, and a kernel text size reduction of 3400 bytes. I wish Jeremy give us patches for UP machines so that %gs can be let untouched in entry.S (syscall entry/exit). A lot of ia32 machines are still using one CPU. Note : I dont have a x86_64 machine here, but I suspect a similar patch could be done for x86_64 too. Thank you [PATCH] i386-pda UP optimization On a !CONFIG_SMP machine, there is only one PDA, (one CPU). We can avoid %gs prefixes when reading/writing fields in PDA. This reduce kernel text size and also give better performance. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: i386-pda-up.patch --] [-- Type: text/plain, Size: 2182 bytes --] --- linux-2.6.19-rc5-mm2/include/asm-i386/pda.h 2006-11-15 11:21:24.000000000 +0100 +++ linux-2.6.19-rc5-mm2-ed/include/asm-i386/pda.h 2006-11-15 11:23:49.000000000 +0100 @@ -91,10 +91,19 @@ ((typeof(_proxy_pda.field) *)((unsigned char *)read_pda(_pda) + \ pda_offset(field))) +#if defined(CONFIG_SMP) #define read_pda(field) pda_from_op("mov",field) #define write_pda(field,val) pda_to_op("mov",field,val) #define add_pda(field,val) pda_to_op("add",field,val) #define sub_pda(field,val) pda_to_op("sub",field,val) #define or_pda(field,val) pda_to_op("or",field,val) +#else +extern struct i386_pda boot_pda; +#define read_pda(field) boot_pda.field +#define write_pda(field,val) do { boot_pda.field = (val);} while (0) +#define add_pda(field,val) ) do { boot_pda.field += (val);} while (0) +#define sub_pda(field,val) do { boot_pda.field -= (val);} while (0) +#define or_pda(field,val) do { boot_pda.field |= (val);} while (0) +#endif #endif /* _I386_PDA_H */ --- linux-2.6.19-rc5-mm2/arch/i386/kernel/cpu/common.c 2006-11-15 11:21:25.000000000 +0100 +++ linux-2.6.19-rc5-mm2-ed/arch/i386/kernel/cpu/common.c 2006-11-15 11:45:09.000000000 +0100 @@ -609,6 +609,14 @@ return regs; } +/* Initial PDA used by boot CPU */ +struct i386_pda boot_pda = { + ._pda = &boot_pda, + .cpu_number = 0, + .pcurrent = &init_task, +}; +EXPORT_SYMBOL(boot_pda); + static __cpuinit int alloc_gdt(int cpu) { struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu); @@ -628,11 +636,10 @@ BUG_ON(gdt != NULL || pda != NULL); gdt = alloc_bootmem_pages(PAGE_SIZE); - pda = alloc_bootmem(sizeof(*pda)); + pda = &boot_pda; /* alloc_bootmem(_pages) panics on failure, so no check */ memset(gdt, 0, PAGE_SIZE); - memset(pda, 0, sizeof(*pda)); } else { /* GDT and PDA might already have been allocated if this is a CPU hotplug re-insertion. */ @@ -655,13 +662,6 @@ return 1; } -/* Initial PDA used by boot CPU */ -struct i386_pda boot_pda = { - ._pda = &boot_pda, - .cpu_number = 0, - .pcurrent = &init_task, -}; - static inline void set_kernel_gs(void) { /* Set %gs for this CPU's PDA. Memory clobber is to create a ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 11:27 ` [PATCH] i386-pda UP optimization Eric Dumazet @ 2006-11-15 11:32 ` Andi Kleen 2006-11-15 17:20 ` Ingo Molnar 2006-11-15 17:52 ` Jeremy Fitzhardinge 2006-11-28 23:12 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 45+ messages in thread From: Andi Kleen @ 2006-11-15 11:32 UTC (permalink / raw) To: Eric Dumazet Cc: akpm, Arjan van de Ven, Jeremy Fitzhardinge, mingo, linux-kernel On Wednesday 15 November 2006 12:27, Eric Dumazet wrote: > Seeing %gs prefixes used now by i386 port, I recalled seeing strange oprofile > results on Opteron machines. > > I really think %gs prefixes can be expensive in some (most ?) cases, even if > the Intel/AMD docs say they are free. They aren't free, just very cheap. > > With the attached patch, I got 12.212 s, and a kernel text size reduction of > 3400 bytes. Are the benchmark numbers stable? i.e. if you repeat them multiple times with reboots do you still get the same difference? -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 11:32 ` Andi Kleen @ 2006-11-15 17:20 ` Ingo Molnar 2006-11-15 17:24 ` Andi Kleen 2006-11-15 17:28 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 17:20 UTC (permalink / raw) To: Andi Kleen Cc: Eric Dumazet, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel * Andi Kleen <ak@suse.de> wrote: > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote: > > Seeing %gs prefixes used now by i386 port, I recalled seeing strange > > oprofile results on Opteron machines. > > > > I really think %gs prefixes can be expensive in some (most ?) cases, > > even if the Intel/AMD docs say they are free. > > They aren't free, just very cheap. Eric's test shows a 5% slowdown. That's far from cheap. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:20 ` Ingo Molnar @ 2006-11-15 17:24 ` Andi Kleen 2006-11-15 17:46 ` Eric Dumazet 2006-11-15 17:28 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 45+ messages in thread From: Andi Kleen @ 2006-11-15 17:24 UTC (permalink / raw) To: Ingo Molnar Cc: Eric Dumazet, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel On Wednesday 15 November 2006 18:20, Ingo Molnar wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote: > > > Seeing %gs prefixes used now by i386 port, I recalled seeing strange > > > oprofile results on Opteron machines. > > > > > > I really think %gs prefixes can be expensive in some (most ?) cases, > > > even if the Intel/AMD docs say they are free. > > > > They aren't free, just very cheap. > > Eric's test shows a 5% slowdown. That's far from cheap. I have my doubts about the accuracy of his test results. That is why I asked him to double check. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:24 ` Andi Kleen @ 2006-11-15 17:46 ` Eric Dumazet 2006-11-15 17:49 ` Ingo Molnar 2006-11-21 11:38 ` Eric Dumazet 0 siblings, 2 replies; 45+ messages in thread From: Eric Dumazet @ 2006-11-15 17:46 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1164 bytes --] On Wednesday 15 November 2006 18:24, Andi Kleen wrote: > On Wednesday 15 November 2006 18:20, Ingo Molnar wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote: > > > > Seeing %gs prefixes used now by i386 port, I recalled seeing strange > > > > oprofile results on Opteron machines. > > > > > > > > I really think %gs prefixes can be expensive in some (most ?) cases, > > > > even if the Intel/AMD docs say they are free. > > > > > > They aren't free, just very cheap. > > > > Eric's test shows a 5% slowdown. That's far from cheap. > > I have my doubts about the accuracy of his test results. That is why I > asked him to double check. Fair enough :) I plan doing *lot* of tests as soon as possible (not possible during daytime unfortunately, I miss a dev machine) By the way, I tried this patch to avoid reload %gs at syscall start. Since %gs is not anymore used inside kernel (after i386-pda UP optimization is applied) : We can let in %gs the User Program %gs value. (I still force a reload of %gs before syscall exit of course) Machine boots but freeze when init starts. Any idea ? Thank you Eric [-- Attachment #2: entry.patch --] [-- Type: text/plain, Size: 1398 bytes --] --- linux-2.6.19-rc5-mm2/arch/i386/kernel/entry.S 2006-11-15 11:21:25.000000000 +0100 +++ linux-2.6.19-rc5-mm2-ed/arch/i386/kernel/entry.S 2006-11-15 18:40:53.000000000 +0100 @@ -97,6 +97,16 @@ #define resume_userspace_sig resume_userspace #endif +/* + * On UP, we dont need to change %gs since PDA accesses dont use %gs + */ +#if defined(CONFIG_SMP) +#define LOAD_KERNEL_GS(reg) movl $(__KERNEL_PDA), reg; \ + movl reg, %gs +#else +#define LOAD_KERNEL_GS(reg) +#endif + #define SAVE_ALL \ cld; \ pushl %gs; \ @@ -132,8 +142,7 @@ movl $(__USER_DS), %edx; \ movl %edx, %ds; \ movl %edx, %es; \ - movl $(__KERNEL_PDA), %edx; \ - movl %edx, %gs + LOAD_KERNEL_GS(%edx); #define RESTORE_INT_REGS \ popl %ebx; \ @@ -544,9 +553,15 @@ jmp resume_userspace CFI_ENDPROC +#ifdef CONFIG_SMP +# define GET_CPU_NUM(reg) movl %gs:PDA_cpu, reg; +#else +# define GET_CPU_NUM(reg) +#endif + #define FIXUP_ESPFIX_STACK \ /* since we are on a wrong stack, we cant make it a C code :( */ \ - movl %gs:PDA_cpu, %ebx; \ + GET_CPU_NUM(%ebx) \ PER_CPU(cpu_gdt_descr, %ebx); \ movl GDS_address(%ebx), %ebx; \ GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah); \ @@ -660,8 +675,7 @@ pushl %gs CFI_ADJUST_CFA_OFFSET 4 /*CFI_REL_OFFSET gs, 0*/ - movl $(__KERNEL_PDA), %ecx - movl %ecx, %gs + LOAD_KERNEL_GS(%ecx) UNWIND_ESPFIX_STACK popl %ecx CFI_ADJUST_CFA_OFFSET -4 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:46 ` Eric Dumazet @ 2006-11-15 17:49 ` Ingo Molnar 2006-11-15 17:58 ` Eric Dumazet 2006-11-21 11:38 ` Eric Dumazet 1 sibling, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 17:49 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel * Eric Dumazet <dada1@cosmosbay.com> wrote: > Machine boots but freeze when init starts. Any idea ? probably caused by this: > +# define GET_CPU_NUM(reg) > #define FIXUP_ESPFIX_STACK \ > /* since we are on a wrong stack, we cant make it a C code :( */ \ > - movl %gs:PDA_cpu, %ebx; \ > + GET_CPU_NUM(%ebx) \ > PER_CPU(cpu_gdt_descr, %ebx); \ > movl GDS_address(%ebx), %ebx; \ %ebx very definitely wants to have a current CPU number loaded ;) Pick it up from the task struct. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:49 ` Ingo Molnar @ 2006-11-15 17:58 ` Eric Dumazet 2006-11-15 18:01 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2006-11-15 17:58 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel On Wednesday 15 November 2006 18:49, Ingo Molnar wrote: > * Eric Dumazet <dada1@cosmosbay.com> wrote: > > Machine boots but freeze when init starts. Any idea ? > > probably caused by this: > > +# define GET_CPU_NUM(reg) > > > > #define FIXUP_ESPFIX_STACK \ > > /* since we are on a wrong stack, we cant make it a C code :( */ \ > > - movl %gs:PDA_cpu, %ebx; \ > > + GET_CPU_NUM(%ebx) \ > > PER_CPU(cpu_gdt_descr, %ebx); \ > > movl GDS_address(%ebx), %ebx; \ > > %ebx very definitely wants to have a current CPU number loaded ;) Pick > it up from the task struct. Hum.... Are you sure ? For UP we have this PER_CPU definition : #define PER_CPU(var, cpu) \ movl $per_cpu__/**/var, cpu; You can see 'cpu' is a pure output , not an input value. So I basically deleted the fist instruction of this sequence : movl %gs:PDA_cpu, %ebx movl $per_cpu__cpu_gdt_descr, %ebx; Did I miss something ? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:58 ` Eric Dumazet @ 2006-11-15 18:01 ` Ingo Molnar 0 siblings, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 18:01 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel * Eric Dumazet <dada1@cosmosbay.com> wrote: > > > + GET_CPU_NUM(%ebx) \ > > > PER_CPU(cpu_gdt_descr, %ebx); \ > > > movl GDS_address(%ebx), %ebx; \ > > > > %ebx very definitely wants to have a current CPU number loaded ;) Pick > > it up from the task struct. > > Hum.... Are you sure ? > > For UP we have this PER_CPU definition : > > #define PER_CPU(var, cpu) \ > movl $per_cpu__/**/var, cpu; hm, you are right. No quick ideas then. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:46 ` Eric Dumazet 2006-11-15 17:49 ` Ingo Molnar @ 2006-11-21 11:38 ` Eric Dumazet 2006-11-21 21:42 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2006-11-21 11:38 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel On Wednesday 15 November 2006 18:46, Eric Dumazet wrote: > On Wednesday 15 November 2006 18:24, Andi Kleen wrote: > > On Wednesday 15 November 2006 18:20, Ingo Molnar wrote: > > > * Andi Kleen <ak@suse.de> wrote: > > > > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote: > > > > > Seeing %gs prefixes used now by i386 port, I recalled seeing > > > > > strange oprofile results on Opteron machines. > > > > > > > > > > I really think %gs prefixes can be expensive in some (most ?) > > > > > cases, even if the Intel/AMD docs say they are free. > > > > > > > > They aren't free, just very cheap. > > > > > > Eric's test shows a 5% slowdown. That's far from cheap. > > > > I have my doubts about the accuracy of his test results. That is why I > > asked him to double check. > > Fair enough :) > > I plan doing *lot* of tests as soon as possible (not possible during > daytime unfortunately, I miss a dev machine) > I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks using : pipe/write()/read, umask(), or getppid(), using or not oprofile. I managed to avoid reloading %gs in sysenter_entry . (avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs I could not avoid reloading %gs in system_call, I dont know why, but modern glibc use sysenter so I dont care :) I confirm I got better results with my patched kernel in all tests I've done. umask : 12.64 s instead of 12.90 s getppid : 13.37 s instead of 13.72 s pipe/read/write : 9.10 s instead of 9.52 s (I got very different results in umask() bench, patching it not to use xchg(), since this instruction is expensive on x86 and really change oprofile results. I will submit a patch for this. Eric ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-21 11:38 ` Eric Dumazet @ 2006-11-21 21:42 ` Jeremy Fitzhardinge 2006-11-21 21:52 ` Andi Kleen 2006-11-21 21:58 ` Eric Dumazet 0 siblings, 2 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-21 21:42 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel Eric Dumazet wrote: > I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks > using : pipe/write()/read, umask(), or getppid(), using or not oprofile. > > I managed to avoid reloading %gs in sysenter_entry . > (avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs > > I could not avoid reloading %gs in system_call, I dont know why, but modern > glibc use sysenter so I dont care :) > > I confirm I got better results with my patched kernel in all tests I've done. > > umask : 12.64 s instead of 12.90 s > getppid : 13.37 s instead of 13.72 s > pipe/read/write : 9.10 s instead of 9.52 s > > (I got very different results in umask() bench, patching it not to use xchg(), > since this instruction is expensive on x86 and really change oprofile > results. I will submit a patch for this. > Could you go into more detail about what you're actually measuring here? Is it 10,000,000 loops of the single syscall? pipe/read/write suggests that you're doing at least 2 syscalls per loop, but it takes the smallest elapsed time. What are you using as your time reference? Real time? Process time? For umask/getppid, assuming you're just running 1e7 iterations, you're seeing a difference of 25 and 35ns per iteration difference. I wonder why it would be different for different syscalls; I would expect it to be a constant overhead either way. Certainly these numbers are much larger than I saw when I benchmarked pda-vs-nopda using lmbench's null syscall (getppid) test; I saw an overall 9ns difference in null syscall time on my Core Duo run at 1GHz. What's your CPU and speed? One possibility is a cache miss on the gdt while reloading %gs. I've been planning on a patch to rearrange the gdt in order to pack all the commonly used segment descriptors into one or two cache lines so that all the segment register reloads can be done with a minimum of cache misses. It would be interesting for you to replace the: movl $(__KERNEL_PDA), %edx; movl %edx, %gs with an appropriate read of the gdt entry, hm, which is a bit complex to find. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-21 21:42 ` Jeremy Fitzhardinge @ 2006-11-21 21:52 ` Andi Kleen 2006-11-21 22:10 ` Jeremy Fitzhardinge 2006-11-21 21:58 ` Eric Dumazet 1 sibling, 1 reply; 45+ messages in thread From: Andi Kleen @ 2006-11-21 21:52 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Eric Dumazet, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel > For umask/getppid, assuming you're just running 1e7 iterations, you're > seeing a difference of 25 and 35ns per iteration difference. I wonder > why it would be different for different syscalls; I would expect it to > be a constant overhead either way. They got different numbers of current references? > Certainly these numbers are much > larger than I saw when I benchmarked pda-vs-nopda using lmbench's null > syscall (getppid) test; I saw an overall 9ns difference in null syscall > time on my Core Duo run at 1GHz. What's your CPU and speed? > > One possibility is a cache miss on the gdt while reloading %gs. I've On such micro benchmarks everything should be cache hot in theory (unless it's a system with really small cache) > been planning on a patch to rearrange the gdt in order to pack all the > commonly used segment descriptors into one or two cache lines so that > all the segment register reloads can be done with a minimum of cache > misses. It would be interesting for you to replace the: > > movl $(__KERNEL_PDA), %edx; movl %edx, %gs > > with an appropriate read of the gdt entry, hm, which is a bit complex to > find. On UP it could be hardcoded. And oprofile can be used to profile for cache misses. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-21 21:52 ` Andi Kleen @ 2006-11-21 22:10 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-21 22:10 UTC (permalink / raw) To: Andi Kleen Cc: Eric Dumazet, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel Andi Kleen wrote: >> For umask/getppid, assuming you're just running 1e7 iterations, you're >> seeing a difference of 25 and 35ns per iteration difference. I wonder >> why it would be different for different syscalls; I would expect it to >> be a constant overhead either way. >> > > They got different numbers of current references? > My understanding is that Eric has changed UP current (and other PDA ops) to not touch %gs at all, and the difference in reported times in due omitting the %gs load in entry.S (though %gs is still save/restored on the stack). > On such micro benchmarks everything should be cache hot in theory > (unless it's a system with really small cache) > Yes, that would be my thought too, but maybe there's excessive aliasing on one of the ways, but I think he's using a Pentium M which has a 8-way L1. >> been planning on a patch to rearrange the gdt in order to pack all the >> commonly used segment descriptors into one or two cache lines so that >> all the segment register reloads can be done with a minimum of cache >> misses. It would be interesting for you to replace the: >> >> movl $(__KERNEL_PDA), %edx; movl %edx, %gs >> >> with an appropriate read of the gdt entry, hm, which is a bit complex to >> find. >> > > On UP it could be hardcoded. And oprofile can be used to profile for cache misses. > Yes, assuming oprofile doesn't interfere with things too much. Actually, just counting cache miss events during the course of a syscall would be most interesting (ie, no need to sample). J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-21 21:42 ` Jeremy Fitzhardinge 2006-11-21 21:52 ` Andi Kleen @ 2006-11-21 21:58 ` Eric Dumazet 2006-11-21 23:12 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2006-11-21 21:58 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Andi Kleen, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel Jeremy Fitzhardinge a écrit : > Eric Dumazet wrote: >> I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks >> using : pipe/write()/read, umask(), or getppid(), using or not oprofile. >> >> I managed to avoid reloading %gs in sysenter_entry . >> (avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs >> >> I could not avoid reloading %gs in system_call, I dont know why, but modern >> glibc use sysenter so I dont care :) >> >> I confirm I got better results with my patched kernel in all tests I've done. >> >> umask : 12.64 s instead of 12.90 s >> getppid : 13.37 s instead of 13.72 s >> pipe/read/write : 9.10 s instead of 9.52 s >> >> (I got very different results in umask() bench, patching it not to use xchg(), >> since this instruction is expensive on x86 and really change oprofile >> results. I will submit a patch for this. >> > > Could you go into more detail about what you're actually measuring > here? Is it 10,000,000 loops of the single syscall? pipe/read/write > suggests that you're doing at least 2 syscalls per loop, but it takes > the smallest elapsed time. for umask/getppid(), its a basic loop with 100.000.000 iterations for read/write(), loop with 10.000.000 iterations > > What are you using as your time reference? Real time? Process time? > elapsed time (/usr/bin/time ./prog) 10 runs, and the minimum time is taken. > For umask/getppid, assuming you're just running 1e7 iterations, you're > seeing a difference of 25 and 35ns per iteration difference. I wonder > why it would be different for different syscalls; I would expect it to > be a constant overhead either way. Certainly these numbers are much > larger than I saw when I benchmarked pda-vs-nopda using lmbench's null > syscall (getppid) test; I saw an overall 9ns difference in null syscall > time on my Core Duo run at 1GHz. What's your CPU and speed? Its a 1.6GHz Pentium-M CPU (Dell D610) > > One possibility is a cache miss on the gdt while reloading %gs. I've > been planning on a patch to rearrange the gdt in order to pack all the > commonly used segment descriptors into one or two cache lines so that > all the segment register reloads can be done with a minimum of cache > misses. It would be interesting for you to replace the: > > movl $(__KERNEL_PDA), %edx; movl %edx, %gs > > with an appropriate read of the gdt entry, hm, which is a bit complex to > find. > Hum... Do you mean a cache miss every time we do a syscall ? What could invalidate this cache exactly ? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-21 21:58 ` Eric Dumazet @ 2006-11-21 23:12 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-21 23:12 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel Eric Dumazet wrote: > for umask/getppid(), its a basic loop with 100.000.000 iterations Ah, OK, so there's about 2.5-3.5ns difference due to the instructions you removed. That's very much in line with that I saw in my measurements. > for read/write(), loop with 10.000.000 iterations 2 syscalls/iteration? It's interesting you measured about the same absolute time difference (.42s) even though you're doing 1/5th the number of syscalls. > elapsed time (/usr/bin/time ./prog) > 10 runs, and the minimum time is taken. Hm, but "time" measures user, system and real time. You used real time? > Hum... Do you mean a cache miss every time we do a syscall ? What > could invalidate this cache exactly ? Well, there might be a miss simply because the line got evicted. But as Andi pointed out, a hot benchmark like this is very unlikely to get any cache misses unless there's something very unfortunate happening. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:20 ` Ingo Molnar 2006-11-15 17:24 ` Andi Kleen @ 2006-11-15 17:28 ` Jeremy Fitzhardinge 2006-11-15 17:32 ` Ingo Molnar 2006-11-15 18:01 ` Arjan van de Ven 1 sibling, 2 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 17:28 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Eric Dumazet, akpm, Arjan van de Ven, linux-kernel Ingo Molnar wrote: > Eric's test shows a 5% slowdown. That's far from cheap. > It seems like an absurdly large difference. PDA references aren't all that common in the kernel; for the %gs prefix on PDA accesses to be causing a 5% overall difference in a test like this means that the prefixes would have to be costing hundreds or thousands of cycles, which seems absurd. Particularly since Eric's patch doesn't touch head.S, so the %gs save/restore is still being executed. Are we sure this isn't a cache layout issue? Eric, did you try evicting your executable from pagecache between runs to see if you get variation depending on what physical pages it gets put into? (Making several copies of the executable should have the same effect.) J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:28 ` Jeremy Fitzhardinge @ 2006-11-15 17:32 ` Ingo Molnar 2006-11-15 17:59 ` Jeremy Fitzhardinge 2006-11-15 18:01 ` Arjan van de Ven 1 sibling, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 17:32 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Andi Kleen, Eric Dumazet, akpm, Arjan van de Ven, linux-kernel * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > > Eric's test shows a 5% slowdown. That's far from cheap. > > It seems like an absurdly large difference. PDA references aren't all > that common in the kernel; for the %gs prefix on PDA accesses to be > causing a 5% overall difference in a test like this means that the > prefixes would have to be costing hundreds or thousands of cycles, > which seems absurd. Particularly since Eric's patch doesn't touch > head.S, so the %gs save/restore is still being executed. i said this before: using segmentation tricks these days is /insane/. Segmentation is not for free, and it's not going to be cheap in the future. In fact, chances are that it will be /more/ expensive in the future, because sane OSs just make no use of them besides the trivial "they dont even exist" uses. so /at a minimum/, as i suggested it before, the kernel's segment use should not overlap that of glibc's. I.e. the kernel should use %fs, not %gs. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:32 ` Ingo Molnar @ 2006-11-15 17:59 ` Jeremy Fitzhardinge 2006-11-15 18:05 ` Eric Dumazet 0 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 17:59 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Eric Dumazet, akpm, Arjan van de Ven, linux-kernel Ingo Molnar wrote: > i said this before: using segmentation tricks these days is /insane/. > Segmentation is not for free, and it's not going to be cheap in the > future. In fact, chances are that it will be /more/ expensive in the > future, because sane OSs just make no use of them besides the trivial > "they dont even exist" uses. > Many, many systems use %fs/%gs to implement some kind of thread-local storage, and such usage is becoming more common; the PDA's use of it in the kernel is no different. I would agree that using all the obscure corners of segmentation is just asking for trouble, but using %gs as an address offset seems like something that's going to be efficient on x86 32/64 processors indefinitely. > so /at a minimum/, as i suggested it before, the kernel's segment use > should not overlap that of glibc's. I.e. the kernel should use %fs, not > %gs. Last time you raised this I did a pretty comprehensive set of tests which showed there was flat out zero difference between using %fs and %gs. There doesn't seem to be anything to the theory that reloading a null segment selector is in any way cheaper than loading a real selector. Did you find a problem in my methodology? J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:59 ` Jeremy Fitzhardinge @ 2006-11-15 18:05 ` Eric Dumazet 2006-11-15 18:28 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2006-11-15 18:05 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andi Kleen, akpm, Arjan van de Ven, linux-kernel On Wednesday 15 November 2006 18:59, Jeremy Fitzhardinge wrote: > Ingo Molnar wrote: > > i said this before: using segmentation tricks these days is /insane/. > > Segmentation is not for free, and it's not going to be cheap in the > > future. In fact, chances are that it will be /more/ expensive in the > > future, because sane OSs just make no use of them besides the trivial > > "they dont even exist" uses. > > Many, many systems use %fs/%gs to implement some kind of thread-local > storage, and such usage is becoming more common; the PDA's use of it in > the kernel is no different. I would agree that using all the obscure > corners of segmentation is just asking for trouble, but using %gs as an > address offset seems like something that's going to be efficient on x86 > 32/64 processors indefinitely. > > > so /at a minimum/, as i suggested it before, the kernel's segment use > > should not overlap that of glibc's. I.e. the kernel should use %fs, not > > %gs. > > Last time you raised this I did a pretty comprehensive set of tests > which showed there was flat out zero difference between using %fs and > %gs. There doesn't seem to be anything to the theory that reloading a > null segment selector is in any way cheaper than loading a real > selector. Did you find a problem in my methodology? I have the feeling (most probably wrong, but I prefer to speak than keeping this for myself) that the cost of segment load is delayed up to the first use of a segment selector. Sort of a lazy reload... I had this crazy idea while looking at oprofile numbers ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 18:05 ` Eric Dumazet @ 2006-11-15 18:28 ` Jeremy Fitzhardinge 2006-11-15 18:31 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 18:28 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Andi Kleen, akpm, Arjan van de Ven, linux-kernel Eric Dumazet wrote: > I have the feeling (most probably wrong, but I prefer to speak than keeping > this for myself) that the cost of segment load is delayed up to the first use > of a segment selector. Sort of a lazy reload... > Probably not too much, since the load itself has to raise a fault if there's any problem with the segment itself, and once it is loaded you can change the underlying descriptor without affecting the segment register. Even if it were lazy, that would only make the first %gs use a bit slow, and shouldn't affect the subsequent ones. However, when I measured segment register use timings, I didn't see any dramatic costs associated with segment register use which would account for a 5% hit in your benchmark. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 18:28 ` Jeremy Fitzhardinge @ 2006-11-15 18:31 ` Ingo Molnar 0 siblings, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 18:31 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Eric Dumazet, Andi Kleen, akpm, Arjan van de Ven, linux-kernel * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > [...] However, when I measured segment register use timings, I didn't > see any dramatic costs associated with segment register use which > would account for a 5% hit in your benchmark. if by that measurement you mean time-segops.c, i dont think it correctly measures 'mixed' use of different selector values for the same %gs segment selector. And that's what i suggested for you to measure in September, and that's what Eric's testcase triggers too. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 17:28 ` Jeremy Fitzhardinge 2006-11-15 17:32 ` Ingo Molnar @ 2006-11-15 18:01 ` Arjan van de Ven 2006-11-15 18:24 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 45+ messages in thread From: Arjan van de Ven @ 2006-11-15 18:01 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, Andi Kleen, Eric Dumazet, akpm, linux-kernel On Wed, 2006-11-15 at 09:28 -0800, Jeremy Fitzhardinge wrote: > Ingo Molnar wrote: > > Eric's test shows a 5% slowdown. That's far from cheap. > > > > It seems like an absurdly large difference. PDA references aren't all > that common in the kernel; for the %gs prefix on PDA accesses to be > causing a 5% overall difference in a test like this means that the > prefixes would have to be costing hundreds or thousands of cycles, which > seems absurd. Particularly since Eric's patch doesn't touch head.S, so > the %gs save/restore is still being executed. segment register accesses really are not cheap. Also really it'll be better to use the register userspace is not using, but we had that discussion before; could you remind me why you picked %gs in the first place? -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 18:01 ` Arjan van de Ven @ 2006-11-15 18:24 ` Jeremy Fitzhardinge 2006-11-15 19:06 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 18:24 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, Andi Kleen, Eric Dumazet, akpm, linux-kernel Arjan van de Ven wrote: > segment register accesses really are not cheap. > Also really it'll be better to use the register userspace is not using, > but we had that discussion before; could you remind me why you picked > %gs in the first place? > To leave open the possibility of using the compiler's TLS support in the kernel for percpu. I also measured the cost of reloading %gs vs %fs, and found no difference between reloading a null selector vs a non-null selector. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 18:24 ` Jeremy Fitzhardinge @ 2006-11-15 19:06 ` Ingo Molnar 2006-11-17 0:24 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 19:06 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, Andi Kleen, Eric Dumazet, akpm, linux-kernel * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Arjan van de Ven wrote: > > segment register accesses really are not cheap. > > Also really it'll be better to use the register userspace is not using, > > but we had that discussion before; could you remind me why you picked > > %gs in the first place? > > > > To leave open the possibility of using the compiler's TLS support in > the kernel for percpu. I also measured the cost of reloading %gs vs > %fs, and found no difference between reloading a null selector vs a > non-null selector. what point would there be in using it? It's not like the kernel could make use of the thread keyword anytime soon (it would need /all/ architectures to support it) ... and the kernel doesnt mind how the current per_cpu() primitives are implemented, via assembly or via C. In any case, it very much matters to see the precise cost of having the pda selector value in %gs versus %fs. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 19:06 ` Ingo Molnar @ 2006-11-17 0:24 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-17 0:24 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, Andi Kleen, Eric Dumazet, akpm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 843 bytes --] Ingo Molnar wrote: > what point would there be in using it? It's not like the kernel could > make use of the thread keyword anytime soon (it would need /all/ > architectures to support it) ... The plan was to implement the x86 arch-specific percpu stuff to use it, since it allows gcc better optimisation opportunities. > and the kernel doesnt mind how the > current per_cpu() primitives are implemented, via assembly or via C. In > any case, it very much matters to see the precise cost of having the pda > selector value in %gs versus %fs. > Hm, well, unfortunately for me, there is a small but distinct advantage to using %fs rather than %gs (around 0-5ns per iteration). The notable exception being the "AMD-K6(tm) 3D+ Processor", where %gs is about 25% (15ns) faster. I'll revise the patches to use %fs and resubmit. J [-- Attachment #2: results-mixed.txt --] [-- Type: text/plain, Size: 3720 bytes --] "Genuine Intel(R) CPU T2400 @ 1.83GHz" @1000Mhz (6,14,8): ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME <none> with data selector: 0ns/iteration fs with data selector: 26ns/iteration gs with data selector: 30ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 26ns/iteration gs with LDT selector: 26ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 26ns/iteration gs with GDT selector: 30ns/iteration "Intel(R) Pentium(R) 4 CPU 1.80GHz" @1817.9Mhz (15,2,4): ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME <none> with data selector: 0ns/iteration fs with data selector: 33ns/iteration gs with data selector: 34ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 43ns/iteration gs with LDT selector: 52ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 33ns/iteration gs with GDT selector: 34ns/iteration "Intel(R) Celeron(R) CPU 2.40GHz" @2394.47Mhz (15,2,9): ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME <none> with data selector: 0ns/iteration fs with data selector: 20ns/iteration gs with data selector: 24ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 21ns/iteration gs with LDT selector: 26ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 21ns/iteration gs with GDT selector: 26ns/iteration "Pentium 75 - 200" @166.206Mhz (5,2,12): ds=7b fs=0 gs=33 ldt=f gdt=3b GTOD <none> with data selector: 1ns/iteration fs with data selector: 74ns/iteration gs with data selector: 75ns/iteration <none> with LDT selector: 1ns/iteration fs with LDT selector: 74ns/iteration gs with LDT selector: 75ns/iteration <none> with GDT selector: 1ns/iteration fs with GDT selector: 74ns/iteration gs with GDT selector: 74ns/iteration "AMD-K6(tm) 3D+ Processor" @451.105Mhz (5,9,1): ds=7b fs=0 gs=33 ldt=f gdt=3b GTOD <none> with data selector: 0ns/iteration fs with data selector: 59ns/iteration gs with data selector: 44ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 59ns/iteration gs with LDT selector: 44ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 59ns/iteration gs with GDT selector: 44ns/iteration "AMD Athlon(tm) XP 3000+" @2162.74Mhz (6,10,0): ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME <none> with data selector: 0ns/iteration fs with data selector: 10ns/iteration gs with data selector: 11ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 11ns/iteration gs with LDT selector: 11ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 11ns/iteration gs with GDT selector: 11ns/iteration "AMD Athlon(tm) 64 Processor 3500+" @2210.23Mhz (15,31,0): ds=2b fs=0 gs=63 ldt=f gdt=6b GTOD <none> with data selector: 0ns/iteration fs with data selector: 11ns/iteration gs with data selector: 11ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 10ns/iteration gs with LDT selector: 11ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 10ns/iteration gs with GDT selector: 11ns/iteration "Pentium III (Coppermine)" @700Mhz (6,8,6): ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME <none> with data selector: 0ns/iteration fs with data selector: 38ns/iteration gs with data selector: 45ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 39ns/iteration gs with LDT selector: 41ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 39ns/iteration gs with GDT selector: 44ns/iteration ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 11:27 ` [PATCH] i386-pda UP optimization Eric Dumazet 2006-11-15 11:32 ` Andi Kleen @ 2006-11-15 17:52 ` Jeremy Fitzhardinge 2006-11-28 23:12 ` Jeremy Fitzhardinge 2 siblings, 0 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 17:52 UTC (permalink / raw) To: Eric Dumazet; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel Eric Dumazet wrote: > I wish Jeremy give us patches for UP machines so that %gs can be let untouched > in entry.S (syscall entry/exit). A lot of ia32 machines are still using one > CPU. > Unfortunately that would add cruft in a number of places. At the moment, context switch, ptrace and vm86 all assume entry.S has saved %gs into pt_regs, so they can treat it like any other register. If this were conditional, it would require multiple places to add #ifndef CONFIG_SMP code, which is not something I'd like to do without a good reason. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-15 11:27 ` [PATCH] i386-pda UP optimization Eric Dumazet 2006-11-15 11:32 ` Andi Kleen 2006-11-15 17:52 ` Jeremy Fitzhardinge @ 2006-11-28 23:12 ` Jeremy Fitzhardinge 2006-11-29 9:30 ` Eric Dumazet 2 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-28 23:12 UTC (permalink / raw) To: Eric Dumazet; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel Eric Dumazet wrote: > Seeing %gs prefixes used now by i386 port, I recalled seeing strange oprofile > results on Opteron machines. Hi Eric, Could you try this patch out and see if it makes much performance difference for you. You should apply this on top of the %fs patch I posted earlier (and use the %fs patch as the baseline for your comparisons). Thanks, J Don't bother with segment references for UP PDA When compiled for UP, don't bother prefixing PDA references with a segment override. Also doesn't bother reloading the PDA segment register (though it still gets saved and restored, because the value is used elsewhere in the kernel, and the restore is necessary for correct context switches). I'm not very keen on the extra #ifdefs this adds, though I've tried to keep them minimal. Eric Dumazet reports small performance gains from similar patch however. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <andi@muc.de> Cc: Eric Dumazet <dada1@cosmosbay.com> diff -r 022c29ea754e arch/i386/kernel/cpu/common.c --- a/arch/i386/kernel/cpu/common.c Tue Nov 21 18:54:56 2006 -0800 +++ b/arch/i386/kernel/cpu/common.c Wed Nov 22 01:54:02 2006 -0800 @@ -628,7 +628,11 @@ static __cpuinit int alloc_gdt(int cpu) BUG_ON(gdt != NULL || pda != NULL); gdt = alloc_bootmem_pages(PAGE_SIZE); +#ifdef CONFIG_SMP + pda = &boot_pda; +#else pda = alloc_bootmem(sizeof(*pda)); +#endif /* alloc_bootmem(_pages) panics on failure, so no check */ memset(gdt, 0, PAGE_SIZE); @@ -661,6 +665,10 @@ struct i386_pda boot_pda = { .cpu_number = 0, .pcurrent = &init_task, }; +#ifndef CONFIG_SMP +/* boot_pda is used for all PDA access in UP */ +EXPORT_SYMBOL(boot_pda); +#endif static inline void set_kernel_fs(void) { diff -r 022c29ea754e arch/i386/kernel/entry.S --- a/arch/i386/kernel/entry.S Tue Nov 21 18:54:56 2006 -0800 +++ b/arch/i386/kernel/entry.S Wed Nov 22 13:38:56 2006 -0800 @@ -97,6 +97,16 @@ 1: #define resume_userspace_sig resume_userspace #endif +#ifdef CONFIG_SMP +#define LOAD_PDA_SEG(reg) \ + movl $(__KERNEL_PDA), reg; \ + movl reg, %fs +#define CUR_CPU(reg) movl %fs:PDA_cpu, reg +#else +#define LOAD_PDA_SEG(reg) +#define CUR_CPU(reg) movl boot_pda+PDA_cpu, reg +#endif + #define SAVE_ALL \ cld; \ pushl %fs; \ @@ -132,8 +142,7 @@ 1: movl $(__USER_DS), %edx; \ movl %edx, %ds; \ movl %edx, %es; \ - movl $(__KERNEL_PDA), %edx; \ - movl %edx, %fs + LOAD_PDA_SEG(%edx) #define RESTORE_INT_REGS \ popl %ebx; \ @@ -546,7 +555,7 @@ syscall_badsys: #define FIXUP_ESPFIX_STACK \ /* since we are on a wrong stack, we cant make it a C code :( */ \ - movl %fs:PDA_cpu, %ebx; \ + CUR_CPU(%ebx); \ PER_CPU(cpu_gdt_descr, %ebx); \ movl GDS_address(%ebx), %ebx; \ GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah); \ diff -r 022c29ea754e include/asm-i386/pda.h --- a/include/asm-i386/pda.h Tue Nov 21 18:54:56 2006 -0800 +++ b/include/asm-i386/pda.h Wed Nov 22 02:35:24 2006 -0800 @@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[]; #define cpu_pda(i) (_cpu_pda[i]) +/* Use boot-time PDA for UP. For SMP we still need to declare it, but + it isn't used. */ +extern struct i386_pda boot_pda; + +#ifdef CONFIG_SMP +#define PDA_REF "%%fs:%c[off]" +#else +#define PDA_REF "%[mem]" +#endif + #define pda_offset(field) offsetof(struct i386_pda, field) extern void __bad_pda_field(void); @@ -33,28 +43,31 @@ extern void __bad_pda_field(void); clobbers, so gcc can readily analyse them. */ extern struct i386_pda _proxy_pda; -#define pda_to_op(op,field,val) \ +#define pda_to_op(op,field,_val) \ do { \ typedef typeof(_proxy_pda.field) T__; \ - if (0) { T__ tmp__; tmp__ = (val); } \ + if (0) { T__ tmp__; tmp__ = (_val); } \ switch (sizeof(_proxy_pda.field)) { \ case 1: \ - asm(op "b %1,%%fs:%c2" \ - : "+m" (_proxy_pda.field) \ - :"ri" ((T__)val), \ - "i"(pda_offset(field))); \ + asm(op "b %[val]," PDA_REF \ + : "+m" (_proxy_pda.field), \ + [mem] "+m" (boot_pda.field) \ + : [val] "ri" ((T__)_val), \ + [off] "i" (pda_offset(field))); \ break; \ case 2: \ - asm(op "w %1,%%fs:%c2" \ - : "+m" (_proxy_pda.field) \ - :"ri" ((T__)val), \ - "i"(pda_offset(field))); \ + asm(op "w %[val]," PDA_REF \ + : "+m" (_proxy_pda.field), \ + [mem] "+m" (boot_pda.field) \ + : [val] "ri" ((T__)_val), \ + [off] "i" (pda_offset(field))); \ break; \ case 4: \ - asm(op "l %1,%%fs:%c2" \ - : "+m" (_proxy_pda.field) \ - :"ri" ((T__)val), \ - "i"(pda_offset(field))); \ + asm(op "l %[val]," PDA_REF \ + : "+m" (_proxy_pda.field), \ + [mem] "+m" (boot_pda.field) \ + : [val] "ri" ((T__)_val), \ + [off] "i" (pda_offset(field))); \ break; \ default: __bad_pda_field(); \ } \ @@ -65,22 +78,25 @@ extern struct i386_pda _proxy_pda; typeof(_proxy_pda.field) ret__; \ switch (sizeof(_proxy_pda.field)) { \ case 1: \ - asm(op "b %%fs:%c1,%0" \ - : "=r" (ret__) \ - : "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ + asm(op "b " PDA_REF ",%[ret]" \ + : [ret] "=r" (ret__) \ + : [off] "i" (pda_offset(field)), \ + "m" (_proxy_pda.field), \ + [mem] "m" (boot_pda.field)); \ break; \ case 2: \ - asm(op "w %%fs:%c1,%0" \ - : "=r" (ret__) \ - : "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ + asm(op "w " PDA_REF ",%[ret]" \ + : [ret] "=r" (ret__) \ + : [off] "i" (pda_offset(field)), \ + "m" (_proxy_pda.field), \ + [mem] "m" (boot_pda.field)); \ break; \ case 4: \ - asm(op "l %%fs:%c1,%0" \ - : "=r" (ret__) \ - : "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ + asm(op "l " PDA_REF ",%[ret]" \ + : [ret] "=r" (ret__) \ + : [off] "i" (pda_offset(field)), \ + "m" (_proxy_pda.field), \ + [mem] "m" (boot_pda.field)); \ break; \ default: __bad_pda_field(); \ } \ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-28 23:12 ` Jeremy Fitzhardinge @ 2006-11-29 9:30 ` Eric Dumazet 2006-11-29 9:56 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2006-11-29 9:30 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel On Wednesday 29 November 2006 00:12, Jeremy Fitzhardinge wrote: > Hi Eric, > > Could you try this patch out and see if it makes much performance > difference for you. You should apply this on top of the %fs patch I > posted earlier (and use the %fs patch as the baseline for your > comparisons). Hi Jeremy I will try this as soon as possible, thank you. However I have some remarks browsing your patch. > +#ifdef CONFIG_SMP > +#define LOAD_PDA_SEG(reg) \ > + movl $(__KERNEL_PDA), reg; \ > + movl reg, %fs > +#define CUR_CPU(reg) movl %fs:PDA_cpu, reg > +#else > +#define LOAD_PDA_SEG(reg) > +#define CUR_CPU(reg) movl boot_pda+PDA_cpu, reg if !CONFIG_SMP, why even dereferencing boot_pda+PDA_cpu to get 0 ? and as PER_CPU(cpu_gdt_descr, %ebx) in !CONFIG_SMP doesnt need the a value in ebx, you can just do : #define CUR_CPU(reg) /* nothing */ > --- a/include/asm-i386/pda.h Tue Nov 21 18:54:56 2006 -0800 > +++ b/include/asm-i386/pda.h Wed Nov 22 02:35:24 2006 -0800 > @@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[]; > My patch was better IMHO : we dont need to force asm () instructions to perform regular C variable reading/writing in !CONFIG_SMP case. Using plain C allows compiler to generate a better code. Eric ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] i386-pda UP optimization 2006-11-29 9:30 ` Eric Dumazet @ 2006-11-29 9:56 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-29 9:56 UTC (permalink / raw) To: Eric Dumazet; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel Eric Dumazet wrote: > if !CONFIG_SMP, why even dereferencing boot_pda+PDA_cpu to get 0 ? > and as PER_CPU(cpu_gdt_descr, %ebx) in !CONFIG_SMP doesnt need the a value in > ebx, you can just do : > > #define CUR_CPU(reg) /* nothing */ > Yep. On the other hand, I think that's an incredibly rare path anyway, so it won't make any difference either way. >> --- a/include/asm-i386/pda.h Tue Nov 21 18:54:56 2006 -0800 >> +++ b/include/asm-i386/pda.h Wed Nov 22 02:35:24 2006 -0800 >> @@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[]; >> >> > > My patch was better IMHO : we dont need to force asm () instructions to > perform regular C variable reading/writing in !CONFIG_SMP case. > > Using plain C allows compiler to generate a better code. > Probably, but I'm interested in comparing apples with apples; how much do the actual segment prefixes make a difference? J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-09-12 7:35 i386 PDA patches use of %gs Arjan van de Ven 2006-09-12 7:48 ` Jeremy Fitzhardinge @ 2006-09-13 1:00 ` Jeremy Fitzhardinge 2006-09-13 9:59 ` Ingo Molnar 1 sibling, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-09-13 1:00 UTC (permalink / raw) To: Arjan van de Ven Cc: akpm, ak, mingo, linux-kernel, Michael.Fetterman, Ian Campbell [-- Attachment #1: Type: text/plain, Size: 3131 bytes --] Arjan van de Ven wrote: > The advantage of this is very simple: %fs will be 0 for userspace most > of the time. Putting 0 in a segment register is cheap for the cpu, > putting anything else in is quite expensive (a LOT of security checks > need to happen). As such I would MUCH rather see that the i386 PDA > patches use %fs and not %gs... Hi Arjan, I spent some time trying to measure this, to see if there really is a difference between loading a null selector vs a non-null. The short answer is no, I couldn't measure any difference at all, on any CPU going back to a P166, up to a current Core Duo machine. I used a usermode test model of the entry.S code in order to make it easier to test on more machines. The basic inner loop is: push %segreg mov %selectorreg, %segreg add $1,%segreg:offset # use the segment register pop %segreg I also unrolled the loop to minimize the overhead from anything else. This is clearly much more segment-register intense than any real use, so I'm hoping that this should exacerbate any performance differences. I also tried to put cpuid in the loop in order to approximate the synchronizing effects of taking an exception, but it didn't seem to make much difference other than slow everything down by a constant amount (the cpuid slowdown swamped pretty much everything else on Intel CPUs, but was much less intrusive on the Athlon64). I tried the push/load/pop sequence with both %fs and %gs, where pop %fs would result in a null selector load, and pop %gs would load the normal userspace TLS selector. I also tried loading 3 types of selector after the push: * the normal usermode ds selector, on the grounds that the CPU might be more efficient in reloading a selector which is already in use * an ldt selector, which I thought might be slower since (at least conceptually) there's an indirection into a different descriptor table * and a gdt selector (the normally unused second TLS selector) In general, I got identical results for all of these. There were two exceptions: * The 1.8 GHz P4 Northwood was slower loading the LDT selector as expected, and pop %fs was faster than pop %gs. The GDT and data selector results were the same independent of %fs or %gs. * The AMD K6 was consistently *slower* with pop %fs; pop %gs was faster. I didn't try reversing the uses of %fs and %gs to see if it was the null selector being slower, or some inherent slowness in using %fs. It's possible I got something wrong, and I'm not really measuring what I think I'm measuring. The main thing that worries me about the results is that they don't scale much at all in proportion to the clock speed. Otherwise the results look sensible to me. I'd appreciate it if people could review the test program to see if I've overlooked something. So, in summary, I don't think there's much point in switching to %fs. I may get around to confirming this by doing a %gs->%fs conversion patch, but given these results that's at a fairly low priority. I've attached my test program and results. J [-- Attachment #2: time-segops.c --] [-- Type: text/x-csrc, Size: 5235 bytes --] /* gcc -m32 -O3 -Wall -fomit-frame-pointer -funroll-loops -g -o time-segops time-segops.c -lrt */ #include <stdio.h> #include <time.h> #include <sys/time.h> #include <errno.h> #include <string.h> #include <ctype.h> #include <asm/unistd.h> #define GTOD 0 #define SYNC 0 #define COUNT 50000000 /* different glibc's call this different things, so define our own */ struct desc { unsigned int entry_number; unsigned long base_addr; unsigned int limit; unsigned int seg_32bit:1; unsigned int contents:2; unsigned int read_exec_only:1; unsigned int limit_in_pages:1; unsigned int seg_not_present:1; unsigned int useable:1; }; /* These don't seem to be consistently defined in glibc */ static int set_thread_area(struct desc *desc) { int ret; asm("int $0x80" : "=a" (ret) : "0" (__NR_set_thread_area), "b" (desc) : "memory"); if (ret < 0) { errno = -ret; ret = -1; } return ret; } static int modify_ldt(int func, struct desc *desc, int size) { int ret; asm("int $0x80" : "=a" (ret) : "0" (__NR_modify_ldt), "b" (func), "c" (desc), "d" (size) : "memory"); if (ret < 0) { errno = -ret; ret = -1; } return ret; } static inline unsigned long long now(void) { #if GTOD struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec * 1000000000ull + tv.tv_usec * 1000ull; #else struct timespec ts; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); return ts.tv_sec * 1000000000ull + ts.tv_nsec; #endif } /* Simulate an exception's effect on the pipeline? */ static inline void sync(void) { if (0) { int a,b,c,d; asm volatile("cpuid" : "=a" (a), "=b" (b), "=c" (c), "=d" (d) : "0" (0), "2" (0) : "memory"); } else asm volatile("" : : : "memory"); } static const char *test_none(int seg, int *offset) { int i; for(i = 0; i < COUNT; i++) { sync(); } return "<none>"; } static const char *test_fs(int seg, int *offset) { int i; for(i = 0; i < COUNT; i++) { asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs" : "+m" (*offset): "r" (seg) : "memory"); sync(); } return "fs"; } static const char *test_gs(int seg, int *offset) { int i; for(i = 0; i < COUNT; i++) { asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs" : "+m" (*offset): "r" (seg) : "memory"); sync(); } return "gs"; } typedef const char *(*test_t)(int, int *); static const test_t tests[] = { test_none, test_fs, test_gs, NULL, }; static int segment[1]; static void test(int seg, int *offset, const char *segdesc) { int i; for(i = 0; tests[i]; i++) { unsigned long long start, end; unsigned long long delta; const char *t; start = now(); t = (*tests[i])(seg, offset); end = now(); delta = (end - start); printf(" %s with %s selector: %lluns/iteration\n", t, segdesc, delta / COUNT); } } struct cpu { char modelname[100]; int family, model, stepping; float speed; }; static int cpu_details(struct cpu *cpu) { FILE *fp = fopen("/proc/cpuinfo", "r"); char buf[500]; if (fp == NULL) { perror("open /proc/cpuinfo"); return 0; } while(fgets(buf, sizeof(buf), fp) != NULL) { char *col = strchr(buf, ':'); char *val; if (col == NULL) continue; val = col+1; while(*val == ' ') val++; col--; while(col > buf && isspace(*col)) col--; col[1] = 0; col = strchr(val, '\n'); if (col) *col = 0; //printf("name=%s val=%s\n", buf, val); if (strcmp(buf, "model name") == 0) strcpy(cpu->modelname, val); if (strcmp(buf, "cpu family") == 0) sscanf(val, "%d", &cpu->family); if (strcmp(buf, "model") == 0) sscanf(val, "%d", &cpu->model); if (strcmp(buf, "stepping") == 0) sscanf(val, "%d", &cpu->stepping); if (strcmp(buf, "cpu MHz") == 0) sscanf(val, "%f", &cpu->speed); if (strcmp(buf, "processor") == 0 && strcmp(val, "0") != 0) break; } fclose(fp); return 1; } int main() { int ds, fs, gs; static struct desc desc = { .entry_number = 1, .base_addr = (unsigned long)segment, .limit = sizeof(segment)-1, .seg_32bit = 1, .contents = 0, .read_exec_only = 0, .limit_in_pages = 0, .seg_not_present = 0, .useable = 1, }; int gdtseg, ldtseg; struct cpu cpu; float speed; if (!cpu_details(&cpu)) { printf("can't read CPU details"); return 1; } speed = cpu.speed; if (modify_ldt(1, &desc, sizeof(desc)) == -1) perror("modify ldt"); ldtseg = desc.entry_number * 8 | 4 | 3; desc.entry_number = -1; if (set_thread_area(&desc) == -1) perror("set_thread_area"); gdtseg = desc.entry_number * 8 | 3; asm volatile("mov %%ds, %0; " "mov %%fs, %1; " "mov %%gs, %2" : "=r" (ds), "=r" (fs), "=r" (gs) : : "memory"); printf("\"%s\" @%gMhz (%d,%d,%d):\n", cpu.modelname, cpu.speed, cpu.family, cpu.model, cpu.stepping); printf("ds=%x fs=%x gs=%x ldt=%x gdt=%x %s %s\n", ds, fs, gs, ldtseg, gdtseg, GTOD ? "GTOD" : "CPUTIME", SYNC ? "SYNC" : ""); test(ds, segment, "data"); printf("\n"); test(ldtseg, 0, "LDT"); printf("\n"); test(gdtseg, 0, "GDT"); if (cpu_details(&cpu)) { if (speed != cpu.speed) printf("cpu speed changed %f->%f?! disable CPUFREQ\n", speed, cpu.speed); } return 0; } [-- Attachment #3: results-nosync.txt --] [-- Type: text/plain, Size: 3164 bytes --] "Genuine Intel(R) CPU T2400 @ 1.83GHz" @1000Mhz (6,14,8): fs=0 gs=33 ldt=f gdt=3b <none> with data selector: 0ns/iteration fs with data selector: 27ns/iteration gs with data selector: 28ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 27ns/iteration gs with LDT selector: 28ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 27ns/iteration gs with GDT selector: 28ns/iteration "AMD Athlon(tm) 64 Processor 3500+" @1000Mhz (15,15,0): fs=0 gs=63 ldt=f gdt=6b <none> with data selector: 0ns/iteration fs with data selector: 10ns/iteration gs with data selector: 10ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 10ns/iteration gs with LDT selector: 10ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 10ns/iteration gs with GDT selector: 10ns/iteration "Intel(R) Pentium(R) 4 CPU 1.80GHz" @1817.91Mhz (15,2,4): fs=0 gs=33 ldt=f gdt=3b <none> with data selector: 0ns/iteration fs with data selector: 30ns/iteration gs with data selector: 31ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 40ns/iteration gs with LDT selector: 44ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 30ns/iteration gs with GDT selector: 31ns/iteration "Intel(R) Celeron(R) CPU 2.40GHz" @2394.47Mhz (15,2,9): fs=0 gs=33 ldt=f gdt=3b <none> with data selector: 0ns/iteration fs with data selector: 27ns/iteration gs with data selector: 25ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 25ns/iteration gs with LDT selector: 25ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 24ns/iteration gs with GDT selector: 25ns/iteration "Pentium 75 - 200" @166.213Mhz (5,2,12): fs=0 gs=33 ldt=f gdt=3b <none> with data selector: 1ns/iteration fs with data selector: 57ns/iteration gs with data selector: 57ns/iteration <none> with LDT selector: 1ns/iteration fs with LDT selector: 57ns/iteration gs with LDT selector: 57ns/iteration <none> with GDT selector: 1ns/iteration fs with GDT selector: 57ns/iteration gs with GDT selector: 57ns/iteration "AMD-K6(tm) 3D+ Processor" @451.105Mhz (5,9,1): fs=0 gs=33 ldt=f gdt=3b <none> with data selector: 0ns/iteration fs with data selector: 57ns/iteration gs with data selector: 44ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 57ns/iteration gs with LDT selector: 44ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 57ns/iteration gs with GDT selector: 44ns/iteration "Pentium III (Coppermine)" @700Mhz (6,8,6): fs=0 gs=33 ldt=f gdt=3b <none> with data selector: 0ns/iteration fs with data selector: 46ns/iteration gs with data selector: 46ns/iteration <none> with LDT selector: 0ns/iteration fs with LDT selector: 46ns/iteration gs with LDT selector: 47ns/iteration <none> with GDT selector: 0ns/iteration fs with GDT selector: 46ns/iteration gs with GDT selector: 47ns/iteration ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-09-13 1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge @ 2006-09-13 9:59 ` Ingo Molnar 2006-09-13 16:17 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2006-09-13 9:59 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > [...] The basic inner loop is: > > push %segreg > mov %selectorreg, %segreg > add $1,%segreg:offset # use the segment register > pop %segreg well, the most important thing i believe you didnt test: the effect of mixing two descriptors on the _same_ selector: one %gs selector value loaded and used by glibc, and another %gs selector value loaded and used by the kernel, intermixed. It's the mixing that causes the descriptor cache reload. (unless i missed some detail about your testcase) Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-09-13 9:59 ` Ingo Molnar @ 2006-09-13 16:17 ` Jeremy Fitzhardinge 2006-11-15 18:26 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-09-13 16:17 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell Ingo Molnar wrote: > well, the most important thing i believe you didnt test: the effect of > mixing two descriptors on the _same_ selector: one %gs selector value > loaded and used by glibc, and another %gs selector value loaded and used > by the kernel, intermixed. It's the mixing that causes the descriptor > cache reload. (unless i missed some detail about your testcase) But it doesn't mix different descriptors on the same selector; the GDT is initialized when the CPU is brought up, and is unchanged from then on. The PDA descriptor is GDT entry 27 and the userspace TLS entries are 6-8, so in the typical case %gs will alternate between 0x33 and 0xd8 as it enters and leaves the kernel. My test program does the same thing, except using GDT entries 6 and 7 (selectors 0x33 and 0x3b). J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-09-13 16:17 ` Jeremy Fitzhardinge @ 2006-11-15 18:26 ` Ingo Molnar 2006-11-15 18:29 ` Ingo Molnar 2006-11-15 18:39 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 18:26 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ingo Molnar wrote: > >well, the most important thing i believe you didnt test: the effect of > >mixing two descriptors on the _same_ selector: one %gs selector value > >loaded and used by glibc, and another %gs selector value loaded and used > >by the kernel, intermixed. It's the mixing that causes the descriptor > >cache reload. (unless i missed some detail about your testcase) > > But it doesn't mix different descriptors on the same selector; the GDT > is initialized when the CPU is brought up, and is unchanged from then > on. The PDA descriptor is GDT entry 27 and the userspace TLS entries > are 6-8, so in the typical case %gs will alternate between 0x33 and > 0xd8 as it enters and leaves the kernel. > > My test program does the same thing, except using GDT entries 6 and 7 > (selectors 0x33 and 0x3b). no, that's not what it does. It measures 50000000 switches of the _same_ selector value, without using any of the selectors in the loop itself. I.e. no mixing at all! But when the kernel and userspace uses %gs, it's the cost of switching between two selector values of %gs that has to be measured. Your code does not measure that at all, AFAICS. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:26 ` Ingo Molnar @ 2006-11-15 18:29 ` Ingo Molnar 2006-11-15 18:43 ` Jeremy Fitzhardinge 2006-11-15 18:39 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 18:29 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell * Ingo Molnar <mingo@elte.hu> wrote: > > My test program does the same thing, except using GDT entries 6 and > > 7 (selectors 0x33 and 0x3b). > > no, that's not what it does. It measures 50000000 switches of the > _same_ selector value, without using any of the selectors in the loop > itself. I.e. no mixing at all! But when the kernel and userspace uses > %gs, it's the cost of switching between two selector values of %gs > that has to be measured. Your code does not measure that at all, > AFAICS. for example, your test_fs() code does: for(i = 0; i < COUNT; i++) { asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs" : "+m" (*offset): "r" (seg) : "memory"); sync(); } that loads (and uses) a single selector value for %fs, and doesnt do any mixed use as far as i can see. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:29 ` Ingo Molnar @ 2006-11-15 18:43 ` Jeremy Fitzhardinge 2006-11-15 18:44 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 18:43 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell Ingo Molnar wrote: > for example, your test_fs() code does: > > for(i = 0; i < COUNT; i++) { > asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs" > : "+m" (*offset): "r" (seg) : "memory"); > sync(); > } > > that loads (and uses) a single selector value for %fs, and doesnt do any > mixed use as far as i can see. I'm not sure what you're getting at. Each loop iteration is analogous to a user->kernel->user transition with respect to the save/reload/use/restore pattern on the segment register. In this case, %fs starts as a null selector, gets reloaded with a non NULL selector, and then is restored to null. Do you mean some other mixing? J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:43 ` Jeremy Fitzhardinge @ 2006-11-15 18:44 ` Ingo Molnar 0 siblings, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 18:44 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > > that loads (and uses) a single selector value for %fs, and doesnt do > > any mixed use as far as i can see. > > I'm not sure what you're getting at. Each loop iteration is analogous > to a user->kernel->user transition with respect to the > save/reload/use/restore pattern on the segment register. In this > case, %fs starts as a null selector, gets reloaded with a non NULL > selector, and then is restored to null. Do you mean some other > mixing? yeah, mixed use: i.e. set up /two/ selector values and load them into %gs and read+write memory through them. It might not change the results, but that's what i meant under 'mixed use'. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:26 ` Ingo Molnar 2006-11-15 18:29 ` Ingo Molnar @ 2006-11-15 18:39 ` Jeremy Fitzhardinge 2006-11-15 18:43 ` Ingo Molnar 1 sibling, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 18:39 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell Ingo Molnar wrote: > no, that's not what it does. It measures 50000000 switches of the _same_ > selector value, without using any of the selectors in the loop itself. > I.e. no mixing at all! But when the kernel and userspace uses %gs, it's > the cost of switching between two selector values of %gs that has to be > measured. Your code does not measure that at all, AFAICS. > I think you're misreading it. This is the inner loop: for(i = 0; i < COUNT; i++) { asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs" : "+m" (*offset): "r" (seg) : "memory"); sync(); } return "gs"; On entry, %gs will contain the normal usermode TLS selector. "seg" is another selector allocated with set_thread_area(). The asm pushes the old %gs, loads the new one, uses a memory address via the new segment, then restores the previous %gs. So given this output: "Genuine Intel(R) CPU T2400 @ 1.83GHz" @1000Mhz (6,14,8): ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME [...] The initial %fs and %gs are 0 and 0x33 respectively, and it is using 0x3b as the other GDT selector (and 0xf as the other LDT selector). J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:39 ` Jeremy Fitzhardinge @ 2006-11-15 18:43 ` Ingo Molnar 2006-11-15 18:49 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 18:43 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ingo Molnar wrote: > > no, that's not what it does. It measures 50000000 switches of the _same_ > > selector value, without using any of the selectors in the loop itself. > > I.e. no mixing at all! But when the kernel and userspace uses %gs, it's > > the cost of switching between two selector values of %gs that has to be > > measured. Your code does not measure that at all, AFAICS. > > > I think you're misreading it. This is the inner loop: > > for(i = 0; i < COUNT; i++) { > asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs" > : "+m" (*offset): "r" (seg) : "memory"); > sync(); > } > return "gs"; > > On entry, %gs will contain the normal usermode TLS selector. "seg" is > another selector allocated with set_thread_area(). The asm pushes the > old %gs, loads the new one, uses a memory address via the new segment, > then restores the previous %gs. but it does not actually use the 'normal usermode TLS selector' - it only loads it. a meaningful test would be to allocate two selector values and load and read+write memory through both of them. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:43 ` Ingo Molnar @ 2006-11-15 18:49 ` Jeremy Fitzhardinge 2006-11-15 18:49 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 18:49 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell Ingo Molnar wrote: > but it does not actually use the 'normal usermode TLS selector' - it > only loads it. > > a meaningful test would be to allocate two selector values and load and > read+write memory through both of them. > Well, obviously in one case it would need to switch between null/non-null/null. But yes, good point about using the "usermode" %gs each iteration. I'll do some more tests. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:49 ` Jeremy Fitzhardinge @ 2006-11-15 18:49 ` Ingo Molnar 2006-11-15 19:00 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 18:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ingo Molnar wrote: > > but it does not actually use the 'normal usermode TLS selector' - it > > only loads it. > > > > a meaningful test would be to allocate two selector values and load and > > read+write memory through both of them. > > > > Well, obviously in one case it would need to switch between > null/non-null/null. But yes, good point about using the "usermode" > %gs each iteration. I'll do some more tests. i'd not even use glibc's %gs but set up two separate selectors. (that's a more controlled experiment - someone might run a non-TLS glibc, etc.) Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 18:49 ` Ingo Molnar @ 2006-11-15 19:00 ` Jeremy Fitzhardinge 2006-11-15 19:03 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Jeremy Fitzhardinge @ 2006-11-15 19:00 UTC (permalink / raw) To: Ingo Molnar Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell Ingo Molnar wrote: > i'd not even use glibc's %gs but set up two separate selectors. (that's > a more controlled experiment - someone might run a non-TLS glibc, etc.) > Well, in that case they probably don't care whether the kernel uses %fs or %gs ;) But either way, this doesn't have much bearing on Eric's test; we'd be only talking about a few ns per kernel exit, rather than 5% for read/write. J ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: i386 PDA patches use of %gs 2006-11-15 19:00 ` Jeremy Fitzhardinge @ 2006-11-15 19:03 ` Ingo Molnar 0 siblings, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2006-11-15 19:03 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman, Ian Campbell * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ingo Molnar wrote: > > i'd not even use glibc's %gs but set up two separate selectors. > > (that's a more controlled experiment - someone might run a non-TLS > > glibc, etc.) > > > > Well, in that case they probably don't care whether the kernel uses > %fs or %gs ;) > > But either way, this doesn't have much bearing on Eric's test; we'd be > only talking about a few ns per kernel exit, rather than 5% for > read/write. if the timings are different then it very much has bearing on the argument that i made against the current i386 PDA patchset, that mixed use segments are suboptimal. So i'm NAK-ing the i386 PDA patchset until this has been properly measured (and fixed if needed). Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2006-11-29 9:55 UTC | newest] Thread overview: 45+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-09-12 7:35 i386 PDA patches use of %gs Arjan van de Ven 2006-09-12 7:48 ` Jeremy Fitzhardinge 2006-09-12 7:56 ` Arjan van de Ven 2006-09-12 8:31 ` Jeremy Fitzhardinge 2006-11-15 11:27 ` [PATCH] i386-pda UP optimization Eric Dumazet 2006-11-15 11:32 ` Andi Kleen 2006-11-15 17:20 ` Ingo Molnar 2006-11-15 17:24 ` Andi Kleen 2006-11-15 17:46 ` Eric Dumazet 2006-11-15 17:49 ` Ingo Molnar 2006-11-15 17:58 ` Eric Dumazet 2006-11-15 18:01 ` Ingo Molnar 2006-11-21 11:38 ` Eric Dumazet 2006-11-21 21:42 ` Jeremy Fitzhardinge 2006-11-21 21:52 ` Andi Kleen 2006-11-21 22:10 ` Jeremy Fitzhardinge 2006-11-21 21:58 ` Eric Dumazet 2006-11-21 23:12 ` Jeremy Fitzhardinge 2006-11-15 17:28 ` Jeremy Fitzhardinge 2006-11-15 17:32 ` Ingo Molnar 2006-11-15 17:59 ` Jeremy Fitzhardinge 2006-11-15 18:05 ` Eric Dumazet 2006-11-15 18:28 ` Jeremy Fitzhardinge 2006-11-15 18:31 ` Ingo Molnar 2006-11-15 18:01 ` Arjan van de Ven 2006-11-15 18:24 ` Jeremy Fitzhardinge 2006-11-15 19:06 ` Ingo Molnar 2006-11-17 0:24 ` Jeremy Fitzhardinge 2006-11-15 17:52 ` Jeremy Fitzhardinge 2006-11-28 23:12 ` Jeremy Fitzhardinge 2006-11-29 9:30 ` Eric Dumazet 2006-11-29 9:56 ` Jeremy Fitzhardinge 2006-09-13 1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge 2006-09-13 9:59 ` Ingo Molnar 2006-09-13 16:17 ` Jeremy Fitzhardinge 2006-11-15 18:26 ` Ingo Molnar 2006-11-15 18:29 ` Ingo Molnar 2006-11-15 18:43 ` Jeremy Fitzhardinge 2006-11-15 18:44 ` Ingo Molnar 2006-11-15 18:39 ` Jeremy Fitzhardinge 2006-11-15 18:43 ` Ingo Molnar 2006-11-15 18:49 ` Jeremy Fitzhardinge 2006-11-15 18:49 ` Ingo Molnar 2006-11-15 19:00 ` Jeremy Fitzhardinge 2006-11-15 19:03 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).