* Faster getcpu() and sched_getcpu() [not found] <af8810200809111648n55e05ac9g286fcd498690432f@mail.gmail.com> @ 2008-09-23 19:09 ` Pardo 2008-09-23 19:48 ` Fwd: " Pardo 2008-09-28 16:42 ` Andi Kleen 0 siblings, 2 replies; 9+ messages in thread From: Pardo @ 2008-09-23 19:09 UTC (permalink / raw) To: linux-kernel, mbligh; +Cc: briangrant, odo, nil, jyasskin [-- Attachment #1: Type: text/plain, Size: 8178 bytes --] getcpu() returns a caller's current core number. On 2.6.26 running on x86_64, there are two VDSO implementations: store it in TSCP's AUX register; or, if the processor does not support TSCP, store it in GDT's limit. Dean Gaudet and Nathan Laredo also suggest using IDT's limit. Call these GDT, TSCP, and SIDT. The cost of reading the CPU number can be reduced significantly across a variety of platforms. Suggestions: eliminate per-call architecture check, use SIDT to hold the CPU and node number, cache the result, split the VDSO in to red-zone and no-red-zone areas, streamline cache checks in getcpu() code; provide a specialized sched_getcpu(). Result: on various x86_64 platforms, reading the CPU number drops from about 30-100 cycles to 4-21 cycles. I do not yet have a patch. I would like folks to (a) comment; and (b) try the attached microbenchmark on various machines to see if there are any machines where something is faster than SIDT. TESTS AND DATA I ran timing tests that "fake" the user-space instruction sequence for various VDSO-based getcpu() and sched_getcpu() implementations. I ran the tests on seven kinds of Intel and AMD platforms. Each sequence was measured individually (rather than averaging N runs). Best and median costs of 1000 runs were recorded. An empty sequence was also measured and that cost subtracted from each of the other runs, so a reported "20 cycles" is "20 cycles more than the empty sequence." A first test is the "raw" cost of just the machine instructions to read the special register. SIDT holds the value offset by 0x1000 and the machine instruction saves it to memory. The SIDT cost reported here is conservative in that it includes a load and subtract which are sometimes eliminated in getcpu()/sched_getcpu(). Note machine E is based on a P4-microarchitecture processor, which is typically hard to measure accurately, hence some reported costs for E are as low as 0 cycles. --- BEST --- -- MEDIAN -- GDT TSCP SIDT GDT TSCP SIDT A 60 77 15 61 78 16 B 45 N/A 9 54 N/A 18 C 54 N/A 9 54 N/A 18 D 49 N/A 14 56 N/A 14 E 32 N/A 0 42 N/A 11 F 74 N/A 17 74 N/A 18 G 16 23 16 21 24 17 On all machines, SIDT is always fastest, often by 3x or more. TSCP is always slowest. Current implementations (2.6.26/arch/x86/kernel/vsyscall_64.c and 2.6.26/arch/x86/vdso/vgetcpu.c) choose dynamically whether to use GDT or TSCP, something like if (global) raw = GDT(); else raw = TSCP(); A second test compares the cost of dispatch to GDT, dispatch to TSCP, or using GDT, TSCP, or SIDT unconditionally. This test mimics the glibc/VDSO strcuture, where the benchmark calls an out-of-line "fake glibc" routine that performs an indirect call to a "fake vdso" routine. ---------- BEST --------- --------- MEDIAN ------- *GDT *TSCP GDT TSCP SIDT *GDT *TSCP GDT TSCP SIDT A 67 86 65 83 31 68 87 68 85 31 B 72 N/A 72 N/A 18 81 N/A 81 N/A 27 C 72 N/A 77 N/A 18 81 N/A 81 N/A 27 D 77 N/A 77 N/A 21 77 N/A 77 N/A 28 E 63 N/A 53 N/A 0 64 N/A 63 N/A 11 F 99 N/A 98 N/A 17 99 N/A 98 N/A 21 G 26 29 20 28 19 28 32 27 28 22 In these tests, TSCP is still significantly slower and SIDT is still significantly faster, despite function call and conditional overheads. Also, dispatch overhead is small. A third test compares the cost of caching. There are 4 variations: never use a cache (like 2.6.26/arch/x86/vdso/vgetcpu.c), do use a cache (like 2.6.26/arch/x86/kernel/vsyscall_64.c) but pass in NULL; use a cache and take a miss, and use a cache and take a hit. The following are all SIDT implementations and measure the cost to read and set the CPU number but not the node number. ------ BEST ------ ----- MEDIAN ----- NONE NULL MISS HIT NONE NULL MISS HIT A 31 32 29 10 31 32 30 10 B 18 27 27 9 27 27 27 27 C 18 18 27 18 27 27 27 27 D 21 21 18 14 28 28 28 28 E 0 0 0 0 11 11 11 11 F 17 19 23 12 21 21 25 15 G 19 22 24 12 22 23 26 13 In these tests, HIT is usually faster in the best case, and is sometimes faster and never slower in the median case. The savings from skipping the cache test is usually small. So even in cases where NULL is always passed, the penalty is usually small. A fourth test compares two "fake" sched_getcpu() implementations. The generic version (similar to glibc-2.7/sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S) calls the general VDSO getcpu(). The specialized version tail-calls a specialized VDSO sched_getcpu() similar to getcpu() but faster because various checks are eliminated. The following reports times for the cache-hit case. ------- BEST ------ ------ MEDIAN ----- GENERAL SPECIALIZED GENERAL SPECIALIZED A 12 4 13 6 B 18 18 18 18 C 18 18 18 18 D 14 14 21 21 E 0 0 11 11 F 18 11 19 11 G 16 9 17 9 The specialized version is only faster on 3 of 7 machines, but is roughly 2x faster in those cases. The original getcpu() code always tests twice for the cache, and writes it whether or not it changed. Tests here use a slight rewrite that re-tests and writes only on a cache miss: if (cache && (cache->blob[0] == (j = _jiffies))) { p = cache->blob[1]; } else { p = ...; if (cache) { cache->blob[0] = j; cache->blob[1] = p; } } It turns out this "streamlining" is of no benefit because GCC makes this optimization anyway (but see below). Code to measure and report raw GDT/TSCP/SIDT timings is attached. I have other test data and test code if it is useful. ANALYSIS AND SUGGESTIONS Caching is currently disabled for 2.6.26/arch/x86/vdso/vgetcpu.c. getcpu()/sched_getcpu() performance is most important when they are used very frequently, in which case the jiffy-based cache is effective. Conversely, when calls are infrequent, cache miss overhead is small. Recommendation: caching should be enabled (probably for all architectures, not just x86-64). Switching to SIDT everywhere looks promising for all machines measured so far. The SIDT instruction performs a memory store, which means the VDSO needs to be split in to red-zone/no-red-zone areas to avoid frame overhead. See, e.g., http://lkml.org/lkml/2007/1/14/27 for details and http://arctic.org/~dean/patches/linux-2.6.20-vgetcpu-sidt.patch for an old 2.6.20 patch. Recommendation: measure relative costs on more systems to see if SIDT is ever a worse choice. Code to measure and report GDT/TSCP/SIDT timings is attached. If GDT or TSCP turns out to be faster on some machines, the binding could be done via the indirect pointer set once on startup and used by glibc to call the VDSO entry, rather than a conditional in getcpu()/sched_getcpu() run on every call. This would increase code size and might cause cache fragmentation, but would improve prefetching and reduce branch predictor pressure. Recommendation: nothing now; should GDT or TSCP turn out to be faster, this might deserve more study. A specialized version of the VDSO code for sched_getcpu() is substantially faster than calling getcpu(). It can be implemented almost trivially by inlining getcpu()'s code in a second function. It adds about 75 bytes of x86-64 instructions to the VDSO code page. Recommendation: probably useful for all architectures. It turns out GCC is able to rewrite the existing code in to the "streamlined" form that only updates the cache when it changes. So while a rewrite won't change the performance it might make the code more obviously fast. Recommendation: to keep people from looking at this repeatedly, recode so it is "obviously" fast or add a comment indicating no further benefit from a rewrite. Comments? More GDT/TSCP/SIDT performance numbers? [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: simple.c --] [-- Type: text/x-csrc; name=simple.c, Size: 3822 bytes --] #include <asm-x86_64/msr.h> #include <stdio.h> #include <stdlib.h> #define noinline __attribute__((__noinline__)) static inline int empty() { return 1; } static inline int gdt_limit() { /* "segment" value is from Linux 2.6.26/include/asm-x86/segment.h */ const int segment = (15 * 8 + 3); int limit; asm volatile("lsl %1,%0" : "=r" (limit) : "r" (segment)); return limit; } static inline int idt_limit() { struct { char pad[6]; /* Align accesses. */ unsigned int limit; /* 16b */ unsigned long long address; /* 64b */ } idt; asm volatile("sidt %0" : "=m"(idt)); return idt.limit; } static inline int tscp_aux() { int eax, edx, aux; asm volatile(".byte 0x0f,0x01,0xf9" : "=a" (eax), "=d" (edx), "=c" (aux)); return aux; } static inline int cpuid_edx_val(unsigned int op) { int eax, edx; asm("cpuid" : "=a" (eax), "=d" (edx) : "0" (op) : "bx", "cx"); return edx; } inline int /*bool*/ have_tscp() { return (cpuid_edx_val(0x80000001) & (1 << 27)) != 0; } typedef long long tsc; noinline tsc now() { unsigned int eax_lo, edx_hi; tsc now; asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi)); now = ((tsc)eax_lo) | ((tsc)(edx_hi) << 32); return now; } int tsc_sort_pred(const void *va, const void *vb) { const tsc *a = (const tsc *)(va); const tsc *b = (const tsc *)(vb); return *a - *b; } typedef enum which { EMPTY, /* Must be first (0'th) to set base_cost. */ GDT_LIMIT, IDT_LIMIT, TSCP_AUX, } which; volatile int g_sink; static inline int/*bool*/ run_test(tsc *delta, int n, which test) { int i; if ((test == TSCP_AUX) && !have_tscp()) return 0; for (i = 0; i < n; ++i) { int val; /* Written before read. Really!*/ tsc stop; tsc start = now(); asm volatile("nop" ::: "memory"); switch (test) { case EMPTY: val = empty(); break; case GDT_LIMIT: val = gdt_limit(); break; case IDT_LIMIT: val = idt_limit(); break; case TSCP_AUX: val = tscp_aux(); break; } asm volatile("nop" ::: "memory"); stop = now(); g_sink = val; *delta++ = stop - start; } return 1; } noinline int/*bool*/ run_test_empty(tsc *delta, int n) { return run_test(tsc, n, EMPTY); } noinline int/*bool*/ run_test_gdt_limit(tsc *delta, int n) { return run_test(tsc, n, GDT_LIMIT); } noinline int/*bool*/ run_test_idt_limit(tsc *delta, int n) { return run_test(tsc, n, IDT_LIMIT); } noinline int/*bool*/ run_test_tscp_aux(tsc *delta, int n) { return run_test(tsc, n, TSCP_AUX); } typedef int/*bool*/ (*funcptr)(tsc *delta, int n); /* Obfuscate pointers so various run_test*() cases do not get inlined. */ funcptr func[] = { run_test_empty, run_test_gdt_limit, run_test_idt_limit, run_test_tscp_aux }; const char *names[] = { "EMPTY", "GDT_LIMIT", "IDT_LIMIT", "TSCP_AUX" }; int main (int argc, char **argv) { int t; const int N = 1000; tsc delta[N]; tsc base_cost = 0; /* In principle can change 'func' so compiler cannot dispatch */ printf("Starting tests...\n"); for (t=0; t<=TSCP_AUX; ++t) { int ran = (*func[t])(delta, N); const char *name = names[t]; if (!ran) { printf("Not-run: %s\n", name); continue; } { static int bin[] = { 5, 10, 20, 50 }; int i; qsort(delta, N, sizeof(tsc), tsc_sort_pred); printf("Run: %s\ttotal-tests= %d\tbests: ", name, N); for (i = 0; i < 5 ; ++i) printf(" %2lld", delta[i] - base_cost); printf("\tmedians:"); for (i = 0; i < sizeof(bin)/sizeof(bin[0]); ++i) printf(" %2d%%: %2lld", bin[i], delta[N * bin[i] / 100] - base_cost); printf("\n"); } if (t == EMPTY) base_cost = delta[0]; } return 0; } [-- Attachment #3: Makefile --] [-- Type: application/octet-stream, Size: 178 bytes --] CC = gcc CFLAGS = -Wall -O3 all: simple simple.dis simple: simple.c Makefile $(CC) $(CFLAGS) -o simple simple.c simple.dis: simple objdump --disassemble simple > simple.dis ^ permalink raw reply [flat|nested] 9+ messages in thread
* Fwd: Faster getcpu() and sched_getcpu() 2008-09-23 19:09 ` Faster getcpu() and sched_getcpu() Pardo @ 2008-09-23 19:48 ` Pardo 2008-09-28 16:42 ` Andi Kleen 1 sibling, 0 replies; 9+ messages in thread From: Pardo @ 2008-09-23 19:48 UTC (permalink / raw) To: linux-kernel; +Cc: Martin Bligh, briangrant, odo, nil, jyasskin [-- Attachment #1: Type: text/plain, Size: 8244 bytes --] [Re-post as the test program attached before was an old one with a bad field size.] getcpu() returns a caller's current core number. On 2.6.26 running on x86_64, there are two VDSO implementations: store it in TSCP's AUX register; or, if the processor does not support TSCP, store it in GDT's limit. Dean Gaudet and Nathan Laredo also suggest using IDT's limit. Call these GDT, TSCP, and SIDT. The cost of reading the CPU number can be reduced significantly across a variety of platforms. Suggestions: eliminate per-call architecture check, use SIDT to hold the CPU and node number, cache the result, split the VDSO in to red-zone and no-red-zone areas, streamline cache checks in getcpu() code; provide a specialized sched_getcpu(). Result: on various x86_64 platforms, reading the CPU number drops from about 30-100 cycles to 4-21 cycles. I do not yet have a patch. I would like folks to (a) comment; and (b) try the attached microbenchmark on various machines to see if there are any machines where something is faster than SIDT. TESTS AND DATA I ran timing tests that "fake" the user-space instruction sequence for various VDSO-based getcpu() and sched_getcpu() implementations. I ran the tests on seven kinds of Intel and AMD platforms. Each sequence was measured individually (rather than averaging N runs). Best and median costs of 1000 runs were recorded. An empty sequence was also measured and that cost subtracted from each of the other runs, so a reported "20 cycles" is "20 cycles more than the empty sequence." A first test is the "raw" cost of just the machine instructions to read the special register. SIDT holds the value offset by 0x1000 and the machine instruction saves it to memory. The SIDT cost reported here is conservative in that it includes a load and subtract which are sometimes eliminated in getcpu()/sched_getcpu(). Note machine E is based on a P4-microarchitecture processor, which is typically hard to measure accurately, hence some reported costs for E are as low as 0 cycles. --- BEST --- -- MEDIAN -- GDT TSCP SIDT GDT TSCP SIDT A 60 77 15 61 78 16 B 45 N/A 9 54 N/A 18 C 54 N/A 9 54 N/A 18 D 49 N/A 14 56 N/A 14 E 32 N/A 0 42 N/A 11 F 74 N/A 17 74 N/A 18 G 16 23 16 21 24 17 On all machines, SIDT is always fastest, often by 3x or more. TSCP is always slowest. Current implementations (2.6.26/arch/x86/kernel/vsyscall_64.c and 2.6.26/arch/x86/vdso/vgetcpu.c) choose dynamically whether to use GDT or TSCP, something like if (global) raw = GDT(); else raw = TSCP(); A second test compares the cost of dispatch to GDT, dispatch to TSCP, or using GDT, TSCP, or SIDT unconditionally. This test mimics the glibc/VDSO strcuture, where the benchmark calls an out-of-line "fake glibc" routine that performs an indirect call to a "fake vdso" routine. ---------- BEST --------- --------- MEDIAN ------- *GDT *TSCP GDT TSCP SIDT *GDT *TSCP GDT TSCP SIDT A 67 86 65 83 31 68 87 68 85 31 B 72 N/A 72 N/A 18 81 N/A 81 N/A 27 C 72 N/A 77 N/A 18 81 N/A 81 N/A 27 D 77 N/A 77 N/A 21 77 N/A 77 N/A 28 E 63 N/A 53 N/A 0 64 N/A 63 N/A 11 F 99 N/A 98 N/A 17 99 N/A 98 N/A 21 G 26 29 20 28 19 28 32 27 28 22 In these tests, TSCP is still significantly slower and SIDT is still significantly faster, despite function call and conditional overheads. Also, dispatch overhead is small. A third test compares the cost of caching. There are 4 variations: never use a cache (like 2.6.26/arch/x86/vdso/vgetcpu.c), do use a cache (like 2.6.26/arch/x86/kernel/vsyscall_64.c) but pass in NULL; use a cache and take a miss, and use a cache and take a hit. The following are all SIDT implementations and measure the cost to read and set the CPU number but not the node number. ------ BEST ------ ----- MEDIAN ----- NONE NULL MISS HIT NONE NULL MISS HIT A 31 32 29 10 31 32 30 10 B 18 27 27 9 27 27 27 27 C 18 18 27 18 27 27 27 27 D 21 21 18 14 28 28 28 28 E 0 0 0 0 11 11 11 11 F 17 19 23 12 21 21 25 15 G 19 22 24 12 22 23 26 13 In these tests, HIT is usually faster in the best case, and is sometimes faster and never slower in the median case. The savings from skipping the cache test is usually small. So even in cases where NULL is always passed, the penalty is usually small. A fourth test compares two "fake" sched_getcpu() implementations. The generic version (similar to glibc-2.7/sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S) calls the general VDSO getcpu(). The specialized version tail-calls a specialized VDSO sched_getcpu() similar to getcpu() but faster because various checks are eliminated. The following reports times for the cache-hit case. ------- BEST ------ ------ MEDIAN ----- GENERAL SPECIALIZED GENERAL SPECIALIZED A 12 4 13 6 B 18 18 18 18 C 18 18 18 18 D 14 14 21 21 E 0 0 11 11 F 18 11 19 11 G 16 9 17 9 The specialized version is only faster on 3 of 7 machines, but is roughly 2x faster in those cases. The original getcpu() code always tests twice for the cache, and writes it whether or not it changed. Tests here use a slight rewrite that re-tests and writes only on a cache miss: if (cache && (cache->blob[0] == (j = _jiffies))) { p = cache->blob[1]; } else { p = ...; if (cache) { cache->blob[0] = j; cache->blob[1] = p; } } It turns out this "streamlining" is of no benefit because GCC makes this optimization anyway (but see below). Code to measure and report raw GDT/TSCP/SIDT timings is attached. I have other test data and test code if it is useful. ANALYSIS AND SUGGESTIONS Caching is currently disabled for 2.6.26/arch/x86/vdso/vgetcpu.c. getcpu()/sched_getcpu() performance is most important when they are used very frequently, in which case the jiffy-based cache is effective. Conversely, when calls are infrequent, cache miss overhead is small. Recommendation: caching should be enabled (probably for all architectures, not just x86-64). Switching to SIDT everywhere looks promising for all machines measured so far. The SIDT instruction performs a memory store, which means the VDSO needs to be split in to red-zone/no-red-zone areas to avoid frame overhead. See, e.g., http://lkml.org/lkml/2007/1/14/27 for details and http://arctic.org/~dean/patches/linux-2.6.20-vgetcpu-sidt.patch for an old 2.6.20 patch. Recommendation: measure relative costs on more systems to see if SIDT is ever a worse choice. Code to measure and report GDT/TSCP/SIDT timings is attached. If GDT or TSCP turns out to be faster on some machines, the binding could be done via the indirect pointer set once on startup and used by glibc to call the VDSO entry, rather than a conditional in getcpu()/sched_getcpu() run on every call. This would increase code size and might cause cache fragmentation, but would improve prefetching and reduce branch predictor pressure. Recommendation: nothing now; should GDT or TSCP turn out to be faster, this might deserve more study. A specialized version of the VDSO code for sched_getcpu() is substantially faster than calling getcpu(). It can be implemented almost trivially by inlining getcpu()'s code in a second function. It adds about 75 bytes of x86-64 instructions to the VDSO code page. Recommendation: probably useful for all architectures. It turns out GCC is able to rewrite the existing code in to the "streamlined" form that only updates the cache when it changes. So while a rewrite won't change the performance it might make the code more obviously fast. Recommendation: to keep people from looking at this repeatedly, recode so it is "obviously" fast or add a comment indicating no further benefit from a rewrite. Comments? More GDT/TSCP/SIDT performance numbers? [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: simple.c --] [-- Type: text/x-csrc; name=simple.c, Size: 3822 bytes --] #include <asm-x86_64/msr.h> #include <stdio.h> #include <stdlib.h> #define noinline __attribute__((__noinline__)) static inline int empty() { return 1; } static inline int gdt_limit() { /* "segment" value is from Linux 2.6.26/include/asm-x86/segment.h */ const int segment = (15 * 8 + 3); int limit; asm volatile("lsl %1,%0" : "=r" (limit) : "r" (segment)); return limit; } static inline int idt_limit() { struct { char pad[6]; /* Align accesses. */ unsigned short limit; /* 16b */ unsigned long long address; /* 64b */ } idt; asm volatile("sidt %0" : "=m"(idt)); return idt.limit; } static inline int tscp_aux() { int eax, edx, aux; asm volatile(".byte 0x0f,0x01,0xf9" : "=a" (eax), "=d" (edx), "=c" (aux)); return aux; } static inline int cpuid_edx_val(unsigned int op) { int eax, edx; asm("cpuid" : "=a" (eax), "=d" (edx) : "0" (op) : "bx", "cx"); return edx; } inline int /*bool*/ have_tscp() { return (cpuid_edx_val(0x80000001) & (1 << 27)) != 0; } typedef long long tsc; noinline tsc now() { unsigned int eax_lo, edx_hi; tsc now; asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi)); now = ((tsc)eax_lo) | ((tsc)(edx_hi) << 32); return now; } int tsc_sort_pred(const void *va, const void *vb) { const tsc *a = (const tsc *)(va); const tsc *b = (const tsc *)(vb); return *a - *b; } typedef enum which { EMPTY, /* Must be first (0'th) to set base_cost. */ GDT_LIMIT, IDT_LIMIT, TSCP_AUX, } which; volatile int g_sink; static inline int/*bool*/ run_test(tsc *delta, int n, which test) { int i; if ((test == TSCP_AUX) && !have_tscp()) return 0; for (i = 0; i < n; ++i) { int val; /* Written before read. Really!*/ tsc stop; tsc start = now(); asm volatile("nop" ::: "memory"); switch (test) { case EMPTY: val = empty(); break; case GDT_LIMIT: val = gdt_limit(); break; case IDT_LIMIT: val = idt_limit(); break; case TSCP_AUX: val = tscp_aux(); break; } asm volatile("nop" ::: "memory"); stop = now(); g_sink = val; *delta++ = stop - start; } return 1; } noinline int/*bool*/ run_test_empty(tsc *delta, int n) { return run_test(tsc, n, EMPTY); } noinline int/*bool*/ run_test_gdt_limit(tsc *delta, int n) { return run_test(tsc, n, GDT_LIMIT); } noinline int/*bool*/ run_test_idt_limit(tsc *delta, int n) { return run_test(tsc, n, IDT_LIMIT); } noinline int/*bool*/ run_test_tscp_aux(tsc *delta, int n) { return run_test(tsc, n, TSCP_AUX); } typedef int/*bool*/ (*funcptr)(tsc *delta, int n); /* Obfuscate pointers so various run_test*() cases do not get inlined. */ funcptr func[] = { run_test_empty, run_test_gdt_limit, run_test_idt_limit, run_test_tscp_aux }; const char *names[] = { "EMPTY", "GDT_LIMIT", "IDT_LIMIT", "TSCP_AUX" }; int main (int argc, char **argv) { int t; const int N = 1000; tsc delta[N]; tsc base_cost = 0; /* In principle can change 'func' so compiler cannot dispatch */ printf("Starting tests...\n"); for (t=0; t<=TSCP_AUX; ++t) { int ran = (*func[t])(delta, N); const char *name = names[t]; if (!ran) { printf("Not-run: %s\n", name); continue; } { static int bin[] = { 5, 10, 20, 50 }; int i; qsort(delta, N, sizeof(tsc), tsc_sort_pred); printf("Run: %s\ttotal-tests= %d\tbests: ", name, N); for (i = 0; i < 5 ; ++i) printf(" %2lld", delta[i] - base_cost); printf("\tmedians:"); for (i = 0; i < sizeof(bin)/sizeof(bin[0]); ++i) printf(" %2d%%: %2lld", bin[i], delta[N * bin[i] / 100] - base_cost); printf("\n"); } if (t == EMPTY) base_cost = delta[0]; } return 0; } [-- Attachment #3: Makefile --] [-- Type: application/octet-stream, Size: 178 bytes --] CC = gcc CFLAGS = -Wall -O3 all: simple simple.dis simple: simple.c Makefile $(CC) $(CFLAGS) -o simple simple.c simple.dis: simple objdump --disassemble simple > simple.dis ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Faster getcpu() and sched_getcpu() 2008-09-23 19:09 ` Faster getcpu() and sched_getcpu() Pardo 2008-09-23 19:48 ` Fwd: " Pardo @ 2008-09-28 16:42 ` Andi Kleen 2008-09-29 7:27 ` dean gaudet 1 sibling, 1 reply; 9+ messages in thread From: Andi Kleen @ 2008-09-28 16:42 UTC (permalink / raw) To: Pardo; +Cc: linux-kernel, mbligh, briangrant, odo, nil, jyasskin Pardo <pardo@google.com> writes: > > ANALYSIS AND SUGGESTIONS > > Caching is currently disabled for 2.6.26/arch/x86/vdso/vgetcpu.c. > getcpu()/sched_getcpu() performance is most important when they are > used very frequently, in which case the jiffy-based cache is > effective. Conversely, when calls are infrequent, cache miss overhead > is small. Recommendation: caching should be enabled (probably for all > architectures, not just x86-64). Without a vsyscall the cache probably doesn't make too much sense because once you're in the kernel reading the real CPU number is really cheap. I agree with you that the cache should be enabled on all vDSO implementations (that is what my original code did) Also the TSCP version could probably go. I'm still not sure why you say no redzone is that expensive? Do you have numbers? I know it's a few instructions, but it shouldn't be that expensive. > A specialized version of the VDSO code for sched_getcpu() is > substantially faster than calling getcpu(). Yes, unfortunately glibc didn't chose the same interface as the kernel for this. I still don't know why. But now since we're in this mess specializing for the glibc implementation is probably a good idea. Or just add getcpu() to glibc :) -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Faster getcpu() and sched_getcpu() 2008-09-28 16:42 ` Andi Kleen @ 2008-09-29 7:27 ` dean gaudet 2008-09-29 14:54 ` Andi Kleen 0 siblings, 1 reply; 9+ messages in thread From: dean gaudet @ 2008-09-29 7:27 UTC (permalink / raw) To: Andi Kleen; +Cc: Pardo, linux-kernel, mbligh, briangrant, nil, jyasskin Andi Kleen wrote: > I'm still not sure why you say no redzone is that expensive? Do you > have numbers? I know it's a few instructions, but it shouldn't > be that expensive. > > it depends on the processor involved and the kernel config options -- i.e. if frame pointers are enabled then the stack frame guarantees a store operation (push rbp) and on processors which do memops in-order this delays the other memops in the vsyscall (i.e. testing the cache or executing SIDT). it was 2 or 3 cycles difference in most cases iirc. -dean ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Faster getcpu() and sched_getcpu() 2008-09-29 7:27 ` dean gaudet @ 2008-09-29 14:54 ` Andi Kleen 2008-09-29 18:02 ` Pardo [not found] ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com> 0 siblings, 2 replies; 9+ messages in thread From: Andi Kleen @ 2008-09-29 14:54 UTC (permalink / raw) To: dean gaudet Cc: Andi Kleen, Pardo, linux-kernel, mbligh, briangrant, nil, jyasskin On Mon, Sep 29, 2008 at 12:27:09AM -0700, dean gaudet wrote: > Andi Kleen wrote: > > I'm still not sure why you say no redzone is that expensive? Do you > > have numbers? I know it's a few instructions, but it shouldn't > > be that expensive. > > > > > > it depends on the processor involved and the kernel config options -- > i.e. if frame pointers are enabled then the stack frame guarantees a > store operation (push rbp) and on processors which do memops in-order > this delays the other memops in the vsyscall (i.e. testing the cache or > executing SIDT). it was 2 or 3 cycles difference in most cases iirc. Ok frame pointers are always a performance disasters on some CPUs. Perhaps they should be just unconditionally disabled for vsyscall.c and the vdso -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Faster getcpu() and sched_getcpu() 2008-09-29 14:54 ` Andi Kleen @ 2008-09-29 18:02 ` Pardo [not found] ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com> 1 sibling, 0 replies; 9+ messages in thread From: Pardo @ 2008-09-29 18:02 UTC (permalink / raw) To: Andi Kleen; +Cc: dean gaudet, linux-kernel, mbligh, briangrant, nil, jyasskin >[Maybe disable frame pointers for vsyscall.c and the vdso?] IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split vsyscall.c, creating a vsyscall_user.c for code which can run without them. Seem reasonable? ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com>]
* Re: Faster getcpu() and sched_getcpu() [not found] ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com> @ 2008-09-29 20:50 ` Andi Kleen [not found] ` <48E14ECE.6080402@google.com> 0 siblings, 1 reply; 9+ messages in thread From: Andi Kleen @ 2008-09-29 20:50 UTC (permalink / raw) To: Pardo Cc: Andi Kleen, dean gaudet, linux-kernel, mbligh, briangrant, nil, jyasskin On Mon, Sep 29, 2008 at 11:01:26AM -0700, Pardo wrote: > >[Maybe disable frame pointers for vsyscall.c and the vdso?] > > IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split I don't think it really needs it. > vsyscall.c, creating a vsyscall_user.c for code which can run without them. > Seem reasonable? Seems unnecessarily complicated. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <48E14ECE.6080402@google.com>]
* Re: Faster getcpu() and sched_getcpu() [not found] ` <48E14ECE.6080402@google.com> @ 2008-09-29 21:59 ` dean gaudet 2008-09-29 22:07 ` Andi Kleen 1 sibling, 0 replies; 9+ messages in thread From: dean gaudet @ 2008-09-29 21:59 UTC (permalink / raw) To: Andi Kleen; +Cc: Pardo, linux-kernel, mbligh, briangrant, nil, jyasskin [attempting resend in plain-text only because thunderbird lost the battle vs. vger] dean gaudet wrote: > Andi Kleen wrote: >> On Mon, Sep 29, 2008 at 11:01:26AM -0700, Pardo wrote: >> >>>> [Maybe disable frame pointers for vsyscall.c and the vdso?] >>>> >>> IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split >>> >> >> I don't think it really needs it. >> >> >>> vsyscall.c, creating a vsyscall_user.c for code which can run without them. >>> Seem reasonable? >>> >> >> Seems unnecessarily complicated. >> > > i disagree that it's complicated to have two files, and disagree that > it's unnecessary to have two files. > > userland code does not have the same limitations/conventions as kernel > code. the ABIs are completely different. > > -dean ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Faster getcpu() and sched_getcpu() [not found] ` <48E14ECE.6080402@google.com> 2008-09-29 21:59 ` dean gaudet @ 2008-09-29 22:07 ` Andi Kleen 1 sibling, 0 replies; 9+ messages in thread From: Andi Kleen @ 2008-09-29 22:07 UTC (permalink / raw) To: dean gaudet Cc: Andi Kleen, Pardo, linux-kernel, mbligh, briangrant, nil, jyasskin On Mon, Sep 29, 2008 at 02:55:26PM -0700, dean gaudet wrote: > Andi Kleen wrote: > > On Mon, Sep 29, 2008 at 11:01:26AM -0700, Pardo wrote: > > > >>> [Maybe disable frame pointers for vsyscall.c and the vdso?] > >>> > >> IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split > >> > > > > I don't think it really needs it. > > > > > >> vsyscall.c, creating a vsyscall_user.c for code which can run without them. > >> Seem reasonable? > >> > > > > Seems unnecessarily complicated. > > > > i disagree that it's complicated to have two files, and disagree that > it's unnecessary to have two files. It's unnecessary to have frame pointer in the kernel functions I meant. I agree with you that disabling redzone is needed for kernel code, but without frame pointers (which are generally a bad idea for performance and should not have been added to the 64bit port ever) redzone is also not particularly expensive and it shouldn't be needed to do anything complicated (like splitting files) just for the few cycles. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-09-29 22:02 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <af8810200809111648n55e05ac9g286fcd498690432f@mail.gmail.com>
2008-09-23 19:09 ` Faster getcpu() and sched_getcpu() Pardo
2008-09-23 19:48 ` Fwd: " Pardo
2008-09-28 16:42 ` Andi Kleen
2008-09-29 7:27 ` dean gaudet
2008-09-29 14:54 ` Andi Kleen
2008-09-29 18:02 ` Pardo
[not found] ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com>
2008-09-29 20:50 ` Andi Kleen
[not found] ` <48E14ECE.6080402@google.com>
2008-09-29 21:59 ` dean gaudet
2008-09-29 22:07 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox