Faster getcpu() and sched

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Faster getcpu() and sched_getcpu()
       [not found] <af8810200809111648n55e05ac9g286fcd498690432f@mail.gmail.com>
@ 2008-09-23 19:09 ` Pardo
  2008-09-23 19:48   ` Fwd: " Pardo
  2008-09-28 16:42   ` Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Pardo @ 2008-09-23 19:09 UTC (permalink / raw)
  To: linux-kernel, mbligh; +Cc: briangrant, odo, nil, jyasskin

[-- Attachment #1: Type: text/plain, Size: 8178 bytes --]

getcpu() returns a caller's current core number.  On 2.6.26 running on
x86_64, there are two VDSO implementations: store it in TSCP's AUX
register; or, if the processor does not support TSCP, store it in
GDT's limit.  Dean Gaudet and Nathan Laredo also suggest using IDT's
limit.  Call these GDT, TSCP, and SIDT.

The cost of reading the CPU number can be reduced significantly across
a variety of platforms.  Suggestions: eliminate per-call architecture
check, use SIDT to hold the CPU and node number, cache the result,
split the VDSO in to red-zone and no-red-zone areas, streamline cache
checks in getcpu() code; provide a specialized sched_getcpu().
Result: on various x86_64 platforms, reading the CPU number drops from
about 30-100 cycles to 4-21 cycles.

I do not yet have a patch.  I would like folks to (a) comment; and (b)
try the attached microbenchmark on various machines to see if there
are any machines where something is faster than SIDT.

TESTS AND DATA

I ran timing tests that "fake" the user-space instruction sequence for
various VDSO-based getcpu() and sched_getcpu() implementations.  I ran
the tests on seven kinds of Intel and AMD platforms.  Each sequence
was measured individually (rather than averaging N runs).  Best and
median costs of 1000 runs were recorded.  An empty sequence was also
measured and that cost subtracted from each of the other runs, so a
reported "20 cycles" is "20 cycles more than the empty sequence."

A first test is the "raw" cost of just the machine instructions to
read the special register.  SIDT holds the value offset by 0x1000 and
the machine instruction saves it to memory.  The SIDT cost reported
here is conservative in that it includes a load and subtract which are
sometimes eliminated in getcpu()/sched_getcpu().  Note machine E is
based on a P4-microarchitecture processor, which is typically hard to
measure accurately, hence some reported costs for E are as low as 0
cycles.

    --- BEST ---    -- MEDIAN --
    GDT TSCP SIDT   GDT TSCP SIDT
A   60   77   15    61   78   16
B   45  N/A    9    54  N/A   18
C   54  N/A    9    54  N/A   18
D   49  N/A   14    56  N/A   14
E   32  N/A    0    42  N/A   11
F   74  N/A   17    74  N/A   18
G   16   23   16    21   24   17

On all machines, SIDT is always fastest, often by 3x or more.  TSCP is
always slowest.

Current implementations (2.6.26/arch/x86/kernel/vsyscall_64.c and
2.6.26/arch/x86/vdso/vgetcpu.c) choose dynamically whether to use GDT
or TSCP, something like

if (global)
  raw = GDT();
else
  raw = TSCP();

A second test compares the cost of dispatch to GDT, dispatch to TSCP,
or using GDT, TSCP, or SIDT unconditionally.  This test mimics the
glibc/VDSO strcuture, where the benchmark calls an out-of-line "fake
glibc" routine that performs an indirect call to a "fake vdso"
routine.

    ---------- BEST ---------    --------- MEDIAN -------
    *GDT *TSCP  GDT TSCP SIDT    *GDT *TSCP GDT TSCP SIDT
A    67    86   65   83   31      68    87  68   85   31
B    72   N/A   72  N/A   18      81   N/A  81  N/A   27
C    72   N/A   77  N/A   18      81   N/A  81  N/A   27
D    77   N/A   77  N/A   21      77   N/A  77  N/A   28
E    63   N/A   53  N/A    0      64   N/A  63  N/A   11
F    99   N/A   98  N/A   17      99   N/A  98  N/A   21
G    26    29   20   28   19      28    32  27   28   22

In these tests, TSCP is still significantly slower and SIDT is still
significantly faster, despite function call and conditional overheads.
 Also, dispatch overhead is small.

A third test compares the cost of caching.  There are 4 variations:
never use a cache (like 2.6.26/arch/x86/vdso/vgetcpu.c), do use a
cache (like 2.6.26/arch/x86/kernel/vsyscall_64.c) but pass in NULL;
use a cache and take a miss, and use a cache and take a hit.  The
following are all SIDT implementations and measure the cost to read
and set the CPU number but not the node number.

    ------ BEST ------    ----- MEDIAN -----
    NONE NULL MISS HIT    NONE NULL MISS HIT
A    31   32   29   10     31   32   30   10
B    18   27   27    9     27   27   27   27
C    18   18   27   18     27   27   27   27
D    21   21   18   14     28   28   28   28
E     0    0    0    0     11   11   11   11
F    17   19   23   12     21   21   25   15
G    19   22   24   12     22   23   26   13

In these tests, HIT is usually faster in the best case, and is
sometimes faster and never slower in the median case.  The savings
from skipping the cache test is usually small.  So even in cases where
NULL is always passed, the penalty is usually small.

A fourth test compares two "fake" sched_getcpu() implementations.  The
generic version (similar to
glibc-2.7/sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S) calls the
general VDSO getcpu().  The specialized version tail-calls a
specialized VDSO sched_getcpu() similar to getcpu() but faster because
various checks are eliminated.  The following reports times for the
cache-hit case.

    ------- BEST ------   ------ MEDIAN -----
    GENERAL SPECIALIZED   GENERAL SPECIALIZED
A     12       4            13      6
B     18      18            18     18
C     18      18            18     18
D     14      14            21     21
E      0       0            11     11
F     18      11            19     11
G     16       9            17      9

The specialized version is only faster on 3 of 7 machines, but is
roughly 2x faster in those cases.

The original getcpu() code always tests twice for the cache, and
writes it whether or not it changed.  Tests here use a slight rewrite
that re-tests and writes only on a cache miss:

  if (cache && (cache->blob[0] == (j = _jiffies))) {
    p = cache->blob[1];
  } else {
    p = ...;
    if (cache) {
      cache->blob[0] = j;
      cache->blob[1] = p;
    }
  }

It turns out this "streamlining" is of no benefit because GCC makes
this optimization anyway (but see below).

Code to measure and report raw GDT/TSCP/SIDT timings is attached.  I
have other test data and test code if it is useful.

ANALYSIS AND SUGGESTIONS

Caching is currently disabled for 2.6.26/arch/x86/vdso/vgetcpu.c.
getcpu()/sched_getcpu() performance is most important when they are
used very frequently, in which case the jiffy-based cache is
effective.  Conversely, when calls are infrequent, cache miss overhead
is small.  Recommendation: caching should be enabled (probably for all
architectures, not just x86-64).

Switching to SIDT everywhere looks promising for all machines measured
so far.  The SIDT instruction performs a memory store, which means the
VDSO needs to be split in to red-zone/no-red-zone areas to avoid frame
overhead.  See, e.g., http://lkml.org/lkml/2007/1/14/27 for details
and http://arctic.org/~dean/patches/linux-2.6.20-vgetcpu-sidt.patch
for an old 2.6.20 patch.  Recommendation: measure relative costs on
more systems to see if SIDT is ever a worse choice.  Code to measure
and report GDT/TSCP/SIDT timings is attached.

If GDT or TSCP turns out to be faster on some machines, the binding
could be done via the indirect pointer set once on startup and used by
glibc to call the VDSO entry, rather than a conditional in
getcpu()/sched_getcpu() run on every call.  This would increase code
size and might cause cache fragmentation, but would improve
prefetching and reduce branch predictor pressure.  Recommendation:
nothing now; should GDT or TSCP turn out to be faster, this might
deserve more study.

A specialized version of the VDSO code for sched_getcpu() is
substantially faster than calling getcpu().  It can be implemented
almost trivially by inlining getcpu()'s code in a second function.  It
adds about 75 bytes of x86-64 instructions to the VDSO code page.
Recommendation: probably useful for all architectures.

It turns out GCC is able to rewrite the existing code in to the
"streamlined" form that only updates the cache when it changes.  So
while a rewrite won't change the performance it might make the code
more obviously fast.  Recommendation: to keep people from looking at
this repeatedly, recode so it is "obviously" fast or add a comment
indicating no further benefit from a rewrite.

Comments?  More GDT/TSCP/SIDT performance numbers?

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: simple.c --]
[-- Type: text/x-csrc; name=simple.c, Size: 3822 bytes --]

#include <asm-x86_64/msr.h>
#include <stdio.h>
#include <stdlib.h>

#define noinline __attribute__((__noinline__))

static inline int empty() {
  return 1;
}

static inline int gdt_limit() {
  /* "segment" value is from Linux 2.6.26/include/asm-x86/segment.h */
  const int segment = (15 * 8 + 3);
  int limit;
  asm volatile("lsl %1,%0" : "=r" (limit) : "r" (segment));
  return limit;
}

static inline int idt_limit() {
  struct {
    char pad[6];                        /* Align accesses. */
    unsigned int limit;                 /* 16b */
    unsigned long long address;         /* 64b */
  } idt;
  asm volatile("sidt %0" : "=m"(idt));
  return idt.limit;
}

static inline int tscp_aux() {
  int eax, edx, aux;
  asm volatile(".byte 0x0f,0x01,0xf9" : "=a" (eax), "=d" (edx), "=c" (aux));
  return aux;
}

static inline int cpuid_edx_val(unsigned int op) {
  int eax, edx;

  asm("cpuid" : "=a" (eax), "=d" (edx) : "0" (op) : "bx", "cx");
  return edx;
}

inline int /*bool*/ have_tscp() {
  return (cpuid_edx_val(0x80000001) & (1 << 27)) != 0;
}

typedef long long tsc;

noinline tsc now() {
  unsigned int eax_lo, edx_hi;
  tsc now;
  asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi));
  now = ((tsc)eax_lo) | ((tsc)(edx_hi) << 32);
  return now;
}

int tsc_sort_pred(const void *va, const void *vb) {
  const tsc *a = (const tsc *)(va);
  const tsc *b = (const tsc *)(vb);
  return *a - *b;
}

typedef enum which {
  EMPTY,                     /* Must be first (0'th) to set base_cost. */
  GDT_LIMIT,
  IDT_LIMIT,
  TSCP_AUX,
} which;

volatile int g_sink;

static inline int/*bool*/ run_test(tsc *delta, int n, which test) {
  int i;

  if ((test == TSCP_AUX) && !have_tscp())
    return 0;

  for (i = 0; i < n; ++i) {
    int val;                        /* Written before read.  Really!*/
    tsc stop;
    tsc start = now();
    asm volatile("nop" ::: "memory");
    switch (test) {
      case EMPTY:      val = empty();      break;
      case GDT_LIMIT:  val = gdt_limit();  break;
      case IDT_LIMIT:  val = idt_limit();  break;
      case TSCP_AUX:   val = tscp_aux();   break;
    }
    asm volatile("nop" ::: "memory");
    stop = now();
    g_sink = val;
    *delta++ = stop - start;
  }
  return 1;
}

noinline int/*bool*/ run_test_empty(tsc *delta, int n) {
  return run_test(tsc, n, EMPTY);
}

noinline int/*bool*/ run_test_gdt_limit(tsc *delta, int n) {
  return run_test(tsc, n, GDT_LIMIT);
}

noinline int/*bool*/ run_test_idt_limit(tsc *delta, int n) {
  return run_test(tsc, n, IDT_LIMIT);
}

noinline int/*bool*/ run_test_tscp_aux(tsc *delta, int n) {
  return run_test(tsc, n, TSCP_AUX);
}

typedef int/*bool*/ (*funcptr)(tsc *delta, int n);

/* Obfuscate pointers so various run_test*() cases do not get inlined. */
funcptr func[] = {
  run_test_empty,
  run_test_gdt_limit,
  run_test_idt_limit,
  run_test_tscp_aux
};

const char *names[] = { "EMPTY", "GDT_LIMIT", "IDT_LIMIT", "TSCP_AUX" };

int main (int argc, char **argv) {
  int t;
  const int N = 1000;
  tsc delta[N];
  tsc base_cost = 0;

  /* In principle can change 'func' so compiler cannot dispatch */
  printf("Starting tests...\n");

  for (t=0; t<=TSCP_AUX; ++t) {
    int ran = (*func[t])(delta, N);
    const char *name = names[t];

    if (!ran) {
      printf("Not-run: %s\n", name);
      continue;
    }

    {
      static int bin[] = { 5, 10, 20, 50 };
      int i;
      qsort(delta, N, sizeof(tsc), tsc_sort_pred);
      printf("Run: %s\ttotal-tests= %d\tbests: ", name, N);
      for (i = 0; i < 5 ; ++i)
        printf(" %2lld", delta[i] - base_cost);
      printf("\tmedians:");
      for (i = 0; i < sizeof(bin)/sizeof(bin[0]); ++i)
        printf(" %2d%%: %2lld", bin[i], delta[N * bin[i] / 100] - base_cost);
      printf("\n");
    }

    if (t == EMPTY)
      base_cost = delta[0];
  }
  return 0;
}

[-- Attachment #3: Makefile --]
[-- Type: application/octet-stream, Size: 178 bytes --]

CC = gcc
CFLAGS = -Wall -O3

all:	simple simple.dis

simple:	simple.c Makefile
	$(CC) $(CFLAGS) -o simple simple.c

simple.dis:	simple
	objdump --disassemble simple > simple.dis

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Fwd: Faster getcpu() and sched_getcpu()
  2008-09-23 19:09 ` Faster getcpu() and sched_getcpu() Pardo
@ 2008-09-23 19:48   ` Pardo
  2008-09-28 16:42   ` Andi Kleen
  1 sibling, 0 replies; 9+ messages in thread
From: Pardo @ 2008-09-23 19:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Martin Bligh, briangrant, odo, nil, jyasskin

[-- Attachment #1: Type: text/plain, Size: 8244 bytes --]

[Re-post as the test program attached before was an old one with a bad
field size.]

getcpu() returns a caller's current core number.  On 2.6.26 running on
x86_64, there are two VDSO implementations: store it in TSCP's AUX
register; or, if the processor does not support TSCP, store it in
GDT's limit.  Dean Gaudet and Nathan Laredo also suggest using IDT's
limit.  Call these GDT, TSCP, and SIDT.

The cost of reading the CPU number can be reduced significantly across
a variety of platforms.  Suggestions: eliminate per-call architecture
check, use SIDT to hold the CPU and node number, cache the result,
split the VDSO in to red-zone and no-red-zone areas, streamline cache
checks in getcpu() code; provide a specialized sched_getcpu().
Result: on various x86_64 platforms, reading the CPU number drops from
about 30-100 cycles to 4-21 cycles.

I do not yet have a patch.  I would like folks to (a) comment; and (b)
try the attached microbenchmark on various machines to see if there
are any machines where something is faster than SIDT.

TESTS AND DATA

I ran timing tests that "fake" the user-space instruction sequence for
various VDSO-based getcpu() and sched_getcpu() implementations.  I ran
the tests on seven kinds of Intel and AMD platforms.  Each sequence
was measured individually (rather than averaging N runs).  Best and
median costs of 1000 runs were recorded.  An empty sequence was also
measured and that cost subtracted from each of the other runs, so a
reported "20 cycles" is "20 cycles more than the empty sequence."

A first test is the "raw" cost of just the machine instructions to
read the special register.  SIDT holds the value offset by 0x1000 and
the machine instruction saves it to memory.  The SIDT cost reported
here is conservative in that it includes a load and subtract which are
sometimes eliminated in getcpu()/sched_getcpu().  Note machine E is
based on a P4-microarchitecture processor, which is typically hard to
measure accurately, hence some reported costs for E are as low as 0
cycles.

   --- BEST ---    -- MEDIAN --
   GDT TSCP SIDT   GDT TSCP SIDT
A   60   77   15    61   78   16
B   45  N/A    9    54  N/A   18
C   54  N/A    9    54  N/A   18
D   49  N/A   14    56  N/A   14
E   32  N/A    0    42  N/A   11
F   74  N/A   17    74  N/A   18
G   16   23   16    21   24   17

On all machines, SIDT is always fastest, often by 3x or more.  TSCP is
always slowest.

Current implementations (2.6.26/arch/x86/kernel/vsyscall_64.c and
2.6.26/arch/x86/vdso/vgetcpu.c) choose dynamically whether to use GDT
or TSCP, something like

if (global)
 raw = GDT();
else
 raw = TSCP();

A second test compares the cost of dispatch to GDT, dispatch to TSCP,
or using GDT, TSCP, or SIDT unconditionally.  This test mimics the
glibc/VDSO strcuture, where the benchmark calls an out-of-line "fake
glibc" routine that performs an indirect call to a "fake vdso"
routine.

   ---------- BEST ---------    --------- MEDIAN -------
   *GDT *TSCP  GDT TSCP SIDT    *GDT *TSCP GDT TSCP SIDT
A    67    86   65   83   31      68    87  68   85   31
B    72   N/A   72  N/A   18      81   N/A  81  N/A   27
C    72   N/A   77  N/A   18      81   N/A  81  N/A   27
D    77   N/A   77  N/A   21      77   N/A  77  N/A   28
E    63   N/A   53  N/A    0      64   N/A  63  N/A   11
F    99   N/A   98  N/A   17      99   N/A  98  N/A   21
G    26    29   20   28   19      28    32  27   28   22

In these tests, TSCP is still significantly slower and SIDT is still
significantly faster, despite function call and conditional overheads.
 Also, dispatch overhead is small.

A third test compares the cost of caching.  There are 4 variations:
never use a cache (like 2.6.26/arch/x86/vdso/vgetcpu.c), do use a
cache (like 2.6.26/arch/x86/kernel/vsyscall_64.c) but pass in NULL;
use a cache and take a miss, and use a cache and take a hit.  The
following are all SIDT implementations and measure the cost to read
and set the CPU number but not the node number.

   ------ BEST ------    ----- MEDIAN -----
   NONE NULL MISS HIT    NONE NULL MISS HIT
A    31   32   29   10     31   32   30   10
B    18   27   27    9     27   27   27   27
C    18   18   27   18     27   27   27   27
D    21   21   18   14     28   28   28   28
E     0    0    0    0     11   11   11   11
F    17   19   23   12     21   21   25   15
G    19   22   24   12     22   23   26   13

In these tests, HIT is usually faster in the best case, and is
sometimes faster and never slower in the median case.  The savings
from skipping the cache test is usually small.  So even in cases where
NULL is always passed, the penalty is usually small.

A fourth test compares two "fake" sched_getcpu() implementations.  The
generic version (similar to
glibc-2.7/sysdeps/unix/sysv/linux/x86_64/sched_getcpu.S) calls the
general VDSO getcpu().  The specialized version tail-calls a
specialized VDSO sched_getcpu() similar to getcpu() but faster because
various checks are eliminated.  The following reports times for the
cache-hit case.

   ------- BEST ------   ------ MEDIAN -----
   GENERAL SPECIALIZED   GENERAL SPECIALIZED
A     12       4            13      6
B     18      18            18     18
C     18      18            18     18
D     14      14            21     21
E      0       0            11     11
F     18      11            19     11
G     16       9            17      9

The specialized version is only faster on 3 of 7 machines, but is
roughly 2x faster in those cases.

The original getcpu() code always tests twice for the cache, and
writes it whether or not it changed.  Tests here use a slight rewrite
that re-tests and writes only on a cache miss:

 if (cache && (cache->blob[0] == (j = _jiffies))) {
   p = cache->blob[1];
 } else {
   p = ...;
   if (cache) {
     cache->blob[0] = j;
     cache->blob[1] = p;
   }
 }

It turns out this "streamlining" is of no benefit because GCC makes
this optimization anyway (but see below).

Code to measure and report raw GDT/TSCP/SIDT timings is attached.  I
have other test data and test code if it is useful.

ANALYSIS AND SUGGESTIONS

Caching is currently disabled for 2.6.26/arch/x86/vdso/vgetcpu.c.
getcpu()/sched_getcpu() performance is most important when they are
used very frequently, in which case the jiffy-based cache is
effective.  Conversely, when calls are infrequent, cache miss overhead
is small.  Recommendation: caching should be enabled (probably for all
architectures, not just x86-64).

Switching to SIDT everywhere looks promising for all machines measured
so far.  The SIDT instruction performs a memory store, which means the
VDSO needs to be split in to red-zone/no-red-zone areas to avoid frame
overhead.  See, e.g., http://lkml.org/lkml/2007/1/14/27 for details
and http://arctic.org/~dean/patches/linux-2.6.20-vgetcpu-sidt.patch
for an old 2.6.20 patch.  Recommendation: measure relative costs on
more systems to see if SIDT is ever a worse choice.  Code to measure
and report GDT/TSCP/SIDT timings is attached.

If GDT or TSCP turns out to be faster on some machines, the binding
could be done via the indirect pointer set once on startup and used by
glibc to call the VDSO entry, rather than a conditional in
getcpu()/sched_getcpu() run on every call.  This would increase code
size and might cause cache fragmentation, but would improve
prefetching and reduce branch predictor pressure.  Recommendation:
nothing now; should GDT or TSCP turn out to be faster, this might
deserve more study.

A specialized version of the VDSO code for sched_getcpu() is
substantially faster than calling getcpu().  It can be implemented
almost trivially by inlining getcpu()'s code in a second function.  It
adds about 75 bytes of x86-64 instructions to the VDSO code page.
Recommendation: probably useful for all architectures.

It turns out GCC is able to rewrite the existing code in to the
"streamlined" form that only updates the cache when it changes.  So
while a rewrite won't change the performance it might make the code
more obviously fast.  Recommendation: to keep people from looking at
this repeatedly, recode so it is "obviously" fast or add a comment
indicating no further benefit from a rewrite.

Comments?  More GDT/TSCP/SIDT performance numbers?

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: simple.c --]
[-- Type: text/x-csrc; name=simple.c, Size: 3822 bytes --]

#include <asm-x86_64/msr.h>
#include <stdio.h>
#include <stdlib.h>

#define noinline __attribute__((__noinline__))

static inline int empty() {
  return 1;
}

static inline int gdt_limit() {
  /* "segment" value is from Linux 2.6.26/include/asm-x86/segment.h */
  const int segment = (15 * 8 + 3);
  int limit;
  asm volatile("lsl %1,%0" : "=r" (limit) : "r" (segment));
  return limit;
}

static inline int idt_limit() {
  struct {
    char pad[6];                        /* Align accesses. */
    unsigned short limit;               /* 16b */
    unsigned long long address;         /* 64b */
  } idt;
  asm volatile("sidt %0" : "=m"(idt));
  return idt.limit;
}

static inline int tscp_aux() {
  int eax, edx, aux;
  asm volatile(".byte 0x0f,0x01,0xf9" : "=a" (eax), "=d" (edx), "=c" (aux));
  return aux;
}

static inline int cpuid_edx_val(unsigned int op) {
  int eax, edx;

  asm("cpuid" : "=a" (eax), "=d" (edx) : "0" (op) : "bx", "cx");
  return edx;
}

inline int /*bool*/ have_tscp() {
  return (cpuid_edx_val(0x80000001) & (1 << 27)) != 0;
}

typedef long long tsc;

noinline tsc now() {
  unsigned int eax_lo, edx_hi;
  tsc now;
  asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi));
  now = ((tsc)eax_lo) | ((tsc)(edx_hi) << 32);
  return now;
}

int tsc_sort_pred(const void *va, const void *vb) {
  const tsc *a = (const tsc *)(va);
  const tsc *b = (const tsc *)(vb);
  return *a - *b;
}

typedef enum which {
  EMPTY,                     /* Must be first (0'th) to set base_cost. */
  GDT_LIMIT,
  IDT_LIMIT,
  TSCP_AUX,
} which;

volatile int g_sink;

static inline int/*bool*/ run_test(tsc *delta, int n, which test) {
  int i;

  if ((test == TSCP_AUX) && !have_tscp())
    return 0;

  for (i = 0; i < n; ++i) {
    int val;                        /* Written before read.  Really!*/
    tsc stop;
    tsc start = now();
    asm volatile("nop" ::: "memory");
    switch (test) {
      case EMPTY:      val = empty();      break;
      case GDT_LIMIT:  val = gdt_limit();  break;
      case IDT_LIMIT:  val = idt_limit();  break;
      case TSCP_AUX:   val = tscp_aux();   break;
    }
    asm volatile("nop" ::: "memory");
    stop = now();
    g_sink = val;
    *delta++ = stop - start;
  }
  return 1;
}

noinline int/*bool*/ run_test_empty(tsc *delta, int n) {
  return run_test(tsc, n, EMPTY);
}

noinline int/*bool*/ run_test_gdt_limit(tsc *delta, int n) {
  return run_test(tsc, n, GDT_LIMIT);
}

noinline int/*bool*/ run_test_idt_limit(tsc *delta, int n) {
  return run_test(tsc, n, IDT_LIMIT);
}

noinline int/*bool*/ run_test_tscp_aux(tsc *delta, int n) {
  return run_test(tsc, n, TSCP_AUX);
}

typedef int/*bool*/ (*funcptr)(tsc *delta, int n);

/* Obfuscate pointers so various run_test*() cases do not get inlined. */
funcptr func[] = {
  run_test_empty,
  run_test_gdt_limit,
  run_test_idt_limit,
  run_test_tscp_aux
};

const char *names[] = { "EMPTY", "GDT_LIMIT", "IDT_LIMIT", "TSCP_AUX" };

int main (int argc, char **argv) {
  int t;
  const int N = 1000;
  tsc delta[N];
  tsc base_cost = 0;

  /* In principle can change 'func' so compiler cannot dispatch */
  printf("Starting tests...\n");

  for (t=0; t<=TSCP_AUX; ++t) {
    int ran = (*func[t])(delta, N);
    const char *name = names[t];

    if (!ran) {
      printf("Not-run: %s\n", name);
      continue;
    }

    {
      static int bin[] = { 5, 10, 20, 50 };
      int i;
      qsort(delta, N, sizeof(tsc), tsc_sort_pred);
      printf("Run: %s\ttotal-tests= %d\tbests: ", name, N);
      for (i = 0; i < 5 ; ++i)
        printf(" %2lld", delta[i] - base_cost);
      printf("\tmedians:");
      for (i = 0; i < sizeof(bin)/sizeof(bin[0]); ++i)
        printf(" %2d%%: %2lld", bin[i], delta[N * bin[i] / 100] - base_cost);
      printf("\n");
    }

    if (t == EMPTY)
      base_cost = delta[0];
  }
  return 0;
}

[-- Attachment #3: Makefile --]
[-- Type: application/octet-stream, Size: 178 bytes --]

CC = gcc
CFLAGS = -Wall -O3

all:	simple simple.dis

simple:	simple.c Makefile
	$(CC) $(CFLAGS) -o simple simple.c

simple.dis:	simple
	objdump --disassemble simple > simple.dis

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Faster getcpu() and sched_getcpu()
  2008-09-23 19:09 ` Faster getcpu() and sched_getcpu() Pardo
  2008-09-23 19:48   ` Fwd: " Pardo
@ 2008-09-28 16:42   ` Andi Kleen
  2008-09-29  7:27     ` dean gaudet
  1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2008-09-28 16:42 UTC (permalink / raw)
  To: Pardo; +Cc: linux-kernel, mbligh, briangrant, odo, nil, jyasskin

Pardo <pardo@google.com> writes:
>
> ANALYSIS AND SUGGESTIONS
>
> Caching is currently disabled for 2.6.26/arch/x86/vdso/vgetcpu.c.
> getcpu()/sched_getcpu() performance is most important when they are
> used very frequently, in which case the jiffy-based cache is
> effective.  Conversely, when calls are infrequent, cache miss overhead
> is small.  Recommendation: caching should be enabled (probably for all
> architectures, not just x86-64).

Without a vsyscall the cache probably doesn't make too much sense
because once you're in the kernel reading the real CPU number is really
cheap.

I agree with you that the cache should be enabled on all vDSO implementations
(that is what my original code did)

Also the TSCP version could probably go.

I'm still not sure why you say no redzone is that expensive? Do you
have numbers?  I know it's a few instructions, but it shouldn't 
be that expensive.

> A specialized version of the VDSO code for sched_getcpu() is
> substantially faster than calling getcpu().  

Yes, unfortunately glibc didn't chose the same interface as the kernel
for this. I still don't know why. But now since we're in this mess
specializing for the glibc implementation is probably a good idea.
Or just add getcpu() to glibc :)

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Faster getcpu() and sched_getcpu()
  2008-09-28 16:42   ` Andi Kleen
@ 2008-09-29  7:27     ` dean gaudet
  2008-09-29 14:54       ` Andi Kleen
  0 siblings, 1 reply; 9+ messages in thread
From: dean gaudet @ 2008-09-29  7:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Pardo, linux-kernel, mbligh, briangrant, nil, jyasskin

Andi Kleen wrote:
> I'm still not sure why you say no redzone is that expensive? Do you
> have numbers?  I know it's a few instructions, but it shouldn't 
> be that expensive.
>
>   

it depends on the processor involved and the kernel config options --
i.e. if frame pointers are enabled then the stack frame guarantees a
store operation (push rbp) and on processors which do memops in-order
this delays the other memops in the vsyscall (i.e. testing the cache or
executing SIDT).  it was 2 or 3 cycles difference in most cases iirc.

-dean

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Faster getcpu() and sched_getcpu()
  2008-09-29  7:27     ` dean gaudet
@ 2008-09-29 14:54       ` Andi Kleen
  2008-09-29 18:02         ` Pardo
       [not found]         ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Andi Kleen @ 2008-09-29 14:54 UTC (permalink / raw)
  To: dean gaudet
  Cc: Andi Kleen, Pardo, linux-kernel, mbligh, briangrant, nil,
	jyasskin

On Mon, Sep 29, 2008 at 12:27:09AM -0700, dean gaudet wrote:
> Andi Kleen wrote:
> > I'm still not sure why you say no redzone is that expensive? Do you
> > have numbers?  I know it's a few instructions, but it shouldn't 
> > be that expensive.
> >
> >   
> 
> it depends on the processor involved and the kernel config options --
> i.e. if frame pointers are enabled then the stack frame guarantees a
> store operation (push rbp) and on processors which do memops in-order
> this delays the other memops in the vsyscall (i.e. testing the cache or
> executing SIDT).  it was 2 or 3 cycles difference in most cases iirc.

Ok frame pointers are always a performance disasters on some CPUs.
Perhaps they should be just unconditionally disabled for vsyscall.c
and the vdso

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Faster getcpu() and sched_getcpu()
  2008-09-29 14:54       ` Andi Kleen
@ 2008-09-29 18:02         ` Pardo
       [not found]         ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Pardo @ 2008-09-29 18:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: dean gaudet, linux-kernel, mbligh, briangrant, nil, jyasskin

>[Maybe disable frame pointers for vsyscall.c and the vdso?]

IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch
split vsyscall.c, creating a vsyscall_user.c for code which can run
without them.  Seem reasonable?

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com>]

* Re: Faster getcpu() and sched_getcpu()
       [not found]         ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com>
@ 2008-09-29 20:50           ` Andi Kleen
       [not found]             ` <48E14ECE.6080402@google.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2008-09-29 20:50 UTC (permalink / raw)
  To: Pardo
  Cc: Andi Kleen, dean gaudet, linux-kernel, mbligh, briangrant, nil,
	jyasskin

On Mon, Sep 29, 2008 at 11:01:26AM -0700, Pardo wrote:
> >[Maybe disable frame pointers for vsyscall.c and the vdso?]
> 
> IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split

I don't think it really needs it.

> vsyscall.c, creating a vsyscall_user.c for code which can run without them.
> Seem reasonable?

Seems unnecessarily complicated.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <48E14ECE.6080402@google.com>]

* Re: Faster getcpu() and sched_getcpu()
       [not found]             ` <48E14ECE.6080402@google.com>
@ 2008-09-29 21:59               ` dean gaudet
  2008-09-29 22:07               ` Andi Kleen
  1 sibling, 0 replies; 9+ messages in thread
From: dean gaudet @ 2008-09-29 21:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Pardo, linux-kernel, mbligh, briangrant, nil, jyasskin

[attempting resend in plain-text only because thunderbird lost the
battle vs. vger]

dean gaudet wrote:
> Andi Kleen wrote:
>> On Mon, Sep 29, 2008 at 11:01:26AM -0700, Pardo wrote:
>>   
>>>> [Maybe disable frame pointers for vsyscall.c and the vdso?]
>>>>       
>>> IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split
>>>     
>>
>> I don't think it really needs it.
>>
>>   
>>> vsyscall.c, creating a vsyscall_user.c for code which can run without them.
>>> Seem reasonable?
>>>     
>>
>> Seems unnecessarily complicated.
>>   
>
> i disagree that it's complicated to have two files, and disagree that
> it's unnecessary to have two files.
>
> userland code does not have the same limitations/conventions as kernel
> code.  the ABIs are completely different.
>
> -dean


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Faster getcpu() and sched_getcpu()
       [not found]             ` <48E14ECE.6080402@google.com>
  2008-09-29 21:59               ` dean gaudet
@ 2008-09-29 22:07               ` Andi Kleen
  1 sibling, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2008-09-29 22:07 UTC (permalink / raw)
  To: dean gaudet
  Cc: Andi Kleen, Pardo, linux-kernel, mbligh, briangrant, nil,
	jyasskin

On Mon, Sep 29, 2008 at 02:55:26PM -0700, dean gaudet wrote:
> Andi Kleen wrote:
> > On Mon, Sep 29, 2008 at 11:01:26AM -0700, Pardo wrote:
> >   
> >>> [Maybe disable frame pointers for vsyscall.c and the vdso?]
> >>>       
> >> IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split
> >>     
> >
> > I don't think it really needs it.
> >
> >   
> >> vsyscall.c, creating a vsyscall_user.c for code which can run without them.
> >> Seem reasonable?
> >>     
> >
> > Seems unnecessarily complicated.
> >   
> 
> i disagree that it's complicated to have two files, and disagree that
> it's unnecessary to have two files.

It's unnecessary to have frame pointer in the kernel functions I meant.
I agree with you that disabling redzone is needed for kernel code,
but without frame pointers (which are generally a bad idea for 
performance and should not have been added to the 64bit port ever) 
redzone is also not particularly expensive and it shouldn't be needed
to do anything complicated (like splitting files) just for the few
cycles.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-09-29 22:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <af8810200809111648n55e05ac9g286fcd498690432f@mail.gmail.com>
2008-09-23 19:09 ` Faster getcpu() and sched_getcpu() Pardo
2008-09-23 19:48   ` Fwd: " Pardo
2008-09-28 16:42   ` Andi Kleen
2008-09-29  7:27     ` dean gaudet
2008-09-29 14:54       ` Andi Kleen
2008-09-29 18:02         ` Pardo
     [not found]         ` <af8810200809291101r6f3208beua36a4b2d3b5713eb@mail.gmail.com>
2008-09-29 20:50           ` Andi Kleen
     [not found]             ` <48E14ECE.6080402@google.com>
2008-09-29 21:59               ` dean gaudet
2008-09-29 22:07               ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox