From: Andi Kleen <ak@suse.de>
To: discuss@x86-64.org, linux-kernel@vger.kernel.org
Cc: libc-alpha@sourceware.org, vojtech@suse.cz
Subject: FOR REVIEW: New x86-64 vsyscall vgetcpu()
Date: Wed, 14 Jun 2006 09:42:31 +0200 [thread overview]
Message-ID: <200606140942.31150.ak@suse.de> (raw)
I got several requests over the years to provide a fast way to get
the current CPU and node on x86-64. That is useful for a couple of things:
- The kernel gets a lot of benefit from using per CPU data to get better
cache locality and avoid cache line bouncing. This is currently
not quite possible for user programs. With a fast way to know the current
CPU user space can use per CPU data that is likely in cache already.
Locking is still needed of course - after all the thread might switch
to a different CPU - but at least the memory should be already in cache
and locking on cached memory is much cheaper.
- For NUMA optimization in user space you really need to know the current
node to find out where to allocate memory from.
If you allocate a fresh page from the kernel the kernel will give you
one in the current node, but if you keep your own pools like most programs
do you need to know this to select the right pool.
On single threaded programs it is usually not a big issue because they
tend to start on one node, allocate all their memory there and then eventually
use it there too, but on multithreaded programs where threads can
run on different nodes it's a bigger problem to make sure the threads
can get node local memory for best performance.
At first look such a call still looks like a bad idea - after all the kernel can
switch a process at any time to other CPUs so any result of this call might
be wrong as soon as it returns.
But at a closer look it really makes sense:
- The kernel has strong thread affinity and usually keeps a process on the
same CPU. So switching CPUs is rare. This makes it an useful optimization.
The alternative is usually to bind the process to a specific CPU - then it
"know" where it is - but the problem is that this is nasty to use and
requires user configuration. The kernel often can make better decisions on
where to schedule. And doing it automatically makes it just work.
This cannot be done effectively in user space because only the kernel
knows how to get this information from the CPUs because it requires
translating local APIC numbers to Linux CPU numbers.
Doing it in a syscall is too slow so doing it in a vsyscall makes sense.
I have patches now in my tree from Vojtech
ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/getcpu-vsyscall
(note doesn't apply on its own, needs earlier patches in the quilt set)
The prototype is
long vgetcpu(int *cpu, int *node, unsigned long *tcache)
cpu gets the current CPU number if not NULL.
node gets the current node number if not NULL
tcache is a pointer to a two element long array, can be also NULL. Described below.
Return is always 0.
[I modified the prototype a bit over Vojtech's original implementation
to be more foolproof and add the caching mechanism]
Unfortunately all ways to get this information from the CPU are still relatively slow:
it supports RDTSCP on CPUs that support it and CPUID(1) otherwise. Unfortunately
they both are relatively slow.
They stall the pipeline and add some overhead
so I added a special caching mechanism. The idea is that if it's a little
slow then user space would likely cache the information anyways. The problem
with caching is that you need a way to find if it's out of date. User space
cannot do this because it doesn't have a fast way to access a time stamp.
But the x86-64 vsyscall implementation happens to incidentally - vgettimeofday()
already has access to jiffies, that can be just used as a timestamp to
invalidate the cache. The vsyscall cannot cache this information by itself
though - it doesn't have any storage. The idea is that the user would pass a
TLS variable in there which is then used for storage. With that the information
can be at best a jiffie out of date, which is good enough.
The contents of the cache are theoretically supposed to be opaque (although I'm
sure user programs will soon abuse that because it will such a convenient way
to get at jiffies ..). I've considered xoring it with a value to make it clear
it's not, but that is probably overkill (?). Might be still safer because
jiffies is unsafe to use in user space because the unit might change.
The array is slightly ugly - one open possibility is to replace it with
a structure. Shouldn't make much difference to the general semantics of the syscall though.
Some numbers: (the getpid is to compare syscall cost)
AMD RevF (with RDTSCP support):
getpid 162 cycles
vgetcpu 145 cycles
vgetcpu rdtscp 32 cycles
vgetcpu cached 14 cycles
Intel Pentium-D (Smithfield):
getpid 719 cycles
vgetcpu 535 cycles
vgetcpu cached 27 cycles
AMD RevE:
getpid 162 cycles
vgetcpu 185 cycles
vgetcpu cached 15 cycles
As you can see CPUID(1) is always very slow, but usually narrowly wins
against the syscall still, except on AMD E stepping. The difference
is very small there and while it would have been possible to implement
a third mode for this that uses a real syscall I ended not too because it
has some other implications.
With the caching mechanism it really flies though and should be fast enough
for most uses.
My eventual hope is that glibc will be start using this to implement a NUMA aware
malloc() in user space that tries to allocate local memory preferably.
I would say that's the biggest gap we still have in "general purpose" NUMA tuning
on Linux. Of course it will be likely useful for a lot of other scalable
code too.
Comments on the general mechanism are welcome. If someone is interested in using
this in user space for SMP or NUMA tuning please let me know.
I haven't quite made of my mind yet if it's 2.6.18 material or not.
-Andi
next reply other threads:[~2006-06-14 7:42 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-06-14 7:42 Andi Kleen [this message]
2006-06-14 10:47 ` FOR REVIEW: New x86-64 vsyscall vgetcpu() Alan Cox
2006-06-14 14:54 ` Steve Munroe
2006-06-15 23:17 ` Benjamin Herrenschmidt
[not found] ` <449029DB.7030505@redhat.com>
[not found] ` <200606141752.02361.ak@suse.de>
2006-06-14 16:30 ` Ulrich Drepper
2006-06-14 17:34 ` [discuss] " Andi Kleen
2006-06-15 18:44 ` Tony Luck
2006-06-16 6:22 ` Andi Kleen
2006-06-16 7:23 ` Gerd Hoffmann
2006-06-16 7:37 ` Andi Kleen
2006-06-16 9:48 ` Jes Sorensen
2006-06-16 10:09 ` Andi Kleen
2006-06-16 11:02 ` Jes Sorensen
2006-06-16 11:17 ` Andi Kleen
2006-06-16 11:58 ` Jes Sorensen
2006-06-16 12:36 ` Zoltan Menyhart
2006-06-16 12:41 ` Jes Sorensen
2006-06-16 12:48 ` Zoltan Menyhart
2006-06-16 21:04 ` Chase Venters
2006-06-16 14:56 ` Andi Kleen
2006-06-16 15:31 ` Zoltan Menyhart
2006-06-16 15:37 ` Andi Kleen
2006-06-16 15:58 ` Jakub Jelinek
2006-06-16 16:24 ` Andi Kleen
2006-06-16 16:33 ` Jakub Jelinek
2006-06-16 21:12 ` Chase Venters
2006-06-16 15:36 ` Brent Casavant
2006-06-16 15:40 ` Andi Kleen
2006-06-16 21:15 ` Chase Venters
2006-06-16 21:19 ` Chase Venters
2006-06-16 23:40 ` Brent Casavant
2006-06-17 6:58 ` Andi Kleen
2006-06-17 6:55 ` [discuss] " Andi Kleen
2006-06-19 8:42 ` Zoltan Menyhart
2006-06-19 8:54 ` Andi Kleen
2006-06-16 14:54 ` Andi Kleen
2006-06-20 8:28 ` Jes Sorensen
2006-06-19 0:15 ` Paul Jackson
2006-06-19 8:21 ` Andi Kleen
2006-06-19 10:09 ` Paul Jackson
2006-06-21 1:18 ` Paul Jackson
2006-06-21 1:21 ` Paul Jackson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200606140942.31150.ak@suse.de \
--to=ak@suse.de \
--cc=discuss@x86-64.org \
--cc=libc-alpha@sourceware.org \
--cc=linux-kernel@vger.kernel.org \
--cc=vojtech@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox