* FOR REVIEW: New x86-64 vsyscall vgetcpu()
@ 2006-06-14 7:42 Andi Kleen
2006-06-14 10:47 ` Alan Cox
` (4 more replies)
0 siblings, 5 replies; 69+ messages in thread
From: Andi Kleen @ 2006-06-14 7:42 UTC (permalink / raw)
To: discuss, linux-kernel; +Cc: libc-alpha, vojtech
I got several requests over the years to provide a fast way to get
the current CPU and node on x86-64. That is useful for a couple of things:
- The kernel gets a lot of benefit from using per CPU data to get better
cache locality and avoid cache line bouncing. This is currently
not quite possible for user programs. With a fast way to know the current
CPU user space can use per CPU data that is likely in cache already.
Locking is still needed of course - after all the thread might switch
to a different CPU - but at least the memory should be already in cache
and locking on cached memory is much cheaper.
- For NUMA optimization in user space you really need to know the current
node to find out where to allocate memory from.
If you allocate a fresh page from the kernel the kernel will give you
one in the current node, but if you keep your own pools like most programs
do you need to know this to select the right pool.
On single threaded programs it is usually not a big issue because they
tend to start on one node, allocate all their memory there and then eventually
use it there too, but on multithreaded programs where threads can
run on different nodes it's a bigger problem to make sure the threads
can get node local memory for best performance.
At first look such a call still looks like a bad idea - after all the kernel can
switch a process at any time to other CPUs so any result of this call might
be wrong as soon as it returns.
But at a closer look it really makes sense:
- The kernel has strong thread affinity and usually keeps a process on the
same CPU. So switching CPUs is rare. This makes it an useful optimization.
The alternative is usually to bind the process to a specific CPU - then it
"know" where it is - but the problem is that this is nasty to use and
requires user configuration. The kernel often can make better decisions on
where to schedule. And doing it automatically makes it just work.
This cannot be done effectively in user space because only the kernel
knows how to get this information from the CPUs because it requires
translating local APIC numbers to Linux CPU numbers.
Doing it in a syscall is too slow so doing it in a vsyscall makes sense.
I have patches now in my tree from Vojtech
ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/getcpu-vsyscall
(note doesn't apply on its own, needs earlier patches in the quilt set)
The prototype is
long vgetcpu(int *cpu, int *node, unsigned long *tcache)
cpu gets the current CPU number if not NULL.
node gets the current node number if not NULL
tcache is a pointer to a two element long array, can be also NULL. Described below.
Return is always 0.
[I modified the prototype a bit over Vojtech's original implementation
to be more foolproof and add the caching mechanism]
Unfortunately all ways to get this information from the CPU are still relatively slow:
it supports RDTSCP on CPUs that support it and CPUID(1) otherwise. Unfortunately
they both are relatively slow.
They stall the pipeline and add some overhead
so I added a special caching mechanism. The idea is that if it's a little
slow then user space would likely cache the information anyways. The problem
with caching is that you need a way to find if it's out of date. User space
cannot do this because it doesn't have a fast way to access a time stamp.
But the x86-64 vsyscall implementation happens to incidentally - vgettimeofday()
already has access to jiffies, that can be just used as a timestamp to
invalidate the cache. The vsyscall cannot cache this information by itself
though - it doesn't have any storage. The idea is that the user would pass a
TLS variable in there which is then used for storage. With that the information
can be at best a jiffie out of date, which is good enough.
The contents of the cache are theoretically supposed to be opaque (although I'm
sure user programs will soon abuse that because it will such a convenient way
to get at jiffies ..). I've considered xoring it with a value to make it clear
it's not, but that is probably overkill (?). Might be still safer because
jiffies is unsafe to use in user space because the unit might change.
The array is slightly ugly - one open possibility is to replace it with
a structure. Shouldn't make much difference to the general semantics of the syscall though.
Some numbers: (the getpid is to compare syscall cost)
AMD RevF (with RDTSCP support):
getpid 162 cycles
vgetcpu 145 cycles
vgetcpu rdtscp 32 cycles
vgetcpu cached 14 cycles
Intel Pentium-D (Smithfield):
getpid 719 cycles
vgetcpu 535 cycles
vgetcpu cached 27 cycles
AMD RevE:
getpid 162 cycles
vgetcpu 185 cycles
vgetcpu cached 15 cycles
As you can see CPUID(1) is always very slow, but usually narrowly wins
against the syscall still, except on AMD E stepping. The difference
is very small there and while it would have been possible to implement
a third mode for this that uses a real syscall I ended not too because it
has some other implications.
With the caching mechanism it really flies though and should be fast enough
for most uses.
My eventual hope is that glibc will be start using this to implement a NUMA aware
malloc() in user space that tries to allocate local memory preferably.
I would say that's the biggest gap we still have in "general purpose" NUMA tuning
on Linux. Of course it will be likely useful for a lot of other scalable
code too.
Comments on the general mechanism are welcome. If someone is interested in using
this in user space for SMP or NUMA tuning please let me know.
I haven't quite made of my mind yet if it's 2.6.18 material or not.
-Andi
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-14 7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen @ 2006-06-14 10:47 ` Alan Cox 2006-06-14 14:54 ` Steve Munroe ` (3 subsequent siblings) 4 siblings, 0 replies; 69+ messages in thread From: Alan Cox @ 2006-06-14 10:47 UTC (permalink / raw) To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech Ar Mer, 2006-06-14 am 09:42 +0200, ysgrifennodd Andi Kleen: > Comments on the general mechanism are welcome. If someone is interested in using > this in user space for SMP or NUMA tuning please let me know. Will 2 words always be enough, it costs nothing to demand 8 or 16 ... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-14 7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen 2006-06-14 10:47 ` Alan Cox @ 2006-06-14 14:54 ` Steve Munroe 2006-06-15 23:17 ` Benjamin Herrenschmidt [not found] ` <449029DB.7030505@redhat.com> ` (2 subsequent siblings) 4 siblings, 1 reply; 69+ messages in thread From: Steve Munroe @ 2006-06-14 14:54 UTC (permalink / raw) To: Andi Kleen, benh Cc: discuss, libc-alpha, libc-alpha-owner, linux-kernel, vojtech Andi Kleen <ak@suse.de> wrote on 06/14/2006 02:42:31 AM: > > I got several requests over the years to provide a fast way to get > the current CPU and node on x86-64. That is useful for a couple of things: > > - The kernel gets a lot of benefit from using per CPU data to get better > cache locality and avoid cache line bouncing. This is currently > not quite possible for user programs. With a fast way to know the current > CPU user space can use per CPU data that is likely in cache already. > Locking is still needed of course - after all the thread might switch > to a different CPU - but at least the memory should be already in cache > and locking on cached memory is much cheaper. > > - For NUMA optimization in user space you really need to know the current > node to find out where to allocate memory from. > If you allocate a fresh page from the kernel the kernel will give you > one in the current node, but if you keep your own pools like most programs > do you need to know this to select the right pool. > On single threaded programs it is usually not a big issue because they > tend to start on one node, allocate all their memory there and then eventually > use it there too, but on multithreaded programs where threads can > run on different nodes it's a bigger problem to make sure the threads > can get node local memory for best performance. > PowerPC has similar issues and could use VDSO/vsyscal to implement vgetcpu() as well. So we should get Ben Herrenschmidt involved to insure that we have a cross platform solution. > At first look such a call still looks like a bad idea - after all > the kernel can > switch a process at any time to other CPUs so any result of this call might > be wrong as soon as it returns. > > But at a closer look it really makes sense: > - The kernel has strong thread affinity and usually keeps a process on the > same CPU. So switching CPUs is rare. This makes it an useful optimization. > > The alternative is usually to bind the process to a specific CPU - then it > "know" where it is - but the problem is that this is nasty to use and > requires user configuration. The kernel often can make better decisions on > where to schedule. And doing it automatically makes it just work. > > This cannot be done effectively in user space because only the kernel > knows how to get this information from the CPUs because it requires > translating local APIC numbers to Linux CPU numbers. > > Doing it in a syscall is too slow so doing it in a vsyscall makes sense. > > I have patches now in my tree from Vojtech > ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/getcpu-vsyscall > (note doesn't apply on its own, needs earlier patches in the quilt set) > > The prototype is > > long vgetcpu(int *cpu, int *node, unsigned long *tcache) > > cpu gets the current CPU number if not NULL. > node gets the current node number if not NULL > tcache is a pointer to a two element long array, can be also NULL. > Described below. > Return is always 0. > > [I modified the prototype a bit over Vojtech's original implementation > to be more foolproof and add the caching mechanism] > > Unfortunately all ways to get this information from the CPU are > still relatively slow: > it supports RDTSCP on CPUs that support it and CPUID(1) otherwise. > Unfortunately > they both are relatively slow. > > They stall the pipeline and add some overhead > so I added a special caching mechanism. The idea is that if it's a little > slow then user space would likely cache the information anyways. The problem > with caching is that you need a way to find if it's out of date. User space > cannot do this because it doesn't have a fast way to access a time stamp. > > But the x86-64 vsyscall implementation happens to incidentally - > vgettimeofday() > already has access to jiffies, that can be just used as a timestamp to > invalidate the cache. The vsyscall cannot cache this information by itself > though - it doesn't have any storage. The idea is that the user would pass a > TLS variable in there which is then used for storage. With that the > information > can be at best a jiffie out of date, which is good enough. > > The contents of the cache are theoretically supposed to be opaque > (although I'm > sure user programs will soon abuse that because it will such a > convenient way > to get at jiffies ..). I've considered xoring it with a value to make it clear > it's not, but that is probably overkill (?). Might be still safer because > jiffies is unsafe to use in user space because the unit might change. > > The array is slightly ugly - one open possibility is to replace it with > a structure. Shouldn't make much difference to the general semantics > of the syscall though. > > Some numbers: (the getpid is to compare syscall cost) > > AMD RevF (with RDTSCP support): > getpid 162 cycles > vgetcpu 145 cycles > vgetcpu rdtscp 32 cycles > vgetcpu cached 14 cycles > > Intel Pentium-D (Smithfield): > getpid 719 cycles > vgetcpu 535 cycles > vgetcpu cached 27 cycles > > AMD RevE: > getpid 162 cycles > vgetcpu 185 cycles > vgetcpu cached 15 cycles > > As you can see CPUID(1) is always very slow, but usually narrowly wins > against the syscall still, except on AMD E stepping. The difference > is very small there and while it would have been possible to implement > a third mode for this that uses a real syscall I ended not too because it > has some other implications. > > With the caching mechanism it really flies though and should be fast enough > for most uses. > > My eventual hope is that glibc will be start using this to implement > a NUMA aware > malloc() in user space that tries to allocate local memory preferably. > I would say that's the biggest gap we still have in "general > purpose" NUMA tuning > on Linux. Of course it will be likely useful for a lot of other scalable > code too. > > Comments on the general mechanism are welcome. If someone is > interested in using > this in user space for SMP or NUMA tuning please let me know. > > I haven't quite made of my mind yet if it's 2.6.18 material or not. > Steven J. Munroe Linux on Power Toolchain Architect IBM Corporation, Linux Technology Center ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-14 14:54 ` Steve Munroe @ 2006-06-15 23:17 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 69+ messages in thread From: Benjamin Herrenschmidt @ 2006-06-15 23:17 UTC (permalink / raw) To: Steve Munroe Cc: Andi Kleen, discuss, libc-alpha, libc-alpha-owner, linux-kernel, vojtech > PowerPC has similar issues and could use VDSO/vsyscal to implement > vgetcpu() as well. So we should get Ben Herrenschmidt involved to insure > that we have a cross platform solution. Except that I haven't yet found a way to pass the information to the vdso... in the past, there used to be an SPRG that was readable by userland that I could have used but I can't see that working on recent CPUs. The PIR isn't quite portable (though the vDSO can have per-cpu model) and we don't quite know for sure what's in there, especially on shared processor machines. Any idea ? Ben. ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <449029DB.7030505@redhat.com>]
[parent not found: <200606141752.02361.ak@suse.de>]
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() [not found] ` <200606141752.02361.ak@suse.de> @ 2006-06-14 16:30 ` Ulrich Drepper 2006-06-14 17:34 ` [discuss] " Andi Kleen 0 siblings, 1 reply; 69+ messages in thread From: Ulrich Drepper @ 2006-06-14 16:30 UTC (permalink / raw) To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech [-- Attachment #1: Type: text/plain, Size: 2471 bytes --] Andi Kleen wrote: > Eventually we'll need a dynamic format but I'll only add it > for new calls that actually require it for security. > vgetcpu doesn't need it. Just introduce the vdso now, add all new vdso calls there. There is no reason except laziness to continue with these moronic fixed addresses. They only get in the way of address space layout change/optimizations. And nobody said anything about breaking apps which use the fixed addresses. That code can still be available. One should be able to turn it off with setarch. >>> long vgetcpu(int *cpu, int *node, unsigned long *tcache) >> Do you expect the value returned in *cpu and*node to require an error >> value? If not, then why this fascination with signed types? > > Shouldn't make a difference. If there is no reason for a signed type none should be used. It can only lead to problems. This reminds me: what are the values for the CPU number? Are they continuous? Are they the same as those used in the affinity syscalls (they better be)? With hotplug CPUs, are CPU numbers "recycled"? >> And as for the cache: you definitely should use a length parameter. >> We've seen in the past over and over again that implicit length >> requirements sooner or later fail. > > No, the cache should be completely opaque to user space. It's just > temporary space for the vsyscall which it cannot store for itself. > I'll probably change it to a struct to make that clearer. > > length doesn't make sense for that use. You didn't even try to understand what I said. Yes, in this one case you might at this point in time only need two words. But - this might change - there might be other future functions in the vdso which need memory. It is a huge pain to provide more and more of these individual variables. Better allocate one chunk. > If some other function needs a cache too it can define its own. > I don't see any advantage of using a shared buffer. I believe it that _you_ don't see it. Because the pain is in the libc. The code to set up stack frames has to be adjusted for each new TLS variable. It is better to do it once in a general way which is what I suggested. > I think you're misunderstanding the concept. No, I understand perfectly. You don't get it because you don't want to understand the userlevel side. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 251 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-14 16:30 ` Ulrich Drepper @ 2006-06-14 17:34 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-14 17:34 UTC (permalink / raw) To: discuss; +Cc: Ulrich Drepper, linux-kernel, libc-alpha, vojtech On Wednesday 14 June 2006 18:30, Ulrich Drepper wrote: > > Eventually we'll need a dynamic format but I'll only add it > > for new calls that actually require it for security. > > vgetcpu doesn't need it. > > Just introduce the vdso now, add all new vdso calls there. There is no > reason except laziness to continue with these moronic fixed addresses. > They only get in the way of address space layout change/optimizations. The user address space size on x86-64 is final (baring the architecture gets extended to beyond 48bit VA). We already use all positive space. But the vsyscalls don't even live in user address space. > >>> long vgetcpu(int *cpu, int *node, unsigned long *tcache) > >> Do you expect the value returned in *cpu and*node to require an error > >> value? If not, then why this fascination with signed types? > > > > Shouldn't make a difference. > > If there is no reason for a signed type none should be used. It can > only lead to problems. Ok i can change it to unsigned if you feel that strongly about it. > > This reminds me: what are the values for the CPU number? Are they > continuous? Are they the same as those used in the affinity syscalls > (they better be)? Yes of course. > With hotplug CPUs, are CPU numbers "recycled"? I think if the same CPU gets unplugged and replugged it should get the same number. Otherwise new numbers should be allocated. > Yes, in this one case > you might at this point in time only need two words. But > > - this might change Alan suggested adding some padding which probably makes sense, although I frankly don't see the implementation changing. Variable length would be clear overkill and I refuse to overdesign this. > - there might be other future functions in the vdso which need memory. > It is a huge pain to provide more and more of these individual > variables. Better allocate one chunk. Why is it a problem? It's just var __thread isn't it? > > > If some other function needs a cache too it can define its own. > > I don't see any advantage of using a shared buffer. > > I believe it that _you_ don't see it. Because the pain is in the libc. > The code to set up stack frames has to be adjusted for each new TLS > variable. It is better to do it once in a general way which is what I > suggested. Hmm, I thought user space could define arbitary own __threads. I certainly used that in some of my code. Why is it a problem for the libc to do the same? Anyways even if it's such a big problem you can put it all in one chunk and partition it yourself given the fixed size. I don't think the kernel code should concern itself about this. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-14 7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen ` (2 preceding siblings ...) [not found] ` <449029DB.7030505@redhat.com> @ 2006-06-15 18:44 ` Tony Luck 2006-06-16 6:22 ` Andi Kleen 2006-06-19 0:15 ` Paul Jackson 4 siblings, 1 reply; 69+ messages in thread From: Tony Luck @ 2006-06-15 18:44 UTC (permalink / raw) To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech On 6/14/06, Andi Kleen <ak@suse.de> wrote: > But at a closer look it really makes sense: > - The kernel has strong thread affinity and usually keeps a process on the > same CPU. So switching CPUs is rare. This makes it an useful optimization. Alternatively it means that this will almost always do the right thing, but once in a while it won't, your application will happen to have been migrated to a different cpu/node at the point it makes the call, and from then on this instance will behave oddly (running slowly because it allocates most of its memory on the wrong node). When you try to reproduce the problem, the application will work normally. > The alternative is usually to bind the process to a specific CPU - then it > "know" where it is - but the problem is that this is nasty to use and > requires user configuration. The kernel often can make better decisions on > where to schedule. And doing it automatically makes it just work. Another alternative would be to provide a mechanism for a process to bind to the current cpu (whatever cpu that happens to be). Then the kernel gets to make the smart placement decisions, and processes that want to be bound somewhere (but don't really care exactly where) have a way to meet their need. Perhaps a cpumask of all zeroes to a sched_setaffinity call could be overloaded for this? Or we can dig up some of the old virtual cpu/virtual node suggestions (we will eventually need to do something like this, but most systems now don't have enough cpus for this to make much sense yet). -Tony ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-15 18:44 ` Tony Luck @ 2006-06-16 6:22 ` Andi Kleen 2006-06-16 7:23 ` Gerd Hoffmann 2006-06-16 9:48 ` Jes Sorensen 0 siblings, 2 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 6:22 UTC (permalink / raw) To: Tony Luck; +Cc: discuss, linux-kernel, libc-alpha, vojtech On Thursday 15 June 2006 20:44, Tony Luck wrote: > On 6/14/06, Andi Kleen <ak@suse.de> wrote: > > But at a closer look it really makes sense: > > - The kernel has strong thread affinity and usually keeps a process on the > > same CPU. So switching CPUs is rare. This makes it an useful optimization. > > Alternatively it means that this will almost always do the right thing, but > once in a while it won't, your application will happen to have been migrated > to a different cpu/node at the point it makes the call, and from then on > this instance will behave oddly (running slowly because it allocates most > of its memory on the wrong node). When you try to reproduce the problem, > the application will work normally. That's inherent in NUMA. No good way around that. We have a similar problem with caches because we don't color them. People have learned to live with it. > > The alternative is usually to bind the process to a specific CPU - then it > > "know" where it is - but the problem is that this is nasty to use and > > requires user configuration. The kernel often can make better decisions on > > where to schedule. And doing it automatically makes it just work. > > Another alternative would be to provide a mechanism for a process > to bind to the current cpu (whatever cpu that happens to be). Then > the kernel gets to make the smart placement decisions, and processes > that want to be bound somewhere (but don't really care exactly where) > have a way to meet their need. Perhaps a cpumask of all zeroes to a > sched_setaffinity call could be overloaded for this? I tried something like this a few years ago and it just didn't work (or rather ran usually slower) The scheduler would select a home node at startup and then try to move the process there. The problem is that not using a CPU costs you much more than whatever overhead you get from using non local memory. So by default filling the CPUs must be the highest priority and memory policy cannot interfere with that. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 6:22 ` Andi Kleen @ 2006-06-16 7:23 ` Gerd Hoffmann 2006-06-16 7:37 ` Andi Kleen 2006-06-16 9:48 ` Jes Sorensen 1 sibling, 1 reply; 69+ messages in thread From: Gerd Hoffmann @ 2006-06-16 7:23 UTC (permalink / raw) To: Andi Kleen; +Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech Andi Kleen wrote: >> Alternatively it means that this will almost always do the right thing, but >> once in a while it won't, your application will happen to have been migrated >> to a different cpu/node at the point it makes the call, and from then on >> this instance will behave oddly (running slowly because it allocates most >> of its memory on the wrong node). When you try to reproduce the problem, >> the application will work normally. > > That's inherent in NUMA. No good way around that. Hmm, maybe it makes sense to allow binding memory areas to threads instead of nodes. That way the kernel may attempt to migrate the pages to another node in case it migrates threads / processes. Either via mbind(), or maybe better via madvise() to make clear it's a hint only. cheers, Gerd -- Gerd Hoffmann <kraxel@suse.de> http://www.suse.de/~kraxel/julika-dora.jpeg ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 7:23 ` Gerd Hoffmann @ 2006-06-16 7:37 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 7:37 UTC (permalink / raw) To: Gerd Hoffmann; +Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech On Friday 16 June 2006 09:23, Gerd Hoffmann wrote: > Andi Kleen wrote: > >> Alternatively it means that this will almost always do the right thing, but > >> once in a while it won't, your application will happen to have been migrated > >> to a different cpu/node at the point it makes the call, and from then on > >> this instance will behave oddly (running slowly because it allocates most > >> of its memory on the wrong node). When you try to reproduce the problem, > >> the application will work normally. > > > > That's inherent in NUMA. No good way around that. > > Hmm, maybe it makes sense to allow binding memory areas to threads > instead of nodes. That way the kernel may attempt to migrate the pages > to another node in case it migrates threads / processes. Either via > mbind(), or maybe better via madvise() to make clear it's a hint only. I haven't tried that but I have talked to others who tried to implement automatic page migration and they say they couldn't make that work (or rather make it a win) either. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 6:22 ` Andi Kleen @ 2006-06-16 9:48 ` Jes Sorensen 2006-06-16 9:48 ` Jes Sorensen 1 sibling, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 9:48 UTC (permalink / raw) To: Andi Kleen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 >>>>> "Andi" = Andi Kleen <ak@suse.de> writes: Andi> On Thursday 15 June 2006 20:44, Tony Luck wrote: >> Another alternative would be to provide a mechanism for a process >> to bind to the current cpu (whatever cpu that happens to be). Then >> the kernel gets to make the smart placement decisions, and >> processes that want to be bound somewhere (but don't really care >> exactly where) have a way to meet their need. Perhaps a cpumask of >> all zeroes to a sched_setaffinity call could be overloaded for >> this? Andi> I tried something like this a few years ago and it just didn't Andi> work (or rather ran usually slower) The scheduler would select a Andi> home node at startup and then try to move the process there. Andi> The problem is that not using a CPU costs you much more than Andi> whatever overhead you get from using non local memory. It all depends on your application and the type of system you are running on. What you say applies to smaller cpu counts. However once we see the upcoming larger count multi-core cpus become commonly available, this is likely to change and become more like what is seen today on larger NUMA systems. In the scientific application space, there are two very common groupings of jobs. One is simply a large threaded application with a lot of intercommunication, often via MPI. In many cases one ends up running a job on just a subset of the system, in which case you want to see threads placed on the same node(s) to minimize internode communication. It is desirable to either force the other tasks on the system (system daemons etc) onto other node(s) to reduce noise and there could also be space to run another parallel job on the remaining node(s). The other common case is to have jobs which spawn off a number of threads that work together in groups (via OpenMP). In this case you would like to have all your OpenMP threads placed on the same node for similar reasons. Not getting this right can result in significant loss of performance for jobs which are highly memory bound or rely heavily on intercommunication and synchronization. Andi> So by default filling the CPUs must be the highest priority and Andi> memory policy cannot interfere with that. I really don't think this approach is going to solve the problem. As Tony also points out, tasks will eventually migrate. The user needs to tell the kernel where it wants to run the tasks rather than the kernel telling the task where it is located. Only the application (or developer/user) knows how the threads are expected to behave, doing this automatically is almost never going to be optimal. Obviously the user needs visibility of the topology of the machine to do so but that should be available on any NUMA system through /proc or /sys. In the scientific space the jobs are often run repeatedly with new data sets every time, so it is worthwhile to spend the effort up front to get the placement right. One-off runs are obviously something else and there your method is going to be more beneficial. IMHO, what we really need is a more advanced way for user applications to hint at the kernel how to place it's threads. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 9:48 ` Jes Sorensen 0 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 9:48 UTC (permalink / raw) To: Andi Kleen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 >>>>> "Andi" == Andi Kleen <ak@suse.de> writes: Andi> On Thursday 15 June 2006 20:44, Tony Luck wrote: >> Another alternative would be to provide a mechanism for a process >> to bind to the current cpu (whatever cpu that happens to be). Then >> the kernel gets to make the smart placement decisions, and >> processes that want to be bound somewhere (but don't really care >> exactly where) have a way to meet their need. Perhaps a cpumask of >> all zeroes to a sched_setaffinity call could be overloaded for >> this? Andi> I tried something like this a few years ago and it just didn't Andi> work (or rather ran usually slower) The scheduler would select a Andi> home node at startup and then try to move the process there. Andi> The problem is that not using a CPU costs you much more than Andi> whatever overhead you get from using non local memory. It all depends on your application and the type of system you are running on. What you say applies to smaller cpu counts. However once we see the upcoming larger count multi-core cpus become commonly available, this is likely to change and become more like what is seen today on larger NUMA systems. In the scientific application space, there are two very common groupings of jobs. One is simply a large threaded application with a lot of intercommunication, often via MPI. In many cases one ends up running a job on just a subset of the system, in which case you want to see threads placed on the same node(s) to minimize internode communication. It is desirable to either force the other tasks on the system (system daemons etc) onto other node(s) to reduce noise and there could also be space to run another parallel job on the remaining node(s). The other common case is to have jobs which spawn off a number of threads that work together in groups (via OpenMP). In this case you would like to have all your OpenMP threads placed on the same node for similar reasons. Not getting this right can result in significant loss of performance for jobs which are highly memory bound or rely heavily on intercommunication and synchronization. Andi> So by default filling the CPUs must be the highest priority and Andi> memory policy cannot interfere with that. I really don't think this approach is going to solve the problem. As Tony also points out, tasks will eventually migrate. The user needs to tell the kernel where it wants to run the tasks rather than the kernel telling the task where it is located. Only the application (or developer/user) knows how the threads are expected to behave, doing this automatically is almost never going to be optimal. Obviously the user needs visibility of the topology of the machine to do so but that should be available on any NUMA system through /proc or /sys. In the scientific space the jobs are often run repeatedly with new data sets every time, so it is worthwhile to spend the effort up front to get the placement right. One-off runs are obviously something else and there your method is going to be more beneficial. IMHO, what we really need is a more advanced way for user applications to hint at the kernel how to place it's threads. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 9:48 ` Jes Sorensen @ 2006-06-16 10:09 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 10:09 UTC (permalink / raw) To: Jes Sorensen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > It all depends on your application and the type of system you are > running on. What you say applies to smaller cpu counts. However once > we see the upcoming larger count multi-core cpus become commonly > available, this is likely to change and become more like what is seen > today on larger NUMA systems. Maybe. Maybe not. > > In the scientific application space, there are two very common > groupings of jobs. The scientific users just use pinned CPUs and seem to be happy with that. They also have cheap slav^wgrade students to spend lots of time on manual tuning. I'm not concerned about them. If you already use CPU affinity you should already know where you are and don't need this call at all. So this clearly isn't targetted for them. Interesting is getting the best performance from general purpose applications without any special tuning. For them I'm trying to improve things. Number one applications currently are databases and JVMs. I hope with Wolfam's malloc work it will be useful for more applications too. > Andi> So by default filling the CPUs must be the highest priority and > Andi> memory policy cannot interfere with that. > > I really don't think this approach is going to solve the problem. As > Tony also points out, tasks will eventually migrate. Currently we don't solve this problem with the standard heuristics. It can be solved with manual tuning (mempolicy, explicit CPU affinity) but if you're doing that you're already out side the primary use case of vgetcpu(). vgetcpu() is only trying to be a incremental improvement of the current simple default local policy. > The user needs to Scientific users do that, but other users normally not. I doubt that is going to change. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 10:09 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 10:09 UTC (permalink / raw) To: Jes Sorensen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > It all depends on your application and the type of system you are > running on. What you say applies to smaller cpu counts. However once > we see the upcoming larger count multi-core cpus become commonly > available, this is likely to change and become more like what is seen > today on larger NUMA systems. Maybe. Maybe not. > > In the scientific application space, there are two very common > groupings of jobs. The scientific users just use pinned CPUs and seem to be happy with that. They also have cheap slav^wgrade students to spend lots of time on manual tuning. I'm not concerned about them. If you already use CPU affinity you should already know where you are and don't need this call at all. So this clearly isn't targetted for them. Interesting is getting the best performance from general purpose applications without any special tuning. For them I'm trying to improve things. Number one applications currently are databases and JVMs. I hope with Wolfam's malloc work it will be useful for more applications too. > Andi> So by default filling the CPUs must be the highest priority and > Andi> memory policy cannot interfere with that. > > I really don't think this approach is going to solve the problem. As > Tony also points out, tasks will eventually migrate. Currently we don't solve this problem with the standard heuristics. It can be solved with manual tuning (mempolicy, explicit CPU affinity) but if you're doing that you're already out side the primary use case of vgetcpu(). vgetcpu() is only trying to be a incremental improvement of the current simple default local policy. > The user needs to Scientific users do that, but other users normally not. I doubt that is going to change. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 10:09 ` Andi Kleen @ 2006-06-16 11:02 ` Jes Sorensen -1 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 11:02 UTC (permalink / raw) To: Andi Kleen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: >> In the scientific application space, there are two very common >> groupings of jobs. > > The scientific users just use pinned CPUs and seem to be happy with that. > They also have cheap slav^wgrade students to spend lots of time on > manual tuning. I'm not concerned about them. Do they? There's a lot of scientific sites out there which are not universities or research organizations. They do not have free slave labour at hand. A lot of users fall into this category, especially the users with larger systems or large clusters (be it ia64, x86_64 or PPC). > If you already use CPU affinity you should already know where you are and don't > need this call at all. Except that whats currently available isn't sufficient to do what is needed. > So this clearly isn't targetted for them. > > Interesting is getting the best performance from general purpose applications > without any special tuning. For them I'm trying to improve things. Well I am interested in getting the best performance for some of the same applications, without having to modify them. The current affinity support simply isn't sufficient for that. Placement has to be targetted at launch time since thread implementations can change the layout etc. > Number one applications currently are databases and JVMs. I hope with > Wolfam's malloc work it will be useful for more applications too. If you want this to work for general purpose applications, then how is this new syscall going to help? If you expect application vendors to code for it, that means few users will benefit. >> I really don't think this approach is going to solve the problem. As >> Tony also points out, tasks will eventually migrate. > > Currently we don't solve this problem with the standard heuristics. > It can be solved with manual tuning (mempolicy, explicit CPU affinity) > but if you're doing that you're already out side the primary use > case of vgetcpu(). This is another area where the kernel could do better by possibly using the cpumask to determine where it will allocate memory. > vgetcpu() is only trying to be a incremental improvement of the current > simple default local policy. As Tony rightfully pointed out, tasks do migrate. By making this guess initially and then expecting the application to run for a long time, you will end up with it having zero or possibly a negative effect. >> The user needs to > > Scientific users do that, but other users normally not. I doubt that > is going to change. I just use scientific users since thats where I have the most recent detailed data from. Databases could well benefit from what I mentioned, though the serious ones would want to look into using affinity support explicitly in their code. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 11:02 ` Jes Sorensen 0 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 11:02 UTC (permalink / raw) To: Andi Kleen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: >> In the scientific application space, there are two very common >> groupings of jobs. > > The scientific users just use pinned CPUs and seem to be happy with that. > They also have cheap slav^wgrade students to spend lots of time on > manual tuning. I'm not concerned about them. Do they? There's a lot of scientific sites out there which are not universities or research organizations. They do not have free slave labour at hand. A lot of users fall into this category, especially the users with larger systems or large clusters (be it ia64, x86_64 or PPC). > If you already use CPU affinity you should already know where you are and don't > need this call at all. Except that whats currently available isn't sufficient to do what is needed. > So this clearly isn't targetted for them. > > Interesting is getting the best performance from general purpose applications > without any special tuning. For them I'm trying to improve things. Well I am interested in getting the best performance for some of the same applications, without having to modify them. The current affinity support simply isn't sufficient for that. Placement has to be targetted at launch time since thread implementations can change the layout etc. > Number one applications currently are databases and JVMs. I hope with > Wolfam's malloc work it will be useful for more applications too. If you want this to work for general purpose applications, then how is this new syscall going to help? If you expect application vendors to code for it, that means few users will benefit. >> I really don't think this approach is going to solve the problem. As >> Tony also points out, tasks will eventually migrate. > > Currently we don't solve this problem with the standard heuristics. > It can be solved with manual tuning (mempolicy, explicit CPU affinity) > but if you're doing that you're already out side the primary use > case of vgetcpu(). This is another area where the kernel could do better by possibly using the cpumask to determine where it will allocate memory. > vgetcpu() is only trying to be a incremental improvement of the current > simple default local policy. As Tony rightfully pointed out, tasks do migrate. By making this guess initially and then expecting the application to run for a long time, you will end up with it having zero or possibly a negative effect. >> The user needs to > > Scientific users do that, but other users normally not. I doubt that > is going to change. I just use scientific users since thats where I have the most recent detailed data from. Databases could well benefit from what I mentioned, though the serious ones would want to look into using affinity support explicitly in their code. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 11:02 ` Jes Sorensen @ 2006-06-16 11:17 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 11:17 UTC (permalink / raw) To: Jes Sorensen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > The current affinity > support simply isn't sufficient for that. Placement has to be targetted > at launch time since thread implementations can change the layout etc. I'm not sure how that's related to vgetcpu, but ok ... In general if you want to affect placement below the process / shared memory segment level you should change the application. Anything else just results in a big messy and unreliable and fragile user command line interface - a quick look at the respective Irix manpage should make that clear. > > Number one applications currently are databases and JVMs. I hope with > > Wolfam's malloc work it will be useful for more applications too. > > If you want this to work for general purpose applications, then how is > this new syscall going to help? It will improve their malloc(). They don't know anything about NUMA, but getting local memory will help them. They already get local memory now from the kernel when they use big allocations, but for smaller allocations it doesn't work because the kernel can't give out anything smaller than a page. This would be solved by a NUMA aware malloc, but it needs vgetcpu() for this if it should work without fixed CPU affinity. Basically it is just for extending the existing already used proven etc. default local policy to sub pages. Also there might be other uses of it too (like per CPU data), although I expect most use of that in user space can be already done using TLS. JVM and databases will use it too, but since they often use their own allocators they will need to be modified. > If you expect application vendors to > code for it, that means few users will benefit. Most applications use malloc() > >> I really don't think this approach is going to solve the problem. As > >> Tony also points out, tasks will eventually migrate. > > > > Currently we don't solve this problem with the standard heuristics. > > It can be solved with manual tuning (mempolicy, explicit CPU affinity) > > but if you're doing that you're already out side the primary use > > case of vgetcpu(). > > This is another area where the kernel could do better by possibly using > the cpumask to determine where it will allocate memory. Modify fallback lists based on cpu affinity? Would get messy in the code because you couldn't easily precompute them anymore. But cpusets already does this kind of, even though it has a quite bad impact on fast paths. Also what happens if the affinity mask is modified later? From the high semantics point it is also a little dubious to mesh them together. My feeling is that as a heuristic it is probably dubious. Also when you set cpu affinity you can as well set memory policy iit. > > > vgetcpu() is only trying to be a incremental improvement of the current > > simple default local policy. > > As Tony rightfully pointed out, tasks do migrate. By making this guess > initially The gamble is already there in the local policy. No change at all. When you already got local memory you can use it better with vgetcpu() though. From our experience it works out in most cases though - in general most benchmarks show better performance with simple local NUMA policy than SMP mode or no policy. In the cases where it doesn't you have to either eat the slow down or use manual tuning. > I just use scientific users since thats where I have the most recent > detailed data from. Databases could well benefit from what I mentioned, > though the serious ones would want to look into using affinity support > explicitly in their code. No exactly not - i got requests from "serious" databases to offer vgetcpu() because affinity is too complicated to configure and manage. It sounds like you want to solve NUMA world hunger here, not concentrate on the specific small incremental improvement vgetcpu is trying to offer. I'm sure there is much research that could be done in the general NUMA tuning area, but I would suggest making it research with numbers first before trying to hack like this anything into the kernel without a clear understanding first. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 11:17 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 11:17 UTC (permalink / raw) To: Jes Sorensen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > The current affinity > support simply isn't sufficient for that. Placement has to be targetted > at launch time since thread implementations can change the layout etc. I'm not sure how that's related to vgetcpu, but ok ... In general if you want to affect placement below the process / shared memory segment level you should change the application. Anything else just results in a big messy and unreliable and fragile user command line interface - a quick look at the respective Irix manpage should make that clear. > > Number one applications currently are databases and JVMs. I hope with > > Wolfam's malloc work it will be useful for more applications too. > > If you want this to work for general purpose applications, then how is > this new syscall going to help? It will improve their malloc(). They don't know anything about NUMA, but getting local memory will help them. They already get local memory now from the kernel when they use big allocations, but for smaller allocations it doesn't work because the kernel can't give out anything smaller than a page. This would be solved by a NUMA aware malloc, but it needs vgetcpu() for this if it should work without fixed CPU affinity. Basically it is just for extending the existing already used proven etc. default local policy to sub pages. Also there might be other uses of it too (like per CPU data), although I expect most use of that in user space can be already done using TLS. JVM and databases will use it too, but since they often use their own allocators they will need to be modified. > If you expect application vendors to > code for it, that means few users will benefit. Most applications use malloc() > >> I really don't think this approach is going to solve the problem. As > >> Tony also points out, tasks will eventually migrate. > > > > Currently we don't solve this problem with the standard heuristics. > > It can be solved with manual tuning (mempolicy, explicit CPU affinity) > > but if you're doing that you're already out side the primary use > > case of vgetcpu(). > > This is another area where the kernel could do better by possibly using > the cpumask to determine where it will allocate memory. Modify fallback lists based on cpu affinity? Would get messy in the code because you couldn't easily precompute them anymore. But cpusets already does this kind of, even though it has a quite bad impact on fast paths. Also what happens if the affinity mask is modified later? >From the high semantics point it is also a little dubious to mesh them together. My feeling is that as a heuristic it is probably dubious. Also when you set cpu affinity you can as well set memory policy iit. > > > vgetcpu() is only trying to be a incremental improvement of the current > > simple default local policy. > > As Tony rightfully pointed out, tasks do migrate. By making this guess > initially The gamble is already there in the local policy. No change at all. When you already got local memory you can use it better with vgetcpu() though. >From our experience it works out in most cases though - in general most benchmarks show better performance with simple local NUMA policy than SMP mode or no policy. In the cases where it doesn't you have to either eat the slow down or use manual tuning. > I just use scientific users since thats where I have the most recent > detailed data from. Databases could well benefit from what I mentioned, > though the serious ones would want to look into using affinity support > explicitly in their code. No exactly not - i got requests from "serious" databases to offer vgetcpu() because affinity is too complicated to configure and manage. It sounds like you want to solve NUMA world hunger here, not concentrate on the specific small incremental improvement vgetcpu is trying to offer. I'm sure there is much research that could be done in the general NUMA tuning area, but I would suggest making it research with numbers first before trying to hack like this anything into the kernel without a clear understanding first. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 11:17 ` Andi Kleen @ 2006-06-16 11:58 ` Jes Sorensen -1 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 11:58 UTC (permalink / raw) To: Andi Kleen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: >> The current affinity >> support simply isn't sufficient for that. Placement has to be targetted >> at launch time since thread implementations can change the layout etc. > > I'm not sure how that's related to vgetcpu, but ok ... > > In general if you want to affect placement below the process / shared memory > segment level you should change the application. That would be great, except that a lot of these applications are 'standard' applications which people they don't write themselves. Sometimes the sourcecode is no longer available. We could argue that people should just rewrite their applications, but in reality this isn't whats happening. > It will improve their malloc(). They don't know anything about NUMA, > but getting local memory will help them. They already get local > memory now from the kernel when they use big allocations, but > for smaller allocations it doesn't work because the kernel can't > give out anything smaller than a page. This would be solved > by a NUMA aware malloc, but it needs vgetcpu() for this if it > should work without fixed CPU affinity. I really don't see the benefit here. malloc already gets pages handed down from the kernel which are node local due to them being assigned at a first touch basis. I am not sure about glibc's malloc internals, but rather rely on a vgetcpu() call, all it really needs to do is to keep a thread local pool which will automatically get it's thing locally through first touch usage. I don't see how a new syscall is going to provide anything to malloc that it doesn't already have. What am I missing? > Basically it is just for extending the existing already used proven etc. > default local policy to sub pages. Also there might be other uses > of it too (like per CPU data), although I expect most use of that > in user space can be already done using TLS. The thread libraries already have their own thread local area which should be allocated on the thread's own node if done right, which I assume it is. > JVM and databases will use it too, but since they often use their > own allocators they will need to be modified. I would assume the real databases to be smart enough to benefit from things being first touch already. JVMs .... well who knows, can't say I have a lot of faith in anything running in a JVM :) >> If you expect application vendors to >> code for it, that means few users will benefit. > > Most applications use malloc() Which doesn't need the vgetcpu() call as far as I can see. >> This is another area where the kernel could do better by possibly using >> the cpumask to determine where it will allocate memory. > > Modify fallback lists based on cpu affinity? It's a hint, not guaranteed placement. You have the same problem if you try to allocate memory on a node and there's nothing left there. > But cpusets already does this kind of, even though it has a quite > bad impact on fast paths. > Also what happens if the affinity mask is modified later? > From the high semantics point it is also a little dubious to mesh > them together. My feeling is that as a heuristic it is probably > dubious. If you migrate your app elsewhere, you should migrate the pages with it, or not expect things to run with the local effect. > The gamble is already there in the local policy. No change at all. > When you already got local memory you can use it better with > vgetcpu() though. > > From our experience it works out in most cases though - in general > most benchmarks show better performance with simple local NUMA > policy than SMP mode or no policy. Could you share some information about the type of benchmarks? >> I just use scientific users since thats where I have the most recent >> detailed data from. Databases could well benefit from what I mentioned, >> though the serious ones would want to look into using affinity support >> explicitly in their code. > > No exactly not - i got requests from "serious" databases to offer > vgetcpu() because affinity is too complicated to configure and manage. > > It sounds like you want to solve NUMA world hunger here, not > concentrate on the specific small incremental improvement vgetcpu is trying > to offer. I don't really see the point in solving something half way when it can be done better. Maybe the "serious" databases should open up and let us know what the problem is they are hitting. > I'm sure there is much research that could be done in the general NUMA > tuning area, but I would suggest making it research with numbers first > before trying to hack like this anything into the kernel without > a clear understanding first. Well I did spend a good chunk of time looking at some of this some time ago and did speek a lot to one of my colleagues who actually runs benchmarks using some of these tools to understand the impact. If anything it seems that vgetcpu is the issue that is still in the research stage. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 11:58 ` Jes Sorensen 0 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 11:58 UTC (permalink / raw) To: Andi Kleen Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: >> The current affinity >> support simply isn't sufficient for that. Placement has to be targetted >> at launch time since thread implementations can change the layout etc. > > I'm not sure how that's related to vgetcpu, but ok ... > > In general if you want to affect placement below the process / shared memory > segment level you should change the application. That would be great, except that a lot of these applications are 'standard' applications which people they don't write themselves. Sometimes the sourcecode is no longer available. We could argue that people should just rewrite their applications, but in reality this isn't whats happening. > It will improve their malloc(). They don't know anything about NUMA, > but getting local memory will help them. They already get local > memory now from the kernel when they use big allocations, but > for smaller allocations it doesn't work because the kernel can't > give out anything smaller than a page. This would be solved > by a NUMA aware malloc, but it needs vgetcpu() for this if it > should work without fixed CPU affinity. I really don't see the benefit here. malloc already gets pages handed down from the kernel which are node local due to them being assigned at a first touch basis. I am not sure about glibc's malloc internals, but rather rely on a vgetcpu() call, all it really needs to do is to keep a thread local pool which will automatically get it's thing locally through first touch usage. I don't see how a new syscall is going to provide anything to malloc that it doesn't already have. What am I missing? > Basically it is just for extending the existing already used proven etc. > default local policy to sub pages. Also there might be other uses > of it too (like per CPU data), although I expect most use of that > in user space can be already done using TLS. The thread libraries already have their own thread local area which should be allocated on the thread's own node if done right, which I assume it is. > JVM and databases will use it too, but since they often use their > own allocators they will need to be modified. I would assume the real databases to be smart enough to benefit from things being first touch already. JVMs .... well who knows, can't say I have a lot of faith in anything running in a JVM :) >> If you expect application vendors to >> code for it, that means few users will benefit. > > Most applications use malloc() Which doesn't need the vgetcpu() call as far as I can see. >> This is another area where the kernel could do better by possibly using >> the cpumask to determine where it will allocate memory. > > Modify fallback lists based on cpu affinity? It's a hint, not guaranteed placement. You have the same problem if you try to allocate memory on a node and there's nothing left there. > But cpusets already does this kind of, even though it has a quite > bad impact on fast paths. > Also what happens if the affinity mask is modified later? > From the high semantics point it is also a little dubious to mesh > them together. My feeling is that as a heuristic it is probably > dubious. If you migrate your app elsewhere, you should migrate the pages with it, or not expect things to run with the local effect. > The gamble is already there in the local policy. No change at all. > When you already got local memory you can use it better with > vgetcpu() though. > > From our experience it works out in most cases though - in general > most benchmarks show better performance with simple local NUMA > policy than SMP mode or no policy. Could you share some information about the type of benchmarks? >> I just use scientific users since thats where I have the most recent >> detailed data from. Databases could well benefit from what I mentioned, >> though the serious ones would want to look into using affinity support >> explicitly in their code. > > No exactly not - i got requests from "serious" databases to offer > vgetcpu() because affinity is too complicated to configure and manage. > > It sounds like you want to solve NUMA world hunger here, not > concentrate on the specific small incremental improvement vgetcpu is trying > to offer. I don't really see the point in solving something half way when it can be done better. Maybe the "serious" databases should open up and let us know what the problem is they are hitting. > I'm sure there is much research that could be done in the general NUMA > tuning area, but I would suggest making it research with numbers first > before trying to hack like this anything into the kernel without > a clear understanding first. Well I did spend a good chunk of time looking at some of this some time ago and did speek a lot to one of my colleagues who actually runs benchmarks using some of these tools to understand the impact. If anything it seems that vgetcpu is the issue that is still in the research stage. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 11:58 ` Jes Sorensen @ 2006-06-16 12:36 ` Zoltan Menyhart -1 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-16 12:36 UTC (permalink / raw) To: Jes Sorensen Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Just to make sure I understand it correctly... Assuming I have allocated per CPU data (numa control, etc.) pointed at by: void *per_cpu[MAXCPUS]; Assuming a per CPU variable has got an "offset" in each per CPU data area. Accessing this variable can be done as follows: err = vgetcpu(&my_cpu, ...); if (err) goto .... pointer = (typeof pointer) (per_cpu[my_cpu] + offset); // use "pointer"... It is hundred times more long than "__get_per_cpu(var)++". As we do not know when we can be moved to another CPU, "vgetcpu()" has to be called again after a "reasonable short" time. My idea is to map the current task structure at an arch. dependent virtual address into the user space (obviously in RO). #define current ((struct task_struct *) 0x...) No more need to for "vgetcpu()" at all. The example above becomes: pointer = (typeof pointer) (per_cpu[current->thread_info.cpu] + offset); // use "pointer"... As obtaining "pointer" does not cost much, it can be re-calculated at each usage => no problem to know when to recheck it, there is less chance for using the data of a neighbor. Regards, Zoltan Menyhart ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 12:36 ` Zoltan Menyhart 0 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-16 12:36 UTC (permalink / raw) To: Jes Sorensen Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Just to make sure I understand it correctly... Assuming I have allocated per CPU data (numa control, etc.) pointed at by: void *per_cpu[MAXCPUS]; Assuming a per CPU variable has got an "offset" in each per CPU data area. Accessing this variable can be done as follows: err = vgetcpu(&my_cpu, ...); if (err) goto .... pointer = (typeof pointer) (per_cpu[my_cpu] + offset); // use "pointer"... It is hundred times more long than "__get_per_cpu(var)++". As we do not know when we can be moved to another CPU, "vgetcpu()" has to be called again after a "reasonable short" time. My idea is to map the current task structure at an arch. dependent virtual address into the user space (obviously in RO). #define current ((struct task_struct *) 0x...) No more need to for "vgetcpu()" at all. The example above becomes: pointer = (typeof pointer) (per_cpu[current->thread_info.cpu] + offset); // use "pointer"... As obtaining "pointer" does not cost much, it can be re-calculated at each usage => no problem to know when to recheck it, there is less chance for using the data of a neighbor. Regards, Zoltan Menyhart ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 12:36 ` Zoltan Menyhart @ 2006-06-16 12:41 ` Jes Sorensen -1 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 12:41 UTC (permalink / raw) To: Zoltan Menyhart Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Zoltan Menyhart wrote: > Just to make sure I understand it correctly... > Assuming I have allocated per CPU data (numa control, etc.) pointed at by: I think you misunderstood - vgetcpu is for userland usage, not within the kernel. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 12:41 ` Jes Sorensen 0 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-16 12:41 UTC (permalink / raw) To: Zoltan Menyhart Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Zoltan Menyhart wrote: > Just to make sure I understand it correctly... > Assuming I have allocated per CPU data (numa control, etc.) pointed at by: I think you misunderstood - vgetcpu is for userland usage, not within the kernel. Cheers, Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 12:41 ` Jes Sorensen @ 2006-06-16 12:48 ` Zoltan Menyhart -1 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-16 12:48 UTC (permalink / raw) To: Jes Sorensen Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Jes Sorensen wrote: > Zoltan Menyhart wrote: > >>Just to make sure I understand it correctly... >>Assuming I have allocated per CPU data (numa control, etc.) pointed at by: > > > I think you misunderstood - vgetcpu is for userland usage, not within > the kernel. > > Cheers, > Jes > I did understand it as a user land stuff. This is why I want to map the current task structure into the user space. In user code, we could see the actual value of the "current->thread_info.cpu". My "#define current ((struct task_struct *) 0x...)" is not the same as the kernel's one. Thanks, Zoltan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 12:48 ` Zoltan Menyhart 0 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-16 12:48 UTC (permalink / raw) To: Jes Sorensen Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Jes Sorensen wrote: > Zoltan Menyhart wrote: > >>Just to make sure I understand it correctly... >>Assuming I have allocated per CPU data (numa control, etc.) pointed at by: > > > I think you misunderstood - vgetcpu is for userland usage, not within > the kernel. > > Cheers, > Jes > I did understand it as a user land stuff. This is why I want to map the current task structure into the user space. In user code, we could see the actual value of the "current->thread_info.cpu". My "#define current ((struct task_struct *) 0x...)" is not the same as the kernel's one. Thanks, Zoltan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 12:48 ` Zoltan Menyhart @ 2006-06-16 21:04 ` Chase Venters -1 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:04 UTC (permalink / raw) To: Zoltan Menyhart Cc: Jes Sorensen, Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Zoltan Menyhart wrote: > Jes Sorensen wrote: >> Zoltan Menyhart wrote: >> >> > Just to make sure I understand it correctly... >> > Assuming I have allocated per CPU data (numa control, etc.) pointed at >> > by: >> >> >> I think you misunderstood - vgetcpu is for userland usage, not within >> the kernel. >> >> Cheers, >> Jes >> > I did understand it as a user land stuff. > This is why I want to map the current task structure into the user space. > In user code, we could see the actual value of the > "current->thread_info.cpu". > My "#define current ((struct task_struct *) 0x...)" is not the same as > the kernel's one. I think it's probably best to leave most of the stuff in task_struct private (ie, mapped in kernel only). > Thanks, > > Zoltan Thanks, Chase ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 21:04 ` Chase Venters 0 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:04 UTC (permalink / raw) To: Zoltan Menyhart Cc: Jes Sorensen, Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Zoltan Menyhart wrote: > Jes Sorensen wrote: >> Zoltan Menyhart wrote: >> >> > Just to make sure I understand it correctly... >> > Assuming I have allocated per CPU data (numa control, etc.) pointed at >> > by: >> >> >> I think you misunderstood - vgetcpu is for userland usage, not within >> the kernel. >> >> Cheers, >> Jes >> > I did understand it as a user land stuff. > This is why I want to map the current task structure into the user space. > In user code, we could see the actual value of the > "current->thread_info.cpu". > My "#define current ((struct task_struct *) 0x...)" is not the same as > the kernel's one. I think it's probably best to leave most of the stuff in task_struct private (ie, mapped in kernel only). > Thanks, > > Zoltan Thanks, Chase ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 12:36 ` Zoltan Menyhart @ 2006-06-16 14:56 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 14:56 UTC (permalink / raw) To: Zoltan Menyhart Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 14:36, Zoltan Menyhart wrote: > Just to make sure I understand it correctly... > Assuming I have allocated per CPU data (numa control, etc.) pointed at by: > > void *per_cpu[MAXCPUS]; That is not how user space TLS works. It usually has a base a register. > > Assuming a per CPU variable has got an "offset" in each per CPU data area. > Accessing this variable can be done as follows: > > err = vgetcpu(&my_cpu, ...); > if (err) > goto .... > pointer = (typeof pointer) (per_cpu[my_cpu] + offset); > // use "pointer"... > > It is hundred times more long than "__get_per_cpu(var)++". 14 cycles is not a 100 times longer. > My idea is to map the current task structure at an arch. dependent > virtual address into the user space (obviously in RO). > > #define current ((struct task_struct *) 0x...) This means it cannot be cache colored (because you would need a static offset) and you couldn't share task_structs on a page. Also you would make task_struct part of the userland ABI which seems like a very very bad idea to me. It means we couldn't change it anymore. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 14:56 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 14:56 UTC (permalink / raw) To: Zoltan Menyhart Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 14:36, Zoltan Menyhart wrote: > Just to make sure I understand it correctly... > Assuming I have allocated per CPU data (numa control, etc.) pointed at by: > > void *per_cpu[MAXCPUS]; That is not how user space TLS works. It usually has a base a register. > > Assuming a per CPU variable has got an "offset" in each per CPU data area. > Accessing this variable can be done as follows: > > err = vgetcpu(&my_cpu, ...); > if (err) > goto .... > pointer = (typeof pointer) (per_cpu[my_cpu] + offset); > // use "pointer"... > > It is hundred times more long than "__get_per_cpu(var)++". 14 cycles is not a 100 times longer. > My idea is to map the current task structure at an arch. dependent > virtual address into the user space (obviously in RO). > > #define current ((struct task_struct *) 0x...) This means it cannot be cache colored (because you would need a static offset) and you couldn't share task_structs on a page. Also you would make task_struct part of the userland ABI which seems like a very very bad idea to me. It means we couldn't change it anymore. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 14:56 ` Andi Kleen @ 2006-06-16 15:31 ` Zoltan Menyhart -1 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-16 15:31 UTC (permalink / raw) To: Andi Kleen Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: > That is not how user space TLS works. It usually has a base a register. Can you please give me a real life (simplified) example? > This means it cannot be cache colored (because you would need a static > offset) and you couldn't share task_structs on a page. I do not see the problem. Can you explain please? E.g. the scheduler pulls a task instead of the current one. The CPU will see "current->thread_info.cpu"-s of all the tasks at the same offset anyway. > Also you would make task_struct part of the userland ABI which > seems like a very very bad idea to me. It means we couldn't change > it anymore. We can make some wrapper, e.g.: user_per_cpu_var(name, offset) "vgetcpu()" would also be added to the ABI which we couldn't change easily either. Thanks, Zoltan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 15:31 ` Zoltan Menyhart 0 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-16 15:31 UTC (permalink / raw) To: Andi Kleen Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: > That is not how user space TLS works. It usually has a base a register. Can you please give me a real life (simplified) example? > This means it cannot be cache colored (because you would need a static > offset) and you couldn't share task_structs on a page. I do not see the problem. Can you explain please? E.g. the scheduler pulls a task instead of the current one. The CPU will see "current->thread_info.cpu"-s of all the tasks at the same offset anyway. > Also you would make task_struct part of the userland ABI which > seems like a very very bad idea to me. It means we couldn't change > it anymore. We can make some wrapper, e.g.: user_per_cpu_var(name, offset) "vgetcpu()" would also be added to the ABI which we couldn't change easily either. Thanks, Zoltan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 15:31 ` Zoltan Menyhart @ 2006-06-16 15:37 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 15:37 UTC (permalink / raw) To: Zoltan Menyhart Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 17:31, Zoltan Menyhart wrote: > Andi Kleen wrote: > > > That is not how user space TLS works. It usually has a base a register. > > Can you please give me a real life (simplified) example? On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually loads the base address from %fs:0 first. > > > This means it cannot be cache colored (because you would need a static > > offset) and you couldn't share task_structs on a page. > > I do not see the problem. Your scheme relies on task_struct fields being on a known offset in the page. But slab cache coloring varies the offset to make the data spread out better in the caches. > Can you explain please? > E.g. the scheduler pulls a task instead of the current one. The CPU > will see "current->thread_info.cpu"-s of all the tasks at the same > offset anyway. It varies relative to the start of page. That was one of the bigger wins relative to the task_struct in stack page of 2.4 had. > > > Also you would make task_struct part of the userland ABI which > > seems like a very very bad idea to me. It means we couldn't change > > it anymore. > > We can make some wrapper, e.g.: > > user_per_cpu_var(name, offset) You would need to wrap everything and likely users would like task_struct so much that they accessed it anyways without your wrappers. > "vgetcpu()" would also be added to the ABI which we couldn't change > easily either. Yes, but it's a defined function. No different from a system call. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 15:37 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 15:37 UTC (permalink / raw) To: Zoltan Menyhart Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 17:31, Zoltan Menyhart wrote: > Andi Kleen wrote: > > > That is not how user space TLS works. It usually has a base a register. > > Can you please give me a real life (simplified) example? On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually loads the base address from %fs:0 first. > > > This means it cannot be cache colored (because you would need a static > > offset) and you couldn't share task_structs on a page. > > I do not see the problem. Your scheme relies on task_struct fields being on a known offset in the page. But slab cache coloring varies the offset to make the data spread out better in the caches. > Can you explain please? > E.g. the scheduler pulls a task instead of the current one. The CPU > will see "current->thread_info.cpu"-s of all the tasks at the same > offset anyway. It varies relative to the start of page. That was one of the bigger wins relative to the task_struct in stack page of 2.4 had. > > > Also you would make task_struct part of the userland ABI which > > seems like a very very bad idea to me. It means we couldn't change > > it anymore. > > We can make some wrapper, e.g.: > > user_per_cpu_var(name, offset) You would need to wrap everything and likely users would like task_struct so much that they accessed it anyways without your wrappers. > "vgetcpu()" would also be added to the ABI which we couldn't change > easily either. Yes, but it's a defined function. No different from a system call. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 15:37 ` Andi Kleen @ 2006-06-16 15:58 ` Jakub Jelinek -1 siblings, 0 replies; 69+ messages in thread From: Jakub Jelinek @ 2006-06-16 15:58 UTC (permalink / raw) To: Andi Kleen Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote: > On Friday 16 June 2006 17:31, Zoltan Menyhart wrote: > > Andi Kleen wrote: > > > > > That is not how user space TLS works. It usually has a base a register. > > > > Can you please give me a real life (simplified) example? > > On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually > loads the base address from %fs:0 first. GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs. Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs the default (especially on x86-64 it is a really bad idea). Jakub ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 15:58 ` Jakub Jelinek 0 siblings, 0 replies; 69+ messages in thread From: Jakub Jelinek @ 2006-06-16 15:58 UTC (permalink / raw) To: Andi Kleen Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote: > On Friday 16 June 2006 17:31, Zoltan Menyhart wrote: > > Andi Kleen wrote: > > > > > That is not how user space TLS works. It usually has a base a register. > > > > Can you please give me a real life (simplified) example? > > On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually > loads the base address from %fs:0 first. GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs. Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs the default (especially on x86-64 it is a really bad idea). Jakub ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 15:58 ` Jakub Jelinek @ 2006-06-16 16:24 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 16:24 UTC (permalink / raw) To: Jakub Jelinek Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 17:58, Jakub Jelinek wrote: > On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote: > > On Friday 16 June 2006 17:31, Zoltan Menyhart wrote: > > > Andi Kleen wrote: > > > > > > > That is not how user space TLS works. It usually has a base a register. > > > > > > Can you please give me a real life (simplified) example? > > > > On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually > > loads the base address from %fs:0 first. > > GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs. > Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs > the default (especially on x86-64 it is a really bad idea). I apparently got indeed. I wonder why it happened on x86-64 though - i thought there were no negative offsets on x86-64 TLS. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 16:24 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 16:24 UTC (permalink / raw) To: Jakub Jelinek Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 17:58, Jakub Jelinek wrote: > On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote: > > On Friday 16 June 2006 17:31, Zoltan Menyhart wrote: > > > Andi Kleen wrote: > > > > > > > That is not how user space TLS works. It usually has a base a register. > > > > > > Can you please give me a real life (simplified) example? > > > > On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually > > loads the base address from %fs:0 first. > > GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs. > Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs > the default (especially on x86-64 it is a really bad idea). I apparently got indeed. I wonder why it happened on x86-64 though - i thought there were no negative offsets on x86-64 TLS. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 16:24 ` Andi Kleen @ 2006-06-16 16:33 ` Jakub Jelinek -1 siblings, 0 replies; 69+ messages in thread From: Jakub Jelinek @ 2006-06-16 16:33 UTC (permalink / raw) To: Andi Kleen Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, Jun 16, 2006 at 06:24:52PM +0200, Andi Kleen wrote: > I wonder why it happened on x86-64 though - i thought there were no negative > offsets on x86-64 TLS. It uses negative offsets for __thread vars and positive are reserved for implementation (i.e. glibc). But as %fs in 64-bit programs is just msr 0xc0000100 base addition, with no segment limit, neither Xen nor VMWare can play limit tricks with it. Jakub ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 16:33 ` Jakub Jelinek 0 siblings, 0 replies; 69+ messages in thread From: Jakub Jelinek @ 2006-06-16 16:33 UTC (permalink / raw) To: Andi Kleen Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, Jun 16, 2006 at 06:24:52PM +0200, Andi Kleen wrote: > I wonder why it happened on x86-64 though - i thought there were no negative > offsets on x86-64 TLS. It uses negative offsets for __thread vars and positive are reserved for implementation (i.e. glibc). But as %fs in 64-bit programs is just msr 0xc0000100 base addition, with no segment limit, neither Xen nor VMWare can play limit tricks with it. Jakub ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 15:31 ` Zoltan Menyhart @ 2006-06-16 21:12 ` Chase Venters -1 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:12 UTC (permalink / raw) To: Zoltan Menyhart Cc: Andi Kleen, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Zoltan Menyhart wrote: > Andi Kleen wrote: > >> That is not how user space TLS works. It usually has a base a register. > > Can you please give me a real life (simplified) example? > >> This means it cannot be cache colored (because you would need a static >> offset) and you couldn't share task_structs on a page. > > I do not see the problem. Can you explain please? > E.g. the scheduler pulls a task instead of the current one. The CPU > will see "current->thread_info.cpu"-s of all the tasks at the same > offset anyway. Memory maps have to fall on page boundaries for lots of various reasons. Assuming a 16-word cache line, you've got plenty of spots you could align task_struct to within a page. (That number of spots is actually constrained by either sizeof(task_struct) or the number of colors). The bottom line is that task_struct won't always be on a page boundary. If it's not on a page boundary in the physical page frames, it's not going to be on a page boundary in virtual memory either. (Note also that if two task_structs shared a page, you'd have an information leak. I'm not sure with sizeof(task_struct) and cache alignment if task_structs are small enough for sharing, though. Definitely on hugepages.) Thanks, Chase ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 21:12 ` Chase Venters 0 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:12 UTC (permalink / raw) To: Zoltan Menyhart Cc: Andi Kleen, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Zoltan Menyhart wrote: > Andi Kleen wrote: > >> That is not how user space TLS works. It usually has a base a register. > > Can you please give me a real life (simplified) example? > >> This means it cannot be cache colored (because you would need a static >> offset) and you couldn't share task_structs on a page. > > I do not see the problem. Can you explain please? > E.g. the scheduler pulls a task instead of the current one. The CPU > will see "current->thread_info.cpu"-s of all the tasks at the same > offset anyway. Memory maps have to fall on page boundaries for lots of various reasons. Assuming a 16-word cache line, you've got plenty of spots you could align task_struct to within a page. (That number of spots is actually constrained by either sizeof(task_struct) or the number of colors). The bottom line is that task_struct won't always be on a page boundary. If it's not on a page boundary in the physical page frames, it's not going to be on a page boundary in virtual memory either. (Note also that if two task_structs shared a page, you'd have an information leak. I'm not sure with sizeof(task_struct) and cache alignment if task_structs are small enough for sharing, though. Definitely on hugepages.) Thanks, Chase ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 14:56 ` Andi Kleen @ 2006-06-16 15:36 ` Brent Casavant -1 siblings, 0 replies; 69+ messages in thread From: Brent Casavant @ 2006-06-16 15:36 UTC (permalink / raw) To: Andi Kleen Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Andi Kleen wrote: > On Friday 16 June 2006 14:36, Zoltan Menyhart wrote: > > My idea is to map the current task structure at an arch. dependent > > virtual address into the user space (obviously in RO). > > > > #define current ((struct task_struct *) 0x...) > > This means it cannot be cache colored (because you would need a static > offset) and you couldn't share task_structs on a page. > > Also you would make task_struct part of the userland ABI which > seems like a very very bad idea to me. It means we couldn't change > it anymore. To this last point, it might be more reasonable to map in a page that contained a new structure with a stable ABI, which mirrored some of the task_struct information, and likely other useful information as needs are identified in the future. In any case, it would be hard to beat a single memory read for performance. Cache-coloring and kernel bookkeeping effects could be minimized if this was provided as an mmaped page from a device driver, used only by applications which care. This does work somewhat contrary to the idea of getting support into glibc, unless glibc only used this capability when asked to through some sort of environment variable or other run-time configuration. Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 15:36 ` Brent Casavant 0 siblings, 0 replies; 69+ messages in thread From: Brent Casavant @ 2006-06-16 15:36 UTC (permalink / raw) To: Andi Kleen Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Andi Kleen wrote: > On Friday 16 June 2006 14:36, Zoltan Menyhart wrote: > > My idea is to map the current task structure at an arch. dependent > > virtual address into the user space (obviously in RO). > > > > #define current ((struct task_struct *) 0x...) > > This means it cannot be cache colored (because you would need a static > offset) and you couldn't share task_structs on a page. > > Also you would make task_struct part of the userland ABI which > seems like a very very bad idea to me. It means we couldn't change > it anymore. To this last point, it might be more reasonable to map in a page that contained a new structure with a stable ABI, which mirrored some of the task_struct information, and likely other useful information as needs are identified in the future. In any case, it would be hard to beat a single memory read for performance. Cache-coloring and kernel bookkeeping effects could be minimized if this was provided as an mmaped page from a device driver, used only by applications which care. This does work somewhat contrary to the idea of getting support into glibc, unless glibc only used this capability when asked to through some sort of environment variable or other run-time configuration. Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 15:36 ` Brent Casavant @ 2006-06-16 15:40 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 15:40 UTC (permalink / raw) To: Brent Casavant Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > To this last point, it might be more reasonable to map in a page that > contained a new structure with a stable ABI, which mirrored some of > the task_struct information, and likely other useful information as > needs are identified in the future. In any case, it would be hard > to beat a single memory read for performance. That would mean making the context switch and possibly other things slower. In general you would need to make a very good case first that all this complexity is worth it. > Cache-coloring and kernel bookkeeping effects could be minimized if this > was provided as an mmaped page from a device driver, used only by > applications which care. I don't see what difference that would make. You would still have the fixed offset problem and doing things on demand often tends to be even more complex. -Andi (who thinks these proposals all sound very messy) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 15:40 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 15:40 UTC (permalink / raw) To: Brent Casavant Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > To this last point, it might be more reasonable to map in a page that > contained a new structure with a stable ABI, which mirrored some of > the task_struct information, and likely other useful information as > needs are identified in the future. In any case, it would be hard > to beat a single memory read for performance. That would mean making the context switch and possibly other things slower. In general you would need to make a very good case first that all this complexity is worth it. > Cache-coloring and kernel bookkeeping effects could be minimized if this > was provided as an mmaped page from a device driver, used only by > applications which care. I don't see what difference that would make. You would still have the fixed offset problem and doing things on demand often tends to be even more complex. -Andi (who thinks these proposals all sound very messy) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 15:40 ` Andi Kleen @ 2006-06-16 21:15 ` Chase Venters -1 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:15 UTC (permalink / raw) To: Andi Kleen Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Andi Kleen wrote: > >> To this last point, it might be more reasonable to map in a page that >> contained a new structure with a stable ABI, which mirrored some of >> the task_struct information, and likely other useful information as >> needs are identified in the future. In any case, it would be hard >> to beat a single memory read for performance. > > That would mean making the context switch and possibly other > things slower. > > In general you would need to make a very good case first that all this > complexity is worth it. > >> Cache-coloring and kernel bookkeeping effects could be minimized if this >> was provided as an mmaped page from a device driver, used only by >> applications which care. > > I don't see what difference that would make. You would still > have the fixed offset problem and doing things on demand often tends > to be even more complex. > > > -Andi (who thinks these proposals all sound very messy) > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 21:15 ` Chase Venters 0 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:15 UTC (permalink / raw) To: Andi Kleen Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Andi Kleen wrote: > >> To this last point, it might be more reasonable to map in a page that >> contained a new structure with a stable ABI, which mirrored some of >> the task_struct information, and likely other useful information as >> needs are identified in the future. In any case, it would be hard >> to beat a single memory read for performance. > > That would mean making the context switch and possibly other > things slower. > > In general you would need to make a very good case first that all this > complexity is worth it. > >> Cache-coloring and kernel bookkeeping effects could be minimized if this >> was provided as an mmaped page from a device driver, used only by >> applications which care. > > I don't see what difference that would make. You would still > have the fixed offset problem and doing things on demand often tends > to be even more complex. > > > -Andi (who thinks these proposals all sound very messy) > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 15:40 ` Andi Kleen @ 2006-06-16 21:19 ` Chase Venters -1 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:19 UTC (permalink / raw) To: Andi Kleen Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 (Sorry for the empty reply! Pine over a laggy SSH connection is annoying sometimes) On Fri, 16 Jun 2006, Andi Kleen wrote: > >> To this last point, it might be more reasonable to map in a page that >> contained a new structure with a stable ABI, which mirrored some of >> the task_struct information, and likely other useful information as >> needs are identified in the future. In any case, it would be hard >> to beat a single memory read for performance. > > That would mean making the context switch and possibly other > things slower. Well, if every process had a page of its own, what would the context switch overhead be? But, I'm not advocating exporting anything. Though I sort of like the vgetcpu() idea because I was working on a user-space slab allocator recently and magazines could use vgetcpu() instead of pthread keys. (Also means if threads > cpus I'd get better results). Thanks, Chase ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 21:19 ` Chase Venters 0 siblings, 0 replies; 69+ messages in thread From: Chase Venters @ 2006-06-16 21:19 UTC (permalink / raw) To: Andi Kleen Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 (Sorry for the empty reply! Pine over a laggy SSH connection is annoying sometimes) On Fri, 16 Jun 2006, Andi Kleen wrote: > >> To this last point, it might be more reasonable to map in a page that >> contained a new structure with a stable ABI, which mirrored some of >> the task_struct information, and likely other useful information as >> needs are identified in the future. In any case, it would be hard >> to beat a single memory read for performance. > > That would mean making the context switch and possibly other > things slower. Well, if every process had a page of its own, what would the context switch overhead be? But, I'm not advocating exporting anything. Though I sort of like the vgetcpu() idea because I was working on a user-space slab allocator recently and magazines could use vgetcpu() instead of pthread keys. (Also means if threads > cpus I'd get better results). Thanks, Chase ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 21:19 ` Chase Venters @ 2006-06-16 23:40 ` Brent Casavant -1 siblings, 0 replies; 69+ messages in thread From: Brent Casavant @ 2006-06-16 23:40 UTC (permalink / raw) To: Chase Venters Cc: Andi Kleen, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Chase Venters wrote: > On Fri, 16 Jun 2006, Andi Kleen wrote: > > > > > > To this last point, it might be more reasonable to map in a page that > > > contained a new structure with a stable ABI, which mirrored some of > > > the task_struct information, and likely other useful information as > > > needs are identified in the future. In any case, it would be hard > > > to beat a single memory read for performance. > > > > That would mean making the context switch and possibly other > > things slower. > > Well, if every process had a page of its own, what would the context switch > overhead be? Mostly copying the useful information into the read-only mapped page. However, this doesn't have to be all that expensive. The particular information we care about in this case only needs to be copied when a task begins running on a CPU different from the one it last ran on. In fact, on ia64 we already have something very similar to handle certain I/O pecularities on SN2. http://marc.theaimsgroup.com/?l=linux-ia64&m\x113831137712197&w=2 That work could form the basis for a low-impact method of exporting the current CPU to user space via a read-only mapped page. I'll admit to having zero knowledge of whether this would be workable on anything other than ia64. Thanks, Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 23:40 ` Brent Casavant 0 siblings, 0 replies; 69+ messages in thread From: Brent Casavant @ 2006-06-16 23:40 UTC (permalink / raw) To: Chase Venters Cc: Andi Kleen, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 On Fri, 16 Jun 2006, Chase Venters wrote: > On Fri, 16 Jun 2006, Andi Kleen wrote: > > > > > > To this last point, it might be more reasonable to map in a page that > > > contained a new structure with a stable ABI, which mirrored some of > > > the task_struct information, and likely other useful information as > > > needs are identified in the future. In any case, it would be hard > > > to beat a single memory read for performance. > > > > That would mean making the context switch and possibly other > > things slower. > > Well, if every process had a page of its own, what would the context switch > overhead be? Mostly copying the useful information into the read-only mapped page. However, this doesn't have to be all that expensive. The particular information we care about in this case only needs to be copied when a task begins running on a CPU different from the one it last ran on. In fact, on ia64 we already have something very similar to handle certain I/O pecularities on SN2. http://marc.theaimsgroup.com/?l=linux-ia64&m=113831137712197&w=2 That work could form the basis for a low-impact method of exporting the current CPU to user space via a read-only mapped page. I'll admit to having zero knowledge of whether this would be workable on anything other than ia64. Thanks, Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 23:40 ` Brent Casavant @ 2006-06-17 6:58 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-17 6:58 UTC (permalink / raw) To: Brent Casavant Cc: Chase Venters, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > That work could form the basis for a low-impact method of exporting > the current CPU to user space via a read-only mapped page. I'll admit > to having zero knowledge of whether this would be workable on anything > other than ia64. On x86 per CPU mappings are not really feasible. That is because the CPU uses the Linux page tables directly and to change them per CPU you would need to fork them per CPU. That would add so much complications that I don't even want to think them all through ... -andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-17 6:58 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-17 6:58 UTC (permalink / raw) To: Brent Casavant Cc: Chase Venters, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64 > That work could form the basis for a low-impact method of exporting > the current CPU to user space via a read-only mapped page. I'll admit > to having zero knowledge of whether this would be workable on anything > other than ia64. On x86 per CPU mappings are not really feasible. That is because the CPU uses the Linux page tables directly and to change them per CPU you would need to fork them per CPU. That would add so much complications that I don't even want to think them all through ... -andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 21:19 ` Chase Venters @ 2006-06-17 6:55 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-17 6:55 UTC (permalink / raw) To: discuss Cc: Chase Venters, Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 23:19, Chase Venters wrote: > On Fri, 16 Jun 2006, Andi Kleen wrote: > >> To this last point, it might be more reasonable to map in a page that > >> contained a new structure with a stable ABI, which mirrored some of > >> the task_struct information, and likely other useful information as > >> needs are identified in the future. In any case, it would be hard > >> to beat a single memory read for performance. > > > > That would mean making the context switch and possibly other > > things slower. > > Well, if every process had a page of its own, what would the context > switch overhead be? For process zero, for thread quite high on x86 because you would need per CPU page tables. Doing that would be extremly nasty because you would potentially need to allocate a new set of page tables every time the process is scheduled to a new CPU it hasn't run on before. If you limit it to a process then you can't get the current CPU from such a mapping because a process can run threaded on multiple CPUs. My reference was more to high suggestion of keeping a second version of task_struct for export. That would require changing everything in task struct that is changed on switch_to and should be exported in the other function too. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-17 6:55 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-17 6:55 UTC (permalink / raw) To: discuss Cc: Chase Venters, Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 On Friday 16 June 2006 23:19, Chase Venters wrote: > On Fri, 16 Jun 2006, Andi Kleen wrote: > >> To this last point, it might be more reasonable to map in a page that > >> contained a new structure with a stable ABI, which mirrored some of > >> the task_struct information, and likely other useful information as > >> needs are identified in the future. In any case, it would be hard > >> to beat a single memory read for performance. > > > > That would mean making the context switch and possibly other > > things slower. > > Well, if every process had a page of its own, what would the context > switch overhead be? For process zero, for thread quite high on x86 because you would need per CPU page tables. Doing that would be extremly nasty because you would potentially need to allocate a new set of page tables every time the process is scheduled to a new CPU it hasn't run on before. If you limit it to a process then you can't get the current CPU from such a mapping because a process can run threaded on multiple CPUs. My reference was more to high suggestion of keeping a second version of task_struct for export. That would require changing everything in task struct that is changed on switch_to and should be exported in the other function too. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-17 6:55 ` Andi Kleen @ 2006-06-19 8:42 ` Zoltan Menyhart -1 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-19 8:42 UTC (permalink / raw) To: Andi Kleen Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 Brent Casavant wrote: > To this last point, it might be more reasonable to map in a page that > contained a new structure with a stable ABI, which mirrored some of > the task_struct information, and likely other useful information as > needs are identified in the future. In any case, it would be hard > to beat a single memory read for performance. > > Cache-coloring and kernel bookkeeping effects could be minimized if this > was provided as an mmaped page from a device driver, used only by > applications which care. This does work somewhat contrary to the idea of > getting support into glibc, unless glibc only used this capability when > asked to through some sort of environment variable or other run-time > configuration. Quite O.K. for me. Andi Kleen wrote: >>Well, if every process had a page of its own, what would the context >>switch overhead be? > For process zero, for thread quite high on x86 because you > would need per CPU page tables. Doing that would be extremly > nasty because you would potentially need to allocate a new > set of page tables every time the process is scheduled to a new > CPU it hasn't run on before. Probably I have not explained it correctly: - The "information page" (that includes the current CPU no.) is not a per CPU page - This page is just another page that is mapped at a "well known" user virtual address (for those who are interested in) - As you do not do any special action for each user page on context switch, there is nothing to to this page either - The scheduler sometimes migrates a task, then it updates the the current CPU number on the "information page" > My reference was more to high suggestion of keeping a second version > of task_struct for export. That would require changing everything > in task struct that is changed on switch_to and should be exported > in the other function too. It depends on what else can be in this "information page". As for the current CPU no., you need a single store on each task migration. Thanks, Zoltan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-19 8:42 ` Zoltan Menyhart 0 siblings, 0 replies; 69+ messages in thread From: Zoltan Menyhart @ 2006-06-19 8:42 UTC (permalink / raw) To: Andi Kleen Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 Brent Casavant wrote: > To this last point, it might be more reasonable to map in a page that > contained a new structure with a stable ABI, which mirrored some of > the task_struct information, and likely other useful information as > needs are identified in the future. In any case, it would be hard > to beat a single memory read for performance. > > Cache-coloring and kernel bookkeeping effects could be minimized if this > was provided as an mmaped page from a device driver, used only by > applications which care. This does work somewhat contrary to the idea of > getting support into glibc, unless glibc only used this capability when > asked to through some sort of environment variable or other run-time > configuration. Quite O.K. for me. Andi Kleen wrote: >>Well, if every process had a page of its own, what would the context >>switch overhead be? > For process zero, for thread quite high on x86 because you > would need per CPU page tables. Doing that would be extremly > nasty because you would potentially need to allocate a new > set of page tables every time the process is scheduled to a new > CPU it hasn't run on before. Probably I have not explained it correctly: - The "information page" (that includes the current CPU no.) is not a per CPU page - This page is just another page that is mapped at a "well known" user virtual address (for those who are interested in) - As you do not do any special action for each user page on context switch, there is nothing to to this page either - The scheduler sometimes migrates a task, then it updates the the current CPU number on the "information page" > My reference was more to high suggestion of keeping a second version > of task_struct for export. That would require changing everything > in task struct that is changed on switch_to and should be exported > in the other function too. It depends on what else can be in this "information page". As for the current CPU no., you need a single store on each task migration. Thanks, Zoltan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-19 8:42 ` Zoltan Menyhart @ 2006-06-19 8:54 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-19 8:54 UTC (permalink / raw) To: Zoltan Menyhart Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 > Probably I have not explained it correctly: > - The "information page" (that includes the current CPU no.) is not a > per CPU page If it isn't then you can't figure out the current CPU/node for a thread. Anyways I think we're talking past each other. Your approach might even work on ia64 (at least if you're willing to add a lot of cost to the context switch). You presumably could implement vgetcpu() internally with an approach like this (although with IA64's fast EPC calls it seems a bit pointless) It just won't work on x86. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-19 8:54 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-19 8:54 UTC (permalink / raw) To: Zoltan Menyhart Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 > Probably I have not explained it correctly: > - The "information page" (that includes the current CPU no.) is not a > per CPU page If it isn't then you can't figure out the current CPU/node for a thread. Anyways I think we're talking past each other. Your approach might even work on ia64 (at least if you're willing to add a lot of cost to the context switch). You presumably could implement vgetcpu() internally with an approach like this (although with IA64's fast EPC calls it seems a bit pointless) It just won't work on x86. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 11:58 ` Jes Sorensen @ 2006-06-16 14:54 ` Andi Kleen -1 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 14:54 UTC (permalink / raw) To: discuss Cc: Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 > I really don't see the benefit here. malloc already gets pages handed > down from the kernel which are node local due to them being assigned at > a first touch basis. I am not sure about glibc's malloc internals, but > rather rely on a vgetcpu() call, all it really needs to do is to keep > a thread local pool which will automatically get it's thing locally > through first touch usage. That would add too much overhead on small systems. It's better to be able to share the pools. vgetcpu allows that. > > Basically it is just for extending the existing already used proven etc. > > default local policy to sub pages. Also there might be other uses > > of it too (like per CPU data), although I expect most use of that > > in user space can be already done using TLS. > > The thread libraries already have their own thread local area which > should be allocated on the thread's own node if done right, which I > assume it is. - The heap for small allocations is shared (although this can be tuned) - When another thread does free() you need special handling to keep the item in the correct free lists This is one of the tricky bits in the new kernel NUMA slab allocator too. > > But cpusets already does this kind of, even though it has a quite > > bad impact on fast paths. > > Also what happens if the affinity mask is modified later? > > From the high semantics point it is also a little dubious to mesh > > them together. My feeling is that as a heuristic it is probably > > dubious. > > If you migrate your app elsewhere, you should migrate the pages with it, > or not expect things to run with the local effect. That's too costly to do by default and you have no guarantee that it will amortize. > I don't really see the point in solving something half way when it can > be done better. Maybe the "serious" databases should open up and let us > know what the problem is they are hitting. I see no indication of anything better so far from you. You only offered static configuration instead which while in some cases is better doesn't work in the general case. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-16 14:54 ` Andi Kleen 0 siblings, 0 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-16 14:54 UTC (permalink / raw) To: discuss Cc: Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 > I really don't see the benefit here. malloc already gets pages handed > down from the kernel which are node local due to them being assigned at > a first touch basis. I am not sure about glibc's malloc internals, but > rather rely on a vgetcpu() call, all it really needs to do is to keep > a thread local pool which will automatically get it's thing locally > through first touch usage. That would add too much overhead on small systems. It's better to be able to share the pools. vgetcpu allows that. > > Basically it is just for extending the existing already used proven etc. > > default local policy to sub pages. Also there might be other uses > > of it too (like per CPU data), although I expect most use of that > > in user space can be already done using TLS. > > The thread libraries already have their own thread local area which > should be allocated on the thread's own node if done right, which I > assume it is. - The heap for small allocations is shared (although this can be tuned) - When another thread does free() you need special handling to keep the item in the correct free lists This is one of the tricky bits in the new kernel NUMA slab allocator too. > > But cpusets already does this kind of, even though it has a quite > > bad impact on fast paths. > > Also what happens if the affinity mask is modified later? > > From the high semantics point it is also a little dubious to mesh > > them together. My feeling is that as a heuristic it is probably > > dubious. > > If you migrate your app elsewhere, you should migrate the pages with it, > or not expect things to run with the local effect. That's too costly to do by default and you have no guarantee that it will amortize. > I don't really see the point in solving something half way when it can > be done better. Maybe the "serious" databases should open up and let us > know what the problem is they are hitting. I see no indication of anything better so far from you. You only offered static configuration instead which while in some cases is better doesn't work in the general case. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-16 14:54 ` Andi Kleen @ 2006-06-20 8:28 ` Jes Sorensen -1 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-20 8:28 UTC (permalink / raw) To: Andi Kleen Cc: discuss, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: >> I really don't see the benefit here. malloc already gets pages handed >> down from the kernel which are node local due to them being assigned at >> a first touch basis. I am not sure about glibc's malloc internals, but >> rather rely on a vgetcpu() call, all it really needs to do is to keep >> a thread local pool which will automatically get it's thing locally >> through first touch usage. > > That would add too much overhead on small systems. It's better to be > able to share the pools. vgetcpu allows that. How do you expect to be able to share the pools? Or are you saying you just one page per numa node? Having a page per thread is not noticable and for databases, which was your primary target usergroup, I think it's fair to see it won't even be visible as noise. >>> Basically it is just for extending the existing already used proven etc. >>> default local policy to sub pages. Also there might be other uses >>> of it too (like per CPU data), although I expect most use of that >>> in user space can be already done using TLS. >> The thread libraries already have their own thread local area which >> should be allocated on the thread's own node if done right, which I >> assume it is. > > - The heap for small allocations is shared (although this can be tuned) > - When another thread does free() you need special handling to keep > the item in the correct free lists > This is one of the tricky bits in the new kernel NUMA slab allocator > too. It should be pretty easy to make the allocator aware of the per thread regions based on the address. >> If you migrate your app elsewhere, you should migrate the pages with it, >> or not expect things to run with the local effect. > > That's too costly to do by default and you have no guarantee that it will amortize. But if you don't migrate the pages with it, the numa aware allocation is wasted anyway, whether you do it on a first-touch basis or using vgetcpu. >> I don't really see the point in solving something half way when it can >> be done better. Maybe the "serious" databases should open up and let us >> know what the problem is they are hitting. > > I see no indication of anything better so far from you. You only offered > static configuration instead which while in some cases is better > doesn't work in the general case. Static configuration? I never said anything about that, I said that libc should offer a memory pool per thread and have it created when it's first touched by the thread. That solves exactly what you have described so far unless is something else you also expect to benefit from vgetcpu(). Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() @ 2006-06-20 8:28 ` Jes Sorensen 0 siblings, 0 replies; 69+ messages in thread From: Jes Sorensen @ 2006-06-20 8:28 UTC (permalink / raw) To: Andi Kleen Cc: discuss, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64 Andi Kleen wrote: >> I really don't see the benefit here. malloc already gets pages handed >> down from the kernel which are node local due to them being assigned at >> a first touch basis. I am not sure about glibc's malloc internals, but >> rather rely on a vgetcpu() call, all it really needs to do is to keep >> a thread local pool which will automatically get it's thing locally >> through first touch usage. > > That would add too much overhead on small systems. It's better to be > able to share the pools. vgetcpu allows that. How do you expect to be able to share the pools? Or are you saying you just one page per numa node? Having a page per thread is not noticable and for databases, which was your primary target usergroup, I think it's fair to see it won't even be visible as noise. >>> Basically it is just for extending the existing already used proven etc. >>> default local policy to sub pages. Also there might be other uses >>> of it too (like per CPU data), although I expect most use of that >>> in user space can be already done using TLS. >> The thread libraries already have their own thread local area which >> should be allocated on the thread's own node if done right, which I >> assume it is. > > - The heap for small allocations is shared (although this can be tuned) > - When another thread does free() you need special handling to keep > the item in the correct free lists > This is one of the tricky bits in the new kernel NUMA slab allocator > too. It should be pretty easy to make the allocator aware of the per thread regions based on the address. >> If you migrate your app elsewhere, you should migrate the pages with it, >> or not expect things to run with the local effect. > > That's too costly to do by default and you have no guarantee that it will amortize. But if you don't migrate the pages with it, the numa aware allocation is wasted anyway, whether you do it on a first-touch basis or using vgetcpu. >> I don't really see the point in solving something half way when it can >> be done better. Maybe the "serious" databases should open up and let us >> know what the problem is they are hitting. > > I see no indication of anything better so far from you. You only offered > static configuration instead which while in some cases is better > doesn't work in the general case. Static configuration? I never said anything about that, I said that libc should offer a memory pool per thread and have it created when it's first touched by the thread. That solves exactly what you have described so far unless is something else you also expect to benefit from vgetcpu(). Jes ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-14 7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen ` (3 preceding siblings ...) 2006-06-15 18:44 ` Tony Luck @ 2006-06-19 0:15 ` Paul Jackson 2006-06-19 8:21 ` Andi Kleen 4 siblings, 1 reply; 69+ messages in thread From: Paul Jackson @ 2006-06-19 0:15 UTC (permalink / raw) To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech Interesting - thanks Andi. I had one of my colleagues at SGI lobby me hard for such a facility. I'll see if I can get him on this thread to better explain what he wanted it for. Roughly, he was looking to support something resembling the kernel's per-cpu data in userland library code for high performance scientific number crunching, for things like statistics gathering and perhaps (not sure of this) reduce locking costs. I see "x86-64" in the Subject. I don't see why this facility is arch-specific. Could it work on any arch, ia64 being the one of interest to me? I have some ignorance on your references to "CPUID(1)". I don't recall what it is. The only command so named I find on my systems are a Windows command from the year 1999. I doubt that's it. You wrote: > As you can see CPUID(1) is always very slow but but I don't see any stats above the comment mentioning CPUID(1), so ... er eh ... no I don't see. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-19 0:15 ` Paul Jackson @ 2006-06-19 8:21 ` Andi Kleen 2006-06-19 10:09 ` Paul Jackson 2006-06-21 1:18 ` Paul Jackson 0 siblings, 2 replies; 69+ messages in thread From: Andi Kleen @ 2006-06-19 8:21 UTC (permalink / raw) To: Paul Jackson; +Cc: discuss, linux-kernel, libc-alpha, vojtech On Monday 19 June 2006 02:15, Paul Jackson wrote: > > Roughly, he was looking to support something resembling the kernel's > per-cpu data in userland library code for high performance scientific > number crunching, for things like statistics gathering and perhaps (not > sure of this) reduce locking costs. While vgetcpu() can be used for this most likely glibc TLS is already good enough for this. So it will help, but I don't think it's the primary motivation. > I see "x86-64" in the Subject. I don't see why this facility is > arch-specific. Could it work on any arch, ia64 being the one of > interest to me? The implementation is x86-64 specific and optimized for x86-64. You could probably implement something with the same prototype for IA64 too, although the internal implementation will likely be very different (there is nothing x86-64 specific in the prototype) AFAIK ia64 supports fast system calls so it might be possible to do a simple implementation without vsyscalls. > I have some ignorance on your references to "CPUID(1)". I don't recall > what it is. The only command so named I find on my systems are a CPUID 1 is a x86 instruction that is one way to implement a user level vgetcpu on x86. -Andi ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-19 8:21 ` Andi Kleen @ 2006-06-19 10:09 ` Paul Jackson 2006-06-21 1:18 ` Paul Jackson 1 sibling, 0 replies; 69+ messages in thread From: Paul Jackson @ 2006-06-19 10:09 UTC (permalink / raw) To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech Andi wrote: > glibc TLS Good idea - thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-19 8:21 ` Andi Kleen 2006-06-19 10:09 ` Paul Jackson @ 2006-06-21 1:18 ` Paul Jackson 2006-06-21 1:21 ` Paul Jackson 1 sibling, 1 reply; 69+ messages in thread From: Paul Jackson @ 2006-06-21 1:18 UTC (permalink / raw) To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech Andi wrote: > While vgetcpu() can be used for this most likely glibc TLS is already > good enough for this. So it will help, but I don't think it's the primary > motivation. Elsewhere on this thread, Jes wrote: > ... libc > should offer a memory pool per thread and have it created when it's > first touched by the thread. That solves exactly what you have described > so far unless is something else you also expect to benefit from > vgetcpu(). I don't see a reply from you (Andi) on Jes's comment. Why can't Thread Local Storage (TLS) or other per-thread data be used for a memory pool, as Jes suggests. It seems to me that we don't need vgetcpu() at all. Instead, we should make things that would use it per-thread, not per-cpu. If it works for the statistics gathering you recommended I use TLS for, why not for malloc pages as well? That would seem to be a better abstraction anyway: * A threads cpu can be changed without notice, but a tasks threads don't change unless the task intentionally does it. * Two threads on the same cpu could collide on some per-cpu data, where they'd be find on per-thread data. We already have user visibility into what cpu a task is executing on, via the /proc/<pid>/stat file (39th field). That's slow, of course. The main reason for speeding it up seems to make it useful in critical places, turning it from an infrequently used debugging option into a critical element of certain NUMA aware user level code. I'd think we should not be introducing a new construct, a threads current cpu, as such a first class component unless: 1) we do so on all arch's, and 2) we don't alread have a better construct (TLS). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu() 2006-06-21 1:18 ` Paul Jackson @ 2006-06-21 1:21 ` Paul Jackson 0 siblings, 0 replies; 69+ messages in thread From: Paul Jackson @ 2006-06-21 1:21 UTC (permalink / raw) To: Paul Jackson; +Cc: ak, discuss, linux-kernel, libc-alpha, vojtech Typo: > where they'd be find on per-thread data. s/find/fine/ -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2006-06-21 1:22 UTC | newest]
Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-14 7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen
2006-06-14 10:47 ` Alan Cox
2006-06-14 14:54 ` Steve Munroe
2006-06-15 23:17 ` Benjamin Herrenschmidt
[not found] ` <449029DB.7030505@redhat.com>
[not found] ` <200606141752.02361.ak@suse.de>
2006-06-14 16:30 ` Ulrich Drepper
2006-06-14 17:34 ` [discuss] " Andi Kleen
2006-06-15 18:44 ` Tony Luck
2006-06-16 6:22 ` Andi Kleen
2006-06-16 7:23 ` Gerd Hoffmann
2006-06-16 7:37 ` Andi Kleen
2006-06-16 9:48 ` Jes Sorensen
2006-06-16 9:48 ` Jes Sorensen
2006-06-16 10:09 ` Andi Kleen
2006-06-16 10:09 ` Andi Kleen
2006-06-16 11:02 ` Jes Sorensen
2006-06-16 11:02 ` Jes Sorensen
2006-06-16 11:17 ` Andi Kleen
2006-06-16 11:17 ` Andi Kleen
2006-06-16 11:58 ` Jes Sorensen
2006-06-16 11:58 ` Jes Sorensen
2006-06-16 12:36 ` Zoltan Menyhart
2006-06-16 12:36 ` Zoltan Menyhart
2006-06-16 12:41 ` Jes Sorensen
2006-06-16 12:41 ` Jes Sorensen
2006-06-16 12:48 ` Zoltan Menyhart
2006-06-16 12:48 ` Zoltan Menyhart
2006-06-16 21:04 ` Chase Venters
2006-06-16 21:04 ` Chase Venters
2006-06-16 14:56 ` Andi Kleen
2006-06-16 14:56 ` Andi Kleen
2006-06-16 15:31 ` Zoltan Menyhart
2006-06-16 15:31 ` Zoltan Menyhart
2006-06-16 15:37 ` Andi Kleen
2006-06-16 15:37 ` Andi Kleen
2006-06-16 15:58 ` Jakub Jelinek
2006-06-16 15:58 ` Jakub Jelinek
2006-06-16 16:24 ` Andi Kleen
2006-06-16 16:24 ` Andi Kleen
2006-06-16 16:33 ` Jakub Jelinek
2006-06-16 16:33 ` Jakub Jelinek
2006-06-16 21:12 ` Chase Venters
2006-06-16 21:12 ` Chase Venters
2006-06-16 15:36 ` Brent Casavant
2006-06-16 15:36 ` Brent Casavant
2006-06-16 15:40 ` Andi Kleen
2006-06-16 15:40 ` Andi Kleen
2006-06-16 21:15 ` Chase Venters
2006-06-16 21:15 ` Chase Venters
2006-06-16 21:19 ` Chase Venters
2006-06-16 21:19 ` Chase Venters
2006-06-16 23:40 ` Brent Casavant
2006-06-16 23:40 ` Brent Casavant
2006-06-17 6:58 ` Andi Kleen
2006-06-17 6:58 ` Andi Kleen
2006-06-17 6:55 ` [discuss] " Andi Kleen
2006-06-17 6:55 ` Andi Kleen
2006-06-19 8:42 ` Zoltan Menyhart
2006-06-19 8:42 ` Zoltan Menyhart
2006-06-19 8:54 ` Andi Kleen
2006-06-19 8:54 ` Andi Kleen
2006-06-16 14:54 ` Andi Kleen
2006-06-16 14:54 ` Andi Kleen
2006-06-20 8:28 ` Jes Sorensen
2006-06-20 8:28 ` Jes Sorensen
2006-06-19 0:15 ` Paul Jackson
2006-06-19 8:21 ` Andi Kleen
2006-06-19 10:09 ` Paul Jackson
2006-06-21 1:18 ` Paul Jackson
2006-06-21 1:21 ` Paul Jackson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.