FOR REVIEW: New x86-64 vsyscall vgetcpu()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* FOR REVIEW: New x86-64 vsyscall vgetcpu()
@ 2006-06-14  7:42 Andi Kleen
  2006-06-14 10:47 ` Alan Cox
                   ` (4 more replies)
  0 siblings, 5 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-14  7:42 UTC (permalink / raw)
  To: discuss, linux-kernel; +Cc: libc-alpha, vojtech

I got several requests over the years to provide a fast way to get
the current CPU and node on x86-64.  That is useful for a couple of things:

- The kernel gets a lot of benefit from using per CPU data to get better
cache locality and avoid cache line bouncing. This is currently
not quite possible for user programs. With a fast way to know the current
CPU user space can use per CPU data that is likely in cache already.
Locking is still needed of course - after all the thread might switch
to a different CPU - but at least the memory should be already in cache
and locking on cached memory is much cheaper.

- For NUMA optimization in user space you really need to know the current
node to find out where to allocate memory from.
If you allocate a fresh page from the kernel the kernel will give you
one in the current node, but if you keep your own pools like most programs
do you need to know this to select the right pool.
On single threaded programs it is usually not a big issue because they
tend to start on one node, allocate all their memory there and then eventually
use it there too, but on multithreaded programs where threads can
run on different nodes it's a bigger problem to make sure the threads
can get node local memory for best performance.

At first look such a call still looks like a bad idea - after all the kernel can 
switch a process at any time to other CPUs so any result of this call might 
be wrong as soon as it returns.

But at a closer look it really makes sense:
- The kernel has strong thread affinity and usually keeps a process on the 
same CPU. So switching CPUs is rare. This makes it an useful optimization.

The alternative is usually to bind the process to a specific CPU - then it
"know" where it is - but the problem is that this is nasty to use and 
requires user configuration. The kernel often can make better decisions on 
where to schedule. And doing it automatically makes it just work.

This cannot be done effectively in user space because only the kernel
knows how to get this information from the CPUs because it  requires
translating local APIC numbers to Linux CPU numbers.

Doing it in a syscall is too slow so doing it in a vsyscall makes sense.

I have patches now in my tree from Vojtech
ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/getcpu-vsyscall
(note doesn't apply on its own, needs earlier patches in the quilt set) 

The prototype is 

long vgetcpu(int *cpu, int *node, unsigned long *tcache)

cpu gets the current CPU number if not NULL.
node gets the current node number if not NULL
tcache is a pointer to a two element long array, can be also NULL. Described below.
Return is always 0.

[I modified the prototype a bit over Vojtech's original implementation
to be more foolproof and add the caching mechanism]

Unfortunately all ways to get this information from the CPU are still relatively slow:
it supports RDTSCP on CPUs that support it and CPUID(1) otherwise. Unfortunately
they both are relatively slow.

They stall the pipeline and add some overhead
so I added a special caching mechanism. The idea is that if it's a little
slow then user space would likely cache the information anyways. The problem
with caching is that you need a way to find if it's out of date. User space
cannot do this because it doesn't have a fast way to access a time stamp.

But the x86-64 vsyscall implementation happens to incidentally - vgettimeofday()
already has access to jiffies, that can be just used as a timestamp to
invalidate the cache. The vsyscall cannot cache this information by itself
though - it doesn't have any storage. The idea is that the user would pass a 
TLS variable in there which is then used for storage.  With that the information
can be at best a jiffie out of date, which is good enough.

The contents of the cache are theoretically supposed to be opaque (although I'm 
sure user programs  will soon abuse that because it will such a convenient way 
to get at jiffies ..). I've considered xoring it with a value to make it clear
it's not, but that is probably overkill (?). Might be still safer because
jiffies is unsafe to use in user space because the unit might change.

The array is slightly ugly - one open possibility is to replace it with 
a structure. Shouldn't make much difference to the general semantics of the syscall though.

Some numbers:  (the getpid is to compare syscall cost)

AMD RevF (with RDTSCP support):
getpid 162 cycles
vgetcpu 145 cycles
vgetcpu rdtscp 32 cycles
vgetcpu cached 14 cycles

Intel Pentium-D (Smithfield): 
getpid 719 cycles
vgetcpu 535 cycles
vgetcpu cached 27 cycles

AMD RevE:
getpid 162 cycles
vgetcpu 185 cycles
vgetcpu cached 15 cycles

As you can see CPUID(1) is always very slow, but usually narrowly wins
against the syscall still, except on AMD E stepping. The difference
is very small there and while it would have been possible to implement
a third mode for this that uses a real syscall I ended not too because it 
has some other implications.

With the caching mechanism it really flies though and should be fast enough
for most uses.

My eventual hope is that glibc will be start using this to implement a NUMA aware
malloc() in user space that tries to allocate local memory preferably.
I would say that's the biggest gap we still have in "general purpose" NUMA tuning 
on Linux. Of course it will be likely useful for a lot of other scalable
code too.

Comments on the general mechanism are welcome. If someone is interested in using 
this in user space for SMP or NUMA tuning please let me know.

I haven't quite made of my mind yet if it's 2.6.18 material or not.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-14  7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen
@ 2006-06-14 10:47 ` Alan Cox
  2006-06-14 14:54 ` Steve Munroe
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 42+ messages in thread
From: Alan Cox @ 2006-06-14 10:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech

Ar Mer, 2006-06-14 am 09:42 +0200, ysgrifennodd Andi Kleen:
> Comments on the general mechanism are welcome. If someone is interested in using 
> this in user space for SMP or NUMA tuning please let me know.

Will 2 words always be enough, it costs nothing to demand 8 or 16 ...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-14  7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen
  2006-06-14 10:47 ` Alan Cox
@ 2006-06-14 14:54 ` Steve Munroe
  2006-06-15 23:17   ` Benjamin Herrenschmidt
       [not found] ` <449029DB.7030505@redhat.com>
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 42+ messages in thread
From: Steve Munroe @ 2006-06-14 14:54 UTC (permalink / raw)
  To: Andi Kleen, benh
  Cc: discuss, libc-alpha, libc-alpha-owner, linux-kernel, vojtech



Andi Kleen <ak@suse.de> wrote on 06/14/2006 02:42:31 AM:

>
> I got several requests over the years to provide a fast way to get
> the current CPU and node on x86-64.  That is useful for a couple of
things:
>
> - The kernel gets a lot of benefit from using per CPU data to get better
> cache locality and avoid cache line bouncing. This is currently
> not quite possible for user programs. With a fast way to know the current
> CPU user space can use per CPU data that is likely in cache already.
> Locking is still needed of course - after all the thread might switch
> to a different CPU - but at least the memory should be already in cache
> and locking on cached memory is much cheaper.
>
> - For NUMA optimization in user space you really need to know the current
> node to find out where to allocate memory from.
> If you allocate a fresh page from the kernel the kernel will give you
> one in the current node, but if you keep your own pools like most
programs
> do you need to know this to select the right pool.
> On single threaded programs it is usually not a big issue because they
> tend to start on one node, allocate all their memory there and then
eventually
> use it there too, but on multithreaded programs where threads can
> run on different nodes it's a bigger problem to make sure the threads
> can get node local memory for best performance.
>
PowerPC has similar issues and could use VDSO/vsyscal to implement
vgetcpu() as well. So we should get Ben Herrenschmidt involved to insure
that we have a cross platform solution.


> At first look such a call still looks like a bad idea - after all
> the kernel can
> switch a process at any time to other CPUs so any result of this call
might
> be wrong as soon as it returns.
>
> But at a closer look it really makes sense:
> - The kernel has strong thread affinity and usually keeps a process on
the
> same CPU. So switching CPUs is rare. This makes it an useful
optimization.
>
> The alternative is usually to bind the process to a specific CPU - then
it
> "know" where it is - but the problem is that this is nasty to use and
> requires user configuration. The kernel often can make better decisions
on
> where to schedule. And doing it automatically makes it just work.
>
> This cannot be done effectively in user space because only the kernel
> knows how to get this information from the CPUs because it  requires
> translating local APIC numbers to Linux CPU numbers.
>
> Doing it in a syscall is too slow so doing it in a vsyscall makes sense.
>
> I have patches now in my tree from Vojtech
> ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/getcpu-vsyscall
> (note doesn't apply on its own, needs earlier patches in the quilt set)
>
> The prototype is
>
> long vgetcpu(int *cpu, int *node, unsigned long *tcache)
>
> cpu gets the current CPU number if not NULL.
> node gets the current node number if not NULL
> tcache is a pointer to a two element long array, can be also NULL.
> Described below.
> Return is always 0.
>
> [I modified the prototype a bit over Vojtech's original implementation
> to be more foolproof and add the caching mechanism]
>
> Unfortunately all ways to get this information from the CPU are
> still relatively slow:
> it supports RDTSCP on CPUs that support it and CPUID(1) otherwise.
> Unfortunately
> they both are relatively slow.
>
> They stall the pipeline and add some overhead
> so I added a special caching mechanism. The idea is that if it's a little
> slow then user space would likely cache the information anyways. The
problem
> with caching is that you need a way to find if it's out of date. User
space
> cannot do this because it doesn't have a fast way to access a time stamp.
>
> But the x86-64 vsyscall implementation happens to incidentally -
> vgettimeofday()
> already has access to jiffies, that can be just used as a timestamp to
> invalidate the cache. The vsyscall cannot cache this information by
itself
> though - it doesn't have any storage. The idea is that the user would
pass a
> TLS variable in there which is then used for storage.  With that the
> information
> can be at best a jiffie out of date, which is good enough.
>
> The contents of the cache are theoretically supposed to be opaque
> (although I'm
> sure user programs  will soon abuse that because it will such a
> convenient way
> to get at jiffies ..). I've considered xoring it with a value to make it
clear
> it's not, but that is probably overkill (?). Might be still safer because
> jiffies is unsafe to use in user space because the unit might change.
>
> The array is slightly ugly - one open possibility is to replace it with
> a structure. Shouldn't make much difference to the general semantics
> of the syscall though.
>
> Some numbers:  (the getpid is to compare syscall cost)
>
> AMD RevF (with RDTSCP support):
> getpid 162 cycles
> vgetcpu 145 cycles
> vgetcpu rdtscp 32 cycles
> vgetcpu cached 14 cycles
>
> Intel Pentium-D (Smithfield):
> getpid 719 cycles
> vgetcpu 535 cycles
> vgetcpu cached 27 cycles
>
> AMD RevE:
> getpid 162 cycles
> vgetcpu 185 cycles
> vgetcpu cached 15 cycles
>
> As you can see CPUID(1) is always very slow, but usually narrowly wins
> against the syscall still, except on AMD E stepping. The difference
> is very small there and while it would have been possible to implement
> a third mode for this that uses a real syscall I ended not too because it

> has some other implications.
>
> With the caching mechanism it really flies though and should be fast
enough
> for most uses.
>
> My eventual hope is that glibc will be start using this to implement
> a NUMA aware
> malloc() in user space that tries to allocate local memory preferably.
> I would say that's the biggest gap we still have in "general
> purpose" NUMA tuning
> on Linux. Of course it will be likely useful for a lot of other scalable
> code too.
>
> Comments on the general mechanism are welcome. If someone is
> interested in using
> this in user space for SMP or NUMA tuning please let me know.
>
> I haven't quite made of my mind yet if it's 2.6.18 material or not.
>

Steven J. Munroe
Linux on Power Toolchain Architect
IBM Corporation, Linux Technology Center


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
       [not found]   ` <200606141752.02361.ak@suse.de>
@ 2006-06-14 16:30     ` Ulrich Drepper
  2006-06-14 17:34       ` [discuss] " Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Ulrich Drepper @ 2006-06-14 16:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

Andi Kleen wrote:

> Eventually we'll need a dynamic format but I'll only add it 
> for new calls that actually require it for security.
> vgetcpu doesn't need it.

Just introduce the vdso now, add all new vdso calls there.  There is no
reason except laziness to continue with these moronic fixed addresses.
They only get in the way of address space layout change/optimizations.
And nobody said anything about breaking apps which use the fixed
addresses.  That code can still be available.  One should be able to
turn it off with setarch.

>>> long vgetcpu(int *cpu, int *node, unsigned long *tcache)
>> Do you expect the value returned in *cpu and*node to require an error
>> value?  If not, then why this fascination with signed types?
> 
> Shouldn't make a difference.

If there is no reason for a signed type none should be used.  It can
only lead to problems.

This reminds me: what are the values for the CPU number?  Are they
continuous?  Are they the same as those used in the affinity syscalls
(they better be)?  With hotplug CPUs, are CPU numbers "recycled"?

>> And as for the cache: you definitely should use a length parameter.
>> We've seen in the past over and over again that implicit length
>> requirements sooner or later fail.
> 
> No, the cache should be completely opaque to user space. It's just
> temporary space for the vsyscall which it cannot store for itself.
> I'll probably change it to a struct to make that clearer.
> 
> length doesn't make sense for that use.

You didn't even try to understand what I said.  Yes, in this one case
you might at this point in time only need two words.  But

- this might change
- there might be other future functions in the vdso which need memory.
  It is a huge pain to provide more and more of these individual
  variables.  Better allocate one chunk.

> If some other function needs a cache too it can define its own.
> I don't see any advantage of using a shared buffer.

I believe it that _you_ don't see it.  Because the pain is in the libc.
 The code to set up stack frames has to be adjusted for each new TLS
variable.  It is better to do it once in a general way which is what I
suggested.

> I think you're misunderstanding the concept.

No, I understand perfectly.  You don't get it because you don't want to
understand the userlevel side.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-14 16:30     ` Ulrich Drepper
@ 2006-06-14 17:34       ` Andi Kleen
  0 siblings, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-14 17:34 UTC (permalink / raw)
  To: discuss; +Cc: Ulrich Drepper, linux-kernel, libc-alpha, vojtech

On Wednesday 14 June 2006 18:30, Ulrich Drepper wrote:
> > Eventually we'll need a dynamic format but I'll only add it 
> > for new calls that actually require it for security.
> > vgetcpu doesn't need it.
> 
> Just introduce the vdso now, add all new vdso calls there.  There is no
> reason except laziness to continue with these moronic fixed addresses.
> They only get in the way of address space layout change/optimizations.

The user address space size on x86-64 is final (baring the architecture gets extended
to beyond 48bit VA). We already use all positive
space. But the vsyscalls don't even live in user address space.

> >>> long vgetcpu(int *cpu, int *node, unsigned long *tcache)
> >> Do you expect the value returned in *cpu and*node to require an error
> >> value?  If not, then why this fascination with signed types?
> > 
> > Shouldn't make a difference.
> 
> If there is no reason for a signed type none should be used.  It can
> only lead to problems.

Ok i can change it to unsigned if you feel that strongly about it.

> 
> This reminds me: what are the values for the CPU number?  Are they
> continuous?  Are they the same as those used in the affinity syscalls 
> (they better be)?  

Yes of course.

> With hotplug CPUs, are CPU numbers "recycled"? 

I think if the same CPU gets unplugged and replugged it should
get the same number. Otherwise new numbers should be allocated.

> Yes, in this one case 
> you might at this point in time only need two words.  But
> 
> - this might change

Alan suggested adding some padding which probably
makes sense, although I frankly don't see the implementation
changing.  Variable length would be clear overkill and I refuse
to overdesign this.

> - there might be other future functions in the vdso which need memory.
>   It is a huge pain to provide more and more of these individual
>   variables.  Better allocate one chunk.

Why is it a problem? It's just var __thread isn't it?

> 
> > If some other function needs a cache too it can define its own.
> > I don't see any advantage of using a shared buffer.
> 
> I believe it that _you_ don't see it.  Because the pain is in the libc.
>  The code to set up stack frames has to be adjusted for each new TLS
> variable.  It is better to do it once in a general way which is what I
> suggested.

Hmm, I thought user space could define arbitary own __threads. I certainly
used that in some of my code. Why is it a problem for the libc to do the same?

Anyways even if it's such a big problem you can put it all in
one chunk and partition it yourself given the fixed size. I don't think
the kernel code should concern itself about this.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-14  7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen
                   ` (2 preceding siblings ...)
       [not found] ` <449029DB.7030505@redhat.com>
@ 2006-06-15 18:44 ` Tony Luck
  2006-06-16  6:22   ` Andi Kleen
  2006-06-19  0:15 ` Paul Jackson
  4 siblings, 1 reply; 42+ messages in thread
From: Tony Luck @ 2006-06-15 18:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech

On 6/14/06, Andi Kleen <ak@suse.de> wrote:
> But at a closer look it really makes sense:
> - The kernel has strong thread affinity and usually keeps a process on the
> same CPU. So switching CPUs is rare. This makes it an useful optimization.

Alternatively it means that this will almost always do the right thing, but
once in a while it won't, your application will happen to have been migrated
to a different cpu/node at the point it makes the call, and from then on
this instance will behave oddly (running slowly because it allocates most
of its memory on the wrong node).  When you try to reproduce the problem,
the application will work normally.

> The alternative is usually to bind the process to a specific CPU - then it
> "know" where it is - but the problem is that this is nasty to use and
> requires user configuration. The kernel often can make better decisions on
> where to schedule. And doing it automatically makes it just work.

Another alternative would be to provide a mechanism for a process
to bind to the current cpu (whatever cpu that happens to be).  Then
the kernel gets to make the smart placement decisions, and processes
that want to be bound somewhere (but don't really care exactly where)
have a way to meet their need.  Perhaps a cpumask of all zeroes to a
sched_setaffinity call could be overloaded for this?

Or we can dig up some of the old virtual cpu/virtual node suggestions (we
will eventually need to do something like this, but most systems now don't
have enough cpus for this to make much sense yet).

-Tony

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-14 14:54 ` Steve Munroe
@ 2006-06-15 23:17   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2006-06-15 23:17 UTC (permalink / raw)
  To: Steve Munroe
  Cc: Andi Kleen, discuss, libc-alpha, libc-alpha-owner, linux-kernel,
	vojtech

> PowerPC has similar issues and could use VDSO/vsyscal to implement
> vgetcpu() as well. So we should get Ben Herrenschmidt involved to insure
> that we have a cross platform solution.

Except that I haven't yet found a way to pass the information to the
vdso... in the past, there used to be an SPRG that was readable by
userland that I could have used but I can't see that working on recent
CPUs. The PIR isn't quite portable (though the vDSO can have per-cpu
model) and we don't quite know for sure what's in there, especially on
shared processor machines.

Any idea ?

Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-15 18:44 ` Tony Luck
@ 2006-06-16  6:22   ` Andi Kleen
  2006-06-16  7:23     ` Gerd Hoffmann
  2006-06-16  9:48     ` Jes Sorensen
  0 siblings, 2 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-16  6:22 UTC (permalink / raw)
  To: Tony Luck; +Cc: discuss, linux-kernel, libc-alpha, vojtech

On Thursday 15 June 2006 20:44, Tony Luck wrote:
> On 6/14/06, Andi Kleen <ak@suse.de> wrote:
> > But at a closer look it really makes sense:
> > - The kernel has strong thread affinity and usually keeps a process on the
> > same CPU. So switching CPUs is rare. This makes it an useful optimization.
> 
> Alternatively it means that this will almost always do the right thing, but
> once in a while it won't, your application will happen to have been migrated
> to a different cpu/node at the point it makes the call, and from then on
> this instance will behave oddly (running slowly because it allocates most
> of its memory on the wrong node).  When you try to reproduce the problem,
> the application will work normally.

That's inherent in NUMA. No good way around that.

We have a similar problem with caches because we don't color them. People
have learned to live with it.
 
> > The alternative is usually to bind the process to a specific CPU - then it
> > "know" where it is - but the problem is that this is nasty to use and
> > requires user configuration. The kernel often can make better decisions on
> > where to schedule. And doing it automatically makes it just work.
> 
> Another alternative would be to provide a mechanism for a process
> to bind to the current cpu (whatever cpu that happens to be).  Then
> the kernel gets to make the smart placement decisions, and processes
> that want to be bound somewhere (but don't really care exactly where)
> have a way to meet their need.  Perhaps a cpumask of all zeroes to a
> sched_setaffinity call could be overloaded for this?

I tried something like this a few years ago and it just didn't work
(or rather ran usually slower) The scheduler would select a home node at startup and 
then try to move the process there. 

The problem is that not using a CPU costs you much more than whatever
overhead you get from using non local memory.

So by default filling the CPUs must be the highest priority and memory 
policy cannot interfere with that.
 
-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16  6:22   ` Andi Kleen
@ 2006-06-16  7:23     ` Gerd Hoffmann
  2006-06-16  7:37       ` Andi Kleen
  2006-06-16  9:48     ` Jes Sorensen
  1 sibling, 1 reply; 42+ messages in thread
From: Gerd Hoffmann @ 2006-06-16  7:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech

Andi Kleen wrote:
>> Alternatively it means that this will almost always do the right thing, but
>> once in a while it won't, your application will happen to have been migrated
>> to a different cpu/node at the point it makes the call, and from then on
>> this instance will behave oddly (running slowly because it allocates most
>> of its memory on the wrong node).  When you try to reproduce the problem,
>> the application will work normally.
> 
> That's inherent in NUMA. No good way around that.

Hmm, maybe it makes sense to allow binding memory areas to threads
instead of nodes.  That way the kernel may attempt to migrate the pages
to another node in case it migrates threads / processes.  Either via
mbind(), or maybe better via madvise() to make clear it's a hint only.

cheers,

  Gerd

-- 
Gerd Hoffmann <kraxel@suse.de>
http://www.suse.de/~kraxel/julika-dora.jpeg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16  7:23     ` Gerd Hoffmann
@ 2006-06-16  7:37       ` Andi Kleen
  0 siblings, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-16  7:37 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech

On Friday 16 June 2006 09:23, Gerd Hoffmann wrote:
> Andi Kleen wrote:
> >> Alternatively it means that this will almost always do the right thing, but
> >> once in a while it won't, your application will happen to have been migrated
> >> to a different cpu/node at the point it makes the call, and from then on
> >> this instance will behave oddly (running slowly because it allocates most
> >> of its memory on the wrong node).  When you try to reproduce the problem,
> >> the application will work normally.
> > 
> > That's inherent in NUMA. No good way around that.
> 
> Hmm, maybe it makes sense to allow binding memory areas to threads
> instead of nodes.  That way the kernel may attempt to migrate the pages
> to another node in case it migrates threads / processes.  Either via
> mbind(), or maybe better via madvise() to make clear it's a hint only.

I haven't tried that but I have talked to others who tried to implement
automatic page migration and they say they couldn't make that work (or rather
make it a win) either.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16  6:22   ` Andi Kleen
  2006-06-16  7:23     ` Gerd Hoffmann
@ 2006-06-16  9:48     ` Jes Sorensen
  2006-06-16 10:09       ` Andi Kleen
  1 sibling, 1 reply; 42+ messages in thread
From: Jes Sorensen @ 2006-06-16  9:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

>>>>> "Andi" == Andi Kleen <ak@suse.de> writes:

Andi> On Thursday 15 June 2006 20:44, Tony Luck wrote:
>> Another alternative would be to provide a mechanism for a process
>> to bind to the current cpu (whatever cpu that happens to be).  Then
>> the kernel gets to make the smart placement decisions, and
>> processes that want to be bound somewhere (but don't really care
>> exactly where) have a way to meet their need.  Perhaps a cpumask of
>> all zeroes to a sched_setaffinity call could be overloaded for
>> this?

Andi> I tried something like this a few years ago and it just didn't
Andi> work (or rather ran usually slower) The scheduler would select a
Andi> home node at startup and then try to move the process there.

Andi> The problem is that not using a CPU costs you much more than
Andi> whatever overhead you get from using non local memory.

It all depends on your application and the type of system you are
running on. What you say applies to smaller cpu counts. However once
we see the upcoming larger count multi-core cpus become commonly
available, this is likely to change and become more like what is seen
today on larger NUMA systems.

In the scientific application space, there are two very common
groupings of jobs. One is simply a large threaded application with a
lot of intercommunication, often via MPI. In many cases one ends up
running a job on just a subset of the system, in which case you want
to see threads placed on the same node(s) to minimize internode
communication. It is desirable to either force the other tasks on the
system (system daemons etc) onto other node(s) to reduce noise and
there could also be space to run another parallel job on the remaining
node(s).

The other common case is to have jobs which spawn off a number of
threads that work together in groups (via OpenMP). In this case you
would like to have all your OpenMP threads placed on the same node for
similar reasons.

Not getting this right can result in significant loss of performance
for jobs which are highly memory bound or rely heavily on
intercommunication and synchronization.

Andi> So by default filling the CPUs must be the highest priority and
Andi> memory policy cannot interfere with that.

I really don't think this approach is going to solve the problem. As
Tony also points out, tasks will eventually migrate. The user needs to
tell the kernel where it wants to run the tasks rather than the kernel
telling the task where it is located. Only the application (or
developer/user) knows how the threads are expected to behave, doing
this automatically is almost never going to be optimal. Obviously the
user needs visibility of the topology of the machine to do so but that
should be available on any NUMA system through /proc or /sys.

In the scientific space the jobs are often run repeatedly with new
data sets every time, so it is worthwhile to spend the effort up front
to get the placement right. One-off runs are obviously something else
and there your method is going to be more beneficial.

IMHO, what we really need is a more advanced way for user applications
to hint at the kernel how to place it's threads.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16  9:48     ` Jes Sorensen
@ 2006-06-16 10:09       ` Andi Kleen
  2006-06-16 11:02         ` Jes Sorensen
  0 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2006-06-16 10:09 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

> It all depends on your application and the type of system you are
> running on. What you say applies to smaller cpu counts. However once
> we see the upcoming larger count multi-core cpus become commonly
> available, this is likely to change and become more like what is seen
> today on larger NUMA systems.

Maybe. Maybe not. 

> 
> In the scientific application space, there are two very common
> groupings of jobs.

The scientific users just use pinned CPUs and seem to be happy with that.
They also have cheap slav^wgrade students to spend lots of time on
manual tuning.  I'm not concerned about them. 

If you already use CPU affinity you should already know where you are and don't 
need this call at all.

So this clearly isn't targetted for them. 

Interesting is getting the best performance from general purpose applications 
without any special tuning. For them I'm trying to improve things.

Number one applications currently are databases and JVMs. I hope with 
Wolfam's malloc work it will be useful for more applications too. 

> Andi> So by default filling the CPUs must be the highest priority and
> Andi> memory policy cannot interfere with that.
> 
> I really don't think this approach is going to solve the problem. As
> Tony also points out, tasks will eventually migrate.

Currently we don't solve this problem with the standard heuristics.
It can be solved with manual tuning (mempolicy, explicit CPU affinity)
but if you're doing that you're already out side the primary use 
case of vgetcpu().

vgetcpu() is only trying to be a incremental improvement of the current
simple default local policy.

> The user needs to 

Scientific users do that, but other users normally not. I doubt that
is going to change.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 10:09       ` Andi Kleen
@ 2006-06-16 11:02         ` Jes Sorensen
  2006-06-16 11:17           ` Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Jes Sorensen @ 2006-06-16 11:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

Andi Kleen wrote:
>> In the scientific application space, there are two very common
>> groupings of jobs.
> 
> The scientific users just use pinned CPUs and seem to be happy with that.
> They also have cheap slav^wgrade students to spend lots of time on
> manual tuning.  I'm not concerned about them. 

Do they? There's a lot of scientific sites out there which are not
universities or research organizations. They do not have free slave
labour at hand. A lot of users fall into this category, especially the
users with larger systems or large clusters (be it ia64, x86_64 or PPC).

> If you already use CPU affinity you should already know where you are and don't 
> need this call at all.

Except that whats currently available isn't sufficient to do what is
needed.

> So this clearly isn't targetted for them. 
>
> Interesting is getting the best performance from general purpose applications 
> without any special tuning. For them I'm trying to improve things.

Well I am interested in getting the best performance for some of the
same applications, without having to modify them. The current affinity
support simply isn't sufficient for that. Placement has to be targetted
at launch time since thread implementations can change the layout etc.

> Number one applications currently are databases and JVMs. I hope with 
> Wolfam's malloc work it will be useful for more applications too. 

If you want this to work for general purpose applications, then how is
this new syscall going to help? If you expect application vendors to
code for it, that means few users will benefit.

>> I really don't think this approach is going to solve the problem. As
>> Tony also points out, tasks will eventually migrate.
> 
> Currently we don't solve this problem with the standard heuristics.
> It can be solved with manual tuning (mempolicy, explicit CPU affinity)
> but if you're doing that you're already out side the primary use 
> case of vgetcpu().

This is another area where the kernel could do better by possibly using
the cpumask to determine where it will allocate memory.

> vgetcpu() is only trying to be a incremental improvement of the current
> simple default local policy.

As Tony rightfully pointed out, tasks do migrate. By making this guess
initially and then expecting the application to run for a long time,
you will end up with it having zero or possibly a negative effect.

>> The user needs to 
> 
> Scientific users do that, but other users normally not. I doubt that
> is going to change.

I just use scientific users since thats where I have the most recent
detailed data from. Databases could well benefit from what I mentioned,
though the serious ones would want to look into using affinity support
explicitly in their code.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:02         ` Jes Sorensen
@ 2006-06-16 11:17           ` Andi Kleen
  2006-06-16 11:58             ` Jes Sorensen
  0 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2006-06-16 11:17 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

> The current affinity 
> support simply isn't sufficient for that. Placement has to be targetted
> at launch time since thread implementations can change the layout etc.

I'm not sure how that's related to vgetcpu, but ok ...

In general if you want to affect placement below the process / shared memory
segment level you should change the application.

Anything else just results in a big messy and unreliable and fragile user
command line interface - a quick look at the respective Irix manpage should
make that clear.

> > Number one applications currently are databases and JVMs. I hope with 
> > Wolfam's malloc work it will be useful for more applications too. 
> 
> If you want this to work for general purpose applications, then how is
> this new syscall going to help? 

It will improve their malloc(). They don't know anything about NUMA,
but getting local memory will help them. They already get local
memory now from the kernel when they use big allocations, but
for smaller allocations it doesn't work because the kernel can't
give out anything smaller than a page. This would be solved
by a NUMA aware malloc, but it needs vgetcpu() for this if it 
should work without fixed CPU affinity. 

Basically it is just for extending the existing already used proven etc.
default local policy to sub pages. Also there might be other uses
of it too (like per CPU data), although I expect most use of that
in user space can be already done using TLS.

JVM and databases will use it too, but since they often use their
own allocators they will need to be modified.

> If you expect application vendors to 
> code for it, that means few users will benefit.

Most applications use malloc()

> >> I really don't think this approach is going to solve the problem. As
> >> Tony also points out, tasks will eventually migrate.
> > 
> > Currently we don't solve this problem with the standard heuristics.
> > It can be solved with manual tuning (mempolicy, explicit CPU affinity)
> > but if you're doing that you're already out side the primary use 
> > case of vgetcpu().
> 
> This is another area where the kernel could do better by possibly using
> the cpumask to determine where it will allocate memory.

Modify fallback lists based on cpu affinity?

Would get messy in the code because you couldn't easily precompute
them anymore.

But cpusets already does this kind of, even though it has a quite
bad impact on fast paths.
 Also what happens if the affinity mask is modified later?
>From the high semantics point it is also a little dubious to mesh
them together. My feeling is that as a heuristic it is probably
dubious.

Also when you set cpu affinity you can as well set memory
policy iit.

> 
> > vgetcpu() is only trying to be a incremental improvement of the current
> > simple default local policy.
> 
> As Tony rightfully pointed out, tasks do migrate. By making this guess
> initially

The gamble is already there in the local policy. No change at all.
When you already got local memory you can use it better with
vgetcpu() though.

>From our experience it works out in most cases though - in general
most benchmarks show better performance with simple local NUMA
policy than SMP mode or no policy.

In the cases where it doesn't you have to either eat the slow
down or use manual tuning.

> I just use scientific users since thats where I have the most recent
> detailed data from. Databases could well benefit from what I mentioned,
> though the serious ones would want to look into using affinity support
> explicitly in their code.

No exactly not - i got requests from "serious" databases to offer
vgetcpu() because affinity is too complicated to configure and manage.

It sounds like you want to solve NUMA world hunger here, not
concentrate on the specific small incremental improvement vgetcpu is trying 
to offer.

I'm sure there is much research that could be done in the general NUMA
tuning area, but I would suggest making it research with numbers first
before trying to hack like this anything into the kernel without
a clear understanding first.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:17           ` Andi Kleen
@ 2006-06-16 11:58             ` Jes Sorensen
  2006-06-16 12:36               ` Zoltan Menyhart
  2006-06-16 14:54               ` Andi Kleen
  0 siblings, 2 replies; 42+ messages in thread
From: Jes Sorensen @ 2006-06-16 11:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

Andi Kleen wrote:
>> The current affinity 
>> support simply isn't sufficient for that. Placement has to be targetted
>> at launch time since thread implementations can change the layout etc.
> 
> I'm not sure how that's related to vgetcpu, but ok ...
> 
> In general if you want to affect placement below the process / shared memory
> segment level you should change the application.

That would be great, except that a lot of these applications are
'standard' applications which people they don't write themselves.
Sometimes the sourcecode is no longer available. We could argue that
people should just rewrite their applications, but in reality this isn't
whats happening.

> It will improve their malloc(). They don't know anything about NUMA,
> but getting local memory will help them. They already get local
> memory now from the kernel when they use big allocations, but
> for smaller allocations it doesn't work because the kernel can't
> give out anything smaller than a page. This would be solved
> by a NUMA aware malloc, but it needs vgetcpu() for this if it 
> should work without fixed CPU affinity. 

I really don't see the benefit here. malloc already gets pages handed
down from the kernel which are node local due to them being assigned at
a first touch basis. I am not sure about glibc's malloc internals, but
rather rely on a vgetcpu() call, all it really needs to do is to keep
a thread local pool which will automatically get it's thing locally
through first touch usage.

I don't see how a new syscall is going to provide anything to malloc
that it doesn't already have. What am I missing?

> Basically it is just for extending the existing already used proven etc.
> default local policy to sub pages. Also there might be other uses
> of it too (like per CPU data), although I expect most use of that
> in user space can be already done using TLS.

The thread libraries already have their own thread local area which
should be allocated on the thread's own node if done right, which I
assume it is.

> JVM and databases will use it too, but since they often use their
> own allocators they will need to be modified.

I would assume the real databases to be smart enough to benefit from
things being first touch already. JVMs .... well who knows, can't say
I have a lot of faith in anything running in a JVM :)

>> If you expect application vendors to 
>> code for it, that means few users will benefit.
> 
> Most applications use malloc()

Which doesn't need the vgetcpu() call as far as I can see.

>> This is another area where the kernel could do better by possibly using
>> the cpumask to determine where it will allocate memory.
> 
> Modify fallback lists based on cpu affinity?

It's a hint, not guaranteed placement. You have the same problem if you
try to allocate memory on a node and there's nothing left there.

> But cpusets already does this kind of, even though it has a quite
> bad impact on fast paths.
>  Also what happens if the affinity mask is modified later?
> From the high semantics point it is also a little dubious to mesh
> them together. My feeling is that as a heuristic it is probably
> dubious.

If you migrate your app elsewhere, you should migrate the pages with it,
or not expect things to run with the local effect.

> The gamble is already there in the local policy. No change at all.
> When you already got local memory you can use it better with
> vgetcpu() though.
> 
> From our experience it works out in most cases though - in general
> most benchmarks show better performance with simple local NUMA
> policy than SMP mode or no policy.

Could you share some information about the type of benchmarks?

>> I just use scientific users since thats where I have the most recent
>> detailed data from. Databases could well benefit from what I mentioned,
>> though the serious ones would want to look into using affinity support
>> explicitly in their code.
> 
> No exactly not - i got requests from "serious" databases to offer
> vgetcpu() because affinity is too complicated to configure and manage.
> 
> It sounds like you want to solve NUMA world hunger here, not
> concentrate on the specific small incremental improvement vgetcpu is trying 
> to offer.

I don't really see the point in solving something half way when it can
be done better. Maybe the "serious" databases should open up and let us
know what the problem is they are hitting.

> I'm sure there is much research that could be done in the general NUMA
> tuning area, but I would suggest making it research with numbers first
> before trying to hack like this anything into the kernel without
> a clear understanding first.

Well I did spend a good chunk of time looking at some of this some time
ago and did speek a lot to one of my colleagues who actually runs
benchmarks using some of these tools to understand the impact. If
anything it seems that vgetcpu is the issue that is still in the
research stage.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:58             ` Jes Sorensen
@ 2006-06-16 12:36               ` Zoltan Menyhart
  2006-06-16 12:41                 ` Jes Sorensen
  2006-06-16 14:56                 ` Andi Kleen
  2006-06-16 14:54               ` Andi Kleen
  1 sibling, 2 replies; 42+ messages in thread
From: Zoltan Menyhart @ 2006-06-16 12:36 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech,
	linux-ia64

Just to make sure I understand it correctly...
Assuming I have allocated per CPU data (numa control, etc.) pointed at by:

	void *per_cpu[MAXCPUS];

Assuming a per CPU variable has got an "offset" in each per CPU data area.
Accessing this variable can be done as follows:

	err = vgetcpu(&my_cpu, ...);
	if (err)
		goto ....
	pointer = (typeof pointer) (per_cpu[my_cpu] + offset);
	// use "pointer"...

It is hundred times more long than "__get_per_cpu(var)++".

As we do not know when we can be moved to another CPU,
"vgetcpu()" has to be called again after a "reasonable short" time.

My idea is to map the current task structure at an arch. dependent
virtual address into the user space (obviously in RO).

	#define current	((struct task_struct *) 0x...)

No more need to for "vgetcpu()" at all. The example above becomes:

	pointer = (typeof pointer) (per_cpu[current->thread_info.cpu] + offset);
	// use "pointer"...

As obtaining "pointer" does not cost much, it can be re-calculated at
each usage => no problem to know when to recheck it, there is less chance for
using the data of a neighbor.

Regards,

Zoltan Menyhart

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:36               ` Zoltan Menyhart
@ 2006-06-16 12:41                 ` Jes Sorensen
  2006-06-16 12:48                   ` Zoltan Menyhart
  2006-06-16 14:56                 ` Andi Kleen
  1 sibling, 1 reply; 42+ messages in thread
From: Jes Sorensen @ 2006-06-16 12:41 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech,
	linux-ia64

Zoltan Menyhart wrote:
> Just to make sure I understand it correctly...
> Assuming I have allocated per CPU data (numa control, etc.) pointed at by:

I think you misunderstood - vgetcpu is for userland usage, not within
the kernel.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:41                 ` Jes Sorensen
@ 2006-06-16 12:48                   ` Zoltan Menyhart
  2006-06-16 21:04                     ` Chase Venters
  0 siblings, 1 reply; 42+ messages in thread
From: Zoltan Menyhart @ 2006-06-16 12:48 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech,
	linux-ia64

Jes Sorensen wrote:
> Zoltan Menyhart wrote:
> 
>>Just to make sure I understand it correctly...
>>Assuming I have allocated per CPU data (numa control, etc.) pointed at by:
> 
> 
> I think you misunderstood - vgetcpu is for userland usage, not within
> the kernel.
> 
> Cheers,
> Jes
> 
I did understand it as a user land stuff.
This is why I want to map the current task structure into the user space.
In user code, we could see the actual value of the "current->thread_info.cpu".
My "#define current ((struct task_struct *) 0x...)" is not the same as
the kernel's one.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:58             ` Jes Sorensen
  2006-06-16 12:36               ` Zoltan Menyhart
@ 2006-06-16 14:54               ` Andi Kleen
  2006-06-20  8:28                 ` Jes Sorensen
  1 sibling, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2006-06-16 14:54 UTC (permalink / raw)
  To: discuss
  Cc: Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech,
	linux-ia64


> I really don't see the benefit here. malloc already gets pages handed
> down from the kernel which are node local due to them being assigned at
> a first touch basis. I am not sure about glibc's malloc internals, but
> rather rely on a vgetcpu() call, all it really needs to do is to keep
> a thread local pool which will automatically get it's thing locally
> through first touch usage.

That would add too much overhead on small systems. It's better to be 
able to share the pools. vgetcpu allows that.
 
> > Basically it is just for extending the existing already used proven etc.
> > default local policy to sub pages. Also there might be other uses
> > of it too (like per CPU data), although I expect most use of that
> > in user space can be already done using TLS.
> 
> The thread libraries already have their own thread local area which
> should be allocated on the thread's own node if done right, which I
> assume it is.

- The heap for small allocations is shared (although this can be tuned) 
- When another thread does free() you need special handling to keep
the item in the correct free lists
This is one of the tricky bits in the new kernel NUMA slab allocator
too.

> > But cpusets already does this kind of, even though it has a quite
> > bad impact on fast paths.
> >  Also what happens if the affinity mask is modified later?
> > From the high semantics point it is also a little dubious to mesh
> > them together. My feeling is that as a heuristic it is probably
> > dubious.
> 
> If you migrate your app elsewhere, you should migrate the pages with it,
> or not expect things to run with the local effect.

That's too costly to do by default and you have no guarantee that it will amortize.
 
> I don't really see the point in solving something half way when it can
> be done better. Maybe the "serious" databases should open up and let us
> know what the problem is they are hitting.

I see no indication of anything better so far from you. You only offered
static configuration instead which while in some cases is better
doesn't work in the general case.
-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:36               ` Zoltan Menyhart
  2006-06-16 12:41                 ` Jes Sorensen
@ 2006-06-16 14:56                 ` Andi Kleen
  2006-06-16 15:31                   ` Zoltan Menyhart
  2006-06-16 15:36                   ` Brent Casavant
  1 sibling, 2 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-16 14:56 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha,
	vojtech, linux-ia64

On Friday 16 June 2006 14:36, Zoltan Menyhart wrote:
> Just to make sure I understand it correctly...
> Assuming I have allocated per CPU data (numa control, etc.) pointed at by:
> 
> 	void *per_cpu[MAXCPUS];


That is not how user space TLS works. It usually has a base a register.

> 
> Assuming a per CPU variable has got an "offset" in each per CPU data area.
> Accessing this variable can be done as follows:
> 
> 	err = vgetcpu(&my_cpu, ...);
> 	if (err)
> 		goto ....
> 	pointer = (typeof pointer) (per_cpu[my_cpu] + offset);
> 	// use "pointer"...
> 
> It is hundred times more long than "__get_per_cpu(var)++".

14 cycles is not a 100 times longer.

> My idea is to map the current task structure at an arch. dependent
> virtual address into the user space (obviously in RO).
> 
> 	#define current	((struct task_struct *) 0x...)

This means it cannot be cache colored (because you would need a static
offset) and you couldn't share task_structs on a page.

Also you would make task_struct part of the userland ABI which
seems like a very very bad idea to me. It means we couldn't change
it anymore.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 14:56                 ` Andi Kleen
@ 2006-06-16 15:31                   ` Zoltan Menyhart
  2006-06-16 15:37                     ` Andi Kleen
  2006-06-16 21:12                     ` Chase Venters
  2006-06-16 15:36                   ` Brent Casavant
  1 sibling, 2 replies; 42+ messages in thread
From: Zoltan Menyhart @ 2006-06-16 15:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha,
	vojtech, linux-ia64

Andi Kleen wrote:

> That is not how user space TLS works. It usually has a base a register.

Can you please give me a real life (simplified) example?

> This means it cannot be cache colored (because you would need a static
> offset) and you couldn't share task_structs on a page.

I do not see the problem. Can you explain please?
E.g. the scheduler pulls a task instead of the current one. The CPU
will see "current->thread_info.cpu"-s of all the tasks at the same
offset anyway.

> Also you would make task_struct part of the userland ABI which
> seems like a very very bad idea to me. It means we couldn't change
> it anymore.

We can make some wrapper, e.g.:

	user_per_cpu_var(name, offset)

"vgetcpu()" would also be added to the ABI which we couldn't change
easily either.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 14:56                 ` Andi Kleen
  2006-06-16 15:31                   ` Zoltan Menyhart
@ 2006-06-16 15:36                   ` Brent Casavant
  2006-06-16 15:40                     ` Andi Kleen
  1 sibling, 1 reply; 42+ messages in thread
From: Brent Casavant @ 2006-06-16 15:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Andi Kleen wrote:

> On Friday 16 June 2006 14:36, Zoltan Menyhart wrote:

> > My idea is to map the current task structure at an arch. dependent
> > virtual address into the user space (obviously in RO).
> > 
> > 	#define current	((struct task_struct *) 0x...)
> 
> This means it cannot be cache colored (because you would need a static
> offset) and you couldn't share task_structs on a page.
> 
> Also you would make task_struct part of the userland ABI which
> seems like a very very bad idea to me. It means we couldn't change
> it anymore.

To this last point, it might be more reasonable to map in a page that
contained a new structure with a stable ABI, which mirrored some of
the task_struct information, and likely other useful information as
needs are identified in the future.  In any case, it would be hard
to beat a single memory read for performance.

Cache-coloring and kernel bookkeeping effects could be minimized if this
was provided as an mmaped page from a device driver, used only by
applications which care.  This does work somewhat contrary to the idea of
getting support into glibc, unless glibc only used this capability when
asked to through some sort of environment variable or other run-time
configuration.

Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:31                   ` Zoltan Menyhart
@ 2006-06-16 15:37                     ` Andi Kleen
  2006-06-16 15:58                       ` Jakub Jelinek
  2006-06-16 21:12                     ` Chase Venters
  1 sibling, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2006-06-16 15:37 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha,
	vojtech, linux-ia64

On Friday 16 June 2006 17:31, Zoltan Menyhart wrote:
> Andi Kleen wrote:
> 
> > That is not how user space TLS works. It usually has a base a register.
> 
> Can you please give me a real life (simplified) example?

On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually
loads the base address from %fs:0 first.

> 
> > This means it cannot be cache colored (because you would need a static
> > offset) and you couldn't share task_structs on a page.
> 
> I do not see the problem.

Your scheme relies on task_struct fields being on a known offset
in the page. But slab cache coloring varies the offset to make the data
spread out better in the caches.

> Can you explain please? 
> E.g. the scheduler pulls a task instead of the current one. The CPU
> will see "current->thread_info.cpu"-s of all the tasks at the same
> offset anyway.

It varies relative to the start of page.

That was one of the bigger wins relative to the task_struct in stack
page of 2.4 had.

> 
> > Also you would make task_struct part of the userland ABI which
> > seems like a very very bad idea to me. It means we couldn't change
> > it anymore.
> 
> We can make some wrapper, e.g.:
> 
> 	user_per_cpu_var(name, offset)

You would need to wrap everything and likely users would like
task_struct so much that they accessed it anyways without your wrappers.
 
> "vgetcpu()" would also be added to the ABI which we couldn't change
> easily either.

Yes, but it's a defined function. No different from a system call.

-Andi


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:36                   ` Brent Casavant
@ 2006-06-16 15:40                     ` Andi Kleen
  2006-06-16 21:15                       ` Chase Venters
  2006-06-16 21:19                       ` Chase Venters
  0 siblings, 2 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-16 15:40 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64


> To this last point, it might be more reasonable to map in a page that
> contained a new structure with a stable ABI, which mirrored some of
> the task_struct information, and likely other useful information as
> needs are identified in the future.  In any case, it would be hard
> to beat a single memory read for performance.

That would mean making the context switch and possibly other
things slower. 

In general you would need to make a very good case first that all this 
complexity is worth it.

> Cache-coloring and kernel bookkeeping effects could be minimized if this 
> was provided as an mmaped page from a device driver, used only by
> applications which care.

I don't see what difference that would make. You would still
have the fixed offset problem and doing things on demand often tends 
to be even more complex.


-Andi (who thinks these proposals all sound very messy) 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:37                     ` Andi Kleen
@ 2006-06-16 15:58                       ` Jakub Jelinek
  2006-06-16 16:24                         ` Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Jakub Jelinek @ 2006-06-16 15:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote:
> On Friday 16 June 2006 17:31, Zoltan Menyhart wrote:
> > Andi Kleen wrote:
> > 
> > > That is not how user space TLS works. It usually has a base a register.
> > 
> > Can you please give me a real life (simplified) example?
> 
> On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually
> loads the base address from %fs:0 first.

GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs.
Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs
the default (especially on x86-64 it is a really bad idea).

	Jakub

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:58                       ` Jakub Jelinek
@ 2006-06-16 16:24                         ` Andi Kleen
  2006-06-16 16:33                           ` Jakub Jelinek
  0 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2006-06-16 16:24 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Friday 16 June 2006 17:58, Jakub Jelinek wrote:
> On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote:
> > On Friday 16 June 2006 17:31, Zoltan Menyhart wrote:
> > > Andi Kleen wrote:
> > > 
> > > > That is not how user space TLS works. It usually has a base a register.
> > > 
> > > Can you please give me a real life (simplified) example?
> > 
> > On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually
> > loads the base address from %fs:0 first.
> 
> GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs.
> Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs
> the default (especially on x86-64 it is a really bad idea).

I apparently got indeed.

I wonder why it happened on x86-64 though - i thought there were no negative
offsets on x86-64 TLS.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 16:24                         ` Andi Kleen
@ 2006-06-16 16:33                           ` Jakub Jelinek
  0 siblings, 0 replies; 42+ messages in thread
From: Jakub Jelinek @ 2006-06-16 16:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, Jun 16, 2006 at 06:24:52PM +0200, Andi Kleen wrote:
> I wonder why it happened on x86-64 though - i thought there were no negative
> offsets on x86-64 TLS.

It uses negative offsets for __thread vars and positive are reserved for
implementation (i.e. glibc).  But as %fs in 64-bit programs is just
msr 0xc0000100 base addition, with no segment limit, neither Xen nor VMWare
can play limit tricks with it.

	Jakub

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:48                   ` Zoltan Menyhart
@ 2006-06-16 21:04                     ` Chase Venters
  0 siblings, 0 replies; 42+ messages in thread
From: Chase Venters @ 2006-06-16 21:04 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Jes Sorensen, Andi Kleen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Zoltan Menyhart wrote:

> Jes Sorensen wrote:
>>  Zoltan Menyhart wrote:
>> 
>> > Just to make sure I understand it correctly...
>> > Assuming I have allocated per CPU data (numa control, etc.) pointed at 
>> > by:
>>
>>
>>  I think you misunderstood - vgetcpu is for userland usage, not within
>>  the kernel.
>>
>>  Cheers,
>>  Jes
>> 
> I did understand it as a user land stuff.
> This is why I want to map the current task structure into the user space.
> In user code, we could see the actual value of the 
> "current->thread_info.cpu".
> My "#define current ((struct task_struct *) 0x...)" is not the same as
> the kernel's one.

I think it's probably best to leave most of the stuff in task_struct 
private (ie, mapped in kernel only).

> Thanks,
>
> Zoltan

Thanks,
Chase

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:31                   ` Zoltan Menyhart
  2006-06-16 15:37                     ` Andi Kleen
@ 2006-06-16 21:12                     ` Chase Venters
  1 sibling, 0 replies; 42+ messages in thread
From: Chase Venters @ 2006-06-16 21:12 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Andi Kleen, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Zoltan Menyhart wrote:

> Andi Kleen wrote:
>
>>  That is not how user space TLS works. It usually has a base a register.
>
> Can you please give me a real life (simplified) example?
>
>>  This means it cannot be cache colored (because you would need a static
>>  offset) and you couldn't share task_structs on a page.
>
> I do not see the problem. Can you explain please?
> E.g. the scheduler pulls a task instead of the current one. The CPU
> will see "current->thread_info.cpu"-s of all the tasks at the same
> offset anyway.

Memory maps have to fall on page boundaries for lots of various reasons. 
Assuming a 16-word cache line, you've got plenty of spots you could align 
task_struct to within a page. (That number of spots is actually 
constrained by either sizeof(task_struct) or the number of colors).

The bottom line is that task_struct won't always be on a page boundary. If 
it's not on a page boundary in the physical page frames, it's not going to 
be on a page boundary in virtual memory either.

(Note also that if two task_structs shared a page, you'd have an 
information leak. I'm not sure with sizeof(task_struct) and cache 
alignment if task_structs are small enough for sharing, though. Definitely 
on hugepages.)

Thanks,
Chase

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:40                     ` Andi Kleen
@ 2006-06-16 21:15                       ` Chase Venters
  2006-06-16 21:19                       ` Chase Venters
  1 sibling, 0 replies; 42+ messages in thread
From: Chase Venters @ 2006-06-16 21:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Andi Kleen wrote:

>
>> To this last point, it might be more reasonable to map in a page that
>> contained a new structure with a stable ABI, which mirrored some of
>> the task_struct information, and likely other useful information as
>> needs are identified in the future.  In any case, it would be hard
>> to beat a single memory read for performance.
>
> That would mean making the context switch and possibly other
> things slower.
>
> In general you would need to make a very good case first that all this
> complexity is worth it.
>
>> Cache-coloring and kernel bookkeeping effects could be minimized if this
>> was provided as an mmaped page from a device driver, used only by
>> applications which care.
>
> I don't see what difference that would make. You would still
> have the fixed offset problem and doing things on demand often tends
> to be even more complex.
>
>
> -Andi (who thinks these proposals all sound very messy)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:40                     ` Andi Kleen
  2006-06-16 21:15                       ` Chase Venters
@ 2006-06-16 21:19                       ` Chase Venters
  2006-06-16 23:40                         ` Brent Casavant
  2006-06-17  6:55                         ` [discuss] " Andi Kleen
  1 sibling, 2 replies; 42+ messages in thread
From: Chase Venters @ 2006-06-16 21:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64

(Sorry for the empty reply! Pine over a laggy SSH connection is annoying 
sometimes)

On Fri, 16 Jun 2006, Andi Kleen wrote:

>
>> To this last point, it might be more reasonable to map in a page that
>> contained a new structure with a stable ABI, which mirrored some of
>> the task_struct information, and likely other useful information as
>> needs are identified in the future.  In any case, it would be hard
>> to beat a single memory read for performance.
>
> That would mean making the context switch and possibly other
> things slower.

Well, if every process had a page of its own, what would the context 
switch overhead be?

But, I'm not advocating exporting anything. Though I sort of like the 
vgetcpu() idea because I was working on a user-space slab allocator 
recently and magazines could use vgetcpu() instead of pthread keys.
(Also means if threads > cpus I'd get better results).

Thanks,
Chase

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 21:19                       ` Chase Venters
@ 2006-06-16 23:40                         ` Brent Casavant
  2006-06-17  6:58                           ` Andi Kleen
  2006-06-17  6:55                         ` [discuss] " Andi Kleen
  1 sibling, 1 reply; 42+ messages in thread
From: Brent Casavant @ 2006-06-16 23:40 UTC (permalink / raw)
  To: Chase Venters
  Cc: Andi Kleen, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Chase Venters wrote:

> On Fri, 16 Jun 2006, Andi Kleen wrote:
> 
> > 
> > > To this last point, it might be more reasonable to map in a page that
> > > contained a new structure with a stable ABI, which mirrored some of
> > > the task_struct information, and likely other useful information as
> > > needs are identified in the future.  In any case, it would be hard
> > > to beat a single memory read for performance.
> > 
> > That would mean making the context switch and possibly other
> > things slower.
> 
> Well, if every process had a page of its own, what would the context switch
> overhead be?

Mostly copying the useful information into the read-only mapped page.

However, this doesn't have to be all that expensive.  The particular
information we care about in this case only needs to be copied when a
task begins running on a CPU different from the one it last ran on.  In
fact, on ia64 we already have something very similar to handle certain
I/O pecularities on SN2.

http://marc.theaimsgroup.com/?l=linux-ia64&m=113831137712197&w=2

That work could form the basis for a low-impact method of exporting
the current CPU to user space via a read-only mapped page.  I'll admit
to having zero knowledge of whether this would be workable on anything
other than ia64.

Thanks,
Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 21:19                       ` Chase Venters
  2006-06-16 23:40                         ` Brent Casavant
@ 2006-06-17  6:55                         ` Andi Kleen
  2006-06-19  8:42                           ` Zoltan Menyhart
  1 sibling, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2006-06-17  6:55 UTC (permalink / raw)
  To: discuss
  Cc: Chase Venters, Brent Casavant, Zoltan Menyhart, Jes Sorensen,
	Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64

On Friday 16 June 2006 23:19, Chase Venters wrote:
> On Fri, 16 Jun 2006, Andi Kleen wrote:
> >> To this last point, it might be more reasonable to map in a page that
> >> contained a new structure with a stable ABI, which mirrored some of
> >> the task_struct information, and likely other useful information as
> >> needs are identified in the future.  In any case, it would be hard
> >> to beat a single memory read for performance.
> >
> > That would mean making the context switch and possibly other
> > things slower.
>
> Well, if every process had a page of its own, what would the context
> switch overhead be?

For process zero, for thread quite high on x86 because you
would need per CPU page tables. Doing that would be extremly
nasty because you would potentially need to allocate a new
set of page tables every time the process is scheduled to a new
CPU it hasn't run on before.

If you limit it to a process then you can't get the current CPU
from such a mapping because a process can run threaded on
multiple CPUs.

My reference was more to high suggestion of keeping a second version 
of task_struct for export. That would require changing everything
in task struct that is changed on switch_to and should be exported
in the other function too.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 23:40                         ` Brent Casavant
@ 2006-06-17  6:58                           ` Andi Kleen
  0 siblings, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-17  6:58 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Chase Venters, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64


> That work could form the basis for a low-impact method of exporting
> the current CPU to user space via a read-only mapped page.  I'll admit
> to having zero knowledge of whether this would be workable on anything
> other than ia64.

On x86 per CPU mappings are not really feasible. That is because
the CPU uses the Linux page tables directly and to change them
per CPU you would need to fork them per CPU. That would add so much
complications that I don't even want to think them all through ...

-andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-14  7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen
                   ` (3 preceding siblings ...)
  2006-06-15 18:44 ` Tony Luck
@ 2006-06-19  0:15 ` Paul Jackson
  2006-06-19  8:21   ` Andi Kleen
  4 siblings, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2006-06-19  0:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech

Interesting - thanks Andi.

I had one of my colleagues at SGI lobby me hard for such a facility.
I'll see if I can get him on this thread to better explain what he
wanted it for.

Roughly, he was looking to support something resembling the kernel's
per-cpu data in userland library code for high performance scientific
number crunching, for things like statistics gathering and perhaps (not
sure of this) reduce locking costs.

I see "x86-64" in the Subject.  I don't see why this facility is
arch-specific.  Could it work on any arch, ia64 being the one of
interest to me?

I have some ignorance on your references to "CPUID(1)".  I don't recall
what it is.  The only command so named I find on my systems are a
Windows command from the year 1999.  I doubt that's it.  You wrote:

> As you can see CPUID(1) is always very slow

but but I don't see any stats above the comment mentioning CPUID(1),
so ... er eh ... no I don't see.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-19  0:15 ` Paul Jackson
@ 2006-06-19  8:21   ` Andi Kleen
  2006-06-19 10:09     ` Paul Jackson
  2006-06-21  1:18     ` Paul Jackson
  0 siblings, 2 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-19  8:21 UTC (permalink / raw)
  To: Paul Jackson; +Cc: discuss, linux-kernel, libc-alpha, vojtech

On Monday 19 June 2006 02:15, Paul Jackson wrote:

>
> Roughly, he was looking to support something resembling the kernel's
> per-cpu data in userland library code for high performance scientific
> number crunching, for things like statistics gathering and perhaps (not
> sure of this) reduce locking costs.

While vgetcpu() can be used for this most likely glibc TLS is already 
good enough for this. So it will help, but I don't think it's the primary
motivation.

> I see "x86-64" in the Subject.  I don't see why this facility is
> arch-specific.  Could it work on any arch, ia64 being the one of
> interest to me?

The implementation is x86-64 specific and optimized for x86-64. You could 
probably implement something with the same prototype for IA64 too,
although the internal implementation will likely be very different
(there is nothing x86-64 specific in the prototype) 

AFAIK ia64 supports fast system calls so it might be possible to 
do a simple implementation without vsyscalls.

> I have some ignorance on your references to "CPUID(1)".  I don't recall
> what it is.  The only command so named I find on my systems are a

CPUID 1 is a x86 instruction that is one way to implement a user level
vgetcpu on x86.

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-17  6:55                         ` [discuss] " Andi Kleen
@ 2006-06-19  8:42                           ` Zoltan Menyhart
  2006-06-19  8:54                             ` Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Zoltan Menyhart @ 2006-06-19  8:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck,
	linux-kernel, libc-alpha, vojtech, linux-ia64

Brent Casavant wrote:

> To this last point, it might be more reasonable to map in a page that
> contained a new structure with a stable ABI, which mirrored some of
> the task_struct information, and likely other useful information as
> needs are identified in the future.  In any case, it would be hard
> to beat a single memory read for performance.
> 
> Cache-coloring and kernel bookkeeping effects could be minimized if this
> was provided as an mmaped page from a device driver, used only by
> applications which care.  This does work somewhat contrary to the idea of
> getting support into glibc, unless glibc only used this capability when
> asked to through some sort of environment variable or other run-time
> configuration.

Quite O.K. for me.

Andi Kleen wrote:

>>Well, if every process had a page of its own, what would the context
>>switch overhead be?

> For process zero, for thread quite high on x86 because you
> would need per CPU page tables. Doing that would be extremly
> nasty because you would potentially need to allocate a new
> set of page tables every time the process is scheduled to a new
> CPU it hasn't run on before.

Probably I have not explained it correctly:
- The "information page" (that includes the current CPU no.) is not a
  per CPU page
- This page is just another page that is mapped at a "well known" user
  virtual address (for those who are interested in)
- As you do not do any special action for each user page on context
  switch, there is nothing to to this page either
- The scheduler sometimes migrates a task, then it updates the
  the current CPU number on the "information page"

> My reference was more to high suggestion of keeping a second version 
> of task_struct for export. That would require changing everything
> in task struct that is changed on switch_to and should be exported
> in the other function too.

It depends on what else can be in this "information page".
As for the current CPU no., you need a single store on each task migration.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-19  8:42                           ` Zoltan Menyhart
@ 2006-06-19  8:54                             ` Andi Kleen
  0 siblings, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2006-06-19  8:54 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck,
	linux-kernel, libc-alpha, vojtech, linux-ia64

> Probably I have not explained it correctly:
> - The "information page" (that includes the current CPU no.) is not a
>   per CPU page

If it isn't then you can't figure out the current CPU/node for a thread.

Anyways I think we're talking past each other. Your approach might
even work on ia64 (at least if you're willing to add a lot of cost
to the context switch). You presumably could implement vgetcpu()
internally with an approach like this (although with IA64's fast 
EPC calls it seems a bit pointless) 

It just won't work on x86. 

-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-19  8:21   ` Andi Kleen
@ 2006-06-19 10:09     ` Paul Jackson
  2006-06-21  1:18     ` Paul Jackson
  1 sibling, 0 replies; 42+ messages in thread
From: Paul Jackson @ 2006-06-19 10:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech

Andi wrote:
> glibc TLS

Good idea - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 14:54               ` Andi Kleen
@ 2006-06-20  8:28                 ` Jes Sorensen
  0 siblings, 0 replies; 42+ messages in thread
From: Jes Sorensen @ 2006-06-20  8:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: discuss, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64

Andi Kleen wrote:
>> I really don't see the benefit here. malloc already gets pages handed
>> down from the kernel which are node local due to them being assigned at
>> a first touch basis. I am not sure about glibc's malloc internals, but
>> rather rely on a vgetcpu() call, all it really needs to do is to keep
>> a thread local pool which will automatically get it's thing locally
>> through first touch usage.
> 
> That would add too much overhead on small systems. It's better to be 
> able to share the pools. vgetcpu allows that.

How do you expect to be able to share the pools? Or are you saying you
just one page per numa node? Having a page per thread is not noticable
and for databases, which was your primary target usergroup, I think it's
fair to see it won't even be visible as noise.

>>> Basically it is just for extending the existing already used proven etc.
>>> default local policy to sub pages. Also there might be other uses
>>> of it too (like per CPU data), although I expect most use of that
>>> in user space can be already done using TLS.
>> The thread libraries already have their own thread local area which
>> should be allocated on the thread's own node if done right, which I
>> assume it is.
> 
> - The heap for small allocations is shared (although this can be tuned) 
> - When another thread does free() you need special handling to keep
> the item in the correct free lists
> This is one of the tricky bits in the new kernel NUMA slab allocator
> too.

It should be pretty easy to make the allocator aware of the per thread
regions based on the address.

>> If you migrate your app elsewhere, you should migrate the pages with it,
>> or not expect things to run with the local effect.
> 
> That's too costly to do by default and you have no guarantee that it will amortize.

But if you don't migrate the pages with it, the numa aware allocation is
wasted anyway, whether you do it on a first-touch basis or using
vgetcpu.

>> I don't really see the point in solving something half way when it can
>> be done better. Maybe the "serious" databases should open up and let us
>> know what the problem is they are hitting.
> 
> I see no indication of anything better so far from you. You only offered
> static configuration instead which while in some cases is better
> doesn't work in the general case.

Static configuration? I never said anything about that, I said that libc
should offer a memory pool per thread and have it created when it's
first touched by the thread. That solves exactly what you have described
so far unless is something else you also expect to benefit from
vgetcpu().

Jes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-19  8:21   ` Andi Kleen
  2006-06-19 10:09     ` Paul Jackson
@ 2006-06-21  1:18     ` Paul Jackson
  2006-06-21  1:21       ` Paul Jackson
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2006-06-21  1:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: discuss, linux-kernel, libc-alpha, vojtech

Andi wrote:
> While vgetcpu() can be used for this most likely glibc TLS is already 
> good enough for this. So it will help, but I don't think it's the primary
> motivation.

Elsewhere on this thread, Jes wrote:
> ... libc
> should offer a memory pool per thread and have it created when it's
> first touched by the thread. That solves exactly what you have described
> so far unless is something else you also expect to benefit from
> vgetcpu().

I don't see a reply from you (Andi) on Jes's comment.

Why can't Thread Local Storage (TLS) or other per-thread data be used
for a memory pool, as Jes suggests.

It seems to me that we don't need vgetcpu() at all.  Instead, we should
make things that would use it per-thread, not per-cpu.  If it works for
the statistics gathering you recommended I use TLS for, why not for
malloc pages as well?

That would seem to be a better abstraction anyway:

* A threads cpu can be changed without notice, but a tasks threads
  don't change unless the task intentionally does it.

* Two threads on the same cpu could collide on some per-cpu
  data, where they'd be find on per-thread data.

We already have user visibility into what cpu a task is executing on,
via the /proc/<pid>/stat file (39th field).  That's slow, of course.

The main reason for speeding it up seems to make it useful in critical
places, turning it from an infrequently used debugging option into
a critical element of certain NUMA aware user level code.

I'd think we should not be introducing a new construct, a threads
current cpu, as such a first class component unless:
 1) we do so on all arch's, and
 2) we don't alread have a better construct (TLS).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-21  1:18     ` Paul Jackson
@ 2006-06-21  1:21       ` Paul Jackson
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Jackson @ 2006-06-21  1:21 UTC (permalink / raw)
  To: Paul Jackson; +Cc: ak, discuss, linux-kernel, libc-alpha, vojtech

Typo:
> where they'd be find on per-thread data.

s/find/fine/

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2006-06-21  1:22 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-14  7:42 FOR REVIEW: New x86-64 vsyscall vgetcpu() Andi Kleen
2006-06-14 10:47 ` Alan Cox
2006-06-14 14:54 ` Steve Munroe
2006-06-15 23:17   ` Benjamin Herrenschmidt
     [not found] ` <449029DB.7030505@redhat.com>
     [not found]   ` <200606141752.02361.ak@suse.de>
2006-06-14 16:30     ` Ulrich Drepper
2006-06-14 17:34       ` [discuss] " Andi Kleen
2006-06-15 18:44 ` Tony Luck
2006-06-16  6:22   ` Andi Kleen
2006-06-16  7:23     ` Gerd Hoffmann
2006-06-16  7:37       ` Andi Kleen
2006-06-16  9:48     ` Jes Sorensen
2006-06-16 10:09       ` Andi Kleen
2006-06-16 11:02         ` Jes Sorensen
2006-06-16 11:17           ` Andi Kleen
2006-06-16 11:58             ` Jes Sorensen
2006-06-16 12:36               ` Zoltan Menyhart
2006-06-16 12:41                 ` Jes Sorensen
2006-06-16 12:48                   ` Zoltan Menyhart
2006-06-16 21:04                     ` Chase Venters
2006-06-16 14:56                 ` Andi Kleen
2006-06-16 15:31                   ` Zoltan Menyhart
2006-06-16 15:37                     ` Andi Kleen
2006-06-16 15:58                       ` Jakub Jelinek
2006-06-16 16:24                         ` Andi Kleen
2006-06-16 16:33                           ` Jakub Jelinek
2006-06-16 21:12                     ` Chase Venters
2006-06-16 15:36                   ` Brent Casavant
2006-06-16 15:40                     ` Andi Kleen
2006-06-16 21:15                       ` Chase Venters
2006-06-16 21:19                       ` Chase Venters
2006-06-16 23:40                         ` Brent Casavant
2006-06-17  6:58                           ` Andi Kleen
2006-06-17  6:55                         ` [discuss] " Andi Kleen
2006-06-19  8:42                           ` Zoltan Menyhart
2006-06-19  8:54                             ` Andi Kleen
2006-06-16 14:54               ` Andi Kleen
2006-06-20  8:28                 ` Jes Sorensen
2006-06-19  0:15 ` Paul Jackson
2006-06-19  8:21   ` Andi Kleen
2006-06-19 10:09     ` Paul Jackson
2006-06-21  1:18     ` Paul Jackson
2006-06-21  1:21       ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox