Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
       [not found]   ` <200606160822.23898.ak@suse.de>
@ 2006-06-16  9:48     ` Jes Sorensen
  2006-06-16 10:09       ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Jes Sorensen @ 2006-06-16  9:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

>>>>> "Andi" = Andi Kleen <ak@suse.de> writes:

Andi> On Thursday 15 June 2006 20:44, Tony Luck wrote:
>> Another alternative would be to provide a mechanism for a process
>> to bind to the current cpu (whatever cpu that happens to be).  Then
>> the kernel gets to make the smart placement decisions, and
>> processes that want to be bound somewhere (but don't really care
>> exactly where) have a way to meet their need.  Perhaps a cpumask of
>> all zeroes to a sched_setaffinity call could be overloaded for
>> this?

Andi> I tried something like this a few years ago and it just didn't
Andi> work (or rather ran usually slower) The scheduler would select a
Andi> home node at startup and then try to move the process there.

Andi> The problem is that not using a CPU costs you much more than
Andi> whatever overhead you get from using non local memory.

It all depends on your application and the type of system you are
running on. What you say applies to smaller cpu counts. However once
we see the upcoming larger count multi-core cpus become commonly
available, this is likely to change and become more like what is seen
today on larger NUMA systems.

In the scientific application space, there are two very common
groupings of jobs. One is simply a large threaded application with a
lot of intercommunication, often via MPI. In many cases one ends up
running a job on just a subset of the system, in which case you want
to see threads placed on the same node(s) to minimize internode
communication. It is desirable to either force the other tasks on the
system (system daemons etc) onto other node(s) to reduce noise and
there could also be space to run another parallel job on the remaining
node(s).

The other common case is to have jobs which spawn off a number of
threads that work together in groups (via OpenMP). In this case you
would like to have all your OpenMP threads placed on the same node for
similar reasons.

Not getting this right can result in significant loss of performance
for jobs which are highly memory bound or rely heavily on
intercommunication and synchronization.

Andi> So by default filling the CPUs must be the highest priority and
Andi> memory policy cannot interfere with that.

I really don't think this approach is going to solve the problem. As
Tony also points out, tasks will eventually migrate. The user needs to
tell the kernel where it wants to run the tasks rather than the kernel
telling the task where it is located. Only the application (or
developer/user) knows how the threads are expected to behave, doing
this automatically is almost never going to be optimal. Obviously the
user needs visibility of the topology of the machine to do so but that
should be available on any NUMA system through /proc or /sys.

In the scientific space the jobs are often run repeatedly with new
data sets every time, so it is worthwhile to spend the effort up front
to get the placement right. One-off runs are obviously something else
and there your method is going to be more beneficial.

IMHO, what we really need is a more advanced way for user applications
to hint at the kernel how to place it's threads.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16  9:48     ` FOR REVIEW: New x86-64 vsyscall vgetcpu() Jes Sorensen
@ 2006-06-16 10:09       ` Andi Kleen
  2006-06-16 11:02         ` Jes Sorensen
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2006-06-16 10:09 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

> It all depends on your application and the type of system you are
> running on. What you say applies to smaller cpu counts. However once
> we see the upcoming larger count multi-core cpus become commonly
> available, this is likely to change and become more like what is seen
> today on larger NUMA systems.

Maybe. Maybe not. 

> 
> In the scientific application space, there are two very common
> groupings of jobs.

The scientific users just use pinned CPUs and seem to be happy with that.
They also have cheap slav^wgrade students to spend lots of time on
manual tuning.  I'm not concerned about them. 

If you already use CPU affinity you should already know where you are and don't 
need this call at all.

So this clearly isn't targetted for them. 

Interesting is getting the best performance from general purpose applications 
without any special tuning. For them I'm trying to improve things.

Number one applications currently are databases and JVMs. I hope with 
Wolfam's malloc work it will be useful for more applications too. 

> Andi> So by default filling the CPUs must be the highest priority and
> Andi> memory policy cannot interfere with that.
> 
> I really don't think this approach is going to solve the problem. As
> Tony also points out, tasks will eventually migrate.

Currently we don't solve this problem with the standard heuristics.
It can be solved with manual tuning (mempolicy, explicit CPU affinity)
but if you're doing that you're already out side the primary use 
case of vgetcpu().

vgetcpu() is only trying to be a incremental improvement of the current
simple default local policy.

> The user needs to 

Scientific users do that, but other users normally not. I doubt that
is going to change.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 10:09       ` Andi Kleen
@ 2006-06-16 11:02         ` Jes Sorensen
  2006-06-16 11:17           ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Jes Sorensen @ 2006-06-16 11:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

Andi Kleen wrote:
>> In the scientific application space, there are two very common
>> groupings of jobs.
> 
> The scientific users just use pinned CPUs and seem to be happy with that.
> They also have cheap slav^wgrade students to spend lots of time on
> manual tuning.  I'm not concerned about them. 

Do they? There's a lot of scientific sites out there which are not
universities or research organizations. They do not have free slave
labour at hand. A lot of users fall into this category, especially the
users with larger systems or large clusters (be it ia64, x86_64 or PPC).

> If you already use CPU affinity you should already know where you are and don't 
> need this call at all.

Except that whats currently available isn't sufficient to do what is
needed.

> So this clearly isn't targetted for them. 
>
> Interesting is getting the best performance from general purpose applications 
> without any special tuning. For them I'm trying to improve things.

Well I am interested in getting the best performance for some of the
same applications, without having to modify them. The current affinity
support simply isn't sufficient for that. Placement has to be targetted
at launch time since thread implementations can change the layout etc.

> Number one applications currently are databases and JVMs. I hope with 
> Wolfam's malloc work it will be useful for more applications too. 

If you want this to work for general purpose applications, then how is
this new syscall going to help? If you expect application vendors to
code for it, that means few users will benefit.

>> I really don't think this approach is going to solve the problem. As
>> Tony also points out, tasks will eventually migrate.
> 
> Currently we don't solve this problem with the standard heuristics.
> It can be solved with manual tuning (mempolicy, explicit CPU affinity)
> but if you're doing that you're already out side the primary use 
> case of vgetcpu().

This is another area where the kernel could do better by possibly using
the cpumask to determine where it will allocate memory.

> vgetcpu() is only trying to be a incremental improvement of the current
> simple default local policy.

As Tony rightfully pointed out, tasks do migrate. By making this guess
initially and then expecting the application to run for a long time,
you will end up with it having zero or possibly a negative effect.

>> The user needs to 
> 
> Scientific users do that, but other users normally not. I doubt that
> is going to change.

I just use scientific users since thats where I have the most recent
detailed data from. Databases could well benefit from what I mentioned,
though the serious ones would want to look into using affinity support
explicitly in their code.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:02         ` Jes Sorensen
@ 2006-06-16 11:17           ` Andi Kleen
  2006-06-16 11:58             ` Jes Sorensen
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2006-06-16 11:17 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

> The current affinity 
> support simply isn't sufficient for that. Placement has to be targetted
> at launch time since thread implementations can change the layout etc.

I'm not sure how that's related to vgetcpu, but ok ...

In general if you want to affect placement below the process / shared memory
segment level you should change the application.

Anything else just results in a big messy and unreliable and fragile user
command line interface - a quick look at the respective Irix manpage should
make that clear.

> > Number one applications currently are databases and JVMs. I hope with 
> > Wolfam's malloc work it will be useful for more applications too. 
> 
> If you want this to work for general purpose applications, then how is
> this new syscall going to help? 

It will improve their malloc(). They don't know anything about NUMA,
but getting local memory will help them. They already get local
memory now from the kernel when they use big allocations, but
for smaller allocations it doesn't work because the kernel can't
give out anything smaller than a page. This would be solved
by a NUMA aware malloc, but it needs vgetcpu() for this if it 
should work without fixed CPU affinity. 

Basically it is just for extending the existing already used proven etc.
default local policy to sub pages. Also there might be other uses
of it too (like per CPU data), although I expect most use of that
in user space can be already done using TLS.

JVM and databases will use it too, but since they often use their
own allocators they will need to be modified.

> If you expect application vendors to 
> code for it, that means few users will benefit.

Most applications use malloc()

> >> I really don't think this approach is going to solve the problem. As
> >> Tony also points out, tasks will eventually migrate.
> > 
> > Currently we don't solve this problem with the standard heuristics.
> > It can be solved with manual tuning (mempolicy, explicit CPU affinity)
> > but if you're doing that you're already out side the primary use 
> > case of vgetcpu().
> 
> This is another area where the kernel could do better by possibly using
> the cpumask to determine where it will allocate memory.

Modify fallback lists based on cpu affinity?

Would get messy in the code because you couldn't easily precompute
them anymore.

But cpusets already does this kind of, even though it has a quite
bad impact on fast paths.
 Also what happens if the affinity mask is modified later?
From the high semantics point it is also a little dubious to mesh
them together. My feeling is that as a heuristic it is probably
dubious.

Also when you set cpu affinity you can as well set memory
policy iit.

> 
> > vgetcpu() is only trying to be a incremental improvement of the current
> > simple default local policy.
> 
> As Tony rightfully pointed out, tasks do migrate. By making this guess
> initially

The gamble is already there in the local policy. No change at all.
When you already got local memory you can use it better with
vgetcpu() though.

From our experience it works out in most cases though - in general
most benchmarks show better performance with simple local NUMA
policy than SMP mode or no policy.

In the cases where it doesn't you have to either eat the slow
down or use manual tuning.

> I just use scientific users since thats where I have the most recent
> detailed data from. Databases could well benefit from what I mentioned,
> though the serious ones would want to look into using affinity support
> explicitly in their code.

No exactly not - i got requests from "serious" databases to offer
vgetcpu() because affinity is too complicated to configure and manage.

It sounds like you want to solve NUMA world hunger here, not
concentrate on the specific small incremental improvement vgetcpu is trying 
to offer.

I'm sure there is much research that could be done in the general NUMA
tuning area, but I would suggest making it research with numbers first
before trying to hack like this anything into the kernel without
a clear understanding first.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:17           ` Andi Kleen
@ 2006-06-16 11:58             ` Jes Sorensen
  2006-06-16 12:36               ` Zoltan Menyhart
  2006-06-16 14:54               ` Andi Kleen
  0 siblings, 2 replies; 27+ messages in thread
From: Jes Sorensen @ 2006-06-16 11:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Tony Luck, discuss, linux-kernel, libc-alpha, vojtech, linux-ia64

Andi Kleen wrote:
>> The current affinity 
>> support simply isn't sufficient for that. Placement has to be targetted
>> at launch time since thread implementations can change the layout etc.
> 
> I'm not sure how that's related to vgetcpu, but ok ...
> 
> In general if you want to affect placement below the process / shared memory
> segment level you should change the application.

That would be great, except that a lot of these applications are
'standard' applications which people they don't write themselves.
Sometimes the sourcecode is no longer available. We could argue that
people should just rewrite their applications, but in reality this isn't
whats happening.

> It will improve their malloc(). They don't know anything about NUMA,
> but getting local memory will help them. They already get local
> memory now from the kernel when they use big allocations, but
> for smaller allocations it doesn't work because the kernel can't
> give out anything smaller than a page. This would be solved
> by a NUMA aware malloc, but it needs vgetcpu() for this if it 
> should work without fixed CPU affinity. 

I really don't see the benefit here. malloc already gets pages handed
down from the kernel which are node local due to them being assigned at
a first touch basis. I am not sure about glibc's malloc internals, but
rather rely on a vgetcpu() call, all it really needs to do is to keep
a thread local pool which will automatically get it's thing locally
through first touch usage.

I don't see how a new syscall is going to provide anything to malloc
that it doesn't already have. What am I missing?

> Basically it is just for extending the existing already used proven etc.
> default local policy to sub pages. Also there might be other uses
> of it too (like per CPU data), although I expect most use of that
> in user space can be already done using TLS.

The thread libraries already have their own thread local area which
should be allocated on the thread's own node if done right, which I
assume it is.

> JVM and databases will use it too, but since they often use their
> own allocators they will need to be modified.

I would assume the real databases to be smart enough to benefit from
things being first touch already. JVMs .... well who knows, can't say
I have a lot of faith in anything running in a JVM :)

>> If you expect application vendors to 
>> code for it, that means few users will benefit.
> 
> Most applications use malloc()

Which doesn't need the vgetcpu() call as far as I can see.

>> This is another area where the kernel could do better by possibly using
>> the cpumask to determine where it will allocate memory.
> 
> Modify fallback lists based on cpu affinity?

It's a hint, not guaranteed placement. You have the same problem if you
try to allocate memory on a node and there's nothing left there.

> But cpusets already does this kind of, even though it has a quite
> bad impact on fast paths.
>  Also what happens if the affinity mask is modified later?
> From the high semantics point it is also a little dubious to mesh
> them together. My feeling is that as a heuristic it is probably
> dubious.

If you migrate your app elsewhere, you should migrate the pages with it,
or not expect things to run with the local effect.

> The gamble is already there in the local policy. No change at all.
> When you already got local memory you can use it better with
> vgetcpu() though.
> 
> From our experience it works out in most cases though - in general
> most benchmarks show better performance with simple local NUMA
> policy than SMP mode or no policy.

Could you share some information about the type of benchmarks?

>> I just use scientific users since thats where I have the most recent
>> detailed data from. Databases could well benefit from what I mentioned,
>> though the serious ones would want to look into using affinity support
>> explicitly in their code.
> 
> No exactly not - i got requests from "serious" databases to offer
> vgetcpu() because affinity is too complicated to configure and manage.
> 
> It sounds like you want to solve NUMA world hunger here, not
> concentrate on the specific small incremental improvement vgetcpu is trying 
> to offer.

I don't really see the point in solving something half way when it can
be done better. Maybe the "serious" databases should open up and let us
know what the problem is they are hitting.

> I'm sure there is much research that could be done in the general NUMA
> tuning area, but I would suggest making it research with numbers first
> before trying to hack like this anything into the kernel without
> a clear understanding first.

Well I did spend a good chunk of time looking at some of this some time
ago and did speek a lot to one of my colleagues who actually runs
benchmarks using some of these tools to understand the impact. If
anything it seems that vgetcpu is the issue that is still in the
research stage.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:58             ` Jes Sorensen
@ 2006-06-16 12:36               ` Zoltan Menyhart
  2006-06-16 12:41                 ` Jes Sorensen
  2006-06-16 14:56                 ` Andi Kleen
  2006-06-16 14:54               ` Andi Kleen
  1 sibling, 2 replies; 27+ messages in thread
From: Zoltan Menyhart @ 2006-06-16 12:36 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech,
	linux-ia64

Just to make sure I understand it correctly...
Assuming I have allocated per CPU data (numa control, etc.) pointed at by:

	void *per_cpu[MAXCPUS];

Assuming a per CPU variable has got an "offset" in each per CPU data area.
Accessing this variable can be done as follows:

	err = vgetcpu(&my_cpu, ...);
	if (err)
		goto ....
	pointer = (typeof pointer) (per_cpu[my_cpu] + offset);
	// use "pointer"...

It is hundred times more long than "__get_per_cpu(var)++".

As we do not know when we can be moved to another CPU,
"vgetcpu()" has to be called again after a "reasonable short" time.

My idea is to map the current task structure at an arch. dependent
virtual address into the user space (obviously in RO).

	#define current	((struct task_struct *) 0x...)

No more need to for "vgetcpu()" at all. The example above becomes:

	pointer = (typeof pointer) (per_cpu[current->thread_info.cpu] + offset);
	// use "pointer"...

As obtaining "pointer" does not cost much, it can be re-calculated at
each usage => no problem to know when to recheck it, there is less chance for
using the data of a neighbor.

Regards,

Zoltan Menyhart

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:36               ` Zoltan Menyhart
@ 2006-06-16 12:41                 ` Jes Sorensen
  2006-06-16 12:48                   ` Zoltan Menyhart
  2006-06-16 14:56                 ` Andi Kleen
  1 sibling, 1 reply; 27+ messages in thread
From: Jes Sorensen @ 2006-06-16 12:41 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech,
	linux-ia64

Zoltan Menyhart wrote:
> Just to make sure I understand it correctly...
> Assuming I have allocated per CPU data (numa control, etc.) pointed at by:

I think you misunderstood - vgetcpu is for userland usage, not within
the kernel.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:41                 ` Jes Sorensen
@ 2006-06-16 12:48                   ` Zoltan Menyhart
  2006-06-16 21:04                     ` Chase Venters
  0 siblings, 1 reply; 27+ messages in thread
From: Zoltan Menyhart @ 2006-06-16 12:48 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Andi Kleen, Tony Luck, discuss, linux-kernel, libc-alpha, vojtech,
	linux-ia64

Jes Sorensen wrote:
> Zoltan Menyhart wrote:
> 
>>Just to make sure I understand it correctly...
>>Assuming I have allocated per CPU data (numa control, etc.) pointed at by:
> 
> 
> I think you misunderstood - vgetcpu is for userland usage, not within
> the kernel.
> 
> Cheers,
> Jes
> 
I did understand it as a user land stuff.
This is why I want to map the current task structure into the user space.
In user code, we could see the actual value of the "current->thread_info.cpu".
My "#define current ((struct task_struct *) 0x...)" is not the same as
the kernel's one.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:48                   ` Zoltan Menyhart
@ 2006-06-16 21:04                     ` Chase Venters
  0 siblings, 0 replies; 27+ messages in thread
From: Chase Venters @ 2006-06-16 21:04 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Jes Sorensen, Andi Kleen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Zoltan Menyhart wrote:

> Jes Sorensen wrote:
>>  Zoltan Menyhart wrote:
>> 
>> > Just to make sure I understand it correctly...
>> > Assuming I have allocated per CPU data (numa control, etc.) pointed at 
>> > by:
>>
>>
>>  I think you misunderstood - vgetcpu is for userland usage, not within
>>  the kernel.
>>
>>  Cheers,
>>  Jes
>> 
> I did understand it as a user land stuff.
> This is why I want to map the current task structure into the user space.
> In user code, we could see the actual value of the 
> "current->thread_info.cpu".
> My "#define current ((struct task_struct *) 0x...)" is not the same as
> the kernel's one.

I think it's probably best to leave most of the stuff in task_struct 
private (ie, mapped in kernel only).

> Thanks,
>
> Zoltan

Thanks,
Chase

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 12:36               ` Zoltan Menyhart
  2006-06-16 12:41                 ` Jes Sorensen
@ 2006-06-16 14:56                 ` Andi Kleen
  2006-06-16 15:31                   ` Zoltan Menyhart
  2006-06-16 15:36                   ` Brent Casavant
  1 sibling, 2 replies; 27+ messages in thread
From: Andi Kleen @ 2006-06-16 14:56 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha,
	vojtech, linux-ia64

On Friday 16 June 2006 14:36, Zoltan Menyhart wrote:
> Just to make sure I understand it correctly...
> Assuming I have allocated per CPU data (numa control, etc.) pointed at by:
> 
> 	void *per_cpu[MAXCPUS];


That is not how user space TLS works. It usually has a base a register.

> 
> Assuming a per CPU variable has got an "offset" in each per CPU data area.
> Accessing this variable can be done as follows:
> 
> 	err = vgetcpu(&my_cpu, ...);
> 	if (err)
> 		goto ....
> 	pointer = (typeof pointer) (per_cpu[my_cpu] + offset);
> 	// use "pointer"...
> 
> It is hundred times more long than "__get_per_cpu(var)++".

14 cycles is not a 100 times longer.

> My idea is to map the current task structure at an arch. dependent
> virtual address into the user space (obviously in RO).
> 
> 	#define current	((struct task_struct *) 0x...)

This means it cannot be cache colored (because you would need a static
offset) and you couldn't share task_structs on a page.

Also you would make task_struct part of the userland ABI which
seems like a very very bad idea to me. It means we couldn't change
it anymore.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 14:56                 ` Andi Kleen
@ 2006-06-16 15:31                   ` Zoltan Menyhart
  2006-06-16 15:37                     ` Andi Kleen
  2006-06-16 21:12                     ` Chase Venters
  2006-06-16 15:36                   ` Brent Casavant
  1 sibling, 2 replies; 27+ messages in thread
From: Zoltan Menyhart @ 2006-06-16 15:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha,
	vojtech, linux-ia64

Andi Kleen wrote:

> That is not how user space TLS works. It usually has a base a register.

Can you please give me a real life (simplified) example?

> This means it cannot be cache colored (because you would need a static
> offset) and you couldn't share task_structs on a page.

I do not see the problem. Can you explain please?
E.g. the scheduler pulls a task instead of the current one. The CPU
will see "current->thread_info.cpu"-s of all the tasks at the same
offset anyway.

> Also you would make task_struct part of the userland ABI which
> seems like a very very bad idea to me. It means we couldn't change
> it anymore.

We can make some wrapper, e.g.:

	user_per_cpu_var(name, offset)

"vgetcpu()" would also be added to the ABI which we couldn't change
easily either.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:31                   ` Zoltan Menyhart
@ 2006-06-16 15:37                     ` Andi Kleen
  2006-06-16 15:58                       ` Jakub Jelinek
  2006-06-16 21:12                     ` Chase Venters
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2006-06-16 15:37 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Jes Sorensen, Tony Luck, discuss, linux-kernel, libc-alpha,
	vojtech, linux-ia64

On Friday 16 June 2006 17:31, Zoltan Menyhart wrote:
> Andi Kleen wrote:
> 
> > That is not how user space TLS works. It usually has a base a register.
> 
> Can you please give me a real life (simplified) example?

On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually
loads the base address from %fs:0 first.

> 
> > This means it cannot be cache colored (because you would need a static
> > offset) and you couldn't share task_structs on a page.
> 
> I do not see the problem.

Your scheme relies on task_struct fields being on a known offset
in the page. But slab cache coloring varies the offset to make the data
spread out better in the caches.

> Can you explain please? 
> E.g. the scheduler pulls a task instead of the current one. The CPU
> will see "current->thread_info.cpu"-s of all the tasks at the same
> offset anyway.

It varies relative to the start of page.

That was one of the bigger wins relative to the task_struct in stack
page of 2.4 had.

> 
> > Also you would make task_struct part of the userland ABI which
> > seems like a very very bad idea to me. It means we couldn't change
> > it anymore.
> 
> We can make some wrapper, e.g.:
> 
> 	user_per_cpu_var(name, offset)

You would need to wrap everything and likely users would like
task_struct so much that they accessed it anyways without your wrappers.
 
> "vgetcpu()" would also be added to the ABI which we couldn't change
> easily either.

Yes, but it's a defined function. No different from a system call.

-Andi


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:37                     ` Andi Kleen
@ 2006-06-16 15:58                       ` Jakub Jelinek
  2006-06-16 16:24                         ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Jakub Jelinek @ 2006-06-16 15:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote:
> On Friday 16 June 2006 17:31, Zoltan Menyhart wrote:
> > Andi Kleen wrote:
> > 
> > > That is not how user space TLS works. It usually has a base a register.
> > 
> > Can you please give me a real life (simplified) example?
> 
> On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually
> loads the base address from %fs:0 first.

GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs.
Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs
the default (especially on x86-64 it is a really bad idea).

	Jakub

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:58                       ` Jakub Jelinek
@ 2006-06-16 16:24                         ` Andi Kleen
  2006-06-16 16:33                           ` Jakub Jelinek
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2006-06-16 16:24 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Friday 16 June 2006 17:58, Jakub Jelinek wrote:
> On Fri, Jun 16, 2006 at 05:37:06PM +0200, Andi Kleen wrote:
> > On Friday 16 June 2006 17:31, Zoltan Menyhart wrote:
> > > Andi Kleen wrote:
> > > 
> > > > That is not how user space TLS works. It usually has a base a register.
> > > 
> > > Can you please give me a real life (simplified) example?
> > 
> > On x86-64 it's just %fs:offset. gcc is a bit dumb on this and usually
> > loads the base address from %fs:0 first.
> 
> GCC is not dumb, unless you force it with -mno-tls-direct-seg-refs.
> Guess you are bitten by SUSE GCC hack which makes -mno-tls-direct-seg-refs
> the default (especially on x86-64 it is a really bad idea).

I apparently got indeed.

I wonder why it happened on x86-64 though - i thought there were no negative
offsets on x86-64 TLS.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 16:24                         ` Andi Kleen
@ 2006-06-16 16:33                           ` Jakub Jelinek
  0 siblings, 0 replies; 27+ messages in thread
From: Jakub Jelinek @ 2006-06-16 16:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, Jun 16, 2006 at 06:24:52PM +0200, Andi Kleen wrote:
> I wonder why it happened on x86-64 though - i thought there were no negative
> offsets on x86-64 TLS.

It uses negative offsets for __thread vars and positive are reserved for
implementation (i.e. glibc).  But as %fs in 64-bit programs is just
msr 0xc0000100 base addition, with no segment limit, neither Xen nor VMWare
can play limit tricks with it.

	Jakub

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:31                   ` Zoltan Menyhart
  2006-06-16 15:37                     ` Andi Kleen
@ 2006-06-16 21:12                     ` Chase Venters
  1 sibling, 0 replies; 27+ messages in thread
From: Chase Venters @ 2006-06-16 21:12 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Andi Kleen, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Zoltan Menyhart wrote:

> Andi Kleen wrote:
>
>>  That is not how user space TLS works. It usually has a base a register.
>
> Can you please give me a real life (simplified) example?
>
>>  This means it cannot be cache colored (because you would need a static
>>  offset) and you couldn't share task_structs on a page.
>
> I do not see the problem. Can you explain please?
> E.g. the scheduler pulls a task instead of the current one. The CPU
> will see "current->thread_info.cpu"-s of all the tasks at the same
> offset anyway.

Memory maps have to fall on page boundaries for lots of various reasons. 
Assuming a 16-word cache line, you've got plenty of spots you could align 
task_struct to within a page. (That number of spots is actually 
constrained by either sizeof(task_struct) or the number of colors).

The bottom line is that task_struct won't always be on a page boundary. If 
it's not on a page boundary in the physical page frames, it's not going to 
be on a page boundary in virtual memory either.

(Note also that if two task_structs shared a page, you'd have an 
information leak. I'm not sure with sizeof(task_struct) and cache 
alignment if task_structs are small enough for sharing, though. Definitely 
on hugepages.)

Thanks,
Chase

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 14:56                 ` Andi Kleen
  2006-06-16 15:31                   ` Zoltan Menyhart
@ 2006-06-16 15:36                   ` Brent Casavant
  2006-06-16 15:40                     ` Andi Kleen
  1 sibling, 1 reply; 27+ messages in thread
From: Brent Casavant @ 2006-06-16 15:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Andi Kleen wrote:

> On Friday 16 June 2006 14:36, Zoltan Menyhart wrote:

> > My idea is to map the current task structure at an arch. dependent
> > virtual address into the user space (obviously in RO).
> > 
> > 	#define current	((struct task_struct *) 0x...)
> 
> This means it cannot be cache colored (because you would need a static
> offset) and you couldn't share task_structs on a page.
> 
> Also you would make task_struct part of the userland ABI which
> seems like a very very bad idea to me. It means we couldn't change
> it anymore.

To this last point, it might be more reasonable to map in a page that
contained a new structure with a stable ABI, which mirrored some of
the task_struct information, and likely other useful information as
needs are identified in the future.  In any case, it would be hard
to beat a single memory read for performance.

Cache-coloring and kernel bookkeeping effects could be minimized if this
was provided as an mmaped page from a device driver, used only by
applications which care.  This does work somewhat contrary to the idea of
getting support into glibc, unless glibc only used this capability when
asked to through some sort of environment variable or other run-time
configuration.

Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:36                   ` Brent Casavant
@ 2006-06-16 15:40                     ` Andi Kleen
  2006-06-16 21:15                       ` Chase Venters
  2006-06-16 21:19                       ` Chase Venters
  0 siblings, 2 replies; 27+ messages in thread
From: Andi Kleen @ 2006-06-16 15:40 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss, linux-kernel,
	libc-alpha, vojtech, linux-ia64


> To this last point, it might be more reasonable to map in a page that
> contained a new structure with a stable ABI, which mirrored some of
> the task_struct information, and likely other useful information as
> needs are identified in the future.  In any case, it would be hard
> to beat a single memory read for performance.

That would mean making the context switch and possibly other
things slower. 

In general you would need to make a very good case first that all this 
complexity is worth it.

> Cache-coloring and kernel bookkeeping effects could be minimized if this 
> was provided as an mmaped page from a device driver, used only by
> applications which care.

I don't see what difference that would make. You would still
have the fixed offset problem and doing things on demand often tends 
to be even more complex.


-Andi (who thinks these proposals all sound very messy) 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:40                     ` Andi Kleen
@ 2006-06-16 21:15                       ` Chase Venters
  2006-06-16 21:19                       ` Chase Venters
  1 sibling, 0 replies; 27+ messages in thread
From: Chase Venters @ 2006-06-16 21:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Andi Kleen wrote:

>
>> To this last point, it might be more reasonable to map in a page that
>> contained a new structure with a stable ABI, which mirrored some of
>> the task_struct information, and likely other useful information as
>> needs are identified in the future.  In any case, it would be hard
>> to beat a single memory read for performance.
>
> That would mean making the context switch and possibly other
> things slower.
>
> In general you would need to make a very good case first that all this
> complexity is worth it.
>
>> Cache-coloring and kernel bookkeeping effects could be minimized if this
>> was provided as an mmaped page from a device driver, used only by
>> applications which care.
>
> I don't see what difference that would make. You would still
> have the fixed offset problem and doing things on demand often tends
> to be even more complex.
>
>
> -Andi (who thinks these proposals all sound very messy)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 15:40                     ` Andi Kleen
  2006-06-16 21:15                       ` Chase Venters
@ 2006-06-16 21:19                       ` Chase Venters
  2006-06-16 23:40                         ` Brent Casavant
  2006-06-17  6:55                         ` [discuss] " Andi Kleen
  1 sibling, 2 replies; 27+ messages in thread
From: Chase Venters @ 2006-06-16 21:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Brent Casavant, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64

(Sorry for the empty reply! Pine over a laggy SSH connection is annoying 
sometimes)

On Fri, 16 Jun 2006, Andi Kleen wrote:

>
>> To this last point, it might be more reasonable to map in a page that
>> contained a new structure with a stable ABI, which mirrored some of
>> the task_struct information, and likely other useful information as
>> needs are identified in the future.  In any case, it would be hard
>> to beat a single memory read for performance.
>
> That would mean making the context switch and possibly other
> things slower.

Well, if every process had a page of its own, what would the context 
switch overhead be?

But, I'm not advocating exporting anything. Though I sort of like the 
vgetcpu() idea because I was working on a user-space slab allocator 
recently and magazines could use vgetcpu() instead of pthread keys.
(Also means if threads > cpus I'd get better results).

Thanks,
Chase

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 21:19                       ` Chase Venters
@ 2006-06-16 23:40                         ` Brent Casavant
  2006-06-17  6:58                           ` Andi Kleen
  2006-06-17  6:55                         ` [discuss] " Andi Kleen
  1 sibling, 1 reply; 27+ messages in thread
From: Brent Casavant @ 2006-06-16 23:40 UTC (permalink / raw)
  To: Chase Venters
  Cc: Andi Kleen, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64

On Fri, 16 Jun 2006, Chase Venters wrote:

> On Fri, 16 Jun 2006, Andi Kleen wrote:
> 
> > 
> > > To this last point, it might be more reasonable to map in a page that
> > > contained a new structure with a stable ABI, which mirrored some of
> > > the task_struct information, and likely other useful information as
> > > needs are identified in the future.  In any case, it would be hard
> > > to beat a single memory read for performance.
> > 
> > That would mean making the context switch and possibly other
> > things slower.
> 
> Well, if every process had a page of its own, what would the context switch
> overhead be?

Mostly copying the useful information into the read-only mapped page.

However, this doesn't have to be all that expensive.  The particular
information we care about in this case only needs to be copied when a
task begins running on a CPU different from the one it last ran on.  In
fact, on ia64 we already have something very similar to handle certain
I/O pecularities on SN2.

http://marc.theaimsgroup.com/?l=linux-ia64&m\x113831137712197&w=2

That work could form the basis for a low-impact method of exporting
the current CPU to user space via a read-only mapped page.  I'll admit
to having zero knowledge of whether this would be workable on anything
other than ia64.

Thanks,
Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 23:40                         ` Brent Casavant
@ 2006-06-17  6:58                           ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2006-06-17  6:58 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Chase Venters, Zoltan Menyhart, Jes Sorensen, Tony Luck, discuss,
	linux-kernel, libc-alpha, vojtech, linux-ia64


> That work could form the basis for a low-impact method of exporting
> the current CPU to user space via a read-only mapped page.  I'll admit
> to having zero knowledge of whether this would be workable on anything
> other than ia64.

On x86 per CPU mappings are not really feasible. That is because
the CPU uses the Linux page tables directly and to change them
per CPU you would need to fork them per CPU. That would add so much
complications that I don't even want to think them all through ...

-andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 21:19                       ` Chase Venters
  2006-06-16 23:40                         ` Brent Casavant
@ 2006-06-17  6:55                         ` Andi Kleen
  2006-06-19  8:42                           ` Zoltan Menyhart
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2006-06-17  6:55 UTC (permalink / raw)
  To: discuss
  Cc: Chase Venters, Brent Casavant, Zoltan Menyhart, Jes Sorensen,
	Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64

On Friday 16 June 2006 23:19, Chase Venters wrote:
> On Fri, 16 Jun 2006, Andi Kleen wrote:
> >> To this last point, it might be more reasonable to map in a page that
> >> contained a new structure with a stable ABI, which mirrored some of
> >> the task_struct information, and likely other useful information as
> >> needs are identified in the future.  In any case, it would be hard
> >> to beat a single memory read for performance.
> >
> > That would mean making the context switch and possibly other
> > things slower.
>
> Well, if every process had a page of its own, what would the context
> switch overhead be?

For process zero, for thread quite high on x86 because you
would need per CPU page tables. Doing that would be extremly
nasty because you would potentially need to allocate a new
set of page tables every time the process is scheduled to a new
CPU it hasn't run on before.

If you limit it to a process then you can't get the current CPU
from such a mapping because a process can run threaded on
multiple CPUs.

My reference was more to high suggestion of keeping a second version 
of task_struct for export. That would require changing everything
in task struct that is changed on switch_to and should be exported
in the other function too.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-17  6:55                         ` [discuss] " Andi Kleen
@ 2006-06-19  8:42                           ` Zoltan Menyhart
  2006-06-19  8:54                             ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Zoltan Menyhart @ 2006-06-19  8:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck,
	linux-kernel, libc-alpha, vojtech, linux-ia64

Brent Casavant wrote:

> To this last point, it might be more reasonable to map in a page that
> contained a new structure with a stable ABI, which mirrored some of
> the task_struct information, and likely other useful information as
> needs are identified in the future.  In any case, it would be hard
> to beat a single memory read for performance.
> 
> Cache-coloring and kernel bookkeeping effects could be minimized if this
> was provided as an mmaped page from a device driver, used only by
> applications which care.  This does work somewhat contrary to the idea of
> getting support into glibc, unless glibc only used this capability when
> asked to through some sort of environment variable or other run-time
> configuration.

Quite O.K. for me.

Andi Kleen wrote:

>>Well, if every process had a page of its own, what would the context
>>switch overhead be?

> For process zero, for thread quite high on x86 because you
> would need per CPU page tables. Doing that would be extremly
> nasty because you would potentially need to allocate a new
> set of page tables every time the process is scheduled to a new
> CPU it hasn't run on before.

Probably I have not explained it correctly:
- The "information page" (that includes the current CPU no.) is not a
  per CPU page
- This page is just another page that is mapped at a "well known" user
  virtual address (for those who are interested in)
- As you do not do any special action for each user page on context
  switch, there is nothing to to this page either
- The scheduler sometimes migrates a task, then it updates the
  the current CPU number on the "information page"

> My reference was more to high suggestion of keeping a second version 
> of task_struct for export. That would require changing everything
> in task struct that is changed on switch_to and should be exported
> in the other function too.

It depends on what else can be in this "information page".
As for the current CPU no., you need a single store on each task migration.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-19  8:42                           ` Zoltan Menyhart
@ 2006-06-19  8:54                             ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2006-06-19  8:54 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: discuss, Chase Venters, Brent Casavant, Jes Sorensen, Tony Luck,
	linux-kernel, libc-alpha, vojtech, linux-ia64

> Probably I have not explained it correctly:
> - The "information page" (that includes the current CPU no.) is not a
>   per CPU page

If it isn't then you can't figure out the current CPU/node for a thread.

Anyways I think we're talking past each other. Your approach might
even work on ia64 (at least if you're willing to add a lot of cost
to the context switch). You presumably could implement vgetcpu()
internally with an approach like this (although with IA64's fast 
EPC calls it seems a bit pointless) 

It just won't work on x86. 

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 11:58             ` Jes Sorensen
  2006-06-16 12:36               ` Zoltan Menyhart
@ 2006-06-16 14:54               ` Andi Kleen
  2006-06-20  8:28                 ` Jes Sorensen
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2006-06-16 14:54 UTC (permalink / raw)
  To: discuss
  Cc: Jes Sorensen, Tony Luck, linux-kernel, libc-alpha, vojtech,
	linux-ia64


> I really don't see the benefit here. malloc already gets pages handed
> down from the kernel which are node local due to them being assigned at
> a first touch basis. I am not sure about glibc's malloc internals, but
> rather rely on a vgetcpu() call, all it really needs to do is to keep
> a thread local pool which will automatically get it's thing locally
> through first touch usage.

That would add too much overhead on small systems. It's better to be 
able to share the pools. vgetcpu allows that.
 
> > Basically it is just for extending the existing already used proven etc.
> > default local policy to sub pages. Also there might be other uses
> > of it too (like per CPU data), although I expect most use of that
> > in user space can be already done using TLS.
> 
> The thread libraries already have their own thread local area which
> should be allocated on the thread's own node if done right, which I
> assume it is.

- The heap for small allocations is shared (although this can be tuned) 
- When another thread does free() you need special handling to keep
the item in the correct free lists
This is one of the tricky bits in the new kernel NUMA slab allocator
too.

> > But cpusets already does this kind of, even though it has a quite
> > bad impact on fast paths.
> >  Also what happens if the affinity mask is modified later?
> > From the high semantics point it is also a little dubious to mesh
> > them together. My feeling is that as a heuristic it is probably
> > dubious.
> 
> If you migrate your app elsewhere, you should migrate the pages with it,
> or not expect things to run with the local effect.

That's too costly to do by default and you have no guarantee that it will amortize.
 
> I don't really see the point in solving something half way when it can
> be done better. Maybe the "serious" databases should open up and let us
> know what the problem is they are hitting.

I see no indication of anything better so far from you. You only offered
static configuration instead which while in some cases is better
doesn't work in the general case.
-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [discuss] Re: FOR REVIEW: New x86-64 vsyscall vgetcpu()
  2006-06-16 14:54               ` Andi Kleen
@ 2006-06-20  8:28                 ` Jes Sorensen
  0 siblings, 0 replies; 27+ messages in thread
From: Jes Sorensen @ 2006-06-20  8:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: discuss, Tony Luck, linux-kernel, libc-alpha, vojtech, linux-ia64

Andi Kleen wrote:
>> I really don't see the benefit here. malloc already gets pages handed
>> down from the kernel which are node local due to them being assigned at
>> a first touch basis. I am not sure about glibc's malloc internals, but
>> rather rely on a vgetcpu() call, all it really needs to do is to keep
>> a thread local pool which will automatically get it's thing locally
>> through first touch usage.
> 
> That would add too much overhead on small systems. It's better to be 
> able to share the pools. vgetcpu allows that.

How do you expect to be able to share the pools? Or are you saying you
just one page per numa node? Having a page per thread is not noticable
and for databases, which was your primary target usergroup, I think it's
fair to see it won't even be visible as noise.

>>> Basically it is just for extending the existing already used proven etc.
>>> default local policy to sub pages. Also there might be other uses
>>> of it too (like per CPU data), although I expect most use of that
>>> in user space can be already done using TLS.
>> The thread libraries already have their own thread local area which
>> should be allocated on the thread's own node if done right, which I
>> assume it is.
> 
> - The heap for small allocations is shared (although this can be tuned) 
> - When another thread does free() you need special handling to keep
> the item in the correct free lists
> This is one of the tricky bits in the new kernel NUMA slab allocator
> too.

It should be pretty easy to make the allocator aware of the per thread
regions based on the address.

>> If you migrate your app elsewhere, you should migrate the pages with it,
>> or not expect things to run with the local effect.
> 
> That's too costly to do by default and you have no guarantee that it will amortize.

But if you don't migrate the pages with it, the numa aware allocation is
wasted anyway, whether you do it on a first-touch basis or using
vgetcpu.

>> I don't really see the point in solving something half way when it can
>> be done better. Maybe the "serious" databases should open up and let us
>> know what the problem is they are hitting.
> 
> I see no indication of anything better so far from you. You only offered
> static configuration instead which while in some cases is better
> doesn't work in the general case.

Static configuration? I never said anything about that, I said that libc
should offer a memory pool per thread and have it created when it's
first touched by the thread. That solves exactly what you have described
so far unless is something else you also expect to benefit from
vgetcpu().

Jes

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2006-06-20  8:28 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200606140942.31150.ak@suse.de>
     [not found] ` <12c511ca0606151144i140c21e5w90dd948af9b536a4@mail.gmail.com>
     [not found]   ` <200606160822.23898.ak@suse.de>
2006-06-16  9:48     ` FOR REVIEW: New x86-64 vsyscall vgetcpu() Jes Sorensen
2006-06-16 10:09       ` Andi Kleen
2006-06-16 11:02         ` Jes Sorensen
2006-06-16 11:17           ` Andi Kleen
2006-06-16 11:58             ` Jes Sorensen
2006-06-16 12:36               ` Zoltan Menyhart
2006-06-16 12:41                 ` Jes Sorensen
2006-06-16 12:48                   ` Zoltan Menyhart
2006-06-16 21:04                     ` Chase Venters
2006-06-16 14:56                 ` Andi Kleen
2006-06-16 15:31                   ` Zoltan Menyhart
2006-06-16 15:37                     ` Andi Kleen
2006-06-16 15:58                       ` Jakub Jelinek
2006-06-16 16:24                         ` Andi Kleen
2006-06-16 16:33                           ` Jakub Jelinek
2006-06-16 21:12                     ` Chase Venters
2006-06-16 15:36                   ` Brent Casavant
2006-06-16 15:40                     ` Andi Kleen
2006-06-16 21:15                       ` Chase Venters
2006-06-16 21:19                       ` Chase Venters
2006-06-16 23:40                         ` Brent Casavant
2006-06-17  6:58                           ` Andi Kleen
2006-06-17  6:55                         ` [discuss] " Andi Kleen
2006-06-19  8:42                           ` Zoltan Menyhart
2006-06-19  8:54                             ` Andi Kleen
2006-06-16 14:54               ` Andi Kleen
2006-06-20  8:28                 ` Jes Sorensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox