* idle task's task_t allocation on NUMA machines
@ 2005-08-18 14:08 Samuel Thibault
2005-08-18 15:18 ` Eric Dumazet
2005-08-18 18:27 ` Robin Holt
0 siblings, 2 replies; 9+ messages in thread
From: Samuel Thibault @ 2005-08-18 14:08 UTC (permalink / raw)
To: linux-kernel, lse-tech
Hi,
Currently, the task_t structure of the idle task is always allocated
on CPU0, hence on node 0: while booting, for each CPU, CPU 0 calls
fork_idle(), hence copy_process(), hence dup_task_struct(), hence
alloc_task_struct(), hence kmem_cache_alloc(), which picks up memory
from the allocation cache of the current CPU, i.e. on node 0.
This is a bad idea: every write needs be written back to node 0 at some
time, so that node 0 can get a small bit busy especially when other
nodes are idle.
A solution would be to add to copy_process(), dup_task_struct(),
alloc_task_struct() and kmem_cache_alloc() the node number on which
allocation should be performed. This might also be useful if performing
node load balancing at fork(): one could then allocate task_t directly
on the new node. It might also be useful when allocating data for
another node.
Regards,
Samuel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: idle task's task_t allocation on NUMA machines
2005-08-18 14:08 idle task's task_t allocation on NUMA machines Samuel Thibault
@ 2005-08-18 15:18 ` Eric Dumazet
2005-08-18 15:39 ` Samuel Thibault
2005-08-18 19:49 ` Samuel Thibault
2005-08-18 18:27 ` Robin Holt
1 sibling, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2005-08-18 15:18 UTC (permalink / raw)
To: Samuel Thibault; +Cc: linux-kernel, lse-tech
Samuel Thibault a écrit :
> Hi,
>
> Currently, the task_t structure of the idle task is always allocated
> on CPU0, hence on node 0: while booting, for each CPU, CPU 0 calls
> fork_idle(), hence copy_process(), hence dup_task_struct(), hence
> alloc_task_struct(), hence kmem_cache_alloc(), which picks up memory
> from the allocation cache of the current CPU, i.e. on node 0.
>
> This is a bad idea: every write needs be written back to node 0 at some
> time, so that node 0 can get a small bit busy especially when other
> nodes are idle.
>
> A solution would be to add to copy_process(), dup_task_struct(),
> alloc_task_struct() and kmem_cache_alloc() the node number on which
> allocation should be performed. This might also be useful if performing
> node load balancing at fork(): one could then allocate task_t directly
> on the new node. It might also be useful when allocating data for
> another node.
>
> Regards,
> Samuel
An idle task should block itself, hence not touching its task_t structure very much.
I believe IRQ stacks are also allocated on node 0, that seems more serious.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: idle task's task_t allocation on NUMA machines
2005-08-18 15:18 ` Eric Dumazet
@ 2005-08-18 15:39 ` Samuel Thibault
2005-08-18 17:20 ` Christoph Lameter
2005-08-18 19:49 ` Samuel Thibault
1 sibling, 1 reply; 9+ messages in thread
From: Samuel Thibault @ 2005-08-18 15:39 UTC (permalink / raw)
To: Eric Dumazet; +Cc: linux-kernel, lse-tech
Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit :
> An idle task should block itself, hence not touching its task_t structure
> very much.
Indeed, but I guess there are a lot of such little optimizations here
and there that could be relatively easily fixed, for a not-so little
benefit.
> I believe IRQ stacks are also allocated on node 0, that seems more serious.
Such as this :)
Regards,
Samuel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: idle task's task_t allocation on NUMA machines
2005-08-18 15:39 ` Samuel Thibault
@ 2005-08-18 17:20 ` Christoph Lameter
0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2005-08-18 17:20 UTC (permalink / raw)
To: Samuel Thibault; +Cc: Eric Dumazet, linux-kernel, lse-tech
On Thu, 18 Aug 2005, Samuel Thibault wrote:
> Indeed, but I guess there are a lot of such little optimizations here
> and there that could be relatively easily fixed, for a not-so little
> benefit.
Get on it :-) I hope the kmalloc_node stuff etc that was recently added is
enough for most structur4es. Note that there is a new rev of the slab
allocator in Andrew's tree that will make kmalloc_node as fast as kmalloc.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: idle task's task_t allocation on NUMA machines
2005-08-18 14:08 idle task's task_t allocation on NUMA machines Samuel Thibault
2005-08-18 15:18 ` Eric Dumazet
@ 2005-08-18 18:27 ` Robin Holt
1 sibling, 0 replies; 9+ messages in thread
From: Robin Holt @ 2005-08-18 18:27 UTC (permalink / raw)
To: Samuel Thibault, linux-kernel, lse-tech; +Cc: Jack Steiner
On Thu, Aug 18, 2005 at 04:08:29PM +0200, Samuel Thibault wrote:
> A solution would be to add to copy_process(), dup_task_struct(),
> alloc_task_struct() and kmem_cache_alloc() the node number on which
> allocation should be performed. This might also be useful if performing
> node load balancing at fork(): one could then allocate task_t directly
> on the new node. It might also be useful when allocating data for
> another node.
Can this be abstracted some?
Let me start with some background. SGI has a kernel addition we made
on our previous kernel release something we called dplace. It has a
userland piece and a library which gets configuration information passed
into a kernel driver.
Inside the kernel, we used Process Aggregates (pagg as found on
oss.sgi.com) to track children of a starting process and migrate them
to a desired cpu.
The problem we have with this method is the callout to pagg happens
far too late after the fork to help with some of the more important
user structures like page tables. We find that most processes have
their pgd and many parts of the pmd allocated remotely. Although it
is not a significant source of NUMA traffic, it does cause variability
in process run times which becomes exaggerated on larger MPI jobs which
rendezvous at a barrier.
It would be nice to be able to, early in fork, decide on a destination
numa node and cpu list for the task. If this were done, then changing
allocation of structures like the task_t and page tables could be handled
on a case-by-case basis as we see benefit. Additionally, it would be
nice if we could make the placement decision logic provide a callback
so we could add tailored placement.
I realize this is a very vague sketch of what I think needs to be done,
but I am sort of in a rush right now and wanted to at least start the
discussion.
Thanks,
Robin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: idle task's task_t allocation on NUMA machines
2005-08-18 15:18 ` Eric Dumazet
2005-08-18 15:39 ` Samuel Thibault
@ 2005-08-18 19:49 ` Samuel Thibault
2005-08-18 20:02 ` Samuel Thibault
1 sibling, 1 reply; 9+ messages in thread
From: Samuel Thibault @ 2005-08-18 19:49 UTC (permalink / raw)
To: Eric Dumazet; +Cc: linux-kernel, lse-tech
Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit :
> I believe IRQ stacks are also allocated on node 0, that seems more serious.
For the i386 architecture at least, yes: they are statically defined in
arch/i386/kernel/irq.c, while they could be per_cpu.
Regards,
Samuel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: idle task's task_t allocation on NUMA machines
2005-08-18 19:49 ` Samuel Thibault
@ 2005-08-18 20:02 ` Samuel Thibault
2005-08-18 21:28 ` [Lse-tech] " Martin J. Bligh
2005-08-18 21:32 ` Andi Kleen
0 siblings, 2 replies; 9+ messages in thread
From: Samuel Thibault @ 2005-08-18 20:02 UTC (permalink / raw)
To: Eric Dumazet, linux-kernel, lse-tech
Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a écrit :
> Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit :
> > I believe IRQ stacks are also allocated on node 0, that seems more serious.
>
> For the i386 architecture at least, yes: they are statically defined in
> arch/i386/kernel/irq.c, while they could be per_cpu.
Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering:
isn't the current x86_64 numa-aware implementation of per_cpu generic
enough for any architecture?
Regards,
Samuel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Lse-tech] Re: idle task's task_t allocation on NUMA machines
2005-08-18 20:02 ` Samuel Thibault
@ 2005-08-18 21:28 ` Martin J. Bligh
2005-08-18 21:32 ` Andi Kleen
1 sibling, 0 replies; 9+ messages in thread
From: Martin J. Bligh @ 2005-08-18 21:28 UTC (permalink / raw)
To: Samuel Thibault, Eric Dumazet, linux-kernel, lse-tech
--On Thursday, August 18, 2005 22:02:55 +0200 Samuel Thibault <samuel.thibault@ens-lyon.org> wrote:
> Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a écrit :
>> Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit :
>> > I believe IRQ stacks are also allocated on node 0, that seems more serious.
>>
>> For the i386 architecture at least, yes: they are statically defined in
>> arch/i386/kernel/irq.c, while they could be per_cpu.
>
> Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering:
> isn't the current x86_64 numa-aware implementation of per_cpu generic
> enough for any architecture?
All ZONE_NORMAL on ia32 is on node 0, so I don't think it'll help.
M.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Lse-tech] Re: idle task's task_t allocation on NUMA machines
2005-08-18 20:02 ` Samuel Thibault
2005-08-18 21:28 ` [Lse-tech] " Martin J. Bligh
@ 2005-08-18 21:32 ` Andi Kleen
1 sibling, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2005-08-18 21:32 UTC (permalink / raw)
To: Samuel Thibault, Eric Dumazet, linux-kernel, lse-tech
On Thu, Aug 18, 2005 at 10:02:55PM +0200, Samuel Thibault wrote:
> Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a ?crit :
> > Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a ?crit :
> > > I believe IRQ stacks are also allocated on node 0, that seems more serious.
> >
> > For the i386 architecture at least, yes: they are statically defined in
> > arch/i386/kernel/irq.c, while they could be per_cpu.
>
> Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering:
> isn't the current x86_64 numa-aware implementation of per_cpu generic
> enough for any architecture?
Actually it's broken for many x86-64 configurations now that use SRAT because
we assign the nodes to CPUs only after this code runs. I was considering
to remove it.
-Andi
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-08-18 21:32 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-18 14:08 idle task's task_t allocation on NUMA machines Samuel Thibault
2005-08-18 15:18 ` Eric Dumazet
2005-08-18 15:39 ` Samuel Thibault
2005-08-18 17:20 ` Christoph Lameter
2005-08-18 19:49 ` Samuel Thibault
2005-08-18 20:02 ` Samuel Thibault
2005-08-18 21:28 ` [Lse-tech] " Martin J. Bligh
2005-08-18 21:32 ` Andi Kleen
2005-08-18 18:27 ` Robin Holt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox