More problems in setup_pcpu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* More problems in setup_pcpu_remap()
@ 2009-04-02  4:31 David Miller
  2009-04-02  4:42 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2009-04-02  4:31 UTC (permalink / raw)
  To: tj; +Cc: linux-kernel

The way this code is currently designed, it can exhaust all of the
VMALLOC address space on both x86 and x86_64, and then some.

It allocates PMD_SIZE * num_possible_cpus() of vmalloc space.

PMD_SIZE is 2MB, num_possible_cpus() can be up to 4096....
which can easily exceed (VMALLOC_END - VMALLOC_START)

Initially I had set out to implement sparc64 support for the new
per-cpu stuff, but it looks like I'm stuck finding bugs in the x86
implementation instead :-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More problems in setup_pcpu_remap()
  2009-04-02  4:31 More problems in setup_pcpu_remap() David Miller
@ 2009-04-02  4:42 ` Tejun Heo
  2009-04-02  4:52   ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2009-04-02  4:42 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel

Hello,

David Miller wrote:
> The way this code is currently designed, it can exhaust all of the
> VMALLOC address space on both x86 and x86_64, and then some.
> 
> It allocates PMD_SIZE * num_possible_cpus() of vmalloc space.
> 
> PMD_SIZE is 2MB, num_possible_cpus() can be up to 4096....
> which can easily exceed (VMALLOC_END - VMALLOC_START)
> 
> Initially I had set out to implement sparc64 support for the new
> per-cpu stuff, but it looks like I'm stuck finding bugs in the x86
> implementation instead :-)

Eh... sorry about that.  :-)

I guess we'll have to put a cap on how high possible cpus can be for
remap allocator.  e.g. if single chunk size is over 20% of the whole
vmalloc area, don't use remap.  Does anyone have a good random %
number on mind?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More problems in setup_pcpu_remap()
  2009-04-02  4:42 ` Tejun Heo
@ 2009-04-02  4:52   ` David Miller
  2009-04-02  5:55     ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2009-04-02  4:52 UTC (permalink / raw)
  To: tj; +Cc: linux-kernel

From: Tejun Heo <tj@kernel.org>
Date: Thu, 02 Apr 2009 13:42:57 +0900

> I guess we'll have to put a cap on how high possible cpus can be for
> remap allocator.  e.g. if single chunk size is over 20% of the whole
> vmalloc area, don't use remap.  Does anyone have a good random %
> number on mind?

I would suggest instead to rethink what this code is doing.

It would make more sense to carve up 2MB chunks into some-power-of-2
pieces and use that as the unit size.

You could retain the NUMA goals of this function, as well as the
ability to be using 2MB pages in the TLBs.

And consider that if the dynamic allocation part of this code triggers
even once, you'll end up eating twice as much VMALLOC space.

Using 2MB per cpu is just rediculious, and really not even necessary.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More problems in setup_pcpu_remap()
  2009-04-02  4:52   ` David Miller
@ 2009-04-02  5:55     ` Tejun Heo
  2009-04-02  7:07       ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2009-04-02  5:55 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel

Hello, David.

David Miller wrote:
>> I guess we'll have to put a cap on how high possible cpus can be for
>> remap allocator.  e.g. if single chunk size is over 20% of the whole
>> vmalloc area, don't use remap.  Does anyone have a good random %
>> number on mind?
> 
> I would suggest instead to rethink what this code is doing.

Actually, I've been looking at the numbers and I'm not sure if the
concern is valid.  On x86_32, the practical number of maximum
processors would be around 16 so it will end up 32M, which isn't nice
and it would probably a good idea to introduce a parameter to select
which allocator to use but still it's far from consuming all the VM
area.  On x86_64, the vmalloc area is obscenely large at 2^45, ie 32
terabytes.  Even with 4096 processors, single chunk is measly 0.02%.

If it's a problem for other archs or extreme x86_32 configurations, we

can add some safety measures but in general I don't think it is a
problem.

> It would make more sense to carve up 2MB chunks into some-power-of-2
> pieces and use that as the unit size.
>
> You could retain the NUMA goals of this function, as well as the
> ability to be using 2MB pages in the TLBs.

Can you please elaborate a bit?

> And consider that if the dynamic allocation part of this code triggers
> even once, you'll end up eating twice as much VMALLOC space.
> 
> Using 2MB per cpu is just rediculious, and really not even necessary.

The focus at the moment is using large page for the first chunk to
reduce pressure on TLB, not necessarily actually using 2MB for each
unit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More problems in setup_pcpu_remap()
  2009-04-02  5:55     ` Tejun Heo
@ 2009-04-02  7:07       ` David Miller
  2009-04-02  7:22         ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2009-04-02  7:07 UTC (permalink / raw)
  To: tj; +Cc: linux-kernel

From: Tejun Heo <tj@kernel.org>
Date: Thu, 02 Apr 2009 14:55:48 +0900

> David Miller wrote:
> > It would make more sense to carve up 2MB chunks into some-power-of-2
> > pieces and use that as the unit size.
> >
> > You could retain the NUMA goals of this function, as well as the
> > ability to be using 2MB pages in the TLBs.
> 
> Can you please elaborate a bit?
> 
> > And consider that if the dynamic allocation part of this code triggers
> > even once, you'll end up eating twice as much VMALLOC space.
> > 
> > Using 2MB per cpu is just rediculious, and really not even necessary.
> 
> The focus at the moment is using large page for the first chunk to
> reduce pressure on TLB, not necessarily actually using 2MB for each
> unit.

You'll get better TLB hit rates with my suggestion.

The idea is to carve up a 2MB page amongst consequetive cpus in
the same NUMA node.

With hyperthreading these cpus will be sharing the TLB, so you'll
get better TLB hit rates than the current code.

I'm going to do something like this on sparc64.  This x86 code is
severely demotivating me, so I'll stop looking at it now :-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More problems in setup_pcpu_remap()
  2009-04-02  7:07       ` David Miller
@ 2009-04-02  7:22         ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2009-04-02  7:22 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel

Hello, David.

David Miller wrote:
>>> Using 2MB per cpu is just rediculious, and really not even necessary.
>> The focus at the moment is using large page for the first chunk to
>> reduce pressure on TLB, not necessarily actually using 2MB for each
>> unit.
> 
> You'll get better TLB hit rates with my suggestion.
> 
> The idea is to carve up a 2MB page amongst consequetive cpus in
> the same NUMA node.
> 
> With hyperthreading these cpus will be sharing the TLB, so you'll
> get better TLB hit rates than the current code.

That sounds great.

> I'm going to do something like this on sparc64.  This x86 code is
> severely demotivating me, so I'll stop looking at it now :-)

Yeah, please go ahead and write severely motivating code.  I'll be
very motivated to apply it to x86.  :-)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-04-02  7:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-02  4:31 More problems in setup_pcpu_remap() David Miller
2009-04-02  4:42 ` Tejun Heo
2009-04-02  4:52   ` David Miller
2009-04-02  5:55     ` Tejun Heo
2009-04-02  7:07       ` David Miller
2009-04-02  7:22         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox