Re: regarding the x86_64 zero-based percpu patches

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: regarding the x86_64 zero-based percpu patches
       [not found] ` <20090107120225.GA30651@elte.hu>
@ 2009-01-07 12:13   ` Tejun Heo
  2009-01-10  6:46     ` Rusty Russell
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2009-01-07 12:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: travis, Rusty Russell, Linux Kernel Mailing List, H. Peter Anvin,
	Andrew Morton, Eric Biederman, steiner, Hugh Dickins

(cc'ing people from the original thread and LKML as it seems to
require actual discussion.)

Hello, this thread started with me asking for help regarding
the zero-based percpu patches and the initial message is quoted
below.

Ingo Molnar wrote:
> * Tejun Heo <tj@kernel.org> wrote:
> 
>> Hello, Mike, Ingo.
>>
>> I was working on something which requires better dynamic per-cpu
>> performance and have been working on implementing it myself but
>> realized the strange gcc stack protector ABI limitation and with
>> Rusty's hint and googling found out that Mike already did the heavy
>> lifting.
>>
>> I read the "x86_64: Optimize percpu accesses" from July last year and
>> it looks like it got stuck on tool chain problem which showed up as
>> two problems (is one of the two resolved?).
>>
>> * Notifier call chain corruption
>>
>> * Stack overflow with default stack size
>>
>> >From the cpu_alloc thread from November, it seems Mike is quite
>> pre-occupied, so I'm willing to give it a shot as it's blocking stuff
>> I have in queue.  The problem is that I'm having problem finding some
>> information.
>>
>> 1. Mike seems to have splitted the patch but haven't posted them.
>>
>> 2. Ingo's x86/percpu-zerobased branch doesn't contain any revision not
>>    in the current upstream.  Maybe the commits got lost during merges?
>>
>> 3. What failed and what got fixed and how to reproduce the problem.
>>
>> So, can you please help me a bit?  I'll be happy to forward port the 
>> patches if they have bit-rotted.
> 
> hm, i zapped them two days ago, because they collided with Rusty's ongoing 
> percpu-alloc work in his tree. Mike should be able to tell you what the 
> plans are for the resurrection of those patches.

IIUC, Rusty is somewhat leaning toward limiting per-cpu area and using
static allocator. (right?)  As I was trying to do more stuff per-cpu
(not putting a lot of stuff into per-cpu area but even with small
things limited per-cpu area poses scalability problems), cpu_alloc
seems to fit the bill better.

Anyways, I think it's worthwhile to listen what people have on mind
regarding how per-cpu stuff should proceed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-07 12:13   ` regarding the x86_64 zero-based percpu patches Tejun Heo
@ 2009-01-10  6:46     ` Rusty Russell
  2009-01-12 17:23       ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Rusty Russell @ 2009-01-10  6:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin,
	Andrew Morton, Eric Biederman, steiner, Hugh Dickins,
	Christoph Lameter

On Wednesday 07 January 2009 22:43:25 Tejun Heo wrote:
> IIUC, Rusty is somewhat leaning toward limiting per-cpu area and using
> static allocator. (right?)

Not quite.  Six years ago I didn't do "proper" dynamic per-cpu because of
this lack-of-expanding problem.  I expected (myself or someone else) would
fix that and the current temporary solution would be replaced.

But Christoph showed that even in a limited form it can be used for more
than static per-cpu vars and such vars in modules.  (It's also in dire need
of a cleanup, since there have been several abortive changes made in the last
few years).

> As I was trying to do more stuff per-cpu
> (not putting a lot of stuff into per-cpu area but even with small
> things limited per-cpu area poses scalability problems), cpu_alloc
> seems to fit the bill better.

Unfortunately cpu_alloc didn't solve this problem either.

We need to grow the areas, but for NUMA layouts it's non-trivial.  I don't
like the idea of remapping: one TLB entry per page per cpu is going to suck.
Finding pages which are "congruent" with the original percpu pages is more
promising, but it will almost certainly need to elbow pages out the way to
have a chance of succeeding on a real system.

> Anyways, I think it's worthwhile to listen what people have on mind
> regarding how per-cpu stuff should proceed.

Absolutely.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-10  6:46     ` Rusty Russell
@ 2009-01-12 17:23       ` Christoph Lameter
  2009-01-12 17:44         ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2009-01-12 17:23 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Tejun Heo, Ingo Molnar, travis, Linux Kernel Mailing List,
	H. Peter Anvin, Andrew Morton, Eric Biederman, steiner,
	Hugh Dickins

On Sat, 10 Jan 2009, Rusty Russell wrote:

> > As I was trying to do more stuff per-cpu
> > (not putting a lot of stuff into per-cpu area but even with small
> > things limited per-cpu area poses scalability problems), cpu_alloc
> > seems to fit the bill better.
>
> Unfortunately cpu_alloc didn't solve this problem either.
>
> We need to grow the areas, but for NUMA layouts it's non-trivial.  I don't
> like the idea of remapping: one TLB entry per page per cpu is going to suck.
> Finding pages which are "congruent" with the original percpu pages is more
> promising, but it will almost certainly need to elbow pages out the way to
> have a chance of succeeding on a real system.

An allocation automatically falls back to the nearest node on NUMA
cpu_to_node() gives you the current node.

There are 2M TLB entries on x86_64. If we really get into a high usage
scenario then the 2M entry makes sense. Average server memory sizes likely
already are way beyond 10G per box. The higher that goes the more
reasonable the 2M TLB entry will be.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-12 17:23       ` Christoph Lameter
@ 2009-01-12 17:44         ` Eric W. Biederman
  2009-01-12 19:00           ` Christoph Lameter
                             ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Eric W. Biederman @ 2009-01-12 17:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rusty Russell, Tejun Heo, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

Christoph Lameter <cl@linux-foundation.org> writes:

> On Sat, 10 Jan 2009, Rusty Russell wrote:
>
>> > As I was trying to do more stuff per-cpu
>> > (not putting a lot of stuff into per-cpu area but even with small
>> > things limited per-cpu area poses scalability problems), cpu_alloc
>> > seems to fit the bill better.
>>
>> Unfortunately cpu_alloc didn't solve this problem either.
>>
>> We need to grow the areas, but for NUMA layouts it's non-trivial.  I don't
>> like the idea of remapping: one TLB entry per page per cpu is going to suck.
>> Finding pages which are "congruent" with the original percpu pages is more
>> promising, but it will almost certainly need to elbow pages out the way to
>> have a chance of succeeding on a real system.
>
> An allocation automatically falls back to the nearest node on NUMA
> cpu_to_node() gives you the current node.
>
> There are 2M TLB entries on x86_64. If we really get into a high usage
> scenario then the 2M entry makes sense. Average server memory sizes likely
> already are way beyond 10G per box. The higher that goes the more
> reasonable the 2M TLB entry will be.

2M of per cpu data doesn't make sense, and likely indicates a design
flaw somewhere.  It just doesn't make sense to have large amounts of
data allocated per cpu.

The most common user of per cpu data I am aware of is allocating one
word per cpu for counters.

What would be better is simply to: 
- Require a lock to access another cpus per cpu data.
- Do large page allocations for the per cpu data.

At which point we could grow the per cpu data by simply reallocating it on
each cpu and updating the register that holds the base pointer.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-12 17:44         ` Eric W. Biederman
@ 2009-01-12 19:00           ` Christoph Lameter
  2009-01-13  0:33           ` Tejun Heo
  2009-01-15  1:34           ` Rusty Russell
  2 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2009-01-12 19:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rusty Russell, Tejun Heo, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

On Mon, 12 Jan 2009, Eric W. Biederman wrote:

> > There are 2M TLB entries on x86_64. If we really get into a high usage
> > scenario then the 2M entry makes sense. Average server memory sizes likely
> > already are way beyond 10G per box. The higher that goes the more
> > reasonable the 2M TLB entry will be.
>
> 2M of per cpu data doesn't make sense, and likely indicates a design
> flaw somewhere.  It just doesn't make sense to have large amounts of
> data allocated per cpu.

Some data is not small. MIB data is allocated per cpu etc etc

> What would be better is simply to:
> - Require a lock to access another cpus per cpu data.
> - Do large page allocations for the per cpu data.
>
> At which point we could grow the per cpu data by simply reallocating it on
> each cpu and updating the register that holds the base pointer.

If per cpu data areas have no fixed address then you cannot use list
operations on per cpu data nor can the address of per cpu variables be
stored anywhere.

But maybe that is okay?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-12 17:44         ` Eric W. Biederman
  2009-01-12 19:00           ` Christoph Lameter
@ 2009-01-13  0:33           ` Tejun Heo
  2009-01-13  3:01             ` Eric W. Biederman
  2009-01-15  1:34           ` Rusty Russell
  2 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2009-01-13  0:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

Hello, Eric.

Eric W. Biederman wrote:
>> There are 2M TLB entries on x86_64. If we really get into a high usage
>> scenario then the 2M entry makes sense. Average server memory sizes likely
>> already are way beyond 10G per box. The higher that goes the more
>> reasonable the 2M TLB entry will be.
> 
> 2M of per cpu data doesn't make sense, and likely indicates a design
> flaw somewhere.  It just doesn't make sense to have large amounts of
> data allocated per cpu.

Why?  On almost all large machines I've seen or heard of, memory size
scales way better than the number of cpus.  Whether certain usage
makes sense or not surely is debatable but I can't imagine all use
cases where 2MB percpu TLB entry could be useful would be senseless.

> The most common user of per cpu data I am aware of is allocating one
> word per cpu for counters.
> 
> What would be better is simply to: 
> - Require a lock to access another cpus per cpu data.
> - Do large page allocations for the per cpu data.
> 
> At which point we could grow the per cpu data by simply reallocating it on
> each cpu and updating the register that holds the base pointer.

I don't think moving live objects is such a good idea for the
following reasons.

1. Programming convenience is usually much more important than people
   think it is.  Even in the kernel.  I think it's very likely that
   we'll have unending stream of small feature requirements which
   would step just outside the supported bounds and ever smart
   workaround until the restriction is finally removed years later.

2. Moving live objects is inherently dangerous + it won't happen
   often.  Thinking about possible subtle bugs is scary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-13  0:33           ` Tejun Heo
@ 2009-01-13  3:01             ` Eric W. Biederman
  2009-01-13  3:14               ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2009-01-13  3:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

Tejun Heo <tj@kernel.org> writes:

> Hello, Eric.
>
> Eric W. Biederman wrote:
>>> There are 2M TLB entries on x86_64. If we really get into a high usage
>>> scenario then the 2M entry makes sense. Average server memory sizes likely
>>> already are way beyond 10G per box. The higher that goes the more
>>> reasonable the 2M TLB entry will be.
>> 
>> 2M of per cpu data doesn't make sense, and likely indicates a design
>> flaw somewhere.  It just doesn't make sense to have large amounts of
>> data allocated per cpu.
>
> Why?  On almost all large machines I've seen or heard of, memory size
> scales way better than the number of cpus.  Whether certain usage
> makes sense or not surely is debatable but I can't imagine all use
> cases where 2MB percpu TLB entry could be useful would be senseless.

Right, there are cases where you could hit 2MB but they aren't likely
to be that common.

In particular the common case is to allocate a single word of per
cpu data, with a given allocation request.  To get 2 2MB with
8byte requests requires 262144 different, which is a lot
more than I expect to be common any time soon.

So I figure reserving a 2MB tlb entry is not likely what we want,
in the common case.

>> The most common user of per cpu data I am aware of is allocating one
>> word per cpu for counters.
>> 
>> What would be better is simply to: 
>> - Require a lock to access another cpus per cpu data.
>> - Do large page allocations for the per cpu data.
>> 
>> At which point we could grow the per cpu data by simply reallocating it on
>> each cpu and updating the register that holds the base pointer.
>
> I don't think moving live objects is such a good idea for the
> following reasons.
>
> 1. Programming convenience is usually much more important than people
>    think it is.  Even in the kernel.  I think it's very likely that
>    we'll have unending stream of small feature requirements which
>    would step just outside the supported bounds and ever smart
>    workaround until the restriction is finally removed years later.

> 2. Moving live objects is inherently dangerous + it won't happen
>    often.  Thinking about possible subtle bugs is scary.

But the question is what is per cpu memory.  Per cpu memory is
something we can access quickly without creating cross cpu
cache line contention.

Accessing that memory from other cpus implies we create that
contention and will be the slow path.   We need to do cross
cpu access for the rollup of the statistics, but we clearly
don't want to do it often.

So I expect most times we will want to store a pointer to per
cpu data will be a bug. 

per cpu memory is not something we ever want to use to lightly.
So as long as the rules are clear we should be ok.  And simply
removing the address of function for per cpu data would make it impossible
to point into it.  So I think it is worth a look, to see if we can
move live per cpu data.  As it noticeably simplifies the problem of
growing a per cpu area in the rare case when we need to.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-13  3:01             ` Eric W. Biederman
@ 2009-01-13  3:14               ` Tejun Heo
  2009-01-13  4:07                 ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2009-01-13  3:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

Hello, Eric.

Eric W. Biederman wrote:
> Right, there are cases where you could hit 2MB but they aren't likely
> to be that common.
> 
> In particular the common case is to allocate a single word of per
> cpu data, with a given allocation request.  To get 2 2MB with
> 8byte requests requires 262144 different, which is a lot
> more than I expect to be common any time soon.
> 
> So I figure reserving a 2MB tlb entry is not likely what we want,
> in the common case.

Yeap, it probably won't hit 2MB in common cases but it still needs to
scale for uncommon cases and if 4K TLB pressure becomes too high for
such cases, promoting to 2MB TLB makes sense for those.  IIUC, that's
what Christoph Lameter is intending to do (haven't looked at the code
yet tho).

>>> The most common user of per cpu data I am aware of is allocating one
>>> word per cpu for counters.
>>>
>>> What would be better is simply to: 
>>> - Require a lock to access another cpus per cpu data.
>>> - Do large page allocations for the per cpu data.
>>>
>>> At which point we could grow the per cpu data by simply reallocating it on
>>> each cpu and updating the register that holds the base pointer.
>> I don't think moving live objects is such a good idea for the
>> following reasons.
>>
>> 1. Programming convenience is usually much more important than people
>>    think it is.  Even in the kernel.  I think it's very likely that
>>    we'll have unending stream of small feature requirements which
>>    would step just outside the supported bounds and ever smart
>>    workaround until the restriction is finally removed years later.
> 
>> 2. Moving live objects is inherently dangerous + it won't happen
>>    often.  Thinking about possible subtle bugs is scary.
> 
> But the question is what is per cpu memory.  Per cpu memory is
> something we can access quickly without creating cross cpu
> cache line contention.
> 
> Accessing that memory from other cpus implies we create that
> contention and will be the slow path.   We need to do cross
> cpu access for the rollup of the statistics, but we clearly
> don't want to do it often.
> 
> So I expect most times we will want to store a pointer to per
> cpu data will be a bug. 
> 
> per cpu memory is not something we ever want to use to lightly.
> So as long as the rules are clear we should be ok.  And simply
> removing the address of function for per cpu data would make it impossible
> to point into it.  So I think it is worth a look, to see if we can
> move live per cpu data.  As it noticeably simplifies the problem of
> growing a per cpu area in the rare case when we need to.

I don't know.  I think it's a dangerous thing which can be avoided.
If there's no other solution, then we might have to live with it but I
don't see the winning benefit of such design over per-cpu virtual
mapping.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-13  3:14               ` Tejun Heo
@ 2009-01-13  4:07                 ` Eric W. Biederman
  2009-01-14  3:58                   ` Tejun Heo
  2009-01-15  1:49                   ` Rusty Russell
  0 siblings, 2 replies; 16+ messages in thread
From: Eric W. Biederman @ 2009-01-13  4:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

Tejun Heo <tj@kernel.org> writes:

> Hello, Eric.
>
> Eric W. Biederman wrote:
>> Right, there are cases where you could hit 2MB but they aren't likely
>> to be that common.
>> 
>> In particular the common case is to allocate a single word of per
>> cpu data, with a given allocation request.  To get 2 2MB with
>> 8byte requests requires 262144 different, which is a lot
>> more than I expect to be common any time soon.
>> 
>> So I figure reserving a 2MB tlb entry is not likely what we want,
>> in the common case.
>
> Yeap, it probably won't hit 2MB in common cases but it still needs to
> scale for uncommon cases and if 4K TLB pressure becomes too high for
> such cases, promoting to 2MB TLB makes sense for those.  IIUC, that's
> what Christoph Lameter is intending to do (haven't looked at the code
> yet tho).
>
>>>> The most common user of per cpu data I am aware of is allocating one
>>>> word per cpu for counters.
>>>>
>>>> What would be better is simply to: 
>>>> - Require a lock to access another cpus per cpu data.
>>>> - Do large page allocations for the per cpu data.
>>>>
>>>> At which point we could grow the per cpu data by simply reallocating it on
>>>> each cpu and updating the register that holds the base pointer.
>>> I don't think moving live objects is such a good idea for the
>>> following reasons.
>>>
>>> 1. Programming convenience is usually much more important than people
>>>    think it is.  Even in the kernel.  I think it's very likely that
>>>    we'll have unending stream of small feature requirements which
>>>    would step just outside the supported bounds and ever smart
>>>    workaround until the restriction is finally removed years later.
>> 
>>> 2. Moving live objects is inherently dangerous + it won't happen
>>>    often.  Thinking about possible subtle bugs is scary.
>> 
>> But the question is what is per cpu memory.  Per cpu memory is
>> something we can access quickly without creating cross cpu
>> cache line contention.
>> 
>> Accessing that memory from other cpus implies we create that
>> contention and will be the slow path.   We need to do cross
>> cpu access for the rollup of the statistics, but we clearly
>> don't want to do it often.
>> 
>> So I expect most times we will want to store a pointer to per
>> cpu data will be a bug. 
>> 
>> per cpu memory is not something we ever want to use to lightly.
>> So as long as the rules are clear we should be ok.  And simply
>> removing the address of function for per cpu data would make it impossible
>> to point into it.  So I think it is worth a look, to see if we can
>> move live per cpu data.  As it noticeably simplifies the problem of
>> growing a per cpu area in the rare case when we need to.
>
> I don't know.  I think it's a dangerous thing which can be avoided.
> If there's no other solution, then we might have to live with it but I
> don't see the winning benefit of such design over per-cpu virtual
> mapping.

It isn't incompatible with a per-cpu virtual mapping.  It allows the
possibility of each cpu reusing the same chunk of virtual address
space for per cpu memory.

On x86_64 and other architectures with enough address space bits it allows
us to share the large pages that we use for the normal memory mapping with
the ones for per cpu access.

I definitely think the work of combining the pda and the percpu areas
into a common area is worthwhile.

I think it would be nice if the percpu area could grow and would not be
a fixed size at boot time, I'm not particularly convinced it has to.

Eric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-13  4:07                 ` Eric W. Biederman
@ 2009-01-14  3:58                   ` Tejun Heo
  2009-01-15  1:47                     ` Rusty Russell
  2009-01-15  1:49                   ` Rusty Russell
  1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2009-01-14  3:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

Hello, Eric.

Eric W. Biederman wrote:
> Tejun Heo <tj@kernel.org> writes:
>> I don't know.  I think it's a dangerous thing which can be avoided.
>> If there's no other solution, then we might have to live with it but I
>> don't see the winning benefit of such design over per-cpu virtual
>> mapping.
> 
> It isn't incompatible with a per-cpu virtual mapping.  It allows the
> possibility of each cpu reusing the same chunk of virtual address
> space for per cpu memory.
> 
> On x86_64 and other architectures with enough address space bits it allows
> us to share the large pages that we use for the normal memory mapping with
> the ones for per cpu access.
> 
> I definitely think the work of combining the pda and the percpu areas
> into a common area is worthwhile.

Yeah, it's gonna be necessary regardless of which way we go.

> I think it would be nice if the percpu area could grow and would not be
> a fixed size at boot time, I'm not particularly convinced it has to.

The main problem is that the area needs to be congruent which
basically mandates them to be contiguous.  The three alternatives on
table are...

1. Just reserve memory from the get-go.  Simplest.  No additional TLB
   pressure but memory is likely to be wasted and more importantly
   scalability suffers.

2. Reserve address space and map memory as necessary.  We can be much
   more generous about reserving address space especially on 64bit
   machines and probably can mostly forget about scalability issue
   there.  However, getting things just right for address space
   contrained 32bit might not be too easy but then again nothing
   really is scalable on 32bit these days, so we probably can live
   with boot time parameter or something.

   Another issue is added TLB pressure as it's likely to consume 4K
   TLB entries in addition to the default kernel mapping 2M TLB
   entries.  The TLB pressure can be mostly avoided if percpu area is
   sufficiently large to justify 2MB page allocation but it isn't.

3. Do realloc().  This doesn't impose scalability issues or add to TLB
   pressure but it does contrain how the percpu variables can be used
   and introduces certain amount of possibility for scary
   once-in-a-blue-moon never-reproducible bugs.  Maybe such
   possibility can be reduced by putting some restriction on the
   interface but I don't know.  It still scares me.

Hmm... IIUC, the biggest drawback of #2 is the added TLB pressure,
right?  What if we reserve percpu allocation by 2MB chunks?  ie. use
4k mapping but always allocate the percpu pages from aligned 2MB
chunks.  That way it won't waste 2MB per cpu and although it will use
additional 4K TLB entries, it will free up 2MB TLB entries.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-14  3:58                   ` Tejun Heo
@ 2009-01-15  1:47                     ` Rusty Russell
  0 siblings, 0 replies; 16+ messages in thread
From: Rusty Russell @ 2009-01-15  1:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Christoph Lameter, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

On Wednesday 14 January 2009 14:28:56 Tejun Heo wrote:
> The main problem is that the area needs to be congruent which
> basically mandates them to be contiguous.

I want to explore this assumption a little.  Logically, yes, if 50% of pages are free and we have 4096 cpus, the chance that a page is free on all CPUs is 1 in 2^4095.  But maybe such systems are fine with 2M pages for per-cpu areas at boot?  And can page mobility tricks help us make the odds reasonable here?
Only allowing movable pages in our expansion-of-percpu area?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-13  4:07                 ` Eric W. Biederman
  2009-01-14  3:58                   ` Tejun Heo
@ 2009-01-15  1:49                   ` Rusty Russell
  2009-01-15 20:26                     ` Christoph Lameter
  1 sibling, 1 reply; 16+ messages in thread
From: Rusty Russell @ 2009-01-15  1:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, Christoph Lameter, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

On Tuesday 13 January 2009 14:37:38 Eric W. Biederman wrote:
> It isn't incompatible with a per-cpu virtual mapping.  It allows the
> possibility of each cpu reusing the same chunk of virtual address
> space for per cpu memory.

This can be done (IA64 does it today), but it's not generically useful.  You can use it to frob a few simple values, but it means you can't store any pointers, and that just doesn't fly in general kernel code.

> I think it would be nice if the percpu area could grow and would not be
> a fixed size at boot time, I'm not particularly convinced it has to.

I used to be convinced it had to grow, but Christoph showed otherwise.  Nonetheless, it's an annoying restriction which is going to bite us in the ass repeatedly as coders use per_cpu on random sizes.

Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-15  1:49                   ` Rusty Russell
@ 2009-01-15 20:26                     ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2009-01-15 20:26 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Eric W. Biederman, Tejun Heo, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

On Thu, 15 Jan 2009, Rusty Russell wrote:

> On Tuesday 13 January 2009 14:37:38 Eric W. Biederman wrote:
> > It isn't incompatible with a per-cpu virtual mapping.  It allows the
> > possibility of each cpu reusing the same chunk of virtual address
> > space for per cpu memory.
>
> This can be done (IA64 does it today), but it's not generically useful.
> You can use it to frob a few simple values, but it means you can't store
> any pointers, and that just doesn't fly in general kernel code.

Well if we can have some surelty that we are not going to store pointers
to percpu data anywhere then this would work.

> > I think it would be nice if the percpu area could grow and would not be
> > a fixed size at boot time, I'm not particularly convinced it has to.
>
> I used to be convinced it had to grow, but Christoph showed otherwise.
> Nonetheless, it's an annoying restriction which is going to bite us in
> the ass repeatedly as coders use per_cpu on random sizes.

Not exactly. I implemented a minimal version that had only limited use. I
was fully intending to add further bloat to have dynamically extendable
percpu areas at the end. Most of the early cpu_alloc patchsets already
include that code.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-12 17:44         ` Eric W. Biederman
  2009-01-12 19:00           ` Christoph Lameter
  2009-01-13  0:33           ` Tejun Heo
@ 2009-01-15  1:34           ` Rusty Russell
  2009-01-15 13:55             ` Ingo Molnar
  2009-01-15 20:27             ` Christoph Lameter
  2 siblings, 2 replies; 16+ messages in thread
From: Rusty Russell @ 2009-01-15  1:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Lameter, Tejun Heo, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

On Tuesday 13 January 2009 04:14:58 Eric W. Biederman wrote:
> 2M of per cpu data doesn't make sense, and likely indicates a design
> flaw somewhere.  It just doesn't make sense to have large amounts of
> data allocated per cpu.
> 
> The most common user of per cpu data I am aware of is allocating one
> word per cpu for counters.

This is why I did a brief audit.  Here it is:

With x86/32 allyesconfig (trimmed a little, until it booted under kvm)
we have 37148 bytes of static percpu data, and 117228 bytes of dynamic
percpu data.

File and line			Number		Size		Total
net/ipv4/af_inet.c:1287		 21		2048		43008
net/ipv4/af_inet.c:1290		 21		2048		43008
kernel/workqueue.c:819		 72		 128		 9126
net/ipv4/af_inet.c:1287		 48		 128		 6144
net/ipv4/af_inet.c:1290		 48		 128		 6144
net/ipv4/route.c:3258		  1		4096		 4096
include/linux/genhd.h:271	 72		  40		 2880
lib/percpu_counter.c:77		194		   4		  776
net/ipv4/af_inet.c:1287		  1		 288		  288
net/ipv4/af_inet.c:1290		  1		 288		  288
net/ipv4/af_inet.c:1287		  1		 256		  256
net/ipv4/af_inet.c:1290		  1		 256		  256
net/core/neighbour.c:1424	  4		  44		  176
kernel/kexec.c:1143		  1		 176		  176
net/ipv4/af_inet.c:1287		  1		 104		  104
net/ipv4/af_inet.c:1290		  1		 104		  104
arch/x86/.../acpi-cpufreq.c:528	 96		   1		   96
arch/x86/acpi/cstate.c:153	  1		  64		   64
net/.../nf_conntrack_core.c:1209  1		  60		   60

Others:								  178

This is why my patch series adds "big_percpu_alloc" (basically identical to current code) for the bigger/unbounded users.

I don't think moving per-cpu areas is going to fly.  We do put complex datastructures in there. And you're going to need preempt_disable() on all per-cpu ops on many archs to make it work (assuming you use stop_machine to do the realloc.  Even a rough audit quickly becomes overwhelming: 20 of the first 1/4 of DECLARE_PER_CPUs are non-movable datastructures.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-15  1:34           ` Rusty Russell
@ 2009-01-15 13:55             ` Ingo Molnar
  2009-01-15 20:27             ` Christoph Lameter
  1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2009-01-15 13:55 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Eric W. Biederman, Christoph Lameter, Tejun Heo, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> On Tuesday 13 January 2009 04:14:58 Eric W. Biederman wrote:
> > 2M of per cpu data doesn't make sense, and likely indicates a design
> > flaw somewhere.  It just doesn't make sense to have large amounts of
> > data allocated per cpu.
> > 
> > The most common user of per cpu data I am aware of is allocating one
> > word per cpu for counters.
> 
> This is why I did a brief audit.  Here it is:
> 
> With x86/32 allyesconfig (trimmed a little, until it booted under kvm) 
> we have 37148 bytes of static percpu data, and 117228 bytes of dynamic 
> percpu data.
> 
> File and line			Number		Size		Total
> net/ipv4/af_inet.c:1287		 21		2048		43008
> net/ipv4/af_inet.c:1290		 21		2048		43008
> kernel/workqueue.c:819		 72		 128		 9126
> net/ipv4/af_inet.c:1287		 48		 128		 6144
> net/ipv4/af_inet.c:1290		 48		 128		 6144
> net/ipv4/route.c:3258		  1		4096		 4096
> include/linux/genhd.h:271	 72		  40		 2880
> lib/percpu_counter.c:77		194		   4		  776
> net/ipv4/af_inet.c:1287		  1		 288		  288
> net/ipv4/af_inet.c:1290		  1		 288		  288
> net/ipv4/af_inet.c:1287		  1		 256		  256
> net/ipv4/af_inet.c:1290		  1		 256		  256
> net/core/neighbour.c:1424	  4		  44		  176
> kernel/kexec.c:1143		  1		 176		  176
> net/ipv4/af_inet.c:1287		  1		 104		  104
> net/ipv4/af_inet.c:1290		  1		 104		  104
> arch/x86/.../acpi-cpufreq.c:528	 96		   1		   96
> arch/x86/acpi/cstate.c:153	  1		  64		   64
> net/.../nf_conntrack_core.c:1209  1		  60		   60
> 
> Others:								  178
> 
> This is why my patch series adds "big_percpu_alloc" (basically identical 
> to current code) for the bigger/unbounded users.
> 
> I don't think moving per-cpu areas is going to fly.  We do put complex 
> datastructures in there. And you're going to need preempt_disable() on 
> all per-cpu ops on many archs to make it work (assuming you use 
> stop_machine to do the realloc.  Even a rough audit quickly becomes 
> overwhelming: 20 of the first 1/4 of DECLARE_PER_CPUs are non-movable 
> datastructures.

Why do we have to move them? Even on an allyesconfig the total ~150K size 
seems to be peanuts - compared to the ~+4MB CONFIG_MAXSMP .data/.bss 
bloat. I must be missing something ...

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: regarding the x86_64 zero-based percpu patches
  2009-01-15  1:34           ` Rusty Russell
  2009-01-15 13:55             ` Ingo Molnar
@ 2009-01-15 20:27             ` Christoph Lameter
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2009-01-15 20:27 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Eric W. Biederman, Tejun Heo, Ingo Molnar, travis,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner,
	Hugh Dickins

On Thu, 15 Jan 2009, Rusty Russell wrote:

> I don't think moving per-cpu areas is going to fly.  We do put complex
> datastructures in there. And you're going to need preempt_disable() on
> all per-cpu ops on many archs to make it work (assuming you use
> stop_machine to do the realloc.  Even a rough audit quickly becomes
> overwhelming: 20 of the first 1/4 of DECLARE_PER_CPUs are non-movable
> datastructures.

Ok then lets go for dynamically growing per cpu areas using 2M virtual
mappings.... At least on 64 bit that should be fine.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-01-15 21:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <49649814.4040005@kernel.org>
     [not found] ` <20090107120225.GA30651@elte.hu>
2009-01-07 12:13   ` regarding the x86_64 zero-based percpu patches Tejun Heo
2009-01-10  6:46     ` Rusty Russell
2009-01-12 17:23       ` Christoph Lameter
2009-01-12 17:44         ` Eric W. Biederman
2009-01-12 19:00           ` Christoph Lameter
2009-01-13  0:33           ` Tejun Heo
2009-01-13  3:01             ` Eric W. Biederman
2009-01-13  3:14               ` Tejun Heo
2009-01-13  4:07                 ` Eric W. Biederman
2009-01-14  3:58                   ` Tejun Heo
2009-01-15  1:47                     ` Rusty Russell
2009-01-15  1:49                   ` Rusty Russell
2009-01-15 20:26                     ` Christoph Lameter
2009-01-15  1:34           ` Rusty Russell
2009-01-15 13:55             ` Ingo Molnar
2009-01-15 20:27             ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox