Re: [PATCH] percpu: add optimized generic percpu accessors

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] percpu: add optimized generic percpu accessors
       [not found]   ` <20090116001544.GA11073@elte.hu>
@ 2009-01-16  0:18     ` Herbert Xu
       [not found]       ` <200901170827.33729.rusty@rustcorp.com.au>
  0 siblings, 1 reply; 32+ messages in thread
From: Herbert Xu @ 2009-01-16  0:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, tj, hpa, brgerst, ebiederm, cl, rusty, travis, linux-kernel,
	steiner, hugh, David S. Miller, netdev

On Fri, Jan 16, 2009 at 01:15:44AM +0100, Ingo Molnar wrote:
>
> > So if you could design the API such that we have a variant of add/inc 
> > that automatically disables/enables preemption then we can optimise that 
> > away on x86.
> 
> Yeah. percpu_add(var, 1) does exactly that on x86.

Awesome! Sounds like we can finally do away with those nasty
hacks in the SNMP macros.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <200901170827.33729.rusty@rustcorp.com.au>]

* Re: [PATCH] percpu: add optimized generic percpu accessors
       [not found]       ` <200901170827.33729.rusty@rustcorp.com.au>
@ 2009-01-16 22:08         ` Ingo Molnar
       [not found]           ` <200901201328.24605.rusty@rustcorp.com.au>
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-01-16 22:08 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Herbert Xu, akpm, tj, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

* Rusty Russell <rusty@rustcorp.com.au> wrote:

> On Friday 16 January 2009 10:48:24 Herbert Xu wrote:
> > On Fri, Jan 16, 2009 at 01:15:44AM +0100, Ingo Molnar wrote:
> > >
> > > > So if you could design the API such that we have a variant of add/inc 
> > > > that automatically disables/enables preemption then we can optimise that 
> > > > away on x86.
> > > 
> > > Yeah. percpu_add(var, 1) does exactly that on x86.
> 
> <sigh>.  No it doesn't.

What do you mean by "No it doesn't". It does exactly what i claimed it 
does.

> It's really nice that everyone's excited about this, but it's more 
> complex than this.  Unf. I'm too busy preparing for linux.conf.au to 
> explain it all properly right now, but here's the highlights:
> 
> 1) This only works on static per-cpu vars.
>    - We are working on fixing this, but it's non-trivial for large allocs like
>      those in networking.  Small allocs, we have patches for.

How do difficulties of dynamic percpu-alloc make my above suggestion 
unsuitable for SNMP stats in networking? Most of those stats are not 
dynamically allocated - they are plain straightforward percpu variables.

Plus the majority of percpu usage is static - just like the majority of 
local variables is static, not dynamic. So any percpu-alloc complication 
is a non-issue.

> 2) The generic versions of these as posted by Tejun are unsuitable for
>    networking; they need to bh_disable.  That would make networking less
>    efficient than it is now for non-x86, and to be generic it would have
>    to be local_irq_save/restore anyway.

The generic versions will not be used on 95%+ of the active Linux systems 
out there, as they run on x86. If you worry about the remaining 5%, those 
can be optimized too.

> 3) local_t was designed to do exactly this: a fast cpu-local counter
>    implemented optimally for each arch.  For sparc64, doing a trivalue version
>    seems optimal, for s390 atomics, for x86 single-insn, for powerpc
>    irq_save/restore, etc.

But local_t does not actually solve this problem at all - because one 
still has to have per-cpu-ness.

> 4) Unfortunately, local_t has been extended beyond a simple counter, meaning
>    it now has more complex requirements (eg. Mathieu wants nmi-safe, even
>    though that's impossible on sparc and parisc, and percpu_counter wants
>    local_add_return, which makes trival less desirable).  These discussions
>    are on the back burner at the moment, but ongoing.

In reality local_t has almost zero users in the kernel - despite being 
with us at least since v2.6.12. That pretty much tells us all about its 
utility.

The thing is, local_t without proper percpu integration is a toothless 
tiger in the jungle. And our APIS do exactly that kind of integration and 
i expect them to be more popular than local_t. There's already a dozen 
usage sites of it in arch/x86.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <200901201328.24605.rusty@rustcorp.com.au>]

* Re: [PATCH] percpu: add optimized generic percpu accessors
       [not found]           ` <200901201328.24605.rusty@rustcorp.com.au>
@ 2009-01-20  6:25             ` Tejun Heo
  2009-01-20 10:36               ` Ingo Molnar
       [not found]               ` <200901271213.18605.rusty@rustcorp.com.au>
  2009-01-20 10:40             ` Ingo Molnar
  1 sibling, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2009-01-20  6:25 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Rusty Russell wrote:
> The generic versions Tejun posted are not softirq safe, so not
> suitable for network counters. To figure out what semantics we
> really want we need to we must audit the users; I'm sorry I haven't
> finished that task (and won't until after the conference).

No, they're not.  They're preempt safe as mentioned in the comment and
is basically just generalization of the original x86 versions used by
x86_64 on SMP before pda and percpu areas were merged.  I agree that
it's something very close to local_t and it would be nice to see those
somehow unified (and I have patches which make use of local_t in my
queue waiting for dynamic percpu allocation).

Another question to ask is whether to keep using separate interfaces
for static and dynamic percpu variables or migrate to something which
can take both.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-20  6:25             ` Tejun Heo
@ 2009-01-20 10:36               ` Ingo Molnar
       [not found]               ` <200901271213.18605.rusty@rustcorp.com.au>
  1 sibling, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 10:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers


* Tejun Heo <tj@kernel.org> wrote:

> Rusty Russell wrote:
> > The generic versions Tejun posted are not softirq safe, so not
> > suitable for network counters. To figure out what semantics we
> > really want we need to we must audit the users; I'm sorry I haven't
> > finished that task (and won't until after the conference).
> 
> No, they're not.  They're preempt safe as mentioned in the comment and 
> is basically just generalization of the original x86 versions used by 
> x86_64 on SMP before pda and percpu areas were merged.  I agree that 
> it's something very close to local_t and it would be nice to see those 
> somehow unified (and I have patches which make use of local_t in my 
> queue waiting for dynamic percpu allocation).
> 
> Another question to ask is whether to keep using separate interfaces for 
> static and dynamic percpu variables or migrate to something which can 
> take both.

Also, there's over 400 PER_CPU variable definitions in the kernel, while 
only about 40 dynamic percpu allocation usage sites. (in that i included 
the percpu_counter bits used by networking)

So our percpu code usage is on the static side, by a large margin.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <200901271213.18605.rusty@rustcorp.com.au>]

* Re: [PATCH] percpu: add optimized generic percpu accessors
       [not found]               ` <200901271213.18605.rusty@rustcorp.com.au>
@ 2009-01-27  2:24                 ` Tejun Heo
  2009-01-27 13:13                   ` Ingo Molnar
                                     ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Tejun Heo @ 2009-01-27  2:24 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Hello, Rusty.

Rusty Russell wrote:

>> No, they're not. They're preempt safe as mentioned in the comment
>> and is basically just generalization of the original x86 versions
>> used by x86_64 on SMP before pda and percpu areas were merged. I
>> agree that it's something very close to local_t and it would be
>> nice to see those somehow unified (and I have patches which make
>> use of local_t in my queue waiting for dynamic percpu allocation).
> 
> Yes, which is one reason I dislike Ingo's patch:
> 1) Mine did just read because that covers the most common fast-path use
> and is easily atomic for word-sizes on all archs,
> 2) Didn't replace x86, just #defined generic one, so much less churn,
> 3) read_percpu_var and read_percpu_ptr variants following the convention
> reinforced by my other patches.
>
> Linus' tree had read/write/add/or counts at 22/13/0/0. Yours has
> more write usage, so I'm happy there, but still only one add and one
> or. If we assume that generic code will look a bit like that when
> converted, I'm not convinced that generic and/or/etc ops are worth
> it.

There actually were quite some places where atomic add ops would be
useful, especially the places where statistics are collected.  For
logical bitops, I don't think we'll have too many of them.

> If they are worth doing generically, should the ops be atomic? To
> extrapolate from x86 usages again, it seems to be happy with
> non-atomic (tho of course it is atomic on x86).

If atomic rw/add/sub are implementible on most archs (and judging from
local_t, I suppose it is), I think it should.  So that it can replace
local_t and we won't need something else again in the future.

>> Another question to ask is whether to keep using separate
>> interfaces for static and dynamic percpu variables or migrate to
>> something which can take both.
>
> Well, IA64 can do stuff with static percpus that it can't do with
> dynamic (assuming we get expanding dynamic percpu areas
> later). That's because they use TLB tricks for a static 64k per-cpu
> area, but this doesn't scale.  That might not be vital: abandoning
> that trick will mean they can't optimise read_percpu/read_percpu_var
> etc as much.

Isn't something like the following possible?

#define pcpu_read(ptr)						\
({								\
	if (__builtin_constant_p(ptr) &&			\
	    ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END)	\
		do 64k TLB trick for static pcpu;		\
	else							\
		do generic stuff;				\
})

> Tejun, any chance of you updating the tj-percpu tree? My current
> patches are against Linus's tree, and rebasing them on yours
> involves some icky merging.

If Ingo is okay with it, I'm fine with it too.  Unless Ingo objects,
I'll do it tomorrow-ish (still big holiday here).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27  2:24                 ` Tejun Heo
@ 2009-01-27 13:13                   ` Ingo Molnar
  2009-01-27 23:07                     ` Tejun Heo
  2009-01-27 20:08                   ` Christoph Lameter
  2009-01-28 10:38                   ` Rusty Russell
  2 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-01-27 13:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers


* Tejun Heo <tj@kernel.org> wrote:

> > Tejun, any chance of you updating the tj-percpu tree? My current 
> > patches are against Linus's tree, and rebasing them on yours involves 
> > some icky merging.
> 
> If Ingo is okay with it, I'm fine with it too.  Unless Ingo objects, 
> I'll do it tomorrow-ish (still big holiday here).

Sure - but please do it as a clean rebase ontop of latest, not as a 
conflict-merge, as i'd expect there to be quite many clashes.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27 13:13                   ` Ingo Molnar
@ 2009-01-27 23:07                     ` Tejun Heo
  2009-01-28  3:36                       ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2009-01-27 23:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Ingo Molnar wrote:
> * Tejun Heo <tj@kernel.org> wrote:
> 
>>> Tejun, any chance of you updating the tj-percpu tree? My current 
>>> patches are against Linus's tree, and rebasing them on yours involves 
>>> some icky merging.
>> If Ingo is okay with it, I'm fine with it too.  Unless Ingo objects, 
>> I'll do it tomorrow-ish (still big holiday here).
> 
> Sure - but please do it as a clean rebase ontop of latest, not as a 
> conflict-merge, as i'd expect there to be quite many clashes.

Alrighty, re-basing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27 23:07                     ` Tejun Heo
@ 2009-01-28  3:36                       ` Tejun Heo
  2009-01-28  8:12                         ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2009-01-28  3:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Tejun Heo wrote:
> Ingo Molnar wrote:
>> * Tejun Heo <tj@kernel.org> wrote:
>>
>>>> Tejun, any chance of you updating the tj-percpu tree? My current 
>>>> patches are against Linus's tree, and rebasing them on yours involves 
>>>> some icky merging.
>>> If Ingo is okay with it, I'm fine with it too.  Unless Ingo objects, 
>>> I'll do it tomorrow-ish (still big holiday here).
>> Sure - but please do it as a clean rebase ontop of latest, not as a 
>> conflict-merge, as i'd expect there to be quite many clashes.
> 
> Alrighty, re-basing.

Okay, rebased on top of the current master[1].

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu

The rebased commit ID is e7f538f3d2a6b3a0a632190a7d736730ee10909c.

There were several changes which didn't fit the rebased tree as they
really belong to other branches in the x86 tree.  I'll re-submit them
as patches against the appropriate branch.

Thanks.

-- 
tejun

[1] 5ee810072175042775e39bdd3eaaa68884c27805

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28  3:36                       ` Tejun Heo
@ 2009-01-28  8:12                         ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2009-01-28  8:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Tejun Heo wrote:
> Okay, rebased on top of the current master[1].
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu
> 
> The rebased commit ID is e7f538f3d2a6b3a0a632190a7d736730ee10909c.
> 
> There were several changes which didn't fit the rebased tree as they
> really belong to other branches in the x86 tree.  I'll re-submit them
> as patches against the appropriate branch.

Two patches against #stackprotector and one against #cpus4096 sent
out.  With these three patches, most of stuff has been preserved but
there are still few missing pieces which only make sense after trees
are merged.  Will be happy to drive it if the current branch looks
fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27  2:24                 ` Tejun Heo
  2009-01-27 13:13                   ` Ingo Molnar
@ 2009-01-27 20:08                   ` Christoph Lameter
  2009-01-27 21:47                     ` David Miller
  2009-01-28 10:38                   ` Rusty Russell
  2 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2009-01-27 20:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, Ingo Molnar, Herbert Xu, akpm, hpa, brgerst,
	ebiederm, travis, linux-kernel, steiner, hugh, David S. Miller,
	netdev, Mathieu Desnoyers

On Tue, 27 Jan 2009, Tejun Heo wrote:

> > later). That's because they use TLB tricks for a static 64k per-cpu
> > area, but this doesn't scale.  That might not be vital: abandoning
> > that trick will mean they can't optimise read_percpu/read_percpu_var
> > etc as much.

Why wont it scale? this is a separate TLB entry for each processor.

>
> Isn't something like the following possible?
>
> #define pcpu_read(ptr)						\
> ({								\
> 	if (__builtin_constant_p(ptr) &&			\
> 	    ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END)	\
> 		do 64k TLB trick for static pcpu;		\
> 	else							\
> 		do generic stuff;				\
> })

The TLB trick is just to access the percpu data at a fixed base. I.e.
	value = SHIFT_PERCPU_PTR(percpu_var, FIXED_ADDRESS);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27 20:08                   ` Christoph Lameter
@ 2009-01-27 21:47                     ` David Miller
  2009-01-27 22:47                       ` Rick Jones
  2009-01-28 16:45                       ` Christoph Lameter
  0 siblings, 2 replies; 32+ messages in thread
From: David Miller @ 2009-01-27 21:47 UTC (permalink / raw)
  To: cl
  Cc: tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm, travis,
	linux-kernel, steiner, hugh, netdev, mathieu.desnoyers

From: Christoph Lameter <cl@linux-foundation.org>
Date: Tue, 27 Jan 2009 15:08:57 -0500 (EST)

> On Tue, 27 Jan 2009, Tejun Heo wrote:
> 
> > > later). That's because they use TLB tricks for a static 64k per-cpu
> > > area, but this doesn't scale.  That might not be vital: abandoning
> > > that trick will mean they can't optimise read_percpu/read_percpu_var
> > > etc as much.
> 
> Why wont it scale? this is a separate TLB entry for each processor.

The IA64 per-cpu TLB entry only covers 64k which makes use of it for
dynamic per-cpu stuff out of the question.  That's why it "doesn't
scale"

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27 21:47                     ` David Miller
@ 2009-01-27 22:47                       ` Rick Jones
  2009-01-28  0:17                         ` Luck, Tony
  2009-01-28 16:45                       ` Christoph Lameter
  1 sibling, 1 reply; 32+ messages in thread
From: Rick Jones @ 2009-01-27 22:47 UTC (permalink / raw)
  To: David Miller
  Cc: cl, tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm,
	travis, linux-kernel, steiner, hugh, netdev, mathieu.desnoyers,
	linux-ia64@vger.kernel.org

David Miller wrote:
> From: Christoph Lameter <cl@linux-foundation.org>
> Date: Tue, 27 Jan 2009 15:08:57 -0500 (EST)
> 
> 
>>On Tue, 27 Jan 2009, Tejun Heo wrote:
>>
>>
>>>>later). That's because they use TLB tricks for a static 64k per-cpu
>>>>area, but this doesn't scale.  That might not be vital: abandoning
>>>>that trick will mean they can't optimise read_percpu/read_percpu_var
>>>>etc as much.
>>
>>Why wont it scale? this is a separate TLB entry for each processor.
> 
> 
> The IA64 per-cpu TLB entry only covers 64k which makes use of it for
> dynamic per-cpu stuff out of the question.  That's why it "doesn't
> scale"

I was asking around, and was told that on IA64 *harware* at least, in addition to 
supporting multiple page sizes (up to a GB IIRC), one can pin up to 8 or perhaps 
1/2 the TLB entries.

So, in theory if one were so inclined the special pinned per-CPU entry could 
either be more than one 64K entry, or a single, rather larger entry.

As a sanity check, I've cc'd linux-ia64.

rick jones

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27 22:47                       ` Rick Jones
@ 2009-01-28  0:17                         ` Luck, Tony
  2009-01-28 16:48                           ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Luck, Tony @ 2009-01-28  0:17 UTC (permalink / raw)
  To: Rick Jones, David Miller
  Cc: cl@linux-foundation.org, tj@kernel.org, rusty@rustcorp.com.au,
	mingo@elte.hu, herbert@gondor.apana.org.au,
	akpm@linux-foundation.org, hpa@zytor.com, brgerst@gmail.com,
	ebiederm@xmission.com, travis@sgi.com,
	linux-kernel@vger.kernel.org, steiner@sgi.com, hugh@veritas.com,
	netdev@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
	linux-ia64@vger.kernel.org

> I was asking around, and was told that on IA64 *harware* at least, in addition to
> supporting multiple page sizes (up to a GB IIRC), one can pin up to 8 or perhaps
> 1/2 the TLB entries.

The number of TLB entries that can be pinned is model specific (but the
minimum allowed is 8 each for code & data).  Page sizes supported also
vary by model, recent models go up to 4GB.

BUT ... we stopped pinning this entry in the TLB when benchmarks showed
that it was better to just insert this as a regular TLB entry which might
can be dropped to map something else.  So now there is code in the Alt-DTLB
miss handler (ivt.S) to insert the per-cpu mapping on an as needed basis.

> So, in theory if one were so inclined the special pinned per-CPU entry could
> either be more than one 64K entry, or a single, rather larger entry.

Managing a larger space could be done ... but at the expense of making
the Alt-DTLB miss handler do a memory lookup to find the physical address
of the per-cpu page needed (assuming that we allocate a bunch of random
physical pages for use as per-cpu space rather than a single contiguous
block of physical memory).

When do we know the total amount of per-cpu memory needed?
1) CONFIG time?
2) Boot time?
3) Arbitrary run time?

-Tony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28  0:17                         ` Luck, Tony
@ 2009-01-28 16:48                           ` Christoph Lameter
  2009-01-28 17:15                             ` Luck, Tony
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2009-01-28 16:48 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Rick Jones, David Miller, tj@kernel.org, rusty@rustcorp.com.au,
	mingo@elte.hu, herbert@gondor.apana.org.au,
	akpm@linux-foundation.org, hpa@zytor.com, brgerst@gmail.com,
	ebiederm@xmission.com, travis@sgi.com,
	linux-kernel@vger.kernel.org, steiner@sgi.com, hugh@veritas.com,
	netdev@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
	linux-ia64@vger.kernel.org

On Tue, 27 Jan 2009, Luck, Tony wrote:

> Managing a larger space could be done ... but at the expense of making
> the Alt-DTLB miss handler do a memory lookup to find the physical address
> of the per-cpu page needed (assuming that we allocate a bunch of random
> physical pages for use as per-cpu space rather than a single contiguous
> block of physical memory).

We cannot resize the area by using a single larger TLB entry?

> When do we know the total amount of per-cpu memory needed?
> 1) CONFIG time?

Would be easiest.

> 2) Boot time?

We could make the TLB entry size configurable with a kernel parameter

> 3) Arbitrary run time?

We could reserve a larger virtual space and switch TLB entries as needed?
We would need do get larger order pages to do this.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28 16:48                           ` Christoph Lameter
@ 2009-01-28 17:15                             ` Luck, Tony
  0 siblings, 0 replies; 32+ messages in thread
From: Luck, Tony @ 2009-01-28 17:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rick Jones, David Miller, tj@kernel.org, rusty@rustcorp.com.au,
	mingo@elte.hu, herbert@gondor.apana.org.au,
	akpm@linux-foundation.org, hpa@zytor.com, brgerst@gmail.com,
	ebiederm@xmission.com, travis@sgi.com,
	linux-kernel@vger.kernel.org, steiner@sgi.com, hugh@veritas.com,
	netdev@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
	linux-ia64@vger.kernel.org

> > Managing a larger space could be done ... but at the expense of making
> > the Alt-DTLB miss handler do a memory lookup to find the physical address
> > of the per-cpu page needed (assuming that we allocate a bunch of random
> > physical pages for use as per-cpu space rather than a single contiguous
> > block of physical memory).
>
> We cannot resize the area by using a single larger TLB entry?

Yes we could ... the catch is that the supported TLB page sizes go
up by multiples of 4.  So if 64K is not enough the next stop is
at 256K, then 1M, then 4M, and so on.

That's why I asked when we know what total amount of per-cpu memory
is needed.  CONFIG time or boot time is good, because allocating higher
order pages is easy at boot time.  Arbitrary run time is bad because
we might have difficulty allocating a high order page for every cpu.

-Tony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27 21:47                     ` David Miller
  2009-01-27 22:47                       ` Rick Jones
@ 2009-01-28 16:45                       ` Christoph Lameter
  2009-01-28 20:47                         ` David Miller
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2009-01-28 16:45 UTC (permalink / raw)
  To: David Miller
  Cc: tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm, travis,
	linux-kernel, steiner, hugh, netdev, mathieu.desnoyers

On Tue, 27 Jan 2009, David Miller wrote:

> > Why wont it scale? this is a separate TLB entry for each processor.
>
> The IA64 per-cpu TLB entry only covers 64k which makes use of it for
> dynamic per-cpu stuff out of the question.  That's why it "doesn't
> scale"

IA64 supports varying page sizes. You can use a 128k TLB entry etc.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28 16:45                       ` Christoph Lameter
@ 2009-01-28 20:47                         ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2009-01-28 20:47 UTC (permalink / raw)
  To: cl
  Cc: tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm, travis,
	linux-kernel, steiner, hugh, netdev, mathieu.desnoyers

From: Christoph Lameter <cl@linux-foundation.org>
Date: Wed, 28 Jan 2009 11:45:28 -0500 (EST)

> On Tue, 27 Jan 2009, David Miller wrote:
> 
> > > Why wont it scale? this is a separate TLB entry for each processor.
> >
> > The IA64 per-cpu TLB entry only covers 64k which makes use of it for
> > dynamic per-cpu stuff out of the question.  That's why it "doesn't
> > scale"
> 
> IA64 supports varying page sizes. You can use a 128k TLB entry etc.

Good luck moving to a larger size dynamically at run time.

It really isn't a tenable solution.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-27  2:24                 ` Tejun Heo
  2009-01-27 13:13                   ` Ingo Molnar
  2009-01-27 20:08                   ` Christoph Lameter
@ 2009-01-28 10:38                   ` Rusty Russell
  2009-01-28 10:56                     ` Tejun Heo
  2009-01-28 16:50                     ` Christoph Lameter
  2 siblings, 2 replies; 32+ messages in thread
From: Rusty Russell @ 2009-01-28 10:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

On Tuesday 27 January 2009 12:54:27 Tejun Heo wrote:
> Hello, Rusty.

Hi Tejun!

> There actually were quite some places where atomic add ops would be
> useful, especially the places where statistics are collected.  For
> logical bitops, I don't think we'll have too many of them.

If the stats are only manipulated in one context, than an atomic requirement is overkill (and expensive on non-x86).

> > If they are worth doing generically, should the ops be atomic? To
> > extrapolate from x86 usages again, it seems to be happy with
> > non-atomic (tho of course it is atomic on x86).
> 
> If atomic rw/add/sub are implementible on most archs (and judging from
> local_t, I suppose it is), I think it should.  So that it can replace
> local_t and we won't need something else again in the future.

This is more like Christoph's CPU_OPS: they were special operators on normal per-cpu vars/ptrs.  Generic version was irqsave+op+irqrestore.

I actually like this idea, but Mathieu insists that the ops be NMI-safe, for ftrace.  Hence local_t needing to be atomic_t for generic code.

AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t
in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care).

Other than the shouting, I liked Christoph's system:
- CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore)
- _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu)
- __CPU_INC = not safe against anything (eg. per_cpu(i)++)

I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed.

> >> Another question to ask is whether to keep using separate
> >> interfaces for static and dynamic percpu variables or migrate to
> >> something which can take both.
> >
> > Well, IA64 can do stuff with static percpus that it can't do with
> > dynamic (assuming we get expanding dynamic percpu areas
> > later). That's because they use TLB tricks for a static 64k per-cpu
> > area, but this doesn't scale.  That might not be vital: abandoning
> > that trick will mean they can't optimise read_percpu/read_percpu_var
> > etc as much.
> 
> Isn't something like the following possible?
> 
> #define pcpu_read(ptr)						\
> ({								\
> 	if (__builtin_constant_p(ptr) &&			\
> 	    ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END)	\
> 		do 64k TLB trick for static pcpu;		\
> 	else							\
> 		do generic stuff;				\
> })

No, that will be "do generic stuff", since it's a link-time constant.  I don't know that this is a huge worry, to be honest.  We can leave the __ia64_per_cpu_var for their arch-specific code (I feel the same way about x86 to be honest).

> > Tejun, any chance of you updating the tj-percpu tree? My current
> > patches are against Linus's tree, and rebasing them on yours
> > involves some icky merging.
> 
> If Ingo is okay with it, I'm fine with it too.  Unless Ingo objects,
> I'll do it tomorrow-ish (still big holiday here).

Ah, I did not realize that you celebrated Australia day :)

Cheers!
Rusty.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28 10:38                   ` Rusty Russell
@ 2009-01-28 10:56                     ` Tejun Heo
  2009-01-29  2:06                       ` Rusty Russell
  2009-01-28 16:50                     ` Christoph Lameter
  1 sibling, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2009-01-28 10:56 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Hello,

Rusty Russell wrote:
> On Tuesday 27 January 2009 12:54:27 Tejun Heo wrote:
>> Hello, Rusty.
> 
> Hi Tejun!
> 
>> There actually were quite some places where atomic add ops would be
>> useful, especially the places where statistics are collected.  For
>> logical bitops, I don't think we'll have too many of them.
> 
> If the stats are only manipulated in one context, than an atomic
> requirement is overkill (and expensive on non-x86).

Yes, it is.  I was hoping it to be not more expensive on most archs.
It isn't on x86 at the very least but I don't know much about other
archs.

>>> If they are worth doing generically, should the ops be atomic? To
>>> extrapolate from x86 usages again, it seems to be happy with
>>> non-atomic (tho of course it is atomic on x86).
>> If atomic rw/add/sub are implementible on most archs (and judging from
>> local_t, I suppose it is), I think it should.  So that it can replace
>> local_t and we won't need something else again in the future.
> 
> This is more like Christoph's CPU_OPS: they were special operators
> on normal per-cpu vars/ptrs.  Generic version was
> irqsave+op+irqrestore.
>
> I actually like this idea, but Mathieu insists that the ops be
> NMI-safe, for ftrace.  Hence local_t needing to be atomic_t for
> generic code.
> 
> AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use
> atomic_t in ftrace (which isn't NMI safe on parisc or sparc/32
> anyway, but I don't think we care).

Requiring NMI-safeness is quite an exception, I suppose.  I don't
think we should design around it.  If it can be worked around one way
or the other, it should be fine.

> Other than the shouting, I liked Christoph's system:
> - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore)
> - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu)
> - __CPU_INC = not safe against anything (eg. per_cpu(i)++)
> 
> I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed.

I like local better too but no biggies one way or the other.

>>>> Another question to ask is whether to keep using separate
>>>> interfaces for static and dynamic percpu variables or migrate to
>>>> something which can take both.
>>> Well, IA64 can do stuff with static percpus that it can't do with
>>> dynamic (assuming we get expanding dynamic percpu areas
>>> later). That's because they use TLB tricks for a static 64k per-cpu
>>> area, but this doesn't scale.  That might not be vital: abandoning
>>> that trick will mean they can't optimise read_percpu/read_percpu_var
>>> etc as much.
>> Isn't something like the following possible?
>>
>> #define pcpu_read(ptr)						\
>> ({								\
>> 	if (__builtin_constant_p(ptr) &&			\
>> 	    ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END)	\
>> 		do 64k TLB trick for static pcpu;		\
>> 	else							\
>> 		do generic stuff;				\
>> })
> 
> No, that will be "do generic stuff", since it's a link-time
> constant.  I don't know that this is a huge worry, to be honest.  We
> can leave the __ia64_per_cpu_var for their arch-specific code (I
> feel the same way about x86 to be honest).

Yes, right.  Got confused there.  Hmmm... looks like what would work
there is "is it a lvalue?" test.  Well, anyways, if it isn't
necessary.

>>> Tejun, any chance of you updating the tj-percpu tree? My current
>>> patches are against Linus's tree, and rebasing them on yours
>>> involves some icky merging.
>> If Ingo is okay with it, I'm fine with it too.  Unless Ingo objects,
>> I'll do it tomorrow-ish (still big holiday here).
> 
> Ah, I did not realize that you celebrated Australia day :)

Hey, didn't know Australia was founded on lunar New Year's day.
Nice. :-)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28 10:56                     ` Tejun Heo
@ 2009-01-29  2:06                       ` Rusty Russell
  2009-01-31  6:11                         ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Rusty Russell @ 2009-01-29  2:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

On Wednesday 28 January 2009 21:26:34 Tejun Heo wrote:
> Hello,

Hi Tejun,

> Rusty Russell wrote:
> > If the stats are only manipulated in one context, than an atomic
> > requirement is overkill (and expensive on non-x86).
> 
> Yes, it is.  I was hoping it to be not more expensive on most archs.
> It isn't on x86 at the very least but I don't know much about other
> archs.

Hmm, you can garner this from the local_t stats which were flying around.
(see Re: local_add_return from me), or look in the preamble to
http://ozlabs.org/~rusty/kernel/rr-latest/misc:test-local_t.patch ).

Of course, if you want to be my hero, you could implement "soft" irq
disable for all archs, which would make this cheaper.

> > Other than the shouting, I liked Christoph's system:
> > - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore)
> > - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu)
> > - __CPU_INC = not safe against anything (eg. per_cpu(i)++)
> > 
> > I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed.
> 
> I like local better too but no biggies one way or the other.

Maybe kill local_t and take the name back.  I'll leave it to you...

> > Ah, I did not realize that you celebrated Australia day :)
> 
> Hey, didn't know Australia was founded on lunar New Year's day.
> Nice. :-)

That would have been cool, but no; first time in 76 years they matched tho.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-29  2:06                       ` Rusty Russell
@ 2009-01-31  6:11                         ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2009-01-31  6:11 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Hello, Rusty.

Rusty Russell wrote:
>> Rusty Russell wrote:
>>> If the stats are only manipulated in one context, than an atomic
>>> requirement is overkill (and expensive on non-x86).
>> Yes, it is.  I was hoping it to be not more expensive on most archs.
>> It isn't on x86 at the very least but I don't know much about other
>> archs.
> 
> Hmm, you can garner this from the local_t stats which were flying around.
> (see Re: local_add_return from me), or look in the preamble to
> http://ozlabs.org/~rusty/kernel/rr-latest/misc:test-local_t.patch ).

Ah... Great.

> Of course, if you want to be my hero, you could implement "soft" irq
> disable for all archs, which would make this cheaper.

I suppose you mean deferred execution of interrupt handlers for quick
atomicities.  Yeah, that would be nice for things like this.

>>> Other than the shouting, I liked Christoph's system:
>>> - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore)
>>> - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu)
>>> - __CPU_INC = not safe against anything (eg. per_cpu(i)++)
>>>
>>> I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed.
>> I like local better too but no biggies one way or the other.
> 
> Maybe kill local_t and take the name back.  I'll leave it to you...
> 
>>> Ah, I did not realize that you celebrated Australia day :)
>> Hey, didn't know Australia was founded on lunar New Year's day.
>> Nice. :-)
> 
> That would have been cool, but no; first time in 76 years they matched tho.

It was a joke.  :-)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28 10:38                   ` Rusty Russell
  2009-01-28 10:56                     ` Tejun Heo
@ 2009-01-28 16:50                     ` Christoph Lameter
  2009-01-28 18:07                       ` Mathieu Desnoyers
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2009-01-28 16:50 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Tejun Heo, Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

On Wed, 28 Jan 2009, Rusty Russell wrote:

> AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t
> in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care).

Right.


> Other than the shouting, I liked Christoph's system:
> - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore)
> - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu)
> - __CPU_INC = not safe against anything (eg. per_cpu(i)++)
>
> I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed.

The term cpu is meaning multiple things at this point. So yes it may be
better to go with glibc naming of thread local space.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28 16:50                     ` Christoph Lameter
@ 2009-01-28 18:07                       ` Mathieu Desnoyers
  2009-01-29 18:33                         ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-01-28 18:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rusty Russell, Tejun Heo, Ingo Molnar, Herbert Xu, akpm, hpa,
	brgerst, ebiederm, travis, linux-kernel, steiner, hugh,
	David S. Miller, netdev

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Wed, 28 Jan 2009, Rusty Russell wrote:
> 
> > AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t
> > in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care).
> 
> Right.
> 

Ideally this should be done transparently so we don't have to #ifdef
code around HAVE_NMISAFE_CPUOPS everywhere in the tracer. We might
consider declaring an intermediate type with this kind of #ifdef in the
tracer code if we are the only one user though.

> 
> > Other than the shouting, I liked Christoph's system:
> > - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore)
> > - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu)
> > - __CPU_INC = not safe against anything (eg. per_cpu(i)++)
> >
> > I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed.
> 
> The term cpu is meaning multiple things at this point. So yes it may be
> better to go with glibc naming of thread local space.
> 

However using "local" for "per-cpu" could be confusing with the glibc
naming of thread local space, because "per-thread" and "per-cpu"
variables are different from a synchronization POV and we can end up
needing both (e.g. a thread local variable can never be accessed by
another thread, but a cpu local variable could be accessed by a
different CPU due to scheduling).

I'm currently thinking about implementing user-space per-cpu tracing buffers,
and the last thing I'd like is to have a "local" semantic clash between
the kernel and glibc. Ideally, we could have something like :

Atomic safe primitives (emulated with irq disable if the architecture
does not have atomic primitives) :
- atomic_thread_inc()
  * current mainline local_t local_inc().
- atomic_cpu_inc()
  * Your proposed CPU_INC.

Non-safe against interrupts, but safe against preemption :
- thread_inc()
  * no preempt_disable involved, because this deals with per-thread
    variables :
    * Simple var++
- cpu_inc()
  * disables preemption, per_cpu(i)++, enables preemption

Not safe against preemption nor interrupts :
- _thread_inc()
  * maps to thread_inc()
- _cpu_inc()
  * simple per_cpu(i)++

So maybe we don't really need thread_inc(), _thread_inc() and _cpu_inc,
because they map to standard C operations, but we may find that in some
architectures that the atomic_cpu_inc() is faster than the per_cpu(i)++,
and in those cases it would make sense to map e.g. _cpu_inc() to
atomic_cpu_inc().

Also note that don't think adding _ and __ prefixes to the operations
makes it clear for the programmer and reviewer what is safe against
what. OMHO, it will just make the code more obscure. One level of
underscore is about the limit I think people can "know" what this
"unsafe" version of the primitive does.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-28 18:07                       ` Mathieu Desnoyers
@ 2009-01-29 18:33                         ` Christoph Lameter
  2009-01-29 18:48                           ` H. Peter Anvin
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2009-01-29 18:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Rusty Russell, Tejun Heo, Ingo Molnar, Herbert Xu, akpm, hpa,
	brgerst, ebiederm, travis, linux-kernel, steiner, hugh,
	David S. Miller, netdev

On Wed, 28 Jan 2009, Mathieu Desnoyers wrote:

> > The term cpu is meaning multiple things at this point. So yes it may be
> > better to go with glibc naming of thread local space.
> >
>
> However using "local" for "per-cpu" could be confusing with the glibc
> naming of thread local space, because "per-thread" and "per-cpu"
> variables are different from a synchronization POV and we can end up
> needing both (e.g. a thread local variable can never be accessed by
> another thread, but a cpu local variable could be accessed by a
> different CPU due to scheduling).

gcc/glibc support a __thread attribute to variables. As far as I can tell
this automatically makes gcc perform the relocation to the current
context using a segment register.

But its a weird ABI http://people.redhat.com/drepper/tls.pdf. After
reading that I agree that we should stay with the cpu ops and forget about
the local and thread stuff in gcc/glibc.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-29 18:33                         ` Christoph Lameter
@ 2009-01-29 18:48                           ` H. Peter Anvin
  0 siblings, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2009-01-29 18:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Rusty Russell, Tejun Heo, Ingo Molnar,
	Herbert Xu, akpm, brgerst, ebiederm, travis, linux-kernel,
	steiner, hugh, David S. Miller, netdev

Christoph Lameter wrote:
> 
> gcc/glibc support a __thread attribute to variables. As far as I can tell
> this automatically makes gcc perform the relocation to the current
> context using a segment register.
> 
> But its a weird ABI http://people.redhat.com/drepper/tls.pdf. After
> reading that I agree that we should stay with the cpu ops and forget about
> the local and thread stuff in gcc/glibc.
> 

We have discussed this a number of times.  There are several issues with
it; one being that per-cpu != per-thread (we can switch CPU whereever
we're preempted), and another that many versions of gcc uses hardcoded
registers, in particular %fs on 64 bits, but the kernel *has* to use %gs
since there is no swapfs instruction.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
       [not found]           ` <200901201328.24605.rusty@rustcorp.com.au>
  2009-01-20  6:25             ` Tejun Heo
@ 2009-01-20 10:40             ` Ingo Molnar
  2009-01-21  5:52               ` Tejun Heo
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 10:40 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Herbert Xu, akpm, tj, hpa, brgerst, ebiederm, cl, travis,
	linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

* Rusty Russell <rusty@rustcorp.com.au> wrote:

> On Saturday 17 January 2009 08:38:32 Ingo Molnar wrote:
> > How do difficulties of dynamic percpu-alloc make my above suggestion 
> > unsuitable for SNMP stats in networking? Most of those stats are not 
> > dynamically allocated - they are plain straightforward percpu variables.
> 
> No they're not. [...]

hm, i see this got changed recently - part of the networking stats went 
over to the lib/percpu_counter API recently.

The larger point still remains: the kernel dominantly uses static percpu 
variables by a margin of 10 to 1, so we cannot just brush away the static 
percpu variables and must concentrate on optimizing that side with 
priority. It's nice if the dynamic percpu-alloc side improves as well, of 
course.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-20 10:40             ` Ingo Molnar
@ 2009-01-21  5:52               ` Tejun Heo
  2009-01-21 10:05                 ` Ingo Molnar
  2009-01-21 11:21                 ` Eric W. Biederman
  0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2009-01-21  5:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Ingo Molnar wrote:
> The larger point still remains: the kernel dominantly uses static percpu 
> variables by a margin of 10 to 1, so we cannot just brush away the static 
> percpu variables and must concentrate on optimizing that side with 
> priority. It's nice if the dynamic percpu-alloc side improves as well, of 
> course.

Well, the infrequent usage of dynamic percpu allocation is in some
part due to the poor implementation, so it's sort of chicken and egg
problem.  I got into this percpu thing because I wanted a percpu
reference count which can be dynamically allocated and it sucked.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-21  5:52               ` Tejun Heo
@ 2009-01-21 10:05                 ` Ingo Molnar
  2009-01-21 11:21                 ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-21 10:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers


* Tejun Heo <tj@kernel.org> wrote:

> Ingo Molnar wrote:
> > The larger point still remains: the kernel dominantly uses static percpu 
> > variables by a margin of 10 to 1, so we cannot just brush away the static 
> > percpu variables and must concentrate on optimizing that side with 
> > priority. It's nice if the dynamic percpu-alloc side improves as well, of 
> > course.
> 
> Well, the infrequent usage of dynamic percpu allocation is in some part 
> due to the poor implementation, so it's sort of chicken and egg problem.  
> I got into this percpu thing because I wanted a percpu reference count 
> which can be dynamically allocated and it sucked.

Sure, but even static percpu sucked very much (it expanded to half a dozen 
or more instructions), and dynamic is _more_ complex. Anyway, it's getting 
fixed now :-)

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-21  5:52               ` Tejun Heo
  2009-01-21 10:05                 ` Ingo Molnar
@ 2009-01-21 11:21                 ` Eric W. Biederman
  2009-01-21 12:45                   ` Stephen Hemminger
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2009-01-21 11:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Rusty Russell, Herbert Xu, akpm, hpa, brgerst, cl,
	travis, linux-kernel, steiner, hugh, David S. Miller, netdev,
	Mathieu Desnoyers

Tejun Heo <tj@kernel.org> writes:

> Ingo Molnar wrote:
>> The larger point still remains: the kernel dominantly uses static percpu 
>> variables by a margin of 10 to 1, so we cannot just brush away the static 
>> percpu variables and must concentrate on optimizing that side with 
>> priority. It's nice if the dynamic percpu-alloc side improves as well, of 
>> course.
>
> Well, the infrequent usage of dynamic percpu allocation is in some
> part due to the poor implementation, so it's sort of chicken and egg
> problem.  I got into this percpu thing because I wanted a percpu
> reference count which can be dynamically allocated and it sucked.

Counters are our other special case, and counters are interesting
because they are individually very small.  I just looked and the vast
majority of the alloc_percpu users are counters.

I just did a rough count in include/linux/snmp.h  and I came
up with 171*2 counters.  At 8 bytes per counter that is roughly
2.7K.  Or about two thirds of a 4K page.

What makes this is a challenge is that those counters are per network
namespace, and there are no static limits on the number of network
namespaces.

If we push the system and allocate 1024 network namespaces we wind up
needing 2.7MB per cpu, just for the SNMP counters.  

Which nicely illustrates the principle that typically each individual
per cpu allocation is small, but with dynamic allocation we have the
challenge that number of allocations becomes unbounded and in some cases
could be large, while the typical per cpu size is likely to be very small.

I wonder if for the specific case of counters it might make sense to
simply optimize the per cpu allocator for machine word size allocations
and allocate each counter individually freeing us from the burden of
worrying about fragmentation.

The pain with the current alloc_percpu implementation when working
with counters is that it has to allocate an array with one entry
for each cpu to point at the per cpu data.  Which isn't especially efficient.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-21 11:21                 ` Eric W. Biederman
@ 2009-01-21 12:45                   ` Stephen Hemminger
  2009-01-21 14:13                     ` Eric W. Biederman
  2009-01-21 20:34                     ` David Miller
  0 siblings, 2 replies; 32+ messages in thread
From: Stephen Hemminger @ 2009-01-21 12:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tejun Heo, Ingo Molnar, Rusty Russell, Herbert Xu, akpm, hpa,
	brgerst, cl, travis, linux-kernel, steiner, hugh, David S. Miller,
	netdev, Mathieu Desnoyers

On Wed, 21 Jan 2009 03:21:23 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Tejun Heo <tj@kernel.org> writes:
> 
> > Ingo Molnar wrote:
> >> The larger point still remains: the kernel dominantly uses static percpu 
> >> variables by a margin of 10 to 1, so we cannot just brush away the static 
> >> percpu variables and must concentrate on optimizing that side with 
> >> priority. It's nice if the dynamic percpu-alloc side improves as well, of 
> >> course.
> >
> > Well, the infrequent usage of dynamic percpu allocation is in some
> > part due to the poor implementation, so it's sort of chicken and egg
> > problem.  I got into this percpu thing because I wanted a percpu
> > reference count which can be dynamically allocated and it sucked.
> 
> Counters are our other special case, and counters are interesting
> because they are individually very small.  I just looked and the vast
> majority of the alloc_percpu users are counters.
> 
> I just did a rough count in include/linux/snmp.h  and I came
> up with 171*2 counters.  At 8 bytes per counter that is roughly
> 2.7K.  Or about two thirds of a 4K page.

This is crap. only a small fraction of these SNMP counters are
close enough to the hot path to deserve per-cpu treatment.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-21 12:45                   ` Stephen Hemminger
@ 2009-01-21 14:13                     ` Eric W. Biederman
  2009-01-21 20:34                     ` David Miller
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2009-01-21 14:13 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Tejun Heo, Ingo Molnar, Rusty Russell, Herbert Xu, akpm, hpa,
	brgerst, cl, travis, linux-kernel, steiner, hugh, David S. Miller,
	netdev, Mathieu Desnoyers

Stephen Hemminger <shemminger@vyatta.com> writes:

> This is crap. only a small fraction of these SNMP counters are
> close enough to the hot path to deserve per-cpu treatment.

Interesting.  I was just reading af_inet.c:ipv4_mib_init_net()
to get a feel for what a large consumer of percpu counters looks
like.

I expect the patterns I saw will hold even if the base size is much
smaller.

Your statement reinforces my point that our allocations in a per cpu
area are small and our dynamic per cpu allocator is not really
optimized for small allocations.

In most places what we want are 1-5 counters, which is max 40 bytes.
And our minimum allocation is a full cache line (64-128 bytes) per
allocation.

I'm wondering if dynamic per cpu allocations should work more like
slab.  Have a set of percpu base pointers for an area, and return an
area + offset in some convenient form.

Ideally we would just have one area (allowing us to keep the base
pointer in a register), but I'm not convinced that doesn't fall down
in the face of dynamic allocation.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] percpu: add optimized generic percpu accessors
  2009-01-21 12:45                   ` Stephen Hemminger
  2009-01-21 14:13                     ` Eric W. Biederman
@ 2009-01-21 20:34                     ` David Miller
  1 sibling, 0 replies; 32+ messages in thread
From: David Miller @ 2009-01-21 20:34 UTC (permalink / raw)
  To: shemminger
  Cc: ebiederm, tj, mingo, rusty, herbert, akpm, hpa, brgerst, cl,
	travis, linux-kernel, steiner, hugh, netdev, mathieu.desnoyers

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 21 Jan 2009 23:45:01 +1100

> This is crap. only a small fraction of these SNMP counters are
> close enough to the hot path to deserve per-cpu treatment.

Only if your router/firewall/webserver isn't hitting that code path
which bumps the counters you think aren't hot path.

It's a micro-DoS waiting to happen if we start trying to split
counters up into groups which matter for hot path processing
(and thus use per-cpu stuff) and those which don't (and thus
use atomics or whatever your idea is).

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-01-31  6:12 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090115183942.GA6325@elte.hu>
     [not found] ` <20090116001200.GA9137@gondor.apana.org.au>
     [not found]   ` <20090116001544.GA11073@elte.hu>
2009-01-16  0:18     ` [PATCH] percpu: add optimized generic percpu accessors Herbert Xu
     [not found]       ` <200901170827.33729.rusty@rustcorp.com.au>
2009-01-16 22:08         ` Ingo Molnar
     [not found]           ` <200901201328.24605.rusty@rustcorp.com.au>
2009-01-20  6:25             ` Tejun Heo
2009-01-20 10:36               ` Ingo Molnar
     [not found]               ` <200901271213.18605.rusty@rustcorp.com.au>
2009-01-27  2:24                 ` Tejun Heo
2009-01-27 13:13                   ` Ingo Molnar
2009-01-27 23:07                     ` Tejun Heo
2009-01-28  3:36                       ` Tejun Heo
2009-01-28  8:12                         ` Tejun Heo
2009-01-27 20:08                   ` Christoph Lameter
2009-01-27 21:47                     ` David Miller
2009-01-27 22:47                       ` Rick Jones
2009-01-28  0:17                         ` Luck, Tony
2009-01-28 16:48                           ` Christoph Lameter
2009-01-28 17:15                             ` Luck, Tony
2009-01-28 16:45                       ` Christoph Lameter
2009-01-28 20:47                         ` David Miller
2009-01-28 10:38                   ` Rusty Russell
2009-01-28 10:56                     ` Tejun Heo
2009-01-29  2:06                       ` Rusty Russell
2009-01-31  6:11                         ` Tejun Heo
2009-01-28 16:50                     ` Christoph Lameter
2009-01-28 18:07                       ` Mathieu Desnoyers
2009-01-29 18:33                         ` Christoph Lameter
2009-01-29 18:48                           ` H. Peter Anvin
2009-01-20 10:40             ` Ingo Molnar
2009-01-21  5:52               ` Tejun Heo
2009-01-21 10:05                 ` Ingo Molnar
2009-01-21 11:21                 ` Eric W. Biederman
2009-01-21 12:45                   ` Stephen Hemminger
2009-01-21 14:13                     ` Eric W. Biederman
2009-01-21 20:34                     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).