* Re: [PATCH] percpu: add optimized generic percpu accessors [not found] ` <20090116001544.GA11073@elte.hu> @ 2009-01-16 0:18 ` Herbert Xu [not found] ` <200901170827.33729.rusty@rustcorp.com.au> 0 siblings, 1 reply; 32+ messages in thread From: Herbert Xu @ 2009-01-16 0:18 UTC (permalink / raw) To: Ingo Molnar Cc: akpm, tj, hpa, brgerst, ebiederm, cl, rusty, travis, linux-kernel, steiner, hugh, David S. Miller, netdev On Fri, Jan 16, 2009 at 01:15:44AM +0100, Ingo Molnar wrote: > > > So if you could design the API such that we have a variant of add/inc > > that automatically disables/enables preemption then we can optimise that > > away on x86. > > Yeah. percpu_add(var, 1) does exactly that on x86. Awesome! Sounds like we can finally do away with those nasty hacks in the SNMP macros. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <200901170827.33729.rusty@rustcorp.com.au>]
* Re: [PATCH] percpu: add optimized generic percpu accessors [not found] ` <200901170827.33729.rusty@rustcorp.com.au> @ 2009-01-16 22:08 ` Ingo Molnar [not found] ` <200901201328.24605.rusty@rustcorp.com.au> 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-01-16 22:08 UTC (permalink / raw) To: Rusty Russell Cc: Herbert Xu, akpm, tj, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers * Rusty Russell <rusty@rustcorp.com.au> wrote: > On Friday 16 January 2009 10:48:24 Herbert Xu wrote: > > On Fri, Jan 16, 2009 at 01:15:44AM +0100, Ingo Molnar wrote: > > > > > > > So if you could design the API such that we have a variant of add/inc > > > > that automatically disables/enables preemption then we can optimise that > > > > away on x86. > > > > > > Yeah. percpu_add(var, 1) does exactly that on x86. > > <sigh>. No it doesn't. What do you mean by "No it doesn't". It does exactly what i claimed it does. > It's really nice that everyone's excited about this, but it's more > complex than this. Unf. I'm too busy preparing for linux.conf.au to > explain it all properly right now, but here's the highlights: > > 1) This only works on static per-cpu vars. > - We are working on fixing this, but it's non-trivial for large allocs like > those in networking. Small allocs, we have patches for. How do difficulties of dynamic percpu-alloc make my above suggestion unsuitable for SNMP stats in networking? Most of those stats are not dynamically allocated - they are plain straightforward percpu variables. Plus the majority of percpu usage is static - just like the majority of local variables is static, not dynamic. So any percpu-alloc complication is a non-issue. > 2) The generic versions of these as posted by Tejun are unsuitable for > networking; they need to bh_disable. That would make networking less > efficient than it is now for non-x86, and to be generic it would have > to be local_irq_save/restore anyway. The generic versions will not be used on 95%+ of the active Linux systems out there, as they run on x86. If you worry about the remaining 5%, those can be optimized too. > 3) local_t was designed to do exactly this: a fast cpu-local counter > implemented optimally for each arch. For sparc64, doing a trivalue version > seems optimal, for s390 atomics, for x86 single-insn, for powerpc > irq_save/restore, etc. But local_t does not actually solve this problem at all - because one still has to have per-cpu-ness. > 4) Unfortunately, local_t has been extended beyond a simple counter, meaning > it now has more complex requirements (eg. Mathieu wants nmi-safe, even > though that's impossible on sparc and parisc, and percpu_counter wants > local_add_return, which makes trival less desirable). These discussions > are on the back burner at the moment, but ongoing. In reality local_t has almost zero users in the kernel - despite being with us at least since v2.6.12. That pretty much tells us all about its utility. The thing is, local_t without proper percpu integration is a toothless tiger in the jungle. And our APIS do exactly that kind of integration and i expect them to be more popular than local_t. There's already a dozen usage sites of it in arch/x86. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <200901201328.24605.rusty@rustcorp.com.au>]
* Re: [PATCH] percpu: add optimized generic percpu accessors [not found] ` <200901201328.24605.rusty@rustcorp.com.au> @ 2009-01-20 6:25 ` Tejun Heo 2009-01-20 10:36 ` Ingo Molnar [not found] ` <200901271213.18605.rusty@rustcorp.com.au> 2009-01-20 10:40 ` Ingo Molnar 1 sibling, 2 replies; 32+ messages in thread From: Tejun Heo @ 2009-01-20 6:25 UTC (permalink / raw) To: Rusty Russell Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Rusty Russell wrote: > The generic versions Tejun posted are not softirq safe, so not > suitable for network counters. To figure out what semantics we > really want we need to we must audit the users; I'm sorry I haven't > finished that task (and won't until after the conference). No, they're not. They're preempt safe as mentioned in the comment and is basically just generalization of the original x86 versions used by x86_64 on SMP before pda and percpu areas were merged. I agree that it's something very close to local_t and it would be nice to see those somehow unified (and I have patches which make use of local_t in my queue waiting for dynamic percpu allocation). Another question to ask is whether to keep using separate interfaces for static and dynamic percpu variables or migrate to something which can take both. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-20 6:25 ` Tejun Heo @ 2009-01-20 10:36 ` Ingo Molnar [not found] ` <200901271213.18605.rusty@rustcorp.com.au> 1 sibling, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 10:36 UTC (permalink / raw) To: Tejun Heo Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers * Tejun Heo <tj@kernel.org> wrote: > Rusty Russell wrote: > > The generic versions Tejun posted are not softirq safe, so not > > suitable for network counters. To figure out what semantics we > > really want we need to we must audit the users; I'm sorry I haven't > > finished that task (and won't until after the conference). > > No, they're not. They're preempt safe as mentioned in the comment and > is basically just generalization of the original x86 versions used by > x86_64 on SMP before pda and percpu areas were merged. I agree that > it's something very close to local_t and it would be nice to see those > somehow unified (and I have patches which make use of local_t in my > queue waiting for dynamic percpu allocation). > > Another question to ask is whether to keep using separate interfaces for > static and dynamic percpu variables or migrate to something which can > take both. Also, there's over 400 PER_CPU variable definitions in the kernel, while only about 40 dynamic percpu allocation usage sites. (in that i included the percpu_counter bits used by networking) So our percpu code usage is on the static side, by a large margin. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <200901271213.18605.rusty@rustcorp.com.au>]
* Re: [PATCH] percpu: add optimized generic percpu accessors [not found] ` <200901271213.18605.rusty@rustcorp.com.au> @ 2009-01-27 2:24 ` Tejun Heo 2009-01-27 13:13 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Tejun Heo @ 2009-01-27 2:24 UTC (permalink / raw) To: Rusty Russell Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Hello, Rusty. Rusty Russell wrote: >> No, they're not. They're preempt safe as mentioned in the comment >> and is basically just generalization of the original x86 versions >> used by x86_64 on SMP before pda and percpu areas were merged. I >> agree that it's something very close to local_t and it would be >> nice to see those somehow unified (and I have patches which make >> use of local_t in my queue waiting for dynamic percpu allocation). > > Yes, which is one reason I dislike Ingo's patch: > 1) Mine did just read because that covers the most common fast-path use > and is easily atomic for word-sizes on all archs, > 2) Didn't replace x86, just #defined generic one, so much less churn, > 3) read_percpu_var and read_percpu_ptr variants following the convention > reinforced by my other patches. > > Linus' tree had read/write/add/or counts at 22/13/0/0. Yours has > more write usage, so I'm happy there, but still only one add and one > or. If we assume that generic code will look a bit like that when > converted, I'm not convinced that generic and/or/etc ops are worth > it. There actually were quite some places where atomic add ops would be useful, especially the places where statistics are collected. For logical bitops, I don't think we'll have too many of them. > If they are worth doing generically, should the ops be atomic? To > extrapolate from x86 usages again, it seems to be happy with > non-atomic (tho of course it is atomic on x86). If atomic rw/add/sub are implementible on most archs (and judging from local_t, I suppose it is), I think it should. So that it can replace local_t and we won't need something else again in the future. >> Another question to ask is whether to keep using separate >> interfaces for static and dynamic percpu variables or migrate to >> something which can take both. > > Well, IA64 can do stuff with static percpus that it can't do with > dynamic (assuming we get expanding dynamic percpu areas > later). That's because they use TLB tricks for a static 64k per-cpu > area, but this doesn't scale. That might not be vital: abandoning > that trick will mean they can't optimise read_percpu/read_percpu_var > etc as much. Isn't something like the following possible? #define pcpu_read(ptr) \ ({ \ if (__builtin_constant_p(ptr) && \ ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END) \ do 64k TLB trick for static pcpu; \ else \ do generic stuff; \ }) > Tejun, any chance of you updating the tj-percpu tree? My current > patches are against Linus's tree, and rebasing them on yours > involves some icky merging. If Ingo is okay with it, I'm fine with it too. Unless Ingo objects, I'll do it tomorrow-ish (still big holiday here). Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 2:24 ` Tejun Heo @ 2009-01-27 13:13 ` Ingo Molnar 2009-01-27 23:07 ` Tejun Heo 2009-01-27 20:08 ` Christoph Lameter 2009-01-28 10:38 ` Rusty Russell 2 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-01-27 13:13 UTC (permalink / raw) To: Tejun Heo Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers * Tejun Heo <tj@kernel.org> wrote: > > Tejun, any chance of you updating the tj-percpu tree? My current > > patches are against Linus's tree, and rebasing them on yours involves > > some icky merging. > > If Ingo is okay with it, I'm fine with it too. Unless Ingo objects, > I'll do it tomorrow-ish (still big holiday here). Sure - but please do it as a clean rebase ontop of latest, not as a conflict-merge, as i'd expect there to be quite many clashes. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 13:13 ` Ingo Molnar @ 2009-01-27 23:07 ` Tejun Heo 2009-01-28 3:36 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2009-01-27 23:07 UTC (permalink / raw) To: Ingo Molnar Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Ingo Molnar wrote: > * Tejun Heo <tj@kernel.org> wrote: > >>> Tejun, any chance of you updating the tj-percpu tree? My current >>> patches are against Linus's tree, and rebasing them on yours involves >>> some icky merging. >> If Ingo is okay with it, I'm fine with it too. Unless Ingo objects, >> I'll do it tomorrow-ish (still big holiday here). > > Sure - but please do it as a clean rebase ontop of latest, not as a > conflict-merge, as i'd expect there to be quite many clashes. Alrighty, re-basing. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 23:07 ` Tejun Heo @ 2009-01-28 3:36 ` Tejun Heo 2009-01-28 8:12 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2009-01-28 3:36 UTC (permalink / raw) To: Ingo Molnar Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Tejun Heo wrote: > Ingo Molnar wrote: >> * Tejun Heo <tj@kernel.org> wrote: >> >>>> Tejun, any chance of you updating the tj-percpu tree? My current >>>> patches are against Linus's tree, and rebasing them on yours involves >>>> some icky merging. >>> If Ingo is okay with it, I'm fine with it too. Unless Ingo objects, >>> I'll do it tomorrow-ish (still big holiday here). >> Sure - but please do it as a clean rebase ontop of latest, not as a >> conflict-merge, as i'd expect there to be quite many clashes. > > Alrighty, re-basing. Okay, rebased on top of the current master[1]. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu The rebased commit ID is e7f538f3d2a6b3a0a632190a7d736730ee10909c. There were several changes which didn't fit the rebased tree as they really belong to other branches in the x86 tree. I'll re-submit them as patches against the appropriate branch. Thanks. -- tejun [1] 5ee810072175042775e39bdd3eaaa68884c27805 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 3:36 ` Tejun Heo @ 2009-01-28 8:12 ` Tejun Heo 0 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2009-01-28 8:12 UTC (permalink / raw) To: Ingo Molnar Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Tejun Heo wrote: > Okay, rebased on top of the current master[1]. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu > > The rebased commit ID is e7f538f3d2a6b3a0a632190a7d736730ee10909c. > > There were several changes which didn't fit the rebased tree as they > really belong to other branches in the x86 tree. I'll re-submit them > as patches against the appropriate branch. Two patches against #stackprotector and one against #cpus4096 sent out. With these three patches, most of stuff has been preserved but there are still few missing pieces which only make sense after trees are merged. Will be happy to drive it if the current branch looks fine. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 2:24 ` Tejun Heo 2009-01-27 13:13 ` Ingo Molnar @ 2009-01-27 20:08 ` Christoph Lameter 2009-01-27 21:47 ` David Miller 2009-01-28 10:38 ` Rusty Russell 2 siblings, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2009-01-27 20:08 UTC (permalink / raw) To: Tejun Heo Cc: Rusty Russell, Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers On Tue, 27 Jan 2009, Tejun Heo wrote: > > later). That's because they use TLB tricks for a static 64k per-cpu > > area, but this doesn't scale. That might not be vital: abandoning > > that trick will mean they can't optimise read_percpu/read_percpu_var > > etc as much. Why wont it scale? this is a separate TLB entry for each processor. > > Isn't something like the following possible? > > #define pcpu_read(ptr) \ > ({ \ > if (__builtin_constant_p(ptr) && \ > ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END) \ > do 64k TLB trick for static pcpu; \ > else \ > do generic stuff; \ > }) The TLB trick is just to access the percpu data at a fixed base. I.e. value = SHIFT_PERCPU_PTR(percpu_var, FIXED_ADDRESS); ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 20:08 ` Christoph Lameter @ 2009-01-27 21:47 ` David Miller 2009-01-27 22:47 ` Rick Jones 2009-01-28 16:45 ` Christoph Lameter 0 siblings, 2 replies; 32+ messages in thread From: David Miller @ 2009-01-27 21:47 UTC (permalink / raw) To: cl Cc: tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, netdev, mathieu.desnoyers From: Christoph Lameter <cl@linux-foundation.org> Date: Tue, 27 Jan 2009 15:08:57 -0500 (EST) > On Tue, 27 Jan 2009, Tejun Heo wrote: > > > > later). That's because they use TLB tricks for a static 64k per-cpu > > > area, but this doesn't scale. That might not be vital: abandoning > > > that trick will mean they can't optimise read_percpu/read_percpu_var > > > etc as much. > > Why wont it scale? this is a separate TLB entry for each processor. The IA64 per-cpu TLB entry only covers 64k which makes use of it for dynamic per-cpu stuff out of the question. That's why it "doesn't scale" ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 21:47 ` David Miller @ 2009-01-27 22:47 ` Rick Jones 2009-01-28 0:17 ` Luck, Tony 2009-01-28 16:45 ` Christoph Lameter 1 sibling, 1 reply; 32+ messages in thread From: Rick Jones @ 2009-01-27 22:47 UTC (permalink / raw) To: David Miller Cc: cl, tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, netdev, mathieu.desnoyers, linux-ia64@vger.kernel.org David Miller wrote: > From: Christoph Lameter <cl@linux-foundation.org> > Date: Tue, 27 Jan 2009 15:08:57 -0500 (EST) > > >>On Tue, 27 Jan 2009, Tejun Heo wrote: >> >> >>>>later). That's because they use TLB tricks for a static 64k per-cpu >>>>area, but this doesn't scale. That might not be vital: abandoning >>>>that trick will mean they can't optimise read_percpu/read_percpu_var >>>>etc as much. >> >>Why wont it scale? this is a separate TLB entry for each processor. > > > The IA64 per-cpu TLB entry only covers 64k which makes use of it for > dynamic per-cpu stuff out of the question. That's why it "doesn't > scale" I was asking around, and was told that on IA64 *harware* at least, in addition to supporting multiple page sizes (up to a GB IIRC), one can pin up to 8 or perhaps 1/2 the TLB entries. So, in theory if one were so inclined the special pinned per-CPU entry could either be more than one 64K entry, or a single, rather larger entry. As a sanity check, I've cc'd linux-ia64. rick jones ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 22:47 ` Rick Jones @ 2009-01-28 0:17 ` Luck, Tony 2009-01-28 16:48 ` Christoph Lameter 0 siblings, 1 reply; 32+ messages in thread From: Luck, Tony @ 2009-01-28 0:17 UTC (permalink / raw) To: Rick Jones, David Miller Cc: cl@linux-foundation.org, tj@kernel.org, rusty@rustcorp.com.au, mingo@elte.hu, herbert@gondor.apana.org.au, akpm@linux-foundation.org, hpa@zytor.com, brgerst@gmail.com, ebiederm@xmission.com, travis@sgi.com, linux-kernel@vger.kernel.org, steiner@sgi.com, hugh@veritas.com, netdev@vger.kernel.org, mathieu.desnoyers@polymtl.ca, linux-ia64@vger.kernel.org > I was asking around, and was told that on IA64 *harware* at least, in addition to > supporting multiple page sizes (up to a GB IIRC), one can pin up to 8 or perhaps > 1/2 the TLB entries. The number of TLB entries that can be pinned is model specific (but the minimum allowed is 8 each for code & data). Page sizes supported also vary by model, recent models go up to 4GB. BUT ... we stopped pinning this entry in the TLB when benchmarks showed that it was better to just insert this as a regular TLB entry which might can be dropped to map something else. So now there is code in the Alt-DTLB miss handler (ivt.S) to insert the per-cpu mapping on an as needed basis. > So, in theory if one were so inclined the special pinned per-CPU entry could > either be more than one 64K entry, or a single, rather larger entry. Managing a larger space could be done ... but at the expense of making the Alt-DTLB miss handler do a memory lookup to find the physical address of the per-cpu page needed (assuming that we allocate a bunch of random physical pages for use as per-cpu space rather than a single contiguous block of physical memory). When do we know the total amount of per-cpu memory needed? 1) CONFIG time? 2) Boot time? 3) Arbitrary run time? -Tony ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 0:17 ` Luck, Tony @ 2009-01-28 16:48 ` Christoph Lameter 2009-01-28 17:15 ` Luck, Tony 0 siblings, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2009-01-28 16:48 UTC (permalink / raw) To: Luck, Tony Cc: Rick Jones, David Miller, tj@kernel.org, rusty@rustcorp.com.au, mingo@elte.hu, herbert@gondor.apana.org.au, akpm@linux-foundation.org, hpa@zytor.com, brgerst@gmail.com, ebiederm@xmission.com, travis@sgi.com, linux-kernel@vger.kernel.org, steiner@sgi.com, hugh@veritas.com, netdev@vger.kernel.org, mathieu.desnoyers@polymtl.ca, linux-ia64@vger.kernel.org On Tue, 27 Jan 2009, Luck, Tony wrote: > Managing a larger space could be done ... but at the expense of making > the Alt-DTLB miss handler do a memory lookup to find the physical address > of the per-cpu page needed (assuming that we allocate a bunch of random > physical pages for use as per-cpu space rather than a single contiguous > block of physical memory). We cannot resize the area by using a single larger TLB entry? > When do we know the total amount of per-cpu memory needed? > 1) CONFIG time? Would be easiest. > 2) Boot time? We could make the TLB entry size configurable with a kernel parameter > 3) Arbitrary run time? We could reserve a larger virtual space and switch TLB entries as needed? We would need do get larger order pages to do this. ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 16:48 ` Christoph Lameter @ 2009-01-28 17:15 ` Luck, Tony 0 siblings, 0 replies; 32+ messages in thread From: Luck, Tony @ 2009-01-28 17:15 UTC (permalink / raw) To: Christoph Lameter Cc: Rick Jones, David Miller, tj@kernel.org, rusty@rustcorp.com.au, mingo@elte.hu, herbert@gondor.apana.org.au, akpm@linux-foundation.org, hpa@zytor.com, brgerst@gmail.com, ebiederm@xmission.com, travis@sgi.com, linux-kernel@vger.kernel.org, steiner@sgi.com, hugh@veritas.com, netdev@vger.kernel.org, mathieu.desnoyers@polymtl.ca, linux-ia64@vger.kernel.org > > Managing a larger space could be done ... but at the expense of making > > the Alt-DTLB miss handler do a memory lookup to find the physical address > > of the per-cpu page needed (assuming that we allocate a bunch of random > > physical pages for use as per-cpu space rather than a single contiguous > > block of physical memory). > > We cannot resize the area by using a single larger TLB entry? Yes we could ... the catch is that the supported TLB page sizes go up by multiples of 4. So if 64K is not enough the next stop is at 256K, then 1M, then 4M, and so on. That's why I asked when we know what total amount of per-cpu memory is needed. CONFIG time or boot time is good, because allocating higher order pages is easy at boot time. Arbitrary run time is bad because we might have difficulty allocating a high order page for every cpu. -Tony ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 21:47 ` David Miller 2009-01-27 22:47 ` Rick Jones @ 2009-01-28 16:45 ` Christoph Lameter 2009-01-28 20:47 ` David Miller 1 sibling, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2009-01-28 16:45 UTC (permalink / raw) To: David Miller Cc: tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, netdev, mathieu.desnoyers On Tue, 27 Jan 2009, David Miller wrote: > > Why wont it scale? this is a separate TLB entry for each processor. > > The IA64 per-cpu TLB entry only covers 64k which makes use of it for > dynamic per-cpu stuff out of the question. That's why it "doesn't > scale" IA64 supports varying page sizes. You can use a 128k TLB entry etc. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 16:45 ` Christoph Lameter @ 2009-01-28 20:47 ` David Miller 0 siblings, 0 replies; 32+ messages in thread From: David Miller @ 2009-01-28 20:47 UTC (permalink / raw) To: cl Cc: tj, rusty, mingo, herbert, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, netdev, mathieu.desnoyers From: Christoph Lameter <cl@linux-foundation.org> Date: Wed, 28 Jan 2009 11:45:28 -0500 (EST) > On Tue, 27 Jan 2009, David Miller wrote: > > > > Why wont it scale? this is a separate TLB entry for each processor. > > > > The IA64 per-cpu TLB entry only covers 64k which makes use of it for > > dynamic per-cpu stuff out of the question. That's why it "doesn't > > scale" > > IA64 supports varying page sizes. You can use a 128k TLB entry etc. Good luck moving to a larger size dynamically at run time. It really isn't a tenable solution. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-27 2:24 ` Tejun Heo 2009-01-27 13:13 ` Ingo Molnar 2009-01-27 20:08 ` Christoph Lameter @ 2009-01-28 10:38 ` Rusty Russell 2009-01-28 10:56 ` Tejun Heo 2009-01-28 16:50 ` Christoph Lameter 2 siblings, 2 replies; 32+ messages in thread From: Rusty Russell @ 2009-01-28 10:38 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers On Tuesday 27 January 2009 12:54:27 Tejun Heo wrote: > Hello, Rusty. Hi Tejun! > There actually were quite some places where atomic add ops would be > useful, especially the places where statistics are collected. For > logical bitops, I don't think we'll have too many of them. If the stats are only manipulated in one context, than an atomic requirement is overkill (and expensive on non-x86). > > If they are worth doing generically, should the ops be atomic? To > > extrapolate from x86 usages again, it seems to be happy with > > non-atomic (tho of course it is atomic on x86). > > If atomic rw/add/sub are implementible on most archs (and judging from > local_t, I suppose it is), I think it should. So that it can replace > local_t and we won't need something else again in the future. This is more like Christoph's CPU_OPS: they were special operators on normal per-cpu vars/ptrs. Generic version was irqsave+op+irqrestore. I actually like this idea, but Mathieu insists that the ops be NMI-safe, for ftrace. Hence local_t needing to be atomic_t for generic code. AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care). Other than the shouting, I liked Christoph's system: - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore) - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu) - __CPU_INC = not safe against anything (eg. per_cpu(i)++) I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed. > >> Another question to ask is whether to keep using separate > >> interfaces for static and dynamic percpu variables or migrate to > >> something which can take both. > > > > Well, IA64 can do stuff with static percpus that it can't do with > > dynamic (assuming we get expanding dynamic percpu areas > > later). That's because they use TLB tricks for a static 64k per-cpu > > area, but this doesn't scale. That might not be vital: abandoning > > that trick will mean they can't optimise read_percpu/read_percpu_var > > etc as much. > > Isn't something like the following possible? > > #define pcpu_read(ptr) \ > ({ \ > if (__builtin_constant_p(ptr) && \ > ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END) \ > do 64k TLB trick for static pcpu; \ > else \ > do generic stuff; \ > }) No, that will be "do generic stuff", since it's a link-time constant. I don't know that this is a huge worry, to be honest. We can leave the __ia64_per_cpu_var for their arch-specific code (I feel the same way about x86 to be honest). > > Tejun, any chance of you updating the tj-percpu tree? My current > > patches are against Linus's tree, and rebasing them on yours > > involves some icky merging. > > If Ingo is okay with it, I'm fine with it too. Unless Ingo objects, > I'll do it tomorrow-ish (still big holiday here). Ah, I did not realize that you celebrated Australia day :) Cheers! Rusty. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 10:38 ` Rusty Russell @ 2009-01-28 10:56 ` Tejun Heo 2009-01-29 2:06 ` Rusty Russell 2009-01-28 16:50 ` Christoph Lameter 1 sibling, 1 reply; 32+ messages in thread From: Tejun Heo @ 2009-01-28 10:56 UTC (permalink / raw) To: Rusty Russell Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Hello, Rusty Russell wrote: > On Tuesday 27 January 2009 12:54:27 Tejun Heo wrote: >> Hello, Rusty. > > Hi Tejun! > >> There actually were quite some places where atomic add ops would be >> useful, especially the places where statistics are collected. For >> logical bitops, I don't think we'll have too many of them. > > If the stats are only manipulated in one context, than an atomic > requirement is overkill (and expensive on non-x86). Yes, it is. I was hoping it to be not more expensive on most archs. It isn't on x86 at the very least but I don't know much about other archs. >>> If they are worth doing generically, should the ops be atomic? To >>> extrapolate from x86 usages again, it seems to be happy with >>> non-atomic (tho of course it is atomic on x86). >> If atomic rw/add/sub are implementible on most archs (and judging from >> local_t, I suppose it is), I think it should. So that it can replace >> local_t and we won't need something else again in the future. > > This is more like Christoph's CPU_OPS: they were special operators > on normal per-cpu vars/ptrs. Generic version was > irqsave+op+irqrestore. > > I actually like this idea, but Mathieu insists that the ops be > NMI-safe, for ftrace. Hence local_t needing to be atomic_t for > generic code. > > AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use > atomic_t in ftrace (which isn't NMI safe on parisc or sparc/32 > anyway, but I don't think we care). Requiring NMI-safeness is quite an exception, I suppose. I don't think we should design around it. If it can be worked around one way or the other, it should be fine. > Other than the shouting, I liked Christoph's system: > - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore) > - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu) > - __CPU_INC = not safe against anything (eg. per_cpu(i)++) > > I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed. I like local better too but no biggies one way or the other. >>>> Another question to ask is whether to keep using separate >>>> interfaces for static and dynamic percpu variables or migrate to >>>> something which can take both. >>> Well, IA64 can do stuff with static percpus that it can't do with >>> dynamic (assuming we get expanding dynamic percpu areas >>> later). That's because they use TLB tricks for a static 64k per-cpu >>> area, but this doesn't scale. That might not be vital: abandoning >>> that trick will mean they can't optimise read_percpu/read_percpu_var >>> etc as much. >> Isn't something like the following possible? >> >> #define pcpu_read(ptr) \ >> ({ \ >> if (__builtin_constant_p(ptr) && \ >> ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END) \ >> do 64k TLB trick for static pcpu; \ >> else \ >> do generic stuff; \ >> }) > > No, that will be "do generic stuff", since it's a link-time > constant. I don't know that this is a huge worry, to be honest. We > can leave the __ia64_per_cpu_var for their arch-specific code (I > feel the same way about x86 to be honest). Yes, right. Got confused there. Hmmm... looks like what would work there is "is it a lvalue?" test. Well, anyways, if it isn't necessary. >>> Tejun, any chance of you updating the tj-percpu tree? My current >>> patches are against Linus's tree, and rebasing them on yours >>> involves some icky merging. >> If Ingo is okay with it, I'm fine with it too. Unless Ingo objects, >> I'll do it tomorrow-ish (still big holiday here). > > Ah, I did not realize that you celebrated Australia day :) Hey, didn't know Australia was founded on lunar New Year's day. Nice. :-) Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 10:56 ` Tejun Heo @ 2009-01-29 2:06 ` Rusty Russell 2009-01-31 6:11 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Rusty Russell @ 2009-01-29 2:06 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers On Wednesday 28 January 2009 21:26:34 Tejun Heo wrote: > Hello, Hi Tejun, > Rusty Russell wrote: > > If the stats are only manipulated in one context, than an atomic > > requirement is overkill (and expensive on non-x86). > > Yes, it is. I was hoping it to be not more expensive on most archs. > It isn't on x86 at the very least but I don't know much about other > archs. Hmm, you can garner this from the local_t stats which were flying around. (see Re: local_add_return from me), or look in the preamble to http://ozlabs.org/~rusty/kernel/rr-latest/misc:test-local_t.patch ). Of course, if you want to be my hero, you could implement "soft" irq disable for all archs, which would make this cheaper. > > Other than the shouting, I liked Christoph's system: > > - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore) > > - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu) > > - __CPU_INC = not safe against anything (eg. per_cpu(i)++) > > > > I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed. > > I like local better too but no biggies one way or the other. Maybe kill local_t and take the name back. I'll leave it to you... > > Ah, I did not realize that you celebrated Australia day :) > > Hey, didn't know Australia was founded on lunar New Year's day. > Nice. :-) That would have been cool, but no; first time in 76 years they matched tho. Thanks, Rusty. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-29 2:06 ` Rusty Russell @ 2009-01-31 6:11 ` Tejun Heo 0 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2009-01-31 6:11 UTC (permalink / raw) To: Rusty Russell Cc: Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Hello, Rusty. Rusty Russell wrote: >> Rusty Russell wrote: >>> If the stats are only manipulated in one context, than an atomic >>> requirement is overkill (and expensive on non-x86). >> Yes, it is. I was hoping it to be not more expensive on most archs. >> It isn't on x86 at the very least but I don't know much about other >> archs. > > Hmm, you can garner this from the local_t stats which were flying around. > (see Re: local_add_return from me), or look in the preamble to > http://ozlabs.org/~rusty/kernel/rr-latest/misc:test-local_t.patch ). Ah... Great. > Of course, if you want to be my hero, you could implement "soft" irq > disable for all archs, which would make this cheaper. I suppose you mean deferred execution of interrupt handlers for quick atomicities. Yeah, that would be nice for things like this. >>> Other than the shouting, I liked Christoph's system: >>> - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore) >>> - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu) >>> - __CPU_INC = not safe against anything (eg. per_cpu(i)++) >>> >>> I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed. >> I like local better too but no biggies one way or the other. > > Maybe kill local_t and take the name back. I'll leave it to you... > >>> Ah, I did not realize that you celebrated Australia day :) >> Hey, didn't know Australia was founded on lunar New Year's day. >> Nice. :-) > > That would have been cool, but no; first time in 76 years they matched tho. It was a joke. :-) Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 10:38 ` Rusty Russell 2009-01-28 10:56 ` Tejun Heo @ 2009-01-28 16:50 ` Christoph Lameter 2009-01-28 18:07 ` Mathieu Desnoyers 1 sibling, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2009-01-28 16:50 UTC (permalink / raw) To: Rusty Russell Cc: Tejun Heo, Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers On Wed, 28 Jan 2009, Rusty Russell wrote: > AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t > in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care). Right. > Other than the shouting, I liked Christoph's system: > - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore) > - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu) > - __CPU_INC = not safe against anything (eg. per_cpu(i)++) > > I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed. The term cpu is meaning multiple things at this point. So yes it may be better to go with glibc naming of thread local space. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 16:50 ` Christoph Lameter @ 2009-01-28 18:07 ` Mathieu Desnoyers 2009-01-29 18:33 ` Christoph Lameter 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-01-28 18:07 UTC (permalink / raw) To: Christoph Lameter Cc: Rusty Russell, Tejun Heo, Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, David S. Miller, netdev * Christoph Lameter (cl@linux-foundation.org) wrote: > On Wed, 28 Jan 2009, Rusty Russell wrote: > > > AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t > > in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care). > > Right. > Ideally this should be done transparently so we don't have to #ifdef code around HAVE_NMISAFE_CPUOPS everywhere in the tracer. We might consider declaring an intermediate type with this kind of #ifdef in the tracer code if we are the only one user though. > > > Other than the shouting, I liked Christoph's system: > > - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore) > > - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu) > > - __CPU_INC = not safe against anything (eg. per_cpu(i)++) > > > > I prefer the name 'local' to the name 'cpu', but I'm not hugely fussed. > > The term cpu is meaning multiple things at this point. So yes it may be > better to go with glibc naming of thread local space. > However using "local" for "per-cpu" could be confusing with the glibc naming of thread local space, because "per-thread" and "per-cpu" variables are different from a synchronization POV and we can end up needing both (e.g. a thread local variable can never be accessed by another thread, but a cpu local variable could be accessed by a different CPU due to scheduling). I'm currently thinking about implementing user-space per-cpu tracing buffers, and the last thing I'd like is to have a "local" semantic clash between the kernel and glibc. Ideally, we could have something like : Atomic safe primitives (emulated with irq disable if the architecture does not have atomic primitives) : - atomic_thread_inc() * current mainline local_t local_inc(). - atomic_cpu_inc() * Your proposed CPU_INC. Non-safe against interrupts, but safe against preemption : - thread_inc() * no preempt_disable involved, because this deals with per-thread variables : * Simple var++ - cpu_inc() * disables preemption, per_cpu(i)++, enables preemption Not safe against preemption nor interrupts : - _thread_inc() * maps to thread_inc() - _cpu_inc() * simple per_cpu(i)++ So maybe we don't really need thread_inc(), _thread_inc() and _cpu_inc, because they map to standard C operations, but we may find that in some architectures that the atomic_cpu_inc() is faster than the per_cpu(i)++, and in those cases it would make sense to map e.g. _cpu_inc() to atomic_cpu_inc(). Also note that don't think adding _ and __ prefixes to the operations makes it clear for the programmer and reviewer what is safe against what. OMHO, it will just make the code more obscure. One level of underscore is about the limit I think people can "know" what this "unsafe" version of the primitive does. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-28 18:07 ` Mathieu Desnoyers @ 2009-01-29 18:33 ` Christoph Lameter 2009-01-29 18:48 ` H. Peter Anvin 0 siblings, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2009-01-29 18:33 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Rusty Russell, Tejun Heo, Ingo Molnar, Herbert Xu, akpm, hpa, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, David S. Miller, netdev On Wed, 28 Jan 2009, Mathieu Desnoyers wrote: > > The term cpu is meaning multiple things at this point. So yes it may be > > better to go with glibc naming of thread local space. > > > > However using "local" for "per-cpu" could be confusing with the glibc > naming of thread local space, because "per-thread" and "per-cpu" > variables are different from a synchronization POV and we can end up > needing both (e.g. a thread local variable can never be accessed by > another thread, but a cpu local variable could be accessed by a > different CPU due to scheduling). gcc/glibc support a __thread attribute to variables. As far as I can tell this automatically makes gcc perform the relocation to the current context using a segment register. But its a weird ABI http://people.redhat.com/drepper/tls.pdf. After reading that I agree that we should stay with the cpu ops and forget about the local and thread stuff in gcc/glibc. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-29 18:33 ` Christoph Lameter @ 2009-01-29 18:48 ` H. Peter Anvin 0 siblings, 0 replies; 32+ messages in thread From: H. Peter Anvin @ 2009-01-29 18:48 UTC (permalink / raw) To: Christoph Lameter Cc: Mathieu Desnoyers, Rusty Russell, Tejun Heo, Ingo Molnar, Herbert Xu, akpm, brgerst, ebiederm, travis, linux-kernel, steiner, hugh, David S. Miller, netdev Christoph Lameter wrote: > > gcc/glibc support a __thread attribute to variables. As far as I can tell > this automatically makes gcc perform the relocation to the current > context using a segment register. > > But its a weird ABI http://people.redhat.com/drepper/tls.pdf. After > reading that I agree that we should stay with the cpu ops and forget about > the local and thread stuff in gcc/glibc. > We have discussed this a number of times. There are several issues with it; one being that per-cpu != per-thread (we can switch CPU whereever we're preempted), and another that many versions of gcc uses hardcoded registers, in particular %fs on 64 bits, but the kernel *has* to use %gs since there is no swapfs instruction. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors [not found] ` <200901201328.24605.rusty@rustcorp.com.au> 2009-01-20 6:25 ` Tejun Heo @ 2009-01-20 10:40 ` Ingo Molnar 2009-01-21 5:52 ` Tejun Heo 1 sibling, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 10:40 UTC (permalink / raw) To: Rusty Russell Cc: Herbert Xu, akpm, tj, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers * Rusty Russell <rusty@rustcorp.com.au> wrote: > On Saturday 17 January 2009 08:38:32 Ingo Molnar wrote: > > How do difficulties of dynamic percpu-alloc make my above suggestion > > unsuitable for SNMP stats in networking? Most of those stats are not > > dynamically allocated - they are plain straightforward percpu variables. > > No they're not. [...] hm, i see this got changed recently - part of the networking stats went over to the lib/percpu_counter API recently. The larger point still remains: the kernel dominantly uses static percpu variables by a margin of 10 to 1, so we cannot just brush away the static percpu variables and must concentrate on optimizing that side with priority. It's nice if the dynamic percpu-alloc side improves as well, of course. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-20 10:40 ` Ingo Molnar @ 2009-01-21 5:52 ` Tejun Heo 2009-01-21 10:05 ` Ingo Molnar 2009-01-21 11:21 ` Eric W. Biederman 0 siblings, 2 replies; 32+ messages in thread From: Tejun Heo @ 2009-01-21 5:52 UTC (permalink / raw) To: Ingo Molnar Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Ingo Molnar wrote: > The larger point still remains: the kernel dominantly uses static percpu > variables by a margin of 10 to 1, so we cannot just brush away the static > percpu variables and must concentrate on optimizing that side with > priority. It's nice if the dynamic percpu-alloc side improves as well, of > course. Well, the infrequent usage of dynamic percpu allocation is in some part due to the poor implementation, so it's sort of chicken and egg problem. I got into this percpu thing because I wanted a percpu reference count which can be dynamically allocated and it sucked. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-21 5:52 ` Tejun Heo @ 2009-01-21 10:05 ` Ingo Molnar 2009-01-21 11:21 ` Eric W. Biederman 1 sibling, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-21 10:05 UTC (permalink / raw) To: Tejun Heo Cc: Rusty Russell, Herbert Xu, akpm, hpa, brgerst, ebiederm, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers * Tejun Heo <tj@kernel.org> wrote: > Ingo Molnar wrote: > > The larger point still remains: the kernel dominantly uses static percpu > > variables by a margin of 10 to 1, so we cannot just brush away the static > > percpu variables and must concentrate on optimizing that side with > > priority. It's nice if the dynamic percpu-alloc side improves as well, of > > course. > > Well, the infrequent usage of dynamic percpu allocation is in some part > due to the poor implementation, so it's sort of chicken and egg problem. > I got into this percpu thing because I wanted a percpu reference count > which can be dynamically allocated and it sucked. Sure, but even static percpu sucked very much (it expanded to half a dozen or more instructions), and dynamic is _more_ complex. Anyway, it's getting fixed now :-) Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-21 5:52 ` Tejun Heo 2009-01-21 10:05 ` Ingo Molnar @ 2009-01-21 11:21 ` Eric W. Biederman 2009-01-21 12:45 ` Stephen Hemminger 1 sibling, 1 reply; 32+ messages in thread From: Eric W. Biederman @ 2009-01-21 11:21 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Rusty Russell, Herbert Xu, akpm, hpa, brgerst, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Tejun Heo <tj@kernel.org> writes: > Ingo Molnar wrote: >> The larger point still remains: the kernel dominantly uses static percpu >> variables by a margin of 10 to 1, so we cannot just brush away the static >> percpu variables and must concentrate on optimizing that side with >> priority. It's nice if the dynamic percpu-alloc side improves as well, of >> course. > > Well, the infrequent usage of dynamic percpu allocation is in some > part due to the poor implementation, so it's sort of chicken and egg > problem. I got into this percpu thing because I wanted a percpu > reference count which can be dynamically allocated and it sucked. Counters are our other special case, and counters are interesting because they are individually very small. I just looked and the vast majority of the alloc_percpu users are counters. I just did a rough count in include/linux/snmp.h and I came up with 171*2 counters. At 8 bytes per counter that is roughly 2.7K. Or about two thirds of a 4K page. What makes this is a challenge is that those counters are per network namespace, and there are no static limits on the number of network namespaces. If we push the system and allocate 1024 network namespaces we wind up needing 2.7MB per cpu, just for the SNMP counters. Which nicely illustrates the principle that typically each individual per cpu allocation is small, but with dynamic allocation we have the challenge that number of allocations becomes unbounded and in some cases could be large, while the typical per cpu size is likely to be very small. I wonder if for the specific case of counters it might make sense to simply optimize the per cpu allocator for machine word size allocations and allocate each counter individually freeing us from the burden of worrying about fragmentation. The pain with the current alloc_percpu implementation when working with counters is that it has to allocate an array with one entry for each cpu to point at the per cpu data. Which isn't especially efficient. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-21 11:21 ` Eric W. Biederman @ 2009-01-21 12:45 ` Stephen Hemminger 2009-01-21 14:13 ` Eric W. Biederman 2009-01-21 20:34 ` David Miller 0 siblings, 2 replies; 32+ messages in thread From: Stephen Hemminger @ 2009-01-21 12:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Tejun Heo, Ingo Molnar, Rusty Russell, Herbert Xu, akpm, hpa, brgerst, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers On Wed, 21 Jan 2009 03:21:23 -0800 ebiederm@xmission.com (Eric W. Biederman) wrote: > Tejun Heo <tj@kernel.org> writes: > > > Ingo Molnar wrote: > >> The larger point still remains: the kernel dominantly uses static percpu > >> variables by a margin of 10 to 1, so we cannot just brush away the static > >> percpu variables and must concentrate on optimizing that side with > >> priority. It's nice if the dynamic percpu-alloc side improves as well, of > >> course. > > > > Well, the infrequent usage of dynamic percpu allocation is in some > > part due to the poor implementation, so it's sort of chicken and egg > > problem. I got into this percpu thing because I wanted a percpu > > reference count which can be dynamically allocated and it sucked. > > Counters are our other special case, and counters are interesting > because they are individually very small. I just looked and the vast > majority of the alloc_percpu users are counters. > > I just did a rough count in include/linux/snmp.h and I came > up with 171*2 counters. At 8 bytes per counter that is roughly > 2.7K. Or about two thirds of a 4K page. This is crap. only a small fraction of these SNMP counters are close enough to the hot path to deserve per-cpu treatment. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-21 12:45 ` Stephen Hemminger @ 2009-01-21 14:13 ` Eric W. Biederman 2009-01-21 20:34 ` David Miller 1 sibling, 0 replies; 32+ messages in thread From: Eric W. Biederman @ 2009-01-21 14:13 UTC (permalink / raw) To: Stephen Hemminger Cc: Tejun Heo, Ingo Molnar, Rusty Russell, Herbert Xu, akpm, hpa, brgerst, cl, travis, linux-kernel, steiner, hugh, David S. Miller, netdev, Mathieu Desnoyers Stephen Hemminger <shemminger@vyatta.com> writes: > This is crap. only a small fraction of these SNMP counters are > close enough to the hot path to deserve per-cpu treatment. Interesting. I was just reading af_inet.c:ipv4_mib_init_net() to get a feel for what a large consumer of percpu counters looks like. I expect the patterns I saw will hold even if the base size is much smaller. Your statement reinforces my point that our allocations in a per cpu area are small and our dynamic per cpu allocator is not really optimized for small allocations. In most places what we want are 1-5 counters, which is max 40 bytes. And our minimum allocation is a full cache line (64-128 bytes) per allocation. I'm wondering if dynamic per cpu allocations should work more like slab. Have a set of percpu base pointers for an area, and return an area + offset in some convenient form. Ideally we would just have one area (allowing us to keep the base pointer in a register), but I'm not convinced that doesn't fall down in the face of dynamic allocation. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] percpu: add optimized generic percpu accessors 2009-01-21 12:45 ` Stephen Hemminger 2009-01-21 14:13 ` Eric W. Biederman @ 2009-01-21 20:34 ` David Miller 1 sibling, 0 replies; 32+ messages in thread From: David Miller @ 2009-01-21 20:34 UTC (permalink / raw) To: shemminger Cc: ebiederm, tj, mingo, rusty, herbert, akpm, hpa, brgerst, cl, travis, linux-kernel, steiner, hugh, netdev, mathieu.desnoyers From: Stephen Hemminger <shemminger@vyatta.com> Date: Wed, 21 Jan 2009 23:45:01 +1100 > This is crap. only a small fraction of these SNMP counters are > close enough to the hot path to deserve per-cpu treatment. Only if your router/firewall/webserver isn't hitting that code path which bumps the counters you think aren't hot path. It's a micro-DoS waiting to happen if we start trying to split counters up into groups which matter for hot path processing (and thus use per-cpu stuff) and those which don't (and thus use atomics or whatever your idea is). ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-01-31 6:12 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090115183942.GA6325@elte.hu>
[not found] ` <20090116001200.GA9137@gondor.apana.org.au>
[not found] ` <20090116001544.GA11073@elte.hu>
2009-01-16 0:18 ` [PATCH] percpu: add optimized generic percpu accessors Herbert Xu
[not found] ` <200901170827.33729.rusty@rustcorp.com.au>
2009-01-16 22:08 ` Ingo Molnar
[not found] ` <200901201328.24605.rusty@rustcorp.com.au>
2009-01-20 6:25 ` Tejun Heo
2009-01-20 10:36 ` Ingo Molnar
[not found] ` <200901271213.18605.rusty@rustcorp.com.au>
2009-01-27 2:24 ` Tejun Heo
2009-01-27 13:13 ` Ingo Molnar
2009-01-27 23:07 ` Tejun Heo
2009-01-28 3:36 ` Tejun Heo
2009-01-28 8:12 ` Tejun Heo
2009-01-27 20:08 ` Christoph Lameter
2009-01-27 21:47 ` David Miller
2009-01-27 22:47 ` Rick Jones
2009-01-28 0:17 ` Luck, Tony
2009-01-28 16:48 ` Christoph Lameter
2009-01-28 17:15 ` Luck, Tony
2009-01-28 16:45 ` Christoph Lameter
2009-01-28 20:47 ` David Miller
2009-01-28 10:38 ` Rusty Russell
2009-01-28 10:56 ` Tejun Heo
2009-01-29 2:06 ` Rusty Russell
2009-01-31 6:11 ` Tejun Heo
2009-01-28 16:50 ` Christoph Lameter
2009-01-28 18:07 ` Mathieu Desnoyers
2009-01-29 18:33 ` Christoph Lameter
2009-01-29 18:48 ` H. Peter Anvin
2009-01-20 10:40 ` Ingo Molnar
2009-01-21 5:52 ` Tejun Heo
2009-01-21 10:05 ` Ingo Molnar
2009-01-21 11:21 ` Eric W. Biederman
2009-01-21 12:45 ` Stephen Hemminger
2009-01-21 14:13 ` Eric W. Biederman
2009-01-21 20:34 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).