* Poll about irqsafe_cpu_add and others @ 2011-03-17 14:23 Eric Dumazet 2011-03-17 14:40 ` Christoph Lameter ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Eric Dumazet @ 2011-03-17 14:23 UTC (permalink / raw) To: linux-kernel, linux-arch Cc: Christoph Lameter, netdev, Netfilter Development Mailinglist Hi irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit 29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter fast path too, and SNMP counters eventually (to lower ram needs by 50%) Initial support of irqsafe_ was given by Christoph in 2.6.34 It seems only x86 arch is using a native and efficient implementation. Others use irqsafe_cpu_generic_to_op() and its pair of local_irq_save() / local_irq_restore() Which other arches could use a native implementation ? What about defining a HAVE_FAST_IRQSAFE_ADD ? Thanks ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet @ 2011-03-17 14:40 ` Christoph Lameter 2011-03-17 15:14 ` David Miller 2011-03-18 6:56 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 11+ messages in thread From: Christoph Lameter @ 2011-03-17 14:40 UTC (permalink / raw) To: Eric Dumazet Cc: linux-kernel, linux-arch, netdev, Netfilter Development Mailinglist On Thu, 17 Mar 2011, Eric Dumazet wrote: > irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit > 29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter > fast path too, and SNMP counters eventually (to lower ram needs by 50%) > > Initial support of irqsafe_ was given by Christoph in 2.6.34 > > It seems only x86 arch is using a native and efficient implementation. I have some draft(y old) patches for IA64 that use fetchadd together with a per cpu virtual address range mapped differently for each processor but in general the problem with other arches is that they do not have instructions that avoid the expensive bus arbitration for a cacheline nor do they have segment overrides. The segment override effect of implicit relocation of the address to the correct percpu area within one instruction can also be accomplished (like in the IA64 case) if we have per cpu page tables for the kernel and map the same virtual addres to different physical addresses for each cpu. Then the address given to a percpu instruction is constant and the relocation is performed implicitly by the MMU. Thus the relocation and the RMW operation are "atomic". Either both occur or none. The other aspect is that the arch must have cheap increment (and other RMW) instructions that avoids expensive in cpu processing to establish a coherent state of the cacheline. > What about defining a HAVE_FAST_IRQSAFE_ADD ? Useful at some point I think. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet 2011-03-17 14:40 ` Christoph Lameter @ 2011-03-17 15:14 ` David Miller 2011-03-17 15:18 ` Christoph Lameter 2011-03-18 6:56 ` Benjamin Herrenschmidt 2 siblings, 1 reply; 11+ messages in thread From: David Miller @ 2011-03-17 15:14 UTC (permalink / raw) To: eric.dumazet; +Cc: linux-kernel, linux-arch, cl, netdev, netfilter-devel From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 17 Mar 2011 15:23:54 +0100 > Hi > > irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit > 29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter > fast path too, and SNMP counters eventually (to lower ram needs by 50%) > > Initial support of irqsafe_ was given by Christoph in 2.6.34 > > It seems only x86 arch is using a native and efficient implementation. > > Others use irqsafe_cpu_generic_to_op() and its pair of > local_irq_save() / local_irq_restore() > > Which other arches could use a native implementation ? > > What about defining a HAVE_FAST_IRQSAFE_ADD ? I had been meaning to bring this up from another perspective. In networking, we often only ever access objects in base or BH context. Therefore in BH context cases we can do just normal counter bumps without any of the special atomic or IRQ disabling code at all. This would help non-x86 architectures a lot, and depending upon how the non-atomic case performs perhaps x86 too. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 15:14 ` David Miller @ 2011-03-17 15:18 ` Christoph Lameter 2011-03-17 16:53 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Christoph Lameter @ 2011-03-17 15:18 UTC (permalink / raw) To: David Miller Cc: eric.dumazet, linux-kernel, linux-arch, netdev, netfilter-devel On Thu, 17 Mar 2011, David Miller wrote: > > I had been meaning to bring this up from another perspective. > > In networking, we often only ever access objects in base or > BH context. Therefore in BH context cases we can do just > normal counter bumps without any of the special atomic or > IRQ disabling code at all. We have the __ functions for that purpose. __this_cpu_inc f.e. falls back to a simply ++ operation if the arch cannot provide something better. irqsafe_xx are only used if the context does not provide any protection and if there is the potential of the counter being incremented from an interrupt context. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 15:18 ` Christoph Lameter @ 2011-03-17 16:53 ` Eric Dumazet 2011-03-17 17:46 ` Christoph Lameter 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2011-03-17 16:53 UTC (permalink / raw) To: Christoph Lameter Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel Le jeudi 17 mars 2011 à 10:18 -0500, Christoph Lameter a écrit : > On Thu, 17 Mar 2011, David Miller wrote: > > > > > I had been meaning to bring this up from another perspective. > > > > In networking, we often only ever access objects in base or > > BH context. Therefore in BH context cases we can do just > > normal counter bumps without any of the special atomic or > > IRQ disabling code at all. > > We have the __ functions for that purpose. __this_cpu_inc f.e. falls back > to a simply ++ operation if the arch cannot provide something better. > irqsafe_xx are only used if the context does not provide any protection > and if there is the potential of the counter being incremented from an > interrupt context. > What David and I have in mind is to use one array per mib instead of two. This is an old idea. https://patchwork.kernel.org/patch/15883/ When we know we run from BH context, we can use __this_cpu_inc(), but if we dont know or run from user/process context, we would need irqsafe_inc variant. For x86 this maps to same single instruction, but for other arches, this might be too expensive. BTW, I think following patch is possible to save some text and useless tests (on 64bit platform at least) size vmlinux.old vmlinux Thanks [PATCH] snmp: SNMP_UPD_PO_STATS_BH() always called from softirq We dont need to test if we run from softirq context. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> --- include/net/snmp.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/net/snmp.h b/include/net/snmp.h index 762e2ab..be2424d 100644 --- a/include/net/snmp.h +++ b/include/net/snmp.h @@ -149,8 +149,8 @@ struct linux_xfrm_mib { } while (0) #define SNMP_UPD_PO_STATS_BH(mib, basefield, addend) \ do { \ - __typeof__(*mib[0]) *ptr = \ - __this_cpu_ptr((mib)[!in_softirq()]); \ + __typeof__(*mib[0]) *ptr = __this_cpu_ptr((mib)[0]); \ + \ ptr->mibs[basefield##PKTS]++; \ ptr->mibs[basefield##OCTETS] += addend;\ } while (0) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 16:53 ` Eric Dumazet @ 2011-03-17 17:46 ` Christoph Lameter 2011-03-17 18:29 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Christoph Lameter @ 2011-03-17 17:46 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel On Thu, 17 Mar 2011, Eric Dumazet wrote: > When we know we run from BH context, we can use __this_cpu_inc(), but if > we dont know or run from user/process context, we would need irqsafe_inc > variant. If the BH context is the only one where we care about performance then its ok I would think. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 17:46 ` Christoph Lameter @ 2011-03-17 18:29 ` Eric Dumazet 2011-03-17 18:42 ` Christoph Lameter 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2011-03-17 18:29 UTC (permalink / raw) To: Christoph Lameter Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel Le jeudi 17 mars 2011 à 12:46 -0500, Christoph Lameter a écrit : > On Thu, 17 Mar 2011, Eric Dumazet wrote: > > > When we know we run from BH context, we can use __this_cpu_inc(), but if > > we dont know or run from user/process context, we would need irqsafe_inc > > variant. > > If the BH context is the only one where we care about performance then its > ok I would think. Hmm... yes. By the way, I noticed : DECLARE_PER_CPU(u64, xt_u64); __this_cpu_add(xt_u64, 2) translates to following x86_32 code : mov $xt_u64,%eax add %fs:0x0,%eax addl $0x2,(%eax) adcl $0x0,0x4(%eax) I wonder why we dont use : addl $0x2,%fs:xt_u64 addcl $0x0,%fs:xt_u64+4 -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 18:29 ` Eric Dumazet @ 2011-03-17 18:42 ` Christoph Lameter 2011-03-17 18:55 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Christoph Lameter @ 2011-03-17 18:42 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel On Thu, 17 Mar 2011, Eric Dumazet wrote: > By the way, I noticed : > > DECLARE_PER_CPU(u64, xt_u64); > __this_cpu_add(xt_u64, 2) translates to following x86_32 code : > > mov $xt_u64,%eax > add %fs:0x0,%eax > addl $0x2,(%eax) > adcl $0x0,0x4(%eax) > > > I wonder why we dont use : > > addl $0x2,%fs:xt_u64 > addcl $0x0,%fs:xt_u64+4 The compiler is fed the following *__this_cpu_ptr(xt_u64) += 2 __this_cpu_ptr makes it: *(xt_u64 + __my_cpu_offset) += 2 So the compiler calculates the address first and then increments it. The compiler could optimize this I think. Wonder why that does not happen. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 18:42 ` Christoph Lameter @ 2011-03-17 18:55 ` Eric Dumazet 2011-03-17 19:21 ` Christoph Lameter 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2011-03-17 18:55 UTC (permalink / raw) To: Christoph Lameter Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel Le jeudi 17 mars 2011 à 13:42 -0500, Christoph Lameter a écrit : > On Thu, 17 Mar 2011, Eric Dumazet wrote: > > > By the way, I noticed : > > > > DECLARE_PER_CPU(u64, xt_u64); > > __this_cpu_add(xt_u64, 2) translates to following x86_32 code : > > > > mov $xt_u64,%eax > > add %fs:0x0,%eax > > addl $0x2,(%eax) > > adcl $0x0,0x4(%eax) > > > > > > I wonder why we dont use : > > > > addl $0x2,%fs:xt_u64 > > addcl $0x0,%fs:xt_u64+4 > > The compiler is fed the following > > *__this_cpu_ptr(xt_u64) += 2 > > __this_cpu_ptr makes it: > > *(xt_u64 + __my_cpu_offset) += 2 > > So the compiler calculates the address first and then increments it. > > The compiler could optimize this I think. Wonder why that does not happen. Compiler is really forced to compute addr, thats why. Hmm, we should not fallback to generic ops I think, but tweak percpu_add_op() { ... case 8: #if CONFIG_X86_64_SMP if (pao_ID__ == 1) \ asm("incq "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ asm("decq "__percpu_arg(0) : "+m" (var)); \ else \ asm("addq %1, "__percpu_arg(0) \ : "+m" (var) \ : "re" ((pao_T__)(val))); \ break; \ #else asm("addl %1, "__percpu_arg(0) \ : "+m" (var) \ : "ri" ((u32)(val))); \ asm("adcl %1, "__percpu_arg(0) \ : "+m" ((char *)var+4) \ : "ri" ((u32)(val>>32)); \ break; \ #endif .... } ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 18:55 ` Eric Dumazet @ 2011-03-17 19:21 ` Christoph Lameter 0 siblings, 0 replies; 11+ messages in thread From: Christoph Lameter @ 2011-03-17 19:21 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel On Thu, 17 Mar 2011, Eric Dumazet wrote: > > > I wonder why we dont use : > > > > > > addl $0x2,%fs:xt_u64 > > > addcl $0x0,%fs:xt_u64+4 > > > > The compiler is fed the following > > > > *__this_cpu_ptr(xt_u64) += 2 > > > > __this_cpu_ptr makes it: > > > > *(xt_u64 + __my_cpu_offset) += 2 > > > > So the compiler calculates the address first and then increments it. > > > > The compiler could optimize this I think. Wonder why that does not happen. > > Compiler is really forced to compute addr, thats why. > > Hmm, we should not fallback to generic ops I think, but tweak > > percpu_add_op() { percpu_add_op() is not used. This is a 64 bit operation on a 32 bit machine thus we fall back to this_cpu_generic_to_op() #define __this_cpu_generic_to_op(pcp, val, op) \ do { \ *__this_cpu_ptr(&(pcp)) op val; \ } while (0) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Poll about irqsafe_cpu_add and others 2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet 2011-03-17 14:40 ` Christoph Lameter 2011-03-17 15:14 ` David Miller @ 2011-03-18 6:56 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 11+ messages in thread From: Benjamin Herrenschmidt @ 2011-03-18 6:56 UTC (permalink / raw) To: Eric Dumazet Cc: linux-kernel, linux-arch, Christoph Lameter, netdev, Netfilter Development Mailinglist On Thu, 2011-03-17 at 15:23 +0100, Eric Dumazet wrote: > Hi > > irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit > 29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter > fast path too, and SNMP counters eventually (to lower ram needs by 50%) > > Initial support of irqsafe_ was given by Christoph in 2.6.34 > > It seems only x86 arch is using a native and efficient implementation. > > Others use irqsafe_cpu_generic_to_op() and its pair of > local_irq_save() / local_irq_restore() > > Which other arches could use a native implementation ? > > What about defining a HAVE_FAST_IRQSAFE_ADD ? On powerpc, we use the generic one and it's fast :-) Well, at least on ppc64, bcs we do lazy irq disabling, disabling/enabling interrupt is basically just poking a byte in a per-cpu data structure (there's a tad more work on enable in case the interrupt actually occured and we ended up really disabling but that's not the common case). Overall faster than using the atomic ops which have to go all the way to the L2 cache. So if you define the above, please set in on powerpc despite the fact that we use the generic implementation. Cheers, Ben. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-03-18 6:56 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet 2011-03-17 14:40 ` Christoph Lameter 2011-03-17 15:14 ` David Miller 2011-03-17 15:18 ` Christoph Lameter 2011-03-17 16:53 ` Eric Dumazet 2011-03-17 17:46 ` Christoph Lameter 2011-03-17 18:29 ` Eric Dumazet 2011-03-17 18:42 ` Christoph Lameter 2011-03-17 18:55 ` Eric Dumazet 2011-03-17 19:21 ` Christoph Lameter 2011-03-18 6:56 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).