Poll about irqsafe_cpu_add and others

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Poll about irqsafe_cpu_add and others
@ 2011-03-17 14:23 Eric Dumazet
  2011-03-17 14:40 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Eric Dumazet @ 2011-03-17 14:23 UTC (permalink / raw)
  To: linux-kernel, linux-arch
  Cc: Christoph Lameter, netdev, Netfilter Development Mailinglist

Hi

irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit
29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter
fast path too, and SNMP counters eventually (to lower ram needs by 50%)

Initial support of irqsafe_ was given by Christoph in 2.6.34

It seems only x86 arch is using a native and efficient implementation.

Others use irqsafe_cpu_generic_to_op() and its pair of
local_irq_save() / local_irq_restore()

Which other arches could use a native implementation ?

What about defining a HAVE_FAST_IRQSAFE_ADD ?

Thanks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet
@ 2011-03-17 14:40 ` Christoph Lameter
  2011-03-17 15:14 ` David Miller
  2011-03-18  6:56 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2011-03-17 14:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, linux-arch, netdev,
	Netfilter Development Mailinglist

On Thu, 17 Mar 2011, Eric Dumazet wrote:

> irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit
> 29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter
> fast path too, and SNMP counters eventually (to lower ram needs by 50%)
>
> Initial support of irqsafe_ was given by Christoph in 2.6.34
>
> It seems only x86 arch is using a native and efficient implementation.

I have some draft(y old) patches for IA64 that use fetchadd together with
a per cpu virtual address range mapped differently for each processor but
in general the problem with other arches is that they do not have
instructions that avoid the expensive bus arbitration for a cacheline nor
do they have segment overrides.

The segment override effect of implicit relocation of the address to the
correct percpu area within one instruction can also be accomplished (like
in the IA64 case) if we have per cpu page tables for the kernel and map
the same virtual addres to different physical addresses for each cpu. Then
the address given to a percpu instruction is constant and the relocation
is performed implicitly by the MMU. Thus the relocation and the RMW
operation are "atomic". Either both occur or none.

The other aspect is that the arch must have cheap increment (and other
RMW) instructions that avoids expensive in cpu processing to establish a
coherent state of the cacheline.

> What about defining a HAVE_FAST_IRQSAFE_ADD ?

Useful at some point I think.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet
  2011-03-17 14:40 ` Christoph Lameter
@ 2011-03-17 15:14 ` David Miller
  2011-03-17 15:18   ` Christoph Lameter
  2011-03-18  6:56 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2011-03-17 15:14 UTC (permalink / raw)
  To: eric.dumazet; +Cc: linux-kernel, linux-arch, cl, netdev, netfilter-devel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 Mar 2011 15:23:54 +0100

> Hi
> 
> irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit
> 29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter
> fast path too, and SNMP counters eventually (to lower ram needs by 50%)
> 
> Initial support of irqsafe_ was given by Christoph in 2.6.34
> 
> It seems only x86 arch is using a native and efficient implementation.
> 
> Others use irqsafe_cpu_generic_to_op() and its pair of
> local_irq_save() / local_irq_restore()
> 
> Which other arches could use a native implementation ?
> 
> What about defining a HAVE_FAST_IRQSAFE_ADD ?

I had been meaning to bring this up from another perspective.

In networking, we often only ever access objects in base or
BH context.  Therefore in BH context cases we can do just
normal counter bumps without any of the special atomic or
IRQ disabling code at all.

This would help non-x86 architectures a lot, and depending upon
how the non-atomic case performs perhaps x86 too.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 15:14 ` David Miller
@ 2011-03-17 15:18   ` Christoph Lameter
  2011-03-17 16:53     ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2011-03-17 15:18 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, linux-kernel, linux-arch, netdev, netfilter-devel

On Thu, 17 Mar 2011, David Miller wrote:

>
> I had been meaning to bring this up from another perspective.
>
> In networking, we often only ever access objects in base or
> BH context.  Therefore in BH context cases we can do just
> normal counter bumps without any of the special atomic or
> IRQ disabling code at all.

We have the __ functions for that purpose. __this_cpu_inc f.e. falls back
to a simply ++ operation if the arch cannot provide something better.
irqsafe_xx are only used if the context does not provide any protection
and if there is the potential of the counter being incremented from an
interrupt context.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 15:18   ` Christoph Lameter
@ 2011-03-17 16:53     ` Eric Dumazet
  2011-03-17 17:46       ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2011-03-17 16:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel

Le jeudi 17 mars 2011 à 10:18 -0500, Christoph Lameter a écrit :
> On Thu, 17 Mar 2011, David Miller wrote:
> 
> >
> > I had been meaning to bring this up from another perspective.
> >
> > In networking, we often only ever access objects in base or
> > BH context.  Therefore in BH context cases we can do just
> > normal counter bumps without any of the special atomic or
> > IRQ disabling code at all.
> 
> We have the __ functions for that purpose. __this_cpu_inc f.e. falls back
> to a simply ++ operation if the arch cannot provide something better.
> irqsafe_xx are only used if the context does not provide any protection
> and if there is the potential of the counter being incremented from an
> interrupt context.
> 

What David and I have in mind is to use one array per mib instead of
two. This is an old idea.

https://patchwork.kernel.org/patch/15883/

When we know we run from BH context, we can use __this_cpu_inc(), but if
we dont know or run from user/process context, we would need irqsafe_inc
variant.

For x86 this maps to same single instruction, but for other arches, this
might be too expensive.

BTW, I think following patch is possible to save some text and useless
tests (on 64bit platform at least)

size vmlinux.old vmlinux

Thanks

[PATCH] snmp: SNMP_UPD_PO_STATS_BH() always called from softirq

We dont need to test if we run from softirq context.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
---
 include/net/snmp.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/snmp.h b/include/net/snmp.h
index 762e2ab..be2424d 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -149,8 +149,8 @@ struct linux_xfrm_mib {
 	} while (0)
 #define SNMP_UPD_PO_STATS_BH(mib, basefield, addend)	\
 	do { \
-		__typeof__(*mib[0]) *ptr = \
-			__this_cpu_ptr((mib)[!in_softirq()]); \
+		__typeof__(*mib[0]) *ptr = __this_cpu_ptr((mib)[0]);	\
+									\
 		ptr->mibs[basefield##PKTS]++; \
 		ptr->mibs[basefield##OCTETS] += addend;\
 	} while (0)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 16:53     ` Eric Dumazet
@ 2011-03-17 17:46       ` Christoph Lameter
  2011-03-17 18:29         ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2011-03-17 17:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel

On Thu, 17 Mar 2011, Eric Dumazet wrote:

> When we know we run from BH context, we can use __this_cpu_inc(), but if
> we dont know or run from user/process context, we would need irqsafe_inc
> variant.

If the BH context is the only one where we care about performance then its
ok I would think.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 17:46       ` Christoph Lameter
@ 2011-03-17 18:29         ` Eric Dumazet
  2011-03-17 18:42           ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2011-03-17 18:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel

Le jeudi 17 mars 2011 à 12:46 -0500, Christoph Lameter a écrit :
> On Thu, 17 Mar 2011, Eric Dumazet wrote:
> 
> > When we know we run from BH context, we can use __this_cpu_inc(), but if
> > we dont know or run from user/process context, we would need irqsafe_inc
> > variant.
> 
> If the BH context is the only one where we care about performance then its
> ok I would think.

Hmm... yes.

By the way, I noticed :

DECLARE_PER_CPU(u64, xt_u64);
__this_cpu_add(xt_u64, 2) translates to following x86_32 code :

	mov	$xt_u64,%eax
	add	%fs:0x0,%eax
	addl	$0x2,(%eax)
	adcl	$0x0,0x4(%eax)


I wonder why we dont use :

	addl	$0x2,%fs:xt_u64
	addcl	$0x0,%fs:xt_u64+4



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 18:29         ` Eric Dumazet
@ 2011-03-17 18:42           ` Christoph Lameter
  2011-03-17 18:55             ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2011-03-17 18:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel

On Thu, 17 Mar 2011, Eric Dumazet wrote:

> By the way, I noticed :
>
> DECLARE_PER_CPU(u64, xt_u64);
> __this_cpu_add(xt_u64, 2) translates to following x86_32 code :
>
> 	mov	$xt_u64,%eax
> 	add	%fs:0x0,%eax
> 	addl	$0x2,(%eax)
> 	adcl	$0x0,0x4(%eax)
>
>
> I wonder why we dont use :
>
> 	addl	$0x2,%fs:xt_u64
> 	addcl	$0x0,%fs:xt_u64+4

The compiler is fed the following

	*__this_cpu_ptr(xt_u64) += 2

__this_cpu_ptr makes it:

	*(xt_u64 + __my_cpu_offset) += 2

So the compiler calculates the address first and then increments it.

The compiler could optimize this I think. Wonder why that does not happen.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 18:42           ` Christoph Lameter
@ 2011-03-17 18:55             ` Eric Dumazet
  2011-03-17 19:21               ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2011-03-17 18:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel

Le jeudi 17 mars 2011 à 13:42 -0500, Christoph Lameter a écrit :
> On Thu, 17 Mar 2011, Eric Dumazet wrote:
> 
> > By the way, I noticed :
> >
> > DECLARE_PER_CPU(u64, xt_u64);
> > __this_cpu_add(xt_u64, 2) translates to following x86_32 code :
> >
> > 	mov	$xt_u64,%eax
> > 	add	%fs:0x0,%eax
> > 	addl	$0x2,(%eax)
> > 	adcl	$0x0,0x4(%eax)
> >
> >
> > I wonder why we dont use :
> >
> > 	addl	$0x2,%fs:xt_u64
> > 	addcl	$0x0,%fs:xt_u64+4
> 
> The compiler is fed the following
> 
> 	*__this_cpu_ptr(xt_u64) += 2
> 
> __this_cpu_ptr makes it:
> 
> 	*(xt_u64 + __my_cpu_offset) += 2
> 
> So the compiler calculates the address first and then increments it.
> 
> The compiler could optimize this I think. Wonder why that does not happen.

Compiler is really forced to compute addr, thats why.

Hmm, we should not fallback to generic ops I think, but tweak 

percpu_add_op() { 
...
	case 8: 
#if CONFIG_X86_64_SMP
                if (pao_ID__ == 1)                                      \
                        asm("incq "__percpu_arg(0) : "+m" (var));       \
                else if (pao_ID__ == -1)                                \
                        asm("decq "__percpu_arg(0) : "+m" (var));       \
                else                                                    \
                        asm("addq %1, "__percpu_arg(0)                  \
                            : "+m" (var)                                \
                            : "re" ((pao_T__)(val)));                   \
                break;                                                  \

#else
                asm("addl %1, "__percpu_arg(0)                  	\
                            : "+m" (var)                                \
                            : "ri" ((u32)(val)));                   	\
		asm("adcl %1, "__percpu_arg(0) 				\
			: "+m" ((char *)var+4)				\
			: "ri" ((u32)(val>>32));			\
                break;                                                  \

#endif
....
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 18:55             ` Eric Dumazet
@ 2011-03-17 19:21               ` Christoph Lameter
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2011-03-17 19:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, linux-kernel, linux-arch, netdev, netfilter-devel

On Thu, 17 Mar 2011, Eric Dumazet wrote:

> > > I wonder why we dont use :
> > >
> > > 	addl	$0x2,%fs:xt_u64
> > > 	addcl	$0x0,%fs:xt_u64+4
> >
> > The compiler is fed the following
> >
> > 	*__this_cpu_ptr(xt_u64) += 2
> >
> > __this_cpu_ptr makes it:
> >
> > 	*(xt_u64 + __my_cpu_offset) += 2
> >
> > So the compiler calculates the address first and then increments it.
> >
> > The compiler could optimize this I think. Wonder why that does not happen.
>
> Compiler is really forced to compute addr, thats why.
>
> Hmm, we should not fallback to generic ops I think, but tweak
>
> percpu_add_op() {

percpu_add_op() is not used. This is a 64 bit operation on a 32 bit
machine thus we fall back to this_cpu_generic_to_op()

#define __this_cpu_generic_to_op(pcp, val, op)                          \
do {                                                                    \
        *__this_cpu_ptr(&(pcp)) op val;                                 \
} while (0)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Poll about irqsafe_cpu_add and others
  2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet
  2011-03-17 14:40 ` Christoph Lameter
  2011-03-17 15:14 ` David Miller
@ 2011-03-18  6:56 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2011-03-18  6:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, linux-arch, Christoph Lameter, netdev,
	Netfilter Development Mailinglist

On Thu, 2011-03-17 at 15:23 +0100, Eric Dumazet wrote:
> Hi
> 
> irqsafe_cpu_{dec|inc} are used in network stack since 2.6.37 (commit
> 29b4433d991c88), and I would like to use irqsafe_cpu_add() in netfilter
> fast path too, and SNMP counters eventually (to lower ram needs by 50%)
> 
> Initial support of irqsafe_ was given by Christoph in 2.6.34
> 
> It seems only x86 arch is using a native and efficient implementation.
> 
> Others use irqsafe_cpu_generic_to_op() and its pair of
> local_irq_save() / local_irq_restore()
> 
> Which other arches could use a native implementation ?
> 
> What about defining a HAVE_FAST_IRQSAFE_ADD ?

On powerpc, we use the generic one and it's fast :-) Well, at least on
ppc64, bcs we do lazy irq disabling, disabling/enabling interrupt is
basically just poking a byte in a per-cpu data structure (there's a tad
more work on enable in case the interrupt actually occured and we ended
up really disabling but that's not the common case). Overall faster than
using the atomic ops which have to go all the way to the L2 cache.

So if you define the above, please set in on powerpc despite the fact
that we use the generic implementation.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-03-18  6:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-17 14:23 Poll about irqsafe_cpu_add and others Eric Dumazet
2011-03-17 14:40 ` Christoph Lameter
2011-03-17 15:14 ` David Miller
2011-03-17 15:18   ` Christoph Lameter
2011-03-17 16:53     ` Eric Dumazet
2011-03-17 17:46       ` Christoph Lameter
2011-03-17 18:29         ` Eric Dumazet
2011-03-17 18:42           ` Christoph Lameter
2011-03-17 18:55             ` Eric Dumazet
2011-03-17 19:21               ` Christoph Lameter
2011-03-18  6:56 ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).