ipsec smp scalability and cpu use fairness (softirqs)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ipsec smp scalability and cpu use fairness (softirqs)
@ 2013-08-12 13:01 Timo Teras
  2013-08-12 21:58 ` Andrew Collins
  0 siblings, 1 reply; 12+ messages in thread
From: Timo Teras @ 2013-08-12 13:01 UTC (permalink / raw)
  To: netdev

Hi,

I've been recently doing some ipsec benchmarking, and analysis on
system running out of cpu power. The setup is dmvpn gateway
(gre+xfrm+opennhrp) with traffic in forward path.

The system I have been using are VIA Nano (Padlock aes/sha accel) and
Intel Xeon (aes-ni and ssse3 sha1) based. In both setups the crypto
happens synchronously using special opcodes, or assembly implementation
of the algorithm.

It seems that the combination of softirq, napi and synchronous crypto
causes two problems.

1. Single core systems that are going out of cpu power, are
overwhelmed in uncontrollable manner. As softirq is doing the heavy
lifting, the user land processes are starved first. This can cause
userland IKE daemon to starve and lose tunnels when it is unable to
answer liveliness checks. The quick workaround is to setup traffic
shaping for the encrypted traffic.

2. On multicore (6-12 cores) systems, it would appear that it is not
easy to distribute the ipsec to multiple cores. as softirq is sticky to
the cpu where it was raised. The ipsec decryption/encryption is done
synchronously in the napi poll loop, and the throughput is limited by
one cpu. If the NIC supports multiple queues and balancing with ESP
SPI, we can use that to get some parallelism.

Fundamentally, both problems arise because synchronous crypto happens in
the softirq context. I'm wondering if it would make sense to execute
the synchronous crypto in low-priority per-xfrm_state workqueue or
similar.

Any suggestions or comments?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-12 13:01 ipsec smp scalability and cpu use fairness (softirqs) Timo Teras
@ 2013-08-12 21:58 ` Andrew Collins
  2013-08-13  6:23   ` Timo Teras
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Collins @ 2013-08-12 21:58 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev

On Mon, Aug 12, 2013 at 7:01 AM, Timo Teras <timo.teras@iki.fi> wrote:
> 1. Single core systems that are going out of cpu power, are
> overwhelmed in uncontrollable manner. As softirq is doing the heavy
> lifting, the user land processes are starved first. This can cause
> userland IKE daemon to starve and lose tunnels when it is unable to
> answer liveliness checks. The quick workaround is to setup traffic
> shaping for the encrypted traffic.

Which kernel version are you on?  I've found I've had better behavior since:

commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jan 10 15:26:34 2013 -0800

    softirq: reduce latencies

as it bails from lengthy softirq processing much earlier, along with
tuning "netdev_budget" to avoid cycling for too long in the NAPI poll.

> 2. On multicore (6-12 cores) systems, it would appear that it is not
> easy to distribute the ipsec to multiple cores. as softirq is sticky to
> the cpu where it was raised. The ipsec decryption/encryption is done
> synchronously in the napi poll loop, and the throughput is limited by
> one cpu. If the NIC supports multiple queues and balancing with ESP
> SPI, we can use that to get some parallelism.

Although it's highly usecase dependent, I've had good luck using
RPS.  I'm testing as an ipsec router however, not with an endpoint
on the host itself, so it processes nearly all ipsec traffic in receive
context.

Andrew Collins

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-12 21:58 ` Andrew Collins
@ 2013-08-13  6:23   ` Timo Teras
  2013-08-13  7:46     ` Steffen Klassert
  0 siblings, 1 reply; 12+ messages in thread
From: Timo Teras @ 2013-08-13  6:23 UTC (permalink / raw)
  To: Andrew Collins; +Cc: netdev

On Mon, 12 Aug 2013 15:58:41 -0600
Andrew Collins <bsderandrew@gmail.com> wrote:

> On Mon, Aug 12, 2013 at 7:01 AM, Timo Teras <timo.teras@iki.fi> wrote:
> > 1. Single core systems that are going out of cpu power, are
> > overwhelmed in uncontrollable manner. As softirq is doing the heavy
> > lifting, the user land processes are starved first. This can cause
> > userland IKE daemon to starve and lose tunnels when it is unable to
> > answer liveliness checks. The quick workaround is to setup traffic
> > shaping for the encrypted traffic.
> 
> Which kernel version are you on?  I've found I've had better behavior
> since:
>
> commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Thu Jan 10 15:26:34 2013 -0800
> 
>     softirq: reduce latencies
> 
> as it bails from lengthy softirq processing much earlier, along with

The user process starvation observations are originally from 3.3/3.4
kernels, and I have not retested properly yet with newer ones.
Currently starting upgrades to 3.10. That commit looks like it will
directly fix mostly single core starvation issues.

I think netdev_budget mostly affects latencies for other softirqs,
since the rx softirq will be practically always on during the stress.

And it can still cause problems that encrypted and non-encrypted
packets still go through same queues. This means that when we are out
cpu, we can start dropping even non-encrypted packets early.

> tuning "netdev_budget" to avoid cycling for too long in the NAPI poll.
> 
> > 2. On multicore (6-12 cores) systems, it would appear that it is not
> > easy to distribute the ipsec to multiple cores. as softirq is
> > sticky to the cpu where it was raised. The ipsec
> > decryption/encryption is done synchronously in the napi poll loop,
> > and the throughput is limited by one cpu. If the NIC supports
> > multiple queues and balancing with ESP SPI, we can use that to get
> > some parallelism.
> 
> Although it's highly usecase dependent, I've had good luck using
> RPS.  I'm testing as an ipsec router however, not with an endpoint
> on the host itself, so it processes nearly all ipsec traffic in
> receive context.

Yes, RPS will help on many scenarios but not all. The flow dissector
knows only IP/TCP/UDP/GRE, but not ESP. So as long as traffic is
distributed between different IP-addresses, it is distributed. But if I
have lot of traffic between two nodes either with different ESP SPI
(different gatewayed subnets), or even with same SPI, then it won't.

For my scenario it will be usually even same SPI. So even if flow
dissector learns ESP and uses SPI in hash, I'd need a way to balance
traffic to multiple SAs.

I guess the place where I'd want to see the distribution to cores is
crypto_aead_*() calls. In fact, it seems there's code infracture
already for it: crypto/cryptd.c. Seems it needs to be manually
configured and only few places e.g. aesni gcm parts use it.

I'm wondering if it'd make sense to patch net/xfrm/xfrm_algo.c to use
cryptd? Or at least have a Kconfig or sysctl option make it do so.

- Timo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13  6:23   ` Timo Teras
@ 2013-08-13  7:46     ` Steffen Klassert
  2013-08-13  7:57       ` Timo Teras
  0 siblings, 1 reply; 12+ messages in thread
From: Steffen Klassert @ 2013-08-13  7:46 UTC (permalink / raw)
  To: Timo Teras; +Cc: Andrew Collins, netdev

On Tue, Aug 13, 2013 at 09:23:12AM +0300, Timo Teras wrote:
> 
> For my scenario it will be usually even same SPI. So even if flow
> dissector learns ESP and uses SPI in hash, I'd need a way to balance
> traffic to multiple SAs.
> 
> I guess the place where I'd want to see the distribution to cores is
> crypto_aead_*() calls. In fact, it seems there's code infracture
> already for it: crypto/cryptd.c. Seems it needs to be manually
> configured and only few places e.g. aesni gcm parts use it.
> 
> I'm wondering if it'd make sense to patch net/xfrm/xfrm_algo.c to use
> cryptd? Or at least have a Kconfig or sysctl option make it do so.
> 

It is possible to configure the used crypto algorithm from userspace
with the crypto user configuration API, see crypto/crypto_user.c.

I wrote to tool that usses this API some time ago, it is still
a bit rudimentary but it does the job. You can find it at:
https://sourceforge.net/projects/crconf/

Also, if you want parallelism, you could use the pcrypt algorithm.
It sends the crypto requests asynchronously round robin to a
configurable set of cpus. Finaly it takes care to bring the
served crypto requests back into the order they were submitted
to avoid packet reordering.

Currently we have only one systemwide workqueue for encryption
and one decryption. So all IPsec packets are send to the same
workqueue, regardless which state they use.

I have patches that make it possible to configure a separate
workqueue for each state or to group some states to a specific
workqueue. These patches are still unpublished because they
have not much testing yet, but I could send them after some
polishing for review or testing if you are interested.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13  7:46     ` Steffen Klassert
@ 2013-08-13  7:57       ` Timo Teras
  2013-08-13 10:45         ` Steffen Klassert
  2013-08-20  6:17         ` Steffen Klassert
  0 siblings, 2 replies; 12+ messages in thread
From: Timo Teras @ 2013-08-13  7:57 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Andrew Collins, netdev

On Tue, 13 Aug 2013 09:46:14 +0200
Steffen Klassert <steffen.klassert@secunet.com> wrote:

> On Tue, Aug 13, 2013 at 09:23:12AM +0300, Timo Teras wrote:
> > 
> > For my scenario it will be usually even same SPI. So even if flow
> > dissector learns ESP and uses SPI in hash, I'd need a way to balance
> > traffic to multiple SAs.
> > 
> > I guess the place where I'd want to see the distribution to cores is
> > crypto_aead_*() calls. In fact, it seems there's code infracture
> > already for it: crypto/cryptd.c. Seems it needs to be manually
> > configured and only few places e.g. aesni gcm parts use it.
> > 
> > I'm wondering if it'd make sense to patch net/xfrm/xfrm_algo.c to
> > use cryptd? Or at least have a Kconfig or sysctl option make it do
> > so.
> 
> It is possible to configure the used crypto algorithm from userspace
> with the crypto user configuration API, see crypto/crypto_user.c.
> 
> I wrote to tool that usses this API some time ago, it is still
> a bit rudimentary but it does the job. You can find it at:
> https://sourceforge.net/projects/crconf/

Exactly what I was looking for! Thanks!

> Also, if you want parallelism, you could use the pcrypt algorithm.
> It sends the crypto requests asynchronously round robin to a
> configurable set of cpus. Finaly it takes care to bring the
> served crypto requests back into the order they were submitted
> to avoid packet reordering.

Right. Looks like this helps a lot.

Perhaps it would be worth to experiment also with RPS type hash based
cpu selection?

> Currently we have only one systemwide workqueue for encryption
> and one decryption. So all IPsec packets are send to the same
> workqueue, regardless which state they use.
> 
> I have patches that make it possible to configure a separate
> workqueue for each state or to group some states to a specific
> workqueue. These patches are still unpublished because they
> have not much testing yet, but I could send them after some
> polishing for review or testing if you are interested.

Yes, I'd be interested.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13  7:57       ` Timo Teras
@ 2013-08-13 10:45         ` Steffen Klassert
  2013-08-13 11:33           ` Timo Teras
  2013-08-20  6:17         ` Steffen Klassert
  1 sibling, 1 reply; 12+ messages in thread
From: Steffen Klassert @ 2013-08-13 10:45 UTC (permalink / raw)
  To: Timo Teras; +Cc: Andrew Collins, netdev

On Tue, Aug 13, 2013 at 10:57:57AM +0300, Timo Teras wrote:
> On Tue, 13 Aug 2013 09:46:14 +0200
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> 
> > Also, if you want parallelism, you could use the pcrypt algorithm.
> > It sends the crypto requests asynchronously round robin to a
> > configurable set of cpus. Finaly it takes care to bring the
> > served crypto requests back into the order they were submitted
> > to avoid packet reordering.
> 
> Right. Looks like this helps a lot.
> 
> Perhaps it would be worth to experiment also with RPS type hash based
> cpu selection?
> 

Actually, this was the reason why I started to write the below mentioned
patches. The idea behind that was to use a combination of flow based
and inner flow parallelization.

On bigger NUMA machines it does not make much sense to use all
cores for parallelization. The performance depends too much on the
actual topology. Moving crypto requests to another NUMA node can
even reduce performance. So I wanted to use RPS type hash based
cpu selection to choose the node for a given flow and then use
pcrypt to parallelize this flow on the chosen node.

> > Currently we have only one systemwide workqueue for encryption
> > and one decryption. So all IPsec packets are send to the same
> > workqueue, regardless which state they use.
> > 
> > I have patches that make it possible to configure a separate
> > workqueue for each state or to group some states to a specific
> > workqueue. These patches are still unpublished because they
> > have not much testing yet, but I could send them after some
> > polishing for review or testing if you are interested.
> 
> Yes, I'd be interested.
> 

Ok, I'll send them. May take some days to rebase and polish.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13 10:45         ` Steffen Klassert
@ 2013-08-13 11:33           ` Timo Teras
  2013-08-13 11:56             ` Steffen Klassert
  0 siblings, 1 reply; 12+ messages in thread
From: Timo Teras @ 2013-08-13 11:33 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Andrew Collins, netdev

On Tue, 13 Aug 2013 12:45:48 +0200
Steffen Klassert <steffen.klassert@secunet.com> wrote:

> On Tue, Aug 13, 2013 at 10:57:57AM +0300, Timo Teras wrote:
> > On Tue, 13 Aug 2013 09:46:14 +0200
> > Steffen Klassert <steffen.klassert@secunet.com> wrote:
> > 
> > > Also, if you want parallelism, you could use the pcrypt algorithm.
> > > It sends the crypto requests asynchronously round robin to a
> > > configurable set of cpus. Finaly it takes care to bring the
> > > served crypto requests back into the order they were submitted
> > > to avoid packet reordering.
> > 
> > Right. Looks like this helps a lot.
> > 
> > Perhaps it would be worth to experiment also with RPS type hash
> > based cpu selection?
> 
> Actually, this was the reason why I started to write the below
> mentioned patches. The idea behind that was to use a combination of
> flow based and inner flow parallelization.
> 
> On bigger NUMA machines it does not make much sense to use all
> cores for parallelization. The performance depends too much on the
> actual topology. Moving crypto requests to another NUMA node can
> even reduce performance. So I wanted to use RPS type hash based
> cpu selection to choose the node for a given flow and then use
> pcrypt to parallelize this flow on the chosen node.

Excellent.

I've been now playing with pcrypt. It seems to not give significant
boost in throughput. I've setup the cpumaps properly, and top says the
work is distributed to appropriate kworkers, but for some reason
throughput does not get any better. I've tested with iperf in both udp
and tcp modes, with various amounts of threads.

Is there any more synchronization points for single SA that might limit
throughput? I've been testing with auth hmac(sha1), enc cbc(aes) -
according to metric the CPUs are still largely idle instead of
processing more data for better throughput. aes-gcm (without pcrypt)
achieves better throughput even saturating my test box links.

Any pointers what to test, or to pinpoint the bottleneck?

I also tried enabling RPS on the gre device, but it did not seem to
make any significant difference either.

> > > Currently we have only one systemwide workqueue for encryption
> > > and one decryption. So all IPsec packets are send to the same
> > > workqueue, regardless which state they use.
> > > 
> > > I have patches that make it possible to configure a separate
> > > workqueue for each state or to group some states to a specific
> > > workqueue. These patches are still unpublished because they
> > > have not much testing yet, but I could send them after some
> > > polishing for review or testing if you are interested.
> > 
> > Yes, I'd be interested.
> 
> Ok, I'll send them. May take some days to rebase and polish.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13 11:33           ` Timo Teras
@ 2013-08-13 11:56             ` Steffen Klassert
  2013-08-13 12:41               ` Timo Teras
  0 siblings, 1 reply; 12+ messages in thread
From: Steffen Klassert @ 2013-08-13 11:56 UTC (permalink / raw)
  To: Timo Teras; +Cc: Andrew Collins, netdev

On Tue, Aug 13, 2013 at 02:33:25PM +0300, Timo Teras wrote:
> 
> I've been now playing with pcrypt. It seems to not give significant
> boost in throughput. I've setup the cpumaps properly, and top says the
> work is distributed to appropriate kworkers, but for some reason
> throughput does not get any better. I've tested with iperf in both udp
> and tcp modes, with various amounts of threads.
> 
> Is there any more synchronization points for single SA that might limit
> throughput? I've been testing with auth hmac(sha1), enc cbc(aes) -
> according to metric the CPUs are still largely idle instead of
> processing more data for better throughput. aes-gcm (without pcrypt)
> achieves better throughput even saturating my test box links.
> 
> Any pointers what to test, or to pinpoint the bottleneck?
> 

The only pitfall that comes to my mind is that pcrypt must be
instantiated before inserting the states. Your /proc/crypto
should show something like:

name         : authenc(hmac(sha1),cbc(aes))
driver       : pcrypt(authenc(hmac(sha1-generic),cbc(aes-asm)))
module       : pcrypt
priority     : 2100
refcnt       : 1
selftest     : passed
type         : aead
async        : yes
blocksize    : 16
ivsize       : 16
maxauthsize  : 20
geniv        : <built-in>

pcrypt is now instantiated, e.g. all new IPsec states (that do
hmac-sha1, cbc-aes) will use it, adding new states increase the
refcount.

I'll do some tests with current net-next on my own tomorrow and let
you know about the results.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13 11:56             ` Steffen Klassert
@ 2013-08-13 12:41               ` Timo Teras
  2013-08-20  6:19                 ` Steffen Klassert
  0 siblings, 1 reply; 12+ messages in thread
From: Timo Teras @ 2013-08-13 12:41 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Andrew Collins, netdev

On Tue, 13 Aug 2013 13:56:52 +0200
Steffen Klassert <steffen.klassert@secunet.com> wrote:

> On Tue, Aug 13, 2013 at 02:33:25PM +0300, Timo Teras wrote:
> > 
> > I've been now playing with pcrypt. It seems to not give significant
> > boost in throughput. I've setup the cpumaps properly, and top says
> > the work is distributed to appropriate kworkers, but for some reason
> > throughput does not get any better. I've tested with iperf in both
> > udp and tcp modes, with various amounts of threads.
> > 
> > Is there any more synchronization points for single SA that might
> > limit throughput? I've been testing with auth hmac(sha1), enc
> > cbc(aes) - according to metric the CPUs are still largely idle
> > instead of processing more data for better throughput. aes-gcm
> > (without pcrypt) achieves better throughput even saturating my test
> > box links.
> > 
> > Any pointers what to test, or to pinpoint the bottleneck?
> > 
> 
> The only pitfall that comes to my mind is that pcrypt must be
> instantiated before inserting the states. Your /proc/crypto
> should show something like:
> 
> name         : authenc(hmac(sha1),cbc(aes))
> driver       : pcrypt(authenc(hmac(sha1-generic),cbc(aes-asm)))
> module       : pcrypt
> priority     : 2100
> refcnt       : 1
> selftest     : passed
> type         : aead
> async        : yes
> blocksize    : 16
> ivsize       : 16
> maxauthsize  : 20
> geniv        : <built-in>
> 
> pcrypt is now instantiated, e.g. all new IPsec states (that do
> hmac-sha1, cbc-aes) will use it, adding new states increase the
> refcount.
> 
> I'll do some tests with current net-next on my own tomorrow and let
> you know about the results.

Yes, I've got pcrypt there. Apparently I had some of the cpu bindings
not right, so now it's looking a lot better. But it seems that
ksoftirqd on one of the CPUs becomes first bottleneck. I'll try to
figure out why.

Thanks on all the info so far, will continue experimenting here too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13 12:41               ` Timo Teras
@ 2013-08-20  6:19                 ` Steffen Klassert
  2013-08-20  6:39                   ` Timo Teras
  0 siblings, 1 reply; 12+ messages in thread
From: Steffen Klassert @ 2013-08-20  6:19 UTC (permalink / raw)
  To: Timo Teras; +Cc: Andrew Collins, netdev

On Tue, Aug 13, 2013 at 03:41:02PM +0300, Timo Teras wrote:
> On Tue, 13 Aug 2013 13:56:52 +0200
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> 
> > 
> > I'll do some tests with current net-next on my own tomorrow and let
> > you know about the results.
> 
> Yes, I've got pcrypt there. Apparently I had some of the cpu bindings
> not right, so now it's looking a lot better. But it seems that
> ksoftirqd on one of the CPUs becomes first bottleneck. I'll try to
> figure out why.
> 
> Thanks on all the info so far, will continue experimenting here too.

Here are the promised test results:

I used my test boxes with two nodes (Intel Xeon X5550  @ 2.67GHz) and all
cores utilized (16 logical cores). I did iperf box to box IPsec tunnel
tests with the crypto algorithm:

pcrypt(authenc(hmac(sha1-ssse3),cbc(aes-asm)))

Throughput is at 1.70 Gbits/sec.

Same test without pcrypt, i.e. crypto algorithm:

authenc(hmac(sha1-ssse3),cbc(aes-asm))

Throughput is at 560 Mbits/sec.

Unfortunately I can't do forwarding tests, I have only two 10 Gbit NICs.
Would be nice if I could get forwarding test results from somewhere.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-20  6:19                 ` Steffen Klassert
@ 2013-08-20  6:39                   ` Timo Teras
  0 siblings, 0 replies; 12+ messages in thread
From: Timo Teras @ 2013-08-20  6:39 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Andrew Collins, netdev

On Tue, 20 Aug 2013 08:19:14 +0200
Steffen Klassert <steffen.klassert@secunet.com> wrote:

> On Tue, Aug 13, 2013 at 03:41:02PM +0300, Timo Teras wrote:
> > On Tue, 13 Aug 2013 13:56:52 +0200
> > Steffen Klassert <steffen.klassert@secunet.com> wrote:
> > 
> > > 
> > > I'll do some tests with current net-next on my own tomorrow and
> > > let you know about the results.
> > 
> > Yes, I've got pcrypt there. Apparently I had some of the cpu
> > bindings not right, so now it's looking a lot better. But it seems
> > that ksoftirqd on one of the CPUs becomes first bottleneck. I'll
> > try to figure out why.
> > 
> > Thanks on all the info so far, will continue experimenting here too.
> 
> Here are the promised test results:
> 
> I used my test boxes with two nodes (Intel Xeon X5550  @ 2.67GHz) and
> all cores utilized (16 logical cores). I did iperf box to box IPsec
> tunnel tests with the crypto algorithm:
> 
> pcrypt(authenc(hmac(sha1-ssse3),cbc(aes-asm)))
> 
> Throughput is at 1.70 Gbits/sec.
> 
> Same test without pcrypt, i.e. crypto algorithm:
> 
> authenc(hmac(sha1-ssse3),cbc(aes-asm))
> 
> Throughput is at 560 Mbits/sec.
> 
> Unfortunately I can't do forwarding tests, I have only two 10 Gbit
> NICs. Would be nice if I could get forwarding test results from
> somewhere.

I got basically the same results. (Managed to get 2.5 Gbit/s after some
cpumask experimenting.)

At this point it seems that one core cpu peaks at 100% softirq. It
seems to be the nic rx softirq. I am curious why it takes so much cpu,
because plain tcp at 10Gbit/s does not take much cpu at all. So even
though pcrypt is used, it seems it adds considerable overhead in
softirq rx path still. I wonder if it's the pcrypt do parallel
overhead or some generic ipsec/gre overhead; perhaps some locking thing.
I should profile it.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipsec smp scalability and cpu use fairness (softirqs)
  2013-08-13  7:57       ` Timo Teras
  2013-08-13 10:45         ` Steffen Klassert
@ 2013-08-20  6:17         ` Steffen Klassert
  1 sibling, 0 replies; 12+ messages in thread
From: Steffen Klassert @ 2013-08-20  6:17 UTC (permalink / raw)
  To: Timo Teras; +Cc: Andrew Collins, netdev

On Tue, Aug 13, 2013 at 10:57:57AM +0300, Timo Teras wrote:
> On Tue, 13 Aug 2013 09:46:14 +0200
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> 
> > Currently we have only one systemwide workqueue for encryption
> > and one decryption. So all IPsec packets are send to the same
> > workqueue, regardless which state they use.
> > 
> > I have patches that make it possible to configure a separate
> > workqueue for each state or to group some states to a specific
> > workqueue. These patches are still unpublished because they
> > have not much testing yet, but I could send them after some
> > polishing for review or testing if you are interested.
> 
> Yes, I'd be interested.
> 

I've pushed the patches to

git://git.kernel.org/pub/scm/linux/kernel/git/klassert/linux-stk.git net-next-pcrypt

Steffen Klassert (9):
      crypto: api - Add crypto_tfm_has_alg helper
      xfrm: Add a netlink attribute for crypto algorithm drivers
      esp4: Use the crypto algorithm driver name if present
      esp6: Use the crypto algorithm driver name if present
      crypto: Support for multi instance algorithms
      pcrypt: handle errors from crypto_register_template
      crypto: pcrypt - Add support for request backlog
      crypto: pcrypt - Add the padata related informations to the instance context
      crypto: pcrypt - Support for multiple padata instances

 crypto/algapi.c           |    3 +-
 crypto/api.c              |   15 ++
 crypto/pcrypt.c           |  489 +++++++++++++++++++++++++++++++++++----------
 include/linux/crypto.h    |    7 +
 include/net/xfrm.h        |    2 +
 include/uapi/linux/xfrm.h |    5 +
 net/ipv4/esp4.c           |   33 ++-
 net/ipv6/esp6.c           |   33 ++-
 net/xfrm/xfrm_user.c      |    8 +
 9 files changed, 482 insertions(+), 113 deletions(-)


This is a combined patchset of networking and crypto changes.
I merged them and pushed it to a git repo so I don't need to bother
the netdev and the crypto list with this early stage patches.

The networking changes add a posibility to choose the crypto alg driver
on a per SA basis. I've attach the necessary iproute2 patch to this mail.

The crypto changes are a general pcrypt update. It adds a possibility to
build multiple instances of pcrypt such that each SA can have it's own
pcrypt instance. There is one unrelated patch in the patchset:

crypto: pcrypt - Add support for request backlog

It should not interfere with the other patches, it was just to much pain
to rebase without that patch.

Comments to the patchset and test results are very welcome!

The patch below adds an iproute2 option to configure the crypto driver
per SA:

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 20 Aug 2013 07:13:51 +0200
Subject: [PATCH] iproute2: Add a option to configure the crypto driver on per
 SA basis

---
 include/linux/xfrm.h |    5 +++++
 ip/xfrm_state.c      |    7 +++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/xfrm.h b/include/linux/xfrm.h
index 341c3c9..4520008 100644
--- a/include/linux/xfrm.h
+++ b/include/linux/xfrm.h
@@ -116,6 +116,10 @@ struct xfrm_algo_aead {
 	char		alg_key[0];
 };
 
+struct xfrm_algo_driver{
+	char		driver_name[64];
+};
+
 struct xfrm_stats {
 	__u32	replay_window;
 	__u32	replay;
@@ -298,6 +302,7 @@ enum xfrm_attr_type_t {
 	XFRMA_TFCPAD,		/* __u32 */
 	XFRMA_REPLAY_ESN_VAL,	/* struct xfrm_replay_esn */
 	XFRMA_SA_EXTRA_FLAGS,	/* __u32 */
+	XFRMA_ALG_DRIVER,	/* struct xfrm_algo_driver */
 	__XFRMA_MAX
 
 #define XFRMA_MAX (__XFRMA_MAX - 1)
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 389942c..b7d413d 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -274,6 +274,7 @@ static int xfrm_state_modify(int cmd, unsigned flags, int argc, char **argv)
 		char   			buf[RTA_BUF_SIZE];
 	} req;
 	struct xfrm_replay_state replay;
+	struct xfrm_algo_driver driver;
 	char *idp = NULL;
 	char *aeadop = NULL;
 	char *ealgop = NULL;
@@ -290,6 +291,7 @@ static int xfrm_state_modify(int cmd, unsigned flags, int argc, char **argv)
 
 	memset(&req, 0, sizeof(req));
 	memset(&replay, 0, sizeof(replay));
+	memset(&driver, 0, sizeof(driver));
 	memset(&ctx, 0, sizeof(ctx));
 
 	req.n.nlmsg_len = NLMSG_LENGTH(sizeof(req.xsinfo));
@@ -392,6 +394,11 @@ static int xfrm_state_modify(int cmd, unsigned flags, int argc, char **argv)
 			xfrm_sctx_parse((char *)&ctx.str, context, &ctx.sctx);
 			addattr_l(&req.n, sizeof(req.buf), XFRMA_SEC_CTX,
 				  (void *)&ctx, ctx.sctx.len);
+		} else if (strcmp(*argv, "crypto-driver") == 0) {
+			NEXT_ARG();
+			strncpy(driver.driver_name, *argv, sizeof(driver.driver_name));
+			addattr_l(&req.n, sizeof(req.buf), XFRMA_ALG_DRIVER,
+				  (void *)&driver, sizeof(driver));
 		} else {
 			/* try to assume ALGO */
 			int type = xfrm_algotype_getbyname(*argv);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-08-20  6:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-12 13:01 ipsec smp scalability and cpu use fairness (softirqs) Timo Teras
2013-08-12 21:58 ` Andrew Collins
2013-08-13  6:23   ` Timo Teras
2013-08-13  7:46     ` Steffen Klassert
2013-08-13  7:57       ` Timo Teras
2013-08-13 10:45         ` Steffen Klassert
2013-08-13 11:33           ` Timo Teras
2013-08-13 11:56             ` Steffen Klassert
2013-08-13 12:41               ` Timo Teras
2013-08-20  6:19                 ` Steffen Klassert
2013-08-20  6:39                   ` Timo Teras
2013-08-20  6:17         ` Steffen Klassert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).