All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Parallel crypto/IPsec v7
@ 2009-12-18 12:20 Steffen Klassert
  2009-12-18 12:20 ` [PATCH 1/2] padata: generic parallelization/serialization interface Steffen Klassert
  2009-12-18 12:21 ` [PATCH 2/2] crypto: pcrypt - Add pcrypt crypto parallelization wrapper Steffen Klassert
  0 siblings, 2 replies; 6+ messages in thread
From: Steffen Klassert @ 2009-12-18 12:20 UTC (permalink / raw)
  To: Herbert Xu, David Miller; +Cc: linux-crypto

This patchset adds the 'pcrypt' parallel crypto template. With this template it
is possible to process the crypto requests of a transform in parallel without
getting request reorder. This is in particular interesting for IPsec.

The parallel crypto template is based on the 'padata' generic
parallelization/serialization method. With this method data objects can
be processed in parallel, starting at some given point.
The parallelized data objects return after serialization in the order as
they were before the parallelization. In the case of IPsec, this makes it
possible to run the expensive parts in parallel without getting packet
reordering.

IPsec forwarding tests with two quad core machines (Intel Core 2 Quad Q6600)
and an EXFO FTB-400 packet blazer showed the following results:

On all tests I used smp_affinity to pin the interrupts of the network cards
to different cpus.

linux-2.6.33-rc1 (64 bit)
Packetsize: 1420 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 325 Mbit/s
unidirectional throughput without packet loss: 325 Mbit/s

linux-2.6.33-rc1 (64 bit)
Packetsize: 128 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 100 Mbit/s
unidirectional throughput without packet loss: 125 Mbit/s


linux-2.6.33-rc1 with padata/pcrypt (64 bit)
Packetsize: 1420 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 650 Mbit/s
unidirectional throughput without packet loss: 850  Mbit/s

linux-2.6.33-rc1 with padata/pcrypt (64 bit)
Packetsize: 128 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 100 Mbit/s
unidirectional throughput without packet loss: 125 Mbit/s


So the performance win on big packets is quite good. But on small packets
the troughput results with and without the workqueue based parallelization
are amost the same on my testing environment.

Changes from v6:

- Rework padata to use workqueues instead of softirqs for
  parallelization/serialization

- Add a cyclic sequence number pattern, makes the reset of the padata
  serialization logic on sequence number overrun superfluous.

- Adapt pcrypt to the changed padata interface.

- Rebased to linux-2.6.33-rc1

Steffen

^ permalink raw reply	[flat|nested] 6+ messages in thread
* Re: workqueue thing
@ 2009-12-21 13:53 Arjan van de Ven
  2009-12-21 14:19 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Arjan van de Ven @ 2009-12-21 13:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Andi Kleen, Peter Zijlstra, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On 12/21/2009 14:22, Tejun Heo wrote:
> Hello,
>
> On 12/21/2009 08:11 PM, Arjan van de Ven wrote:
>> I don't mind a good and clean design; and for sure sharing thread
>> pools into one pool is really good.  But if I have to choose between
>> a complex "how to deal with deadlocks" algorithm, versus just
>> running some more threads in the pool, I'll pick the later.
>
> The deadlock avoidance algorithm is pretty simple.  It creates a new
> worker when everything is blocked.  If the attempt to create a new
> worker blocks, it calls in dedicated workers to ensure allocation path
> is not blocked.  It's not that complex.

I'm just wondering if even that is overkill; I suspect you can do entirely without the scheduler intrusion;
just make a new thread for each work item, with some hesteresis:
* threads should stay around for a bit before dying (you do that)
* after some minimum nr of threads (say 4 per cpu), you wait, say, 0.1 seconds before deciding it's time
   to spawn more threads, to smooth out spikes of very short lived stuff.

wouldn't that be a lot simpler than "ask the scheduler to see if they are all blocked". If they are all
very busy churning cpu (say doing raid6 work, or btrfs checksumming) you still would want more threads
I suspect

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-01-16  9:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-18 12:20 [PATCH 0/2] Parallel crypto/IPsec v7 Steffen Klassert
2009-12-18 12:20 ` [PATCH 1/2] padata: generic parallelization/serialization interface Steffen Klassert
2009-12-18 12:21 ` [PATCH 2/2] crypto: pcrypt - Add pcrypt crypto parallelization wrapper Steffen Klassert
  -- strict thread matches above, loose matches on Subject: below --
2009-12-21 13:53 workqueue thing Arjan van de Ven
2009-12-21 14:19 ` Tejun Heo
2009-12-21 15:19   ` Arjan van de Ven
2009-12-22  0:00     ` Tejun Heo
2009-12-22 11:10       ` Peter Zijlstra
2009-12-22 17:20         ` Linus Torvalds
2009-12-22 17:47           ` Peter Zijlstra
2009-12-23  3:37             ` Tejun Heo
2009-12-23  6:52               ` Herbert Xu
2009-12-23  8:00                 ` Steffen Klassert
2009-12-23  8:01                   ` [PATCH 0/2] Parallel crypto/IPsec v7 Steffen Klassert
2010-01-07  5:39                     ` Herbert Xu
2010-01-16  9:44                       ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.