Linux cryptographic layer development
 help / color / mirror / Atom feed
* [PATCH 1/2] crypto: twofish - disable AVX2 implementation
@ 2013-06-02 16:51 Jussi Kivilinna
  2013-06-02 16:51 ` [PATCH 2/2] crypto: blowfish " Jussi Kivilinna
  0 siblings, 1 reply; 4+ messages in thread
From: Jussi Kivilinna @ 2013-06-02 16:51 UTC (permalink / raw)
  To: linux-crypto; +Cc: Herbert Xu, David S. Miller

It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
workload (tested on Core i5-4570) and causes twofish_avx2 to be significantly
slower than twofish_avx. So disable the AVX2 implementation to avoid
performance regressions.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
---
 crypto/Kconfig |    1 +
 1 file changed, 1 insertion(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index d1ca631..678a6ed 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1318,6 +1318,7 @@ config CRYPTO_TWOFISH_AVX_X86_64
 config CRYPTO_TWOFISH_AVX2_X86_64
 	tristate "Twofish cipher algorithm (x86_64/AVX2)"
 	depends on X86 && 64BIT
+	depends on BROKEN
 	select CRYPTO_ALGAPI
 	select CRYPTO_CRYPTD
 	select CRYPTO_ABLK_HELPER_X86

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] crypto: blowfish - disable AVX2 implementation
  2013-06-02 16:51 [PATCH 1/2] crypto: twofish - disable AVX2 implementation Jussi Kivilinna
@ 2013-06-02 16:51 ` Jussi Kivilinna
  2013-06-05  8:34   ` Herbert Xu
  0 siblings, 1 reply; 4+ messages in thread
From: Jussi Kivilinna @ 2013-06-02 16:51 UTC (permalink / raw)
  To: linux-crypto; +Cc: Herbert Xu, David S. Miller

It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
workload (tested on Core i5-4570) and causes blowfish-avx2 to be significantly
slower than blowfish-amd64. So disable the AVX2 implementation to avoid
performance regressions.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
---
 crypto/Kconfig |    1 +
 1 file changed, 1 insertion(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 678a6ed..8ca52c5 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -842,6 +842,7 @@ config CRYPTO_BLOWFISH_X86_64
 config CRYPTO_BLOWFISH_AVX2_X86_64
 	tristate "Blowfish cipher algorithm (x86_64/AVX2)"
 	depends on X86 && 64BIT
+	depends on BROKEN
 	select CRYPTO_ALGAPI
 	select CRYPTO_CRYPTD
 	select CRYPTO_ABLK_HELPER_X86

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] crypto: blowfish - disable AVX2 implementation
  2013-06-02 16:51 ` [PATCH 2/2] crypto: blowfish " Jussi Kivilinna
@ 2013-06-05  8:34   ` Herbert Xu
  2013-06-05 12:26     ` Jussi Kivilinna
  0 siblings, 1 reply; 4+ messages in thread
From: Herbert Xu @ 2013-06-05  8:34 UTC (permalink / raw)
  To: Jussi Kivilinna; +Cc: linux-crypto, David S. Miller

On Sun, Jun 02, 2013 at 07:51:52PM +0300, Jussi Kivilinna wrote:
> It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
> workload (tested on Core i5-4570) and causes blowfish-avx2 to be significantly
> slower than blowfish-amd64. So disable the AVX2 implementation to avoid
> performance regressions.
> 
> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Both patches applied to crypto.  I presume you're working on
a more permanent solution on this?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] crypto: blowfish - disable AVX2 implementation
  2013-06-05  8:34   ` Herbert Xu
@ 2013-06-05 12:26     ` Jussi Kivilinna
  0 siblings, 0 replies; 4+ messages in thread
From: Jussi Kivilinna @ 2013-06-05 12:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-crypto, David S. Miller

On 05.06.2013 11:34, Herbert Xu wrote:
> On Sun, Jun 02, 2013 at 07:51:52PM +0300, Jussi Kivilinna wrote:
>> It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
>> workload (tested on Core i5-4570) and causes blowfish-avx2 to be significantly
>> slower than blowfish-amd64. So disable the AVX2 implementation to avoid
>> performance regressions.
>>
>> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
> 
> Both patches applied to crypto.  I presume you're working on
> a more permanent solution on this?

Yes, I've been looking for solution. Problem is, well, that I assumed vgather to be quicker than emulating gather using vpextr/vpinsr instructions. But it appears that vgather has about the same speed as group of vpextr/vpinsr doing gather manually. So doing

    asm volatile(
	"vpgatherdd %%xmm0, (%[ptr], %%xmm8, 4), %%xmm9;        \n\t"
	"vpcmpeqd %%xmm0, %%xmm0, %%xmm0; /* reset mask */      \n\t"
	"vpgatherdd %%xmm0, (%[ptr], %%xmm9, 4), %%xmm8;        \n\t"
	"vpcmpeqd %%xmm0, %%xmm0, %%xmm0;                       \n\t"
	:: [ptr] "r" (&mem[0]) : "memory"
    );

in loop is slightly _slower_ than manually extracting&inserting values with

    asm volatile(
        "vmovd       %%xmm8, %%eax;                             \n\t"
        "vpextrd $1, %%xmm8, %%edx;                             \n\t"
        "vmovd       (%[ptr], %%rax, 4), %%xmm10;               \n\t"
        "vpextrd $2, %%xmm8, %%eax;                             \n\t"
        "vpinsrd $1, (%[ptr], %%rdx, 4), %%xmm10, %%xmm10;      \n\t"
        "vpextrd $3, %%xmm8, %%edx;                             \n\t"
        "vpinsrd $2, (%[ptr], %%rax, 4), %%xmm10, %%xmm10;      \n\t"
        "vpinsrd $3, (%[ptr], %%rdx, 4), %%xmm10, %%xmm9;       \n\t"

        "vmovd       %%xmm9, %%eax;                             \n\t"
        "vpextrd $1, %%xmm9, %%edx;                             \n\t"
        "vmovd       (%[ptr], %%rax, 4), %%xmm10;               \n\t"
        "vpextrd $2, %%xmm9, %%eax;                             \n\t"
        "vpinsrd $1, (%[ptr], %%rdx, 4), %%xmm10, %%xmm10;      \n\t"
        "vpextrd $3, %%xmm9, %%edx;                             \n\t"
        "vpinsrd $2, (%[ptr], %%rax, 4), %%xmm10, %%xmm10;      \n\t"
        "vpinsrd $3, (%[ptr], %%rdx, 4), %%xmm10, %%xmm8;       \n\t"
	:: [ptr] "r" (&mem[0]) : "memory", "eax", "edx"
    );

vpextr/vpinsr cannot be used with 256-bit wide ymm registers, so 'vinserti128/vextracti128' is needed and make manual gather about the same speed as vpgatherdd.

Now the block cipher implementations need to use all bytes of vector register for table look-ups, and the way that this is done in the AVX implementation of Twofish (move data from vector register to generic purpose registers, handle byte-extraction and table look-ups there and move processed data back to vector register) is about two to three times faster than the way with current AVX2 implementation using vgather.

Blowfish does not do much processing in addition to table look-ups, so there is not much to that can be done. With Twofish, the table look-ups are the most computationally heavy part and I don't think that the wider vector registers in the other parts are going to give much boost. So permanent solution is likely to be revert.

-Jussi

> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-06-05 12:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-02 16:51 [PATCH 1/2] crypto: twofish - disable AVX2 implementation Jussi Kivilinna
2013-06-02 16:51 ` [PATCH 2/2] crypto: blowfish " Jussi Kivilinna
2013-06-05  8:34   ` Herbert Xu
2013-06-05 12:26     ` Jussi Kivilinna

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox