Re: [PATCH 0/2] xor: enable auto-vectorization in Clang

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nathan Chancellor <nathan@kernel.org>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	Nick Desaulniers <ndesaulniers@google.com>
Subject: Re: [PATCH 0/2] xor: enable auto-vectorization in Clang
Date: Thu, 27 Jan 2022 14:36:57 -0700	[thread overview]
Message-ID: <YfMQeRVkNMHr81P9@dev-arch.archlinux-ax161> (raw)
In-Reply-To: <20220127081227.2430-1-ardb@kernel.org>

Hi Ard,

On Thu, Jan 27, 2022 at 09:12:25AM +0100, Ard Biesheuvel wrote:
> Update the xor_blocks() prototypes so that the compiler understands that
> the inputs always refer to distinct regions of memory. This is implied
> by the existing implementations, as they use different granularities for
> the load/xor/store loops.
> 
> With that, we can fix the ARM/Clang version, which refuses to SIMD
> vectorize otherwise, and throws a spurious warning related to the GCC
> version being incompatible.
> 
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Nathan Chancellor <nathan@kernel.org>
> 
> Ard Biesheuvel (2):
>   lib/xor: make xor prototypes more friendely to compiler vectorization
>   crypto: arm/xor - make vectorized C code Clang-friendly

I tested multi_v7_defconfig + CONFIG_BTRFS=y (to get
CONFIG_XOR_BLOCKS=y) in QEMU 6.2.0 (10 boots) and the xor neon code gets
faster according to do_xor_speed():

mainline @ 626b2dda7651:

[    2.591449]    neon            :  1166 MB/sec
[    2.579454]    neon            :  1118 MB/sec
[    2.589061]    neon            :  1163 MB/sec
[    2.581827]    neon            :  1167 MB/sec
[    2.599079]    neon            :  1166 MB/sec
[    2.579252]    neon            :  1147 MB/sec
[    2.582637]    neon            :  1168 MB/sec
[    2.582872]    neon            :  1164 MB/sec
[    2.570671]    neon            :  1167 MB/sec
[    2.571830]    neon            :  1166 MB/sec

mainline @ 626b2dda7651 with series:

[    2.570227]    neon            :  1238 MB/sec
[    2.571642]    neon            :  1237 MB/sec
[    2.580370]    neon            :  1234 MB/sec
[    2.581966]    neon            :  1238 MB/sec
[    2.582313]    neon            :  1236 MB/sec
[    2.572291]    neon            :  1238 MB/sec
[    2.570625]    neon            :  1233 MB/sec
[    2.571897]    neon            :  1234 MB/sec
[    2.589616]    neon            :  1228 MB/sec
[    2.582449]    neon            :  1236 MB/sec

This series is currently broken for powerpc [1], as the functions in
arch/powerpc/lib/xor_vmx.c were not updated.

arch/powerpc/lib/xor_vmx.c:52:6: error: conflicting types for '__xor_altivec_2'
void __xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:9:6: note: previous declaration is here
void __xor_altivec_2(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:70:6: error: conflicting types for '__xor_altivec_3'
void __xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:11:6: note: previous declaration is here
void __xor_altivec_3(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:92:6: error: conflicting types for '__xor_altivec_4'
void __xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:14:6: note: previous declaration is here
void __xor_altivec_4(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:119:6: error: conflicting types for '__xor_altivec_5'
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:18:6: note: previous declaration is here
void __xor_altivec_5(unsigned long bytes, unsigned long * __restrict p1,
     ^
4 errors generated.

If I fix that up [2], it builds and resolves an instance of
-Wframe-larger-than= in the xor altivec code, as seen with
pmac32_defconfig.

Before this series:

arch/powerpc/lib/xor_vmx.c:119:6: error: stack frame size (1232) exceeds limit (1024) in '__xor_altivec_5' [-Werror,-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
     ^
1 error generated.

After this patch (with CONFIG_FRAME_WARN=100 and
CONFIG_PPC_DISABLE_WERROR=y):

arch/powerpc/lib/xor_vmx.c:52:6: warning: stack frame size (128) exceeds limit (100) in '__xor_altivec_2' [-Wframe-larger-than]
void __xor_altivec_2(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:71:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_3' [-Wframe-larger-than]
void __xor_altivec_3(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:95:6: warning: stack frame size (144) exceeds limit (100) in '__xor_altivec_4' [-Wframe-larger-than]
void __xor_altivec_4(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:124:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_5' [-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes,
     ^
4 warnings generated.

There is a similar performance gain as ARM according to do_xor_speed():

Before:

   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   219 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec

After:

   altivec         :   278 MB/sec
   altivec         :   276 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec

I did also build test arm64 and x86_64 and saw no errors. I did runtime
test arm64 for improvements and did not see any, which is good, since I
take that as meaning it was working fine before and there is no
regression.

Once the build error is fixed, consider this series:

Tested-by: Nathan Chancellor <nathan@kernel.org>

[1]: https://lore.kernel.org/r/202112310646.kuh2pXiG-lkp@intel.com/
[2]: https://github.com/ClangBuiltLinux/linux/issues/563#issuecomment-1005175153

Cheers,
Nathan

     prev parent reply	other threads:[~2022-01-27 21:37 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-27  8:12 [PATCH 0/2] xor: enable auto-vectorization in Clang Ard Biesheuvel
2022-01-27  8:12 ` [PATCH 1/2] lib/xor: make xor prototypes more friendely to compiler vectorization Ard Biesheuvel
2022-02-05  4:26   ` Herbert Xu
2022-01-27  8:12 ` [PATCH 2/2] crypto: arm/xor - make vectorized C code Clang-friendly Ard Biesheuvel
2022-01-27 21:36 ` Nathan Chancellor [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YfMQeRVkNMHr81P9@dev-arch.archlinux-ax161 \
    --to=nathan@kernel.org \
    --cc=ardb@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-crypto@vger.kernel.org \
    --cc=ndesaulniers@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).