From: Nathan Chancellor <nathan@kernel.org>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
Nick Desaulniers <ndesaulniers@google.com>
Subject: Re: [PATCH 0/2] xor: enable auto-vectorization in Clang
Date: Thu, 27 Jan 2022 14:36:57 -0700 [thread overview]
Message-ID: <YfMQeRVkNMHr81P9@dev-arch.archlinux-ax161> (raw)
In-Reply-To: <20220127081227.2430-1-ardb@kernel.org>
Hi Ard,
On Thu, Jan 27, 2022 at 09:12:25AM +0100, Ard Biesheuvel wrote:
> Update the xor_blocks() prototypes so that the compiler understands that
> the inputs always refer to distinct regions of memory. This is implied
> by the existing implementations, as they use different granularities for
> the load/xor/store loops.
>
> With that, we can fix the ARM/Clang version, which refuses to SIMD
> vectorize otherwise, and throws a spurious warning related to the GCC
> version being incompatible.
>
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Nathan Chancellor <nathan@kernel.org>
>
> Ard Biesheuvel (2):
> lib/xor: make xor prototypes more friendely to compiler vectorization
> crypto: arm/xor - make vectorized C code Clang-friendly
I tested multi_v7_defconfig + CONFIG_BTRFS=y (to get
CONFIG_XOR_BLOCKS=y) in QEMU 6.2.0 (10 boots) and the xor neon code gets
faster according to do_xor_speed():
mainline @ 626b2dda7651:
[ 2.591449] neon : 1166 MB/sec
[ 2.579454] neon : 1118 MB/sec
[ 2.589061] neon : 1163 MB/sec
[ 2.581827] neon : 1167 MB/sec
[ 2.599079] neon : 1166 MB/sec
[ 2.579252] neon : 1147 MB/sec
[ 2.582637] neon : 1168 MB/sec
[ 2.582872] neon : 1164 MB/sec
[ 2.570671] neon : 1167 MB/sec
[ 2.571830] neon : 1166 MB/sec
mainline @ 626b2dda7651 with series:
[ 2.570227] neon : 1238 MB/sec
[ 2.571642] neon : 1237 MB/sec
[ 2.580370] neon : 1234 MB/sec
[ 2.581966] neon : 1238 MB/sec
[ 2.582313] neon : 1236 MB/sec
[ 2.572291] neon : 1238 MB/sec
[ 2.570625] neon : 1233 MB/sec
[ 2.571897] neon : 1234 MB/sec
[ 2.589616] neon : 1228 MB/sec
[ 2.582449] neon : 1236 MB/sec
This series is currently broken for powerpc [1], as the functions in
arch/powerpc/lib/xor_vmx.c were not updated.
arch/powerpc/lib/xor_vmx.c:52:6: error: conflicting types for '__xor_altivec_2'
void __xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
^
arch/powerpc/lib/xor_vmx.h:9:6: note: previous declaration is here
void __xor_altivec_2(unsigned long bytes, unsigned long * __restrict p1,
^
arch/powerpc/lib/xor_vmx.c:70:6: error: conflicting types for '__xor_altivec_3'
void __xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
^
arch/powerpc/lib/xor_vmx.h:11:6: note: previous declaration is here
void __xor_altivec_3(unsigned long bytes, unsigned long * __restrict p1,
^
arch/powerpc/lib/xor_vmx.c:92:6: error: conflicting types for '__xor_altivec_4'
void __xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
^
arch/powerpc/lib/xor_vmx.h:14:6: note: previous declaration is here
void __xor_altivec_4(unsigned long bytes, unsigned long * __restrict p1,
^
arch/powerpc/lib/xor_vmx.c:119:6: error: conflicting types for '__xor_altivec_5'
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
^
arch/powerpc/lib/xor_vmx.h:18:6: note: previous declaration is here
void __xor_altivec_5(unsigned long bytes, unsigned long * __restrict p1,
^
4 errors generated.
If I fix that up [2], it builds and resolves an instance of
-Wframe-larger-than= in the xor altivec code, as seen with
pmac32_defconfig.
Before this series:
arch/powerpc/lib/xor_vmx.c:119:6: error: stack frame size (1232) exceeds limit (1024) in '__xor_altivec_5' [-Werror,-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
^
1 error generated.
After this patch (with CONFIG_FRAME_WARN=100 and
CONFIG_PPC_DISABLE_WERROR=y):
arch/powerpc/lib/xor_vmx.c:52:6: warning: stack frame size (128) exceeds limit (100) in '__xor_altivec_2' [-Wframe-larger-than]
void __xor_altivec_2(unsigned long bytes,
^
arch/powerpc/lib/xor_vmx.c:71:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_3' [-Wframe-larger-than]
void __xor_altivec_3(unsigned long bytes,
^
arch/powerpc/lib/xor_vmx.c:95:6: warning: stack frame size (144) exceeds limit (100) in '__xor_altivec_4' [-Wframe-larger-than]
void __xor_altivec_4(unsigned long bytes,
^
arch/powerpc/lib/xor_vmx.c:124:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_5' [-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes,
^
4 warnings generated.
There is a similar performance gain as ARM according to do_xor_speed():
Before:
altivec : 222 MB/sec
altivec : 222 MB/sec
altivec : 222 MB/sec
altivec : 219 MB/sec
altivec : 222 MB/sec
altivec : 222 MB/sec
altivec : 222 MB/sec
altivec : 222 MB/sec
altivec : 222 MB/sec
altivec : 222 MB/sec
After:
altivec : 278 MB/sec
altivec : 276 MB/sec
altivec : 278 MB/sec
altivec : 278 MB/sec
altivec : 278 MB/sec
altivec : 278 MB/sec
altivec : 278 MB/sec
altivec : 278 MB/sec
altivec : 278 MB/sec
altivec : 278 MB/sec
I did also build test arm64 and x86_64 and saw no errors. I did runtime
test arm64 for improvements and did not see any, which is good, since I
take that as meaning it was working fine before and there is no
regression.
Once the build error is fixed, consider this series:
Tested-by: Nathan Chancellor <nathan@kernel.org>
[1]: https://lore.kernel.org/r/202112310646.kuh2pXiG-lkp@intel.com/
[2]: https://github.com/ClangBuiltLinux/linux/issues/563#issuecomment-1005175153
Cheers,
Nathan
prev parent reply other threads:[~2022-01-27 21:37 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-27 8:12 [PATCH 0/2] xor: enable auto-vectorization in Clang Ard Biesheuvel
2022-01-27 8:12 ` [PATCH 1/2] lib/xor: make xor prototypes more friendely to compiler vectorization Ard Biesheuvel
2022-02-05 4:26 ` Herbert Xu
2022-01-27 8:12 ` [PATCH 2/2] crypto: arm/xor - make vectorized C code Clang-friendly Ard Biesheuvel
2022-01-27 21:36 ` Nathan Chancellor [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YfMQeRVkNMHr81P9@dev-arch.archlinux-ax161 \
--to=nathan@kernel.org \
--cc=ardb@kernel.org \
--cc=herbert@gondor.apana.org.au \
--cc=linux-crypto@vger.kernel.org \
--cc=ndesaulniers@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).