linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
       [not found] <20210514100106.3404011-1-arnd@kernel.org>
@ 2021-05-14 10:00 ` Arnd Bergmann
  2021-05-17 21:53   ` Eric Biggers
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2021-05-14 10:00 UTC (permalink / raw)
  To: linux-arch
  Cc: Linus Torvalds, Vineet Gupta, Arnd Bergmann, Russell King,
	Herbert Xu, David S. Miller, Thomas Bogendoerfer,
	linux-arm-kernel, linux-kernel, linux-crypto, linux-mips

From: Arnd Bergmann <arnd@arndb.de>

As found by Vineet Gupta and Linus Torvalds, gcc has somewhat unexpected
behavior when faced with overlapping unaligned pointers. The kernel's
unaligned/access-ok.h header technically invokes undefined behavior
that happens to usually work on the architectures using it, but if the
compiler optimizes code based on the assumption that undefined behavior
doesn't happen, it can create output that actually causes data corruption.

A related problem was previously found on 32-bit ARMv7, where most
instructions can be used on unaligned data, but 64-bit ldrd/strd causes
an exception. The workaround was to always use the unaligned/le_struct.h
helper instead of unaligned/access-ok.h, in commit 1cce91dfc8f7 ("ARM:
8715/1: add a private asm/unaligned.h").

The same solution should work on all other architectures as well, so
remove the access-ok.h variant and use the other one unconditionally on
all architectures, picking either the big-endian or little-endian version.

With this, the arm specific header can be removed as well, and the
only file including linux/unaligned/access_ok.h gets moved to including
the normal file.

Fortunately, this made almost no difference to the object code produced
by gcc-11. On x86, s390, powerpc, and arc, the resulting binary appears
to be identical to the previous version, while on arm64 and m68k there
are minimal differences that looks like an optimization pass went into
a different direction, usually using fewer stack spills on the new
version.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
---
 arch/arm/include/asm/unaligned.h    | 25 -----------
 arch/mips/crypto/crc32-mips.c       |  2 +-
 include/asm-generic/unaligned.h     | 13 +-----
 include/linux/unaligned/access_ok.h | 68 -----------------------------
 4 files changed, 3 insertions(+), 105 deletions(-)
 delete mode 100644 arch/arm/include/asm/unaligned.h
 delete mode 100644 include/linux/unaligned/access_ok.h

diff --git a/arch/arm/include/asm/unaligned.h b/arch/arm/include/asm/unaligned.h
deleted file mode 100644
index 3c5248fb4cdc..000000000000
--- a/arch/arm/include/asm/unaligned.h
+++ /dev/null
@@ -1,25 +0,0 @@
-#ifndef __ASM_ARM_UNALIGNED_H
-#define __ASM_ARM_UNALIGNED_H
-
-/*
- * We generally want to set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on ARMv6+,
- * but we don't want to use linux/unaligned/access_ok.h since that can lead
- * to traps on unaligned stm/ldm or strd/ldrd.
- */
-#include <asm/byteorder.h>
-
-#if defined(__LITTLE_ENDIAN)
-# include <linux/unaligned/le_struct.h>
-# include <linux/unaligned/generic.h>
-# define get_unaligned	__get_unaligned_le
-# define put_unaligned	__put_unaligned_le
-#elif defined(__BIG_ENDIAN)
-# include <linux/unaligned/be_struct.h>
-# include <linux/unaligned/generic.h>
-# define get_unaligned	__get_unaligned_be
-# define put_unaligned	__put_unaligned_be
-#else
-# error need to define endianess
-#endif
-
-#endif /* __ASM_ARM_UNALIGNED_H */
diff --git a/arch/mips/crypto/crc32-mips.c b/arch/mips/crypto/crc32-mips.c
index faa88a6a74c0..0a03529cf317 100644
--- a/arch/mips/crypto/crc32-mips.c
+++ b/arch/mips/crypto/crc32-mips.c
@@ -8,13 +8,13 @@
  * Copyright (C) 2018 MIPS Tech, LLC
  */
 
-#include <linux/unaligned/access_ok.h>
 #include <linux/cpufeature.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/string.h>
 #include <asm/mipsregs.h>
+#include <asm/unaligned.h>
 
 #include <crypto/internal/hash.h>
 
diff --git a/include/asm-generic/unaligned.h b/include/asm-generic/unaligned.h
index d79df721ae60..36bf03aaa674 100644
--- a/include/asm-generic/unaligned.h
+++ b/include/asm-generic/unaligned.h
@@ -8,22 +8,13 @@
  */
 #include <asm/byteorder.h>
 
-/* Set by the arch if it can handle unaligned accesses in hardware. */
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-# include <linux/unaligned/access_ok.h>
-#endif
-
 #if defined(__LITTLE_ENDIAN)
-# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-#  include <linux/unaligned/le_struct.h>
-# endif
+# include <linux/unaligned/le_struct.h>
 # include <linux/unaligned/generic.h>
 # define get_unaligned	__get_unaligned_le
 # define put_unaligned	__put_unaligned_le
 #elif defined(__BIG_ENDIAN)
-# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-#  include <linux/unaligned/be_struct.h>
-# endif
+# include <linux/unaligned/be_struct.h>
 # include <linux/unaligned/generic.h>
 # define get_unaligned	__get_unaligned_be
 # define put_unaligned	__put_unaligned_be
diff --git a/include/linux/unaligned/access_ok.h b/include/linux/unaligned/access_ok.h
deleted file mode 100644
index 167aa849c0ce..000000000000
--- a/include/linux/unaligned/access_ok.h
+++ /dev/null
@@ -1,68 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_UNALIGNED_ACCESS_OK_H
-#define _LINUX_UNALIGNED_ACCESS_OK_H
-
-#include <linux/kernel.h>
-#include <asm/byteorder.h>
-
-static __always_inline u16 get_unaligned_le16(const void *p)
-{
-	return le16_to_cpup((__le16 *)p);
-}
-
-static __always_inline u32 get_unaligned_le32(const void *p)
-{
-	return le32_to_cpup((__le32 *)p);
-}
-
-static __always_inline u64 get_unaligned_le64(const void *p)
-{
-	return le64_to_cpup((__le64 *)p);
-}
-
-static __always_inline u16 get_unaligned_be16(const void *p)
-{
-	return be16_to_cpup((__be16 *)p);
-}
-
-static __always_inline u32 get_unaligned_be32(const void *p)
-{
-	return be32_to_cpup((__be32 *)p);
-}
-
-static __always_inline u64 get_unaligned_be64(const void *p)
-{
-	return be64_to_cpup((__be64 *)p);
-}
-
-static __always_inline void put_unaligned_le16(u16 val, void *p)
-{
-	*((__le16 *)p) = cpu_to_le16(val);
-}
-
-static __always_inline void put_unaligned_le32(u32 val, void *p)
-{
-	*((__le32 *)p) = cpu_to_le32(val);
-}
-
-static __always_inline void put_unaligned_le64(u64 val, void *p)
-{
-	*((__le64 *)p) = cpu_to_le64(val);
-}
-
-static __always_inline void put_unaligned_be16(u16 val, void *p)
-{
-	*((__be16 *)p) = cpu_to_be16(val);
-}
-
-static __always_inline void put_unaligned_be32(u32 val, void *p)
-{
-	*((__be32 *)p) = cpu_to_be32(val);
-}
-
-static __always_inline void put_unaligned_be64(u64 val, void *p)
-{
-	*((__be64 *)p) = cpu_to_be64(val);
-}
-
-#endif /* _LINUX_UNALIGNED_ACCESS_OK_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-14 10:00 ` [PATCH v2 07/13] asm-generic: unaligned always use struct helpers Arnd Bergmann
@ 2021-05-17 21:53   ` Eric Biggers
  2021-05-18  7:25     ` Arnd Bergmann
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Biggers @ 2021-05-17 21:53 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arch, Linus Torvalds, Vineet Gupta, Arnd Bergmann,
	Russell King, Herbert Xu, David S. Miller, Thomas Bogendoerfer,
	linux-arm-kernel, linux-kernel, linux-crypto, linux-mips

On Fri, May 14, 2021 at 12:00:55PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> As found by Vineet Gupta and Linus Torvalds, gcc has somewhat unexpected
> behavior when faced with overlapping unaligned pointers. The kernel's
> unaligned/access-ok.h header technically invokes undefined behavior
> that happens to usually work on the architectures using it, but if the
> compiler optimizes code based on the assumption that undefined behavior
> doesn't happen, it can create output that actually causes data corruption.
> 
> A related problem was previously found on 32-bit ARMv7, where most
> instructions can be used on unaligned data, but 64-bit ldrd/strd causes
> an exception. The workaround was to always use the unaligned/le_struct.h
> helper instead of unaligned/access-ok.h, in commit 1cce91dfc8f7 ("ARM:
> 8715/1: add a private asm/unaligned.h").
> 
> The same solution should work on all other architectures as well, so
> remove the access-ok.h variant and use the other one unconditionally on
> all architectures, picking either the big-endian or little-endian version.

FYI, gcc 10 had a bug where it miscompiled code that uses "packed structs" to
copy between overlapping unaligned pointers
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94994).

I'm not sure whether the kernel will run into that or not, and gcc has since
fixed it.  But it's worth mentioning, especially since the issue mentioned in
this commit sounds very similar (overlapping unaligned pointers), and both
involved implementations of DEFLATE decompression.

Anyway, partly due to the above, in userspace I now only use memcpy() to
implement {get,put}_unaligned_*, since these days it seems to be compiled
optimally and have the least amount of problems.

I wonder if the kernel should do the same, or whether there are still cases
where memcpy() isn't compiled optimally.  armv6/7 used to be one such case, but
it was fixed in gcc 6.

- Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-17 21:53   ` Eric Biggers
@ 2021-05-18  7:25     ` Arnd Bergmann
  2021-05-18 14:56       ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2021-05-18  7:25 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-arch, Linus Torvalds, Vineet Gupta, Russell King,
	Herbert Xu, David S. Miller, Thomas Bogendoerfer, Linux ARM,
	Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER

On Mon, May 17, 2021 at 11:53 PM Eric Biggers <ebiggers@kernel.org> wrote:
> On Fri, May 14, 2021 at 12:00:55PM +0200, Arnd Bergmann wrote:
> > From: Arnd Bergmann <arnd@arndb.de>
> >
> > As found by Vineet Gupta and Linus Torvalds, gcc has somewhat unexpected
> > behavior when faced with overlapping unaligned pointers. The kernel's
> > unaligned/access-ok.h header technically invokes undefined behavior
> > that happens to usually work on the architectures using it, but if the
> > compiler optimizes code based on the assumption that undefined behavior
> > doesn't happen, it can create output that actually causes data corruption.
> >
> > A related problem was previously found on 32-bit ARMv7, where most
> > instructions can be used on unaligned data, but 64-bit ldrd/strd causes
> > an exception. The workaround was to always use the unaligned/le_struct.h
> > helper instead of unaligned/access-ok.h, in commit 1cce91dfc8f7 ("ARM:
> > 8715/1: add a private asm/unaligned.h").
> >
> > The same solution should work on all other architectures as well, so
> > remove the access-ok.h variant and use the other one unconditionally on
> > all architectures, picking either the big-endian or little-endian version.
>
> FYI, gcc 10 had a bug where it miscompiled code that uses "packed structs" to
> copy between overlapping unaligned pointers
> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94994).

Thank you for pointing this out

> I'm not sure whether the kernel will run into that or not, and gcc has since
> fixed it.  But it's worth mentioning, especially since the issue mentioned in
> this commit sounds very similar (overlapping unaligned pointers), and both
> involved implementations of DEFLATE decompression.

I tried reproducing this on the kernel deflate code with the kernel.org gcc-10.1
and gcc-10.3 crosstool versions but couldn't quite get there with Vineet's
preprocessed source https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Trying with both the original get_unaligned() version in there and the
packed-struct
variant, I get the same output from gcc-10.1 and gcc-10.3 when I compile those
myself for arc hs4x , but it's rather different from the output that Vineet got
and I don't know how to spot whether the problem exists in any of those
versions.

> Anyway, partly due to the above, in userspace I now only use memcpy() to
> implement {get,put}_unaligned_*, since these days it seems to be compiled
> optimally and have the least amount of problems.
>
> I wonder if the kernel should do the same, or whether there are still cases
> where memcpy() isn't compiled optimally.  armv6/7 used to be one such case, but
> it was fixed in gcc 6.

It would have to be memmove(), not memcpy() in this case, right?
My feeling is that if gcc-4.9 and gcc-5 produce correct but slightly slower
code, we can live with that, unlike the possibility of gcc-10.{1,2} producing
incorrect code.

Since the new asm/unaligned.h has a single implementation across all
architectures, we could probably fall back to a memmove based version for
the compilers affected by the 94994 bug,  but I'd first need to have a better
way to test regarding whether given combination of asm/unaligned.h and
compiler version runs into this bug.

I have checked your reproducer and confirmed that it does affect x86_64
gcc-10.1 -O3 with my proposed version of asm-generic/unaligned.h, but
does not trigger on any other version (4.9 though 9.3, 10.3 or 11.1), and not
on -O2 or "-O3 -mno-sse" builds or on arm64, but that doesn't necessarily
mean it's safe on these.

        Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-18  7:25     ` Arnd Bergmann
@ 2021-05-18 14:56       ` Linus Torvalds
  2021-05-18 15:41         ` Arnd Bergmann
  2021-05-18 21:14         ` David Laight
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2021-05-18 14:56 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Eric Biggers, linux-arch, Vineet Gupta, Russell King, Herbert Xu,
	David S. Miller, Thomas Bogendoerfer, Linux ARM,
	Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER

On Tue, May 18, 2021 at 12:27 AM Arnd Bergmann <arnd@kernel.org> wrote:
> >
> > I wonder if the kernel should do the same, or whether there are still cases
> > where memcpy() isn't compiled optimally.  armv6/7 used to be one such case, but
> > it was fixed in gcc 6.
>
> It would have to be memmove(), not memcpy() in this case, right?

No, it would simply be something like

  #define __get_unaligned_t(type, ptr) \
        ({ type __val; memcpy(&__val, ptr, sizeof(type)); __val; })

  #define get_unaligned(ptr) \
        __get_unaligned_t(typeof(*(ptr)), ptr)

but honestly, the likelihood that the compiler generates something
horrible (possibly because of KASAN etc) is uncomfortably high.

I'd prefer the __packed thing. We don't actually use -O3, and it's
considered a bad idea, and the gcc bug is as such less likely than
just  the above generating unacceptable code (we have several cases
where "bad code generation" ends up being an actual bug, since we
depend on inlining and depend on some code sequences not generating
calls etc).

But I hate how gcc is buggy in so many places here, and the
straightforward thing is made to explicitly not work.

I absolutely despise compiler people who think it's ok to generate
known bad code based on pointless "undefined behavior" arguments - and
then those same clever optimizations break even when you do things
properly.  It's basically intellectual dishonesty - doing known
fragile things, blaming the user when it breaks, but then not
acknowledging that the fragile shit they did was broken even when the
user bent over backwards.

                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-18 14:56       ` Linus Torvalds
@ 2021-05-18 15:41         ` Arnd Bergmann
  2021-05-18 16:12           ` Linus Torvalds
  2021-05-18 21:14         ` David Laight
  1 sibling, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2021-05-18 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Biggers, linux-arch, Vineet Gupta, Russell King, Herbert Xu,
	David S. Miller, Thomas Bogendoerfer, Linux ARM,
	Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER

On Tue, May 18, 2021 at 4:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, May 18, 2021 at 12:27 AM Arnd Bergmann <arnd@kernel.org> wrote:
> > >
> > > I wonder if the kernel should do the same, or whether there are still cases
> > > where memcpy() isn't compiled optimally.  armv6/7 used to be one such case, but
> > > it was fixed in gcc 6.
> >
> > It would have to be memmove(), not memcpy() in this case, right?
>
> No, it would simply be something like
>
>   #define __get_unaligned_t(type, ptr) \
>         ({ type __val; memcpy(&__val, ptr, sizeof(type)); __val; })
>
>   #define get_unaligned(ptr) \
>         __get_unaligned_t(typeof(*(ptr)), ptr)
>
> but honestly, the likelihood that the compiler generates something
> horrible (possibly because of KASAN etc) is uncomfortably high.
>
> I'd prefer the __packed thing. We don't actually use -O3, and it's
> considered a bad idea, and the gcc bug is as such less likely than
> just  the above generating unacceptable code (we have several cases
> where "bad code generation" ends up being an actual bug, since we
> depend on inlining and depend on some code sequences not generating
> calls etc).

I think the important question is whether we know that the bug that Eric
pointed to can only happen with -O3, or whether it is something in
gcc-10.1 that got triggered by -O3 -msse on x86-64 but could equally
well get triggered on some other architecture without -O3 or
vector instructions enabled.

From the gcc fix at
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=9fa5b473b5b8e289b
it looks like this code path is entered when compiling with
-ftree-loop-vectorize, which is documented as

'-ftree-loop-vectorize'
     Perform loop vectorization on trees.  This flag is enabled by
     default at '-O3' and by '-ftree-vectorize', '-fprofile-use', and
     '-fauto-profile'.

-ftree-vectorize is set in arch/arm/lib/xor-neon.c
-O3 is set for the lz4 and zstd compression helpers and for wireguard.

To be on the safe side, we could pass -fno-tree-loop-vectorize along
with -O3 on the affected gcc versions, or use a bigger hammer
(not use -O3 at all, always set -fno-tree-loop-vectorize, ...).

        Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-18 15:41         ` Arnd Bergmann
@ 2021-05-18 16:12           ` Linus Torvalds
  2021-05-18 18:09             ` Jason A. Donenfeld
  2021-05-18 20:51             ` Arnd Bergmann
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2021-05-18 16:12 UTC (permalink / raw)
  To: Arnd Bergmann, Jason A. Donenfeld
  Cc: Eric Biggers, linux-arch, Vineet Gupta, Russell King, Herbert Xu,
	David S. Miller, Thomas Bogendoerfer, Linux ARM,
	Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER

On Tue, May 18, 2021 at 5:42 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> To be on the safe side, we could pass -fno-tree-loop-vectorize along
> with -O3 on the affected gcc versions, or use a bigger hammer
> (not use -O3 at all, always set -fno-tree-loop-vectorize, ...).

I personally think -O3 in general is unsafe.

It has historically been horribly buggy. It's gotten better, but this
case clearly shows that "gotten better" really isn't that high of a
bar.

Very few projects use -O3, which is obviously part of why it's buggy.
But the other part of why it's buggy is that vectorization is simply
very complicated, and honestly, judging by the last report the gcc
people don't care about being careful. They literally are ok with
knowingly generating an off-by-one range check, because "it's
undefined behavior".

With that kind of mentality, I'm not personally all that inclined to
say "sure, use -O3". We know it has bugs even for the well-defined
cases.

> -O3 is set for the lz4 and zstd compression helpers and for wireguard.

I'm actually surprised wireguard would use -O3. Yes, performance is
important. But for wireguard, correctness is certainly important too.
Maybe Jason isn't aware of just how bad gcc -O3 has historically been?

And -O3 has often generated _slower_ code, in addition to the bugs.
It's not like it's a situation where "-O3 is obviously better than
-O2". There's a reason -O2 is the default.

And that tends to be even more true in the kernel than in many user
space programs (ie smaller loops, generally much higher I$ miss rates
etc).

Jason? How big of a deal is that -O3 for wireguard wrt the normal -O2?
There are known buggy gcc versions that aren't ancient.

Of the other cases, that xor-neon.c case actually makes sense. For
that file, it literally exists _only_ to get a vectorized version of
the trivial xor_8regs loop. It's one of the (very very few) cases of
vectorization we actually want. And in that case, we might even want
to make things easier - and more explicit - for the compiler by making
the xor_8regs loops use "restrict" pointers.

That neon case actually wants and needs that tree-vectorization to
DTRT. But maybe it doesn't need the actual _loop_ vectorization? The
xor_8regs code is literally using hand-unrolled loops already, exactly
to make it as simple as possible for the compiler (but the lack of
"restrict" pointers means that it's not all that simple after all, and
I assume the compiler generates conditionals for the NEON case?

lz4 is questionable - yes, upstream lh4 seems to use -O3 (good), but
it also very much uses unaligned accesses, which is where the gcc bug
hits. I doubt that it really needs or wants the loop vectorization.

zstd looks very similar to lz4.

End result: at a minimum, I'd suggest using
"-fno-tree-loop-vectorize", although somebody should check that NEON
case.

And I still think that using O3 for anything halfway complicated
should be considered odd and need some strong numbers to enable.

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-18 16:12           ` Linus Torvalds
@ 2021-05-18 18:09             ` Jason A. Donenfeld
  2021-05-18 20:51             ` Arnd Bergmann
  1 sibling, 0 replies; 10+ messages in thread
From: Jason A. Donenfeld @ 2021-05-18 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnd Bergmann, Eric Biggers, linux-arch, Vineet Gupta,
	Russell King, Herbert Xu, David S. Miller, Thomas Bogendoerfer,
	Linux ARM, Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER

Hi Linus,

On Tue, May 18, 2021 at 6:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I'm actually surprised wireguard would use -O3. Yes, performance is
> important. But for wireguard, correctness is certainly important too.
> Maybe Jason isn't aware of just how bad gcc -O3 has historically been?
> Jason? How big of a deal is that -O3 for wireguard wrt the normal -O2?
> There are known buggy gcc versions that aren't ancient.

My impression has always been that O3 might sometimes generate slower
code, but not that it generates buggy code so commonly. Thanks for
letting me know.

I have a habit of compulsively run IDA Pro after making changes (brain
damage from too many years as a "security person" or something), to
see what the compiler did, and I've just been doing that with O3 since
the beginning of the project, so that's what I wound up optimizing
for. Or sometimes I'll work little things out in Godbolt's compiler
explorer. It's not like it matters much most of the time, but
sometimes I enjoy the golf. Anyway, I've never noticed it producing
any clearly wrong code compared to O2. But I'm obviously not testing
on all compilers or on all architectures. So if you think there's
danger lurking somewhere, it seems reasonable to change that to O2.

Comparing gcc 11's output between O2 and O3, it looks like the primary
difference is that the constant propagation is much less aggressive
with O2, and less inlining in general also means that some stores and
loads to local variables across static function calls aren't being
coalesced. A few null checks are removed too, where the compiler can
prove them away.

So while I've never seen issues with that code under O3, I don't see a
super compelling speed up anywhere either, but rather a bunch of
places that may or may not be theoretically faster or slower on some
system, maybe. I can queue up a patch for the next wireguard series I
send to Dave.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-18 16:12           ` Linus Torvalds
  2021-05-18 18:09             ` Jason A. Donenfeld
@ 2021-05-18 20:51             ` Arnd Bergmann
  2021-05-18 21:31               ` Eric Biggers
  1 sibling, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2021-05-18 20:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A. Donenfeld, Eric Biggers, linux-arch, Vineet Gupta,
	Russell King, Herbert Xu, David S. Miller, Thomas Bogendoerfer,
	Linux ARM, Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER, Nobuhiro Iwamatsu

On Tue, May 18, 2021 at 6:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, May 18, 2021 at 5:42 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> Of the other cases, that xor-neon.c case actually makes sense. For
> that file, it literally exists _only_ to get a vectorized version of
> the trivial xor_8regs loop. It's one of the (very very few) cases of
> vectorization we actually want. And in that case, we might even want
> to make things easier - and more explicit - for the compiler by making
> the xor_8regs loops use "restrict" pointers.
>
> That neon case actually wants and needs that tree-vectorization to
> DTRT. But maybe it doesn't need the actual _loop_ vectorization? The
> xor_8regs code is literally using hand-unrolled loops already, exactly
> to make it as simple as possible for the compiler (but the lack of
> "restrict" pointers means that it's not all that simple after all, and
> I assume the compiler generates conditionals for the NEON case?

Right, I think there is an ongoing debate over how to best handle this
one in clang, since that does not do any vectorization for this file
unless the pointers are marked "restrict". As far as I remember, there
are some callers that want to do the xor in place though, which
means restrict is wrong.

> lz4 is questionable - yes, upstream lh4 seems to use -O3 (good), but
> it also very much uses unaligned accesses, which is where the gcc bug
> hits. I doubt that it really needs or wants the loop vectorization.

I ran some limited speed tests with the lz4 sources that come with Ubuntu,
using gcc-10.3 on an AMD Zen1 Threadripper with 10GB of /dev/urandom
input.
This package patches the sources to use -O2 and no vectorization,
which turns out to be the fastest combination for my stupid test as well.

The results are barely above noise, but it appears that  -O2
-ftree-loop-vectorize
makes it slightly slower than just -O2, while -O3 is even slower than
that regardless of -fno-tree-loop-vectorize/-ftree-loop-vectorize.

I see that Nobuhiro Iwamatsu (Cc'd) changed the Debian lz4 package to
use -O2, but I don't see an explanation for it. I also see that the lz4 sources
force -O2 on ppc64 because -O3 causes a 30% slowdown from vectorization.
The kernel version is missing the bit that does this.

> zstd looks very similar to lz4.

> End result: at a minimum, I'd suggest using
> "-fno-tree-loop-vectorize", although somebody should check that NEON
> case.

> And I still think that using O3 for anything halfway complicated
> should be considered odd and need some strong numbers to enable.

Agreed. I think there is a fairly strong case for just using -O2 on lz4
and backport that to stable.
Searching for lz4 bugs with -O3 also finds several reports including
one that I sent myself:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702

I see that user space zstd is built with -O3 in Debian, but it the changelog
also lists "Improved : better speed on clang and gcc -O2, thanks to Eric
Biggers", so maybe Eric has some useful ideas on whether we should
just use -O2 for the in-kernel version.

        Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-18 14:56       ` Linus Torvalds
  2021-05-18 15:41         ` Arnd Bergmann
@ 2021-05-18 21:14         ` David Laight
  1 sibling, 0 replies; 10+ messages in thread
From: David Laight @ 2021-05-18 21:14 UTC (permalink / raw)
  To: 'Linus Torvalds', Arnd Bergmann
  Cc: Eric Biggers, linux-arch, Vineet Gupta, Russell King, Herbert Xu,
	David S. Miller, Thomas Bogendoerfer, Linux ARM,
	Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER

From: Linus Torvalds
> Sent: 18 May 2021 15:56
> 
> On Tue, May 18, 2021 at 12:27 AM Arnd Bergmann <arnd@kernel.org> wrote:
> > >
> > > I wonder if the kernel should do the same, or whether there are still cases
> > > where memcpy() isn't compiled optimally.  armv6/7 used to be one such case, but
> > > it was fixed in gcc 6.
> >
> > It would have to be memmove(), not memcpy() in this case, right?
> 
> No, it would simply be something like
> 
>   #define __get_unaligned_t(type, ptr) \
>         ({ type __val; memcpy(&__val, ptr, sizeof(type)); __val; })

You still need something to ensure that gcc can't assume that
'ptr' has an aligned type.
If there is an 'int *ptr' visible in the call chain no amount
of (void *) casts will make gcc forget the alignment.
So the memcpy() will get converted to an aligned load-store pair.
(This has always caused grief on sparc.)

A cast though (long) might be enough, as might a cast to a __packed
struct pointer type.
Using a union of the two pointer types might be ok - but might
generate a store/load to stack.
An alternative is an asm statement with input and output of different
pointer types but using the same register for both.
That ought to force the compile to forget any tracked type
and value.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers
  2021-05-18 20:51             ` Arnd Bergmann
@ 2021-05-18 21:31               ` Eric Biggers
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Biggers @ 2021-05-18 21:31 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Jason A. Donenfeld, linux-arch, Vineet Gupta,
	Russell King, Herbert Xu, David S. Miller, Thomas Bogendoerfer,
	Linux ARM, Linux Kernel Mailing List,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	open list:BROADCOM NVRAM DRIVER, Nobuhiro Iwamatsu

On Tue, May 18, 2021 at 10:51:23PM +0200, Arnd Bergmann wrote:
> 
> > zstd looks very similar to lz4.
> 
> > End result: at a minimum, I'd suggest using
> > "-fno-tree-loop-vectorize", although somebody should check that NEON
> > case.
> 
> > And I still think that using O3 for anything halfway complicated
> > should be considered odd and need some strong numbers to enable.
> 
> Agreed. I think there is a fairly strong case for just using -O2 on lz4
> and backport that to stable.
> Searching for lz4 bugs with -O3 also finds several reports including
> one that I sent myself:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702
> 
> I see that user space zstd is built with -O3 in Debian, but it the changelog
> also lists "Improved : better speed on clang and gcc -O2, thanks to Eric
> Biggers", so maybe Eric has some useful ideas on whether we should
> just use -O2 for the in-kernel version.
> 

In my opinion, -O2 is a good default even for compression code.  I generally
don't see any benefit from -O3 in compression code I've written.

That being said, -O2 is what I usually use during development.  Other people
could write code that relies on -O3 to be optimized well.

The Makefiles for lz4 and zstd use -O3 by default, which is a little concerning.
I do expect that they're still well-written enough to do well with -O2 too, but
it would require doing benchmarks to tell for sure.  (As Arnd noted, it happens
that I did do such benchmarks on zstd about 5 years ago, and I found an issue
where some functions weren't marked inline when they should be, causing them to
be inlined at -O3 but not at -O2.  That got fixed.)

- Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-05-18 21:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20210514100106.3404011-1-arnd@kernel.org>
2021-05-14 10:00 ` [PATCH v2 07/13] asm-generic: unaligned always use struct helpers Arnd Bergmann
2021-05-17 21:53   ` Eric Biggers
2021-05-18  7:25     ` Arnd Bergmann
2021-05-18 14:56       ` Linus Torvalds
2021-05-18 15:41         ` Arnd Bergmann
2021-05-18 16:12           ` Linus Torvalds
2021-05-18 18:09             ` Jason A. Donenfeld
2021-05-18 20:51             ` Arnd Bergmann
2021-05-18 21:31               ` Eric Biggers
2021-05-18 21:14         ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).