[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
@ 2025-06-05 16:47 Mateusz Guzik
  2025-06-05 18:32 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mateusz Guzik @ 2025-06-05 16:47 UTC (permalink / raw)
  To: torvalds; +Cc: mingo, x86, linux-kernel, Mateusz Guzik

gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
comes with a significant penalty on CPUs without the respective fast
short ops bits (FSRM/FSRS).

Another point is that even uarchs with FSRM don't necessarily have FSRS (Ice
Lake and Sapphire Rapids don't).

More importantly, rep movsq is not fast even if FSRM is present.

The issue got reported to upstream gcc, but no progress was made and it
looks like nothing will happen for the foreseeable future (see links
1-3).

In the meantime perf is left on the table, here is a sample result from
compilation of a hello world program in a loop (in compilations / s):

Sapphire Rapids:
before:	979
after:	997 (+1.8%)

AMD EPYC 9R14:
before:	808
after:	815 (+0.8%)

So this is very much visible outside of a microbenchmark setting.

This is very page fault heavy, which lands in sync_regs():
<+0>:     endbr64
<+4>:     mov    %gs:0x22ca5d4(%rip),%rax        # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>:    mov    %rdi,%rsi
<+15>:    sub    $0xa8,%rax
<+21>:    cmp    %rdi,%rax
<+24>:    je     0xffffffff82244a55 <sync_regs+37>
<+26>:    mov    $0x15,%ecx
<+31>:    mov    %rax,%rdi
<+34>:    rep movsq %ds:(%rsi),%es:(%rdi)
<+37>:    jmp    0xffffffff82256ba0 <__x86_return_thunk>

When microbenchmarking page faults, perf top shows:

before:
  22.07%  [kernel]                  [k] asm_exc_page_fault
  12.83%  pf_processes              [.] testcase
  11.81%  [kernel]                  [k] sync_regs

after:
  26.06%  [kernel]                  [k] asm_exc_page_fault
  13.18%  pf_processes              [.] testcase
  [..]
   0.91%  [kernel]              [k] sync_regs

A massive reduction in execution time of the routine.

Link 1: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Link 2: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
Link 3: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704
Link 4: https://lore.kernel.org/oe-lkp/202504181042.54ea2b8a-lkp@intel.com/
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
v2:
- only do it if not building with CONFIG_X86_NATIVE_CPU

Hi Linus,

RFC for the patch was posted here:
https://lore.kernel.org/all/xmzxiwno5q3ordgia55wyqtjqbefxpami5wevwltcto52fehbv@ul44rsesp4kw/

You rejected it on 2 grounds:
- this should be handled by gcc itself -- agreed, but per the
  interaction in the bzs I created for them I don't believe this will
  happen any time soon (if ever to be frank)
- messing with local optimization flags -- perhaps ifdefing on
  CONFIG_X86_NATIVE_CPU would be good enough? if not, the thing can be
  hidden behind an option (default Y) so interested parties can whack it

See the commit message for perf numbers. It would be a shame to not get
these wins only because gcc is too stubborn.

While I completely understand not liking compiler-specific hacks, I
believe I made a good enough case for rolling with them here.

That said, if you don't see any justification to get something of this
sort in, I'm dropping the matter.

cheers

 arch/x86/Makefile | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 1913d342969b..9eb75bd7c81d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -198,6 +198,31 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
     endif
 endif
 
+ifdef CONFIG_CC_IS_GCC
+ifndef CONFIG_X86_NATIVE_CPU
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily as uarchs wildly differ between a
+# threshold past which rep-prefixed ops become faster. 256 being the lowest
+# common denominator. This should be fixed in the compiler.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+endif
+
 #
 # If the function graph tracer is used with mcount instead of fentry,
 # '-maccumulate-outgoing-args' is needed to prevent a GCC bug
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-05 16:47 [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops Mateusz Guzik
@ 2025-06-05 18:32 ` Linus Torvalds
  2025-06-05 19:00 ` Peter Zijlstra
  2025-06-09 21:19 ` David Laight
  2 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2025-06-05 18:32 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: mingo, x86, linux-kernel

On Thu, 5 Jun 2025 at 09:47, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> comes with a significant penalty on CPUs without the respective fast
> short ops bits (FSRM/FSRS).

I have said this before, and I'll say it again: I do not want random
crazy internal compiler tuning flags in the kernel sources.

We've had them before with things like inline limits, and it's
absolutely horrendous.

If you believe in this so much, add it to your gcc spec file. Or
continue to push gcc code improvement.

But this is not in any way kernel-specific, and I do not want to have
random "compiler internal modification flags" for code generation.

We want to have much higher-level things like "-O2" and "-march=xyz"
for optimization.

Now, for *correctness* issues like instruction choices, we will do odd
low-level internal flags like "don't use AVX", or
"-fno-strict-overflow" that are fixing ABI issues or bugs in the
language definition. So it's not like we don't ever do low-level
internal implementation compiler flags, but not for random
microarchitecture tuning.

           Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-05 16:47 [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops Mateusz Guzik
  2025-06-05 18:32 ` Linus Torvalds
@ 2025-06-05 19:00 ` Peter Zijlstra
  2025-06-06  6:13   ` Uros Bizjak
                     ` (2 more replies)
  2025-06-09 21:19 ` David Laight
  2 siblings, 3 replies; 18+ messages in thread
From: Peter Zijlstra @ 2025-06-05 19:00 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: torvalds, mingo, x86, linux-kernel, ubizjak

On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> comes with a significant penalty on CPUs without the respective fast
> short ops bits (FSRM/FSRS).

I don't suppose there's a magic compiler toggle to make it emit prefix
padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
right?

Something like:

   2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)

because if we can get the compilers to do this; then I can get objtool
to collect all these locations and then we can runtime patch them to be:

   call rep_movs_alternative / rep_stos_alternative

or whatever other crap we want really.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-05 19:00 ` Peter Zijlstra
@ 2025-06-06  6:13   ` Uros Bizjak
  2025-06-06  6:27     ` Uros Bizjak
  2025-06-06  7:27   ` Uros Bizjak
  2025-06-06 15:36   ` Mateusz Guzik
  2 siblings, 1 reply; 18+ messages in thread
From: Uros Bizjak @ 2025-06-06  6:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > comes with a significant penalty on CPUs without the respective fast
> > short ops bits (FSRM/FSRS).
>
> I don't suppose there's a magic compiler toggle to make it emit prefix
> padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> right?
>
> Something like:
>
>    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)

This won't fly, because gas complains:

z.s: Assembler messages:
z.s:1: Error: same type of prefix used twice

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-06  6:13   ` Uros Bizjak
@ 2025-06-06  6:27     ` Uros Bizjak
  2025-06-06  7:20       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Uros Bizjak @ 2025-06-06  6:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Fri, Jun 6, 2025 at 8:13 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > comes with a significant penalty on CPUs without the respective fast
> > > short ops bits (FSRM/FSRS).
> >
> > I don't suppose there's a magic compiler toggle to make it emit prefix
> > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > right?
> >
> > Something like:
> >
> >    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
>
> This won't fly, because gas complains:
>
> z.s: Assembler messages:
> z.s:1: Error: same type of prefix used twice

However, it is possible to use " cs ; cs ; cs ; rep movsb". We can add
a compile flag to the compiler, and it will be able to emit the
desired sequence.

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-06  6:27     ` Uros Bizjak
@ 2025-06-06  7:20       ` Peter Zijlstra
  2025-06-09 21:09         ` David Laight
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2025-06-06  7:20 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Fri, Jun 06, 2025 at 08:27:39AM +0200, Uros Bizjak wrote:
> On Fri, Jun 6, 2025 at 8:13 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > > comes with a significant penalty on CPUs without the respective fast
> > > > short ops bits (FSRM/FSRS).
> > >
> > > I don't suppose there's a magic compiler toggle to make it emit prefix
> > > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > > right?
> > >
> > > Something like:
> > >
> > >    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
> >
> > This won't fly, because gas complains:
> >
> > z.s: Assembler messages:
> > z.s:1: Error: same type of prefix used twice
> 
> However, it is possible to use " cs ; cs ; cs ; rep movsb". 

Heh, the way I encoded it was:

	.byte 0x2e, 0x2e, 0x2e
	rep movsb %ds:(%rsi), %es:(%rdi)

GCC compiled it, and then objdump grokked it (although it outputs one CS
too few). Your variant is much nicer though.

> We can add a compile flag to the compiler, and it will be able to emit
> the desired sequence.

Thanks; Linus, this would be acceptable?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-05 19:00 ` Peter Zijlstra
  2025-06-06  6:13   ` Uros Bizjak
@ 2025-06-06  7:27   ` Uros Bizjak
  2025-06-06  8:19     ` Peter Zijlstra
  2025-06-08 20:51     ` David Laight
  2025-06-06 15:36   ` Mateusz Guzik
  2 siblings, 2 replies; 18+ messages in thread
From: Uros Bizjak @ 2025-06-06  7:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > comes with a significant penalty on CPUs without the respective fast
> > short ops bits (FSRM/FSRS).
>
> I don't suppose there's a magic compiler toggle to make it emit prefix
> padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> right?
>
> Something like:
>
>    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
>
> because if we can get the compilers to do this; then I can get objtool
> to collect all these locations and then we can runtime patch them to be:
>
>    call rep_movs_alternative / rep_stos_alternative
>
> or whatever other crap we want really.

BTW: You can achieve the same effect by using -mstringop-strategy=libcall

Please consider the following testcase:

--cut here--
struct a { int r[40]; };
struct a foo (struct a b) { return b; }
--cut here--

By default, the compiler emits SSE copy (-O2):

foo:
.LFB0:
       .cfi_startproc
       movdqu  8(%rsp), %xmm0
       movq    %rdi, %rax
       movups  %xmm0, (%rdi)
       movdqu  24(%rsp), %xmm0
       movups  %xmm0, 16(%rdi)
       ...
       movdqu  152(%rsp), %xmm0
       movups  %xmm0, 144(%rdi)
       ret

but kernel doesn't enable SSE, so the compiler falls back to (-O2 -mno-sse):

foo:
       movq    8(%rsp), %rdx
       movq    %rdi, %rax
       leaq    8(%rdi), %rdi
       leaq    8(%rsp), %rsi
       movq    %rax, %rcx
       movq    %rdx, -8(%rdi)
       movq    160(%rsp), %rdx
       movq    %rdx, 144(%rdi)
       andq    $-8, %rdi
       subq    %rdi, %rcx
       subq    %rcx, %rsi
       addl    $160, %ecx
       shrl    $3, %ecx
       rep movsq
       ret

Please note the code that aligns pointers before "rep movsq". You
don't want this code when "rep movsq" is substituted by a call to some
optimized "rep movsb" type memcpy.

Let's use "rep movsb" (-O2 -mno-sse -mtune=skylake):

foo:
       leaq    8(%rsp), %rsi
       movl    $160, %ecx
       movq    %rdi, %rax
       rep movsb
       ret

much better. Now, to use function call instead of "rep movsb", you can
use -mstringop-strategy=libcall instead of some objtool substitution
(-O2 -mno-sse -mstringop-strategy=libcall):

foo:
       subq    $8, %rsp
       movl    $160, %edx
       leaq    16(%rsp), %rsi
       call    memcpy
       addq    $8, %rsp
       ret

Which IMO is what you are looking for.

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-06  7:27   ` Uros Bizjak
@ 2025-06-06  8:19     ` Peter Zijlstra
  2025-06-08 20:51     ` David Laight
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2025-06-06  8:19 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Fri, Jun 06, 2025 at 09:27:07AM +0200, Uros Bizjak wrote:

> Now, to use function call instead of "rep movsb", you can
> use -mstringop-strategy=libcall instead of some objtool substitution
> (-O2 -mno-sse -mstringop-strategy=libcall):
> 
> foo:
>        subq    $8, %rsp
>        movl    $160, %edx
>        leaq    16(%rsp), %rsi
>        call    memcpy
>        addq    $8, %rsp
>        ret
> 
> Which IMO is what you are looking for.

Well, no, because I can't replace the 'call memcpy' with "cs;cs;cs;rep
movsb" on good machines.

We want the "rep movsb" / "rep stosb" interface so we can actually use
those instructions when they're good, but then fall back to an
out-of-line function (with the same calling convention as the
instruction) when they're not so very good.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-05 19:00 ` Peter Zijlstra
  2025-06-06  6:13   ` Uros Bizjak
  2025-06-06  7:27   ` Uros Bizjak
@ 2025-06-06 15:36   ` Mateusz Guzik
  2025-06-06 17:59     ` Linus Torvalds
  2 siblings, 1 reply; 18+ messages in thread
From: Mateusz Guzik @ 2025-06-06 15:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: torvalds, mingo, x86, linux-kernel, ubizjak

On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > comes with a significant penalty on CPUs without the respective fast
> > short ops bits (FSRM/FSRS).
>
> I don't suppose there's a magic compiler toggle to make it emit prefix
> padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> right?
>
> Something like:
>
>    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
>
> because if we can get the compilers to do this; then I can get objtool
> to collect all these locations and then we can runtime patch them to be:
>
>    call rep_movs_alternative / rep_stos_alternative
>
> or whatever other crap we want really.

Even if inlining patchable rep mov/stos is the long term solution, you
still want the compiler to emit regular stores for sizes up to 16 or
so bytes or you are losing out.

So you would still need custom flags to dictate the policy, which is
the bit protested here.

Per other mail in this thread, gcc as is does not do what you need anyway.

So if any patches would have to be written, how about fix up gcc to
emit better inline asm for these to begin with. ;)

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-06 15:36   ` Mateusz Guzik
@ 2025-06-06 17:59     ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2025-06-06 17:59 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Peter Zijlstra, mingo, x86, linux-kernel, ubizjak

On Fri, 6 Jun 2025 at 08:36, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> So if any patches would have to be written, how about fix up gcc to
> emit better inline asm for these to begin with. ;)

I really want to re-iterate that these kinds of "I want specific
behavior" things can be done with custom spec-files.

And I wouldn't object very much to having some kernel build support
for people doing custom spec-files themselves.

Because I suspect we could do some of our gcc-specific flags by using
spec files explicitly in the kernel, instead of the tens of flags we
add by hand to the command line.

Now, maybe things have changed - it's been years (decades?) since I
actually played with gcc spec files to do things - but it used to be a
convenient, if rather obscure, way to have site-specific gcc behavior.

I do remember that it was a pain to just add some incremental spec
file, and I think you practically speaking have to do things like "use
'gcc --dumpspec' to dump system specs, modify them, and then use 'gcc
-specs modified.specs' to use them".

Maybe that has been made simpler over the years.

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-06  7:27   ` Uros Bizjak
  2025-06-06  8:19     ` Peter Zijlstra
@ 2025-06-08 20:51     ` David Laight
  2025-06-09  6:04       ` Uros Bizjak
  1 sibling, 1 reply; 18+ messages in thread
From: David Laight @ 2025-06-08 20:51 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Peter Zijlstra, Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Fri, 6 Jun 2025 09:27:07 +0200
Uros Bizjak <ubizjak@gmail.com> wrote:

> On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:  
> > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > comes with a significant penalty on CPUs without the respective fast
> > > short ops bits (FSRM/FSRS).  
> >
> > I don't suppose there's a magic compiler toggle to make it emit prefix
> > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > right?
> >
> > Something like:
> >
> >    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
> >
> > because if we can get the compilers to do this; then I can get objtool
> > to collect all these locations and then we can runtime patch them to be:
> >
> >    call rep_movs_alternative / rep_stos_alternative
> >
> > or whatever other crap we want really.  
> 
> BTW: You can achieve the same effect by using -mstringop-strategy=libcall
> 
> Please consider the following testcase:
> 
> --cut here--
> struct a { int r[40]; };
> struct a foo (struct a b) { return b; }
> --cut here--
> 
> By default, the compiler emits SSE copy (-O2):
> 
> foo:
> .LFB0:
>        .cfi_startproc
>        movdqu  8(%rsp), %xmm0
>        movq    %rdi, %rax
>        movups  %xmm0, (%rdi)
>        movdqu  24(%rsp), %xmm0
>        movups  %xmm0, 16(%rdi)
>        ...
>        movdqu  152(%rsp), %xmm0
>        movups  %xmm0, 144(%rdi)
>        ret
> 
> but kernel doesn't enable SSE, so the compiler falls back to (-O2 -mno-sse):
> 
> foo:
>        movq    8(%rsp), %rdx
>        movq    %rdi, %rax
>        leaq    8(%rdi), %rdi
>        leaq    8(%rsp), %rsi
>        movq    %rax, %rcx
>        movq    %rdx, -8(%rdi)
>        movq    160(%rsp), %rdx
>        movq    %rdx, 144(%rdi)
>        andq    $-8, %rdi
>        subq    %rdi, %rcx
>        subq    %rcx, %rsi
>        addl    $160, %ecx
>        shrl    $3, %ecx
>        rep movsq
>        ret
> 
> Please note the code that aligns pointers before "rep movsq".

Do you ever want it?
From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
alignment makes almost no difference to throughput.
I don't have any old zen cpu though.
On zen5 pretty much the only thing that matters is cache-line aligning
the destination buffer - but there are some strange oddities.

I need to revisit my 'rep mosvb' benchmarks though.
If you make %cx depend on the initial timestamp (cx = cx + (timestamp & zero))
will do it, and then make the final timestamp depend on a result of the copy
(easiest if using the performance counters) you should get a pretty true
value for the setup cost (pretty impossibly if you try to synchronise with
lfence or mfence).

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-08 20:51     ` David Laight
@ 2025-06-09  6:04       ` Uros Bizjak
  2025-06-09 16:38         ` Linus Torvalds
  2025-06-09 19:04         ` David Laight
  0 siblings, 2 replies; 18+ messages in thread
From: Uros Bizjak @ 2025-06-09  6:04 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Sun, Jun 8, 2025 at 10:51 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Fri, 6 Jun 2025 09:27:07 +0200
> Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > > comes with a significant penalty on CPUs without the respective fast
> > > > short ops bits (FSRM/FSRS).
> > >
> > > I don't suppose there's a magic compiler toggle to make it emit prefix
> > > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > > right?
> > >
> > > Something like:
> > >
> > >    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
> > >
> > > because if we can get the compilers to do this; then I can get objtool
> > > to collect all these locations and then we can runtime patch them to be:
> > >
> > >    call rep_movs_alternative / rep_stos_alternative
> > >
> > > or whatever other crap we want really.
> >
> > BTW: You can achieve the same effect by using -mstringop-strategy=libcall
> >
> > Please consider the following testcase:
> >
> > --cut here--
> > struct a { int r[40]; };
> > struct a foo (struct a b) { return b; }
> > --cut here--
> >
> > By default, the compiler emits SSE copy (-O2):
> >
> > foo:
> > .LFB0:
> >        .cfi_startproc
> >        movdqu  8(%rsp), %xmm0
> >        movq    %rdi, %rax
> >        movups  %xmm0, (%rdi)
> >        movdqu  24(%rsp), %xmm0
> >        movups  %xmm0, 16(%rdi)
> >        ...
> >        movdqu  152(%rsp), %xmm0
> >        movups  %xmm0, 144(%rdi)
> >        ret
> >
> > but kernel doesn't enable SSE, so the compiler falls back to (-O2 -mno-sse):
> >
> > foo:
> >        movq    8(%rsp), %rdx
> >        movq    %rdi, %rax
> >        leaq    8(%rdi), %rdi
> >        leaq    8(%rsp), %rsi
> >        movq    %rax, %rcx
> >        movq    %rdx, -8(%rdi)
> >        movq    160(%rsp), %rdx
> >        movq    %rdx, 144(%rdi)
> >        andq    $-8, %rdi
> >        subq    %rdi, %rcx
> >        subq    %rcx, %rsi
> >        addl    $160, %ecx
> >        shrl    $3, %ecx
> >        rep movsq
> >        ret
> >
> > Please note the code that aligns pointers before "rep movsq".
>
> Do you ever want it?
> From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
> alignment makes almost no difference to throughput.

Please note that the instruction is "rep movsQ", it moves 64bit
quantities. The alignment is needed to align data to the 64-bit
boundary.

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-09  6:04       ` Uros Bizjak
@ 2025-06-09 16:38         ` Linus Torvalds
  2025-06-09 19:25           ` Linus Torvalds
  2025-06-09 19:04         ` David Laight
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2025-06-09 16:38 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: David Laight, Peter Zijlstra, Mateusz Guzik, mingo, x86,
	linux-kernel

On Sun, 8 Jun 2025 at 23:04, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> Please note that the instruction is "rep movsQ", it moves 64bit
> quantities. The alignment is needed to align data to the 64-bit
> boundary.

On real code, the cost of aligning things can be worse than just doing
it, and that's particularly true when inlining things.

When you clear 8 bytes on x86, you don't align things. You just write
a single 'movq'.

For the kernel, I$ misses are a very real thing, and the cost of
alignment is often the fact that you made things three times bigger
than they needed to be, and that it might make you uninline things.

Function calls are quite expensive - partly because of all the horrid
CPU bug workarounds (that are often *MORE* costly than any microcode
overhead of 'rep movs', which some people don't seem to have realized)
- but also because nonlinear code is simply a *big* hit when you don't
have good I$ behavior.

Benchmarks often don't show those effects. The benchmarks that have
big enough I$ footprints for those issues to show up are sadly not the
ones that compiler or library people use, which then results in fancy
memcpy routines that are actually really bad in real life.

That said, I at one point had a local patch that did a "memcpy_a4/a8"
for when the source and destination types were aligned and the right
size and also had the size as a hint (ie it called "memcpy_a8_large"
when types were 8-byte aligned and larger than 256 bytes iirc), so
that we could then do the right thing and avoid alignment and size
checks when we had enough of a clue that it's likely a bad idea (note
the "likely" - particularly for user copies the type may be aligned,
but user space might have place it unaligned anyway).

And honestly, that's what I'd like to see gcc generate natively: a
call instruction with the "rep movsb" semantics (so %rsi/rdi/%rcx
arguments, and they get clobbered).

That way, we could rewrite it in place and just replace it with "rep
movsb" when we know the hardware is good at it (or - as mentioned -
when we know the hardware is particularly bad at function calls: the
retpoline crap really is horrendously expensive with forced branch
mispredicts).

And then have a simple library routine (or a couple) for the other
cases. I'd like gcc to give the alignment hint and size hints too,
even if I suspect that we'd just make all versions be aliases to the
generic case, because once it's a function call, the rest tends to be
in the noise.

What gcc has now for memcpy/memset is complicated and largely useless.
I think it has been done for all the wrong reasons (ie spec type
benchmarking where you optimize for a known target CPU, which is bogus
garbage).

               Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-09  6:04       ` Uros Bizjak
  2025-06-09 16:38         ` Linus Torvalds
@ 2025-06-09 19:04         ` David Laight
  1 sibling, 0 replies; 18+ messages in thread
From: David Laight @ 2025-06-09 19:04 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Peter Zijlstra, Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Mon, 9 Jun 2025 08:04:34 +0200
Uros Bizjak <ubizjak@gmail.com> wrote:

> On Sun, Jun 8, 2025 at 10:51 PM David Laight
> <david.laight.linux@gmail.com> wrote:
..
> > Do you ever want it?
> > From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
> > alignment makes almost no difference to throughput.  
> 
> Please note that the instruction is "rep movsQ", it moves 64bit
> quantities. The alignment is needed to align data to the 64-bit
> boundary.

No it isn't, there is no requirement to align the data for 'rep movsq'.
Even a naive cpu will do misaligned transfers quite happily.
The worst that ought to happen is each memory access being split in two.
Since it is likely that copies will be aligned (or so short it doesn't
matter) the alignment code is just a waste of time.

Even length checks to decide the algorithm cost - and can kill overall
performance if the copies are often short.
You really do need the software to give a compile-time hint of the likely
length and use that select the algorithm. 

I need to check Sandy bridge (I've got one with a recent debian installed)
but even on ivy bridge 'rep movsq' is pretty identical to 'rep movsb'
with the count multiplied by 8.

The fixed/setup costs do vary by cpu, but the per-byte costs for moderate
(a few k - fitting in the D-cache) copies were the same for all the intel
cpu I had to hand at the time.
The only thing that mattered was cache-line aligning %rdi - doubled throughput.

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-09 16:38         ` Linus Torvalds
@ 2025-06-09 19:25           ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2025-06-09 19:25 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: David Laight, Peter Zijlstra, Mateusz Guzik, mingo, x86,
	linux-kernel

On Mon, 9 Jun 2025 at 09:38, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What gcc has now for memcpy/memset is complicated and largely useless.

Just to clarify: I'm talking about the "pick between rep movs and
library call" parts of the gcc code. That's the part that then ends up
being about very random choices that can't be done well statically
because the exact choices depend very much on microarchitecture.

What is absolutely *not* useless is when the compiler decides to just
do the memcpy entirely by hand using regular mov instructions.

That's the main reason we end up no longer having our own memcpy
inlines and helpers - that and the fact that structure assignments etc
mean that we can't catch 'memcpy()' in the general case anyway.

So the whole "I'm turning this small and known-size memcpy into just X
pairs of 'mov' instructions" is a big deal. That part I love.

It's the "call to library routine or use string instructions" that I
don't like, and that I think the kernel would be better off picking
dynamically at boot time with instruction rewriting.

But to do a good job at that, we'd need that memcpy call to have the
string instruction semantics (including, very much, same clobber
rules).

And I do think we'd want to have hints as to size and alignment
because the whole "compiler knew about those, but then turned it into
a single special library call so that we can no longer optimize for
small/large/alignment cases" is sad.

So what I'd love to see is that if we have a

        large_struct_dest = large_struct_source;

then gcc would generate

        leaq dest,%rdi // or whatever
        leaq src,%rsi // again - obviously this will depend
        movl $size,%ecx
        call rep_movsb_large_aligned

so that we can take that target information into account when we rewrite it.

For example, on *some* microarchitectures, we'd decide to just always
replace all those calls with 'rep movsb', simply because the uarch is
known to be good at it.

But in *other* cases, we might only do it when we know the copy is
large (thus the need for a size hint).

And we might even be able to then turn that

        movl $size,%ecx
        call rep_movsb_large_aligned

pattern into

        movl $size/8,%ecx
        rep movsq

on older architectures that do better at 'movsq' than at 'movsb', but
have slow function calls due to retpoline crap.

Admittedly I don't think anybody has the energy to do those kinds of
bigger rewrites, but I think it would be good to have the _option_ if
somebody gets excited about it.

              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-06  7:20       ` Peter Zijlstra
@ 2025-06-09 21:09         ` David Laight
  2025-06-15 16:41           ` Uros Bizjak
  0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2025-06-09 21:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Uros Bizjak, Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Fri, 6 Jun 2025 09:20:29 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jun 06, 2025 at 08:27:39AM +0200, Uros Bizjak wrote:
> > On Fri, Jun 6, 2025 at 8:13 AM Uros Bizjak <ubizjak@gmail.com> wrote:  
> > >
> > > On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:  
> > > >
> > > > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:  
> > > > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > > > comes with a significant penalty on CPUs without the respective fast
> > > > > short ops bits (FSRM/FSRS).  
> > > >
> > > > I don't suppose there's a magic compiler toggle to make it emit prefix
> > > > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > > > right?
> > > >
> > > > Something like:
> > > >
> > > >    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)  
> > >
> > > This won't fly, because gas complains:
> > >
> > > z.s: Assembler messages:
> > > z.s:1: Error: same type of prefix used twice  
> > 
> > However, it is possible to use " cs ; cs ; cs ; rep movsb".   
> 
> Heh, the way I encoded it was:
> 
> 	.byte 0x2e, 0x2e, 0x2e
> 	rep movsb %ds:(%rsi), %es:(%rdi)
> 
> GCC compiled it, and then objdump grokked it (although it outputs one CS
> too few). Your variant is much nicer though.
> 
> > We can add a compile flag to the compiler, and it will be able to emit
> > the desired sequence.

You want the compiler to use 'rep movsw', 'rep movsl' or 'rep movsq' if it knows
the size is a multiple of 2, 4 or 8.
Then you can substitute a suitable function.
But replacing the 'rep movsb' with a call is much more flexible.

You do need to allow for the REX prefix (rep movsq is f3 48 a5).

I also suspect it would be neater to repeat the 'rep' prefix instead of
using the 'cs' prefix.

> 
> Thanks; Linus, this would be acceptable?

Any compiler change isn't going to be useful in the short term.
(Unless the kernel sources start including gcc.)

	David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-05 16:47 [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops Mateusz Guzik
  2025-06-05 18:32 ` Linus Torvalds
  2025-06-05 19:00 ` Peter Zijlstra
@ 2025-06-09 21:19 ` David Laight
  2 siblings, 0 replies; 18+ messages in thread
From: David Laight @ 2025-06-09 21:19 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: torvalds, mingo, x86, linux-kernel

On Thu,  5 Jun 2025 18:47:33 +0200
Mateusz Guzik <mjguzik@gmail.com> wrote:

> gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> comes with a significant penalty on CPUs without the respective fast
> short ops bits (FSRM/FSRS).
> 
> Another point is that even uarchs with FSRM don't necessarily have FSRS (Ice
> Lake and Sapphire Rapids don't).
> 
> More importantly, rep movsq is not fast even if FSRM is present.

Which architecture is that?
I got exactly the same timings for 'rep movsb' and 'rep movsq' when
I did some tests on Intel cpu going back to Ivy bridge.

I do need to redo them though, I've worked out how to time them
without using mfence/lfence and that should give a reasonable
estimation of the setup cost.
(I can measure the data-dependency of a single divide...)

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
  2025-06-09 21:09         ` David Laight
@ 2025-06-15 16:41           ` Uros Bizjak
  0 siblings, 0 replies; 18+ messages in thread
From: Uros Bizjak @ 2025-06-15 16:41 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, Mateusz Guzik, torvalds, mingo, x86, linux-kernel

On Mon, Jun 9, 2025 at 11:09 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Fri, 6 Jun 2025 09:20:29 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Fri, Jun 06, 2025 at 08:27:39AM +0200, Uros Bizjak wrote:
> > > On Fri, Jun 6, 2025 at 8:13 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > >
> > > > > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > > > > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > > > > comes with a significant penalty on CPUs without the respective fast
> > > > > > short ops bits (FSRM/FSRS).
> > > > >
> > > > > I don't suppose there's a magic compiler toggle to make it emit prefix
> > > > > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > > > > right?
> > > > >
> > > > > Something like:
> > > > >
> > > > >    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
> > > >
> > > > This won't fly, because gas complains:
> > > >
> > > > z.s: Assembler messages:
> > > > z.s:1: Error: same type of prefix used twice
> > >
> > > However, it is possible to use " cs ; cs ; cs ; rep movsb".
> >
> > Heh, the way I encoded it was:
> >
> >       .byte 0x2e, 0x2e, 0x2e
> >       rep movsb %ds:(%rsi), %es:(%rdi)
> >
> > GCC compiled it, and then objdump grokked it (although it outputs one CS
> > too few). Your variant is much nicer though.
> >
> > > We can add a compile flag to the compiler, and it will be able to emit
> > > the desired sequence.
>
> You want the compiler to use 'rep movsw', 'rep movsl' or 'rep movsq' if it knows
> the size is a multiple of 2, 4 or 8.
> Then you can substitute a suitable function.
> But replacing the 'rep movsb' with a call is much more flexible.
>
> You do need to allow for the REX prefix (rep movsq is f3 48 a5).
>
> I also suspect it would be neater to repeat the 'rep' prefix instead of
> using the 'cs' prefix.
>
> >
> > Thanks; Linus, this would be acceptable?
>
> Any compiler change isn't going to be useful in the short term.
> (Unless the kernel sources start including gcc.)

The discussion is moved to gcc mailing lists. Can please interested
parties continue the discussion at [1]?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2025-June/686813.html

Uros.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-15 16:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 16:47 [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops Mateusz Guzik
2025-06-05 18:32 ` Linus Torvalds
2025-06-05 19:00 ` Peter Zijlstra
2025-06-06  6:13   ` Uros Bizjak
2025-06-06  6:27     ` Uros Bizjak
2025-06-06  7:20       ` Peter Zijlstra
2025-06-09 21:09         ` David Laight
2025-06-15 16:41           ` Uros Bizjak
2025-06-06  7:27   ` Uros Bizjak
2025-06-06  8:19     ` Peter Zijlstra
2025-06-08 20:51     ` David Laight
2025-06-09  6:04       ` Uros Bizjak
2025-06-09 16:38         ` Linus Torvalds
2025-06-09 19:25           ` Linus Torvalds
2025-06-09 19:04         ` David Laight
2025-06-06 15:36   ` Mateusz Guzik
2025-06-06 17:59     ` Linus Torvalds
2025-06-09 21:19 ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).