Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	mingo@redhat.com, x86@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
Date: Sun, 13 Apr 2025 19:20:11 +0100	[thread overview]
Message-ID: <20250413192011.3e083d33@pumpkin> (raw)
In-Reply-To: <CAGudoHFvPqE=Sby-ttn1ar8b+abj15X2jX3FvgY3ca_TRqoc-Q@mail.gmail.com>

On Sun, 13 Apr 2025 12:27:08 +0200
Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Wed, Apr 2, 2025 at 6:27 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Wed, Apr 2, 2025 at 6:22 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:  
> > >
> > > On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@gmail.com> wrote:  
> > > >
> > > >
> > > > +ifdef CONFIG_CC_IS_GCC
> > > > +#
> > > > +# Inline memcpy and memset handling policy for gcc.
> > > > +#
> > > > +# For ops of sizes known at compilation time it quickly resorts to issuing rep
> > > > +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
> > > > +# latency and it is faster to issue regular stores (even if in loops) to handle
> > > > +# small buffers.
> > > > +#
> > > > +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
> > > > +# reported 0.23% increase for enabling these.
> > > > +#
> > > > +# We inline up to 256 bytes, which in the best case issues few movs, in the
> > > > +# worst case creates a 4 * 8 store loop.
> > > > +#
> > > > +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
> > > > +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
> > > > +# common denominator. Someone(tm) should revisit this from time to time.
> > > > +#
> > > > +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > > +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > > +endif  
> > >
> > > Please make this a gcc bug-report instead - I really don't want to
> > > have random compiler-specific tuning options in the kernel.
> > >
> > > Because that whole memcpy-strategy thing is something that gets tuned
> > > by a lot of other compiler options (ie -march and different versions).
> > >  
> >
> > Ok.  
> 
> So I reported this upstream:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
> 
> And found some other problems in the meantime:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704
> 
> Looks like this particular bit was persisting for quite some time now.
> 
> I also confirmed there is a benefit on AMD CPUs.

Is that a benefit of doing 'rep movsb' or a benefit of not doing it?

It also depends very much of the actual cpu.
I think zen5 are faster (at running 'rep movsb') than earlier ones.
But someone needs to run the same test on a range of cpu.

I've found a 'cunning plan' to actually measure instruction clock times.
While 'mfence' will wait for all the instructions to complete, it is
horribly expensive.
The trick is to use data dependencies and the 'pmc' cycle counter.
So something like:
volatile int always_zero;
...
	int zero = always_zero;
	start = rdpmc(reg_no);
	updated = do_rep_movsb(dst, src, count + (start & zero));
	end = rdpmc(reg_no + (updated & zero);
	elapsed = end - start;
So the cpu has to execute the rdpmc() either side of the code
being tested.
For 'rep_movsb' it might be reasonable to use the updated address (or count),
but you could read back the last memory location to get a true execution time.

I've not tried to time memcpy() loops that way, but of arithmetic you
can measure the data dependency of the clock could for divide.

	David

next prev parent reply	other threads:[~2025-04-13 18:20 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-02 13:42 [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops Mateusz Guzik
2025-04-02 16:21 ` Linus Torvalds
2025-04-02 16:27   ` Mateusz Guzik
2025-04-02 18:17     ` Andrew Cooper
2025-04-02 18:22       ` Peter Zijlstra
2025-04-02 18:29       ` Linus Torvalds
2025-04-02 18:40         ` Andrew Cooper
2025-04-02 18:56           ` Linus Torvalds
2025-04-02 23:39             ` Mateusz Guzik
2025-04-13 10:27     ` Mateusz Guzik
2025-04-13 18:20       ` David Laight [this message]
2025-04-13 18:58         ` Mateusz Guzik
2025-04-02 22:29 ` David Laight
2025-04-02 23:15   ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250413192011.3e083d33@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.