public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Kees Cook <kees@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: "kernel test robot" <oliver.sang@intel.com>,
	oe-lkp@lists.linux.dev, lkp@intel.com,
	linux-kernel@vger.kernel.org,
	"Thomas Weißschuh" <linux@weissschuh.net>,
	"Nilay Shroff" <nilay@linux.ibm.com>,
	"Yury Norov" <yury.norov@gmail.com>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	linux-hardening@vger.kernel.org
Subject: Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
Date: Fri, 10 Jan 2025 08:58:16 -0800	[thread overview]
Message-ID: <202501100853.CC2A15B6D@keescook> (raw)
In-Reply-To: <CAGudoHHymwdk9ChLK=JcPtRD8BSTRDG9B78yOzcRoBGWFzumng@mail.gmail.com>

On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <kees@kernel.org> wrote:
> >
> > On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> > > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > > >
> > > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > > > Well that is unexpected. There should be no binary output difference
> > > > > with that patch. I will investigate...
> > > >
> > > > It looks like hiding the size value from GCC has the side-effect of
> > > > breaking memcpy inlining in many places. I would expect this to make
> > > > things _slower_, though. O_o
> >
> > I think it's disabling value-range-based inlining, I'm trying to
> > construct some tests...
> >
> > > This depends on what was emitted in place and what CPU is executing it.
> > >
> > > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> > > not have FSRM and the size is low enough, then such code can indeed be
> > > slower than suffering a call to memcpy (which does not issue rep mov).
> > >
> > > I had seen gcc go to great pains to align a buffer for rep movsq even
> > > when it was guaranteed to not be necessary for example.
> > >
> > > Can you disasm an example affected spot?
> >
> > I tried to find the most self-contained example I could, and I ended up
> > with:
> >
> > static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
> >                                      const struct in6_addr *daddr,
> >                                      const void *post, unsigned char pfx)
> > {
> >         memcpy(dst, daddr, pfx);
> >         memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
> > }
> >
> 
> Well I did what I should have from the get go and took the liberty of
> looking at the profile.
> 
>          %stddev     %change         %stddev
>              \          |                \
> [snip]
>       0.00            +6.5        6.54 ± 66%
> perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
> 
> Disassembling copy_page_from_iter_atomic *prior* to the change indeed
> reveals rep movsq as I suspected (second to last instruction):
> 
> <+919>:   mov    (%rax),%rdx
> <+922>:   lea    0x8(%rsi),%rdi
> <+926>:   and    $0xfffffffffffffff8,%rdi
> <+930>:   mov    %rdx,(%rsi)
> <+933>:   mov    %r8d,%edx
> <+936>:   mov    -0x8(%rax,%rdx,1),%rcx
> <+941>:   mov    %rcx,-0x8(%rsi,%rdx,1)
> <+946>:   sub    %rdi,%rsi
> <+949>:   mov    %rsi,%rdx
> <+952>:   sub    %rsi,%rax
> <+955>:   lea    (%r8,%rdx,1),%ecx
> <+959>:   mov    %rax,%rsi
> <+962>:   shr    $0x3,%ecx
> <+965>:   rep movsq %ds:(%rsi),%es:(%rdi)
> <+968>:   jmp    0xffffffff819157c5 <copy_page_from_iter_atomic+869>
> 
> With the reported patch this is a call to memcpy.
> 
> This is the guy:
> static __always_inline
> size_t memcpy_from_iter(void *iter_from, size_t progress,
>                         size_t len, void *to, void *priv2)
> {
>         memcpy(to + progress, iter_from, len);
>         return 0;
> }

Thanks for looking at this case!

> 
> I don't know what the specific bench is doing, I'm assuming passed
> values were low enough that the overhead of spinning up rep movsq took
> over.
> 
> gcc should retain the ability to optimize this, except it needs to be
> convinced to not emit rep movsq for variable sizes (and instead call
> memcpy).
> 
> For user memory access there is a bunch of hackery to inline rep mov
> for CPUs where it does not suck for small sizes (see
> rep_movs_alternative). Someone(tm) should port it over to memcpy
> handling as well.
> 
> The expected state would be that for sizes known at compilation time
> it rolls with movs as needed (no rep), otherwise emits the magic rep
> movs/memcpy invocation, except for when it would be tail-called.
> 
> In your ipv6_rpl_addr_decompress example gcc went a little crazy,
> which I mentioned does happen. However, most of the time it is doing a
> good job instead and a now generated call to memcpy should make things
> slower. I presume these spots are merely not being benchmarked here.
> Note that going from inline movs (no rep) to a call to memcpy which
> does movs (again no rep) comes with a "mere" function call overhead,
> which is a different beast than spinning up rep movs on CPUs without
> FSRM.
> 
> That is to say, contrary to the report above, I believe the change is
> in fact a regression which just so happened to make things faster for
> a specific case. The unintended speed up can be achieved without
> regressing anything else by taming the craziness.

How do we best make sense of the perf report? Even in the iter case
above, it looks like a perf improvement?

The fortify change lets GCC still inline compile-time-constant sizes, so
that's good. But it seems to force all the "in a given range" cases into
calls.

> Reading the commit log I don't know what the way out is, perhaps you
> could rope in some gcc folk to ask? Screwing with optimization to not
> see a warning is definitely not the best option.

Yeah, if we do need to revert this, I'm going to need another way to
silence the GCC value-range checker for memcpy...

-- 
Kees Cook

  reply	other threads:[~2025-01-10 16:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-09  6:57 [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement kernel test robot
2025-01-09 16:51 ` Kees Cook
2025-01-09 20:38   ` Kees Cook
2025-01-09 20:52     ` Mateusz Guzik
2025-01-09 21:12       ` Kees Cook
2025-01-09 22:01         ` Mateusz Guzik
2025-01-10 16:58           ` Kees Cook [this message]
2025-01-10 19:14             ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202501100853.CC2A15B6D@keescook \
    --to=kees@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@weissschuh.net \
    --cc=lkp@intel.com \
    --cc=mjguzik@gmail.com \
    --cc=nilay@linux.ibm.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox