From: Kees Cook <kees@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: "kernel test robot" <oliver.sang@intel.com>,
oe-lkp@lists.linux.dev, lkp@intel.com,
linux-kernel@vger.kernel.org,
"Thomas Weißschuh" <linux@weissschuh.net>,
"Nilay Shroff" <nilay@linux.ibm.com>,
"Yury Norov" <yury.norov@gmail.com>,
"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
linux-hardening@vger.kernel.org
Subject: Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
Date: Fri, 10 Jan 2025 08:58:16 -0800 [thread overview]
Message-ID: <202501100853.CC2A15B6D@keescook> (raw)
In-Reply-To: <CAGudoHHymwdk9ChLK=JcPtRD8BSTRDG9B78yOzcRoBGWFzumng@mail.gmail.com>
On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <kees@kernel.org> wrote:
> >
> > On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> > > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > > >
> > > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > > > Well that is unexpected. There should be no binary output difference
> > > > > with that patch. I will investigate...
> > > >
> > > > It looks like hiding the size value from GCC has the side-effect of
> > > > breaking memcpy inlining in many places. I would expect this to make
> > > > things _slower_, though. O_o
> >
> > I think it's disabling value-range-based inlining, I'm trying to
> > construct some tests...
> >
> > > This depends on what was emitted in place and what CPU is executing it.
> > >
> > > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> > > not have FSRM and the size is low enough, then such code can indeed be
> > > slower than suffering a call to memcpy (which does not issue rep mov).
> > >
> > > I had seen gcc go to great pains to align a buffer for rep movsq even
> > > when it was guaranteed to not be necessary for example.
> > >
> > > Can you disasm an example affected spot?
> >
> > I tried to find the most self-contained example I could, and I ended up
> > with:
> >
> > static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
> > const struct in6_addr *daddr,
> > const void *post, unsigned char pfx)
> > {
> > memcpy(dst, daddr, pfx);
> > memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
> > }
> >
>
> Well I did what I should have from the get go and took the liberty of
> looking at the profile.
>
> %stddev %change %stddev
> \ | \
> [snip]
> 0.00 +6.5 6.54 ± 66%
> perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
>
> Disassembling copy_page_from_iter_atomic *prior* to the change indeed
> reveals rep movsq as I suspected (second to last instruction):
>
> <+919>: mov (%rax),%rdx
> <+922>: lea 0x8(%rsi),%rdi
> <+926>: and $0xfffffffffffffff8,%rdi
> <+930>: mov %rdx,(%rsi)
> <+933>: mov %r8d,%edx
> <+936>: mov -0x8(%rax,%rdx,1),%rcx
> <+941>: mov %rcx,-0x8(%rsi,%rdx,1)
> <+946>: sub %rdi,%rsi
> <+949>: mov %rsi,%rdx
> <+952>: sub %rsi,%rax
> <+955>: lea (%r8,%rdx,1),%ecx
> <+959>: mov %rax,%rsi
> <+962>: shr $0x3,%ecx
> <+965>: rep movsq %ds:(%rsi),%es:(%rdi)
> <+968>: jmp 0xffffffff819157c5 <copy_page_from_iter_atomic+869>
>
> With the reported patch this is a call to memcpy.
>
> This is the guy:
> static __always_inline
> size_t memcpy_from_iter(void *iter_from, size_t progress,
> size_t len, void *to, void *priv2)
> {
> memcpy(to + progress, iter_from, len);
> return 0;
> }
Thanks for looking at this case!
>
> I don't know what the specific bench is doing, I'm assuming passed
> values were low enough that the overhead of spinning up rep movsq took
> over.
>
> gcc should retain the ability to optimize this, except it needs to be
> convinced to not emit rep movsq for variable sizes (and instead call
> memcpy).
>
> For user memory access there is a bunch of hackery to inline rep mov
> for CPUs where it does not suck for small sizes (see
> rep_movs_alternative). Someone(tm) should port it over to memcpy
> handling as well.
>
> The expected state would be that for sizes known at compilation time
> it rolls with movs as needed (no rep), otherwise emits the magic rep
> movs/memcpy invocation, except for when it would be tail-called.
>
> In your ipv6_rpl_addr_decompress example gcc went a little crazy,
> which I mentioned does happen. However, most of the time it is doing a
> good job instead and a now generated call to memcpy should make things
> slower. I presume these spots are merely not being benchmarked here.
> Note that going from inline movs (no rep) to a call to memcpy which
> does movs (again no rep) comes with a "mere" function call overhead,
> which is a different beast than spinning up rep movs on CPUs without
> FSRM.
>
> That is to say, contrary to the report above, I believe the change is
> in fact a regression which just so happened to make things faster for
> a specific case. The unintended speed up can be achieved without
> regressing anything else by taming the craziness.
How do we best make sense of the perf report? Even in the iter case
above, it looks like a perf improvement?
The fortify change lets GCC still inline compile-time-constant sizes, so
that's good. But it seems to force all the "in a given range" cases into
calls.
> Reading the commit log I don't know what the way out is, perhaps you
> could rope in some gcc folk to ask? Screwing with optimization to not
> see a warning is definitely not the best option.
Yeah, if we do need to revert this, I'm going to need another way to
silence the GCC value-range checker for memcpy...
--
Kees Cook
next prev parent reply other threads:[~2025-01-10 16:58 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-09 6:57 [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement kernel test robot
2025-01-09 16:51 ` Kees Cook
2025-01-09 20:38 ` Kees Cook
2025-01-09 20:52 ` Mateusz Guzik
2025-01-09 21:12 ` Kees Cook
2025-01-09 22:01 ` Mateusz Guzik
2025-01-10 16:58 ` Kees Cook [this message]
2025-01-10 19:14 ` Mateusz Guzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=202501100853.CC2A15B6D@keescook \
--to=kees@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=linux-hardening@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@weissschuh.net \
--cc=lkp@intel.com \
--cc=mjguzik@gmail.com \
--cc=nilay@linux.ibm.com \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.