Re: [PATCH] lib/string: Bring optimized memcmp from glibc

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nikolay Borisov <nborisov@suse.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nikolay Borisov <nborisov@suse.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH] lib/string: Bring optimized memcmp from glibc
Date: Thu, 22 Jul 2021 20:03:22 +0300	[thread overview]
Message-ID: <c8262efc-e97a-4672-0cc1-90c778eecd94@suse.com> (raw)
In-Reply-To: <CAHk-=whTNJyEMmCpNh5FeT+bJzSE2CYDPWyDO12G4q39d1jn1g@mail.gmail.com>



On 22.07.21 г. 19:40, Linus Torvalds wrote:
> On Thu, Jul 22, 2021 at 4:28 AM Nikolay Borisov
> <n.borisov.lkml@gmail.com> wrote:
>>
>> This one also works, tested only on x86-64. Looking at the perf diff:
>>
>>     30.44%    -28.66%  [kernel.vmlinux]         [k] memcmp
> 
> Ok.
> 
> So the one that doesn't even bother to align is
> 
>     30.44%    -29.38%  [kernel.vmlinux]         [k] memcmp
> 
> and the one that aligns the first one is
> 
>     30.44%    -28.66%  [kernel.vmlinux]         [k] memcmp
> 
> and the difference between the two is basically in the noise:
> 
>      1.05%     +0.72%  [kernel.vmlinux]     [k] memcmp
> 
> but the first one does seem to be slightly better.
> 
>> Now on a more practical note, IIUC your 2nd version makes sense if the
>> cost of doing a one unaligned access in the loop body is offset by the
>> fact we are doing a native word-sized comparison, right?
> 
> So honestly, the reason the first one seems to beat the second one is
> that the cost of unaligned accesses on modern x86 is basically
> epsilon.
> 
> For all normal unaligned accesses there simply is no cost at all.
> There is a _tiny_ cost when the unaligned access crosses a cacheline
> access boundary (which on older CPU's is every 32 bytes, on modern
> ones it's 64 bytes). And then there is another slightly bigger cost
> when the unaligned access actually crosses a page boundary.
> 
> But even those non-zero cost cases are basically in the noise, because
> under most circumstances they will be hidden by any out-of-order
> engine, and completely dwarfed by the _real_ costs which are branch
> mispredicts and cache misses.
> 
> So on the whole, unaligned accesses are basically no cost at all. You
> almost have to have unusual code sequences for them to be even
> measurable.
> 
> So that second patch that aligns one of the sources is basically only
> extra overhead for no real advantage. The cost of it is probably one
> more branch mispredict, and possibly a cycle or two for the extra
> instructions.
> 
> Which is why the first one wins: it's simpler, and the extra work the
> second one does is basically not worth it on x86. Plus I suspect your
> test-case was all aligned anyway to begin with, so the extra work is
> _doubly_ pointless.
> 
> I suspect the second patch would be worthwhile if
> 
>  (a) there really were a lot of strings that weren't aligned (likelihood: low)
> 
>  (b) other microarchitectures that do worse on unaligned accesses -
> some microarchitectures spend extra cycles on _any_ unaligned accesses
> even if they don't cross cache access boundaries etc.
> 
> and I can see (b) happening quite easily. You just won't see it on a
> modern x86-64 setup.
> 
> I suspect we should start with the first version. It's not only better
> on x86, but it's simpler, and it's guarded by that
> 
>     #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> 
> so it's fundamentally "safer" on architectures that are just horrible
> about unaligned accesses.
> 
> Now, admittedly I don't personally even care about such architectures,
> and because we use "get_unaligned()", the compiler should always
> generate code that doesn't absolutely suck for bad architectures, but
> considering how long we've gone with the completely brainlessly simple
> "byte at a time" version without anybody even noticing, I think a
> minimal change is a better change.
> 
> That said, I'm not convinced I want to apply even that minimal first
> patch outside of the merge window.
> 
> So would you mind reminding me about this patch the next merge window?
> Unless there was some big extrernal reason why the performance of
> memcmp() mattered to you so much (ie some user that actually noticed
> and complained) and we really want to prioritize this..

I will do my best and hope I don't forget. OTOH there isn't anything
pressing it's something I found while looking at fidedupe's performance
and not even the major contributor but still, it looks sensible to fix
it now, that I have a workload at hand which clearly demonstrates
positive impact and can easily measure any changes.
> 
>               Linus
>

next prev parent reply	other threads:[~2021-07-22 17:03 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-21 13:59 [PATCH] lib/string: Bring optimized memcmp from glibc Nikolay Borisov
2021-07-21 14:32 ` Christoph Hellwig
2021-07-21 14:35   ` Nikolay Borisov
2021-07-21 14:42     ` Christoph Hellwig
2021-07-21 14:51       ` Matthew Wilcox
2021-07-21 15:17         ` Peter.Enderborg
2021-07-21 15:34           ` Matthew Wilcox
2021-07-21 15:39             ` Peter.Enderborg
2021-07-21 18:00 ` Linus Torvalds
2021-07-21 18:17   ` Nikolay Borisov
2021-07-21 18:45     ` Linus Torvalds
2021-07-21 18:55       ` Linus Torvalds
2021-07-21 19:26       ` Linus Torvalds
2021-07-22 11:28         ` Nikolay Borisov
2021-07-22 16:40           ` Linus Torvalds
2021-07-22 17:03             ` Nikolay Borisov [this message]
2021-08-26  9:03             ` Nikolay Borisov
2021-07-22  8:28       ` Nikolay Borisov
2021-07-23 14:02       ` David Laight
2021-07-21 20:10   ` David Sterba
2021-07-21 20:27     ` Linus Torvalds
2021-07-22  5:54       ` Nikolay Borisov
2021-07-28 20:12 ` Florian Weimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c8262efc-e97a-4672-0cc1-90c778eecd94@suse.com \
    --to=nborisov@suse.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.