Re: [PATCH next] string: Optimise strlen()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH next] string: Optimise strlen()
       [not found] <20260327195737.89537-1-david.laight.linux@gmail.com>
@ 2026-03-27 20:37 ` Linus Torvalds
  2026-03-27 22:49   ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2026-03-27 20:37 UTC (permalink / raw)
  To: david.laight.linux
  Cc: Andrew Morton, Kees Cook, Andy Shevchenko, linux-kernel,
	linux-hardening

On Fri, 27 Mar 2026 at 12:57, <david.laight.linux@gmail.com> wrote:
>
> Using 'byte masking' is faster for longer strings - the break-even point
> is around 56 bytes on the same Zen-5 (there is much larger overhead, then
> it runs at 16 bytes in 3 clocks).

What byte masking approach did you actually use?

We have 'lib/strnlen_user.c', which is actually the only strlen() in
the kernel that I've really ever seen in profiles (it shows up for
execve() with lots of arguments).

That has tons of extra overhead due to the whole user access setuip,
but the core loop should be pretty good with that has_zero() thing.

I do agree that we shouldn't use 'rep scas'. It goes back to the
*very* original linux kernel sources, though, and I've never seen it
in profiles because very few things in the kernel actually use strings
a lot.

               Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH next] string: Optimise strlen()
  2026-03-27 20:37 ` [PATCH next] string: Optimise strlen() Linus Torvalds
@ 2026-03-27 22:49   ` David Laight
  2026-03-28  0:29     ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2026-03-27 22:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Kees Cook, Andy Shevchenko, linux-kernel,
	linux-hardening

On Fri, 27 Mar 2026 13:37:29 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 27 Mar 2026 at 12:57, <david.laight.linux@gmail.com> wrote:
> >
> > Using 'byte masking' is faster for longer strings - the break-even point
> > is around 56 bytes on the same Zen-5 (there is much larger overhead, then
> > it runs at 16 bytes in 3 clocks).  
> 
> What byte masking approach did you actually use?

This is the code I was testing.
It does aligned accesses - I did measure it without the alignment
code, made little/no difference.

The OPTIMIZER_HIDE_VAR() is needed to stop gcc generating different
64bit constants and to make it generate the constant in a sane way
(especially on architectures with only 16bit immediates).

size_t strlen_longs(const char *s)
{       
        unsigned int off = (unsigned long)s % sizeof (long);
        const unsigned long *p = (void *)(s - off);
        unsigned long ones = 0x01010101ul;
        unsigned long val;
        unsigned long mask;
        int first = 1;
                
        OPTIMIZER_HIDE_VAR(ones);
        ones |= ones << 16 << 16;
     
        mask = (~0ul >> 8) >> 8 * (sizeof (long) - 1 - off);

// I've just realised that might be better as:
//	mask = ones >> 1 + 8 * (sizeof (long) - 1 - off);
// which has the right properties and stops the compiler generating
// 0x00ffffffffffffff

        val = *p | mask;        
                
        do {    
                if (!first)
                        val = *++p;
                first = 0;
                mask = (val - ones) & ~val & (ones << 7);
        } while (!mask);

        off = (__builtin_ffsl(mask) - 1)/8;
        
        return (const char *)p + off - s;
}

That loop is the one that compiled best, ISTR it has a 'spare'
register move in it ('first' gets optimised out).

On many BE systems doing a byteswapping memory read may be best.

> We have 'lib/strnlen_user.c', which is actually the only strlen() in
> the kernel that I've really ever seen in profiles (it shows up for
> execve() with lots of arguments).
> 
> That has tons of extra overhead due to the whole user access setup,
> but the core loop should be pretty good with that has_zero() thing.

I've not measured strnlen(), but it wouldn't surprise me if argv[]
processing wouldn't be faster with something like the strlen() in this
patch.
After all arguments are usually relatively short.

If you were going to use the above then both 'ones' and 'ones << 7'
need so be calculated once and kept in registers.

> I do agree that we shouldn't use 'rep scas'. It goes back to the
> *very* original linux kernel sources, though, and I've never seen it
> in profiles because very few things in the kernel actually use strings
> a lot.

True, and most are short.
strscpy() is next on the list...

And the arm64 strlen() has special code to optimise crossing page boundaries.
God knows how slow it is on your typical 10 character string.

	David

> 
>                Linus


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH next] string: Optimise strlen()
  2026-03-27 22:49   ` David Laight
@ 2026-03-28  0:29     ` Linus Torvalds
  2026-03-28 11:08       ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2026-03-28  0:29 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, Kees Cook, Andy Shevchenko, linux-kernel,
	linux-hardening

On Fri, 27 Mar 2026 at 15:49, David Laight <david.laight.linux@gmail.com> wrote:
>
> I've not measured strnlen(), but it wouldn't surprise me if argv[]
> processing wouldn't be faster with something like the strlen() in this
> patch.
> After all arguments are usually relatively short.

No, we used to do those a byte at a time. It was not great. execve()
strings are often actually long because of filenames and environment
variables.

The trivial cases don't even matter, because all the cost of execve()
are elsewhere for those cases.

But the cases where the strings *do* matter, they are many and long.

Again - where did you actually see costs in real profiles? We really
don't tend to have strings in the kernel. The strings we do have are
from user space, and by the time they are in the kernel they typically
aren't C strings any more (or, with pathnames, have other termination
than just the NUL character).

           Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH next] string: Optimise strlen()
  2026-03-28  0:29     ` Linus Torvalds
@ 2026-03-28 11:08       ` David Laight
  2026-03-28 19:16         ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2026-03-28 11:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Kees Cook, Andy Shevchenko, linux-kernel,
	linux-hardening

On Fri, 27 Mar 2026 17:29:21 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 27 Mar 2026 at 15:49, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > I've not measured strnlen(), but it wouldn't surprise me if argv[]
> > processing wouldn't be faster with something like the strlen() in this
> > patch.
> > After all arguments are usually relatively short.  
> 
> No, we used to do those a byte at a time. It was not great. execve()
> strings are often actually long because of filenames and environment
> variables.
> 
> The trivial cases don't even matter, because all the cost of execve()
> are elsewhere for those cases.
> 
> But the cases where the strings *do* matter, they are many and long.

Is that the strncpy_from_user() path?

> 
> Again - where did you actually see costs in real profiles? We really
> don't tend to have strings in the kernel. The strings we do have are
> from user space, and by the time they are in the kernel they typically
> aren't C strings any more (or, with pathnames, have other termination
> than just the NUL character).

I started looking at this because someone was trying to write the 'bit-masking'
version for (possibly) RISC-V and I deciding that they weren't making a good
job of it and that it probably wasn't worth while (since x86-64 just uses
the byte code).
So I did some measurements - mostly to prove it was a waste of time.
The slight unroll of the C versions was a 'I wonder what effect it would have'
change, I was surprised it doubled throughput.
For such a small change it seemed worthwhile.

I've just run the tests on an old Haswell box (I don't have any recent
Intel cpu).
The 'byte' loop (and the unrolled once one) execute at 1 byte/clock.
The 'masking longs' loop runs at 4 bytes/clock but matches the byte loop
for 16 bytes (mostly because the byte loop is slower).

The 'elephant in the room' is the glibc code.
That must be using AVX and manages to beat 32 bytes/clock on my zen-5
(slightly under 32 bytes/clock on the Haswell).

That makes me think about the exec path, the fp registers need to be
trashed which should make them usable in the kernel (even without
disabling preemption).
You might need them reloaded after being preempted - I'm guessing that
is done in the 'return to user' path these days, rather than trapping
on first the access?

From what I remember kernel_fpu_end() got changed so that it didn't
restore the registers - so pretty much just enables preemption.
Which means that subsequent kernel_fpu_begin() are also cheap.
But code can't request 'kernel_fpu_begin(if_cheap)' and use the
fallback slow path if the registers would have to be saved.
That could speed up some code for 'medium length' buffers.

	David

> 
>            Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH next] string: Optimise strlen()
  2026-03-28 11:08       ` David Laight
@ 2026-03-28 19:16         ` Linus Torvalds
  2026-03-28 21:47           ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2026-03-28 19:16 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, Kees Cook, Andy Shevchenko, linux-kernel,
	linux-hardening

On Sat, 28 Mar 2026 at 04:08, David Laight <david.laight.linux@gmail.com> wrote:
>
> On Fri, 27 Mar 2026 17:29:21 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > The trivial cases don't even matter, because all the cost of execve()
> > are elsewhere for those cases.
> >
> > But the cases where the strings *do* matter, they are many and long.
>
> Is that the strncpy_from_user() path?

No. For annoying reasons, execve() mainly uses "strnlen_user()"
followed by "copy_from_user()".

See fs/exec.v: copy_strings().

The reason is that it needs to know the size of the string before it
can start copying it, because the destination address will depend on
it.

And yes, it's racy, and yes, if y ou modify the arguments or the
environment while an exevbe() is going on, you get what you deserve
(but it's not a security issue, it's just a "resulting argv[] array is
odd", but you could have made it odd in the first place, so whatever).

It would be lovely to be able to od it in one go and not walk the
source string twice, but that's sadly not how the execve() interface
works (or somebody would need to come up with a clever trick).

The main user of strncpy_from_user() is the path copying: see the
'getname' variations in fs/namei.c.

And sometimes pathnames are short, but we had a semi-recent discussion
about the distribution of pathname lengths due to some allocation
optimizations recently:

    https://lore.kernel.org/all/CAGudoHEMjWCOLEp+TdKLjuguHEKn9+e+aZwfKyK_sYpTZY8HRg@mail.gmail.com/

so while short names are common, longer names aren't *uncommon*, and
and loads that use them tend to keep using them.

We ended up aiming for ~128 bytes for the initial allocation
(EMBEDDED_NAME_MAX is 168 in one common config) for that reason.

Don't get me wrong: there are certainly many other users of
strnlen_user() and strncpy_from_user(), but the ones I've seen in any
half-way normal loads are those two: execve() and pathname copying.

> I started looking at this because someone was trying to write the 'bit-masking'
> version for (possibly) RISC-V and I deciding that they weren't making a good
> job of it and that it probably wasn't worth while (since x86-64 just uses
> the byte code).

Ok.

I do think that in user space, strlen() and friends can be absolutely
critical for some loads, because the C string model is horrible.

But in the kernel, I really don't think any of this matters. Our
strlen() is bad not because it's bad - it's bad because nobody really
should *care*.

Some of our "rep scas" users have been kept around exactly because
absolutely nobody cares, and it's a cute remnant of a very naive young
Linus who was using them because he was trying to learn things about
his new i80386 CPU, and started a whole small hobby project as a
result...

               Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH next] string: Optimise strlen()
  2026-03-28 19:16         ` Linus Torvalds
@ 2026-03-28 21:47           ` David Laight
  0 siblings, 0 replies; 6+ messages in thread
From: David Laight @ 2026-03-28 21:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Kees Cook, Andy Shevchenko, linux-kernel,
	linux-hardening

On Sat, 28 Mar 2026 12:16:52 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 28 Mar 2026 at 04:08, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > On Fri, 27 Mar 2026 17:29:21 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:  
> > > The trivial cases don't even matter, because all the cost of execve()
> > > are elsewhere for those cases.
> > >
> > > But the cases where the strings *do* matter, they are many and long.  
> >
> > Is that the strncpy_from_user() path?  
> 
> No. For annoying reasons, execve() mainly uses "strnlen_user()"
> followed by "copy_from_user()".
> 
> See fs/exec.v: copy_strings().
> 
> The reason is that it needs to know the size of the string before it
> can start copying it, because the destination address will depend on
> it.
> 
> And yes, it's racy, and yes, if y ou modify the arguments or the
> environment while an exevbe() is going on, you get what you deserve
> (but it's not a security issue, it's just a "resulting argv[] array is
> odd", but you could have made it odd in the first place, so whatever).
> 
> It would be lovely to be able to od it in one go and not walk the
> source string twice, but that's sadly not how the execve() interface
> works (or somebody would need to come up with a clever trick).

That sounds like a challenge :-)

> 
> The main user of strncpy_from_user() is the path copying: see the
> 'getname' variations in fs/namei.c.
> 
> And sometimes pathnames are short, but we had a semi-recent discussion
> about the distribution of pathname lengths due to some allocation
> optimizations recently:
> 
>     https://lore.kernel.org/all/CAGudoHEMjWCOLEp+TdKLjuguHEKn9+e+aZwfKyK_sYpTZY8HRg@mail.gmail.com/
> 
> so while short names are common, longer names aren't *uncommon*, and
> and loads that use them tend to keep using them.
> 
> We ended up aiming for ~128 bytes for the initial allocation
> (EMBEDDED_NAME_MAX is 168 in one common config) for that reason.
> 
> Don't get me wrong: there are certainly many other users of
> strnlen_user() and strncpy_from_user(), but the ones I've seen in any
> half-way normal loads are those two: execve() and pathname copying.
> 
> > I started looking at this because someone was trying to write the 'bit-masking'
> > version for (possibly) RISC-V and I deciding that they weren't making a good
> > job of it and that it probably wasn't worth while (since x86-64 just uses
> > the byte code).  
> 
> Ok.
> 
> I do think that in user space, strlen() and friends can be absolutely
> critical for some loads, because the C string model is horrible.
> 
> But in the kernel, I really don't think any of this matters. Our
> strlen() is bad not because it's bad - it's bad because nobody really
> should *care*.

You've said that before - which is why I dissuaded the RISC-V people
from writing a cache-destroying strlen().
Actually strscpy() is also optimised for long strings - that is now
being used all over the place (I think/hope most constant strings get
converted to memcpy()), I suspect the typical length is 10 bytes!
That probably wants de-optimising :-)

The 'one size fits all' for the string functions doesn't help.
If the source contained a constant 'hint' for a typical size then
an appropriate algorithm could be picked. 

> Some of our "rep scas" users have been kept around exactly because
> absolutely nobody cares, and it's a cute remnant of a very naive young
> Linus who was using them because he was trying to learn things about
> his new i80386 CPU, and started a whole small hobby project as a
> result...

When you wrote those they weren't that bad at all.
The 386 book has 'rep scas' as 5+8n (as does the 286 book) - much faster
than the equivalent instructions.
I am surprised they survived the P4.

In 1990 I was writing driver code for multi-cpu sparc systems (not solaris)
and worrying about TSO and the store buffer.
(and seeing the cpu stall for 150 clocks while the mmu did a page table walk.)

	David


> 
>                Linus


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-28 21:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260327195737.89537-1-david.laight.linux@gmail.com>
2026-03-27 20:37 ` [PATCH next] string: Optimise strlen() Linus Torvalds
2026-03-27 22:49   ` David Laight
2026-03-28  0:29     ` Linus Torvalds
2026-03-28 11:08       ` David Laight
2026-03-28 19:16         ` Linus Torvalds
2026-03-28 21:47           ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox