From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C3E84A23
	for <linux-kernel@vger.kernel.org>; Sun, 19 Apr 2026 10:41:20 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.50
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776595281; cv=none; b=KKSmU3+7DbGSncv9cbjITTLH/WCo+siaYiBv1w1lWnjGI2MKVTMdp+rFFK+Gv5VtEkbmCVFSusbxMkvh3jULaC7oUeYyiajWykqDzz2vYiuhkPPpXvAZW22koflTz2uviIUR7cv8SX6KC+hD/LomhcG0xf2hRgmnMxLbX7oOPeY=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776595281; c=relaxed/simple;
	bh=qS6jpmxPn0ciU4hAsvRwK00rIYLp0V6xTjVqSfHBlQU=;
	h=Date:From:To:Subject:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=mz+pFIB/g+nlxHacDSg77dStIlhzunc6wORZyLqa4EoEkOGt39Y76J//R9Jb/cIyPSuUNCqJDkT5noh4yo06WCY6xHBr+SdAm5G5MRFoWwhi1wxOxPYqpZsdb6bGUCnSjrmmvF9aLeprX+5MLb+WR8XXGWsxv9sYzE+nVdCk+xg=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=rIepnjRy; arc=none smtp.client-ip=209.85.128.50
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="rIepnjRy"
Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-488a88aeec9so33032475e9.2
        for <linux-kernel@vger.kernel.org>; Sun, 19 Apr 2026 03:41:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20251104; t=1776595279; x=1777200079; darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:subject:to:from:date:from:to:cc:subject:date:message-id
         :reply-to;
        bh=SewYd26qMSqsJ3PdBJdB1FU0Fv0N/RaIvQ2whLAWfGQ=;
        b=rIepnjRy33BSUCRbiBUu1zmJbkUmoIlmq7WoBhJCBKxwFvni3KBJCXAvZ+dmLwBEU/
         VFivMEldX+wkiLA87zEtLT4+/S0QK2tgLU/WG1McLa6OH1hQKqrJV9ZX6y4jdGO5bcjb
         KqoikEZ5QUQZlHucRP3RKCbztYjbNPDJ2I6oFpcBBpGcatxieTDFt0D69mEVk0cSYcOa
         5li4qH9L2n4Rx9g6v6hWI4tWSKgBLnDexSi0W0Tzj5Zx5Q+u/LUhQhFmHouWDZa3baYG
         XNDywxEoQsYIB6mdM8NT+T00L/fioc18rLAc2jrIBSDkOhwNz6bYpVz0B7FsGDbhD4tN
         l8YA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776595279; x=1777200079;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:subject:to:from:date:x-gm-gg:x-gm-message-state:from:to
         :cc:subject:date:message-id:reply-to;
        bh=SewYd26qMSqsJ3PdBJdB1FU0Fv0N/RaIvQ2whLAWfGQ=;
        b=fWh6w5feK+i1gP4mI+NWqME4HkduS5ulfk+f3Cy5bjgGcfRlBnuhp3zked6F67iFe4
         DsQjkud90IJRwANfNSdnf1CDlGNst5l0HJbmAW+FDLqtyN2PhoKfvtZ+GbUKbhcTxZak
         Pg0DCiLYxr+P7TDp+OTCihyWRoTm0JXnOO2XnbbajyIOoP4Cvifw0DC8S5q0u1+X5J8T
         AZIOS/AY0iWpZsY+pDLOJjn5gXHnwSs6J93102VppKeDQnEyD8+wbXZwXjs8iXMJ4vkA
         5Oxm8OvtTkDCbASayoVZ+evzKw5i/Hx1QknnDYw5B81zXmGhhb4LAIp+qBJ5Qoukpjyo
         Ekyw==
X-Forwarded-Encrypted: i=1; AFNElJ9T06nj6wZotieZ32UULFfUYQq2HbUpjgm4tO01bsDpV/7m6SKWnXrfeim4iu3DBSEAEqDu0ahSutVI+us=@vger.kernel.org
X-Gm-Message-State: AOJu0Yw5b0CgwX/JLBDyDJfAmjt5UnwMJJrJWo/Q8k8mDnqaMC1eLj1/
	T4VR+lS+P/pfuP4FABU3tA2F5AOPNAa1sBWgo5vnHZNlgqcR3f2AtwWQ
X-Gm-Gg: AeBDieu8MoAWjA4+BOmf0+ZFlsrjDtaoBxOMbu/yA4p5bQUbHfHb0/a0ZbY07CItEx4
	EeJaZRs2rtuqRDocPTkPdpINuFPQ4H9sP9chm/n96JShGxtIt5vFeRCQBurmrnY/zwOQwAJbCPR
	X21lxhky3NnQkwYS0cfpxOzsZv5EiFFf+NnQ5hUagRoIWylSa2hfhyj9q1dABtw9JnYRDe8FJK6
	kVDQ9dAWikXl8CTw3FSDpOYdz6dEjdho+DNdK1XIVHEI9FamWEdKZznPr0WMtIT8jaTBnwmEVQk
	EKYUtqBaDz4FM7bOogctCpXKUSYYgbBQ3WuL0j8Kvw7Ixyk3/PKiZ6JxrT4l6/Q7l+DlRc5BqZ6
	Dqu3+FcpyowLxnWqdUJaRz5R2tI2kLlmUwgfV543jeyiEJArSA6MJ3KuTtxRB0J/IjH0VO3Ff9l
	qvpnpwbUjHbJc8rx39YwzblZ1htPN0lShWpFIgFs51zwTPGfFsNdbtS4Xgue3HNFIYio6KSaDlI
	+k=
X-Received: by 2002:a05:600d:10:b0:489:1b0c:8b43 with SMTP id 5b1f17b1804b1-4891b0c8c48mr4142765e9.1.1776595278820;
        Sun, 19 Apr 2026 03:41:18 -0700 (PDT)
Received: from pumpkin (82-69-66-36.dsl.in-addr.zen.co.uk. [82.69.66.36])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-43fe4cc2cacsm19934215f8f.13.2026.04.19.03.41.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 19 Apr 2026 03:41:18 -0700 (PDT)
Date: Sun, 19 Apr 2026 11:41:17 +0100
From: David Laight <david.laight.linux@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>, Kees Cook <kees@kernel.org>,
 Andy Shevchenko <andy@kernel.org>, linux-kernel@vger.kernel.org,
 linux-hardening@vger.kernel.org, Linus Torvalds
 <torvalds@linux-foundation.org>
Subject: Re: [PATCH next] string: Optimise strlen()
Message-ID: <20260419114117.7cf50b2b@pumpkin>
In-Reply-To: <20260327195737.89537-1-david.laight.linux@gmail.com>
References: <20260327195737.89537-1-david.laight.linux@gmail.com>
X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Fri, 27 Mar 2026 19:57:37 +0000
david.laight.linux@gmail.com wrote:

> From: David Laight <david.laight.linux@gmail.com>
> 
> Unrolling the loop once significantly improves performance on some CPU.
> Userspace testing on a Zen-5 shows it runs at two bytes/clock rather than
> one byte/clock with only a marginal additional overhead.

I hate benchmarks.

I've finally got around to looking at this again (on x86-64).
I changed the order of the 'single byte' and 'two byte' tests and the
'two byte' loop slowed down massively - to pretty much the same speed
as the 'single byte' loop.
gcc had swapped over the two functions in the object file.
Swapping the order changed the alignment of the loop top between odd and
even multiples of 16 (this alignment is disabled in kernel to avoid bloat).
The loop in the 'two byte' code is 17 bytes, in the slow case the loop
top is aligned to an odd boundary so that the last byte is in a different
32 byte code block - which is presumably slow.
Changing the two 'cmpb $0, mem' to (say) 'cmpb %cl, mem' would reduce
the loop to 15 bytes and so wouldn't cross a 16 byte boundary.
(The 'single byte' loop doesn't cross a 16 byte boundary in the test program.)

The kernel build I just looked at has strlen() aligned to a 16 byte
boundary with the branch crossing the next 16 byte boundary.
So, if the same is true as in my test program, strlen() will run a
lot slower on 50% of kernel builds.
(And other cpu may have costs associated with the 16 byte boundary.)

Mostly this means that however hard you try you are guaranteed to
lose somewhere :-(

> 
> Using 'byte masking' is faster for longer strings - the break-even point
> is around 56 bytes on the same Zen-5 (there is much larger overhead, then
> it runs at 16 bytes in 3 clocks).
> But the majority of kernel calls won't be near that length.
> There will also be extra overhead for big-endian systems and those
> without a fast ffs().

I've had a further thought on that as well.

The 'byte masking' code is somewhat larger (112 rather than 32 or 48).
While the extra overhead is ~20 clocks, that is less than a 'branch
mispredict' penalty that the byte loop suffers every time the length
changes.
So for randomly changing lengths I'm beginning to think the 'byte mask'
version is better.

I ran the code on a Haswell a while back, the break even length was
also somewhat shorter (I'm remembering 32 bytes).

This all means the byte masking code may actually be sensible provided.
- LE or BE with byte swapping memory read.
- fast ffsl()
- 64bit

	David