linux-um archives
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: 'Yu-Jen Chang' <arthurchang09@gmail.com>,
	"ak@linux.intel.com" <ak@linux.intel.com>,
	"jdike@linux.intel.com" <jdike@linux.intel.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"bp@alien8.de" <bp@alien8.de>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-hardening@vger.kernel.org"
	<linux-hardening@vger.kernel.org>,
	"richard@nod.at" <richard@nod.at>,
	"anton.ivanov@cambridgegreys.com"
	<anton.ivanov@cambridgegreys.com>,
	"johannes@sipsolutions.net" <johannes@sipsolutions.net>,
	"linux-um@lists.infradead.org" <linux-um@lists.infradead.org>,
	"jserv@ccns.ncku.edu.tw" <jserv@ccns.ncku.edu.tw>
Subject: RE: [PATCH 1/2] x86/lib: Optimize memchr()
Date: Mon, 30 May 2022 08:09:57 +0000	[thread overview]
Message-ID: <8b480c85f53c4f3b88bf99ba585e8768@AcuMS.aculab.com> (raw)
In-Reply-To: <20220528081236.3020-2-arthurchang09@gmail.com>

From: Yu-Jen Chang
> Sent: 28 May 2022 09:13
> 
> The original assembly version of memchr() is implemented with
> the byte-wise comparing technique, which does not fully
> use 64-bits registers in x86_64 CPU. We use word-wide
> comparing so that 8 characters can be compared at the same time
> on x86_64 CPU. First we align the input and then use word-wise
> comparing to find the first 64-bit word that contain the target.
> Secondly, we compare every byte in the word and get the output.
> 
> We create two files to measure the performance. The first file
> contains on average 10 characters ahead the target character.
> The second file contains at least 1000 characters ahead the
> target character. Our implementation of “memchr()” is slightly
> better in the first test and nearly 4x faster than the orginal
> implementation in the second test.
> 
> Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> ---
>  arch/x86/include/asm/string_64.h |  3 ++
>  arch/x86/lib/Makefile            |  1 +
>  arch/x86/lib/string_64.c         | 78 ++++++++++++++++++++++++++++++++
>  3 files changed, 82 insertions(+)
>  create mode 100644 arch/x86/lib/string_64.c
> 
...
> diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> new file mode 100644
> index 000000000..4e067d5be
> --- /dev/null
> +++ b/arch/x86/lib/string_64.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/string.h>
> +#include <linux/export.h>
> +#include <linux/align.h>
> +
> +/* How many bytes are loaded each iteration of the word copy loop */
> +#define LBLOCKSIZE (sizeof(long))
> +
> +#ifdef __HAVE_ARCH_MEMCHR
> +
> +void *memchr(const void *cs, int c, size_t length)
> +{
> +	const unsigned char *src = (const unsigned char *)cs, d = c;

You don't need the cast.

> +
> +	while (!IS_ALIGNED((long)src, sizeof(long))) {
> +		if (!length--)
> +			return NULL;
> +		if (*src == d)
> +			return (void *)src;
> +		src++;
> +	}

There is no point aligning the address.
On tests I've done misaligned reads don't even take an extra
clock - even if you get the cpu doing two reads/clock.
Even if they did the code isn't memory limited.

> +	if (length >= LBLOCKSIZE) {
> +		unsigned long mask = d << 8 | d;
> +		unsigned int i = 32;
> +		long xor, data;
> +		const long consta = 0xFEFEFEFEFEFEFEFF,
> +			   constb = 0x8080808080808080;
> +
> +		/*
> +		 * Create a 8-bytes mask for word-wise comparing.
> +		 * For example, a mask for 'a' is 0x6161616161616161.
> +		 */
> +
> +		mask |= mask << 16;
> +		for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> +			mask |= mask << i;

Given that consta/b only support 64 bit why the loop.
Just do mask |= mask << 32.
I'd also put all 3 calculations together - not hide one
in the initialiser.

> +		/*
> +		 * We perform word-wise comparing with following operation:
> +		 *	1. Perform xor on the long word @src and @mask
> +		 *	   and put into @xor.
> +		 *	2. Add @xor with @consta.
> +		 *	3. ~@xor & @constb.
> +		 *	4. Perform & with the result of step 2 and 3.
> +		 *
> +		 * Step 1 creates a byte which is 0 in the long word if
> +		 * there is at least one target byte in it.
> +		 *
> +		 * Step 2 to Step 4 find if there is a byte with 0 in
> +		 * the long word.
> +		 */
> +		asm volatile("1:\n\t"
> +			     "movq (%0),%1\n\t"
> +			     "xorq %6,%1\n\t"
> +			     "lea (%1,%4), %2\n\t"
> +			     "notq %1\n\t"
> +			     "andq %5,%1\n\t"
> +			     "testq %1,%2\n\t"
> +			     "jne 2f\n\t"
> +			     "add $8,%0\n\t"
> +			     "sub $8,%3\n\t"
> +			     "cmp $7,%3\n\t"
> +			     "ja 1b\n\t"
> +			     "2:\n\t"
> +			     : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)

Why constrain src to %rdi?

> +			     : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> +			       "1"(xor), "2"(data), "3"(length)

Use "+r" in the outputs instead of respecifying the args.
I'd also suggest using named arguments - much easier to read.

> +			     : "memory", "cc");

Doesn't the compiler generate much the same code?
You should also be able to code without needing add, sub and cmp
at the end of the loop.
If you use negative offsets from the end of the buffer
the loop can be a single add and jnz.

	David

> +	}
> +
> +	while (length--) {
> +		if (*src == d)
> +			return (void *)src;
> +		src++;
> +	}
> +	return NULL;
> +}
> +EXPORT_SYMBOL(memchr);
> +#endif
> --
> 2.25.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

  parent reply	other threads:[~2022-05-30  8:10 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-28  8:12 [PATCH 0/2] x86: Optimize memchr() for x86-64 Yu-Jen Chang
2022-05-28  8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
2022-05-28 16:41   ` Tao Zhou
2022-05-29 12:05     ` arthur chang arthur
2022-05-30  8:09   ` David Laight [this message]
2022-06-01  5:58     ` Yu-Jen Chang
2022-06-01  8:25       ` David Laight
2022-06-06  3:25         ` Yu-Jen Chang
2022-05-28  8:12 ` [PATCH 2/2] x86/um: Use x86_64-optimized memchr Yu-Jen Chang
2022-05-29  1:10 ` [PATCH 0/2] x86: Optimize memchr() for x86-64 Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8b480c85f53c4f3b88bf99ba585e8768@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=ak@linux.intel.com \
    --cc=anton.ivanov@cambridgegreys.com \
    --cc=arthurchang09@gmail.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jdike@linux.intel.com \
    --cc=johannes@sipsolutions.net \
    --cc=jserv@ccns.ncku.edu.tw \
    --cc=keescook@chromium.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-um@lists.infradead.org \
    --cc=mingo@redhat.com \
    --cc=richard@nod.at \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox