From: David Laight <David.Laight@ACULAB.COM>
To: 'Yu-Jen Chang' <arthurchang09@gmail.com>
Cc: "ak@linux.intel.com" <ak@linux.intel.com>,
"jdike@linux.intel.com" <jdike@linux.intel.com>,
"tglx@linutronix.de" <tglx@linutronix.de>,
"mingo@redhat.com" <mingo@redhat.com>,
"bp@alien8.de" <bp@alien8.de>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"x86@kernel.org" <x86@kernel.org>,
"hpa@zytor.com" <hpa@zytor.com>,
"keescook@chromium.org" <keescook@chromium.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-hardening@vger.kernel.org"
<linux-hardening@vger.kernel.org>,
"richard@nod.at" <richard@nod.at>,
"anton.ivanov@cambridgegreys.com"
<anton.ivanov@cambridgegreys.com>,
"johannes@sipsolutions.net" <johannes@sipsolutions.net>,
"linux-um@lists.infradead.org" <linux-um@lists.infradead.org>,
"jserv@ccns.ncku.edu.tw" <jserv@ccns.ncku.edu.tw>
Subject: RE: [PATCH 1/2] x86/lib: Optimize memchr()
Date: Wed, 1 Jun 2022 08:25:47 +0000 [thread overview]
Message-ID: <5afc275656764ee6818693dfd39df764@AcuMS.aculab.com> (raw)
In-Reply-To: <CAD4RrFOHaE0NKjPHw3RaUSrD6vZwnpHbY01bcO3emkoia2of6Q@mail.gmail.com>
From: Yu-Jen Chang
> Sent: 01 June 2022 06:59
>
> David Laight <David.Laight@aculab.com> 於 2022年5月30日 週一 下午4:10寫道:
> >
> > From: Yu-Jen Chang
> > > Sent: 28 May 2022 09:13
> > >
> > > The original assembly version of memchr() is implemented with
> > > the byte-wise comparing technique, which does not fully
> > > use 64-bits registers in x86_64 CPU. We use word-wide
> > > comparing so that 8 characters can be compared at the same time
> > > on x86_64 CPU. First we align the input and then use word-wise
> > > comparing to find the first 64-bit word that contain the target.
> > > Secondly, we compare every byte in the word and get the output.
> > >
> > > We create two files to measure the performance. The first file
> > > contains on average 10 characters ahead the target character.
> > > The second file contains at least 1000 characters ahead the
> > > target character. Our implementation of “memchr()” is slightly
> > > better in the first test and nearly 4x faster than the orginal
> > > implementation in the second test.
> > >
> > > Signed-off-by: Yu-Jen Chang <arthurchang09@gmail.com>
> > > Signed-off-by: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> > > ---
> > > arch/x86/include/asm/string_64.h | 3 ++
> > > arch/x86/lib/Makefile | 1 +
> > > arch/x86/lib/string_64.c | 78 ++++++++++++++++++++++++++++++++
> > > 3 files changed, 82 insertions(+)
> > > create mode 100644 arch/x86/lib/string_64.c
> > >
> > ...
> > > diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c
> > > new file mode 100644
> > > index 000000000..4e067d5be
> > > --- /dev/null
> > > +++ b/arch/x86/lib/string_64.c
> > > @@ -0,0 +1,78 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include <linux/string.h>
> > > +#include <linux/export.h>
> > > +#include <linux/align.h>
> > > +
> > > +/* How many bytes are loaded each iteration of the word copy loop */
> > > +#define LBLOCKSIZE (sizeof(long))
> > > +
> > > +#ifdef __HAVE_ARCH_MEMCHR
> > > +
> > > +void *memchr(const void *cs, int c, size_t length)
> > > +{
> > > + const unsigned char *src = (const unsigned char *)cs, d = c;
> >
> > You don't need the cast.
> >
> > > +
> > > + while (!IS_ALIGNED((long)src, sizeof(long))) {
> > > + if (!length--)
> > > + return NULL;
> > > + if (*src == d)
> > > + return (void *)src;
> > > + src++;
> > > + }
> >
> > There is no point aligning the address.
> > On tests I've done misaligned reads don't even take an extra
> > clock - even if you get the cpu doing two reads/clock.
> > Even if they did the code isn't memory limited.
> >
> > > + if (length >= LBLOCKSIZE) {
> > > + unsigned long mask = d << 8 | d;
> > > + unsigned int i = 32;
> > > + long xor, data;
> > > + const long consta = 0xFEFEFEFEFEFEFEFF,
> > > + constb = 0x8080808080808080;
> > > +
> > > + /*
> > > + * Create a 8-bytes mask for word-wise comparing.
> > > + * For example, a mask for 'a' is 0x6161616161616161.
> > > + */
> > > +
> > > + mask |= mask << 16;
> > > + for (i = 32; i < LBLOCKSIZE * 8; i <<= 1)
> > > + mask |= mask << i;
> >
> > Given that consta/b only support 64 bit why the loop.
> > Just do mask |= mask << 32.
> > I'd also put all 3 calculations together - not hide one
> > in the initialiser.
> >
> > > + /*
> > > + * We perform word-wise comparing with following operation:
> > > + * 1. Perform xor on the long word @src and @mask
> > > + * and put into @xor.
> > > + * 2. Add @xor with @consta.
> > > + * 3. ~@xor & @constb.
> > > + * 4. Perform & with the result of step 2 and 3.
> > > + *
> > > + * Step 1 creates a byte which is 0 in the long word if
> > > + * there is at least one target byte in it.
> > > + *
> > > + * Step 2 to Step 4 find if there is a byte with 0 in
> > > + * the long word.
> > > + */
> > > + asm volatile("1:\n\t"
> > > + "movq (%0),%1\n\t"
> > > + "xorq %6,%1\n\t"
> > > + "lea (%1,%4), %2\n\t"
> > > + "notq %1\n\t"
> > > + "andq %5,%1\n\t"
> > > + "testq %1,%2\n\t"
> > > + "jne 2f\n\t"
> > > + "add $8,%0\n\t"
> > > + "sub $8,%3\n\t"
> > > + "cmp $7,%3\n\t"
> > > + "ja 1b\n\t"
> > > + "2:\n\t"
> > > + : "=D"(src), "=r"(xor), "=r"(data), "=r"(length)
> >
> > Why constrain src to %rdi?
>
> At first I try to use some instructions related to %rdi, but I realize
> that I won't use these instructions. It is unnecessary to constrain
> src to %rdi.
>
> >
> > > + : "r"(consta), "r"(constb), "r"(mask), "0"(src),
> > > + "1"(xor), "2"(data), "3"(length)
> >
> > Use "+r" in the outputs instead of respecifying the args.
> > I'd also suggest using named arguments - much easier to read.
> >
> > > + : "memory", "cc");
> >
> > Doesn't the compiler generate much the same code?
> > You should also be able to code without needing add, sub and cmp
> > at the end of the loop.
> > If you use negative offsets from the end of the buffer
> > the loop can be a single add and jnz.
> >
> > David
> >
> > > + }
> > > +
> > > + while (length--) {
> > > + if (*src == d)
> > > + return (void *)src;
> > > + src++;
> > > + }
> > > + return NULL;
> > > +}
> > > +EXPORT_SYMBOL(memchr);
> > > +#endif
> > > --
> > > 2.25.1
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
>
> I remove the aligning address part. On my tests the performance are similar.
> Here I rewrite the assembly using named arguments and I reduce one instruction
> in the loop by adding two parameters, which are 'end' and 'dst'.
> 'end' stores the
> address of the end of the string. 'dst' stores the address of the end
> of word-wise
> comparison. As a result, when 'src' is equal to 'dst', the number of remaining
> characters is less than 8. The following while loop will find if the
> target character is
> in these remaining characters.
>
> On my test the performance is similar with the my original implementation. Only
> a little bit fast when going through a very long string, which contains 128*1024
> characters and the target character is near the end of the string.
>
> I also explain how consta and constb work clearly in the comments. Hope that it
> helps understanding.
>
> The following code is what I change.
>
> void *memchr(const void *cs, int c, size_t length)
> {
> const unsigned char *src = (const unsigned char *)cs;
> const unsigned char *end = src + length;
>
> if (length >= LBLOCKSIZE) {
> unsigned long mask = c << 8 | c;
That is wrong if 'c' is outside 0..255.
I suspect it is best to at least allow -128..-1.
> long xor, data;
> const long consta = 0xFEFEFEFEFEFEFEFF,
> constb = 0x8080808080808080;
> const unsigned char *dst = (const unsigned char *)src +
> (length & 0xFFFFFFFFFFFFFFF8);
>
> /*
> * Create a 8-bytes mask for word-wise comparing.
> * For example, a mask for 'a' is 0x6161616161616161.
> */
>
> mask |= mask << 16;
> mask |= mask << 32;
> /*
> * We perform word-wise comparing with following operation:
> * 1. Perform xor on the long word @src and @mask
> * and put into @xor.
> * 2. Add @xor with @consta.
> * 3. ~@xor & @constb.
> * 4. Perform & with the result of step 2 and 3.
> *
> * If there is a zero byte in @xor, step 2 turns it into
> * 0xFF. Then step 3 and 4 turn it into 0x80.
> *
> * If there is a none-zero byte in @xor, let k
> * (0 <= k <= 7) be the lowest 1 in this byte. The lowest
> * k bits are 0. After step 2, the byte ends in a single
> * bit of value 0. Step 3 and 4 turns this byte into k
> * bits of 1, which is 2^k - 1, at first. Then & @constb
> * makes it into 0.
> *
> * Step 2 to Step 4 find if there is a byte with 0 in
> * the long word.
> */
> asm volatile("1:\n\t"
> "movq (%[src]),%[xor]\n\t"
> "xorq %[mask],%[xor]\n\t"
> "lea (%[xor],%[const_a]), %[tmp]\n\t"
> "notq %[xor]\n\t"
> "andq %[const_b],%[xor]\n\t"
> "testq %[xor],%[tmp]\n\t"
> "jnz 2f\n\t"
> "add $8,%[src]\n\t"
> "cmp %[src], %[dst]\n\t"
> "ja 1b\n\t"
> "2:\n\t"
> :
> [src] "+r"(src), [xor] "+r"(xor), [tmp] "+r"(data)
> : [const_a] "r"(consta), [const_b] "r"(constb),
> [mask] "r"(mask), [dst] "r"(dst)
> : "memory", "cc");
> }
>
> while (src <= end) {
> if (*src == d)
I think you mean 'c'.
> return (void *)src;
> src++;
> }
> return NULL;
> }
>
> Thanks,
> Yu-Jen Chang
Gcc compiles this C to the same loop and is easier to read.
Valid on all LE 64bit systems.
void *memchr(const void *p, int c, unsigned long length)
{
unsigned long mask, val;
const void *end = p + length;
c &= 0xff;
if (p <= end - 8) {
mask = c | c << 8;
mask |= mask << 16;
mask |= mask << 32;
for (; p <= end - 8; p += 8) {
val = *(unsigned long *)p ^ mask;
if ((val + 0xfefefefefefefeffu) & (~val & 0x8080808080808080u))
break;
}
}
for (; p < end; p++)
if (*(unsigned char *)p == c)
return p;
return NULL;
}
See https://godbolt.org/z/6rqTqfEsx
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
next prev parent reply other threads:[~2022-06-01 8:26 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-28 8:12 [PATCH 0/2] x86: Optimize memchr() for x86-64 Yu-Jen Chang
2022-05-28 8:12 ` [PATCH 1/2] x86/lib: Optimize memchr() Yu-Jen Chang
2022-05-28 16:41 ` Tao Zhou
2022-05-29 12:05 ` arthur chang arthur
2022-05-30 8:09 ` David Laight
2022-06-01 5:58 ` Yu-Jen Chang
2022-06-01 8:25 ` David Laight [this message]
2022-06-06 3:25 ` Yu-Jen Chang
2022-05-28 8:12 ` [PATCH 2/2] x86/um: Use x86_64-optimized memchr Yu-Jen Chang
2022-05-29 1:10 ` [PATCH 0/2] x86: Optimize memchr() for x86-64 Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5afc275656764ee6818693dfd39df764@AcuMS.aculab.com \
--to=david.laight@aculab.com \
--cc=ak@linux.intel.com \
--cc=anton.ivanov@cambridgegreys.com \
--cc=arthurchang09@gmail.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=jdike@linux.intel.com \
--cc=johannes@sipsolutions.net \
--cc=jserv@ccns.ncku.edu.tw \
--cc=keescook@chromium.org \
--cc=linux-hardening@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-um@lists.infradead.org \
--cc=mingo@redhat.com \
--cc=richard@nod.at \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox