* Re: Help on memchr() EGLIBC assembly code
[not found] ` <20090713211723.GE10110@hall.aurel32.net>
@ 2009-07-13 22:16 ` Matt Turner
2009-07-13 22:24 ` Aurelien Jarno
2009-07-15 19:48 ` Richard Henderson
0 siblings, 2 replies; 9+ messages in thread
From: Matt Turner @ 2009-07-13 22:16 UTC (permalink / raw)
To: Aurelien Jarno
Cc: Carlos O'Donell, debian-alpha, debian-glibc, Ivan Kokshaysky,
Richard Henderson, linux-alpha
On Mon, Jul 13, 2009 at 6:17 PM, Aurelien Jarno<aurelien@aurel32.net> wrote:
> On Mon, Jul 13, 2009 at 02:24:00PM -0400, Carlos O'Donell wrote:
>> On Mon, Jul 13, 2009 at 1:31 PM, Aurelien Jarno<aurelien@aurel32.net> wrote:
>> > With a lot of patches (E)GLIBC 2.10 builds on alpha, but it fails on the
>> > testsuite for the memchr() function, which is an optimized assembly code
>> > on alpha. Unfortunately I don't speak alpha assembly very well, so help
>> > is needed.
>> >
>> > The problem is that the memchr() function on alpha uses prefetch, which
>> > can cause a page boundary to be crossed, while the standards (POSIX and
>> > C99) says it should stop when a match is found.
>> >
>> > I have built a small testcase (see file attached), which contains the
>> > code to trigger the bug and the assembly code of the memchr() function,
>> > copied from EGLIBC.
>> >
>> > It would be nice if someone can fix the assembly code so that the
>> > prefetching does not create memory faults. Thanks in advance.
>>
>> If you remove:
>> ./sysdeps/alpha/alphaev6/memchr.S
>> ./sysdeps/alpha/memchr.S
>> and allow the build to fallback on string/memchr.c do the tests pass?
>>
>> You can always add memchr.S routines back in a later release if you
>> need an immediate workaround.
>
> Yeah, it works, and that's actually my plan if I get no answer. But I
> would really like to see this fixed now, as otherwise, it will most
> probably stay like that indefinitely.
>
> --
> Aurelien Jarno GPG: 1024D/F1BCDB73
> aurelien@aurel32.net http://www.aurel32.net
>
I'm copying Richard Henderson and Ivan Kokshayshy on this, as they are
without a doubt the most knowledgeable people about this sort of
thing.
As an aside, please make sure these two bugs are fixed in eglibc (I
don't know if they exist in eglibc, but I do realize that you filed
them against glibc yourself -- so this is just a reminder).
http://sources.redhat.com/bugzilla/show_bug.cgi?id=5350
http://sources.redhat.com/bugzilla/show_bug.cgi?id=5400
Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-alpha" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-13 22:16 ` Help on memchr() EGLIBC assembly code Matt Turner
@ 2009-07-13 22:24 ` Aurelien Jarno
2009-07-15 19:48 ` Richard Henderson
1 sibling, 0 replies; 9+ messages in thread
From: Aurelien Jarno @ 2009-07-13 22:24 UTC (permalink / raw)
To: Matt Turner
Cc: Carlos O'Donell, debian-alpha, debian-glibc, Ivan Kokshaysky,
Richard Henderson, linux-alpha
On Mon, Jul 13, 2009 at 07:16:16PM -0300, Matt Turner wrote:
> On Mon, Jul 13, 2009 at 6:17 PM, Aurelien Jarno<aurelien@aurel32.net> wrote:
> > On Mon, Jul 13, 2009 at 02:24:00PM -0400, Carlos O'Donell wrote:
> >> On Mon, Jul 13, 2009 at 1:31 PM, Aurelien Jarno<aurelien@aurel32.net> wrote:
> >> > With a lot of patches (E)GLIBC 2.10 builds on alpha, but it fails on the
> >> > testsuite for the memchr() function, which is an optimized assembly code
> >> > on alpha. Unfortunately I don't speak alpha assembly very well, so help
> >> > is needed.
> >> >
> >> > The problem is that the memchr() function on alpha uses prefetch, which
> >> > can cause a page boundary to be crossed, while the standards (POSIX and
> >> > C99) says it should stop when a match is found.
> >> >
> >> > I have built a small testcase (see file attached), which contains the
> >> > code to trigger the bug and the assembly code of the memchr() function,
> >> > copied from EGLIBC.
> >> >
> >> > It would be nice if someone can fix the assembly code so that the
> >> > prefetching does not create memory faults. Thanks in advance.
> >>
> >> If you remove:
> >> ./sysdeps/alpha/alphaev6/memchr.S
> >> ./sysdeps/alpha/memchr.S
> >> and allow the build to fallback on string/memchr.c do the tests pass?
> >>
> >> You can always add memchr.S routines back in a later release if you
> >> need an immediate workaround.
> >
> > Yeah, it works, and that's actually my plan if I get no answer. But I
> > would really like to see this fixed now, as otherwise, it will most
> > probably stay like that indefinitely.
> >
> > --
> > Aurelien Jarno GPG: 1024D/F1BCDB73
> > aurelien@aurel32.net http://www.aurel32.net
> >
>
> I'm copying Richard Henderson and Ivan Kokshayshy on this, as they are
> without a doubt the most knowledgeable people about this sort of
> thing.
Thanks.
> As an aside, please make sure these two bugs are fixed in eglibc (I
> don't know if they exist in eglibc, but I do realize that you filed
> them against glibc yourself -- so this is just a reminder).
>
> http://sources.redhat.com/bugzilla/show_bug.cgi?id=5350
> http://sources.redhat.com/bugzilla/show_bug.cgi?id=5400
>
I have already put a first set of patches needed for alpha in a git
repository [1], and sent a pull request on the libc-ports mailing list.
I hope it will work better than filling bug reports. If it also fails,
I'll send them to EGLIBC.
[1]
http://git.aurel32.net/?p=glibc-ports.git;a=shortlog;h=refs/heads/alpha-for-upstream
--
Aurelien Jarno GPG: 1024D/F1BCDB73
aurelien@aurel32.net http://www.aurel32.net
--
To UNSUBSCRIBE, email to debian-glibc-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-13 22:16 ` Help on memchr() EGLIBC assembly code Matt Turner
2009-07-13 22:24 ` Aurelien Jarno
@ 2009-07-15 19:48 ` Richard Henderson
2009-07-19 14:29 ` Aurelien Jarno
1 sibling, 1 reply; 9+ messages in thread
From: Richard Henderson @ 2009-07-15 19:48 UTC (permalink / raw)
To: Matt Turner
Cc: Aurelien Jarno, Carlos O'Donell, debian-alpha, debian-glibc,
Ivan Kokshaysky, linux-alpha
On 07/13/2009 03:16 PM, Matt Turner forwarded:
>>>> The problem is that the memchr() function on alpha uses prefetch, which
>>>> can cause a page boundary to be crossed, while the standards (POSIX and
>>>> C99) says it should stop when a match is found.
That's not supposed to matter -- faults from prefetch are supposed to be
ignored; see do_page_fault:
/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
(or is suppressed by the PALcode). Support that for older CPUs
by ignoring such an instruction. */
if (cause == 0) {
unsigned int insn;
__get_user(insn, (unsigned int __user *)regs->pc);
if ((insn >> 21 & 0x1f) == 0x1f &&
/* ldq ldl ldt lds ldg ldf ldwu ldbu */
(1ul << (insn >> 26) & 0x30f00001400ul)) {
regs->pc += 4;
return;
}
}
Can you figure out why that kernel code isn't working? I no longer have
working alpha hw...
r~
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-15 19:48 ` Richard Henderson
@ 2009-07-19 14:29 ` Aurelien Jarno
2009-07-26 23:45 ` Aurelien Jarno
0 siblings, 1 reply; 9+ messages in thread
From: Aurelien Jarno @ 2009-07-19 14:29 UTC (permalink / raw)
To: Richard Henderson
Cc: Matt Turner, Carlos O'Donell, debian-alpha, debian-glibc,
Ivan Kokshaysky, linux-alpha
On Wed, Jul 15, 2009 at 12:48:02PM -0700, Richard Henderson wrote:
> On 07/13/2009 03:16 PM, Matt Turner forwarded:
>>>>> The problem is that the memchr() function on alpha uses prefetch, which
>>>>> can cause a page boundary to be crossed, while the standards (POSIX and
>>>>> C99) says it should stop when a match is found.
>
> That's not supposed to matter -- faults from prefetch are supposed to be
> ignored; see do_page_fault:
The problem is that the "prefech" is not done with $31, but using $1 and
$3. It is called "prefetch" in the code, but it is more like "read a value
in advance".
--
Aurelien Jarno GPG: 1024D/F1BCDB73
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-19 14:29 ` Aurelien Jarno
@ 2009-07-26 23:45 ` Aurelien Jarno
2009-07-27 9:29 ` Aurelien Jarno
2009-07-30 0:24 ` Richard Henderson
0 siblings, 2 replies; 9+ messages in thread
From: Aurelien Jarno @ 2009-07-26 23:45 UTC (permalink / raw)
To: Richard Henderson
Cc: Matt Turner, Carlos O'Donell, debian-alpha, debian-glibc,
Ivan Kokshaysky, linux-alpha
On Sun, Jul 19, 2009 at 04:29:33PM +0200, Aurelien Jarno wrote:
> On Wed, Jul 15, 2009 at 12:48:02PM -0700, Richard Henderson wrote:
> > On 07/13/2009 03:16 PM, Matt Turner forwarded:
> >>>>> The problem is that the memchr() function on alpha uses prefetch, which
> >>>>> can cause a page boundary to be crossed, while the standards (POSIX and
> >>>>> C99) says it should stop when a match is found.
> >
> > That's not supposed to matter -- faults from prefetch are supposed to be
> > ignored; see do_page_fault:
>
> The problem is that the "prefech" is not done with $31, but using $1 and
> $3. It is called "prefetch" in the code, but it is more like "read a value
> in advance".
>
Knowing that $31 could be used for prefetch, I have modified the
assembly code from memchr.S to use it. It passes all the testsuite.
Comments are welcome. Then I'll do the alphaev6 version.
diff --git a/sysdeps/alpha/memchr.S b/sysdeps/alpha/memchr.S
index 5d713d5..87c7fb1 100644
--- a/sysdeps/alpha/memchr.S
+++ b/sysdeps/alpha/memchr.S
@@ -119,7 +119,7 @@ $first_quad:
# At least one byte left to process.
- ldq t0, 8(v0) # e0 :
+ ldq zero, 8(v0) # e0 : prefetch next quad
subq t4, 1, a2 # .. e1 :
addq v0, 8, v0 #-e0 :
@@ -138,19 +138,19 @@ $first_quad:
# At least three quads remain to be accessed
- mov t0, t3 # e0 : move prefetched value to correct reg
-
.align 4
$unrolled_loop:
- ldq t0, 8(v0) #-e0 : prefetch t0
- xor a1, t3, t1 # .. e1 :
- cmpbge zero, t1, t1 # e0 :
- bne t1, $found_it # .. e1 :
+ ldq t0, 0(v0) # e0 : load quad
+ xor a1, t0, t1 # .. e1 :
+ ldq zero, 8(v0) # e0 : prefetch next quad
+ cmpbge zero, t1, t1 # .. e1:
+ bne t1, $found_it # e0 :
- addq v0, 8, v0 #-e0 :
+ addq v0, 8, v0 # e1 :
$odd_quad_count:
+ ldq t0, 0(v0) # e0 : load quad
xor a1, t0, t1 # .. e1 :
- ldq t3, 8(v0) # e0 : prefetch t3
+ ldq zero, 8(v0) # e0 : prefetch next quad
cmpbge zero, t1, t1 # .. e1 :
addq v0, 8, t5 #-e0 :
bne t1, $found_it # .. e1 :
@@ -159,8 +159,8 @@ $odd_quad_count:
addq v0, 8, v0 # .. e1 :
bne t5, $unrolled_loop #-e1 :
- mov t3, t0 # e0 : move prefetched value into t0
-$final: subq t4, v0, a2 # .. e1 : a2 <- number of bytes left to do
+$final: ldq t0, 0(v0) # e0 : load last quad
+ subq t4, v0, a2 # .. e1 : a2 <- number of bytes left to do
bne a2, $last_quad # e1 :
$not_found:
--
Aurelien Jarno GPG: 1024D/F1BCDB73
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-26 23:45 ` Aurelien Jarno
@ 2009-07-27 9:29 ` Aurelien Jarno
2009-07-30 0:24 ` Richard Henderson
1 sibling, 0 replies; 9+ messages in thread
From: Aurelien Jarno @ 2009-07-27 9:29 UTC (permalink / raw)
To: Richard Henderson
Cc: Matt Turner, Carlos O'Donell, debian-alpha, debian-glibc,
Ivan Kokshaysky, linux-alpha
On Mon, Jul 27, 2009 at 01:45:06AM +0200, Aurelien Jarno wrote:
> On Sun, Jul 19, 2009 at 04:29:33PM +0200, Aurelien Jarno wrote:
> > On Wed, Jul 15, 2009 at 12:48:02PM -0700, Richard Henderson wrote:
> > > On 07/13/2009 03:16 PM, Matt Turner forwarded:
> > >>>>> The problem is that the memchr() function on alpha uses prefetch, which
> > >>>>> can cause a page boundary to be crossed, while the standards (POSIX and
> > >>>>> C99) says it should stop when a match is found.
> > >
> > > That's not supposed to matter -- faults from prefetch are supposed to be
> > > ignored; see do_page_fault:
> >
> > The problem is that the "prefech" is not done with $31, but using $1 and
> > $3. It is called "prefetch" in the code, but it is more like "read a value
> > in advance".
> >
>
> Knowing that $31 could be used for prefetch, I have modified the
> assembly code from memchr.S to use it. It passes all the testsuite.
>
> Comments are welcome. Then I'll do the alphaev6 version.
Here is the alphaev6 version:
--- a/sysdeps/alpha/alphaev6/memchr.S
+++ b/sysdeps/alpha/alphaev6/memchr.S
@@ -127,7 +127,7 @@ $first_quad:
cmpbge $31, $1, $2 # E :
bne $2, $found_it # U :
# At least one byte left to process.
- ldq $1, 8($0) # L :
+ ldq $31, 8($0) # L :
subq $5, 1, $18 # E : U L U L
addq $0, 8, $0 # E :
@@ -143,38 +143,38 @@ $first_quad:
and $4, 8, $4 # E : odd number of quads?
bne $4, $odd_quad_count # U :
# At least three quads remain to be accessed
- mov $1, $4 # E : L U L U : move prefetched value to correct reg
+ nop # E : L U L U : move prefetched value to correct reg
.align 4
$unrolled_loop:
- ldq $1, 8($0) # L : prefetch $1
- xor $17, $4, $2 # E :
- cmpbge $31, $2, $2 # E :
- bne $2, $found_it # U : U L U L
+ ldq $1, 0($0) # L : load quad
+ xor $17, $1, $2 # E :
+ ldq $31, 8($0) # L : prefetch next quad
+ cmpbge $31, $2, $2 # E : U L U L
+ bne $2, $found_it # U :
addq $0, 8, $0 # E :
nop # E :
nop # E :
- nop # E :
$odd_quad_count:
+ ldq $1, 0($0) # L : load quad
xor $17, $1, $2 # E :
- ldq $4, 8($0) # L : prefetch $4
+ ldq $31, 8($0) # L : prefetch $4
cmpbge $31, $2, $2 # E :
- addq $0, 8, $6 # E :
+ addq $0, 8, $6 # E :
bne $2, $found_it # U :
cmpult $6, $18, $6 # E :
addq $0, 8, $0 # E :
- nop # E :
bne $6, $unrolled_loop # U :
- mov $4, $1 # E : move prefetched value into $1
nop # E :
nop # E :
-
-$final: subq $5, $0, $18 # E : $18 <- number of bytes left to do
nop # E :
+
+$final: ldq $1, 0($0) # L : load last quad
+ subq $5, $0, $18 # E : $18 <- number of bytes left to do
nop # E :
bne $18, $last_quad # U :
--
Aurelien Jarno GPG: 1024D/F1BCDB73
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-26 23:45 ` Aurelien Jarno
2009-07-27 9:29 ` Aurelien Jarno
@ 2009-07-30 0:24 ` Richard Henderson
2009-07-30 16:29 ` Aurelien Jarno
1 sibling, 1 reply; 9+ messages in thread
From: Richard Henderson @ 2009-07-30 0:24 UTC (permalink / raw)
To: Aurelien Jarno
Cc: Matt Turner, Carlos O'Donell, debian-alpha, debian-glibc,
Ivan Kokshaysky, linux-alpha
[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]
On 07/26/2009 04:45 PM, Aurelien Jarno wrote:
> Knowing that $31 could be used for prefetch, I have modified the
> assembly code from memchr.S to use it. It passes all the testsuite.
>
This isn't intended to be a prefetch instruction, it's
meant to be fetching the data for the next word. I.e.
we've unrolled the loop and there's at least 8 bytes
left in the search.
Note the
# At least two quads remain to be accessed.
comment. At that point we're supposed to be more
than 16 bytes away from the end of the input buffer.
Actually, the confusion I see is farther upthread:
> > >>>>> The problem is that the memchr() function on alpha uses
> prefetch, which
> > >>>>> can cause a page boundary to be crossed, while the standards
> (POSIX and
> > >>>>> C99) says it should stop when a match is found.
I didn't realize this when I wrote the function.
The entire function should be rewritten, since there's little
point in using a prefetch instruction that close to the load.
Prefetch instructions should be used to move data into the L1
cache, not hide the 3 cycle load delay. Thus a prefetch, if
used, should be several cache lines ahead, not just a single word.
Perhaps a better solution would be to read words until we get
cacheline aligned, then read the entire line into 8 registers,
prefetch the next line, then process each register one by one.
Try this.
r~
[-- Attachment #2: memchr.c --]
[-- Type: text/x-csrc, Size: 3718 bytes --]
typedef unsigned long word;
#define ldq(X) (*(const word *)(X))
#define ldq_u(X) (*(const word *)((X) & -8))
#define cmpbge __builtin_alpha_cmpbge
#define extql __builtin_alpha_extql
#define extqh __builtin_alpha_extqh
#define insbl __builtin_alpha_insbl
#define insqh __builtin_alpha_insqh
#define zap __builtin_alpha_zap
#define unlikely(X) __builtin_expect ((X), 0)
#define prefetch(X) __builtin_prefetch ((void *)(X), 0)
#define find(X, Y) cmpbge (0, (X) ^ (Y))
void *memchr (const void *xs, int xc, word n)
{
word s = (word)xs;
word c;
word current, found;
word s_align;
if (unlikely (n == 0))
return 0;
current = ldq_u (s);
/* Replicate low byte of C into all bytes. */
{
word t = insbl (xc, 1); /* 000000c0 */
c = (xc & 0xff) | t; /* 000000cc */
c = (c << 16) | c; /* 0000cccc */
c = (c << 32) | c; /* cccccccc */
}
if (unlikely (n < 9))
goto only_quad;
/* Align the source, and decrement the count by the number
of bytes searched in the first word. */
s_align = s & -8;
n += (s & 7);
n -= 8;
/* Deal with misalignment in the first word for the comparison. */
{
word mask = insqh (-1, s);
found = cmpbge (0, (current ^ c) | mask);
if (found)
goto found_it;
}
s_align += 8;
/* If the block is sufficiently large, align to cacheline (minimum 64-bytes),
prefetch the next line, and read entire cachelines at a time. */
if (unlikely (n >= 256))
{
prefetch (s_align + 64);
while (s_align & 63)
{
current = ldq (s_align);
found = find (current, c);
if (found)
goto found_it;
s_align += 8;
n -= 8;
}
while (n > 64)
{
word c0, c1, c2, c3, c4, c5, c6, c7;
prefetch (s_align + 64);
c0 = ldq (s_align + 0*8);
c1 = ldq (s_align + 1*8);
c2 = ldq (s_align + 2*8);
c3 = ldq (s_align + 3*8);
c4 = ldq (s_align + 4*8);
c5 = ldq (s_align + 5*8);
c6 = ldq (s_align + 6*8);
c7 = ldq (s_align + 7*8);
found = find (c0, c);
if (unlikely (found))
goto found_it;
s_align += 8;
found = find (c1, c);
if (unlikely (found))
goto found_it;
s_align += 8;
found = find (c2, c);
if (unlikely (found))
goto found_it;
s_align += 8;
found = find (c3, c);
if (unlikely (found))
goto found_it;
s_align += 8;
found = find (c4, c);
if (unlikely (found))
goto found_it;
s_align += 8;
found = find (c5, c);
if (unlikely (found))
goto found_it;
s_align += 8;
found = find (c6, c);
if (unlikely (found))
goto found_it;
s_align += 8;
found = find (c7, c);
if (unlikely (found))
goto found_it;
s_align += 8;
n -= 64;
}
}
/* Quadword aligned loop. */
while (1)
{
current = ldq (s_align);
if (n < 8)
goto last_quad;
found = find (current, c);
if (found)
goto found_it;
s_align += 8;
n -= 8;
}
only_quad:
{
word end = zap (n, 0x80) - 1;
word last = extqh (ldq_u (s + end), s);
word first = extql (current, s);
current = first | last;
/* We've read one word and aligned it in the register. Thus the
bit offset in current is relative to S. */
s_align = s;
}
last_quad:
{
word mask = (-1ul >> -n);
found = find (current, c) & mask;
if (found == 0)
return 0;
}
found_it:
{
word offset;
#ifdef __alpha_cix__
offset = __builtin_alpha_cttz (found);
#else
/* Extract LSB. */
found &= -found;
/* Binary search for the LSB. */
offset = (found & 0x0f ? 0 : 4);
offset += (found & 0x33 ? 0 : 2);
offset += (found & 0x55 ? 0 : 1);
#endif
return (void *)(s_align + offset);
}
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-30 0:24 ` Richard Henderson
@ 2009-07-30 16:29 ` Aurelien Jarno
2009-07-31 23:25 ` Richard Henderson
0 siblings, 1 reply; 9+ messages in thread
From: Aurelien Jarno @ 2009-07-30 16:29 UTC (permalink / raw)
To: Richard Henderson
Cc: Matt Turner, Carlos O'Donell, debian-alpha, debian-glibc,
Ivan Kokshaysky, linux-alpha
On Wed, Jul 29, 2009 at 05:24:59PM -0700, Richard Henderson wrote:
> On 07/26/2009 04:45 PM, Aurelien Jarno wrote:
>> Knowing that $31 could be used for prefetch, I have modified the
>> assembly code from memchr.S to use it. It passes all the testsuite.
>>
>
> This isn't intended to be a prefetch instruction, it's
> meant to be fetching the data for the next word. I.e.
> we've unrolled the loop and there's at least 8 bytes
> left in the search.
>
> Note the
>
> # At least two quads remain to be accessed.
>
> comment. At that point we're supposed to be more
> than 16 bytes away from the end of the input buffer.
>
> Actually, the confusion I see is farther upthread:
>
>> > >>>>> The problem is that the memchr() function on alpha uses
>> prefetch, which
>> > >>>>> can cause a page boundary to be crossed, while the standards
>> (POSIX and
>> > >>>>> C99) says it should stop when a match is found.
>
> I didn't realize this when I wrote the function.
>
> The entire function should be rewritten, since there's little
> point in using a prefetch instruction that close to the load.
> Prefetch instructions should be used to move data into the L1
> cache, not hide the 3 cycle load delay. Thus a prefetch, if
> used, should be several cache lines ahead, not just a single word.
>
> Perhaps a better solution would be to read words until we get
> cacheline aligned, then read the entire line into 8 registers,
> prefetch the next line, then process each register one by one.
>
> Try this.
>
Thanks for this patch I have tried it, and it does not have the original
problem I have reported. Unfortunately it does not pass the glibc
testsuite. I'll try to debug the problem later (I don't own an alpha
machine, and need to have internet access to debug it remotely).
--
Aurelien Jarno GPG: 1024D/F1BCDB73
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Help on memchr() EGLIBC assembly code
2009-07-30 16:29 ` Aurelien Jarno
@ 2009-07-31 23:25 ` Richard Henderson
0 siblings, 0 replies; 9+ messages in thread
From: Richard Henderson @ 2009-07-31 23:25 UTC (permalink / raw)
To: Aurelien Jarno
Cc: Matt Turner, Carlos O'Donell, debian-alpha, debian-glibc,
Ivan Kokshaysky, linux-alpha, Richard Henderson
[-- Attachment #1: Type: text/plain, Size: 419 bytes --]
On 07/30/2009 09:29 AM, Aurelien Jarno wrote:
> Thanks for this patch I have tried it, and it does not have the original
> problem I have reported. Unfortunately it does not pass the glibc
> testsuite. I'll try to debug the problem later (I don't own an alpha
> machine, and need to have internet access to debug it remotely).
>
This one passes the test suite (on x86, with replacements for the alpha
builtins).
r~
[-- Attachment #2: memchr.c --]
[-- Type: text/x-csrc, Size: 4709 bytes --]
typedef unsigned long word;
#ifdef __alpha__
#define cmpbeq0(X) __builtin_alpha_cmpbge(0, (X))
#define extql __builtin_alpha_extql
#define extqh __builtin_alpha_extqh
#define insbl __builtin_alpha_insbl
#define zap __builtin_alpha_zap
#else
static word
cmpbeq0(word y)
{
word i, r = 0;
for (i = 0; i < 8; ++i)
{
unsigned char yc = (y >> i*8);
if (yc == 0)
r |= 1 << i;
}
return r;
}
static word
extql(word x, word y)
{
return x >> ((y & 7) * 8);
}
static word
extqh(word x, word y)
{
return x << (64 - ((y & 7) * 8));
}
static word
zap(word x, word y)
{
word i, mask = 0;
for (i = 0; i < 8; ++i)
if (y & (1 << i))
mask |= (word)0xff << (i * 8);
return x & ~mask;
}
static word
insbl(word x, word y)
{
word byte_mask = 1ul << (y & 7);
word byte_loc = (y & 7) * 8;
return zap (x << byte_loc, ~byte_mask);
}
#endif
static inline word
ldq_u(word s)
{
return *(const word *)(s & -8);
}
#define unlikely(X) __builtin_expect ((X), 0)
#define prefetch(X) __builtin_prefetch ((void *)(X), 0)
#define find(X, Y) cmpbeq0 ((X) ^ (Y))
void *
memchr (const void *xs, int xc, word n)
{
const word *s_align;
const word s = (word) xs;
word t, current, found;
if (unlikely (n == 0))
return 0;
current = ldq_u (s);
/* Replicate low byte of XC into all bytes of C. */
t = insbl (xc, 1); /* 000000c0 */
t = (xc & 0xff) | t; /* 000000cc */
t = (t << 16) | t; /* 0000cccc */
const word c = (t << 32) | t; /* cccccccc */
/* When N <= 8, we can perform the entire comparison in one go.
Load the N bytes into the low part of CURRENT, via an unaligned
quadword load sequence, and treat it as the last aligned word read. */
if (unlikely (n <= 8))
{
/* Tweak the standard unaligned quadword load sequence by issuing
the second ldq_u at (s + n - 1) instead of (s + 8 - 1). This
avoids crossing a page boundary when S+N doesn't. */
word last = extqh (ldq_u (s + n - 1), s);
word first = extql (current, s);
current = first | last;
s_align = (const word *) s;
goto last_quad;
}
/* Align the source, and decrement the count by the number
of bytes searched in the first word. */
s_align = (const word *)(s & -8);
n += (s & 7);
/* Deal with misalignment in the first word for the comparison. */
{
word mask = -1ul << (s & 7);
found = find (current, c) & mask;
if (unlikely (found))
goto found_it;
}
s_align++;
n -= 8;
/* If the block is sufficiently large, align to cacheline and prefetch. */
if (unlikely (n >= 256))
{
/* Prefetch 3 cache lines beyond the one we're working on. */
prefetch (s_align + 8);
prefetch (s_align + 16);
prefetch (s_align + 24);
while ((word)s_align & 63)
{
current = *s_align;
found = find (current, c);
if (found)
goto found_it;
s_align++;
n -= 8;
}
/* Within each cacheline, advance the load for the next word
before the test for the previous word is complete. This
allows us to hide the 3 cycle L1 cache load latency. We
only perform this advance load within a cacheline to prevent
reading across page boundary. */
#define CACHELINE_LOOP \
do { \
word i, next = s_align[0]; \
for (i = 0; i < 7; ++i) \
{ \
current = next; \
next = s_align[1]; \
found = find (current, c); \
if (unlikely (found)) \
goto found_it; \
s_align++; \
} \
current = next; \
found = find (current, c); \
if (unlikely (found)) \
goto found_it; \
s_align++; \
n -= 64; \
} while (0)
/* While there's still lots more data to potentially be read,
continue issuing prefetches for the 4th cacheline out. */
while (n >= 256)
{
prefetch (s_align + 24);
CACHELINE_LOOP;
}
/* Up to 3 cache lines remaining. Continue issuing advanced
loads, but stop prefetching. */
while (n >= 64)
CACHELINE_LOOP;
}
/* Quadword aligned loop. */
current = *s_align;
while (n > 8)
{
found = find (current, c);
if (unlikely (found))
goto found_it;
current = *++s_align;
n -= 8;
}
last_quad:
{
word mask = -1ul << n;
found = find (current, c) & ~mask;
if (found == 0)
return 0;
}
found_it:
{
word offset;
#ifdef __alpha_cix__
offset = __builtin_alpha_cttz (found);
#else
/* Extract LSB. */
found &= -found;
/* Binary search for the LSB. */
offset = (found & 0x0f ? 0 : 4);
offset += (found & 0x33 ? 0 : 2);
offset += (found & 0x55 ? 0 : 1);
#endif
return (void *)((word)s_align + offset);
}
}
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-07-31 23:25 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090713173104.GA13883@hall.aurel32.net>
[not found] ` <119aab440907131124r3fd333d3n967cdde2cf3c2e1b@mail.gmail.com>
[not found] ` <20090713211723.GE10110@hall.aurel32.net>
2009-07-13 22:16 ` Help on memchr() EGLIBC assembly code Matt Turner
2009-07-13 22:24 ` Aurelien Jarno
2009-07-15 19:48 ` Richard Henderson
2009-07-19 14:29 ` Aurelien Jarno
2009-07-26 23:45 ` Aurelien Jarno
2009-07-27 9:29 ` Aurelien Jarno
2009-07-30 0:24 ` Richard Henderson
2009-07-30 16:29 ` Aurelien Jarno
2009-07-31 23:25 ` Richard Henderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).