* Help with string.S
@ 2000-07-08 22:57 Dan Malek
2000-07-08 23:57 ` Dan Malek
2000-07-10 6:14 ` Daniel Marmier
0 siblings, 2 replies; 13+ messages in thread
From: Dan Malek @ 2000-07-08 22:57 UTC (permalink / raw)
To: linuxppc-dev
I found the source of the 4xx and 8xx troubles in the 2.4.xx kernel.
The functions in arch/ppc/lib/string.S are broken for anything other
than 32-byte cache lines. I am making the changes, but it would be
nice to have someone else look at this as well. There are lots of
assumptions outside of the apparent parameters that the cache is
32-bytes.
Thanks.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-08 22:57 Help with string.S Dan Malek
@ 2000-07-08 23:57 ` Dan Malek
2000-07-10 6:14 ` Daniel Marmier
1 sibling, 0 replies; 13+ messages in thread
From: Dan Malek @ 2000-07-08 23:57 UTC (permalink / raw)
To: linuxppc-dev
Dan Malek wrote:
>
> I found the source of the 4xx and 8xx troubles in the 2.4.xx kernel.
It wasn't really the source of the problem, but the functions are
still not correct here. I converted the easy ones, the difficult one
is copy_tofrom_user with all of its potential exception cases. I
converted for 16 and 32 bytes lines, but the 64 and 128 bytes lines
need some work.
Thanks.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-08 22:57 Help with string.S Dan Malek
2000-07-08 23:57 ` Dan Malek
@ 2000-07-10 6:14 ` Daniel Marmier
2000-07-10 15:17 ` David Edelsohn
2000-07-10 22:42 ` Dan Malek
1 sibling, 2 replies; 13+ messages in thread
From: Daniel Marmier @ 2000-07-10 6:14 UTC (permalink / raw)
To: Dan Malek; +Cc: linuxppc-dev
Hi Dan,
IIRC, I have sent you a patch that did the right thing for 16-byte
cache lines at time of 2.3.99-pre5. So this is a known problem and
had already been fixed. Of course, if there are caches with 64 or
128 byte lines, some more work needs to be done.
What gives me trouble is the fact that dcbz instruction in function
arch/ppc/lib/string.S:__copy_tofrom_user does not seem to work for me.
But the function works fine if I remove that instruction. Has anybody
else experienced similar problems ?
Any suggestions welcome,
Daniel Marmier
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-10 6:14 ` Daniel Marmier
@ 2000-07-10 15:17 ` David Edelsohn
2000-07-10 22:42 ` Dan Malek
1 sibling, 0 replies; 13+ messages in thread
From: David Edelsohn @ 2000-07-10 15:17 UTC (permalink / raw)
To: daniel.marmier; +Cc: Dan Malek, linuxppc-dev
>>>>> Daniel Marmier writes:
Daniel> IIRC, I have sent you a patch that did the right thing for 16-byte
Daniel> cache lines at time of 2.3.99-pre5. So this is a known problem and
Daniel> had already been fixed. Of course, if there are caches with 64 or
Daniel> 128 byte lines, some more work needs to be done.
Current 64-bit PowerPC chips use a cacheline size of 128 bytes.
Assuming 32 bytes or 32 bytes and 16 bytes or any small number of values
is a mistake.
David
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-10 6:14 ` Daniel Marmier
2000-07-10 15:17 ` David Edelsohn
@ 2000-07-10 22:42 ` Dan Malek
2000-07-11 5:50 ` Daniel Marmier
2000-07-11 10:06 ` Adrian Cox
1 sibling, 2 replies; 13+ messages in thread
From: Dan Malek @ 2000-07-10 22:42 UTC (permalink / raw)
To: daniel.marmier; +Cc: linuxppc-dev
Daniel Marmier wrote:
> IIRC, I have sent you a patch that did the right thing for 16-byte
> cache lines at time of 2.3.99-pre5.
Now I remember :-). I see too much code from too many places every
day!
> ...... Of course, if there are caches with 64 or
> 128 byte lines, some more work needs to be done.
Lots more :-).
> What gives me trouble is the fact that dcbz instruction in function
> arch/ppc/lib/string.S:__copy_tofrom_user does not seem to work for me.
These are becoming a pain in the ass instructions. Has anyone ever
done some performance analysis to see what we really gain here in
real life? Sure, locally and logically you can make an intuitive
argument, but we are sure fetching lots of instructions just to get
this aligned, and further to actually move the data.
These instructions certainly don't work on uncached memory space,
causing the alignment exception and probably horrible performance without
people knowing. These instructions used to cause the exception on
the early MPC8xx processors when copyback cache wasn't enabled. Today,
the newer silicon doesn't fault at all regardless of cache mode. I
guess I need to determine what is really happening. Nothing would
be fine, but it appears _something_ (usually incorrect) happens.
> But the function works fine if I remove that instruction.
I'm still a C code fan:
for(i=0; i<count; i++)
*d++ = *s++;
...and let the compiler guys make it go fast :-).
You know, we could make this even faster by using the Altivec and the
new cache streaming modes on the 7400 processors :-). I've tested this
in applications. It really works.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-10 22:42 ` Dan Malek
@ 2000-07-11 5:50 ` Daniel Marmier
2000-07-13 18:52 ` Dan Malek
2000-07-11 10:06 ` Adrian Cox
1 sibling, 1 reply; 13+ messages in thread
From: Daniel Marmier @ 2000-07-11 5:50 UTC (permalink / raw)
To: Dan Malek; +Cc: linuxppc-dev
Dan Malek wrote:
> These are becoming a pain in the ass instructions. Has anyone ever
> done some performance analysis to see what we really gain here in
> real life? Sure, locally and logically you can make an intuitive
> argument, but we are sure fetching lots of instructions just to get
> this aligned, and further to actually move the data.
>
> These instructions certainly don't work on uncached memory space,
> causing the alignment exception and probably horrible performance without
> people knowing. These instructions used to cause the exception on
> the early MPC8xx processors when copyback cache wasn't enabled. Today,
> the newer silicon doesn't fault at all regardless of cache mode. I
> guess I need to determine what is really happening. Nothing would
> be fine, but it appears _something_ (usually incorrect) happens.
I have seen this happen on cacheable memory with copyback enabled.
The dcbz-memcpy caused the destination to be zeroed, IIRC.
> > But the function works fine if I remove that instruction.
>
> I'm still a C code fan:
> for(i=0; i<count; i++)
> *d++ = *s++;
>
> ...and let the compiler guys make it go fast :-).
That would be cool, but I am sure the asm funcs perform much better.
I'll try to do some benchmarking if I have time.
Daniel M.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-10 22:42 ` Dan Malek
2000-07-11 5:50 ` Daniel Marmier
@ 2000-07-11 10:06 ` Adrian Cox
2000-07-11 15:53 ` Dan Malek
1 sibling, 1 reply; 13+ messages in thread
From: Adrian Cox @ 2000-07-11 10:06 UTC (permalink / raw)
To: Dan Malek; +Cc: linuxppc-dev
Dan Malek wrote:
> > What gives me trouble is the fact that dcbz instruction in function
> > arch/ppc/lib/string.S:__copy_tofrom_user does not seem to work for me.
> These are becoming a pain in the ass instructions. Has anyone ever
> done some performance analysis to see what we really gain here in
> real life? Sure, locally and logically you can make an intuitive
> argument, but we are sure fetching lots of instructions just to get
> this aligned, and further to actually move the data.
The 7xx(x) processors don't have the alignment handler set up to cover
this problem in 2.2, so they just get an oops when somebody writes to
uncached memory, like a framebuffer device. This could probably be
solved by starting the function with a test of the address, and using a
version without cache operations for target addresses above the kernel
image of memory.
Or by removing the cache operations. Even if they stay, could they be a
compilation time optimisation for particular processors?
> You know, we could make this even faster by using the Altivec and the
> new cache streaming modes on the 7400 processors :-). I've tested this
> in applications. It really works.
The 7400 certainly doesn't need the dcbz, as it will perform an implicit
allocation if the entire cache line is written by store instructions.
- Adrian Cox, AG Electronics
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-11 10:06 ` Adrian Cox
@ 2000-07-11 15:53 ` Dan Malek
0 siblings, 0 replies; 13+ messages in thread
From: Dan Malek @ 2000-07-11 15:53 UTC (permalink / raw)
To: Adrian Cox; +Cc: Dan Malek, linuxppc-dev
Adrian Cox wrote:
> The 7xx(x) processors don't have the alignment handler set up ....
Paul and I (and possibly others) conspired and added this in the late
2.3.xx kernels for all processors. It had been floating around for
the MPC8xx processors, I hit it again on the 8260, and we just made
the code generic for all processors. It "fixes" alignment faults and
will also zero memory on a dcbz fault. Hmmm, I wonder if this code
actually gets called and if it still does the right thing? I'll check
it again.
> Or by removing the cache operations. Even if they stay, could they be a
> compilation time optimisation for particular processors?
While the code wasn't really correct for anything but 32 byte cache
lines, it should work correctly on 16 byte lines. You don't get the
performance increase as the dcbz is only performed every other cache
line. However, like David mentioned, it really is broken for 64 and
128 byte cache lines. Here, you zero a long line, but only fill 32
bytes of data. You end up with a nearly zero filled buffer.
Anyway, enough talk, it has to be fixed. I'll do the best I can. I
would like to remove the assumption in copy_tofrom_user that we can
fault in so many places in the cache line. Considering all of the
alignment restrictions, it seems to me you will only fault on the
first access to the cache line (it isn't like you are going to
cross a page boundary in the middle of a line). This would simplify
the function and make for a much smaller exception table.
> The 7400 certainly doesn't need the dcbz, as it will perform an implicit
> allocation if the entire cache line is written by store instructions.
No, but those cache streaming instructions and data move cache hints
really do something. It was my attempt at humor, you see :-).
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-07-11 5:50 ` Daniel Marmier
@ 2000-07-13 18:52 ` Dan Malek
0 siblings, 0 replies; 13+ messages in thread
From: Dan Malek @ 2000-07-13 18:52 UTC (permalink / raw)
To: daniel.marmier; +Cc: Dan Malek, linuxppc-dev
Daniel Marmier wrote:
>
> Dan Malek wrote:
> > These are becoming a pain in the ass instructions.
> I have seen this happen on cacheable memory with copyback enabled.
> The dcbz-memcpy caused the destination to be zeroed, IIRC.
OK, I think I am bailing out here. For some reason, if I remove
the 'dcbz' instructions on the MPC8xx processor the world is just
a better place. I don't know why, maybe because of some of the
TLB mapping, but I can't find a reason.
I am going to put #ifdef CONFIG_8xx around the dcbz instructions,
and where they are actually used to zero-fill memory I will use
store operations. The other option is to make the changes at a
higher level (like make 'clear_page' call 'memset' with 0), but I
think the direct assembly changes are preferable. Suggestions welcome.
I'll keep looking for a better solution, but I can't hold up others
trying to use this kernel.
Thanks.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
@ 2000-08-16 7:26 Graham Stoney
2000-08-16 16:22 ` tom_gall
2000-08-17 19:28 ` Dan Malek
0 siblings, 2 replies; 13+ messages in thread
From: Graham Stoney @ 2000-08-16 7:26 UTC (permalink / raw)
To: dan; +Cc: linuxppc-dev
Casting our minds back to July, Dan Malek wrote:
> OK, I think I am bailing out here. For some reason, if I remove
> the 'dcbz' instructions on the MPC8xx processor the world is just
> a better place. I don't know why, maybe because of some of the
> TLB mapping, but I can't find a reason.
What was the eventual outcome of this? I've been doing some 2.2.13
kernel profiling on the 860, and __copy_tofrom_user is coming up as a
hotspot.
I tried dropping in the new improved version from
linux-2.4.0-test7-pre4, and none of the 8xx mods are in there: it'l only
work for 32 byte cache lines.
I hacked it around and found the same as you: it won't work with the
dcbz in there, and of course it doesn't run any faster than the old
version without it. It's certainly getting more complex in there, and I
see your point about whether the extra code will actually make it run
any faster, especially on 8xx CPUs with small I-caches. I'd be keen to
test whatever you've come up with to see if it's actually better than
the old 2.2 code on 8xx CPUs.
It sounds like a few people have at least had a shot at adding support
for other than 32 byte cache lines, but none have propagated into the
official kernels; how does that happen anyway?
Thanks,
Graham
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-08-16 7:26 Graham Stoney
@ 2000-08-16 16:22 ` tom_gall
2000-08-17 0:50 ` Graham Stoney
2000-08-17 19:28 ` Dan Malek
1 sibling, 1 reply; 13+ messages in thread
From: tom_gall @ 2000-08-16 16:22 UTC (permalink / raw)
To: Graham Stoney; +Cc: dan, linuxppc-dev
Graham Stoney wrote:
>
> Casting our minds back to July, Dan Malek wrote:
> > OK, I think I am bailing out here. For some reason, if I remove
> > the 'dcbz' instructions on the MPC8xx processor the world is just
> > a better place. I don't know why, maybe because of some of the
> > TLB mapping, but I can't find a reason.
>
> What was the eventual outcome of this? I've been doing some 2.2.13
> kernel profiling on the 860, and __copy_tofrom_user is coming up as a
> hotspot.
> I tried dropping in the new improved version from
> linux-2.4.0-test7-pre4, and none of the 8xx mods are in there: it'l only
> work for 32 byte cache lines.
Graham,
That's not right. Which version of linux-2.4.0-x do you have? I'm sure Paul's,
Cort's and and I know for sure mine work for more than 32 byte cache lines
because 2 of my boxes have 128 byte cache lines and there's support in there for
it. (Paul wrote it... awesome stuff)
> I hacked it around and found the same as you: it won't work with the
> dcbz in there, and of course it doesn't run any faster than the old
> version without it. It's certainly getting more complex in there, and I
> see your point about whether the extra code will actually make it run
> any faster, especially on 8xx CPUs with small I-caches. I'd be keen to
> test whatever you've come up with to see if it's actually better than
> the old 2.2 code on 8xx CPUs.
This isn't the only spot too by the way ... glibc has the some problem.
> It sounds like a few people have at least had a shot at adding support
> for other than 32 byte cache lines, but none have propagated into the
> official kernels; how does that happen anyway?
By official are you meaning Linus' ?
Regards,
Tom
--
Tom Gall - PowerPC Linux Team "Where's the ka-boom? There was
Linux Technology Center supposed to be an earth
(w) tom_gall@vnet.ibm.com shattering ka-boom!"
(w) 507-253-4558 -- Marvin Martian
(h) tgall@uswest.net
http://oss.software.ibm.com/developerworks/opensource/linux
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-08-16 16:22 ` tom_gall
@ 2000-08-17 0:50 ` Graham Stoney
0 siblings, 0 replies; 13+ messages in thread
From: Graham Stoney @ 2000-08-17 0:50 UTC (permalink / raw)
To: tom_gall; +Cc: Linux PowerPC developers mailing list
Hi Tom,
Thanks for the response...
Graham Stoney wrote:
> I tried dropping in the new improved version from linux-2.4.0-test7-pre4,
> and none of the 8xx mods are in there: it'l only work for 32 byte cache
> lines.
tom_gall@vnet.ibm.com writes:
> That's not right. Which version of linux-2.4.0-x do you have? I'm sure Paul's,
> Cort's and and I know for sure mine work for more than 32 byte cache lines
> because 2 of my boxes have 128 byte cache lines and there's support in there
> for it. (Paul wrote it... awesome stuff)
I'm talking about linux-2.4.0-test7-pre4, from:
http://www.kernel.org/pub/linux/kernel/v2.4/linux-2.4.0-test6.tar.bz2
plus:
http://www.kernel.org/pub/linux/kernel/testing/test7-pre4.gz
I believe this is the latest test snapshot of what will soon become the
official 2.4. Is there some other repository I should be looking at for the
lastest ppc stuff?
> This isn't the only spot too by the way ... glibc has the some problem.
Yes, in sysdeps/powerpc/memset.S. This is easily enough solved by nuking it,
and allowing glibc to use the generic C version.
> > It sounds like a few people have at least had a shot at adding support
> > for other than 32 byte cache lines, but none have propagated into the
> > official kernels; how does that happen anyway?
>
> By official are you meaning Linus' ?
Yes. Do I need to run BitKeeper to keep up with the latest ppc changes?
They don't seem to be making it in to Linus's kernel...
Thanks,
Graham
--
Graham Stoney
Principal Hardware/Software Engineer
Canon Information Systems Research Australia
Ph: +61 2 9805 2909 Fax: +61 2 9805 2929
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help with string.S
2000-08-16 7:26 Graham Stoney
2000-08-16 16:22 ` tom_gall
@ 2000-08-17 19:28 ` Dan Malek
1 sibling, 0 replies; 13+ messages in thread
From: Dan Malek @ 2000-08-17 19:28 UTC (permalink / raw)
To: Graham Stoney; +Cc: dan, linuxppc-dev
Graham Stoney wrote:
> ..... I've been doing some 2.2.13
> kernel profiling on the 860, and __copy_tofrom_user is coming up as a
> hotspot.
In what kind of test?
> I tried dropping in the new improved version from
> linux-2.4.0-test7-pre4, and none of the 8xx mods are in there: it'l only
> work for 32 byte cache lines.
Hmmm....I check it into the FSM BK tree a long time ago.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2000-08-17 19:28 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-07-08 22:57 Help with string.S Dan Malek
2000-07-08 23:57 ` Dan Malek
2000-07-10 6:14 ` Daniel Marmier
2000-07-10 15:17 ` David Edelsohn
2000-07-10 22:42 ` Dan Malek
2000-07-11 5:50 ` Daniel Marmier
2000-07-13 18:52 ` Dan Malek
2000-07-11 10:06 ` Adrian Cox
2000-07-11 15:53 ` Dan Malek
-- strict thread matches above, loose matches on Subject: below --
2000-08-16 7:26 Graham Stoney
2000-08-16 16:22 ` tom_gall
2000-08-17 0:50 ` Graham Stoney
2000-08-17 19:28 ` Dan Malek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).