* Help with string.S
@ 2000-07-08 22:57 Dan Malek
2000-07-08 23:57 ` Dan Malek
2000-07-10 6:14 ` Daniel Marmier
0 siblings, 2 replies; 14+ messages in thread
From: Dan Malek @ 2000-07-08 22:57 UTC (permalink / raw)
To: linuxppc-dev
I found the source of the 4xx and 8xx troubles in the 2.4.xx kernel.
The functions in arch/ppc/lib/string.S are broken for anything other
than 32-byte cache lines. I am making the changes, but it would be
nice to have someone else look at this as well. There are lots of
assumptions outside of the apparent parameters that the cache is
32-bytes.
Thanks.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-08 22:57 Help with string.S Dan Malek
@ 2000-07-08 23:57 ` Dan Malek
2000-07-10 6:14 ` Daniel Marmier
1 sibling, 0 replies; 14+ messages in thread
From: Dan Malek @ 2000-07-08 23:57 UTC (permalink / raw)
To: linuxppc-dev
Dan Malek wrote:
>
> I found the source of the 4xx and 8xx troubles in the 2.4.xx kernel.
It wasn't really the source of the problem, but the functions are
still not correct here. I converted the easy ones, the difficult one
is copy_tofrom_user with all of its potential exception cases. I
converted for 16 and 32 bytes lines, but the 64 and 128 bytes lines
need some work.
Thanks.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-08 22:57 Help with string.S Dan Malek
2000-07-08 23:57 ` Dan Malek
@ 2000-07-10 6:14 ` Daniel Marmier
2000-07-10 15:17 ` David Edelsohn
2000-07-10 22:42 ` Dan Malek
1 sibling, 2 replies; 14+ messages in thread
From: Daniel Marmier @ 2000-07-10 6:14 UTC (permalink / raw)
To: Dan Malek; +Cc: linuxppc-dev
Hi Dan,
IIRC, I have sent you a patch that did the right thing for 16-byte
cache lines at time of 2.3.99-pre5. So this is a known problem and
had already been fixed. Of course, if there are caches with 64 or
128 byte lines, some more work needs to be done.
What gives me trouble is the fact that dcbz instruction in function
arch/ppc/lib/string.S:__copy_tofrom_user does not seem to work for me.
But the function works fine if I remove that instruction. Has anybody
else experienced similar problems ?
Any suggestions welcome,
Daniel Marmier
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-10 6:14 ` Daniel Marmier
@ 2000-07-10 15:17 ` David Edelsohn
2000-07-10 22:42 ` Dan Malek
1 sibling, 0 replies; 14+ messages in thread
From: David Edelsohn @ 2000-07-10 15:17 UTC (permalink / raw)
To: daniel.marmier; +Cc: Dan Malek, linuxppc-dev
>>>>> Daniel Marmier writes:
Daniel> IIRC, I have sent you a patch that did the right thing for 16-byte
Daniel> cache lines at time of 2.3.99-pre5. So this is a known problem and
Daniel> had already been fixed. Of course, if there are caches with 64 or
Daniel> 128 byte lines, some more work needs to be done.
Current 64-bit PowerPC chips use a cacheline size of 128 bytes.
Assuming 32 bytes or 32 bytes and 16 bytes or any small number of values
is a mistake.
David
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-10 6:14 ` Daniel Marmier
2000-07-10 15:17 ` David Edelsohn
@ 2000-07-10 22:42 ` Dan Malek
2000-07-11 5:50 ` Daniel Marmier
2000-07-11 10:06 ` Adrian Cox
1 sibling, 2 replies; 14+ messages in thread
From: Dan Malek @ 2000-07-10 22:42 UTC (permalink / raw)
To: daniel.marmier; +Cc: linuxppc-dev
Daniel Marmier wrote:
> IIRC, I have sent you a patch that did the right thing for 16-byte
> cache lines at time of 2.3.99-pre5.
Now I remember :-). I see too much code from too many places every
day!
> ...... Of course, if there are caches with 64 or
> 128 byte lines, some more work needs to be done.
Lots more :-).
> What gives me trouble is the fact that dcbz instruction in function
> arch/ppc/lib/string.S:__copy_tofrom_user does not seem to work for me.
These are becoming a pain in the ass instructions. Has anyone ever
done some performance analysis to see what we really gain here in
real life? Sure, locally and logically you can make an intuitive
argument, but we are sure fetching lots of instructions just to get
this aligned, and further to actually move the data.
These instructions certainly don't work on uncached memory space,
causing the alignment exception and probably horrible performance without
people knowing. These instructions used to cause the exception on
the early MPC8xx processors when copyback cache wasn't enabled. Today,
the newer silicon doesn't fault at all regardless of cache mode. I
guess I need to determine what is really happening. Nothing would
be fine, but it appears _something_ (usually incorrect) happens.
> But the function works fine if I remove that instruction.
I'm still a C code fan:
for(i=0; i<count; i++)
*d++ = *s++;
...and let the compiler guys make it go fast :-).
You know, we could make this even faster by using the Altivec and the
new cache streaming modes on the 7400 processors :-). I've tested this
in applications. It really works.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-10 22:42 ` Dan Malek
@ 2000-07-11 5:50 ` Daniel Marmier
2000-07-13 18:52 ` Dan Malek
2000-07-11 10:06 ` Adrian Cox
1 sibling, 1 reply; 14+ messages in thread
From: Daniel Marmier @ 2000-07-11 5:50 UTC (permalink / raw)
To: Dan Malek; +Cc: linuxppc-dev
Dan Malek wrote:
> These are becoming a pain in the ass instructions. Has anyone ever
> done some performance analysis to see what we really gain here in
> real life? Sure, locally and logically you can make an intuitive
> argument, but we are sure fetching lots of instructions just to get
> this aligned, and further to actually move the data.
>
> These instructions certainly don't work on uncached memory space,
> causing the alignment exception and probably horrible performance without
> people knowing. These instructions used to cause the exception on
> the early MPC8xx processors when copyback cache wasn't enabled. Today,
> the newer silicon doesn't fault at all regardless of cache mode. I
> guess I need to determine what is really happening. Nothing would
> be fine, but it appears _something_ (usually incorrect) happens.
I have seen this happen on cacheable memory with copyback enabled.
The dcbz-memcpy caused the destination to be zeroed, IIRC.
> > But the function works fine if I remove that instruction.
>
> I'm still a C code fan:
> for(i=0; i<count; i++)
> *d++ = *s++;
>
> ...and let the compiler guys make it go fast :-).
That would be cool, but I am sure the asm funcs perform much better.
I'll try to do some benchmarking if I have time.
Daniel M.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-10 22:42 ` Dan Malek
2000-07-11 5:50 ` Daniel Marmier
@ 2000-07-11 10:06 ` Adrian Cox
2000-07-11 15:53 ` Dan Malek
1 sibling, 1 reply; 14+ messages in thread
From: Adrian Cox @ 2000-07-11 10:06 UTC (permalink / raw)
To: Dan Malek; +Cc: linuxppc-dev
Dan Malek wrote:
> > What gives me trouble is the fact that dcbz instruction in function
> > arch/ppc/lib/string.S:__copy_tofrom_user does not seem to work for me.
> These are becoming a pain in the ass instructions. Has anyone ever
> done some performance analysis to see what we really gain here in
> real life? Sure, locally and logically you can make an intuitive
> argument, but we are sure fetching lots of instructions just to get
> this aligned, and further to actually move the data.
The 7xx(x) processors don't have the alignment handler set up to cover
this problem in 2.2, so they just get an oops when somebody writes to
uncached memory, like a framebuffer device. This could probably be
solved by starting the function with a test of the address, and using a
version without cache operations for target addresses above the kernel
image of memory.
Or by removing the cache operations. Even if they stay, could they be a
compilation time optimisation for particular processors?
> You know, we could make this even faster by using the Altivec and the
> new cache streaming modes on the 7400 processors :-). I've tested this
> in applications. It really works.
The 7400 certainly doesn't need the dcbz, as it will perform an implicit
allocation if the entire cache line is written by store instructions.
- Adrian Cox, AG Electronics
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-11 10:06 ` Adrian Cox
@ 2000-07-11 15:53 ` Dan Malek
0 siblings, 0 replies; 14+ messages in thread
From: Dan Malek @ 2000-07-11 15:53 UTC (permalink / raw)
To: Adrian Cox; +Cc: Dan Malek, linuxppc-dev
Adrian Cox wrote:
> The 7xx(x) processors don't have the alignment handler set up ....
Paul and I (and possibly others) conspired and added this in the late
2.3.xx kernels for all processors. It had been floating around for
the MPC8xx processors, I hit it again on the 8260, and we just made
the code generic for all processors. It "fixes" alignment faults and
will also zero memory on a dcbz fault. Hmmm, I wonder if this code
actually gets called and if it still does the right thing? I'll check
it again.
> Or by removing the cache operations. Even if they stay, could they be a
> compilation time optimisation for particular processors?
While the code wasn't really correct for anything but 32 byte cache
lines, it should work correctly on 16 byte lines. You don't get the
performance increase as the dcbz is only performed every other cache
line. However, like David mentioned, it really is broken for 64 and
128 byte cache lines. Here, you zero a long line, but only fill 32
bytes of data. You end up with a nearly zero filled buffer.
Anyway, enough talk, it has to be fixed. I'll do the best I can. I
would like to remove the assumption in copy_tofrom_user that we can
fault in so many places in the cache line. Considering all of the
alignment restrictions, it seems to me you will only fault on the
first access to the cache line (it isn't like you are going to
cross a page boundary in the middle of a line). This would simplify
the function and make for a much smaller exception table.
> The 7400 certainly doesn't need the dcbz, as it will perform an implicit
> allocation if the entire cache line is written by store instructions.
No, but those cache streaming instructions and data move cache hints
really do something. It was my attempt at humor, you see :-).
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-07-11 5:50 ` Daniel Marmier
@ 2000-07-13 18:52 ` Dan Malek
0 siblings, 0 replies; 14+ messages in thread
From: Dan Malek @ 2000-07-13 18:52 UTC (permalink / raw)
To: daniel.marmier; +Cc: Dan Malek, linuxppc-dev
Daniel Marmier wrote:
>
> Dan Malek wrote:
> > These are becoming a pain in the ass instructions.
> I have seen this happen on cacheable memory with copyback enabled.
> The dcbz-memcpy caused the destination to be zeroed, IIRC.
OK, I think I am bailing out here. For some reason, if I remove
the 'dcbz' instructions on the MPC8xx processor the world is just
a better place. I don't know why, maybe because of some of the
TLB mapping, but I can't find a reason.
I am going to put #ifdef CONFIG_8xx around the dcbz instructions,
and where they are actually used to zero-fill memory I will use
store operations. The other option is to make the changes at a
higher level (like make 'clear_page' call 'memset' with 0), but I
think the direct assembly changes are preferable. Suggestions welcome.
I'll keep looking for a better solution, but I can't hold up others
trying to use this kernel.
Thanks.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
@ 2000-08-16 7:26 Graham Stoney
2000-08-16 16:22 ` tom_gall
2000-08-17 19:28 ` Dan Malek
0 siblings, 2 replies; 14+ messages in thread
From: Graham Stoney @ 2000-08-16 7:26 UTC (permalink / raw)
To: dan; +Cc: linuxppc-dev
Casting our minds back to July, Dan Malek wrote:
> OK, I think I am bailing out here. For some reason, if I remove
> the 'dcbz' instructions on the MPC8xx processor the world is just
> a better place. I don't know why, maybe because of some of the
> TLB mapping, but I can't find a reason.
What was the eventual outcome of this? I've been doing some 2.2.13
kernel profiling on the 860, and __copy_tofrom_user is coming up as a
hotspot.
I tried dropping in the new improved version from
linux-2.4.0-test7-pre4, and none of the 8xx mods are in there: it'l only
work for 32 byte cache lines.
I hacked it around and found the same as you: it won't work with the
dcbz in there, and of course it doesn't run any faster than the old
version without it. It's certainly getting more complex in there, and I
see your point about whether the extra code will actually make it run
any faster, especially on 8xx CPUs with small I-caches. I'd be keen to
test whatever you've come up with to see if it's actually better than
the old 2.2 code on 8xx CPUs.
It sounds like a few people have at least had a shot at adding support
for other than 32 byte cache lines, but none have propagated into the
official kernels; how does that happen anyway?
Thanks,
Graham
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-08-16 7:26 Help with string.S Graham Stoney
@ 2000-08-16 16:22 ` tom_gall
2000-08-17 0:50 ` Graham Stoney
2000-08-17 19:28 ` Dan Malek
1 sibling, 1 reply; 14+ messages in thread
From: tom_gall @ 2000-08-16 16:22 UTC (permalink / raw)
To: Graham Stoney; +Cc: dan, linuxppc-dev
Graham Stoney wrote:
>
> Casting our minds back to July, Dan Malek wrote:
> > OK, I think I am bailing out here. For some reason, if I remove
> > the 'dcbz' instructions on the MPC8xx processor the world is just
> > a better place. I don't know why, maybe because of some of the
> > TLB mapping, but I can't find a reason.
>
> What was the eventual outcome of this? I've been doing some 2.2.13
> kernel profiling on the 860, and __copy_tofrom_user is coming up as a
> hotspot.
> I tried dropping in the new improved version from
> linux-2.4.0-test7-pre4, and none of the 8xx mods are in there: it'l only
> work for 32 byte cache lines.
Graham,
That's not right. Which version of linux-2.4.0-x do you have? I'm sure Paul's,
Cort's and and I know for sure mine work for more than 32 byte cache lines
because 2 of my boxes have 128 byte cache lines and there's support in there for
it. (Paul wrote it... awesome stuff)
> I hacked it around and found the same as you: it won't work with the
> dcbz in there, and of course it doesn't run any faster than the old
> version without it. It's certainly getting more complex in there, and I
> see your point about whether the extra code will actually make it run
> any faster, especially on 8xx CPUs with small I-caches. I'd be keen to
> test whatever you've come up with to see if it's actually better than
> the old 2.2 code on 8xx CPUs.
This isn't the only spot too by the way ... glibc has the some problem.
> It sounds like a few people have at least had a shot at adding support
> for other than 32 byte cache lines, but none have propagated into the
> official kernels; how does that happen anyway?
By official are you meaning Linus' ?
Regards,
Tom
--
Tom Gall - PowerPC Linux Team "Where's the ka-boom? There was
Linux Technology Center supposed to be an earth
(w) tom_gall@vnet.ibm.com shattering ka-boom!"
(w) 507-253-4558 -- Marvin Martian
(h) tgall@uswest.net
http://oss.software.ibm.com/developerworks/opensource/linux
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-08-16 16:22 ` tom_gall
@ 2000-08-17 0:50 ` Graham Stoney
0 siblings, 0 replies; 14+ messages in thread
From: Graham Stoney @ 2000-08-17 0:50 UTC (permalink / raw)
To: tom_gall; +Cc: Linux PowerPC developers mailing list
Hi Tom,
Thanks for the response...
Graham Stoney wrote:
> I tried dropping in the new improved version from linux-2.4.0-test7-pre4,
> and none of the 8xx mods are in there: it'l only work for 32 byte cache
> lines.
tom_gall@vnet.ibm.com writes:
> That's not right. Which version of linux-2.4.0-x do you have? I'm sure Paul's,
> Cort's and and I know for sure mine work for more than 32 byte cache lines
> because 2 of my boxes have 128 byte cache lines and there's support in there
> for it. (Paul wrote it... awesome stuff)
I'm talking about linux-2.4.0-test7-pre4, from:
http://www.kernel.org/pub/linux/kernel/v2.4/linux-2.4.0-test6.tar.bz2
plus:
http://www.kernel.org/pub/linux/kernel/testing/test7-pre4.gz
I believe this is the latest test snapshot of what will soon become the
official 2.4. Is there some other repository I should be looking at for the
lastest ppc stuff?
> This isn't the only spot too by the way ... glibc has the some problem.
Yes, in sysdeps/powerpc/memset.S. This is easily enough solved by nuking it,
and allowing glibc to use the generic C version.
> > It sounds like a few people have at least had a shot at adding support
> > for other than 32 byte cache lines, but none have propagated into the
> > official kernels; how does that happen anyway?
>
> By official are you meaning Linus' ?
Yes. Do I need to run BitKeeper to keep up with the latest ppc changes?
They don't seem to be making it in to Linus's kernel...
Thanks,
Graham
--
Graham Stoney
Principal Hardware/Software Engineer
Canon Information Systems Research Australia
Ph: +61 2 9805 2909 Fax: +61 2 9805 2929
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Help with string.S
2000-08-16 7:26 Help with string.S Graham Stoney
2000-08-16 16:22 ` tom_gall
@ 2000-08-17 19:28 ` Dan Malek
2000-08-18 2:10 ` Kernel TCP performance profiling (was Re: Help with string.S) Graham Stoney
1 sibling, 1 reply; 14+ messages in thread
From: Dan Malek @ 2000-08-17 19:28 UTC (permalink / raw)
To: Graham Stoney; +Cc: dan, linuxppc-dev
Graham Stoney wrote:
> ..... I've been doing some 2.2.13
> kernel profiling on the 860, and __copy_tofrom_user is coming up as a
> hotspot.
In what kind of test?
> I tried dropping in the new improved version from
> linux-2.4.0-test7-pre4, and none of the 8xx mods are in there: it'l only
> work for 32 byte cache lines.
Hmmm....I check it into the FSM BK tree a long time ago.
-- Dan
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Kernel TCP performance profiling (was Re: Help with string.S)
2000-08-17 19:28 ` Dan Malek
@ 2000-08-18 2:10 ` Graham Stoney
0 siblings, 0 replies; 14+ messages in thread
From: Graham Stoney @ 2000-08-18 2:10 UTC (permalink / raw)
To: Dan Malek; +Cc: dan, Linux PowerPC developers mailing list
Graham Stoney wrote:
> ..... I've been doing some 2.2.13
> kernel profiling on the 860, and __copy_tofrom_user is coming up as a
> hotspot.
Dan Malek writes:
> In what kind of test?
Reading data from a TCP connected socket at full speed over the FEC and just
dropping it, to measure the maximum theoretical TCP throughput. I added
/proc/profile support to the ppc kernel (it was missing) to see where the
time was going; the patch to do this is available at:
http://members.xoom.com/greyhams/linux/patches/2.2/profile.patch
Here are the top ten functions output from readprofile:
3490 total 0.0050
972 csum_partial_copy_generic 6.5676
754 __copy_to_user 2.4481
129 do_lost_interrupts 2.0156
113 kfree 0.1519
94 tcp_recvmsg 0.0625
85 kmalloc 0.1250
74 alloc_skb 0.2534
73 fec_enet_rx 0.1393
64 tcp_rcv_established 0.0370
62 tcp_v4_rcv 0.0686
The count values on the left are in jiffies, and those on the right are in
jiffies per instruction (I think real time would be more useful!). I split
__copy_tofrom_user so it would appear seperately in the profile as
__copy_to_user and __copy_from_user.
It shows that almost half the time in TCP reception is consumed in:
1. checksuming & copying the data into the socket buffer in
csum_partial_copy_generic (called from fec_enet_rx->eth_copy_and_sum->
csum_partial_copy->csum_partial_copy_generic),
and
2. copying the result out to the user (called from sys_read->sock_read->
sock_recvmsg->tcp_recvmsg->memcpy_toiovec->copy_to_user->__copy_to_user)
I've got this crazy idea that the FEC could DMA directly to the skb to
eliminate the first copy. Pity the FEC can't calculate IP checksums for us,
but eliminating the copy should make it go faster even though tcp_v4_rcv
would then need to calculate the checksum in software. Would you like to tell
me why this won't work before I spend hours trying to implement it? :-)
> > I tried dropping in the new improved version from
> > linux-2.4.0-test7-pre4, and none of the 8xx mods are in there: it'l only
> > work for 32 byte cache lines.
>
> Hmmm....I check it into the FSM BK tree a long time ago.
Anyone know how/when these propagate to the stuff on kernel.org?
Thanks,
Graham
--
Graham Stoney
Principal Hardware/Software Engineer
Canon Information Systems Research Australia
Ph: +61 2 9805 2909 Fax: +61 2 9805 2929
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2000-08-18 2:10 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-08-16 7:26 Help with string.S Graham Stoney
2000-08-16 16:22 ` tom_gall
2000-08-17 0:50 ` Graham Stoney
2000-08-17 19:28 ` Dan Malek
2000-08-18 2:10 ` Kernel TCP performance profiling (was Re: Help with string.S) Graham Stoney
-- strict thread matches above, loose matches on Subject: below --
2000-07-08 22:57 Help with string.S Dan Malek
2000-07-08 23:57 ` Dan Malek
2000-07-10 6:14 ` Daniel Marmier
2000-07-10 15:17 ` David Edelsohn
2000-07-10 22:42 ` Dan Malek
2000-07-11 5:50 ` Daniel Marmier
2000-07-13 18:52 ` Dan Malek
2000-07-11 10:06 ` Adrian Cox
2000-07-11 15:53 ` Dan Malek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).