* Re: [PATCH] optimize ia32 memmove [not found] <200312300713.hBU7DGC4024213@hera.kernel.org> @ 2003-12-30 7:32 ` Jeff Garzik 2003-12-30 7:51 ` Andrew Morton 2003-12-30 10:21 ` Ed Sweetman 0 siblings, 2 replies; 12+ messages in thread From: Jeff Garzik @ 2003-12-30 7:32 UTC (permalink / raw) To: manfred, akpm; +Cc: Linux Kernel Mailing List Linux Kernel Mailing List wrote: > ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org > > [PATCH] optimize ia32 memmove > > From: Manfred Spraul <manfred@colorfullife.com> > > The memmove implementation of i386 is not optimized: it uses movsb, which is > far slower than movsd. The optimization is trivial: if dest is less than > source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved > around 300 million cpu ticks in cache_flusharray(): [...] > diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h > --- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 > +++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 > @@ -299,14 +299,9 @@ > static inline void * memmove(void * dest,const void * src, size_t n) > { > int d0, d1, d2; > -if (dest<src) > -__asm__ __volatile__( > - "rep\n\t" > - "movsb" > - : "=&c" (d0), "=&S" (d1), "=&D" (d2) > - :"0" (n),"1" (src),"2" (dest) > - : "memory"); > -else > +if (dest<src) { > + memcpy(dest,src,n); > +} else > __asm__ __volatile__( > "std\n\t" > "rep\n\t" Dumb question, though... what about the overlap case, when dest<src ? It seems to me this change is ignoring that. Jeff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik @ 2003-12-30 7:51 ` Andrew Morton 2003-12-30 7:56 ` Jeff Garzik 2003-12-30 10:21 ` Ed Sweetman 1 sibling, 1 reply; 12+ messages in thread From: Andrew Morton @ 2003-12-30 7:51 UTC (permalink / raw) To: Jeff Garzik; +Cc: manfred, linux-kernel Jeff Garzik <jgarzik@pobox.com> wrote: > > Linux Kernel Mailing List wrote: > > ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org > > > > [PATCH] optimize ia32 memmove > > > > From: Manfred Spraul <manfred@colorfullife.com> > > > > The memmove implementation of i386 is not optimized: it uses movsb, which is > > far slower than movsd. The optimization is trivial: if dest is less than > > source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved > > around 300 million cpu ticks in cache_flusharray(): > [...] > > diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h > > --- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 > > +++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 > > @@ -299,14 +299,9 @@ > > static inline void * memmove(void * dest,const void * src, size_t n) > > { > > int d0, d1, d2; > > -if (dest<src) > > -__asm__ __volatile__( > > - "rep\n\t" > > - "movsb" > > - : "=&c" (d0), "=&S" (d1), "=&D" (d2) > > - :"0" (n),"1" (src),"2" (dest) > > - : "memory"); > > -else > > +if (dest<src) { > > + memcpy(dest,src,n); > > +} else > > __asm__ __volatile__( > > "std\n\t" > > "rep\n\t" > > Dumb question, though... what about the overlap case, when dest<src ? > It seems to me this change is ignoring that. > "if dest is less that source, then call memcpy". If the move is to a higher address we do it the old way. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 7:51 ` Andrew Morton @ 2003-12-30 7:56 ` Jeff Garzik 2003-12-30 8:11 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Jeff Garzik @ 2003-12-30 7:56 UTC (permalink / raw) To: Andrew Morton; +Cc: manfred, linux-kernel Andrew Morton wrote: > Jeff Garzik <jgarzik@pobox.com> wrote: > >>Linux Kernel Mailing List wrote: >> >>>ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org >>> >>> [PATCH] optimize ia32 memmove >>> >>> From: Manfred Spraul <manfred@colorfullife.com> >>> >>> The memmove implementation of i386 is not optimized: it uses movsb, which is >>> far slower than movsd. The optimization is trivial: if dest is less than >>> source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved >>> around 300 million cpu ticks in cache_flusharray(): >> >>[...] >> >>>diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h >>>--- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 >>>+++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 >>>@@ -299,14 +299,9 @@ >>> static inline void * memmove(void * dest,const void * src, size_t n) >>> { >>> int d0, d1, d2; >>>-if (dest<src) >>>-__asm__ __volatile__( >>>- "rep\n\t" >>>- "movsb" >>>- : "=&c" (d0), "=&S" (d1), "=&D" (d2) >>>- :"0" (n),"1" (src),"2" (dest) >>>- : "memory"); >>>-else >>>+if (dest<src) { >>>+ memcpy(dest,src,n); >>>+} else >>> __asm__ __volatile__( >>> "std\n\t" >>> "rep\n\t" >> >>Dumb question, though... what about the overlap case, when dest<src ? >> It seems to me this change is ignoring that. >> > > > "if dest is less that source, then call memcpy". If the move is to a > higher address we do it the old way. I'm confused... that doesn't say anything to me about overlap. They can still overlap: Consider if dest is 1 byte less than src, and n==128... Jeff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 7:56 ` Jeff Garzik @ 2003-12-30 8:11 ` Andrew Morton 2003-12-30 8:11 ` Andreas Dilger 2003-12-30 9:58 ` Linus Torvalds 2 siblings, 0 replies; 12+ messages in thread From: Andrew Morton @ 2003-12-30 8:11 UTC (permalink / raw) To: Jeff Garzik; +Cc: manfred, linux-kernel Jeff Garzik <jgarzik@pobox.com> wrote: > > > "if dest is less that source, then call memcpy". If the move is to a > > higher address we do it the old way. > > > I'm confused... that doesn't say anything to me about overlap. > Overlap is OK if dest<src, because memcpy starts with the lowest address and works upwards. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 7:56 ` Jeff Garzik 2003-12-30 8:11 ` Andrew Morton @ 2003-12-30 8:11 ` Andreas Dilger 2003-12-30 10:05 ` Linus Torvalds 2003-12-30 9:58 ` Linus Torvalds 2 siblings, 1 reply; 12+ messages in thread From: Andreas Dilger @ 2003-12-30 8:11 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andrew Morton, manfred, linux-kernel On Dec 30, 2003 02:56 -0500, Jeff Garzik wrote: > Andrew Morton wrote: > > Jeff Garzik <jgarzik@pobox.com> wrote: > > > >>Linux Kernel Mailing List wrote: > >> > >>>ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org > >>> > >>> [PATCH] optimize ia32 memmove > >>> > >>> From: Manfred Spraul <manfred@colorfullife.com> > >>> > >>> The memmove implementation of i386 is not optimized: it uses movsb, which is > >>> far slower than movsd. The optimization is trivial: if dest is less than > >>> source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved > >>> around 300 million cpu ticks in cache_flusharray(): > >> > >>[...] > >> > >>>diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h > >>>--- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 > >>>+++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 > >>>@@ -299,14 +299,9 @@ > >>> static inline void * memmove(void * dest,const void * src, size_t n) > >>> { > >>> int d0, d1, d2; > >>>-if (dest<src) > >>>-__asm__ __volatile__( > >>>- "rep\n\t" > >>>- "movsb" > >>>- : "=&c" (d0), "=&S" (d1), "=&D" (d2) > >>>- :"0" (n),"1" (src),"2" (dest) > >>>- : "memory"); > >>>-else > >>>+if (dest<src) { > >>>+ memcpy(dest,src,n); > >>>+} else > >>> __asm__ __volatile__( > >>> "std\n\t" > >>> "rep\n\t" > >> > >>Dumb question, though... what about the overlap case, when dest<src ? > >> It seems to me this change is ignoring that. > >> > > > > > > "if dest is less that source, then call memcpy". If the move is to a > > higher address we do it the old way. > > > I'm confused... that doesn't say anything to me about overlap. > > They can still overlap: Consider if dest is 1 byte less than src, and > n==128... The non-overlapping cases are probably very common and worth optimizing for: if (dest + n < src || src + n < dest) { memcpy(dest, src, n); } else { "old way" } People generally call memmove() if they _think_ the areas might overlap, but if memcpy() can be so much faster it is probably worth catching the cases where the two areas don't actually overlap. The above assumes that "dest + n" and "src + n" don't wrap, but that is probably a broken case in the current code so not worth adding extra checks in for. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 8:11 ` Andreas Dilger @ 2003-12-30 10:05 ` Linus Torvalds 0 siblings, 0 replies; 12+ messages in thread From: Linus Torvalds @ 2003-12-30 10:05 UTC (permalink / raw) To: Andreas Dilger; +Cc: Jeff Garzik, Andrew Morton, manfred, linux-kernel On Tue, 30 Dec 2003, Andreas Dilger wrote: > > The non-overlapping cases are probably very common and worth optimizing for: No, almost all non-overlapping users already just use "memcpy()". So most of the kernel uses of "memmove()" are likely overlapping - and just optimizing the non-overlap case probably doesn't help a lot. That's why you optimize the overlapping case in one direction. We really should do the other case right too. Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 7:56 ` Jeff Garzik 2003-12-30 8:11 ` Andrew Morton 2003-12-30 8:11 ` Andreas Dilger @ 2003-12-30 9:58 ` Linus Torvalds 2003-12-30 10:17 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 12+ messages in thread From: Linus Torvalds @ 2003-12-30 9:58 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andrew Morton, manfred, linux-kernel On Tue, 30 Dec 2003, Jeff Garzik wrote: > > I'm confused... that doesn't say anything to me about overlap. > > They can still overlap: Consider if dest is 1 byte less than src, and > n==128... But then anything that does the loads in ascending order is still ok, so it shouldn't matter - by the time "dest" has been overwritten, the source data has already been read. And all the "memcpy()" implementations had better do that anyway, in order to get nice memory access patterns. "rep movsl" certainly does. So assuming we have an ascending "memcpy()", the only case we need to care about is "overlap && dest > src". Now, if we have a non-ascending memcpy(), we have trouble. Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 9:58 ` Linus Torvalds @ 2003-12-30 10:17 ` Jeremy Fitzhardinge 2003-12-30 11:12 ` Manfred Spraul 0 siblings, 1 reply; 12+ messages in thread From: Jeremy Fitzhardinge @ 2003-12-30 10:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff Garzik, Andrew Morton, manfred, Linux Kernel List On Tue, 2003-12-30 at 01:58, Linus Torvalds wrote: > But then anything that does the loads in ascending order is still ok, so > it shouldn't matter - by the time "dest" has been overwritten, the source > data has already been read. And all the "memcpy()" implementations had > better do that anyway, in order to get nice memory access patterns. "rep > movsl" certainly does. A PPC memcpy may end up clearing the destination before reading the source (using the cache-line zeroing instruction, to prevent the destination from being spuriously read to populate the cache line). J ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 10:17 ` Jeremy Fitzhardinge @ 2003-12-30 11:12 ` Manfred Spraul 2003-12-30 20:17 ` H. Peter Anvin 0 siblings, 1 reply; 12+ messages in thread From: Manfred Spraul @ 2003-12-30 11:12 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Linus Torvalds, Jeff Garzik, Andrew Morton, Linux Kernel List Jeremy Fitzhardinge wrote: >On Tue, 2003-12-30 at 01:58, Linus Torvalds wrote: > >>But then anything that does the loads in ascending order is still ok, so >>it shouldn't matter - by the time "dest" has been overwritten, the source >>data has already been read. And all the "memcpy()" implementations had >>better do that anyway, in order to get nice memory access patterns. "rep >>movsl" certainly does. AMD recommends to perform bulk copies backwards: That defeats the hw prefecher, and results in even better access patterns. Doesn't matter in this case, memmove is never used for bulk copies. >A PPC memcpy may end up clearing the destination before reading the >source (using the cache-line zeroing instruction, to prevent the >destination from being spuriously read to populate the cache line). > The change is i386 only, no effect on other archs. I found the unoptimized memmove in oprofiles of dbt2 testruns: slab contains a few memmoves to keep it's recently used arrays in strict LIFO order. Typically perhaps 100 bytes. -- Manfred ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 11:12 ` Manfred Spraul @ 2003-12-30 20:17 ` H. Peter Anvin 0 siblings, 0 replies; 12+ messages in thread From: H. Peter Anvin @ 2003-12-30 20:17 UTC (permalink / raw) To: linux-kernel Followup to: <3FF15DAB.8080203@colorfullife.com> By author: Manfred Spraul <manfred@colorfullife.com> In newsgroup: linux.dev.kernel > > AMD recommends to perform bulk copies backwards: That defeats the hw > prefecher, and results in even better access patterns. Doesn't matter in > this case, memmove is never used for bulk copies. > That's also a microoptimization for one particular microarchitecture *bug*. Hardware prefetchers are going omnidirectional going forward. Additionally, nearly all bulk copies are performed forward (DF=0) in existing codebases. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! If you send me mail in HTML format I will assume it's spam. "Unix gives you enough rope to shoot yourself in the foot." Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik 2003-12-30 7:51 ` Andrew Morton @ 2003-12-30 10:21 ` Ed Sweetman 2003-12-30 10:37 ` Andrew Morton 1 sibling, 1 reply; 12+ messages in thread From: Ed Sweetman @ 2003-12-30 10:21 UTC (permalink / raw) To: Jeff Garzik; +Cc: manfred, akpm, Linux Kernel Mailing List Is this the entire patch? When was the original post since I cant find the message going back over a month ago in the archives or any current posts. Jeff Garzik wrote: > Linux Kernel Mailing List wrote: > >> ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org >> >> [PATCH] optimize ia32 memmove >> >> From: Manfred Spraul <manfred@colorfullife.com> >> >> diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h >> --- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 >> +++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003 >> @@ -299,14 +299,9 @@ >> static inline void * memmove(void * dest,const void * src, size_t n) >> { >> int d0, d1, d2; >> -if (dest<src) >> -__asm__ __volatile__( >> - "rep\n\t" >> - "movsb" >> - : "=&c" (d0), "=&S" (d1), "=&D" (d2) >> - :"0" (n),"1" (src),"2" (dest) >> - : "memory"); >> -else >> +if (dest<src) { >> + memcpy(dest,src,n); >> +} else >> __asm__ __volatile__( >> "std\n\t" >> "rep\n\t" > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove 2003-12-30 10:21 ` Ed Sweetman @ 2003-12-30 10:37 ` Andrew Morton 0 siblings, 0 replies; 12+ messages in thread From: Andrew Morton @ 2003-12-30 10:37 UTC (permalink / raw) To: Ed Sweetman; +Cc: jgarzik, manfred, linux-kernel Ed Sweetman <ed.sweetman@wmich.edu> wrote: > > Is this the entire patch? When was the original post since I cant find > the message going back over a month ago in the archives or any current > posts. > The original patch is on the 2.5/2.6 commits mailing list, bk-commits-head@vger.kernel.org. That list sets linux-kernel@vger.kernel.org as the From: address, so replies come here. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2003-12-30 20:18 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200312300713.hBU7DGC4024213@hera.kernel.org>
2003-12-30 7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik
2003-12-30 7:51 ` Andrew Morton
2003-12-30 7:56 ` Jeff Garzik
2003-12-30 8:11 ` Andrew Morton
2003-12-30 8:11 ` Andreas Dilger
2003-12-30 10:05 ` Linus Torvalds
2003-12-30 9:58 ` Linus Torvalds
2003-12-30 10:17 ` Jeremy Fitzhardinge
2003-12-30 11:12 ` Manfred Spraul
2003-12-30 20:17 ` H. Peter Anvin
2003-12-30 10:21 ` Ed Sweetman
2003-12-30 10:37 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox