* [U-Boot] [PATCH V2 0/3] make memcpy and memset faster
@ 2009-10-08 11:29 Alessandro Rubini
2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
` (4 more replies)
0 siblings, 5 replies; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 11:29 UTC (permalink / raw)
To: u-boot
I've added 32-bit lcd to the Nomadik (not submitted yet), and I found
the scroll to be very slow, as the screen is big.
Instead of activating the "if 0" stanza for 32-bit scroll in lcd.c,
I'd better have a faster memcpy/memset globally. So this patch set
adds ulong-wide memcpy and memset, then removes the "#if 0" part in the
scroll function. For me scrolling is 4 times faster on a 32 bit system.
V2: I incorporated most of the comments, but I didn't change the for
loops to help the compiler optimizing it, since nowadays gcc is
already doing the loops his own way irrespective of what i write.
Similarly, I'm not interested in "4 bytes at a time, then 1 at a time"
as it's quite a corner case. If such optimizations are really useful,
then we'd better have hand-crafted assembly for each arch, possibly
lifted from glibc.
Alessandro Rubini (3):
memcpy: copy one word at a time if possible
memset: fill one word at a time if possible
lcd: remove '#if 0' 32-bit scroll, now memcpy does it
common/lcd.c | 21 ---------------------
lib_generic/string.c | 34 +++++++++++++++++++++++++++++-----
2 files changed, 29 insertions(+), 26 deletions(-)
^ permalink raw reply [flat|nested] 24+ messages in thread* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini @ 2009-10-08 11:30 ` Alessandro Rubini 2009-10-08 15:12 ` Peter Tyser ` (2 more replies) 2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini ` (3 subsequent siblings) 4 siblings, 3 replies; 24+ messages in thread From: Alessandro Rubini @ 2009-10-08 11:30 UTC (permalink / raw) To: u-boot From: Alessandro Rubini <rubini@unipv.it> Signed-off-by: Alessandro Rubini <rubini@unipv.it> Acked-by: Andrea Gallo <andrea.gallo@stericsson.com> --- lib_generic/string.c | 17 +++++++++++++---- 1 files changed, 13 insertions(+), 4 deletions(-) diff --git a/lib_generic/string.c b/lib_generic/string.c index 181eda6..9911941 100644 --- a/lib_generic/string.c +++ b/lib_generic/string.c @@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count) * You should not use this function to access IO space, use memcpy_toio() * or memcpy_fromio() instead. */ -void * memcpy(void * dest,const void *src,size_t count) +void * memcpy(void *dest, const void *src, size_t count) { - char *tmp = (char *) dest, *s = (char *) src; + char *d8 = (char *)dest, *s8 = (char *)src; + unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src; + /* if all data is aligned (common case), copy a word at a time */ + if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { + count /= sizeof(unsigned long); + while (count--) + *dl++ = *sl++; + return dest; + } + /* else, use 1-byte copy */ while (count--) - *tmp++ = *s++; + *d8++ = *s8++; return dest; } -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini @ 2009-10-08 15:12 ` Peter Tyser 2009-10-08 16:00 ` Alessandro Rubini 2009-10-08 19:14 ` Mike Frysinger 2009-10-08 20:44 ` Wolfgang Denk 2 siblings, 1 reply; 24+ messages in thread From: Peter Tyser @ 2009-10-08 15:12 UTC (permalink / raw) To: u-boot On Thu, 2009-10-08 at 13:30 +0200, Alessandro Rubini wrote: > From: Alessandro Rubini <rubini@unipv.it> > > Signed-off-by: Alessandro Rubini <rubini@unipv.it> > Acked-by: Andrea Gallo <andrea.gallo@stericsson.com> > --- > lib_generic/string.c | 17 +++++++++++++---- > 1 files changed, 13 insertions(+), 4 deletions(-) > > diff --git a/lib_generic/string.c b/lib_generic/string.c > index 181eda6..9911941 100644 > --- a/lib_generic/string.c > +++ b/lib_generic/string.c > @@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count) > * You should not use this function to access IO space, use memcpy_toio() > * or memcpy_fromio() instead. > */ > -void * memcpy(void * dest,const void *src,size_t count) > +void * memcpy(void *dest, const void *src, size_t count) > { > - char *tmp = (char *) dest, *s = (char *) src; > + char *d8 = (char *)dest, *s8 = (char *)src; > + unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src; > > + /* if all data is aligned (common case), copy a word at a time */ > + if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { > + count /= sizeof(unsigned long); > + while (count--) > + *dl++ = *sl++; > + return dest; > + } > + /* else, use 1-byte copy */ > while (count--) > - *tmp++ = *s++; > + *d8++ = *s8++; > > return dest; > } Hi Alessandro, No interest in the suggestion to not require count to be an exact multiple of 4/8? I don't think it would be that hard to update the logic accordingly and this would let your code be utilized much more often, especially if/when we run on a 64-bit machine. Conceptually it seems like a cleaner implementation too, but that's probably just my preference:) Best, Peter ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 15:12 ` Peter Tyser @ 2009-10-08 16:00 ` Alessandro Rubini 2009-10-08 16:30 ` Peter Tyser 2009-10-08 20:47 ` Wolfgang Denk 0 siblings, 2 replies; 24+ messages in thread From: Alessandro Rubini @ 2009-10-08 16:00 UTC (permalink / raw) To: u-boot > No interest in the suggestion to not require count to be an exact > multiple of 4/8? Actually, I wrote about that in my patch 0/3. > I don't think it would be that hard to update the logic accordingly > and this would let your code be utilized much more often, especially > if/when we run on a 64-bit machine. That's true, but I think the most important case is lcd scrolling, where it's usually a big power of two -- that's where we had the #ifdef, so the problem was known, I suppose. Currently, I don't even know if this is going to be picked up, so I don't want to go too far -- and in that case I would like to measure things and be able to test for stupid bugs, so it takes time. Scrolling the video is an easy test for memcpy. If there's interest, there's memmov to fix; it's used pretty often and it's the same idea as memcpy (well, scrolling should use memmove, in theory). /alessandro ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 16:00 ` Alessandro Rubini @ 2009-10-08 16:30 ` Peter Tyser 2009-10-08 18:23 ` Alessandro Rubini 2009-10-08 20:47 ` Wolfgang Denk 1 sibling, 1 reply; 24+ messages in thread From: Peter Tyser @ 2009-10-08 16:30 UTC (permalink / raw) To: u-boot On Thu, 2009-10-08 at 18:00 +0200, Alessandro Rubini wrote: > > No interest in the suggestion to not require count to be an exact > > multiple of 4/8? > > Actually, I wrote about that in my patch 0/3. Sorry, I should have read more thoroughly. > > I don't think it would be that hard to update the logic accordingly > > and this would let your code be utilized much more often, especially > > if/when we run on a 64-bit machine. > > That's true, but I think the most important case is lcd scrolling, > where it's usually a big power of two -- that's where we had the #ifdef, > so the problem was known, I suppose. I think the most important case for *you* is lcd scrolling, but for 99% of everyone else, it isn't at all:) memcpy() and memset() are used 100 times more often in non-lcd related code and most boards don't even have LCDs. In my opinion, we shouldn't be fixing/bloating common code for 1 outlying scenario, we should be trying to improve it for common cases. > Currently, I don't even know if this is going to be picked up, so I > don't want to go too far -- and in that case I would like to measure > things and be able to test for stupid bugs, so it takes time. Sure, I understand where you're coming from. FWIW, if people don't want to pick this up as it affects lots of people, you could always add an architecture-specific memcpy/memset implementation. > If there's interest, there's memmov to fix; it's used pretty often and > it's the same idea as memcpy (well, scrolling should use memmove, > in theory). I'll quit pestering, just wanted to put my $.02 in:) Best, Peter ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 16:30 ` Peter Tyser @ 2009-10-08 18:23 ` Alessandro Rubini 2009-10-08 19:09 ` Peter Tyser 0 siblings, 1 reply; 24+ messages in thread From: Alessandro Rubini @ 2009-10-08 18:23 UTC (permalink / raw) To: u-boot >> That's true, but I think the most important case is lcd scrolling, >> where it's usually a big power of two -- that's where we had the #ifdef, >> so the problem was known, I suppose. > > I think the most important case for *you* is lcd scrolling, but for 99% > of everyone else, it isn't at all:) Well, its a big memcpy, and it has direct effect on the user. Every other copy is smaller, or has no interactive value. > memcpy() and memset() are used 100 times more often in non-lcd > related code and most boards don't even have LCDs. That's true. But it's only a boot loader (I just looked at what Nicolas Pitre did in the kernel for ARM strcpy and, well....). So I made some measures (it's one of Pike's rules of programming: * Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest. ) I booted in u-boot, typed "setenv stdout serial" then "boot", which goes over the ethernet. Stopped the system after u-boot gave over control to the kernel. Result: 10412 memcopies so divided (number, length): 3941 4 1583 6 772 20 1 46 1 47 3 60 1024 64 1 815 1 888 770 1148 1543 1480 1 2283 1 3836 770 4096 So I dare say non-power-of-4 is a minority anyways: 1587 calls, 12689 bytes. i.e. 15.2% of the calls and 0.2% of the data. Data collected in memory with patch below, used with following line: od -An -t d4 logfile | awk '{print $4}' | sort -n | uniq -c diff --git a/include/configs/nhk8815.h b/include/configs/nhk8815.h index edd698e..a390f28 100644 --- a/include/configs/nhk8815.h +++ b/include/configs/nhk8815.h @@ -28,6 +28,8 @@ #include <nomadik.h> +#define CONFIG_MCLOGSIZE (16*1024) + #define CONFIG_ARM926EJS #define CONFIG_NOMADIK #define CONFIG_NOMADIK_8815 /* cpu variant */ diff --git a/lib_generic/string.c b/lib_generic/string.c index 5f7aff9..5afa11e 100644 --- a/lib_generic/string.c +++ b/lib_generic/string.c @@ -19,6 +19,7 @@ #include <linux/string.h> #include <linux/ctype.h> #include <malloc.h> +#include <common.h> #if 0 /* not used - was: #ifndef __HAVE_ARCH_STRNICMP */ @@ -461,11 +462,29 @@ char * bcopy(const char * src, char * dest, int count) * You should not use this function to access IO space, use memcpy_toio() * or memcpy_fromio() instead. */ + +#ifndef CONFIG_MCLOGSIZE /* if you want to log the memcpy calls, define it */ +#define CONFIG_MCLOGSIZE 0 +#endif +struct mclog {int idx; void *dst; const void *src; int cnt;}; +static struct mclog mclog[CONFIG_MCLOGSIZE]; + void * memcpy(void *dest, const void *src, size_t count) { char *d8 = (char *)dest, *s8 = (char *)src; unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src; + if (CONFIG_MCLOGSIZE) { + static int idx; + struct mclog *p = mclog + (idx % (CONFIG_MCLOGSIZE ?: 1)); + if (!idx) printf("memcpy log at %p, size 0x%x\n", + mclog, sizeof(mclog)); + p->idx = idx++; + p->dst = dest; + p->src = src; + p->cnt = count; + } + /* if all data is aligned (common case), copy a word at a time */ if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { count /= sizeof(unsigned long); ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 18:23 ` Alessandro Rubini @ 2009-10-08 19:09 ` Peter Tyser 2009-10-08 19:17 ` Alessandro Rubini 0 siblings, 1 reply; 24+ messages in thread From: Peter Tyser @ 2009-10-08 19:09 UTC (permalink / raw) To: u-boot On Thu, 2009-10-08 at 20:23 +0200, Alessandro Rubini wrote: > >> That's true, but I think the most important case is lcd scrolling, > >> where it's usually a big power of two -- that's where we had the #ifdef, > >> so the problem was known, I suppose. > > > > I think the most important case for *you* is lcd scrolling, but for 99% > > of everyone else, it isn't at all:) > > Well, its a big memcpy, and it has direct effect on the user. Every > other copy is smaller, or has no interactive value. > > > memcpy() and memset() are used 100 times more often in non-lcd > > related code and most boards don't even have LCDs. > > That's true. But it's only a boot loader (I just looked at what Nicolas > Pitre did in the kernel for ARM strcpy and, well....). > > So I made some measures (it's one of Pike's rules of programming: > > * Rule 2. Measure. Don't tune for speed until you've measured, and even > then don't unless one part of the code overwhelms the rest. > > ) > > I booted in u-boot, typed "setenv stdout serial" then "boot", which goes > over the ethernet. Stopped the system after u-boot gave over control to > the kernel. Result: 10412 memcopies so divided (number, length): > > 3941 4 > 1583 6 > 772 20 > 1 46 > 1 47 > 3 60 > 1024 64 > 1 815 > 1 888 > 770 1148 > 1543 1480 > 1 2283 > 1 3836 > 770 4096 > > So I dare say non-power-of-4 is a minority anyways: 1587 calls, 12689 bytes. > i.e. 15.2% of the calls and 0.2% of the data. The statistics are going to be very different for different scenarios. For example, network operations seem to be the majority of your large memcpys, this isn't the case for everyone. If/when U-Boot runs on 64-bit hardware, the stats will change too. In any case, my only suggestion would be that if we're improving memcpy()/memset(), do the extra 10% of effort required to make them a little better. That 10% of effort will improve 15.2% of all memcpy() calls for the foreseeable future:) I honestly don't care much what implementation you choose as I only currently use PPC, which has their own memcpy()/memset(). I'm only trying to be helpful, feel free to proceed however you see fit, I promise I won't comment on future patches:) Best, Peter ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 19:09 ` Peter Tyser @ 2009-10-08 19:17 ` Alessandro Rubini 2009-10-08 20:40 ` Wolfgang Denk 0 siblings, 1 reply; 24+ messages in thread From: Alessandro Rubini @ 2009-10-08 19:17 UTC (permalink / raw) To: u-boot > The statistics are going to be very different for different scenarios. Yes, I know. > For example, network operations seem to be the majority of your large > memcpys, this isn't the case for everyone. True. I noticed it after sending -- although I expected it. > In any case, my only suggestion would be that if we're improving > memcpy()/memset(), do the extra 10% of effort required to make them a > little better. That 10% of effort will improve 15.2% of all memcpy() > calls for the foreseeable future:) It mainly depends on Wolfgang but, hey, it's not 10% of effort. > I promise I won't comment on future patches:) No problem at all. And I apologize if my tone looked rude, it wasn't meant to. Thank you for your comments. /alessandro, who didn't notice ppc has an asm implementation of its own ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 19:17 ` Alessandro Rubini @ 2009-10-08 20:40 ` Wolfgang Denk 0 siblings, 0 replies; 24+ messages in thread From: Wolfgang Denk @ 2009-10-08 20:40 UTC (permalink / raw) To: u-boot Dear Alessandro Rubini, In message <20091008191734.GA13161@mail.gnudd.com> you wrote: > > > In any case, my only suggestion would be that if we're improving > > memcpy()/memset(), do the extra 10% of effort required to make them a > > little better. That 10% of effort will improve 15.2% of all memcpy() > > calls for the foreseeable future:) > > It mainly depends on Wolfgang but, hey, it's not 10% of effort. It's less, right ? :-) I think now we have this part of code in focus, we should to the Right Thing :-) Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de Men will always be men -- no matter where they are. -- Harry Mudd, "Mudd's Women", stardate 1329.8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 16:00 ` Alessandro Rubini 2009-10-08 16:30 ` Peter Tyser @ 2009-10-08 20:47 ` Wolfgang Denk 1 sibling, 0 replies; 24+ messages in thread From: Wolfgang Denk @ 2009-10-08 20:47 UTC (permalink / raw) To: u-boot Dear Alessandro Rubini, In message <20091008160026.GA9077@mail.gnudd.com> you wrote: > > No interest in the suggestion to not require count to be an exact > > multiple of 4/8? > > Actually, I wrote about that in my patch 0/3. Hm. Your argument was not exctly convincing, though. > Currently, I don't even know if this is going to be picked up, so I > don't want to go too far -- and in that case I would like to measure > things and be able to test for stupid bugs, so it takes time. > Scrolling the video is an easy test for memcpy. Sure this is going to ber picked uyp. And actually the implementation can be made even simpler doing both. > If there's interest, there's memmov to fix; it's used pretty often and > it's the same idea as memcpy (well, scrolling should use memmove, > in theory). Volunteers welcome :-) Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de "Where humor is concerned there are no standards -- no one can say what is good or bad, although you can be sure that everyone will. - John Kenneth Galbraith ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini 2009-10-08 15:12 ` Peter Tyser @ 2009-10-08 19:14 ` Mike Frysinger 2009-10-08 20:44 ` Wolfgang Denk 2 siblings, 0 replies; 24+ messages in thread From: Mike Frysinger @ 2009-10-08 19:14 UTC (permalink / raw) To: u-boot On Thursday 08 October 2009 07:30:02 Alessandro Rubini wrote: > + if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { when i talked about changing the int cast, i was talking about the pointers. pointers should always be cast to (unsigned long). -mike -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://lists.denx.de/pipermail/u-boot/attachments/20091008/23acd1cb/attachment.pgp ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini 2009-10-08 15:12 ` Peter Tyser 2009-10-08 19:14 ` Mike Frysinger @ 2009-10-08 20:44 ` Wolfgang Denk 2009-10-09 4:42 ` Chris Moore 2 siblings, 1 reply; 24+ messages in thread From: Wolfgang Denk @ 2009-10-08 20:44 UTC (permalink / raw) To: u-boot Dear Alessandro Rubini, In message <45d5e3a574bf4844f46f50b2c88054a5b28f973b.1255000877.git.rubini@ unipv.it> you wrote: > From: Alessandro Rubini <rubini@unipv.it> > > Signed-off-by: Alessandro Rubini <rubini@unipv.it> > Acked-by: Andrea Gallo <andrea.gallo@stericsson.com> > --- > lib_generic/string.c | 17 +++++++++++++---- > 1 files changed, 13 insertions(+), 4 deletions(-) > > diff --git a/lib_generic/string.c b/lib_generic/string.c > index 181eda6..9911941 100644 > --- a/lib_generic/string.c > +++ b/lib_generic/string.c > @@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count) > * You should not use this function to access IO space, use memcpy_toio() > * or memcpy_fromio() instead. > */ > -void * memcpy(void * dest,const void *src,size_t count) > +void * memcpy(void *dest, const void *src, size_t count) > { > - char *tmp = (char *) dest, *s = (char *) src; > + char *d8 = (char *)dest, *s8 = (char *)src; > + unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src; > > + /* if all data is aligned (common case), copy a word at a time */ > + if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { > + count /= sizeof(unsigned long); > + while (count--) > + *dl++ = *sl++; > + return dest; > + } > + /* else, use 1-byte copy */ > while (count--) > - *tmp++ = *s++; > + *d8++ = *s8++; > > return dest; I think we should change this if-else into a plain if, something like that: void * memcpy(void *dest, const void *src, size_t count) { char *tmp = (char *) dest, *s = (char *) src; char *d8 = (char *)dest, *s8 = (char *)src; unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src; /* while all data is aligned (common case), copy a word at a time */ if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { while (count) { *dl++ = *sl++; count -= sizeof(unsigned long); } } while (count--) *d8++ = *s8++; return dest; } This way we can have both - the "long" copy of a potential aligne dfirst part, and the byte copy of any trailing (or unaligned) part. Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de Computers make excellent and efficient servants, but I have no wish to serve under them. Captain, a starship also runs on loyalty to one man. And nothing can replace it or him. -- Spock, "The Ultimate Computer", stardate 4729.4 ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-08 20:44 ` Wolfgang Denk @ 2009-10-09 4:42 ` Chris Moore 2009-10-09 10:11 ` Mark Jackson 2009-10-09 11:12 ` Wolfgang Denk 0 siblings, 2 replies; 24+ messages in thread From: Chris Moore @ 2009-10-09 4:42 UTC (permalink / raw) To: u-boot Wolfgang Denk a ?crit : > I think we should change this if-else into a plain if, something like > that: > > void * memcpy(void *dest, const void *src, size_t count) > { > char *tmp = (char *) dest, *s = (char *) src; > char *d8 = (char *)dest, *s8 = (char *)src; > unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src; > > /* while all data is aligned (common case), copy a word at a time */ > if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { > while (count) { > *dl++ = *sl++; > count -= sizeof(unsigned long); > } > } > while (count--) > *d8++ = *s8++; > > return dest; > } > > This way we can have both - the "long" copy of a potential aligne > dfirst part, and the byte copy of any trailing (or unaligned) part. > > I agree wholeheartedly with the idea but shouldn't it be more like this (untested) code : void * memcpy(void *dest, const void *src, size_t count) { char *d8, *s8; unsigned long *dl = dest, *sl = src; /* while all data is aligned (common case), copy multiple bytes at a time */ if ( (((int)(long)dest | (int)(long)src) & (sizeof(*dl) - 1)) == 0) { while (count >= sizeof(*dl)) { *dl++ = *sl++; count -= sizeof(*dl); } } d8 = (char *)dl; s8 = (char *)sl; /* copy any remaining data byte by byte */ while (count--) *d8++ = *s8++; return dest; } Remarks : 1) My curious (int) (long) pointer casts are intended to avoid compiler warnings while avoiding unnecessary calculations in long. On some architectures long calculations are less efficient than int ones. In fact I wonder whether, on such architectures, it might not also be better to perform the copy with int size chunks. 2) Personally I prefer sizeof(*dl) to sizeof(unsigned long) as there is less risk of error if the type of the chunks is changed. 3) In C (but not in C++) I think the casts from void * to unsigned long * are unnecessary. But as I said all this is completely untested :( Cheers, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-09 4:42 ` Chris Moore @ 2009-10-09 10:11 ` Mark Jackson 2009-10-09 10:26 ` Mike Frysinger 2009-10-09 11:12 ` Wolfgang Denk 1 sibling, 1 reply; 24+ messages in thread From: Mark Jackson @ 2009-10-09 10:11 UTC (permalink / raw) To: u-boot Chris Moore wrote: > I agree wholeheartedly with the idea but shouldn't it be more like this > (untested) code : > > void * memcpy(void *dest, const void *src, size_t count) > > { > char *d8, *s8; > unsigned long *dl = dest, *sl = src; > In here, would it be overkill to add byte copying until data is aligned, and then fall into the aligned copy code. In that case, you'd still gain a speed increase if you're starting at an unaligned address ? > /* while all data is aligned (common case), copy multiple bytes at a time */ > if ( (((int)(long)dest | (int)(long)src) & (sizeof(*dl) - 1)) == 0) { > while (count >= sizeof(*dl)) { > *dl++ = *sl++; > count -= sizeof(*dl); > } > } > > d8 = (char *)dl; > s8 = (char *)sl; > > /* copy any remaining data byte by byte */ > while (count--) > *d8++ = *s8++; > > return dest; > } Regards Mark ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-09 10:11 ` Mark Jackson @ 2009-10-09 10:26 ` Mike Frysinger 2009-10-11 7:06 ` Chris Moore 0 siblings, 1 reply; 24+ messages in thread From: Mike Frysinger @ 2009-10-09 10:26 UTC (permalink / raw) To: u-boot On Friday 09 October 2009 06:11:16 Mark Jackson wrote: > Chris Moore wrote: > > I agree wholeheartedly with the idea but shouldn't it be more like this > > (untested) code : > > > > void * memcpy(void *dest, const void *src, size_t count) > > > > { > > char *d8, *s8; > > unsigned long *dl = dest, *sl = src; > > In here, would it be overkill to add byte copying until data is aligned, > and then fall into the aligned copy code. both addresses have to be unaligned the same ... if ((ulong)dl & (sizeof(*dl) - 1) == (ulong)sl & (sizeof(*sl) - 1)) > In that case, you'd still gain a speed increase if you're starting at an > unaligned address ? now it's a question of how often does this come up and is it worth the code size increase ? -mike -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://lists.denx.de/pipermail/u-boot/attachments/20091009/fe1b4c85/attachment.pgp ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-09 10:26 ` Mike Frysinger @ 2009-10-11 7:06 ` Chris Moore 0 siblings, 0 replies; 24+ messages in thread From: Chris Moore @ 2009-10-11 7:06 UTC (permalink / raw) To: u-boot Mike Frysinger a ?crit : > On Friday 09 October 2009 06:11:16 Mark Jackson wrote: > >> Chris Moore wrote: >> >>> I agree wholeheartedly with the idea but shouldn't it be more like this >>> (untested) code : >>> >>> void * memcpy(void *dest, const void *src, size_t count) >>> >>> { >>> char *d8, *s8; >>> unsigned long *dl = dest, *sl = src; >>> >> In here, would it be overkill to add byte copying until data is aligned, >> and then fall into the aligned copy code. >> > > both addresses have to be unaligned the same ... > > if ((ulong)dl & (sizeof(*dl) - 1) == (ulong)sl & (sizeof(*sl) - 1)) > > >> In that case, you'd still gain a speed increase if you're starting at an >> unaligned address ? >> > > now it's a question of how often does this come up and is it worth the code > size increase ? > Like Mike I'm not sure if it is worth it for memcpy. However it may be for memset where only the destination has to be aligned. Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible 2009-10-09 4:42 ` Chris Moore 2009-10-09 10:11 ` Mark Jackson @ 2009-10-09 11:12 ` Wolfgang Denk 1 sibling, 0 replies; 24+ messages in thread From: Wolfgang Denk @ 2009-10-09 11:12 UTC (permalink / raw) To: u-boot Dear Chris Moore, In message <4ACEBF19.4010902@free.fr> you wrote: > > I agree wholeheartedly with the idea but shouldn't it be more like this > (untested) code : Thanks for catching this. Of course you are right - my code was untested as well, as usual ;-) Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de A UNIX saleslady, Lenore, Enjoys work, but she likes the beach more. She found a good way To combine work and play: She sells C shells by the seashore. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 2/3] memset: fill one word at a time if possible 2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini 2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini @ 2009-10-08 11:30 ` Alessandro Rubini 2009-10-08 20:46 ` Wolfgang Denk 2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini ` (2 subsequent siblings) 4 siblings, 1 reply; 24+ messages in thread From: Alessandro Rubini @ 2009-10-08 11:30 UTC (permalink / raw) To: u-boot From: Alessandro Rubini <rubini@unipv.it> Signed-off-by: Alessandro Rubini <rubini@unipv.it> Acked-by: Andrea Gallo <andrea.gallo@stericsson.com> --- lib_generic/string.c | 17 ++++++++++++++++- 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/lib_generic/string.c b/lib_generic/string.c index 9911941..5f7aff9 100644 --- a/lib_generic/string.c +++ b/lib_generic/string.c @@ -404,7 +404,22 @@ char *strswab(const char *s) void * memset(void * s,int c,size_t count) { char *xs = (char *) s; - + unsigned long *sl = (unsigned long *) s; + unsigned long cl = 0; + int i; + + /* do it one word at a time (32 bits or 64 bits) if possible */ + if ( ((count | (int)s) & (sizeof(long) - 1)) == 0) { + count /= sizeof(long); + for (i=0; i<sizeof(long); ++i) { + cl <<= 8; + cl |= c & 0xff; + } + while (count--) + *sl++ = cl; + return s; + } + /* else, fill 8 bits@a time */ while (count--) *xs++ = c; -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 2/3] memset: fill one word at a time if possible 2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini @ 2009-10-08 20:46 ` Wolfgang Denk 0 siblings, 0 replies; 24+ messages in thread From: Wolfgang Denk @ 2009-10-08 20:46 UTC (permalink / raw) To: u-boot Dear Alessandro Rubini, In message <f6eaf0a025bdc32019286a66bf10cc2184f00995.1255000877.git.rubini@ unipv.it> you wrote: ... > void * memset(void * s,int c,size_t count) > { > char *xs = (char *) s; > - > + unsigned long *sl = (unsigned long *) s; > + unsigned long cl = 0; > + int i; Same changes as suggested for the memcpy() patch go here: > + /* do it one word at a time (32 bits or 64 bits) if possible */ > + if ( ((count | (int)s) & (sizeof(long) - 1)) == 0) { > + count /= sizeof(long); delete this; > + for (i=0; i<sizeof(long); ++i) { > + cl <<= 8; > + cl |= c & 0xff; > + } > + while (count--) > + *sl++ = cl; and make this while (count) { *sl++ = cl; count -= sizeof(long); } > + return s; delete this, too. > + } > + /* else, fill 8 bits at a time */ Delete the comment, too. > while (count--) > *xs++ = c; Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de "One lawyer can steal more than a hundred men with guns." - The Godfather ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it 2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini 2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini 2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini @ 2009-10-08 11:30 ` Alessandro Rubini 2009-11-22 22:34 ` Wolfgang Denk 2009-10-08 20:36 ` [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Wolfgang Denk 2009-10-08 21:30 ` Mike Frysinger 4 siblings, 1 reply; 24+ messages in thread From: Alessandro Rubini @ 2009-10-08 11:30 UTC (permalink / raw) To: u-boot From: Alessandro Rubini <rubini@unipv.it> Signed-off-by: Alessandro Rubini <rubini@unipv.it> Acked-by: Andrea Gallo <andrea.gallo@stericsson.com> --- common/lcd.c | 21 --------------------- 1 files changed, 0 insertions(+), 21 deletions(-) diff --git a/common/lcd.c b/common/lcd.c index dc8fea6..4e31618 100644 --- a/common/lcd.c +++ b/common/lcd.c @@ -99,32 +99,11 @@ static int lcd_getfgcolor (void); static void console_scrollup (void) { -#if 1 /* Copy up rows ignoring the first one */ memcpy (CONSOLE_ROW_FIRST, CONSOLE_ROW_SECOND, CONSOLE_SCROLL_SIZE); /* Clear the last one */ memset (CONSOLE_ROW_LAST, COLOR_MASK(lcd_color_bg), CONSOLE_ROW_SIZE); -#else - /* - * Poor attempt to optimize speed by moving "long"s. - * But the code is ugly, and not a bit faster :-( - */ - ulong *t = (ulong *)CONSOLE_ROW_FIRST; - ulong *s = (ulong *)CONSOLE_ROW_SECOND; - ulong l = CONSOLE_SCROLL_SIZE / sizeof(ulong); - uchar c = lcd_color_bg & 0xFF; - ulong val= (c<<24) | (c<<16) | (c<<8) | c; - - while (l--) - *t++ = *s++; - - t = (ulong *)CONSOLE_ROW_LAST; - l = CONSOLE_ROW_SIZE / sizeof(ulong); - - while (l-- > 0) - *t++ = val; -#endif } /*----------------------------------------------------------------------*/ -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it 2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini @ 2009-11-22 22:34 ` Wolfgang Denk 2009-11-24 23:04 ` Anatolij Gustschin 0 siblings, 1 reply; 24+ messages in thread From: Wolfgang Denk @ 2009-11-22 22:34 UTC (permalink / raw) To: u-boot Dear Anatolij, In message <48a15e2db97a6bccf6cc0ebdfae45bc5f946a2f7.1255000877.git.rubini@ unipv.it> Alessandro Rubini wrote: > From: Alessandro Rubini <rubini@unipv.it> > > Signed-off-by: Alessandro Rubini <rubini@unipv.it> > Acked-by: Andrea Gallo <andrea.gallo@stericsson.com> Could you please comment on your position about this patch (and eventually apply it) ? Thanks in advance. Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man." - George Bernard Shaw ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it 2009-11-22 22:34 ` Wolfgang Denk @ 2009-11-24 23:04 ` Anatolij Gustschin 0 siblings, 0 replies; 24+ messages in thread From: Anatolij Gustschin @ 2009-11-24 23:04 UTC (permalink / raw) To: u-boot Dear Wolfgang, Wolfgang Denk <wd@denx.de> wrote: > Dear Anatolij, > > In message <48a15e2db97a6bccf6cc0ebdfae45bc5f946a2f7.1255000877.git.rubini@ unipv.it> Alessandro Rubini wrote: > > From: Alessandro Rubini <rubini@unipv.it> > > > > Signed-off-by: Alessandro Rubini <rubini@unipv.it> > > Acked-by: Andrea Gallo <andrea.gallo@stericsson.com> > > Could you please comment on your position about this patch (and > eventually apply it) ? Thanks in advance. there was a V4 of this patch, you already applied it, so this patch is obsolete. Best regards, Anatolij ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 0/3] make memcpy and memset faster 2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini ` (2 preceding siblings ...) 2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini @ 2009-10-08 20:36 ` Wolfgang Denk 2009-10-08 21:30 ` Mike Frysinger 4 siblings, 0 replies; 24+ messages in thread From: Wolfgang Denk @ 2009-10-08 20:36 UTC (permalink / raw) To: u-boot Dear Alessandro Rubini, In message <cover.1255000877.git.rubini@unipv.it> you wrote: > > Similarly, I'm not interested in "4 bytes at a time, then 1 at a time" > as it's quite a corner case. If such optimizations are really useful, > then we'd better have hand-crafted assembly for each arch, possibly > lifted from glibc. I disagree here, especially as the change is actually trivial to implement and proably results even in smaller code size. Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de Pray to God, but keep rowing to shore. - Russian Proverb ^ permalink raw reply [flat|nested] 24+ messages in thread
* [U-Boot] [PATCH V2 0/3] make memcpy and memset faster 2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini ` (3 preceding siblings ...) 2009-10-08 20:36 ` [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Wolfgang Denk @ 2009-10-08 21:30 ` Mike Frysinger 4 siblings, 0 replies; 24+ messages in thread From: Mike Frysinger @ 2009-10-08 21:30 UTC (permalink / raw) To: u-boot On Thursday 08 October 2009 07:29:51 Alessandro Rubini wrote: > Similarly, I'm not interested in "4 bytes at a time, then 1 at a time" > as it's quite a corner case. If such optimizations are really useful, > then we'd better have hand-crafted assembly for each arch, possibly > lifted from glibc. why ? it's trivial to implement with little code impact. have your code run while the len is larger than 4 (sizeof-whatever), then fall through to the loop that runs while the len is larger than 0 instead of immediately returning. -mike -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://lists.denx.de/pipermail/u-boot/attachments/20091008/1c47d9e6/attachment.pgp ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2009-11-24 23:04 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini 2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini 2009-10-08 15:12 ` Peter Tyser 2009-10-08 16:00 ` Alessandro Rubini 2009-10-08 16:30 ` Peter Tyser 2009-10-08 18:23 ` Alessandro Rubini 2009-10-08 19:09 ` Peter Tyser 2009-10-08 19:17 ` Alessandro Rubini 2009-10-08 20:40 ` Wolfgang Denk 2009-10-08 20:47 ` Wolfgang Denk 2009-10-08 19:14 ` Mike Frysinger 2009-10-08 20:44 ` Wolfgang Denk 2009-10-09 4:42 ` Chris Moore 2009-10-09 10:11 ` Mark Jackson 2009-10-09 10:26 ` Mike Frysinger 2009-10-11 7:06 ` Chris Moore 2009-10-09 11:12 ` Wolfgang Denk 2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini 2009-10-08 20:46 ` Wolfgang Denk 2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini 2009-11-22 22:34 ` Wolfgang Denk 2009-11-24 23:04 ` Anatolij Gustschin 2009-10-08 20:36 ` [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Wolfgang Denk 2009-10-08 21:30 ` Mike Frysinger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox