public inbox for u-boot@lists.denx.de
 help / color / mirror / Atom feed
* [U-Boot] [PATCH V2 0/3] make memcpy and memset faster
@ 2009-10-08 11:29 Alessandro Rubini
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 11:29 UTC (permalink / raw)
  To: u-boot

I've added 32-bit lcd to the Nomadik (not submitted yet), and I found
the scroll to be very slow, as the screen is big.

Instead of activating the "if 0" stanza for 32-bit scroll in lcd.c,
I'd better have a faster memcpy/memset globally. So this patch set
adds ulong-wide memcpy and memset, then removes the "#if 0" part in the
scroll function. For me scrolling is 4 times faster on a 32 bit system.

V2: I incorporated most of the comments, but I didn't change the for
loops to help the compiler optimizing it, since nowadays gcc is
already doing the loops his own way irrespective of what i write.

Similarly, I'm not interested in "4 bytes at a time, then 1 at a time"
as it's quite a corner case. If such optimizations are really useful,
then we'd better have hand-crafted assembly for each arch, possibly
lifted from glibc.

Alessandro Rubini (3):
  memcpy: copy one word at a time if possible
  memset: fill one word at a time if possible
  lcd: remove '#if 0' 32-bit scroll, now memcpy does it

 common/lcd.c         |   21 ---------------------
 lib_generic/string.c |   34 +++++++++++++++++++++++++++++-----
 2 files changed, 29 insertions(+), 26 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini
@ 2009-10-08 11:30 ` Alessandro Rubini
  2009-10-08 15:12   ` Peter Tyser
                     ` (2 more replies)
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 11:30 UTC (permalink / raw)
  To: u-boot

From: Alessandro Rubini <rubini@unipv.it>

Signed-off-by: Alessandro Rubini <rubini@unipv.it>
Acked-by: Andrea Gallo <andrea.gallo@stericsson.com>
---
 lib_generic/string.c |   17 +++++++++++++----
 1 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/lib_generic/string.c b/lib_generic/string.c
index 181eda6..9911941 100644
--- a/lib_generic/string.c
+++ b/lib_generic/string.c
@@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count)
  * You should not use this function to access IO space, use memcpy_toio()
  * or memcpy_fromio() instead.
  */
-void * memcpy(void * dest,const void *src,size_t count)
+void * memcpy(void *dest, const void *src, size_t count)
 {
-	char *tmp = (char *) dest, *s = (char *) src;
+	char *d8 = (char *)dest, *s8 = (char *)src;
+	unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
 
+	/* if all data is aligned (common case), copy a word at a time */
+	if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {
+		count /= sizeof(unsigned long);
+		while (count--)
+			*dl++ = *sl++;
+		return dest;
+	}
+	/* else, use 1-byte copy */
 	while (count--)
-		*tmp++ = *s++;
+		*d8++ = *s8++;
 
 	return dest;
 }
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 2/3] memset: fill one word at a time if possible
  2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
@ 2009-10-08 11:30 ` Alessandro Rubini
  2009-10-08 20:46   ` Wolfgang Denk
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 11:30 UTC (permalink / raw)
  To: u-boot

From: Alessandro Rubini <rubini@unipv.it>

Signed-off-by: Alessandro Rubini <rubini@unipv.it>
Acked-by: Andrea Gallo <andrea.gallo@stericsson.com>
---
 lib_generic/string.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/lib_generic/string.c b/lib_generic/string.c
index 9911941..5f7aff9 100644
--- a/lib_generic/string.c
+++ b/lib_generic/string.c
@@ -404,7 +404,22 @@ char *strswab(const char *s)
 void * memset(void * s,int c,size_t count)
 {
 	char *xs = (char *) s;
-
+	unsigned long *sl = (unsigned long *) s;
+	unsigned long cl = 0;
+	int i;
+
+	/* do it one word at a time (32 bits or 64 bits) if possible */
+	if ( ((count | (int)s) & (sizeof(long) - 1)) == 0) {
+		count /= sizeof(long);
+		for (i=0; i<sizeof(long); ++i) {
+			cl <<= 8;
+			cl |= c & 0xff;
+		}
+		while (count--)
+			*sl++ = cl;
+		return s;
+	}
+	/* else, fill 8 bits@a time */
 	while (count--)
 		*xs++ = c;
 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it
  2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini
@ 2009-10-08 11:30 ` Alessandro Rubini
  2009-11-22 22:34   ` Wolfgang Denk
  2009-10-08 20:36 ` [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Wolfgang Denk
  2009-10-08 21:30 ` Mike Frysinger
  4 siblings, 1 reply; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 11:30 UTC (permalink / raw)
  To: u-boot

From: Alessandro Rubini <rubini@unipv.it>

Signed-off-by: Alessandro Rubini <rubini@unipv.it>
Acked-by: Andrea Gallo <andrea.gallo@stericsson.com>
---
 common/lcd.c |   21 ---------------------
 1 files changed, 0 insertions(+), 21 deletions(-)

diff --git a/common/lcd.c b/common/lcd.c
index dc8fea6..4e31618 100644
--- a/common/lcd.c
+++ b/common/lcd.c
@@ -99,32 +99,11 @@ static int lcd_getfgcolor (void);
 
 static void console_scrollup (void)
 {
-#if 1
 	/* Copy up rows ignoring the first one */
 	memcpy (CONSOLE_ROW_FIRST, CONSOLE_ROW_SECOND, CONSOLE_SCROLL_SIZE);
 
 	/* Clear the last one */
 	memset (CONSOLE_ROW_LAST, COLOR_MASK(lcd_color_bg), CONSOLE_ROW_SIZE);
-#else
-	/*
-	 * Poor attempt to optimize speed by moving "long"s.
-	 * But the code is ugly, and not a bit faster :-(
-	 */
-	ulong *t = (ulong *)CONSOLE_ROW_FIRST;
-	ulong *s = (ulong *)CONSOLE_ROW_SECOND;
-	ulong    l = CONSOLE_SCROLL_SIZE / sizeof(ulong);
-	uchar  c = lcd_color_bg & 0xFF;
-	ulong val= (c<<24) | (c<<16) | (c<<8) | c;
-
-	while (l--)
-		*t++ = *s++;
-
-	t = (ulong *)CONSOLE_ROW_LAST;
-	l = CONSOLE_ROW_SIZE / sizeof(ulong);
-
-	while (l-- > 0)
-		*t++ = val;
-#endif
 }
 
 /*----------------------------------------------------------------------*/
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
@ 2009-10-08 15:12   ` Peter Tyser
  2009-10-08 16:00     ` Alessandro Rubini
  2009-10-08 19:14   ` Mike Frysinger
  2009-10-08 20:44   ` Wolfgang Denk
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Tyser @ 2009-10-08 15:12 UTC (permalink / raw)
  To: u-boot

On Thu, 2009-10-08 at 13:30 +0200, Alessandro Rubini wrote:
> From: Alessandro Rubini <rubini@unipv.it>
> 
> Signed-off-by: Alessandro Rubini <rubini@unipv.it>
> Acked-by: Andrea Gallo <andrea.gallo@stericsson.com>
> ---
>  lib_generic/string.c |   17 +++++++++++++----
>  1 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/lib_generic/string.c b/lib_generic/string.c
> index 181eda6..9911941 100644
> --- a/lib_generic/string.c
> +++ b/lib_generic/string.c
> @@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count)
>   * You should not use this function to access IO space, use memcpy_toio()
>   * or memcpy_fromio() instead.
>   */
> -void * memcpy(void * dest,const void *src,size_t count)
> +void * memcpy(void *dest, const void *src, size_t count)
>  {
> -	char *tmp = (char *) dest, *s = (char *) src;
> +	char *d8 = (char *)dest, *s8 = (char *)src;
> +	unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
>  
> +	/* if all data is aligned (common case), copy a word at a time */
> +	if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {
> +		count /= sizeof(unsigned long);
> +		while (count--)
> +			*dl++ = *sl++;
> +		return dest;
> +	}
> +	/* else, use 1-byte copy */
>  	while (count--)
> -		*tmp++ = *s++;
> +		*d8++ = *s8++;
>  
>  	return dest;
>  }

Hi Alessandro,
No interest in the suggestion to not require count to be an exact
multiple of 4/8?  I don't think it would be that hard to update the
logic accordingly and this would let your code be utilized much more
often, especially if/when we run on a 64-bit machine.  Conceptually it
seems like a cleaner implementation too, but that's probably just my
preference:)

Best,
Peter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 15:12   ` Peter Tyser
@ 2009-10-08 16:00     ` Alessandro Rubini
  2009-10-08 16:30       ` Peter Tyser
  2009-10-08 20:47       ` Wolfgang Denk
  0 siblings, 2 replies; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 16:00 UTC (permalink / raw)
  To: u-boot

> No interest in the suggestion to not require count to be an exact
> multiple of 4/8?

Actually, I wrote about that in my patch 0/3.

> I don't think it would be that hard to update the logic accordingly
> and this would let your code be utilized much more often, especially
> if/when we run on a 64-bit machine.

That's true, but I think the most important case is lcd scrolling,
where it's usually a big power of two -- that's where we had the #ifdef,
so the problem was known, I suppose.

Currently, I don't even know if this is going to be picked up, so I
don't want to go too far -- and in that case I would like to measure
things and be able to test for stupid bugs, so it takes time.
Scrolling the video is an easy test for memcpy.

If there's interest, there's memmov to fix; it's used pretty often and
it's the same idea as memcpy (well, scrolling should use memmove,
in theory).

/alessandro

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 16:00     ` Alessandro Rubini
@ 2009-10-08 16:30       ` Peter Tyser
  2009-10-08 18:23         ` Alessandro Rubini
  2009-10-08 20:47       ` Wolfgang Denk
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Tyser @ 2009-10-08 16:30 UTC (permalink / raw)
  To: u-boot

On Thu, 2009-10-08 at 18:00 +0200, Alessandro Rubini wrote:
> > No interest in the suggestion to not require count to be an exact
> > multiple of 4/8?
> 
> Actually, I wrote about that in my patch 0/3.

Sorry, I should have read more thoroughly.

> > I don't think it would be that hard to update the logic accordingly
> > and this would let your code be utilized much more often, especially
> > if/when we run on a 64-bit machine.
> 
> That's true, but I think the most important case is lcd scrolling,
> where it's usually a big power of two -- that's where we had the #ifdef,
> so the problem was known, I suppose.

I think the most important case for *you* is lcd scrolling, but for 99%
of everyone else, it isn't at all:)  memcpy() and memset() are used 100
times more often in non-lcd related code and most boards don't even have
LCDs.  In my opinion, we shouldn't be fixing/bloating common code for 1
outlying scenario, we should be trying to improve it for common cases.

> Currently, I don't even know if this is going to be picked up, so I
> don't want to go too far -- and in that case I would like to measure
> things and be able to test for stupid bugs, so it takes time.

Sure, I understand where you're coming from.  FWIW, if people don't want
to pick this up as it affects lots of people, you could always add an
architecture-specific memcpy/memset implementation.

> If there's interest, there's memmov to fix; it's used pretty often and
> it's the same idea as memcpy (well, scrolling should use memmove,
> in theory).

I'll quit pestering, just wanted to put my $.02 in:)

Best,
Peter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 16:30       ` Peter Tyser
@ 2009-10-08 18:23         ` Alessandro Rubini
  2009-10-08 19:09           ` Peter Tyser
  0 siblings, 1 reply; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 18:23 UTC (permalink / raw)
  To: u-boot

>> That's true, but I think the most important case is lcd scrolling,
>> where it's usually a big power of two -- that's where we had the #ifdef,
>> so the problem was known, I suppose.
> 
> I think the most important case for *you* is lcd scrolling, but for 99%
> of everyone else, it isn't at all:)

Well, its a big memcpy, and it has direct effect on the user. Every
other copy is smaller, or has no interactive value. 

> memcpy() and memset() are used 100 times more often in non-lcd
> related code and most boards don't even have LCDs.

That's true. But it's only a boot loader (I just looked at what Nicolas
Pitre did in the kernel for ARM strcpy and, well....).

So I made some measures (it's one of Pike's rules of programming:

     * Rule 2. Measure. Don't tune for speed until you've measured, and even
       then don't unless one part of the code overwhelms the rest.

)

I booted in u-boot, typed "setenv stdout serial" then "boot", which goes
over the ethernet. Stopped the system after u-boot gave over control to
the kernel. Result: 10412 memcopies so divided (number, length): 

   3941 4
   1583 6
    772 20
      1 46
      1 47
      3 60
   1024 64
      1 815
      1 888
    770 1148
   1543 1480
      1 2283
      1 3836
    770 4096

So I dare say non-power-of-4 is a minority anyways: 1587 calls, 12689 bytes.
i.e. 15.2% of the calls and 0.2% of the data.

Data collected in memory with patch below, used with following line:

od -An -t d4 logfile | awk '{print $4}' | sort -n | uniq -c

diff --git a/include/configs/nhk8815.h b/include/configs/nhk8815.h
index edd698e..a390f28 100644
--- a/include/configs/nhk8815.h
+++ b/include/configs/nhk8815.h
@@ -28,6 +28,8 @@
 
 #include <nomadik.h>
 
+#define CONFIG_MCLOGSIZE (16*1024)
+
 #define CONFIG_ARM926EJS
 #define CONFIG_NOMADIK
 #define CONFIG_NOMADIK_8815	/* cpu variant */
diff --git a/lib_generic/string.c b/lib_generic/string.c
index 5f7aff9..5afa11e 100644
--- a/lib_generic/string.c
+++ b/lib_generic/string.c
@@ -19,6 +19,7 @@
 #include <linux/string.h>
 #include <linux/ctype.h>
 #include <malloc.h>
+#include <common.h>
 
 
 #if 0 /* not used - was: #ifndef __HAVE_ARCH_STRNICMP */
@@ -461,11 +462,29 @@ char * bcopy(const char * src, char * dest, int count)
  * You should not use this function to access IO space, use memcpy_toio()
  * or memcpy_fromio() instead.
  */
+
+#ifndef CONFIG_MCLOGSIZE /* if you want to log the memcpy calls, define it */
+#define CONFIG_MCLOGSIZE 0
+#endif
+struct mclog {int idx; void *dst; const void *src; int cnt;};
+static struct mclog mclog[CONFIG_MCLOGSIZE];
+
 void * memcpy(void *dest, const void *src, size_t count)
 {
 	char *d8 = (char *)dest, *s8 = (char *)src;
 	unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
 
+	if (CONFIG_MCLOGSIZE) {
+		static int idx;
+		struct mclog *p = mclog + (idx % (CONFIG_MCLOGSIZE ?: 1));
+		if (!idx) printf("memcpy log at %p, size 0x%x\n",
+				 mclog, sizeof(mclog));
+		p->idx = idx++;
+		p->dst = dest;
+		p->src = src;
+		p->cnt = count;
+	}
+
 	/* if all data is aligned (common case), copy a word at a time */
 	if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {
 		count /= sizeof(unsigned long);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 18:23         ` Alessandro Rubini
@ 2009-10-08 19:09           ` Peter Tyser
  2009-10-08 19:17             ` Alessandro Rubini
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Tyser @ 2009-10-08 19:09 UTC (permalink / raw)
  To: u-boot

On Thu, 2009-10-08 at 20:23 +0200, Alessandro Rubini wrote:
> >> That's true, but I think the most important case is lcd scrolling,
> >> where it's usually a big power of two -- that's where we had the #ifdef,
> >> so the problem was known, I suppose.
> > 
> > I think the most important case for *you* is lcd scrolling, but for 99%
> > of everyone else, it isn't at all:)
> 
> Well, its a big memcpy, and it has direct effect on the user. Every
> other copy is smaller, or has no interactive value. 
> 
> > memcpy() and memset() are used 100 times more often in non-lcd
> > related code and most boards don't even have LCDs.
> 
> That's true. But it's only a boot loader (I just looked at what Nicolas
> Pitre did in the kernel for ARM strcpy and, well....).
> 
> So I made some measures (it's one of Pike's rules of programming:
> 
>      * Rule 2. Measure. Don't tune for speed until you've measured, and even
>        then don't unless one part of the code overwhelms the rest.
> 
> )
> 
> I booted in u-boot, typed "setenv stdout serial" then "boot", which goes
> over the ethernet. Stopped the system after u-boot gave over control to
> the kernel. Result: 10412 memcopies so divided (number, length): 
> 
>    3941 4
>    1583 6
>     772 20
>       1 46
>       1 47
>       3 60
>    1024 64
>       1 815
>       1 888
>     770 1148
>    1543 1480
>       1 2283
>       1 3836
>     770 4096
> 
> So I dare say non-power-of-4 is a minority anyways: 1587 calls, 12689 bytes.
> i.e. 15.2% of the calls and 0.2% of the data.

The statistics are going to be very different for different scenarios.
For example, network operations seem to be the majority of your large
memcpys, this isn't the case for everyone.  If/when U-Boot runs on
64-bit hardware, the stats will change too.

In any case, my only suggestion would be that if we're improving
memcpy()/memset(), do the extra 10% of effort required to make them a
little better.  That 10% of effort will improve 15.2% of all memcpy()
calls for the foreseeable future:)

I honestly don't care much what implementation you choose as I only
currently use PPC, which has their own memcpy()/memset().  I'm only
trying to be helpful, feel free to proceed however you see fit, I
promise I won't comment on future patches:)

Best,
Peter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
  2009-10-08 15:12   ` Peter Tyser
@ 2009-10-08 19:14   ` Mike Frysinger
  2009-10-08 20:44   ` Wolfgang Denk
  2 siblings, 0 replies; 24+ messages in thread
From: Mike Frysinger @ 2009-10-08 19:14 UTC (permalink / raw)
  To: u-boot

On Thursday 08 October 2009 07:30:02 Alessandro Rubini wrote:
> +	if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {

when i talked about changing the int cast, i was talking about the pointers.  
pointers should always be cast to (unsigned long).
-mike
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
Url : http://lists.denx.de/pipermail/u-boot/attachments/20091008/23acd1cb/attachment.pgp 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 19:09           ` Peter Tyser
@ 2009-10-08 19:17             ` Alessandro Rubini
  2009-10-08 20:40               ` Wolfgang Denk
  0 siblings, 1 reply; 24+ messages in thread
From: Alessandro Rubini @ 2009-10-08 19:17 UTC (permalink / raw)
  To: u-boot

> The statistics are going to be very different for different scenarios.

Yes, I know.

> For example, network operations seem to be the majority of your large
> memcpys, this isn't the case for everyone.

True. I noticed it after sending -- although I expected it.

> In any case, my only suggestion would be that if we're improving
> memcpy()/memset(), do the extra 10% of effort required to make them a
> little better.  That 10% of effort will improve 15.2% of all memcpy()
> calls for the foreseeable future:)

It mainly depends on Wolfgang but, hey, it's not 10% of effort.

> I promise I won't comment on future patches:)

No problem at all. And I apologize if my tone looked rude, it wasn't
meant to.  Thank you for your comments.

/alessandro, who didn't notice ppc has an asm implementation of its own

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 0/3] make memcpy and memset faster
  2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini
                   ` (2 preceding siblings ...)
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini
@ 2009-10-08 20:36 ` Wolfgang Denk
  2009-10-08 21:30 ` Mike Frysinger
  4 siblings, 0 replies; 24+ messages in thread
From: Wolfgang Denk @ 2009-10-08 20:36 UTC (permalink / raw)
  To: u-boot

Dear Alessandro Rubini,

In message <cover.1255000877.git.rubini@unipv.it> you wrote:
> 
> Similarly, I'm not interested in "4 bytes at a time, then 1 at a time"
> as it's quite a corner case. If such optimizations are really useful,
> then we'd better have hand-crafted assembly for each arch, possibly
> lifted from glibc.

I disagree here, especially as the change is actually trivial to
implement and proably results even in smaller code size.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
Pray to God, but keep rowing to shore. - Russian Proverb

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 19:17             ` Alessandro Rubini
@ 2009-10-08 20:40               ` Wolfgang Denk
  0 siblings, 0 replies; 24+ messages in thread
From: Wolfgang Denk @ 2009-10-08 20:40 UTC (permalink / raw)
  To: u-boot

Dear Alessandro Rubini,

In message <20091008191734.GA13161@mail.gnudd.com> you wrote:
> 
> > In any case, my only suggestion would be that if we're improving
> > memcpy()/memset(), do the extra 10% of effort required to make them a
> > little better.  That 10% of effort will improve 15.2% of all memcpy()
> > calls for the foreseeable future:)
> 
> It mainly depends on Wolfgang but, hey, it's not 10% of effort.

It's less, right ? :-)


I think now we have this part of code in focus, we should to the Right
Thing :-)

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
Men will always be men -- no matter where they are.
	-- Harry Mudd, "Mudd's Women", stardate 1329.8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
  2009-10-08 15:12   ` Peter Tyser
  2009-10-08 19:14   ` Mike Frysinger
@ 2009-10-08 20:44   ` Wolfgang Denk
  2009-10-09  4:42     ` Chris Moore
  2 siblings, 1 reply; 24+ messages in thread
From: Wolfgang Denk @ 2009-10-08 20:44 UTC (permalink / raw)
  To: u-boot

Dear Alessandro Rubini,

In message <45d5e3a574bf4844f46f50b2c88054a5b28f973b.1255000877.git.rubini@ unipv.it> you wrote:
> From: Alessandro Rubini <rubini@unipv.it>
> 
> Signed-off-by: Alessandro Rubini <rubini@unipv.it>
> Acked-by: Andrea Gallo <andrea.gallo@stericsson.com>
> ---
>  lib_generic/string.c |   17 +++++++++++++----
>  1 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/lib_generic/string.c b/lib_generic/string.c
> index 181eda6..9911941 100644
> --- a/lib_generic/string.c
> +++ b/lib_generic/string.c
> @@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count)
>   * You should not use this function to access IO space, use memcpy_toio()
>   * or memcpy_fromio() instead.
>   */
> -void * memcpy(void * dest,const void *src,size_t count)
> +void * memcpy(void *dest, const void *src, size_t count)
>  {
> -	char *tmp = (char *) dest, *s = (char *) src;
> +	char *d8 = (char *)dest, *s8 = (char *)src;
> +	unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
>  
> +	/* if all data is aligned (common case), copy a word at a time */
> +	if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {
> +		count /= sizeof(unsigned long);
> +		while (count--)
> +			*dl++ = *sl++;
> +		return dest;
> +	}
> +	/* else, use 1-byte copy */
>  	while (count--)
> -		*tmp++ = *s++;
> +		*d8++ = *s8++;
>  
>  	return dest;

I think we should change this if-else into a plain if, something like
that:

void * memcpy(void *dest, const void *src, size_t count)
{
	char *tmp = (char *) dest, *s = (char *) src;
	char *d8 = (char *)dest, *s8 = (char *)src;
	unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;

	/* while all data is aligned (common case), copy a word at a time */
	if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {
		while (count) {
			*dl++ = *sl++;
			count -= sizeof(unsigned long);
		}
	}
	while (count--)
		*d8++ = *s8++;

	return dest;
}

This way we can have both - the "long" copy of a potential aligne
dfirst part, and the byte copy of any trailing (or unaligned) part.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
Computers make excellent and efficient servants, but I have  no  wish
to  serve under them. Captain, a starship also runs on loyalty to one
man. And nothing can replace it or him.
	-- Spock, "The Ultimate Computer", stardate 4729.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 2/3] memset: fill one word at a time if possible
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini
@ 2009-10-08 20:46   ` Wolfgang Denk
  0 siblings, 0 replies; 24+ messages in thread
From: Wolfgang Denk @ 2009-10-08 20:46 UTC (permalink / raw)
  To: u-boot

Dear Alessandro Rubini,

In message <f6eaf0a025bdc32019286a66bf10cc2184f00995.1255000877.git.rubini@ unipv.it> you wrote:
...
>  void * memset(void * s,int c,size_t count)
>  {
>  	char *xs = (char *) s;
> -
> +	unsigned long *sl = (unsigned long *) s;
> +	unsigned long cl = 0;
> +	int i;

Same changes as suggested for the memcpy() patch go here:

> +	/* do it one word at a time (32 bits or 64 bits) if possible */
> +	if ( ((count | (int)s) & (sizeof(long) - 1)) == 0) {
> +		count /= sizeof(long);

delete this;

> +		for (i=0; i<sizeof(long); ++i) {
> +			cl <<= 8;
> +			cl |= c & 0xff;
> +		}
> +		while (count--)
> +			*sl++ = cl;

and make this

		while (count) {
			*sl++ = cl;
			count -= sizeof(long);
		}

> +		return s;

delete this, too.

> +	}
> +	/* else, fill 8 bits at a time */

Delete the comment, too.

>  	while (count--)
>  		*xs++ = c;

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
"One lawyer can steal more than a hundred men with guns."
- The Godfather

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 16:00     ` Alessandro Rubini
  2009-10-08 16:30       ` Peter Tyser
@ 2009-10-08 20:47       ` Wolfgang Denk
  1 sibling, 0 replies; 24+ messages in thread
From: Wolfgang Denk @ 2009-10-08 20:47 UTC (permalink / raw)
  To: u-boot

Dear Alessandro Rubini,

In message <20091008160026.GA9077@mail.gnudd.com> you wrote:
> > No interest in the suggestion to not require count to be an exact
> > multiple of 4/8?
> 
> Actually, I wrote about that in my patch 0/3.

Hm. Your argument was not exctly convincing, though.

> Currently, I don't even know if this is going to be picked up, so I
> don't want to go too far -- and in that case I would like to measure
> things and be able to test for stupid bugs, so it takes time.
> Scrolling the video is an easy test for memcpy.

Sure this is going to ber picked uyp. And actually the implementation
can be made even simpler doing both.

> If there's interest, there's memmov to fix; it's used pretty often and
> it's the same idea as memcpy (well, scrolling should use memmove,
> in theory).

Volunteers welcome :-)

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
"Where humor is concerned there are no standards -- no  one  can  say
what is good or bad, although you can be sure that everyone will.
- John Kenneth Galbraith

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 0/3] make memcpy and memset faster
  2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini
                   ` (3 preceding siblings ...)
  2009-10-08 20:36 ` [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Wolfgang Denk
@ 2009-10-08 21:30 ` Mike Frysinger
  4 siblings, 0 replies; 24+ messages in thread
From: Mike Frysinger @ 2009-10-08 21:30 UTC (permalink / raw)
  To: u-boot

On Thursday 08 October 2009 07:29:51 Alessandro Rubini wrote:
> Similarly, I'm not interested in "4 bytes at a time, then 1 at a time"
> as it's quite a corner case. If such optimizations are really useful,
> then we'd better have hand-crafted assembly for each arch, possibly
> lifted from glibc.

why ?  it's trivial to implement with little code impact.  have your code run 
while the len is larger than 4 (sizeof-whatever), then fall through to the 
loop that runs while the len is larger than 0 instead of immediately 
returning.
-mike
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
Url : http://lists.denx.de/pipermail/u-boot/attachments/20091008/1c47d9e6/attachment.pgp 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-08 20:44   ` Wolfgang Denk
@ 2009-10-09  4:42     ` Chris Moore
  2009-10-09 10:11       ` Mark Jackson
  2009-10-09 11:12       ` Wolfgang Denk
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Moore @ 2009-10-09  4:42 UTC (permalink / raw)
  To: u-boot

Wolfgang Denk a ?crit :
> I think we should change this if-else into a plain if, something like
> that:
>
> void * memcpy(void *dest, const void *src, size_t count)
> {
> 	char *tmp = (char *) dest, *s = (char *) src;
> 	char *d8 = (char *)dest, *s8 = (char *)src;
> 	unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
>
> 	/* while all data is aligned (common case), copy a word at a time */
> 	if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {
> 		while (count) {
> 			*dl++ = *sl++;
> 			count -= sizeof(unsigned long);
> 		}
> 	}
> 	while (count--)
> 		*d8++ = *s8++;
>
> 	return dest;
> }
>
> This way we can have both - the "long" copy of a potential aligne
> dfirst part, and the byte copy of any trailing (or unaligned) part.
>
>   

I agree wholeheartedly with the idea but shouldn't it be more like this 
(untested) code :

void * memcpy(void *dest, const void *src, size_t count)

{
	char *d8, *s8;
	unsigned long *dl = dest, *sl = src;

	/* while all data is aligned (common case), copy multiple bytes at a time */
	if ( (((int)(long)dest | (int)(long)src) & (sizeof(*dl) - 1)) == 0) {
		while (count >= sizeof(*dl)) {
			*dl++ = *sl++;
			count -= sizeof(*dl);
		}
	}

	d8 = (char *)dl;
	s8 = (char *)sl;

	/* copy any remaining data byte by byte */
	while (count--)
		*d8++ = *s8++;

	return dest;
}


Remarks :
1) My curious (int) (long) pointer casts are intended to avoid compiler 
warnings while avoiding unnecessary calculations in long.
On some architectures long calculations are less efficient than int ones.
In fact I wonder whether, on such architectures, it might not also be 
better to perform the copy with int size chunks.
2) Personally I prefer sizeof(*dl) to sizeof(unsigned long) as there is 
less risk of error if the type of the chunks is changed.
3) In C (but not in C++) I think the casts from void * to unsigned long 
* are unnecessary.

But as I said all this is completely untested :(

Cheers,
Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-09  4:42     ` Chris Moore
@ 2009-10-09 10:11       ` Mark Jackson
  2009-10-09 10:26         ` Mike Frysinger
  2009-10-09 11:12       ` Wolfgang Denk
  1 sibling, 1 reply; 24+ messages in thread
From: Mark Jackson @ 2009-10-09 10:11 UTC (permalink / raw)
  To: u-boot

Chris Moore wrote:
> I agree wholeheartedly with the idea but shouldn't it be more like this 
> (untested) code :
> 
> void * memcpy(void *dest, const void *src, size_t count)
> 
> {
> 	char *d8, *s8;
> 	unsigned long *dl = dest, *sl = src;
> 

In here, would it be overkill to add byte copying until data is aligned, and
then fall into the aligned copy code.

In that case, you'd still gain a speed increase if you're starting at an unaligned
address ?

> 	/* while all data is aligned (common case), copy multiple bytes at a time */
> 	if ( (((int)(long)dest | (int)(long)src) & (sizeof(*dl) - 1)) == 0) {
> 		while (count >= sizeof(*dl)) {
> 			*dl++ = *sl++;
> 			count -= sizeof(*dl);
> 		}
> 	}
> 
> 	d8 = (char *)dl;
> 	s8 = (char *)sl;
> 
> 	/* copy any remaining data byte by byte */
> 	while (count--)
> 		*d8++ = *s8++;
> 
> 	return dest;
> }

Regards
Mark

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-09 10:11       ` Mark Jackson
@ 2009-10-09 10:26         ` Mike Frysinger
  2009-10-11  7:06           ` Chris Moore
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Frysinger @ 2009-10-09 10:26 UTC (permalink / raw)
  To: u-boot

On Friday 09 October 2009 06:11:16 Mark Jackson wrote:
> Chris Moore wrote:
> > I agree wholeheartedly with the idea but shouldn't it be more like this
> > (untested) code :
> >
> > void * memcpy(void *dest, const void *src, size_t count)
> >
> > {
> > 	char *d8, *s8;
> > 	unsigned long *dl = dest, *sl = src;
> 
> In here, would it be overkill to add byte copying until data is aligned,
>  and then fall into the aligned copy code.

both addresses have to be unaligned the same ...

if ((ulong)dl & (sizeof(*dl) - 1) == (ulong)sl & (sizeof(*sl) - 1))

> In that case, you'd still gain a speed increase if you're starting at an
>  unaligned address ?

now it's a question of how often does this come up and is it worth the code 
size increase ?
-mike
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
Url : http://lists.denx.de/pipermail/u-boot/attachments/20091009/fe1b4c85/attachment.pgp 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-09  4:42     ` Chris Moore
  2009-10-09 10:11       ` Mark Jackson
@ 2009-10-09 11:12       ` Wolfgang Denk
  1 sibling, 0 replies; 24+ messages in thread
From: Wolfgang Denk @ 2009-10-09 11:12 UTC (permalink / raw)
  To: u-boot

Dear Chris Moore,

In message <4ACEBF19.4010902@free.fr> you wrote:
>
> I agree wholeheartedly with the idea but shouldn't it be more like this 
> (untested) code :

Thanks for catching this.

Of course you are right - my code was untested as well, as usual ;-)

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
A UNIX saleslady, Lenore,
Enjoys work, but she likes the beach more.
        She found a good way
        To combine work and play:
She sells C shells by the seashore.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible
  2009-10-09 10:26         ` Mike Frysinger
@ 2009-10-11  7:06           ` Chris Moore
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Moore @ 2009-10-11  7:06 UTC (permalink / raw)
  To: u-boot

Mike Frysinger a ?crit :
> On Friday 09 October 2009 06:11:16 Mark Jackson wrote:
>   
>> Chris Moore wrote:
>>     
>>> I agree wholeheartedly with the idea but shouldn't it be more like this
>>> (untested) code :
>>>
>>> void * memcpy(void *dest, const void *src, size_t count)
>>>
>>> {
>>> 	char *d8, *s8;
>>> 	unsigned long *dl = dest, *sl = src;
>>>       
>> In here, would it be overkill to add byte copying until data is aligned,
>>  and then fall into the aligned copy code.
>>     
>
> both addresses have to be unaligned the same ...
>
> if ((ulong)dl & (sizeof(*dl) - 1) == (ulong)sl & (sizeof(*sl) - 1))
>
>   
>> In that case, you'd still gain a speed increase if you're starting at an
>>  unaligned address ?
>>     
>
> now it's a question of how often does this come up and is it worth the code 
> size increase ?
>   

Like Mike I'm not sure if it is worth it for memcpy.
However it may be for memset where only the destination has to be aligned.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it
  2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini
@ 2009-11-22 22:34   ` Wolfgang Denk
  2009-11-24 23:04     ` Anatolij Gustschin
  0 siblings, 1 reply; 24+ messages in thread
From: Wolfgang Denk @ 2009-11-22 22:34 UTC (permalink / raw)
  To: u-boot

Dear Anatolij,

In message <48a15e2db97a6bccf6cc0ebdfae45bc5f946a2f7.1255000877.git.rubini@ unipv.it> Alessandro Rubini wrote:
> From: Alessandro Rubini <rubini@unipv.it>
> 
> Signed-off-by: Alessandro Rubini <rubini@unipv.it>
> Acked-by: Andrea Gallo <andrea.gallo@stericsson.com>

Could you please comment on your position about this patch (and
eventually apply it) ?  Thanks in advance.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
The reasonable man adapts himself to the world; the unreasonable  one
persists  in  trying  to  adapt  the  world to himself. Therefore all
progress depends on the unreasonable man."      - George Bernard Shaw

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it
  2009-11-22 22:34   ` Wolfgang Denk
@ 2009-11-24 23:04     ` Anatolij Gustschin
  0 siblings, 0 replies; 24+ messages in thread
From: Anatolij Gustschin @ 2009-11-24 23:04 UTC (permalink / raw)
  To: u-boot

Dear Wolfgang,

Wolfgang Denk <wd@denx.de> wrote:

> Dear Anatolij,
> 
> In message <48a15e2db97a6bccf6cc0ebdfae45bc5f946a2f7.1255000877.git.rubini@ unipv.it> Alessandro Rubini wrote:
> > From: Alessandro Rubini <rubini@unipv.it>
> > 
> > Signed-off-by: Alessandro Rubini <rubini@unipv.it>
> > Acked-by: Andrea Gallo <andrea.gallo@stericsson.com>
> 
> Could you please comment on your position about this patch (and
> eventually apply it) ?  Thanks in advance.

there was a V4 of this patch, you already applied it, so this
patch is obsolete.
 
Best regards,
Anatolij

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-11-24 23:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-08 11:29 [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Alessandro Rubini
2009-10-08 11:30 ` [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible Alessandro Rubini
2009-10-08 15:12   ` Peter Tyser
2009-10-08 16:00     ` Alessandro Rubini
2009-10-08 16:30       ` Peter Tyser
2009-10-08 18:23         ` Alessandro Rubini
2009-10-08 19:09           ` Peter Tyser
2009-10-08 19:17             ` Alessandro Rubini
2009-10-08 20:40               ` Wolfgang Denk
2009-10-08 20:47       ` Wolfgang Denk
2009-10-08 19:14   ` Mike Frysinger
2009-10-08 20:44   ` Wolfgang Denk
2009-10-09  4:42     ` Chris Moore
2009-10-09 10:11       ` Mark Jackson
2009-10-09 10:26         ` Mike Frysinger
2009-10-11  7:06           ` Chris Moore
2009-10-09 11:12       ` Wolfgang Denk
2009-10-08 11:30 ` [U-Boot] [PATCH V2 2/3] memset: fill " Alessandro Rubini
2009-10-08 20:46   ` Wolfgang Denk
2009-10-08 11:30 ` [U-Boot] [PATCH V2 3/3] lcd: remove '#if 0' 32-bit scroll, now memcpy does it Alessandro Rubini
2009-11-22 22:34   ` Wolfgang Denk
2009-11-24 23:04     ` Anatolij Gustschin
2009-10-08 20:36 ` [U-Boot] [PATCH V2 0/3] make memcpy and memset faster Wolfgang Denk
2009-10-08 21:30 ` Mike Frysinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox