[OpenRISC] [RFC PATCH 0/1 ] Optimized memcpy routine

Linux OpenRISC platform development
 help / color / mirror / Atom feed

* [OpenRISC] [RFC PATCH 0/1 ] Optimized memcpy routine
@ 2016-03-21 21:36 Stafford Horne
  2016-03-21 21:36 ` [OpenRISC] [RFC PATCH 1/1] openrisc: Add optimized " Stafford Horne
  0 siblings, 1 reply; 3+ messages in thread
From: Stafford Horne @ 2016-03-21 21:36 UTC (permalink / raw)
  To: openrisc

(resending as there were some reported issues on the mail server)

Hello,

This patch is to show what I have been doing and what methods I have tried in
providing an optimized memcpy routine.  As you can see from the results in the
commit details the numbers are promising and there are 2 word copy optimization
options which both perform well. If no opposition for a few days I will update
the patch to remove the Kconfig stuff and only use the
`word copies + loop unrolls` implementation.

The final patch will be made as a pull request to Stefan's github repo.

Thanks
 -Stafford

Stafford Horne (1):
  openrisc: Add optimized memcpy routine

 arch/openrisc/Kconfig              |  61 ++++++
 arch/openrisc/TODO.openrisc        |   1 -
 arch/openrisc/include/asm/string.h |   5 +
 arch/openrisc/lib/Makefile         |   3 +-
 arch/openrisc/lib/memcpy.c         | 377 +++++++++++++++++++++++++++++++++++++
 5 files changed, 445 insertions(+), 2 deletions(-)
 create mode 100644 arch/openrisc/lib/memcpy.c

-- 
2.5.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [OpenRISC] [RFC PATCH 1/1] openrisc: Add optimized memcpy routine
  2016-03-21 21:36 [OpenRISC] [RFC PATCH 0/1 ] Optimized memcpy routine Stafford Horne
@ 2016-03-21 21:36 ` Stafford Horne
  2016-03-21 22:27   ` Jeremy Bennett
  0 siblings, 1 reply; 3+ messages in thread
From: Stafford Horne @ 2016-03-21 21:36 UTC (permalink / raw)
  To: openrisc

The default memcpy routing provided in lib does only byte copies.
Using word copies we can lower boot time and cycles spend in memcpy
quite significantly.

Booting on my de0 nano I see boot times go from 7.2 to 5.6 seconds.
The avg cycles in memcpy during boot go from 6467 to 1887.

This commit contains an option menu for people to see what I tried
but in the end we should only leave the implementation we want to
keep.
The implementations I tested and avg cycles:
  - Word Copies + Loop Unrolls + Non Aligned    1882
  - Word Copies + Loop Unrolls                  1887
  - Word Copies                                 2441
  - Byte Copies + Loop Unrolls                  6467
  - Byte Copies                                 7600

I would suggest going with the Word Copies + Loop Unrolls one as its
provides best tradeoff between simplicity and boot speedups.

Signed-off-by: Stafford Horne <shorne@gmail.com>
---
 arch/openrisc/Kconfig              |  61 ++++++
 arch/openrisc/TODO.openrisc        |   1 -
 arch/openrisc/include/asm/string.h |   5 +
 arch/openrisc/lib/Makefile         |   3 +-
 arch/openrisc/lib/memcpy.c         | 377 +++++++++++++++++++++++++++++++++++++
 5 files changed, 445 insertions(+), 2 deletions(-)
 create mode 100644 arch/openrisc/lib/memcpy.c

diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 6e88268..68c0588 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -115,6 +115,67 @@ config OPENRISC_HAVE_INST_LWA_SWA
 
 endmenu
 
+menu "Optimized Lib"
+
+config OPT_LIB_FUNCTION
+	bool "Enable Optimalized lib functions"
+	default y
+	help
+	  Tunes on optimalized library functions (memcpy and memset).
+	  They are optimized for using word memory operations instead of
+          the default byte operations.
+choice
+	prompt "Optimized lib Implementation"
+	default OPT_LIB_WORD_NONALIGNED
+	depends on OPT_LIB_FUNCTION
+
+config OPT_LIB_WORD_NONALIGNED
+	bool "Non-Aligned Word Operations"
+	help
+	  The implementation performs word operations and loop unrolls.
+	  It supports word operations on non-aligned memory but at the cost
+	  of doing shifts and or operations to fix alignement issues.
+
+	  This should be the fasted implementation.
+
+config OPT_LIB_WORD_UNROLL
+	bool "Unrolled Loop Word Operations"
+	help
+	  This implementation performs word operations and loop unrolls.
+	  However, if memory being operated on is not word aligned it will
+	  fall back to using byte operations.
+
+	  This may be as fast as the non-aligned operations if the shift and
+	  or operations to reconstructe alignement issues are slower than the
+	  4 byte memory operations.
+
+	  This should be the 2nd fastest implementation.
+
+config OPT_LIB_WORD
+	bool "Simple Word Operations"
+	help
+	  This implementation performs word operations if data is word aligned
+	  then falls back to byte operations.  It does not do loop unrolling.
+
+	  This should be 3rd fastest implementation.
+
+config OPT_LIB_BYTE_UNROLL
+	bool "Unrolled Loop Byte Operations"
+	help
+	  This implemenation performes Byte operations with loop untolling.
+
+	  This should be the 4th fastest implementation.
+
+config OPT_LIB_BYTE
+	bool "Simple Byte Operations"
+	help
+	  Simple byte operations. No frills but should not have any problem
+	  working on any architecture.
+
+endchoice
+
+endmenu
+
 config NR_CPUS
 	int "Maximum number of CPUs (2-32)"
 	range 2 32
diff --git a/arch/openrisc/TODO.openrisc b/arch/openrisc/TODO.openrisc
index acfeef9..a2bda7b 100644
--- a/arch/openrisc/TODO.openrisc
+++ b/arch/openrisc/TODO.openrisc
@@ -13,4 +13,3 @@ that are due for investigation shortly, i.e. our TODO list:
    or1k and this change is slowly trickling through the stack.  For the time
    being, or32 is equivalent to or1k.
 
--- Implement optimized version of memcpy and memset
diff --git a/arch/openrisc/include/asm/string.h b/arch/openrisc/include/asm/string.h
index 33470d4..04111b2 100644
--- a/arch/openrisc/include/asm/string.h
+++ b/arch/openrisc/include/asm/string.h
@@ -1,7 +1,12 @@
 #ifndef __ASM_OPENRISC_STRING_H
 #define __ASM_OPENRISC_STRING_H
 
+#ifdef CONFIG_OPT_LIB_FUNCTION
 #define __HAVE_ARCH_MEMSET
 extern void *memset(void *s, int c, __kernel_size_t n);
 
+#define __HAVE_ARCH_MEMCPY
+extern void *memcpy(void *dest, __const void *src, __kernel_size_t n);
+#endif
+
 #endif /* __ASM_OPENRISC_STRING_H */
diff --git a/arch/openrisc/lib/Makefile b/arch/openrisc/lib/Makefile
index 67c583e..c3316f6 100644
--- a/arch/openrisc/lib/Makefile
+++ b/arch/openrisc/lib/Makefile
@@ -2,4 +2,5 @@
 # Makefile for or32 specific library files..
 #
 
-obj-y  = memset.o string.o delay.o
+obj-y				:= delay.o string.o
+obj-$(CONFIG_OPT_LIB_FUNCTION)	+= memset.o memcpy.o
diff --git a/arch/openrisc/lib/memcpy.c b/arch/openrisc/lib/memcpy.c
new file mode 100644
index 0000000..36a7aac
--- /dev/null
+++ b/arch/openrisc/lib/memcpy.c
@@ -0,0 +1,377 @@
+/*
+ * arch/openrisc/lib/memcpy.c
+ *
+ * Optimized memory copy routines for openrisc.  These are mostly copied
+ * from ohter sources but slightly entended based on ideas discuassed in
+ * #openrisc.
+ *
+ * The word non aligned is based on microblaze found in:
+ *  arch/microblaze/lib/memcpy.c
+ * but this is extended to have loop unrolls. This only supports
+ * big endian at the moment.
+ *
+ * The byte unroll implementation is a copy of that found in:
+ *  arm/boot/compressed/string.c
+ *
+ * The word unroll implementation is an extention to the byte
+ * unrolled implementation, but using word copies (if things are
+ * properly aligned)
+ */
+
+#ifndef _MC_TEST
+#include <linux/export.h>
+
+#include <linux/string.h>
+#endif
+
+#if defined(CONFIG_OPT_LIB_WORD_NONALIGNED)
+/*
+ * Make below loops a bit more manageable
+ *
+ *
+ */
+#define __OFFSET_MEMCPY(n) 	value = *src_w++;					\
+				*dest_w++ = buf_hold | value >> ( ( 4 - n ) * 8 );	\
+				buf_hold = value << ( n * 8 )
+
+void *memcpy(void *dest, const void *src, __kernel_size_t n)
+{
+	const char *src_b = src;
+	char *dest_b = dest;
+	int i;
+
+	/* The following code tries to optimize the copy by using unsigned
+	 * alignment. This will work fine if both source and destination are
+	 * aligned on the same boundary. However, if they are aligned on
+	 * different boundaries shifts will be necessary.
+	 */
+	const uint32_t *src_w;
+	uint32_t *dest_w;
+
+	if (likely(n >= 4)) {
+		unsigned  value, buf_hold;
+
+		/* Align the destination to a word boundary. */
+		/* This is done in an endian independent manner. */
+		switch ((unsigned long)dest_b & 3) {
+		case 1:
+			*dest_b++ = *src_b++;
+			--n;
+		case 2:
+			*dest_b++ = *src_b++;
+			--n;
+		case 3:
+			*dest_b++ = *src_b++;
+			--n;
+		}
+
+		dest_w = (void *)dest_b;
+
+		/* Choose a copy scheme based on the source, this is done big endian */
+		/* alignment relative to destination. */
+		switch ((unsigned long)src_b & 3) {
+		case 0x0:	/* Both byte offsets are aligned */
+			src_w  = (const uint32_t *)src_b;
+
+			/* Copy 32 bytes per loop */
+			for (i = n >> 5; i > 0; i--) {
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+			}
+
+			if (n & 1 << 4) {
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+			}
+
+			if (n & 1 << 3) {
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+			}
+
+			if (n & 1 << 2)
+				*dest_w++ = *src_w++;
+
+			src_b  = (const char *)src_w;
+			break;
+
+		case 0x1:	/* Unaligned - Off by 1 */
+			/* Word align the source */
+			src_w = (const void *) ((unsigned)src_b & ~3);
+			/* Load the holding buffer */
+			buf_hold = *src_w++ << 8;
+
+			for (i = n >> 5; i > 0; i--) {
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+			}
+
+			if (n & 1 << 4) {
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+			}
+
+			if (n & 1 << 3) {
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+			}
+
+			if (n & 1 << 2) {
+				__OFFSET_MEMCPY(1);
+			}
+
+			/* Realign the source */
+			src_b = (const void *)src_w;
+			src_b -= 3;
+			break;
+		case 0x2:	/* Unaligned - Off by 2 */
+			/* Word align the source */
+			src_w = (const void *) ((unsigned)src_b & ~3);
+			/* Load the holding buffer */
+			buf_hold = *src_w++ << 16;
+
+			for (i = n >> 5; i > 0; i--) {
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+			}
+
+			if (n & 1 << 4) {
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+			}
+
+			if (n & 1 << 3) {
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+			}
+
+			if (n & 1 << 2) {
+				__OFFSET_MEMCPY(2);
+			}
+
+			/* Realign the source */
+			src_b = (const void *)src_w;
+			src_b -= 2;
+			break;
+		case 0x3:	/* Unaligned - Off by 3 */
+			/* Word align the source */
+			src_w = (const void *) ((unsigned)src_b & ~3);
+			/* Load the holding buffer */
+			buf_hold = *src_w++ << 24;
+
+			for (i = n >> 5; i > 0; i--) {
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+			}
+
+			if (n & 1 << 4) {
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+			}
+
+			if (n & 1 << 3) {
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+			}
+
+			if (n & 1 << 2) {
+				__OFFSET_MEMCPY(3);
+			}
+
+			/* Realign the source */
+			src_b = (const void *)src_w;
+			src_b -= 1;
+			break;
+		}
+		dest_b = (void *)dest_w;
+	}
+
+	/* Finish off any remaining bytes */
+	/* simple fast copy, ... unless a cache boundary is crossed */
+       	if (n & 1 << 1) {
+		*dest_b++ = *src_b++;
+		*dest_b++ = *src_b++;
+	}
+
+	if (n & 1)
+		*dest_b++ = *src_b++;
+
+
+	return dest;
+}
+#elif defined(CONFIG_OPT_LIB_WORD)
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
+	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
+
+	/* If both source and dest are word aligned copy words */
+	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
+		for (; n >= 4; n -= 4)
+			*dest_w++ = *src_w++;
+	}
+
+	d = (unsigned char *) dest_w;
+	s = (unsigned char *) src_w;
+
+	/* For remaining or if not aligned, copy bytes */
+	for (; n >= 1; n -= 1)
+		*d++ = *s++;
+
+	return dest;
+
+}
+#elif defined(CONFIG_OPT_LIB_WORD_UNROLL)
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	int i = 0;
+	unsigned char * d, * s;
+	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
+
+	/* If both source and dest are word aligned copy words */
+	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
+		/* Copy 32 bytes per loop */
+		for (i = n >> 5; i > 0; i--) {
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+		}
+
+		if (n & 1 << 4) {
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+		}
+
+		if (n & 1 << 3) {
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+		}
+
+		if (n & 1 << 2)
+			*dest_w++ = *src_w++;
+
+		d = (unsigned char *) dest_w;
+		s = (unsigned char *) src_w;
+
+	} else {
+		d = (unsigned char *) dest_w;
+		s = (unsigned char *) src_w;
+
+		for (i = n >> 3; i > 0; i--) {
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+		 }
+
+		 if (n & 1 << 2) {
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+		 }
+	}
+
+       	if (n & 1 << 1) {
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1)
+		*d++ = *s++;
+
+	return dest;
+}
+
+#elif defined(CONFIG_OPT_LIB_BYTE_UNROLL)
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	int i = 0;
+	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
+
+	/* For remaining or if not aligned, still unroll loops */
+	for (i = n >> 3; i > 0; i--) {
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1 << 2) {
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1 << 1) {
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1)
+		*d++ = *s++;
+
+	return dest;
+}
+#else /* CONFIG_OPT_LIB_BYTE fallback */
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
+
+	/* For remaining or if not aligned, still unroll loops */
+	for (; n > 0; n--)
+		*d++ = *s++;
+
+	return dest;
+}
+#endif
+
+EXPORT_SYMBOL(memcpy);
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [OpenRISC] [RFC PATCH 1/1] openrisc: Add optimized memcpy routine
  2016-03-21 21:36 ` [OpenRISC] [RFC PATCH 1/1] openrisc: Add optimized " Stafford Horne
@ 2016-03-21 22:27   ` Jeremy Bennett
  0 siblings, 0 replies; 3+ messages in thread
From: Jeremy Bennett @ 2016-03-21 22:27 UTC (permalink / raw)
  To: openrisc

Nice work.

A while since I've looked at this, but you might like to consider
whether the compiler is also aggressively optimizing memcpy in all
circumstances. The compiler should generate _builtin_memcpy for things
like structure assignment, and then lower that to suitably efficient
inline code in most circumstances.

Best wishes,


Jeremy

On 21/03/16 22:36, Stafford Horne wrote:
> The default memcpy routing provided in lib does only byte copies.
> Using word copies we can lower boot time and cycles spend in memcpy
> quite significantly.
> 
> Booting on my de0 nano I see boot times go from 7.2 to 5.6 seconds.
> The avg cycles in memcpy during boot go from 6467 to 1887.
> 
> This commit contains an option menu for people to see what I tried
> but in the end we should only leave the implementation we want to
> keep.
> The implementations I tested and avg cycles:
>   - Word Copies + Loop Unrolls + Non Aligned    1882
>   - Word Copies + Loop Unrolls                  1887
>   - Word Copies                                 2441
>   - Byte Copies + Loop Unrolls                  6467
>   - Byte Copies                                 7600
> 
> I would suggest going with the Word Copies + Loop Unrolls one as its
> provides best tradeoff between simplicity and boot speedups.
> 
> Signed-off-by: Stafford Horne <shorne@gmail.com>
> ---
>  arch/openrisc/Kconfig              |  61 ++++++
>  arch/openrisc/TODO.openrisc        |   1 -
>  arch/openrisc/include/asm/string.h |   5 +
>  arch/openrisc/lib/Makefile         |   3 +-
>  arch/openrisc/lib/memcpy.c         | 377 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 445 insertions(+), 2 deletions(-)
>  create mode 100644 arch/openrisc/lib/memcpy.c
> 
> diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
> index 6e88268..68c0588 100644
> --- a/arch/openrisc/Kconfig
> +++ b/arch/openrisc/Kconfig
> @@ -115,6 +115,67 @@ config OPENRISC_HAVE_INST_LWA_SWA
>  
>  endmenu
>  
> +menu "Optimized Lib"
> +
> +config OPT_LIB_FUNCTION
> +	bool "Enable Optimalized lib functions"
> +	default y
> +	help
> +	  Tunes on optimalized library functions (memcpy and memset).
> +	  They are optimized for using word memory operations instead of
> +          the default byte operations.
> +choice
> +	prompt "Optimized lib Implementation"
> +	default OPT_LIB_WORD_NONALIGNED
> +	depends on OPT_LIB_FUNCTION
> +
> +config OPT_LIB_WORD_NONALIGNED
> +	bool "Non-Aligned Word Operations"
> +	help
> +	  The implementation performs word operations and loop unrolls.
> +	  It supports word operations on non-aligned memory but at the cost
> +	  of doing shifts and or operations to fix alignement issues.
> +
> +	  This should be the fasted implementation.
> +
> +config OPT_LIB_WORD_UNROLL
> +	bool "Unrolled Loop Word Operations"
> +	help
> +	  This implementation performs word operations and loop unrolls.
> +	  However, if memory being operated on is not word aligned it will
> +	  fall back to using byte operations.
> +
> +	  This may be as fast as the non-aligned operations if the shift and
> +	  or operations to reconstructe alignement issues are slower than the
> +	  4 byte memory operations.
> +
> +	  This should be the 2nd fastest implementation.
> +
> +config OPT_LIB_WORD
> +	bool "Simple Word Operations"
> +	help
> +	  This implementation performs word operations if data is word aligned
> +	  then falls back to byte operations.  It does not do loop unrolling.
> +
> +	  This should be 3rd fastest implementation.
> +
> +config OPT_LIB_BYTE_UNROLL
> +	bool "Unrolled Loop Byte Operations"
> +	help
> +	  This implemenation performes Byte operations with loop untolling.
> +
> +	  This should be the 4th fastest implementation.
> +
> +config OPT_LIB_BYTE
> +	bool "Simple Byte Operations"
> +	help
> +	  Simple byte operations. No frills but should not have any problem
> +	  working on any architecture.
> +
> +endchoice
> +
> +endmenu
> +
>  config NR_CPUS
>  	int "Maximum number of CPUs (2-32)"
>  	range 2 32
> diff --git a/arch/openrisc/TODO.openrisc b/arch/openrisc/TODO.openrisc
> index acfeef9..a2bda7b 100644
> --- a/arch/openrisc/TODO.openrisc
> +++ b/arch/openrisc/TODO.openrisc
> @@ -13,4 +13,3 @@ that are due for investigation shortly, i.e. our TODO list:
>     or1k and this change is slowly trickling through the stack.  For the time
>     being, or32 is equivalent to or1k.
>  
> --- Implement optimized version of memcpy and memset
> diff --git a/arch/openrisc/include/asm/string.h b/arch/openrisc/include/asm/string.h
> index 33470d4..04111b2 100644
> --- a/arch/openrisc/include/asm/string.h
> +++ b/arch/openrisc/include/asm/string.h
> @@ -1,7 +1,12 @@
>  #ifndef __ASM_OPENRISC_STRING_H
>  #define __ASM_OPENRISC_STRING_H
>  
> +#ifdef CONFIG_OPT_LIB_FUNCTION
>  #define __HAVE_ARCH_MEMSET
>  extern void *memset(void *s, int c, __kernel_size_t n);
>  
> +#define __HAVE_ARCH_MEMCPY
> +extern void *memcpy(void *dest, __const void *src, __kernel_size_t n);
> +#endif
> +
>  #endif /* __ASM_OPENRISC_STRING_H */
> diff --git a/arch/openrisc/lib/Makefile b/arch/openrisc/lib/Makefile
> index 67c583e..c3316f6 100644
> --- a/arch/openrisc/lib/Makefile
> +++ b/arch/openrisc/lib/Makefile
> @@ -2,4 +2,5 @@
>  # Makefile for or32 specific library files..
>  #
>  
> -obj-y  = memset.o string.o delay.o
> +obj-y				:= delay.o string.o
> +obj-$(CONFIG_OPT_LIB_FUNCTION)	+= memset.o memcpy.o
> diff --git a/arch/openrisc/lib/memcpy.c b/arch/openrisc/lib/memcpy.c
> new file mode 100644
> index 0000000..36a7aac
> --- /dev/null
> +++ b/arch/openrisc/lib/memcpy.c
> @@ -0,0 +1,377 @@
> +/*
> + * arch/openrisc/lib/memcpy.c
> + *
> + * Optimized memory copy routines for openrisc.  These are mostly copied
> + * from ohter sources but slightly entended based on ideas discuassed in
> + * #openrisc.
> + *
> + * The word non aligned is based on microblaze found in:
> + *  arch/microblaze/lib/memcpy.c
> + * but this is extended to have loop unrolls. This only supports
> + * big endian at the moment.
> + *
> + * The byte unroll implementation is a copy of that found in:
> + *  arm/boot/compressed/string.c
> + *
> + * The word unroll implementation is an extention to the byte
> + * unrolled implementation, but using word copies (if things are
> + * properly aligned)
> + */
> +
> +#ifndef _MC_TEST
> +#include <linux/export.h>
> +
> +#include <linux/string.h>
> +#endif
> +
> +#if defined(CONFIG_OPT_LIB_WORD_NONALIGNED)
> +/*
> + * Make below loops a bit more manageable
> + *
> + *
> + */
> +#define __OFFSET_MEMCPY(n) 	value = *src_w++;					\
> +				*dest_w++ = buf_hold | value >> ( ( 4 - n ) * 8 );	\
> +				buf_hold = value << ( n * 8 )
> +
> +void *memcpy(void *dest, const void *src, __kernel_size_t n)
> +{
> +	const char *src_b = src;
> +	char *dest_b = dest;
> +	int i;
> +
> +	/* The following code tries to optimize the copy by using unsigned
> +	 * alignment. This will work fine if both source and destination are
> +	 * aligned on the same boundary. However, if they are aligned on
> +	 * different boundaries shifts will be necessary.
> +	 */
> +	const uint32_t *src_w;
> +	uint32_t *dest_w;
> +
> +	if (likely(n >= 4)) {
> +		unsigned  value, buf_hold;
> +
> +		/* Align the destination to a word boundary. */
> +		/* This is done in an endian independent manner. */
> +		switch ((unsigned long)dest_b & 3) {
> +		case 1:
> +			*dest_b++ = *src_b++;
> +			--n;
> +		case 2:
> +			*dest_b++ = *src_b++;
> +			--n;
> +		case 3:
> +			*dest_b++ = *src_b++;
> +			--n;
> +		}
> +
> +		dest_w = (void *)dest_b;
> +
> +		/* Choose a copy scheme based on the source, this is done big endian */
> +		/* alignment relative to destination. */
> +		switch ((unsigned long)src_b & 3) {
> +		case 0x0:	/* Both byte offsets are aligned */
> +			src_w  = (const uint32_t *)src_b;
> +
> +			/* Copy 32 bytes per loop */
> +			for (i = n >> 5; i > 0; i--) {
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +			}
> +
> +			if (n & 1 << 4) {
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +			}
> +
> +			if (n & 1 << 3) {
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +			}
> +
> +			if (n & 1 << 2)
> +				*dest_w++ = *src_w++;
> +
> +			src_b  = (const char *)src_w;
> +			break;
> +
> +		case 0x1:	/* Unaligned - Off by 1 */
> +			/* Word align the source */
> +			src_w = (const void *) ((unsigned)src_b & ~3);
> +			/* Load the holding buffer */
> +			buf_hold = *src_w++ << 8;
> +
> +			for (i = n >> 5; i > 0; i--) {
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			if (n & 1 << 4) {
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			if (n & 1 << 3) {
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			if (n & 1 << 2) {
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			/* Realign the source */
> +			src_b = (const void *)src_w;
> +			src_b -= 3;
> +			break;
> +		case 0x2:	/* Unaligned - Off by 2 */
> +			/* Word align the source */
> +			src_w = (const void *) ((unsigned)src_b & ~3);
> +			/* Load the holding buffer */
> +			buf_hold = *src_w++ << 16;
> +
> +			for (i = n >> 5; i > 0; i--) {
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			if (n & 1 << 4) {
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			if (n & 1 << 3) {
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			if (n & 1 << 2) {
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			/* Realign the source */
> +			src_b = (const void *)src_w;
> +			src_b -= 2;
> +			break;
> +		case 0x3:	/* Unaligned - Off by 3 */
> +			/* Word align the source */
> +			src_w = (const void *) ((unsigned)src_b & ~3);
> +			/* Load the holding buffer */
> +			buf_hold = *src_w++ << 24;
> +
> +			for (i = n >> 5; i > 0; i--) {
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			if (n & 1 << 4) {
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			if (n & 1 << 3) {
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			if (n & 1 << 2) {
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			/* Realign the source */
> +			src_b = (const void *)src_w;
> +			src_b -= 1;
> +			break;
> +		}
> +		dest_b = (void *)dest_w;
> +	}
> +
> +	/* Finish off any remaining bytes */
> +	/* simple fast copy, ... unless a cache boundary is crossed */
> +       	if (n & 1 << 1) {
> +		*dest_b++ = *src_b++;
> +		*dest_b++ = *src_b++;
> +	}
> +
> +	if (n & 1)
> +		*dest_b++ = *src_b++;
> +
> +
> +	return dest;
> +}
> +#elif defined(CONFIG_OPT_LIB_WORD)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
> +
> +	/* If both source and dest are word aligned copy words */
> +	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
> +		for (; n >= 4; n -= 4)
> +			*dest_w++ = *src_w++;
> +	}
> +
> +	d = (unsigned char *) dest_w;
> +	s = (unsigned char *) src_w;
> +
> +	/* For remaining or if not aligned, copy bytes */
> +	for (; n >= 1; n -= 1)
> +		*d++ = *s++;
> +
> +	return dest;
> +
> +}
> +#elif defined(CONFIG_OPT_LIB_WORD_UNROLL)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	int i = 0;
> +	unsigned char * d, * s;
> +	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
> +
> +	/* If both source and dest are word aligned copy words */
> +	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
> +		/* Copy 32 bytes per loop */
> +		for (i = n >> 5; i > 0; i--) {
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +		}
> +
> +		if (n & 1 << 4) {
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +		}
> +
> +		if (n & 1 << 3) {
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +		}
> +
> +		if (n & 1 << 2)
> +			*dest_w++ = *src_w++;
> +
> +		d = (unsigned char *) dest_w;
> +		s = (unsigned char *) src_w;
> +
> +	} else {
> +		d = (unsigned char *) dest_w;
> +		s = (unsigned char *) src_w;
> +
> +		for (i = n >> 3; i > 0; i--) {
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +		 }
> +
> +		 if (n & 1 << 2) {
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +		 }
> +	}
> +
> +       	if (n & 1 << 1) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1)
> +		*d++ = *s++;
> +
> +	return dest;
> +}
> +
> +#elif defined(CONFIG_OPT_LIB_BYTE_UNROLL)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	int i = 0;
> +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +
> +	/* For remaining or if not aligned, still unroll loops */
> +	for (i = n >> 3; i > 0; i--) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1 << 2) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1 << 1) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1)
> +		*d++ = *s++;
> +
> +	return dest;
> +}
> +#else /* CONFIG_OPT_LIB_BYTE fallback */
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +
> +	/* For remaining or if not aligned, still unroll loops */
> +	for (; n > 0; n--)
> +		*d++ = *s++;
> +
> +	return dest;
> +}
> +#endif
> +
> +EXPORT_SYMBOL(memcpy);
> 


-- 
Tel:     +44 (1590) 610184
Cell:    +44 (7970) 676050
SkypeID: jeremybennett
Twitter: @jeremypbennett
Email:   jeremy.bennett at embecosm.com
Web:     www.embecosm.com
PGP key: 1024D/BEF58172FB4754E1 2009-03-20

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: OpenPGP digital signature
URL: <http://lists.librecores.org/pipermail/openrisc/attachments/20160321/8e2c384f/attachment.sig>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-03-21 22:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-21 21:36 [OpenRISC] [RFC PATCH 0/1 ] Optimized memcpy routine Stafford Horne
2016-03-21 21:36 ` [OpenRISC] [RFC PATCH 1/1] openrisc: Add optimized " Stafford Horne
2016-03-21 22:27   ` Jeremy Bennett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox