[OpenRISC] [RFC PATCH] Optimized memcpy routine

Linux OpenRISC platform development
 help / color / mirror / Atom feed

* [OpenRISC] [RFC PATCH] Optimized memcpy routine
@ 2016-03-21 14:29 Stafford Horne
  2016-03-21 14:29 ` [OpenRISC] [PATCH] openrisc: Add optimized " Stafford Horne
  0 siblings, 1 reply; 5+ messages in thread
From: Stafford Horne @ 2016-03-21 14:29 UTC (permalink / raw)
  To: openrisc

Hello,

This patch is to show what I have been doing and what methods I have tried to
provide an optimized memcpy routing.  As you can see from the results above the
numbers are promosing and working well. If no big opposition for a few days I
will update the patch to remove the Kconfig stuff only use the
`word copies + loop unrolls` implementation.

Thanks
 -Stafford

Stafford Horne (1):
  openrisc: Add optimized memcpy routine

 arch/openrisc/Kconfig              |  61 ++++++
 arch/openrisc/TODO.openrisc        |   1 -
 arch/openrisc/include/asm/string.h |   5 +
 arch/openrisc/lib/Makefile         |   3 +-
 arch/openrisc/lib/memcpy.c         | 377 +++++++++++++++++++++++++++++++++++++
 5 files changed, 445 insertions(+), 2 deletions(-)
 create mode 100644 arch/openrisc/lib/memcpy.c

-- 
2.5.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [OpenRISC] [PATCH] openrisc: Add optimized memcpy routine
  2016-03-21 14:29 [OpenRISC] [RFC PATCH] Optimized memcpy routine Stafford Horne
@ 2016-03-21 14:29 ` Stafford Horne
  2016-03-23  4:54   ` Jonas Bonn
  0 siblings, 1 reply; 5+ messages in thread
From: Stafford Horne @ 2016-03-21 14:29 UTC (permalink / raw)
  To: openrisc

The default memcpy routing provided in lib does only byte copies.
Using word copies we can lower boot time and cycles spend in memcpy
quite significantly.

Booting on my de0 nano I see boot times go from 7.2 to 5.6 seconds.
The avg cycles in memcpy during boot go from 6467 to 1887.

This commit contains an option menu for people to see what I tried
but in the end we should only leave the implementation we want to
keep.
The implementations I tested and avg cycles:
  - Word Copies + Loop Unrolls + Non Aligned    1882
  - Word Copies + Loop Unrolls                  1887
  - Word Copies                                 2441
  - Byte Copies + Loop Unrolls                  6467
  - Byte Copies                                 7600

I would suggest going with the Word Copies + Loop Unrolls one as its
provides best tradeoff between simplicity and boot speedups.

Signed-off-by: Stafford Horne <shorne@gmail.com>
---
 arch/openrisc/Kconfig              |  61 ++++++
 arch/openrisc/TODO.openrisc        |   1 -
 arch/openrisc/include/asm/string.h |   5 +
 arch/openrisc/lib/Makefile         |   3 +-
 arch/openrisc/lib/memcpy.c         | 377 +++++++++++++++++++++++++++++++++++++
 5 files changed, 445 insertions(+), 2 deletions(-)
 create mode 100644 arch/openrisc/lib/memcpy.c

diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 6e88268..68c0588 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -115,6 +115,67 @@ config OPENRISC_HAVE_INST_LWA_SWA
 
 endmenu
 
+menu "Optimized Lib"
+
+config OPT_LIB_FUNCTION
+	bool "Enable Optimalized lib functions"
+	default y
+	help
+	  Tunes on optimalized library functions (memcpy and memset).
+	  They are optimized for using word memory operations instead of
+          the default byte operations.
+choice
+	prompt "Optimized lib Implementation"
+	default OPT_LIB_WORD_NONALIGNED
+	depends on OPT_LIB_FUNCTION
+
+config OPT_LIB_WORD_NONALIGNED
+	bool "Non-Aligned Word Operations"
+	help
+	  The implementation performs word operations and loop unrolls.
+	  It supports word operations on non-aligned memory but at the cost
+	  of doing shifts and or operations to fix alignement issues.
+
+	  This should be the fasted implementation.
+
+config OPT_LIB_WORD_UNROLL
+	bool "Unrolled Loop Word Operations"
+	help
+	  This implementation performs word operations and loop unrolls.
+	  However, if memory being operated on is not word aligned it will
+	  fall back to using byte operations.
+
+	  This may be as fast as the non-aligned operations if the shift and
+	  or operations to reconstructe alignement issues are slower than the
+	  4 byte memory operations.
+
+	  This should be the 2nd fastest implementation.
+
+config OPT_LIB_WORD
+	bool "Simple Word Operations"
+	help
+	  This implementation performs word operations if data is word aligned
+	  then falls back to byte operations.  It does not do loop unrolling.
+
+	  This should be 3rd fastest implementation.
+
+config OPT_LIB_BYTE_UNROLL
+	bool "Unrolled Loop Byte Operations"
+	help
+	  This implemenation performes Byte operations with loop untolling.
+
+	  This should be the 4th fastest implementation.
+
+config OPT_LIB_BYTE
+	bool "Simple Byte Operations"
+	help
+	  Simple byte operations. No frills but should not have any problem
+	  working on any architecture.
+
+endchoice
+
+endmenu
+
 config NR_CPUS
 	int "Maximum number of CPUs (2-32)"
 	range 2 32
diff --git a/arch/openrisc/TODO.openrisc b/arch/openrisc/TODO.openrisc
index acfeef9..a2bda7b 100644
--- a/arch/openrisc/TODO.openrisc
+++ b/arch/openrisc/TODO.openrisc
@@ -13,4 +13,3 @@ that are due for investigation shortly, i.e. our TODO list:
    or1k and this change is slowly trickling through the stack.  For the time
    being, or32 is equivalent to or1k.
 
--- Implement optimized version of memcpy and memset
diff --git a/arch/openrisc/include/asm/string.h b/arch/openrisc/include/asm/string.h
index 33470d4..04111b2 100644
--- a/arch/openrisc/include/asm/string.h
+++ b/arch/openrisc/include/asm/string.h
@@ -1,7 +1,12 @@
 #ifndef __ASM_OPENRISC_STRING_H
 #define __ASM_OPENRISC_STRING_H
 
+#ifdef CONFIG_OPT_LIB_FUNCTION
 #define __HAVE_ARCH_MEMSET
 extern void *memset(void *s, int c, __kernel_size_t n);
 
+#define __HAVE_ARCH_MEMCPY
+extern void *memcpy(void *dest, __const void *src, __kernel_size_t n);
+#endif
+
 #endif /* __ASM_OPENRISC_STRING_H */
diff --git a/arch/openrisc/lib/Makefile b/arch/openrisc/lib/Makefile
index 67c583e..c3316f6 100644
--- a/arch/openrisc/lib/Makefile
+++ b/arch/openrisc/lib/Makefile
@@ -2,4 +2,5 @@
 # Makefile for or32 specific library files..
 #
 
-obj-y  = memset.o string.o delay.o
+obj-y				:= delay.o string.o
+obj-$(CONFIG_OPT_LIB_FUNCTION)	+= memset.o memcpy.o
diff --git a/arch/openrisc/lib/memcpy.c b/arch/openrisc/lib/memcpy.c
new file mode 100644
index 0000000..36a7aac
--- /dev/null
+++ b/arch/openrisc/lib/memcpy.c
@@ -0,0 +1,377 @@
+/*
+ * arch/openrisc/lib/memcpy.c
+ *
+ * Optimized memory copy routines for openrisc.  These are mostly copied
+ * from ohter sources but slightly entended based on ideas discuassed in
+ * #openrisc.
+ *
+ * The word non aligned is based on microblaze found in:
+ *  arch/microblaze/lib/memcpy.c
+ * but this is extended to have loop unrolls. This only supports
+ * big endian at the moment.
+ *
+ * The byte unroll implementation is a copy of that found in:
+ *  arm/boot/compressed/string.c
+ *
+ * The word unroll implementation is an extention to the byte
+ * unrolled implementation, but using word copies (if things are
+ * properly aligned)
+ */
+
+#ifndef _MC_TEST
+#include <linux/export.h>
+
+#include <linux/string.h>
+#endif
+
+#if defined(CONFIG_OPT_LIB_WORD_NONALIGNED)
+/*
+ * Make below loops a bit more manageable
+ *
+ *
+ */
+#define __OFFSET_MEMCPY(n) 	value = *src_w++;					\
+				*dest_w++ = buf_hold | value >> ( ( 4 - n ) * 8 );	\
+				buf_hold = value << ( n * 8 )
+
+void *memcpy(void *dest, const void *src, __kernel_size_t n)
+{
+	const char *src_b = src;
+	char *dest_b = dest;
+	int i;
+
+	/* The following code tries to optimize the copy by using unsigned
+	 * alignment. This will work fine if both source and destination are
+	 * aligned on the same boundary. However, if they are aligned on
+	 * different boundaries shifts will be necessary.
+	 */
+	const uint32_t *src_w;
+	uint32_t *dest_w;
+
+	if (likely(n >= 4)) {
+		unsigned  value, buf_hold;
+
+		/* Align the destination to a word boundary. */
+		/* This is done in an endian independent manner. */
+		switch ((unsigned long)dest_b & 3) {
+		case 1:
+			*dest_b++ = *src_b++;
+			--n;
+		case 2:
+			*dest_b++ = *src_b++;
+			--n;
+		case 3:
+			*dest_b++ = *src_b++;
+			--n;
+		}
+
+		dest_w = (void *)dest_b;
+
+		/* Choose a copy scheme based on the source, this is done big endian */
+		/* alignment relative to destination. */
+		switch ((unsigned long)src_b & 3) {
+		case 0x0:	/* Both byte offsets are aligned */
+			src_w  = (const uint32_t *)src_b;
+
+			/* Copy 32 bytes per loop */
+			for (i = n >> 5; i > 0; i--) {
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+			}
+
+			if (n & 1 << 4) {
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+			}
+
+			if (n & 1 << 3) {
+				*dest_w++ = *src_w++;
+				*dest_w++ = *src_w++;
+			}
+
+			if (n & 1 << 2)
+				*dest_w++ = *src_w++;
+
+			src_b  = (const char *)src_w;
+			break;
+
+		case 0x1:	/* Unaligned - Off by 1 */
+			/* Word align the source */
+			src_w = (const void *) ((unsigned)src_b & ~3);
+			/* Load the holding buffer */
+			buf_hold = *src_w++ << 8;
+
+			for (i = n >> 5; i > 0; i--) {
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+			}
+
+			if (n & 1 << 4) {
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+			}
+
+			if (n & 1 << 3) {
+				__OFFSET_MEMCPY(1);
+				__OFFSET_MEMCPY(1);
+			}
+
+			if (n & 1 << 2) {
+				__OFFSET_MEMCPY(1);
+			}
+
+			/* Realign the source */
+			src_b = (const void *)src_w;
+			src_b -= 3;
+			break;
+		case 0x2:	/* Unaligned - Off by 2 */
+			/* Word align the source */
+			src_w = (const void *) ((unsigned)src_b & ~3);
+			/* Load the holding buffer */
+			buf_hold = *src_w++ << 16;
+
+			for (i = n >> 5; i > 0; i--) {
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+			}
+
+			if (n & 1 << 4) {
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+			}
+
+			if (n & 1 << 3) {
+				__OFFSET_MEMCPY(2);
+				__OFFSET_MEMCPY(2);
+			}
+
+			if (n & 1 << 2) {
+				__OFFSET_MEMCPY(2);
+			}
+
+			/* Realign the source */
+			src_b = (const void *)src_w;
+			src_b -= 2;
+			break;
+		case 0x3:	/* Unaligned - Off by 3 */
+			/* Word align the source */
+			src_w = (const void *) ((unsigned)src_b & ~3);
+			/* Load the holding buffer */
+			buf_hold = *src_w++ << 24;
+
+			for (i = n >> 5; i > 0; i--) {
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+			}
+
+			if (n & 1 << 4) {
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+			}
+
+			if (n & 1 << 3) {
+				__OFFSET_MEMCPY(3);
+				__OFFSET_MEMCPY(3);
+			}
+
+			if (n & 1 << 2) {
+				__OFFSET_MEMCPY(3);
+			}
+
+			/* Realign the source */
+			src_b = (const void *)src_w;
+			src_b -= 1;
+			break;
+		}
+		dest_b = (void *)dest_w;
+	}
+
+	/* Finish off any remaining bytes */
+	/* simple fast copy, ... unless a cache boundary is crossed */
+       	if (n & 1 << 1) {
+		*dest_b++ = *src_b++;
+		*dest_b++ = *src_b++;
+	}
+
+	if (n & 1)
+		*dest_b++ = *src_b++;
+
+
+	return dest;
+}
+#elif defined(CONFIG_OPT_LIB_WORD)
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
+	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
+
+	/* If both source and dest are word aligned copy words */
+	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
+		for (; n >= 4; n -= 4)
+			*dest_w++ = *src_w++;
+	}
+
+	d = (unsigned char *) dest_w;
+	s = (unsigned char *) src_w;
+
+	/* For remaining or if not aligned, copy bytes */
+	for (; n >= 1; n -= 1)
+		*d++ = *s++;
+
+	return dest;
+
+}
+#elif defined(CONFIG_OPT_LIB_WORD_UNROLL)
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	int i = 0;
+	unsigned char * d, * s;
+	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
+
+	/* If both source and dest are word aligned copy words */
+	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
+		/* Copy 32 bytes per loop */
+		for (i = n >> 5; i > 0; i--) {
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+		}
+
+		if (n & 1 << 4) {
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+		}
+
+		if (n & 1 << 3) {
+			*dest_w++ = *src_w++;
+			*dest_w++ = *src_w++;
+		}
+
+		if (n & 1 << 2)
+			*dest_w++ = *src_w++;
+
+		d = (unsigned char *) dest_w;
+		s = (unsigned char *) src_w;
+
+	} else {
+		d = (unsigned char *) dest_w;
+		s = (unsigned char *) src_w;
+
+		for (i = n >> 3; i > 0; i--) {
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+		 }
+
+		 if (n & 1 << 2) {
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+			*d++ = *s++;
+		 }
+	}
+
+       	if (n & 1 << 1) {
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1)
+		*d++ = *s++;
+
+	return dest;
+}
+
+#elif defined(CONFIG_OPT_LIB_BYTE_UNROLL)
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	int i = 0;
+	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
+
+	/* For remaining or if not aligned, still unroll loops */
+	for (i = n >> 3; i > 0; i--) {
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1 << 2) {
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1 << 1) {
+		*d++ = *s++;
+		*d++ = *s++;
+	}
+
+	if (n & 1)
+		*d++ = *s++;
+
+	return dest;
+}
+#else /* CONFIG_OPT_LIB_BYTE fallback */
+void * memcpy(void *dest, __const void *src, __kernel_size_t n)
+{
+	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
+
+	/* For remaining or if not aligned, still unroll loops */
+	for (; n > 0; n--)
+		*d++ = *s++;
+
+	return dest;
+}
+#endif
+
+EXPORT_SYMBOL(memcpy);
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [OpenRISC] [PATCH] openrisc: Add optimized memcpy routine
  2016-03-21 14:29 ` [OpenRISC] [PATCH] openrisc: Add optimized " Stafford Horne
@ 2016-03-23  4:54   ` Jonas Bonn
  2016-03-24 22:06     ` Stafford Horne
  0 siblings, 1 reply; 5+ messages in thread
From: Jonas Bonn @ 2016-03-23  4:54 UTC (permalink / raw)
  To: openrisc

Hi Stafford,

Looks really good.  Here's my two cents worth:

i)  I personally don't care much for the open-coded loop unrolling 
because it makes a lot of assumptions about the underlying 
implementation; put it behind CONFIG_OR1200 (or whatever that option was 
called) if you really want to do it this way
ii) The NONALIGNED variant appears overly complex for little gain.
iii)  That said, the "simple" word copy variant is what I'd probably choose

What would probably be even better though is:

#define memcpy(...) __builtin_memcpy(...)

and put the below optimisations directly into GCC.  I haven't looked at 
GCC in a long time and don't know if anyone is currently maintaining it; 
it used to be in need of the below optimization, as well, but perhaps 
someone has done it in the meantime.  If not, that's probably the best 
place for it; there you can better take the underlying implementation 
into account in order to do proper loop unrolling and other 
optimizations (speaking of which, I believe loop unrolling is an 
optimization pass in GCC... not sure what architecture hooks it relies on).

That said, if the GCC folk are impossible, your Linux optmisation will 
be fine.  Let's just get a reliable "Tested-by" on there once you've 
settled on which variant you want to support.

Top-notch work!

/Jonas

On 03/21/2016 03:29 PM, Stafford Horne wrote:
> The default memcpy routing provided in lib does only byte copies.
> Using word copies we can lower boot time and cycles spend in memcpy
> quite significantly.
>
> Booting on my de0 nano I see boot times go from 7.2 to 5.6 seconds.
> The avg cycles in memcpy during boot go from 6467 to 1887.
>
> This commit contains an option menu for people to see what I tried
> but in the end we should only leave the implementation we want to
> keep.
> The implementations I tested and avg cycles:
>    - Word Copies + Loop Unrolls + Non Aligned    1882
>    - Word Copies + Loop Unrolls                  1887
>    - Word Copies                                 2441
>    - Byte Copies + Loop Unrolls                  6467
>    - Byte Copies                                 7600
>
> I would suggest going with the Word Copies + Loop Unrolls one as its
> provides best tradeoff between simplicity and boot speedups.
>
> Signed-off-by: Stafford Horne <shorne@gmail.com>
> ---
>   arch/openrisc/Kconfig              |  61 ++++++
>   arch/openrisc/TODO.openrisc        |   1 -
>   arch/openrisc/include/asm/string.h |   5 +
>   arch/openrisc/lib/Makefile         |   3 +-
>   arch/openrisc/lib/memcpy.c         | 377 +++++++++++++++++++++++++++++++++++++
>   5 files changed, 445 insertions(+), 2 deletions(-)
>   create mode 100644 arch/openrisc/lib/memcpy.c
>
> diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
> index 6e88268..68c0588 100644
> --- a/arch/openrisc/Kconfig
> +++ b/arch/openrisc/Kconfig
> @@ -115,6 +115,67 @@ config OPENRISC_HAVE_INST_LWA_SWA
>   
>   endmenu
>   
> +menu "Optimized Lib"
> +
> +config OPT_LIB_FUNCTION
> +	bool "Enable Optimalized lib functions"
> +	default y
> +	help
> +	  Tunes on optimalized library functions (memcpy and memset).
> +	  They are optimized for using word memory operations instead of
> +          the default byte operations.
> +choice
> +	prompt "Optimized lib Implementation"
> +	default OPT_LIB_WORD_NONALIGNED
> +	depends on OPT_LIB_FUNCTION
> +
> +config OPT_LIB_WORD_NONALIGNED
> +	bool "Non-Aligned Word Operations"
> +	help
> +	  The implementation performs word operations and loop unrolls.
> +	  It supports word operations on non-aligned memory but at the cost
> +	  of doing shifts and or operations to fix alignement issues.
> +
> +	  This should be the fasted implementation.
> +
> +config OPT_LIB_WORD_UNROLL
> +	bool "Unrolled Loop Word Operations"
> +	help
> +	  This implementation performs word operations and loop unrolls.
> +	  However, if memory being operated on is not word aligned it will
> +	  fall back to using byte operations.
> +
> +	  This may be as fast as the non-aligned operations if the shift and
> +	  or operations to reconstructe alignement issues are slower than the
> +	  4 byte memory operations.
> +
> +	  This should be the 2nd fastest implementation.
> +
> +config OPT_LIB_WORD
> +	bool "Simple Word Operations"
> +	help
> +	  This implementation performs word operations if data is word aligned
> +	  then falls back to byte operations.  It does not do loop unrolling.
> +
> +	  This should be 3rd fastest implementation.
> +
> +config OPT_LIB_BYTE_UNROLL
> +	bool "Unrolled Loop Byte Operations"
> +	help
> +	  This implemenation performes Byte operations with loop untolling.
> +
> +	  This should be the 4th fastest implementation.
> +
> +config OPT_LIB_BYTE
> +	bool "Simple Byte Operations"
> +	help
> +	  Simple byte operations. No frills but should not have any problem
> +	  working on any architecture.
> +
> +endchoice
> +
> +endmenu
> +
>   config NR_CPUS
>   	int "Maximum number of CPUs (2-32)"
>   	range 2 32
> diff --git a/arch/openrisc/TODO.openrisc b/arch/openrisc/TODO.openrisc
> index acfeef9..a2bda7b 100644
> --- a/arch/openrisc/TODO.openrisc
> +++ b/arch/openrisc/TODO.openrisc
> @@ -13,4 +13,3 @@ that are due for investigation shortly, i.e. our TODO list:
>      or1k and this change is slowly trickling through the stack.  For the time
>      being, or32 is equivalent to or1k.
>   
> --- Implement optimized version of memcpy and memset
> diff --git a/arch/openrisc/include/asm/string.h b/arch/openrisc/include/asm/string.h
> index 33470d4..04111b2 100644
> --- a/arch/openrisc/include/asm/string.h
> +++ b/arch/openrisc/include/asm/string.h
> @@ -1,7 +1,12 @@
>   #ifndef __ASM_OPENRISC_STRING_H
>   #define __ASM_OPENRISC_STRING_H
>   
> +#ifdef CONFIG_OPT_LIB_FUNCTION
>   #define __HAVE_ARCH_MEMSET
>   extern void *memset(void *s, int c, __kernel_size_t n);
>   
> +#define __HAVE_ARCH_MEMCPY
> +extern void *memcpy(void *dest, __const void *src, __kernel_size_t n);
> +#endif
> +
>   #endif /* __ASM_OPENRISC_STRING_H */
> diff --git a/arch/openrisc/lib/Makefile b/arch/openrisc/lib/Makefile
> index 67c583e..c3316f6 100644
> --- a/arch/openrisc/lib/Makefile
> +++ b/arch/openrisc/lib/Makefile
> @@ -2,4 +2,5 @@
>   # Makefile for or32 specific library files..
>   #
>   
> -obj-y  = memset.o string.o delay.o
> +obj-y				:= delay.o string.o
> +obj-$(CONFIG_OPT_LIB_FUNCTION)	+= memset.o memcpy.o
> diff --git a/arch/openrisc/lib/memcpy.c b/arch/openrisc/lib/memcpy.c
> new file mode 100644
> index 0000000..36a7aac
> --- /dev/null
> +++ b/arch/openrisc/lib/memcpy.c
> @@ -0,0 +1,377 @@
> +/*
> + * arch/openrisc/lib/memcpy.c
> + *
> + * Optimized memory copy routines for openrisc.  These are mostly copied
> + * from ohter sources but slightly entended based on ideas discuassed in
> + * #openrisc.
> + *
> + * The word non aligned is based on microblaze found in:
> + *  arch/microblaze/lib/memcpy.c
> + * but this is extended to have loop unrolls. This only supports
> + * big endian at the moment.
> + *
> + * The byte unroll implementation is a copy of that found in:
> + *  arm/boot/compressed/string.c
> + *
> + * The word unroll implementation is an extention to the byte
> + * unrolled implementation, but using word copies (if things are
> + * properly aligned)
> + */
> +
> +#ifndef _MC_TEST
> +#include <linux/export.h>
> +
> +#include <linux/string.h>
> +#endif
> +
> +#if defined(CONFIG_OPT_LIB_WORD_NONALIGNED)
> +/*
> + * Make below loops a bit more manageable
> + *
> + *
> + */
> +#define __OFFSET_MEMCPY(n) 	value = *src_w++;					\
> +				*dest_w++ = buf_hold | value >> ( ( 4 - n ) * 8 );	\
> +				buf_hold = value << ( n * 8 )
> +
> +void *memcpy(void *dest, const void *src, __kernel_size_t n)
> +{
> +	const char *src_b = src;
> +	char *dest_b = dest;
> +	int i;
> +
> +	/* The following code tries to optimize the copy by using unsigned
> +	 * alignment. This will work fine if both source and destination are
> +	 * aligned on the same boundary. However, if they are aligned on
> +	 * different boundaries shifts will be necessary.
> +	 */
> +	const uint32_t *src_w;
> +	uint32_t *dest_w;
> +
> +	if (likely(n >= 4)) {
> +		unsigned  value, buf_hold;
> +
> +		/* Align the destination to a word boundary. */
> +		/* This is done in an endian independent manner. */
> +		switch ((unsigned long)dest_b & 3) {
> +		case 1:
> +			*dest_b++ = *src_b++;
> +			--n;
> +		case 2:
> +			*dest_b++ = *src_b++;
> +			--n;
> +		case 3:
> +			*dest_b++ = *src_b++;
> +			--n;
> +		}
> +
> +		dest_w = (void *)dest_b;
> +
> +		/* Choose a copy scheme based on the source, this is done big endian */
> +		/* alignment relative to destination. */
> +		switch ((unsigned long)src_b & 3) {
> +		case 0x0:	/* Both byte offsets are aligned */
> +			src_w  = (const uint32_t *)src_b;
> +
> +			/* Copy 32 bytes per loop */
> +			for (i = n >> 5; i > 0; i--) {
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +			}
> +
> +			if (n & 1 << 4) {
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +			}
> +
> +			if (n & 1 << 3) {
> +				*dest_w++ = *src_w++;
> +				*dest_w++ = *src_w++;
> +			}
> +
> +			if (n & 1 << 2)
> +				*dest_w++ = *src_w++;
> +
> +			src_b  = (const char *)src_w;
> +			break;
> +
> +		case 0x1:	/* Unaligned - Off by 1 */
> +			/* Word align the source */
> +			src_w = (const void *) ((unsigned)src_b & ~3);
> +			/* Load the holding buffer */
> +			buf_hold = *src_w++ << 8;
> +
> +			for (i = n >> 5; i > 0; i--) {
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			if (n & 1 << 4) {
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			if (n & 1 << 3) {
> +				__OFFSET_MEMCPY(1);
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			if (n & 1 << 2) {
> +				__OFFSET_MEMCPY(1);
> +			}
> +
> +			/* Realign the source */
> +			src_b = (const void *)src_w;
> +			src_b -= 3;
> +			break;
> +		case 0x2:	/* Unaligned - Off by 2 */
> +			/* Word align the source */
> +			src_w = (const void *) ((unsigned)src_b & ~3);
> +			/* Load the holding buffer */
> +			buf_hold = *src_w++ << 16;
> +
> +			for (i = n >> 5; i > 0; i--) {
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			if (n & 1 << 4) {
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			if (n & 1 << 3) {
> +				__OFFSET_MEMCPY(2);
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			if (n & 1 << 2) {
> +				__OFFSET_MEMCPY(2);
> +			}
> +
> +			/* Realign the source */
> +			src_b = (const void *)src_w;
> +			src_b -= 2;
> +			break;
> +		case 0x3:	/* Unaligned - Off by 3 */
> +			/* Word align the source */
> +			src_w = (const void *) ((unsigned)src_b & ~3);
> +			/* Load the holding buffer */
> +			buf_hold = *src_w++ << 24;
> +
> +			for (i = n >> 5; i > 0; i--) {
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			if (n & 1 << 4) {
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			if (n & 1 << 3) {
> +				__OFFSET_MEMCPY(3);
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			if (n & 1 << 2) {
> +				__OFFSET_MEMCPY(3);
> +			}
> +
> +			/* Realign the source */
> +			src_b = (const void *)src_w;
> +			src_b -= 1;
> +			break;
> +		}
> +		dest_b = (void *)dest_w;
> +	}
> +
> +	/* Finish off any remaining bytes */
> +	/* simple fast copy, ... unless a cache boundary is crossed */
> +       	if (n & 1 << 1) {
> +		*dest_b++ = *src_b++;
> +		*dest_b++ = *src_b++;
> +	}
> +
> +	if (n & 1)
> +		*dest_b++ = *src_b++;
> +
> +
> +	return dest;
> +}
> +#elif defined(CONFIG_OPT_LIB_WORD)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
> +
> +	/* If both source and dest are word aligned copy words */
> +	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
> +		for (; n >= 4; n -= 4)
> +			*dest_w++ = *src_w++;
> +	}
> +
> +	d = (unsigned char *) dest_w;
> +	s = (unsigned char *) src_w;
> +
> +	/* For remaining or if not aligned, copy bytes */
> +	for (; n >= 1; n -= 1)
> +		*d++ = *s++;
> +
> +	return dest;
> +
> +}
> +#elif defined(CONFIG_OPT_LIB_WORD_UNROLL)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	int i = 0;
> +	unsigned char * d, * s;
> +	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
> +
> +	/* If both source and dest are word aligned copy words */
> +	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
> +		/* Copy 32 bytes per loop */
> +		for (i = n >> 5; i > 0; i--) {
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +		}
> +
> +		if (n & 1 << 4) {
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +		}
> +
> +		if (n & 1 << 3) {
> +			*dest_w++ = *src_w++;
> +			*dest_w++ = *src_w++;
> +		}
> +
> +		if (n & 1 << 2)
> +			*dest_w++ = *src_w++;
> +
> +		d = (unsigned char *) dest_w;
> +		s = (unsigned char *) src_w;
> +
> +	} else {
> +		d = (unsigned char *) dest_w;
> +		s = (unsigned char *) src_w;
> +
> +		for (i = n >> 3; i > 0; i--) {
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +		 }
> +
> +		 if (n & 1 << 2) {
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +			*d++ = *s++;
> +		 }
> +	}
> +
> +       	if (n & 1 << 1) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1)
> +		*d++ = *s++;
> +
> +	return dest;
> +}
> +
> +#elif defined(CONFIG_OPT_LIB_BYTE_UNROLL)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	int i = 0;
> +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +
> +	/* For remaining or if not aligned, still unroll loops */
> +	for (i = n >> 3; i > 0; i--) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1 << 2) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1 << 1) {
> +		*d++ = *s++;
> +		*d++ = *s++;
> +	}
> +
> +	if (n & 1)
> +		*d++ = *s++;
> +
> +	return dest;
> +}
> +#else /* CONFIG_OPT_LIB_BYTE fallback */
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +
> +	/* For remaining or if not aligned, still unroll loops */
> +	for (; n > 0; n--)
> +		*d++ = *s++;
> +
> +	return dest;
> +}
> +#endif
> +
> +EXPORT_SYMBOL(memcpy);



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [OpenRISC] [PATCH] openrisc: Add optimized memcpy routine
  2016-03-23  4:54   ` Jonas Bonn
@ 2016-03-24 22:06     ` Stafford Horne
  2016-03-25 12:18       ` Jeremy Bennett
  0 siblings, 1 reply; 5+ messages in thread
From: Stafford Horne @ 2016-03-24 22:06 UTC (permalink / raw)
  To: openrisc



On Wed, 23 Mar 2016, Jonas Bonn wrote:

> Hi Stafford,
>
> Looks really good.  Here's my two cents worth:
>
> i)  I personally don't care much for the open-coded loop unrolling because it 
> makes a lot of assumptions about the underlying implementation; put it behind 
> CONFIG_OR1200 (or whatever that option was called) if you really want to do 
> it this way

Yes its still CONFIG_OR1200

> ii) The NONALIGNED variant appears overly complex for little gain.

I agree, complex and you lose a lot of cycles doing shifts and masks, so
what you gain with the word load/stores is mostly cancelled out.

> iii)  That said, the "simple" word copy variant is what I'd probably choose
>
> What would probably be even better though is:
>
> #define memcpy(...) __builtin_memcpy(...)
>
> and put the below optimisations directly into GCC.  I haven't looked at GCC 
> in a long time and don't know if anyone is currently maintaining it; it used 
> to be in need of the below optimization, as well, but perhaps someone has 
> done it in the meantime.  If not, that's probably the best place for it; 
> there you can better take the underlying implementation into account in order 
> to do proper loop unrolling and other optimizations (speaking of which, I 
> believe loop unrolling is an optimization pass in GCC... not sure what 
> architecture hooks it relies on).

This reply took me some time because I was spending the last few days 
looking into GCC. The openrisc GCC back end seems to need a rewrite before 
it can be upstreamed due to copyright issues. But we are still maintaining 
the port and keeping it up to date with upstream.

I did find where to add the memcpy builtin (defining a define_expand 
movmem in or1k.md). We have a proposal to update the GCC port as part of 
gsoc student project [1]. If that gets taken up I will help with that.

For the mean time I will update the patch based on this feedback.

> That said, if the GCC folk are impossible, your Linux optmisation will be 
> fine.  Let's just get a reliable "Tested-by" on there once you've settled on 
> which variant you want to support.

I have had some others from #openrisc compile and run my branch [2] on 
github.
For my own tests on de0 nano it is working well.  But I'll see if someone
has a good load for it.

> Top-notch work!

Thank you for the review.


[1] here: https://lists.fossi-foundation.org/listinfo/discussion
[2] https://github.com/stffrdhrn/linux/tree/openrisc

> /Jonas
>
> On 03/21/2016 03:29 PM, Stafford Horne wrote:
>>  The default memcpy routing provided in lib does only byte copies.
>>  Using word copies we can lower boot time and cycles spend in memcpy
>>  quite significantly.
>>
>>  Booting on my de0 nano I see boot times go from 7.2 to 5.6 seconds.
>>  The avg cycles in memcpy during boot go from 6467 to 1887.
>>
>>  This commit contains an option menu for people to see what I tried
>>  but in the end we should only leave the implementation we want to
>>  keep.
>>  The implementations I tested and avg cycles:
>>     - Word Copies + Loop Unrolls + Non Aligned    1882
>>     - Word Copies + Loop Unrolls                  1887
>>     - Word Copies                                 2441
>>     - Byte Copies + Loop Unrolls                  6467
>>     - Byte Copies                                 7600
>>
>>  I would suggest going with the Word Copies + Loop Unrolls one as its
>>  provides best tradeoff between simplicity and boot speedups.
>>
>>  Signed-off-by: Stafford Horne <shorne@gmail.com>
>>  ---
>>    arch/openrisc/Kconfig              |  61 ++++++
>>    arch/openrisc/TODO.openrisc        |   1 -
>>    arch/openrisc/include/asm/string.h |   5 +
>>    arch/openrisc/lib/Makefile         |   3 +-
>>    arch/openrisc/lib/memcpy.c         | 377
>>    +++++++++++++++++++++++++++++++++++++
>>    5 files changed, 445 insertions(+), 2 deletions(-)
>>    create mode 100644 arch/openrisc/lib/memcpy.c
>>
>>  diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
>>  index 6e88268..68c0588 100644
>>  --- a/arch/openrisc/Kconfig
>>  +++ b/arch/openrisc/Kconfig
>>  @@ -115,6 +115,67 @@ config OPENRISC_HAVE_INST_LWA_SWA
>>
>>    endmenu
>>
>>  +menu "Optimized Lib"
>>  +
>>  +config OPT_LIB_FUNCTION
>>  +	bool "Enable Optimalized lib functions"
>>  +	default y
>>  +	help
>>  +	  Tunes on optimalized library functions (memcpy and memset).
>>  +	  They are optimized for using word memory operations instead of
>>  +          the default byte operations.
>>  +choice
>>  +	prompt "Optimized lib Implementation"
>>  +	default OPT_LIB_WORD_NONALIGNED
>>  +	depends on OPT_LIB_FUNCTION
>>  +
>>  +config OPT_LIB_WORD_NONALIGNED
>>  +	bool "Non-Aligned Word Operations"
>>  +	help
>>  +	  The implementation performs word operations and loop unrolls.
>>  +	  It supports word operations on non-aligned memory but at the cost
>>  +	  of doing shifts and or operations to fix alignement issues.
>>  +
>>  +	  This should be the fasted implementation.
>>  +
>>  +config OPT_LIB_WORD_UNROLL
>>  +	bool "Unrolled Loop Word Operations"
>>  +	help
>>  +	  This implementation performs word operations and loop unrolls.
>>  +	  However, if memory being operated on is not word aligned it will
>>  +	  fall back to using byte operations.
>>  +
>>  +	  This may be as fast as the non-aligned operations if the shift and
>>  +	  or operations to reconstructe alignement issues are slower than the
>>  +	  4 byte memory operations.
>>  +
>>  +	  This should be the 2nd fastest implementation.
>>  +
>>  +config OPT_LIB_WORD
>>  +	bool "Simple Word Operations"
>>  +	help
>>  +	  This implementation performs word operations if data is word
>>  aligned
>>  +	  then falls back to byte operations.  It does not do loop unrolling.
>>  +
>>  +	  This should be 3rd fastest implementation.
>>  +
>>  +config OPT_LIB_BYTE_UNROLL
>>  +	bool "Unrolled Loop Byte Operations"
>>  +	help
>>  +	  This implemenation performes Byte operations with loop untolling.
>>  +
>>  +	  This should be the 4th fastest implementation.
>>  +
>>  +config OPT_LIB_BYTE
>>  +	bool "Simple Byte Operations"
>>  +	help
>>  +	  Simple byte operations. No frills but should not have any problem
>>  +	  working on any architecture.
>>  +
>>  +endchoice
>>  +
>>  +endmenu
>>  +
>>    config NR_CPUS
>>     int "Maximum number of CPUs (2-32)"
>>     range 2 32
>>  diff --git a/arch/openrisc/TODO.openrisc b/arch/openrisc/TODO.openrisc
>>  index acfeef9..a2bda7b 100644
>>  --- a/arch/openrisc/TODO.openrisc
>>  +++ b/arch/openrisc/TODO.openrisc
>>  @@ -13,4 +13,3 @@ that are due for investigation shortly, i.e. our TODO
>>  list:
>>       or1k and this change is slowly trickling through the stack.  For the
>>       time
>>       being, or32 is equivalent to or1k.
>>    --- Implement optimized version of memcpy and memset
>>  diff --git a/arch/openrisc/include/asm/string.h
>>  b/arch/openrisc/include/asm/string.h
>>  index 33470d4..04111b2 100644
>>  --- a/arch/openrisc/include/asm/string.h
>>  +++ b/arch/openrisc/include/asm/string.h
>>  @@ -1,7 +1,12 @@
>>    #ifndef __ASM_OPENRISC_STRING_H
>>    #define __ASM_OPENRISC_STRING_H
>>
>>  +#ifdef CONFIG_OPT_LIB_FUNCTION
>>    #define __HAVE_ARCH_MEMSET
>>    extern void *memset(void *s, int c, __kernel_size_t n);
>>
>>  +#define __HAVE_ARCH_MEMCPY
>>  +extern void *memcpy(void *dest, __const void *src, __kernel_size_t n);
>>  +#endif
>>  +
>>    #endif /* __ASM_OPENRISC_STRING_H */
>>  diff --git a/arch/openrisc/lib/Makefile b/arch/openrisc/lib/Makefile
>>  index 67c583e..c3316f6 100644
>>  --- a/arch/openrisc/lib/Makefile
>>  +++ b/arch/openrisc/lib/Makefile
>>  @@ -2,4 +2,5 @@
>> #  Makefile for or32 specific library files..
>> #
>>
>>  -obj-y  = memset.o string.o delay.o
>>  +obj-y				:= delay.o string.o
>>  +obj-$(CONFIG_OPT_LIB_FUNCTION)	+= memset.o memcpy.o
>>  diff --git a/arch/openrisc/lib/memcpy.c b/arch/openrisc/lib/memcpy.c
>>  new file mode 100644
>>  index 0000000..36a7aac
>>  --- /dev/null
>>  +++ b/arch/openrisc/lib/memcpy.c
>>  @@ -0,0 +1,377 @@
>>  +/*
>>  + * arch/openrisc/lib/memcpy.c
>>  + *
>>  + * Optimized memory copy routines for openrisc.  These are mostly copied
>>  + * from ohter sources but slightly entended based on ideas discuassed in
>>  + * #openrisc.
>>  + *
>>  + * The word non aligned is based on microblaze found in:
>>  + *  arch/microblaze/lib/memcpy.c
>>  + * but this is extended to have loop unrolls. This only supports
>>  + * big endian at the moment.
>>  + *
>>  + * The byte unroll implementation is a copy of that found in:
>>  + *  arm/boot/compressed/string.c
>>  + *
>>  + * The word unroll implementation is an extention to the byte
>>  + * unrolled implementation, but using word copies (if things are
>>  + * properly aligned)
>>  + */
>>  +
>>  +#ifndef _MC_TEST
>>  +#include <linux/export.h>
>>  +
>>  +#include <linux/string.h>
>>  +#endif
>>  +
>>  +#if defined(CONFIG_OPT_LIB_WORD_NONALIGNED)
>>  +/*
>>  + * Make below loops a bit more manageable
>>  + *
>>  + *
>>  + */
>>  +#define __OFFSET_MEMCPY(n) 	value = *src_w++;
>>  \
>>  +				*dest_w++ = buf_hold | value >> ( ( 4 - n ) *
>>  8 );	\
>>  +				buf_hold = value << ( n * 8 )
>>  +
>>  +void *memcpy(void *dest, const void *src, __kernel_size_t n)
>>  +{
>>  +	const char *src_b = src;
>>  +	char *dest_b = dest;
>>  +	int i;
>>  +
>>  +	/* The following code tries to optimize the copy by using unsigned
>>  +	 * alignment. This will work fine if both source and destination are
>>  +	 * aligned on the same boundary. However, if they are aligned on
>>  +	 * different boundaries shifts will be necessary.
>>  +	 */
>>  +	const uint32_t *src_w;
>>  +	uint32_t *dest_w;
>>  +
>>  +	if (likely(n >= 4)) {
>>  +		unsigned  value, buf_hold;
>>  +
>>  +		/* Align the destination to a word boundary. */
>>  +		/* This is done in an endian independent manner. */
>>  +		switch ((unsigned long)dest_b & 3) {
>>  +		case 1:
>>  +			*dest_b++ = *src_b++;
>>  +			--n;
>>  +		case 2:
>>  +			*dest_b++ = *src_b++;
>>  +			--n;
>>  +		case 3:
>>  +			*dest_b++ = *src_b++;
>>  +			--n;
>>  +		}
>>  +
>>  +		dest_w = (void *)dest_b;
>>  +
>>  +		/* Choose a copy scheme based on the source, this is done big
>>  endian */
>>  +		/* alignment relative to destination. */
>>  +		switch ((unsigned long)src_b & 3) {
>>  +		case 0x0:	/* Both byte offsets are aligned */
>>  +			src_w  = (const uint32_t *)src_b;
>>  +
>>  +			/* Copy 32 bytes per loop */
>>  +			for (i = n >> 5; i > 0; i--) {
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +			}
>>  +
>>  +			if (n & 1 << 4) {
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +			}
>>  +
>>  +			if (n & 1 << 3) {
>>  +				*dest_w++ = *src_w++;
>>  +				*dest_w++ = *src_w++;
>>  +			}
>>  +
>>  +			if (n & 1 << 2)
>>  +				*dest_w++ = *src_w++;
>>  +
>>  +			src_b  = (const char *)src_w;
>>  +			break;
>>  +
>>  +		case 0x1:	/* Unaligned - Off by 1 */
>>  +			/* Word align the source */
>>  +			src_w = (const void *) ((unsigned)src_b & ~3);
>>  +			/* Load the holding buffer */
>>  +			buf_hold = *src_w++ << 8;
>>  +
>>  +			for (i = n >> 5; i > 0; i--) {
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +			}
>>  +
>>  +			if (n & 1 << 4) {
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +			}
>>  +
>>  +			if (n & 1 << 3) {
>>  +				__OFFSET_MEMCPY(1);
>>  +				__OFFSET_MEMCPY(1);
>>  +			}
>>  +
>>  +			if (n & 1 << 2) {
>>  +				__OFFSET_MEMCPY(1);
>>  +			}
>>  +
>>  +			/* Realign the source */
>>  +			src_b = (const void *)src_w;
>>  +			src_b -= 3;
>>  +			break;
>>  +		case 0x2:	/* Unaligned - Off by 2 */
>>  +			/* Word align the source */
>>  +			src_w = (const void *) ((unsigned)src_b & ~3);
>>  +			/* Load the holding buffer */
>>  +			buf_hold = *src_w++ << 16;
>>  +
>>  +			for (i = n >> 5; i > 0; i--) {
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +			}
>>  +
>>  +			if (n & 1 << 4) {
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +			}
>>  +
>>  +			if (n & 1 << 3) {
>>  +				__OFFSET_MEMCPY(2);
>>  +				__OFFSET_MEMCPY(2);
>>  +			}
>>  +
>>  +			if (n & 1 << 2) {
>>  +				__OFFSET_MEMCPY(2);
>>  +			}
>>  +
>>  +			/* Realign the source */
>>  +			src_b = (const void *)src_w;
>>  +			src_b -= 2;
>>  +			break;
>>  +		case 0x3:	/* Unaligned - Off by 3 */
>>  +			/* Word align the source */
>>  +			src_w = (const void *) ((unsigned)src_b & ~3);
>>  +			/* Load the holding buffer */
>>  +			buf_hold = *src_w++ << 24;
>>  +
>>  +			for (i = n >> 5; i > 0; i--) {
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +			}
>>  +
>>  +			if (n & 1 << 4) {
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +			}
>>  +
>>  +			if (n & 1 << 3) {
>>  +				__OFFSET_MEMCPY(3);
>>  +				__OFFSET_MEMCPY(3);
>>  +			}
>>  +
>>  +			if (n & 1 << 2) {
>>  +				__OFFSET_MEMCPY(3);
>>  +			}
>>  +
>>  +			/* Realign the source */
>>  +			src_b = (const void *)src_w;
>>  +			src_b -= 1;
>>  +			break;
>>  +		}
>>  +		dest_b = (void *)dest_w;
>>  +	}
>>  +
>>  +	/* Finish off any remaining bytes */
>>  +	/* simple fast copy, ... unless a cache boundary is crossed */
>>  +       	if (n & 1 << 1) {
>>  +		*dest_b++ = *src_b++;
>>  +		*dest_b++ = *src_b++;
>>  +	}
>>  +
>>  +	if (n & 1)
>>  +		*dest_b++ = *src_b++;
>>  +
>>  +
>>  +	return dest;
>>  +}
>>  +#elif defined(CONFIG_OPT_LIB_WORD)
>>  +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
>>  +{
>>  +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *)
>>  src;
>>  +	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
>>  +
>>  +	/* If both source and dest are word aligned copy words */
>>  +	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
>>  +		for (; n >= 4; n -= 4)
>>  +			*dest_w++ = *src_w++;
>>  +	}
>>  +
>>  +	d = (unsigned char *) dest_w;
>>  +	s = (unsigned char *) src_w;
>>  +
>>  +	/* For remaining or if not aligned, copy bytes */
>>  +	for (; n >= 1; n -= 1)
>>  +		*d++ = *s++;
>>  +
>>  +	return dest;
>>  +
>>  +}
>>  +#elif defined(CONFIG_OPT_LIB_WORD_UNROLL)
>>  +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
>>  +{
>>  +	int i = 0;
>>  +	unsigned char * d, * s;
>>  +	uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
>>  +
>>  +	/* If both source and dest are word aligned copy words */
>>  +	if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
>>  +		/* Copy 32 bytes per loop */
>>  +		for (i = n >> 5; i > 0; i--) {
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +		}
>>  +
>>  +		if (n & 1 << 4) {
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +		}
>>  +
>>  +		if (n & 1 << 3) {
>>  +			*dest_w++ = *src_w++;
>>  +			*dest_w++ = *src_w++;
>>  +		}
>>  +
>>  +		if (n & 1 << 2)
>>  +			*dest_w++ = *src_w++;
>>  +
>>  +		d = (unsigned char *) dest_w;
>>  +		s = (unsigned char *) src_w;
>>  +
>>  +	} else {
>>  +		d = (unsigned char *) dest_w;
>>  +		s = (unsigned char *) src_w;
>>  +
>>  +		for (i = n >> 3; i > 0; i--) {
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +		 }
>>  +
>>  +		 if (n & 1 << 2) {
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +			*d++ = *s++;
>>  +		 }
>>  +	}
>>  +
>>  +       	if (n & 1 << 1) {
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +	}
>>  +
>>  +	if (n & 1)
>>  +		*d++ = *s++;
>>  +
>>  +	return dest;
>>  +}
>>  +
>>  +#elif defined(CONFIG_OPT_LIB_BYTE_UNROLL)
>>  +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
>>  +{
>>  +	int i = 0;
>>  +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *)
>>  src;
>>  +
>>  +	/* For remaining or if not aligned, still unroll loops */
>>  +	for (i = n >> 3; i > 0; i--) {
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +	}
>>  +
>>  +	if (n & 1 << 2) {
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +	}
>>  +
>>  +	if (n & 1 << 1) {
>>  +		*d++ = *s++;
>>  +		*d++ = *s++;
>>  +	}
>>  +
>>  +	if (n & 1)
>>  +		*d++ = *s++;
>>  +
>>  +	return dest;
>>  +}
>>  +#else /* CONFIG_OPT_LIB_BYTE fallback */
>>  +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
>>  +{
>>  +	unsigned char * d = (unsigned char *) dest, * s = (unsigned char *)
>>  src;
>>  +
>>  +	/* For remaining or if not aligned, still unroll loops */
>>  +	for (; n > 0; n--)
>>  +		*d++ = *s++;
>>  +
>>  +	return dest;
>>  +}
>>  +#endif
>>  +
>>  +EXPORT_SYMBOL(memcpy);
>
>
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [OpenRISC] [PATCH] openrisc: Add optimized memcpy routine
  2016-03-24 22:06     ` Stafford Horne
@ 2016-03-25 12:18       ` Jeremy Bennett
  0 siblings, 0 replies; 5+ messages in thread
From: Jeremy Bennett @ 2016-03-25 12:18 UTC (permalink / raw)
  To: openrisc

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/03/16 22:06, Stafford Horne wrote:

<snip>

> This reply took me some time because I was spending the last few
> days looking into GCC. The openrisc GCC back end seems to need a
> rewrite before it can be upstreamed due to copyright issues. But we
> are still maintaining the port and keeping it up to date with
> upstream.

Hi Stafford,

I thought BlueCmd had resolved almost all of the copyright issues, by
talking to most of the copyright holders. So I didn't think there was
much that needed rewriting. Certainly there is no restriction to
anything done by Joern Rennecke (amylaar) as part of the major work he
did on GCC 4.5.1 a few years ago.

I don't think Joern is on this mailing list, but I'm sure he'll be
happy to review any OpenRISC GCC work.

Best wishes,


Jeremy

- -- 
Tel:     +44 (1590) 610184
Cell:    +44 (7970) 676050
SkypeID: jeremybennett
Twitter: @jeremypbennett
Email:   jeremy.bennett at embecosm.com
Web:     www.embecosm.com
PGP key: 1024D/BEF58172FB4754E1 2009-03-20
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAlb1LIIACgkQvvWBcvtHVOG5qACgiAWUC/vU8Z9qkapBtujAIHVx
wpwAnjoidh1Pkz7lgnmfjVlyXkas6iXW
=7ZAg
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-03-25 12:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-21 14:29 [OpenRISC] [RFC PATCH] Optimized memcpy routine Stafford Horne
2016-03-21 14:29 ` [OpenRISC] [PATCH] openrisc: Add optimized " Stafford Horne
2016-03-23  4:54   ` Jonas Bonn
2016-03-24 22:06     ` Stafford Horne
2016-03-25 12:18       ` Jeremy Bennett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox