* [OpenRISC] [RFC PATCH 1/1] openrisc: Add optimized memcpy routine
2016-03-21 21:36 ` [OpenRISC] [RFC PATCH 1/1] openrisc: Add optimized " Stafford Horne
@ 2016-03-21 22:27 ` Jeremy Bennett
0 siblings, 0 replies; 3+ messages in thread
From: Jeremy Bennett @ 2016-03-21 22:27 UTC (permalink / raw)
To: openrisc
Nice work.
A while since I've looked at this, but you might like to consider
whether the compiler is also aggressively optimizing memcpy in all
circumstances. The compiler should generate _builtin_memcpy for things
like structure assignment, and then lower that to suitably efficient
inline code in most circumstances.
Best wishes,
Jeremy
On 21/03/16 22:36, Stafford Horne wrote:
> The default memcpy routing provided in lib does only byte copies.
> Using word copies we can lower boot time and cycles spend in memcpy
> quite significantly.
>
> Booting on my de0 nano I see boot times go from 7.2 to 5.6 seconds.
> The avg cycles in memcpy during boot go from 6467 to 1887.
>
> This commit contains an option menu for people to see what I tried
> but in the end we should only leave the implementation we want to
> keep.
> The implementations I tested and avg cycles:
> - Word Copies + Loop Unrolls + Non Aligned 1882
> - Word Copies + Loop Unrolls 1887
> - Word Copies 2441
> - Byte Copies + Loop Unrolls 6467
> - Byte Copies 7600
>
> I would suggest going with the Word Copies + Loop Unrolls one as its
> provides best tradeoff between simplicity and boot speedups.
>
> Signed-off-by: Stafford Horne <shorne@gmail.com>
> ---
> arch/openrisc/Kconfig | 61 ++++++
> arch/openrisc/TODO.openrisc | 1 -
> arch/openrisc/include/asm/string.h | 5 +
> arch/openrisc/lib/Makefile | 3 +-
> arch/openrisc/lib/memcpy.c | 377 +++++++++++++++++++++++++++++++++++++
> 5 files changed, 445 insertions(+), 2 deletions(-)
> create mode 100644 arch/openrisc/lib/memcpy.c
>
> diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
> index 6e88268..68c0588 100644
> --- a/arch/openrisc/Kconfig
> +++ b/arch/openrisc/Kconfig
> @@ -115,6 +115,67 @@ config OPENRISC_HAVE_INST_LWA_SWA
>
> endmenu
>
> +menu "Optimized Lib"
> +
> +config OPT_LIB_FUNCTION
> + bool "Enable Optimalized lib functions"
> + default y
> + help
> + Tunes on optimalized library functions (memcpy and memset).
> + They are optimized for using word memory operations instead of
> + the default byte operations.
> +choice
> + prompt "Optimized lib Implementation"
> + default OPT_LIB_WORD_NONALIGNED
> + depends on OPT_LIB_FUNCTION
> +
> +config OPT_LIB_WORD_NONALIGNED
> + bool "Non-Aligned Word Operations"
> + help
> + The implementation performs word operations and loop unrolls.
> + It supports word operations on non-aligned memory but at the cost
> + of doing shifts and or operations to fix alignement issues.
> +
> + This should be the fasted implementation.
> +
> +config OPT_LIB_WORD_UNROLL
> + bool "Unrolled Loop Word Operations"
> + help
> + This implementation performs word operations and loop unrolls.
> + However, if memory being operated on is not word aligned it will
> + fall back to using byte operations.
> +
> + This may be as fast as the non-aligned operations if the shift and
> + or operations to reconstructe alignement issues are slower than the
> + 4 byte memory operations.
> +
> + This should be the 2nd fastest implementation.
> +
> +config OPT_LIB_WORD
> + bool "Simple Word Operations"
> + help
> + This implementation performs word operations if data is word aligned
> + then falls back to byte operations. It does not do loop unrolling.
> +
> + This should be 3rd fastest implementation.
> +
> +config OPT_LIB_BYTE_UNROLL
> + bool "Unrolled Loop Byte Operations"
> + help
> + This implemenation performes Byte operations with loop untolling.
> +
> + This should be the 4th fastest implementation.
> +
> +config OPT_LIB_BYTE
> + bool "Simple Byte Operations"
> + help
> + Simple byte operations. No frills but should not have any problem
> + working on any architecture.
> +
> +endchoice
> +
> +endmenu
> +
> config NR_CPUS
> int "Maximum number of CPUs (2-32)"
> range 2 32
> diff --git a/arch/openrisc/TODO.openrisc b/arch/openrisc/TODO.openrisc
> index acfeef9..a2bda7b 100644
> --- a/arch/openrisc/TODO.openrisc
> +++ b/arch/openrisc/TODO.openrisc
> @@ -13,4 +13,3 @@ that are due for investigation shortly, i.e. our TODO list:
> or1k and this change is slowly trickling through the stack. For the time
> being, or32 is equivalent to or1k.
>
> --- Implement optimized version of memcpy and memset
> diff --git a/arch/openrisc/include/asm/string.h b/arch/openrisc/include/asm/string.h
> index 33470d4..04111b2 100644
> --- a/arch/openrisc/include/asm/string.h
> +++ b/arch/openrisc/include/asm/string.h
> @@ -1,7 +1,12 @@
> #ifndef __ASM_OPENRISC_STRING_H
> #define __ASM_OPENRISC_STRING_H
>
> +#ifdef CONFIG_OPT_LIB_FUNCTION
> #define __HAVE_ARCH_MEMSET
> extern void *memset(void *s, int c, __kernel_size_t n);
>
> +#define __HAVE_ARCH_MEMCPY
> +extern void *memcpy(void *dest, __const void *src, __kernel_size_t n);
> +#endif
> +
> #endif /* __ASM_OPENRISC_STRING_H */
> diff --git a/arch/openrisc/lib/Makefile b/arch/openrisc/lib/Makefile
> index 67c583e..c3316f6 100644
> --- a/arch/openrisc/lib/Makefile
> +++ b/arch/openrisc/lib/Makefile
> @@ -2,4 +2,5 @@
> # Makefile for or32 specific library files..
> #
>
> -obj-y = memset.o string.o delay.o
> +obj-y := delay.o string.o
> +obj-$(CONFIG_OPT_LIB_FUNCTION) += memset.o memcpy.o
> diff --git a/arch/openrisc/lib/memcpy.c b/arch/openrisc/lib/memcpy.c
> new file mode 100644
> index 0000000..36a7aac
> --- /dev/null
> +++ b/arch/openrisc/lib/memcpy.c
> @@ -0,0 +1,377 @@
> +/*
> + * arch/openrisc/lib/memcpy.c
> + *
> + * Optimized memory copy routines for openrisc. These are mostly copied
> + * from ohter sources but slightly entended based on ideas discuassed in
> + * #openrisc.
> + *
> + * The word non aligned is based on microblaze found in:
> + * arch/microblaze/lib/memcpy.c
> + * but this is extended to have loop unrolls. This only supports
> + * big endian at the moment.
> + *
> + * The byte unroll implementation is a copy of that found in:
> + * arm/boot/compressed/string.c
> + *
> + * The word unroll implementation is an extention to the byte
> + * unrolled implementation, but using word copies (if things are
> + * properly aligned)
> + */
> +
> +#ifndef _MC_TEST
> +#include <linux/export.h>
> +
> +#include <linux/string.h>
> +#endif
> +
> +#if defined(CONFIG_OPT_LIB_WORD_NONALIGNED)
> +/*
> + * Make below loops a bit more manageable
> + *
> + *
> + */
> +#define __OFFSET_MEMCPY(n) value = *src_w++; \
> + *dest_w++ = buf_hold | value >> ( ( 4 - n ) * 8 ); \
> + buf_hold = value << ( n * 8 )
> +
> +void *memcpy(void *dest, const void *src, __kernel_size_t n)
> +{
> + const char *src_b = src;
> + char *dest_b = dest;
> + int i;
> +
> + /* The following code tries to optimize the copy by using unsigned
> + * alignment. This will work fine if both source and destination are
> + * aligned on the same boundary. However, if they are aligned on
> + * different boundaries shifts will be necessary.
> + */
> + const uint32_t *src_w;
> + uint32_t *dest_w;
> +
> + if (likely(n >= 4)) {
> + unsigned value, buf_hold;
> +
> + /* Align the destination to a word boundary. */
> + /* This is done in an endian independent manner. */
> + switch ((unsigned long)dest_b & 3) {
> + case 1:
> + *dest_b++ = *src_b++;
> + --n;
> + case 2:
> + *dest_b++ = *src_b++;
> + --n;
> + case 3:
> + *dest_b++ = *src_b++;
> + --n;
> + }
> +
> + dest_w = (void *)dest_b;
> +
> + /* Choose a copy scheme based on the source, this is done big endian */
> + /* alignment relative to destination. */
> + switch ((unsigned long)src_b & 3) {
> + case 0x0: /* Both byte offsets are aligned */
> + src_w = (const uint32_t *)src_b;
> +
> + /* Copy 32 bytes per loop */
> + for (i = n >> 5; i > 0; i--) {
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + }
> +
> + if (n & 1 << 4) {
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + }
> +
> + if (n & 1 << 3) {
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + }
> +
> + if (n & 1 << 2)
> + *dest_w++ = *src_w++;
> +
> + src_b = (const char *)src_w;
> + break;
> +
> + case 0x1: /* Unaligned - Off by 1 */
> + /* Word align the source */
> + src_w = (const void *) ((unsigned)src_b & ~3);
> + /* Load the holding buffer */
> + buf_hold = *src_w++ << 8;
> +
> + for (i = n >> 5; i > 0; i--) {
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + }
> +
> + if (n & 1 << 4) {
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + }
> +
> + if (n & 1 << 3) {
> + __OFFSET_MEMCPY(1);
> + __OFFSET_MEMCPY(1);
> + }
> +
> + if (n & 1 << 2) {
> + __OFFSET_MEMCPY(1);
> + }
> +
> + /* Realign the source */
> + src_b = (const void *)src_w;
> + src_b -= 3;
> + break;
> + case 0x2: /* Unaligned - Off by 2 */
> + /* Word align the source */
> + src_w = (const void *) ((unsigned)src_b & ~3);
> + /* Load the holding buffer */
> + buf_hold = *src_w++ << 16;
> +
> + for (i = n >> 5; i > 0; i--) {
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + }
> +
> + if (n & 1 << 4) {
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + }
> +
> + if (n & 1 << 3) {
> + __OFFSET_MEMCPY(2);
> + __OFFSET_MEMCPY(2);
> + }
> +
> + if (n & 1 << 2) {
> + __OFFSET_MEMCPY(2);
> + }
> +
> + /* Realign the source */
> + src_b = (const void *)src_w;
> + src_b -= 2;
> + break;
> + case 0x3: /* Unaligned - Off by 3 */
> + /* Word align the source */
> + src_w = (const void *) ((unsigned)src_b & ~3);
> + /* Load the holding buffer */
> + buf_hold = *src_w++ << 24;
> +
> + for (i = n >> 5; i > 0; i--) {
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + }
> +
> + if (n & 1 << 4) {
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + }
> +
> + if (n & 1 << 3) {
> + __OFFSET_MEMCPY(3);
> + __OFFSET_MEMCPY(3);
> + }
> +
> + if (n & 1 << 2) {
> + __OFFSET_MEMCPY(3);
> + }
> +
> + /* Realign the source */
> + src_b = (const void *)src_w;
> + src_b -= 1;
> + break;
> + }
> + dest_b = (void *)dest_w;
> + }
> +
> + /* Finish off any remaining bytes */
> + /* simple fast copy, ... unless a cache boundary is crossed */
> + if (n & 1 << 1) {
> + *dest_b++ = *src_b++;
> + *dest_b++ = *src_b++;
> + }
> +
> + if (n & 1)
> + *dest_b++ = *src_b++;
> +
> +
> + return dest;
> +}
> +#elif defined(CONFIG_OPT_LIB_WORD)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> + unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> + uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
> +
> + /* If both source and dest are word aligned copy words */
> + if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
> + for (; n >= 4; n -= 4)
> + *dest_w++ = *src_w++;
> + }
> +
> + d = (unsigned char *) dest_w;
> + s = (unsigned char *) src_w;
> +
> + /* For remaining or if not aligned, copy bytes */
> + for (; n >= 1; n -= 1)
> + *d++ = *s++;
> +
> + return dest;
> +
> +}
> +#elif defined(CONFIG_OPT_LIB_WORD_UNROLL)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> + int i = 0;
> + unsigned char * d, * s;
> + uint32_t * dest_w = (uint32_t *) dest, * src_w = (uint32_t *) src;
> +
> + /* If both source and dest are word aligned copy words */
> + if (!((unsigned)dest_w & 3) && !((unsigned)src_w & 3)) {
> + /* Copy 32 bytes per loop */
> + for (i = n >> 5; i > 0; i--) {
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + }
> +
> + if (n & 1 << 4) {
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + }
> +
> + if (n & 1 << 3) {
> + *dest_w++ = *src_w++;
> + *dest_w++ = *src_w++;
> + }
> +
> + if (n & 1 << 2)
> + *dest_w++ = *src_w++;
> +
> + d = (unsigned char *) dest_w;
> + s = (unsigned char *) src_w;
> +
> + } else {
> + d = (unsigned char *) dest_w;
> + s = (unsigned char *) src_w;
> +
> + for (i = n >> 3; i > 0; i--) {
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + }
> +
> + if (n & 1 << 2) {
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + }
> + }
> +
> + if (n & 1 << 1) {
> + *d++ = *s++;
> + *d++ = *s++;
> + }
> +
> + if (n & 1)
> + *d++ = *s++;
> +
> + return dest;
> +}
> +
> +#elif defined(CONFIG_OPT_LIB_BYTE_UNROLL)
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> + int i = 0;
> + unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +
> + /* For remaining or if not aligned, still unroll loops */
> + for (i = n >> 3; i > 0; i--) {
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + }
> +
> + if (n & 1 << 2) {
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + *d++ = *s++;
> + }
> +
> + if (n & 1 << 1) {
> + *d++ = *s++;
> + *d++ = *s++;
> + }
> +
> + if (n & 1)
> + *d++ = *s++;
> +
> + return dest;
> +}
> +#else /* CONFIG_OPT_LIB_BYTE fallback */
> +void * memcpy(void *dest, __const void *src, __kernel_size_t n)
> +{
> + unsigned char * d = (unsigned char *) dest, * s = (unsigned char *) src;
> +
> + /* For remaining or if not aligned, still unroll loops */
> + for (; n > 0; n--)
> + *d++ = *s++;
> +
> + return dest;
> +}
> +#endif
> +
> +EXPORT_SYMBOL(memcpy);
>
--
Tel: +44 (1590) 610184
Cell: +44 (7970) 676050
SkypeID: jeremybennett
Twitter: @jeremypbennett
Email: jeremy.bennett at embecosm.com
Web: www.embecosm.com
PGP key: 1024D/BEF58172FB4754E1 2009-03-20
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: OpenPGP digital signature
URL: <http://lists.librecores.org/pipermail/openrisc/attachments/20160321/8e2c384f/attachment.sig>
^ permalink raw reply [flat|nested] 3+ messages in thread