* [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines
@ 2004-09-10 0:00 Randolph Chung
2004-09-10 0:15 ` Randolph Chung
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Randolph Chung @ 2004-09-10 0:00 UTC (permalink / raw)
To: parisc-linux
This has been on my todo list for sometime -- the parisc kernel
currently uses a one-byte-at-a-time implementation for
copy_{to,from,in}_user. Obviously there is much room for improvement.
Here's an attempt at a more optimal version.
There is a somewhat lengthy comment below in memcpy.c about how this
works, so I will not repeat that here. This seems to work ok for me for
64-bit-smp. Test reports on other systems and different workloads are
welcome (but see the warning below).
Unfortunately this doesn't speed up as many things as I thought it
would. There are quite a bit of userspace copying that go through
copy_user_page(), which already uses an unrolled register copy loop. I
suspect this implementation is still slightly faster than that, so
possibly that can be slightly optimized as well. You will see the most
improvement in pipe or local socket intensive workloads. On the a500
that i tested on, this improved bw_pipe and bw_unix (from lmbench)
results about 400%. But for e.g. ftp or make vmlinux there was no
measureable performance difference.
I am still doing some verifications on the error handling, so there is a
chance that this patch will corrupt your data. Use with care! With that
cavaet, suggestions for further optimizations and improvements are much
appreciated!!
Many thanks to Grant and Carlos for looking over and helping me spot
some problems in my earlier attempts.
randolph
Index: arch/parisc/lib/Makefile
===================================================================
RCS file: /var/cvs/linux-2.6/arch/parisc/lib/Makefile,v
retrieving revision 1.2
diff -u -p -r1.2 Makefile
--- arch/parisc/lib/Makefile 1 Jul 2004 18:30:36 -0000 1.2
+++ arch/parisc/lib/Makefile 9 Sep 2004 23:51:35 -0000
@@ -2,6 +2,6 @@
# Makefile for parisc-specific library files
#
-lib-y := lusercopy.o bitops.o checksum.o io.o memset.o
+lib-y := lusercopy.o bitops.o checksum.o io.o memset.o memcpy.o
lib-$(CONFIG_SMP) += debuglocks.o
Index: arch/parisc/lib/memcpy.c
===================================================================
RCS file: arch/parisc/lib/memcpy.c
diff -N arch/parisc/lib/memcpy.c
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ arch/parisc/lib/memcpy.c 9 Sep 2004 23:51:35 -0000
@@ -0,0 +1,457 @@
+/*
+ * Optimized memory copy routines.
+ *
+ * Copyright (C) 2004 Randolph Chung <tausq@debian.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Portions derived from the GNU C Library
+ * Copyright (C) 1991, 1997, 2003 Free Software Foundation, Inc.
+ *
+ * Several strategies are tried to try to get the best performance for various
+ * conditions. In the optimal case, we copy 64-bytes in an unrolled loop using
+ * fp regs. This is followed by loops that copy 32- or 16-bytes at a time using
+ * general registers. Unaligned copies are handled either by aligning the
+ * destination and then using shift-and-write method, or in a few cases by
+ * falling back to a byte-at-a-time copy.
+ *
+ * I chose to implement this in C because it is easier to maintain and debug,
+ * and in my experiments it appears that the C code generated by gcc (3.3/3.4
+ * at the time of writing) is fairly optimal. Unfortunately some of the
+ * semantics of the copy routine (exception handling) is difficult to express
+ * in C, so we have to play some tricks to get it to work.
+ *
+ * All the loads and stores are done via explicit asm() code in order to use
+ * the right space registers.
+ *
+ * Testing with various alignments and buffer sizes shows that this code is
+ * often >10x faster than a simple byte-at-a-time copy, even for strangely
+ * aligned operands. It is interesting to note that the glibc version
+ * of memcpy (written in C) is actually quite fast already. This routine is
+ * able to beat it by 30-40% for aligned copies because of the loop unrolling,
+ * but in some cases the glibc version is still slightly faster. This lends
+ * more credibility that gcc can generate very good code as long as we are
+ * careful.
+ */
+
+#ifdef __KERNEL__
+#include <linux/config.h>
+#include <linux/compiler.h>
+#include <asm/uaccess.h>
+#define s_space "%%sr1"
+#define d_space "%%sr2"
+#else
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+#define inline __inline__ __attribute__((always_inline))
+#define s_space "%%sr0"
+#define d_space "%%sr0"
+#define pa_memcpy new2_copy
+#define L1_CACHE_BYTES 64
+#endif
+
+#define preserve_branch(label) do { \
+ volatile int dummy; \
+ /* The following branch is never taken, it's just here to */ \
+ /* prevent gcc from optimizing away our exception code. */ \
+ if (unlikely(dummy != dummy)) \
+ goto label; \
+} while (0)
+
+#define get_user_space() (segment_eq(get_fs(), KERNEL_DS) ? 0 : mfsp(3))
+
+#define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2)))
+#define THRESHOLD 16
+
+#ifndef __LP64__
+#define EXC_WORD ".word"
+#else
+#define EXC_WORD ".dword"
+#endif
+
+#define def_load_ai_insn(_insn,_sz,_tt,_s,_a,_t,_e) \
+ __asm__ __volatile__ ( \
+ "1:\t" #_insn ",ma " #_sz "(" _s ",%1), %0\n" \
+ "\t.section __ex_table,\"aw\"\n" \
+ "\t" EXC_WORD "\t1b\n" \
+ "\t" EXC_WORD "\t(" #_e "-1b)\n" \
+ "\t.previous\n" \
+ : "=" #_tt(_t), "+r"(_a) \
+ : "1"(_a) \
+ : "r8")
+
+#define def_store_ai_insn(_insn,_sz,_tt,_s,_a,_t,_e) \
+ __asm__ __volatile__ ( \
+ "1:\t" #_insn ",ma %1, " #_sz "(" _s ",%0)\n" \
+ "\t.section __ex_table,\"aw\"\n" \
+ "\t" EXC_WORD "\t1b\n" \
+ "\t" EXC_WORD "\t(" #_e "-1b)\n" \
+ "\t.previous\n" \
+ : "+r"(_a) \
+ : #_tt(_t), "0"(_a) \
+ : "r8")
+
+#define ldbma(_s, _a, _t, _e) def_load_ai_insn(ldbs,1,r,_s,_a,_t,_e)
+#define stbma(_s, _t, _a, _e) def_store_ai_insn(stbs,1,r,_s,_a,_t,_e)
+#define ldwma(_s, _a, _t, _e) def_load_ai_insn(ldw,4,r,_s,_a,_t,_e)
+#define stwma(_s, _t, _a, _e) def_store_ai_insn(stw,4,r,_s,_a,_t,_e)
+#define flddma(_s, _a, _t, _e) def_load_ai_insn(fldd,8,f,_s,_a,_t,_e)
+#define fstdma(_s, _t, _a, _e) def_store_ai_insn(fstd,8,f,_s,_a,_t,_e)
+
+#define ldw(_s,_o,_a,_t,_e) \
+ __asm__ __volatile__ ( \
+ "1:\tldw " #_o "(" _s ",%1), %0\n" \
+ "\t.section __ex_table,\"aw\"\n" \
+ "\t" EXC_WORD "\t1b\n" \
+ "\t" EXC_WORD "\t(" #_e "-1b)\n" \
+ "\t.previous\n" \
+ : "=r"(_t) \
+ : "r"(_a) \
+ : "r8")
+
+#define stw(_s,_t,_o,_a,_e) \
+ __asm__ __volatile__ ( \
+ "1:\tstw %0, " #_o "(" _s ",%1)\n" \
+ "\t.section __ex_table,\"aw\"\n" \
+ "\t.dword\t1b\n" \
+ "\t.dword\t(" #_e "-1b)\n" \
+ "\t.previous\n" \
+ : \
+ : "r"(_t), "r"(_a) \
+ : "r8")
+
+#ifdef CONFIG_PREFETCH
+extern inline void prefetch_src(const void *addr)
+{
+ __asm__("ldw 0(" s_space ",%0), %%r0" : : "r" (addr));
+}
+
+extern inline void prefetch_dst(const void *addr)
+{
+ __asm__("ldd 0(" d_space ",%0), %%r0" : : "r" (addr));
+}
+#else
+#define prefetch_src(addr)
+#define prefetch_dst(addr)
+#endif
+
+/* Copy from a not-aligned src to an aligned dst, using shifts. Handles 4 words
+ * per loop. This code is derived from glibc.
+ */
+static inline unsigned long copy_dstaligned(unsigned long dst, unsigned long src, unsigned long len)
+{
+ /* gcc complains that a2 and a3 may be uninitialized, but actually
+ * they cannot be. Initialize a2/a3 to shut gcc up.
+ */
+ register unsigned int a0, a1, a2 = 0, a3 = 0;
+ int sh_1, sh_2;
+
+ /* prefetch_src((const void *)src); */
+
+ /* Calculate how to shift a word read at the memory operation
+ aligned srcp to make it aligned for copy. */
+ sh_1 = 8 * (src % sizeof(unsigned int));
+ sh_2 = 8 * sizeof(unsigned int) - sh_1;
+
+ /* Make src aligned by rounding it down. */
+ src &= -sizeof(unsigned int);
+
+ switch (len % 4)
+ {
+ case 2:
+ /* a1 = ((unsigned int *) src)[0];
+ a2 = ((unsigned int *) src)[1]; */
+ ldw(s_space, 0, src, a1, copy_dstaligned_exc);
+ ldw(s_space, 4, src, a2, copy_dstaligned_exc);
+ src -= 1 * sizeof(unsigned int);
+ dst -= 3 * sizeof(unsigned int);
+ len += 2;
+ goto do1;
+ case 3:
+ /* a0 = ((unsigned int *) src)[0];
+ a1 = ((unsigned int *) src)[1]; */
+ ldw(s_space, 0, src, a0, copy_dstaligned_exc);
+ ldw(s_space, 4, src, a1, copy_dstaligned_exc);
+ src -= 0 * sizeof(unsigned int);
+ dst -= 2 * sizeof(unsigned int);
+ len += 1;
+ goto do2;
+ case 0:
+ if (len == 0)
+ return 0;
+ /* a3 = ((unsigned int *) src)[0];
+ a0 = ((unsigned int *) src)[1]; */
+ ldw(s_space, 0, src, a3, copy_dstaligned_exc);
+ ldw(s_space, 4, src, a0, copy_dstaligned_exc);
+ src -=-1 * sizeof(unsigned int);
+ dst -= 1 * sizeof(unsigned int);
+ len += 0;
+ goto do3;
+ case 1:
+ /* a2 = ((unsigned int *) src)[0];
+ a3 = ((unsigned int *) src)[1]; */
+ ldw(s_space, 0, src, a2, copy_dstaligned_exc);
+ ldw(s_space, 4, src, a3, copy_dstaligned_exc);
+ src -=-2 * sizeof(unsigned int);
+ dst -= 0 * sizeof(unsigned int);
+ len -= 1;
+ if (len == 0)
+ goto do0;
+ goto do4; /* No-op. */
+ }
+
+ do
+ {
+ /* prefetch_src((const void *)(src + 4 * sizeof(unsigned int))); */
+do4:
+ /* a0 = ((unsigned int *) src)[0]; */
+ ldw(s_space, 0, src, a0, copy_dstaligned_exc);
+ /* ((unsigned int *) dst)[0] = MERGE (a2, sh_1, a3, sh_2); */
+ stw(d_space, MERGE (a2, sh_1, a3, sh_2), 0, dst, copy_dstaligned_exc);
+do3:
+ /* a1 = ((unsigned int *) src)[1]; */
+ ldw(s_space, 4, src, a1, copy_dstaligned_exc);
+ /* ((unsigned int *) dst)[1] = MERGE (a3, sh_1, a0, sh_2); */
+ stw(d_space, MERGE (a3, sh_1, a0, sh_2), 4, dst, copy_dstaligned_exc);
+do2:
+ /* a2 = ((unsigned int *) src)[2]; */
+ ldw(s_space, 8, src, a2, copy_dstaligned_exc);
+ /* ((unsigned int *) dst)[2] = MERGE (a0, sh_1, a1, sh_2); */
+ stw(d_space, MERGE (a0, sh_1, a1, sh_2), 8, dst, copy_dstaligned_exc);
+do1:
+ /* a3 = ((unsigned int *) src)[3]; */
+ ldw(s_space, 12, src, a3, copy_dstaligned_exc);
+ /* ((unsigned int *) dst)[3] = MERGE (a1, sh_1, a2, sh_2); */
+ stw(d_space, MERGE (a1, sh_1, a2, sh_2), 12, dst, copy_dstaligned_exc);
+
+ src += 4 * sizeof(unsigned int);
+ dst += 4 * sizeof(unsigned int);
+ len -= 4;
+ }
+ while (len != 0);
+
+do0:
+ /* ((unsigned int *) dst)[0] = MERGE (a2, sh_1, a3, sh_2); */
+ stw(d_space, MERGE (a2, sh_1, a3, sh_2), 0, dst, copy_dstaligned_exc);
+
+ preserve_branch(handle_error);
+
+ return 0;
+
+handle_error:
+ __asm__ __volatile__ ("copy_dstaligned_exc:\n");
+ printk("copy_dstaligned_exc! returning with %lu\n", len * 4);
+ return len * 4;
+}
+
+
+/* Returns 0 for success, otherwise, returns number of bytes not transferred. */
+unsigned long pa_memcpy(void *dstp, const void *srcp, unsigned long len)
+{
+ register unsigned long src, dst, t1, t2, t3;
+ register char *pcs, *pcd;
+ register unsigned int *pws, *pwd;
+ register double *pds, *pdd;
+ unsigned long ret = 0;
+
+ src = (unsigned long)srcp;
+ dst = (unsigned long)dstp;
+ pcs = (unsigned char *)srcp;
+ pcd = (unsigned char *)dstp;
+
+ /* prefetch_src((const void *)srcp); */
+
+ if (unlikely(len == 0))
+ return 0;
+
+ /* Check alignment */
+ t1 = (src ^ dst);
+ if (unlikely(t1 & (sizeof(double)-1)))
+ goto unaligned_copy;
+
+ /* src and dst have same alignment. */
+
+ /* Copy bytes till we are double-aligned. */
+ t2 = src & (sizeof(double) - 1);
+ if (unlikely(t2 != 0)) {
+ t2 = sizeof(double) - t2;
+ while (t2 && len) {
+ /* *pcd++ = *pcs++; */
+ ldbma(s_space, pcs, t3, pa_memcpy_exc);
+ len--;
+ stbma(d_space, t3, pcd, pa_memcpy_exc);
+ t2--;
+ }
+ }
+
+ pds = (double *)pcs;
+ pdd = (double *)pcd;
+
+ /* Copy 8 doubles at a time */
+ while (len >= 8*sizeof(double)) {
+ register double r1, r2, r3, r4, r5, r6, r7, r8;
+ /* prefetch_src((char *)pds + L1_CACHE_BYTES); */
+ flddma(s_space, pds, r1, pa_memcpy_exc);
+ flddma(s_space, pds, r2, pa_memcpy_exc);
+ flddma(s_space, pds, r3, pa_memcpy_exc);
+ flddma(s_space, pds, r4, pa_memcpy_exc);
+ fstdma(d_space, r1, pdd, pa_memcpy_exc);
+ fstdma(d_space, r2, pdd, pa_memcpy_exc);
+ fstdma(d_space, r3, pdd, pa_memcpy_exc);
+ fstdma(d_space, r4, pdd, pa_memcpy_exc);
+
+#if 0
+ if (L1_CACHE_BYTES <= 32)
+ prefetch_src((char *)pds + L1_CACHE_BYTES);
+#endif
+ flddma(s_space, pds, r5, pa_memcpy_exc);
+ flddma(s_space, pds, r6, pa_memcpy_exc);
+ flddma(s_space, pds, r7, pa_memcpy_exc);
+ flddma(s_space, pds, r8, pa_memcpy_exc);
+ fstdma(d_space, r5, pdd, pa_memcpy_exc);
+ fstdma(d_space, r6, pdd, pa_memcpy_exc);
+ fstdma(d_space, r7, pdd, pa_memcpy_exc);
+ fstdma(d_space, r8, pdd, pa_memcpy_exc);
+ len -= 8*sizeof(double);
+ }
+
+ pws = (unsigned int *)pds;
+ pwd = (unsigned int *)pdd;
+
+word_copy:
+ while (len >= 8*sizeof(unsigned int)) {
+ register unsigned int r1,r2,r3,r4,r5,r6,r7,r8;
+ /* prefetch_src((char *)pws + L1_CACHE_BYTES); */
+ ldwma(s_space, pws, r1, pa_memcpy_exc);
+ ldwma(s_space, pws, r2, pa_memcpy_exc);
+ ldwma(s_space, pws, r3, pa_memcpy_exc);
+ ldwma(s_space, pws, r4, pa_memcpy_exc);
+ stwma(d_space, r1, pwd, pa_memcpy_exc);
+ stwma(d_space, r2, pwd, pa_memcpy_exc);
+ stwma(d_space, r3, pwd, pa_memcpy_exc);
+ stwma(d_space, r4, pwd, pa_memcpy_exc);
+
+ ldwma(s_space, pws, r5, pa_memcpy_exc);
+ ldwma(s_space, pws, r6, pa_memcpy_exc);
+ ldwma(s_space, pws, r7, pa_memcpy_exc);
+ ldwma(s_space, pws, r8, pa_memcpy_exc);
+ stwma(d_space, r5, pwd, pa_memcpy_exc);
+ stwma(d_space, r6, pwd, pa_memcpy_exc);
+ stwma(d_space, r7, pwd, pa_memcpy_exc);
+ stwma(d_space, r8, pwd, pa_memcpy_exc);
+ len -= 8*sizeof(unsigned int);
+ }
+
+ while (len >= 4*sizeof(unsigned int)) {
+ register unsigned int r1,r2,r3,r4;
+ ldwma(s_space, pws, r1, pa_memcpy_exc);
+ ldwma(s_space, pws, r2, pa_memcpy_exc);
+ ldwma(s_space, pws, r3, pa_memcpy_exc);
+ ldwma(s_space, pws, r4, pa_memcpy_exc);
+ stwma(d_space, r1, pwd, pa_memcpy_exc);
+ stwma(d_space, r2, pwd, pa_memcpy_exc);
+ stwma(d_space, r3, pwd, pa_memcpy_exc);
+ stwma(d_space, r4, pwd, pa_memcpy_exc);
+ len -= 4*sizeof(unsigned int);
+ }
+
+ pcs = (unsigned char *)pws;
+ pcd = (unsigned char *)pwd;
+
+byte_copy:
+ while (len) {
+ /* *pcd++ = *pcs++; */
+ ldbma(s_space, pcs, t3, pa_memcpy_exc);
+ stbma(d_space, t3, pcd, pa_memcpy_exc);
+ len--;
+ }
+
+ return 0;
+
+unaligned_copy:
+ if (len < THRESHOLD)
+ goto byte_copy;
+
+ /* possibly we are aligned on a word, but not on a double... */
+ if (likely(t1 & (sizeof(unsigned int)-1)) == 0) {
+ t2 = src & (sizeof(unsigned int) - 1);
+
+ if (unlikely(t2 != 0)) {
+ t2 = sizeof(unsigned int) - t2;
+ while (t2) {
+ /* *pcd++ = *pcs++; */
+ ldbma(s_space, pcs, t3, pa_memcpy_exc);
+ stbma(d_space, t3, pcd, pa_memcpy_exc);
+ t2--;
+ }
+ }
+
+ pws = (unsigned int *)pcs;
+ pwd = (unsigned int *)pcd;
+ goto word_copy;
+ }
+
+ /* Align the destination. */
+ if (unlikely((dst & (sizeof(unsigned int) - 1)) != 0)) {
+ t2 = sizeof(unsigned int) - (dst & (sizeof(unsigned int) - 1));
+ while (t2) {
+ /* *pcd++ = *pcs++; */
+ ldbma(s_space, pcs, t3, pa_memcpy_exc);
+ stbma(d_space, t3, pcd, pa_memcpy_exc);
+ len--;
+ t2--;
+ }
+ dst = (unsigned long)pcd;
+ src = (unsigned long)pcs;
+ }
+
+ ret = copy_dstaligned(dst, src, len / sizeof(unsigned int));
+ if (ret)
+ return ret;
+
+ pcs += (len & -sizeof(unsigned int));
+ pcd += (len & -sizeof(unsigned int));
+ len %= sizeof(unsigned int);
+
+ preserve_branch(handle_error);
+
+ goto byte_copy;
+
+handle_error:
+ __asm__ __volatile__ ("pa_memcpy_exc:\n");
+ printk("pa_memcpy_exc! returning with %lu\n", len + 1);
+ return len + 1;
+}
+
+unsigned long copy_to_user(void __user *dst, const void *src, unsigned long len)
+{
+ mtsp(0, 1);
+ mtsp(get_user_space(), 2);
+ return pa_memcpy(dst, src, len);
+}
+
+unsigned long copy_from_user(void *dst, const void __user *src, unsigned long len)
+{
+ mtsp(get_user_space(), 1);
+ mtsp(0, 2);
+ return pa_memcpy(dst, src, len);
+}
+
+unsigned long copy_in_user(void __user *dst, const void __user *src, unsigned long len)
+{
+ mtsp(get_user_space(), 1);
+ mtsp(get_user_space(), 2);
+ return pa_memcpy(dst, src, len);
+}
Index: arch/parisc/Makefile
===================================================================
RCS file: /var/cvs/linux-2.6/arch/parisc/Makefile,v
retrieving revision 1.13
diff -u -p -r1.13 Makefile
--- arch/parisc/Makefile 8 Sep 2004 15:06:48 -0000 1.13
+++ arch/parisc/Makefile 9 Sep 2004 23:51:35 -0000
@@ -38,7 +38,7 @@ cflags-y := -pipe
cflags-y += -mno-space-regs -mfast-indirect-calls
# No fixed-point multiply
-cflags-y += -mdisable-fpregs
+#cflags-y += -mdisable-fpregs
# Without this, "ld -r" results in .text sections that are too big
# (> 0x40000) for branches to reach stubs.
Index: include/asm-parisc/uaccess.h
===================================================================
RCS file: /var/cvs/linux-2.6/include/asm-parisc/uaccess.h,v
retrieving revision 1.13
diff -u -p -r1.13 uaccess.h
--- include/asm-parisc/uaccess.h 4 Feb 2004 18:24:55 -0000 1.13
+++ include/asm-parisc/uaccess.h 9 Sep 2004 23:51:35 -0000
@@ -273,11 +273,11 @@ extern long lstrnlen_user(const char __u
#define clear_user lclear_user
#define __clear_user lclear_user
-#define copy_from_user lcopy_from_user
-#define __copy_from_user lcopy_from_user
-#define copy_to_user lcopy_to_user
-#define __copy_to_user lcopy_to_user
-#define copy_in_user lcopy_in_user
-#define __copy_in_user lcopy_in_user
+unsigned long copy_to_user(void __user *dst, const void *src, unsigned long len);
+#define __copy_to_user copy_to_user
+unsigned long copy_from_user(void *dst, const void __user *src, unsigned long len);
+#define __copy_from_user copy_from_user
+unsigned long copy_in_user(void __user *dst, const void __user *src, unsigned long len);
+#define __copy_in_user copy_in_user
#endif /* __PARISC_UACCESS_H */
--
Randolph Chung
Debian GNU/Linux Developer, hppa/ia64 ports
http://www.tausq.org/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines 2004-09-10 0:00 [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines Randolph Chung @ 2004-09-10 0:15 ` Randolph Chung 2004-09-10 14:04 ` Carlos O'Donell [not found] ` <200409100656.i8A6u4hI026343@hiauly1.hia.nrc.ca> 2004-09-16 19:18 ` Randolph Chung 2 siblings, 1 reply; 8+ messages in thread From: Randolph Chung @ 2004-09-10 0:15 UTC (permalink / raw) To: parisc-linux In reference to a message from Randolph Chung, dated Sep 09: > This has been on my todo list for sometime -- the parisc kernel > currently uses a one-byte-at-a-time implementation for > copy_{to,from,in}_user. Obviously there is much room for improvement. > Here's an attempt at a more optimal version. A few more addon questions/RFCs: - this patch relies on using fp regs for copy operations. In our current kernel we actually have -mdisable-fpregs, but i was not able to ascertain why. we don't currently have fp lazy state saving. there's been some talk about doing this, but it requires some help from gcc. is this ok? - i have a test suite for testing the copy routine in userspace, so memcpy.c is written in a way to make it easily sharable between userspace and kernel. is that ok to include the userspace support code in the kernel? - this is the first time the kernel has had to change sr2 in the kernel (thus Carlos' earlier patch to restore it in the syscall path). Arguably we should not do the sr2 save/restore in the syscall path and instead do it in the copy routines, in order to make syscall as fast as possible. Thoughts? Alternatively, if we play some tricks with the preprocessor we can get away with only using sr1, but having 3 copies of the copy routines in the code. I actually tried this, but found the hacks required to do it to be rather ugly. - gcc gurus (Dave :) might think what I'm doing with exceptions to be too ugly^Wfragile. are there better ways to do it? :) this is a somewhat compelling reason to write the whole thing in assembly (which will also allow for a few more small optimizations). do you think the advantages of maintaining C code outweigh the slight performance penality and hacks required? randolph -- Randolph Chung Debian GNU/Linux Developer, hppa/ia64 ports http://www.tausq.org/ _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines 2004-09-10 0:15 ` Randolph Chung @ 2004-09-10 14:04 ` Carlos O'Donell 0 siblings, 0 replies; 8+ messages in thread From: Carlos O'Donell @ 2004-09-10 14:04 UTC (permalink / raw) To: Randolph Chung; +Cc: parisc-linux > - this is the first time the kernel has had to change sr2 in the kernel > (thus Carlos' earlier patch to restore it in the syscall path). > Arguably we should not do the sr2 save/restore in the syscall path and > instead do it in the copy routines, in order to make syscall as fast > as possible. Thoughts? In an optimal world: - Syscalls must only copy sr7 (user space id) to sr3 (where the kernel expects user space id to live), and then set sr7 to kernel space id zero, when ready. - Light weight syscalls clear sr2 so that users don't abuse security. - syscall_restore should set sr7 to user space id and clear sr2 for syscalls (don't reload from TASK_PT). - syscall_restore_rfi should store sr3 into sr7, and r0 into sr2 for restoration in restore_rfi. - When starting a new process sr7 should be set to the processes space id, and sr2 should be cleared for making syscalls with. a. No userspace code will be using scratch space registers because we don't support that. We will be clobbering sr2 for the user. If in the future we decided to change the kernels space, we just clobber it to something else. As long as no userspace code is setting sr2 and expecting it to stay set. b. The kernel might use temp scratch registers, in which case the kernel can use sr1,sr2,sr5 and sr6, though for each sr we use as a temporary we have to incur the penalty of a save/restore during an interruption (I think that's right eh?). c. On a schedule we should be saving only the state of those sr's that are used as temp and are important to the process? --- I need to review code here, I don't know what path is taken during a context switch to save/restore state --- > Alternatively, if we play some tricks with the preprocessor we can get > away with only using sr1, but having 3 copies of the copy routines in > the code. I actually tried this, but found the hacks required to do it > to be rather ugly. Difficult to maintain. > - gcc gurus (Dave :) might think what I'm doing with exceptions to be > too ugly^Wfragile. are there better ways to do it? :) GCC might break this if it gets smarter. Though I like the current method of indicating volatility. > this is a somewhat compelling reason to write the whole thing in > assembly (which will also allow for a few more small optimizations). > do you think the advantages of maintaining C code outweigh the > slight performance penality and hacks required? Perhaps. The C code is nice though, you could *almost* abstract it to all architectures that have double,single word,byte load store insns :) Thanks for this wicked work. c. _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <200409100656.i8A6u4hI026343@hiauly1.hia.nrc.ca>]
* Re: [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines [not found] ` <200409100656.i8A6u4hI026343@hiauly1.hia.nrc.ca> @ 2004-09-10 8:14 ` Randolph Chung 2004-09-10 8:46 ` John David Anglin 0 siblings, 1 reply; 8+ messages in thread From: Randolph Chung @ 2004-09-10 8:14 UTC (permalink / raw) To: John David Anglin; +Cc: parisc-linux In reference to a message from John David Anglin, dated Sep 10: > Have you looked at using stby or stdby for unaligned copies? do you mean for the entire copy, or just to copy enough bytes to get it aligned? i've thought about it, for replacing the copy-till-aligned bits, but haven't tried... i always get confused when i try to understand how those insns work :) randolph -- Randolph Chung Debian GNU/Linux Developer, hppa/ia64 ports http://www.tausq.org/ _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines 2004-09-10 8:14 ` Randolph Chung @ 2004-09-10 8:46 ` John David Anglin 0 siblings, 0 replies; 8+ messages in thread From: John David Anglin @ 2004-09-10 8:46 UTC (permalink / raw) To: randolph; +Cc: parisc-linux > In reference to a message from John David Anglin, dated Sep 10: > > Have you looked at using stby or stdby for unaligned copies? > > do you mean for the entire copy, or just to copy enough bytes to get it > aligned? I believe that the "b" completer can be used to write the first set of bytes needed to get you to an alignment boundary. The "e" completer can be used to store the final residual. The PA block move in GCC uses the latter. Basically, you use aligned loads and stby/stdby to store the right set of bytes at the beginning and end. It's much more efficient than doing the beginning and end byte by byte. Dave -- J. David Anglin dave.anglin@nrc-cnrc.gc.ca National Research Council of Canada (613) 990-0752 (FAX: 952-6602) _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines 2004-09-10 0:00 [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines Randolph Chung 2004-09-10 0:15 ` Randolph Chung [not found] ` <200409100656.i8A6u4hI026343@hiauly1.hia.nrc.ca> @ 2004-09-16 19:18 ` Randolph Chung [not found] ` <20040917144813.GH28936@baldric.uwo.ca> 2 siblings, 1 reply; 8+ messages in thread From: Randolph Chung @ 2004-09-16 19:18 UTC (permalink / raw) To: parisc-linux Version 2 of optimized copy routines. Changes compared to the previous version: - uses new exception mechanism to get the error semantics right - cleaned up debugs - added TODO items If there are no more comments, i'll check this in in a couple of days. thanks randolph Index: arch/parisc/Makefile =================================================================== RCS file: /var/cvs/linux-2.6/arch/parisc/Makefile,v retrieving revision 1.14 diff -u -p -r1.14 Makefile --- arch/parisc/Makefile 15 Sep 2004 14:11:49 -0000 1.14 +++ arch/parisc/Makefile 16 Sep 2004 19:15:54 -0000 @@ -38,7 +38,7 @@ cflags-y := -pipe cflags-y += -mno-space-regs -mfast-indirect-calls # No fixed-point multiply -cflags-y += -mdisable-fpregs +#cflags-y += -mdisable-fpregs # Without this, "ld -r" results in .text sections that are too big # (> 0x40000) for branches to reach stubs. Index: arch/parisc/lib/Makefile =================================================================== RCS file: /var/cvs/linux-2.6/arch/parisc/lib/Makefile,v retrieving revision 1.3 diff -u -p -r1.3 Makefile --- arch/parisc/lib/Makefile 15 Sep 2004 16:08:48 -0000 1.3 +++ arch/parisc/lib/Makefile 16 Sep 2004 19:15:58 -0000 @@ -2,6 +2,6 @@ # Makefile for parisc-specific library files # -lib-y := lusercopy.o bitops.o checksum.o io.o memset.o fixup.o +lib-y := lusercopy.o bitops.o checksum.o io.o memset.o fixup.o memcpy.o lib-$(CONFIG_SMP) += debuglocks.o Index: arch/parisc/lib/memcpy.c =================================================================== RCS file: arch/parisc/lib/memcpy.c diff -N arch/parisc/lib/memcpy.c --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ arch/parisc/lib/memcpy.c 16 Sep 2004 19:15:58 -0000 @@ -0,0 +1,499 @@ +/* + * Optimized memory copy routines. + * + * Copyright (C) 2004 Randolph Chung <tausq@debian.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Portions derived from the GNU C Library + * Copyright (C) 1991, 1997, 2003 Free Software Foundation, Inc. + * + * Several strategies are tried to try to get the best performance for various + * conditions. In the optimal case, we copy 64-bytes in an unrolled loop using + * fp regs. This is followed by loops that copy 32- or 16-bytes at a time using + * general registers. Unaligned copies are handled either by aligning the + * destination and then using shift-and-write method, or in a few cases by + * falling back to a byte-at-a-time copy. + * + * I chose to implement this in C because it is easier to maintain and debug, + * and in my experiments it appears that the C code generated by gcc (3.3/3.4 + * at the time of writing) is fairly optimal. Unfortunately some of the + * semantics of the copy routine (exception handling) is difficult to express + * in C, so we have to play some tricks to get it to work. + * + * All the loads and stores are done via explicit asm() code in order to use + * the right space registers. + * + * Testing with various alignments and buffer sizes shows that this code is + * often >10x faster than a simple byte-at-a-time copy, even for strangely + * aligned operands. It is interesting to note that the glibc version + * of memcpy (written in C) is actually quite fast already. This routine is + * able to beat it by 30-40% for aligned copies because of the loop unrolling, + * but in some cases the glibc version is still slightly faster. This lends + * more credibility that gcc can generate very good code as long as we are + * careful. + * + * TODO: + * - cache prefetching needs more experimentation to get optimal settings + * - try not to use the post-increment address modifiers; they create additional + * interlocks + * - replace byte-copy loops with stybs sequences + */ + +#ifdef __KERNEL__ +#include <linux/config.h> +#include <linux/compiler.h> +#include <asm/uaccess.h> +#define s_space "%%sr1" +#define d_space "%%sr2" +#else +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) +#define inline __inline__ __attribute__((always_inline)) +#define s_space "%%sr0" +#define d_space "%%sr0" +#define pa_memcpy new2_copy +#define L1_CACHE_BYTES 64 +#endif + +DECLARE_PER_CPU(struct exception_data, exception_data); + +#define preserve_branch(label) do { \ + volatile int dummy; \ + /* The following branch is never taken, it's just here to */ \ + /* prevent gcc from optimizing away our exception code. */ \ + if (unlikely(dummy != dummy)) \ + goto label; \ +} while (0) + +#define get_user_space() (segment_eq(get_fs(), KERNEL_DS) ? 0 : mfsp(3)) + +#define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2))) +#define THRESHOLD 16 + +#ifdef DEBUG_MEMCPY +#define DPRINTF(fmt, args...) do { printk(KERN_DEBUG "%s:%d:%s ", __FILE__, __LINE__, __FUNCTION__ ); printk(KERN_DEBUG fmt, ##args ); } while (0) +#else +#define DPRINTF(fmt, args...) +#endif + +#ifndef __LP64__ +#define EXC_WORD ".word" +#else +#define EXC_WORD ".dword" +#endif + +#define def_load_ai_insn(_insn,_sz,_tt,_s,_a,_t,_e) \ + __asm__ __volatile__ ( \ + "1:\t" #_insn ",ma " #_sz "(" _s ",%1), %0\n" \ + "\t.section __ex_table,\"aw\"\n" \ + "\t" EXC_WORD "\t1b\n" \ + "\t" EXC_WORD "\t" #_e "\n" \ + "\t.previous\n" \ + : "=" #_tt(_t), "+r"(_a) \ + : "1"(_a) \ + : "r8") + +#define def_store_ai_insn(_insn,_sz,_tt,_s,_a,_t,_e) \ + __asm__ __volatile__ ( \ + "1:\t" #_insn ",ma %1, " #_sz "(" _s ",%0)\n" \ + "\t.section __ex_table,\"aw\"\n" \ + "\t" EXC_WORD "\t1b\n" \ + "\t" EXC_WORD "\t" #_e "\n" \ + "\t.previous\n" \ + : "+r"(_a) \ + : #_tt(_t), "0"(_a) \ + : "r8") + +#define ldbma(_s, _a, _t, _e) def_load_ai_insn(ldbs,1,r,_s,_a,_t,_e) +#define stbma(_s, _t, _a, _e) def_store_ai_insn(stbs,1,r,_s,_a,_t,_e) +#define ldwma(_s, _a, _t, _e) def_load_ai_insn(ldw,4,r,_s,_a,_t,_e) +#define stwma(_s, _t, _a, _e) def_store_ai_insn(stw,4,r,_s,_a,_t,_e) +#define flddma(_s, _a, _t, _e) def_load_ai_insn(fldd,8,f,_s,_a,_t,_e) +#define fstdma(_s, _t, _a, _e) def_store_ai_insn(fstd,8,f,_s,_a,_t,_e) + +#define ldw(_s,_o,_a,_t,_e) \ + __asm__ __volatile__ ( \ + "1:\tldw " #_o "(" _s ",%1), %0\n" \ + "\t.section __ex_table,\"aw\"\n" \ + "\t" EXC_WORD "\t1b\n" \ + "\t" EXC_WORD "\t" #_e "\n" \ + "\t.previous\n" \ + : "=r"(_t) \ + : "r"(_a) \ + : "r8") + +#define stw(_s,_t,_o,_a,_e) \ + __asm__ __volatile__ ( \ + "1:\tstw %0, " #_o "(" _s ",%1)\n" \ + "\t.section __ex_table,\"aw\"\n" \ + "\t" EXC_WORD "\t1b\n" \ + "\t" EXC_WORD "\t" #_e "\n" \ + "\t.previous\n" \ + : \ + : "r"(_t), "r"(_a) \ + : "r8") + +#ifdef CONFIG_PREFETCH +extern inline void prefetch_src(const void *addr) +{ + __asm__("ldw 0(" s_space ",%0), %%r0" : : "r" (addr)); +} + +extern inline void prefetch_dst(const void *addr) +{ + __asm__("ldd 0(" d_space ",%0), %%r0" : : "r" (addr)); +} +#else +#define prefetch_src(addr) +#define prefetch_dst(addr) +#endif + +/* Copy from a not-aligned src to an aligned dst, using shifts. Handles 4 words + * per loop. This code is derived from glibc. + */ +static inline unsigned long copy_dstaligned(unsigned long dst, unsigned long src, unsigned long len, unsigned long o_dst, unsigned long o_src, unsigned long o_len) +{ + /* gcc complains that a2 and a3 may be uninitialized, but actually + * they cannot be. Initialize a2/a3 to shut gcc up. + */ + register unsigned int a0, a1, a2 = 0, a3 = 0; + int sh_1, sh_2; + struct exception_data *d; + + /* prefetch_src((const void *)src); */ + + /* Calculate how to shift a word read at the memory operation + aligned srcp to make it aligned for copy. */ + sh_1 = 8 * (src % sizeof(unsigned int)); + sh_2 = 8 * sizeof(unsigned int) - sh_1; + + /* Make src aligned by rounding it down. */ + src &= -sizeof(unsigned int); + + switch (len % 4) + { + case 2: + /* a1 = ((unsigned int *) src)[0]; + a2 = ((unsigned int *) src)[1]; */ + ldw(s_space, 0, src, a1, cda_ldw_exc); + ldw(s_space, 4, src, a2, cda_ldw_exc); + src -= 1 * sizeof(unsigned int); + dst -= 3 * sizeof(unsigned int); + len += 2; + goto do1; + case 3: + /* a0 = ((unsigned int *) src)[0]; + a1 = ((unsigned int *) src)[1]; */ + ldw(s_space, 0, src, a0, cda_ldw_exc); + ldw(s_space, 4, src, a1, cda_ldw_exc); + src -= 0 * sizeof(unsigned int); + dst -= 2 * sizeof(unsigned int); + len += 1; + goto do2; + case 0: + if (len == 0) + return 0; + /* a3 = ((unsigned int *) src)[0]; + a0 = ((unsigned int *) src)[1]; */ + ldw(s_space, 0, src, a3, cda_ldw_exc); + ldw(s_space, 4, src, a0, cda_ldw_exc); + src -=-1 * sizeof(unsigned int); + dst -= 1 * sizeof(unsigned int); + len += 0; + goto do3; + case 1: + /* a2 = ((unsigned int *) src)[0]; + a3 = ((unsigned int *) src)[1]; */ + ldw(s_space, 0, src, a2, cda_ldw_exc); + ldw(s_space, 4, src, a3, cda_ldw_exc); + src -=-2 * sizeof(unsigned int); + dst -= 0 * sizeof(unsigned int); + len -= 1; + if (len == 0) + goto do0; + goto do4; /* No-op. */ + } + + do + { + /* prefetch_src((const void *)(src + 4 * sizeof(unsigned int))); */ +do4: + /* a0 = ((unsigned int *) src)[0]; */ + ldw(s_space, 0, src, a0, cda_ldw_exc); + /* ((unsigned int *) dst)[0] = MERGE (a2, sh_1, a3, sh_2); */ + stw(d_space, MERGE (a2, sh_1, a3, sh_2), 0, dst, cda_stw_exc); +do3: + /* a1 = ((unsigned int *) src)[1]; */ + ldw(s_space, 4, src, a1, cda_ldw_exc); + /* ((unsigned int *) dst)[1] = MERGE (a3, sh_1, a0, sh_2); */ + stw(d_space, MERGE (a3, sh_1, a0, sh_2), 4, dst, cda_stw_exc); +do2: + /* a2 = ((unsigned int *) src)[2]; */ + ldw(s_space, 8, src, a2, cda_ldw_exc); + /* ((unsigned int *) dst)[2] = MERGE (a0, sh_1, a1, sh_2); */ + stw(d_space, MERGE (a0, sh_1, a1, sh_2), 8, dst, cda_stw_exc); +do1: + /* a3 = ((unsigned int *) src)[3]; */ + ldw(s_space, 12, src, a3, cda_ldw_exc); + /* ((unsigned int *) dst)[3] = MERGE (a1, sh_1, a2, sh_2); */ + stw(d_space, MERGE (a1, sh_1, a2, sh_2), 12, dst, cda_stw_exc); + + src += 4 * sizeof(unsigned int); + dst += 4 * sizeof(unsigned int); + len -= 4; + } + while (len != 0); + +do0: + /* ((unsigned int *) dst)[0] = MERGE (a2, sh_1, a3, sh_2); */ + stw(d_space, MERGE (a2, sh_1, a3, sh_2), 0, dst, cda_stw_exc); + + preserve_branch(handle_load_error); + preserve_branch(handle_store_error); + + return 0; + +handle_load_error: + __asm__ __volatile__ ("cda_ldw_exc:\n"); + d = &__get_cpu_var(exception_data); + DPRINTF("cda_ldw_exc: o_len=%lu fault_addr=%lu o_src=%lu ret=%lu\n", + o_len, d->fault_addr, o_src, o_len - d->fault_addr + o_src); + return o_len * 4 - d->fault_addr + o_src; + +handle_store_error: + __asm__ __volatile__ ("cda_stw_exc:\n"); + d = &__get_cpu_var(exception_data); + DPRINTF("cda_stw_exc: o_len=%lu fault_addr=%lu o_dst=%lu ret=%lu\n", + o_len, d->fault_addr, o_dst, o_len - d->fault_addr + o_dst); + return o_len * 4 - d->fault_addr + o_dst; +} + + +/* Returns 0 for success, otherwise, returns number of bytes not transferred. */ +unsigned long pa_memcpy(void *dstp, const void *srcp, unsigned long len) +{ + register unsigned long src, dst, t1, t2, t3; + register char *pcs, *pcd; + register unsigned int *pws, *pwd; + register double *pds, *pdd; + unsigned long ret = 0; + unsigned long o_dst, o_src, o_len; + struct exception_data *d; + + src = (unsigned long)srcp; + dst = (unsigned long)dstp; + pcs = (unsigned char *)srcp; + pcd = (unsigned char *)dstp; + + o_dst = dst; o_src = src; o_len = len; + + /* prefetch_src((const void *)srcp); */ + + if (unlikely(len == 0)) + return 0; + + /* Check alignment */ + t1 = (src ^ dst); + if (unlikely(t1 & (sizeof(double)-1))) + goto unaligned_copy; + + /* src and dst have same alignment. */ + + /* Copy bytes till we are double-aligned. */ + t2 = src & (sizeof(double) - 1); + if (unlikely(t2 != 0)) { + t2 = sizeof(double) - t2; + while (t2 && len) { + /* *pcd++ = *pcs++; */ + ldbma(s_space, pcs, t3, pmc_load_exc); + len--; + stbma(d_space, t3, pcd, pmc_store_exc); + t2--; + } + } + + pds = (double *)pcs; + pdd = (double *)pcd; + + /* Copy 8 doubles at a time */ + while (len >= 8*sizeof(double)) { + register double r1, r2, r3, r4, r5, r6, r7, r8; + /* prefetch_src((char *)pds + L1_CACHE_BYTES); */ + flddma(s_space, pds, r1, pmc_load_exc); + flddma(s_space, pds, r2, pmc_load_exc); + flddma(s_space, pds, r3, pmc_load_exc); + flddma(s_space, pds, r4, pmc_load_exc); + fstdma(d_space, r1, pdd, pmc_store_exc); + fstdma(d_space, r2, pdd, pmc_store_exc); + fstdma(d_space, r3, pdd, pmc_store_exc); + fstdma(d_space, r4, pdd, pmc_store_exc); + +#if 0 + if (L1_CACHE_BYTES <= 32) + prefetch_src((char *)pds + L1_CACHE_BYTES); +#endif + flddma(s_space, pds, r5, pmc_load_exc); + flddma(s_space, pds, r6, pmc_load_exc); + flddma(s_space, pds, r7, pmc_load_exc); + flddma(s_space, pds, r8, pmc_load_exc); + fstdma(d_space, r5, pdd, pmc_store_exc); + fstdma(d_space, r6, pdd, pmc_store_exc); + fstdma(d_space, r7, pdd, pmc_store_exc); + fstdma(d_space, r8, pdd, pmc_store_exc); + len -= 8*sizeof(double); + } + + pws = (unsigned int *)pds; + pwd = (unsigned int *)pdd; + +word_copy: + while (len >= 8*sizeof(unsigned int)) { + register unsigned int r1,r2,r3,r4,r5,r6,r7,r8; + /* prefetch_src((char *)pws + L1_CACHE_BYTES); */ + ldwma(s_space, pws, r1, pmc_load_exc); + ldwma(s_space, pws, r2, pmc_load_exc); + ldwma(s_space, pws, r3, pmc_load_exc); + ldwma(s_space, pws, r4, pmc_load_exc); + stwma(d_space, r1, pwd, pmc_store_exc); + stwma(d_space, r2, pwd, pmc_store_exc); + stwma(d_space, r3, pwd, pmc_store_exc); + stwma(d_space, r4, pwd, pmc_store_exc); + + ldwma(s_space, pws, r5, pmc_load_exc); + ldwma(s_space, pws, r6, pmc_load_exc); + ldwma(s_space, pws, r7, pmc_load_exc); + ldwma(s_space, pws, r8, pmc_load_exc); + stwma(d_space, r5, pwd, pmc_store_exc); + stwma(d_space, r6, pwd, pmc_store_exc); + stwma(d_space, r7, pwd, pmc_store_exc); + stwma(d_space, r8, pwd, pmc_store_exc); + len -= 8*sizeof(unsigned int); + } + + while (len >= 4*sizeof(unsigned int)) { + register unsigned int r1,r2,r3,r4; + ldwma(s_space, pws, r1, pmc_load_exc); + ldwma(s_space, pws, r2, pmc_load_exc); + ldwma(s_space, pws, r3, pmc_load_exc); + ldwma(s_space, pws, r4, pmc_load_exc); + stwma(d_space, r1, pwd, pmc_store_exc); + stwma(d_space, r2, pwd, pmc_store_exc); + stwma(d_space, r3, pwd, pmc_store_exc); + stwma(d_space, r4, pwd, pmc_store_exc); + len -= 4*sizeof(unsigned int); + } + + pcs = (unsigned char *)pws; + pcd = (unsigned char *)pwd; + +byte_copy: + while (len) { + /* *pcd++ = *pcs++; */ + ldbma(s_space, pcs, t3, pmc_load_exc); + stbma(d_space, t3, pcd, pmc_store_exc); + len--; + } + + return 0; + +unaligned_copy: + if (len < THRESHOLD) + goto byte_copy; + + /* possibly we are aligned on a word, but not on a double... */ + if (likely(t1 & (sizeof(unsigned int)-1)) == 0) { + t2 = src & (sizeof(unsigned int) - 1); + + if (unlikely(t2 != 0)) { + t2 = sizeof(unsigned int) - t2; + while (t2) { + /* *pcd++ = *pcs++; */ + ldbma(s_space, pcs, t3, pmc_load_exc); + stbma(d_space, t3, pcd, pmc_store_exc); + t2--; + } + } + + pws = (unsigned int *)pcs; + pwd = (unsigned int *)pcd; + goto word_copy; + } + + /* Align the destination. */ + if (unlikely((dst & (sizeof(unsigned int) - 1)) != 0)) { + t2 = sizeof(unsigned int) - (dst & (sizeof(unsigned int) - 1)); + while (t2) { + /* *pcd++ = *pcs++; */ + ldbma(s_space, pcs, t3, pmc_load_exc); + stbma(d_space, t3, pcd, pmc_store_exc); + len--; + t2--; + } + dst = (unsigned long)pcd; + src = (unsigned long)pcs; + } + + ret = copy_dstaligned(dst, src, len / sizeof(unsigned int), + o_dst, o_src, o_len); + if (ret) + return ret; + + pcs += (len & -sizeof(unsigned int)); + pcd += (len & -sizeof(unsigned int)); + len %= sizeof(unsigned int); + + preserve_branch(handle_load_error); + preserve_branch(handle_store_error); + + goto byte_copy; + +handle_load_error: + __asm__ __volatile__ ("pmc_load_exc:\n"); + d = &__get_cpu_var(exception_data); + DPRINTF("pmc_load_exc: o_len=%lu fault_addr=%lu o_src=%lu ret=%lu\n", + o_len, d->fault_addr, o_src, o_len - d->fault_addr + o_src); + return o_len - d->fault_addr + o_src; + +handle_store_error: + __asm__ __volatile__ ("pmc_store_exc:\n"); + d = &__get_cpu_var(exception_data); + DPRINTF("pmc_store_exc: o_len=%lu fault_addr=%lu o_dst=%lu ret=%lu\n", + o_len, d->fault_addr, o_dst, o_len - d->fault_addr + o_dst); + return o_len - d->fault_addr + o_dst; +} + +#ifdef __KERNEL__ +unsigned long copy_to_user(void __user *dst, const void *src, unsigned long len) +{ + mtsp(0, 1); + mtsp(get_user_space(), 2); + return pa_memcpy(dst, src, len); +} + +unsigned long copy_from_user(void *dst, const void __user *src, unsigned long len) +{ + mtsp(get_user_space(), 1); + mtsp(0, 2); + return pa_memcpy(dst, src, len); +} + +unsigned long copy_in_user(void __user *dst, const void __user *src, unsigned long len) +{ + mtsp(get_user_space(), 1); + mtsp(get_user_space(), 2); + return pa_memcpy(dst, src, len); +} +#endif Index: include/asm-parisc/uaccess.h =================================================================== RCS file: /var/cvs/linux-2.6/include/asm-parisc/uaccess.h,v retrieving revision 1.16 diff -u -p -r1.16 uaccess.h --- include/asm-parisc/uaccess.h 15 Sep 2004 16:08:48 -0000 1.16 +++ include/asm-parisc/uaccess.h 16 Sep 2004 19:16:21 -0000 @@ -267,12 +267,12 @@ extern long lstrnlen_user(const char __u #define clear_user lclear_user #define __clear_user lclear_user -#define copy_from_user lcopy_from_user -#define __copy_from_user lcopy_from_user -#define copy_to_user lcopy_to_user -#define __copy_to_user lcopy_to_user -#define copy_in_user lcopy_in_user -#define __copy_in_user lcopy_in_user +unsigned long copy_to_user(void __user *dst, const void *src, unsigned long len); +#define __copy_to_user copy_to_user +unsigned long copy_from_user(void *dst, const void __user *src, unsigned long len); +#define __copy_from_user copy_from_user +unsigned long copy_in_user(void __user *dst, const void __user *src, unsigned long len); +#define __copy_in_user copy_in_user #define __copy_to_user_inatomic __copy_to_user #define __copy_from_user_inatomic __copy_from_user _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20040917144813.GH28936@baldric.uwo.ca>]
* Re: [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines [not found] ` <20040917144813.GH28936@baldric.uwo.ca> @ 2004-09-17 15:34 ` Randolph Chung 2004-09-17 16:23 ` Carlos O'Donell 0 siblings, 1 reply; 8+ messages in thread From: Randolph Chung @ 2004-09-17 15:34 UTC (permalink / raw) To: Carlos O'Donell; +Cc: parisc-linux > What tests have you run against this? > I think the original version was pretty stable on my 32-bit UP c3k. > Included kernel compiles and some lws tests. i've done kernel compiles at various -j levels, and run some lmbench benchmarks. as you know, there's also a memcpy testsuite that verified the main copying algorithm for various block sizes and alignments. i also have a small test program that is able to exercise the exception cases in the code, so i've tested the various load/store exception cases in the two functions (pa_memcpy and copy_dstaligned). currently, the exception test program is a bit indirect (it requires a lot of manual work). i have an idea for a better automated testsuite which still needs to be implemented. randolph -- Randolph Chung Debian GNU/Linux Developer, hppa/ia64 ports http://www.tausq.org/ _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines 2004-09-17 15:34 ` Randolph Chung @ 2004-09-17 16:23 ` Carlos O'Donell 0 siblings, 0 replies; 8+ messages in thread From: Carlos O'Donell @ 2004-09-17 16:23 UTC (permalink / raw) To: Randolph Chung; +Cc: parisc-linux Randolph, > i've done kernel compiles at various -j levels, and run some lmbench > benchmarks. Good. > as you know, there's also a memcpy testsuite that verified the main > copying algorithm for various block sizes and alignments. Yup. I ran those first. > i also have a small test program that is able to exercise the > exception cases in the code, so i've tested the various load/store > exception cases in the two functions (pa_memcpy and copy_dstaligned). Cool, that's the most important cases to test. My lws testcase exercises teh exception code, but obviously not the exception cases in pa_memcpy and copy_dstaligned. > currently, the exception test program is a bit indirect (it requires a > lot of manual work). i have an idea for a better automated testsuite > which still needs to be implemented. Kernel module? :) If you did it in C it could be used for a lot of arches. c. _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-09-17 16:23 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-10 0:00 [parisc-linux] [RFC] Optimi[zs]e copy_*_user routines Randolph Chung
2004-09-10 0:15 ` Randolph Chung
2004-09-10 14:04 ` Carlos O'Donell
[not found] ` <200409100656.i8A6u4hI026343@hiauly1.hia.nrc.ca>
2004-09-10 8:14 ` Randolph Chung
2004-09-10 8:46 ` John David Anglin
2004-09-16 19:18 ` Randolph Chung
[not found] ` <20040917144813.GH28936@baldric.uwo.ca>
2004-09-17 15:34 ` Randolph Chung
2004-09-17 16:23 ` Carlos O'Donell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox