public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-24 19:34 efficient copy_to_user and copy_from_user routines in Linux Kernel Mala Anand
@ 2002-06-24 19:33 ` David S. Miller
  2002-06-24 20:24 ` [Lse-tech] " Niels Christiansen
  2002-06-25 17:03 ` Andrew Morton
  2 siblings, 0 replies; 14+ messages in thread
From: David S. Miller @ 2002-06-24 19:33 UTC (permalink / raw)
  To: manand; +Cc: linux-kernel, lse-tech

   From: "Mala Anand" <manand@us.ibm.com>
   Date: Mon, 24 Jun 2002 14:34:08 -0500

   The 2.5.19 copy routines use the movsl instruction.  We found that when the
   src or dst addresses are not aligned on 8 bytes, performance can be
   improved
   by using the integer registers instead of the movsl instruction.  For
   tcpip,
   the src or dst addresses are often misaligned.

If the code is going to become so much larger, move the implementation
out of the header file and into arch/i386/lib/foo.S

It makes no sense to inline it anymore if it is going to be
implemented with so many instructions.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* efficient copy_to_user and copy_from_user routines in Linux Kernel
@ 2002-06-24 19:34 Mala Anand
  2002-06-24 19:33 ` David S. Miller
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mala Anand @ 2002-06-24 19:34 UTC (permalink / raw)
  To: Linux Kernel Mailing List, lse-tech

Here is a 2.5.19 patch that improves the performance of IA32 copy_to_user
and copy_from_user routines used by :

(1) tcpip protocol stack
(2) file systems

The 2.5.19 copy routines use the movsl instruction.  We found that when the
src or dst addresses are not aligned on 8 bytes, performance can be
improved
by using the integer registers instead of the movsl instruction.  For
tcpip,
the src or dst addresses are often misaligned.

The patch uses the integer registers if :

(1) length of the copy >= 64 AND src or dst not aligned on 8 bytes.

We found that the patch improves both network throughput and overall
CPU utilization of the sender/receiver when tested using netperf.

Here are some netperf (www.netperf.org) 2.5.19 UP results with and without
the patch :

client & server used for test : 996MHz Pentium III Intel PRO/1000 Gb

netperf 12865 -l 60 -H perf1 -t TCP_STREAM -c -C -i 10,2 -I 95,10 -s 65536
-S 65536

The message size used by netperf was varied from 512...65536, MTU was 1500.

       2.5.19        2.5.19+patch   -- % improvement using patch --
 msg   Throughput    Throughput                   Sender   Receiver
 size  10^6bits/s    10^6bits/s     Throughput      CPU         CPU

  512      741.84      842.00          13.5 %     -1.6 %      0.0 %
 1024      816.61      922.77          13.0 %     -2.8 %      0.0 %
 2048      854.06      940.02          10.1 %     -4.7 %     10.0 %
 4096      913.12      940.12           3.0 %     -1.5 %     19.9 %
 8192      925.36      940.34           1.7 %    -16.3 %     27.9 %
16384      936.51      935.84          -0.1 %      1.8 %     33.4 %
32768      875.11      892.54           2.0 %     -2.5 %     33.9 %
65536      885.79      930.38           5.0 %    -21.0 %     13.3 %

We also instrumented the copy routines in order to measure the number of
CPU cycles required to copy a 1448 byte piece of memory :

        buffer    CPU
method  aligned   cycles
------  -------   ------
movsl   YES       3000
movsl   NO        7000
integer NO        4000

Badari Pulavarty has suggested using mmx registers instead of integer
registers in the unrolled loop copy method.  We are both investigating
the performance of the copy routines when the mmx registers are used.

Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088

Here is the patch:

diff -Naur linux-519/include/asm-i386/uaccess.h
linux-519copy/include/asm-i386/uaccess.h
--- linux-519/include/asm-i386/uaccess.h  Wed Jun 12 12:37:00 2002
+++ linux-519copy/include/asm-i386/uaccess.h    Mon Jun 17 07:52:46 2002
@@ -253,55 +253,199 @@
  */

 /* Generic arbitrary sized copy.  */
-#define __copy_user(to,from,size)                          \
-do {                                                 \
-     int __d0, __d1;                                       \
-     __asm__ __volatile__(                                 \
-           "0:   rep; movsl\n"                             \
-           "     movl %3,%0\n"                             \
-           "1:   rep; movsb\n"                             \
-           "2:\n"                                          \
-           ".section .fixup,\"ax\"\n"                      \
-           "3:   lea 0(%3,%0,4),%0\n"                      \
-           "     jmp 2b\n"                           \
-           ".previous\n"                                   \
-           ".section __ex_table,\"a\"\n"                   \
-           "     .align 4\n"                         \
-           "     .long 0b,3b\n"                            \
-           "     .long 1b,2b\n"                            \
-           ".previous"                               \
-           : "=&c"(size), "=&D" (__d0), "=&S" (__d1)       \
-           : "r"(size & 3), "0"(size / 4), "1"(to), "2"(from)    \
-           : "memory");                                    \
-} while (0)
+#define __copy_user(to,from,size)                          \
+do {                                                 \
+     int __d0, __d1, __d2;                                         \
+     __asm__ __volatile__(                                 \
+                "       cmpl $63, %0\n"                               \
+                "       jbe  5f\n"                                    \
+                "       mov %%si, %%ax\n"                             \
+                "       test $7, %%al\n"                              \
+                "       jz  5f\n"                                     \
+                "       .align 2,0x90\n"                              \
+                "0:     movl 32(%4), %%eax\n"                         \
+            "       cmpl $67, %0\n"                              \
+                "       jbe 1f\n"                                     \
+                "       movl 64(%4), %%eax\n"                         \
+                "       .align 2,0x90\n"                              \
+           "1:     movl 0(%4), %%eax\n"                          \
+                "       movl 4(%4), %%edx\n"                          \
+                "2:     movl %%eax, 0(%3)\n"                          \
+           "21:    movl %%edx, 4(%3)\n"                          \
+           "       movl 8(%4), %%eax\n"                          \
+           "       movl 12(%4),%%edx\n"                          \
+           "3:     movl %%eax, 8(%3)\n"                          \
+           "31:    movl %%edx, 12(%3)\n"                         \
+           "       movl 16(%4), %%eax\n"                         \
+           "       movl 20(%4), %%edx\n"                         \
+                "4:     movl %%eax, 16(%3)\n"                         \
+                "41:    movl %%edx, 20(%3)\n"                         \
+                "       movl 24(%4), %%eax\n"                         \
+                "       movl 28(%4), %%edx\n"                         \
+                "10:    movl %%eax, 24(%3)\n"                         \
+                "51:    movl %%edx, 28(%3)\n"                         \
+                "       movl 32(%4), %%eax\n"                         \
+                "       movl 36(%4), %%edx\n"                         \
+                "11:    movl %%eax, 32(%3)\n"                         \
+                "61:    movl %%edx, 36(%3)\n"                         \
+                "       movl 40(%4), %%eax\n"                         \
+                "       movl 44(%4), %%edx\n"                         \
+                "12:    movl %%eax, 40(%3)\n"                         \
+                "71:    movl %%edx, 44(%3)\n"                         \
+                "       movl 48(%4), %%eax\n"                         \
+                "       movl 52(%4), %%edx\n"                         \
+                "13:    movl %%eax, 48(%3)\n"                         \
+                "81:    movl %%edx, 52(%3)\n"                         \
+                "       movl 56(%4), %%eax\n"                         \
+                "       movl 60(%4), %%edx\n"                         \
+                "14:    movl %%eax, 56(%3)\n"                         \
+                "91:    movl %%edx, 60(%3)\n"                         \
+                "       addl $-64, %0\n"                              \
+                "       addl $64, %4\n"                               \
+                "       addl $64, %3\n"                               \
+                "       cmpl $63, %0\n"                               \
+                "       ja  0b\n"                                     \
+           "5:   movl  %0, %%eax\n"                            \
+                "       shrl  $2, %0\n"                               \
+                "       andl  $3, %%eax\n"                            \
+                "       cld\n"                                        \
+                "6:     rep; movsl\n"                                 \
+                "       movl %%eax, %0\n"                             \
+           "7:   rep; movsb\n"                             \
+           "8:\n"                                          \
+           ".section .fixup,\"ax\"\n"                      \
+           "9:   lea 0(%%eax,%0,4),%0\n"                   \
+           "     jmp 8b\n"                           \
+                "15:    movl %6, %0\n"                                \
+                "       jmp 8b\n"                                     \
+           ".previous\n"                                   \
+           ".section __ex_table,\"a\"\n"                   \
+           "     .align 4\n"                         \
+           "     .long 2b,15b\n"                           \
+           "     .long 21b,15b\n"                    \
+           "     .long 3b,15b\n"                           \
+           "     .long 31b,15b\n"                    \
+           "     .long 4b,15b\n"                           \
+           "     .long 41b,15b\n"                    \
+           "     .long 10b,15b\n"                      \
+           "     .long 51b,15b\n"                    \
+           "     .long 11b,15b\n"                    \
+           "     .long 61b,15b\n"                    \
+           "     .long 12b,15b\n"                    \
+           "     .long 71b,15b\n"                    \
+           "     .long 13b,15b\n"                    \
+           "     .long 81b,15b\n"                    \
+           "     .long 14b,15b\n"                          \
+           "     .long 91b,15b\n"                    \
+           "     .long 6b,9b\n"                            \
+                "       .long 7b,8b\n"                                \
+           ".previous"                               \
+           : "=&c"(size), "=&D" (__d0), "=&S" (__d1)       \
+           :  "1"(to), "2"(from), "0"(size),"i"(-EFAULT)         \
+           : "eax", "edx", "memory");                      \
+ } while (0)
+

 #define __copy_user_zeroing(to,from,size)                        \
 do {                                                 \
-     int __d0, __d1;                                       \
+     int __d0, __d1, __d2;                                 \
      __asm__ __volatile__(                                 \
-           "0:   rep; movsl\n"                             \
-           "     movl %3,%0\n"                             \
-           "1:   rep; movsb\n"                             \
-           "2:\n"                                          \
-           ".section .fixup,\"ax\"\n"                      \
-           "3:   lea 0(%3,%0,4),%0\n"                      \
-           "4:   pushl %0\n"                         \
-           "     pushl %%eax\n"                            \
-           "     xorl %%eax,%%eax\n"                       \
-           "     rep; stosb\n"                             \
-           "     popl %%eax\n"                             \
-           "     popl %0\n"                          \
-           "     jmp 2b\n"                           \
-           ".previous\n"                                   \
-           ".section __ex_table,\"a\"\n"                   \
-           "     .align 4\n"                         \
-           "     .long 0b,3b\n"                            \
-           "     .long 1b,4b\n"                            \
-           ".previous"                               \
-           : "=&c"(size), "=&D" (__d0), "=&S" (__d1)       \
-           : "r"(size & 3), "0"(size / 4), "1"(to), "2"(from)    \
-           : "memory");                                    \
-} while (0)
+                "       cmpl $63, %0\n"                               \
+                "       jbe  5f\n"                                    \
+                "       movl %%di, %%ax\n"                            \
+                "       test $7, %%al\n"                              \
+                "       jz   5f\n"                                    \
+                "       .align 2,0x90\n"                              \
+                "0:     movl 32(%4), %%eax\n"                         \
+           "       cmpl $67, %0\n"                               \
+                "       jbe 2f\n"                                     \
+                "1:     movl 64(%4), %%eax\n"                         \
+                "       .align 2,0x90\n"                              \
+           "2:     movl 0(%4), %%eax\n"                          \
+                "21:    movl 4(%4), %%edx\n"                          \
+                "       movl %%eax, 0(%3)\n"                          \
+           "       movl %%edx, 4(%3)\n"                          \
+           "3:     movl 8(%4), %%eax\n"                          \
+           "31:    movl 12(%4),%%edx\n"                          \
+           "       movl %%eax, 8(%3)\n"                          \
+           "       movl %%edx, 12(%3)\n"                         \
+           "4:     movl 16(%4), %%eax\n"                         \
+           "41:    movl 20(%4), %%edx\n"                         \
+                "       movl %%eax, 16(%3)\n"                         \
+                "       movl %%edx, 20(%3)\n"                         \
+                "10:    movl 24(%4), %%eax\n"                         \
+                "51:    movl 28(%4), %%edx\n"                         \
+                "       movl %%eax, 24(%3)\n"                         \
+                "       movl %%edx, 28(%3)\n"                         \
+                "11:    movl 32(%4), %%eax\n"                         \
+                "61:    movl 36(%4), %%edx\n"                         \
+                "       movl %%eax, 32(%3)\n"                         \
+                "       movl %%edx, 36(%3)\n"                         \
+                "12:    movl 40(%4), %%eax\n"                         \
+                "71:    movl 44(%4), %%edx\n"                         \
+                "       movl %%eax, 40(%3)\n"                         \
+                "       movl %%edx, 44(%3)\n"                         \
+                "13:    movl 48(%4), %%eax\n"                         \
+                "81:    movl 52(%4), %%edx\n"                         \
+                "       movl %%eax, 48(%3)\n"                         \
+                "       movl %%edx, 52(%3)\n"                         \
+                "14:    movl 56(%4), %%eax\n"                         \
+                "91:    movl 60(%4), %%edx\n"                         \
+                "       movl %%eax, 56(%3)\n"                         \
+                "       movl %%edx, 60(%3)\n"                         \
+                "       addl $-64, %0\n"                              \
+                "       addl $64, %4\n"                               \
+                "       addl $64, %3\n"                               \
+                "       cmpl $63, %0\n"                               \
+                "       ja  0b\n"                                     \
+           "5:   movl  %0, %%eax\n"                            \
+                "       shrl  $2, %0\n"                               \
+                "       andl $3, %%eax\n"                             \
+                "       cld\n"                                        \
+                "6:     rep; movsl\n"                                 \
+                "       movl %%eax,%0\n"                              \
+           "7:   rep; movsb\n"                             \
+           "8:\n"                                          \
+           ".section .fixup,\"ax\"\n"                      \
+           "9:   lea 0(%%eax,%0,4),%0\n"                   \
+           "16:  pushl %0\n"                             \
+           "     pushl %%eax\n"                            \
+           "     xorl %%eax,%%eax\n"                       \
+           "     rep; stosb\n"                             \
+           "     popl %%eax\n"                             \
+           "     popl %0\n"                          \
+           "     jmp 8b\n"                           \
+                "15:    movl %6, %0\n"                                \
+                "       jmp 8b\n"                                     \
+           ".previous\n"                                   \
+           ".section __ex_table,\"a\"\n"                   \
+           "     .align 4\n"                         \
+           "     .long 0b,16b\n"                           \
+           "     .long 1b,16b\n"                           \
+           "     .long 2b,16b\n"                           \
+           "     .long 21b,16b\n"                    \
+           "     .long 3b,16b\n"                           \
+           "     .long 31b,16b\n"                    \
+           "     .long 4b,16b\n"                           \
+           "     .long 41b,16b\n"                    \
+           "     .long 10b,16b\n"                      \
+           "     .long 51b,16b\n"                    \
+           "     .long 11b,16b\n"                    \
+           "     .long 61b,16b\n"                    \
+           "     .long 12b,16b\n"                    \
+           "     .long 71b,16b\n"                    \
+           "     .long 13b,16b\n"                    \
+           "     .long 81b,16b\n"                    \
+           "     .long 14b,16b\n"                          \
+           "     .long 91b,16b\n"                    \
+           "     .long 6b,9b\n"                            \
+                "       .long 7b,16b\n"                               \
+           ".previous"                               \
+           : "=&c"(size), "=&D" (__d0), "=&S" (__d1)       \
+           :  "1"(to), "2"(from), "0"(size),"i"(-EFAULT)         \
+           : "eax", "edx", "memory");                      \
+ } while (0)

 /* We let the __ versions of copy_from/to_user inline, because they're
often
  * used in fast paths and have only a small space overhead.
@@ -578,24 +722,16 @@
 }

 #define copy_to_user(to,from,n)                      \
-     (__builtin_constant_p(n) ?                \
-      __constant_copy_to_user((to),(from),(n)) :     \
-      __generic_copy_to_user((to),(from),(n)))
+      __generic_copy_to_user((to),(from),(n))

 #define copy_from_user(to,from,n)              \
-     (__builtin_constant_p(n) ?                \
-      __constant_copy_from_user((to),(from),(n)) :   \
-      __generic_copy_from_user((to),(from),(n)))
+      __generic_copy_from_user((to),(from),(n))

 #define __copy_to_user(to,from,n)              \
-     (__builtin_constant_p(n) ?                \
-      __constant_copy_to_user_nocheck((to),(from),(n)) :   \
-      __generic_copy_to_user_nocheck((to),(from),(n)))
+      __generic_copy_to_user_nocheck((to),(from),(n))

 #define __copy_from_user(to,from,n)                  \
-     (__builtin_constant_p(n) ?                \
-      __constant_copy_from_user_nocheck((to),(from),(n)) : \
-      __generic_copy_from_user_nocheck((to),(from),(n)))
+      __generic_copy_from_user_nocheck((to),(from),(n))

 long strncpy_from_user(char *dst, const char *src, long count);
 long __strncpy_from_user(char *dst, const char *src, long count);




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lse-tech] efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-24 19:34 efficient copy_to_user and copy_from_user routines in Linux Kernel Mala Anand
  2002-06-24 19:33 ` David S. Miller
@ 2002-06-24 20:24 ` Niels Christiansen
  2002-06-25 17:03 ` Andrew Morton
  2 siblings, 0 replies; 14+ messages in thread
From: Niels Christiansen @ 2002-06-24 20:24 UTC (permalink / raw)
  To: Linux Kernel Mailing List, lse-tech

Mala,

As you may recall, I showed you results back in February with MMX registers
enabled.  I also gave you a simple patch to test activate the use of the MMX
registers.  It would be interesting if you could run your test with the MMX
patch so we could see the difference.  In case you forgot the patch, it is
as simple as setting CONFIG_X86_USE_3DNOW on for Pentium III in
arch/i386/config.in.

Niels Christiansen

----- Original Message -----
From: "Mala Anand" <manand@us.ibm.com>
To: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>;
<lse-tech@lists.sourceforge.net>
Sent: Monday, June 24, 2002 2:34 PM
Subject: [Lse-tech] efficient copy_to_user and copy_from_user routines in
Linux Kernel


> Here is a 2.5.19 patch that improves the performance of IA32 copy_to_user
> and copy_from_user routines used by :
>
> (1) tcpip protocol stack
> (2) file systems
>
> Badari Pulavarty has suggested using mmx registers instead of integer
> registers in the unrolled loop copy method.  We are both investigating
> the performance of the copy routines when the mmx registers are used.
>
> Regards,
>     Mala


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-24 19:34 efficient copy_to_user and copy_from_user routines in Linux Kernel Mala Anand
  2002-06-24 19:33 ` David S. Miller
  2002-06-24 20:24 ` [Lse-tech] " Niels Christiansen
@ 2002-06-25 17:03 ` Andrew Morton
  2002-06-25 17:43   ` kuznet
                     ` (2 more replies)
  2 siblings, 3 replies; 14+ messages in thread
From: Andrew Morton @ 2002-06-25 17:03 UTC (permalink / raw)
  To: Mala Anand; +Cc: Linux Kernel Mailing List, lse-tech

Mala Anand wrote:
> 
> Here is a 2.5.19 patch that improves the performance of IA32 copy_to_user
> and copy_from_user routines used by :
> 
> (1) tcpip protocol stack
> (2) file systems
> 


This came up about a year back when zerocopy networking was merged.
Intel boxes started running more slowly purely because of the 8+8
alignment thing.

I changed tcp to use a different copy if either source or dest were
not eight-byte aligned, and found that the resulting improvement
across a mixed networking load was only 1%.  Your numbers are higher,
so perhaps there are different alignments in the mix...

One question:  have you tested on other CPU types?  This problem is
very specific to Intel hardware.  On AMD, the eight-byte alignement
artifact does not exist at all.  It could be that your patch is not
desirable on such CPUs?

-

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-25 17:03 ` Andrew Morton
@ 2002-06-25 17:43   ` kuznet
  2002-06-25 19:47     ` Andrew Morton
  2002-06-26 13:54     ` Hirokazu Takahashi
  2002-06-25 18:58   ` [Lse-tech] " Niels Christiansen
  2002-06-26 14:50   ` Bill Hartner
  2 siblings, 2 replies; 14+ messages in thread
From: kuznet @ 2002-06-25 17:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hello!

> I changed tcp to use a different copy if either source or dest were
> not eight-byte aligned, and found that the resulting improvement
> across a mixed networking load was only 1%.  Your numbers are higher,
> so perhaps there are different alignments in the mix...

Did you look at sender or changed both of the functions?

After that accident TCP was changed and it does not use copy_from_user more,
it does copy_and_csum even when no checksum is required. So, his results
on sender side (except for strange anomaly at msg size 8K) just confirm
nil effect of copy_from_user.

What's about copy_to_user, we forgot about this at all,
worrying mostly about sender side. :-)

Alexey

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lse-tech] Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-25 17:03 ` Andrew Morton
  2002-06-25 17:43   ` kuznet
@ 2002-06-25 18:58   ` Niels Christiansen
  2002-06-25 19:11     ` Dave Jones
  2002-06-26 14:50   ` Bill Hartner
  2 siblings, 1 reply; 14+ messages in thread
From: Niels Christiansen @ 2002-06-25 18:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel Mailing List, lse-tech


Indeed, I ordered a P4 for these tests while still at IBM but AMD boxes are
not available to Mala, I believe.  Before I got sacked, I even bought a P4
to test this and other things at home but I lost interest in the matter so
never actually got around to testing.

When I did test back in February I created a few test programs and found
that the code generated by GCC Version 3.x and the library that came with
RedHat 6.2 gave almost as good results as the patches Mala then had
available.  Maybe it is time to see if the compiler has improved enough to
scrap the copy code in the kernel in favor of code as generated by the
compiler.

Niels

----- Original Message -----
From: "Andrew Morton" <akpm@zip.com.au>
To: "Mala Anand" <manand@us.ibm.com>
Cc: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>;
<lse-tech@lists.sourceforge.net>
Sent: Tuesday, June 25, 2002 12:03 PM
Subject: [Lse-tech] Re: efficient copy_to_user and copy_from_user routines
in Linux Kernel


> Mala Anand wrote:
> >
> > Here is a 2.5.19 patch that improves the performance of IA32
copy_to_user
> > and copy_from_user routines used by :
> >
> > (1) tcpip protocol stack
> > (2) file systems
> >
>
> One question:  have you tested on other CPU types?  This problem is
> very specific to Intel hardware.  On AMD, the eight-byte alignement
> artifact does not exist at all.  It could be that your patch is not
> desirable on such CPUs?
>
> -------------------------------------------------------
> This sf.net email is sponsored by: Jabber Inc.
> Don't miss the IM event of the season | Special offer for OSDN members!
> JabConf 2002, Aug. 20-22, Keystone, CO http://www.jabberconf.com/osdn


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lse-tech] Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-25 18:58   ` [Lse-tech] " Niels Christiansen
@ 2002-06-25 19:11     ` Dave Jones
  2002-06-26 12:58       ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Jones @ 2002-06-25 19:11 UTC (permalink / raw)
  To: Niels Christiansen; +Cc: Andrew Morton, Linux Kernel Mailing List, lse-tech

On Tue, Jun 25, 2002 at 01:58:01PM -0500, Niels Christiansen wrote:
 
 > Maybe it is time to see if the compiler has improved enough to
 > scrap the copy code in the kernel in favor of code as generated by the
 > compiler.

This came up about a month ago. I'll repeat what I said then.
"I'll believe it when I see it".

        Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-25 17:43   ` kuznet
@ 2002-06-25 19:47     ` Andrew Morton
  2002-06-25 21:46       ` Chris Friesen
  2002-06-26 13:54     ` Hirokazu Takahashi
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2002-06-25 19:47 UTC (permalink / raw)
  To: kuznet; +Cc: linux-kernel

kuznet@ms2.inr.ac.ru wrote:
> 
> Hello!
> 
> > I changed tcp to use a different copy if either source or dest were
> > not eight-byte aligned, and found that the resulting improvement
> > across a mixed networking load was only 1%.  Your numbers are higher,
> > so perhaps there are different alignments in the mix...
> 
> Did you look at sender or changed both of the functions?

I changed it to use csum_copy_from_user() instead of copy_from_user()
if the source and dest weren't 8-byte aligned.   No other changes
in there.   

> After that accident TCP was changed and it does not use copy_from_user more,
> it does copy_and_csum even when no checksum is required. So, his results
> on sender side (except for strange anomaly at msg size 8K) just confirm
> nil effect of copy_from_user.

Yup.

> What's about copy_to_user, we forgot about this at all,
> worrying mostly about sender side. :-)

We didn't really forget, but we were trying to get a 2.4 kernel out,
so it became a "fix in 2.5" item.  You're right, we should fix it in
2.4.

I wrote a little app to test this - it times a couple of copy algorithms
at all possible alignments.  It may be useful for someone...  http://www.zip.com.au/~akpm/linux/cptimer.tar.gz
I think it covers everything - uncached/cache source/dest,
all possible transfer alignemnts.

The cost of getting it wrong is, iirc, 40% slowdown.  In the
kernel's single most expensive function.

-

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-25 19:47     ` Andrew Morton
@ 2002-06-25 21:46       ` Chris Friesen
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Friesen @ 2002-06-25 21:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:

> I wrote a little app to test this - it times a couple of copy algorithms
> at all possible alignments.  It may be useful for someone...  http://www.zip.com.au/~akpm/linux/cptimer.tar.gz
> I think it covers everything - uncached/cache source/dest,
> all possible transfer alignemnts.

I have a problem.

[cfriesen@pcard0ks cptimer]$ ./report.sh 
vendor_id       : GenuineIntel
model name      : Pentium III (Katmai)
stepping        : 3
cpu MHz         : 548.633
cache size      : 512 KB


CP_OPTS= ./all-alignments.sh
expr: syntax error
expr: syntax error
expr: syntax error
expr: syntax error
expr: syntax error
expr: non-numeric argument
expr: syntax error
expr: non-numeric argument
expr: syntax error
expr: syntax error


Any ideas?

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10  
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lse-tech] Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-25 19:11     ` Dave Jones
@ 2002-06-26 12:58       ` Andi Kleen
  0 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2002-06-26 12:58 UTC (permalink / raw)
  To: Dave Jones, Niels Christiansen, Andrew Morton,
	Linux Kernel Mailing List, lse-tech

On Tue, Jun 25, 2002 at 09:11:27PM +0200, Dave Jones wrote:
> On Tue, Jun 25, 2002 at 01:58:01PM -0500, Niels Christiansen wrote:
>  
>  > Maybe it is time to see if the compiler has improved enough to
>  > scrap the copy code in the kernel in favor of code as generated by the
>  > compiler.
> 
> This came up about a month ago. I'll repeat what I said then.
> "I'll believe it when I see it".

Just look at the x86-64 port (2.5)

The code generated by gcc 3.1 is a lot better than the inline macros.
For example it knows the alignment of target/source and emits 
unrolled big (4,2,1bytes) moves and some tricks.
We're using that on x86-64. For tricky cases (it cannot determine length
or alignment) it'll still call out to out of line functions, which 
should be optimized, notably not just use rep ; s... like the inline macros
which isn't very efficient on Athlon.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
  2002-06-25 17:43   ` kuznet
  2002-06-25 19:47     ` Andrew Morton
@ 2002-06-26 13:54     ` Hirokazu Takahashi
  1 sibling, 0 replies; 14+ messages in thread
From: Hirokazu Takahashi @ 2002-06-26 13:54 UTC (permalink / raw)
  To: akpm; +Cc: kuznet, linux-kernel

Hello,

I have a patch which let sendmsg use copy_from_user instead of
csum_and_copy_from_user when a NIC supports HW-checksumming.
It just works on kernel 2.4, I haven't ported it on kernel 2.5 yet.
If you want it I'll port it after I come back from OTTAWA.

You can get the patch against kernel 2.4 from
ftp://ftp.valinux.co.jp/pub/people/taka/tune/2.4.17/va-udptcpchecksum-2.4.17-test1.patch

Would you try it?

> > I changed tcp to use a different copy if either source or dest were
> > not eight-byte aligned, and found that the resulting improvement
> > across a mixed networking load was only 1%.  Your numbers are higher,
> > so perhaps there are different alignments in the mix...
> 
> Did you look at sender or changed both of the functions?
> 
> After that accident TCP was changed and it does not use copy_from_user more,
> it does copy_and_csum even when no checksum is required. So, his results
> on sender side (except for strange anomaly at msg size 8K) just confirm
> nil effect of copy_from_user.
> 
> What's about copy_to_user, we forgot about this at all,
> worrying mostly about sender side. :-)

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lse-tech] Re: efficient copy_to_user and copy_from_user routines in  Linux Kernel
  2002-06-25 17:03 ` Andrew Morton
  2002-06-25 17:43   ` kuznet
  2002-06-25 18:58   ` [Lse-tech] " Niels Christiansen
@ 2002-06-26 14:50   ` Bill Hartner
  2 siblings, 0 replies; 14+ messages in thread
From: Bill Hartner @ 2002-06-26 14:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mala Anand, Linux Kernel Mailing List, lse-tech



Andrew Morton wrote:
> 
> Mala Anand wrote:
> >
> > Here is a 2.5.19 patch that improves the performance of IA32 copy_to_user
> > and copy_from_user routines used by :
...
> 
> One question:  have you tested on other CPU types?  This problem is
> very specific to Intel hardware.  On AMD, the eight-byte alignement
> artifact does not exist at all.  It could be that your patch is not
> desirable on such CPUs?
> 

In Mala's lab, there are a couple of 1.6 Ghz P4 systems that can be used to test on.

There is also a Netbench (P4 and PIII Xeon) and SPECweb99 (PIII Xeon) setup
that can be used for further testing.

There are some older P6 systems available too.  Not sure about AMD yet.

Bill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
@ 2002-06-28 11:50 Mala Anand
  0 siblings, 0 replies; 14+ messages in thread
From: Mala Anand @ 2002-06-28 11:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: akpm, Linux Kernel Mailing List, lse-tech, Mala Anand

>Mala Anand wrote:
>>
>> Here is a 2.5.19 patch that improves the performance of IA32
copy_to_user
>> and copy_from_user routines used by :
>>
>> (1) tcpip protocol stack
>> (2) file systems
>>


>This came up about a year back when zerocopy networking was merged.
>Intel boxes started running more slowly purely because of the 8+8
>alignment thing.

>I changed tcp to use a different copy if either source or dest were
>not eight-byte aligned, and found that the resulting improvement
>across a mixed networking load was only 1%.  Your numbers are higher,
>so perhaps there are different alignments in the mix...

I will test on other workloads when I return back to work after OLS
and vacation.  However we tested an earlier version of this patch on
Netbench using sendfile and gained around 3% improvement. The baseline
profiling showed that Netbench was spending 10% in generic_copy_to_user.
The tcp options are aligned on an 4-byte boundary, so depending on the
options used the address to the data (source address to the
generic_copy_to_user) should fall on an 4 or 8 byte boundary. I agree
with you more test is needed.


>One question:  have you tested on other CPU types?  This problem is
>very specific to Intel hardware.  On AMD, the eight-byte alignement
>artifact does not exist at all.  It could be that your patch is not
>desirable on such CPUs?

I tested only Pentium II and III. I will test it on Pentium IV.
When I said 8-byte alignment, it is 8 and greater.  I will
try to check out AMD also.



Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088




                                                                                                                                       
                      Andrew Morton                                                                                                    
                      <akpm@zip.com.au>        To:       Mala Anand/Austin/IBM@IBMUS                                                   
                      Sent by:                 cc:       Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,                     
                      akpm@us.ibm.com           lse-tech@lists.sourceforge.net                                                         
                                               Subject:  Re: efficient copy_to_user and copy_from_user routines in Linux Kernel        
                                                                                                                                       
                      06/25/2002 12:03                                                                                                 
                      PM                                                                                                               
                                                                                                                                       
                                                                                                                                       




-




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: efficient copy_to_user and copy_from_user routines in Linux Kernel
@ 2002-06-28 12:35 Mala Anand
  0 siblings, 0 replies; 14+ messages in thread
From: Mala Anand @ 2002-06-28 12:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel Mailing List, lse-tech, Mala Anand

Andrew Morton wrote..

>>This came up about a year back when zerocopy networking was merged.
>>Intel boxes started running more slowly purely because of the 8+8
>>alignment thing.

>>I changed tcp to use a different copy if either source or dest were
>>not eight-byte aligned, and found that the resulting improvement
>>across a mixed networking load was only 1%.  Your numbers are higher,
>>so perhaps there are different alignments in the mix...

>I will test on other workloads when I return back to work after OLS
>and vacation.  However we tested an earlier version of this patch on
>Netbench using sendfile and gained around 3% improvement. The baseline
>profiling showed that Netbench was spending 10% in generic_copy_to_user.
>The tcp options are aligned on an 4-byte boundary, so depending on the
>options used the address to the data (source address to the
>generic_copy_to_user) should fall on an 4 or 8 byte boundary. I agree
>with you more test is needed.

One correction to the above statement...
Due to the tcp options alignment on an 4-byte boundary the source
address to the generic_copy_to_user should fall on an 4, 8, 12
and 16 etc., byte boundary. However,I have seen that 4 and 12 byte
alignment using unrolled loop performed better than the string copy.


>>One question:  have you tested on other CPU types?  This problem is
>>very specific to Intel hardware.  On AMD, the eight-byte alignement
>>artifact does not exist at all.  It could be that your patch is not
>>desirable on such CPUs?

>I tested only on Pentium II and III. I will test it on Pentium IV.
>When I said 8-byte alignment, it is 8 and greater.  I will
>try to check out AMD also.

Same correction here, 8-byte alignment means 8, 16 or greater.

Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2002-06-28 12:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-24 19:34 efficient copy_to_user and copy_from_user routines in Linux Kernel Mala Anand
2002-06-24 19:33 ` David S. Miller
2002-06-24 20:24 ` [Lse-tech] " Niels Christiansen
2002-06-25 17:03 ` Andrew Morton
2002-06-25 17:43   ` kuznet
2002-06-25 19:47     ` Andrew Morton
2002-06-25 21:46       ` Chris Friesen
2002-06-26 13:54     ` Hirokazu Takahashi
2002-06-25 18:58   ` [Lse-tech] " Niels Christiansen
2002-06-25 19:11     ` Dave Jones
2002-06-26 12:58       ` Andi Kleen
2002-06-26 14:50   ` Bill Hartner
  -- strict thread matches above, loose matches on Subject: below --
2002-06-28 11:50 Mala Anand
2002-06-28 12:35 Mala Anand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox