[PATCH] zerocopy NFS for 2.5.36

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-18  8:14 Hirokazu Takahashi
  2002-09-18 23:00 ` David S. Miller
  2002-10-14  5:50 ` Neil Brown
  0 siblings, 2 replies; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-09-18  8:14 UTC (permalink / raw)
  To: Neil Brown, linux-kernel, nfs

Hello,

I ported the zerocopy NFS patches against linux-2.5.36.

I made va05-zerocopy-nfsdwrite-2.5.36.patch more generic,
so that it would be easy to merge with NFSv4. Each procedure can
chose whether it can accept splitted buffers or not.
And I fixed a probelem that nfsd couldn't handle NFS-symlink
requests which were very large.

1)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
This patch enables HW-checksum against outgoing packets including UDP frames.

2)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va11-udpsendfile-2.5.36.patch
This patch makes sendfile systemcall over UDP work. It also supports
UDP_CORK interface which is very similar to TCP_CORK. And you can call
sendmsg/senfile with MSG_MORE flags over UDP sockets.

3)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
This patch fixes the problem of x86 csum_partilal() routines which
can't handle odd addressed buffers.

4)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va01-zerocopy-rpc-2.5.36.patch
This patch makes RPC can send some pieces of data and pages without copy.

5)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va02-zerocopy-nfsdread-2.5.36.patch
This patch makes NFSD send pages in pagecache directly when NFS clinets request
file-read.

6)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va03-zerocopy-nfsdreaddir-2.5.36.patch
nfsd_readdir can also send pages without copy.

7)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va04-zerocopy-shadowsock-2.5.36.patch
This patch makes per-cpu UDP sockets so that NFSD can send UDP frames on
each prosessor simultaneously.
Without the patch we can send only one UDP frame at the time as a UDP socket
have to be locked during sending some pages to serialize them.

8)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va05-zerocopy-nfsdwrite-2.5.36.patch
This patch enables NFS-write uses writev interface. NFSd can handle NFS
requests without reassembling IP fragments into one UDP frame.

9)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/taka-writev-2.5.36.patch
This patch makes writev for regular file work faster.
It also can be found at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/

Caution:
       XFS doesn't support writev interface yet. NFS write on XFS might
       slow down with No.8 patch. I wish SGI guys will implement it.

10)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va07-nfsbigbuf-2.5.36.patch
This makes NFS buffer much bigger (60KB).
60KB buffer is the same to 32KB buffer for linux-kernel as both of them
require 64KB chunk.

11)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va09-zerocopy-tempsendto-2.5.36.patch
If you don't want to use sendfile over UDP yet, you can apply it instead of No.1 and No.2 patches.

Regards,
Hirokazu Takahashi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18  8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi
@ 2002-09-18 23:00 ` David S. Miller
  2002-09-18 23:54   ` Alan Cox
  2002-09-21 11:56   ` Pavel Machek
  2002-10-14  5:50 ` Neil Brown
  1 sibling, 2 replies; 36+ messages in thread
From: David S. Miller @ 2002-09-18 23:00 UTC (permalink / raw)
  To: taka; +Cc: neilb, linux-kernel, nfs

   From: Hirokazu Takahashi <taka@valinux.co.jp>
   Date: Wed, 18 Sep 2002 17:14:31 +0900 (JST)

   1)
   ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
   This patch enables HW-checksum against outgoing packets including UDP frames.

Can you explain the TCP parts?  They look very wrong.

It was discussed long ago that csum_and_copy_from_user() performs
better than plain copy_from_user() on x86.  I do not remember all
details, but I do know that using copy_from_user() is not a real
improvement at least on x86 architecture.

The rest of the changes (ie. the getfrag() logic to set
skb->ip_summed) looks fine.

   3)
   ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
   This patch fixes the problem of x86 csum_partilal() routines which
   can't handle odd addressed buffers.

I've sent Linus this fix already.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18 23:00 ` David S. Miller
@ 2002-09-18 23:54   ` Alan Cox
  2002-09-19  0:16     ` Andrew Morton
  2002-09-21 11:56   ` Pavel Machek
  1 sibling, 1 reply; 36+ messages in thread
From: Alan Cox @ 2002-09-18 23:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs

On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86.  I do not remember all

The better was a freak of PPro/PII scheduling I think

> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

The same as bit is easy to explain. Its totally memory bandwidth limited
on current x86-32 processors. (Although I'd welcome demonstrations to
the contrary on newer toys)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18 23:54   ` Alan Cox
@ 2002-09-19  0:16     ` Andrew Morton
  2002-09-19  2:13       ` Aaron Lehmann
  2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
  0 siblings, 2 replies; 36+ messages in thread
From: Andrew Morton @ 2002-09-19  0:16 UTC (permalink / raw)
  To: Alan Cox; +Cc: David S. Miller, taka, neilb, linux-kernel, nfs

Alan Cox wrote:
> 
> On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> > It was discussed long ago that csum_and_copy_from_user() performs
> > better than plain copy_from_user() on x86.  I do not remember all
> 
> The better was a freak of PPro/PII scheduling I think
> 
> > details, but I do know that using copy_from_user() is not a real
> > improvement at least on x86 architecture.
> 
> The same as bit is easy to explain. Its totally memory bandwidth limited
> on current x86-32 processors. (Although I'd welcome demonstrations to
> the contrary on newer toys)

Nope.  There are distinct alignment problems with movsl-based
memcpy on PII and (at least) "Pentium III (Coppermine)", which is
tested here:

copy_32 uses movsl.  copy_duff just uses a stream of "movl"s

Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s     
nbytes=10240  from_align=0, to_align=0
    copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

OK, movsl wins.   But now give the source address 8+1 alignment:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
nbytes=10240  from_align=1, to_align=0
    copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec

The "movl"-based copy wins.  By miles.

Make the source 8+4 aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
nbytes=10240  from_align=4, to_align=0
    copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec

So movl still beats movsl, by lots.

I have various scriptlets which generate the entire matrix.

I think I ended up deciding that we should use movsl _only_
when both src and dsc are 8-byte-aligned.  And that when you
multiply the gain from that by the frequency*size with which
funny alignments are used by TCP the net gain was 2% or something.

It needs redoing.  These differences are really big, and this
is the kernel's most expensive function.

A little project for someone.

The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  0:16     ` Andrew Morton
@ 2002-09-19  2:13       ` Aaron Lehmann
  2002-09-19  3:30         ` Andrew Morton
  2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
  1 sibling, 1 reply; 36+ messages in thread
From: Aaron Lehmann @ 2002-09-19  2:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nfs

> akpm:/usr/src/cptimer> ./cptimer -d -s     
> nbytes=10240  from_align=0, to_align=0
>     copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

It's disappointing that this program doesn't seem to support
benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
Those seem to be the more interesting memcpy functions on modern
systems.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  2:13       ` Aaron Lehmann
@ 2002-09-19  3:30         ` Andrew Morton
  2002-09-19 10:42           ` Alan Cox
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2002-09-19  3:30 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: linux-kernel, nfs

Aaron Lehmann wrote:
> 
> > akpm:/usr/src/cptimer> ./cptimer -d -s
> > nbytes=10240  from_align=0, to_align=0
> >     copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
> 
> It's disappointing that this program doesn't seem to support
> benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> Those seem to be the more interesting memcpy functions on modern
> systems.

Well the source is there, and the licensing terms are most reasonable.

But then, the source was there eighteen months ago and nothing happened.
Sigh.

I think in-kernel MMX has fatal drawbacks anyway.  Not sure what
they are - I prefer to pretend that x86 CPUs execute raw C.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  3:30         ` Andrew Morton
@ 2002-09-19 10:42           ` Alan Cox
  0 siblings, 0 replies; 36+ messages in thread
From: Alan Cox @ 2002-09-19 10:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aaron Lehmann, linux-kernel, nfs

On Thu, 2002-09-19 at 04:30, Andrew Morton wrote:
> > It's disappointing that this program doesn't seem to support
> > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> > Those seem to be the more interesting memcpy functions on modern
> > systems.
> 
> Well the source is there, and the licensing terms are most reasonable.
> 
> But then, the source was there eighteen months ago and nothing happened.
> Sigh.
> 
> I think in-kernel MMX has fatal drawbacks anyway.  Not sure what
> they are - I prefer to pretend that x86 CPUs execute raw C.

MMX isnt useful for anything smaller than about 512bytes-1K. Its not
useful in interrupt handlers. The list goes on.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  0:16     ` Andrew Morton
  2002-09-19  2:13       ` Aaron Lehmann
@ 2002-09-19 13:15       ` Hirokazu Takahashi
  2002-09-19 20:42         ` Andrew Morton
  1 sibling, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-09-19 13:15 UTC (permalink / raw)
  To: akpm; +Cc: alan, davem, neilb, linux-kernel, nfs

Hello,

> > > details, but I do know that using copy_from_user() is not a real
> > > improvement at least on x86 architecture.
> > 
> > The same as bit is easy to explain. Its totally memory bandwidth limited
> > on current x86-32 processors. (Although I'd welcome demonstrations to
> > the contrary on newer toys)
> 
> Nope.  There are distinct alignment problems with movsl-based
> memcpy on PII and (at least) "Pentium III (Coppermine)", which is
> tested here:
...
> I have various scriptlets which generate the entire matrix.
> 
> I think I ended up deciding that we should use movsl _only_
> when both src and dsc are 8-byte-aligned.  And that when you
> multiply the gain from that by the frequency*size with which
> funny alignments are used by TCP the net gain was 2% or something.

Amazing! I beleived 4-byte-aligned was enough.
read/write systemcalls may also reduce their penalties.

> It needs redoing.  These differences are really big, and this
> is the kernel's most expensive function.
> 
> A little project for someone.

OK, if there is nobody who wants to do it I'll do it by myself.

> The tools are at http://www.zip.com.au/~/linux/cptimer.tar.gz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
@ 2002-09-19 20:42         ` Andrew Morton
  2002-09-19 21:12           ` David S. Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2002-09-19 20:42 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: alan, davem, neilb, linux-kernel, nfs

Hirokazu Takahashi wrote:
> 
> ...
> > It needs redoing.  These differences are really big, and this
> > is the kernel's most expensive function.
> >
> > A little project for someone.
> 
> OK, if there is nobody who wants to do it I'll do it by myself.

That would be fantastic - thanks.  This is more a measurement
and testing exercise than a coding one.  And if those measurements
are sufficiently nice (eg: >5%) then a 2.4 backport should be done.

It seems that movsl works acceptably with all alignments on AMD
hardware, although this needs to be checked with more recent machines.

movsl is a (bad) loss on PII and PIII for all alignments except 8&8.
Don't know about P4 - I can test that in a day or two.

I expect that a minimal, 90% solution would be just:

fancy_copy_to_user(dst, src, count)
{
	if (arch_has_sane_movsl || ((dst|src) & 7) == 0)
		movsl_copy_to_user(dst, src, count);
	else
		movl_copy_to_user(dst, src, count);
}

and

#ifndef ARCH_HAS_FANCY_COPY_USER
#define fancy_copy_to_user copy_to_user
#endif

and we really only need fancy_copy_to_user in a handful of
places - the bulk copies in networking and filemap.c.  For all
the other call sites it's probably more important to keep the
code footprint down than it is to squeeze the last few drops out
of the copy speed.

Mala Anand has done some work on this.  See
http://www.uwsg.iu.edu/hypermail/linux/kernel/0206.3/0100.html

<searches>  Yes, I have a copy of Mala's patch here which works
against 2.5.current.  Mala's patch will cause quite an expansion
of kernel size; we would need an implementation which did not
use inlining.  This work was discussed at OLS2002.  See
http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz


 uaccess.h |  252 +++++++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 193 insertions(+), 59 deletions(-)

--- 2.5.25/include/asm-i386/uaccess.h~fast-cu	Tue Jul  9 21:34:58 2002
+++ 2.5.25-akpm/include/asm-i386/uaccess.h	Tue Jul  9 21:51:03 2002
@@ -253,55 +253,197 @@ do {									\
  */
 
 /* Generic arbitrary sized copy.  */
-#define __copy_user(to,from,size)					\
-do {									\
-	int __d0, __d1;							\
-	__asm__ __volatile__(						\
-		"0:	rep; movsl\n"					\
-		"	movl %3,%0\n"					\
-		"1:	rep; movsb\n"					\
-		"2:\n"							\
-		".section .fixup,\"ax\"\n"				\
-		"3:	lea 0(%3,%0,4),%0\n"				\
-		"	jmp 2b\n"					\
-		".previous\n"						\
-		".section __ex_table,\"a\"\n"				\
-		"	.align 4\n"					\
-		"	.long 0b,3b\n"					\
-		"	.long 1b,2b\n"					\
-		".previous"						\
-		: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
-		: "r"(size & 3), "0"(size / 4), "1"(to), "2"(from)	\
-		: "memory");						\
+#define __copy_user(to,from,size)				\
+do {								\
+	int __d0, __d1;						\
+	__asm__ __volatile__(					\
+	"       cmpl $63, %0\n"					\
+	"       jbe  5f\n"					\
+	"       mov %%esi, %%eax\n"				\
+	"       test $7, %%al\n"				\
+	"       jz  5f\n"					\
+	"       .align 2,0x90\n"				\
+	"0:     movl 32(%4), %%eax\n"				\
+	"       cmpl $67, %0\n"					\
+	"       jbe 1f\n"					\
+	"       movl 64(%4), %%eax\n"				\
+	"       .align 2,0x90\n"				\
+	"1:     movl 0(%4), %%eax\n"				\
+	"       movl 4(%4), %%edx\n"				\
+	"2:     movl %%eax, 0(%3)\n"				\
+	"21:    movl %%edx, 4(%3)\n"				\
+	"       movl 8(%4), %%eax\n"				\
+	"       movl 12(%4),%%edx\n"				\
+	"3:     movl %%eax, 8(%3)\n"				\
+	"31:    movl %%edx, 12(%3)\n"				\
+	"       movl 16(%4), %%eax\n"				\
+	"       movl 20(%4), %%edx\n"				\
+	"4:     movl %%eax, 16(%3)\n"				\
+	"41:    movl %%edx, 20(%3)\n"				\
+	"       movl 24(%4), %%eax\n"				\
+	"       movl 28(%4), %%edx\n"				\
+	"10:    movl %%eax, 24(%3)\n"				\
+	"51:    movl %%edx, 28(%3)\n"				\
+	"       movl 32(%4), %%eax\n"				\
+	"       movl 36(%4), %%edx\n"				\
+	"11:    movl %%eax, 32(%3)\n"				\
+	"61:    movl %%edx, 36(%3)\n"				\
+	"       movl 40(%4), %%eax\n"				\
+	"       movl 44(%4), %%edx\n"				\
+	"12:    movl %%eax, 40(%3)\n"				\
+	"71:    movl %%edx, 44(%3)\n"				\
+	"       movl 48(%4), %%eax\n"				\
+	"       movl 52(%4), %%edx\n"				\
+	"13:    movl %%eax, 48(%3)\n"				\
+	"81:    movl %%edx, 52(%3)\n"				\
+	"       movl 56(%4), %%eax\n"				\
+	"       movl 60(%4), %%edx\n"				\
+	"14:    movl %%eax, 56(%3)\n"				\
+	"91:    movl %%edx, 60(%3)\n"				\
+	"       addl $-64, %0\n"				\
+	"       addl $64, %4\n"					\
+	"       addl $64, %3\n"					\
+	"       cmpl $63, %0\n"					\
+	"       ja  0b\n"					\
+	"5:   movl  %0, %%eax\n"				\
+	"       shrl  $2, %0\n"					\
+	"       andl  $3, %%eax\n"				\
+	"       cld\n"						\
+	"6:     rep; movsl\n"					\
+	"       movl %%eax, %0\n"				\
+	"7:   rep; movsb\n"					\
+	"8:\n"							\
+	".section .fixup,\"ax\"\n"				\
+	"9:   lea 0(%%eax,%0,4),%0\n"				\
+	"     jmp 8b\n"						\
+	"15:    movl %6, %0\n"					\
+	"       jmp 8b\n"					\
+	".previous\n"						\
+	".section __ex_table,\"a\"\n"				\
+	"     .align 4\n"					\
+	"     .long 2b,15b\n"					\
+	"     .long 21b,15b\n"					\
+	"     .long 3b,15b\n"					\
+	"     .long 31b,15b\n"					\
+	"     .long 4b,15b\n"					\
+	"     .long 41b,15b\n"					\
+	"     .long 10b,15b\n"					\
+	"     .long 51b,15b\n"					\
+	"     .long 11b,15b\n"					\
+	"     .long 61b,15b\n"					\
+	"     .long 12b,15b\n"					\
+	"     .long 71b,15b\n"					\
+	"     .long 13b,15b\n"					\
+	"     .long 81b,15b\n"					\
+	"     .long 14b,15b\n"					\
+	"     .long 91b,15b\n"					\
+	"     .long 6b,9b\n"					\
+	"       .long 7b,8b\n"					\
+	".previous"						\
+	: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
+	:  "1"(to), "2"(from), "0"(size),"i"(-EFAULT)		\
+	: "eax", "edx", "memory");				\
 } while (0)
 
-#define __copy_user_zeroing(to,from,size)				\
-do {									\
-	int __d0, __d1;							\
-	__asm__ __volatile__(						\
-		"0:	rep; movsl\n"					\
-		"	movl %3,%0\n"					\
-		"1:	rep; movsb\n"					\
-		"2:\n"							\
-		".section .fixup,\"ax\"\n"				\
-		"3:	lea 0(%3,%0,4),%0\n"				\
-		"4:	pushl %0\n"					\
-		"	pushl %%eax\n"					\
-		"	xorl %%eax,%%eax\n"				\
-		"	rep; stosb\n"					\
-		"	popl %%eax\n"					\
-		"	popl %0\n"					\
-		"	jmp 2b\n"					\
-		".previous\n"						\
-		".section __ex_table,\"a\"\n"				\
-		"	.align 4\n"					\
-		"	.long 0b,3b\n"					\
-		"	.long 1b,4b\n"					\
-		".previous"						\
-		: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
-		: "r"(size & 3), "0"(size / 4), "1"(to), "2"(from)	\
-		: "memory");						\
-} while (0)
+#define __copy_user_zeroing(to,from,size)			\
+do {								\
+	int __d0, __d1;						\
+	__asm__ __volatile__(					\
+	"       cmpl $63, %0\n"					\
+	"       jbe  5f\n"					\
+	"       movl %%edi, %%eax\n"				\
+	"       test $7, %%al\n"				\
+	"       jz   5f\n"					\
+	"       .align 2,0x90\n"				\
+	"0:     movl 32(%4), %%eax\n"				\
+	"       cmpl $67, %0\n"					\
+	"       jbe 2f\n"					\
+	"1:     movl 64(%4), %%eax\n"				\
+	"       .align 2,0x90\n"				\
+	"2:     movl 0(%4), %%eax\n"				\
+	"21:    movl 4(%4), %%edx\n"				\
+	"       movl %%eax, 0(%3)\n"				\
+	"       movl %%edx, 4(%3)\n"				\
+	"3:     movl 8(%4), %%eax\n"				\
+	"31:    movl 12(%4),%%edx\n"				\
+	"       movl %%eax, 8(%3)\n"				\
+	"       movl %%edx, 12(%3)\n"				\
+	"4:     movl 16(%4), %%eax\n"				\
+	"41:    movl 20(%4), %%edx\n"				\
+	"       movl %%eax, 16(%3)\n"				\
+	"       movl %%edx, 20(%3)\n"				\
+	"10:    movl 24(%4), %%eax\n"				\
+	"51:    movl 28(%4), %%edx\n"				\
+	"       movl %%eax, 24(%3)\n"				\
+	"       movl %%edx, 28(%3)\n"				\
+	"11:    movl 32(%4), %%eax\n"				\
+	"61:    movl 36(%4), %%edx\n"				\
+	"       movl %%eax, 32(%3)\n"				\
+	"       movl %%edx, 36(%3)\n"				\
+	"12:    movl 40(%4), %%eax\n"				\
+	"71:    movl 44(%4), %%edx\n"				\
+	"       movl %%eax, 40(%3)\n"				\
+	"       movl %%edx, 44(%3)\n"				\
+	"13:    movl 48(%4), %%eax\n"				\
+	"81:    movl 52(%4), %%edx\n"				\
+	"       movl %%eax, 48(%3)\n"				\
+	"       movl %%edx, 52(%3)\n"				\
+	"14:    movl 56(%4), %%eax\n"				\
+	"91:    movl 60(%4), %%edx\n"				\
+	"       movl %%eax, 56(%3)\n"				\
+	"       movl %%edx, 60(%3)\n"				\
+	"       addl $-64, %0\n"				\
+	"       addl $64, %4\n"					\
+	"       addl $64, %3\n"					\
+	"       cmpl $63, %0\n"					\
+	"       ja  0b\n"					\
+	"5:   movl  %0, %%eax\n"				\
+	"       shrl  $2, %0\n"					\
+	"       andl $3, %%eax\n"				\
+	"       cld\n"						\
+	"6:     rep; movsl\n"					\
+	"       movl %%eax,%0\n"				\
+	"7:   rep; movsb\n"					\
+	"8:\n"							\
+	".section .fixup,\"ax\"\n"				\
+	"9:   lea 0(%%eax,%0,4),%0\n"				\
+	"16:  pushl %0\n"					\
+	"     pushl %%eax\n"					\
+	"     xorl %%eax,%%eax\n"				\
+	"     rep; stosb\n"					\
+	"     popl %%eax\n"					\
+	"     popl %0\n"					\
+	"     jmp 8b\n"						\
+	"15:    movl %6, %0\n"					\
+	"       jmp 8b\n"					\
+	".previous\n"						\
+	".section __ex_table,\"a\"\n"				\
+	"     .align 4\n"					\
+	"     .long 0b,16b\n"					\
+	"     .long 1b,16b\n"					\
+	"     .long 2b,16b\n"					\
+	"     .long 21b,16b\n"					\
+	"     .long 3b,16b\n"					\
+	"     .long 31b,16b\n"					\
+	"     .long 4b,16b\n"					\
+	"     .long 41b,16b\n"					\
+	"     .long 10b,16b\n"					\
+	"     .long 51b,16b\n"					\
+	"     .long 11b,16b\n"					\
+	"     .long 61b,16b\n"					\
+	"     .long 12b,16b\n"					\
+	"     .long 71b,16b\n"					\
+	"     .long 13b,16b\n"					\
+	"     .long 81b,16b\n"					\
+	"     .long 14b,16b\n"					\
+	"     .long 91b,16b\n"					\
+	"     .long 6b,9b\n"					\
+	"       .long 7b,16b\n"					\
+	".previous"						\
+	: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
+	:  "1"(to), "2"(from), "0"(size),"i"(-EFAULT)		\
+	: "eax", "edx", "memory");				\
+ } while (0)
 
 /* We let the __ versions of copy_from/to_user inline, because they're often
  * used in fast paths and have only a small space overhead.
@@ -578,24 +720,16 @@ __constant_copy_from_user_nocheck(void *
 }
 
 #define copy_to_user(to,from,n)				\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_to_user((to),(from),(n)) :	\
-	 __generic_copy_to_user((to),(from),(n)))
+	__generic_copy_to_user((to),(from),(n))
 
 #define copy_from_user(to,from,n)			\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_from_user((to),(from),(n)) :	\
-	 __generic_copy_from_user((to),(from),(n)))
+	__generic_copy_from_user((to),(from),(n))
 
 #define __copy_to_user(to,from,n)			\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_to_user_nocheck((to),(from),(n)) :	\
-	 __generic_copy_to_user_nocheck((to),(from),(n)))
+	__generic_copy_to_user_nocheck((to),(from),(n))
 
 #define __copy_from_user(to,from,n)			\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_from_user_nocheck((to),(from),(n)) :	\
-	 __generic_copy_from_user_nocheck((to),(from),(n)))
+	__generic_copy_from_user_nocheck((to),(from),(n))
 
 long strncpy_from_user(char *dst, const char *src, long count);
 long __strncpy_from_user(char *dst, const char *src, long count);

-

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19 20:42         ` Andrew Morton
@ 2002-09-19 21:12           ` David S. Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David S. Miller @ 2002-09-19 21:12 UTC (permalink / raw)
  To: akpm; +Cc: taka, alan, neilb, linux-kernel, nfs

   From: Andrew Morton <akpm@digeo.com>
   Date: Thu, 19 Sep 2002 13:42:13 -0700

   Mala's patch will cause quite an expansion
   of kernel size; we would need an implementation which did not
   use inlining.

It definitely belongs in arch/i386/lib/copy.c or whatever,
not inlined.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18 23:00 ` David S. Miller
  2002-09-18 23:54   ` Alan Cox
@ 2002-09-21 11:56   ` Pavel Machek
  1 sibling, 0 replies; 36+ messages in thread
From: Pavel Machek @ 2002-09-21 11:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs

Hi!
>    
>    1)
>    ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
>    This patch enables HW-checksum against outgoing packets including UDP frames.
>    
> Can you explain the TCP parts?  They look very wrong.
> 
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86.  I do not remember all
> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

Well, if this is the case, we need to #define copy_from_user csum_and_copy_from_user :-).

								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18  8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi
  2002-09-18 23:00 ` David S. Miller
@ 2002-10-14  5:50 ` Neil Brown
  2002-10-14  6:15   ` David S. Miller
  2002-10-14 12:01   ` Hirokazu Takahashi
  1 sibling, 2 replies; 36+ messages in thread
From: Neil Brown @ 2002-10-14  5:50 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: David S. Miller, linux-kernel, nfs

On Wednesday September 18, taka@valinux.co.jp wrote:
> Hello,
> 
> I ported the zerocopy NFS patches against linux-2.5.36.
> 

hi,
 I finally got around to looking at this.
 It looks good.

 However it really needs the MSG_MORE support for udp_sendmsg to be
 accepted before there is any point merging the rpc/nfsd bits.

 Would you like to see if davem is happy with that bit first and get
 it in?  Then I will be happy to forward the nfsd specific bit.

 I'm bit I'm not very sure about is the 'shadowsock' patch for having
 several xmit sockets, one per CPU.  What sort of speedup do you get
 from this?  How important is it really?

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14  5:50 ` Neil Brown
@ 2002-10-14  6:15   ` David S. Miller
  2002-10-14 10:45     ` kuznet
  2002-10-14 12:01   ` Hirokazu Takahashi
  1 sibling, 1 reply; 36+ messages in thread
From: David S. Miller @ 2002-10-14  6:15 UTC (permalink / raw)
  To: neilb; +Cc: taka, linux-kernel, nfs, kuznet

   From: Neil Brown <neilb@cse.unsw.edu.au>
   Date: Mon, 14 Oct 2002 15:50:02 +1000

    Would you like to see if davem is happy with that bit first and get
    it in?  Then I will be happy to forward the nfsd specific bit.
   
Alexey is working on this, or at least he was. :-)
(Alexey this is about the UDP cork changes)

    I'm bit I'm not very sure about is the 'shadowsock' patch for having
    several xmit sockets, one per CPU.  What sort of speedup do you get
    from this?  How important is it really?
   
Personally, it seems rather essential for scalability on SMP.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14  6:15   ` David S. Miller
@ 2002-10-14 10:45     ` kuznet
  2002-10-14 10:48       ` David S. Miller
  0 siblings, 1 reply; 36+ messages in thread
From: kuznet @ 2002-10-14 10:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: neilb, taka, linux-kernel, nfs

Hello!

> Alexey is working on this, or at least he was. :-)
> (Alexey this is about the UDP cork changes)

I took two patches of the batch:

va10-hwchecksum-2.5.36.patch
va11-udpsendfile-2.5.36.patch

I did not worry about the rest i.e. sunrpc/* part.

Alexey

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14 10:45     ` kuznet
@ 2002-10-14 10:48       ` David S. Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David S. Miller @ 2002-10-14 10:48 UTC (permalink / raw)
  To: kuznet; +Cc: neilb, taka, linux-kernel, nfs

   From: kuznet@ms2.inr.ac.ru
   Date: Mon, 14 Oct 2002 14:45:33 +0400 (MSD)

   I took two patches of the batch:
   
   va10-hwchecksum-2.5.36.patch
   va11-udpsendfile-2.5.36.patch
   
   I did not worry about the rest i.e. sunrpc/* part.

Neil and the NFS folks can take care of those parts
once the generic UDP parts are in.

So, no worries.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14  5:50 ` Neil Brown
  2002-10-14  6:15   ` David S. Miller
@ 2002-10-14 12:01   ` Hirokazu Takahashi
  2002-10-14 14:12     ` Andrew Theurer
  2002-10-16  3:44     ` Neil Brown
  1 sibling, 2 replies; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-14 12:01 UTC (permalink / raw)
  To: neilb; +Cc: davem, linux-kernel, nfs

Hello, Neil

> > I ported the zerocopy NFS patches against linux-2.5.36.
> 
> hi,
>  I finally got around to looking at this.
>  It looks good.

Thanks!

>  However it really needs the MSG_MORE support for udp_sendmsg to be
>  accepted before there is any point merging the rpc/nfsd bits.
> 
>  Would you like to see if davem is happy with that bit first and get
>  it in?  Then I will be happy to forward the nfsd specific bit.

Yes.

>  I'm bit I'm not very sure about is the 'shadowsock' patch for having
>  several xmit sockets, one per CPU.  What sort of speedup do you get
>  from this?  How important is it really?

It's not so important.

davem> Personally, it seems rather essential for scalability on SMP.

Yes.
It will be effective on large scale SMP machines as all kNFSd shares
one NFS port. A udp socket can't send data on each CPU at the same
time while MSG_MORE/UDP_CORK options are set.
The UDP socket have to block any other requests during making a UDP frame.


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14 12:01   ` Hirokazu Takahashi
@ 2002-10-14 14:12     ` Andrew Theurer
  2002-10-16  3:44     ` Neil Brown
  1 sibling, 0 replies; 36+ messages in thread
From: Andrew Theurer @ 2002-10-14 14:12 UTC (permalink / raw)
  To: neilb, Hirokazu Takahashi; +Cc: davem, linux-kernel, nfs

> Hello, Neil
>
> > > I ported the zerocopy NFS patches against linux-2.5.36.
> >
> > hi,
> >  I finally got around to looking at this.
> >  It looks good.
>
> Thanks!
>
> >  However it really needs the MSG_MORE support for udp_sendmsg to be
> >  accepted before there is any point merging the rpc/nfsd bits.
> >
> >  Would you like to see if davem is happy with that bit first and get
> >  it in?  Then I will be happy to forward the nfsd specific bit.
>
> Yes.
>
> >  I'm bit I'm not very sure about is the 'shadowsock' patch for having
> >  several xmit sockets, one per CPU.  What sort of speedup do you get
> >  from this?  How important is it really?
>
> It's not so important.
>
> davem> Personally, it seems rather essential for scalability on SMP.
>
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.

I experienced this exact problem a few months ago.  I had a test where
several clients read a file or files cached on a linux server.  TCP was just
fine, I could get 100% CPU on all CPUs on the server.  TCP zerocopy was even
better, by about 50% throughput.  UDP could not get better than 33% CPU, one
CPU working on those UDP requests and I assume a portion of another CPU
handling some inturrupt stuff.  Essentially 2P and 4P throughput was only as
good as UP throughput.  It is essential to get scaling on UDP.  That
combined with the UDP zerocopy, we will have one extremely fast NFS server.

Andrew Theurer
IBM LTC


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14 12:01   ` Hirokazu Takahashi
  2002-10-14 14:12     ` Andrew Theurer
@ 2002-10-16  3:44     ` Neil Brown
  2002-10-16  4:31       ` David S. Miller
  2002-10-16 11:09       ` Hirokazu Takahashi
  1 sibling, 2 replies; 36+ messages in thread
From: Neil Brown @ 2002-10-16  3:44 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: davem, linux-kernel, nfs

On Monday October 14, taka@valinux.co.jp wrote:
> >  I'm bit I'm not very sure about is the 'shadowsock' patch for having
> >  several xmit sockets, one per CPU.  What sort of speedup do you get
> >  from this?  How important is it really?
> 
> It's not so important.
> 
> davem> Personally, it seems rather essential for scalability on SMP.
> 
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.
> 

After thinking about this some more, I suspect it would have to be
quite large scale SMP to get much contention.
The only contention on the udp socket is, as you say, assembling a udp
frame, and it would be surprised if that takes a substantial faction
of the time to handle a request.

Presumably on a sufficiently large SMP machine that this became an
issue, there would be multiple NICs.  Maybe it would make sense to
have one udp socket for each NIC.  Would that make sense? or work?
It feels to me to be cleaner than one for each CPU.

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  3:44     ` Neil Brown
@ 2002-10-16  4:31       ` David S. Miller
  2002-10-16 15:04         ` Andrew Theurer
  2002-10-17  2:03         ` [NFS] " Andrew Theurer
  2002-10-16 11:09       ` Hirokazu Takahashi
  1 sibling, 2 replies; 36+ messages in thread
From: David S. Miller @ 2002-10-16  4:31 UTC (permalink / raw)
  To: neilb; +Cc: taka, linux-kernel, nfs

   From: Neil Brown <neilb@cse.unsw.edu.au>
   Date: Wed, 16 Oct 2002 13:44:04 +1000

   Presumably on a sufficiently large SMP machine that this became an
   issue, there would be multiple NICs.  Maybe it would make sense to
   have one udp socket for each NIC.  Would that make sense? or work?
   It feels to me to be cleaner than one for each CPU.

Doesn't make much sense.

Usually we are talking via one IP address, and thus over
one device.  It could be using multiple NICs via BONDING,
but that would be transparent to anything at the socket
level.

Really, I think there is real value to making the socket
per-cpu even on a 2 or 4 way system.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  4:31       ` David S. Miller
@ 2002-10-16 15:04         ` Andrew Theurer
  2002-10-17  2:03         ` [NFS] " Andrew Theurer
  1 sibling, 0 replies; 36+ messages in thread
From: Andrew Theurer @ 2002-10-16 15:04 UTC (permalink / raw)
  To: David S. Miller, neilb; +Cc: taka, linux-kernel, nfs

On Tuesday 15 October 2002 11:31 pm, David S. Miller wrote:
>    From: Neil Brown <neilb@cse.unsw.edu.au>
>    Date: Wed, 16 Oct 2002 13:44:04 +1000
>
>    Presumably on a sufficiently large SMP machine that this became an
>    issue, there would be multiple NICs.  Maybe it would make sense to
>    have one udp socket for each NIC.  Would that make sense? or work?
>    It feels to me to be cleaner than one for each CPU.
>
> Doesn't make much sense.
>
> Usually we are talking via one IP address, and thus over
> one device.  It could be using multiple NICs via BONDING,
> but that would be transparent to anything at the socket
> level.
>
> Really, I think there is real value to making the socket
> per-cpu even on a 2 or 4 way system.

I am trying my best today to get a 4 way system up and running for this test.  
IMO, per cpu is best..  with just one socket, I seriously could not get over 
33% cpu utilization on a 4 way (back in April).  With TCP, I could max it 
out.  I'll update later today hopefully with some promising results.

-Andrew

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  4:31       ` David S. Miller
  2002-10-16 15:04         ` Andrew Theurer
@ 2002-10-17  2:03         ` Andrew Theurer
  2002-10-17  2:31           ` Hirokazu Takahashi
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Theurer @ 2002-10-17  2:03 UTC (permalink / raw)
  To: neilb, David S. Miller; +Cc: taka, linux-kernel, nfs

>    From: Neil Brown <neilb@cse.unsw.edu.au>
>    Date: Wed, 16 Oct 2002 13:44:04 +1000
>
>    Presumably on a sufficiently large SMP machine that this became an
>    issue, there would be multiple NICs.  Maybe it would make sense to
>    have one udp socket for each NIC.  Would that make sense? or work?
>    It feels to me to be cleaner than one for each CPU.
>
> Doesn't make much sense.
>
> Usually we are talking via one IP address, and thus over
> one device.  It could be using multiple NICs via BONDING,
> but that would be transparent to anything at the socket
> level.
>
> Really, I think there is real value to making the socket
> per-cpu even on a 2 or 4 way system.

I am still seeing some sort of problem on an 8 way (hyperthreaded 8
logical/4 physical) on UDP with these patches.  I cannot get more than 2
NFSd threads in a run state at one time.  TCP usually has 8 or more.  The
test involves 40 100Mbit clients reading a 200 MB file on one server (4
acenic adapters) in cache.  I am fighting some other issues at the moment
(acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and
138 MB/sec for NFSv2,TCP.  With the patches, 115 MB/sec for NFSv2,UDP and
181 MB/sec for NFSv2,TCP.  One CPU is maxed due to acpi int storm, so I
think the results will get better.  I'm not sure what other lock or
contention point this is hitting on UDP.  If there is anything I can do to
help, please let me know, thanks.

Andrew Theurer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17  2:03         ` [NFS] " Andrew Theurer
@ 2002-10-17  2:31           ` Hirokazu Takahashi
  2002-10-17 13:16             ` Andrew Theurer
  0 siblings, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17  2:31 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hello,

Thanks for testing my patches.

> I am still seeing some sort of problem on an 8 way (hyperthreaded 8
> logical/4 physical) on UDP with these patches.  I cannot get more than 2
> NFSd threads in a run state at one time.  TCP usually has 8 or more.  The
> test involves 40 100Mbit clients reading a 200 MB file on one server (4
> acenic adapters) in cache.  I am fighting some other issues at the moment
> (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and
> 138 MB/sec for NFSv2,TCP.  With the patches, 115 MB/sec for NFSv2,UDP and
> 181 MB/sec for NFSv2,TCP.  One CPU is maxed due to acpi int storm, so I
> think the results will get better.  I'm not sure what other lock or
> contention point this is hitting on UDP.  If there is anything I can do to
> help, please let me know, thanks.

I guess some UDP packets might be lost. It may happen easily as UDP protocol
doesn't support flow control.
Can you check how many errors has happened? 
You can see them in /proc/net/snmp of the server and the clients.

And how many threads did you start on your machine?
Buffer size of a UDP socket depends on number of kNFS threads.
Large number of threads might help you.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17  2:31           ` Hirokazu Takahashi
@ 2002-10-17 13:16             ` Andrew Theurer
  2002-10-17 13:26               ` Hirokazu Takahashi
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Theurer @ 2002-10-17 13:16 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36


> Hello,
>
> Thanks for testing my patches.
>
> > I am still seeing some sort of problem on an 8 way (hyperthreaded 8
> > logical/4 physical) on UDP with these patches.  I cannot get more than 2
> > NFSd threads in a run state at one time.  TCP usually has 8 or more.
The
> > test involves 40 100Mbit clients reading a 200 MB file on one server (4
> > acenic adapters) in cache.  I am fighting some other issues at the
moment
> > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP
and
> > 138 MB/sec for NFSv2,TCP.  With the patches, 115 MB/sec for NFSv2,UDP
and
> > 181 MB/sec for NFSv2,TCP.  One CPU is maxed due to acpi int storm, so I
> > think the results will get better.  I'm not sure what other lock or
> > contention point this is hitting on UDP.  If there is anything I can do
to
> > help, please let me know, thanks.
>
> I guess some UDP packets might be lost. It may happen easily as UDP
protocol
> doesn't support flow control.
> Can you check how many errors has happened?
> You can see them in /proc/net/snmp of the server and the clients.

server: Udp: InDatagrams NoPorts InErrors OutDatagrams
        Udp: 1000665 41 0 1000666

clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
         Udp: 200403 0 0 200406
         (all clients the same)

> And how many threads did you start on your machine?
> Buffer size of a UDP socket depends on number of kNFS threads.
> Large number of threads might help you.

128 threads.  client rsize=8196.  Server and client MTU is 1500.

Andrew Theurer


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 13:16             ` Andrew Theurer
@ 2002-10-17 13:26               ` Hirokazu Takahashi
  2002-10-17 14:10                 ` Andrew Theurer
  0 siblings, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17 13:26 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hi,

> server: Udp: InDatagrams NoPorts InErrors OutDatagrams
>         Udp: 1000665 41 0 1000666
> clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
>          Udp: 200403 0 0 200406
>          (all clients the same)

How about IP datagrams?  You can see the IP fields in /proc/net/snmp
IP layer may also discard them.

> > And how many threads did you start on your machine?
> > Buffer size of a UDP socket depends on number of kNFS threads.
> > Large number of threads might help you.
> 
> 128 threads.  client rsize=8196.  Server and client MTU is 1500.

It seems enough...


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 13:26               ` Hirokazu Takahashi
@ 2002-10-17 14:10                 ` Andrew Theurer
  2002-10-17 16:26                   ` Hirokazu Takahashi
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Theurer @ 2002-10-17 14:10 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

> Hi,
>
> > server: Udp: InDatagrams NoPorts InErrors OutDatagrams
> >         Udp: 1000665 41 0 1000666
> > clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
> >          Udp: 200403 0 0 200406
> >          (all clients the same)
>
> How about IP datagrams?  You can see the IP fields in /proc/net/snmp
> IP layer may also discard them.

Server:

Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000

A Client:

Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0


Andrew Theurer


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 14:10                 ` Andrew Theurer
@ 2002-10-17 16:26                   ` Hirokazu Takahashi
  2002-10-18  5:38                     ` Trond Myklebust
  0 siblings, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17 16:26 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hi,

> > How about IP datagrams?  You can see the IP fields in /proc/net/snmp
> > IP layer may also discard them.
> 
> Server:
> 
> Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
> InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
> ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
> Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000
> 
> A Client:
> 
> Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
> InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
> ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
> Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0

It looks fine.  
Hmmm....  What version of linux do you use?

Congestion avoidance mechanism of NFS clients might cause this situation.
I think the congestion window size is not enough for high end machines.
You can make the window be larger as a test.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 16:26                   ` Hirokazu Takahashi
@ 2002-10-18  5:38                     ` Trond Myklebust
  2002-10-18  7:19                       ` Hirokazu Takahashi
  0 siblings, 1 reply; 36+ messages in thread
From: Trond Myklebust @ 2002-10-18  5:38 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: habanero, neilb, davem, linux-kernel, nfs

>>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes:

     > Congestion avoidance mechanism of NFS clients might cause this
     > situation.  I think the congestion window size is not enough
     > for high end machines.  You can make the window be larger as a
     > test.

The congestion avoidance window is supposed to adapt to the bandwidth
that is available. Turn congestion avoidance off if you like, but my
experience is that doing so tends to seriously degrade performance as
the number of timeouts + resends skyrockets.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-18  5:38                     ` Trond Myklebust
@ 2002-10-18  7:19                       ` Hirokazu Takahashi
  2002-10-18 15:12                         ` Andrew Theurer
  0 siblings, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-18  7:19 UTC (permalink / raw)
  To: trond.myklebust; +Cc: habanero, neilb, davem, linux-kernel, nfs

Hello,

>      > Congestion avoidance mechanism of NFS clients might cause this
>      > situation.  I think the congestion window size is not enough
>      > for high end machines.  You can make the window be larger as a
>      > test.
> 
> The congestion avoidance window is supposed to adapt to the bandwidth
> that is available. Turn congestion avoidance off if you like, but my
> experience is that doing so tends to seriously degrade performance as
> the number of timeouts + resends skyrockets.

Yes, you must be right.

But I guess Andrew may use a great machine so that the transfer rate
has exeeded the maximum size of the congestion avoidance window.
Can we determin preferable maximum window size dynamically?

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-18  7:19                       ` Hirokazu Takahashi
@ 2002-10-18 15:12                         ` Andrew Theurer
  2002-10-19 20:34                           ` Hirokazu Takahashi
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Theurer @ 2002-10-18 15:12 UTC (permalink / raw)
  To: trond.myklebust, Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

> >      > Congestion avoidance mechanism of NFS clients might cause this
> >      > situation.  I think the congestion window size is not enough
> >      > for high end machines.  You can make the window be larger as a
> >      > test.
> >
> > The congestion avoidance window is supposed to adapt to the bandwidth
> > that is available. Turn congestion avoidance off if you like, but my
> > experience is that doing so tends to seriously degrade performance as
> > the number of timeouts + resends skyrockets.
>
> Yes, you must be right.
>
> But I guess Andrew may use a great machine so that the transfer rate
> has exeeded the maximum size of the congestion avoidance window.
> Can we determin preferable maximum window size dynamically?

Is this a concern on the client only?  I can run a test with just one client
and see if I can saturate the 100Mbit adapter.  If I can, would we need to
make any adjustments then?  FYI, at 115 MB/sec total throughput, that's only
2.875 MB/sec for each of the 40 clients.  For the TCP result of 181 MB/sec,
that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
100Mbit client.

Andrew Theurer


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-18 15:12                         ` Andrew Theurer
@ 2002-10-19 20:34                           ` Hirokazu Takahashi
  2002-10-22 21:16                             ` Andrew Theurer
  0 siblings, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-19 20:34 UTC (permalink / raw)
  To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

Hello,

> > Congestion avoidance mechanism of NFS clients might cause this
> > situation.  I think the congestion window size is not enough
> > for high end machines.  You can make the window be larger as a
> > test.

> Is this a concern on the client only?  I can run a test with just one client
> and see if I can saturate the 100Mbit adapter.  If I can, would we need to
> make any adjustments then?  FYI, at 115 MB/sec total throughput, that's only
> 2.875 MB/sec for each of the 40 clients.  For the TCP result of 181 MB/sec,
> that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
> 100Mbit client.

I think it's a client issue. NFS servers don't care about cogestion of UDP
traffic and they will try to response to all NFS requests as fast as they can.

You can try to increase the number of clients or the number of mount points
for a test. It's easy to mount the same directory of the server on some
directries of the client so that each of them can work simultaneously.
   # mount -t nfs server:/foo   /baa1
   # mount -t nfs server:/foo   /baa2
   # mount -t nfs server:/foo   /baa3

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-19 20:34                           ` Hirokazu Takahashi
@ 2002-10-22 21:16                             ` Andrew Theurer
  2002-10-23  9:29                               ` Hirokazu Takahashi
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Theurer @ 2002-10-22 21:16 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

On Saturday 19 October 2002 15:34, Hirokazu Takahashi wrote:
> Hello,
>
> > > Congestion avoidance mechanism of NFS clients might cause this
> > > situation.  I think the congestion window size is not enough
> > > for high end machines.  You can make the window be larger as a
> > > test.
> >
> > Is this a concern on the client only?  I can run a test with just one
> > client and see if I can saturate the 100Mbit adapter.  If I can, would we
> > need to make any adjustments then?  FYI, at 115 MB/sec total throughput,
> > that's only 2.875 MB/sec for each of the 40 clients.  For the TCP result
> > of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortable
> > throughputs for a 100Mbit client.
>
> I think it's a client issue. NFS servers don't care about cogestion of UDP
> traffic and they will try to response to all NFS requests as fast as they
> can.
>
> You can try to increase the number of clients or the number of mount points
> for a test. It's easy to mount the same directory of the server on some
> directries of the client so that each of them can work simultaneously.
>    # mount -t nfs server:/foo   /baa1
>    # mount -t nfs server:/foo   /baa2
>    # mount -t nfs server:/foo   /baa3

I don't think it is a client congestion issue at this point.  I can run the 
test with just one client on UDP and achieve 11.2 MB/sec with just one mount 
point.  The client has 100 Mbit Ethernet, so should be the upper limit (or 
really close).  In the 40 client read test, I have only achieved 2.875 MB/sec 
per client.  That and the fact that there are never more than 2 nfsd threads 
in a run state at one time (for UDP only) leads me to believe there is still 
a scaling problem on the server for UDP.  I will continue to run the test and 
poke a prod around.  Hopefully something will jump out at me.  Thanks for all 
the input!

Andrew Theurer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-22 21:16                             ` Andrew Theurer
@ 2002-10-23  9:29                               ` Hirokazu Takahashi
  2002-10-24 15:32                                 ` Andrew Theurer
  0 siblings, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-23  9:29 UTC (permalink / raw)
  To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

Hi,

> > > > Congestion avoidance mechanism of NFS clients might cause this
> > > > situation.  I think the congestion window size is not enough
> > > > for high end machines.  You can make the window be larger as a
> > > > test.

> I don't think it is a client congestion issue at this point.  I can run the 
> test with just one client on UDP and achieve 11.2 MB/sec with just one mount 
> point.  The client has 100 Mbit Ethernet, so should be the upper limit (or 
> really close).  In the 40 client read test, I have only achieved 2.875 MB/sec 
> per client.  That and the fact that there are never more than 2 nfsd threads 
> in a run state at one time (for UDP only) leads me to believe there is still 
> a scaling problem on the server for UDP.  I will continue to run the test and 
> poke a prod around.  Hopefully something will jump out at me.  Thanks for all 
> the input!

Can You check /proc/net/rpc/nfsd which shows how many NFS requests have
been retransmitted ?

# cat /proc/net/rpc/nfsd
rc 0 27680 162118
  ^^^
This field means the clinents have retransmitted pakeckets.
The transmission ratio will slow down if it have happened once.
It may occur if the response from the server is slower than the
clinents expect.

And you can use older version - e.g. linux-2.4 series - for clients
and see what will happen as older versions don't have any intelligent
features.

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-23  9:29                               ` Hirokazu Takahashi
@ 2002-10-24 15:32                                 ` Andrew Theurer
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Theurer @ 2002-10-24 15:32 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

> > I don't think it is a client congestion issue at this point.  I can run
the
> > test with just one client on UDP and achieve 11.2 MB/sec with just one
mount
> > point.  The client has 100 Mbit Ethernet, so should be the upper limit
(or
> > really close).  In the 40 client read test, I have only achieved 2.875
MB/sec
> > per client.  That and the fact that there are never more than 2 nfsd
threads
> > in a run state at one time (for UDP only) leads me to believe there is
still
> > a scaling problem on the server for UDP.  I will continue to run the
test and
> > poke a prod around.  Hopefully something will jump out at me.  Thanks
for all
> > the input!
>
> Can You check /proc/net/rpc/nfsd which shows how many NFS requests have
> been retransmitted ?
>
> # cat /proc/net/rpc/nfsd
> rc 0 27680 162118
>   ^^^
> This field means the clinents have retransmitted pakeckets.
> The transmission ratio will slow down if it have happened once.
> It may occur if the response from the server is slower than the
> clinents expect.

/proc/net/rpc/nfsd
rc 0 1 1025221

> And you can use older version - e.g. linux-2.4 series - for clients
> and see what will happen as older versions don't have any intelligent
> features.

Actually all of the clients are 2.4 (RH 7.0).  I could change them out to
2.5, but it may take me a little while.

Let me do a little digging around.  I seem to recall an issue I had earlier
this year when waking up the nfsd threads and having most of them just go
back to sleep.  I need to go back to that code and understand it a little
better.   Thanks for all of your help.

Andrew Theurer


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  3:44     ` Neil Brown
  2002-10-16  4:31       ` David S. Miller
@ 2002-10-16 11:09       ` Hirokazu Takahashi
  2002-10-16 17:02         ` kaza
  1 sibling, 1 reply; 36+ messages in thread
From: Hirokazu Takahashi @ 2002-10-16 11:09 UTC (permalink / raw)
  To: neilb; +Cc: davem, linux-kernel, nfs

Hello,

> > It will be effective on large scale SMP machines as all kNFSd shares
> > one NFS port. A udp socket can't send data on each CPU at the same
> > time while MSG_MORE/UDP_CORK options are set.
> > The UDP socket have to block any other requests during making a UDP frame.
> > 

> After thinking about this some more, I suspect it would have to be
> quite large scale SMP to get much contention.

I have no idea how much contention will happen. I haven't checked the
performance of it on large scale SMP yet as I don't have such a great
machines.

Does anyone help us?

> The only contention on the udp socket is, as you say, assembling a udp
> frame, and it would be surprised if that takes a substantial faction
> of the time to handle a request.

After assembling a udp frame, kNFSd may drive a NIC to transmit the frame.

> Presumably on a sufficiently large SMP machine that this became an
> issue, there would be multiple NICs.  Maybe it would make sense to
> have one udp socket for each NIC.  Would that make sense? or work?

Some CPUs often share one GbE NIC today as a NIC can handle much data
than one CPU can. I think that CPU seems likely to become bottleneck.
Personally I guess several CPUs will share one 10GbE NIC in the near
future even if it's a high end machine. (It's just my guess)

But I don't know how effective this patch works......

devem> Doesn't make much sense.
devem> 
devem> Usually we are talking via one IP address, and thus over
devem> one device.  It could be using multiple NICs via BONDING,
devem> but that would be transparent to anything at the socket
devem> level.
devem> 
devem> Really, I think there is real value to making the socket
devem> per-cpu even on a 2 or 4 way system.

I wish so.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16 11:09       ` Hirokazu Takahashi
@ 2002-10-16 17:02         ` kaza
  2002-10-17  4:36           ` rddunlap
  0 siblings, 1 reply; 36+ messages in thread
From: kaza @ 2002-10-16 17:02 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

Hello,

On Wed, Oct 16, 2002 at 08:09:00PM +0900,
Hirokazu Takahashi-san wrote:
> > After thinking about this some more, I suspect it would have to be
> > quite large scale SMP to get much contention.
> 
> I have no idea how much contention will happen. I haven't checked the
> performance of it on large scale SMP yet as I don't have such a great
> machines.
> 
> Does anyone help us?

Why don't you propose the performance test to OSDL? (OSDL-J is more
better, I think)   OSDL provide hardware resources and operation staffs.

If you want, I can help you to propose it. :-)

-- 
Ko Kazaana / editor-in-chief of "TechStyle" ( http://techstyle.jp/ )
GnuPG Fingerprint = 1A50 B204 46BD EE22 2E8C  903F F2EB CEA7 4BCF 808F

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16 17:02         ` kaza
@ 2002-10-17  4:36           ` rddunlap
  0 siblings, 0 replies; 36+ messages in thread
From: rddunlap @ 2002-10-17  4:36 UTC (permalink / raw)
  To: kaza; +Cc: Hirokazu Takahashi, neilb, davem, linux-kernel, nfs

On Thu, 17 Oct 2002 kaza@kk.iij4u.or.jp wrote:

| Hello,
|
| On Wed, Oct 16, 2002 at 08:09:00PM +0900,
| Hirokazu Takahashi-san wrote:
| > > After thinking about this some more, I suspect it would have to be
| > > quite large scale SMP to get much contention.
| >
| > I have no idea how much contention will happen. I haven't checked the
| > performance of it on large scale SMP yet as I don't have such a great
| > machines.
| >
| > Does anyone help us?
|
| Why don't you propose the performance test to OSDL? (OSDL-J is more
| better, I think)   OSDL provide hardware resources and operation staffs.

and why do you say that?  8;)

| If you want, I can help you to propose it. :-)

That's the right thing to do.

-- 
~Randy

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2002-10-24 15:19 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-18  8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi
2002-09-18 23:00 ` David S. Miller
2002-09-18 23:54   ` Alan Cox
2002-09-19  0:16     ` Andrew Morton
2002-09-19  2:13       ` Aaron Lehmann
2002-09-19  3:30         ` Andrew Morton
2002-09-19 10:42           ` Alan Cox
2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
2002-09-19 20:42         ` Andrew Morton
2002-09-19 21:12           ` David S. Miller
2002-09-21 11:56   ` Pavel Machek
2002-10-14  5:50 ` Neil Brown
2002-10-14  6:15   ` David S. Miller
2002-10-14 10:45     ` kuznet
2002-10-14 10:48       ` David S. Miller
2002-10-14 12:01   ` Hirokazu Takahashi
2002-10-14 14:12     ` Andrew Theurer
2002-10-16  3:44     ` Neil Brown
2002-10-16  4:31       ` David S. Miller
2002-10-16 15:04         ` Andrew Theurer
2002-10-17  2:03         ` [NFS] " Andrew Theurer
2002-10-17  2:31           ` Hirokazu Takahashi
2002-10-17 13:16             ` Andrew Theurer
2002-10-17 13:26               ` Hirokazu Takahashi
2002-10-17 14:10                 ` Andrew Theurer
2002-10-17 16:26                   ` Hirokazu Takahashi
2002-10-18  5:38                     ` Trond Myklebust
2002-10-18  7:19                       ` Hirokazu Takahashi
2002-10-18 15:12                         ` Andrew Theurer
2002-10-19 20:34                           ` Hirokazu Takahashi
2002-10-22 21:16                             ` Andrew Theurer
2002-10-23  9:29                               ` Hirokazu Takahashi
2002-10-24 15:32                                 ` Andrew Theurer
2002-10-16 11:09       ` Hirokazu Takahashi
2002-10-16 17:02         ` kaza
2002-10-17  4:36           ` rddunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).