[PATCH] zerocopy NFS for 2.5.36

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-18  8:14 Hirokazu Takahashi
  2002-09-18 23:00 ` David S. Miller
  2002-10-14  5:50 ` Neil Brown
  0 siblings, 2 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-09-18  8:14 UTC (permalink / raw)
  To: Neil Brown, linux-kernel, nfs

Hello,

I ported the zerocopy NFS patches against linux-2.5.36.

I made va05-zerocopy-nfsdwrite-2.5.36.patch more generic,
so that it would be easy to merge with NFSv4. Each procedure can
chose whether it can accept splitted buffers or not.
And I fixed a probelem that nfsd couldn't handle NFS-symlink
requests which were very large.

1)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
This patch enables HW-checksum against outgoing packets including UDP frames.

2)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va11-udpsendfile-2.5.36.patch
This patch makes sendfile systemcall over UDP work. It also supports
UDP_CORK interface which is very similar to TCP_CORK. And you can call
sendmsg/senfile with MSG_MORE flags over UDP sockets.

3)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
This patch fixes the problem of x86 csum_partilal() routines which
can't handle odd addressed buffers.

4)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va01-zerocopy-rpc-2.5.36.patch
This patch makes RPC can send some pieces of data and pages without copy.

5)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va02-zerocopy-nfsdread-2.5.36.patch
This patch makes NFSD send pages in pagecache directly when NFS clinets request
file-read.

6)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va03-zerocopy-nfsdreaddir-2.5.36.patch
nfsd_readdir can also send pages without copy.

7)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va04-zerocopy-shadowsock-2.5.36.patch
This patch makes per-cpu UDP sockets so that NFSD can send UDP frames on
each prosessor simultaneously.
Without the patch we can send only one UDP frame at the time as a UDP socket
have to be locked during sending some pages to serialize them.

8)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va05-zerocopy-nfsdwrite-2.5.36.patch
This patch enables NFS-write uses writev interface. NFSd can handle NFS
requests without reassembling IP fragments into one UDP frame.

9)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/taka-writev-2.5.36.patch
This patch makes writev for regular file work faster.
It also can be found at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/

Caution:
       XFS doesn't support writev interface yet. NFS write on XFS might
       slow down with No.8 patch. I wish SGI guys will implement it.

10)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va07-nfsbigbuf-2.5.36.patch
This makes NFS buffer much bigger (60KB).
60KB buffer is the same to 32KB buffer for linux-kernel as both of them
require 64KB chunk.

11)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va09-zerocopy-tempsendto-2.5.36.patch
If you don't want to use sendfile over UDP yet, you can apply it instead of No.1 and No.2 patches.

Regards,
Hirokazu Takahashi

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18  8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi
@ 2002-09-18 23:00 ` David S. Miller
  2002-09-18 23:54     ` Alan Cox
  2002-09-21 11:56     ` Pavel Machek
  2002-10-14  5:50 ` Neil Brown
  1 sibling, 2 replies; 98+ messages in thread
From: David S. Miller @ 2002-09-18 23:00 UTC (permalink / raw)
  To: taka; +Cc: neilb, linux-kernel, nfs

   From: Hirokazu Takahashi <taka@valinux.co.jp>
   Date: Wed, 18 Sep 2002 17:14:31 +0900 (JST)

   1)
   ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
   This patch enables HW-checksum against outgoing packets including UDP frames.

Can you explain the TCP parts?  They look very wrong.

It was discussed long ago that csum_and_copy_from_user() performs
better than plain copy_from_user() on x86.  I do not remember all
details, but I do know that using copy_from_user() is not a real
improvement at least on x86 architecture.

The rest of the changes (ie. the getfrag() logic to set
skb->ip_summed) looks fine.

   3)
   ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
   This patch fixes the problem of x86 csum_partilal() routines which
   can't handle odd addressed buffers.

I've sent Linus this fix already.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18 23:00 ` David S. Miller
@ 2002-09-18 23:54     ` Alan Cox
  2002-09-21 11:56     ` Pavel Machek
  1 sibling, 0 replies; 98+ messages in thread
From: Alan Cox @ 2002-09-18 23:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs

On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86.  I do not remember all

The better was a freak of PPro/PII scheduling I think

> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

The same as bit is easy to explain. Its totally memory bandwidth limited
on current x86-32 processors. (Although I'd welcome demonstrations to
the contrary on newer toys)



-------------------------------------------------------
This SF.NET email is sponsored by: AMD - Your access to the experts
on Hammer Technology! Open Source & Linux Developers, register now
for the AMD Developer Symposium. Code: EX8664
http://www.developwithamd.com/developerlab
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-18 23:54     ` Alan Cox
  0 siblings, 0 replies; 98+ messages in thread
From: Alan Cox @ 2002-09-18 23:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs

On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86.  I do not remember all

The better was a freak of PPro/PII scheduling I think

> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

The same as bit is easy to explain. Its totally memory bandwidth limited
on current x86-32 processors. (Although I'd welcome demonstrations to
the contrary on newer toys)


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18 23:54     ` Alan Cox
  (?)
@ 2002-09-19  0:16     ` Andrew Morton
  2002-09-19  2:13       ` Aaron Lehmann
  2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
  -1 siblings, 2 replies; 98+ messages in thread
From: Andrew Morton @ 2002-09-19  0:16 UTC (permalink / raw)
  To: Alan Cox; +Cc: David S. Miller, taka, neilb, linux-kernel, nfs

Alan Cox wrote:
> 
> On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> > It was discussed long ago that csum_and_copy_from_user() performs
> > better than plain copy_from_user() on x86.  I do not remember all
> 
> The better was a freak of PPro/PII scheduling I think
> 
> > details, but I do know that using copy_from_user() is not a real
> > improvement at least on x86 architecture.
> 
> The same as bit is easy to explain. Its totally memory bandwidth limited
> on current x86-32 processors. (Although I'd welcome demonstrations to
> the contrary on newer toys)

Nope.  There are distinct alignment problems with movsl-based
memcpy on PII and (at least) "Pentium III (Coppermine)", which is
tested here:

copy_32 uses movsl.  copy_duff just uses a stream of "movl"s

Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s     
nbytes=10240  from_align=0, to_align=0
    copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

OK, movsl wins.   But now give the source address 8+1 alignment:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
nbytes=10240  from_align=1, to_align=0
    copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec

The "movl"-based copy wins.  By miles.

Make the source 8+4 aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
nbytes=10240  from_align=4, to_align=0
    copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec

So movl still beats movsl, by lots.

I have various scriptlets which generate the entire matrix.

I think I ended up deciding that we should use movsl _only_
when both src and dsc are 8-byte-aligned.  And that when you
multiply the gain from that by the frequency*size with which
funny alignments are used by TCP the net gain was 2% or something.

It needs redoing.  These differences are really big, and this
is the kernel's most expensive function.

A little project for someone.

The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz

^ permalink raw reply	[flat|nested] 98+ messages in thread

* RE: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-19  2:00 Lever, Charles
  0 siblings, 0 replies; 98+ messages in thread
From: Lever, Charles @ 2002-09-19  2:00 UTC (permalink / raw)
  To: 'Andrew Morton'
  Cc: David S. Miller, taka, neilb, linux-kernel, nfs, Alan Cox

dude, that's pretty cool.

if you were re-implementing XDR, you think a series of movl
instructions would be best?  i'm not sure how practical that
is for an architecture-independent implementation.

> > > It was discussed long ago that csum_and_copy_from_user() performs 
> > > better than plain copy_from_user() on x86.  I do not remember all
> > 
> > The better was a freak of PPro/PII scheduling I think
> > 
> > > details, but I do know that using copy_from_user() is not a real 
> > > improvement at least on x86 architecture.
> > 
> > The same as bit is easy to explain. Its totally memory bandwidth 
> > limited on current x86-32 processors. (Although I'd welcome 
> > demonstrations to the contrary on newer toys)
> 
> Nope.  There are distinct alignment problems with movsl-based 
> memcpy on PII and (at least) "Pentium III (Coppermine)", 
> which is tested here:
> 
> copy_32 uses movsl.  copy_duff just uses a stream of "movl"s
> 
> Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:
> 
> akpm:/usr/src/cptimer> ./cptimer -d -s     
> nbytes=10240  from_align=0, to_align=0
>     copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
> 
> OK, movsl wins.   But now give the source address 8+1 alignment:
> 
> akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
> nbytes=10240  from_align=1, to_align=0
>     copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
> __copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec
> 
> The "movl"-based copy wins.  By miles.
> 
> Make the source 8+4 aligned:
> 
> akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
> nbytes=10240  from_align=4, to_align=0
>     copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
> __copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec
> 
> So movl still beats movsl, by lots.
> 
> I have various scriptlets which generate the entire matrix.
> 
> I think I ended up deciding that we should use movsl _only_ 
> when both src and dsc are 8-byte-aligned.  And that when you 
> multiply the gain from that by the frequency*size with which 
> funny alignments are used by TCP the net gain was 2% or something.
> 
> It needs redoing.  These differences are really big, and this 
> is the kernel's most expensive function.
> 
> A little project for someone.
> 
> The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  0:16     ` Andrew Morton
@ 2002-09-19  2:13       ` Aaron Lehmann
  2002-09-19  3:30           ` Andrew Morton
  2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
  1 sibling, 1 reply; 98+ messages in thread
From: Aaron Lehmann @ 2002-09-19  2:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nfs

> akpm:/usr/src/cptimer> ./cptimer -d -s     
> nbytes=10240  from_align=0, to_align=0
>     copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

It's disappointing that this program doesn't seem to support
benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
Those seem to be the more interesting memcpy functions on modern
systems.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  2:13       ` Aaron Lehmann
@ 2002-09-19  3:30           ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2002-09-19  3:30 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: linux-kernel, nfs

Aaron Lehmann wrote:
> 
> > akpm:/usr/src/cptimer> ./cptimer -d -s
> > nbytes=10240  from_align=0, to_align=0
> >     copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
> 
> It's disappointing that this program doesn't seem to support
> benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> Those seem to be the more interesting memcpy functions on modern
> systems.

Well the source is there, and the licensing terms are most reasonable.

But then, the source was there eighteen months ago and nothing happened.
Sigh.

I think in-kernel MMX has fatal drawbacks anyway.  Not sure what
they are - I prefer to pretend that x86 CPUs execute raw C.


-------------------------------------------------------
This SF.NET email is sponsored by: AMD - Your access to the experts
on Hammer Technology! Open Source & Linux Developers, register now
for the AMD Developer Symposium. Code: EX8664
http://www.developwithamd.com/developerlab
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-19  3:30           ` Andrew Morton
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Morton @ 2002-09-19  3:30 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: linux-kernel, nfs

Aaron Lehmann wrote:
> 
> > akpm:/usr/src/cptimer> ./cptimer -d -s
> > nbytes=10240  from_align=0, to_align=0
> >     copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
> 
> It's disappointing that this program doesn't seem to support
> benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> Those seem to be the more interesting memcpy functions on modern
> systems.

Well the source is there, and the licensing terms are most reasonable.

But then, the source was there eighteen months ago and nothing happened.
Sigh.

I think in-kernel MMX has fatal drawbacks anyway.  Not sure what
they are - I prefer to pretend that x86 CPUs execute raw C.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  3:30           ` Andrew Morton
@ 2002-09-19 10:42             ` Alan Cox
  -1 siblings, 0 replies; 98+ messages in thread
From: Alan Cox @ 2002-09-19 10:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aaron Lehmann, linux-kernel, nfs

On Thu, 2002-09-19 at 04:30, Andrew Morton wrote:
> > It's disappointing that this program doesn't seem to support
> > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> > Those seem to be the more interesting memcpy functions on modern
> > systems.
> 
> Well the source is there, and the licensing terms are most reasonable.
> 
> But then, the source was there eighteen months ago and nothing happened.
> Sigh.
> 
> I think in-kernel MMX has fatal drawbacks anyway.  Not sure what
> they are - I prefer to pretend that x86 CPUs execute raw C.

MMX isnt useful for anything smaller than about 512bytes-1K. Its not
useful in interrupt handlers. The list goes on.



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-19 10:42             ` Alan Cox
  0 siblings, 0 replies; 98+ messages in thread
From: Alan Cox @ 2002-09-19 10:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aaron Lehmann, linux-kernel, nfs

On Thu, 2002-09-19 at 04:30, Andrew Morton wrote:
> > It's disappointing that this program doesn't seem to support
> > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> > Those seem to be the more interesting memcpy functions on modern
> > systems.
> 
> Well the source is there, and the licensing terms are most reasonable.
> 
> But then, the source was there eighteen months ago and nothing happened.
> Sigh.
> 
> I think in-kernel MMX has fatal drawbacks anyway.  Not sure what
> they are - I prefer to pretend that x86 CPUs execute raw C.

MMX isnt useful for anything smaller than about 512bytes-1K. Its not
useful in interrupt handlers. The list goes on.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19  0:16     ` Andrew Morton
  2002-09-19  2:13       ` Aaron Lehmann
@ 2002-09-19 13:15       ` Hirokazu Takahashi
  2002-09-19 20:42         ` Andrew Morton
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-09-19 13:15 UTC (permalink / raw)
  To: akpm; +Cc: alan, davem, neilb, linux-kernel, nfs

Hello,

> > > details, but I do know that using copy_from_user() is not a real
> > > improvement at least on x86 architecture.
> > 
> > The same as bit is easy to explain. Its totally memory bandwidth limited
> > on current x86-32 processors. (Although I'd welcome demonstrations to
> > the contrary on newer toys)
> 
> Nope.  There are distinct alignment problems with movsl-based
> memcpy on PII and (at least) "Pentium III (Coppermine)", which is
> tested here:
...
> I have various scriptlets which generate the entire matrix.
> 
> I think I ended up deciding that we should use movsl _only_
> when both src and dsc are 8-byte-aligned.  And that when you
> multiply the gain from that by the frequency*size with which
> funny alignments are used by TCP the net gain was 2% or something.

Amazing! I beleived 4-byte-aligned was enough.
read/write systemcalls may also reduce their penalties.

> It needs redoing.  These differences are really big, and this
> is the kernel's most expensive function.
> 
> A little project for someone.

OK, if there is nobody who wants to do it I'll do it by myself.

> The tools are at http://www.zip.com.au/~/linux/cptimer.tar.gz

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
@ 2002-09-19 20:42         ` Andrew Morton
  2002-09-19 21:12             ` [NFS] " David S. Miller
  0 siblings, 1 reply; 98+ messages in thread
From: Andrew Morton @ 2002-09-19 20:42 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: alan, davem, neilb, linux-kernel, nfs

Hirokazu Takahashi wrote:
> 
> ...
> > It needs redoing.  These differences are really big, and this
> > is the kernel's most expensive function.
> >
> > A little project for someone.
> 
> OK, if there is nobody who wants to do it I'll do it by myself.

That would be fantastic - thanks.  This is more a measurement
and testing exercise than a coding one.  And if those measurements
are sufficiently nice (eg: >5%) then a 2.4 backport should be done.

It seems that movsl works acceptably with all alignments on AMD
hardware, although this needs to be checked with more recent machines.

movsl is a (bad) loss on PII and PIII for all alignments except 8&8.
Don't know about P4 - I can test that in a day or two.

I expect that a minimal, 90% solution would be just:

fancy_copy_to_user(dst, src, count)
{
	if (arch_has_sane_movsl || ((dst|src) & 7) == 0)
		movsl_copy_to_user(dst, src, count);
	else
		movl_copy_to_user(dst, src, count);
}

and

#ifndef ARCH_HAS_FANCY_COPY_USER
#define fancy_copy_to_user copy_to_user
#endif

and we really only need fancy_copy_to_user in a handful of
places - the bulk copies in networking and filemap.c.  For all
the other call sites it's probably more important to keep the
code footprint down than it is to squeeze the last few drops out
of the copy speed.

Mala Anand has done some work on this.  See
http://www.uwsg.iu.edu/hypermail/linux/kernel/0206.3/0100.html

<searches>  Yes, I have a copy of Mala's patch here which works
against 2.5.current.  Mala's patch will cause quite an expansion
of kernel size; we would need an implementation which did not
use inlining.  This work was discussed at OLS2002.  See
http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz


 uaccess.h |  252 +++++++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 193 insertions(+), 59 deletions(-)

--- 2.5.25/include/asm-i386/uaccess.h~fast-cu	Tue Jul  9 21:34:58 2002
+++ 2.5.25-akpm/include/asm-i386/uaccess.h	Tue Jul  9 21:51:03 2002
@@ -253,55 +253,197 @@ do {									\
  */
 
 /* Generic arbitrary sized copy.  */
-#define __copy_user(to,from,size)					\
-do {									\
-	int __d0, __d1;							\
-	__asm__ __volatile__(						\
-		"0:	rep; movsl\n"					\
-		"	movl %3,%0\n"					\
-		"1:	rep; movsb\n"					\
-		"2:\n"							\
-		".section .fixup,\"ax\"\n"				\
-		"3:	lea 0(%3,%0,4),%0\n"				\
-		"	jmp 2b\n"					\
-		".previous\n"						\
-		".section __ex_table,\"a\"\n"				\
-		"	.align 4\n"					\
-		"	.long 0b,3b\n"					\
-		"	.long 1b,2b\n"					\
-		".previous"						\
-		: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
-		: "r"(size & 3), "0"(size / 4), "1"(to), "2"(from)	\
-		: "memory");						\
+#define __copy_user(to,from,size)				\
+do {								\
+	int __d0, __d1;						\
+	__asm__ __volatile__(					\
+	"       cmpl $63, %0\n"					\
+	"       jbe  5f\n"					\
+	"       mov %%esi, %%eax\n"				\
+	"       test $7, %%al\n"				\
+	"       jz  5f\n"					\
+	"       .align 2,0x90\n"				\
+	"0:     movl 32(%4), %%eax\n"				\
+	"       cmpl $67, %0\n"					\
+	"       jbe 1f\n"					\
+	"       movl 64(%4), %%eax\n"				\
+	"       .align 2,0x90\n"				\
+	"1:     movl 0(%4), %%eax\n"				\
+	"       movl 4(%4), %%edx\n"				\
+	"2:     movl %%eax, 0(%3)\n"				\
+	"21:    movl %%edx, 4(%3)\n"				\
+	"       movl 8(%4), %%eax\n"				\
+	"       movl 12(%4),%%edx\n"				\
+	"3:     movl %%eax, 8(%3)\n"				\
+	"31:    movl %%edx, 12(%3)\n"				\
+	"       movl 16(%4), %%eax\n"				\
+	"       movl 20(%4), %%edx\n"				\
+	"4:     movl %%eax, 16(%3)\n"				\
+	"41:    movl %%edx, 20(%3)\n"				\
+	"       movl 24(%4), %%eax\n"				\
+	"       movl 28(%4), %%edx\n"				\
+	"10:    movl %%eax, 24(%3)\n"				\
+	"51:    movl %%edx, 28(%3)\n"				\
+	"       movl 32(%4), %%eax\n"				\
+	"       movl 36(%4), %%edx\n"				\
+	"11:    movl %%eax, 32(%3)\n"				\
+	"61:    movl %%edx, 36(%3)\n"				\
+	"       movl 40(%4), %%eax\n"				\
+	"       movl 44(%4), %%edx\n"				\
+	"12:    movl %%eax, 40(%3)\n"				\
+	"71:    movl %%edx, 44(%3)\n"				\
+	"       movl 48(%4), %%eax\n"				\
+	"       movl 52(%4), %%edx\n"				\
+	"13:    movl %%eax, 48(%3)\n"				\
+	"81:    movl %%edx, 52(%3)\n"				\
+	"       movl 56(%4), %%eax\n"				\
+	"       movl 60(%4), %%edx\n"				\
+	"14:    movl %%eax, 56(%3)\n"				\
+	"91:    movl %%edx, 60(%3)\n"				\
+	"       addl $-64, %0\n"				\
+	"       addl $64, %4\n"					\
+	"       addl $64, %3\n"					\
+	"       cmpl $63, %0\n"					\
+	"       ja  0b\n"					\
+	"5:   movl  %0, %%eax\n"				\
+	"       shrl  $2, %0\n"					\
+	"       andl  $3, %%eax\n"				\
+	"       cld\n"						\
+	"6:     rep; movsl\n"					\
+	"       movl %%eax, %0\n"				\
+	"7:   rep; movsb\n"					\
+	"8:\n"							\
+	".section .fixup,\"ax\"\n"				\
+	"9:   lea 0(%%eax,%0,4),%0\n"				\
+	"     jmp 8b\n"						\
+	"15:    movl %6, %0\n"					\
+	"       jmp 8b\n"					\
+	".previous\n"						\
+	".section __ex_table,\"a\"\n"				\
+	"     .align 4\n"					\
+	"     .long 2b,15b\n"					\
+	"     .long 21b,15b\n"					\
+	"     .long 3b,15b\n"					\
+	"     .long 31b,15b\n"					\
+	"     .long 4b,15b\n"					\
+	"     .long 41b,15b\n"					\
+	"     .long 10b,15b\n"					\
+	"     .long 51b,15b\n"					\
+	"     .long 11b,15b\n"					\
+	"     .long 61b,15b\n"					\
+	"     .long 12b,15b\n"					\
+	"     .long 71b,15b\n"					\
+	"     .long 13b,15b\n"					\
+	"     .long 81b,15b\n"					\
+	"     .long 14b,15b\n"					\
+	"     .long 91b,15b\n"					\
+	"     .long 6b,9b\n"					\
+	"       .long 7b,8b\n"					\
+	".previous"						\
+	: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
+	:  "1"(to), "2"(from), "0"(size),"i"(-EFAULT)		\
+	: "eax", "edx", "memory");				\
 } while (0)
 
-#define __copy_user_zeroing(to,from,size)				\
-do {									\
-	int __d0, __d1;							\
-	__asm__ __volatile__(						\
-		"0:	rep; movsl\n"					\
-		"	movl %3,%0\n"					\
-		"1:	rep; movsb\n"					\
-		"2:\n"							\
-		".section .fixup,\"ax\"\n"				\
-		"3:	lea 0(%3,%0,4),%0\n"				\
-		"4:	pushl %0\n"					\
-		"	pushl %%eax\n"					\
-		"	xorl %%eax,%%eax\n"				\
-		"	rep; stosb\n"					\
-		"	popl %%eax\n"					\
-		"	popl %0\n"					\
-		"	jmp 2b\n"					\
-		".previous\n"						\
-		".section __ex_table,\"a\"\n"				\
-		"	.align 4\n"					\
-		"	.long 0b,3b\n"					\
-		"	.long 1b,4b\n"					\
-		".previous"						\
-		: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
-		: "r"(size & 3), "0"(size / 4), "1"(to), "2"(from)	\
-		: "memory");						\
-} while (0)
+#define __copy_user_zeroing(to,from,size)			\
+do {								\
+	int __d0, __d1;						\
+	__asm__ __volatile__(					\
+	"       cmpl $63, %0\n"					\
+	"       jbe  5f\n"					\
+	"       movl %%edi, %%eax\n"				\
+	"       test $7, %%al\n"				\
+	"       jz   5f\n"					\
+	"       .align 2,0x90\n"				\
+	"0:     movl 32(%4), %%eax\n"				\
+	"       cmpl $67, %0\n"					\
+	"       jbe 2f\n"					\
+	"1:     movl 64(%4), %%eax\n"				\
+	"       .align 2,0x90\n"				\
+	"2:     movl 0(%4), %%eax\n"				\
+	"21:    movl 4(%4), %%edx\n"				\
+	"       movl %%eax, 0(%3)\n"				\
+	"       movl %%edx, 4(%3)\n"				\
+	"3:     movl 8(%4), %%eax\n"				\
+	"31:    movl 12(%4),%%edx\n"				\
+	"       movl %%eax, 8(%3)\n"				\
+	"       movl %%edx, 12(%3)\n"				\
+	"4:     movl 16(%4), %%eax\n"				\
+	"41:    movl 20(%4), %%edx\n"				\
+	"       movl %%eax, 16(%3)\n"				\
+	"       movl %%edx, 20(%3)\n"				\
+	"10:    movl 24(%4), %%eax\n"				\
+	"51:    movl 28(%4), %%edx\n"				\
+	"       movl %%eax, 24(%3)\n"				\
+	"       movl %%edx, 28(%3)\n"				\
+	"11:    movl 32(%4), %%eax\n"				\
+	"61:    movl 36(%4), %%edx\n"				\
+	"       movl %%eax, 32(%3)\n"				\
+	"       movl %%edx, 36(%3)\n"				\
+	"12:    movl 40(%4), %%eax\n"				\
+	"71:    movl 44(%4), %%edx\n"				\
+	"       movl %%eax, 40(%3)\n"				\
+	"       movl %%edx, 44(%3)\n"				\
+	"13:    movl 48(%4), %%eax\n"				\
+	"81:    movl 52(%4), %%edx\n"				\
+	"       movl %%eax, 48(%3)\n"				\
+	"       movl %%edx, 52(%3)\n"				\
+	"14:    movl 56(%4), %%eax\n"				\
+	"91:    movl 60(%4), %%edx\n"				\
+	"       movl %%eax, 56(%3)\n"				\
+	"       movl %%edx, 60(%3)\n"				\
+	"       addl $-64, %0\n"				\
+	"       addl $64, %4\n"					\
+	"       addl $64, %3\n"					\
+	"       cmpl $63, %0\n"					\
+	"       ja  0b\n"					\
+	"5:   movl  %0, %%eax\n"				\
+	"       shrl  $2, %0\n"					\
+	"       andl $3, %%eax\n"				\
+	"       cld\n"						\
+	"6:     rep; movsl\n"					\
+	"       movl %%eax,%0\n"				\
+	"7:   rep; movsb\n"					\
+	"8:\n"							\
+	".section .fixup,\"ax\"\n"				\
+	"9:   lea 0(%%eax,%0,4),%0\n"				\
+	"16:  pushl %0\n"					\
+	"     pushl %%eax\n"					\
+	"     xorl %%eax,%%eax\n"				\
+	"     rep; stosb\n"					\
+	"     popl %%eax\n"					\
+	"     popl %0\n"					\
+	"     jmp 8b\n"						\
+	"15:    movl %6, %0\n"					\
+	"       jmp 8b\n"					\
+	".previous\n"						\
+	".section __ex_table,\"a\"\n"				\
+	"     .align 4\n"					\
+	"     .long 0b,16b\n"					\
+	"     .long 1b,16b\n"					\
+	"     .long 2b,16b\n"					\
+	"     .long 21b,16b\n"					\
+	"     .long 3b,16b\n"					\
+	"     .long 31b,16b\n"					\
+	"     .long 4b,16b\n"					\
+	"     .long 41b,16b\n"					\
+	"     .long 10b,16b\n"					\
+	"     .long 51b,16b\n"					\
+	"     .long 11b,16b\n"					\
+	"     .long 61b,16b\n"					\
+	"     .long 12b,16b\n"					\
+	"     .long 71b,16b\n"					\
+	"     .long 13b,16b\n"					\
+	"     .long 81b,16b\n"					\
+	"     .long 14b,16b\n"					\
+	"     .long 91b,16b\n"					\
+	"     .long 6b,9b\n"					\
+	"       .long 7b,16b\n"					\
+	".previous"						\
+	: "=&c"(size), "=&D" (__d0), "=&S" (__d1)		\
+	:  "1"(to), "2"(from), "0"(size),"i"(-EFAULT)		\
+	: "eax", "edx", "memory");				\
+ } while (0)
 
 /* We let the __ versions of copy_from/to_user inline, because they're often
  * used in fast paths and have only a small space overhead.
@@ -578,24 +720,16 @@ __constant_copy_from_user_nocheck(void *
 }
 
 #define copy_to_user(to,from,n)				\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_to_user((to),(from),(n)) :	\
-	 __generic_copy_to_user((to),(from),(n)))
+	__generic_copy_to_user((to),(from),(n))
 
 #define copy_from_user(to,from,n)			\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_from_user((to),(from),(n)) :	\
-	 __generic_copy_from_user((to),(from),(n)))
+	__generic_copy_from_user((to),(from),(n))
 
 #define __copy_to_user(to,from,n)			\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_to_user_nocheck((to),(from),(n)) :	\
-	 __generic_copy_to_user_nocheck((to),(from),(n)))
+	__generic_copy_to_user_nocheck((to),(from),(n))
 
 #define __copy_from_user(to,from,n)			\
-	(__builtin_constant_p(n) ?			\
-	 __constant_copy_from_user_nocheck((to),(from),(n)) :	\
-	 __generic_copy_from_user_nocheck((to),(from),(n)))
+	__generic_copy_from_user_nocheck((to),(from),(n))
 
 long strncpy_from_user(char *dst, const char *src, long count);
 long __strncpy_from_user(char *dst, const char *src, long count);

-

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-19 20:42         ` Andrew Morton
@ 2002-09-19 21:12             ` David S. Miller
  0 siblings, 0 replies; 98+ messages in thread
From: David S. Miller @ 2002-09-19 21:12 UTC (permalink / raw)
  To: akpm; +Cc: taka, alan, neilb, linux-kernel, nfs

   From: Andrew Morton <akpm@digeo.com>
   Date: Thu, 19 Sep 2002 13:42:13 -0700
   
   Mala's patch will cause quite an expansion
   of kernel size; we would need an implementation which did not
   use inlining.

It definitely belongs in arch/i386/lib/copy.c or whatever,
not inlined.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-19 21:12             ` David S. Miller
  0 siblings, 0 replies; 98+ messages in thread
From: David S. Miller @ 2002-09-19 21:12 UTC (permalink / raw)
  To: akpm; +Cc: taka, alan, neilb, linux-kernel, nfs

   From: Andrew Morton <akpm@digeo.com>
   Date: Thu, 19 Sep 2002 13:42:13 -0700

   Mala's patch will cause quite an expansion
   of kernel size; we would need an implementation which did not
   use inlining.

It definitely belongs in arch/i386/lib/copy.c or whatever,
not inlined.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
       [not found]   ` <3D8A36A5.846D806@digeo.com.suse.lists.linux.kernel>
@ 2002-09-20  1:00     ` Andi Kleen
  2002-09-20  1:09       ` Andrew Morton
  0 siblings, 1 reply; 98+ messages in thread
From: Andi Kleen @ 2002-09-20  1:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs

Andrew Morton <akpm@digeo.com> writes:

> Hirokazu Takahashi wrote:
> > 
> > ...
> > > It needs redoing.  These differences are really big, and this
> > > is the kernel's most expensive function.
> > >
> > > A little project for someone.
> > 
> > OK, if there is nobody who wants to do it I'll do it by myself.
> 
> That would be fantastic - thanks.  This is more a measurement
> and testing exercise than a coding one.  And if those measurements
> are sufficiently nice (eg: >5%) then a 2.4 backport should be done.

Very interesting IMHO would be to find a heuristic to switch between
a write combining copy and a cache hot copy. Write combining is good 
for blasting huge amounts of data quickly without killing your caches.
Cache hot is good for everything else.

But it'll need hints from the higher level code. e.g. read and write
could turn on write combining for bigger writes (let's say >8K) 
I discovered that just unconditionally turning it on for all copies 
is not good because it forces data out of cache. But I still have hope
that it helps for selected copies.

-Andi

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  1:00     ` Andi Kleen
@ 2002-09-20  1:09       ` Andrew Morton
  2002-09-20  1:23         ` Andi Kleen
  0 siblings, 1 reply; 98+ messages in thread
From: Andrew Morton @ 2002-09-20  1:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs

Andi Kleen wrote:
> 
> Andrew Morton <akpm@digeo.com> writes:
> 
> > Hirokazu Takahashi wrote:
> > >
> > > ...
> > > > It needs redoing.  These differences are really big, and this
> > > > is the kernel's most expensive function.
> > > >
> > > > A little project for someone.
> > >
> > > OK, if there is nobody who wants to do it I'll do it by myself.
> >
> > That would be fantastic - thanks.  This is more a measurement
> > and testing exercise than a coding one.  And if those measurements
> > are sufficiently nice (eg: >5%) then a 2.4 backport should be done.
> 
> Very interesting IMHO would be to find a heuristic to switch between
> a write combining copy and a cache hot copy. Write combining is good
> for blasting huge amounts of data quickly without killing your caches.
> Cache hot is good for everything else.

I expect that caching userspace and not pagecache would be
a reasonable choice.

> But it'll need hints from the higher level code. e.g. read and write
> could turn on write combining for bigger writes (let's say >8K)
> I discovered that just unconditionally turning it on for all copies
> is not good because it forces data out of cache. But I still have hope
> that it helps for selected copies.

Well if it's a really big read then bypassing the CPU cache on
the userspace-side buffer would make sense.

Can you control the cachability of the memory reads as well?

What restrictions are there on these instructions?  Would
they force us to bear the cost of the aligment problem?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  1:09       ` Andrew Morton
@ 2002-09-20  1:23         ` Andi Kleen
  2002-09-20  1:27           ` David S. Miller
  0 siblings, 1 reply; 98+ messages in thread
From: Andi Kleen @ 2002-09-20  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Hirokazu Takahashi, alan, davem, neilb, linux-kernel,
	nfs

On Thu, Sep 19, 2002 at 06:09:34PM -0700, Andrew Morton wrote:
> > Very interesting IMHO would be to find a heuristic to switch between
> > a write combining copy and a cache hot copy. Write combining is good
> > for blasting huge amounts of data quickly without killing your caches.
> > Cache hot is good for everything else.
> 
> I expect that caching userspace and not pagecache would be
> a reasonable choice.

Normally yes, but not always. e.g. for squid you don't really want to 
cache user space.

But I guess it would be a reasonable heuristic. Or at least worth a try :-)

> 
> > But it'll need hints from the higher level code. e.g. read and write
> > could turn on write combining for bigger writes (let's say >8K)
> > I discovered that just unconditionally turning it on for all copies
> > is not good because it forces data out of cache. But I still have hope
> > that it helps for selected copies.
> 
> Well if it's a really big read then bypassing the CPU cache on
> the userspace-side buffer would make sense.
> 
> Can you control the cachability of the memory reads as well?

SSE2 has hints for that (prefetchnti and even prefetcht0,1 etc. for different
cache hierarchies), but it's not completely clear on how much
the CPUs follow these. 

For writing it's much more obvious and usually documented even.

> 
> What restrictions are there on these instructions?  Would
> they force us to bear the cost of the aligment problem?

They should be aligned, otherwise it makes no sense. When you assume it's
more likely that one target or destination are unaligned then you can easily
align either target or destination. Trick is to chose the right one,
it varies on the call site.
(these are for big copies so a small alignment function is lost in the noise)

x86-64 copy_*_user currently aligns the destination, but hardcoding that
is a bit dumb and I'm not completely happy with it.

-Andi

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  1:23         ` Andi Kleen
@ 2002-09-20  1:27           ` David S. Miller
  2002-09-20  2:06             ` Andi Kleen
  0 siblings, 1 reply; 98+ messages in thread
From: David S. Miller @ 2002-09-20  1:27 UTC (permalink / raw)
  To: ak; +Cc: akpm, taka, alan, neilb, linux-kernel, nfs

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 20 Sep 2002 03:23:46 +0200

   On Thu, Sep 19, 2002 at 06:09:34PM -0700, Andrew Morton wrote:
   > Can you control the cachability of the memory reads as well?

   SSE2 has hints for that (prefetchnti and even prefetcht0,1 etc. for different
   cache hierarchies), but it's not completely clear on how much
   the CPUs follow these. 

   For writing it's much more obvious and usually documented even.

See "montdq/movnti", the latter of which even works on register
registers.  Ben LaHaise pointed this out to me earlier today.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  2:06             ` Andi Kleen
@ 2002-09-20  2:01               ` David S. Miller
  2002-09-20  2:28                 ` Andi Kleen
  0 siblings, 1 reply; 98+ messages in thread
From: David S. Miller @ 2002-09-20  2:01 UTC (permalink / raw)
  To: ak; +Cc: akpm, taka, alan, neilb, linux-kernel, nfs

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 20 Sep 2002 04:06:19 +0200

   > See "montdq/movnti", the latter of which even works on register
   > registers.  Ben LaHaise pointed this out to me earlier today.

   The issue is that you really want to do prefetching in these loops
   (waiting for the hardware prefetch is too slow because it needs several
   cache misses to trigger) so for cache hints on reading only prefetch
   instructions are interesting.

I'm talking about using this to bypass the cache on the stores.
The prefetches are a seperate issue and I agree with you on that.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  1:27           ` David S. Miller
@ 2002-09-20  2:06             ` Andi Kleen
  2002-09-20  2:01               ` David S. Miller
  0 siblings, 1 reply; 98+ messages in thread
From: Andi Kleen @ 2002-09-20  2:06 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, akpm, taka, alan, neilb, linux-kernel, nfs

> See "montdq/movnti", the latter of which even works on register
> registers.  Ben LaHaise pointed this out to me earlier today.

The issue is that you really want to do prefetching in these loops
(waiting for the hardware prefetch is too slow because it needs several
cache misses to trigger) so for cache hints on reading only prefetch
instructions are interesting.

-Andi

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  2:28                 ` Andi Kleen
@ 2002-09-20  2:20                   ` David S. Miller
  2002-09-20  2:35                     ` Andi Kleen
  0 siblings, 1 reply; 98+ messages in thread
From: David S. Miller @ 2002-09-20  2:20 UTC (permalink / raw)
  To: ak; +Cc: akpm, taka, alan, neilb, linux-kernel, nfs

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 20 Sep 2002 04:28:19 +0200
   
   You cannot really use these instructions on Athlon,

I know that Athlon lacks these instructions, they are p4 sse2
only.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  2:01               ` David S. Miller
@ 2002-09-20  2:28                 ` Andi Kleen
  2002-09-20  2:20                   ` David S. Miller
  0 siblings, 1 reply; 98+ messages in thread
From: Andi Kleen @ 2002-09-20  2:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, akpm, taka, alan, neilb, linux-kernel, nfs

On Thu, Sep 19, 2002 at 07:01:54PM -0700, David S. Miller wrote:
>    From: Andi Kleen <ak@suse.de>
>    Date: Fri, 20 Sep 2002 04:06:19 +0200
> 
>    > See "montdq/movnti", the latter of which even works on register
>    > registers.  Ben LaHaise pointed this out to me earlier today.
>    
>    The issue is that you really want to do prefetching in these loops
>    (waiting for the hardware prefetch is too slow because it needs several
>    cache misses to trigger) so for cache hints on reading only prefetch
>    instructions are interesting.
>    
> I'm talking about using this to bypass the cache on the stores.
> The prefetches are a seperate issue and I agree with you on that.

I was talking generally. You cannot really use these instructions on Athlon,
because they're microcoded and slow or do not exist. On Athlon it needs
3dnow write combining functions (adding FPU overhead so may not be worth
it). On P3/P4 you can use movnti/movntdq yes.

Just doing it for reads is more tricky/dubious.

-Andi

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-20  2:20                   ` David S. Miller
@ 2002-09-20  2:35                     ` Andi Kleen
  0 siblings, 0 replies; 98+ messages in thread
From: Andi Kleen @ 2002-09-20  2:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, akpm, taka, alan, neilb, linux-kernel, nfs

On Thu, Sep 19, 2002 at 07:20:48PM -0700, David S. Miller wrote:
>    From: Andi Kleen <ak@suse.de>
>    Date: Fri, 20 Sep 2002 04:28:19 +0200
>    
>    You cannot really use these instructions on Athlon,
> 
> I know that Athlon lacks these instructions, they are p4 sse2
> only.

AFAIK it is an SSE1 feature.

Athlon actually has movnti in newer models, just you do not really want to 
use it.

-Andi

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18 23:00 ` David S. Miller
@ 2002-09-21 11:56     ` Pavel Machek
  2002-09-21 11:56     ` Pavel Machek
  1 sibling, 0 replies; 98+ messages in thread
From: Pavel Machek @ 2002-09-21 11:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs

Hi!
>    
>    1)
>    ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
>    This patch enables HW-checksum against outgoing packets including UDP frames.
>    
> Can you explain the TCP parts?  They look very wrong.
> 
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86.  I do not remember all
> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

Well, if this is the case, we need to #define copy_from_user csum_and_copy_from_user :-).

								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-21 11:56     ` Pavel Machek
  0 siblings, 0 replies; 98+ messages in thread
From: Pavel Machek @ 2002-09-21 11:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs

Hi!
>    
>    1)
>    ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
>    This patch enables HW-checksum against outgoing packets including UDP frames.
>    
> Can you explain the TCP parts?  They look very wrong.
> 
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86.  I do not remember all
> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

Well, if this is the case, we need to #define copy_from_user csum_and_copy_from_user :-).

								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-09-18  8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi
  2002-09-18 23:00 ` David S. Miller
@ 2002-10-14  5:50 ` Neil Brown
  2002-10-14  6:15   ` David S. Miller
                     ` (2 more replies)
  1 sibling, 3 replies; 98+ messages in thread
From: Neil Brown @ 2002-10-14  5:50 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: David S. Miller, linux-kernel, nfs

On Wednesday September 18, taka@valinux.co.jp wrote:
> Hello,
> 
> I ported the zerocopy NFS patches against linux-2.5.36.
> 

hi,
 I finally got around to looking at this.
 It looks good.

 However it really needs the MSG_MORE support for udp_sendmsg to be
 accepted before there is any point merging the rpc/nfsd bits.

 Would you like to see if davem is happy with that bit first and get
 it in?  Then I will be happy to forward the nfsd specific bit.

 I'm bit I'm not very sure about is the 'shadowsock' patch for having
 several xmit sockets, one per CPU.  What sort of speedup do you get
 from this?  How important is it really?

NeilBrown

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14  5:50 ` Neil Brown
@ 2002-10-14  6:15   ` David S. Miller
  2002-10-14 10:45     ` kuznet
  2002-10-14 12:01   ` Hirokazu Takahashi
  2002-10-18 13:11   ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi
  2 siblings, 1 reply; 98+ messages in thread
From: David S. Miller @ 2002-10-14  6:15 UTC (permalink / raw)
  To: neilb; +Cc: taka, linux-kernel, nfs, kuznet

   From: Neil Brown <neilb@cse.unsw.edu.au>
   Date: Mon, 14 Oct 2002 15:50:02 +1000

    Would you like to see if davem is happy with that bit first and get
    it in?  Then I will be happy to forward the nfsd specific bit.
   
Alexey is working on this, or at least he was. :-)
(Alexey this is about the UDP cork changes)

    I'm bit I'm not very sure about is the 'shadowsock' patch for having
    several xmit sockets, one per CPU.  What sort of speedup do you get
    from this?  How important is it really?
   
Personally, it seems rather essential for scalability on SMP.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14  6:15   ` David S. Miller
@ 2002-10-14 10:45     ` kuznet
  2002-10-14 10:48       ` David S. Miller
  0 siblings, 1 reply; 98+ messages in thread
From: kuznet @ 2002-10-14 10:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: neilb, taka, linux-kernel, nfs

Hello!

> Alexey is working on this, or at least he was. :-)
> (Alexey this is about the UDP cork changes)

I took two patches of the batch:

va10-hwchecksum-2.5.36.patch
va11-udpsendfile-2.5.36.patch

I did not worry about the rest i.e. sunrpc/* part.

Alexey

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14 10:45     ` kuznet
@ 2002-10-14 10:48       ` David S. Miller
  0 siblings, 0 replies; 98+ messages in thread
From: David S. Miller @ 2002-10-14 10:48 UTC (permalink / raw)
  To: kuznet; +Cc: neilb, taka, linux-kernel, nfs

   From: kuznet@ms2.inr.ac.ru
   Date: Mon, 14 Oct 2002 14:45:33 +0400 (MSD)

   I took two patches of the batch:
   
   va10-hwchecksum-2.5.36.patch
   va11-udpsendfile-2.5.36.patch
   
   I did not worry about the rest i.e. sunrpc/* part.

Neil and the NFS folks can take care of those parts
once the generic UDP parts are in.

So, no worries.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14  5:50 ` Neil Brown
  2002-10-14  6:15   ` David S. Miller
@ 2002-10-14 12:01   ` Hirokazu Takahashi
  2002-10-14 14:12     ` Andrew Theurer
  2002-10-16  3:44     ` Neil Brown
  2002-10-18 13:11   ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi
  2 siblings, 2 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-14 12:01 UTC (permalink / raw)
  To: neilb; +Cc: davem, linux-kernel, nfs

Hello, Neil

> > I ported the zerocopy NFS patches against linux-2.5.36.
> 
> hi,
>  I finally got around to looking at this.
>  It looks good.

Thanks!

>  However it really needs the MSG_MORE support for udp_sendmsg to be
>  accepted before there is any point merging the rpc/nfsd bits.
> 
>  Would you like to see if davem is happy with that bit first and get
>  it in?  Then I will be happy to forward the nfsd specific bit.

Yes.

>  I'm bit I'm not very sure about is the 'shadowsock' patch for having
>  several xmit sockets, one per CPU.  What sort of speedup do you get
>  from this?  How important is it really?

It's not so important.

davem> Personally, it seems rather essential for scalability on SMP.

Yes.
It will be effective on large scale SMP machines as all kNFSd shares
one NFS port. A udp socket can't send data on each CPU at the same
time while MSG_MORE/UDP_CORK options are set.
The UDP socket have to block any other requests during making a UDP frame.


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14 12:01   ` Hirokazu Takahashi
@ 2002-10-14 14:12     ` Andrew Theurer
  2002-10-16  3:44     ` Neil Brown
  1 sibling, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-14 14:12 UTC (permalink / raw)
  To: neilb, Hirokazu Takahashi; +Cc: davem, linux-kernel, nfs

> Hello, Neil
>
> > > I ported the zerocopy NFS patches against linux-2.5.36.
> >
> > hi,
> >  I finally got around to looking at this.
> >  It looks good.
>
> Thanks!
>
> >  However it really needs the MSG_MORE support for udp_sendmsg to be
> >  accepted before there is any point merging the rpc/nfsd bits.
> >
> >  Would you like to see if davem is happy with that bit first and get
> >  it in?  Then I will be happy to forward the nfsd specific bit.
>
> Yes.
>
> >  I'm bit I'm not very sure about is the 'shadowsock' patch for having
> >  several xmit sockets, one per CPU.  What sort of speedup do you get
> >  from this?  How important is it really?
>
> It's not so important.
>
> davem> Personally, it seems rather essential for scalability on SMP.
>
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.

I experienced this exact problem a few months ago.  I had a test where
several clients read a file or files cached on a linux server.  TCP was just
fine, I could get 100% CPU on all CPUs on the server.  TCP zerocopy was even
better, by about 50% throughput.  UDP could not get better than 33% CPU, one
CPU working on those UDP requests and I assume a portion of another CPU
handling some inturrupt stuff.  Essentially 2P and 4P throughput was only as
good as UP throughput.  It is essential to get scaling on UDP.  That
combined with the UDP zerocopy, we will have one extremely fast NFS server.

Andrew Theurer
IBM LTC

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-14 12:01   ` Hirokazu Takahashi
  2002-10-14 14:12     ` Andrew Theurer
@ 2002-10-16  3:44     ` Neil Brown
  2002-10-16  4:31       ` David S. Miller
  2002-10-16 11:09       ` Hirokazu Takahashi
  1 sibling, 2 replies; 98+ messages in thread
From: Neil Brown @ 2002-10-16  3:44 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: davem, linux-kernel, nfs

On Monday October 14, taka@valinux.co.jp wrote:
> >  I'm bit I'm not very sure about is the 'shadowsock' patch for having
> >  several xmit sockets, one per CPU.  What sort of speedup do you get
> >  from this?  How important is it really?
> 
> It's not so important.
> 
> davem> Personally, it seems rather essential for scalability on SMP.
> 
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.
> 

After thinking about this some more, I suspect it would have to be
quite large scale SMP to get much contention.
The only contention on the udp socket is, as you say, assembling a udp
frame, and it would be surprised if that takes a substantial faction
of the time to handle a request.

Presumably on a sufficiently large SMP machine that this became an
issue, there would be multiple NICs.  Maybe it would make sense to
have one udp socket for each NIC.  Would that make sense? or work?
It feels to me to be cleaner than one for each CPU.

NeilBrown

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  3:44     ` Neil Brown
@ 2002-10-16  4:31       ` David S. Miller
  2002-10-16 15:04         ` Andrew Theurer
  2002-10-17  2:03         ` [NFS] " Andrew Theurer
  2002-10-16 11:09       ` Hirokazu Takahashi
  1 sibling, 2 replies; 98+ messages in thread
From: David S. Miller @ 2002-10-16  4:31 UTC (permalink / raw)
  To: neilb; +Cc: taka, linux-kernel, nfs

   From: Neil Brown <neilb@cse.unsw.edu.au>
   Date: Wed, 16 Oct 2002 13:44:04 +1000

   Presumably on a sufficiently large SMP machine that this became an
   issue, there would be multiple NICs.  Maybe it would make sense to
   have one udp socket for each NIC.  Would that make sense? or work?
   It feels to me to be cleaner than one for each CPU.

Doesn't make much sense.

Usually we are talking via one IP address, and thus over
one device.  It could be using multiple NICs via BONDING,
but that would be transparent to anything at the socket
level.

Really, I think there is real value to making the socket
per-cpu even on a 2 or 4 way system.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  3:44     ` Neil Brown
  2002-10-16  4:31       ` David S. Miller
@ 2002-10-16 11:09       ` Hirokazu Takahashi
  2002-10-16 17:02         ` kaza
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-16 11:09 UTC (permalink / raw)
  To: neilb; +Cc: davem, linux-kernel, nfs

Hello,

> > It will be effective on large scale SMP machines as all kNFSd shares
> > one NFS port. A udp socket can't send data on each CPU at the same
> > time while MSG_MORE/UDP_CORK options are set.
> > The UDP socket have to block any other requests during making a UDP frame.
> > 

> After thinking about this some more, I suspect it would have to be
> quite large scale SMP to get much contention.

I have no idea how much contention will happen. I haven't checked the
performance of it on large scale SMP yet as I don't have such a great
machines.

Does anyone help us?

> The only contention on the udp socket is, as you say, assembling a udp
> frame, and it would be surprised if that takes a substantial faction
> of the time to handle a request.

After assembling a udp frame, kNFSd may drive a NIC to transmit the frame.

> Presumably on a sufficiently large SMP machine that this became an
> issue, there would be multiple NICs.  Maybe it would make sense to
> have one udp socket for each NIC.  Would that make sense? or work?

Some CPUs often share one GbE NIC today as a NIC can handle much data
than one CPU can. I think that CPU seems likely to become bottleneck.
Personally I guess several CPUs will share one 10GbE NIC in the near
future even if it's a high end machine. (It's just my guess)

But I don't know how effective this patch works......

devem> Doesn't make much sense.
devem> 
devem> Usually we are talking via one IP address, and thus over
devem> one device.  It could be using multiple NICs via BONDING,
devem> but that would be transparent to anything at the socket
devem> level.
devem> 
devem> Really, I think there is real value to making the socket
devem> per-cpu even on a 2 or 4 way system.

I wish so.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* RE: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-16 14:04 Lever, Charles
  0 siblings, 0 replies; 98+ messages in thread
From: Lever, Charles @ 2002-10-16 14:04 UTC (permalink / raw)
  To: neilb; +Cc: taka, linux-kernel, nfs, 'David S. Miller'

> -----Original Message-----
> From: David S. Miller [mailto:davem@redhat.com]
> Sent: Wednesday, October 16, 2002 12:31 AM
>
>    From: Neil Brown <neilb@cse.unsw.edu.au>
>    Date: Wed, 16 Oct 2002 13:44:04 +1000
> 
>    Presumably on a sufficiently large SMP machine that this became an
>    issue, there would be multiple NICs.  Maybe it would make sense to
>    have one udp socket for each NIC.  Would that make sense? or work?
>    It feels to me to be cleaner than one for each CPU.
>    
> Doesn't make much sense.
> 
> Usually we are talking via one IP address, and thus over
> one device.  It could be using multiple NICs via BONDING,
> but that would be transparent to anything at the socket
> level.
> 
> Really, I think there is real value to making the socket
> per-cpu even on a 2 or 4 way system.

having a local socket per CPU is very good for SMP scaling.
it multiplies input buffer space, and reduces socket lock
and CPU cache contention.

sorry, i don't have measurements.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  4:31       ` David S. Miller
@ 2002-10-16 15:04         ` Andrew Theurer
  2002-10-17  2:03         ` [NFS] " Andrew Theurer
  1 sibling, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-16 15:04 UTC (permalink / raw)
  To: David S. Miller, neilb; +Cc: taka, linux-kernel, nfs

On Tuesday 15 October 2002 11:31 pm, David S. Miller wrote:
>    From: Neil Brown <neilb@cse.unsw.edu.au>
>    Date: Wed, 16 Oct 2002 13:44:04 +1000
>
>    Presumably on a sufficiently large SMP machine that this became an
>    issue, there would be multiple NICs.  Maybe it would make sense to
>    have one udp socket for each NIC.  Would that make sense? or work?
>    It feels to me to be cleaner than one for each CPU.
>
> Doesn't make much sense.
>
> Usually we are talking via one IP address, and thus over
> one device.  It could be using multiple NICs via BONDING,
> but that would be transparent to anything at the socket
> level.
>
> Really, I think there is real value to making the socket
> per-cpu even on a 2 or 4 way system.

I am trying my best today to get a 4 way system up and running for this test.  
IMO, per cpu is best..  with just one socket, I seriously could not get over 
33% cpu utilization on a 4 way (back in April).  With TCP, I could max it 
out.  I'll update later today hopefully with some promising results.

-Andrew

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16 11:09       ` Hirokazu Takahashi
@ 2002-10-16 17:02         ` kaza
  2002-10-17  4:36           ` rddunlap
  0 siblings, 1 reply; 98+ messages in thread
From: kaza @ 2002-10-16 17:02 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

Hello,

On Wed, Oct 16, 2002 at 08:09:00PM +0900,
Hirokazu Takahashi-san wrote:
> > After thinking about this some more, I suspect it would have to be
> > quite large scale SMP to get much contention.
> 
> I have no idea how much contention will happen. I haven't checked the
> performance of it on large scale SMP yet as I don't have such a great
> machines.
> 
> Does anyone help us?

Why don't you propose the performance test to OSDL? (OSDL-J is more
better, I think)   OSDL provide hardware resources and operation staffs.

If you want, I can help you to propose it. :-)

-- 
Ko Kazaana / editor-in-chief of "TechStyle" ( http://techstyle.jp/ )
GnuPG Fingerprint = 1A50 B204 46BD EE22 2E8C  903F F2EB CEA7 4BCF 808F

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16  4:31       ` David S. Miller
  2002-10-16 15:04         ` Andrew Theurer
@ 2002-10-17  2:03         ` Andrew Theurer
  2002-10-17  2:31           ` Hirokazu Takahashi
  1 sibling, 1 reply; 98+ messages in thread
From: Andrew Theurer @ 2002-10-17  2:03 UTC (permalink / raw)
  To: neilb, David S. Miller; +Cc: taka, linux-kernel, nfs

>    From: Neil Brown <neilb@cse.unsw.edu.au>
>    Date: Wed, 16 Oct 2002 13:44:04 +1000
>
>    Presumably on a sufficiently large SMP machine that this became an
>    issue, there would be multiple NICs.  Maybe it would make sense to
>    have one udp socket for each NIC.  Would that make sense? or work?
>    It feels to me to be cleaner than one for each CPU.
>
> Doesn't make much sense.
>
> Usually we are talking via one IP address, and thus over
> one device.  It could be using multiple NICs via BONDING,
> but that would be transparent to anything at the socket
> level.
>
> Really, I think there is real value to making the socket
> per-cpu even on a 2 or 4 way system.

I am still seeing some sort of problem on an 8 way (hyperthreaded 8
logical/4 physical) on UDP with these patches.  I cannot get more than 2
NFSd threads in a run state at one time.  TCP usually has 8 or more.  The
test involves 40 100Mbit clients reading a 200 MB file on one server (4
acenic adapters) in cache.  I am fighting some other issues at the moment
(acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and
138 MB/sec for NFSv2,TCP.  With the patches, 115 MB/sec for NFSv2,UDP and
181 MB/sec for NFSv2,TCP.  One CPU is maxed due to acpi int storm, so I
think the results will get better.  I'm not sure what other lock or
contention point this is hitting on UDP.  If there is anything I can do to
help, please let me know, thanks.

Andrew Theurer

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17  2:03         ` [NFS] " Andrew Theurer
@ 2002-10-17  2:31           ` Hirokazu Takahashi
  2002-10-17 13:16               ` [NFS] " Andrew Theurer
  0 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17  2:31 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hello,

Thanks for testing my patches.

> I am still seeing some sort of problem on an 8 way (hyperthreaded 8
> logical/4 physical) on UDP with these patches.  I cannot get more than 2
> NFSd threads in a run state at one time.  TCP usually has 8 or more.  The
> test involves 40 100Mbit clients reading a 200 MB file on one server (4
> acenic adapters) in cache.  I am fighting some other issues at the moment
> (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and
> 138 MB/sec for NFSv2,TCP.  With the patches, 115 MB/sec for NFSv2,UDP and
> 181 MB/sec for NFSv2,TCP.  One CPU is maxed due to acpi int storm, so I
> think the results will get better.  I'm not sure what other lock or
> contention point this is hitting on UDP.  If there is anything I can do to
> help, please let me know, thanks.

I guess some UDP packets might be lost. It may happen easily as UDP protocol
doesn't support flow control.
Can you check how many errors has happened? 
You can see them in /proc/net/snmp of the server and the clients.

And how many threads did you start on your machine?
Buffer size of a UDP socket depends on number of kNFS threads.
Large number of threads might help you.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-16 17:02         ` kaza
@ 2002-10-17  4:36           ` rddunlap
  0 siblings, 0 replies; 98+ messages in thread
From: rddunlap @ 2002-10-17  4:36 UTC (permalink / raw)
  To: kaza; +Cc: Hirokazu Takahashi, neilb, davem, linux-kernel, nfs

On Thu, 17 Oct 2002 kaza@kk.iij4u.or.jp wrote:

| Hello,
|
| On Wed, Oct 16, 2002 at 08:09:00PM +0900,
| Hirokazu Takahashi-san wrote:
| > > After thinking about this some more, I suspect it would have to be
| > > quite large scale SMP to get much contention.
| >
| > I have no idea how much contention will happen. I haven't checked the
| > performance of it on large scale SMP yet as I don't have such a great
| > machines.
| >
| > Does anyone help us?
|
| Why don't you propose the performance test to OSDL? (OSDL-J is more
| better, I think)   OSDL provide hardware resources and operation staffs.

and why do you say that?  8;)

| If you want, I can help you to propose it. :-)

That's the right thing to do.

-- 
~Randy

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17  2:31           ` Hirokazu Takahashi
@ 2002-10-17 13:16               ` Andrew Theurer
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-17 13:16 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36


> Hello,
>
> Thanks for testing my patches.
>
> > I am still seeing some sort of problem on an 8 way (hyperthreaded 8
> > logical/4 physical) on UDP with these patches.  I cannot get more than 2
> > NFSd threads in a run state at one time.  TCP usually has 8 or more.
The
> > test involves 40 100Mbit clients reading a 200 MB file on one server (4
> > acenic adapters) in cache.  I am fighting some other issues at the
moment
> > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP
and
> > 138 MB/sec for NFSv2,TCP.  With the patches, 115 MB/sec for NFSv2,UDP
and
> > 181 MB/sec for NFSv2,TCP.  One CPU is maxed due to acpi int storm, so I
> > think the results will get better.  I'm not sure what other lock or
> > contention point this is hitting on UDP.  If there is anything I can do
to
> > help, please let me know, thanks.
>
> I guess some UDP packets might be lost. It may happen easily as UDP
protocol
> doesn't support flow control.
> Can you check how many errors has happened?
> You can see them in /proc/net/snmp of the server and the clients.

server: Udp: InDatagrams NoPorts InErrors OutDatagrams
        Udp: 1000665 41 0 1000666

clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
         Udp: 200403 0 0 200406
         (all clients the same)

> And how many threads did you start on your machine?
> Buffer size of a UDP socket depends on number of kNFS threads.
> Large number of threads might help you.

128 threads.  client rsize=8196.  Server and client MTU is 1500.

Andrew Theurer



-------------------------------------------------------
This sf.net email is sponsored by: viaVerio will pay you up to
$1,000 for every account that you consolidate with us.
http://ad.doubleclick.net/clk;4749864;7604308;v?
http://www.viaverio.com/consolidator/osdn.cfm
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-17 13:16               ` Andrew Theurer
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-17 13:16 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36


> Hello,
>
> Thanks for testing my patches.
>
> > I am still seeing some sort of problem on an 8 way (hyperthreaded 8
> > logical/4 physical) on UDP with these patches.  I cannot get more than 2
> > NFSd threads in a run state at one time.  TCP usually has 8 or more.
The
> > test involves 40 100Mbit clients reading a 200 MB file on one server (4
> > acenic adapters) in cache.  I am fighting some other issues at the
moment
> > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP
and
> > 138 MB/sec for NFSv2,TCP.  With the patches, 115 MB/sec for NFSv2,UDP
and
> > 181 MB/sec for NFSv2,TCP.  One CPU is maxed due to acpi int storm, so I
> > think the results will get better.  I'm not sure what other lock or
> > contention point this is hitting on UDP.  If there is anything I can do
to
> > help, please let me know, thanks.
>
> I guess some UDP packets might be lost. It may happen easily as UDP
protocol
> doesn't support flow control.
> Can you check how many errors has happened?
> You can see them in /proc/net/snmp of the server and the clients.

server: Udp: InDatagrams NoPorts InErrors OutDatagrams
        Udp: 1000665 41 0 1000666

clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
         Udp: 200403 0 0 200406
         (all clients the same)

> And how many threads did you start on your machine?
> Buffer size of a UDP socket depends on number of kNFS threads.
> Large number of threads might help you.

128 threads.  client rsize=8196.  Server and client MTU is 1500.

Andrew Theurer


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 13:16               ` [NFS] " Andrew Theurer
@ 2002-10-17 13:26                 ` Hirokazu Takahashi
  -1 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17 13:26 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hi,

> server: Udp: InDatagrams NoPorts InErrors OutDatagrams
>         Udp: 1000665 41 0 1000666
> clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
>          Udp: 200403 0 0 200406
>          (all clients the same)

How about IP datagrams?  You can see the IP fields in /proc/net/snmp
IP layer may also discard them.

> > And how many threads did you start on your machine?
> > Buffer size of a UDP socket depends on number of kNFS threads.
> > Large number of threads might help you.
> 
> 128 threads.  client rsize=8196.  Server and client MTU is 1500.

It seems enough...



-------------------------------------------------------
This sf.net email is sponsored by: viaVerio will pay you up to
$1,000 for every account that you consolidate with us.
http://ad.doubleclick.net/clk;4749864;7604308;v?
http://www.viaverio.com/consolidator/osdn.cfm
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-17 13:26                 ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17 13:26 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hi,

> server: Udp: InDatagrams NoPorts InErrors OutDatagrams
>         Udp: 1000665 41 0 1000666
> clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
>          Udp: 200403 0 0 200406
>          (all clients the same)

How about IP datagrams?  You can see the IP fields in /proc/net/snmp
IP layer may also discard them.

> > And how many threads did you start on your machine?
> > Buffer size of a UDP socket depends on number of kNFS threads.
> > Large number of threads might help you.
> 
> 128 threads.  client rsize=8196.  Server and client MTU is 1500.

It seems enough...


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 13:26                 ` [NFS] " Hirokazu Takahashi
  (?)
@ 2002-10-17 14:10                 ` Andrew Theurer
  2002-10-17 16:26                     ` [NFS] " Hirokazu Takahashi
  -1 siblings, 1 reply; 98+ messages in thread
From: Andrew Theurer @ 2002-10-17 14:10 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

> Hi,
>
> > server: Udp: InDatagrams NoPorts InErrors OutDatagrams
> >         Udp: 1000665 41 0 1000666
> > clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
> >          Udp: 200403 0 0 200406
> >          (all clients the same)
>
> How about IP datagrams?  You can see the IP fields in /proc/net/snmp
> IP layer may also discard them.

Server:

Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000

A Client:

Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0


Andrew Theurer

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 14:10                 ` Andrew Theurer
@ 2002-10-17 16:26                     ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17 16:26 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hi,

> > How about IP datagrams?  You can see the IP fields in /proc/net/snmp
> > IP layer may also discard them.
> 
> Server:
> 
> Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
> InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
> ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
> Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000
> 
> A Client:
> 
> Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
> InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
> ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
> Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0

It looks fine.  
Hmmm....  What version of linux do you use?

Congestion avoidance mechanism of NFS clients might cause this situation.
I think the congestion window size is not enough for high end machines.
You can make the window be larger as a test.


-------------------------------------------------------
This sf.net email is sponsored by: viaVerio will pay you up to
$1,000 for every account that you consolidate with us.
http://ad.doubleclick.net/clk;4749864;7604308;v?
http://www.viaverio.com/consolidator/osdn.cfm
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-17 16:26                     ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-17 16:26 UTC (permalink / raw)
  To: habanero; +Cc: neilb, davem, linux-kernel, nfs

Hi,

> > How about IP datagrams?  You can see the IP fields in /proc/net/snmp
> > IP layer may also discard them.
> 
> Server:
> 
> Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
> InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
> ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
> Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000
> 
> A Client:
> 
> Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams
> InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes
> ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
> Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0

It looks fine.  
Hmmm....  What version of linux do you use?

Congestion avoidance mechanism of NFS clients might cause this situation.
I think the congestion window size is not enough for high end machines.
You can make the window be larger as a test.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-17 16:26                     ` [NFS] " Hirokazu Takahashi
  (?)
@ 2002-10-18  5:38                     ` Trond Myklebust
  2002-10-18  7:19                       ` Hirokazu Takahashi
  -1 siblings, 1 reply; 98+ messages in thread
From: Trond Myklebust @ 2002-10-18  5:38 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: habanero, neilb, davem, linux-kernel, nfs

>>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes:

     > Congestion avoidance mechanism of NFS clients might cause this
     > situation.  I think the congestion window size is not enough
     > for high end machines.  You can make the window be larger as a
     > test.

The congestion avoidance window is supposed to adapt to the bandwidth
that is available. Turn congestion avoidance off if you like, but my
experience is that doing so tends to seriously degrade performance as
the number of timeouts + resends skyrockets.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-18  5:38                     ` Trond Myklebust
@ 2002-10-18  7:19                       ` Hirokazu Takahashi
  2002-10-18 15:12                           ` [NFS] " Andrew Theurer
  0 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-18  7:19 UTC (permalink / raw)
  To: trond.myklebust; +Cc: habanero, neilb, davem, linux-kernel, nfs

Hello,

>      > Congestion avoidance mechanism of NFS clients might cause this
>      > situation.  I think the congestion window size is not enough
>      > for high end machines.  You can make the window be larger as a
>      > test.
> 
> The congestion avoidance window is supposed to adapt to the bandwidth
> that is available. Turn congestion avoidance off if you like, but my
> experience is that doing so tends to seriously degrade performance as
> the number of timeouts + resends skyrockets.

Yes, you must be right.

But I guess Andrew may use a great machine so that the transfer rate
has exeeded the maximum size of the congestion avoidance window.
Can we determin preferable maximum window size dynamically?

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH] zerocopy NFS for 2.5.43
  2002-10-14  5:50 ` Neil Brown
  2002-10-14  6:15   ` David S. Miller
  2002-10-14 12:01   ` Hirokazu Takahashi
@ 2002-10-18 13:11   ` Hirokazu Takahashi
  2002-10-23  1:18     ` Neil Brown
  2 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-18 13:11 UTC (permalink / raw)
  To: neilb; +Cc: nfs

[-- Attachment #1: Type: Text/Plain, Size: 453 bytes --]

Hello,

I've ported the zerocopy patches against linux-2.5.43 with
davem's udp-sendfile patches and your patches which you posted
on Wed,16 Oct.

It's sad that zerocopy NFS doesn't work with NFSv4 yet.
kNFSd won't use zerocopy mechanism against NFSv4 requests.
If possible I can make NFSv4 use zerocopy after Halloween.

And I also fixed a small bug that pages might be lost
when nfsd_readdir happens to have an error.


Thank you,
Hirokazu Takahashi.


[-- Attachment #2: rpcfix2.5.43-2.patch --]
[-- Type: Text/Plain, Size: 1094 bytes --]

--- linux/net/sunrpc/svcsock.c.ORG	Thu Oct 17 14:10:43 2030
+++ linux/net/sunrpc/svcsock.c	Fri Oct 18 11:20:27 2030
@@ -882,17 +882,18 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 
 	dprintk("svc: TCP complete record (%d bytes)\n", len);
 
+	rqstp->rq_skbuff         = 0;
+	rqstp->rq_argbuf.buf    += 1;
+	rqstp->rq_argbuf.len     = (len >> 2) + 1;
+	rqstp->rq_argbuf.buflen  = (len >> 2) + 1;
+
 	/* Position reply write pointer immediately args,
 	 * allowing for record length */
-	rqstp->rq_resbuf.base = rqstp->rq_argbuf.base + (len>>2);
-	rqstp->rq_resbuf.buf  = rqstp->rq_resbuf.base + 1;
-	rqstp->rq_resbuf.len  = 1;
-	rqstp->rq_resbuf.buflen= rqstp->rq_argbuf.buflen - (len>>2) - 1;
+	rqstp->rq_resbuf.base   += rqstp->rq_argbuf.buflen;
+	rqstp->rq_resbuf.buf     = rqstp->rq_resbuf.base + 1;
+	rqstp->rq_resbuf.len     = 1;
+	rqstp->rq_resbuf.buflen -= rqstp->rq_argbuf.buflen;
 
-	rqstp->rq_skbuff      = 0;
-	rqstp->rq_argbuf.buf += 1;
-	rqstp->rq_argbuf.len  = (len >> 2);
-	rqstp->rq_argbuf.buflen = (len >> 2);
 	rqstp->rq_prot	      = IPPROTO_TCP;
 
 	/* Reset TCP read info */

[-- Attachment #3: va01-zerocopy-rpc-2.5.43.patch --]
[-- Type: Text/Plain, Size: 10122 bytes --]

--- linux.ORG/include/linux/sunrpc/svc.h	Fri Oct 18 12:26:43 2030
+++ linux/include/linux/sunrpc/svc.h	Fri Oct 18 12:29:31 2030
@@ -48,7 +48,7 @@ struct svc_serv {
  * This is use to determine the max number of pages nfsd is
  * willing to return in a single READ operation.
  */
-#define RPCSVC_MAXPAYLOAD	16384u
+#define RPCSVC_MAXPAYLOAD	(1024u*64)
 
 /*
  * Buffer to store RPC requests or replies in.
@@ -61,7 +61,7 @@ struct svc_serv {
  *
  * The array of iovecs can hold additional data that the server process
  * may not want to copy into the RPC reply buffer, but pass to the 
- * network sendmsg routines directly. The prime candidate for this
+ * network sendmsg/sendpage routines directly. The prime candidate for this
  * will of course be NFS READ operations, but one might also want to
  * do something about READLINK and READDIR. It might be worthwhile
  * to implement some generic readdir cache in the VFS layer...
@@ -70,7 +70,7 @@ struct svc_serv {
  * the list of IP fragments once we get to process fragmented UDP
  * datagrams directly.
  */
-#define RPCSVC_MAXIOV		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1)
+#define RPCSVC_MAXIOV		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 2)
 struct svc_buf {
 	u32 *			area;	/* allocated memory */
 	u32 *			base;	/* base of RPC datagram */
@@ -78,10 +78,24 @@ struct svc_buf {
 	u32 *			buf;	/* read/write pointer */
 	int			len;	/* current end of buffer */
 
-	/* iovec for zero-copy NFS READs */
-	struct iovec		iov[RPCSVC_MAXIOV];
+	/* 
+	 * iovec for zero-copy NFS READs
+	 * pages and non-page data can be mixed.
+	 */
+	struct rpcio_vec {
+		struct page		*rpc_page;
+		union {
+			void		*riov_base;
+			unsigned long	riov_offset;
+		} u;
+		__kernel_size_t		rpc_len;
+	} iov[RPCSVC_MAXIOV];
 	int			nriov;
 };
+
+#define rpc_base	u.riov_base
+#define rpc_offset	u.riov_offset
+
 #define svc_getu32(argp, val)	{ (val) = *(argp)->buf++; (argp)->len--; }
 #define svc_putu32(resp, val)	{ *(resp)->buf++ = (val); (resp)->len++; }
 
--- linux.ORG/include/linux/sunrpc/svcsock.h	Fri Oct 18 12:26:43 2030
+++ linux/include/linux/sunrpc/svcsock.h	Fri Oct 18 12:29:31 2030
@@ -10,6 +10,7 @@
 #define SUNRPC_SVCSOCK_H
 
 #include <linux/sunrpc/svc.h>
+#include <asm/semaphore.h>
 
 /*
  * RPC server socket.
@@ -37,6 +38,7 @@ struct svc_sock {
 
 	struct list_head	sk_deferred;	/* deferred requests that need to
 						 * be revisted */
+	struct semaphore	sk_sem;		/* serialize sending data */
 
 	int			(*sk_recvfrom)(struct svc_rqst *rqstp);
 	int			(*sk_sendto)(struct svc_rqst *rqstp);
--- linux.ORG/net/sunrpc/svc.c	Fri Oct 18 12:26:48 2030
+++ linux/net/sunrpc/svc.c	Fri Oct 18 12:29:31 2030
@@ -106,8 +106,7 @@ svc_destroy(struct svc_serv *serv)
 
 /*
  * Allocate an RPC server buffer
- * Later versions may do nifty things by allocating multiple pages
- * of memory directly and putting them into the bufp->iov.
+ * Multiple pages can be put into the bufp->iov.
  */
 int
 svc_init_buffer(struct svc_buf *bufp, unsigned int size)
@@ -119,8 +118,9 @@ svc_init_buffer(struct svc_buf *bufp, un
 	bufp->len    = 0;
 	bufp->buflen = size >> 2;
 
-	bufp->iov[0].iov_base = bufp->area;
-	bufp->iov[0].iov_len  = size;
+	bufp->iov[0].rpc_base = bufp->area;
+	bufp->iov[0].rpc_len  = size;
+	bufp->iov[0].rpc_page = NULL;
 	bufp->nriov = 1;
 
 	return 1;
--- linux.ORG/net/sunrpc/svcsock.c	Fri Oct 18 12:28:35 2030
+++ linux/net/sunrpc/svcsock.c	Fri Oct 18 12:29:31 2030
@@ -22,6 +22,7 @@
 #include <linux/sched.h>
 #include <linux/errno.h>
 #include <linux/fcntl.h>
+#include <linux/pagemap.h>
 #include <linux/net.h>
 #include <linux/in.h>
 #include <linux/inet.h>
@@ -270,6 +271,8 @@ static void
 svc_sock_release(struct svc_rqst *rqstp)
 {
 	struct svc_sock	*svsk = rqstp->rq_sock;
+	struct svc_buf	*bufp = &rqstp->rq_resbuf;
+	int i;
 
 	svc_release_skb(rqstp);
 
@@ -283,6 +286,13 @@ svc_sock_release(struct svc_rqst *rqstp)
 		       rqstp->rq_reserved,
 		       rqstp->rq_resbuf.len<<2);
 
+	for (i = 0; i < bufp->nriov; i++) {
+		if (bufp->iov[i].rpc_page) {
+			put_page(bufp->iov[i].rpc_page);
+			bufp->iov[i].rpc_page = NULL;
+		}
+	}
+
 	rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base;
 	rqstp->rq_resbuf.len = 0;
 	svc_reserve(rqstp, 0);
@@ -318,38 +328,55 @@ svc_wake_up(struct svc_serv *serv)
  * Generic sendto routine
  */
 static int
-svc_sendto(struct svc_rqst *rqstp, struct iovec *iov, int nr)
+svc_sendto(struct svc_rqst *rqstp, struct rpcio_vec *iov, int nr)
 {
 	mm_segment_t	oldfs;
 	struct svc_sock	*svsk = rqstp->rq_sock;
 	struct socket	*sock = svsk->sk_sock;
 	struct msghdr	msg;
-	int		i, buflen, len;
-
-	for (i = buflen = 0; i < nr; i++)
-		buflen += iov[i].iov_len;
+	unsigned int	flags = MSG_MORE;
+	int		len = 0;
+	int		result, i;
 
 	msg.msg_name    = &rqstp->rq_addr;
 	msg.msg_namelen = sizeof(rqstp->rq_addr);
-	msg.msg_iov     = iov;
-	msg.msg_iovlen  = nr;
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
+	msg.msg_iovlen  = 1;
 
-	/* This was MSG_DONTWAIT, but I now want it to wait.
-	 * The only thing that it would wait for is memory and
-	 * if we are fairly low on memory, then we aren't likely
-	 * to make much progress anyway.
-	 * sk->sndtimeo is set to 30seconds just in case.
-	 */
-	msg.msg_flags	= 0;
+	/* Grab svsk->sk_sem to serialize outgoing data. */
+	down(&svsk->sk_sem);
 
-	oldfs = get_fs(); set_fs(KERNEL_DS);
-	len = sock_sendmsg(sock, &msg, buflen);
-	set_fs(oldfs);
+	/* 
+	 * svc_sendto() assumes rqstp->rq_resbuf.page[0] is NULL
+	 * when RPC over UDP is used as sendpage interface cannot
+	 * pass destination address.
+	 */
+	for (i = 0; i < nr; i++) {
+		if (i == nr - 1)
+			flags = 0;
+		if (iov[i].rpc_page) {
+			result = sock->ops->sendpage(sock, iov[i].rpc_page, iov[i].rpc_offset, iov[i].rpc_len, flags);
+		} else {
+			struct iovec uiov;
+			uiov.iov_base   = iov[i].rpc_base;
+			uiov.iov_len    = iov[i].rpc_len;
+			msg.msg_iov     = &uiov;
+			msg.msg_flags	= flags;
+			oldfs = get_fs(); set_fs(KERNEL_DS);
+			result = sock_sendmsg(sock, &msg, iov[i].rpc_len);
+			set_fs(oldfs);
+		}
+		if (result < 0) {
+			if (!len) len = result;
+			break;
+		}
+		len += result;
+	}
+	up(&svsk->sk_sem);
 
-	dprintk("svc: socket %p sendto([%p %Zu... ], %d, %d) = %d\n",
-			rqstp->rq_sock, iov[0].iov_base, iov[0].iov_len, nr, buflen, len);
+	dprintk("svc: socket %p sendto([%p %Zu... ], %d) = %d\n",
+			rqstp->rq_sock, iov[0].rpc_base, iov[0].rpc_len, nr, len);
 
 	return len;
 }
@@ -375,19 +402,25 @@ svc_recv_available(struct svc_sock *svsk
  * Generic recvfrom routine.
  */
 static int
-svc_recvfrom(struct svc_rqst *rqstp, struct iovec *iov, int nr, int buflen)
+svc_recvfrom(struct svc_rqst *rqstp, struct rpcio_vec *iov, int nr, int buflen)
 {
 	mm_segment_t	oldfs;
 	struct msghdr	msg;
 	struct socket	*sock;
-	int		len, alen;
+	int		len, alen, i;
+	struct iovec uiov[RPCSVC_MAXIOV];
 
 	rqstp->rq_addrlen = sizeof(rqstp->rq_addr);
 	sock = rqstp->rq_sock->sk_sock;
 
+	for (i = 0; i < nr; i++) {
+		uiov[i].iov_base = iov[i].rpc_base;
+		uiov[i].iov_len  = iov[i].rpc_len;
+	}
+
 	msg.msg_name    = &rqstp->rq_addr;
 	msg.msg_namelen = sizeof(rqstp->rq_addr);
-	msg.msg_iov     = iov;
+	msg.msg_iov     = uiov;
 	msg.msg_iovlen  = nr;
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
@@ -406,7 +439,7 @@ svc_recvfrom(struct svc_rqst *rqstp, str
 	sock->ops->getname(sock, (struct sockaddr *)&rqstp->rq_addr, &alen, 1);
 
 	dprintk("svc: socket %p recvfrom(%p, %Zu) = %d\n",
-		rqstp->rq_sock, iov[0].iov_base, iov[0].iov_len, len);
+		rqstp->rq_sock, iov[0].rpc_base, iov[0].rpc_len, len);
 
 	return len;
 }
@@ -567,8 +600,8 @@ svc_udp_sendto(struct svc_rqst *rqstp)
 	 * care of by the server implementation itself.
 	 */
 	/* bufp->base = bufp->area; */
-	bufp->iov[0].iov_base = bufp->base;
-	bufp->iov[0].iov_len  = bufp->len << 2;
+	bufp->iov[0].rpc_base = bufp->base;
+	bufp->iov[0].rpc_len  = bufp->len << 2;
 
 	error = svc_sendto(rqstp, bufp->iov, bufp->nriov);
 	if (error == -ECONNREFUSED)
@@ -827,10 +860,11 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	 */
 	if (svsk->sk_tcplen < 4) {
 		unsigned long	want = 4 - svsk->sk_tcplen;
-		struct iovec	iov;
+		struct rpcio_vec	iov;
 
-		iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen;
-		iov.iov_len  = want;
+		iov.rpc_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen;
+		iov.rpc_len  = want;
+		iov.rpc_page  = NULL;
 		if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0)
 			goto error;
 		svsk->sk_tcplen += len;
@@ -872,8 +906,8 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	set_bit(SK_DATA, &svsk->sk_flags);
 
 	/* Frob argbuf */
-	bufp->iov[0].iov_base += 4;
-	bufp->iov[0].iov_len  -= 4;
+	bufp->iov[0].rpc_base += 4;
+	bufp->iov[0].rpc_len  -= 4;
 
 	/* Now receive data */
 	len = svc_recvfrom(rqstp, bufp->iov, bufp->nriov, svsk->sk_reclen);
@@ -931,21 +965,25 @@ svc_tcp_sendto(struct svc_rqst *rqstp)
 {
 	struct svc_buf	*bufp = &rqstp->rq_resbuf;
 	int sent;
+	int buflen = bufp->len << 2;
+	int i;
 
 	/* Set up the first element of the reply iovec.
 	 * Any other iovecs that may be in use have been taken
 	 * care of by the server implementation itself.
 	 */
-	bufp->iov[0].iov_base = bufp->base;
-	bufp->iov[0].iov_len  = bufp->len << 2;
-	bufp->base[0] = htonl(0x80000000|((bufp->len << 2) - 4));
+	bufp->iov[0].rpc_base = bufp->base;
+	bufp->iov[0].rpc_len  = buflen;
+	for (i = 1; i < bufp->nriov; i++)
+		buflen += bufp->iov[i].rpc_len;
+	bufp->base[0] = htonl(0x80000000|(buflen - 4));
 
 	sent = svc_sendto(rqstp, bufp->iov, bufp->nriov);
-	if (sent != bufp->len<<2) {
+	if (sent != buflen) {
 		printk(KERN_NOTICE "rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n",
 		       rqstp->rq_sock->sk_server->sv_name,
 		       (sent<0)?"got error":"sent only",
-		       sent, bufp->len << 2);
+		       sent, buflen);
 		svc_delete_socket(rqstp->rq_sock);
 		sent = -EAGAIN;
 	}
@@ -1185,6 +1223,7 @@ svc_setup_socket(struct svc_serv *serv, 
 	svsk->sk_server = serv;
 	svsk->sk_lastrecv = CURRENT_TIME;
 	INIT_LIST_HEAD(&svsk->sk_deferred);
+	sema_init(&svsk->sk_sem, 1);
 
 	/* Initialize the socket */
 	if (sock->type == SOCK_DGRAM)

[-- Attachment #4: va02-zerocopy-nfsdread-2.5.43.patch --]
[-- Type: Text/Plain, Size: 6693 bytes --]

--- linux.ORG/fs/nfsd/nfs3xdr.c	Fri Oct 18 12:26:29 2030
+++ linux/fs/nfsd/nfs3xdr.c	Fri Oct 18 12:32:35 2030
@@ -13,6 +13,7 @@
 #include <linux/spinlock.h>
 #include <linux/dcache.h>
 #include <linux/namei.h>
+#include <linux/pagemap.h>
 
 #include <linux/sunrpc/xdr.h>
 #include <linux/sunrpc/svc.h>
@@ -78,6 +79,34 @@ encode_fh(u32 *p, struct svc_fh *fhp)
 }
 
 /*
+ * Pad extra data at the end of the packet as the length of RPC packet
+ * must be multiple of u32.
+ */
+static inline u32 *
+xdr_pack_data(struct svc_rqst *rqstp, u32 *p, unsigned long count)
+{
+	int pad = (XDR_QUADLEN(count) << 2) - count;
+	unsigned int index = rqstp->rq_resbuf.nriov;
+	struct rpcio_vec *iov = rqstp->rq_resbuf.iov;
+
+	if (index == 1)
+		return p + XDR_QUADLEN(count);
+
+	/* The last page may have enough room to pad. */
+	if (iov[index-1].rpc_page &&
+	    iov[index-1].rpc_offset + iov[index-1].rpc_len + pad <= PAGE_SIZE) {
+		iov[index - 1].rpc_len += pad;
+	} else {
+		static long dummy = 0;
+		iov[index].rpc_base = &dummy;
+		iov[index].rpc_len = pad;
+		iov[index].rpc_page = NULL;
+		rqstp->rq_resbuf.nriov++;
+	}
+	return p;
+}
+
+/*
  * Decode a file name and make sure that the path contains
  * no slashes or null bytes.
  */
@@ -569,7 +598,7 @@ nfs3svc_encode_readlinkres(struct svc_rq
 	p = encode_post_op_attr(rqstp, p, &resp->fh);
 	if (resp->status == 0) {
 		*p++ = htonl(resp->len);
-		p += XDR_QUADLEN(resp->len);
+		p = xdr_pack_data(rqstp, p, resp->len);
 	}
 	return xdr_ressize_check(rqstp, p);
 }
@@ -584,7 +613,7 @@ nfs3svc_encode_readres(struct svc_rqst *
 		*p++ = htonl(resp->count);
 		*p++ = htonl(resp->eof);
 		*p++ = htonl(resp->count);	/* xdr opaque count */
-		p += XDR_QUADLEN(resp->count);
+		p = xdr_pack_data(rqstp, p, resp->count);
 	}
 	return xdr_ressize_check(rqstp, p);
 }
@@ -647,7 +676,7 @@ nfs3svc_encode_readdirres(struct svc_rqs
 	if (resp->status == 0) {
 		/* stupid readdir cookie */
 		memcpy(p, resp->verf, 8); p += 2;
-		p += XDR_QUADLEN(resp->count);
+		p = xdr_pack_data(rqstp, p, resp->count);
 	}
 
 	return xdr_ressize_check(rqstp, p);
--- linux.ORG/fs/nfsd/nfsxdr.c	Fri Oct 18 12:26:29 2030
+++ linux/fs/nfsd/nfsxdr.c	Fri Oct 18 12:32:35 2030
@@ -55,6 +55,35 @@ encode_fh(u32 *p, struct svc_fh *fhp)
 	return p + (NFS_FHSIZE>> 2);
 }
 
+
+/*
+ * Pad extra data at the end of the packet as the length of RPC packet
+ * must be multiple of u32.
+ */
+static inline u32 *
+xdr_pack_data(struct svc_rqst *rqstp, u32 *p, unsigned long count)
+{
+	int pad = (XDR_QUADLEN(count) << 2) - count;
+	unsigned int index = rqstp->rq_resbuf.nriov;
+	struct rpcio_vec *iov = rqstp->rq_resbuf.iov;
+
+	if (index == 1)
+		return p + XDR_QUADLEN(count);
+
+	/* The last page may have enough room to pad. */
+	if (iov[index-1].rpc_page &&
+	    iov[index-1].rpc_offset + iov[index-1].rpc_len + pad <= PAGE_SIZE) {
+		iov[index - 1].rpc_len += pad;
+	} else {
+		static long dummy = 0;
+		iov[index].rpc_base = &dummy;
+		iov[index].rpc_len = pad;
+		iov[index].rpc_page = NULL;
+		rqstp->rq_resbuf.nriov++;
+	}
+	return p;
+}
+
 /*
  * Decode a file name and make sure that the path contains
  * no slashes or null bytes.
@@ -361,7 +390,7 @@ nfssvc_encode_readlinkres(struct svc_rqs
 					struct nfsd_readlinkres *resp)
 {
 	*p++ = htonl(resp->len);
-	p += XDR_QUADLEN(resp->len);
+	p = xdr_pack_data(rqstp, p, resp->len);
 	return xdr_ressize_check(rqstp, p);
 }
 
@@ -371,7 +400,7 @@ nfssvc_encode_readres(struct svc_rqst *r
 {
 	p = encode_fattr(rqstp, p, &resp->fh);
 	*p++ = htonl(resp->count);
-	p += XDR_QUADLEN(resp->count);
+	p = xdr_pack_data(rqstp, p, resp->count);
 
 	return xdr_ressize_check(rqstp, p);
 }
@@ -380,7 +409,7 @@ int
 nfssvc_encode_readdirres(struct svc_rqst *rqstp, u32 *p,
 					struct nfsd_readdirres *resp)
 {
-	p += XDR_QUADLEN(resp->count);
+	p = xdr_pack_data(rqstp, p, resp->count);
 	return xdr_ressize_check(rqstp, p);
 }
 
--- linux.ORG/fs/nfsd/vfs.c	Fri Oct 18 12:26:29 2030
+++ linux/fs/nfsd/vfs.c	Fri Oct 18 12:36:13 2030
@@ -13,6 +13,7 @@
  * dentry, don't worry--they have been taken care of.
  *
  * Copyright (C) 1995-1999 Olaf Kirch <okir@monad.swb.de>
+ * Zerocpy NFS support (C) 2002 Hirokazu Takahashi <taka@valinux.co.jp>
  */
 
 #include <linux/config.h>
@@ -28,6 +29,7 @@
 #include <linux/net.h>
 #include <linux/unistd.h>
 #include <linux/slab.h>
+#include <linux/pagemap.h>
 #include <linux/in.h>
 #include <linux/module.h>
 #include <linux/namei.h>
@@ -571,6 +573,61 @@ found:
 }
 
 /*
+ * Grab and keep cached pages assosiated with a file in the svc_rqst
+ * so that they can be passed to the netowork sendmsg/sendpage routines
+ * directrly. They will be released after the sending has completed.
+ */
+static int
+nfsd_read_actor(read_descriptor_t *desc, struct page *page, unsigned long offset , unsigned long size)
+{
+	unsigned long count = desc->count;
+	struct svc_rqst *rqstp = (struct svc_rqst *)desc->buf;
+	unsigned int index = rqstp->rq_resbuf.nriov;
+	struct rpcio_vec *iov = rqstp->rq_resbuf.iov;
+
+	if (size > count)
+		size = count;
+
+	if (page == iov[index-1].rpc_page
+		  && offset == iov[index-1].rpc_offset + iov[index-1].rpc_len) {
+		/* the page can be coalesced */
+		iov[index-1].rpc_len += size;
+	} else {
+		rqstp->rq_resbuf.nriov++;
+		get_page(page);
+		iov[index].rpc_page = page;
+		iov[index].rpc_offset = offset;
+		iov[index].rpc_len = size;
+	}
+
+	desc->count = count - size;
+	desc->written += size;
+	return size;
+}
+
+static inline ssize_t
+nfsd_getpages(struct file *filp, struct svc_rqst *rqstp, unsigned long count)
+{
+	read_descriptor_t desc;
+	ssize_t	retval;
+
+	if (!count)
+		return 0;
+
+	desc.written = 0;
+	desc.count = count;
+	desc.buf = (char *)rqstp;
+	desc.error = 0;
+	do_generic_file_read(filp, &filp->f_pos, &desc, nfsd_read_actor);
+
+	retval = desc.written;
+	if (!retval)
+		retval = desc.error;
+	return retval;
+}
+
+
+/*
  * Read data from a file. count must contain the requested read count
  * on entry. On return, *count contains the number of bytes actually read.
  * N.B. After this call fhp needs an fh_put
@@ -601,10 +658,17 @@ nfsd_read(struct svc_rqst *rqstp, struct
 	if (ra)
 		file.f_ra = ra->p_ra;
 
-	oldfs = get_fs();
-	set_fs(KERNEL_DS);
-	err = vfs_read(&file, buf, *count, &offset);
-	set_fs(oldfs);
+	/* ToDo: NFSv4 can't handle fragmented data yet. */
+/* 	if (inode->i_mapping->a_ops->readpage) { */
+	if (inode->i_mapping->a_ops->readpage && rqstp->rq_vers <= 3) {
+		file.f_pos = offset;
+		err = nfsd_getpages(&file, rqstp, *count);
+	} else {
+		oldfs = get_fs();
+		set_fs(KERNEL_DS);
+		err = vfs_read(&file, buf, *count, &offset);
+		set_fs(oldfs);
+	}
 
 	/* Write back readahead params */
 	if (ra)

[-- Attachment #5: va03-zerocopy-nfsdreaddir-2.5.43.patch --]
[-- Type: Text/Plain, Size: 1426 bytes --]

--- linux.ORG/fs/nfsd/vfs.c	Fri Oct 18 21:24:43 2030
+++ linux/fs/nfsd/vfs.c	Fri Oct 18 21:23:48 2030
@@ -1460,6 +1460,7 @@ nfsd_readdir(struct svc_rqst *rqstp, str
 	int		oldlen, eof, err;
 	struct file	file;
 	struct readdir_cd cd;
+	struct page *page = NULL;
 
 	err = nfsd_open(rqstp, fhp, S_IFDIR, MAY_READ, &file);
 	if (err)
@@ -1469,6 +1470,15 @@ nfsd_readdir(struct svc_rqst *rqstp, str
 
 	file.f_pos = offset;
 
+	/* ToDo: NFSv4 can't handle fragmented data yet. */
+/* 	if (*countp <= (PAGE_SIZE >> 2)) { */
+	if (*countp <= (PAGE_SIZE >> 2) && rqstp->rq_vers <= 3) {
+		/* Don't care if we couldn't get a page. */
+		page = alloc_page(GFP_KERNEL);
+		if (page)
+			buffer = page_address(page);
+	}
+
 	/* Set up the readdir context */
 	memset(&cd, 0, sizeof(cd));
 	cd.rqstp  = rqstp;
@@ -1518,11 +1528,22 @@ nfsd_readdir(struct svc_rqst *rqstp, str
 	*p++ = htonl(eof);		/* end of directory */
 	*countp = (caddr_t) p - (caddr_t) buffer;
 
+	if (page) {
+		int index = rqstp->rq_resbuf.nriov;
+		get_page(page);
+		rqstp->rq_resbuf.iov[index].rpc_page = page;
+		rqstp->rq_resbuf.iov[index].rpc_base = NULL;
+		rqstp->rq_resbuf.iov[index].rpc_len = *countp;
+		rqstp->rq_resbuf.nriov++;
+	}
+
 	dprintk("nfsd: readdir result %d bytes, eof %d offset %d\n",
 				*countp, eof,
 				cd.offset? ntohl(*cd.offset) : -1);
 	err = 0;
 out_close:
+	if (page)
+		put_page(page);
 	nfsd_close(&file);
 out:
 	return err;

[-- Attachment #6: va04-zerocopy-shadowsock-2.5.43.patch --]
[-- Type: Text/Plain, Size: 6649 bytes --]

--- linux.ORG/include/linux/sunrpc/svcsock.h	Fri Oct 18 12:32:04 2030
+++ linux/include/linux/sunrpc/svcsock.h	Fri Oct 18 12:42:02 2030
@@ -52,6 +52,7 @@ struct svc_sock {
 	int			sk_reclen;	/* length of record */
 	int			sk_tcplen;	/* current read length */
 	time_t			sk_lastrecv;	/* time of last received request */
+	struct svc_sock		**sk_shadow;	/* shadow sockets for sending */
 };
 
 /*
--- linux.ORG/net/sunrpc/svcsock.c	Fri Oct 18 12:32:04 2030
+++ linux/net/sunrpc/svcsock.c	Fri Oct 18 12:42:02 2030
@@ -65,7 +65,9 @@
 
 
 static struct svc_sock *svc_setup_socket(struct svc_serv *, struct socket *,
-					 int *errp, int pmap_reg);
+					 int *errp, int type);
+#define SVSK_PMAP_REGISTER	1
+#define SVSK_SHADOW		2
 static void		svc_udp_data_ready(struct sock *, int);
 static int		svc_udp_recvfrom(struct svc_rqst *);
 static int		svc_udp_sendto(struct svc_rqst *);
@@ -260,6 +262,8 @@ svc_sock_put(struct svc_sock *svsk)
 	if (!--(svsk->sk_inuse) && test_bit(SK_DEAD, &svsk->sk_flags)) {
 		spin_unlock_bh(&serv->sv_lock);
 		dprintk("svc: releasing dead socket\n");
+		if (svsk->sk_shadow)
+			kfree(svsk->sk_shadow);
 		sock_release(svsk->sk_sock);
 		kfree(svsk);
 	}
@@ -328,10 +332,10 @@ svc_wake_up(struct svc_serv *serv)
  * Generic sendto routine
  */
 static int
-svc_sendto(struct svc_rqst *rqstp, struct rpcio_vec *iov, int nr)
+svc_sendto(struct svc_rqst *rqstp, struct svc_sock *svsk,
+				 struct rpcio_vec *iov, int nr)
 {
 	mm_segment_t	oldfs;
-	struct svc_sock	*svsk = rqstp->rq_sock;
 	struct socket	*sock = svsk->sk_sock;
 	struct msghdr	msg;
 	unsigned int	flags = MSG_MORE;
@@ -593,6 +597,7 @@ static int
 svc_udp_sendto(struct svc_rqst *rqstp)
 {
 	struct svc_buf	*bufp = &rqstp->rq_resbuf;
+	struct svc_sock	*svsk = rqstp->rq_sock;
 	int		error;
 
 	/* Set up the first element of the reply iovec.
@@ -603,10 +608,25 @@ svc_udp_sendto(struct svc_rqst *rqstp)
 	bufp->iov[0].rpc_base = bufp->base;
 	bufp->iov[0].rpc_len  = bufp->len << 2;
 
-	error = svc_sendto(rqstp, bufp->iov, bufp->nriov);
+#ifdef CONFIG_SMP
+	if (svsk->sk_shadow) {
+		struct svc_sock	*shadow = svsk->sk_shadow[smp_processor_id()];
+		if (shadow) {
+			struct svc_serv	*serv = svsk->sk_server;
+			svsk = shadow;
+			if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags))
+				svc_sock_setbufsize(svsk->sk_sock,
+					(serv->sv_nrthreads+3) * serv->sv_bufsz,
+					(serv->sv_nrthreads+3) * serv->sv_bufsz);
+		}
+
+	}
+#endif
+
+	error = svc_sendto(rqstp, svsk, bufp->iov, bufp->nriov);
 	if (error == -ECONNREFUSED)
 		/* ICMP error on earlier request. */
-		error = svc_sendto(rqstp, bufp->iov, bufp->nriov);
+		error = svc_sendto(rqstp, svsk, bufp->iov, bufp->nriov);
 
 	return error;
 }
@@ -978,7 +998,7 @@ svc_tcp_sendto(struct svc_rqst *rqstp)
 		buflen += bufp->iov[i].rpc_len;
 	bufp->base[0] = htonl(0x80000000|(buflen - 4));
 
-	sent = svc_sendto(rqstp, bufp->iov, bufp->nriov);
+	sent = svc_sendto(rqstp, rqstp->rq_sock, bufp->iov, bufp->nriov);
 	if (sent != buflen) {
 		printk(KERN_NOTICE "rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n",
 		       rqstp->rq_sock->sk_server->sv_name,
@@ -1201,7 +1221,7 @@ svc_send(struct svc_rqst *rqstp)
  */
 static struct svc_sock *
 svc_setup_socket(struct svc_serv *serv, struct socket *sock,
-					int *errp, int pmap_register)
+					int *errp, int type)
 {
 	struct svc_sock	*svsk;
 	struct sock	*inet;
@@ -1222,6 +1242,7 @@ svc_setup_socket(struct svc_serv *serv, 
 	svsk->sk_owspace = inet->write_space;
 	svsk->sk_server = serv;
 	svsk->sk_lastrecv = CURRENT_TIME;
+	svsk->sk_shadow = NULL;
 	INIT_LIST_HEAD(&svsk->sk_deferred);
 	sema_init(&svsk->sk_sem, 1);
 
@@ -1234,7 +1255,7 @@ if (svsk->sk_sk == NULL)
 	printk(KERN_WARNING "svsk->sk_sk == NULL after svc_prot_init!\n");
 
 	/* Register socket with portmapper */
-	if (*errp >= 0 && pmap_register)
+	if (*errp >= 0 && type == SVSK_PMAP_REGISTER)
 		*errp = svc_register(serv, inet->protocol,
 				     ntohs(inet_sk(inet)->sport));
 
@@ -1246,13 +1267,13 @@ if (svsk->sk_sk == NULL)
 
 
 	spin_lock_bh(&serv->sv_lock);
-	if (!pmap_register) {
+	if (type == SVSK_PMAP_REGISTER || type == SVSK_SHADOW) {
+		clear_bit(SK_TEMP, &svsk->sk_flags);
+		list_add(&svsk->sk_list, &serv->sv_permsocks);
+	} else {
 		set_bit(SK_TEMP, &svsk->sk_flags);
 		list_add(&svsk->sk_list, &serv->sv_tempsocks);
 		serv->sv_tmpcnt++;
-	} else {
-		clear_bit(SK_TEMP, &svsk->sk_flags);
-		list_add(&svsk->sk_list, &serv->sv_permsocks);
 	}
 	spin_unlock_bh(&serv->sv_lock);
 
@@ -1261,6 +1282,61 @@ if (svsk->sk_sk == NULL)
 	return svsk;
 }
 
+
+/*
+ * Create a shadow socket which has the same sport of given svsk.
+ * Let each cpu have its own socket to send packets. 
+ */
+static int
+svc_create_shadow_socket(struct svc_serv *serv, struct svc_sock	*svsk,
+				int protocol, struct sockaddr_in *sin)
+{
+#ifdef CONFIG_SMP
+	int		error;
+	struct socket	*newsock;
+	struct svc_sock	*newsvsk;
+	int		i;
+
+	if (num_online_cpus() == 1)
+		return 0;
+
+	svsk->sk_shadow = kmalloc(sizeof(struct svc_sock*)*NR_CPUS, GFP_KERNEL);
+	if (!svsk->sk_shadow)
+		return -ENOMEM;
+
+	memset(svsk->sk_shadow, 0, sizeof(struct svc_sock*)*NR_CPUS);
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+
+		if ((error = sock_create(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &newsock)) < 0)
+			return error;
+		if ((newsvsk = svc_setup_socket(serv, newsock, &error, SVSK_SHADOW)) == NULL) {
+			sock_release(newsock);
+			return error;
+		}
+		/*
+		 * Make the newsvsk as shadow of the svsk.
+		 */
+		newsock->sk->reuse = 1; /* allow address reuse */
+		error = newsock->ops->bind(newsock, (struct sockaddr *) sin,
+						sizeof(*sin));
+		if (error < 0) {
+			sock_release(newsock);
+			kfree(newsvsk);
+			return error;
+		}
+		/*
+		 * Unhash the newsocket not to receive packets.
+		 */
+		newsock->sk->prot->unhash(newsock->sk);
+		svsk->sk_shadow[i] = newsvsk;
+	}
+#endif
+	return 0;
+}
+
 /*
  * Create socket for RPC service.
  */
@@ -1300,8 +1376,13 @@ svc_create_socket(struct svc_serv *serv,
 			goto bummer;
 	}
 
-	if ((svsk = svc_setup_socket(serv, sock, &error, 1)) != NULL)
-		return 0;
+	if ((svsk = svc_setup_socket(serv, sock, &error, SVSK_PMAP_REGISTER)) == NULL)
+		goto bummer;
+
+	if (protocol == IPPROTO_UDP && sin != NULL)
+		svc_create_shadow_socket(serv, svsk, protocol, sin);
+
+	return 0;
 
 bummer:
 	dprintk("svc: svc_create_socket error = %d\n", -error);
@@ -1340,6 +1421,8 @@ svc_delete_socket(struct svc_sock *svsk)
 
 	if (!svsk->sk_inuse) {
 		spin_unlock_bh(&serv->sv_lock);
+		if (svsk->sk_shadow)
+			kfree(svsk->sk_shadow);
 		sock_release(svsk->sk_sock);
 		kfree(svsk);
 	} else {

[-- Attachment #7: va05-zerocopy-nfsdwrite-2.5.43.patch --]
[-- Type: Text/Plain, Size: 18875 bytes --]

--- linux.ORG/include/linux/sunrpc/svc.h	Fri Oct 18 21:24:38 2030
+++ linux/include/linux/sunrpc/svc.h	Fri Oct 18 21:26:01 2030
@@ -70,7 +70,7 @@ struct svc_serv {
  * the list of IP fragments once we get to process fragmented UDP
  * datagrams directly.
  */
-#define RPCSVC_MAXIOV		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 2)
+#define RPCSVC_MAXIOV		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE*3 + 2)
 struct svc_buf {
 	u32 *			area;	/* allocated memory */
 	u32 *			base;	/* base of RPC datagram */
@@ -79,7 +79,7 @@ struct svc_buf {
 	int			len;	/* current end of buffer */
 
 	/* 
-	 * iovec for zero-copy NFS READs
+	 * iovec for zero-copy NFS READs/WRITEs
 	 * pages and non-page data can be mixed.
 	 */
 	struct rpcio_vec {
@@ -204,7 +204,13 @@ struct svc_procedure {
 	unsigned int		pc_count;	/* call count */
 	unsigned int		pc_cachetype;	/* cache info (NFS) */
 	unsigned int		pc_xdrressize;	/* maximum size of XDR reply */
+	unsigned int		pc_flags;
 };
+
+/*
+ * pc_flags
+ */
+#define RPC_HANDLE_IOVARG	0x1	/* can accept separated arg buffers */
 
 /*
  * This is the RPC server thread function prototype
--- linux.ORG/net/sunrpc/svcsock.c	Fri Oct 18 21:26:29 2030
+++ linux/net/sunrpc/svcsock.c	Fri Oct 18 21:26:01 2030
@@ -514,6 +514,98 @@ svc_write_space(struct sock *sk)
 	}
 }
 
+static inline int
+svc_map_skb_rpciovec_one(struct sk_buff *skb, struct rpcio_vec *iov, int *slotp)
+{
+	int i;
+	int slot = *slotp;
+
+	if (slot >= RPCSVC_MAXIOV)
+		return 1;
+
+	iov[slot].rpc_page = NULL;
+	iov[slot].rpc_base = skb->data;
+	iov[slot].rpc_len = skb_headlen(skb);
+	slot++;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+		if (slot >= RPCSVC_MAXIOV)
+			return 1;
+		/* TODO: Highmem is not supported yet.  */
+		if (PageHighMem(frag->page))
+			return 1;
+		/* 
+		 * Some drivers would split skb into some pages in the near
+		 * future as slab for jumbo frames of GbE causes memory
+		 * pressure too much.
+		 */
+		iov[slot].rpc_page = frag->page;
+		iov[slot].rpc_offset = frag->page_offset;
+		iov[slot].rpc_len = frag->size;
+		slot++;
+	}
+	*slotp = slot;
+	return 0;
+}
+
+/*
+ * Map fragments in the skb into rpc_iovec if possible.
+ */
+static inline int
+svc_map_skb_rpciovec(struct sk_buff *skb, struct svc_buf *bufp)
+{
+	int slot = 0;
+	struct sk_buff *list;
+
+	/* 
+	 * Make sure the first buffer big so that knfsd or other services
+	 * can handle it easily.
+	 */
+	if (skb_headlen(skb) < 1400)
+		return 1;
+
+	if (svc_map_skb_rpciovec_one(skb, bufp->iov, &slot))
+		return 1;
+
+	bufp->iov[0].rpc_base += sizeof(struct udphdr);
+	bufp->iov[0].rpc_len -= sizeof(struct udphdr);
+
+	for (list = skb_shinfo(skb)->frag_list; list; list = list->next) {
+		if (svc_map_skb_rpciovec_one(list, bufp->iov, &slot))
+			return 1;
+	}
+	bufp->nriov = slot;
+	return 0;
+}
+
+/*
+ * Copy data from fragmented UDP frame into the RPC buffer.
+ */
+static inline u32*
+svc_copy_skb_argbuf(struct svc_rqst *rqstp, struct sk_buff *skb)
+{
+	struct iovec iov;
+	mm_segment_t		oldfs;
+	int err;
+
+	iov.iov_base = rqstp->rq_argbuf.buf;
+	iov.iov_len = skb->len - sizeof(struct udphdr);
+
+	oldfs = get_fs(); set_fs(KERNEL_DS);
+	if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
+		err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), &iov, iov.iov_len);
+	} else {
+		err = skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), &iov);
+	}
+	set_fs(oldfs);
+	if (err)
+		return NULL;
+
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
+	return rqstp->rq_argbuf.buf;
+}
+
 /*
  * Receive a datagram from a UDP socket.
  */
@@ -549,9 +641,13 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 	}
 	set_bit(SK_DATA, &svsk->sk_flags); /* there may be more data... */
 
-	/* Sorry. */
-	if (skb_is_nonlinear(skb)) {
-		if (skb_linearize(skb, GFP_KERNEL) != 0) {
+	len  = skb->len - sizeof(struct udphdr);
+	data = (u32 *) (skb->data + sizeof(struct udphdr));
+
+	if (skb_is_nonlinear(skb) &&
+			svc_map_skb_rpciovec(skb, &rqstp->rq_argbuf)) {
+		data = svc_copy_skb_argbuf(rqstp, skb);
+		if (data == NULL) {
 			kfree_skb(skb);
 			svc_sock_received(svsk);
 			return 0;
@@ -566,16 +662,15 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 		}
 	}
 
-
-	len  = skb->len - sizeof(struct udphdr);
-	data = (u32 *) (skb->data + sizeof(struct udphdr));
-
 	rqstp->rq_skbuff      = skb;
 	rqstp->rq_argbuf.base = data;
 	rqstp->rq_argbuf.buf  = data;
 	rqstp->rq_argbuf.len  = (len >> 2);
 	rqstp->rq_argbuf.buflen = (len >> 2);
-	/* rqstp->rq_resbuf      = rqstp->rq_defbuf; */
+
+	rqstp->rq_resbuf.base   += rqstp->rq_argbuf.buflen;
+	rqstp->rq_resbuf.buf     = rqstp->rq_resbuf.base;
+	rqstp->rq_resbuf.buflen -= rqstp->rq_argbuf.buflen;
 	rqstp->rq_prot        = IPPROTO_UDP;
 
 	/* Get sender address */
@@ -1067,6 +1162,17 @@ svc_sock_update_bufs(struct svc_serv *se
 	spin_unlock_bh(&serv->sv_lock);
 }
 
+inline void
+svc_clear_buffer(struct svc_buf *target, struct svc_buf *defbuf)
+{
+	target->base   = defbuf->base;
+	target->buflen = defbuf->buflen;
+	target->buf    = defbuf->buf;
+	target->len    = defbuf->len;
+	target->iov[0] = defbuf->iov[0];
+	target->nriov  = defbuf->nriov;
+}
+
 /*
  * Receive the next request on any socket.
  */
@@ -1090,8 +1196,8 @@ svc_recv(struct svc_serv *serv, struct s
 			 rqstp);
 
 	/* Initialize the buffers */
-	rqstp->rq_argbuf = rqstp->rq_defbuf;
-	rqstp->rq_resbuf = rqstp->rq_defbuf;
+	svc_clear_buffer(&rqstp->rq_argbuf, &rqstp->rq_defbuf);
+	svc_clear_buffer(&rqstp->rq_resbuf, &rqstp->rq_defbuf);
 
 	if (signalled())
 		return -EINTR;
--- linux.ORG/net/sunrpc/svc.c	Fri Oct 18 21:24:38 2030
+++ linux/net/sunrpc/svc.c	Fri Oct 18 21:26:01 2030
@@ -13,6 +13,7 @@
 #include <linux/net.h>
 #include <linux/in.h>
 #include <linux/unistd.h>
+#include <linux/pagemap.h>
 
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/xdr.h>
@@ -233,6 +234,40 @@ svc_register(struct svc_serv *serv, int 
 	return error;
 }
 
+static inline void
+svc_linearize_argbuf(struct svc_rqst *rqstp)
+{
+	struct svc_buf *argp = &rqstp->rq_argbuf;
+	char *newbuf;
+	char *base;
+	char *p;
+	unsigned int skip, len;
+	int i;
+
+	skip = (char*)argp->buf - (char*)argp->iov[0].rpc_base;
+	len = argp->iov[0].rpc_len - skip;
+	newbuf = (char*)rqstp->rq_defbuf.base + skip;
+
+	memcpy(newbuf, argp->buf, len);
+	p = newbuf + len;
+
+	for (i = 1; i < argp->nriov; i++) {
+		if (argp->iov[i].rpc_page) {
+			base = kmap(argp->iov[i].rpc_page) + argp->iov[i].rpc_offset;
+		} else {
+			base = argp->iov[i].rpc_base;
+		}
+		memcpy(p, base, argp->iov[i].rpc_len);
+		p += argp->iov[i].rpc_len;
+		if (argp->iov[i].rpc_page)
+			kunmap(argp->iov[i].rpc_page);
+	}
+	rqstp->rq_argbuf.base = rqstp->rq_defbuf.base;
+	rqstp->rq_argbuf.buf = (u32*)newbuf;
+	rqstp->rq_argbuf.nriov = 1;
+}
+
+
 /*
  * Process the RPC request.
  */
@@ -322,6 +357,15 @@ svc_process(struct svc_serv *serv, struc
 	 */
 	if (procp->pc_xdrressize)
 		svc_reserve(rqstp, procp->pc_xdrressize<<2);
+
+	/* Linearize argbuf when the procedure can't handle it.
+	 * It rarely happens on NFS v2/v3 but it would sometimes happen on
+	 * NFS v4 according to its compound procedures. NFSv4 xdr routines
+	 * have to handle splitted buffers or don't set RPC_HANDLE_IOVARG
+	 * flag in the beginning.
+	 */
+	if (argp->nriov > 1 && !(procp->pc_flags & RPC_HANDLE_IOVARG))
+		svc_linearize_argbuf(rqstp);
 
 	/* Call the function that processes the request. */
 	if (!versp->vs_dispatch) {
--- linux.ORG/fs/nfsd/vfs.c	Fri Oct 18 21:26:22 2030
+++ linux/fs/nfsd/vfs.c	Fri Oct 18 21:26:01 2030
@@ -686,6 +686,61 @@ out:
 	return err;
 }
 
+static inline int
+nfsd_writev(struct svc_rqst *rqstp, struct file	*file,
+				char *buf, unsigned long cnt)
+{
+	struct iovec		iov[RPCSVC_MAXIOV];
+	struct rpcio_vec	*rpciov = rqstp->rq_argbuf.iov;
+	unsigned int		len, sub;
+	char			*base = NULL;
+	int			slot = 0;
+	int			i;
+	mm_segment_t		oldfs;
+	int			err;
+
+	/* Look for the starting rpciov including the buf. */
+	for (i = 0; i < rqstp->rq_argbuf.nriov; i++) {
+		if (rpciov->rpc_page) {
+			/* HighMem is not supported yet. */
+			if (PageHighMem(rpciov->rpc_page))
+				BUG();
+			base = page_address(rpciov->rpc_page) + rpciov->rpc_offset;
+		} else {
+			base = rpciov->rpc_base;
+		}
+		if (base <= buf && buf < base + rpciov->rpc_len)
+			break;
+	}
+
+	iov[slot].iov_base = buf;
+	iov[slot].iov_len = rpciov->rpc_len - (buf - base);
+	len = iov[slot].iov_len;
+	for (i++, slot++, rpciov++ ; i < rqstp->rq_argbuf.nriov; i++, slot++, rpciov++) {
+		if (rpciov->rpc_page) {
+			/* HighMem is not supported yet. */
+			if (PageHighMem(rpciov->rpc_page))
+				BUG();
+			iov[slot].iov_base = page_address(rpciov->rpc_page) + rpciov->rpc_offset;
+		} else {
+			iov[slot].iov_base = rpciov->rpc_base;
+		}
+		iov[slot].iov_len = rpciov->rpc_len;
+		len += iov[slot].iov_len;
+	}
+	while (len > cnt) {
+		sub = min_t(unsigned int, iov[slot-1].iov_len, len - cnt);
+		len -= sub;
+		iov[slot-1].iov_len -= sub;
+		if (iov[slot-1].iov_len == 0)
+			slot--;
+	}
+	oldfs = get_fs(); set_fs(KERNEL_DS);
+	err = file->f_op->writev(file, iov, slot, &file->f_pos);
+	set_fs(oldfs);
+	return err;
+}
+
 /*
  * Write data to a file.
  * The stable flag requests synchronous writes.
@@ -740,11 +795,16 @@ nfsd_write(struct svc_rqst *rqstp, struc
 		file.f_flags |= O_SYNC;
 
 	/* Write the data. */
-	oldfs = get_fs(); set_fs(KERNEL_DS);
-	err = vfs_write(&file, buf, cnt, &offset);
+	if (rqstp->rq_argbuf.nriov == 1) {
+		oldfs = get_fs(); set_fs(KERNEL_DS);
+		err = vfs_write(&file, buf, cnt, &offset);
+		set_fs(oldfs);
+	} else {
+		file.f_pos = offset;		/* set write offset */
+		err = nfsd_writev(rqstp, &file, buf, cnt);
+	}
 	if (err >= 0)
 		nfsdstats.io_write += cnt;
-	set_fs(oldfs);
 
 	/* clear setuid/setgid flag after write */
 	if (err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) {
--- linux.ORG/fs/nfsd/nfsproc.c	Fri Oct 18 21:18:42 2030
+++ linux/fs/nfsd/nfsproc.c	Fri Oct 18 21:26:01 2030
@@ -522,7 +522,7 @@ nfsd_proc_statfs(struct svc_rqst * rqstp
 #define nfssvc_release_none	NULL
 struct nfsd_void { int dummy; };
 
-#define PROC(name, argt, rest, relt, cache, respsize)	\
+#define PROC(name, argt, rest, relt, cache, respsize, flags)	\
  { (svc_procfunc) nfsd_proc_##name,		\
    (kxdrproc_t) nfssvc_decode_##argt,		\
    (kxdrproc_t) nfssvc_encode_##rest,		\
@@ -532,6 +532,7 @@ struct nfsd_void { int dummy; };
    0,						\
    cache,					\
    respsize,				       	\
+   flags,					\
  }
 
 #define ST 1		/* status */
@@ -539,24 +540,24 @@ struct nfsd_void { int dummy; };
 #define	AT 18		/* attributes */
 
 static struct svc_procedure		nfsd_procedures2[18] = {
-  PROC(null,	 void,		void,		none,		RC_NOCACHE, ST),
-  PROC(getattr,	 fhandle,	attrstat,	fhandle,	RC_NOCACHE, ST+AT),
-  PROC(setattr,  sattrargs,	attrstat,	fhandle,	RC_REPLBUFF, ST+AT),
-  PROC(none,	 void,		void,		none,		RC_NOCACHE, ST),
-  PROC(lookup,	 diropargs,	diropres,	fhandle,	RC_NOCACHE, ST+FH+AT),
-  PROC(readlink, fhandle,	readlinkres,	none,		RC_NOCACHE, ST+1+NFS_MAXPATHLEN/4),
-  PROC(read,	 readargs,	readres,	fhandle,	RC_NOCACHE, ST+AT+1+NFSSVC_MAXBLKSIZE),
-  PROC(none,	 void,		void,		none,		RC_NOCACHE, ST),
-  PROC(write,	 writeargs,	attrstat,	fhandle,	RC_REPLBUFF, ST+AT),
-  PROC(create,	 createargs,	diropres,	fhandle,	RC_REPLBUFF, ST+FH+AT),
-  PROC(remove,	 diropargs,	void,		none,		RC_REPLSTAT, ST),
-  PROC(rename,	 renameargs,	void,		none,		RC_REPLSTAT, ST),
-  PROC(link,	 linkargs,	void,		none,		RC_REPLSTAT, ST),
-  PROC(symlink,	 symlinkargs,	void,		none,		RC_REPLSTAT, ST),
-  PROC(mkdir,	 createargs,	diropres,	fhandle,	RC_REPLBUFF, ST+FH+AT),
-  PROC(rmdir,	 diropargs,	void,		none,		RC_REPLSTAT, ST),
-  PROC(readdir,	 readdirargs,	readdirres,	none,		RC_REPLBUFF, 0),
-  PROC(statfs,	 fhandle,	statfsres,	none,		RC_NOCACHE, ST+5),
+  PROC(null,	 void,		void,		none,		RC_NOCACHE, ST, 0),
+  PROC(getattr,	 fhandle,	attrstat,	fhandle,	RC_NOCACHE, ST+AT, 0),
+  PROC(setattr,  sattrargs,	attrstat,	fhandle,	RC_REPLBUFF, ST+AT, 0),
+  PROC(none,	 void,		void,		none,		RC_NOCACHE, ST, 0),
+  PROC(lookup,	 diropargs,	diropres,	fhandle,	RC_NOCACHE, ST+FH+AT, 0),
+  PROC(readlink, fhandle,	readlinkres,	none,		RC_NOCACHE, ST+1+NFS_MAXPATHLEN/4, 0),
+  PROC(read,	 readargs,	readres,	fhandle,	RC_NOCACHE, ST+AT+1+NFSSVC_MAXBLKSIZE, 0),
+  PROC(none,	 void,		void,		none,		RC_NOCACHE, ST, 0),
+  PROC(write,	 writeargs,	attrstat,	fhandle,	RC_REPLBUFF, ST+AT, RPC_HANDLE_IOVARG),
+  PROC(create,	 createargs,	diropres,	fhandle,	RC_REPLBUFF, ST+FH+AT, 0),
+  PROC(remove,	 diropargs,	void,		none,		RC_REPLSTAT, ST, 0),
+  PROC(rename,	 renameargs,	void,		none,		RC_REPLSTAT, ST, 0),
+  PROC(link,	 linkargs,	void,		none,		RC_REPLSTAT, ST, 0),
+  PROC(symlink,	 symlinkargs,	void,		none,		RC_REPLSTAT, ST, 0),
+  PROC(mkdir,	 createargs,	diropres,	fhandle,	RC_REPLBUFF, ST+FH+AT, 0),
+  PROC(rmdir,	 diropargs,	void,		none,		RC_REPLSTAT, ST, 0),
+  PROC(readdir,	 readdirargs,	readdirres,	none,		RC_REPLBUFF, 0, 0),
+  PROC(statfs,	 fhandle,	statfsres,	none,		RC_NOCACHE, ST+5, 0),
 };
 
 
--- linux.ORG/fs/nfsd/nfs3proc.c	Fri Oct 18 21:18:42 2030
+++ linux/fs/nfsd/nfs3proc.c	Fri Oct 18 21:26:01 2030
@@ -645,7 +645,7 @@ nfsd3_proc_commit(struct svc_rqst * rqst
 #define nfsd3_voidres			nfsd3_voidargs
 struct nfsd3_voidargs { int dummy; };
 
-#define PROC(name, argt, rest, relt, cache, respsize)	\
+#define PROC(name, argt, rest, relt, cache, respsize, flags)	\
  { (svc_procfunc) nfsd3_proc_##name,		\
    (kxdrproc_t) nfs3svc_decode_##argt##args,	\
    (kxdrproc_t) nfs3svc_encode_##rest##res,	\
@@ -655,6 +655,7 @@ struct nfsd3_voidargs { int dummy; };
    0,						\
    cache,					\
    respsize,					\
+   flags,					\
  }
 
 #define ST 1		/* status*/
@@ -664,28 +665,28 @@ struct nfsd3_voidargs { int dummy; };
 #define WC (7+pAT)	/* WCC attributes */
 
 static struct svc_procedure		nfsd_procedures3[22] = {
-  PROC(null,	 void,		void,		void,	  RC_NOCACHE, ST),
-  PROC(getattr,	 fhandle,	attrstat,	fhandle,  RC_NOCACHE, ST+AT),
-  PROC(setattr,  sattr,		wccstat,	fhandle,  RC_REPLBUFF, ST+WC),
-  PROC(lookup,	 dirop,		dirop,		fhandle2, RC_NOCACHE, ST+FH+pAT+pAT),
-  PROC(access,	 access,	access,		fhandle,  RC_NOCACHE, ST+pAT+1),
-  PROC(readlink, fhandle,	readlink,	fhandle,  RC_NOCACHE, ST+pAT+1+NFS3_MAXPATHLEN/4),
-  PROC(read,	 read,		read,		fhandle,  RC_NOCACHE, ST+pAT+4+NFSSVC_MAXBLKSIZE),
-  PROC(write,	 write,		write,		fhandle,  RC_REPLBUFF, ST+WC+4),
-  PROC(create,	 create,	create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC),
-  PROC(mkdir,	 mkdir,		create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC),
-  PROC(symlink,	 symlink,	create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC),
-  PROC(mknod,	 mknod,		create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC),
-  PROC(remove,	 dirop,		wccstat,	fhandle,  RC_REPLBUFF, ST+WC),
-  PROC(rmdir,	 dirop,		wccstat,	fhandle,  RC_REPLBUFF, ST+WC),
-  PROC(rename,	 rename,	rename,		fhandle2, RC_REPLBUFF, ST+WC+WC),
-  PROC(link,	 link,		link,		fhandle2, RC_REPLBUFF, ST+pAT+WC),
-  PROC(readdir,	 readdir,	readdir,	fhandle,  RC_NOCACHE, 0),
-  PROC(readdirplus,readdirplus,	readdir,	fhandle,  RC_NOCACHE, 0),
-  PROC(fsstat,	 fhandle,	fsstat,		void,     RC_NOCACHE, ST+pAT+2*6+1),
-  PROC(fsinfo,   fhandle,	fsinfo,		void,     RC_NOCACHE, ST+pAT+12),
-  PROC(pathconf, fhandle,	pathconf,	void,     RC_NOCACHE, ST+pAT+6),
-  PROC(commit,	 commit,	commit,		fhandle,  RC_NOCACHE, ST+WC+2),
+  PROC(null,	 void,		void,		void,	  RC_NOCACHE, ST, 0),
+  PROC(getattr,	 fhandle,	attrstat,	fhandle,  RC_NOCACHE, ST+AT, 0),
+  PROC(setattr,  sattr,		wccstat,	fhandle,  RC_REPLBUFF, ST+WC, 0),
+  PROC(lookup,	 dirop,		dirop,		fhandle2, RC_NOCACHE, ST+FH+pAT+pAT, 0),
+  PROC(access,	 access,	access,		fhandle,  RC_NOCACHE, ST+pAT+1, 0),
+  PROC(readlink, fhandle,	readlink,	fhandle,  RC_NOCACHE, ST+pAT+1+NFS3_MAXPATHLEN/4, 0),
+  PROC(read,	 read,		read,		fhandle,  RC_NOCACHE, ST+pAT+4+NFSSVC_MAXBLKSIZE, 0),
+  PROC(write,	 write,		write,		fhandle,  RC_REPLBUFF, ST+WC+4, RPC_HANDLE_IOVARG),
+  PROC(create,	 create,	create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0),
+  PROC(mkdir,	 mkdir,		create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0),
+  PROC(symlink,	 symlink,	create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0),
+  PROC(mknod,	 mknod,		create,		fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0),
+  PROC(remove,	 dirop,		wccstat,	fhandle,  RC_REPLBUFF, ST+WC, 0),
+  PROC(rmdir,	 dirop,		wccstat,	fhandle,  RC_REPLBUFF, ST+WC, 0),
+  PROC(rename,	 rename,	rename,		fhandle2, RC_REPLBUFF, ST+WC+WC, 0),
+  PROC(link,	 link,		link,		fhandle2, RC_REPLBUFF, ST+pAT+WC, 0),
+  PROC(readdir,	 readdir,	readdir,	fhandle,  RC_NOCACHE, 0, 0),
+  PROC(readdirplus,readdirplus,	readdir,	fhandle,  RC_NOCACHE, 0, 0),
+  PROC(fsstat,	 fhandle,	fsstat,		void,     RC_NOCACHE, ST+pAT+2*6+1, 0),
+  PROC(fsinfo,   fhandle,	fsinfo,		void,     RC_NOCACHE, ST+pAT+12, 0),
+  PROC(pathconf, fhandle,	pathconf,	void,     RC_NOCACHE, ST+pAT+6, 0),
+  PROC(commit,	 commit,	commit,		fhandle,  RC_NOCACHE, ST+WC+2, 0),
 };
 
 struct svc_version	nfsd_version3 = {
--- linux.ORG/fs/nfsd/nfs4proc.c	Fri Oct 18 21:18:42 2030
+++ linux/fs/nfsd/nfs4proc.c	Fri Oct 18 21:26:01 2030
@@ -711,7 +711,7 @@ out:
 #define nfs4svc_release_compound	NULL
 struct nfsd4_voidargs { int dummy; };
 
-#define PROC(name, argt, rest, relt, cache, respsize)	\
+#define PROC(name, argt, rest, relt, cache, respsize, flags)	\
  { (svc_procfunc) nfsd4_proc_##name,		\
    (kxdrproc_t) nfs4svc_decode_##argt##args,	\
    (kxdrproc_t) nfs4svc_encode_##rest##res,	\
@@ -721,6 +721,7 @@ struct nfsd4_voidargs { int dummy; };
    0,						\
    cache,					\
    respsize,					\
+   flags,					\
  }
 
 /*
@@ -734,8 +735,8 @@ struct nfsd4_voidargs { int dummy; };
  * better XID's.
  */
 static struct svc_procedure		nfsd_procedures4[2] = {
-  PROC(null,	 void,		void,		void,	  RC_NOCACHE, 1),
-  PROC(compound, compound,	compound,	compound, RC_NOCACHE, NFSD_BUFSIZE)
+  PROC(null,	 void,		void,		void,	  RC_NOCACHE, 1, 0),
+  PROC(compound, compound,	compound,	compound, RC_NOCACHE, NFSD_BUFSIZE, 0)
 };
 
 struct svc_version	nfsd_version4 = {
--- linux.ORG/fs/lockd/svcproc.c	Fri Oct 18 21:18:42 2030
+++ linux/fs/lockd/svcproc.c	Fri Oct 18 21:26:01 2030
@@ -553,6 +553,7 @@ struct nlm_void			{ int dummy; };
    .pc_argsize	= sizeof(struct nlm_##argt),		\
    .pc_ressize	= sizeof(struct nlm_##rest),		\
    .pc_xdrressize = respsize,				\
+   .pc_flags	 = 0,					\
  }
 
 #define	Ck	(1+8)	/* cookie */
--- linux.ORG/fs/lockd/svc4proc.c	Fri Oct 18 21:18:42 2030
+++ linux/fs/lockd/svc4proc.c	Fri Oct 18 21:26:01 2030
@@ -527,6 +527,7 @@ struct nlm_void			{ int dummy; };
    .pc_argsize	= sizeof(struct nlm_##argt),		\
    .pc_ressize	= sizeof(struct nlm_##rest),		\
    .pc_xdrressize = respsize,				\
+   .pc_flags	 = 0,					\
  }
 #define	Ck	(1+8)	/* cookie */
 #define	No	(1+1024/4)	/* netobj */

[-- Attachment #8: va07-nfsbigbuf-2.5.43.patch --]
[-- Type: Text/Plain, Size: 416 bytes --]

--- linux.ORG/include/linux/nfsd/const.h	Sat Oct 12 13:22:12 2002
+++ linux/include/linux/nfsd/const.h	Sun Oct 13 22:07:37 2030
@@ -20,9 +20,9 @@
 #define NFSSVC_MAXVERS		3
 
 /*
- * Maximum blocksize supported by daemon currently at 32K
+ * Maximum blocksize supported by daemon currently at 60K
  */
-#define NFSSVC_MAXBLKSIZE	(32*1024)
+#define NFSSVC_MAXBLKSIZE	((60*1024)&~(PAGE_SIZE-1))
 
 #ifdef __KERNEL__
 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-18  7:19                       ` Hirokazu Takahashi
@ 2002-10-18 15:12                           ` Andrew Theurer
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-18 15:12 UTC (permalink / raw)
  To: trond.myklebust, Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

> >      > Congestion avoidance mechanism of NFS clients might cause this
> >      > situation.  I think the congestion window size is not enough
> >      > for high end machines.  You can make the window be larger as a
> >      > test.
> >
> > The congestion avoidance window is supposed to adapt to the bandwidth
> > that is available. Turn congestion avoidance off if you like, but my
> > experience is that doing so tends to seriously degrade performance as
> > the number of timeouts + resends skyrockets.
>
> Yes, you must be right.
>
> But I guess Andrew may use a great machine so that the transfer rate
> has exeeded the maximum size of the congestion avoidance window.
> Can we determin preferable maximum window size dynamically?

Is this a concern on the client only?  I can run a test with just one client
and see if I can saturate the 100Mbit adapter.  If I can, would we need to
make any adjustments then?  FYI, at 115 MB/sec total throughput, that's only
2.875 MB/sec for each of the 40 clients.  For the TCP result of 181 MB/sec,
that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
100Mbit client.

Andrew Theurer



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-18 15:12                           ` Andrew Theurer
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-18 15:12 UTC (permalink / raw)
  To: trond.myklebust, Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs

> >      > Congestion avoidance mechanism of NFS clients might cause this
> >      > situation.  I think the congestion window size is not enough
> >      > for high end machines.  You can make the window be larger as a
> >      > test.
> >
> > The congestion avoidance window is supposed to adapt to the bandwidth
> > that is available. Turn congestion avoidance off if you like, but my
> > experience is that doing so tends to seriously degrade performance as
> > the number of timeouts + resends skyrockets.
>
> Yes, you must be right.
>
> But I guess Andrew may use a great machine so that the transfer rate
> has exeeded the maximum size of the congestion avoidance window.
> Can we determin preferable maximum window size dynamically?

Is this a concern on the client only?  I can run a test with just one client
and see if I can saturate the 100Mbit adapter.  If I can, would we need to
make any adjustments then?  FYI, at 115 MB/sec total throughput, that's only
2.875 MB/sec for each of the 40 clients.  For the TCP result of 181 MB/sec,
that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
100Mbit client.

Andrew Theurer


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-18 15:12                           ` [NFS] " Andrew Theurer
@ 2002-10-19 20:34                             ` Hirokazu Takahashi
  -1 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-19 20:34 UTC (permalink / raw)
  To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

Hello,

> > Congestion avoidance mechanism of NFS clients might cause this
> > situation.  I think the congestion window size is not enough
> > for high end machines.  You can make the window be larger as a
> > test.

> Is this a concern on the client only?  I can run a test with just one client
> and see if I can saturate the 100Mbit adapter.  If I can, would we need to
> make any adjustments then?  FYI, at 115 MB/sec total throughput, that's only
> 2.875 MB/sec for each of the 40 clients.  For the TCP result of 181 MB/sec,
> that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
> 100Mbit client.

I think it's a client issue. NFS servers don't care about cogestion of UDP
traffic and they will try to response to all NFS requests as fast as they can.

You can try to increase the number of clients or the number of mount points
for a test. It's easy to mount the same directory of the server on some
directries of the client so that each of them can work simultaneously.
   # mount -t nfs server:/foo   /baa1
   # mount -t nfs server:/foo   /baa2
   # mount -t nfs server:/foo   /baa3

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by:
Access Your PC Securely with GoToMyPC. Try Free Now
https://www.gotomypc.com/s/OSND/DD
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-19 20:34                             ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-19 20:34 UTC (permalink / raw)
  To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

Hello,

> > Congestion avoidance mechanism of NFS clients might cause this
> > situation.  I think the congestion window size is not enough
> > for high end machines.  You can make the window be larger as a
> > test.

> Is this a concern on the client only?  I can run a test with just one client
> and see if I can saturate the 100Mbit adapter.  If I can, would we need to
> make any adjustments then?  FYI, at 115 MB/sec total throughput, that's only
> 2.875 MB/sec for each of the 40 clients.  For the TCP result of 181 MB/sec,
> that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
> 100Mbit client.

I think it's a client issue. NFS servers don't care about cogestion of UDP
traffic and they will try to response to all NFS requests as fast as they can.

You can try to increase the number of clients or the number of mount points
for a test. It's easy to mount the same directory of the server on some
directries of the client so that each of them can work simultaneously.
   # mount -t nfs server:/foo   /baa1
   # mount -t nfs server:/foo   /baa2
   # mount -t nfs server:/foo   /baa3

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-19 20:34                             ` [NFS] " Hirokazu Takahashi
@ 2002-10-22 21:16                               ` Andrew Theurer
  -1 siblings, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-22 21:16 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

On Saturday 19 October 2002 15:34, Hirokazu Takahashi wrote:
> Hello,
>
> > > Congestion avoidance mechanism of NFS clients might cause this
> > > situation.  I think the congestion window size is not enough
> > > for high end machines.  You can make the window be larger as a
> > > test.
> >
> > Is this a concern on the client only?  I can run a test with just one
> > client and see if I can saturate the 100Mbit adapter.  If I can, woul=
d we
> > need to make any adjustments then?  FYI, at 115 MB/sec total throughp=
ut,
> > that's only 2.875 MB/sec for each of the 40 clients.  For the TCP res=
ult
> > of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortabl=
e
> > throughputs for a 100Mbit client.
>
> I think it's a client issue. NFS servers don't care about cogestion of =
UDP
> traffic and they will try to response to all NFS requests as fast as th=
ey
> can.
>
> You can try to increase the number of clients or the number of mount po=
ints
> for a test. It's easy to mount the same directory of the server on some
> directries of the client so that each of them can work simultaneously.
>    # mount -t nfs server:/foo   /baa1
>    # mount -t nfs server:/foo   /baa2
>    # mount -t nfs server:/foo   /baa3

I don't think it is a client congestion issue at this point.  I can run t=
he=20
test with just one client on UDP and achieve 11.2 MB/sec with just one mo=
unt=20
point.  The client has 100 Mbit Ethernet, so should be the upper limit (o=
r=20
really close).  In the 40 client read test, I have only achieved 2.875 MB=
/sec=20
per client.  That and the fact that there are never more than 2 nfsd thre=
ads=20
in a run state at one time (for UDP only) leads me to believe there is st=
ill=20
a scaling problem on the server for UDP.  I will continue to run the test=
 and=20
poke a prod around.  Hopefully something will jump out at me.  Thanks for=
 all=20
the input!

Andrew Theurer


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-22 21:16                               ` Andrew Theurer
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-10-22 21:16 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

On Saturday 19 October 2002 15:34, Hirokazu Takahashi wrote:
> Hello,
>
> > > Congestion avoidance mechanism of NFS clients might cause this
> > > situation.  I think the congestion window size is not enough
> > > for high end machines.  You can make the window be larger as a
> > > test.
> >
> > Is this a concern on the client only?  I can run a test with just one
> > client and see if I can saturate the 100Mbit adapter.  If I can, would we
> > need to make any adjustments then?  FYI, at 115 MB/sec total throughput,
> > that's only 2.875 MB/sec for each of the 40 clients.  For the TCP result
> > of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortable
> > throughputs for a 100Mbit client.
>
> I think it's a client issue. NFS servers don't care about cogestion of UDP
> traffic and they will try to response to all NFS requests as fast as they
> can.
>
> You can try to increase the number of clients or the number of mount points
> for a test. It's easy to mount the same directory of the server on some
> directries of the client so that each of them can work simultaneously.
>    # mount -t nfs server:/foo   /baa1
>    # mount -t nfs server:/foo   /baa2
>    # mount -t nfs server:/foo   /baa3

I don't think it is a client congestion issue at this point.  I can run the 
test with just one client on UDP and achieve 11.2 MB/sec with just one mount 
point.  The client has 100 Mbit Ethernet, so should be the upper limit (or 
really close).  In the 40 client read test, I have only achieved 2.875 MB/sec 
per client.  That and the fact that there are never more than 2 nfsd threads 
in a run state at one time (for UDP only) leads me to believe there is still 
a scaling problem on the server for UDP.  I will continue to run the test and 
poke a prod around.  Hopefully something will jump out at me.  Thanks for all 
the input!

Andrew Theurer

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-18 13:11   ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi
@ 2002-10-23  1:18     ` Neil Brown
  2002-10-23  3:53       ` Hirokazu Takahashi
                         ` (2 more replies)
  0 siblings, 3 replies; 98+ messages in thread
From: Neil Brown @ 2002-10-23  1:18 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: nfs

On Friday October 18, taka@valinux.co.jp wrote:
> Hello,
> 
> I've ported the zerocopy patches against linux-2.5.43 with
> davem's udp-sendfile patches and your patches which you posted
> on Wed,16 Oct.

Thanks for these...

I have been thinking some more about this, trying to understand the
big picture, and I'm afraid that I think I want some more changes.

In particular, I think it would be good to use 'struct xdr_buf' from
sunrpc/xdr.h instead of svc_buf.  This is what the nfs client uses and
we could share some of the infrastructure.

I think this would work quite well for sending read responses as there
is a 'head' iovec for the interesting bits of the packet, an array of
pages for the data, and a 'tail' iovec for the padding.

I'm not certain about receiving write requests.
I imagine that it might work to:
  1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
    skb into the head iovec, and hold onto the skbuf (like we
    currently do).
  2/ enter the nfs server to parse that header.
  3/ When the server finds it needs more data for a write, it
     collects the pages and calls xdr_partial_copy_from_skb
     to copy the rest of the skb directly into the page cache.

Does that make any sense?

Also, I am wondering about the way that you put zero-copy support into
nfsd_readdir.

Presumably the gain is that sock_sendmsg does a copy into a
skbuf and then a DMA out of that, while ->sendpage does just the DMA.
In that case, maybe it would be better to get "struct page *" pointers
for the pages in the default buffer, and pass them to 
->sendpage.

I would like to get the a situation where we don't need to do a 64K
kmalloc for each server, but can work entirely with individual pages.

I might try converting svcsock etc to use xdr_buf later today or
tomorrow unless I heard a good reason why it wont work, or someone
else beats me to it...

NeilBrown

-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  1:18     ` Neil Brown
@ 2002-10-23  3:53       ` Hirokazu Takahashi
  2002-10-23  5:40         ` Hirokazu Takahashi
  2002-10-23  6:10         ` Neil Brown
  2002-10-23 21:50       ` Hirokazu Takahashi
  2002-10-25  9:52       ` Hirokazu Takahashi
  2 siblings, 2 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-23  3:53 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

> > I've ported the zerocopy patches against linux-2.5.43 with
> > davem's udp-sendfile patches and your patches which you posted
> > on Wed,16 Oct.
> 
> Thanks for these...
> 
> I have been thinking some more about this, trying to understand the
> big picture, and I'm afraid that I think I want some more changes.
> 
> In particular, I think it would be good to use 'struct xdr_buf' from
> sunrpc/xdr.h instead of svc_buf.  This is what the nfs client uses and
> we could share some of the infrastructure.

It sounds good that they share the same infrastructure.
I agree with your approach. 

> I think this would work quite well for sending read responses as there
> is a 'head' iovec for the interesting bits of the packet, an array of
> pages for the data, and a 'tail' iovec for the padding.

I'm wondering one point that the xdr_buf can't hanldle NFSv4 compound
operation correctly yet. I don't know what will happen if we send some
page data and some non-page data together as it will try to pack some
operations in one xdr_buf.

If we care about NFSv4 it could be like this:

    struct svc_buf {
        u32 *                   area;   /* allocated memory */
        u32 *                   base;   /* base of RPC datagram */
        int                     buflen; /* total length of buffer */
        u32 *                   buf;    /* read/write pointer */
        int                     len;    /* current end of buffer */

        struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES];
        int                     nriov;
    }

I guess it would be better to fix NFSv4 problems after Halloween.

> I'm not certain about receiving write requests.
> I imagine that it might work to:
>   1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
>     skb into the head iovec, and hold onto the skbuf (like we
>     currently do).
>   2/ enter the nfs server to parse that header.
>   3/ When the server finds it needs more data for a write, it
>      collects the pages and calls xdr_partial_copy_from_skb
>      to copy the rest of the skb directly into the page cache.

I think it will be hard work that it's the same that we make another
generic_file_write function. I feel it may be overkill.
  e.g. We must read a page if it isn't on the cache.
       We must allocate disk blocks if the file don't have yet X-(
       Some filesytems like XFS have its own way of updating pagecache.

We should make kNFSd keep away from the implementation of VM/FS
as possible as we can.

> Does that make any sense?
> 
> Also, I am wondering about the way that you put zero-copy support into
> nfsd_readdir.
> 
> Presumably the gain is that sock_sendmsg does a copy into a
> skbuf and then a DMA out of that, while ->sendpage does just the DMA.
> In that case, maybe it would be better to get "struct page *" pointers
> for the pages in the default buffer, and pass them to 
> ->sendpage.

It seems good idea.

The problem is that it's hard to know when the page will be released.
The page will be held by TCP/IP stack. TCP may hold it for a while
by way of retransmition. UDP pakcets may also held in driver-queue
after ->sendpage has done.

We should check reference count of the default buffer and 
decide to use the buffer or allocate new one.
We think Almost request can use the default buffer.

> I would like to get the a situation where we don't need to do a 64K
> kmalloc for each server, but can work entirely with individual pages.
> 
> I might try converting svcsock etc to use xdr_buf later today or
> tomorrow unless I heard a good reason why it wont work, or someone
> else beats me to it...

If you don't mind I'll do about the readdir stuff
while you're fighting with the xdr_buf stuffs.

Thank you,
Hirokazu Takahashi


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  3:53       ` Hirokazu Takahashi
@ 2002-10-23  5:40         ` Hirokazu Takahashi
  2002-10-23  6:03           ` Neil Brown
  2002-10-23  6:10         ` Neil Brown
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-23  5:40 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

> > Also, I am wondering about the way that you put zero-copy support into
> > nfsd_readdir.
> > 
> > Presumably the gain is that sock_sendmsg does a copy into a
> > skbuf and then a DMA out of that, while ->sendpage does just the DMA.
> > In that case, maybe it would be better to get "struct page *" pointers
> > for the pages in the default buffer, and pass them to 
> > ->sendpage.
> 
> It seems good idea.
> 
> The problem is that it's hard to know when the page will be released.
> The page will be held by TCP/IP stack. TCP may hold it for a while
> by way of retransmition. UDP pakcets may also held in driver-queue
> after ->sendpage has done.
> 
> We should check reference count of the default buffer and 
> decide to use the buffer or allocate new one.
> We think Almost request can use the default buffer.

I mean we can't use a page in the default buffer.
We should use the page next to the default buffer or we should
prepare another page for nfsd_readdir.

I don't know whether allocating an extra page for each server
is good or not.
How do you think about it?

> > I would like to get the a situation where we don't need to do a 64K
> > kmalloc for each server, but can work entirely with individual pages.


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  5:40         ` Hirokazu Takahashi
@ 2002-10-23  6:03           ` Neil Brown
  2002-10-23 22:35             ` Hirokazu Takahashi
  0 siblings, 1 reply; 98+ messages in thread
From: Neil Brown @ 2002-10-23  6:03 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: nfs

On Wednesday October 23, taka@valinux.co.jp wrote:
> Hello,
> 
> > > Also, I am wondering about the way that you put zero-copy support into
> > > nfsd_readdir.
> > > 
> > > Presumably the gain is that sock_sendmsg does a copy into a
> > > skbuf and then a DMA out of that, while ->sendpage does just the DMA.
> > > In that case, maybe it would be better to get "struct page *" pointers
> > > for the pages in the default buffer, and pass them to 
> > > ->sendpage.
> > 
> > It seems good idea.
> > 
> > The problem is that it's hard to know when the page will be released.
> > The page will be held by TCP/IP stack. TCP may hold it for a while
> > by way of retransmition. UDP pakcets may also held in driver-queue
> > after ->sendpage has done.
> > 
> > We should check reference count of the default buffer and 
> > decide to use the buffer or allocate new one.
> > We think Almost request can use the default buffer.
> 
> I mean we can't use a page in the default buffer.
> We should use the page next to the default buffer or we should
> prepare another page for nfsd_readdir.
> 
> I don't know whether allocating an extra page for each server
> is good or not.
> How do you think about it?

I think I would change the approach to buffering.
Instead of having a fixed set of pages, we just allocate new pages as
needed, having handed old ones over to the networking layer.

So we have a pool of pages that we draw from when generating replies,
and refill before accepting a new request.

Ofcourse that is a fairly big change from where we are now so it might
take a while.  We should probably get zero copy reads in first...

NeilBrown


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  3:53       ` Hirokazu Takahashi
  2002-10-23  5:40         ` Hirokazu Takahashi
@ 2002-10-23  6:10         ` Neil Brown
  2002-10-23  7:08           ` Hirokazu Takahashi
  2002-10-23 15:23           ` Trond Myklebust
  1 sibling, 2 replies; 98+ messages in thread
From: Neil Brown @ 2002-10-23  6:10 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, nfs, William A.(Andy) Adamson, trond.myklebust

On Wednesday October 23, taka@valinux.co.jp wrote:
> 
> I'm wondering one point that the xdr_buf can't hanldle NFSv4 compound
> operation correctly yet. I don't know what will happen if we send some
> page data and some non-page data together as it will try to pack some
> operations in one xdr_buf.
> 
> If we care about NFSv4 it could be like this:
> 
>     struct svc_buf {
>         u32 *                   area;   /* allocated memory */
>         u32 *                   base;   /* base of RPC datagram */
>         int                     buflen; /* total length of buffer */
>         u32 *                   buf;    /* read/write pointer */
>         int                     len;    /* current end of buffer */
> 
>         struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES];
>         int                     nriov;
>     }
> 
> I guess it would be better to fix NFSv4 problems after Halloween.
> 

Hmm. I wonder what plans there are for this w.r.t. to NFSv4 client.
Andy? Trond?

I suspect that COMPOUNDS with multiple READ or WRITE requests would be
fairly rare, and it would probably be reasonable to respond with
ERESOURCE (or however it is spelt).

i.e. Reject any operation that would need to use a second set of pages
in a response.


> > I'm not certain about receiving write requests.
> > I imagine that it might work to:
> >   1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> >     skb into the head iovec, and hold onto the skbuf (like we
> >     currently do).
> >   2/ enter the nfs server to parse that header.
> >   3/ When the server finds it needs more data for a write, it
> >      collects the pages and calls xdr_partial_copy_from_skb
> >      to copy the rest of the skb directly into the page cache.
> 
> I think it will be hard work that it's the same that we make another
> generic_file_write function. I feel it may be overkill.
>   e.g. We must read a page if it isn't on the cache.
>        We must allocate disk blocks if the file don't have yet X-(
>        Some filesytems like XFS have its own way of updating pagecache.
> 
> We should make kNFSd keep away from the implementation of VM/FS
> as possible as we can.

Could we not use 'mmap'?   Maybe not, and probably best to avoid it as
you say.

I was thinking it would be nice to be able to do the udp-checksum at
the same time as the copy-into-page-cache, but maybe we just say that
you need a NIC that does checksums if you want to do single-copy NFS
writes. 

NeilBrown


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  6:10         ` Neil Brown
@ 2002-10-23  7:08           ` Hirokazu Takahashi
  2002-10-23 15:23           ` Trond Myklebust
  1 sibling, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-23  7:08 UTC (permalink / raw)
  To: neilb; +Cc: nfs, andros, trond.myklebust

Hello,

> > If we care about NFSv4 it could be like this:
> > 
> >     struct svc_buf {
> >         u32 *                   area;   /* allocated memory */
> >         u32 *                   base;   /* base of RPC datagram */
> >         int                     buflen; /* total length of buffer */
> >         u32 *                   buf;    /* read/write pointer */
> >         int                     len;    /* current end of buffer */
> > 
> >         struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES];
> >         int                     nriov;
> >     }
> > 
> > I guess it would be better to fix NFSv4 problems after Halloween.
> > 
> 
> Hmm. I wonder what plans there are for this w.r.t. to NFSv4 client.
> Andy? Trond?
> 
> I suspect that COMPOUNDS with multiple READ or WRITE requests would be
> fairly rare, and it would probably be reasonable to respond with
> ERESOURCE (or however it is spelt).

Yeah, It might be.

> i.e. Reject any operation that would need to use a second set of pages
> in a response.

> > > I'm not certain about receiving write requests.
> > > I imagine that it might work to:
> > >   1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> > >     skb into the head iovec, and hold onto the skbuf (like we
> > >     currently do).
> > >   2/ enter the nfs server to parse that header.
> > >   3/ When the server finds it needs more data for a write, it
> > >      collects the pages and calls xdr_partial_copy_from_skb
> > >      to copy the rest of the skb directly into the page cache.
> > 
> > I think it will be hard work that it's the same that we make another
> > generic_file_write function. I feel it may be overkill.
> >   e.g. We must read a page if it isn't on the cache.
> >        We must allocate disk blocks if the file don't have yet X-(
> >        Some filesytems like XFS have its own way of updating pagecache.
> > 
> > We should make kNFSd keep away from the implementation of VM/FS
> > as possible as we can.
> 
> Could we not use 'mmap'?   Maybe not, and probably best to avoid it as
> you say.

Using mmap sounds intersting to me and I was thinking about it.

Regular mmap will cause many reading blocks on disk on each pagefault
as its handler can't know what size of write will happen after the fault.
It will be meaningless if the size is 4KB which will often happens on NFS.

Standard write/writev can handle it without reading blocks.

> I was thinking it would be nice to be able to do the udp-checksum at
> the same time as the copy-into-page-cache, but maybe we just say that
> you need a NIC that does checksums if you want to do single-copy NFS
> writes. 

Or we can enhance the standard generic_file_write() to assign a
copy-routine like this:

generic_file_write(file, buf, count, ppos, nfsd_write_actor);
generic_file_writev(file, iovec, nr_segs, ppos, nfsd_write_actor);

nfsd_write_actor(struct page *page, int offset, ......)
{
	xdr_partial_copy_from_skb(.....)
}

But I realized there is one big problem on the both approach.
What can we do when the result of checksum is wrong?
The pages will be filled with broken data.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-22 21:16                               ` [NFS] " Andrew Theurer
  (?)
@ 2002-10-23  9:29                               ` Hirokazu Takahashi
  2002-10-24 15:32                                 ` Andrew Theurer
  -1 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-23  9:29 UTC (permalink / raw)
  To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

Hi,

> > > > Congestion avoidance mechanism of NFS clients might cause this
> > > > situation.  I think the congestion window size is not enough
> > > > for high end machines.  You can make the window be larger as a
> > > > test.

> I don't think it is a client congestion issue at this point.  I can run the 
> test with just one client on UDP and achieve 11.2 MB/sec with just one mount 
> point.  The client has 100 Mbit Ethernet, so should be the upper limit (or 
> really close).  In the 40 client read test, I have only achieved 2.875 MB/sec 
> per client.  That and the fact that there are never more than 2 nfsd threads 
> in a run state at one time (for UDP only) leads me to believe there is still 
> a scaling problem on the server for UDP.  I will continue to run the test and 
> poke a prod around.  Hopefully something will jump out at me.  Thanks for all 
> the input!

Can You check /proc/net/rpc/nfsd which shows how many NFS requests have
been retransmitted ?

# cat /proc/net/rpc/nfsd
rc 0 27680 162118
  ^^^
This field means the clinents have retransmitted pakeckets.
The transmission ratio will slow down if it have happened once.
It may occur if the response from the server is slower than the
clinents expect.

And you can use older version - e.g. linux-2.4 series - for clients
and see what will happen as older versions don't have any intelligent
features.

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  6:10         ` Neil Brown
  2002-10-23  7:08           ` Hirokazu Takahashi
@ 2002-10-23 15:23           ` Trond Myklebust
  1 sibling, 0 replies; 98+ messages in thread
From: Trond Myklebust @ 2002-10-23 15:23 UTC (permalink / raw)
  To: Neil Brown
  Cc: Hirokazu Takahashi, nfs, William A.(Andy) Adamson,
	trond.myklebust

>>>>> " " == Neil Brown <neilb@cse.unsw.edu.au> writes:

     > Hmm. I wonder what plans there are for this w.r.t. to NFSv4
     > client.  Andy? Trond?

There's really no need for anything beyond what we have. There's no
call in the client for stringing more than one set of pages together:
you are always dealing with a single set of contiguous pages to
read/write.

     > I suspect that COMPOUNDS with multiple READ or WRITE requests
     > would be fairly rare, and it would probably be reasonable to
     > respond with ERESOURCE (or however it is spelt).

Alternatively, you could add a list_head to the xdr_buf struct so that
you can string several of them together. Frankly, though, it would be
a rather strange NFSv4 client that wants to do this sort of
operation. There's just no advantage to it...

     > I was thinking it would be nice to be able to do the
     > udp-checksum at the same time as the copy-into-page-cache, but
     > maybe we just say that you need a NIC that does checksums if
     > you want to do single-copy NFS writes.

Right. The very last thing you want to do is to copy into the page
cache, then find out that the checksum didn't match up.

Cheers,
  Trond


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  1:18     ` Neil Brown
  2002-10-23  3:53       ` Hirokazu Takahashi
@ 2002-10-23 21:50       ` Hirokazu Takahashi
  2002-10-23 23:55         ` Trond Myklebust
  2002-10-25  9:52       ` Hirokazu Takahashi
  2 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-23 21:50 UTC (permalink / raw)
  To: Trond Myklebust, neilb; +Cc: nfs

Hello,

> In particular, I think it would be good to use 'struct xdr_buf' from
> sunrpc/xdr.h instead of svc_buf.  This is what the nfs client uses and
> we could share some of the infrastructure.

I was thinking about the nfs clients.
Why don't we make xprt_sendmsg() use the sendpage interface instead
of calling sock_sendmsg() so that we can avoid dead-lock which 
multiple kmap()s in xprt_sendmsg() might cause on heavily loaded machines.


Thank you,
Hirokazu Takahashi.



-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  6:03           ` Neil Brown
@ 2002-10-23 22:35             ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-23 22:35 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

> > > > Also, I am wondering about the way that you put zero-copy support into
> > > > nfsd_readdir.

> I think I would change the approach to buffering.
> Instead of having a fixed set of pages, we just allocate new pages as
> needed, having handed old ones over to the networking layer.
> 
> So we have a pool of pages that we draw from when generating replies,
> and refill before accepting a new request.

We can also put RPC/NFS headers on pages and send them without copy.
This seems good for NFSv4 COMPOUNDS.

> Ofcourse that is a fairly big change from where we are now so it might
> take a while.  We should probably get zero copy reads in first...

Yes.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23 21:50       ` Hirokazu Takahashi
@ 2002-10-23 23:55         ` Trond Myklebust
  2002-10-24  1:33           ` Hirokazu Takahashi
  0 siblings, 1 reply; 98+ messages in thread
From: Trond Myklebust @ 2002-10-23 23:55 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, nfs

>>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes:

     > I was thinking about the nfs clients.  Why don't we make
     > xprt_sendmsg() use the sendpage interface instead of calling
     > sock_sendmsg() so that we can avoid dead-lock which multiple
     > kmap()s in xprt_sendmsg() might cause on heavily loaded
     > machines.

I'm definitely in favour of such a change. Particularly so if the UDP
interface is ready.

Cheers,
  Trond


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23 23:55         ` Trond Myklebust
@ 2002-10-24  1:33           ` Hirokazu Takahashi
  2002-10-27 10:39             ` Hirokazu Takahashi
  0 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-24  1:33 UTC (permalink / raw)
  To: trond.myklebust; +Cc: neilb, nfs

Hello,

>      > I was thinking about the nfs clients.  Why don't we make
>      > xprt_sendmsg() use the sendpage interface instead of calling
>      > sock_sendmsg() so that we can avoid dead-lock which multiple
>      > kmap()s in xprt_sendmsg() might cause on heavily loaded
>      > machines.
> 
> I'm definitely in favour of such a change. Particularly so if the UDP
> interface is ready.

I've implemented it and we can find it in linux-2.5.44.
The interface is the same as the TCP's one.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-23  9:29                               ` Hirokazu Takahashi
@ 2002-10-24 15:32                                 ` Andrew Theurer
  2002-10-27 11:10                                   ` Hirokazu Takahashi
  0 siblings, 1 reply; 98+ messages in thread
From: Andrew Theurer @ 2002-10-24 15:32 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs

> > I don't think it is a client congestion issue at this point.  I can run
the
> > test with just one client on UDP and achieve 11.2 MB/sec with just one
mount
> > point.  The client has 100 Mbit Ethernet, so should be the upper limit
(or
> > really close).  In the 40 client read test, I have only achieved 2.875
MB/sec
> > per client.  That and the fact that there are never more than 2 nfsd
threads
> > in a run state at one time (for UDP only) leads me to believe there is
still
> > a scaling problem on the server for UDP.  I will continue to run the
test and
> > poke a prod around.  Hopefully something will jump out at me.  Thanks
for all
> > the input!
>
> Can You check /proc/net/rpc/nfsd which shows how many NFS requests have
> been retransmitted ?
>
> # cat /proc/net/rpc/nfsd
> rc 0 27680 162118
>   ^^^
> This field means the clinents have retransmitted pakeckets.
> The transmission ratio will slow down if it have happened once.
> It may occur if the response from the server is slower than the
> clinents expect.

/proc/net/rpc/nfsd
rc 0 1 1025221

> And you can use older version - e.g. linux-2.4 series - for clients
> and see what will happen as older versions don't have any intelligent
> features.

Actually all of the clients are 2.4 (RH 7.0).  I could change them out to
2.5, but it may take me a little while.

Let me do a little digging around.  I seem to recall an issue I had earlier
this year when waking up the nfsd threads and having most of them just go
back to sleep.  I need to go back to that code and understand it a little
better.   Thanks for all of your help.

Andrew Theurer

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-23  1:18     ` Neil Brown
  2002-10-23  3:53       ` Hirokazu Takahashi
  2002-10-23 21:50       ` Hirokazu Takahashi
@ 2002-10-25  9:52       ` Hirokazu Takahashi
  2002-10-25 12:41         ` Neil Brown
  2002-10-25 17:23         ` Trond Myklebust
  2 siblings, 2 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-25  9:52 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

> I have been thinking some more about this, trying to understand the
> big picture, and I'm afraid that I think I want some more changes.
> 
> In particular, I think it would be good to use 'struct xdr_buf' from
> sunrpc/xdr.h instead of svc_buf.  This is what the nfs client uses and
> we could share some of the infrastructure.

I just realized it would be hard to use the xdr_buf as it couldn't
handle data in a socket buffer. Each socket burfer consists of
some non-page data and some pages and each of them might have its
own offset and length.

> I'm not certain about receiving write requests.
> I imagine that it might work to:
>   1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
>     skb into the head iovec, and hold onto the skbuf (like we
>     currently do).

And I came up with another idea that kNFSd could handles TCP data
in a socket buffer directly without copy if we can enhancemence the
tcp_read_sock() not to release it while kNFSd is using it.
kNFSd would handle TCP data as if it were a UDP datagram.
The differences are kNFSd may grab some TCP socket buffers at once
and the buffers may be shared to other kNFSd's.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-25  9:52       ` Hirokazu Takahashi
@ 2002-10-25 12:41         ` Neil Brown
  2002-10-26  3:11           ` Hirokazu Takahashi
  2002-10-30 23:29           ` Hirokazu Takahashi
  2002-10-25 17:23         ` Trond Myklebust
  1 sibling, 2 replies; 98+ messages in thread
From: Neil Brown @ 2002-10-25 12:41 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: nfs

On Friday October 25, taka@valinux.co.jp wrote:
> Hello,
> 
> > I have been thinking some more about this, trying to understand the
> > big picture, and I'm afraid that I think I want some more changes.
> > 
> > In particular, I think it would be good to use 'struct xdr_buf' from
> > sunrpc/xdr.h instead of svc_buf.  This is what the nfs client uses and
> > we could share some of the infrastructure.
> 
> I just realized it would be hard to use the xdr_buf as it couldn't
> handle data in a socket buffer. Each socket burfer consists of
> some non-page data and some pages and each of them might have its
> own offset and length.

You would only want this for single-copy write request  - right?

I think we have treat them as a special case and pass the skbuf all
the way up to nfsd in that case.
You would only want to try this if:
   The NIC had verified the checksum
   The packets was some minimum size (1K? 1 PAGE ??)
   We were using AUTH_UNIX, nothing more interesting like crypto
     security
   The first fragment were some minimum size (size of a write without
   the data).

I would make a special 'fast-path' for that case which didn't copy any
data but passed a skbuf up, and code in nfs*xdr.c would convert that
into an iovec[];

I am working on a patch which changes rpcsvc to use xdr_buf.  Some of
it works.  Some doesn't.  I include it below for your reference I
repeat: it doesn't work yet.  
Once it is done, adding the rest of zero-copy should be fairly easy.

> 
> > I'm not certain about receiving write requests.
> > I imagine that it might work to:
> >   1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> >     skb into the head iovec, and hold onto the skbuf (like we
> >     currently do).
> 
> And I came up with another idea that kNFSd could handles TCP data
> in a socket buffer directly without copy if we can enhancemence the
> tcp_read_sock() not to release it while kNFSd is using it.
> kNFSd would handle TCP data as if it were a UDP datagram.
> The differences are kNFSd may grab some TCP socket buffers at once
> and the buffers may be shared to other kNFSd's.

That might work... though TCP doesn't have the same concept of a
'packet' that udp does.  You might endup with a socket buffer that had
all of one request and part of the next... still I'm sure it is
possible.

NeilBrown


-----incomplete, buggy, don't-use-it patch starts here----
--- ./fs/nfsd/nfssvc.c	2002/10/21 03:23:44	1.2
+++ ./fs/nfsd/nfssvc.c	2002/10/25 05:08:01
@@ -277,7 +277,8 @@ nfsd_dispatch(struct svc_rqst *rqstp, u3
 
 	/* Decode arguments */
 	xdr = proc->pc_decode;
-	if (xdr && !xdr(rqstp, rqstp->rq_argbuf.buf, rqstp->rq_argp)) {
+	if (xdr && !xdr(rqstp, (u32*)rqstp->rq_arg.head[0].iov_base,
+			rqstp->rq_argp)) {
 		dprintk("nfsd: failed to decode arguments!\n");
 		nfsd_cache_update(rqstp, RC_NOCACHE, NULL);
 		*statp = rpc_garbage_args;
@@ -293,14 +294,15 @@ nfsd_dispatch(struct svc_rqst *rqstp, u3
 	}
 		
 	if (rqstp->rq_proc != 0)
-		svc_putu32(&rqstp->rq_resbuf, nfserr);
+		svc_putu32(&rqstp->rq_res.head[0], nfserr);
 
 	/* Encode result.
 	 * For NFSv2, additional info is never returned in case of an error.
 	 */
 	if (!(nfserr && rqstp->rq_vers == 2)) {
 		xdr = proc->pc_encode;
-		if (xdr && !xdr(rqstp, rqstp->rq_resbuf.buf, rqstp->rq_resp)) {
+		if (xdr && !xdr(rqstp, (u32*)rqstp->rq_res.head[0].iov_base,
+				rqstp->rq_resp)) {
 			/* Failed to encode result. Release cache entry */
 			dprintk("nfsd: failed to encode result!\n");
 			nfsd_cache_update(rqstp, RC_NOCACHE, NULL);
--- ./fs/nfsd/vfs.c	2002/10/24 01:35:37	1.1
+++ ./fs/nfsd/vfs.c	2002/10/24 04:13:31
@@ -571,13 +571,35 @@ found:
 }
 
 /*
+ * reduce iovec:
+ * Reduce the effective size of the passed iovec to
+ * match the count
+ */
+static void reduce_iovec(struct iovec *vec, int *vlenp, int count)
+{
+	int vlen = *vlenp;
+	int i;
+
+	i = 0;
+	while (i < vlen && count > vec->iov_len) {
+		count -= vec->iov_len;
+		i++;
+	}
+	if (i >= vlen)
+		return; /* ERROR??? */
+	vec->iov_len -= count;
+	if (count) i++;
+	*vlenp = i;
+}
+
+/*
  * Read data from a file. count must contain the requested read count
  * on entry. On return, *count contains the number of bytes actually read.
  * N.B. After this call fhp needs an fh_put
  */
 int
 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
-          char *buf, unsigned long *count)
+          struct iovec *vec, int vlen, unsigned long *count)
 {
 	struct raparms	*ra;
 	mm_segment_t	oldfs;
@@ -601,9 +623,10 @@ nfsd_read(struct svc_rqst *rqstp, struct
 	if (ra)
 		file.f_ra = ra->p_ra;
 
+	reduce_iovec(vec, &vlen, *count);
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	err = vfs_read(&file, buf, *count, &offset);
+	err = vfs_readv(&file, vec, vlen, *count, &offset);
 	set_fs(oldfs);
 
 	/* Write back readahead params */
@@ -629,7 +652,8 @@ out:
  */
 int
 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
-				char *buf, unsigned long cnt, int *stablep)
+				struct iovec *vec, int vlen,
+	   			unsigned long cnt, int *stablep)
 {
 	struct svc_export	*exp;
 	struct file		file;
@@ -675,9 +699,10 @@ nfsd_write(struct svc_rqst *rqstp, struc
 	if (stable && !EX_WGATHER(exp))
 		file.f_flags |= O_SYNC;
 
+	reduce_iovec(vec, &vlen, cnt);
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	err = vfs_write(&file, buf, cnt, &offset);
+	err = vfs_writev(&file, vec, vlen, cnt, &offset);
 	if (err >= 0)
 		nfsdstats.io_write += cnt;
 	set_fs(oldfs);
--- ./fs/nfsd/nfsctl.c	2002/10/21 06:35:17	1.2
+++ ./fs/nfsd/nfsctl.c	2002/10/24 11:22:53
@@ -130,13 +130,12 @@ static int exports_open(struct inode *in
 	char *namebuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
 	if (namebuf == NULL)
 		return -ENOMEM;
-	else
-		((struct seq_file *)file->private_data)->private = namebuf;
 
 	res = seq_open(file, &nfs_exports_op);
-	if (!res)
+	if (res)
 		kfree(namebuf);
-
+	else
+		((struct seq_file *)file->private_data)->private = namebuf;
 	return res;
 }
 static int exports_release(struct inode *inode, struct file *file)
--- ./fs/nfsd/nfsxdr.c	2002/10/24 01:06:36	1.1
+++ ./fs/nfsd/nfsxdr.c	2002/10/25 05:31:51
@@ -14,6 +14,7 @@
 #include <linux/sunrpc/svc.h>
 #include <linux/nfsd/nfsd.h>
 #include <linux/nfsd/xdr.h>
+#include <linux/mm.h>
 
 #define NFSDDBG_FACILITY		NFSDDBG_XDR
 
@@ -176,27 +177,6 @@ encode_fattr(struct svc_rqst *rqstp, u32
 	return p;
 }
 
-/*
- * Check buffer bounds after decoding arguments
- */
-static inline int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_argbuf;
-
-	return p - buf->base <= buf->buflen;
-}
-
-static inline int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_resbuf;
-
-	buf->len = p - buf->base;
-	dprintk("nfsd: ressize_check p %p base %p len %d\n",
-			p, buf->base, buf->buflen);
-	return (buf->len <= buf->buflen);
-}
 
 /*
  * XDR decode functions
@@ -241,13 +221,29 @@ int
 nfssvc_decode_readargs(struct svc_rqst *rqstp, u32 *p,
 					struct nfsd_readargs *args)
 {
+	int len;
+	int v,pn;
 	if (!(p = decode_fh(p, &args->fh)))
 		return 0;
 
 	args->offset    = ntohl(*p++);
-	args->count     = ntohl(*p++);
-	args->totalsize = ntohl(*p++);
+	len = args->count     = ntohl(*p++);
+	p++; /* totalcount - unused */
 
+	/* FIXME range check ->count */
+	/* set up somewhere to store response.
+	 * We take pages, put them on reslist and include in iovec
+	 */
+	v=0; 
+	while (len > 0) {
+		pn=rqstp->rq_resused;
+		take_page(rqstp);
+		args->vec[v].iov_base = page_address(rqstp->rq_respages[pn]);
+		args->vec[v].iov_len = PAGE_SIZE;
+		v++;
+		len -= PAGE_SIZE;
+	}
+	args->vlen = v;
 	return xdr_argsize_check(rqstp, p);
 }
 
@@ -255,17 +251,27 @@ int
 nfssvc_decode_writeargs(struct svc_rqst *rqstp, u32 *p,
 					struct nfsd_writeargs *args)
 {
+	int len;
+	int v;
 	if (!(p = decode_fh(p, &args->fh)))
 		return 0;
 
 	p++;				/* beginoffset */
 	args->offset = ntohl(*p++);	/* offset */
 	p++;				/* totalcount */
-	args->len = ntohl(*p++);
-	args->data = (char *) p;
-	p += XDR_QUADLEN(args->len);
-
-	return xdr_argsize_check(rqstp, p);
+	len = args->len = ntohl(*p++);
+	args->vec[0].iov_base = (void*)p;
+	args->vec[0].iov_len = rqstp->rq_arg.head[0].iov_len -
+				(((void*)p) - rqstp->rq_arg.head[0].iov_base);
+	v = 0;
+	while (len > args->vec[v].iov_len) {
+		len -= args->vec[v].iov_len;
+		v++;
+		args->vec[v].iov_base = page_address(rqstp->rq_argpages[v]);
+		args->vec[v].iov_len = PAGE_SIZE;
+	}
+	args->vlen = v+1;
+	return 1; /* FIXME */
 }
 
 int
@@ -371,9 +377,22 @@ nfssvc_encode_readres(struct svc_rqst *r
 {
 	p = encode_fattr(rqstp, p, &resp->fh);
 	*p++ = htonl(resp->count);
-	p += XDR_QUADLEN(resp->count);
+	xdr_ressize_check(rqstp, p);
 
-	return xdr_ressize_check(rqstp, p);
+	/* now update rqstp->rq_res to reflect data aswell */
+	rqstp->rq_res.page_base = 0;
+	rqstp->rq_res.page_len = resp->count;
+	if (resp->count & 3) {
+		/* need to pad with tail */
+		rqstp->rq_res.tail[0].iov_base = p;
+		*p = 0;
+		rqstp->rq_res.tail[0].iov_len = 4 - (resp->count&3);
+	}
+	rqstp->rq_res.len = 
+		rqstp->rq_res.head[0].iov_len+
+		rqstp->rq_res.page_len+
+		rqstp->rq_res.tail[0].iov_len;
+	return 1;
 }
 
 int
--- ./fs/nfsd/nfs3xdr.c	2002/10/24 01:07:00	1.1
+++ ./fs/nfsd/nfs3xdr.c	2002/10/25 05:14:26
@@ -269,27 +269,6 @@ encode_wcc_data(struct svc_rqst *rqstp, 
 	return encode_post_op_attr(rqstp, p, fhp);
 }
 
-/*
- * Check buffer bounds after decoding arguments
- */
-static inline int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_argbuf;
-
-	return p - buf->base <= buf->buflen;
-}
-
-static inline int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_resbuf;
-
-	buf->len = p - buf->base;
-	dprintk("nfsd: ressize_check p %p base %p len %d\n",
-			p, buf->base, buf->buflen);
-	return (buf->len <= buf->buflen);
-}
 
 /*
  * XDR decode functions
--- ./fs/nfsd/nfscache.c	2002/10/24 03:37:10	1.1
+++ ./fs/nfsd/nfscache.c	2002/10/24 04:30:23
@@ -41,7 +41,7 @@ static struct svc_cacherep *	lru_tail;
 static struct svc_cacherep *	nfscache;
 static int			cache_disabled = 1;
 
-static int	nfsd_cache_append(struct svc_rqst *rqstp, struct svc_buf *data);
+static int	nfsd_cache_append(struct svc_rqst *rqstp, struct iovec *vec);
 
 /* 
  * locking for the reply cache:
@@ -107,7 +107,7 @@ nfsd_cache_shutdown(void)
 
 	for (rp = lru_head; rp; rp = rp->c_lru_next) {
 		if (rp->c_state == RC_DONE && rp->c_type == RC_REPLBUFF)
-			kfree(rp->c_replbuf.buf);
+			kfree(rp->c_replvec.iov_base);
 	}
 
 	cache_disabled = 1;
@@ -242,8 +242,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp
 
 	/* release any buffer */
 	if (rp->c_type == RC_REPLBUFF) {
-		kfree(rp->c_replbuf.buf);
-		rp->c_replbuf.buf = NULL;
+		kfree(rp->c_replvec.iov_base);
+		rp->c_replvec.iov_base = NULL;
 	}
 	rp->c_type = RC_NOCACHE;
  out:
@@ -272,11 +272,11 @@ found_entry:
 	case RC_NOCACHE:
 		break;
 	case RC_REPLSTAT:
-		svc_putu32(&rqstp->rq_resbuf, rp->c_replstat);
+		svc_putu32(&rqstp->rq_res.head[0], rp->c_replstat);
 		rtn = RC_REPLY;
 		break;
 	case RC_REPLBUFF:
-		if (!nfsd_cache_append(rqstp, &rp->c_replbuf))
+		if (!nfsd_cache_append(rqstp, &rp->c_replvec))
 			goto out;	/* should not happen */
 		rtn = RC_REPLY;
 		break;
@@ -308,13 +308,14 @@ void
 nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, u32 *statp)
 {
 	struct svc_cacherep *rp;
-	struct svc_buf	*resp = &rqstp->rq_resbuf, *cachp;
+	struct iovec	*resv = &rqstp->rq_res.head[0], *cachv;
 	int		len;
 
 	if (!(rp = rqstp->rq_cacherep) || cache_disabled)
 		return;
 
-	len = resp->len - (statp - resp->base);
+	len = resv->iov_len - ((char*)statp - (char*)resv->iov_base);
+	len >>= 2;
 	
 	/* Don't cache excessive amounts of data and XDR failures */
 	if (!statp || len > (256 >> 2)) {
@@ -329,16 +330,16 @@ nfsd_cache_update(struct svc_rqst *rqstp
 		rp->c_replstat = *statp;
 		break;
 	case RC_REPLBUFF:
-		cachp = &rp->c_replbuf;
-		cachp->buf = (u32 *) kmalloc(len << 2, GFP_KERNEL);
-		if (!cachp->buf) {
+		cachv = &rp->c_replvec;
+		cachv->iov_base = kmalloc(len << 2, GFP_KERNEL);
+		if (!cachv->iov_base) {
 			spin_lock(&cache_lock);
 			rp->c_state = RC_UNUSED;
 			spin_unlock(&cache_lock);
 			return;
 		}
-		cachp->len = len;
-		memcpy(cachp->buf, statp, len << 2);
+		cachv->iov_len = len << 2;
+		memcpy(cachv->iov_base, statp, len << 2);
 		break;
 	}
 	spin_lock(&cache_lock);
@@ -353,19 +354,20 @@ nfsd_cache_update(struct svc_rqst *rqstp
 
 /*
  * Copy cached reply to current reply buffer. Should always fit.
+ * FIXME as reply is in a page, we should just attach the page, and
+ * keep a refcount....
  */
 static int
-nfsd_cache_append(struct svc_rqst *rqstp, struct svc_buf *data)
+nfsd_cache_append(struct svc_rqst *rqstp, struct iovec *data)
 {
-	struct svc_buf	*resp = &rqstp->rq_resbuf;
+	struct iovec	*vec = &rqstp->rq_res.head[0];
 
-	if (resp->len + data->len > resp->buflen) {
+	if (vec->iov_len + data->iov_len > PAGE_SIZE) {
 		printk(KERN_WARNING "nfsd: cached reply too large (%d).\n",
-				data->len);
+				data->iov_len);
 		return 0;
 	}
-	memcpy(resp->buf, data->buf, data->len << 2);
-	resp->buf += data->len;
-	resp->len += data->len;
+	memcpy((char*)vec->iov_base + vec->iov_len, data->iov_base, data->iov_len);
+	vec->iov_len += data->iov_len;
 	return 1;
 }
--- ./fs/nfsd/nfsproc.c	2002/10/24 02:23:57	1.1
+++ ./fs/nfsd/nfsproc.c	2002/10/25 05:32:04
@@ -30,11 +30,11 @@ typedef struct svc_buf	svc_buf;
 #define NFSDDBG_FACILITY		NFSDDBG_PROC
 
 
-static void
-svcbuf_reserve(struct svc_buf *buf, u32 **ptr, int *len, int nr)
+static inline void
+svcbuf_reserve(struct xdr_buf *buf, u32 **ptr, int *len, int nr)
 {
-	*ptr = buf->buf + nr;
-	*len = buf->buflen - buf->len - nr;
+	*ptr = (u32*)(buf->head[0].iov_base+buf->head[0].iov_len) + nr;
+	*len = ((PAGE_SIZE-buf->head[0].iov_len)>>2) - nr;
 }
 
 static int
@@ -109,7 +109,7 @@ nfsd_proc_readlink(struct svc_rqst *rqst
 	dprintk("nfsd: READLINK %s\n", SVCFH_fmt(&argp->fh));
 
 	/* Reserve room for status and path length */
-	svcbuf_reserve(&rqstp->rq_resbuf, &path, &dummy, 2);
+	svcbuf_reserve(&rqstp->rq_res, &path, &dummy, 2);
 
 	/* Read the symlink. */
 	resp->len = NFS_MAXPATHLEN;
@@ -127,8 +127,7 @@ static int
 nfsd_proc_read(struct svc_rqst *rqstp, struct nfsd_readargs *argp,
 				       struct nfsd_readres  *resp)
 {
-	u32 *	buffer;
-	int	nfserr, avail;
+	int	nfserr;
 
 	dprintk("nfsd: READ    %s %d bytes at %d\n",
 		SVCFH_fmt(&argp->fh),
@@ -137,22 +136,21 @@ nfsd_proc_read(struct svc_rqst *rqstp, s
 	/* Obtain buffer pointer for payload. 19 is 1 word for
 	 * status, 17 words for fattr, and 1 word for the byte count.
 	 */
-	svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &avail, 19);
 
-	if ((avail << 2) < argp->count) {
+	if ((32768/*FIXME*/) < argp->count) {
 		printk(KERN_NOTICE
 			"oversized read request from %08x:%d (%d bytes)\n",
 				ntohl(rqstp->rq_addr.sin_addr.s_addr),
 				ntohs(rqstp->rq_addr.sin_port),
 				argp->count);
-		argp->count = avail << 2;
+		argp->count = 32768;
 	}
 	svc_reserve(rqstp, (19<<2) + argp->count + 4);
 
 	resp->count = argp->count;
 	nfserr = nfsd_read(rqstp, fh_copy(&resp->fh, &argp->fh),
 				  argp->offset,
-				  (char *) buffer,
+			   	  argp->vec, argp->vlen,
 				  &resp->count);
 
 	return nfserr;
@@ -175,7 +173,7 @@ nfsd_proc_write(struct svc_rqst *rqstp, 
 
 	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
 				   argp->offset,
-				   argp->data,
+				   argp->vec, argp->vlen,
 				   argp->len,
 				   &stable);
 	return nfserr;
@@ -477,7 +475,7 @@ nfsd_proc_readdir(struct svc_rqst *rqstp
 		argp->count, argp->cookie);
 
 	/* Reserve buffer space for status */
-	svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count, 1);
+	svcbuf_reserve(&rqstp->rq_res, &buffer, &count, 1);
 
 	/* Shrink to the client read size */
 	if (count > (argp->count >> 2))
--- ./fs/nfsd/nfs3proc.c	2002/10/24 04:37:41	1.1
+++ ./fs/nfsd/nfs3proc.c	2002/10/25 05:34:44
@@ -43,11 +43,11 @@ static int	nfs3_ftypes[] = {
 /*
  * Reserve room in the send buffer
  */
-static void
-svcbuf_reserve(struct svc_buf *buf, u32 **ptr, int *len, int nr)
+static inline void
+svcbuf_reserve(struct xdr_buf *buf, u32 **ptr, int *len, int nr)
 {
-	*ptr = buf->buf + nr;
-	*len = buf->buflen - buf->len - nr;
+	*ptr = (u32*)(buf->head[0].iov_base+buf->head[0].iov_len) + nr;
+	*len = ((PAGE_SIZE-buf->head[0].iov_len)>>2) - nr;
 }
 
 /*
@@ -150,7 +150,7 @@ nfsd3_proc_readlink(struct svc_rqst *rqs
 	dprintk("nfsd: READLINK(3) %s\n", SVCFH_fmt(&argp->fh));
 
 	/* Reserve room for status, post_op_attr, and path length */
-	svcbuf_reserve(&rqstp->rq_resbuf, &path, &dummy,
+	svcbuf_reserve(&rqstp->rq_res, &path, &dummy,
 				1 + NFS3_POST_OP_ATTR_WORDS + 1);
 
 	/* Read the symlink. */
@@ -179,7 +179,7 @@ nfsd3_proc_read(struct svc_rqst *rqstp, 
 	 * 1 (status) + 22 (post_op_attr) + 1 (count) + 1 (eof)
 	 * + 1 (xdr opaque byte count) = 26
 	 */
-	svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &avail,
+	svcbuf_reserve(&rqstp->rq_res, &buffer, &avail,
 			1 + NFS3_POST_OP_ATTR_WORDS + 3);
 	resp->count = argp->count;
 	if ((avail << 2) < resp->count)
@@ -447,7 +447,7 @@ nfsd3_proc_readdir(struct svc_rqst *rqst
 				argp->count, (u32) argp->cookie);
 
 	/* Reserve buffer space for status, attributes and verifier */
-	svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count,
+	svcbuf_reserve(&rqstp->rq_res, &buffer, &count,
 				1 + NFS3_POST_OP_ATTR_WORDS + 2);
 
 	/* Make sure we've room for the NULL ptr & eof flag, and shrink to
@@ -482,7 +482,7 @@ nfsd3_proc_readdirplus(struct svc_rqst *
 				argp->count, (u32) argp->cookie);
 
 	/* Reserve buffer space for status, attributes and verifier */
-	svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count,
+	svcbuf_reserve(&rqstp->rq_res, &buffer, &count,
 				1 + NFS3_POST_OP_ATTR_WORDS + 2);
 
 	/* Make sure we've room for the NULL ptr & eof flag, and shrink to
--- ./fs/lockd/xdr.c	2002/10/24 01:01:26	1.1
+++ ./fs/lockd/xdr.c	2002/10/25 05:14:36
@@ -216,25 +216,6 @@ nlm_encode_testres(u32 *p, struct nlm_re
 	return p;
 }
 
-/*
- * Check buffer bounds after decoding arguments
- */
-static inline int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_argbuf;
-
-	return p - buf->base <= buf->buflen;
-}
-
-static inline int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_resbuf;
-
-	buf->len = p - buf->base;
-	return (buf->len <= buf->buflen);
-}
 
 /*
  * First, the server side XDR functions
--- ./fs/lockd/xdr4.c	2002/10/24 01:05:40	1.1
+++ ./fs/lockd/xdr4.c	2002/10/25 05:14:44
@@ -223,26 +223,6 @@ nlm4_encode_testres(u32 *p, struct nlm_r
 
 
 /*
- * Check buffer bounds after decoding arguments
- */
-static int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_argbuf;
-
-	return p - buf->base <= buf->buflen;
-}
-
-static int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
-	struct svc_buf	*buf = &rqstp->rq_resbuf;
-
-	buf->len = p - buf->base;
-	return (buf->len <= buf->buflen);
-}
-
-/*
  * First, the server side XDR functions
  */
 int
--- ./fs/read_write.c	2002/10/24 01:22:09	1.1
+++ ./fs/read_write.c	2002/10/24 02:54:13
@@ -207,6 +207,53 @@ ssize_t vfs_read(struct file *file, char
 	return ret;
 }
 
+ssize_t vfs_readv(struct file *file, struct iovec *vec, int vlen, size_t count, loff_t *pos)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	ssize_t ret;
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EBADF;
+	if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
+		return -EINVAL;
+
+	ret = locks_verify_area(FLOCK_VERIFY_READ, inode, file, *pos, count);
+	if (!ret) {
+		ret = security_ops->file_permission (file, MAY_READ);
+		if (!ret) {
+			if (file->f_op->readv)
+				ret = file->f_op->readv(file, vec, vlen, pos);
+			else {
+				/* do it by hand */
+				struct iovec *vector = vec;
+				ret = 0;
+				while (vlen > 0) {
+					void * base =  vector->iov_base;
+					size_t len = vector->iov_len;
+					ssize_t nr;
+					vector++;
+					vlen--;
+					if (file->f_op->read)
+						nr = file->f_op->read(file, base, len, pos);
+					else
+						nr = do_sync_read(file, base, len, pos);
+					if (nr < 0) {
+						if (!ret) ret = nr;
+						break;
+					}
+					ret += nr;
+					if (nr != len)
+						break;
+				}
+			}
+			if (ret > 0)
+				dnotify_parent(file->f_dentry, DN_ACCESS);
+		}
+	}
+
+	return ret;
+}
+
 ssize_t do_sync_write(struct file *filp, const char *buf, size_t len, loff_t *ppos)
 {
 	struct kiocb kiocb;
@@ -247,6 +294,53 @@ ssize_t vfs_write(struct file *file, con
 	return ret;
 }
 
+ssize_t vfs_writev(struct file *file, const struct iovec *vec, int vlen, size_t count, loff_t *pos)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	ssize_t ret;
+
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EBADF;
+	if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
+		return -EINVAL;
+
+	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, *pos, count);
+	if (!ret) {
+		ret = security_ops->file_permission (file, MAY_WRITE);
+		if (!ret) {
+			if (file->f_op->writev)
+				ret = file->f_op->writev(file, vec, vlen, pos);
+			else {
+				/* do it by hand */
+				struct iovec *vector = vec;
+				ret = 0;
+				while (vlen > 0) {
+					void * base = vector->iov_base;
+					size_t len = vector->iov_len;
+					ssize_t nr;
+					vector++;
+					vlen--;
+					if (file->f_op->write)
+						nr = file->f_op->write(file, base, len, pos);
+					else
+						nr = do_sync_write(file, base, len, pos);
+					if (nr < 0) {
+						if (!ret) ret = nr;
+						break;
+					}
+					ret += nr;
+					if (nr != len)
+						break;
+				}
+			}
+			if (ret > 0)
+				dnotify_parent(file->f_dentry, DN_MODIFY);
+		}
+	}
+
+	return ret;
+}
+
 asmlinkage ssize_t sys_read(unsigned int fd, char * buf, size_t count)
 {
 	struct file *file;
--- ./include/linux/sunrpc/svc.h	2002/10/23 00:38:26	1.1
+++ ./include/linux/sunrpc/svc.h	2002/10/25 05:14:06
@@ -48,43 +48,49 @@ struct svc_serv {
  * This is use to determine the max number of pages nfsd is
  * willing to return in a single READ operation.
  */
-#define RPCSVC_MAXPAYLOAD	16384u
+#define RPCSVC_MAXPAYLOAD	(64*1024u)
 
 /*
- * Buffer to store RPC requests or replies in.
- * Each server thread has one of these beasts.
+ * RPC Requsts and replies are stored in one or more pages.
+ * We maintain an array of pages for each server thread.
+ * Requests are copied into these pages as they arrive.  Remaining
+ * pages are available to write the reply into.
  *
- * Area points to the allocated memory chunk currently owned by the
- * buffer. Base points to the buffer containing the request, which is
- * different from area when directly reading from an sk_buff. buf is
- * the current read/write position while processing an RPC request.
+ * Currently pages are all re-used by the same server.  Later we 
+ * will use ->sendpage to transmit pages with reduced copying.  In
+ * that case we will need to give away the page and allocate new ones.
+ * In preparation for this, we explicitly move pages off the recv
+ * list onto the transmit list, and back.
  *
- * The array of iovecs can hold additional data that the server process
- * may not want to copy into the RPC reply buffer, but pass to the 
- * network sendmsg routines directly. The prime candidate for this
- * will of course be NFS READ operations, but one might also want to
- * do something about READLINK and READDIR. It might be worthwhile
- * to implement some generic readdir cache in the VFS layer...
+ * We use xdr_buf for holding responses as it fits well with NFS
+ * read responses (that have a header, and some data pages, and possibly
+ * a tail) and means we can share some client side routines.
  *
- * On the receiving end of the RPC server, the iovec may be used to hold
- * the list of IP fragments once we get to process fragmented UDP
- * datagrams directly.
- */
-#define RPCSVC_MAXIOV		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1)
-struct svc_buf {
-	u32 *			area;	/* allocated memory */
-	u32 *			base;	/* base of RPC datagram */
-	int			buflen;	/* total length of buffer */
-	u32 *			buf;	/* read/write pointer */
-	int			len;	/* current end of buffer */
-
-	/* iovec for zero-copy NFS READs */
-	struct iovec		iov[RPCSVC_MAXIOV];
-	int			nriov;
-};
-#define svc_getu32(argp, val)	{ (val) = *(argp)->buf++; (argp)->len--; }
-#define svc_putu32(resp, val)	{ *(resp)->buf++ = (val); (resp)->len++; }
+ * The xdr_buf.head iovec always points to the first page in the rq_*pages
+ * list.  The xdr_buf.pages pointer points to the second page on that
+ * list.  xdr_buf.tail points to the end of the first page.
+ * This assumes that the non-page part of an rpc reply will fit
+ * in a page - NFSd ensures this.  lockd also has no trouble.
+ */
+#define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1)
+
+static inline u32 svc_getu32(struct iovec *iov)
+{
+	u32 val, *vp;
+	vp = iov->iov_base;
+	val = *vp++;
+	iov->iov_base = (void*)vp;
+	iov->iov_len -= sizeof(u32);
+	return val;
+}
+static inline void svc_putu32(struct iovec *iov, u32 val)
+{
+	u32 *vp = iov->iov_base + iov->iov_len;
+	*vp = val;
+	iov->iov_len += sizeof(u32);
+}
 
+	
 /*
  * The context of a single thread, including the request currently being
  * processed.
@@ -102,9 +108,15 @@ struct svc_rqst {
 	struct svc_cred		rq_cred;	/* auth info */
 	struct sk_buff *	rq_skbuff;	/* fast recv inet buffer */
 	struct svc_deferred_req*rq_deferred;	/* deferred request we are replaying */
-	struct svc_buf		rq_defbuf;	/* default buffer */
-	struct svc_buf		rq_argbuf;	/* argument buffer */
-	struct svc_buf		rq_resbuf;	/* result buffer */
+
+	struct xdr_buf		rq_arg;
+	struct xdr_buf		rq_res;
+	struct page *		rq_argpages[RPCSVC_MAXPAGES];
+	struct page *		rq_respages[RPCSVC_MAXPAGES];
+	short			rq_argused;	/* pages used for argument */
+	short			rq_arghi;	/* pages available in argument page list */
+	short			rq_resused;	/* pages used for result */
+
 	u32			rq_xid;		/* transmission id */
 	u32			rq_prog;	/* program number */
 	u32			rq_vers;	/* program version */
@@ -136,6 +148,38 @@ struct svc_rqst {
 	wait_queue_head_t	rq_wait;	/* synchronization */
 };
 
+/*
+ * Check buffer bounds after decoding arguments
+ */
+static inline int
+xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
+{
+	char *cp = (char *)p;
+	struct iovec *vec = &rqstp->rq_arg.head[0];
+	return cp - (char*)vec->iov_base <= vec->iov_len;
+}
+
+static inline int
+xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
+{
+	struct iovec *vec = &rqstp->rq_res.head[0];
+	char *cp = (char*)p;
+
+	vec->iov_len = cp - (char*)vec->iov_base;
+	rqstp->rq_res.len = vec->iov_len;
+
+	return vec->iov_len <= PAGE_SIZE;
+}
+
+static int inline take_page(struct svc_rqst *rqstp)
+{
+	if (rqstp->rq_arghi <= rqstp->rq_argused)
+		return -ENOMEM;
+	rqstp->rq_respages[rqstp->rq_resused++] =
+		rqstp->rq_argpages[--rqstp->rq_arghi];
+	return 0;
+}
+
 struct svc_deferred_req {
 	struct svc_serv		*serv;
 	u32			prot;	/* protocol (UDP or TCP) */
--- ./include/linux/nfsd/xdr.h	2002/10/24 01:49:48	1.1
+++ ./include/linux/nfsd/xdr.h	2002/10/25 02:21:03
@@ -29,16 +29,16 @@ struct nfsd_readargs {
 	struct svc_fh		fh;
 	__u32			offset;
 	__u32			count;
-	__u32			totalsize;
+	struct iovec		vec[RPCSVC_MAXPAGES];
+	int			vlen;
 };
 
 struct nfsd_writeargs {
 	svc_fh			fh;
-	__u32			beginoffset;
 	__u32			offset;
-	__u32			totalcount;
-	__u8 *			data;
 	int			len;
+	struct iovec		vec[RPCSVC_MAXPAGES];
+	int			vlen;
 };
 
 struct nfsd_createargs {
--- ./include/linux/nfsd/nfsd.h	2002/10/24 04:04:03	1.1
+++ ./include/linux/nfsd/nfsd.h	2002/10/24 04:13:19
@@ -97,9 +97,9 @@ int		nfsd_open(struct svc_rqst *, struct
 				int, struct file *);
 void		nfsd_close(struct file *);
 int		nfsd_read(struct svc_rqst *, struct svc_fh *,
-				loff_t, char *, unsigned long *);
+				loff_t, struct iovec *,int, unsigned long *);
 int		nfsd_write(struct svc_rqst *, struct svc_fh *,
-				loff_t, char *, unsigned long, int *);
+				loff_t, struct iovec *,int, unsigned long, int *);
 int		nfsd_readlink(struct svc_rqst *, struct svc_fh *,
 				char *, int *);
 int		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
--- ./include/linux/nfsd/cache.h	2002/10/24 03:41:12	1.1
+++ ./include/linux/nfsd/cache.h	2002/10/24 03:41:35
@@ -32,12 +32,12 @@ struct svc_cacherep {
 	u32			c_vers;
 	unsigned long		c_timestamp;
 	union {
-		struct svc_buf	u_buffer;
+		struct iovec	u_vec;
 		u32		u_status;
 	}			c_u;
 };
 
-#define c_replbuf		c_u.u_buffer
+#define c_replvec		c_u.u_vec
 #define c_replstat		c_u.u_status
 
 /* cache entry states */
--- ./include/linux/fs.h	2002/10/24 01:34:48	1.1
+++ ./include/linux/fs.h	2002/10/24 02:53:14
@@ -793,6 +793,8 @@ struct seq_file;
 
 extern ssize_t vfs_read(struct file *, char *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char *, size_t, loff_t *);
+extern ssize_t vfs_readv(struct file *, struct iovec *, int, size_t, loff_t *);
+extern ssize_t vfs_writev(struct file *, const struct iovec *, int, size_t, loff_t *);
 
 /*
  * NOTE: write_inode, delete_inode, clear_inode, put_inode can be called
--- ./net/sunrpc/svc.c	2002/10/23 12:35:50	1.1
+++ ./net/sunrpc/svc.c	2002/10/25 05:41:14
@@ -13,6 +13,7 @@
 #include <linux/net.h>
 #include <linux/in.h>
 #include <linux/unistd.h>
+#include <linux/mm.h>
 
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/xdr.h>
@@ -35,7 +36,6 @@ svc_create(struct svc_program *prog, uns
 
 	if (!(serv = (struct svc_serv *) kmalloc(sizeof(*serv), GFP_KERNEL)))
 		return NULL;
-
 	memset(serv, 0, sizeof(*serv));
 	serv->sv_program   = prog;
 	serv->sv_nrthreads = 1;
@@ -105,35 +105,41 @@ svc_destroy(struct svc_serv *serv)
 }
 
 /*
- * Allocate an RPC server buffer
- * Later versions may do nifty things by allocating multiple pages
- * of memory directly and putting them into the bufp->iov.
+ * Allocate an RPC server's buffer space.
+ * We allocate pages and place them in rq_argpages.
  */
-int
-svc_init_buffer(struct svc_buf *bufp, unsigned int size)
+static int
+svc_init_buffer(struct svc_rqst *rqstp, unsigned int size)
 {
-	if (!(bufp->area = (u32 *) kmalloc(size, GFP_KERNEL)))
-		return 0;
-	bufp->base   = bufp->area;
-	bufp->buf    = bufp->area;
-	bufp->len    = 0;
-	bufp->buflen = size >> 2;
-
-	bufp->iov[0].iov_base = bufp->area;
-	bufp->iov[0].iov_len  = size;
-	bufp->nriov = 1;
-
-	return 1;
+	int pages = 2 + (size+ PAGE_SIZE -1) / PAGE_SIZE;
+	int arghi;
+	
+	rqstp->rq_argused = 0;
+	rqstp->rq_resused = 0;
+	arghi = 0;
+	while (pages) {
+		struct page *p = alloc_page(GFP_KERNEL);
+		if (!p)
+			break;
+		printk("allocated page %d (%d to go)\n", arghi, pages-1);
+		rqstp->rq_argpages[arghi++] = p;
+		pages--;
+	}
+	rqstp->rq_arghi = arghi;
+	return ! pages;
 }
 
 /*
  * Release an RPC server buffer
  */
-void
-svc_release_buffer(struct svc_buf *bufp)
+static void
+svc_release_buffer(struct svc_rqst *rqstp)
 {
-	kfree(bufp->area);
-	bufp->area = 0;
+	while (rqstp->rq_arghi)
+		put_page(rqstp->rq_argpages[--rqstp->rq_arghi]);
+	while (rqstp->rq_resused)
+		put_page(rqstp->rq_respages[--rqstp->rq_resused]);
+	rqstp->rq_argused = 0;
 }
 
 /*
@@ -154,7 +160,7 @@ svc_create_thread(svc_thread_fn func, st
 
 	if (!(rqstp->rq_argp = (u32 *) kmalloc(serv->sv_xdrsize, GFP_KERNEL))
 	 || !(rqstp->rq_resp = (u32 *) kmalloc(serv->sv_xdrsize, GFP_KERNEL))
-	 || !svc_init_buffer(&rqstp->rq_defbuf, serv->sv_bufsz))
+	 || !svc_init_buffer(rqstp, serv->sv_bufsz))
 		goto out_thread;
 
 	serv->sv_nrthreads++;
@@ -180,7 +186,7 @@ svc_exit_thread(struct svc_rqst *rqstp)
 {
 	struct svc_serv	*serv = rqstp->rq_server;
 
-	svc_release_buffer(&rqstp->rq_defbuf);
+	svc_release_buffer(rqstp);
 	if (rqstp->rq_resp)
 		kfree(rqstp->rq_resp);
 	if (rqstp->rq_argp)
@@ -242,37 +248,49 @@ svc_process(struct svc_serv *serv, struc
 	struct svc_program	*progp;
 	struct svc_version	*versp = NULL;	/* compiler food */
 	struct svc_procedure	*procp = NULL;
-	struct svc_buf *	argp = &rqstp->rq_argbuf;
-	struct svc_buf *	resp = &rqstp->rq_resbuf;
+	struct iovec *		argv = &rqstp->rq_arg.head[0];
+	struct iovec *		resv = &rqstp->rq_res.head[0];
 	kxdrproc_t		xdr;
-	u32			*bufp, *statp;
+	u32			*statp;
 	u32			dir, prog, vers, proc,
 				auth_stat, rpc_stat;
 
 	rpc_stat = rpc_success;
-	bufp = argp->buf;
 
-	if (argp->len < 5)
+	if (argv->iov_len < 6*4)
 		goto err_short_len;
 
-	dir  = ntohl(*bufp++);
-	vers = ntohl(*bufp++);
+	/* setup response xdr_buf.
+	 * Initially it has just one page 
+	 */
+	take_page(rqstp); /* must succeed */
+	resv->iov_base = page_address(rqstp->rq_respages[0]);
+	resv->iov_len = 0;
+	rqstp->rq_res.pages = rqstp->rq_respages+1;
+	rqstp->rq_res.len = 0;
+	/* tcp needs a space for the record length... */
+	if (rqstp->rq_prot == IPPROTO_TCP)
+		svc_putu32(resv, 0);
+
+	rqstp->rq_xid = svc_getu32(argv);
+	svc_putu32(resv, rqstp->rq_xid);
+
+	dir  = ntohl(svc_getu32(argv));
+	vers = ntohl(svc_getu32(argv));
 
 	/* First words of reply: */
-	svc_putu32(resp, xdr_one);		/* REPLY */
-	svc_putu32(resp, xdr_zero);		/* ACCEPT */
+	svc_putu32(resv, xdr_one);		/* REPLY */
 
 	if (dir != 0)		/* direction != CALL */
 		goto err_bad_dir;
 	if (vers != 2)		/* RPC version number */
 		goto err_bad_rpc;
 
-	rqstp->rq_prog = prog = ntohl(*bufp++);	/* program number */
-	rqstp->rq_vers = vers = ntohl(*bufp++);	/* version number */
-	rqstp->rq_proc = proc = ntohl(*bufp++);	/* procedure number */
+	svc_putu32(resv, xdr_zero);		/* ACCEPT */
 
-	argp->buf += 5;
-	argp->len -= 5;
+	rqstp->rq_prog = prog = ntohl(svc_getu32(argv));	/* program number */
+	rqstp->rq_vers = vers = ntohl(svc_getu32(argv));	/* version number */
+	rqstp->rq_proc = proc = ntohl(svc_getu32(argv));	/* procedure number */
 
 	/*
 	 * Decode auth data, and add verifier to reply buffer.
@@ -307,8 +325,8 @@ svc_process(struct svc_serv *serv, struc
 	serv->sv_stats->rpccnt++;
 
 	/* Build the reply header. */
-	statp = resp->buf;
-	svc_putu32(resp, rpc_success);		/* RPC_SUCCESS */
+	statp = resv->iov_base +resv->iov_len;
+	svc_putu32(resv, rpc_success);		/* RPC_SUCCESS */
 
 	/* Bump per-procedure stats counter */
 	procp->pc_count++;
@@ -327,14 +345,14 @@ svc_process(struct svc_serv *serv, struc
 	if (!versp->vs_dispatch) {
 		/* Decode arguments */
 		xdr = procp->pc_decode;
-		if (xdr && !xdr(rqstp, rqstp->rq_argbuf.buf, rqstp->rq_argp))
+		if (xdr && !xdr(rqstp, argv->iov_base, rqstp->rq_argp))
 			goto err_garbage;
 
 		*statp = procp->pc_func(rqstp, rqstp->rq_argp, rqstp->rq_resp);
 
 		/* Encode reply */
 		if (*statp == rpc_success && (xdr = procp->pc_encode)
-		 && !xdr(rqstp, rqstp->rq_resbuf.buf, rqstp->rq_resp)) {
+		 && !xdr(rqstp, resv->iov_base+resv->iov_len, rqstp->rq_resp)) {
 			dprintk("svc: failed to encode reply\n");
 			/* serv->sv_stats->rpcsystemerr++; */
 			*statp = rpc_system_err;
@@ -347,7 +365,7 @@ svc_process(struct svc_serv *serv, struc
 
 	/* Check RPC status result */
 	if (*statp != rpc_success)
-		resp->len = statp + 1 - resp->base;
+		resv->iov_len = ((void*)statp)  - resv->iov_base + 4;
 
 	/* Release reply info */
 	if (procp->pc_release)
@@ -369,7 +387,7 @@ svc_process(struct svc_serv *serv, struc
 
 err_short_len:
 #ifdef RPC_PARANOIA
-	printk("svc: short len %d, dropping request\n", argp->len);
+	printk("svc: short len %d, dropping request\n", argv->iov_len);
 #endif
 	goto dropit;			/* drop request */
 
@@ -382,18 +400,19 @@ err_bad_dir:
 
 err_bad_rpc:
 	serv->sv_stats->rpcbadfmt++;
-	resp->buf[-1] = xdr_one;	/* REJECT */
-	svc_putu32(resp, xdr_zero);	/* RPC_MISMATCH */
-	svc_putu32(resp, xdr_two);	/* Only RPCv2 supported */
-	svc_putu32(resp, xdr_two);
+	svc_putu32(resv, xdr_one);	/* REJECT */
+	svc_putu32(resv, xdr_zero);	/* RPC_MISMATCH */
+	svc_putu32(resv, xdr_two);	/* Only RPCv2 supported */
+	svc_putu32(resv, xdr_two);
 	goto sendit;
 
 err_bad_auth:
 	dprintk("svc: authentication failed (%d)\n", ntohl(auth_stat));
 	serv->sv_stats->rpcbadauth++;
-	resp->buf[-1] = xdr_one;	/* REJECT */
-	svc_putu32(resp, xdr_one);	/* AUTH_ERROR */
-	svc_putu32(resp, auth_stat);	/* status */
+	resv->iov_len -= 4;
+	svc_putu32(resv, xdr_one);	/* REJECT */
+	svc_putu32(resv, xdr_one);	/* AUTH_ERROR */
+	svc_putu32(resv, auth_stat);	/* status */
 	goto sendit;
 
 err_bad_prog:
@@ -403,7 +422,7 @@ err_bad_prog:
 	/* else it is just a Solaris client seeing if ACLs are supported */
 #endif
 	serv->sv_stats->rpcbadfmt++;
-	svc_putu32(resp, rpc_prog_unavail);
+	svc_putu32(resv, rpc_prog_unavail);
 	goto sendit;
 
 err_bad_vers:
@@ -411,9 +430,9 @@ err_bad_vers:
 	printk("svc: unknown version (%d)\n", vers);
 #endif
 	serv->sv_stats->rpcbadfmt++;
-	svc_putu32(resp, rpc_prog_mismatch);
-	svc_putu32(resp, htonl(progp->pg_lovers));
-	svc_putu32(resp, htonl(progp->pg_hivers));
+	svc_putu32(resv, rpc_prog_mismatch);
+	svc_putu32(resv, htonl(progp->pg_lovers));
+	svc_putu32(resv, htonl(progp->pg_hivers));
 	goto sendit;
 
 err_bad_proc:
@@ -421,7 +440,7 @@ err_bad_proc:
 	printk("svc: unknown procedure (%d)\n", proc);
 #endif
 	serv->sv_stats->rpcbadfmt++;
-	svc_putu32(resp, rpc_proc_unavail);
+	svc_putu32(resv, rpc_proc_unavail);
 	goto sendit;
 
 err_garbage:
@@ -429,6 +448,6 @@ err_garbage:
 	printk("svc: failed to decode args\n");
 #endif
 	serv->sv_stats->rpcbadfmt++;
-	svc_putu32(resp, rpc_garbage_args);
+	svc_putu32(resv, rpc_garbage_args);
 	goto sendit;
 }
--- ./net/sunrpc/svcsock.c	2002/10/21 23:40:50	1.2
+++ ./net/sunrpc/svcsock.c	2002/10/25 07:22:30
@@ -234,7 +234,7 @@ svc_sock_received(struct svc_sock *svsk)
  */
 void svc_reserve(struct svc_rqst *rqstp, int space)
 {
-	space += rqstp->rq_resbuf.len<<2;
+	space += rqstp->rq_res.head[0].iov_len;
 
 	if (space < rqstp->rq_reserved) {
 		struct svc_sock *svsk = rqstp->rq_sock;
@@ -278,13 +278,12 @@ svc_sock_release(struct svc_rqst *rqstp)
 	 * But first, check that enough space was reserved
 	 * for the reply, otherwise we have a bug!
 	 */
-	if ((rqstp->rq_resbuf.len<<2) >  rqstp->rq_reserved)
+	if ((rqstp->rq_res.len) >  rqstp->rq_reserved)
 		printk(KERN_ERR "RPC request reserved %d but used %d\n",
 		       rqstp->rq_reserved,
-		       rqstp->rq_resbuf.len<<2);
+		       rqstp->rq_res.len);
 
-	rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base;
-	rqstp->rq_resbuf.len = 0;
+	rqstp->rq_res.head[0].iov_len = 0;
 	svc_reserve(rqstp, 0);
 	rqstp->rq_sock = NULL;
 
@@ -480,13 +479,15 @@ svc_write_space(struct sock *sk)
 /*
  * Receive a datagram from a UDP socket.
  */
+extern int
+csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb);
+
 static int
 svc_udp_recvfrom(struct svc_rqst *rqstp)
 {
 	struct svc_sock	*svsk = rqstp->rq_sock;
 	struct svc_serv	*serv = svsk->sk_server;
 	struct sk_buff	*skb;
-	u32		*data;
 	int		err, len;
 
 	if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags))
@@ -512,33 +513,19 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 	}
 	set_bit(SK_DATA, &svsk->sk_flags); /* there may be more data... */
 
-	/* Sorry. */
-	if (skb_is_nonlinear(skb)) {
-		if (skb_linearize(skb, GFP_KERNEL) != 0) {
-			kfree_skb(skb);
-			svc_sock_received(svsk);
-			return 0;
-		}
-	}
+	len  = skb->len - sizeof(struct udphdr);
 
-	if (skb->ip_summed != CHECKSUM_UNNECESSARY) {
-		if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) {
-			skb_free_datagram(svsk->sk_sk, skb);
-			svc_sock_received(svsk);
-			return 0;
-		}
+	if (csum_partial_copy_to_xdr(&rqstp->rq_arg, skb)) {
+		/* checksum error */
+		skb_free_datagram(svsk->sk_sk, skb);
+		svc_sock_received(svsk);
+		return 0;
 	}
 
 
-	len  = skb->len - sizeof(struct udphdr);
-	data = (u32 *) (skb->data + sizeof(struct udphdr));
-
-	rqstp->rq_skbuff      = skb;
-	rqstp->rq_argbuf.base = data;
-	rqstp->rq_argbuf.buf  = data;
-	rqstp->rq_argbuf.len  = (len >> 2);
-	rqstp->rq_argbuf.buflen = (len >> 2);
-	/* rqstp->rq_resbuf      = rqstp->rq_defbuf; */
+	rqstp->rq_arg.len = len;
+	rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
+	rqstp->rq_argused += (rqstp->rq_arg.page_len + PAGE_SIZE - 1)/ PAGE_SIZE;
 	rqstp->rq_prot        = IPPROTO_UDP;
 
 	/* Get sender address */
@@ -546,6 +533,8 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 	rqstp->rq_addr.sin_port = skb->h.uh->source;
 	rqstp->rq_addr.sin_addr.s_addr = skb->nh.iph->saddr;
 
+	skb_free_datagram(svsk->sk_sk, skb);
+
 	if (serv->sv_stats)
 		serv->sv_stats->netudpcnt++;
 
@@ -559,21 +548,37 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 static int
 svc_udp_sendto(struct svc_rqst *rqstp)
 {
-	struct svc_buf	*bufp = &rqstp->rq_resbuf;
 	int		error;
+	struct iovec vec[RPCSVC_MAXPAGES];
+	int v;
+	int base, len;
 
 	/* Set up the first element of the reply iovec.
 	 * Any other iovecs that may be in use have been taken
 	 * care of by the server implementation itself.
 	 */
-	/* bufp->base = bufp->area; */
-	bufp->iov[0].iov_base = bufp->base;
-	bufp->iov[0].iov_len  = bufp->len << 2;
+	vec[0] = rqstp->rq_res.head[0];
+	v=1;
+	base=rqstp->rq_res.page_base;
+	len = rqstp->rq_res.page_len;
+	while (len) {
+		vec[v].iov_base = page_address(rqstp->rq_res.pages[v-1]) + base;
+		vec[v].iov_len = PAGE_SIZE-base;
+		if (len <= vec[v].iov_len)
+			vec[v].iov_len = len;
+		len -= vec[v].iov_len;
+		base = 0;
+		v++;
+	}
+	if (rqstp->rq_res.tail[0].iov_len) {
+		vec[v] = rqstp->rq_res.tail[0];
+		v++;
+	}
 
-	error = svc_sendto(rqstp, bufp->iov, bufp->nriov);
+	error = svc_sendto(rqstp, vec, v);
 	if (error == -ECONNREFUSED)
 		/* ICMP error on earlier request. */
-		error = svc_sendto(rqstp, bufp->iov, bufp->nriov);
+		error = svc_sendto(rqstp, vec, v);
 
 	return error;
 }
@@ -785,8 +790,9 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 {
 	struct svc_sock	*svsk = rqstp->rq_sock;
 	struct svc_serv	*serv = svsk->sk_server;
-	struct svc_buf	*bufp = &rqstp->rq_argbuf;
 	int		len;
+	struct iovec vec[RPCSVC_MAXPAGES];
+	int pnum, vlen;
 
 	dprintk("svc: tcp_recv %p data %d conn %d close %d\n",
 		svsk, test_bit(SK_DATA, &svsk->sk_flags),
@@ -851,7 +857,7 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		}
 		svsk->sk_reclen &= 0x7fffffff;
 		dprintk("svc: TCP record, %d bytes\n", svsk->sk_reclen);
-		if (svsk->sk_reclen > (bufp->buflen<<2)) {
+		if (svsk->sk_reclen > (32768 /*FIXME*/)) {
 			printk(KERN_NOTICE "RPC: bad TCP reclen 0x%08lx (large)\n",
 			       (unsigned long) svsk->sk_reclen);
 			goto err_delete;
@@ -869,30 +875,35 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		svc_sock_received(svsk);
 		return -EAGAIN;	/* record not complete */
 	}
+	len = svsk->sk_reclen;
 	set_bit(SK_DATA, &svsk->sk_flags);
 
-	/* Frob argbuf */
-	bufp->iov[0].iov_base += 4;
-	bufp->iov[0].iov_len  -= 4;
+	vec[0] = rqstp->rq_arg.head[0];
+	vlen = PAGE_SIZE;
+	pnum = 1;
+	while (vlen < len) {
+		vec[pnum].iov_base = page_address(rqstp->rq_argpages[rqstp->rq_argused++]);
+		vec[pnum].iov_len = PAGE_SIZE;
+		pnum++;
+		vlen += PAGE_SIZE;
+	}
 
 	/* Now receive data */
-	len = svc_recvfrom(rqstp, bufp->iov, bufp->nriov, svsk->sk_reclen);
+	len = svc_recvfrom(rqstp, vec, pnum, len);
 	if (len < 0)
 		goto error;
 
 	dprintk("svc: TCP complete record (%d bytes)\n", len);
-
-	/* Position reply write pointer immediately after args,
-	 * allowing for record length */
-	rqstp->rq_resbuf.base = rqstp->rq_argbuf.base + 1 + (len>>2);
-	rqstp->rq_resbuf.buf  = rqstp->rq_resbuf.base + 1;
-	rqstp->rq_resbuf.len  = 1;
-	rqstp->rq_resbuf.buflen= rqstp->rq_argbuf.buflen - (len>>2) - 1;
+	rqstp->rq_arg.len = len;
+	rqstp->rq_arg.page_base = 0;
+	if (len <= rqstp->rq_arg.head[0].iov_len) {
+		rqstp->rq_arg.head[0].iov_len = len;
+		rqstp->rq_arg.page_len = 0;
+	} else {
+		rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
+	}
 
 	rqstp->rq_skbuff      = 0;
-	rqstp->rq_argbuf.buf += 1;
-	rqstp->rq_argbuf.len  = (len >> 2);
-	rqstp->rq_argbuf.buflen = (len >> 2) +1;
 	rqstp->rq_prot	      = IPPROTO_TCP;
 
 	/* Reset TCP read info */
@@ -928,23 +939,44 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 static int
 svc_tcp_sendto(struct svc_rqst *rqstp)
 {
-	struct svc_buf	*bufp = &rqstp->rq_resbuf;
+	struct xdr_buf	*xbufp = &rqstp->rq_res;
+	struct iovec vec[RPCSVC_MAXPAGES];
+	int v;
+	int base, len;
 	int sent;
+	u32 reclen;
 
 	/* Set up the first element of the reply iovec.
 	 * Any other iovecs that may be in use have been taken
 	 * care of by the server implementation itself.
 	 */
-	bufp->iov[0].iov_base = bufp->base;
-	bufp->iov[0].iov_len  = bufp->len << 2;
-	bufp->base[0] = htonl(0x80000000|((bufp->len << 2) - 4));
+	reclen = htonl(0x80000000|((xbufp->len ) - 4));
+	memcpy(xbufp->head[0].iov_base, &reclen, 4);
+
+	vec[0] = rqstp->rq_res.head[0];
+	v=1;
+	base= xbufp->page_base;
+	len = xbufp->page_len;
+	while (len) {
+		vec[v].iov_base = page_address(xbufp->pages[v-1]) + base;
+		vec[v].iov_len = PAGE_SIZE-base;
+		if (len <= vec[v].iov_len)
+			vec[v].iov_len = len;
+		len -= vec[v].iov_len;
+		base = 0;
+		v++;
+	}
+	if (xbufp->tail[0].iov_len) {
+		vec[v] = xbufp->tail[0];
+		v++;
+	}
 
-	sent = svc_sendto(rqstp, bufp->iov, bufp->nriov);
-	if (sent != bufp->len<<2) {
+	sent = svc_sendto(rqstp, vec, v);
+	if (sent != xbufp->len) {
 		printk(KERN_NOTICE "rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n",
 		       rqstp->rq_sock->sk_server->sv_name,
 		       (sent<0)?"got error":"sent only",
-		       sent, bufp->len << 2);
+		       sent, xbufp->len);
 		svc_delete_socket(rqstp->rq_sock);
 		sent = -EAGAIN;
 	}
@@ -1016,6 +1048,8 @@ svc_recv(struct svc_serv *serv, struct s
 {
 	struct svc_sock		*svsk =NULL;
 	int			len;
+	int 			pages;
+	struct xdr_buf		*arg;
 	DECLARE_WAITQUEUE(wait, current);
 
 	dprintk("svc: server %p waiting for data (to = %ld)\n",
@@ -1031,9 +1065,35 @@ svc_recv(struct svc_serv *serv, struct s
 			 rqstp);
 
 	/* Initialize the buffers */
-	rqstp->rq_argbuf = rqstp->rq_defbuf;
-	rqstp->rq_resbuf = rqstp->rq_defbuf;
+	/* first reclaim pages that were moved to response list */
+	while (rqstp->rq_resused) 
+		rqstp->rq_argpages[rqstp->rq_arghi++] =
+			rqstp->rq_respages[--rqstp->rq_resused];
+	/* now allocate needed pages.  If we get a failure, sleep briefly */
+	pages = 2 + (serv->sv_bufsz + PAGE_SIZE -1) / PAGE_SIZE;
+	while (rqstp->rq_arghi < pages) {
+		struct page *p = alloc_page(GFP_KERNEL);
+		if (!p) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_timeout(HZ/2);
+			current->state = TASK_RUNNING;
+			continue;
+		}
+		rqstp->rq_argpages[rqstp->rq_arghi++] = p;
+	}
 
+	/* Make arg->head point to first page and arg->pages point to rest */
+	arg = &rqstp->rq_arg;
+	arg->head[0].iov_base = page_address(rqstp->rq_argpages[0]);
+	arg->head[0].iov_len = PAGE_SIZE;
+	rqstp->rq_argused = 1;
+	arg->pages = rqstp->rq_argpages + 1;
+	arg->page_base = 0;
+	/* save at least one page for response */
+	arg->page_len = (pages-2)*PAGE_SIZE;
+	arg->len = (pages-1)*PAGE_SIZE;
+	arg->tail[0].iov_len = 0;
+	
 	if (signalled())
 		return -EINTR;
 
@@ -1109,12 +1169,6 @@ svc_recv(struct svc_serv *serv, struct s
 	rqstp->rq_userset = 0;
 	rqstp->rq_chandle.defer = svc_defer;
 
-	svc_getu32(&rqstp->rq_argbuf, rqstp->rq_xid);
-	svc_putu32(&rqstp->rq_resbuf, rqstp->rq_xid);
-
-	/* Assume that the reply consists of a single buffer. */
-	rqstp->rq_resbuf.nriov = 1;
-
 	if (serv->sv_stats)
 		serv->sv_stats->netcnt++;
 	return len;
@@ -1354,23 +1408,25 @@ static struct cache_deferred_req *
 svc_defer(struct cache_req *req)
 {
 	struct svc_rqst *rqstp = container_of(req, struct svc_rqst, rq_chandle);
-	int size = sizeof(struct svc_deferred_req) + (rqstp->rq_argbuf.buflen << 2);
+	int size = sizeof(struct svc_deferred_req) + (rqstp->rq_arg.head[0].iov_len);
 	struct svc_deferred_req *dr;
 
+	if (rqstp->rq_arg.page_len)
+		return NULL; /* if more than a page, give up FIXME */
 	if (rqstp->rq_deferred) {
 		dr = rqstp->rq_deferred;
 		rqstp->rq_deferred = NULL;
 	} else {
 		/* FIXME maybe discard if size too large */
-		dr = kmalloc(size<<2, GFP_KERNEL);
+		dr = kmalloc(size, GFP_KERNEL);
 		if (dr == NULL)
 			return NULL;
 
 		dr->serv = rqstp->rq_server;
 		dr->prot = rqstp->rq_prot;
 		dr->addr = rqstp->rq_addr;
-		dr->argslen = rqstp->rq_argbuf.buflen;
-		memcpy(dr->args, rqstp->rq_argbuf.base, dr->argslen<<2);
+		dr->argslen = rqstp->rq_arg.head[0].iov_len >> 2;
+		memcpy(dr->args, rqstp->rq_arg.head[0].iov_base, dr->argslen<<2);
 	}
 	spin_lock(&rqstp->rq_server->sv_lock);
 	rqstp->rq_sock->sk_inuse++;
@@ -1388,10 +1444,10 @@ static int svc_deferred_recv(struct svc_
 {
 	struct svc_deferred_req *dr = rqstp->rq_deferred;
 
-	rqstp->rq_argbuf.base = dr->args;
-	rqstp->rq_argbuf.buf  = dr->args;
-	rqstp->rq_argbuf.len  = dr->argslen;
-	rqstp->rq_argbuf.buflen = dr->argslen;
+	rqstp->rq_arg.head[0].iov_base = dr->args;
+	rqstp->rq_arg.head[0].iov_len = dr->argslen<<2;
+	rqstp->rq_arg.page_len = 0;
+	rqstp->rq_arg.len = dr->argslen<<2;
 	rqstp->rq_prot        = dr->prot;
 	rqstp->rq_addr        = dr->addr;
 	return dr->argslen<<2;
--- ./net/sunrpc/svcauth.c	2002/10/24 06:01:17	1.1
+++ ./net/sunrpc/svcauth.c	2002/10/24 06:01:52
@@ -40,8 +40,7 @@ svc_authenticate(struct svc_rqst *rqstp,
 	*statp = rpc_success;
 	*authp = rpc_auth_ok;
 
-	svc_getu32(&rqstp->rq_argbuf, flavor);
-	flavor = ntohl(flavor);
+	flavor = ntohl(svc_getu32(&rqstp->rq_arg.head[0]));
 
 	dprintk("svc: svc_authenticate (%d)\n", flavor);
 	if (flavor >= RPC_AUTH_MAXFLAVOR || !(aops = authtab[flavor])) {
--- ./net/sunrpc/xprt.c	2002/10/24 00:34:53	1.1
+++ ./net/sunrpc/xprt.c	2002/10/24 01:00:36
@@ -655,7 +655,7 @@ skb_read_and_csum_bits(skb_reader_t *des
  * We have set things up such that we perform the checksum of the UDP
  * packet in parallel with the copies into the RPC client iovec.  -DaveM
  */
-static int
+int
 csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb)
 {
 	skb_reader_t desc;
--- ./net/sunrpc/svcauth_unix.c	2002/10/24 06:09:05	1.1
+++ ./net/sunrpc/svcauth_unix.c	2002/10/25 07:14:44
@@ -287,20 +287,20 @@ void svcauth_unix_purge(void)
 static int
 svcauth_null_accept(struct svc_rqst *rqstp, u32 *authp, int proc)
 {
-	struct svc_buf	*argp = &rqstp->rq_argbuf;
-	struct svc_buf	*resp = &rqstp->rq_resbuf;
+	struct iovec	*argv = &rqstp->rq_arg.head[0];
+	struct iovec	*resv = &rqstp->rq_res.head[0];
 	int		rv=0;
 	struct ip_map key, *ipm;
 
-	if ((argp->len -= 3) < 0) {
+	if (argv->iov_len < 3*4)
 		return SVC_GARBAGE;
-	}
-	if (*(argp->buf)++ != 0) {	/* we already skipped the flavor */
+
+	if (svc_getu32(argv) != 0) { 
 		dprintk("svc: bad null cred\n");
 		*authp = rpc_autherr_badcred;
 		return SVC_DENIED;
 	}
-	if (*(argp->buf)++ != RPC_AUTH_NULL || *(argp->buf)++ != 0) {
+	if (svc_getu32(argv) != RPC_AUTH_NULL || svc_getu32(argv) != 0) {
 		dprintk("svc: bad null verf\n");
 		*authp = rpc_autherr_badverf;
 		return SVC_DENIED;
@@ -312,8 +312,8 @@ svcauth_null_accept(struct svc_rqst *rqs
 	rqstp->rq_cred.cr_groups[0] = NOGROUP;
 
 	/* Put NULL verifier */
-	svc_putu32(resp, RPC_AUTH_NULL);
-	svc_putu32(resp, 0);
+	svc_putu32(resv, RPC_AUTH_NULL);
+	svc_putu32(resv, 0);
 
 	key.m_class = rqstp->rq_server->sv_program->pg_class;
 	key.m_addr = rqstp->rq_addr.sin_addr;
@@ -368,64 +368,70 @@ struct auth_ops svcauth_null = {
 int
 svcauth_unix_accept(struct svc_rqst *rqstp, u32 *authp, int proc)
 {
-	struct svc_buf	*argp = &rqstp->rq_argbuf;
-	struct svc_buf	*resp = &rqstp->rq_resbuf;
+	struct iovec	*argv = &rqstp->rq_arg.head[0];
+	struct iovec	*resv = &rqstp->rq_res.head[0];
 	struct svc_cred	*cred = &rqstp->rq_cred;
-	u32		*bufp = argp->buf, slen, i;
-	int		len   = argp->len;
+	u32		slen, i;
+	int		len   = argv->iov_len;
 	int		rv=0;
 	struct ip_map key, *ipm;
 
-	if ((len -= 3) < 0)
+	if ((len -= 3*4) < 0)
 		return SVC_GARBAGE;
 
-	bufp++;					/* length */
-	bufp++;					/* time stamp */
-	slen = XDR_QUADLEN(ntohl(*bufp++));	/* machname length */
-	if (slen > 64 || (len -= slen + 3) < 0)
+	svc_getu32(argv);			/* length */
+	svc_getu32(argv);			/* time stamp */
+	slen = XDR_QUADLEN(ntohl(svc_getu32(argv)));	/* machname length */
+	if (slen > 64 || (len -= (slen + 3)*4) < 0)
 		goto badcred;
-	bufp += slen;				/* skip machname */
-
-	cred->cr_uid = ntohl(*bufp++);		/* uid */
-	cred->cr_gid = ntohl(*bufp++);		/* gid */
+printk("namelen %d name %.*s\n", slen, slen*4, (char*)argv->iov_base);
+	argv->iov_base = (void*)((u32*)argv->iov_base + slen);	/* skip machname */
 
-	slen = ntohl(*bufp++);			/* gids length */
-	if (slen > 16 || (len -= slen + 2) < 0)
+	cred->cr_uid = ntohl(svc_getu32(argv));		/* uid */
+	cred->cr_gid = ntohl(svc_getu32(argv));		/* gid */
+printk("uid=%d gid=%d\n", cred->cr_uid, cred->cr_gid);
+	slen = ntohl(svc_getu32(argv));			/* gids length */
+	printk("%d gids (%d)\n", slen, len);
+	if (slen > 16 || (len -= (slen + 2)*4) < 0)
 		goto badcred;
-	for (i = 0; i < NGROUPS && i < slen; i++)
-		cred->cr_groups[i] = ntohl(*bufp++);
+	for (i = 0; i < slen; i++)
+		if (i < NGROUPS)
+			cred->cr_groups[i] = ntohl(svc_getu32(argv));
+		else
+			svc_getu32(argv);
 	if (i < NGROUPS)
 		cred->cr_groups[i] = NOGROUP;
-	bufp += (slen - i);
+	printk("..got %d\n", i);
 
-	if (*bufp++ != RPC_AUTH_NULL || *bufp++ != 0) {
+	if (svc_getu32(argv) != RPC_AUTH_NULL || svc_getu32(argv) != 0) {
+		printk("nogo\n");
 		*authp = rpc_autherr_badverf;
 		return SVC_DENIED;
 	}
 
-	argp->buf = bufp;
-	argp->len = len;
-
 	/* Put NULL verifier */
-	svc_putu32(resp, RPC_AUTH_NULL);
-	svc_putu32(resp, 0);
+	svc_putu32(resv, RPC_AUTH_NULL);
+	svc_putu32(resv, 0);
+	printk("put NULL\n");
 
 	key.m_class = rqstp->rq_server->sv_program->pg_class;
 	key.m_addr = rqstp->rq_addr.sin_addr;
 
+	printk("key is <%s>, %x\n", key.m_class, key.m_addr.s_addr);
+
 	ipm = ip_map_lookup(&key, 0);
 
 	rqstp->rq_client = NULL;
-
+	printk(ipm?"Yes\n": "No\n");
 	if (ipm)
 		switch (cache_check(&ip_map_cache, &ipm->h, &rqstp->rq_chandle)) {
-		case -EAGAIN:
+		case -EAGAIN:printk("EAGAIN\n");
 			rv = SVC_DROP;
 			break;
-		case -ENOENT:
+		case -ENOENT:printk("NOENT\n");
 			rv = SVC_OK; /* rq_client is NULL */
 			break;
-		case 0:
+		case 0: printk("Zero\n");
 			rqstp->rq_client = &ipm->m_client->h;
 			cache_get(&rqstp->rq_client->h);
 			ip_map_put(&ipm->h, &ip_map_cache);
@@ -434,7 +440,7 @@ svcauth_unix_accept(struct svc_rqst *rqs
 		default: BUG();
 		}
 	else rv = SVC_DROP;
-
+	if (rqstp->rq_client==NULL) printk("clinet NULL and proc %d\n", proc);
 	if (rqstp->rq_client == NULL && proc != 0)
 		goto badcred;
 	return rv;
--- ./kernel/ksyms.c	2002/10/24 01:33:59	1.1
+++ ./kernel/ksyms.c	2002/10/24 01:34:08
@@ -254,7 +254,9 @@ EXPORT_SYMBOL(find_inode_number);
 EXPORT_SYMBOL(is_subdir);
 EXPORT_SYMBOL(get_unused_fd);
 EXPORT_SYMBOL(vfs_read);
+EXPORT_SYMBOL(vfs_readv);
 EXPORT_SYMBOL(vfs_write);
+EXPORT_SYMBOL(vfs_writev);
 EXPORT_SYMBOL(vfs_create);
 EXPORT_SYMBOL(vfs_mkdir);
 EXPORT_SYMBOL(vfs_mknod);


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-25  9:52       ` Hirokazu Takahashi
  2002-10-25 12:41         ` Neil Brown
@ 2002-10-25 17:23         ` Trond Myklebust
  2002-10-26  3:26           ` Hirokazu Takahashi
  1 sibling, 1 reply; 98+ messages in thread
From: Trond Myklebust @ 2002-10-25 17:23 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, nfs

>>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes:


    >> In particular, I think it would be good to use 'struct xdr_buf'
    >> from sunrpc/xdr.h instead of svc_buf.  This is what the nfs
    >> client uses and we could share some of the infrastructure.

     > I just realized it would be hard to use the xdr_buf as it
     > couldn't handle data in a socket buffer. Each socket burfer
     > consists of some non-page data and some pages and each of them
     > might have its own offset and length.

Then the following trivial modification would be quite sufficient

struct xdr_buf {
        struct list_head list;          /* Further xdr_buf */
        struct iovec    head[1],        /* RPC header + non-page data */
                        tail[1];        /* Appended after page data */

        struct page **  pages;          /* Array of contiguous pages */
        unsigned int    page_base,      /* Start of page data */
                        page_len;       /* Length of page data */

        unsigned int    len;            /* Total length of data */

};

With equally trivial fixes to xdr_kmap() and friends. None of this
needs to affect existing client usage, and may in fact be useful for
optimizing use of v4 COMPOUNDS later.
(I was wrong about this BTW: being able to flush out all the dirty
pages in a file to disk using a single COMPOUND would indeed be worth
the trouble once we've managed to drop UDP as the primary NFS
transport mechanism. For one thing, you would only tie up a single
nfsd thread when writing to the file)

Cheers,
  Trond


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-25 12:41         ` Neil Brown
@ 2002-10-26  3:11           ` Hirokazu Takahashi
  2002-10-26  3:46             ` Benjamin LaHaise
  2002-10-30 23:29           ` Hirokazu Takahashi
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-26  3:11 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello

> > > I have been thinking some more about this, trying to understand the
> > > big picture, and I'm afraid that I think I want some more changes.
> > > 
> > > In particular, I think it would be good to use 'struct xdr_buf' from
> > > sunrpc/xdr.h instead of svc_buf.  This is what the nfs client uses and
> > > we could share some of the infrastructure.
> > 
> > I just realized it would be hard to use the xdr_buf as it couldn't
> > handle data in a socket buffer. Each socket burfer consists of
> > some non-page data and some pages and each of them might have its
> > own offset and length.
> 
> You would only want this for single-copy write request  - right?

Yes.

> I think we have treat them as a special case and pass the skbuf all
> the way up to nfsd in that case.
> You would only want to try this if:
>    The NIC had verified the checksum
>    The packets was some minimum size (1K? 1 PAGE ??)
>    We were using AUTH_UNIX, nothing more interesting like crypto
>      security
>    The first fragment were some minimum size (size of a write without
>    the data).
> 
> I would make a special 'fast-path' for that case which didn't copy any
> data but passed a skbuf up, and code in nfs*xdr.c would convert that
> into an iovec[];

I implemented that only sunrpc layer could handle a skbuff and
made nfsd layer keep away from its implementation. I felt this approach
was not bad.

Yes, your approach is also good and will work fine.

> I am working on a patch which changes rpcsvc to use xdr_buf.  Some of
> it works.  Some doesn't.  I include it below for your reference I
> repeat: it doesn't work yet.  
> Once it is done, adding the rest of zero-copy should be fairly easy.

OK.

It's goot that you're implementing vfs_readv and vfs_writev which
I've also realized it doesn't support aio yet.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-25 17:23         ` Trond Myklebust
@ 2002-10-26  3:26           ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-26  3:26 UTC (permalink / raw)
  To: trond.myklebust; +Cc: neilb, nfs

Hello,

> Then the following trivial modification would be quite sufficient

Yes, it looks good as it's rare to use two or more xdr_bufs.
We can allocate extra xdr_bufs dynamically.

> struct xdr_buf {
>         struct list_head list;          /* Further xdr_buf */
>         struct iovec    head[1],        /* RPC header + non-page data */
>                         tail[1];        /* Appended after page data */
> 
>         struct page **  pages;          /* Array of contiguous pages */
>         unsigned int    page_base,      /* Start of page data */
>                         page_len;       /* Length of page data */
> 
>         unsigned int    len;            /* Total length of data */
> 
> };
> 
> With equally trivial fixes to xdr_kmap() and friends. None of this
> needs to affect existing client usage, and may in fact be useful for
> optimizing use of v4 COMPOUNDS later.
> (I was wrong about this BTW: being able to flush out all the dirty
> pages in a file to disk using a single COMPOUND would indeed be worth
> the trouble once we've managed to drop UDP as the primary NFS
> transport mechanism. For one thing, you would only tie up a single
> nfsd thread when writing to the file)


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-26  3:11           ` Hirokazu Takahashi
@ 2002-10-26  3:46             ` Benjamin LaHaise
  2002-10-27 22:46               ` Neil Brown
  0 siblings, 1 reply; 98+ messages in thread
From: Benjamin LaHaise @ 2002-10-26  3:46 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, nfs

On Sat, Oct 26, 2002 at 12:11:50PM +0900, Hirokazu Takahashi wrote:
> OK.
> 
> It's goot that you're implementing vfs_readv and vfs_writev which
> I've also realized it doesn't support aio yet.

The aio methods are soon switching over to vectored operations for a 
few reasons.  It's likely that non-vectored methods will be gone soon.

		-ben
-- 
"Do you seek knowledge in time travel?"


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-24  1:33           ` Hirokazu Takahashi
@ 2002-10-27 10:39             ` Hirokazu Takahashi
  2002-10-28 16:31               ` Trond Myklebust
  0 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-27 10:39 UTC (permalink / raw)
  To: trond.myklebust; +Cc: neilb, nfs

Hello,

> >      > I was thinking about the nfs clients.  Why don't we make
> >      > xprt_sendmsg() use the sendpage interface instead of calling
> >      > sock_sendmsg() so that we can avoid dead-lock which multiple
> >      > kmap()s in xprt_sendmsg() might cause on heavily loaded
> >      > machines.
> > 
> > I'm definitely in favour of such a change. Particularly so if the UDP
> > interface is ready.

I just modified the xprt_sendmsg() to use the sendpage interface.
I've checked it works fine on both of TCP and UDP.

I think this code need to be cleaned up but I don't have any good ideas
about it.


Thank you,
Hirokazu Takahashi.



--- linux/net/sunrpc/xdr.c.ORG	Sat Oct 26 21:21:16 2030
+++ linux/net/sunrpc/xdr.c	Sun Oct 27 19:07:05 2030
@@ -110,12 +110,15 @@ xdr_encode_pages(struct xdr_buf *xdr, st
 	xdr->page_len = len;
 
 	if (len & 3) {
-		struct iovec *iov = xdr->tail;
 		unsigned int pad = 4 - (len & 3);
-
-		iov->iov_base = (void *) "\0\0\0";
-		iov->iov_len  = pad;
 		len += pad;
+		if (((base + len) & ~PAGE_CACHE_MASK) + pad <= PAGE_CACHE_SIZE) {
+			xdr->page_len += pad;
+		} else {
+			struct iovec *iov = xdr->tail;
+			iov->iov_base = (void *) "\0\0\0";
+			iov->iov_len  = pad;
+		}
 	}
 	xdr->len += len;
 }
--- linux/net/sunrpc/xprt.c.ORG	Sun Oct 27 17:07:17 2030
+++ linux/net/sunrpc/xprt.c	Sun Oct 27 19:07:38 2030
@@ -60,6 +60,7 @@
 #include <linux/unistd.h>
 #include <linux/sunrpc/clnt.h>
 #include <linux/file.h>
+#include <linux/pagemap.h>
 
 #include <net/sock.h>
 #include <net/checksum.h>
@@ -207,48 +208,107 @@ xprt_release_write(struct rpc_xprt *xprt
 	spin_unlock_bh(&xprt->sock_lock);
 }
 
+static inline int
+__xprt_sendmsg(struct socket *sock, struct xdr_buf *xdr, struct msghdr *msg, size_t skip)
+{
+	unsigned int	slen = xdr->len - skip;
+	mm_segment_t	oldfs;
+	int		result = 0;
+	struct page	**ppage = xdr->pages;
+	unsigned int	len, pglen = xdr->page_len;
+	size_t		base = 0;
+	int		flags;
+	int		ret;
+	struct iovec	niv;
+
+	msg->msg_iov	= &niv;
+	msg->msg_iovlen	= 1;
+
+	if (xdr->head[0].iov_len > skip) {
+		len = xdr->head[0].iov_len - skip;
+		niv.iov_base = xdr->head[0].iov_base + skip;
+		niv.iov_len = len;
+		if (slen > len)
+			msg->msg_flags |= MSG_MORE;
+		oldfs = get_fs(); set_fs(get_ds());
+		clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+		result = sock_sendmsg(sock, msg, len);
+		set_fs(oldfs);
+		if (result != len)
+			return result;
+		slen -= len;
+		skip = 0;
+	} else {
+		skip -= xdr->head[0].iov_len;
+	}
+	if (pglen == 0)
+		goto send_tail;
+	if (skip >= pglen) {
+		skip -= pglen;
+		goto send_tail;
+	}
+	if (skip || xdr->page_base) {
+		pglen -= skip;
+		base = xdr->page_base + skip;
+		ppage += base >> PAGE_CACHE_SHIFT;
+		base &= ~PAGE_CACHE_MASK;
+	}
+	len = PAGE_CACHE_SIZE - base;
+	if (len > pglen) len = pglen;
+	flags = MSG_MORE;
+	while (pglen > 0) {
+		if (slen == len)
+			flags = 0;
+		ret = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		if (ret > 0)
+			result += ret;
+		if (ret != len) {
+			if (result == 0)
+				result = ret;
+			return result;
+		}
+		slen -= len;
+		pglen -= len;
+		len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen;
+		base = 0;
+		ppage++;
+	}
+	skip = 0;
+send_tail:
+	if (xdr->tail[0].iov_len) {
+		niv.iov_base = xdr->tail[0].iov_base + skip;
+		niv.iov_len = xdr->tail[0].iov_len - skip;
+		msg->msg_flags &= ~MSG_MORE;
+		oldfs = get_fs(); set_fs(get_ds());
+		clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+		ret = sock_sendmsg(sock, msg, niv.iov_len);
+		set_fs(oldfs);
+		if (ret > 0)
+			result += ret;
+		if (result == 0)
+			result = ret;
+	}
+	return result;
+}
+
 /*
  * Write data to socket.
  */
 static inline int
 xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
 {
-	struct socket	*sock = xprt->sock;
 	struct msghdr	msg;
-	struct xdr_buf	*xdr = &req->rq_snd_buf;
-	struct iovec	niv[MAX_IOVEC];
-	unsigned int	niov, slen, skip;
-	mm_segment_t	oldfs;
 	int		result;
 
-	if (!sock)
-		return -ENOTCONN;
-
-	xprt_pktdump("packet data:",
-				req->rq_svec->iov_base,
-				req->rq_svec->iov_len);
-
-	/* Dont repeat bytes */
-	skip = req->rq_bytes_sent;
-	slen = xdr->len - skip;
-	niov = xdr_kmap(niv, xdr, skip);
-
 	msg.msg_flags   = MSG_DONTWAIT|MSG_NOSIGNAL;
-	msg.msg_iov	= niv;
-	msg.msg_iovlen	= niov;
 	msg.msg_name	= (struct sockaddr *) &xprt->addr;
 	msg.msg_namelen = sizeof(xprt->addr);
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 
-	oldfs = get_fs(); set_fs(get_ds());
-	clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-	result = sock_sendmsg(sock, &msg, slen);
-	set_fs(oldfs);
-
-	xdr_kunmap(xdr, skip);
+	result = __xprt_sendmsg(xprt->sock, &req->rq_snd_buf, &msg, req->rq_bytes_sent);
 
-	dprintk("RPC:      xprt_sendmsg(%d) = %d\n", slen, result);
+	dprintk("RPC:      xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result);
 
 	if (result >= 0)
 		return result;


-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.36
  2002-10-24 15:32                                 ` Andrew Theurer
@ 2002-10-27 11:10                                   ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-27 11:10 UTC (permalink / raw)
  To: habanero; +Cc: nfs

Hi,

>> Can You check /proc/net/rpc/nfsd which shows how many NFS requests have
>> been retransmitted ?

You can also check the client side.
/proc/net/rpc/nfs
net 0 0 0 0
rpc 191035 4339 0
           ^^^^
This field shows us how many times the client has retransmitted RPC requests.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-26  3:46             ` Benjamin LaHaise
@ 2002-10-27 22:46               ` Neil Brown
  0 siblings, 0 replies; 98+ messages in thread
From: Neil Brown @ 2002-10-27 22:46 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Hirokazu Takahashi, nfs

On Friday October 25, bcrl@redhat.com wrote:
> On Sat, Oct 26, 2002 at 12:11:50PM +0900, Hirokazu Takahashi wrote:
> > OK.
> > 
> > It's goot that you're implementing vfs_readv and vfs_writev which
> > I've also realized it doesn't support aio yet.
> 
> The aio methods are soon switching over to vectored operations for a 
> few reasons.  It's likely that non-vectored methods will be gone soon.

If you are introducing new 'vectored' operations, it would be nice if
they work well for kernel-space as well as user-space.

In the 'old days' before CONFIG_HIMEM, you could just

	oldfs = get_fs(); set_fs(KERNEL_DS);
        ...whatever....
	set_fs(oldfs);

to use kernel addresses.  But with CONFIG_HIMEM the kernel often
wants to work with "struct page *" instead of just a "void *",
so this doesn't always work.
It would be nice if you could pass in an 'actor' which for user-space
access would call copy-to/from-user for kernel-space would do
kmap/copy/kunmap

Just a thought.....

NeilBrown


-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-27 10:39             ` Hirokazu Takahashi
@ 2002-10-28 16:31               ` Trond Myklebust
  2002-10-28 23:39                 ` Hirokazu Takahashi
  2002-10-29  6:36                 ` Hirokazu Takahashi
  0 siblings, 2 replies; 98+ messages in thread
From: Trond Myklebust @ 2002-10-28 16:31 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, nfs

>>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes:

     > --- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030
     > +++ linux/net/sunrpc/xdr.c Sun Oct 27 19:07:05 2030
     > @@ -110,12 +110,15 @@ xdr_encode_pages(struct xdr_buf *xdr, st
    xdr-> page_len = len;
 
     >  	if (len & 3) {
     > -		struct iovec *iov = xdr->tail;
     >  		unsigned int pad = 4 - (len & 3);
     > -
     > - iov->iov_base = (void *) "\0\0\0";
     > - iov->iov_len = pad;
     >  		len += pad;
     > + if (((base + len) & ~PAGE_CACHE_MASK) + pad <=
     >  		PAGE_CACHE_SIZE) {
     > + xdr->page_len += pad;

No!!! I believe I told you quite explicitly earlier:

 - RFC1832 states that *all* variable length data must be padded with
   zeros, and that is certainly not the case if the pages you are
   pointing to are in the page cache.

 - Worse: That data is not even guaranteed to have been initialized.
   In effect this means that your 'optimization' is leaking random
   data from the kernel and onto the internet. In security-conscious
   circles this is not considered a good thing...

Please leave that padding so that it *always* returns zeros...

Cheers,
  Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-28 16:31               ` Trond Myklebust
@ 2002-10-28 23:39                 ` Hirokazu Takahashi
  2002-10-29  6:36                 ` Hirokazu Takahashi
  1 sibling, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-28 23:39 UTC (permalink / raw)
  To: trond.myklebust; +Cc: neilb, nfs

Hello,

>  - RFC1832 states that *all* variable length data must be padded with
>    zeros, and that is certainly not the case if the pages you are
>    pointing to are in the page cache.

Yes, your're right.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-28 16:31               ` Trond Myklebust
  2002-10-28 23:39                 ` Hirokazu Takahashi
@ 2002-10-29  6:36                 ` Hirokazu Takahashi
  2002-10-29 15:09                   ` Trond Myklebust
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-29  6:36 UTC (permalink / raw)
  To: trond.myklebust; +Cc: neilb, nfs

Hello,

>  - RFC1832 states that *all* variable length data must be padded with
>    zeros, and that is certainly not the case if the pages you are
>    pointing to are in the page cache.

I've changed my aproach.

Shall we use ZERO_PAGE to pad RPC requests for the purpose of its
performance?  Using non-page data is a little inefficient as
the implementation of skbuff doesn't allow to append non-page data
to a skbuff which already have pages. Only pages can be appneded to it.
If we didn't, TCP/IP stack would allocate a new page to store the 
small zero-padded data.

The last page never be coalesced with data of a next RPC request
in case of UDP while it might be done on TCP.


How do you think of this approach?

Thank you,
Hirokazu Takahashi.


--- linux/include/linux/sunrpc/xdr.h.ORG	Sun Oct 27 17:56:07 2030
+++ linux/include/linux/sunrpc/xdr.h	Tue Oct 29 14:30:48 2030
@@ -48,12 +48,15 @@ typedef int	(*kxdrproc_t)(void *rqstp, u
  * operations and/or has a need for scatter/gather involving pages.
  */
 struct xdr_buf {
-	struct iovec	head[1],	/* RPC header + non-page data */
-			tail[1];	/* Appended after page data */
+	struct iovec	head[1];	/* RPC header + non-page data */
+	struct page *	head_page;	/* Page for head if needed */
 
 	struct page **	pages;		/* Array of contiguous pages */
 	unsigned int	page_base,	/* Start of page data */
 			page_len;	/* Length of page data */
+
+	struct iovec	tail[1];	/* Appended after page data */
+	struct page *	tail_page;	/* Page for tail if needed */
 
 	unsigned int	len;		/* Total length of data */
 
--- linux/net/sunrpc/xdr.c.ORG	Sat Oct 26 21:21:16 2030
+++ linux/net/sunrpc/xdr.c	Tue Oct 29 14:20:52 2030
@@ -113,8 +113,9 @@ xdr_encode_pages(struct xdr_buf *xdr, st
 		struct iovec *iov = xdr->tail;
 		unsigned int pad = 4 - (len & 3);
 
-		iov->iov_base = (void *) "\0\0\0";
+		iov->iov_base = (void *)0;
 		iov->iov_len  = pad;
+		xdr->tail_page = sunrpc_get_zeropage();
 		len += pad;
 	}
 	xdr->len += len;
--- linux/net/sunrpc/xprt.c.ORG	Sun Oct 27 17:07:17 2030
+++ linux/net/sunrpc/xprt.c	Tue Oct 29 14:22:14 2030
@@ -60,6 +60,7 @@
 #include <linux/unistd.h>
 #include <linux/sunrpc/clnt.h>
 #include <linux/file.h>
+#include <linux/pagemap.h>
 
 #include <net/sock.h>
 #include <net/checksum.h>
@@ -207,48 +208,101 @@ xprt_release_write(struct rpc_xprt *xprt
 	spin_unlock_bh(&xprt->sock_lock);
 }
 
+static inline int
+__xprt_sendmsg(struct socket *sock, struct xdr_buf *xdr, struct msghdr *msg, size_t skip)
+{
+	unsigned int	slen = xdr->len - skip;
+	mm_segment_t	oldfs;
+	int		result = 0;
+	struct page	**ppage = xdr->pages;
+	unsigned int	len, pglen = xdr->page_len;
+	size_t		base = 0;
+	int		flags;
+	int		ret;
+	struct iovec	niv;
+
+	msg->msg_iov	= &niv;
+	msg->msg_iovlen	= 1;
+
+	if (xdr->head[0].iov_len > skip) {
+		len = xdr->head[0].iov_len - skip;
+		niv.iov_base = xdr->head[0].iov_base + skip;
+		niv.iov_len = len;
+		if (slen > len)
+			msg->msg_flags |= MSG_MORE;
+		oldfs = get_fs(); set_fs(get_ds());
+		clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+		result = sock_sendmsg(sock, msg, len);
+		set_fs(oldfs);
+		if (result != len)
+			return result;
+		slen -= len;
+		skip = 0;
+	} else {
+		skip -= xdr->head[0].iov_len;
+	}
+	if (pglen == 0)
+		goto send_tail;
+	if (skip >= pglen) {
+		skip -= pglen;
+		goto send_tail;
+	}
+	if (skip || xdr->page_base) {
+		pglen -= skip;
+		base = xdr->page_base + skip;
+		ppage += base >> PAGE_CACHE_SHIFT;
+		base &= ~PAGE_CACHE_MASK;
+	}
+	len = PAGE_CACHE_SIZE - base;
+	if (len > pglen) len = pglen;
+	flags = MSG_MORE;
+	while (pglen > 0) {
+		if (slen == len)
+			flags = 0;
+		ret = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		if (ret > 0)
+			result += ret;
+		if (ret != len) {
+			if (result == 0)
+				result = ret;
+			return result;
+		}
+		slen -= len;
+		pglen -= len;
+		len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen;
+		base = 0;
+		ppage++;
+	}
+	skip = 0;
+send_tail:
+	if (xdr->tail[0].iov_len) {
+		ret = sock->ops->sendpage(sock, xdr->tail_page, (int)xdr->tail[0].iov_base + skip, xdr->tail[0].iov_len - skip, 0);
+		if (ret > 0)
+			result += ret;
+		if (result == 0)
+			result = ret;
+	}
+	return result;
+}
+
 /*
  * Write data to socket.
  */
 static inline int
 xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
 {
-	struct socket	*sock = xprt->sock;
 	struct msghdr	msg;
-	struct xdr_buf	*xdr = &req->rq_snd_buf;
-	struct iovec	niv[MAX_IOVEC];
-	unsigned int	niov, slen, skip;
-	mm_segment_t	oldfs;
 	int		result;
 
-	if (!sock)
-		return -ENOTCONN;
-
-	xprt_pktdump("packet data:",
-				req->rq_svec->iov_base,
-				req->rq_svec->iov_len);
-
-	/* Dont repeat bytes */
-	skip = req->rq_bytes_sent;
-	slen = xdr->len - skip;
-	niov = xdr_kmap(niv, xdr, skip);
-
 	msg.msg_flags   = MSG_DONTWAIT|MSG_NOSIGNAL;
-	msg.msg_iov	= niv;
-	msg.msg_iovlen	= niov;
 	msg.msg_name	= (struct sockaddr *) &xprt->addr;
 	msg.msg_namelen = sizeof(xprt->addr);
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 
-	oldfs = get_fs(); set_fs(get_ds());
-	clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-	result = sock_sendmsg(sock, &msg, slen);
-	set_fs(oldfs);
-
-	xdr_kunmap(xdr, skip);
+	result = __xprt_sendmsg(xprt->sock, &req->rq_snd_buf, &msg, req->rq_bytes_sent);
 
-	dprintk("RPC:      xprt_sendmsg(%d) = %d\n", slen, result);
+	dprintk("RPC:      xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result);
 
 	if (result >= 0)
 		return result;
--- linux/net/sunrpc/sunrpc_syms.c.ORG	Tue Oct 29 14:18:45 2030
+++ linux/net/sunrpc/sunrpc_syms.c	Tue Oct 29 14:15:27 2030
@@ -101,6 +101,7 @@ EXPORT_SYMBOL(auth_unix_lookup);
 EXPORT_SYMBOL(cache_check);
 EXPORT_SYMBOL(cache_clean);
 EXPORT_SYMBOL(cache_flush);
+EXPORT_SYMBOL(cache_purge);
 EXPORT_SYMBOL(cache_fresh);
 EXPORT_SYMBOL(cache_init);
 EXPORT_SYMBOL(cache_register);
@@ -130,6 +131,36 @@ EXPORT_SYMBOL(nfsd_debug);
 EXPORT_SYMBOL(nlm_debug);
 #endif
 
+/* RPC general use */
+EXPORT_SYMBOL(sunrpc_get_zeropage);
+
+static struct page *sunrpc_zero_page;
+
+struct page *
+sunrpc_get_zeropage(void)
+{
+	return sunrpc_zero_page;
+}
+
+static int __init
+sunrpc_init_zeropage(void)
+{
+	sunrpc_zero_page = alloc_page(GFP_ATOMIC);
+	if (sunrpc_zero_page == NULL) {
+		printk(KERN_ERR "RPC: couldn't allocate zero_page.\n");
+		return 1;
+	}
+	clear_page(page_address(sunrpc_zero_page));
+	return 0;
+}
+
+static void __exit
+sunrpc_cleanup_zeropage(void)
+{
+	put_page(sunrpc_zero_page);
+	sunrpc_zero_page = NULL;
+}
+
 static int __init
 init_sunrpc(void)
 {
@@ -141,12 +172,14 @@ init_sunrpc(void)
 #endif
 	cache_register(&auth_domain_cache);
 	cache_register(&ip_map_cache);
+	sunrpc_init_zeropage();
 	return 0;
 }
 
 static void __exit
 cleanup_sunrpc(void)
 {
+	sunrpc_cleanup_zeropage();
 	cache_unregister(&auth_domain_cache);
 	cache_unregister(&ip_map_cache);
 #ifdef RPC_DEBUG
--- linux/include/linux/sunrpc/types.h.ORG	Tue Oct 29 11:31:13 2030
+++ linux/include/linux/sunrpc/types.h	Tue Oct 29 11:37:49 2030
@@ -13,10 +13,14 @@
 #include <linux/workqueue.h>
 #include <linux/sunrpc/debug.h>
 #include <linux/list.h>
+#include <linux/mm.h>
 
 /*
  * Shorthands
  */
 #define signalled()		(signal_pending(current))
+
+extern struct page * sunrpc_get_zeropage(void);
+
 
 #endif /* _LINUX_SUNRPC_TYPES_H_ */


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-29  6:36                 ` Hirokazu Takahashi
@ 2002-10-29 15:09                   ` Trond Myklebust
  2002-10-29 16:27                     ` Hirokazu Takahashi
  2002-10-30  3:18                     ` Hirokazu Takahashi
  0 siblings, 2 replies; 98+ messages in thread
From: Trond Myklebust @ 2002-10-29 15:09 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, nfs

>>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes:

     > Shall we use ZERO_PAGE to pad RPC requests for the purpose of
     > its performance?  Using non-page data is a little inefficient
     > as the implementation of skbuff doesn't allow to append
     > non-page data to a skbuff which already have pages. Only pages
     > can be appneded to it.  If we didn't, TCP/IP stack would
     > allocate a new page to store the small zero-padded data.

Hmmm... What if we just drop actually storing a pointer to the
ZERO_PAGE? Instead, define the convention that

if (xdr_buf->tail[0].iov_base == NULL)
	padding = xdr_buf->tail[0].iov_len;

and just have xprt_sendmsg() magically append 'padding' bytes from
your ZERO_PAGE.

Unless, of course, you've got another use for the head_page/tail_page?

Cheers,
  Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-29 15:09                   ` Trond Myklebust
@ 2002-10-29 16:27                     ` Hirokazu Takahashi
  2002-10-29 16:49                       ` Trond Myklebust
  2002-10-30  3:18                     ` Hirokazu Takahashi
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-29 16:27 UTC (permalink / raw)
  To: trond.myklebust; +Cc: neilb, nfs

Hello,

Thank you for your reply.

>      > Shall we use ZERO_PAGE to pad RPC requests for the purpose of
>      > its performance?  Using non-page data is a little inefficient
>      > as the implementation of skbuff doesn't allow to append
>      > non-page data to a skbuff which already have pages. Only pages
>      > can be appneded to it.  If we didn't, TCP/IP stack would
>      > allocate a new page to store the small zero-padded data.
> 
> Hmmm... What if we just drop actually storing a pointer to the
> ZERO_PAGE? Instead, define the convention that
>
> if (xdr_buf->tail[0].iov_base == NULL)
> 	padding = xdr_buf->tail[0].iov_len;
> 
> and just have xprt_sendmsg() magically append 'padding' bytes from
> your ZERO_PAGE.

Yes, it's possible.
OK, I'll modify it.

> Unless, of course, you've got another use for the head_page/tail_page?

I just wanted to make it general.
I guessed head_page (or head_pages) might be usefull for big NFSv4
COMPOUND messages as we could send a head without any copies.
But it's just my guess.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-29 16:27                     ` Hirokazu Takahashi
@ 2002-10-29 16:49                       ` Trond Myklebust
  0 siblings, 0 replies; 98+ messages in thread
From: Trond Myklebust @ 2002-10-29 16:49 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: neilb, nfs

>>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes:

    >> Unless, of course, you've got another use for the
    >> head_page/tail_page?

     > I just wanted to make it general.  I guessed head_page (or
     > head_pages) might be usefull for big NFSv4 COMPOUND messages as
     > we could send a head without any copies.  But it's just my
     > guess.

It's good to know that this is possible, but lets not overdesign: we
don't want to implement this unless we know that we have a need.

Cheers,
  Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-29 15:09                   ` Trond Myklebust
  2002-10-29 16:27                     ` Hirokazu Takahashi
@ 2002-10-30  3:18                     ` Hirokazu Takahashi
  1 sibling, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-30  3:18 UTC (permalink / raw)
  To: trond.myklebust; +Cc: neilb, nfs

Hello,

I've modified the patch simple as you said.

> Hmmm... What if we just drop actually storing a pointer to the
> ZERO_PAGE? Instead, define the convention that
> 
> if (xdr_buf->tail[0].iov_base == NULL)
> 	padding = xdr_buf->tail[0].iov_len;
> 
> and just have xprt_sendmsg() magically append 'padding' bytes from
> your ZERO_PAGE.


--- linux/net/sunrpc/xdr.c.ORG	Sat Oct 26 21:21:16 2030
+++ linux/net/sunrpc/xdr.c	Wed Oct 30 11:11:03 2030
@@ -113,7 +113,8 @@ xdr_encode_pages(struct xdr_buf *xdr, st
 		struct iovec *iov = xdr->tail;
 		unsigned int pad = 4 - (len & 3);
 
-		iov->iov_base = (void *) "\0\0\0";
+		/* NULL means a request to pad it with zero. */
+		iov->iov_base = NULL;
 		iov->iov_len  = pad;
 		len += pad;
 	}
--- linux/net/sunrpc/xprt.c.ORG	Sun Oct 27 17:07:17 2030
+++ linux/net/sunrpc/xprt.c	Wed Oct 30 12:16:05 2030
@@ -60,6 +60,7 @@
 #include <linux/unistd.h>
 #include <linux/sunrpc/clnt.h>
 #include <linux/file.h>
+#include <linux/pagemap.h>
 
 #include <net/sock.h>
 #include <net/checksum.h>
@@ -207,48 +208,113 @@ xprt_release_write(struct rpc_xprt *xprt
 	spin_unlock_bh(&xprt->sock_lock);
 }
 
-/*
- * Write data to socket.
- */
 static inline int
-xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
+__xprt_sendmsg(struct rpc_xprt *xprt, struct xdr_buf *xdr, size_t skip)
 {
 	struct socket	*sock = xprt->sock;
+	unsigned int	slen = xdr->len - skip;
+	struct page	**ppage = xdr->pages;
+	unsigned int	len, pglen = xdr->page_len;
+	size_t		base = 0;
 	struct msghdr	msg;
-	struct xdr_buf	*xdr = &req->rq_snd_buf;
-	struct iovec	niv[MAX_IOVEC];
-	unsigned int	niov, slen, skip;
+	struct iovec	niv;
+	int		flags;
 	mm_segment_t	oldfs;
-	int		result;
-
-	if (!sock)
-		return -ENOTCONN;
-
-	xprt_pktdump("packet data:",
-				req->rq_svec->iov_base,
-				req->rq_svec->iov_len);
-
-	/* Dont repeat bytes */
-	skip = req->rq_bytes_sent;
-	slen = xdr->len - skip;
-	niov = xdr_kmap(niv, xdr, skip);
+	int		result = 0;
+	int		ret;
 
 	msg.msg_flags   = MSG_DONTWAIT|MSG_NOSIGNAL;
-	msg.msg_iov	= niv;
-	msg.msg_iovlen	= niov;
 	msg.msg_name	= (struct sockaddr *) &xprt->addr;
 	msg.msg_namelen = sizeof(xprt->addr);
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
+	msg.msg_iov	= &niv;
+	msg.msg_iovlen	= 1;
 
-	oldfs = get_fs(); set_fs(get_ds());
-	clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-	result = sock_sendmsg(sock, &msg, slen);
-	set_fs(oldfs);
+	if (xdr->head[0].iov_len > skip) {
+		len = xdr->head[0].iov_len - skip;
+		niv.iov_base = xdr->head[0].iov_base + skip;
+		niv.iov_len = len;
+		if (slen > len)
+			msg.msg_flags |= MSG_MORE;
+		oldfs = get_fs(); set_fs(get_ds());
+		clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+		result = sock_sendmsg(sock, &msg, len);
+		set_fs(oldfs);
+		if (result != len)
+			return result;
+		slen -= len;
+		skip = 0;
+	} else {
+		skip -= xdr->head[0].iov_len;
+	}
+	if (pglen == 0)
+		goto send_tail;
+	if (skip >= pglen) {
+		skip -= pglen;
+		goto send_tail;
+	}
+	if (skip || xdr->page_base) {
+		pglen -= skip;
+		base = xdr->page_base + skip;
+		ppage += base >> PAGE_CACHE_SHIFT;
+		base &= ~PAGE_CACHE_MASK;
+	}
+	len = PAGE_CACHE_SIZE - base;
+	if (len > pglen) len = pglen;
+	flags = MSG_MORE;
+	while (pglen > 0) {
+		if (slen == len)
+			flags = 0;
+		ret = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		if (ret > 0)
+			result += ret;
+		if (ret != len) {
+			if (result == 0)
+				result = ret;
+			return result;
+		}
+		slen -= len;
+		pglen -= len;
+		len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen;
+		base = 0;
+		ppage++;
+	}
+	skip = 0;
+send_tail:
+	if (xdr->tail[0].iov_len) {
+		if (xdr->tail[0].iov_base == NULL) {
+			/* tail[0].iov_base == NULL requires zero padding */
+			ret = sock->ops->sendpage(sock, sunrpc_get_zeropage(),
+					0, xdr->tail[0].iov_len - skip, 0);
+		} else {
+			niv.iov_base = xdr->tail[0].iov_base + skip;
+			niv.iov_len = xdr->tail[0].iov_len - skip;
+			msg.msg_flags &= ~MSG_MORE;
+			oldfs = get_fs(); set_fs(get_ds());
+			clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+			ret = sock_sendmsg(sock, &msg, niv.iov_len);
+			set_fs(oldfs);
+		}
+		if (ret > 0)
+			result += ret;
+		if (result == 0)
+			result = ret;
+	}
+	return result;
+}
+
+/*
+ * Write data to socket.
+ */
+static inline int
+xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
+{
+	int		result;
 
-	xdr_kunmap(xdr, skip);
+	result = __xprt_sendmsg(xprt, &req->rq_snd_buf, req->rq_bytes_sent);
 
-	dprintk("RPC:      xprt_sendmsg(%d) = %d\n", slen, result);
+	dprintk("RPC:      xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result);
 
 	if (result >= 0)
 		return result;
--- linux/net/sunrpc/sunrpc_syms.c.ORG	Tue Oct 29 14:18:45 2030
+++ linux/net/sunrpc/sunrpc_syms.c	Tue Oct 29 14:15:27 2030
@@ -101,6 +101,7 @@ EXPORT_SYMBOL(auth_unix_lookup);
 EXPORT_SYMBOL(cache_check);
 EXPORT_SYMBOL(cache_clean);
 EXPORT_SYMBOL(cache_flush);
+EXPORT_SYMBOL(cache_purge);
 EXPORT_SYMBOL(cache_fresh);
 EXPORT_SYMBOL(cache_init);
 EXPORT_SYMBOL(cache_register);
@@ -130,6 +131,36 @@ EXPORT_SYMBOL(nfsd_debug);
 EXPORT_SYMBOL(nlm_debug);
 #endif
 
+/* RPC general use */
+EXPORT_SYMBOL(sunrpc_get_zeropage);
+
+static struct page *sunrpc_zero_page;
+
+struct page *
+sunrpc_get_zeropage(void)
+{
+	return sunrpc_zero_page;
+}
+
+static int __init
+sunrpc_init_zeropage(void)
+{
+	sunrpc_zero_page = alloc_page(GFP_ATOMIC);
+	if (sunrpc_zero_page == NULL) {
+		printk(KERN_ERR "RPC: couldn't allocate zero_page.\n");
+		return 1;
+	}
+	clear_page(page_address(sunrpc_zero_page));
+	return 0;
+}
+
+static void __exit
+sunrpc_cleanup_zeropage(void)
+{
+	put_page(sunrpc_zero_page);
+	sunrpc_zero_page = NULL;
+}
+
 static int __init
 init_sunrpc(void)
 {
@@ -141,12 +172,14 @@ init_sunrpc(void)
 #endif
 	cache_register(&auth_domain_cache);
 	cache_register(&ip_map_cache);
+	sunrpc_init_zeropage();
 	return 0;
 }
 
 static void __exit
 cleanup_sunrpc(void)
 {
+	sunrpc_cleanup_zeropage();
 	cache_unregister(&auth_domain_cache);
 	cache_unregister(&ip_map_cache);
 #ifdef RPC_DEBUG
--- linux/include/linux/sunrpc/types.h.ORG	Tue Oct 29 11:31:13 2030
+++ linux/include/linux/sunrpc/types.h	Tue Oct 29 11:37:49 2030
@@ -13,10 +13,14 @@
 #include <linux/workqueue.h>
 #include <linux/sunrpc/debug.h>
 #include <linux/list.h>
+#include <linux/mm.h>
 
 /*
  * Shorthands
  */
 #define signalled()		(signal_pending(current))
+
+extern struct page * sunrpc_get_zeropage(void);
+
 
 #endif /* _LINUX_SUNRPC_TYPES_H_ */


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-25 12:41         ` Neil Brown
  2002-10-26  3:11           ` Hirokazu Takahashi
@ 2002-10-30 23:29           ` Hirokazu Takahashi
  2002-10-30 23:53             ` Neil Brown
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-30 23:29 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

How is it going?

neilb> I would make a special 'fast-path' for that case which didn't copy any
neilb> data but passed a skbuf up, and code in nfs*xdr.c would convert that
neilb> into an iovec[];
neilb> 
neilb> I am working on a patch which changes rpcsvc to use xdr_buf.  Some of
neilb> it works.  Some doesn't.  I include it below for your reference I
neilb> repeat: it doesn't work yet.  
neilb> Once it is done, adding the rest of zero-copy should be fairly easy.

-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-30 23:29           ` Hirokazu Takahashi
@ 2002-10-30 23:53             ` Neil Brown
  2002-10-31  2:06               ` Hirokazu Takahashi
  0 siblings, 1 reply; 98+ messages in thread
From: Neil Brown @ 2002-10-30 23:53 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: nfs

On Thursday October 31, taka@valinux.co.jp wrote:
> Hello,
> 
> How is it going?

I've just sent some patches to Linus and nfs@lists....

The rest of the zero copy stuff should fit in quite easily, with the
possible exception of single-copy writes: I haven't looked very hard
at that yet.

NeilBrown

> 
> neilb> I would make a special 'fast-path' for that case which didn't copy any
> neilb> data but passed a skbuf up, and code in nfs*xdr.c would convert that
> neilb> into an iovec[];
> neilb> 
> neilb> I am working on a patch which changes rpcsvc to use xdr_buf.  Some of
> neilb> it works.  Some doesn't.  I include it below for your reference I
> neilb> repeat: it doesn't work yet.  
> neilb> Once it is done, adding the rest of zero-copy should be fairly easy.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-30 23:53             ` Neil Brown
@ 2002-10-31  2:06               ` Hirokazu Takahashi
  2002-10-31 15:40                 ` Hirokazu Takahashi
  0 siblings, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-31  2:06 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

> > How is it going?
> 
> I've just sent some patches to Linus and nfs@lists....

Thanks. I've seen them in linux-2.5.45.

> The rest of the zero copy stuff should fit in quite easily, with the
> possible exception of single-copy writes: I haven't looked very hard
> at that yet.

Ok,
I'll try to port the zero copy stuff on it.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-31  2:06               ` Hirokazu Takahashi
@ 2002-10-31 15:40                 ` Hirokazu Takahashi
  2002-10-31 16:56                   ` Hirokazu Takahashi
  2002-11-01  0:54                   ` Neil Brown
  0 siblings, 2 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-31 15:40 UTC (permalink / raw)
  To: neilb; +Cc: nfs

[-- Attachment #1: Type: Text/Plain, Size: 553 bytes --]

Hello,

> The rest of the zero copy stuff should fit in quite easily, with the
> possible exception of single-copy writes: I haven't looked very hard
> at that yet.

I just ported part of the zero copy stuff against linux-2.5.45.
single-copy writes and per-cpu sokcets are not included yet.
And I fixed a problem that NFS over TCP wouldn't work. 

va-nfsd-sendpage.patch        ....use sendpage instead of sock_sendmsg.
va-sunrpc-zeropage.patch      ....zero filled page for padding.
va-nfsd-vfsread.patch         ....zero-copy nfsd_read/nfsd_readdir.

[-- Attachment #2: zerocopy-2.5.45.taz --]
[-- Type: Application/Octet-Stream, Size: 6263 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-31 15:40                 ` Hirokazu Takahashi
@ 2002-10-31 16:56                   ` Hirokazu Takahashi
  2002-11-01  1:10                     ` Neil Brown
  2002-11-01  0:54                   ` Neil Brown
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-10-31 16:56 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

> > The rest of the zero copy stuff should fit in quite easily, with the
> > possible exception of single-copy writes: I haven't looked very hard
> > at that yet.
> 
> I just ported part of the zero copy stuff against linux-2.5.45.
> single-copy writes and per-cpu sokcets are not included yet.
> And I fixed a problem that NFS over TCP wouldn't work. 

I also ported the per-cpu socket patch against linux2.5.45.


--- include/linux/sunrpc/svcsock.h.ORG3	Fri Nov  1 01:29:52 2030
+++ include/linux/sunrpc/svcsock.h	Fri Nov  1 01:31:28 2030
@@ -51,6 +51,7 @@ struct svc_sock {
 	int			sk_reclen;	/* length of record */
 	int			sk_tcplen;	/* current read length */
 	time_t			sk_lastrecv;	/* time of last received request */
+	struct svc_sock		**sk_shadow;	/* shadow sockets for sending */
 };
 
 /*
--- net/sunrpc/svcsock.c.ORG3	Fri Nov  1 01:30:14 2030
+++ net/sunrpc/svcsock.c	Fri Nov  1 01:51:34 2030
@@ -64,7 +64,9 @@
 
 
 static struct svc_sock *svc_setup_socket(struct svc_serv *, struct socket *,
-					 int *errp, int pmap_reg);
+					 int *errp, int type);
+#define SVSK_PMAP_REGISTER	1
+#define SVSK_SHADOW		2
 static void		svc_udp_data_ready(struct sock *, int);
 static int		svc_udp_recvfrom(struct svc_rqst *);
 static int		svc_udp_sendto(struct svc_rqst *);
@@ -259,6 +261,8 @@ svc_sock_put(struct svc_sock *svsk)
 	if (!--(svsk->sk_inuse) && test_bit(SK_DEAD, &svsk->sk_flags)) {
 		spin_unlock_bh(&serv->sv_lock);
 		dprintk("svc: releasing dead socket\n");
+		if (svsk->sk_shadow)
+			kfree(svsk->sk_shadow);
 		sock_release(svsk->sk_sock);
 		kfree(svsk);
 	}
@@ -326,6 +330,27 @@ svc_wake_up(struct svc_serv *serv)
 	spin_unlock_bh(&serv->sv_lock);
 }
 
+static inline struct svc_sock *
+svc_get_svsk(struct svc_rqst *rqstp)
+{
+	struct svc_sock	*svsk = rqstp->rq_sock;
+#ifdef CONFIG_SMP
+	if (svsk->sk_shadow) {
+		struct svc_sock	*shadow = svsk->sk_shadow[smp_processor_id()];
+		if (shadow) {
+			struct svc_serv	*serv = svsk->sk_server;
+			svsk = shadow;
+			if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags))
+				svc_sock_setbufsize(svsk->sk_sock,
+					(serv->sv_nrthreads+3) * serv->sv_bufsz,
+					(serv->sv_nrthreads+3) * serv->sv_bufsz);
+		}
+
+	}
+#endif
+	return svsk;
+}
+
 /*
  * Generic sendto routine
  */
@@ -333,7 +358,7 @@ static int
 svc_sendto(struct svc_rqst *rqstp, struct xdr_buf *xdr)
 {
 	mm_segment_t	oldfs;
-	struct svc_sock	*svsk = rqstp->rq_sock;
+	struct svc_sock	*svsk = svc_get_svsk(rqstp);
 	struct socket	*sock = svsk->sk_sock;
 	struct msghdr	msg;
 	int		slen;
@@ -1228,7 +1253,7 @@ svc_send(struct svc_rqst *rqstp)
  */
 static struct svc_sock *
 svc_setup_socket(struct svc_serv *serv, struct socket *sock,
-					int *errp, int pmap_register)
+					int *errp, int type)
 {
 	struct svc_sock	*svsk;
 	struct sock	*inet;
@@ -1249,6 +1274,7 @@ svc_setup_socket(struct svc_serv *serv, 
 	svsk->sk_owspace = inet->write_space;
 	svsk->sk_server = serv;
 	svsk->sk_lastrecv = CURRENT_TIME;
+	svsk->sk_shadow = NULL;
 	INIT_LIST_HEAD(&svsk->sk_deferred);
 	sema_init(&svsk->sk_sem, 1);
 
@@ -1261,7 +1287,7 @@ if (svsk->sk_sk == NULL)
 	printk(KERN_WARNING "svsk->sk_sk == NULL after svc_prot_init!\n");
 
 	/* Register socket with portmapper */
-	if (*errp >= 0 && pmap_register)
+	if (*errp >= 0 && type == SVSK_PMAP_REGISTER)
 		*errp = svc_register(serv, inet->protocol,
 				     ntohs(inet_sk(inet)->sport));
 
@@ -1273,13 +1299,13 @@ if (svsk->sk_sk == NULL)
 
 
 	spin_lock_bh(&serv->sv_lock);
-	if (!pmap_register) {
+	if (type == SVSK_PMAP_REGISTER || type == SVSK_SHADOW) {
+		clear_bit(SK_TEMP, &svsk->sk_flags);
+		list_add(&svsk->sk_list, &serv->sv_permsocks);
+	} else {
 		set_bit(SK_TEMP, &svsk->sk_flags);
 		list_add(&svsk->sk_list, &serv->sv_tempsocks);
 		serv->sv_tmpcnt++;
-	} else {
-		clear_bit(SK_TEMP, &svsk->sk_flags);
-		list_add(&svsk->sk_list, &serv->sv_permsocks);
 	}
 	spin_unlock_bh(&serv->sv_lock);
 
@@ -1288,6 +1314,61 @@ if (svsk->sk_sk == NULL)
 	return svsk;
 }
 
+
+/*
+ * Create a shadow socket which has the same sport of given svsk.
+ * Let each cpu have its own socket to send packets. 
+ */
+static int
+svc_create_shadow_socket(struct svc_serv *serv, struct svc_sock	*svsk,
+				int protocol, struct sockaddr_in *sin)
+{
+#ifdef CONFIG_SMP
+	int		error;
+	struct socket	*newsock;
+	struct svc_sock	*newsvsk;
+	int		i;
+
+	if (num_online_cpus() == 1)
+		return 0;
+
+	svsk->sk_shadow = kmalloc(sizeof(struct svc_sock*)*NR_CPUS, GFP_KERNEL);
+	if (!svsk->sk_shadow)
+		return -ENOMEM;
+
+	memset(svsk->sk_shadow, 0, sizeof(struct svc_sock*)*NR_CPUS);
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+
+		if ((error = sock_create(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &newsock)) < 0)
+			return error;
+		if ((newsvsk = svc_setup_socket(serv, newsock, &error, SVSK_SHADOW)) == NULL) {
+			sock_release(newsock);
+			return error;
+		}
+		/*
+		 * Make the newsvsk as shadow of the svsk.
+		 */
+		newsock->sk->reuse = 1; /* allow address reuse */
+		error = newsock->ops->bind(newsock, (struct sockaddr *) sin,
+						sizeof(*sin));
+		if (error < 0) {
+			sock_release(newsock);
+			kfree(newsvsk);
+			return error;
+		}
+		/*
+		 * Unhash the newsocket not to receive packets.
+		 */
+		newsock->sk->prot->unhash(newsock->sk);
+		svsk->sk_shadow[i] = newsvsk;
+	}
+#endif
+	return 0;
+}
+
 /*
  * Create socket for RPC service.
  */
@@ -1327,8 +1408,13 @@ svc_create_socket(struct svc_serv *serv,
 			goto bummer;
 	}
 
-	if ((svsk = svc_setup_socket(serv, sock, &error, 1)) != NULL)
-		return 0;
+	if ((svsk = svc_setup_socket(serv, sock, &error, SVSK_PMAP_REGISTER)) == NULL)
+		goto bummer;
+
+	if (protocol == IPPROTO_UDP && sin != NULL)
+		svc_create_shadow_socket(serv, svsk, protocol, sin);
+
+	return 0;
 
 bummer:
 	dprintk("svc: svc_create_socket error = %d\n", -error);
@@ -1367,6 +1453,8 @@ svc_delete_socket(struct svc_sock *svsk)
 
 	if (!svsk->sk_inuse) {
 		spin_unlock_bh(&serv->sv_lock);
+		if (svsk->sk_shadow)
+			kfree(svsk->sk_shadow);
 		sock_release(svsk->sk_sock);
 		kfree(svsk);
 	} else {


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-31 15:40                 ` Hirokazu Takahashi
  2002-10-31 16:56                   ` Hirokazu Takahashi
@ 2002-11-01  0:54                   ` Neil Brown
  2002-11-01  1:39                     ` Hirokazu Takahashi
  2002-11-01  3:41                     ` Hirokazu Takahashi
  1 sibling, 2 replies; 98+ messages in thread
From: Neil Brown @ 2002-11-01  0:54 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: nfs

On Friday November 1, taka@valinux.co.jp wrote:
> Hello,
> 
> > The rest of the zero copy stuff should fit in quite easily, with the
> > possible exception of single-copy writes: I haven't looked very hard
> > at that yet.
> 
> I just ported part of the zero copy stuff against linux-2.5.45.
> single-copy writes and per-cpu sokcets are not included yet.
> And I fixed a problem that NFS over TCP wouldn't work. 
> 
> 
> va-nfsd-sendpage.patch        ....use sendpage instead of sock_sendmsg.
> va-sunrpc-zeropage.patch      ....zero filled page for padding.
> va-nfsd-vfsread.patch         ....zero-copy nfsd_read/nfsd_readdir.

A lot of this looks fine.

I would like to leave the tail pointing into the end of the first page
(just after the head) rather than using the sunrpc_zero_page thing as
the later doesn't seem necessary.
Also, I would like to send the head and tail with sendpage rather
than using sock_sendmsg.
To give the destination address, you can call sock_sendmsg with
a length of 0, and then call ->sendpage for each page or page
fragment.

You should be able to remove the calls to svcbuf_reserve in
nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the
'buffer' variable as well.

If you could make those changes (or convince me otherwise), I will
forward the patches to Linus,
Thanks.

NeilBrown

-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-10-31 16:56                   ` Hirokazu Takahashi
@ 2002-11-01  1:10                     ` Neil Brown
  2002-11-04 21:13                       ` Andrew Theurer
  0 siblings, 1 reply; 98+ messages in thread
From: Neil Brown @ 2002-11-01  1:10 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: nfs, David S. Miller

On Friday November 1, taka@valinux.co.jp wrote:
> Hello,
> 
> > > The rest of the zero copy stuff should fit in quite easily, with the
> > > possible exception of single-copy writes: I haven't looked very hard
> > > at that yet.
> > 
> > I just ported part of the zero copy stuff against linux-2.5.45.
> > single-copy writes and per-cpu sokcets are not included yet.
> > And I fixed a problem that NFS over TCP wouldn't work. 
> 
> I also ported the per-cpu socket patch against linux2.5.45.
> 

I still don't really like this patch.
I appreciate that some sort of SMP awareness may be appropriate for
nfsd, but this just doesn't feel right.

Once possibility that I have considered goes like this:

- Allow a (udp) socket to have 'cpu affinity' registered.
- Get udp_v4_lookup  add to the score for sockets that
  like the current cpu, and reject sockets that don't like this
  cpu.
- Have some cpu affinity with the nfsd threads, probably having
  a separate idle-server-queue for each cpu.  Possibly half the
  threads would be tied to a cpu, the other half would float, and
  only be used if no cpu-local threads were available.

Then instead of have special 'shadow' sockets, we just create NCPUS
normal udp sockets, instead of one, and give each a cpu affinity.
This would mean that receiving would benefit from multiple sockets
as well as sending.

I have very little experience with these sort of SMP issues, so I may
be missing something obvious, but to me, this approach seems cleaner
and more general.

Dave: what would you think of having a "unsigned long cpus_allowed"
in struct inet_opt and putting the appropriate checks in
udp_v4_lookup??  Is it worth experimenting with?

NeilBrown


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-11-01  0:54                   ` Neil Brown
@ 2002-11-01  1:39                     ` Hirokazu Takahashi
  2002-11-01  3:41                     ` Hirokazu Takahashi
  1 sibling, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-11-01  1:39 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hello,

> > va-nfsd-sendpage.patch        ....use sendpage instead of sock_sendmsg.
> > va-sunrpc-zeropage.patch      ....zero filled page for padding.
> > va-nfsd-vfsread.patch         ....zero-copy nfsd_read/nfsd_readdir.
> 
> A lot of this looks fine.
>
> I would like to leave the tail pointing into the end of the first page
> (just after the head) rather than using the sunrpc_zero_page thing as
> the later doesn't seem necessary.
> Also, I would like to send the head and tail with sendpage rather
> than using sock_sendmsg.

Yes, we can.
I'll do it though it seems little bit tricky.

> To give the destination address, you can call sock_sendmsg with
> a length of 0, and then call ->sendpage for each page or page
> fragment.

Ok.

> You should be able to remove the calls to svcbuf_reserve in
> nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the
> 'buffer' variable as well.

Yes, you're right.

> If you could make those changes (or convince me otherwise), I will
> forward the patches to Linus,
> Thanks.

Thanks!


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-11-01  0:54                   ` Neil Brown
  2002-11-01  1:39                     ` Hirokazu Takahashi
@ 2002-11-01  3:41                     ` Hirokazu Takahashi
  2002-11-01  4:20                       ` Neil Brown
  1 sibling, 1 reply; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-11-01  3:41 UTC (permalink / raw)
  To: neilb; +Cc: nfs

[-- Attachment #1: Type: Text/Plain, Size: 843 bytes --]

Hello,

I updated the patches.
I'll send you 2 patches

     va-nfsd-sendpage.patch
     va-nfsd-vfsread.patch

> I would like to leave the tail pointing into the end of the first page
> (just after the head) rather than using the sunrpc_zero_page thing as
> the later doesn't seem necessary.
> Also, I would like to send the head and tail with sendpage rather
> than using sock_sendmsg.
> To give the destination address, you can call sock_sendmsg with
> a length of 0, and then call ->sendpage for each page or page
> fragment.
> 
> 
> You should be able to remove the calls to svcbuf_reserve in
> nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the
> 'buffer' variable as well.
> 
> 
> If you could make those changes (or convince me otherwise), I will
> forward the patches to Linus,
> Thanks.

Thank you,
Hirokazu Takahashi.



[-- Attachment #2: zerocopy-2.5.45-new.taz --]
[-- Type: Application/Octet-Stream, Size: 5912 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-11-01  3:41                     ` Hirokazu Takahashi
@ 2002-11-01  4:20                       ` Neil Brown
  2002-11-01  5:07                         ` Hirokazu Takahashi
  0 siblings, 1 reply; 98+ messages in thread
From: Neil Brown @ 2002-11-01  4:20 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: nfs

On Friday November 1, taka@valinux.co.jp wrote:
> Hello,
> 
> I updated the patches.
> I'll send you 2 patches
> 
>      va-nfsd-sendpage.patch
>      va-nfsd-vfsread.patch
> 

Thanks.
I made a couple of little changes and sent them to Linus and the list.
1/ I simplified the sending of the tail a bit more.  We assume the
tail is *always* in the same page as the head, and just sendpage it.

2/ I removed svcbuf_reserve and the buffer variable from
   nfsd3_proc_readdirplus as well :-)

Thanks again,
NeilBrown


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-11-01  4:20                       ` Neil Brown
@ 2002-11-01  5:07                         ` Hirokazu Takahashi
  0 siblings, 0 replies; 98+ messages in thread
From: Hirokazu Takahashi @ 2002-11-01  5:07 UTC (permalink / raw)
  To: neilb; +Cc: nfs

Hi,

> Thanks.
> I made a couple of little changes and sent them to Linus and the list.
> 1/ I simplified the sending of the tail a bit more.  We assume the
> tail is *always* in the same page as the head, and just sendpage it.

I looks fine!

> 2/ I removed svcbuf_reserve and the buffer variable from
>    nfsd3_proc_readdirplus as well :-)

Thanks.



-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: Re: [PATCH] zerocopy NFS for 2.5.43
  2002-11-01  1:10                     ` Neil Brown
@ 2002-11-04 21:13                       ` Andrew Theurer
  0 siblings, 0 replies; 98+ messages in thread
From: Andrew Theurer @ 2002-11-04 21:13 UTC (permalink / raw)
  To: Neil Brown, Hirokazu Takahashi; +Cc: nfs, David S. Miller

> > I also ported the per-cpu socket patch against linux2.5.45.
>
> I still don't really like this patch.
> I appreciate that some sort of SMP awareness may be appropriate for
> nfsd, but this just doesn't feel right.
>
> Once possibility that I have considered goes like this:
>
> - Allow a (udp) socket to have 'cpu affinity' registered.
> - Get udp_v4_lookup  add to the score for sockets that
>   like the current cpu, and reject sockets that don't like this
>   cpu.
> - Have some cpu affinity with the nfsd threads, probably having
>   a separate idle-server-queue for each cpu.  Possibly half the
>   threads would be tied to a cpu, the other half would float, and
>   only be used if no cpu-local threads were available.

This all sounds great, I wish I knew how to do this :)

> Then instead of have special 'shadow' sockets, we just create NCPUS
> normal udp sockets, instead of one, and give each a cpu affinity.
> This would mean that receiving would benefit from multiple sockets
> as well as sending.

So, the target socket getting populated on inbound traffic would likely d=
epend=20
on which CPU took the net card inturrupt?  And the resulting CPU would ha=
ndle=20
the NFS request?  If so, and you had a good interrupt balance across CPUs=
,=20
that sounds fine.  If you have an interrupt imbalance, it could be really=
=20
bad.  This doesn't sound like a problem on a system like PIII, where=20
interrupts can float (don't know how irqbalance works with that) but I'm =
not=20
so sure about P4, even with irqbalance.  Over time they do balance out, b=
ut=20
in my experience a particular interrupt is being handled by one CPU or=20
another with a significant (in this context) amount of time between=20
destination changes.=20

> I have very little experience with these sort of SMP issues, so I may
> be missing something obvious, but to me, this approach seems cleaner
> and more general.
>
> Dave: what would you think of having a "unsigned long cpus_allowed"
> in struct inet_opt and putting the appropriate checks in
> udp_v4_lookup??  Is it worth experimenting with?
>
> NeilBrown



-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2002-11-04 21:45 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-18  8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi
2002-09-18 23:00 ` David S. Miller
2002-09-18 23:54   ` Alan Cox
2002-09-18 23:54     ` Alan Cox
2002-09-19  0:16     ` Andrew Morton
2002-09-19  2:13       ` Aaron Lehmann
2002-09-19  3:30         ` Andrew Morton
2002-09-19  3:30           ` Andrew Morton
2002-09-19 10:42           ` Alan Cox
2002-09-19 10:42             ` Alan Cox
2002-09-19 13:15       ` [NFS] " Hirokazu Takahashi
2002-09-19 20:42         ` Andrew Morton
2002-09-19 21:12           ` David S. Miller
2002-09-19 21:12             ` [NFS] " David S. Miller
2002-09-21 11:56   ` Pavel Machek
2002-09-21 11:56     ` Pavel Machek
2002-10-14  5:50 ` Neil Brown
2002-10-14  6:15   ` David S. Miller
2002-10-14 10:45     ` kuznet
2002-10-14 10:48       ` David S. Miller
2002-10-14 12:01   ` Hirokazu Takahashi
2002-10-14 14:12     ` Andrew Theurer
2002-10-16  3:44     ` Neil Brown
2002-10-16  4:31       ` David S. Miller
2002-10-16 15:04         ` Andrew Theurer
2002-10-17  2:03         ` [NFS] " Andrew Theurer
2002-10-17  2:31           ` Hirokazu Takahashi
2002-10-17 13:16             ` Andrew Theurer
2002-10-17 13:16               ` [NFS] " Andrew Theurer
2002-10-17 13:26               ` Hirokazu Takahashi
2002-10-17 13:26                 ` [NFS] " Hirokazu Takahashi
2002-10-17 14:10                 ` Andrew Theurer
2002-10-17 16:26                   ` Hirokazu Takahashi
2002-10-17 16:26                     ` [NFS] " Hirokazu Takahashi
2002-10-18  5:38                     ` Trond Myklebust
2002-10-18  7:19                       ` Hirokazu Takahashi
2002-10-18 15:12                         ` Andrew Theurer
2002-10-18 15:12                           ` [NFS] " Andrew Theurer
2002-10-19 20:34                           ` Hirokazu Takahashi
2002-10-19 20:34                             ` [NFS] " Hirokazu Takahashi
2002-10-22 21:16                             ` Andrew Theurer
2002-10-22 21:16                               ` [NFS] " Andrew Theurer
2002-10-23  9:29                               ` Hirokazu Takahashi
2002-10-24 15:32                                 ` Andrew Theurer
2002-10-27 11:10                                   ` Hirokazu Takahashi
2002-10-16 11:09       ` Hirokazu Takahashi
2002-10-16 17:02         ` kaza
2002-10-17  4:36           ` rddunlap
2002-10-18 13:11   ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi
2002-10-23  1:18     ` Neil Brown
2002-10-23  3:53       ` Hirokazu Takahashi
2002-10-23  5:40         ` Hirokazu Takahashi
2002-10-23  6:03           ` Neil Brown
2002-10-23 22:35             ` Hirokazu Takahashi
2002-10-23  6:10         ` Neil Brown
2002-10-23  7:08           ` Hirokazu Takahashi
2002-10-23 15:23           ` Trond Myklebust
2002-10-23 21:50       ` Hirokazu Takahashi
2002-10-23 23:55         ` Trond Myklebust
2002-10-24  1:33           ` Hirokazu Takahashi
2002-10-27 10:39             ` Hirokazu Takahashi
2002-10-28 16:31               ` Trond Myklebust
2002-10-28 23:39                 ` Hirokazu Takahashi
2002-10-29  6:36                 ` Hirokazu Takahashi
2002-10-29 15:09                   ` Trond Myklebust
2002-10-29 16:27                     ` Hirokazu Takahashi
2002-10-29 16:49                       ` Trond Myklebust
2002-10-30  3:18                     ` Hirokazu Takahashi
2002-10-25  9:52       ` Hirokazu Takahashi
2002-10-25 12:41         ` Neil Brown
2002-10-26  3:11           ` Hirokazu Takahashi
2002-10-26  3:46             ` Benjamin LaHaise
2002-10-27 22:46               ` Neil Brown
2002-10-30 23:29           ` Hirokazu Takahashi
2002-10-30 23:53             ` Neil Brown
2002-10-31  2:06               ` Hirokazu Takahashi
2002-10-31 15:40                 ` Hirokazu Takahashi
2002-10-31 16:56                   ` Hirokazu Takahashi
2002-11-01  1:10                     ` Neil Brown
2002-11-04 21:13                       ` Andrew Theurer
2002-11-01  0:54                   ` Neil Brown
2002-11-01  1:39                     ` Hirokazu Takahashi
2002-11-01  3:41                     ` Hirokazu Takahashi
2002-11-01  4:20                       ` Neil Brown
2002-11-01  5:07                         ` Hirokazu Takahashi
2002-10-25 17:23         ` Trond Myklebust
2002-10-26  3:26           ` Hirokazu Takahashi
  -- strict thread matches above, loose matches on Subject: below --
2002-09-19  2:00 [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 Lever, Charles
     [not found] <3D89176B.40FFD09B@digeo.com.suse.lists.linux.kernel>
     [not found] ` <20020919.221513.28808421.taka@valinux.co.jp.suse.lists.linux.kernel>
     [not found]   ` <3D8A36A5.846D806@digeo.com.suse.lists.linux.kernel>
2002-09-20  1:00     ` Andi Kleen
2002-09-20  1:09       ` Andrew Morton
2002-09-20  1:23         ` Andi Kleen
2002-09-20  1:27           ` David S. Miller
2002-09-20  2:06             ` Andi Kleen
2002-09-20  2:01               ` David S. Miller
2002-09-20  2:28                 ` Andi Kleen
2002-09-20  2:20                   ` David S. Miller
2002-09-20  2:35                     ` Andi Kleen
2002-10-16 14:04 Lever, Charles

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.