* [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-18 8:14 Hirokazu Takahashi
2002-09-18 23:00 ` David S. Miller
2002-10-14 5:50 ` Neil Brown
0 siblings, 2 replies; 87+ messages in thread
From: Hirokazu Takahashi @ 2002-09-18 8:14 UTC (permalink / raw)
To: Neil Brown, linux-kernel, nfs
Hello,
I ported the zerocopy NFS patches against linux-2.5.36.
I made va05-zerocopy-nfsdwrite-2.5.36.patch more generic,
so that it would be easy to merge with NFSv4. Each procedure can
chose whether it can accept splitted buffers or not.
And I fixed a probelem that nfsd couldn't handle NFS-symlink
requests which were very large.
1)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
This patch enables HW-checksum against outgoing packets including UDP frames.
2)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va11-udpsendfile-2.5.36.patch
This patch makes sendfile systemcall over UDP work. It also supports
UDP_CORK interface which is very similar to TCP_CORK. And you can call
sendmsg/senfile with MSG_MORE flags over UDP sockets.
3)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
This patch fixes the problem of x86 csum_partilal() routines which
can't handle odd addressed buffers.
4)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va01-zerocopy-rpc-2.5.36.patch
This patch makes RPC can send some pieces of data and pages without copy.
5)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va02-zerocopy-nfsdread-2.5.36.patch
This patch makes NFSD send pages in pagecache directly when NFS clinets request
file-read.
6)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va03-zerocopy-nfsdreaddir-2.5.36.patch
nfsd_readdir can also send pages without copy.
7)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va04-zerocopy-shadowsock-2.5.36.patch
This patch makes per-cpu UDP sockets so that NFSD can send UDP frames on
each prosessor simultaneously.
Without the patch we can send only one UDP frame at the time as a UDP socket
have to be locked during sending some pages to serialize them.
8)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va05-zerocopy-nfsdwrite-2.5.36.patch
This patch enables NFS-write uses writev interface. NFSd can handle NFS
requests without reassembling IP fragments into one UDP frame.
9)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/taka-writev-2.5.36.patch
This patch makes writev for regular file work faster.
It also can be found at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/
Caution:
XFS doesn't support writev interface yet. NFS write on XFS might
slow down with No.8 patch. I wish SGI guys will implement it.
10)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va07-nfsbigbuf-2.5.36.patch
This makes NFS buffer much bigger (60KB).
60KB buffer is the same to 32KB buffer for linux-kernel as both of them
require 64KB chunk.
11)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va09-zerocopy-tempsendto-2.5.36.patch
If you don't want to use sendfile over UDP yet, you can apply it instead of No.1 and No.2 patches.
Regards,
Hirokazu Takahashi
^ permalink raw reply [flat|nested] 87+ messages in thread* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-18 8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi @ 2002-09-18 23:00 ` David S. Miller 2002-09-18 23:54 ` Alan Cox 2002-09-21 11:56 ` Pavel Machek 2002-10-14 5:50 ` Neil Brown 1 sibling, 2 replies; 87+ messages in thread From: David S. Miller @ 2002-09-18 23:00 UTC (permalink / raw) To: taka; +Cc: neilb, linux-kernel, nfs From: Hirokazu Takahashi <taka@valinux.co.jp> Date: Wed, 18 Sep 2002 17:14:31 +0900 (JST) 1) ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch This patch enables HW-checksum against outgoing packets including UDP frames. Can you explain the TCP parts? They look very wrong. It was discussed long ago that csum_and_copy_from_user() performs better than plain copy_from_user() on x86. I do not remember all details, but I do know that using copy_from_user() is not a real improvement at least on x86 architecture. The rest of the changes (ie. the getfrag() logic to set skb->ip_summed) looks fine. 3) ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch This patch fixes the problem of x86 csum_partilal() routines which can't handle odd addressed buffers. I've sent Linus this fix already. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-18 23:00 ` David S. Miller @ 2002-09-18 23:54 ` Alan Cox 2002-09-21 11:56 ` Pavel Machek 1 sibling, 0 replies; 87+ messages in thread From: Alan Cox @ 2002-09-18 23:54 UTC (permalink / raw) To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs On Thu, 2002-09-19 at 00:00, David S. Miller wrote: > It was discussed long ago that csum_and_copy_from_user() performs > better than plain copy_from_user() on x86. I do not remember all The better was a freak of PPro/PII scheduling I think > details, but I do know that using copy_from_user() is not a real > improvement at least on x86 architecture. The same as bit is easy to explain. Its totally memory bandwidth limited on current x86-32 processors. (Although I'd welcome demonstrations to the contrary on newer toys) ------------------------------------------------------- This SF.NET email is sponsored by: AMD - Your access to the experts on Hammer Technology! Open Source & Linux Developers, register now for the AMD Developer Symposium. Code: EX8664 http://www.developwithamd.com/developerlab _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-18 23:54 ` Alan Cox 0 siblings, 0 replies; 87+ messages in thread From: Alan Cox @ 2002-09-18 23:54 UTC (permalink / raw) To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs On Thu, 2002-09-19 at 00:00, David S. Miller wrote: > It was discussed long ago that csum_and_copy_from_user() performs > better than plain copy_from_user() on x86. I do not remember all The better was a freak of PPro/PII scheduling I think > details, but I do know that using copy_from_user() is not a real > improvement at least on x86 architecture. The same as bit is easy to explain. Its totally memory bandwidth limited on current x86-32 processors. (Although I'd welcome demonstrations to the contrary on newer toys) ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-18 23:54 ` Alan Cox (?) @ 2002-09-19 0:16 ` Andrew Morton 2002-09-19 2:13 ` Aaron Lehmann 2002-09-19 13:15 ` [NFS] " Hirokazu Takahashi -1 siblings, 2 replies; 87+ messages in thread From: Andrew Morton @ 2002-09-19 0:16 UTC (permalink / raw) To: Alan Cox; +Cc: David S. Miller, taka, neilb, linux-kernel, nfs Alan Cox wrote: > > On Thu, 2002-09-19 at 00:00, David S. Miller wrote: > > It was discussed long ago that csum_and_copy_from_user() performs > > better than plain copy_from_user() on x86. I do not remember all > > The better was a freak of PPro/PII scheduling I think > > > details, but I do know that using copy_from_user() is not a real > > improvement at least on x86 architecture. > > The same as bit is easy to explain. Its totally memory bandwidth limited > on current x86-32 processors. (Although I'd welcome demonstrations to > the contrary on newer toys) Nope. There are distinct alignment problems with movsl-based memcpy on PII and (at least) "Pentium III (Coppermine)", which is tested here: copy_32 uses movsl. copy_duff just uses a stream of "movl"s Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned: akpm:/usr/src/cptimer> ./cptimer -d -s nbytes=10240 from_align=0, to_align=0 copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec OK, movsl wins. But now give the source address 8+1 alignment: akpm:/usr/src/cptimer> ./cptimer -d -s -f 1 nbytes=10240 from_align=1, to_align=0 copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec __copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec The "movl"-based copy wins. By miles. Make the source 8+4 aligned: akpm:/usr/src/cptimer> ./cptimer -d -s -f 4 nbytes=10240 from_align=4, to_align=0 copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec __copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec So movl still beats movsl, by lots. I have various scriptlets which generate the entire matrix. I think I ended up deciding that we should use movsl _only_ when both src and dsc are 8-byte-aligned. And that when you multiply the gain from that by the frequency*size with which funny alignments are used by TCP the net gain was 2% or something. It needs redoing. These differences are really big, and this is the kernel's most expensive function. A little project for someone. The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 0:16 ` Andrew Morton @ 2002-09-19 2:13 ` Aaron Lehmann 2002-09-19 3:30 ` Andrew Morton 2002-09-19 13:15 ` [NFS] " Hirokazu Takahashi 1 sibling, 1 reply; 87+ messages in thread From: Aaron Lehmann @ 2002-09-19 2:13 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, nfs > akpm:/usr/src/cptimer> ./cptimer -d -s > nbytes=10240 from_align=0, to_align=0 > copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec It's disappointing that this program doesn't seem to support benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c). Those seem to be the more interesting memcpy functions on modern systems. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 2:13 ` Aaron Lehmann @ 2002-09-19 3:30 ` Andrew Morton 0 siblings, 0 replies; 87+ messages in thread From: Andrew Morton @ 2002-09-19 3:30 UTC (permalink / raw) To: Aaron Lehmann; +Cc: linux-kernel, nfs Aaron Lehmann wrote: > > > akpm:/usr/src/cptimer> ./cptimer -d -s > > nbytes=10240 from_align=0, to_align=0 > > copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec > > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec > > It's disappointing that this program doesn't seem to support > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c). > Those seem to be the more interesting memcpy functions on modern > systems. Well the source is there, and the licensing terms are most reasonable. But then, the source was there eighteen months ago and nothing happened. Sigh. I think in-kernel MMX has fatal drawbacks anyway. Not sure what they are - I prefer to pretend that x86 CPUs execute raw C. ------------------------------------------------------- This SF.NET email is sponsored by: AMD - Your access to the experts on Hammer Technology! Open Source & Linux Developers, register now for the AMD Developer Symposium. Code: EX8664 http://www.developwithamd.com/developerlab _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-19 3:30 ` Andrew Morton 0 siblings, 0 replies; 87+ messages in thread From: Andrew Morton @ 2002-09-19 3:30 UTC (permalink / raw) To: Aaron Lehmann; +Cc: linux-kernel, nfs Aaron Lehmann wrote: > > > akpm:/usr/src/cptimer> ./cptimer -d -s > > nbytes=10240 from_align=0, to_align=0 > > copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec > > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec > > It's disappointing that this program doesn't seem to support > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c). > Those seem to be the more interesting memcpy functions on modern > systems. Well the source is there, and the licensing terms are most reasonable. But then, the source was there eighteen months ago and nothing happened. Sigh. I think in-kernel MMX has fatal drawbacks anyway. Not sure what they are - I prefer to pretend that x86 CPUs execute raw C. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 3:30 ` Andrew Morton @ 2002-09-19 10:42 ` Alan Cox -1 siblings, 0 replies; 87+ messages in thread From: Alan Cox @ 2002-09-19 10:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Aaron Lehmann, linux-kernel, nfs On Thu, 2002-09-19 at 04:30, Andrew Morton wrote: > > It's disappointing that this program doesn't seem to support > > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c). > > Those seem to be the more interesting memcpy functions on modern > > systems. > > Well the source is there, and the licensing terms are most reasonable. > > But then, the source was there eighteen months ago and nothing happened. > Sigh. > > I think in-kernel MMX has fatal drawbacks anyway. Not sure what > they are - I prefer to pretend that x86 CPUs execute raw C. MMX isnt useful for anything smaller than about 512bytes-1K. Its not useful in interrupt handlers. The list goes on. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-19 10:42 ` Alan Cox 0 siblings, 0 replies; 87+ messages in thread From: Alan Cox @ 2002-09-19 10:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Aaron Lehmann, linux-kernel, nfs On Thu, 2002-09-19 at 04:30, Andrew Morton wrote: > > It's disappointing that this program doesn't seem to support > > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c). > > Those seem to be the more interesting memcpy functions on modern > > systems. > > Well the source is there, and the licensing terms are most reasonable. > > But then, the source was there eighteen months ago and nothing happened. > Sigh. > > I think in-kernel MMX has fatal drawbacks anyway. Not sure what > they are - I prefer to pretend that x86 CPUs execute raw C. MMX isnt useful for anything smaller than about 512bytes-1K. Its not useful in interrupt handlers. The list goes on. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 0:16 ` Andrew Morton 2002-09-19 2:13 ` Aaron Lehmann @ 2002-09-19 13:15 ` Hirokazu Takahashi 2002-09-19 20:42 ` Andrew Morton 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-09-19 13:15 UTC (permalink / raw) To: akpm; +Cc: alan, davem, neilb, linux-kernel, nfs Hello, > > > details, but I do know that using copy_from_user() is not a real > > > improvement at least on x86 architecture. > > > > The same as bit is easy to explain. Its totally memory bandwidth limited > > on current x86-32 processors. (Although I'd welcome demonstrations to > > the contrary on newer toys) > > Nope. There are distinct alignment problems with movsl-based > memcpy on PII and (at least) "Pentium III (Coppermine)", which is > tested here: ... > I have various scriptlets which generate the entire matrix. > > I think I ended up deciding that we should use movsl _only_ > when both src and dsc are 8-byte-aligned. And that when you > multiply the gain from that by the frequency*size with which > funny alignments are used by TCP the net gain was 2% or something. Amazing! I beleived 4-byte-aligned was enough. read/write systemcalls may also reduce their penalties. > It needs redoing. These differences are really big, and this > is the kernel's most expensive function. > > A little project for someone. OK, if there is nobody who wants to do it I'll do it by myself. > The tools are at http://www.zip.com.au/~/linux/cptimer.tar.gz ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 13:15 ` [NFS] " Hirokazu Takahashi @ 2002-09-19 20:42 ` Andrew Morton 2002-09-19 21:12 ` [NFS] " David S. Miller 0 siblings, 1 reply; 87+ messages in thread From: Andrew Morton @ 2002-09-19 20:42 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: alan, davem, neilb, linux-kernel, nfs Hirokazu Takahashi wrote: > > ... > > It needs redoing. These differences are really big, and this > > is the kernel's most expensive function. > > > > A little project for someone. > > OK, if there is nobody who wants to do it I'll do it by myself. That would be fantastic - thanks. This is more a measurement and testing exercise than a coding one. And if those measurements are sufficiently nice (eg: >5%) then a 2.4 backport should be done. It seems that movsl works acceptably with all alignments on AMD hardware, although this needs to be checked with more recent machines. movsl is a (bad) loss on PII and PIII for all alignments except 8&8. Don't know about P4 - I can test that in a day or two. I expect that a minimal, 90% solution would be just: fancy_copy_to_user(dst, src, count) { if (arch_has_sane_movsl || ((dst|src) & 7) == 0) movsl_copy_to_user(dst, src, count); else movl_copy_to_user(dst, src, count); } and #ifndef ARCH_HAS_FANCY_COPY_USER #define fancy_copy_to_user copy_to_user #endif and we really only need fancy_copy_to_user in a handful of places - the bulk copies in networking and filemap.c. For all the other call sites it's probably more important to keep the code footprint down than it is to squeeze the last few drops out of the copy speed. Mala Anand has done some work on this. See http://www.uwsg.iu.edu/hypermail/linux/kernel/0206.3/0100.html <searches> Yes, I have a copy of Mala's patch here which works against 2.5.current. Mala's patch will cause quite an expansion of kernel size; we would need an implementation which did not use inlining. This work was discussed at OLS2002. See http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz uaccess.h | 252 +++++++++++++++++++++++++++++++++++++++++++++++--------------- 1 files changed, 193 insertions(+), 59 deletions(-) --- 2.5.25/include/asm-i386/uaccess.h~fast-cu Tue Jul 9 21:34:58 2002 +++ 2.5.25-akpm/include/asm-i386/uaccess.h Tue Jul 9 21:51:03 2002 @@ -253,55 +253,197 @@ do { \ */ /* Generic arbitrary sized copy. */ -#define __copy_user(to,from,size) \ -do { \ - int __d0, __d1; \ - __asm__ __volatile__( \ - "0: rep; movsl\n" \ - " movl %3,%0\n" \ - "1: rep; movsb\n" \ - "2:\n" \ - ".section .fixup,\"ax\"\n" \ - "3: lea 0(%3,%0,4),%0\n" \ - " jmp 2b\n" \ - ".previous\n" \ - ".section __ex_table,\"a\"\n" \ - " .align 4\n" \ - " .long 0b,3b\n" \ - " .long 1b,2b\n" \ - ".previous" \ - : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ - : "r"(size & 3), "0"(size / 4), "1"(to), "2"(from) \ - : "memory"); \ +#define __copy_user(to,from,size) \ +do { \ + int __d0, __d1; \ + __asm__ __volatile__( \ + " cmpl $63, %0\n" \ + " jbe 5f\n" \ + " mov %%esi, %%eax\n" \ + " test $7, %%al\n" \ + " jz 5f\n" \ + " .align 2,0x90\n" \ + "0: movl 32(%4), %%eax\n" \ + " cmpl $67, %0\n" \ + " jbe 1f\n" \ + " movl 64(%4), %%eax\n" \ + " .align 2,0x90\n" \ + "1: movl 0(%4), %%eax\n" \ + " movl 4(%4), %%edx\n" \ + "2: movl %%eax, 0(%3)\n" \ + "21: movl %%edx, 4(%3)\n" \ + " movl 8(%4), %%eax\n" \ + " movl 12(%4),%%edx\n" \ + "3: movl %%eax, 8(%3)\n" \ + "31: movl %%edx, 12(%3)\n" \ + " movl 16(%4), %%eax\n" \ + " movl 20(%4), %%edx\n" \ + "4: movl %%eax, 16(%3)\n" \ + "41: movl %%edx, 20(%3)\n" \ + " movl 24(%4), %%eax\n" \ + " movl 28(%4), %%edx\n" \ + "10: movl %%eax, 24(%3)\n" \ + "51: movl %%edx, 28(%3)\n" \ + " movl 32(%4), %%eax\n" \ + " movl 36(%4), %%edx\n" \ + "11: movl %%eax, 32(%3)\n" \ + "61: movl %%edx, 36(%3)\n" \ + " movl 40(%4), %%eax\n" \ + " movl 44(%4), %%edx\n" \ + "12: movl %%eax, 40(%3)\n" \ + "71: movl %%edx, 44(%3)\n" \ + " movl 48(%4), %%eax\n" \ + " movl 52(%4), %%edx\n" \ + "13: movl %%eax, 48(%3)\n" \ + "81: movl %%edx, 52(%3)\n" \ + " movl 56(%4), %%eax\n" \ + " movl 60(%4), %%edx\n" \ + "14: movl %%eax, 56(%3)\n" \ + "91: movl %%edx, 60(%3)\n" \ + " addl $-64, %0\n" \ + " addl $64, %4\n" \ + " addl $64, %3\n" \ + " cmpl $63, %0\n" \ + " ja 0b\n" \ + "5: movl %0, %%eax\n" \ + " shrl $2, %0\n" \ + " andl $3, %%eax\n" \ + " cld\n" \ + "6: rep; movsl\n" \ + " movl %%eax, %0\n" \ + "7: rep; movsb\n" \ + "8:\n" \ + ".section .fixup,\"ax\"\n" \ + "9: lea 0(%%eax,%0,4),%0\n" \ + " jmp 8b\n" \ + "15: movl %6, %0\n" \ + " jmp 8b\n" \ + ".previous\n" \ + ".section __ex_table,\"a\"\n" \ + " .align 4\n" \ + " .long 2b,15b\n" \ + " .long 21b,15b\n" \ + " .long 3b,15b\n" \ + " .long 31b,15b\n" \ + " .long 4b,15b\n" \ + " .long 41b,15b\n" \ + " .long 10b,15b\n" \ + " .long 51b,15b\n" \ + " .long 11b,15b\n" \ + " .long 61b,15b\n" \ + " .long 12b,15b\n" \ + " .long 71b,15b\n" \ + " .long 13b,15b\n" \ + " .long 81b,15b\n" \ + " .long 14b,15b\n" \ + " .long 91b,15b\n" \ + " .long 6b,9b\n" \ + " .long 7b,8b\n" \ + ".previous" \ + : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ + : "1"(to), "2"(from), "0"(size),"i"(-EFAULT) \ + : "eax", "edx", "memory"); \ } while (0) -#define __copy_user_zeroing(to,from,size) \ -do { \ - int __d0, __d1; \ - __asm__ __volatile__( \ - "0: rep; movsl\n" \ - " movl %3,%0\n" \ - "1: rep; movsb\n" \ - "2:\n" \ - ".section .fixup,\"ax\"\n" \ - "3: lea 0(%3,%0,4),%0\n" \ - "4: pushl %0\n" \ - " pushl %%eax\n" \ - " xorl %%eax,%%eax\n" \ - " rep; stosb\n" \ - " popl %%eax\n" \ - " popl %0\n" \ - " jmp 2b\n" \ - ".previous\n" \ - ".section __ex_table,\"a\"\n" \ - " .align 4\n" \ - " .long 0b,3b\n" \ - " .long 1b,4b\n" \ - ".previous" \ - : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ - : "r"(size & 3), "0"(size / 4), "1"(to), "2"(from) \ - : "memory"); \ -} while (0) +#define __copy_user_zeroing(to,from,size) \ +do { \ + int __d0, __d1; \ + __asm__ __volatile__( \ + " cmpl $63, %0\n" \ + " jbe 5f\n" \ + " movl %%edi, %%eax\n" \ + " test $7, %%al\n" \ + " jz 5f\n" \ + " .align 2,0x90\n" \ + "0: movl 32(%4), %%eax\n" \ + " cmpl $67, %0\n" \ + " jbe 2f\n" \ + "1: movl 64(%4), %%eax\n" \ + " .align 2,0x90\n" \ + "2: movl 0(%4), %%eax\n" \ + "21: movl 4(%4), %%edx\n" \ + " movl %%eax, 0(%3)\n" \ + " movl %%edx, 4(%3)\n" \ + "3: movl 8(%4), %%eax\n" \ + "31: movl 12(%4),%%edx\n" \ + " movl %%eax, 8(%3)\n" \ + " movl %%edx, 12(%3)\n" \ + "4: movl 16(%4), %%eax\n" \ + "41: movl 20(%4), %%edx\n" \ + " movl %%eax, 16(%3)\n" \ + " movl %%edx, 20(%3)\n" \ + "10: movl 24(%4), %%eax\n" \ + "51: movl 28(%4), %%edx\n" \ + " movl %%eax, 24(%3)\n" \ + " movl %%edx, 28(%3)\n" \ + "11: movl 32(%4), %%eax\n" \ + "61: movl 36(%4), %%edx\n" \ + " movl %%eax, 32(%3)\n" \ + " movl %%edx, 36(%3)\n" \ + "12: movl 40(%4), %%eax\n" \ + "71: movl 44(%4), %%edx\n" \ + " movl %%eax, 40(%3)\n" \ + " movl %%edx, 44(%3)\n" \ + "13: movl 48(%4), %%eax\n" \ + "81: movl 52(%4), %%edx\n" \ + " movl %%eax, 48(%3)\n" \ + " movl %%edx, 52(%3)\n" \ + "14: movl 56(%4), %%eax\n" \ + "91: movl 60(%4), %%edx\n" \ + " movl %%eax, 56(%3)\n" \ + " movl %%edx, 60(%3)\n" \ + " addl $-64, %0\n" \ + " addl $64, %4\n" \ + " addl $64, %3\n" \ + " cmpl $63, %0\n" \ + " ja 0b\n" \ + "5: movl %0, %%eax\n" \ + " shrl $2, %0\n" \ + " andl $3, %%eax\n" \ + " cld\n" \ + "6: rep; movsl\n" \ + " movl %%eax,%0\n" \ + "7: rep; movsb\n" \ + "8:\n" \ + ".section .fixup,\"ax\"\n" \ + "9: lea 0(%%eax,%0,4),%0\n" \ + "16: pushl %0\n" \ + " pushl %%eax\n" \ + " xorl %%eax,%%eax\n" \ + " rep; stosb\n" \ + " popl %%eax\n" \ + " popl %0\n" \ + " jmp 8b\n" \ + "15: movl %6, %0\n" \ + " jmp 8b\n" \ + ".previous\n" \ + ".section __ex_table,\"a\"\n" \ + " .align 4\n" \ + " .long 0b,16b\n" \ + " .long 1b,16b\n" \ + " .long 2b,16b\n" \ + " .long 21b,16b\n" \ + " .long 3b,16b\n" \ + " .long 31b,16b\n" \ + " .long 4b,16b\n" \ + " .long 41b,16b\n" \ + " .long 10b,16b\n" \ + " .long 51b,16b\n" \ + " .long 11b,16b\n" \ + " .long 61b,16b\n" \ + " .long 12b,16b\n" \ + " .long 71b,16b\n" \ + " .long 13b,16b\n" \ + " .long 81b,16b\n" \ + " .long 14b,16b\n" \ + " .long 91b,16b\n" \ + " .long 6b,9b\n" \ + " .long 7b,16b\n" \ + ".previous" \ + : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ + : "1"(to), "2"(from), "0"(size),"i"(-EFAULT) \ + : "eax", "edx", "memory"); \ + } while (0) /* We let the __ versions of copy_from/to_user inline, because they're often * used in fast paths and have only a small space overhead. @@ -578,24 +720,16 @@ __constant_copy_from_user_nocheck(void * } #define copy_to_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_to_user((to),(from),(n)) : \ - __generic_copy_to_user((to),(from),(n))) + __generic_copy_to_user((to),(from),(n)) #define copy_from_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_from_user((to),(from),(n)) : \ - __generic_copy_from_user((to),(from),(n))) + __generic_copy_from_user((to),(from),(n)) #define __copy_to_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_to_user_nocheck((to),(from),(n)) : \ - __generic_copy_to_user_nocheck((to),(from),(n))) + __generic_copy_to_user_nocheck((to),(from),(n)) #define __copy_from_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_from_user_nocheck((to),(from),(n)) : \ - __generic_copy_from_user_nocheck((to),(from),(n))) + __generic_copy_from_user_nocheck((to),(from),(n)) long strncpy_from_user(char *dst, const char *src, long count); long __strncpy_from_user(char *dst, const char *src, long count); - ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 20:42 ` Andrew Morton @ 2002-09-19 21:12 ` David S. Miller 0 siblings, 0 replies; 87+ messages in thread From: David S. Miller @ 2002-09-19 21:12 UTC (permalink / raw) To: akpm; +Cc: taka, alan, neilb, linux-kernel, nfs From: Andrew Morton <akpm@digeo.com> Date: Thu, 19 Sep 2002 13:42:13 -0700 Mala's patch will cause quite an expansion of kernel size; we would need an implementation which did not use inlining. It definitely belongs in arch/i386/lib/copy.c or whatever, not inlined. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-19 21:12 ` David S. Miller 0 siblings, 0 replies; 87+ messages in thread From: David S. Miller @ 2002-09-19 21:12 UTC (permalink / raw) To: akpm; +Cc: taka, alan, neilb, linux-kernel, nfs From: Andrew Morton <akpm@digeo.com> Date: Thu, 19 Sep 2002 13:42:13 -0700 Mala's patch will cause quite an expansion of kernel size; we would need an implementation which did not use inlining. It definitely belongs in arch/i386/lib/copy.c or whatever, not inlined. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-18 23:00 ` David S. Miller @ 2002-09-21 11:56 ` Pavel Machek 2002-09-21 11:56 ` Pavel Machek 1 sibling, 0 replies; 87+ messages in thread From: Pavel Machek @ 2002-09-21 11:56 UTC (permalink / raw) To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs Hi! > > 1) > ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch > This patch enables HW-checksum against outgoing packets including UDP frames. > > Can you explain the TCP parts? They look very wrong. > > It was discussed long ago that csum_and_copy_from_user() performs > better than plain copy_from_user() on x86. I do not remember all > details, but I do know that using copy_from_user() is not a real > improvement at least on x86 architecture. Well, if this is the case, we need to #define copy_from_user csum_and_copy_from_user :-). Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-21 11:56 ` Pavel Machek 0 siblings, 0 replies; 87+ messages in thread From: Pavel Machek @ 2002-09-21 11:56 UTC (permalink / raw) To: David S. Miller; +Cc: taka, neilb, linux-kernel, nfs Hi! > > 1) > ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch > This patch enables HW-checksum against outgoing packets including UDP frames. > > Can you explain the TCP parts? They look very wrong. > > It was discussed long ago that csum_and_copy_from_user() performs > better than plain copy_from_user() on x86. I do not remember all > details, but I do know that using copy_from_user() is not a real > improvement at least on x86 architecture. Well, if this is the case, we need to #define copy_from_user csum_and_copy_from_user :-). Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-18 8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi 2002-09-18 23:00 ` David S. Miller @ 2002-10-14 5:50 ` Neil Brown 2002-10-14 6:15 ` David S. Miller ` (2 more replies) 1 sibling, 3 replies; 87+ messages in thread From: Neil Brown @ 2002-10-14 5:50 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: David S. Miller, linux-kernel, nfs On Wednesday September 18, taka@valinux.co.jp wrote: > Hello, > > I ported the zerocopy NFS patches against linux-2.5.36. > hi, I finally got around to looking at this. It looks good. However it really needs the MSG_MORE support for udp_sendmsg to be accepted before there is any point merging the rpc/nfsd bits. Would you like to see if davem is happy with that bit first and get it in? Then I will be happy to forward the nfsd specific bit. I'm bit I'm not very sure about is the 'shadowsock' patch for having several xmit sockets, one per CPU. What sort of speedup do you get from this? How important is it really? NeilBrown ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-14 5:50 ` Neil Brown @ 2002-10-14 6:15 ` David S. Miller 2002-10-14 10:45 ` kuznet 2002-10-14 12:01 ` Hirokazu Takahashi 2002-10-18 13:11 ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi 2 siblings, 1 reply; 87+ messages in thread From: David S. Miller @ 2002-10-14 6:15 UTC (permalink / raw) To: neilb; +Cc: taka, linux-kernel, nfs, kuznet From: Neil Brown <neilb@cse.unsw.edu.au> Date: Mon, 14 Oct 2002 15:50:02 +1000 Would you like to see if davem is happy with that bit first and get it in? Then I will be happy to forward the nfsd specific bit. Alexey is working on this, or at least he was. :-) (Alexey this is about the UDP cork changes) I'm bit I'm not very sure about is the 'shadowsock' patch for having several xmit sockets, one per CPU. What sort of speedup do you get from this? How important is it really? Personally, it seems rather essential for scalability on SMP. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-14 6:15 ` David S. Miller @ 2002-10-14 10:45 ` kuznet 2002-10-14 10:48 ` David S. Miller 0 siblings, 1 reply; 87+ messages in thread From: kuznet @ 2002-10-14 10:45 UTC (permalink / raw) To: David S. Miller; +Cc: neilb, taka, linux-kernel, nfs Hello! > Alexey is working on this, or at least he was. :-) > (Alexey this is about the UDP cork changes) I took two patches of the batch: va10-hwchecksum-2.5.36.patch va11-udpsendfile-2.5.36.patch I did not worry about the rest i.e. sunrpc/* part. Alexey ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-14 10:45 ` kuznet @ 2002-10-14 10:48 ` David S. Miller 0 siblings, 0 replies; 87+ messages in thread From: David S. Miller @ 2002-10-14 10:48 UTC (permalink / raw) To: kuznet; +Cc: neilb, taka, linux-kernel, nfs From: kuznet@ms2.inr.ac.ru Date: Mon, 14 Oct 2002 14:45:33 +0400 (MSD) I took two patches of the batch: va10-hwchecksum-2.5.36.patch va11-udpsendfile-2.5.36.patch I did not worry about the rest i.e. sunrpc/* part. Neil and the NFS folks can take care of those parts once the generic UDP parts are in. So, no worries. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-14 5:50 ` Neil Brown 2002-10-14 6:15 ` David S. Miller @ 2002-10-14 12:01 ` Hirokazu Takahashi 2002-10-14 14:12 ` Andrew Theurer 2002-10-16 3:44 ` Neil Brown 2002-10-18 13:11 ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi 2 siblings, 2 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-14 12:01 UTC (permalink / raw) To: neilb; +Cc: davem, linux-kernel, nfs Hello, Neil > > I ported the zerocopy NFS patches against linux-2.5.36. > > hi, > I finally got around to looking at this. > It looks good. Thanks! > However it really needs the MSG_MORE support for udp_sendmsg to be > accepted before there is any point merging the rpc/nfsd bits. > > Would you like to see if davem is happy with that bit first and get > it in? Then I will be happy to forward the nfsd specific bit. Yes. > I'm bit I'm not very sure about is the 'shadowsock' patch for having > several xmit sockets, one per CPU. What sort of speedup do you get > from this? How important is it really? It's not so important. davem> Personally, it seems rather essential for scalability on SMP. Yes. It will be effective on large scale SMP machines as all kNFSd shares one NFS port. A udp socket can't send data on each CPU at the same time while MSG_MORE/UDP_CORK options are set. The UDP socket have to block any other requests during making a UDP frame. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-14 12:01 ` Hirokazu Takahashi @ 2002-10-14 14:12 ` Andrew Theurer 2002-10-16 3:44 ` Neil Brown 1 sibling, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-14 14:12 UTC (permalink / raw) To: neilb, Hirokazu Takahashi; +Cc: davem, linux-kernel, nfs > Hello, Neil > > > > I ported the zerocopy NFS patches against linux-2.5.36. > > > > hi, > > I finally got around to looking at this. > > It looks good. > > Thanks! > > > However it really needs the MSG_MORE support for udp_sendmsg to be > > accepted before there is any point merging the rpc/nfsd bits. > > > > Would you like to see if davem is happy with that bit first and get > > it in? Then I will be happy to forward the nfsd specific bit. > > Yes. > > > I'm bit I'm not very sure about is the 'shadowsock' patch for having > > several xmit sockets, one per CPU. What sort of speedup do you get > > from this? How important is it really? > > It's not so important. > > davem> Personally, it seems rather essential for scalability on SMP. > > Yes. > It will be effective on large scale SMP machines as all kNFSd shares > one NFS port. A udp socket can't send data on each CPU at the same > time while MSG_MORE/UDP_CORK options are set. > The UDP socket have to block any other requests during making a UDP frame. I experienced this exact problem a few months ago. I had a test where several clients read a file or files cached on a linux server. TCP was just fine, I could get 100% CPU on all CPUs on the server. TCP zerocopy was even better, by about 50% throughput. UDP could not get better than 33% CPU, one CPU working on those UDP requests and I assume a portion of another CPU handling some inturrupt stuff. Essentially 2P and 4P throughput was only as good as UP throughput. It is essential to get scaling on UDP. That combined with the UDP zerocopy, we will have one extremely fast NFS server. Andrew Theurer IBM LTC ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-14 12:01 ` Hirokazu Takahashi 2002-10-14 14:12 ` Andrew Theurer @ 2002-10-16 3:44 ` Neil Brown 2002-10-16 4:31 ` David S. Miller 2002-10-16 11:09 ` Hirokazu Takahashi 1 sibling, 2 replies; 87+ messages in thread From: Neil Brown @ 2002-10-16 3:44 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: davem, linux-kernel, nfs On Monday October 14, taka@valinux.co.jp wrote: > > I'm bit I'm not very sure about is the 'shadowsock' patch for having > > several xmit sockets, one per CPU. What sort of speedup do you get > > from this? How important is it really? > > It's not so important. > > davem> Personally, it seems rather essential for scalability on SMP. > > Yes. > It will be effective on large scale SMP machines as all kNFSd shares > one NFS port. A udp socket can't send data on each CPU at the same > time while MSG_MORE/UDP_CORK options are set. > The UDP socket have to block any other requests during making a UDP frame. > After thinking about this some more, I suspect it would have to be quite large scale SMP to get much contention. The only contention on the udp socket is, as you say, assembling a udp frame, and it would be surprised if that takes a substantial faction of the time to handle a request. Presumably on a sufficiently large SMP machine that this became an issue, there would be multiple NICs. Maybe it would make sense to have one udp socket for each NIC. Would that make sense? or work? It feels to me to be cleaner than one for each CPU. NeilBrown ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 3:44 ` Neil Brown @ 2002-10-16 4:31 ` David S. Miller 2002-10-16 15:04 ` Andrew Theurer 2002-10-17 2:03 ` [NFS] " Andrew Theurer 2002-10-16 11:09 ` Hirokazu Takahashi 1 sibling, 2 replies; 87+ messages in thread From: David S. Miller @ 2002-10-16 4:31 UTC (permalink / raw) To: neilb; +Cc: taka, linux-kernel, nfs From: Neil Brown <neilb@cse.unsw.edu.au> Date: Wed, 16 Oct 2002 13:44:04 +1000 Presumably on a sufficiently large SMP machine that this became an issue, there would be multiple NICs. Maybe it would make sense to have one udp socket for each NIC. Would that make sense? or work? It feels to me to be cleaner than one for each CPU. Doesn't make much sense. Usually we are talking via one IP address, and thus over one device. It could be using multiple NICs via BONDING, but that would be transparent to anything at the socket level. Really, I think there is real value to making the socket per-cpu even on a 2 or 4 way system. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 4:31 ` David S. Miller @ 2002-10-16 15:04 ` Andrew Theurer 2002-10-17 2:03 ` [NFS] " Andrew Theurer 1 sibling, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-16 15:04 UTC (permalink / raw) To: David S. Miller, neilb; +Cc: taka, linux-kernel, nfs On Tuesday 15 October 2002 11:31 pm, David S. Miller wrote: > From: Neil Brown <neilb@cse.unsw.edu.au> > Date: Wed, 16 Oct 2002 13:44:04 +1000 > > Presumably on a sufficiently large SMP machine that this became an > issue, there would be multiple NICs. Maybe it would make sense to > have one udp socket for each NIC. Would that make sense? or work? > It feels to me to be cleaner than one for each CPU. > > Doesn't make much sense. > > Usually we are talking via one IP address, and thus over > one device. It could be using multiple NICs via BONDING, > but that would be transparent to anything at the socket > level. > > Really, I think there is real value to making the socket > per-cpu even on a 2 or 4 way system. I am trying my best today to get a 4 way system up and running for this test. IMO, per cpu is best.. with just one socket, I seriously could not get over 33% cpu utilization on a 4 way (back in April). With TCP, I could max it out. I'll update later today hopefully with some promising results. -Andrew ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 4:31 ` David S. Miller 2002-10-16 15:04 ` Andrew Theurer @ 2002-10-17 2:03 ` Andrew Theurer 2002-10-17 2:31 ` Hirokazu Takahashi 1 sibling, 1 reply; 87+ messages in thread From: Andrew Theurer @ 2002-10-17 2:03 UTC (permalink / raw) To: neilb, David S. Miller; +Cc: taka, linux-kernel, nfs > From: Neil Brown <neilb@cse.unsw.edu.au> > Date: Wed, 16 Oct 2002 13:44:04 +1000 > > Presumably on a sufficiently large SMP machine that this became an > issue, there would be multiple NICs. Maybe it would make sense to > have one udp socket for each NIC. Would that make sense? or work? > It feels to me to be cleaner than one for each CPU. > > Doesn't make much sense. > > Usually we are talking via one IP address, and thus over > one device. It could be using multiple NICs via BONDING, > but that would be transparent to anything at the socket > level. > > Really, I think there is real value to making the socket > per-cpu even on a 2 or 4 way system. I am still seeing some sort of problem on an 8 way (hyperthreaded 8 logical/4 physical) on UDP with these patches. I cannot get more than 2 NFSd threads in a run state at one time. TCP usually has 8 or more. The test involves 40 100Mbit clients reading a 200 MB file on one server (4 acenic adapters) in cache. I am fighting some other issues at the moment (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP and 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I think the results will get better. I'm not sure what other lock or contention point this is hitting on UDP. If there is anything I can do to help, please let me know, thanks. Andrew Theurer ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 2:03 ` [NFS] " Andrew Theurer @ 2002-10-17 2:31 ` Hirokazu Takahashi 2002-10-17 13:16 ` [NFS] " Andrew Theurer 0 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 2:31 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hello, Thanks for testing my patches. > I am still seeing some sort of problem on an 8 way (hyperthreaded 8 > logical/4 physical) on UDP with these patches. I cannot get more than 2 > NFSd threads in a run state at one time. TCP usually has 8 or more. The > test involves 40 100Mbit clients reading a 200 MB file on one server (4 > acenic adapters) in cache. I am fighting some other issues at the moment > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and > 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP and > 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I > think the results will get better. I'm not sure what other lock or > contention point this is hitting on UDP. If there is anything I can do to > help, please let me know, thanks. I guess some UDP packets might be lost. It may happen easily as UDP protocol doesn't support flow control. Can you check how many errors has happened? You can see them in /proc/net/snmp of the server and the clients. And how many threads did you start on your machine? Buffer size of a UDP socket depends on number of kNFS threads. Large number of threads might help you. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 2:31 ` Hirokazu Takahashi @ 2002-10-17 13:16 ` Andrew Theurer 0 siblings, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-17 13:16 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 > Hello, > > Thanks for testing my patches. > > > I am still seeing some sort of problem on an 8 way (hyperthreaded 8 > > logical/4 physical) on UDP with these patches. I cannot get more than 2 > > NFSd threads in a run state at one time. TCP usually has 8 or more. The > > test involves 40 100Mbit clients reading a 200 MB file on one server (4 > > acenic adapters) in cache. I am fighting some other issues at the moment > > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and > > 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP and > > 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I > > think the results will get better. I'm not sure what other lock or > > contention point this is hitting on UDP. If there is anything I can do to > > help, please let me know, thanks. > > I guess some UDP packets might be lost. It may happen easily as UDP protocol > doesn't support flow control. > Can you check how many errors has happened? > You can see them in /proc/net/snmp of the server and the clients. server: Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 1000665 41 0 1000666 clients: Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 200403 0 0 200406 (all clients the same) > And how many threads did you start on your machine? > Buffer size of a UDP socket depends on number of kNFS threads. > Large number of threads might help you. 128 threads. client rsize=8196. Server and client MTU is 1500. Andrew Theurer ------------------------------------------------------- This sf.net email is sponsored by: viaVerio will pay you up to $1,000 for every account that you consolidate with us. http://ad.doubleclick.net/clk;4749864;7604308;v? http://www.viaverio.com/consolidator/osdn.cfm _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-10-17 13:16 ` Andrew Theurer 0 siblings, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-17 13:16 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 > Hello, > > Thanks for testing my patches. > > > I am still seeing some sort of problem on an 8 way (hyperthreaded 8 > > logical/4 physical) on UDP with these patches. I cannot get more than 2 > > NFSd threads in a run state at one time. TCP usually has 8 or more. The > > test involves 40 100Mbit clients reading a 200 MB file on one server (4 > > acenic adapters) in cache. I am fighting some other issues at the moment > > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and > > 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP and > > 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I > > think the results will get better. I'm not sure what other lock or > > contention point this is hitting on UDP. If there is anything I can do to > > help, please let me know, thanks. > > I guess some UDP packets might be lost. It may happen easily as UDP protocol > doesn't support flow control. > Can you check how many errors has happened? > You can see them in /proc/net/snmp of the server and the clients. server: Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 1000665 41 0 1000666 clients: Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 200403 0 0 200406 (all clients the same) > And how many threads did you start on your machine? > Buffer size of a UDP socket depends on number of kNFS threads. > Large number of threads might help you. 128 threads. client rsize=8196. Server and client MTU is 1500. Andrew Theurer ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 13:16 ` [NFS] " Andrew Theurer @ 2002-10-17 13:26 ` Hirokazu Takahashi -1 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 13:26 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hi, > server: Udp: InDatagrams NoPorts InErrors OutDatagrams > Udp: 1000665 41 0 1000666 > clients: Udp: InDatagrams NoPorts InErrors OutDatagrams > Udp: 200403 0 0 200406 > (all clients the same) How about IP datagrams? You can see the IP fields in /proc/net/snmp IP layer may also discard them. > > And how many threads did you start on your machine? > > Buffer size of a UDP socket depends on number of kNFS threads. > > Large number of threads might help you. > > 128 threads. client rsize=8196. Server and client MTU is 1500. It seems enough... ------------------------------------------------------- This sf.net email is sponsored by: viaVerio will pay you up to $1,000 for every account that you consolidate with us. http://ad.doubleclick.net/clk;4749864;7604308;v? http://www.viaverio.com/consolidator/osdn.cfm _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-10-17 13:26 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 13:26 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hi, > server: Udp: InDatagrams NoPorts InErrors OutDatagrams > Udp: 1000665 41 0 1000666 > clients: Udp: InDatagrams NoPorts InErrors OutDatagrams > Udp: 200403 0 0 200406 > (all clients the same) How about IP datagrams? You can see the IP fields in /proc/net/snmp IP layer may also discard them. > > And how many threads did you start on your machine? > > Buffer size of a UDP socket depends on number of kNFS threads. > > Large number of threads might help you. > > 128 threads. client rsize=8196. Server and client MTU is 1500. It seems enough... ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 13:26 ` [NFS] " Hirokazu Takahashi (?) @ 2002-10-17 14:10 ` Andrew Theurer 2002-10-17 16:26 ` [NFS] " Hirokazu Takahashi -1 siblings, 1 reply; 87+ messages in thread From: Andrew Theurer @ 2002-10-17 14:10 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs > Hi, > > > server: Udp: InDatagrams NoPorts InErrors OutDatagrams > > Udp: 1000665 41 0 1000666 > > clients: Udp: InDatagrams NoPorts InErrors OutDatagrams > > Udp: 200403 0 0 200406 > > (all clients the same) > > How about IP datagrams? You can see the IP fields in /proc/net/snmp > IP layer may also discard them. Server: Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000 A Client: Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0 Andrew Theurer ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 14:10 ` Andrew Theurer @ 2002-10-17 16:26 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 16:26 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hi, > > How about IP datagrams? You can see the IP fields in /proc/net/snmp > > IP layer may also discard them. > > Server: > > Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams > InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes > ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates > Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000 > > A Client: > > Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams > InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes > ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates > Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0 It looks fine. Hmmm.... What version of linux do you use? Congestion avoidance mechanism of NFS clients might cause this situation. I think the congestion window size is not enough for high end machines. You can make the window be larger as a test. ------------------------------------------------------- This sf.net email is sponsored by: viaVerio will pay you up to $1,000 for every account that you consolidate with us. http://ad.doubleclick.net/clk;4749864;7604308;v? http://www.viaverio.com/consolidator/osdn.cfm _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-10-17 16:26 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 16:26 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hi, > > How about IP datagrams? You can see the IP fields in /proc/net/snmp > > IP layer may also discard them. > > Server: > > Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams > InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes > ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates > Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000 > > A Client: > > Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams > InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes > ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates > Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0 It looks fine. Hmmm.... What version of linux do you use? Congestion avoidance mechanism of NFS clients might cause this situation. I think the congestion window size is not enough for high end machines. You can make the window be larger as a test. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 16:26 ` [NFS] " Hirokazu Takahashi (?) @ 2002-10-18 5:38 ` Trond Myklebust 2002-10-18 7:19 ` Hirokazu Takahashi -1 siblings, 1 reply; 87+ messages in thread From: Trond Myklebust @ 2002-10-18 5:38 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: habanero, neilb, davem, linux-kernel, nfs >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: > Congestion avoidance mechanism of NFS clients might cause this > situation. I think the congestion window size is not enough > for high end machines. You can make the window be larger as a > test. The congestion avoidance window is supposed to adapt to the bandwidth that is available. Turn congestion avoidance off if you like, but my experience is that doing so tends to seriously degrade performance as the number of timeouts + resends skyrockets. Cheers, Trond ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-18 5:38 ` Trond Myklebust @ 2002-10-18 7:19 ` Hirokazu Takahashi 2002-10-18 15:12 ` [NFS] " Andrew Theurer 0 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-18 7:19 UTC (permalink / raw) To: trond.myklebust; +Cc: habanero, neilb, davem, linux-kernel, nfs Hello, > > Congestion avoidance mechanism of NFS clients might cause this > > situation. I think the congestion window size is not enough > > for high end machines. You can make the window be larger as a > > test. > > The congestion avoidance window is supposed to adapt to the bandwidth > that is available. Turn congestion avoidance off if you like, but my > experience is that doing so tends to seriously degrade performance as > the number of timeouts + resends skyrockets. Yes, you must be right. But I guess Andrew may use a great machine so that the transfer rate has exeeded the maximum size of the congestion avoidance window. Can we determin preferable maximum window size dynamically? Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-18 7:19 ` Hirokazu Takahashi @ 2002-10-18 15:12 ` Andrew Theurer 0 siblings, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-18 15:12 UTC (permalink / raw) To: trond.myklebust, Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs > > > Congestion avoidance mechanism of NFS clients might cause this > > > situation. I think the congestion window size is not enough > > > for high end machines. You can make the window be larger as a > > > test. > > > > The congestion avoidance window is supposed to adapt to the bandwidth > > that is available. Turn congestion avoidance off if you like, but my > > experience is that doing so tends to seriously degrade performance as > > the number of timeouts + resends skyrockets. > > Yes, you must be right. > > But I guess Andrew may use a great machine so that the transfer rate > has exeeded the maximum size of the congestion avoidance window. > Can we determin preferable maximum window size dynamically? Is this a concern on the client only? I can run a test with just one client and see if I can saturate the 100Mbit adapter. If I can, would we need to make any adjustments then? FYI, at 115 MB/sec total throughput, that's only 2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a 100Mbit client. Andrew Theurer ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-10-18 15:12 ` Andrew Theurer 0 siblings, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-18 15:12 UTC (permalink / raw) To: trond.myklebust, Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs > > > Congestion avoidance mechanism of NFS clients might cause this > > > situation. I think the congestion window size is not enough > > > for high end machines. You can make the window be larger as a > > > test. > > > > The congestion avoidance window is supposed to adapt to the bandwidth > > that is available. Turn congestion avoidance off if you like, but my > > experience is that doing so tends to seriously degrade performance as > > the number of timeouts + resends skyrockets. > > Yes, you must be right. > > But I guess Andrew may use a great machine so that the transfer rate > has exeeded the maximum size of the congestion avoidance window. > Can we determin preferable maximum window size dynamically? Is this a concern on the client only? I can run a test with just one client and see if I can saturate the 100Mbit adapter. If I can, would we need to make any adjustments then? FYI, at 115 MB/sec total throughput, that's only 2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a 100Mbit client. Andrew Theurer ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-18 15:12 ` [NFS] " Andrew Theurer @ 2002-10-19 20:34 ` Hirokazu Takahashi -1 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-19 20:34 UTC (permalink / raw) To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs Hello, > > Congestion avoidance mechanism of NFS clients might cause this > > situation. I think the congestion window size is not enough > > for high end machines. You can make the window be larger as a > > test. > Is this a concern on the client only? I can run a test with just one client > and see if I can saturate the 100Mbit adapter. If I can, would we need to > make any adjustments then? FYI, at 115 MB/sec total throughput, that's only > 2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec, > that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a > 100Mbit client. I think it's a client issue. NFS servers don't care about cogestion of UDP traffic and they will try to response to all NFS requests as fast as they can. You can try to increase the number of clients or the number of mount points for a test. It's easy to mount the same directory of the server on some directries of the client so that each of them can work simultaneously. # mount -t nfs server:/foo /baa1 # mount -t nfs server:/foo /baa2 # mount -t nfs server:/foo /baa3 Thank you, Hirokazu Takahashi. ------------------------------------------------------- This sf.net email is sponsored by: Access Your PC Securely with GoToMyPC. Try Free Now https://www.gotomypc.com/s/OSND/DD _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-10-19 20:34 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-19 20:34 UTC (permalink / raw) To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs Hello, > > Congestion avoidance mechanism of NFS clients might cause this > > situation. I think the congestion window size is not enough > > for high end machines. You can make the window be larger as a > > test. > Is this a concern on the client only? I can run a test with just one client > and see if I can saturate the 100Mbit adapter. If I can, would we need to > make any adjustments then? FYI, at 115 MB/sec total throughput, that's only > 2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec, > that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a > 100Mbit client. I think it's a client issue. NFS servers don't care about cogestion of UDP traffic and they will try to response to all NFS requests as fast as they can. You can try to increase the number of clients or the number of mount points for a test. It's easy to mount the same directory of the server on some directries of the client so that each of them can work simultaneously. # mount -t nfs server:/foo /baa1 # mount -t nfs server:/foo /baa2 # mount -t nfs server:/foo /baa3 Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-19 20:34 ` [NFS] " Hirokazu Takahashi @ 2002-10-22 21:16 ` Andrew Theurer -1 siblings, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-22 21:16 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs On Saturday 19 October 2002 15:34, Hirokazu Takahashi wrote: > Hello, > > > > Congestion avoidance mechanism of NFS clients might cause this > > > situation. I think the congestion window size is not enough > > > for high end machines. You can make the window be larger as a > > > test. > > > > Is this a concern on the client only? I can run a test with just one > > client and see if I can saturate the 100Mbit adapter. If I can, woul= d we > > need to make any adjustments then? FYI, at 115 MB/sec total throughp= ut, > > that's only 2.875 MB/sec for each of the 40 clients. For the TCP res= ult > > of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortabl= e > > throughputs for a 100Mbit client. > > I think it's a client issue. NFS servers don't care about cogestion of = UDP > traffic and they will try to response to all NFS requests as fast as th= ey > can. > > You can try to increase the number of clients or the number of mount po= ints > for a test. It's easy to mount the same directory of the server on some > directries of the client so that each of them can work simultaneously. > # mount -t nfs server:/foo /baa1 > # mount -t nfs server:/foo /baa2 > # mount -t nfs server:/foo /baa3 I don't think it is a client congestion issue at this point. I can run t= he=20 test with just one client on UDP and achieve 11.2 MB/sec with just one mo= unt=20 point. The client has 100 Mbit Ethernet, so should be the upper limit (o= r=20 really close). In the 40 client read test, I have only achieved 2.875 MB= /sec=20 per client. That and the fact that there are never more than 2 nfsd thre= ads=20 in a run state at one time (for UDP only) leads me to believe there is st= ill=20 a scaling problem on the server for UDP. I will continue to run the test= and=20 poke a prod around. Hopefully something will jump out at me. Thanks for= all=20 the input! Andrew Theurer ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-10-22 21:16 ` Andrew Theurer 0 siblings, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-10-22 21:16 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs On Saturday 19 October 2002 15:34, Hirokazu Takahashi wrote: > Hello, > > > > Congestion avoidance mechanism of NFS clients might cause this > > > situation. I think the congestion window size is not enough > > > for high end machines. You can make the window be larger as a > > > test. > > > > Is this a concern on the client only? I can run a test with just one > > client and see if I can saturate the 100Mbit adapter. If I can, would we > > need to make any adjustments then? FYI, at 115 MB/sec total throughput, > > that's only 2.875 MB/sec for each of the 40 clients. For the TCP result > > of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortable > > throughputs for a 100Mbit client. > > I think it's a client issue. NFS servers don't care about cogestion of UDP > traffic and they will try to response to all NFS requests as fast as they > can. > > You can try to increase the number of clients or the number of mount points > for a test. It's easy to mount the same directory of the server on some > directries of the client so that each of them can work simultaneously. > # mount -t nfs server:/foo /baa1 > # mount -t nfs server:/foo /baa2 > # mount -t nfs server:/foo /baa3 I don't think it is a client congestion issue at this point. I can run the test with just one client on UDP and achieve 11.2 MB/sec with just one mount point. The client has 100 Mbit Ethernet, so should be the upper limit (or really close). In the 40 client read test, I have only achieved 2.875 MB/sec per client. That and the fact that there are never more than 2 nfsd threads in a run state at one time (for UDP only) leads me to believe there is still a scaling problem on the server for UDP. I will continue to run the test and poke a prod around. Hopefully something will jump out at me. Thanks for all the input! Andrew Theurer ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-22 21:16 ` [NFS] " Andrew Theurer (?) @ 2002-10-23 9:29 ` Hirokazu Takahashi 2002-10-24 15:32 ` Andrew Theurer -1 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-23 9:29 UTC (permalink / raw) To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs Hi, > > > > Congestion avoidance mechanism of NFS clients might cause this > > > > situation. I think the congestion window size is not enough > > > > for high end machines. You can make the window be larger as a > > > > test. > I don't think it is a client congestion issue at this point. I can run the > test with just one client on UDP and achieve 11.2 MB/sec with just one mount > point. The client has 100 Mbit Ethernet, so should be the upper limit (or > really close). In the 40 client read test, I have only achieved 2.875 MB/sec > per client. That and the fact that there are never more than 2 nfsd threads > in a run state at one time (for UDP only) leads me to believe there is still > a scaling problem on the server for UDP. I will continue to run the test and > poke a prod around. Hopefully something will jump out at me. Thanks for all > the input! Can You check /proc/net/rpc/nfsd which shows how many NFS requests have been retransmitted ? # cat /proc/net/rpc/nfsd rc 0 27680 162118 ^^^ This field means the clinents have retransmitted pakeckets. The transmission ratio will slow down if it have happened once. It may occur if the response from the server is slower than the clinents expect. And you can use older version - e.g. linux-2.4 series - for clients and see what will happen as older versions don't have any intelligent features. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-23 9:29 ` Hirokazu Takahashi @ 2002-10-24 15:32 ` Andrew Theurer 2002-10-27 11:10 ` Hirokazu Takahashi 0 siblings, 1 reply; 87+ messages in thread From: Andrew Theurer @ 2002-10-24 15:32 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs > > I don't think it is a client congestion issue at this point. I can run the > > test with just one client on UDP and achieve 11.2 MB/sec with just one mount > > point. The client has 100 Mbit Ethernet, so should be the upper limit (or > > really close). In the 40 client read test, I have only achieved 2.875 MB/sec > > per client. That and the fact that there are never more than 2 nfsd threads > > in a run state at one time (for UDP only) leads me to believe there is still > > a scaling problem on the server for UDP. I will continue to run the test and > > poke a prod around. Hopefully something will jump out at me. Thanks for all > > the input! > > Can You check /proc/net/rpc/nfsd which shows how many NFS requests have > been retransmitted ? > > # cat /proc/net/rpc/nfsd > rc 0 27680 162118 > ^^^ > This field means the clinents have retransmitted pakeckets. > The transmission ratio will slow down if it have happened once. > It may occur if the response from the server is slower than the > clinents expect. /proc/net/rpc/nfsd rc 0 1 1025221 > And you can use older version - e.g. linux-2.4 series - for clients > and see what will happen as older versions don't have any intelligent > features. Actually all of the clients are 2.4 (RH 7.0). I could change them out to 2.5, but it may take me a little while. Let me do a little digging around. I seem to recall an issue I had earlier this year when waking up the nfsd threads and having most of them just go back to sleep. I need to go back to that code and understand it a little better. Thanks for all of your help. Andrew Theurer ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-24 15:32 ` Andrew Theurer @ 2002-10-27 11:10 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-27 11:10 UTC (permalink / raw) To: habanero; +Cc: nfs Hi, >> Can You check /proc/net/rpc/nfsd which shows how many NFS requests have >> been retransmitted ? You can also check the client side. /proc/net/rpc/nfs net 0 0 0 0 rpc 191035 4339 0 ^^^^ This field shows us how many times the client has retransmitted RPC requests. Thank you, Hirokazu Takahashi. ------------------------------------------------------- This SF.net email is sponsored by: ApacheCon, November 18-21 in Las Vegas (supported by COMDEX), the only Apache event to be fully supported by the ASF. http://www.apachecon.com _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 3:44 ` Neil Brown 2002-10-16 4:31 ` David S. Miller @ 2002-10-16 11:09 ` Hirokazu Takahashi 2002-10-16 17:02 ` kaza 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-16 11:09 UTC (permalink / raw) To: neilb; +Cc: davem, linux-kernel, nfs Hello, > > It will be effective on large scale SMP machines as all kNFSd shares > > one NFS port. A udp socket can't send data on each CPU at the same > > time while MSG_MORE/UDP_CORK options are set. > > The UDP socket have to block any other requests during making a UDP frame. > > > After thinking about this some more, I suspect it would have to be > quite large scale SMP to get much contention. I have no idea how much contention will happen. I haven't checked the performance of it on large scale SMP yet as I don't have such a great machines. Does anyone help us? > The only contention on the udp socket is, as you say, assembling a udp > frame, and it would be surprised if that takes a substantial faction > of the time to handle a request. After assembling a udp frame, kNFSd may drive a NIC to transmit the frame. > Presumably on a sufficiently large SMP machine that this became an > issue, there would be multiple NICs. Maybe it would make sense to > have one udp socket for each NIC. Would that make sense? or work? Some CPUs often share one GbE NIC today as a NIC can handle much data than one CPU can. I think that CPU seems likely to become bottleneck. Personally I guess several CPUs will share one 10GbE NIC in the near future even if it's a high end machine. (It's just my guess) But I don't know how effective this patch works...... devem> Doesn't make much sense. devem> devem> Usually we are talking via one IP address, and thus over devem> one device. It could be using multiple NICs via BONDING, devem> but that would be transparent to anything at the socket devem> level. devem> devem> Really, I think there is real value to making the socket devem> per-cpu even on a 2 or 4 way system. I wish so. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 11:09 ` Hirokazu Takahashi @ 2002-10-16 17:02 ` kaza 2002-10-17 4:36 ` rddunlap 0 siblings, 1 reply; 87+ messages in thread From: kaza @ 2002-10-16 17:02 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs Hello, On Wed, Oct 16, 2002 at 08:09:00PM +0900, Hirokazu Takahashi-san wrote: > > After thinking about this some more, I suspect it would have to be > > quite large scale SMP to get much contention. > > I have no idea how much contention will happen. I haven't checked the > performance of it on large scale SMP yet as I don't have such a great > machines. > > Does anyone help us? Why don't you propose the performance test to OSDL? (OSDL-J is more better, I think) OSDL provide hardware resources and operation staffs. If you want, I can help you to propose it. :-) -- Ko Kazaana / editor-in-chief of "TechStyle" ( http://techstyle.jp/ ) GnuPG Fingerprint = 1A50 B204 46BD EE22 2E8C 903F F2EB CEA7 4BCF 808F ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 17:02 ` kaza @ 2002-10-17 4:36 ` rddunlap 0 siblings, 0 replies; 87+ messages in thread From: rddunlap @ 2002-10-17 4:36 UTC (permalink / raw) To: kaza; +Cc: Hirokazu Takahashi, neilb, davem, linux-kernel, nfs On Thu, 17 Oct 2002 kaza@kk.iij4u.or.jp wrote: | Hello, | | On Wed, Oct 16, 2002 at 08:09:00PM +0900, | Hirokazu Takahashi-san wrote: | > > After thinking about this some more, I suspect it would have to be | > > quite large scale SMP to get much contention. | > | > I have no idea how much contention will happen. I haven't checked the | > performance of it on large scale SMP yet as I don't have such a great | > machines. | > | > Does anyone help us? | | Why don't you propose the performance test to OSDL? (OSDL-J is more | better, I think) OSDL provide hardware resources and operation staffs. and why do you say that? 8;) | If you want, I can help you to propose it. :-) That's the right thing to do. -- ~Randy ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH] zerocopy NFS for 2.5.43 2002-10-14 5:50 ` Neil Brown 2002-10-14 6:15 ` David S. Miller 2002-10-14 12:01 ` Hirokazu Takahashi @ 2002-10-18 13:11 ` Hirokazu Takahashi 2002-10-23 1:18 ` Neil Brown 2 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-18 13:11 UTC (permalink / raw) To: neilb; +Cc: nfs [-- Attachment #1: Type: Text/Plain, Size: 453 bytes --] Hello, I've ported the zerocopy patches against linux-2.5.43 with davem's udp-sendfile patches and your patches which you posted on Wed,16 Oct. It's sad that zerocopy NFS doesn't work with NFSv4 yet. kNFSd won't use zerocopy mechanism against NFSv4 requests. If possible I can make NFSv4 use zerocopy after Halloween. And I also fixed a small bug that pages might be lost when nfsd_readdir happens to have an error. Thank you, Hirokazu Takahashi. [-- Attachment #2: rpcfix2.5.43-2.patch --] [-- Type: Text/Plain, Size: 1094 bytes --] --- linux/net/sunrpc/svcsock.c.ORG Thu Oct 17 14:10:43 2030 +++ linux/net/sunrpc/svcsock.c Fri Oct 18 11:20:27 2030 @@ -882,17 +882,18 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp) dprintk("svc: TCP complete record (%d bytes)\n", len); + rqstp->rq_skbuff = 0; + rqstp->rq_argbuf.buf += 1; + rqstp->rq_argbuf.len = (len >> 2) + 1; + rqstp->rq_argbuf.buflen = (len >> 2) + 1; + /* Position reply write pointer immediately args, * allowing for record length */ - rqstp->rq_resbuf.base = rqstp->rq_argbuf.base + (len>>2); - rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base + 1; - rqstp->rq_resbuf.len = 1; - rqstp->rq_resbuf.buflen= rqstp->rq_argbuf.buflen - (len>>2) - 1; + rqstp->rq_resbuf.base += rqstp->rq_argbuf.buflen; + rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base + 1; + rqstp->rq_resbuf.len = 1; + rqstp->rq_resbuf.buflen -= rqstp->rq_argbuf.buflen; - rqstp->rq_skbuff = 0; - rqstp->rq_argbuf.buf += 1; - rqstp->rq_argbuf.len = (len >> 2); - rqstp->rq_argbuf.buflen = (len >> 2); rqstp->rq_prot = IPPROTO_TCP; /* Reset TCP read info */ [-- Attachment #3: va01-zerocopy-rpc-2.5.43.patch --] [-- Type: Text/Plain, Size: 10122 bytes --] --- linux.ORG/include/linux/sunrpc/svc.h Fri Oct 18 12:26:43 2030 +++ linux/include/linux/sunrpc/svc.h Fri Oct 18 12:29:31 2030 @@ -48,7 +48,7 @@ struct svc_serv { * This is use to determine the max number of pages nfsd is * willing to return in a single READ operation. */ -#define RPCSVC_MAXPAYLOAD 16384u +#define RPCSVC_MAXPAYLOAD (1024u*64) /* * Buffer to store RPC requests or replies in. @@ -61,7 +61,7 @@ struct svc_serv { * * The array of iovecs can hold additional data that the server process * may not want to copy into the RPC reply buffer, but pass to the - * network sendmsg routines directly. The prime candidate for this + * network sendmsg/sendpage routines directly. The prime candidate for this * will of course be NFS READ operations, but one might also want to * do something about READLINK and READDIR. It might be worthwhile * to implement some generic readdir cache in the VFS layer... @@ -70,7 +70,7 @@ struct svc_serv { * the list of IP fragments once we get to process fragmented UDP * datagrams directly. */ -#define RPCSVC_MAXIOV ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1) +#define RPCSVC_MAXIOV ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 2) struct svc_buf { u32 * area; /* allocated memory */ u32 * base; /* base of RPC datagram */ @@ -78,10 +78,24 @@ struct svc_buf { u32 * buf; /* read/write pointer */ int len; /* current end of buffer */ - /* iovec for zero-copy NFS READs */ - struct iovec iov[RPCSVC_MAXIOV]; + /* + * iovec for zero-copy NFS READs + * pages and non-page data can be mixed. + */ + struct rpcio_vec { + struct page *rpc_page; + union { + void *riov_base; + unsigned long riov_offset; + } u; + __kernel_size_t rpc_len; + } iov[RPCSVC_MAXIOV]; int nriov; }; + +#define rpc_base u.riov_base +#define rpc_offset u.riov_offset + #define svc_getu32(argp, val) { (val) = *(argp)->buf++; (argp)->len--; } #define svc_putu32(resp, val) { *(resp)->buf++ = (val); (resp)->len++; } --- linux.ORG/include/linux/sunrpc/svcsock.h Fri Oct 18 12:26:43 2030 +++ linux/include/linux/sunrpc/svcsock.h Fri Oct 18 12:29:31 2030 @@ -10,6 +10,7 @@ #define SUNRPC_SVCSOCK_H #include <linux/sunrpc/svc.h> +#include <asm/semaphore.h> /* * RPC server socket. @@ -37,6 +38,7 @@ struct svc_sock { struct list_head sk_deferred; /* deferred requests that need to * be revisted */ + struct semaphore sk_sem; /* serialize sending data */ int (*sk_recvfrom)(struct svc_rqst *rqstp); int (*sk_sendto)(struct svc_rqst *rqstp); --- linux.ORG/net/sunrpc/svc.c Fri Oct 18 12:26:48 2030 +++ linux/net/sunrpc/svc.c Fri Oct 18 12:29:31 2030 @@ -106,8 +106,7 @@ svc_destroy(struct svc_serv *serv) /* * Allocate an RPC server buffer - * Later versions may do nifty things by allocating multiple pages - * of memory directly and putting them into the bufp->iov. + * Multiple pages can be put into the bufp->iov. */ int svc_init_buffer(struct svc_buf *bufp, unsigned int size) @@ -119,8 +118,9 @@ svc_init_buffer(struct svc_buf *bufp, un bufp->len = 0; bufp->buflen = size >> 2; - bufp->iov[0].iov_base = bufp->area; - bufp->iov[0].iov_len = size; + bufp->iov[0].rpc_base = bufp->area; + bufp->iov[0].rpc_len = size; + bufp->iov[0].rpc_page = NULL; bufp->nriov = 1; return 1; --- linux.ORG/net/sunrpc/svcsock.c Fri Oct 18 12:28:35 2030 +++ linux/net/sunrpc/svcsock.c Fri Oct 18 12:29:31 2030 @@ -22,6 +22,7 @@ #include <linux/sched.h> #include <linux/errno.h> #include <linux/fcntl.h> +#include <linux/pagemap.h> #include <linux/net.h> #include <linux/in.h> #include <linux/inet.h> @@ -270,6 +271,8 @@ static void svc_sock_release(struct svc_rqst *rqstp) { struct svc_sock *svsk = rqstp->rq_sock; + struct svc_buf *bufp = &rqstp->rq_resbuf; + int i; svc_release_skb(rqstp); @@ -283,6 +286,13 @@ svc_sock_release(struct svc_rqst *rqstp) rqstp->rq_reserved, rqstp->rq_resbuf.len<<2); + for (i = 0; i < bufp->nriov; i++) { + if (bufp->iov[i].rpc_page) { + put_page(bufp->iov[i].rpc_page); + bufp->iov[i].rpc_page = NULL; + } + } + rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base; rqstp->rq_resbuf.len = 0; svc_reserve(rqstp, 0); @@ -318,38 +328,55 @@ svc_wake_up(struct svc_serv *serv) * Generic sendto routine */ static int -svc_sendto(struct svc_rqst *rqstp, struct iovec *iov, int nr) +svc_sendto(struct svc_rqst *rqstp, struct rpcio_vec *iov, int nr) { mm_segment_t oldfs; struct svc_sock *svsk = rqstp->rq_sock; struct socket *sock = svsk->sk_sock; struct msghdr msg; - int i, buflen, len; - - for (i = buflen = 0; i < nr; i++) - buflen += iov[i].iov_len; + unsigned int flags = MSG_MORE; + int len = 0; + int result, i; msg.msg_name = &rqstp->rq_addr; msg.msg_namelen = sizeof(rqstp->rq_addr); - msg.msg_iov = iov; - msg.msg_iovlen = nr; msg.msg_control = NULL; msg.msg_controllen = 0; + msg.msg_iovlen = 1; - /* This was MSG_DONTWAIT, but I now want it to wait. - * The only thing that it would wait for is memory and - * if we are fairly low on memory, then we aren't likely - * to make much progress anyway. - * sk->sndtimeo is set to 30seconds just in case. - */ - msg.msg_flags = 0; + /* Grab svsk->sk_sem to serialize outgoing data. */ + down(&svsk->sk_sem); - oldfs = get_fs(); set_fs(KERNEL_DS); - len = sock_sendmsg(sock, &msg, buflen); - set_fs(oldfs); + /* + * svc_sendto() assumes rqstp->rq_resbuf.page[0] is NULL + * when RPC over UDP is used as sendpage interface cannot + * pass destination address. + */ + for (i = 0; i < nr; i++) { + if (i == nr - 1) + flags = 0; + if (iov[i].rpc_page) { + result = sock->ops->sendpage(sock, iov[i].rpc_page, iov[i].rpc_offset, iov[i].rpc_len, flags); + } else { + struct iovec uiov; + uiov.iov_base = iov[i].rpc_base; + uiov.iov_len = iov[i].rpc_len; + msg.msg_iov = &uiov; + msg.msg_flags = flags; + oldfs = get_fs(); set_fs(KERNEL_DS); + result = sock_sendmsg(sock, &msg, iov[i].rpc_len); + set_fs(oldfs); + } + if (result < 0) { + if (!len) len = result; + break; + } + len += result; + } + up(&svsk->sk_sem); - dprintk("svc: socket %p sendto([%p %Zu... ], %d, %d) = %d\n", - rqstp->rq_sock, iov[0].iov_base, iov[0].iov_len, nr, buflen, len); + dprintk("svc: socket %p sendto([%p %Zu... ], %d) = %d\n", + rqstp->rq_sock, iov[0].rpc_base, iov[0].rpc_len, nr, len); return len; } @@ -375,19 +402,25 @@ svc_recv_available(struct svc_sock *svsk * Generic recvfrom routine. */ static int -svc_recvfrom(struct svc_rqst *rqstp, struct iovec *iov, int nr, int buflen) +svc_recvfrom(struct svc_rqst *rqstp, struct rpcio_vec *iov, int nr, int buflen) { mm_segment_t oldfs; struct msghdr msg; struct socket *sock; - int len, alen; + int len, alen, i; + struct iovec uiov[RPCSVC_MAXIOV]; rqstp->rq_addrlen = sizeof(rqstp->rq_addr); sock = rqstp->rq_sock->sk_sock; + for (i = 0; i < nr; i++) { + uiov[i].iov_base = iov[i].rpc_base; + uiov[i].iov_len = iov[i].rpc_len; + } + msg.msg_name = &rqstp->rq_addr; msg.msg_namelen = sizeof(rqstp->rq_addr); - msg.msg_iov = iov; + msg.msg_iov = uiov; msg.msg_iovlen = nr; msg.msg_control = NULL; msg.msg_controllen = 0; @@ -406,7 +439,7 @@ svc_recvfrom(struct svc_rqst *rqstp, str sock->ops->getname(sock, (struct sockaddr *)&rqstp->rq_addr, &alen, 1); dprintk("svc: socket %p recvfrom(%p, %Zu) = %d\n", - rqstp->rq_sock, iov[0].iov_base, iov[0].iov_len, len); + rqstp->rq_sock, iov[0].rpc_base, iov[0].rpc_len, len); return len; } @@ -567,8 +600,8 @@ svc_udp_sendto(struct svc_rqst *rqstp) * care of by the server implementation itself. */ /* bufp->base = bufp->area; */ - bufp->iov[0].iov_base = bufp->base; - bufp->iov[0].iov_len = bufp->len << 2; + bufp->iov[0].rpc_base = bufp->base; + bufp->iov[0].rpc_len = bufp->len << 2; error = svc_sendto(rqstp, bufp->iov, bufp->nriov); if (error == -ECONNREFUSED) @@ -827,10 +860,11 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp) */ if (svsk->sk_tcplen < 4) { unsigned long want = 4 - svsk->sk_tcplen; - struct iovec iov; + struct rpcio_vec iov; - iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen; - iov.iov_len = want; + iov.rpc_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen; + iov.rpc_len = want; + iov.rpc_page = NULL; if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0) goto error; svsk->sk_tcplen += len; @@ -872,8 +906,8 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp) set_bit(SK_DATA, &svsk->sk_flags); /* Frob argbuf */ - bufp->iov[0].iov_base += 4; - bufp->iov[0].iov_len -= 4; + bufp->iov[0].rpc_base += 4; + bufp->iov[0].rpc_len -= 4; /* Now receive data */ len = svc_recvfrom(rqstp, bufp->iov, bufp->nriov, svsk->sk_reclen); @@ -931,21 +965,25 @@ svc_tcp_sendto(struct svc_rqst *rqstp) { struct svc_buf *bufp = &rqstp->rq_resbuf; int sent; + int buflen = bufp->len << 2; + int i; /* Set up the first element of the reply iovec. * Any other iovecs that may be in use have been taken * care of by the server implementation itself. */ - bufp->iov[0].iov_base = bufp->base; - bufp->iov[0].iov_len = bufp->len << 2; - bufp->base[0] = htonl(0x80000000|((bufp->len << 2) - 4)); + bufp->iov[0].rpc_base = bufp->base; + bufp->iov[0].rpc_len = buflen; + for (i = 1; i < bufp->nriov; i++) + buflen += bufp->iov[i].rpc_len; + bufp->base[0] = htonl(0x80000000|(buflen - 4)); sent = svc_sendto(rqstp, bufp->iov, bufp->nriov); - if (sent != bufp->len<<2) { + if (sent != buflen) { printk(KERN_NOTICE "rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n", rqstp->rq_sock->sk_server->sv_name, (sent<0)?"got error":"sent only", - sent, bufp->len << 2); + sent, buflen); svc_delete_socket(rqstp->rq_sock); sent = -EAGAIN; } @@ -1185,6 +1223,7 @@ svc_setup_socket(struct svc_serv *serv, svsk->sk_server = serv; svsk->sk_lastrecv = CURRENT_TIME; INIT_LIST_HEAD(&svsk->sk_deferred); + sema_init(&svsk->sk_sem, 1); /* Initialize the socket */ if (sock->type == SOCK_DGRAM) [-- Attachment #4: va02-zerocopy-nfsdread-2.5.43.patch --] [-- Type: Text/Plain, Size: 6693 bytes --] --- linux.ORG/fs/nfsd/nfs3xdr.c Fri Oct 18 12:26:29 2030 +++ linux/fs/nfsd/nfs3xdr.c Fri Oct 18 12:32:35 2030 @@ -13,6 +13,7 @@ #include <linux/spinlock.h> #include <linux/dcache.h> #include <linux/namei.h> +#include <linux/pagemap.h> #include <linux/sunrpc/xdr.h> #include <linux/sunrpc/svc.h> @@ -78,6 +79,34 @@ encode_fh(u32 *p, struct svc_fh *fhp) } /* + * Pad extra data at the end of the packet as the length of RPC packet + * must be multiple of u32. + */ +static inline u32 * +xdr_pack_data(struct svc_rqst *rqstp, u32 *p, unsigned long count) +{ + int pad = (XDR_QUADLEN(count) << 2) - count; + unsigned int index = rqstp->rq_resbuf.nriov; + struct rpcio_vec *iov = rqstp->rq_resbuf.iov; + + if (index == 1) + return p + XDR_QUADLEN(count); + + /* The last page may have enough room to pad. */ + if (iov[index-1].rpc_page && + iov[index-1].rpc_offset + iov[index-1].rpc_len + pad <= PAGE_SIZE) { + iov[index - 1].rpc_len += pad; + } else { + static long dummy = 0; + iov[index].rpc_base = &dummy; + iov[index].rpc_len = pad; + iov[index].rpc_page = NULL; + rqstp->rq_resbuf.nriov++; + } + return p; +} + +/* * Decode a file name and make sure that the path contains * no slashes or null bytes. */ @@ -569,7 +598,7 @@ nfs3svc_encode_readlinkres(struct svc_rq p = encode_post_op_attr(rqstp, p, &resp->fh); if (resp->status == 0) { *p++ = htonl(resp->len); - p += XDR_QUADLEN(resp->len); + p = xdr_pack_data(rqstp, p, resp->len); } return xdr_ressize_check(rqstp, p); } @@ -584,7 +613,7 @@ nfs3svc_encode_readres(struct svc_rqst * *p++ = htonl(resp->count); *p++ = htonl(resp->eof); *p++ = htonl(resp->count); /* xdr opaque count */ - p += XDR_QUADLEN(resp->count); + p = xdr_pack_data(rqstp, p, resp->count); } return xdr_ressize_check(rqstp, p); } @@ -647,7 +676,7 @@ nfs3svc_encode_readdirres(struct svc_rqs if (resp->status == 0) { /* stupid readdir cookie */ memcpy(p, resp->verf, 8); p += 2; - p += XDR_QUADLEN(resp->count); + p = xdr_pack_data(rqstp, p, resp->count); } return xdr_ressize_check(rqstp, p); --- linux.ORG/fs/nfsd/nfsxdr.c Fri Oct 18 12:26:29 2030 +++ linux/fs/nfsd/nfsxdr.c Fri Oct 18 12:32:35 2030 @@ -55,6 +55,35 @@ encode_fh(u32 *p, struct svc_fh *fhp) return p + (NFS_FHSIZE>> 2); } + +/* + * Pad extra data at the end of the packet as the length of RPC packet + * must be multiple of u32. + */ +static inline u32 * +xdr_pack_data(struct svc_rqst *rqstp, u32 *p, unsigned long count) +{ + int pad = (XDR_QUADLEN(count) << 2) - count; + unsigned int index = rqstp->rq_resbuf.nriov; + struct rpcio_vec *iov = rqstp->rq_resbuf.iov; + + if (index == 1) + return p + XDR_QUADLEN(count); + + /* The last page may have enough room to pad. */ + if (iov[index-1].rpc_page && + iov[index-1].rpc_offset + iov[index-1].rpc_len + pad <= PAGE_SIZE) { + iov[index - 1].rpc_len += pad; + } else { + static long dummy = 0; + iov[index].rpc_base = &dummy; + iov[index].rpc_len = pad; + iov[index].rpc_page = NULL; + rqstp->rq_resbuf.nriov++; + } + return p; +} + /* * Decode a file name and make sure that the path contains * no slashes or null bytes. @@ -361,7 +390,7 @@ nfssvc_encode_readlinkres(struct svc_rqs struct nfsd_readlinkres *resp) { *p++ = htonl(resp->len); - p += XDR_QUADLEN(resp->len); + p = xdr_pack_data(rqstp, p, resp->len); return xdr_ressize_check(rqstp, p); } @@ -371,7 +400,7 @@ nfssvc_encode_readres(struct svc_rqst *r { p = encode_fattr(rqstp, p, &resp->fh); *p++ = htonl(resp->count); - p += XDR_QUADLEN(resp->count); + p = xdr_pack_data(rqstp, p, resp->count); return xdr_ressize_check(rqstp, p); } @@ -380,7 +409,7 @@ int nfssvc_encode_readdirres(struct svc_rqst *rqstp, u32 *p, struct nfsd_readdirres *resp) { - p += XDR_QUADLEN(resp->count); + p = xdr_pack_data(rqstp, p, resp->count); return xdr_ressize_check(rqstp, p); } --- linux.ORG/fs/nfsd/vfs.c Fri Oct 18 12:26:29 2030 +++ linux/fs/nfsd/vfs.c Fri Oct 18 12:36:13 2030 @@ -13,6 +13,7 @@ * dentry, don't worry--they have been taken care of. * * Copyright (C) 1995-1999 Olaf Kirch <okir@monad.swb.de> + * Zerocpy NFS support (C) 2002 Hirokazu Takahashi <taka@valinux.co.jp> */ #include <linux/config.h> @@ -28,6 +29,7 @@ #include <linux/net.h> #include <linux/unistd.h> #include <linux/slab.h> +#include <linux/pagemap.h> #include <linux/in.h> #include <linux/module.h> #include <linux/namei.h> @@ -571,6 +573,61 @@ found: } /* + * Grab and keep cached pages assosiated with a file in the svc_rqst + * so that they can be passed to the netowork sendmsg/sendpage routines + * directrly. They will be released after the sending has completed. + */ +static int +nfsd_read_actor(read_descriptor_t *desc, struct page *page, unsigned long offset , unsigned long size) +{ + unsigned long count = desc->count; + struct svc_rqst *rqstp = (struct svc_rqst *)desc->buf; + unsigned int index = rqstp->rq_resbuf.nriov; + struct rpcio_vec *iov = rqstp->rq_resbuf.iov; + + if (size > count) + size = count; + + if (page == iov[index-1].rpc_page + && offset == iov[index-1].rpc_offset + iov[index-1].rpc_len) { + /* the page can be coalesced */ + iov[index-1].rpc_len += size; + } else { + rqstp->rq_resbuf.nriov++; + get_page(page); + iov[index].rpc_page = page; + iov[index].rpc_offset = offset; + iov[index].rpc_len = size; + } + + desc->count = count - size; + desc->written += size; + return size; +} + +static inline ssize_t +nfsd_getpages(struct file *filp, struct svc_rqst *rqstp, unsigned long count) +{ + read_descriptor_t desc; + ssize_t retval; + + if (!count) + return 0; + + desc.written = 0; + desc.count = count; + desc.buf = (char *)rqstp; + desc.error = 0; + do_generic_file_read(filp, &filp->f_pos, &desc, nfsd_read_actor); + + retval = desc.written; + if (!retval) + retval = desc.error; + return retval; +} + + +/* * Read data from a file. count must contain the requested read count * on entry. On return, *count contains the number of bytes actually read. * N.B. After this call fhp needs an fh_put @@ -601,10 +658,17 @@ nfsd_read(struct svc_rqst *rqstp, struct if (ra) file.f_ra = ra->p_ra; - oldfs = get_fs(); - set_fs(KERNEL_DS); - err = vfs_read(&file, buf, *count, &offset); - set_fs(oldfs); + /* ToDo: NFSv4 can't handle fragmented data yet. */ +/* if (inode->i_mapping->a_ops->readpage) { */ + if (inode->i_mapping->a_ops->readpage && rqstp->rq_vers <= 3) { + file.f_pos = offset; + err = nfsd_getpages(&file, rqstp, *count); + } else { + oldfs = get_fs(); + set_fs(KERNEL_DS); + err = vfs_read(&file, buf, *count, &offset); + set_fs(oldfs); + } /* Write back readahead params */ if (ra) [-- Attachment #5: va03-zerocopy-nfsdreaddir-2.5.43.patch --] [-- Type: Text/Plain, Size: 1426 bytes --] --- linux.ORG/fs/nfsd/vfs.c Fri Oct 18 21:24:43 2030 +++ linux/fs/nfsd/vfs.c Fri Oct 18 21:23:48 2030 @@ -1460,6 +1460,7 @@ nfsd_readdir(struct svc_rqst *rqstp, str int oldlen, eof, err; struct file file; struct readdir_cd cd; + struct page *page = NULL; err = nfsd_open(rqstp, fhp, S_IFDIR, MAY_READ, &file); if (err) @@ -1469,6 +1470,15 @@ nfsd_readdir(struct svc_rqst *rqstp, str file.f_pos = offset; + /* ToDo: NFSv4 can't handle fragmented data yet. */ +/* if (*countp <= (PAGE_SIZE >> 2)) { */ + if (*countp <= (PAGE_SIZE >> 2) && rqstp->rq_vers <= 3) { + /* Don't care if we couldn't get a page. */ + page = alloc_page(GFP_KERNEL); + if (page) + buffer = page_address(page); + } + /* Set up the readdir context */ memset(&cd, 0, sizeof(cd)); cd.rqstp = rqstp; @@ -1518,11 +1528,22 @@ nfsd_readdir(struct svc_rqst *rqstp, str *p++ = htonl(eof); /* end of directory */ *countp = (caddr_t) p - (caddr_t) buffer; + if (page) { + int index = rqstp->rq_resbuf.nriov; + get_page(page); + rqstp->rq_resbuf.iov[index].rpc_page = page; + rqstp->rq_resbuf.iov[index].rpc_base = NULL; + rqstp->rq_resbuf.iov[index].rpc_len = *countp; + rqstp->rq_resbuf.nriov++; + } + dprintk("nfsd: readdir result %d bytes, eof %d offset %d\n", *countp, eof, cd.offset? ntohl(*cd.offset) : -1); err = 0; out_close: + if (page) + put_page(page); nfsd_close(&file); out: return err; [-- Attachment #6: va04-zerocopy-shadowsock-2.5.43.patch --] [-- Type: Text/Plain, Size: 6649 bytes --] --- linux.ORG/include/linux/sunrpc/svcsock.h Fri Oct 18 12:32:04 2030 +++ linux/include/linux/sunrpc/svcsock.h Fri Oct 18 12:42:02 2030 @@ -52,6 +52,7 @@ struct svc_sock { int sk_reclen; /* length of record */ int sk_tcplen; /* current read length */ time_t sk_lastrecv; /* time of last received request */ + struct svc_sock **sk_shadow; /* shadow sockets for sending */ }; /* --- linux.ORG/net/sunrpc/svcsock.c Fri Oct 18 12:32:04 2030 +++ linux/net/sunrpc/svcsock.c Fri Oct 18 12:42:02 2030 @@ -65,7 +65,9 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *, struct socket *, - int *errp, int pmap_reg); + int *errp, int type); +#define SVSK_PMAP_REGISTER 1 +#define SVSK_SHADOW 2 static void svc_udp_data_ready(struct sock *, int); static int svc_udp_recvfrom(struct svc_rqst *); static int svc_udp_sendto(struct svc_rqst *); @@ -260,6 +262,8 @@ svc_sock_put(struct svc_sock *svsk) if (!--(svsk->sk_inuse) && test_bit(SK_DEAD, &svsk->sk_flags)) { spin_unlock_bh(&serv->sv_lock); dprintk("svc: releasing dead socket\n"); + if (svsk->sk_shadow) + kfree(svsk->sk_shadow); sock_release(svsk->sk_sock); kfree(svsk); } @@ -328,10 +332,10 @@ svc_wake_up(struct svc_serv *serv) * Generic sendto routine */ static int -svc_sendto(struct svc_rqst *rqstp, struct rpcio_vec *iov, int nr) +svc_sendto(struct svc_rqst *rqstp, struct svc_sock *svsk, + struct rpcio_vec *iov, int nr) { mm_segment_t oldfs; - struct svc_sock *svsk = rqstp->rq_sock; struct socket *sock = svsk->sk_sock; struct msghdr msg; unsigned int flags = MSG_MORE; @@ -593,6 +597,7 @@ static int svc_udp_sendto(struct svc_rqst *rqstp) { struct svc_buf *bufp = &rqstp->rq_resbuf; + struct svc_sock *svsk = rqstp->rq_sock; int error; /* Set up the first element of the reply iovec. @@ -603,10 +608,25 @@ svc_udp_sendto(struct svc_rqst *rqstp) bufp->iov[0].rpc_base = bufp->base; bufp->iov[0].rpc_len = bufp->len << 2; - error = svc_sendto(rqstp, bufp->iov, bufp->nriov); +#ifdef CONFIG_SMP + if (svsk->sk_shadow) { + struct svc_sock *shadow = svsk->sk_shadow[smp_processor_id()]; + if (shadow) { + struct svc_serv *serv = svsk->sk_server; + svsk = shadow; + if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags)) + svc_sock_setbufsize(svsk->sk_sock, + (serv->sv_nrthreads+3) * serv->sv_bufsz, + (serv->sv_nrthreads+3) * serv->sv_bufsz); + } + + } +#endif + + error = svc_sendto(rqstp, svsk, bufp->iov, bufp->nriov); if (error == -ECONNREFUSED) /* ICMP error on earlier request. */ - error = svc_sendto(rqstp, bufp->iov, bufp->nriov); + error = svc_sendto(rqstp, svsk, bufp->iov, bufp->nriov); return error; } @@ -978,7 +998,7 @@ svc_tcp_sendto(struct svc_rqst *rqstp) buflen += bufp->iov[i].rpc_len; bufp->base[0] = htonl(0x80000000|(buflen - 4)); - sent = svc_sendto(rqstp, bufp->iov, bufp->nriov); + sent = svc_sendto(rqstp, rqstp->rq_sock, bufp->iov, bufp->nriov); if (sent != buflen) { printk(KERN_NOTICE "rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n", rqstp->rq_sock->sk_server->sv_name, @@ -1201,7 +1221,7 @@ svc_send(struct svc_rqst *rqstp) */ static struct svc_sock * svc_setup_socket(struct svc_serv *serv, struct socket *sock, - int *errp, int pmap_register) + int *errp, int type) { struct svc_sock *svsk; struct sock *inet; @@ -1222,6 +1242,7 @@ svc_setup_socket(struct svc_serv *serv, svsk->sk_owspace = inet->write_space; svsk->sk_server = serv; svsk->sk_lastrecv = CURRENT_TIME; + svsk->sk_shadow = NULL; INIT_LIST_HEAD(&svsk->sk_deferred); sema_init(&svsk->sk_sem, 1); @@ -1234,7 +1255,7 @@ if (svsk->sk_sk == NULL) printk(KERN_WARNING "svsk->sk_sk == NULL after svc_prot_init!\n"); /* Register socket with portmapper */ - if (*errp >= 0 && pmap_register) + if (*errp >= 0 && type == SVSK_PMAP_REGISTER) *errp = svc_register(serv, inet->protocol, ntohs(inet_sk(inet)->sport)); @@ -1246,13 +1267,13 @@ if (svsk->sk_sk == NULL) spin_lock_bh(&serv->sv_lock); - if (!pmap_register) { + if (type == SVSK_PMAP_REGISTER || type == SVSK_SHADOW) { + clear_bit(SK_TEMP, &svsk->sk_flags); + list_add(&svsk->sk_list, &serv->sv_permsocks); + } else { set_bit(SK_TEMP, &svsk->sk_flags); list_add(&svsk->sk_list, &serv->sv_tempsocks); serv->sv_tmpcnt++; - } else { - clear_bit(SK_TEMP, &svsk->sk_flags); - list_add(&svsk->sk_list, &serv->sv_permsocks); } spin_unlock_bh(&serv->sv_lock); @@ -1261,6 +1282,61 @@ if (svsk->sk_sk == NULL) return svsk; } + +/* + * Create a shadow socket which has the same sport of given svsk. + * Let each cpu have its own socket to send packets. + */ +static int +svc_create_shadow_socket(struct svc_serv *serv, struct svc_sock *svsk, + int protocol, struct sockaddr_in *sin) +{ +#ifdef CONFIG_SMP + int error; + struct socket *newsock; + struct svc_sock *newsvsk; + int i; + + if (num_online_cpus() == 1) + return 0; + + svsk->sk_shadow = kmalloc(sizeof(struct svc_sock*)*NR_CPUS, GFP_KERNEL); + if (!svsk->sk_shadow) + return -ENOMEM; + + memset(svsk->sk_shadow, 0, sizeof(struct svc_sock*)*NR_CPUS); + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + + if ((error = sock_create(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &newsock)) < 0) + return error; + if ((newsvsk = svc_setup_socket(serv, newsock, &error, SVSK_SHADOW)) == NULL) { + sock_release(newsock); + return error; + } + /* + * Make the newsvsk as shadow of the svsk. + */ + newsock->sk->reuse = 1; /* allow address reuse */ + error = newsock->ops->bind(newsock, (struct sockaddr *) sin, + sizeof(*sin)); + if (error < 0) { + sock_release(newsock); + kfree(newsvsk); + return error; + } + /* + * Unhash the newsocket not to receive packets. + */ + newsock->sk->prot->unhash(newsock->sk); + svsk->sk_shadow[i] = newsvsk; + } +#endif + return 0; +} + /* * Create socket for RPC service. */ @@ -1300,8 +1376,13 @@ svc_create_socket(struct svc_serv *serv, goto bummer; } - if ((svsk = svc_setup_socket(serv, sock, &error, 1)) != NULL) - return 0; + if ((svsk = svc_setup_socket(serv, sock, &error, SVSK_PMAP_REGISTER)) == NULL) + goto bummer; + + if (protocol == IPPROTO_UDP && sin != NULL) + svc_create_shadow_socket(serv, svsk, protocol, sin); + + return 0; bummer: dprintk("svc: svc_create_socket error = %d\n", -error); @@ -1340,6 +1421,8 @@ svc_delete_socket(struct svc_sock *svsk) if (!svsk->sk_inuse) { spin_unlock_bh(&serv->sv_lock); + if (svsk->sk_shadow) + kfree(svsk->sk_shadow); sock_release(svsk->sk_sock); kfree(svsk); } else { [-- Attachment #7: va05-zerocopy-nfsdwrite-2.5.43.patch --] [-- Type: Text/Plain, Size: 18875 bytes --] --- linux.ORG/include/linux/sunrpc/svc.h Fri Oct 18 21:24:38 2030 +++ linux/include/linux/sunrpc/svc.h Fri Oct 18 21:26:01 2030 @@ -70,7 +70,7 @@ struct svc_serv { * the list of IP fragments once we get to process fragmented UDP * datagrams directly. */ -#define RPCSVC_MAXIOV ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 2) +#define RPCSVC_MAXIOV ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE*3 + 2) struct svc_buf { u32 * area; /* allocated memory */ u32 * base; /* base of RPC datagram */ @@ -79,7 +79,7 @@ struct svc_buf { int len; /* current end of buffer */ /* - * iovec for zero-copy NFS READs + * iovec for zero-copy NFS READs/WRITEs * pages and non-page data can be mixed. */ struct rpcio_vec { @@ -204,7 +204,13 @@ struct svc_procedure { unsigned int pc_count; /* call count */ unsigned int pc_cachetype; /* cache info (NFS) */ unsigned int pc_xdrressize; /* maximum size of XDR reply */ + unsigned int pc_flags; }; + +/* + * pc_flags + */ +#define RPC_HANDLE_IOVARG 0x1 /* can accept separated arg buffers */ /* * This is the RPC server thread function prototype --- linux.ORG/net/sunrpc/svcsock.c Fri Oct 18 21:26:29 2030 +++ linux/net/sunrpc/svcsock.c Fri Oct 18 21:26:01 2030 @@ -514,6 +514,98 @@ svc_write_space(struct sock *sk) } } +static inline int +svc_map_skb_rpciovec_one(struct sk_buff *skb, struct rpcio_vec *iov, int *slotp) +{ + int i; + int slot = *slotp; + + if (slot >= RPCSVC_MAXIOV) + return 1; + + iov[slot].rpc_page = NULL; + iov[slot].rpc_base = skb->data; + iov[slot].rpc_len = skb_headlen(skb); + slot++; + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + if (slot >= RPCSVC_MAXIOV) + return 1; + /* TODO: Highmem is not supported yet. */ + if (PageHighMem(frag->page)) + return 1; + /* + * Some drivers would split skb into some pages in the near + * future as slab for jumbo frames of GbE causes memory + * pressure too much. + */ + iov[slot].rpc_page = frag->page; + iov[slot].rpc_offset = frag->page_offset; + iov[slot].rpc_len = frag->size; + slot++; + } + *slotp = slot; + return 0; +} + +/* + * Map fragments in the skb into rpc_iovec if possible. + */ +static inline int +svc_map_skb_rpciovec(struct sk_buff *skb, struct svc_buf *bufp) +{ + int slot = 0; + struct sk_buff *list; + + /* + * Make sure the first buffer big so that knfsd or other services + * can handle it easily. + */ + if (skb_headlen(skb) < 1400) + return 1; + + if (svc_map_skb_rpciovec_one(skb, bufp->iov, &slot)) + return 1; + + bufp->iov[0].rpc_base += sizeof(struct udphdr); + bufp->iov[0].rpc_len -= sizeof(struct udphdr); + + for (list = skb_shinfo(skb)->frag_list; list; list = list->next) { + if (svc_map_skb_rpciovec_one(list, bufp->iov, &slot)) + return 1; + } + bufp->nriov = slot; + return 0; +} + +/* + * Copy data from fragmented UDP frame into the RPC buffer. + */ +static inline u32* +svc_copy_skb_argbuf(struct svc_rqst *rqstp, struct sk_buff *skb) +{ + struct iovec iov; + mm_segment_t oldfs; + int err; + + iov.iov_base = rqstp->rq_argbuf.buf; + iov.iov_len = skb->len - sizeof(struct udphdr); + + oldfs = get_fs(); set_fs(KERNEL_DS); + if (skb->ip_summed == CHECKSUM_UNNECESSARY) { + err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), &iov, iov.iov_len); + } else { + err = skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), &iov); + } + set_fs(oldfs); + if (err) + return NULL; + + skb->ip_summed = CHECKSUM_UNNECESSARY; + return rqstp->rq_argbuf.buf; +} + /* * Receive a datagram from a UDP socket. */ @@ -549,9 +641,13 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) } set_bit(SK_DATA, &svsk->sk_flags); /* there may be more data... */ - /* Sorry. */ - if (skb_is_nonlinear(skb)) { - if (skb_linearize(skb, GFP_KERNEL) != 0) { + len = skb->len - sizeof(struct udphdr); + data = (u32 *) (skb->data + sizeof(struct udphdr)); + + if (skb_is_nonlinear(skb) && + svc_map_skb_rpciovec(skb, &rqstp->rq_argbuf)) { + data = svc_copy_skb_argbuf(rqstp, skb); + if (data == NULL) { kfree_skb(skb); svc_sock_received(svsk); return 0; @@ -566,16 +662,15 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) } } - - len = skb->len - sizeof(struct udphdr); - data = (u32 *) (skb->data + sizeof(struct udphdr)); - rqstp->rq_skbuff = skb; rqstp->rq_argbuf.base = data; rqstp->rq_argbuf.buf = data; rqstp->rq_argbuf.len = (len >> 2); rqstp->rq_argbuf.buflen = (len >> 2); - /* rqstp->rq_resbuf = rqstp->rq_defbuf; */ + + rqstp->rq_resbuf.base += rqstp->rq_argbuf.buflen; + rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base; + rqstp->rq_resbuf.buflen -= rqstp->rq_argbuf.buflen; rqstp->rq_prot = IPPROTO_UDP; /* Get sender address */ @@ -1067,6 +1162,17 @@ svc_sock_update_bufs(struct svc_serv *se spin_unlock_bh(&serv->sv_lock); } +inline void +svc_clear_buffer(struct svc_buf *target, struct svc_buf *defbuf) +{ + target->base = defbuf->base; + target->buflen = defbuf->buflen; + target->buf = defbuf->buf; + target->len = defbuf->len; + target->iov[0] = defbuf->iov[0]; + target->nriov = defbuf->nriov; +} + /* * Receive the next request on any socket. */ @@ -1090,8 +1196,8 @@ svc_recv(struct svc_serv *serv, struct s rqstp); /* Initialize the buffers */ - rqstp->rq_argbuf = rqstp->rq_defbuf; - rqstp->rq_resbuf = rqstp->rq_defbuf; + svc_clear_buffer(&rqstp->rq_argbuf, &rqstp->rq_defbuf); + svc_clear_buffer(&rqstp->rq_resbuf, &rqstp->rq_defbuf); if (signalled()) return -EINTR; --- linux.ORG/net/sunrpc/svc.c Fri Oct 18 21:24:38 2030 +++ linux/net/sunrpc/svc.c Fri Oct 18 21:26:01 2030 @@ -13,6 +13,7 @@ #include <linux/net.h> #include <linux/in.h> #include <linux/unistd.h> +#include <linux/pagemap.h> #include <linux/sunrpc/types.h> #include <linux/sunrpc/xdr.h> @@ -233,6 +234,40 @@ svc_register(struct svc_serv *serv, int return error; } +static inline void +svc_linearize_argbuf(struct svc_rqst *rqstp) +{ + struct svc_buf *argp = &rqstp->rq_argbuf; + char *newbuf; + char *base; + char *p; + unsigned int skip, len; + int i; + + skip = (char*)argp->buf - (char*)argp->iov[0].rpc_base; + len = argp->iov[0].rpc_len - skip; + newbuf = (char*)rqstp->rq_defbuf.base + skip; + + memcpy(newbuf, argp->buf, len); + p = newbuf + len; + + for (i = 1; i < argp->nriov; i++) { + if (argp->iov[i].rpc_page) { + base = kmap(argp->iov[i].rpc_page) + argp->iov[i].rpc_offset; + } else { + base = argp->iov[i].rpc_base; + } + memcpy(p, base, argp->iov[i].rpc_len); + p += argp->iov[i].rpc_len; + if (argp->iov[i].rpc_page) + kunmap(argp->iov[i].rpc_page); + } + rqstp->rq_argbuf.base = rqstp->rq_defbuf.base; + rqstp->rq_argbuf.buf = (u32*)newbuf; + rqstp->rq_argbuf.nriov = 1; +} + + /* * Process the RPC request. */ @@ -322,6 +357,15 @@ svc_process(struct svc_serv *serv, struc */ if (procp->pc_xdrressize) svc_reserve(rqstp, procp->pc_xdrressize<<2); + + /* Linearize argbuf when the procedure can't handle it. + * It rarely happens on NFS v2/v3 but it would sometimes happen on + * NFS v4 according to its compound procedures. NFSv4 xdr routines + * have to handle splitted buffers or don't set RPC_HANDLE_IOVARG + * flag in the beginning. + */ + if (argp->nriov > 1 && !(procp->pc_flags & RPC_HANDLE_IOVARG)) + svc_linearize_argbuf(rqstp); /* Call the function that processes the request. */ if (!versp->vs_dispatch) { --- linux.ORG/fs/nfsd/vfs.c Fri Oct 18 21:26:22 2030 +++ linux/fs/nfsd/vfs.c Fri Oct 18 21:26:01 2030 @@ -686,6 +686,61 @@ out: return err; } +static inline int +nfsd_writev(struct svc_rqst *rqstp, struct file *file, + char *buf, unsigned long cnt) +{ + struct iovec iov[RPCSVC_MAXIOV]; + struct rpcio_vec *rpciov = rqstp->rq_argbuf.iov; + unsigned int len, sub; + char *base = NULL; + int slot = 0; + int i; + mm_segment_t oldfs; + int err; + + /* Look for the starting rpciov including the buf. */ + for (i = 0; i < rqstp->rq_argbuf.nriov; i++) { + if (rpciov->rpc_page) { + /* HighMem is not supported yet. */ + if (PageHighMem(rpciov->rpc_page)) + BUG(); + base = page_address(rpciov->rpc_page) + rpciov->rpc_offset; + } else { + base = rpciov->rpc_base; + } + if (base <= buf && buf < base + rpciov->rpc_len) + break; + } + + iov[slot].iov_base = buf; + iov[slot].iov_len = rpciov->rpc_len - (buf - base); + len = iov[slot].iov_len; + for (i++, slot++, rpciov++ ; i < rqstp->rq_argbuf.nriov; i++, slot++, rpciov++) { + if (rpciov->rpc_page) { + /* HighMem is not supported yet. */ + if (PageHighMem(rpciov->rpc_page)) + BUG(); + iov[slot].iov_base = page_address(rpciov->rpc_page) + rpciov->rpc_offset; + } else { + iov[slot].iov_base = rpciov->rpc_base; + } + iov[slot].iov_len = rpciov->rpc_len; + len += iov[slot].iov_len; + } + while (len > cnt) { + sub = min_t(unsigned int, iov[slot-1].iov_len, len - cnt); + len -= sub; + iov[slot-1].iov_len -= sub; + if (iov[slot-1].iov_len == 0) + slot--; + } + oldfs = get_fs(); set_fs(KERNEL_DS); + err = file->f_op->writev(file, iov, slot, &file->f_pos); + set_fs(oldfs); + return err; +} + /* * Write data to a file. * The stable flag requests synchronous writes. @@ -740,11 +795,16 @@ nfsd_write(struct svc_rqst *rqstp, struc file.f_flags |= O_SYNC; /* Write the data. */ - oldfs = get_fs(); set_fs(KERNEL_DS); - err = vfs_write(&file, buf, cnt, &offset); + if (rqstp->rq_argbuf.nriov == 1) { + oldfs = get_fs(); set_fs(KERNEL_DS); + err = vfs_write(&file, buf, cnt, &offset); + set_fs(oldfs); + } else { + file.f_pos = offset; /* set write offset */ + err = nfsd_writev(rqstp, &file, buf, cnt); + } if (err >= 0) nfsdstats.io_write += cnt; - set_fs(oldfs); /* clear setuid/setgid flag after write */ if (err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) { --- linux.ORG/fs/nfsd/nfsproc.c Fri Oct 18 21:18:42 2030 +++ linux/fs/nfsd/nfsproc.c Fri Oct 18 21:26:01 2030 @@ -522,7 +522,7 @@ nfsd_proc_statfs(struct svc_rqst * rqstp #define nfssvc_release_none NULL struct nfsd_void { int dummy; }; -#define PROC(name, argt, rest, relt, cache, respsize) \ +#define PROC(name, argt, rest, relt, cache, respsize, flags) \ { (svc_procfunc) nfsd_proc_##name, \ (kxdrproc_t) nfssvc_decode_##argt, \ (kxdrproc_t) nfssvc_encode_##rest, \ @@ -532,6 +532,7 @@ struct nfsd_void { int dummy; }; 0, \ cache, \ respsize, \ + flags, \ } #define ST 1 /* status */ @@ -539,24 +540,24 @@ struct nfsd_void { int dummy; }; #define AT 18 /* attributes */ static struct svc_procedure nfsd_procedures2[18] = { - PROC(null, void, void, none, RC_NOCACHE, ST), - PROC(getattr, fhandle, attrstat, fhandle, RC_NOCACHE, ST+AT), - PROC(setattr, sattrargs, attrstat, fhandle, RC_REPLBUFF, ST+AT), - PROC(none, void, void, none, RC_NOCACHE, ST), - PROC(lookup, diropargs, diropres, fhandle, RC_NOCACHE, ST+FH+AT), - PROC(readlink, fhandle, readlinkres, none, RC_NOCACHE, ST+1+NFS_MAXPATHLEN/4), - PROC(read, readargs, readres, fhandle, RC_NOCACHE, ST+AT+1+NFSSVC_MAXBLKSIZE), - PROC(none, void, void, none, RC_NOCACHE, ST), - PROC(write, writeargs, attrstat, fhandle, RC_REPLBUFF, ST+AT), - PROC(create, createargs, diropres, fhandle, RC_REPLBUFF, ST+FH+AT), - PROC(remove, diropargs, void, none, RC_REPLSTAT, ST), - PROC(rename, renameargs, void, none, RC_REPLSTAT, ST), - PROC(link, linkargs, void, none, RC_REPLSTAT, ST), - PROC(symlink, symlinkargs, void, none, RC_REPLSTAT, ST), - PROC(mkdir, createargs, diropres, fhandle, RC_REPLBUFF, ST+FH+AT), - PROC(rmdir, diropargs, void, none, RC_REPLSTAT, ST), - PROC(readdir, readdirargs, readdirres, none, RC_REPLBUFF, 0), - PROC(statfs, fhandle, statfsres, none, RC_NOCACHE, ST+5), + PROC(null, void, void, none, RC_NOCACHE, ST, 0), + PROC(getattr, fhandle, attrstat, fhandle, RC_NOCACHE, ST+AT, 0), + PROC(setattr, sattrargs, attrstat, fhandle, RC_REPLBUFF, ST+AT, 0), + PROC(none, void, void, none, RC_NOCACHE, ST, 0), + PROC(lookup, diropargs, diropres, fhandle, RC_NOCACHE, ST+FH+AT, 0), + PROC(readlink, fhandle, readlinkres, none, RC_NOCACHE, ST+1+NFS_MAXPATHLEN/4, 0), + PROC(read, readargs, readres, fhandle, RC_NOCACHE, ST+AT+1+NFSSVC_MAXBLKSIZE, 0), + PROC(none, void, void, none, RC_NOCACHE, ST, 0), + PROC(write, writeargs, attrstat, fhandle, RC_REPLBUFF, ST+AT, RPC_HANDLE_IOVARG), + PROC(create, createargs, diropres, fhandle, RC_REPLBUFF, ST+FH+AT, 0), + PROC(remove, diropargs, void, none, RC_REPLSTAT, ST, 0), + PROC(rename, renameargs, void, none, RC_REPLSTAT, ST, 0), + PROC(link, linkargs, void, none, RC_REPLSTAT, ST, 0), + PROC(symlink, symlinkargs, void, none, RC_REPLSTAT, ST, 0), + PROC(mkdir, createargs, diropres, fhandle, RC_REPLBUFF, ST+FH+AT, 0), + PROC(rmdir, diropargs, void, none, RC_REPLSTAT, ST, 0), + PROC(readdir, readdirargs, readdirres, none, RC_REPLBUFF, 0, 0), + PROC(statfs, fhandle, statfsres, none, RC_NOCACHE, ST+5, 0), }; --- linux.ORG/fs/nfsd/nfs3proc.c Fri Oct 18 21:18:42 2030 +++ linux/fs/nfsd/nfs3proc.c Fri Oct 18 21:26:01 2030 @@ -645,7 +645,7 @@ nfsd3_proc_commit(struct svc_rqst * rqst #define nfsd3_voidres nfsd3_voidargs struct nfsd3_voidargs { int dummy; }; -#define PROC(name, argt, rest, relt, cache, respsize) \ +#define PROC(name, argt, rest, relt, cache, respsize, flags) \ { (svc_procfunc) nfsd3_proc_##name, \ (kxdrproc_t) nfs3svc_decode_##argt##args, \ (kxdrproc_t) nfs3svc_encode_##rest##res, \ @@ -655,6 +655,7 @@ struct nfsd3_voidargs { int dummy; }; 0, \ cache, \ respsize, \ + flags, \ } #define ST 1 /* status*/ @@ -664,28 +665,28 @@ struct nfsd3_voidargs { int dummy; }; #define WC (7+pAT) /* WCC attributes */ static struct svc_procedure nfsd_procedures3[22] = { - PROC(null, void, void, void, RC_NOCACHE, ST), - PROC(getattr, fhandle, attrstat, fhandle, RC_NOCACHE, ST+AT), - PROC(setattr, sattr, wccstat, fhandle, RC_REPLBUFF, ST+WC), - PROC(lookup, dirop, dirop, fhandle2, RC_NOCACHE, ST+FH+pAT+pAT), - PROC(access, access, access, fhandle, RC_NOCACHE, ST+pAT+1), - PROC(readlink, fhandle, readlink, fhandle, RC_NOCACHE, ST+pAT+1+NFS3_MAXPATHLEN/4), - PROC(read, read, read, fhandle, RC_NOCACHE, ST+pAT+4+NFSSVC_MAXBLKSIZE), - PROC(write, write, write, fhandle, RC_REPLBUFF, ST+WC+4), - PROC(create, create, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC), - PROC(mkdir, mkdir, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC), - PROC(symlink, symlink, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC), - PROC(mknod, mknod, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC), - PROC(remove, dirop, wccstat, fhandle, RC_REPLBUFF, ST+WC), - PROC(rmdir, dirop, wccstat, fhandle, RC_REPLBUFF, ST+WC), - PROC(rename, rename, rename, fhandle2, RC_REPLBUFF, ST+WC+WC), - PROC(link, link, link, fhandle2, RC_REPLBUFF, ST+pAT+WC), - PROC(readdir, readdir, readdir, fhandle, RC_NOCACHE, 0), - PROC(readdirplus,readdirplus, readdir, fhandle, RC_NOCACHE, 0), - PROC(fsstat, fhandle, fsstat, void, RC_NOCACHE, ST+pAT+2*6+1), - PROC(fsinfo, fhandle, fsinfo, void, RC_NOCACHE, ST+pAT+12), - PROC(pathconf, fhandle, pathconf, void, RC_NOCACHE, ST+pAT+6), - PROC(commit, commit, commit, fhandle, RC_NOCACHE, ST+WC+2), + PROC(null, void, void, void, RC_NOCACHE, ST, 0), + PROC(getattr, fhandle, attrstat, fhandle, RC_NOCACHE, ST+AT, 0), + PROC(setattr, sattr, wccstat, fhandle, RC_REPLBUFF, ST+WC, 0), + PROC(lookup, dirop, dirop, fhandle2, RC_NOCACHE, ST+FH+pAT+pAT, 0), + PROC(access, access, access, fhandle, RC_NOCACHE, ST+pAT+1, 0), + PROC(readlink, fhandle, readlink, fhandle, RC_NOCACHE, ST+pAT+1+NFS3_MAXPATHLEN/4, 0), + PROC(read, read, read, fhandle, RC_NOCACHE, ST+pAT+4+NFSSVC_MAXBLKSIZE, 0), + PROC(write, write, write, fhandle, RC_REPLBUFF, ST+WC+4, RPC_HANDLE_IOVARG), + PROC(create, create, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0), + PROC(mkdir, mkdir, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0), + PROC(symlink, symlink, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0), + PROC(mknod, mknod, create, fhandle2, RC_REPLBUFF, ST+(1+FH+pAT)+WC, 0), + PROC(remove, dirop, wccstat, fhandle, RC_REPLBUFF, ST+WC, 0), + PROC(rmdir, dirop, wccstat, fhandle, RC_REPLBUFF, ST+WC, 0), + PROC(rename, rename, rename, fhandle2, RC_REPLBUFF, ST+WC+WC, 0), + PROC(link, link, link, fhandle2, RC_REPLBUFF, ST+pAT+WC, 0), + PROC(readdir, readdir, readdir, fhandle, RC_NOCACHE, 0, 0), + PROC(readdirplus,readdirplus, readdir, fhandle, RC_NOCACHE, 0, 0), + PROC(fsstat, fhandle, fsstat, void, RC_NOCACHE, ST+pAT+2*6+1, 0), + PROC(fsinfo, fhandle, fsinfo, void, RC_NOCACHE, ST+pAT+12, 0), + PROC(pathconf, fhandle, pathconf, void, RC_NOCACHE, ST+pAT+6, 0), + PROC(commit, commit, commit, fhandle, RC_NOCACHE, ST+WC+2, 0), }; struct svc_version nfsd_version3 = { --- linux.ORG/fs/nfsd/nfs4proc.c Fri Oct 18 21:18:42 2030 +++ linux/fs/nfsd/nfs4proc.c Fri Oct 18 21:26:01 2030 @@ -711,7 +711,7 @@ out: #define nfs4svc_release_compound NULL struct nfsd4_voidargs { int dummy; }; -#define PROC(name, argt, rest, relt, cache, respsize) \ +#define PROC(name, argt, rest, relt, cache, respsize, flags) \ { (svc_procfunc) nfsd4_proc_##name, \ (kxdrproc_t) nfs4svc_decode_##argt##args, \ (kxdrproc_t) nfs4svc_encode_##rest##res, \ @@ -721,6 +721,7 @@ struct nfsd4_voidargs { int dummy; }; 0, \ cache, \ respsize, \ + flags, \ } /* @@ -734,8 +735,8 @@ struct nfsd4_voidargs { int dummy; }; * better XID's. */ static struct svc_procedure nfsd_procedures4[2] = { - PROC(null, void, void, void, RC_NOCACHE, 1), - PROC(compound, compound, compound, compound, RC_NOCACHE, NFSD_BUFSIZE) + PROC(null, void, void, void, RC_NOCACHE, 1, 0), + PROC(compound, compound, compound, compound, RC_NOCACHE, NFSD_BUFSIZE, 0) }; struct svc_version nfsd_version4 = { --- linux.ORG/fs/lockd/svcproc.c Fri Oct 18 21:18:42 2030 +++ linux/fs/lockd/svcproc.c Fri Oct 18 21:26:01 2030 @@ -553,6 +553,7 @@ struct nlm_void { int dummy; }; .pc_argsize = sizeof(struct nlm_##argt), \ .pc_ressize = sizeof(struct nlm_##rest), \ .pc_xdrressize = respsize, \ + .pc_flags = 0, \ } #define Ck (1+8) /* cookie */ --- linux.ORG/fs/lockd/svc4proc.c Fri Oct 18 21:18:42 2030 +++ linux/fs/lockd/svc4proc.c Fri Oct 18 21:26:01 2030 @@ -527,6 +527,7 @@ struct nlm_void { int dummy; }; .pc_argsize = sizeof(struct nlm_##argt), \ .pc_ressize = sizeof(struct nlm_##rest), \ .pc_xdrressize = respsize, \ + .pc_flags = 0, \ } #define Ck (1+8) /* cookie */ #define No (1+1024/4) /* netobj */ [-- Attachment #8: va07-nfsbigbuf-2.5.43.patch --] [-- Type: Text/Plain, Size: 416 bytes --] --- linux.ORG/include/linux/nfsd/const.h Sat Oct 12 13:22:12 2002 +++ linux/include/linux/nfsd/const.h Sun Oct 13 22:07:37 2030 @@ -20,9 +20,9 @@ #define NFSSVC_MAXVERS 3 /* - * Maximum blocksize supported by daemon currently at 32K + * Maximum blocksize supported by daemon currently at 60K */ -#define NFSSVC_MAXBLKSIZE (32*1024) +#define NFSSVC_MAXBLKSIZE ((60*1024)&~(PAGE_SIZE-1)) #ifdef __KERNEL__ ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-18 13:11 ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi @ 2002-10-23 1:18 ` Neil Brown 2002-10-23 3:53 ` Hirokazu Takahashi ` (2 more replies) 0 siblings, 3 replies; 87+ messages in thread From: Neil Brown @ 2002-10-23 1:18 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: nfs On Friday October 18, taka@valinux.co.jp wrote: > Hello, > > I've ported the zerocopy patches against linux-2.5.43 with > davem's udp-sendfile patches and your patches which you posted > on Wed,16 Oct. Thanks for these... I have been thinking some more about this, trying to understand the big picture, and I'm afraid that I think I want some more changes. In particular, I think it would be good to use 'struct xdr_buf' from sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and we could share some of the infrastructure. I think this would work quite well for sending read responses as there is a 'head' iovec for the interesting bits of the packet, an array of pages for the data, and a 'tail' iovec for the padding. I'm not certain about receiving write requests. I imagine that it might work to: 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the skb into the head iovec, and hold onto the skbuf (like we currently do). 2/ enter the nfs server to parse that header. 3/ When the server finds it needs more data for a write, it collects the pages and calls xdr_partial_copy_from_skb to copy the rest of the skb directly into the page cache. Does that make any sense? Also, I am wondering about the way that you put zero-copy support into nfsd_readdir. Presumably the gain is that sock_sendmsg does a copy into a skbuf and then a DMA out of that, while ->sendpage does just the DMA. In that case, maybe it would be better to get "struct page *" pointers for the pages in the default buffer, and pass them to ->sendpage. I would like to get the a situation where we don't need to do a 64K kmalloc for each server, but can work entirely with individual pages. I might try converting svcsock etc to use xdr_buf later today or tomorrow unless I heard a good reason why it wont work, or someone else beats me to it... NeilBrown ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 1:18 ` Neil Brown @ 2002-10-23 3:53 ` Hirokazu Takahashi 2002-10-23 5:40 ` Hirokazu Takahashi 2002-10-23 6:10 ` Neil Brown 2002-10-23 21:50 ` Hirokazu Takahashi 2002-10-25 9:52 ` Hirokazu Takahashi 2 siblings, 2 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-23 3:53 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, > > I've ported the zerocopy patches against linux-2.5.43 with > > davem's udp-sendfile patches and your patches which you posted > > on Wed,16 Oct. > > Thanks for these... > > I have been thinking some more about this, trying to understand the > big picture, and I'm afraid that I think I want some more changes. > > In particular, I think it would be good to use 'struct xdr_buf' from > sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and > we could share some of the infrastructure. It sounds good that they share the same infrastructure. I agree with your approach. > I think this would work quite well for sending read responses as there > is a 'head' iovec for the interesting bits of the packet, an array of > pages for the data, and a 'tail' iovec for the padding. I'm wondering one point that the xdr_buf can't hanldle NFSv4 compound operation correctly yet. I don't know what will happen if we send some page data and some non-page data together as it will try to pack some operations in one xdr_buf. If we care about NFSv4 it could be like this: struct svc_buf { u32 * area; /* allocated memory */ u32 * base; /* base of RPC datagram */ int buflen; /* total length of buffer */ u32 * buf; /* read/write pointer */ int len; /* current end of buffer */ struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES]; int nriov; } I guess it would be better to fix NFSv4 problems after Halloween. > I'm not certain about receiving write requests. > I imagine that it might work to: > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the > skb into the head iovec, and hold onto the skbuf (like we > currently do). > 2/ enter the nfs server to parse that header. > 3/ When the server finds it needs more data for a write, it > collects the pages and calls xdr_partial_copy_from_skb > to copy the rest of the skb directly into the page cache. I think it will be hard work that it's the same that we make another generic_file_write function. I feel it may be overkill. e.g. We must read a page if it isn't on the cache. We must allocate disk blocks if the file don't have yet X-( Some filesytems like XFS have its own way of updating pagecache. We should make kNFSd keep away from the implementation of VM/FS as possible as we can. > Does that make any sense? > > Also, I am wondering about the way that you put zero-copy support into > nfsd_readdir. > > Presumably the gain is that sock_sendmsg does a copy into a > skbuf and then a DMA out of that, while ->sendpage does just the DMA. > In that case, maybe it would be better to get "struct page *" pointers > for the pages in the default buffer, and pass them to > ->sendpage. It seems good idea. The problem is that it's hard to know when the page will be released. The page will be held by TCP/IP stack. TCP may hold it for a while by way of retransmition. UDP pakcets may also held in driver-queue after ->sendpage has done. We should check reference count of the default buffer and decide to use the buffer or allocate new one. We think Almost request can use the default buffer. > I would like to get the a situation where we don't need to do a 64K > kmalloc for each server, but can work entirely with individual pages. > > I might try converting svcsock etc to use xdr_buf later today or > tomorrow unless I heard a good reason why it wont work, or someone > else beats me to it... If you don't mind I'll do about the readdir stuff while you're fighting with the xdr_buf stuffs. Thank you, Hirokazu Takahashi ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 3:53 ` Hirokazu Takahashi @ 2002-10-23 5:40 ` Hirokazu Takahashi 2002-10-23 6:03 ` Neil Brown 2002-10-23 6:10 ` Neil Brown 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-23 5:40 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, > > Also, I am wondering about the way that you put zero-copy support into > > nfsd_readdir. > > > > Presumably the gain is that sock_sendmsg does a copy into a > > skbuf and then a DMA out of that, while ->sendpage does just the DMA. > > In that case, maybe it would be better to get "struct page *" pointers > > for the pages in the default buffer, and pass them to > > ->sendpage. > > It seems good idea. > > The problem is that it's hard to know when the page will be released. > The page will be held by TCP/IP stack. TCP may hold it for a while > by way of retransmition. UDP pakcets may also held in driver-queue > after ->sendpage has done. > > We should check reference count of the default buffer and > decide to use the buffer or allocate new one. > We think Almost request can use the default buffer. I mean we can't use a page in the default buffer. We should use the page next to the default buffer or we should prepare another page for nfsd_readdir. I don't know whether allocating an extra page for each server is good or not. How do you think about it? > > I would like to get the a situation where we don't need to do a 64K > > kmalloc for each server, but can work entirely with individual pages. ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 5:40 ` Hirokazu Takahashi @ 2002-10-23 6:03 ` Neil Brown 2002-10-23 22:35 ` Hirokazu Takahashi 0 siblings, 1 reply; 87+ messages in thread From: Neil Brown @ 2002-10-23 6:03 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: nfs On Wednesday October 23, taka@valinux.co.jp wrote: > Hello, > > > > Also, I am wondering about the way that you put zero-copy support into > > > nfsd_readdir. > > > > > > Presumably the gain is that sock_sendmsg does a copy into a > > > skbuf and then a DMA out of that, while ->sendpage does just the DMA. > > > In that case, maybe it would be better to get "struct page *" pointers > > > for the pages in the default buffer, and pass them to > > > ->sendpage. > > > > It seems good idea. > > > > The problem is that it's hard to know when the page will be released. > > The page will be held by TCP/IP stack. TCP may hold it for a while > > by way of retransmition. UDP pakcets may also held in driver-queue > > after ->sendpage has done. > > > > We should check reference count of the default buffer and > > decide to use the buffer or allocate new one. > > We think Almost request can use the default buffer. > > I mean we can't use a page in the default buffer. > We should use the page next to the default buffer or we should > prepare another page for nfsd_readdir. > > I don't know whether allocating an extra page for each server > is good or not. > How do you think about it? I think I would change the approach to buffering. Instead of having a fixed set of pages, we just allocate new pages as needed, having handed old ones over to the networking layer. So we have a pool of pages that we draw from when generating replies, and refill before accepting a new request. Ofcourse that is a fairly big change from where we are now so it might take a while. We should probably get zero copy reads in first... NeilBrown ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 6:03 ` Neil Brown @ 2002-10-23 22:35 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-23 22:35 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, > > > > Also, I am wondering about the way that you put zero-copy support into > > > > nfsd_readdir. > I think I would change the approach to buffering. > Instead of having a fixed set of pages, we just allocate new pages as > needed, having handed old ones over to the networking layer. > > So we have a pool of pages that we draw from when generating replies, > and refill before accepting a new request. We can also put RPC/NFS headers on pages and send them without copy. This seems good for NFSv4 COMPOUNDS. > Ofcourse that is a fairly big change from where we are now so it might > take a while. We should probably get zero copy reads in first... Yes. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 3:53 ` Hirokazu Takahashi 2002-10-23 5:40 ` Hirokazu Takahashi @ 2002-10-23 6:10 ` Neil Brown 2002-10-23 7:08 ` Hirokazu Takahashi 2002-10-23 15:23 ` Trond Myklebust 1 sibling, 2 replies; 87+ messages in thread From: Neil Brown @ 2002-10-23 6:10 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, nfs, William A.(Andy) Adamson, trond.myklebust On Wednesday October 23, taka@valinux.co.jp wrote: > > I'm wondering one point that the xdr_buf can't hanldle NFSv4 compound > operation correctly yet. I don't know what will happen if we send some > page data and some non-page data together as it will try to pack some > operations in one xdr_buf. > > If we care about NFSv4 it could be like this: > > struct svc_buf { > u32 * area; /* allocated memory */ > u32 * base; /* base of RPC datagram */ > int buflen; /* total length of buffer */ > u32 * buf; /* read/write pointer */ > int len; /* current end of buffer */ > > struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES]; > int nriov; > } > > I guess it would be better to fix NFSv4 problems after Halloween. > Hmm. I wonder what plans there are for this w.r.t. to NFSv4 client. Andy? Trond? I suspect that COMPOUNDS with multiple READ or WRITE requests would be fairly rare, and it would probably be reasonable to respond with ERESOURCE (or however it is spelt). i.e. Reject any operation that would need to use a second set of pages in a response. > > I'm not certain about receiving write requests. > > I imagine that it might work to: > > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the > > skb into the head iovec, and hold onto the skbuf (like we > > currently do). > > 2/ enter the nfs server to parse that header. > > 3/ When the server finds it needs more data for a write, it > > collects the pages and calls xdr_partial_copy_from_skb > > to copy the rest of the skb directly into the page cache. > > I think it will be hard work that it's the same that we make another > generic_file_write function. I feel it may be overkill. > e.g. We must read a page if it isn't on the cache. > We must allocate disk blocks if the file don't have yet X-( > Some filesytems like XFS have its own way of updating pagecache. > > We should make kNFSd keep away from the implementation of VM/FS > as possible as we can. Could we not use 'mmap'? Maybe not, and probably best to avoid it as you say. I was thinking it would be nice to be able to do the udp-checksum at the same time as the copy-into-page-cache, but maybe we just say that you need a NIC that does checksums if you want to do single-copy NFS writes. NeilBrown ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 6:10 ` Neil Brown @ 2002-10-23 7:08 ` Hirokazu Takahashi 2002-10-23 15:23 ` Trond Myklebust 1 sibling, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-23 7:08 UTC (permalink / raw) To: neilb; +Cc: nfs, andros, trond.myklebust Hello, > > If we care about NFSv4 it could be like this: > > > > struct svc_buf { > > u32 * area; /* allocated memory */ > > u32 * base; /* base of RPC datagram */ > > int buflen; /* total length of buffer */ > > u32 * buf; /* read/write pointer */ > > int len; /* current end of buffer */ > > > > struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES]; > > int nriov; > > } > > > > I guess it would be better to fix NFSv4 problems after Halloween. > > > > Hmm. I wonder what plans there are for this w.r.t. to NFSv4 client. > Andy? Trond? > > I suspect that COMPOUNDS with multiple READ or WRITE requests would be > fairly rare, and it would probably be reasonable to respond with > ERESOURCE (or however it is spelt). Yeah, It might be. > i.e. Reject any operation that would need to use a second set of pages > in a response. > > > I'm not certain about receiving write requests. > > > I imagine that it might work to: > > > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the > > > skb into the head iovec, and hold onto the skbuf (like we > > > currently do). > > > 2/ enter the nfs server to parse that header. > > > 3/ When the server finds it needs more data for a write, it > > > collects the pages and calls xdr_partial_copy_from_skb > > > to copy the rest of the skb directly into the page cache. > > > > I think it will be hard work that it's the same that we make another > > generic_file_write function. I feel it may be overkill. > > e.g. We must read a page if it isn't on the cache. > > We must allocate disk blocks if the file don't have yet X-( > > Some filesytems like XFS have its own way of updating pagecache. > > > > We should make kNFSd keep away from the implementation of VM/FS > > as possible as we can. > > Could we not use 'mmap'? Maybe not, and probably best to avoid it as > you say. Using mmap sounds intersting to me and I was thinking about it. Regular mmap will cause many reading blocks on disk on each pagefault as its handler can't know what size of write will happen after the fault. It will be meaningless if the size is 4KB which will often happens on NFS. Standard write/writev can handle it without reading blocks. > I was thinking it would be nice to be able to do the udp-checksum at > the same time as the copy-into-page-cache, but maybe we just say that > you need a NIC that does checksums if you want to do single-copy NFS > writes. Or we can enhance the standard generic_file_write() to assign a copy-routine like this: generic_file_write(file, buf, count, ppos, nfsd_write_actor); generic_file_writev(file, iovec, nr_segs, ppos, nfsd_write_actor); nfsd_write_actor(struct page *page, int offset, ......) { xdr_partial_copy_from_skb(.....) } But I realized there is one big problem on the both approach. What can we do when the result of checksum is wrong? The pages will be filled with broken data. Thank you, Hirokazu Takahashi. ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 6:10 ` Neil Brown 2002-10-23 7:08 ` Hirokazu Takahashi @ 2002-10-23 15:23 ` Trond Myklebust 1 sibling, 0 replies; 87+ messages in thread From: Trond Myklebust @ 2002-10-23 15:23 UTC (permalink / raw) To: Neil Brown Cc: Hirokazu Takahashi, nfs, William A.(Andy) Adamson, trond.myklebust >>>>> " " == Neil Brown <neilb@cse.unsw.edu.au> writes: > Hmm. I wonder what plans there are for this w.r.t. to NFSv4 > client. Andy? Trond? There's really no need for anything beyond what we have. There's no call in the client for stringing more than one set of pages together: you are always dealing with a single set of contiguous pages to read/write. > I suspect that COMPOUNDS with multiple READ or WRITE requests > would be fairly rare, and it would probably be reasonable to > respond with ERESOURCE (or however it is spelt). Alternatively, you could add a list_head to the xdr_buf struct so that you can string several of them together. Frankly, though, it would be a rather strange NFSv4 client that wants to do this sort of operation. There's just no advantage to it... > I was thinking it would be nice to be able to do the > udp-checksum at the same time as the copy-into-page-cache, but > maybe we just say that you need a NIC that does checksums if > you want to do single-copy NFS writes. Right. The very last thing you want to do is to copy into the page cache, then find out that the checksum didn't match up. Cheers, Trond ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 1:18 ` Neil Brown 2002-10-23 3:53 ` Hirokazu Takahashi @ 2002-10-23 21:50 ` Hirokazu Takahashi 2002-10-23 23:55 ` Trond Myklebust 2002-10-25 9:52 ` Hirokazu Takahashi 2 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-23 21:50 UTC (permalink / raw) To: Trond Myklebust, neilb; +Cc: nfs Hello, > In particular, I think it would be good to use 'struct xdr_buf' from > sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and > we could share some of the infrastructure. I was thinking about the nfs clients. Why don't we make xprt_sendmsg() use the sendpage interface instead of calling sock_sendmsg() so that we can avoid dead-lock which multiple kmap()s in xprt_sendmsg() might cause on heavily loaded machines. Thank you, Hirokazu Takahashi. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 21:50 ` Hirokazu Takahashi @ 2002-10-23 23:55 ` Trond Myklebust 2002-10-24 1:33 ` Hirokazu Takahashi 0 siblings, 1 reply; 87+ messages in thread From: Trond Myklebust @ 2002-10-23 23:55 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, nfs >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: > I was thinking about the nfs clients. Why don't we make > xprt_sendmsg() use the sendpage interface instead of calling > sock_sendmsg() so that we can avoid dead-lock which multiple > kmap()s in xprt_sendmsg() might cause on heavily loaded > machines. I'm definitely in favour of such a change. Particularly so if the UDP interface is ready. Cheers, Trond ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 23:55 ` Trond Myklebust @ 2002-10-24 1:33 ` Hirokazu Takahashi 2002-10-27 10:39 ` Hirokazu Takahashi 0 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-24 1:33 UTC (permalink / raw) To: trond.myklebust; +Cc: neilb, nfs Hello, > > I was thinking about the nfs clients. Why don't we make > > xprt_sendmsg() use the sendpage interface instead of calling > > sock_sendmsg() so that we can avoid dead-lock which multiple > > kmap()s in xprt_sendmsg() might cause on heavily loaded > > machines. > > I'm definitely in favour of such a change. Particularly so if the UDP > interface is ready. I've implemented it and we can find it in linux-2.5.44. The interface is the same as the TCP's one. Thank you, Hirokazu Takahashi. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-24 1:33 ` Hirokazu Takahashi @ 2002-10-27 10:39 ` Hirokazu Takahashi 2002-10-28 16:31 ` Trond Myklebust 0 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-27 10:39 UTC (permalink / raw) To: trond.myklebust; +Cc: neilb, nfs Hello, > > > I was thinking about the nfs clients. Why don't we make > > > xprt_sendmsg() use the sendpage interface instead of calling > > > sock_sendmsg() so that we can avoid dead-lock which multiple > > > kmap()s in xprt_sendmsg() might cause on heavily loaded > > > machines. > > > > I'm definitely in favour of such a change. Particularly so if the UDP > > interface is ready. I just modified the xprt_sendmsg() to use the sendpage interface. I've checked it works fine on both of TCP and UDP. I think this code need to be cleaned up but I don't have any good ideas about it. Thank you, Hirokazu Takahashi. --- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030 +++ linux/net/sunrpc/xdr.c Sun Oct 27 19:07:05 2030 @@ -110,12 +110,15 @@ xdr_encode_pages(struct xdr_buf *xdr, st xdr->page_len = len; if (len & 3) { - struct iovec *iov = xdr->tail; unsigned int pad = 4 - (len & 3); - - iov->iov_base = (void *) "\0\0\0"; - iov->iov_len = pad; len += pad; + if (((base + len) & ~PAGE_CACHE_MASK) + pad <= PAGE_CACHE_SIZE) { + xdr->page_len += pad; + } else { + struct iovec *iov = xdr->tail; + iov->iov_base = (void *) "\0\0\0"; + iov->iov_len = pad; + } } xdr->len += len; } --- linux/net/sunrpc/xprt.c.ORG Sun Oct 27 17:07:17 2030 +++ linux/net/sunrpc/xprt.c Sun Oct 27 19:07:38 2030 @@ -60,6 +60,7 @@ #include <linux/unistd.h> #include <linux/sunrpc/clnt.h> #include <linux/file.h> +#include <linux/pagemap.h> #include <net/sock.h> #include <net/checksum.h> @@ -207,48 +208,107 @@ xprt_release_write(struct rpc_xprt *xprt spin_unlock_bh(&xprt->sock_lock); } +static inline int +__xprt_sendmsg(struct socket *sock, struct xdr_buf *xdr, struct msghdr *msg, size_t skip) +{ + unsigned int slen = xdr->len - skip; + mm_segment_t oldfs; + int result = 0; + struct page **ppage = xdr->pages; + unsigned int len, pglen = xdr->page_len; + size_t base = 0; + int flags; + int ret; + struct iovec niv; + + msg->msg_iov = ∋ + msg->msg_iovlen = 1; + + if (xdr->head[0].iov_len > skip) { + len = xdr->head[0].iov_len - skip; + niv.iov_base = xdr->head[0].iov_base + skip; + niv.iov_len = len; + if (slen > len) + msg->msg_flags |= MSG_MORE; + oldfs = get_fs(); set_fs(get_ds()); + clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); + result = sock_sendmsg(sock, msg, len); + set_fs(oldfs); + if (result != len) + return result; + slen -= len; + skip = 0; + } else { + skip -= xdr->head[0].iov_len; + } + if (pglen == 0) + goto send_tail; + if (skip >= pglen) { + skip -= pglen; + goto send_tail; + } + if (skip || xdr->page_base) { + pglen -= skip; + base = xdr->page_base + skip; + ppage += base >> PAGE_CACHE_SHIFT; + base &= ~PAGE_CACHE_MASK; + } + len = PAGE_CACHE_SIZE - base; + if (len > pglen) len = pglen; + flags = MSG_MORE; + while (pglen > 0) { + if (slen == len) + flags = 0; + ret = sock->ops->sendpage(sock, *ppage, base, len, flags); + if (ret > 0) + result += ret; + if (ret != len) { + if (result == 0) + result = ret; + return result; + } + slen -= len; + pglen -= len; + len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen; + base = 0; + ppage++; + } + skip = 0; +send_tail: + if (xdr->tail[0].iov_len) { + niv.iov_base = xdr->tail[0].iov_base + skip; + niv.iov_len = xdr->tail[0].iov_len - skip; + msg->msg_flags &= ~MSG_MORE; + oldfs = get_fs(); set_fs(get_ds()); + clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); + ret = sock_sendmsg(sock, msg, niv.iov_len); + set_fs(oldfs); + if (ret > 0) + result += ret; + if (result == 0) + result = ret; + } + return result; +} + /* * Write data to socket. */ static inline int xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req) { - struct socket *sock = xprt->sock; struct msghdr msg; - struct xdr_buf *xdr = &req->rq_snd_buf; - struct iovec niv[MAX_IOVEC]; - unsigned int niov, slen, skip; - mm_segment_t oldfs; int result; - if (!sock) - return -ENOTCONN; - - xprt_pktdump("packet data:", - req->rq_svec->iov_base, - req->rq_svec->iov_len); - - /* Dont repeat bytes */ - skip = req->rq_bytes_sent; - slen = xdr->len - skip; - niov = xdr_kmap(niv, xdr, skip); - msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL; - msg.msg_iov = niv; - msg.msg_iovlen = niov; msg.msg_name = (struct sockaddr *) &xprt->addr; msg.msg_namelen = sizeof(xprt->addr); msg.msg_control = NULL; msg.msg_controllen = 0; - oldfs = get_fs(); set_fs(get_ds()); - clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); - result = sock_sendmsg(sock, &msg, slen); - set_fs(oldfs); - - xdr_kunmap(xdr, skip); + result = __xprt_sendmsg(xprt->sock, &req->rq_snd_buf, &msg, req->rq_bytes_sent); - dprintk("RPC: xprt_sendmsg(%d) = %d\n", slen, result); + dprintk("RPC: xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result); if (result >= 0) return result; ------------------------------------------------------- This SF.net email is sponsored by: ApacheCon, November 18-21 in Las Vegas (supported by COMDEX), the only Apache event to be fully supported by the ASF. http://www.apachecon.com _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-27 10:39 ` Hirokazu Takahashi @ 2002-10-28 16:31 ` Trond Myklebust 2002-10-28 23:39 ` Hirokazu Takahashi 2002-10-29 6:36 ` Hirokazu Takahashi 0 siblings, 2 replies; 87+ messages in thread From: Trond Myklebust @ 2002-10-28 16:31 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, nfs >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: > --- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030 > +++ linux/net/sunrpc/xdr.c Sun Oct 27 19:07:05 2030 > @@ -110,12 +110,15 @@ xdr_encode_pages(struct xdr_buf *xdr, st xdr-> page_len = len; > if (len & 3) { > - struct iovec *iov = xdr->tail; > unsigned int pad = 4 - (len & 3); > - > - iov->iov_base = (void *) "\0\0\0"; > - iov->iov_len = pad; > len += pad; > + if (((base + len) & ~PAGE_CACHE_MASK) + pad <= > PAGE_CACHE_SIZE) { > + xdr->page_len += pad; No!!! I believe I told you quite explicitly earlier: - RFC1832 states that *all* variable length data must be padded with zeros, and that is certainly not the case if the pages you are pointing to are in the page cache. - Worse: That data is not even guaranteed to have been initialized. In effect this means that your 'optimization' is leaking random data from the kernel and onto the internet. In security-conscious circles this is not considered a good thing... Please leave that padding so that it *always* returns zeros... Cheers, Trond ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-28 16:31 ` Trond Myklebust @ 2002-10-28 23:39 ` Hirokazu Takahashi 2002-10-29 6:36 ` Hirokazu Takahashi 1 sibling, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-28 23:39 UTC (permalink / raw) To: trond.myklebust; +Cc: neilb, nfs Hello, > - RFC1832 states that *all* variable length data must be padded with > zeros, and that is certainly not the case if the pages you are > pointing to are in the page cache. Yes, your're right. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-28 16:31 ` Trond Myklebust 2002-10-28 23:39 ` Hirokazu Takahashi @ 2002-10-29 6:36 ` Hirokazu Takahashi 2002-10-29 15:09 ` Trond Myklebust 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-29 6:36 UTC (permalink / raw) To: trond.myklebust; +Cc: neilb, nfs Hello, > - RFC1832 states that *all* variable length data must be padded with > zeros, and that is certainly not the case if the pages you are > pointing to are in the page cache. I've changed my aproach. Shall we use ZERO_PAGE to pad RPC requests for the purpose of its performance? Using non-page data is a little inefficient as the implementation of skbuff doesn't allow to append non-page data to a skbuff which already have pages. Only pages can be appneded to it. If we didn't, TCP/IP stack would allocate a new page to store the small zero-padded data. The last page never be coalesced with data of a next RPC request in case of UDP while it might be done on TCP. How do you think of this approach? Thank you, Hirokazu Takahashi. --- linux/include/linux/sunrpc/xdr.h.ORG Sun Oct 27 17:56:07 2030 +++ linux/include/linux/sunrpc/xdr.h Tue Oct 29 14:30:48 2030 @@ -48,12 +48,15 @@ typedef int (*kxdrproc_t)(void *rqstp, u * operations and/or has a need for scatter/gather involving pages. */ struct xdr_buf { - struct iovec head[1], /* RPC header + non-page data */ - tail[1]; /* Appended after page data */ + struct iovec head[1]; /* RPC header + non-page data */ + struct page * head_page; /* Page for head if needed */ struct page ** pages; /* Array of contiguous pages */ unsigned int page_base, /* Start of page data */ page_len; /* Length of page data */ + + struct iovec tail[1]; /* Appended after page data */ + struct page * tail_page; /* Page for tail if needed */ unsigned int len; /* Total length of data */ --- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030 +++ linux/net/sunrpc/xdr.c Tue Oct 29 14:20:52 2030 @@ -113,8 +113,9 @@ xdr_encode_pages(struct xdr_buf *xdr, st struct iovec *iov = xdr->tail; unsigned int pad = 4 - (len & 3); - iov->iov_base = (void *) "\0\0\0"; + iov->iov_base = (void *)0; iov->iov_len = pad; + xdr->tail_page = sunrpc_get_zeropage(); len += pad; } xdr->len += len; --- linux/net/sunrpc/xprt.c.ORG Sun Oct 27 17:07:17 2030 +++ linux/net/sunrpc/xprt.c Tue Oct 29 14:22:14 2030 @@ -60,6 +60,7 @@ #include <linux/unistd.h> #include <linux/sunrpc/clnt.h> #include <linux/file.h> +#include <linux/pagemap.h> #include <net/sock.h> #include <net/checksum.h> @@ -207,48 +208,101 @@ xprt_release_write(struct rpc_xprt *xprt spin_unlock_bh(&xprt->sock_lock); } +static inline int +__xprt_sendmsg(struct socket *sock, struct xdr_buf *xdr, struct msghdr *msg, size_t skip) +{ + unsigned int slen = xdr->len - skip; + mm_segment_t oldfs; + int result = 0; + struct page **ppage = xdr->pages; + unsigned int len, pglen = xdr->page_len; + size_t base = 0; + int flags; + int ret; + struct iovec niv; + + msg->msg_iov = ∋ + msg->msg_iovlen = 1; + + if (xdr->head[0].iov_len > skip) { + len = xdr->head[0].iov_len - skip; + niv.iov_base = xdr->head[0].iov_base + skip; + niv.iov_len = len; + if (slen > len) + msg->msg_flags |= MSG_MORE; + oldfs = get_fs(); set_fs(get_ds()); + clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); + result = sock_sendmsg(sock, msg, len); + set_fs(oldfs); + if (result != len) + return result; + slen -= len; + skip = 0; + } else { + skip -= xdr->head[0].iov_len; + } + if (pglen == 0) + goto send_tail; + if (skip >= pglen) { + skip -= pglen; + goto send_tail; + } + if (skip || xdr->page_base) { + pglen -= skip; + base = xdr->page_base + skip; + ppage += base >> PAGE_CACHE_SHIFT; + base &= ~PAGE_CACHE_MASK; + } + len = PAGE_CACHE_SIZE - base; + if (len > pglen) len = pglen; + flags = MSG_MORE; + while (pglen > 0) { + if (slen == len) + flags = 0; + ret = sock->ops->sendpage(sock, *ppage, base, len, flags); + if (ret > 0) + result += ret; + if (ret != len) { + if (result == 0) + result = ret; + return result; + } + slen -= len; + pglen -= len; + len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen; + base = 0; + ppage++; + } + skip = 0; +send_tail: + if (xdr->tail[0].iov_len) { + ret = sock->ops->sendpage(sock, xdr->tail_page, (int)xdr->tail[0].iov_base + skip, xdr->tail[0].iov_len - skip, 0); + if (ret > 0) + result += ret; + if (result == 0) + result = ret; + } + return result; +} + /* * Write data to socket. */ static inline int xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req) { - struct socket *sock = xprt->sock; struct msghdr msg; - struct xdr_buf *xdr = &req->rq_snd_buf; - struct iovec niv[MAX_IOVEC]; - unsigned int niov, slen, skip; - mm_segment_t oldfs; int result; - if (!sock) - return -ENOTCONN; - - xprt_pktdump("packet data:", - req->rq_svec->iov_base, - req->rq_svec->iov_len); - - /* Dont repeat bytes */ - skip = req->rq_bytes_sent; - slen = xdr->len - skip; - niov = xdr_kmap(niv, xdr, skip); - msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL; - msg.msg_iov = niv; - msg.msg_iovlen = niov; msg.msg_name = (struct sockaddr *) &xprt->addr; msg.msg_namelen = sizeof(xprt->addr); msg.msg_control = NULL; msg.msg_controllen = 0; - oldfs = get_fs(); set_fs(get_ds()); - clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); - result = sock_sendmsg(sock, &msg, slen); - set_fs(oldfs); - - xdr_kunmap(xdr, skip); + result = __xprt_sendmsg(xprt->sock, &req->rq_snd_buf, &msg, req->rq_bytes_sent); - dprintk("RPC: xprt_sendmsg(%d) = %d\n", slen, result); + dprintk("RPC: xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result); if (result >= 0) return result; --- linux/net/sunrpc/sunrpc_syms.c.ORG Tue Oct 29 14:18:45 2030 +++ linux/net/sunrpc/sunrpc_syms.c Tue Oct 29 14:15:27 2030 @@ -101,6 +101,7 @@ EXPORT_SYMBOL(auth_unix_lookup); EXPORT_SYMBOL(cache_check); EXPORT_SYMBOL(cache_clean); EXPORT_SYMBOL(cache_flush); +EXPORT_SYMBOL(cache_purge); EXPORT_SYMBOL(cache_fresh); EXPORT_SYMBOL(cache_init); EXPORT_SYMBOL(cache_register); @@ -130,6 +131,36 @@ EXPORT_SYMBOL(nfsd_debug); EXPORT_SYMBOL(nlm_debug); #endif +/* RPC general use */ +EXPORT_SYMBOL(sunrpc_get_zeropage); + +static struct page *sunrpc_zero_page; + +struct page * +sunrpc_get_zeropage(void) +{ + return sunrpc_zero_page; +} + +static int __init +sunrpc_init_zeropage(void) +{ + sunrpc_zero_page = alloc_page(GFP_ATOMIC); + if (sunrpc_zero_page == NULL) { + printk(KERN_ERR "RPC: couldn't allocate zero_page.\n"); + return 1; + } + clear_page(page_address(sunrpc_zero_page)); + return 0; +} + +static void __exit +sunrpc_cleanup_zeropage(void) +{ + put_page(sunrpc_zero_page); + sunrpc_zero_page = NULL; +} + static int __init init_sunrpc(void) { @@ -141,12 +172,14 @@ init_sunrpc(void) #endif cache_register(&auth_domain_cache); cache_register(&ip_map_cache); + sunrpc_init_zeropage(); return 0; } static void __exit cleanup_sunrpc(void) { + sunrpc_cleanup_zeropage(); cache_unregister(&auth_domain_cache); cache_unregister(&ip_map_cache); #ifdef RPC_DEBUG --- linux/include/linux/sunrpc/types.h.ORG Tue Oct 29 11:31:13 2030 +++ linux/include/linux/sunrpc/types.h Tue Oct 29 11:37:49 2030 @@ -13,10 +13,14 @@ #include <linux/workqueue.h> #include <linux/sunrpc/debug.h> #include <linux/list.h> +#include <linux/mm.h> /* * Shorthands */ #define signalled() (signal_pending(current)) + +extern struct page * sunrpc_get_zeropage(void); + #endif /* _LINUX_SUNRPC_TYPES_H_ */ ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-29 6:36 ` Hirokazu Takahashi @ 2002-10-29 15:09 ` Trond Myklebust 2002-10-29 16:27 ` Hirokazu Takahashi 2002-10-30 3:18 ` Hirokazu Takahashi 0 siblings, 2 replies; 87+ messages in thread From: Trond Myklebust @ 2002-10-29 15:09 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, nfs >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: > Shall we use ZERO_PAGE to pad RPC requests for the purpose of > its performance? Using non-page data is a little inefficient > as the implementation of skbuff doesn't allow to append > non-page data to a skbuff which already have pages. Only pages > can be appneded to it. If we didn't, TCP/IP stack would > allocate a new page to store the small zero-padded data. Hmmm... What if we just drop actually storing a pointer to the ZERO_PAGE? Instead, define the convention that if (xdr_buf->tail[0].iov_base == NULL) padding = xdr_buf->tail[0].iov_len; and just have xprt_sendmsg() magically append 'padding' bytes from your ZERO_PAGE. Unless, of course, you've got another use for the head_page/tail_page? Cheers, Trond ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-29 15:09 ` Trond Myklebust @ 2002-10-29 16:27 ` Hirokazu Takahashi 2002-10-29 16:49 ` Trond Myklebust 2002-10-30 3:18 ` Hirokazu Takahashi 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-29 16:27 UTC (permalink / raw) To: trond.myklebust; +Cc: neilb, nfs Hello, Thank you for your reply. > > Shall we use ZERO_PAGE to pad RPC requests for the purpose of > > its performance? Using non-page data is a little inefficient > > as the implementation of skbuff doesn't allow to append > > non-page data to a skbuff which already have pages. Only pages > > can be appneded to it. If we didn't, TCP/IP stack would > > allocate a new page to store the small zero-padded data. > > Hmmm... What if we just drop actually storing a pointer to the > ZERO_PAGE? Instead, define the convention that > > if (xdr_buf->tail[0].iov_base == NULL) > padding = xdr_buf->tail[0].iov_len; > > and just have xprt_sendmsg() magically append 'padding' bytes from > your ZERO_PAGE. Yes, it's possible. OK, I'll modify it. > Unless, of course, you've got another use for the head_page/tail_page? I just wanted to make it general. I guessed head_page (or head_pages) might be usefull for big NFSv4 COMPOUND messages as we could send a head without any copies. But it's just my guess. Thank you, Hirokazu Takahashi. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-29 16:27 ` Hirokazu Takahashi @ 2002-10-29 16:49 ` Trond Myklebust 0 siblings, 0 replies; 87+ messages in thread From: Trond Myklebust @ 2002-10-29 16:49 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, nfs >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: >> Unless, of course, you've got another use for the >> head_page/tail_page? > I just wanted to make it general. I guessed head_page (or > head_pages) might be usefull for big NFSv4 COMPOUND messages as > we could send a head without any copies. But it's just my > guess. It's good to know that this is possible, but lets not overdesign: we don't want to implement this unless we know that we have a need. Cheers, Trond ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-29 15:09 ` Trond Myklebust 2002-10-29 16:27 ` Hirokazu Takahashi @ 2002-10-30 3:18 ` Hirokazu Takahashi 1 sibling, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-30 3:18 UTC (permalink / raw) To: trond.myklebust; +Cc: neilb, nfs Hello, I've modified the patch simple as you said. > Hmmm... What if we just drop actually storing a pointer to the > ZERO_PAGE? Instead, define the convention that > > if (xdr_buf->tail[0].iov_base == NULL) > padding = xdr_buf->tail[0].iov_len; > > and just have xprt_sendmsg() magically append 'padding' bytes from > your ZERO_PAGE. --- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030 +++ linux/net/sunrpc/xdr.c Wed Oct 30 11:11:03 2030 @@ -113,7 +113,8 @@ xdr_encode_pages(struct xdr_buf *xdr, st struct iovec *iov = xdr->tail; unsigned int pad = 4 - (len & 3); - iov->iov_base = (void *) "\0\0\0"; + /* NULL means a request to pad it with zero. */ + iov->iov_base = NULL; iov->iov_len = pad; len += pad; } --- linux/net/sunrpc/xprt.c.ORG Sun Oct 27 17:07:17 2030 +++ linux/net/sunrpc/xprt.c Wed Oct 30 12:16:05 2030 @@ -60,6 +60,7 @@ #include <linux/unistd.h> #include <linux/sunrpc/clnt.h> #include <linux/file.h> +#include <linux/pagemap.h> #include <net/sock.h> #include <net/checksum.h> @@ -207,48 +208,113 @@ xprt_release_write(struct rpc_xprt *xprt spin_unlock_bh(&xprt->sock_lock); } -/* - * Write data to socket. - */ static inline int -xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req) +__xprt_sendmsg(struct rpc_xprt *xprt, struct xdr_buf *xdr, size_t skip) { struct socket *sock = xprt->sock; + unsigned int slen = xdr->len - skip; + struct page **ppage = xdr->pages; + unsigned int len, pglen = xdr->page_len; + size_t base = 0; struct msghdr msg; - struct xdr_buf *xdr = &req->rq_snd_buf; - struct iovec niv[MAX_IOVEC]; - unsigned int niov, slen, skip; + struct iovec niv; + int flags; mm_segment_t oldfs; - int result; - - if (!sock) - return -ENOTCONN; - - xprt_pktdump("packet data:", - req->rq_svec->iov_base, - req->rq_svec->iov_len); - - /* Dont repeat bytes */ - skip = req->rq_bytes_sent; - slen = xdr->len - skip; - niov = xdr_kmap(niv, xdr, skip); + int result = 0; + int ret; msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL; - msg.msg_iov = niv; - msg.msg_iovlen = niov; msg.msg_name = (struct sockaddr *) &xprt->addr; msg.msg_namelen = sizeof(xprt->addr); msg.msg_control = NULL; msg.msg_controllen = 0; + msg.msg_iov = ∋ + msg.msg_iovlen = 1; - oldfs = get_fs(); set_fs(get_ds()); - clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); - result = sock_sendmsg(sock, &msg, slen); - set_fs(oldfs); + if (xdr->head[0].iov_len > skip) { + len = xdr->head[0].iov_len - skip; + niv.iov_base = xdr->head[0].iov_base + skip; + niv.iov_len = len; + if (slen > len) + msg.msg_flags |= MSG_MORE; + oldfs = get_fs(); set_fs(get_ds()); + clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); + result = sock_sendmsg(sock, &msg, len); + set_fs(oldfs); + if (result != len) + return result; + slen -= len; + skip = 0; + } else { + skip -= xdr->head[0].iov_len; + } + if (pglen == 0) + goto send_tail; + if (skip >= pglen) { + skip -= pglen; + goto send_tail; + } + if (skip || xdr->page_base) { + pglen -= skip; + base = xdr->page_base + skip; + ppage += base >> PAGE_CACHE_SHIFT; + base &= ~PAGE_CACHE_MASK; + } + len = PAGE_CACHE_SIZE - base; + if (len > pglen) len = pglen; + flags = MSG_MORE; + while (pglen > 0) { + if (slen == len) + flags = 0; + ret = sock->ops->sendpage(sock, *ppage, base, len, flags); + if (ret > 0) + result += ret; + if (ret != len) { + if (result == 0) + result = ret; + return result; + } + slen -= len; + pglen -= len; + len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen; + base = 0; + ppage++; + } + skip = 0; +send_tail: + if (xdr->tail[0].iov_len) { + if (xdr->tail[0].iov_base == NULL) { + /* tail[0].iov_base == NULL requires zero padding */ + ret = sock->ops->sendpage(sock, sunrpc_get_zeropage(), + 0, xdr->tail[0].iov_len - skip, 0); + } else { + niv.iov_base = xdr->tail[0].iov_base + skip; + niv.iov_len = xdr->tail[0].iov_len - skip; + msg.msg_flags &= ~MSG_MORE; + oldfs = get_fs(); set_fs(get_ds()); + clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags); + ret = sock_sendmsg(sock, &msg, niv.iov_len); + set_fs(oldfs); + } + if (ret > 0) + result += ret; + if (result == 0) + result = ret; + } + return result; +} + +/* + * Write data to socket. + */ +static inline int +xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req) +{ + int result; - xdr_kunmap(xdr, skip); + result = __xprt_sendmsg(xprt, &req->rq_snd_buf, req->rq_bytes_sent); - dprintk("RPC: xprt_sendmsg(%d) = %d\n", slen, result); + dprintk("RPC: xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result); if (result >= 0) return result; --- linux/net/sunrpc/sunrpc_syms.c.ORG Tue Oct 29 14:18:45 2030 +++ linux/net/sunrpc/sunrpc_syms.c Tue Oct 29 14:15:27 2030 @@ -101,6 +101,7 @@ EXPORT_SYMBOL(auth_unix_lookup); EXPORT_SYMBOL(cache_check); EXPORT_SYMBOL(cache_clean); EXPORT_SYMBOL(cache_flush); +EXPORT_SYMBOL(cache_purge); EXPORT_SYMBOL(cache_fresh); EXPORT_SYMBOL(cache_init); EXPORT_SYMBOL(cache_register); @@ -130,6 +131,36 @@ EXPORT_SYMBOL(nfsd_debug); EXPORT_SYMBOL(nlm_debug); #endif +/* RPC general use */ +EXPORT_SYMBOL(sunrpc_get_zeropage); + +static struct page *sunrpc_zero_page; + +struct page * +sunrpc_get_zeropage(void) +{ + return sunrpc_zero_page; +} + +static int __init +sunrpc_init_zeropage(void) +{ + sunrpc_zero_page = alloc_page(GFP_ATOMIC); + if (sunrpc_zero_page == NULL) { + printk(KERN_ERR "RPC: couldn't allocate zero_page.\n"); + return 1; + } + clear_page(page_address(sunrpc_zero_page)); + return 0; +} + +static void __exit +sunrpc_cleanup_zeropage(void) +{ + put_page(sunrpc_zero_page); + sunrpc_zero_page = NULL; +} + static int __init init_sunrpc(void) { @@ -141,12 +172,14 @@ init_sunrpc(void) #endif cache_register(&auth_domain_cache); cache_register(&ip_map_cache); + sunrpc_init_zeropage(); return 0; } static void __exit cleanup_sunrpc(void) { + sunrpc_cleanup_zeropage(); cache_unregister(&auth_domain_cache); cache_unregister(&ip_map_cache); #ifdef RPC_DEBUG --- linux/include/linux/sunrpc/types.h.ORG Tue Oct 29 11:31:13 2030 +++ linux/include/linux/sunrpc/types.h Tue Oct 29 11:37:49 2030 @@ -13,10 +13,14 @@ #include <linux/workqueue.h> #include <linux/sunrpc/debug.h> #include <linux/list.h> +#include <linux/mm.h> /* * Shorthands */ #define signalled() (signal_pending(current)) + +extern struct page * sunrpc_get_zeropage(void); + #endif /* _LINUX_SUNRPC_TYPES_H_ */ ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-23 1:18 ` Neil Brown 2002-10-23 3:53 ` Hirokazu Takahashi 2002-10-23 21:50 ` Hirokazu Takahashi @ 2002-10-25 9:52 ` Hirokazu Takahashi 2002-10-25 12:41 ` Neil Brown 2002-10-25 17:23 ` Trond Myklebust 2 siblings, 2 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-25 9:52 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, > I have been thinking some more about this, trying to understand the > big picture, and I'm afraid that I think I want some more changes. > > In particular, I think it would be good to use 'struct xdr_buf' from > sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and > we could share some of the infrastructure. I just realized it would be hard to use the xdr_buf as it couldn't handle data in a socket buffer. Each socket burfer consists of some non-page data and some pages and each of them might have its own offset and length. > I'm not certain about receiving write requests. > I imagine that it might work to: > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the > skb into the head iovec, and hold onto the skbuf (like we > currently do). And I came up with another idea that kNFSd could handles TCP data in a socket buffer directly without copy if we can enhancemence the tcp_read_sock() not to release it while kNFSd is using it. kNFSd would handle TCP data as if it were a UDP datagram. The differences are kNFSd may grab some TCP socket buffers at once and the buffers may be shared to other kNFSd's. Thank you, Hirokazu Takahashi. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-25 9:52 ` Hirokazu Takahashi @ 2002-10-25 12:41 ` Neil Brown 2002-10-26 3:11 ` Hirokazu Takahashi 2002-10-30 23:29 ` Hirokazu Takahashi 2002-10-25 17:23 ` Trond Myklebust 1 sibling, 2 replies; 87+ messages in thread From: Neil Brown @ 2002-10-25 12:41 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: nfs On Friday October 25, taka@valinux.co.jp wrote: > Hello, > > > I have been thinking some more about this, trying to understand the > > big picture, and I'm afraid that I think I want some more changes. > > > > In particular, I think it would be good to use 'struct xdr_buf' from > > sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and > > we could share some of the infrastructure. > > I just realized it would be hard to use the xdr_buf as it couldn't > handle data in a socket buffer. Each socket burfer consists of > some non-page data and some pages and each of them might have its > own offset and length. You would only want this for single-copy write request - right? I think we have treat them as a special case and pass the skbuf all the way up to nfsd in that case. You would only want to try this if: The NIC had verified the checksum The packets was some minimum size (1K? 1 PAGE ??) We were using AUTH_UNIX, nothing more interesting like crypto security The first fragment were some minimum size (size of a write without the data). I would make a special 'fast-path' for that case which didn't copy any data but passed a skbuf up, and code in nfs*xdr.c would convert that into an iovec[]; I am working on a patch which changes rpcsvc to use xdr_buf. Some of it works. Some doesn't. I include it below for your reference I repeat: it doesn't work yet. Once it is done, adding the rest of zero-copy should be fairly easy. > > > I'm not certain about receiving write requests. > > I imagine that it might work to: > > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the > > skb into the head iovec, and hold onto the skbuf (like we > > currently do). > > And I came up with another idea that kNFSd could handles TCP data > in a socket buffer directly without copy if we can enhancemence the > tcp_read_sock() not to release it while kNFSd is using it. > kNFSd would handle TCP data as if it were a UDP datagram. > The differences are kNFSd may grab some TCP socket buffers at once > and the buffers may be shared to other kNFSd's. That might work... though TCP doesn't have the same concept of a 'packet' that udp does. You might endup with a socket buffer that had all of one request and part of the next... still I'm sure it is possible. NeilBrown -----incomplete, buggy, don't-use-it patch starts here---- --- ./fs/nfsd/nfssvc.c 2002/10/21 03:23:44 1.2 +++ ./fs/nfsd/nfssvc.c 2002/10/25 05:08:01 @@ -277,7 +277,8 @@ nfsd_dispatch(struct svc_rqst *rqstp, u3 /* Decode arguments */ xdr = proc->pc_decode; - if (xdr && !xdr(rqstp, rqstp->rq_argbuf.buf, rqstp->rq_argp)) { + if (xdr && !xdr(rqstp, (u32*)rqstp->rq_arg.head[0].iov_base, + rqstp->rq_argp)) { dprintk("nfsd: failed to decode arguments!\n"); nfsd_cache_update(rqstp, RC_NOCACHE, NULL); *statp = rpc_garbage_args; @@ -293,14 +294,15 @@ nfsd_dispatch(struct svc_rqst *rqstp, u3 } if (rqstp->rq_proc != 0) - svc_putu32(&rqstp->rq_resbuf, nfserr); + svc_putu32(&rqstp->rq_res.head[0], nfserr); /* Encode result. * For NFSv2, additional info is never returned in case of an error. */ if (!(nfserr && rqstp->rq_vers == 2)) { xdr = proc->pc_encode; - if (xdr && !xdr(rqstp, rqstp->rq_resbuf.buf, rqstp->rq_resp)) { + if (xdr && !xdr(rqstp, (u32*)rqstp->rq_res.head[0].iov_base, + rqstp->rq_resp)) { /* Failed to encode result. Release cache entry */ dprintk("nfsd: failed to encode result!\n"); nfsd_cache_update(rqstp, RC_NOCACHE, NULL); --- ./fs/nfsd/vfs.c 2002/10/24 01:35:37 1.1 +++ ./fs/nfsd/vfs.c 2002/10/24 04:13:31 @@ -571,13 +571,35 @@ found: } /* + * reduce iovec: + * Reduce the effective size of the passed iovec to + * match the count + */ +static void reduce_iovec(struct iovec *vec, int *vlenp, int count) +{ + int vlen = *vlenp; + int i; + + i = 0; + while (i < vlen && count > vec->iov_len) { + count -= vec->iov_len; + i++; + } + if (i >= vlen) + return; /* ERROR??? */ + vec->iov_len -= count; + if (count) i++; + *vlenp = i; +} + +/* * Read data from a file. count must contain the requested read count * on entry. On return, *count contains the number of bytes actually read. * N.B. After this call fhp needs an fh_put */ int nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, - char *buf, unsigned long *count) + struct iovec *vec, int vlen, unsigned long *count) { struct raparms *ra; mm_segment_t oldfs; @@ -601,9 +623,10 @@ nfsd_read(struct svc_rqst *rqstp, struct if (ra) file.f_ra = ra->p_ra; + reduce_iovec(vec, &vlen, *count); oldfs = get_fs(); set_fs(KERNEL_DS); - err = vfs_read(&file, buf, *count, &offset); + err = vfs_readv(&file, vec, vlen, *count, &offset); set_fs(oldfs); /* Write back readahead params */ @@ -629,7 +652,8 @@ out: */ int nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, - char *buf, unsigned long cnt, int *stablep) + struct iovec *vec, int vlen, + unsigned long cnt, int *stablep) { struct svc_export *exp; struct file file; @@ -675,9 +699,10 @@ nfsd_write(struct svc_rqst *rqstp, struc if (stable && !EX_WGATHER(exp)) file.f_flags |= O_SYNC; + reduce_iovec(vec, &vlen, cnt); /* Write the data. */ oldfs = get_fs(); set_fs(KERNEL_DS); - err = vfs_write(&file, buf, cnt, &offset); + err = vfs_writev(&file, vec, vlen, cnt, &offset); if (err >= 0) nfsdstats.io_write += cnt; set_fs(oldfs); --- ./fs/nfsd/nfsctl.c 2002/10/21 06:35:17 1.2 +++ ./fs/nfsd/nfsctl.c 2002/10/24 11:22:53 @@ -130,13 +130,12 @@ static int exports_open(struct inode *in char *namebuf = kmalloc(PAGE_SIZE, GFP_KERNEL); if (namebuf == NULL) return -ENOMEM; - else - ((struct seq_file *)file->private_data)->private = namebuf; res = seq_open(file, &nfs_exports_op); - if (!res) + if (res) kfree(namebuf); - + else + ((struct seq_file *)file->private_data)->private = namebuf; return res; } static int exports_release(struct inode *inode, struct file *file) --- ./fs/nfsd/nfsxdr.c 2002/10/24 01:06:36 1.1 +++ ./fs/nfsd/nfsxdr.c 2002/10/25 05:31:51 @@ -14,6 +14,7 @@ #include <linux/sunrpc/svc.h> #include <linux/nfsd/nfsd.h> #include <linux/nfsd/xdr.h> +#include <linux/mm.h> #define NFSDDBG_FACILITY NFSDDBG_XDR @@ -176,27 +177,6 @@ encode_fattr(struct svc_rqst *rqstp, u32 return p; } -/* - * Check buffer bounds after decoding arguments - */ -static inline int -xdr_argsize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_argbuf; - - return p - buf->base <= buf->buflen; -} - -static inline int -xdr_ressize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_resbuf; - - buf->len = p - buf->base; - dprintk("nfsd: ressize_check p %p base %p len %d\n", - p, buf->base, buf->buflen); - return (buf->len <= buf->buflen); -} /* * XDR decode functions @@ -241,13 +221,29 @@ int nfssvc_decode_readargs(struct svc_rqst *rqstp, u32 *p, struct nfsd_readargs *args) { + int len; + int v,pn; if (!(p = decode_fh(p, &args->fh))) return 0; args->offset = ntohl(*p++); - args->count = ntohl(*p++); - args->totalsize = ntohl(*p++); + len = args->count = ntohl(*p++); + p++; /* totalcount - unused */ + /* FIXME range check ->count */ + /* set up somewhere to store response. + * We take pages, put them on reslist and include in iovec + */ + v=0; + while (len > 0) { + pn=rqstp->rq_resused; + take_page(rqstp); + args->vec[v].iov_base = page_address(rqstp->rq_respages[pn]); + args->vec[v].iov_len = PAGE_SIZE; + v++; + len -= PAGE_SIZE; + } + args->vlen = v; return xdr_argsize_check(rqstp, p); } @@ -255,17 +251,27 @@ int nfssvc_decode_writeargs(struct svc_rqst *rqstp, u32 *p, struct nfsd_writeargs *args) { + int len; + int v; if (!(p = decode_fh(p, &args->fh))) return 0; p++; /* beginoffset */ args->offset = ntohl(*p++); /* offset */ p++; /* totalcount */ - args->len = ntohl(*p++); - args->data = (char *) p; - p += XDR_QUADLEN(args->len); - - return xdr_argsize_check(rqstp, p); + len = args->len = ntohl(*p++); + args->vec[0].iov_base = (void*)p; + args->vec[0].iov_len = rqstp->rq_arg.head[0].iov_len - + (((void*)p) - rqstp->rq_arg.head[0].iov_base); + v = 0; + while (len > args->vec[v].iov_len) { + len -= args->vec[v].iov_len; + v++; + args->vec[v].iov_base = page_address(rqstp->rq_argpages[v]); + args->vec[v].iov_len = PAGE_SIZE; + } + args->vlen = v+1; + return 1; /* FIXME */ } int @@ -371,9 +377,22 @@ nfssvc_encode_readres(struct svc_rqst *r { p = encode_fattr(rqstp, p, &resp->fh); *p++ = htonl(resp->count); - p += XDR_QUADLEN(resp->count); + xdr_ressize_check(rqstp, p); - return xdr_ressize_check(rqstp, p); + /* now update rqstp->rq_res to reflect data aswell */ + rqstp->rq_res.page_base = 0; + rqstp->rq_res.page_len = resp->count; + if (resp->count & 3) { + /* need to pad with tail */ + rqstp->rq_res.tail[0].iov_base = p; + *p = 0; + rqstp->rq_res.tail[0].iov_len = 4 - (resp->count&3); + } + rqstp->rq_res.len = + rqstp->rq_res.head[0].iov_len+ + rqstp->rq_res.page_len+ + rqstp->rq_res.tail[0].iov_len; + return 1; } int --- ./fs/nfsd/nfs3xdr.c 2002/10/24 01:07:00 1.1 +++ ./fs/nfsd/nfs3xdr.c 2002/10/25 05:14:26 @@ -269,27 +269,6 @@ encode_wcc_data(struct svc_rqst *rqstp, return encode_post_op_attr(rqstp, p, fhp); } -/* - * Check buffer bounds after decoding arguments - */ -static inline int -xdr_argsize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_argbuf; - - return p - buf->base <= buf->buflen; -} - -static inline int -xdr_ressize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_resbuf; - - buf->len = p - buf->base; - dprintk("nfsd: ressize_check p %p base %p len %d\n", - p, buf->base, buf->buflen); - return (buf->len <= buf->buflen); -} /* * XDR decode functions --- ./fs/nfsd/nfscache.c 2002/10/24 03:37:10 1.1 +++ ./fs/nfsd/nfscache.c 2002/10/24 04:30:23 @@ -41,7 +41,7 @@ static struct svc_cacherep * lru_tail; static struct svc_cacherep * nfscache; static int cache_disabled = 1; -static int nfsd_cache_append(struct svc_rqst *rqstp, struct svc_buf *data); +static int nfsd_cache_append(struct svc_rqst *rqstp, struct iovec *vec); /* * locking for the reply cache: @@ -107,7 +107,7 @@ nfsd_cache_shutdown(void) for (rp = lru_head; rp; rp = rp->c_lru_next) { if (rp->c_state == RC_DONE && rp->c_type == RC_REPLBUFF) - kfree(rp->c_replbuf.buf); + kfree(rp->c_replvec.iov_base); } cache_disabled = 1; @@ -242,8 +242,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp /* release any buffer */ if (rp->c_type == RC_REPLBUFF) { - kfree(rp->c_replbuf.buf); - rp->c_replbuf.buf = NULL; + kfree(rp->c_replvec.iov_base); + rp->c_replvec.iov_base = NULL; } rp->c_type = RC_NOCACHE; out: @@ -272,11 +272,11 @@ found_entry: case RC_NOCACHE: break; case RC_REPLSTAT: - svc_putu32(&rqstp->rq_resbuf, rp->c_replstat); + svc_putu32(&rqstp->rq_res.head[0], rp->c_replstat); rtn = RC_REPLY; break; case RC_REPLBUFF: - if (!nfsd_cache_append(rqstp, &rp->c_replbuf)) + if (!nfsd_cache_append(rqstp, &rp->c_replvec)) goto out; /* should not happen */ rtn = RC_REPLY; break; @@ -308,13 +308,14 @@ void nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, u32 *statp) { struct svc_cacherep *rp; - struct svc_buf *resp = &rqstp->rq_resbuf, *cachp; + struct iovec *resv = &rqstp->rq_res.head[0], *cachv; int len; if (!(rp = rqstp->rq_cacherep) || cache_disabled) return; - len = resp->len - (statp - resp->base); + len = resv->iov_len - ((char*)statp - (char*)resv->iov_base); + len >>= 2; /* Don't cache excessive amounts of data and XDR failures */ if (!statp || len > (256 >> 2)) { @@ -329,16 +330,16 @@ nfsd_cache_update(struct svc_rqst *rqstp rp->c_replstat = *statp; break; case RC_REPLBUFF: - cachp = &rp->c_replbuf; - cachp->buf = (u32 *) kmalloc(len << 2, GFP_KERNEL); - if (!cachp->buf) { + cachv = &rp->c_replvec; + cachv->iov_base = kmalloc(len << 2, GFP_KERNEL); + if (!cachv->iov_base) { spin_lock(&cache_lock); rp->c_state = RC_UNUSED; spin_unlock(&cache_lock); return; } - cachp->len = len; - memcpy(cachp->buf, statp, len << 2); + cachv->iov_len = len << 2; + memcpy(cachv->iov_base, statp, len << 2); break; } spin_lock(&cache_lock); @@ -353,19 +354,20 @@ nfsd_cache_update(struct svc_rqst *rqstp /* * Copy cached reply to current reply buffer. Should always fit. + * FIXME as reply is in a page, we should just attach the page, and + * keep a refcount.... */ static int -nfsd_cache_append(struct svc_rqst *rqstp, struct svc_buf *data) +nfsd_cache_append(struct svc_rqst *rqstp, struct iovec *data) { - struct svc_buf *resp = &rqstp->rq_resbuf; + struct iovec *vec = &rqstp->rq_res.head[0]; - if (resp->len + data->len > resp->buflen) { + if (vec->iov_len + data->iov_len > PAGE_SIZE) { printk(KERN_WARNING "nfsd: cached reply too large (%d).\n", - data->len); + data->iov_len); return 0; } - memcpy(resp->buf, data->buf, data->len << 2); - resp->buf += data->len; - resp->len += data->len; + memcpy((char*)vec->iov_base + vec->iov_len, data->iov_base, data->iov_len); + vec->iov_len += data->iov_len; return 1; } --- ./fs/nfsd/nfsproc.c 2002/10/24 02:23:57 1.1 +++ ./fs/nfsd/nfsproc.c 2002/10/25 05:32:04 @@ -30,11 +30,11 @@ typedef struct svc_buf svc_buf; #define NFSDDBG_FACILITY NFSDDBG_PROC -static void -svcbuf_reserve(struct svc_buf *buf, u32 **ptr, int *len, int nr) +static inline void +svcbuf_reserve(struct xdr_buf *buf, u32 **ptr, int *len, int nr) { - *ptr = buf->buf + nr; - *len = buf->buflen - buf->len - nr; + *ptr = (u32*)(buf->head[0].iov_base+buf->head[0].iov_len) + nr; + *len = ((PAGE_SIZE-buf->head[0].iov_len)>>2) - nr; } static int @@ -109,7 +109,7 @@ nfsd_proc_readlink(struct svc_rqst *rqst dprintk("nfsd: READLINK %s\n", SVCFH_fmt(&argp->fh)); /* Reserve room for status and path length */ - svcbuf_reserve(&rqstp->rq_resbuf, &path, &dummy, 2); + svcbuf_reserve(&rqstp->rq_res, &path, &dummy, 2); /* Read the symlink. */ resp->len = NFS_MAXPATHLEN; @@ -127,8 +127,7 @@ static int nfsd_proc_read(struct svc_rqst *rqstp, struct nfsd_readargs *argp, struct nfsd_readres *resp) { - u32 * buffer; - int nfserr, avail; + int nfserr; dprintk("nfsd: READ %s %d bytes at %d\n", SVCFH_fmt(&argp->fh), @@ -137,22 +136,21 @@ nfsd_proc_read(struct svc_rqst *rqstp, s /* Obtain buffer pointer for payload. 19 is 1 word for * status, 17 words for fattr, and 1 word for the byte count. */ - svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &avail, 19); - if ((avail << 2) < argp->count) { + if ((32768/*FIXME*/) < argp->count) { printk(KERN_NOTICE "oversized read request from %08x:%d (%d bytes)\n", ntohl(rqstp->rq_addr.sin_addr.s_addr), ntohs(rqstp->rq_addr.sin_port), argp->count); - argp->count = avail << 2; + argp->count = 32768; } svc_reserve(rqstp, (19<<2) + argp->count + 4); resp->count = argp->count; nfserr = nfsd_read(rqstp, fh_copy(&resp->fh, &argp->fh), argp->offset, - (char *) buffer, + argp->vec, argp->vlen, &resp->count); return nfserr; @@ -175,7 +173,7 @@ nfsd_proc_write(struct svc_rqst *rqstp, nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh), argp->offset, - argp->data, + argp->vec, argp->vlen, argp->len, &stable); return nfserr; @@ -477,7 +475,7 @@ nfsd_proc_readdir(struct svc_rqst *rqstp argp->count, argp->cookie); /* Reserve buffer space for status */ - svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count, 1); + svcbuf_reserve(&rqstp->rq_res, &buffer, &count, 1); /* Shrink to the client read size */ if (count > (argp->count >> 2)) --- ./fs/nfsd/nfs3proc.c 2002/10/24 04:37:41 1.1 +++ ./fs/nfsd/nfs3proc.c 2002/10/25 05:34:44 @@ -43,11 +43,11 @@ static int nfs3_ftypes[] = { /* * Reserve room in the send buffer */ -static void -svcbuf_reserve(struct svc_buf *buf, u32 **ptr, int *len, int nr) +static inline void +svcbuf_reserve(struct xdr_buf *buf, u32 **ptr, int *len, int nr) { - *ptr = buf->buf + nr; - *len = buf->buflen - buf->len - nr; + *ptr = (u32*)(buf->head[0].iov_base+buf->head[0].iov_len) + nr; + *len = ((PAGE_SIZE-buf->head[0].iov_len)>>2) - nr; } /* @@ -150,7 +150,7 @@ nfsd3_proc_readlink(struct svc_rqst *rqs dprintk("nfsd: READLINK(3) %s\n", SVCFH_fmt(&argp->fh)); /* Reserve room for status, post_op_attr, and path length */ - svcbuf_reserve(&rqstp->rq_resbuf, &path, &dummy, + svcbuf_reserve(&rqstp->rq_res, &path, &dummy, 1 + NFS3_POST_OP_ATTR_WORDS + 1); /* Read the symlink. */ @@ -179,7 +179,7 @@ nfsd3_proc_read(struct svc_rqst *rqstp, * 1 (status) + 22 (post_op_attr) + 1 (count) + 1 (eof) * + 1 (xdr opaque byte count) = 26 */ - svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &avail, + svcbuf_reserve(&rqstp->rq_res, &buffer, &avail, 1 + NFS3_POST_OP_ATTR_WORDS + 3); resp->count = argp->count; if ((avail << 2) < resp->count) @@ -447,7 +447,7 @@ nfsd3_proc_readdir(struct svc_rqst *rqst argp->count, (u32) argp->cookie); /* Reserve buffer space for status, attributes and verifier */ - svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count, + svcbuf_reserve(&rqstp->rq_res, &buffer, &count, 1 + NFS3_POST_OP_ATTR_WORDS + 2); /* Make sure we've room for the NULL ptr & eof flag, and shrink to @@ -482,7 +482,7 @@ nfsd3_proc_readdirplus(struct svc_rqst * argp->count, (u32) argp->cookie); /* Reserve buffer space for status, attributes and verifier */ - svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count, + svcbuf_reserve(&rqstp->rq_res, &buffer, &count, 1 + NFS3_POST_OP_ATTR_WORDS + 2); /* Make sure we've room for the NULL ptr & eof flag, and shrink to --- ./fs/lockd/xdr.c 2002/10/24 01:01:26 1.1 +++ ./fs/lockd/xdr.c 2002/10/25 05:14:36 @@ -216,25 +216,6 @@ nlm_encode_testres(u32 *p, struct nlm_re return p; } -/* - * Check buffer bounds after decoding arguments - */ -static inline int -xdr_argsize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_argbuf; - - return p - buf->base <= buf->buflen; -} - -static inline int -xdr_ressize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_resbuf; - - buf->len = p - buf->base; - return (buf->len <= buf->buflen); -} /* * First, the server side XDR functions --- ./fs/lockd/xdr4.c 2002/10/24 01:05:40 1.1 +++ ./fs/lockd/xdr4.c 2002/10/25 05:14:44 @@ -223,26 +223,6 @@ nlm4_encode_testres(u32 *p, struct nlm_r /* - * Check buffer bounds after decoding arguments - */ -static int -xdr_argsize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_argbuf; - - return p - buf->base <= buf->buflen; -} - -static int -xdr_ressize_check(struct svc_rqst *rqstp, u32 *p) -{ - struct svc_buf *buf = &rqstp->rq_resbuf; - - buf->len = p - buf->base; - return (buf->len <= buf->buflen); -} - -/* * First, the server side XDR functions */ int --- ./fs/read_write.c 2002/10/24 01:22:09 1.1 +++ ./fs/read_write.c 2002/10/24 02:54:13 @@ -207,6 +207,53 @@ ssize_t vfs_read(struct file *file, char return ret; } +ssize_t vfs_readv(struct file *file, struct iovec *vec, int vlen, size_t count, loff_t *pos) +{ + struct inode *inode = file->f_dentry->d_inode; + ssize_t ret; + + if (!(file->f_mode & FMODE_READ)) + return -EBADF; + if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read)) + return -EINVAL; + + ret = locks_verify_area(FLOCK_VERIFY_READ, inode, file, *pos, count); + if (!ret) { + ret = security_ops->file_permission (file, MAY_READ); + if (!ret) { + if (file->f_op->readv) + ret = file->f_op->readv(file, vec, vlen, pos); + else { + /* do it by hand */ + struct iovec *vector = vec; + ret = 0; + while (vlen > 0) { + void * base = vector->iov_base; + size_t len = vector->iov_len; + ssize_t nr; + vector++; + vlen--; + if (file->f_op->read) + nr = file->f_op->read(file, base, len, pos); + else + nr = do_sync_read(file, base, len, pos); + if (nr < 0) { + if (!ret) ret = nr; + break; + } + ret += nr; + if (nr != len) + break; + } + } + if (ret > 0) + dnotify_parent(file->f_dentry, DN_ACCESS); + } + } + + return ret; +} + ssize_t do_sync_write(struct file *filp, const char *buf, size_t len, loff_t *ppos) { struct kiocb kiocb; @@ -247,6 +294,53 @@ ssize_t vfs_write(struct file *file, con return ret; } +ssize_t vfs_writev(struct file *file, const struct iovec *vec, int vlen, size_t count, loff_t *pos) +{ + struct inode *inode = file->f_dentry->d_inode; + ssize_t ret; + + if (!(file->f_mode & FMODE_WRITE)) + return -EBADF; + if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write)) + return -EINVAL; + + ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, *pos, count); + if (!ret) { + ret = security_ops->file_permission (file, MAY_WRITE); + if (!ret) { + if (file->f_op->writev) + ret = file->f_op->writev(file, vec, vlen, pos); + else { + /* do it by hand */ + struct iovec *vector = vec; + ret = 0; + while (vlen > 0) { + void * base = vector->iov_base; + size_t len = vector->iov_len; + ssize_t nr; + vector++; + vlen--; + if (file->f_op->write) + nr = file->f_op->write(file, base, len, pos); + else + nr = do_sync_write(file, base, len, pos); + if (nr < 0) { + if (!ret) ret = nr; + break; + } + ret += nr; + if (nr != len) + break; + } + } + if (ret > 0) + dnotify_parent(file->f_dentry, DN_MODIFY); + } + } + + return ret; +} + asmlinkage ssize_t sys_read(unsigned int fd, char * buf, size_t count) { struct file *file; --- ./include/linux/sunrpc/svc.h 2002/10/23 00:38:26 1.1 +++ ./include/linux/sunrpc/svc.h 2002/10/25 05:14:06 @@ -48,43 +48,49 @@ struct svc_serv { * This is use to determine the max number of pages nfsd is * willing to return in a single READ operation. */ -#define RPCSVC_MAXPAYLOAD 16384u +#define RPCSVC_MAXPAYLOAD (64*1024u) /* - * Buffer to store RPC requests or replies in. - * Each server thread has one of these beasts. + * RPC Requsts and replies are stored in one or more pages. + * We maintain an array of pages for each server thread. + * Requests are copied into these pages as they arrive. Remaining + * pages are available to write the reply into. * - * Area points to the allocated memory chunk currently owned by the - * buffer. Base points to the buffer containing the request, which is - * different from area when directly reading from an sk_buff. buf is - * the current read/write position while processing an RPC request. + * Currently pages are all re-used by the same server. Later we + * will use ->sendpage to transmit pages with reduced copying. In + * that case we will need to give away the page and allocate new ones. + * In preparation for this, we explicitly move pages off the recv + * list onto the transmit list, and back. * - * The array of iovecs can hold additional data that the server process - * may not want to copy into the RPC reply buffer, but pass to the - * network sendmsg routines directly. The prime candidate for this - * will of course be NFS READ operations, but one might also want to - * do something about READLINK and READDIR. It might be worthwhile - * to implement some generic readdir cache in the VFS layer... + * We use xdr_buf for holding responses as it fits well with NFS + * read responses (that have a header, and some data pages, and possibly + * a tail) and means we can share some client side routines. * - * On the receiving end of the RPC server, the iovec may be used to hold - * the list of IP fragments once we get to process fragmented UDP - * datagrams directly. - */ -#define RPCSVC_MAXIOV ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1) -struct svc_buf { - u32 * area; /* allocated memory */ - u32 * base; /* base of RPC datagram */ - int buflen; /* total length of buffer */ - u32 * buf; /* read/write pointer */ - int len; /* current end of buffer */ - - /* iovec for zero-copy NFS READs */ - struct iovec iov[RPCSVC_MAXIOV]; - int nriov; -}; -#define svc_getu32(argp, val) { (val) = *(argp)->buf++; (argp)->len--; } -#define svc_putu32(resp, val) { *(resp)->buf++ = (val); (resp)->len++; } + * The xdr_buf.head iovec always points to the first page in the rq_*pages + * list. The xdr_buf.pages pointer points to the second page on that + * list. xdr_buf.tail points to the end of the first page. + * This assumes that the non-page part of an rpc reply will fit + * in a page - NFSd ensures this. lockd also has no trouble. + */ +#define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1) + +static inline u32 svc_getu32(struct iovec *iov) +{ + u32 val, *vp; + vp = iov->iov_base; + val = *vp++; + iov->iov_base = (void*)vp; + iov->iov_len -= sizeof(u32); + return val; +} +static inline void svc_putu32(struct iovec *iov, u32 val) +{ + u32 *vp = iov->iov_base + iov->iov_len; + *vp = val; + iov->iov_len += sizeof(u32); +} + /* * The context of a single thread, including the request currently being * processed. @@ -102,9 +108,15 @@ struct svc_rqst { struct svc_cred rq_cred; /* auth info */ struct sk_buff * rq_skbuff; /* fast recv inet buffer */ struct svc_deferred_req*rq_deferred; /* deferred request we are replaying */ - struct svc_buf rq_defbuf; /* default buffer */ - struct svc_buf rq_argbuf; /* argument buffer */ - struct svc_buf rq_resbuf; /* result buffer */ + + struct xdr_buf rq_arg; + struct xdr_buf rq_res; + struct page * rq_argpages[RPCSVC_MAXPAGES]; + struct page * rq_respages[RPCSVC_MAXPAGES]; + short rq_argused; /* pages used for argument */ + short rq_arghi; /* pages available in argument page list */ + short rq_resused; /* pages used for result */ + u32 rq_xid; /* transmission id */ u32 rq_prog; /* program number */ u32 rq_vers; /* program version */ @@ -136,6 +148,38 @@ struct svc_rqst { wait_queue_head_t rq_wait; /* synchronization */ }; +/* + * Check buffer bounds after decoding arguments + */ +static inline int +xdr_argsize_check(struct svc_rqst *rqstp, u32 *p) +{ + char *cp = (char *)p; + struct iovec *vec = &rqstp->rq_arg.head[0]; + return cp - (char*)vec->iov_base <= vec->iov_len; +} + +static inline int +xdr_ressize_check(struct svc_rqst *rqstp, u32 *p) +{ + struct iovec *vec = &rqstp->rq_res.head[0]; + char *cp = (char*)p; + + vec->iov_len = cp - (char*)vec->iov_base; + rqstp->rq_res.len = vec->iov_len; + + return vec->iov_len <= PAGE_SIZE; +} + +static int inline take_page(struct svc_rqst *rqstp) +{ + if (rqstp->rq_arghi <= rqstp->rq_argused) + return -ENOMEM; + rqstp->rq_respages[rqstp->rq_resused++] = + rqstp->rq_argpages[--rqstp->rq_arghi]; + return 0; +} + struct svc_deferred_req { struct svc_serv *serv; u32 prot; /* protocol (UDP or TCP) */ --- ./include/linux/nfsd/xdr.h 2002/10/24 01:49:48 1.1 +++ ./include/linux/nfsd/xdr.h 2002/10/25 02:21:03 @@ -29,16 +29,16 @@ struct nfsd_readargs { struct svc_fh fh; __u32 offset; __u32 count; - __u32 totalsize; + struct iovec vec[RPCSVC_MAXPAGES]; + int vlen; }; struct nfsd_writeargs { svc_fh fh; - __u32 beginoffset; __u32 offset; - __u32 totalcount; - __u8 * data; int len; + struct iovec vec[RPCSVC_MAXPAGES]; + int vlen; }; struct nfsd_createargs { --- ./include/linux/nfsd/nfsd.h 2002/10/24 04:04:03 1.1 +++ ./include/linux/nfsd/nfsd.h 2002/10/24 04:13:19 @@ -97,9 +97,9 @@ int nfsd_open(struct svc_rqst *, struct int, struct file *); void nfsd_close(struct file *); int nfsd_read(struct svc_rqst *, struct svc_fh *, - loff_t, char *, unsigned long *); + loff_t, struct iovec *,int, unsigned long *); int nfsd_write(struct svc_rqst *, struct svc_fh *, - loff_t, char *, unsigned long, int *); + loff_t, struct iovec *,int, unsigned long, int *); int nfsd_readlink(struct svc_rqst *, struct svc_fh *, char *, int *); int nfsd_symlink(struct svc_rqst *, struct svc_fh *, --- ./include/linux/nfsd/cache.h 2002/10/24 03:41:12 1.1 +++ ./include/linux/nfsd/cache.h 2002/10/24 03:41:35 @@ -32,12 +32,12 @@ struct svc_cacherep { u32 c_vers; unsigned long c_timestamp; union { - struct svc_buf u_buffer; + struct iovec u_vec; u32 u_status; } c_u; }; -#define c_replbuf c_u.u_buffer +#define c_replvec c_u.u_vec #define c_replstat c_u.u_status /* cache entry states */ --- ./include/linux/fs.h 2002/10/24 01:34:48 1.1 +++ ./include/linux/fs.h 2002/10/24 02:53:14 @@ -793,6 +793,8 @@ struct seq_file; extern ssize_t vfs_read(struct file *, char *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char *, size_t, loff_t *); +extern ssize_t vfs_readv(struct file *, struct iovec *, int, size_t, loff_t *); +extern ssize_t vfs_writev(struct file *, const struct iovec *, int, size_t, loff_t *); /* * NOTE: write_inode, delete_inode, clear_inode, put_inode can be called --- ./net/sunrpc/svc.c 2002/10/23 12:35:50 1.1 +++ ./net/sunrpc/svc.c 2002/10/25 05:41:14 @@ -13,6 +13,7 @@ #include <linux/net.h> #include <linux/in.h> #include <linux/unistd.h> +#include <linux/mm.h> #include <linux/sunrpc/types.h> #include <linux/sunrpc/xdr.h> @@ -35,7 +36,6 @@ svc_create(struct svc_program *prog, uns if (!(serv = (struct svc_serv *) kmalloc(sizeof(*serv), GFP_KERNEL))) return NULL; - memset(serv, 0, sizeof(*serv)); serv->sv_program = prog; serv->sv_nrthreads = 1; @@ -105,35 +105,41 @@ svc_destroy(struct svc_serv *serv) } /* - * Allocate an RPC server buffer - * Later versions may do nifty things by allocating multiple pages - * of memory directly and putting them into the bufp->iov. + * Allocate an RPC server's buffer space. + * We allocate pages and place them in rq_argpages. */ -int -svc_init_buffer(struct svc_buf *bufp, unsigned int size) +static int +svc_init_buffer(struct svc_rqst *rqstp, unsigned int size) { - if (!(bufp->area = (u32 *) kmalloc(size, GFP_KERNEL))) - return 0; - bufp->base = bufp->area; - bufp->buf = bufp->area; - bufp->len = 0; - bufp->buflen = size >> 2; - - bufp->iov[0].iov_base = bufp->area; - bufp->iov[0].iov_len = size; - bufp->nriov = 1; - - return 1; + int pages = 2 + (size+ PAGE_SIZE -1) / PAGE_SIZE; + int arghi; + + rqstp->rq_argused = 0; + rqstp->rq_resused = 0; + arghi = 0; + while (pages) { + struct page *p = alloc_page(GFP_KERNEL); + if (!p) + break; + printk("allocated page %d (%d to go)\n", arghi, pages-1); + rqstp->rq_argpages[arghi++] = p; + pages--; + } + rqstp->rq_arghi = arghi; + return ! pages; } /* * Release an RPC server buffer */ -void -svc_release_buffer(struct svc_buf *bufp) +static void +svc_release_buffer(struct svc_rqst *rqstp) { - kfree(bufp->area); - bufp->area = 0; + while (rqstp->rq_arghi) + put_page(rqstp->rq_argpages[--rqstp->rq_arghi]); + while (rqstp->rq_resused) + put_page(rqstp->rq_respages[--rqstp->rq_resused]); + rqstp->rq_argused = 0; } /* @@ -154,7 +160,7 @@ svc_create_thread(svc_thread_fn func, st if (!(rqstp->rq_argp = (u32 *) kmalloc(serv->sv_xdrsize, GFP_KERNEL)) || !(rqstp->rq_resp = (u32 *) kmalloc(serv->sv_xdrsize, GFP_KERNEL)) - || !svc_init_buffer(&rqstp->rq_defbuf, serv->sv_bufsz)) + || !svc_init_buffer(rqstp, serv->sv_bufsz)) goto out_thread; serv->sv_nrthreads++; @@ -180,7 +186,7 @@ svc_exit_thread(struct svc_rqst *rqstp) { struct svc_serv *serv = rqstp->rq_server; - svc_release_buffer(&rqstp->rq_defbuf); + svc_release_buffer(rqstp); if (rqstp->rq_resp) kfree(rqstp->rq_resp); if (rqstp->rq_argp) @@ -242,37 +248,49 @@ svc_process(struct svc_serv *serv, struc struct svc_program *progp; struct svc_version *versp = NULL; /* compiler food */ struct svc_procedure *procp = NULL; - struct svc_buf * argp = &rqstp->rq_argbuf; - struct svc_buf * resp = &rqstp->rq_resbuf; + struct iovec * argv = &rqstp->rq_arg.head[0]; + struct iovec * resv = &rqstp->rq_res.head[0]; kxdrproc_t xdr; - u32 *bufp, *statp; + u32 *statp; u32 dir, prog, vers, proc, auth_stat, rpc_stat; rpc_stat = rpc_success; - bufp = argp->buf; - if (argp->len < 5) + if (argv->iov_len < 6*4) goto err_short_len; - dir = ntohl(*bufp++); - vers = ntohl(*bufp++); + /* setup response xdr_buf. + * Initially it has just one page + */ + take_page(rqstp); /* must succeed */ + resv->iov_base = page_address(rqstp->rq_respages[0]); + resv->iov_len = 0; + rqstp->rq_res.pages = rqstp->rq_respages+1; + rqstp->rq_res.len = 0; + /* tcp needs a space for the record length... */ + if (rqstp->rq_prot == IPPROTO_TCP) + svc_putu32(resv, 0); + + rqstp->rq_xid = svc_getu32(argv); + svc_putu32(resv, rqstp->rq_xid); + + dir = ntohl(svc_getu32(argv)); + vers = ntohl(svc_getu32(argv)); /* First words of reply: */ - svc_putu32(resp, xdr_one); /* REPLY */ - svc_putu32(resp, xdr_zero); /* ACCEPT */ + svc_putu32(resv, xdr_one); /* REPLY */ if (dir != 0) /* direction != CALL */ goto err_bad_dir; if (vers != 2) /* RPC version number */ goto err_bad_rpc; - rqstp->rq_prog = prog = ntohl(*bufp++); /* program number */ - rqstp->rq_vers = vers = ntohl(*bufp++); /* version number */ - rqstp->rq_proc = proc = ntohl(*bufp++); /* procedure number */ + svc_putu32(resv, xdr_zero); /* ACCEPT */ - argp->buf += 5; - argp->len -= 5; + rqstp->rq_prog = prog = ntohl(svc_getu32(argv)); /* program number */ + rqstp->rq_vers = vers = ntohl(svc_getu32(argv)); /* version number */ + rqstp->rq_proc = proc = ntohl(svc_getu32(argv)); /* procedure number */ /* * Decode auth data, and add verifier to reply buffer. @@ -307,8 +325,8 @@ svc_process(struct svc_serv *serv, struc serv->sv_stats->rpccnt++; /* Build the reply header. */ - statp = resp->buf; - svc_putu32(resp, rpc_success); /* RPC_SUCCESS */ + statp = resv->iov_base +resv->iov_len; + svc_putu32(resv, rpc_success); /* RPC_SUCCESS */ /* Bump per-procedure stats counter */ procp->pc_count++; @@ -327,14 +345,14 @@ svc_process(struct svc_serv *serv, struc if (!versp->vs_dispatch) { /* Decode arguments */ xdr = procp->pc_decode; - if (xdr && !xdr(rqstp, rqstp->rq_argbuf.buf, rqstp->rq_argp)) + if (xdr && !xdr(rqstp, argv->iov_base, rqstp->rq_argp)) goto err_garbage; *statp = procp->pc_func(rqstp, rqstp->rq_argp, rqstp->rq_resp); /* Encode reply */ if (*statp == rpc_success && (xdr = procp->pc_encode) - && !xdr(rqstp, rqstp->rq_resbuf.buf, rqstp->rq_resp)) { + && !xdr(rqstp, resv->iov_base+resv->iov_len, rqstp->rq_resp)) { dprintk("svc: failed to encode reply\n"); /* serv->sv_stats->rpcsystemerr++; */ *statp = rpc_system_err; @@ -347,7 +365,7 @@ svc_process(struct svc_serv *serv, struc /* Check RPC status result */ if (*statp != rpc_success) - resp->len = statp + 1 - resp->base; + resv->iov_len = ((void*)statp) - resv->iov_base + 4; /* Release reply info */ if (procp->pc_release) @@ -369,7 +387,7 @@ svc_process(struct svc_serv *serv, struc err_short_len: #ifdef RPC_PARANOIA - printk("svc: short len %d, dropping request\n", argp->len); + printk("svc: short len %d, dropping request\n", argv->iov_len); #endif goto dropit; /* drop request */ @@ -382,18 +400,19 @@ err_bad_dir: err_bad_rpc: serv->sv_stats->rpcbadfmt++; - resp->buf[-1] = xdr_one; /* REJECT */ - svc_putu32(resp, xdr_zero); /* RPC_MISMATCH */ - svc_putu32(resp, xdr_two); /* Only RPCv2 supported */ - svc_putu32(resp, xdr_two); + svc_putu32(resv, xdr_one); /* REJECT */ + svc_putu32(resv, xdr_zero); /* RPC_MISMATCH */ + svc_putu32(resv, xdr_two); /* Only RPCv2 supported */ + svc_putu32(resv, xdr_two); goto sendit; err_bad_auth: dprintk("svc: authentication failed (%d)\n", ntohl(auth_stat)); serv->sv_stats->rpcbadauth++; - resp->buf[-1] = xdr_one; /* REJECT */ - svc_putu32(resp, xdr_one); /* AUTH_ERROR */ - svc_putu32(resp, auth_stat); /* status */ + resv->iov_len -= 4; + svc_putu32(resv, xdr_one); /* REJECT */ + svc_putu32(resv, xdr_one); /* AUTH_ERROR */ + svc_putu32(resv, auth_stat); /* status */ goto sendit; err_bad_prog: @@ -403,7 +422,7 @@ err_bad_prog: /* else it is just a Solaris client seeing if ACLs are supported */ #endif serv->sv_stats->rpcbadfmt++; - svc_putu32(resp, rpc_prog_unavail); + svc_putu32(resv, rpc_prog_unavail); goto sendit; err_bad_vers: @@ -411,9 +430,9 @@ err_bad_vers: printk("svc: unknown version (%d)\n", vers); #endif serv->sv_stats->rpcbadfmt++; - svc_putu32(resp, rpc_prog_mismatch); - svc_putu32(resp, htonl(progp->pg_lovers)); - svc_putu32(resp, htonl(progp->pg_hivers)); + svc_putu32(resv, rpc_prog_mismatch); + svc_putu32(resv, htonl(progp->pg_lovers)); + svc_putu32(resv, htonl(progp->pg_hivers)); goto sendit; err_bad_proc: @@ -421,7 +440,7 @@ err_bad_proc: printk("svc: unknown procedure (%d)\n", proc); #endif serv->sv_stats->rpcbadfmt++; - svc_putu32(resp, rpc_proc_unavail); + svc_putu32(resv, rpc_proc_unavail); goto sendit; err_garbage: @@ -429,6 +448,6 @@ err_garbage: printk("svc: failed to decode args\n"); #endif serv->sv_stats->rpcbadfmt++; - svc_putu32(resp, rpc_garbage_args); + svc_putu32(resv, rpc_garbage_args); goto sendit; } --- ./net/sunrpc/svcsock.c 2002/10/21 23:40:50 1.2 +++ ./net/sunrpc/svcsock.c 2002/10/25 07:22:30 @@ -234,7 +234,7 @@ svc_sock_received(struct svc_sock *svsk) */ void svc_reserve(struct svc_rqst *rqstp, int space) { - space += rqstp->rq_resbuf.len<<2; + space += rqstp->rq_res.head[0].iov_len; if (space < rqstp->rq_reserved) { struct svc_sock *svsk = rqstp->rq_sock; @@ -278,13 +278,12 @@ svc_sock_release(struct svc_rqst *rqstp) * But first, check that enough space was reserved * for the reply, otherwise we have a bug! */ - if ((rqstp->rq_resbuf.len<<2) > rqstp->rq_reserved) + if ((rqstp->rq_res.len) > rqstp->rq_reserved) printk(KERN_ERR "RPC request reserved %d but used %d\n", rqstp->rq_reserved, - rqstp->rq_resbuf.len<<2); + rqstp->rq_res.len); - rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base; - rqstp->rq_resbuf.len = 0; + rqstp->rq_res.head[0].iov_len = 0; svc_reserve(rqstp, 0); rqstp->rq_sock = NULL; @@ -480,13 +479,15 @@ svc_write_space(struct sock *sk) /* * Receive a datagram from a UDP socket. */ +extern int +csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb); + static int svc_udp_recvfrom(struct svc_rqst *rqstp) { struct svc_sock *svsk = rqstp->rq_sock; struct svc_serv *serv = svsk->sk_server; struct sk_buff *skb; - u32 *data; int err, len; if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags)) @@ -512,33 +513,19 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) } set_bit(SK_DATA, &svsk->sk_flags); /* there may be more data... */ - /* Sorry. */ - if (skb_is_nonlinear(skb)) { - if (skb_linearize(skb, GFP_KERNEL) != 0) { - kfree_skb(skb); - svc_sock_received(svsk); - return 0; - } - } + len = skb->len - sizeof(struct udphdr); - if (skb->ip_summed != CHECKSUM_UNNECESSARY) { - if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) { - skb_free_datagram(svsk->sk_sk, skb); - svc_sock_received(svsk); - return 0; - } + if (csum_partial_copy_to_xdr(&rqstp->rq_arg, skb)) { + /* checksum error */ + skb_free_datagram(svsk->sk_sk, skb); + svc_sock_received(svsk); + return 0; } - len = skb->len - sizeof(struct udphdr); - data = (u32 *) (skb->data + sizeof(struct udphdr)); - - rqstp->rq_skbuff = skb; - rqstp->rq_argbuf.base = data; - rqstp->rq_argbuf.buf = data; - rqstp->rq_argbuf.len = (len >> 2); - rqstp->rq_argbuf.buflen = (len >> 2); - /* rqstp->rq_resbuf = rqstp->rq_defbuf; */ + rqstp->rq_arg.len = len; + rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len; + rqstp->rq_argused += (rqstp->rq_arg.page_len + PAGE_SIZE - 1)/ PAGE_SIZE; rqstp->rq_prot = IPPROTO_UDP; /* Get sender address */ @@ -546,6 +533,8 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) rqstp->rq_addr.sin_port = skb->h.uh->source; rqstp->rq_addr.sin_addr.s_addr = skb->nh.iph->saddr; + skb_free_datagram(svsk->sk_sk, skb); + if (serv->sv_stats) serv->sv_stats->netudpcnt++; @@ -559,21 +548,37 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) static int svc_udp_sendto(struct svc_rqst *rqstp) { - struct svc_buf *bufp = &rqstp->rq_resbuf; int error; + struct iovec vec[RPCSVC_MAXPAGES]; + int v; + int base, len; /* Set up the first element of the reply iovec. * Any other iovecs that may be in use have been taken * care of by the server implementation itself. */ - /* bufp->base = bufp->area; */ - bufp->iov[0].iov_base = bufp->base; - bufp->iov[0].iov_len = bufp->len << 2; + vec[0] = rqstp->rq_res.head[0]; + v=1; + base=rqstp->rq_res.page_base; + len = rqstp->rq_res.page_len; + while (len) { + vec[v].iov_base = page_address(rqstp->rq_res.pages[v-1]) + base; + vec[v].iov_len = PAGE_SIZE-base; + if (len <= vec[v].iov_len) + vec[v].iov_len = len; + len -= vec[v].iov_len; + base = 0; + v++; + } + if (rqstp->rq_res.tail[0].iov_len) { + vec[v] = rqstp->rq_res.tail[0]; + v++; + } - error = svc_sendto(rqstp, bufp->iov, bufp->nriov); + error = svc_sendto(rqstp, vec, v); if (error == -ECONNREFUSED) /* ICMP error on earlier request. */ - error = svc_sendto(rqstp, bufp->iov, bufp->nriov); + error = svc_sendto(rqstp, vec, v); return error; } @@ -785,8 +790,9 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp) { struct svc_sock *svsk = rqstp->rq_sock; struct svc_serv *serv = svsk->sk_server; - struct svc_buf *bufp = &rqstp->rq_argbuf; int len; + struct iovec vec[RPCSVC_MAXPAGES]; + int pnum, vlen; dprintk("svc: tcp_recv %p data %d conn %d close %d\n", svsk, test_bit(SK_DATA, &svsk->sk_flags), @@ -851,7 +857,7 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp) } svsk->sk_reclen &= 0x7fffffff; dprintk("svc: TCP record, %d bytes\n", svsk->sk_reclen); - if (svsk->sk_reclen > (bufp->buflen<<2)) { + if (svsk->sk_reclen > (32768 /*FIXME*/)) { printk(KERN_NOTICE "RPC: bad TCP reclen 0x%08lx (large)\n", (unsigned long) svsk->sk_reclen); goto err_delete; @@ -869,30 +875,35 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp) svc_sock_received(svsk); return -EAGAIN; /* record not complete */ } + len = svsk->sk_reclen; set_bit(SK_DATA, &svsk->sk_flags); - /* Frob argbuf */ - bufp->iov[0].iov_base += 4; - bufp->iov[0].iov_len -= 4; + vec[0] = rqstp->rq_arg.head[0]; + vlen = PAGE_SIZE; + pnum = 1; + while (vlen < len) { + vec[pnum].iov_base = page_address(rqstp->rq_argpages[rqstp->rq_argused++]); + vec[pnum].iov_len = PAGE_SIZE; + pnum++; + vlen += PAGE_SIZE; + } /* Now receive data */ - len = svc_recvfrom(rqstp, bufp->iov, bufp->nriov, svsk->sk_reclen); + len = svc_recvfrom(rqstp, vec, pnum, len); if (len < 0) goto error; dprintk("svc: TCP complete record (%d bytes)\n", len); - - /* Position reply write pointer immediately after args, - * allowing for record length */ - rqstp->rq_resbuf.base = rqstp->rq_argbuf.base + 1 + (len>>2); - rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base + 1; - rqstp->rq_resbuf.len = 1; - rqstp->rq_resbuf.buflen= rqstp->rq_argbuf.buflen - (len>>2) - 1; + rqstp->rq_arg.len = len; + rqstp->rq_arg.page_base = 0; + if (len <= rqstp->rq_arg.head[0].iov_len) { + rqstp->rq_arg.head[0].iov_len = len; + rqstp->rq_arg.page_len = 0; + } else { + rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len; + } rqstp->rq_skbuff = 0; - rqstp->rq_argbuf.buf += 1; - rqstp->rq_argbuf.len = (len >> 2); - rqstp->rq_argbuf.buflen = (len >> 2) +1; rqstp->rq_prot = IPPROTO_TCP; /* Reset TCP read info */ @@ -928,23 +939,44 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp) static int svc_tcp_sendto(struct svc_rqst *rqstp) { - struct svc_buf *bufp = &rqstp->rq_resbuf; + struct xdr_buf *xbufp = &rqstp->rq_res; + struct iovec vec[RPCSVC_MAXPAGES]; + int v; + int base, len; int sent; + u32 reclen; /* Set up the first element of the reply iovec. * Any other iovecs that may be in use have been taken * care of by the server implementation itself. */ - bufp->iov[0].iov_base = bufp->base; - bufp->iov[0].iov_len = bufp->len << 2; - bufp->base[0] = htonl(0x80000000|((bufp->len << 2) - 4)); + reclen = htonl(0x80000000|((xbufp->len ) - 4)); + memcpy(xbufp->head[0].iov_base, &reclen, 4); + + vec[0] = rqstp->rq_res.head[0]; + v=1; + base= xbufp->page_base; + len = xbufp->page_len; + while (len) { + vec[v].iov_base = page_address(xbufp->pages[v-1]) + base; + vec[v].iov_len = PAGE_SIZE-base; + if (len <= vec[v].iov_len) + vec[v].iov_len = len; + len -= vec[v].iov_len; + base = 0; + v++; + } + if (xbufp->tail[0].iov_len) { + vec[v] = xbufp->tail[0]; + v++; + } - sent = svc_sendto(rqstp, bufp->iov, bufp->nriov); - if (sent != bufp->len<<2) { + sent = svc_sendto(rqstp, vec, v); + if (sent != xbufp->len) { printk(KERN_NOTICE "rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n", rqstp->rq_sock->sk_server->sv_name, (sent<0)?"got error":"sent only", - sent, bufp->len << 2); + sent, xbufp->len); svc_delete_socket(rqstp->rq_sock); sent = -EAGAIN; } @@ -1016,6 +1048,8 @@ svc_recv(struct svc_serv *serv, struct s { struct svc_sock *svsk =NULL; int len; + int pages; + struct xdr_buf *arg; DECLARE_WAITQUEUE(wait, current); dprintk("svc: server %p waiting for data (to = %ld)\n", @@ -1031,9 +1065,35 @@ svc_recv(struct svc_serv *serv, struct s rqstp); /* Initialize the buffers */ - rqstp->rq_argbuf = rqstp->rq_defbuf; - rqstp->rq_resbuf = rqstp->rq_defbuf; + /* first reclaim pages that were moved to response list */ + while (rqstp->rq_resused) + rqstp->rq_argpages[rqstp->rq_arghi++] = + rqstp->rq_respages[--rqstp->rq_resused]; + /* now allocate needed pages. If we get a failure, sleep briefly */ + pages = 2 + (serv->sv_bufsz + PAGE_SIZE -1) / PAGE_SIZE; + while (rqstp->rq_arghi < pages) { + struct page *p = alloc_page(GFP_KERNEL); + if (!p) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(HZ/2); + current->state = TASK_RUNNING; + continue; + } + rqstp->rq_argpages[rqstp->rq_arghi++] = p; + } + /* Make arg->head point to first page and arg->pages point to rest */ + arg = &rqstp->rq_arg; + arg->head[0].iov_base = page_address(rqstp->rq_argpages[0]); + arg->head[0].iov_len = PAGE_SIZE; + rqstp->rq_argused = 1; + arg->pages = rqstp->rq_argpages + 1; + arg->page_base = 0; + /* save at least one page for response */ + arg->page_len = (pages-2)*PAGE_SIZE; + arg->len = (pages-1)*PAGE_SIZE; + arg->tail[0].iov_len = 0; + if (signalled()) return -EINTR; @@ -1109,12 +1169,6 @@ svc_recv(struct svc_serv *serv, struct s rqstp->rq_userset = 0; rqstp->rq_chandle.defer = svc_defer; - svc_getu32(&rqstp->rq_argbuf, rqstp->rq_xid); - svc_putu32(&rqstp->rq_resbuf, rqstp->rq_xid); - - /* Assume that the reply consists of a single buffer. */ - rqstp->rq_resbuf.nriov = 1; - if (serv->sv_stats) serv->sv_stats->netcnt++; return len; @@ -1354,23 +1408,25 @@ static struct cache_deferred_req * svc_defer(struct cache_req *req) { struct svc_rqst *rqstp = container_of(req, struct svc_rqst, rq_chandle); - int size = sizeof(struct svc_deferred_req) + (rqstp->rq_argbuf.buflen << 2); + int size = sizeof(struct svc_deferred_req) + (rqstp->rq_arg.head[0].iov_len); struct svc_deferred_req *dr; + if (rqstp->rq_arg.page_len) + return NULL; /* if more than a page, give up FIXME */ if (rqstp->rq_deferred) { dr = rqstp->rq_deferred; rqstp->rq_deferred = NULL; } else { /* FIXME maybe discard if size too large */ - dr = kmalloc(size<<2, GFP_KERNEL); + dr = kmalloc(size, GFP_KERNEL); if (dr == NULL) return NULL; dr->serv = rqstp->rq_server; dr->prot = rqstp->rq_prot; dr->addr = rqstp->rq_addr; - dr->argslen = rqstp->rq_argbuf.buflen; - memcpy(dr->args, rqstp->rq_argbuf.base, dr->argslen<<2); + dr->argslen = rqstp->rq_arg.head[0].iov_len >> 2; + memcpy(dr->args, rqstp->rq_arg.head[0].iov_base, dr->argslen<<2); } spin_lock(&rqstp->rq_server->sv_lock); rqstp->rq_sock->sk_inuse++; @@ -1388,10 +1444,10 @@ static int svc_deferred_recv(struct svc_ { struct svc_deferred_req *dr = rqstp->rq_deferred; - rqstp->rq_argbuf.base = dr->args; - rqstp->rq_argbuf.buf = dr->args; - rqstp->rq_argbuf.len = dr->argslen; - rqstp->rq_argbuf.buflen = dr->argslen; + rqstp->rq_arg.head[0].iov_base = dr->args; + rqstp->rq_arg.head[0].iov_len = dr->argslen<<2; + rqstp->rq_arg.page_len = 0; + rqstp->rq_arg.len = dr->argslen<<2; rqstp->rq_prot = dr->prot; rqstp->rq_addr = dr->addr; return dr->argslen<<2; --- ./net/sunrpc/svcauth.c 2002/10/24 06:01:17 1.1 +++ ./net/sunrpc/svcauth.c 2002/10/24 06:01:52 @@ -40,8 +40,7 @@ svc_authenticate(struct svc_rqst *rqstp, *statp = rpc_success; *authp = rpc_auth_ok; - svc_getu32(&rqstp->rq_argbuf, flavor); - flavor = ntohl(flavor); + flavor = ntohl(svc_getu32(&rqstp->rq_arg.head[0])); dprintk("svc: svc_authenticate (%d)\n", flavor); if (flavor >= RPC_AUTH_MAXFLAVOR || !(aops = authtab[flavor])) { --- ./net/sunrpc/xprt.c 2002/10/24 00:34:53 1.1 +++ ./net/sunrpc/xprt.c 2002/10/24 01:00:36 @@ -655,7 +655,7 @@ skb_read_and_csum_bits(skb_reader_t *des * We have set things up such that we perform the checksum of the UDP * packet in parallel with the copies into the RPC client iovec. -DaveM */ -static int +int csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb) { skb_reader_t desc; --- ./net/sunrpc/svcauth_unix.c 2002/10/24 06:09:05 1.1 +++ ./net/sunrpc/svcauth_unix.c 2002/10/25 07:14:44 @@ -287,20 +287,20 @@ void svcauth_unix_purge(void) static int svcauth_null_accept(struct svc_rqst *rqstp, u32 *authp, int proc) { - struct svc_buf *argp = &rqstp->rq_argbuf; - struct svc_buf *resp = &rqstp->rq_resbuf; + struct iovec *argv = &rqstp->rq_arg.head[0]; + struct iovec *resv = &rqstp->rq_res.head[0]; int rv=0; struct ip_map key, *ipm; - if ((argp->len -= 3) < 0) { + if (argv->iov_len < 3*4) return SVC_GARBAGE; - } - if (*(argp->buf)++ != 0) { /* we already skipped the flavor */ + + if (svc_getu32(argv) != 0) { dprintk("svc: bad null cred\n"); *authp = rpc_autherr_badcred; return SVC_DENIED; } - if (*(argp->buf)++ != RPC_AUTH_NULL || *(argp->buf)++ != 0) { + if (svc_getu32(argv) != RPC_AUTH_NULL || svc_getu32(argv) != 0) { dprintk("svc: bad null verf\n"); *authp = rpc_autherr_badverf; return SVC_DENIED; @@ -312,8 +312,8 @@ svcauth_null_accept(struct svc_rqst *rqs rqstp->rq_cred.cr_groups[0] = NOGROUP; /* Put NULL verifier */ - svc_putu32(resp, RPC_AUTH_NULL); - svc_putu32(resp, 0); + svc_putu32(resv, RPC_AUTH_NULL); + svc_putu32(resv, 0); key.m_class = rqstp->rq_server->sv_program->pg_class; key.m_addr = rqstp->rq_addr.sin_addr; @@ -368,64 +368,70 @@ struct auth_ops svcauth_null = { int svcauth_unix_accept(struct svc_rqst *rqstp, u32 *authp, int proc) { - struct svc_buf *argp = &rqstp->rq_argbuf; - struct svc_buf *resp = &rqstp->rq_resbuf; + struct iovec *argv = &rqstp->rq_arg.head[0]; + struct iovec *resv = &rqstp->rq_res.head[0]; struct svc_cred *cred = &rqstp->rq_cred; - u32 *bufp = argp->buf, slen, i; - int len = argp->len; + u32 slen, i; + int len = argv->iov_len; int rv=0; struct ip_map key, *ipm; - if ((len -= 3) < 0) + if ((len -= 3*4) < 0) return SVC_GARBAGE; - bufp++; /* length */ - bufp++; /* time stamp */ - slen = XDR_QUADLEN(ntohl(*bufp++)); /* machname length */ - if (slen > 64 || (len -= slen + 3) < 0) + svc_getu32(argv); /* length */ + svc_getu32(argv); /* time stamp */ + slen = XDR_QUADLEN(ntohl(svc_getu32(argv))); /* machname length */ + if (slen > 64 || (len -= (slen + 3)*4) < 0) goto badcred; - bufp += slen; /* skip machname */ - - cred->cr_uid = ntohl(*bufp++); /* uid */ - cred->cr_gid = ntohl(*bufp++); /* gid */ +printk("namelen %d name %.*s\n", slen, slen*4, (char*)argv->iov_base); + argv->iov_base = (void*)((u32*)argv->iov_base + slen); /* skip machname */ - slen = ntohl(*bufp++); /* gids length */ - if (slen > 16 || (len -= slen + 2) < 0) + cred->cr_uid = ntohl(svc_getu32(argv)); /* uid */ + cred->cr_gid = ntohl(svc_getu32(argv)); /* gid */ +printk("uid=%d gid=%d\n", cred->cr_uid, cred->cr_gid); + slen = ntohl(svc_getu32(argv)); /* gids length */ + printk("%d gids (%d)\n", slen, len); + if (slen > 16 || (len -= (slen + 2)*4) < 0) goto badcred; - for (i = 0; i < NGROUPS && i < slen; i++) - cred->cr_groups[i] = ntohl(*bufp++); + for (i = 0; i < slen; i++) + if (i < NGROUPS) + cred->cr_groups[i] = ntohl(svc_getu32(argv)); + else + svc_getu32(argv); if (i < NGROUPS) cred->cr_groups[i] = NOGROUP; - bufp += (slen - i); + printk("..got %d\n", i); - if (*bufp++ != RPC_AUTH_NULL || *bufp++ != 0) { + if (svc_getu32(argv) != RPC_AUTH_NULL || svc_getu32(argv) != 0) { + printk("nogo\n"); *authp = rpc_autherr_badverf; return SVC_DENIED; } - argp->buf = bufp; - argp->len = len; - /* Put NULL verifier */ - svc_putu32(resp, RPC_AUTH_NULL); - svc_putu32(resp, 0); + svc_putu32(resv, RPC_AUTH_NULL); + svc_putu32(resv, 0); + printk("put NULL\n"); key.m_class = rqstp->rq_server->sv_program->pg_class; key.m_addr = rqstp->rq_addr.sin_addr; + printk("key is <%s>, %x\n", key.m_class, key.m_addr.s_addr); + ipm = ip_map_lookup(&key, 0); rqstp->rq_client = NULL; - + printk(ipm?"Yes\n": "No\n"); if (ipm) switch (cache_check(&ip_map_cache, &ipm->h, &rqstp->rq_chandle)) { - case -EAGAIN: + case -EAGAIN:printk("EAGAIN\n"); rv = SVC_DROP; break; - case -ENOENT: + case -ENOENT:printk("NOENT\n"); rv = SVC_OK; /* rq_client is NULL */ break; - case 0: + case 0: printk("Zero\n"); rqstp->rq_client = &ipm->m_client->h; cache_get(&rqstp->rq_client->h); ip_map_put(&ipm->h, &ip_map_cache); @@ -434,7 +440,7 @@ svcauth_unix_accept(struct svc_rqst *rqs default: BUG(); } else rv = SVC_DROP; - + if (rqstp->rq_client==NULL) printk("clinet NULL and proc %d\n", proc); if (rqstp->rq_client == NULL && proc != 0) goto badcred; return rv; --- ./kernel/ksyms.c 2002/10/24 01:33:59 1.1 +++ ./kernel/ksyms.c 2002/10/24 01:34:08 @@ -254,7 +254,9 @@ EXPORT_SYMBOL(find_inode_number); EXPORT_SYMBOL(is_subdir); EXPORT_SYMBOL(get_unused_fd); EXPORT_SYMBOL(vfs_read); +EXPORT_SYMBOL(vfs_readv); EXPORT_SYMBOL(vfs_write); +EXPORT_SYMBOL(vfs_writev); EXPORT_SYMBOL(vfs_create); EXPORT_SYMBOL(vfs_mkdir); EXPORT_SYMBOL(vfs_mknod); ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-25 12:41 ` Neil Brown @ 2002-10-26 3:11 ` Hirokazu Takahashi 2002-10-26 3:46 ` Benjamin LaHaise 2002-10-30 23:29 ` Hirokazu Takahashi 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-26 3:11 UTC (permalink / raw) To: neilb; +Cc: nfs Hello > > > I have been thinking some more about this, trying to understand the > > > big picture, and I'm afraid that I think I want some more changes. > > > > > > In particular, I think it would be good to use 'struct xdr_buf' from > > > sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and > > > we could share some of the infrastructure. > > > > I just realized it would be hard to use the xdr_buf as it couldn't > > handle data in a socket buffer. Each socket burfer consists of > > some non-page data and some pages and each of them might have its > > own offset and length. > > You would only want this for single-copy write request - right? Yes. > I think we have treat them as a special case and pass the skbuf all > the way up to nfsd in that case. > You would only want to try this if: > The NIC had verified the checksum > The packets was some minimum size (1K? 1 PAGE ??) > We were using AUTH_UNIX, nothing more interesting like crypto > security > The first fragment were some minimum size (size of a write without > the data). > > I would make a special 'fast-path' for that case which didn't copy any > data but passed a skbuf up, and code in nfs*xdr.c would convert that > into an iovec[]; I implemented that only sunrpc layer could handle a skbuff and made nfsd layer keep away from its implementation. I felt this approach was not bad. Yes, your approach is also good and will work fine. > I am working on a patch which changes rpcsvc to use xdr_buf. Some of > it works. Some doesn't. I include it below for your reference I > repeat: it doesn't work yet. > Once it is done, adding the rest of zero-copy should be fairly easy. OK. It's goot that you're implementing vfs_readv and vfs_writev which I've also realized it doesn't support aio yet. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-26 3:11 ` Hirokazu Takahashi @ 2002-10-26 3:46 ` Benjamin LaHaise 2002-10-27 22:46 ` Neil Brown 0 siblings, 1 reply; 87+ messages in thread From: Benjamin LaHaise @ 2002-10-26 3:46 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, nfs On Sat, Oct 26, 2002 at 12:11:50PM +0900, Hirokazu Takahashi wrote: > OK. > > It's goot that you're implementing vfs_readv and vfs_writev which > I've also realized it doesn't support aio yet. The aio methods are soon switching over to vectored operations for a few reasons. It's likely that non-vectored methods will be gone soon. -ben -- "Do you seek knowledge in time travel?" ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-26 3:46 ` Benjamin LaHaise @ 2002-10-27 22:46 ` Neil Brown 0 siblings, 0 replies; 87+ messages in thread From: Neil Brown @ 2002-10-27 22:46 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Hirokazu Takahashi, nfs On Friday October 25, bcrl@redhat.com wrote: > On Sat, Oct 26, 2002 at 12:11:50PM +0900, Hirokazu Takahashi wrote: > > OK. > > > > It's goot that you're implementing vfs_readv and vfs_writev which > > I've also realized it doesn't support aio yet. > > The aio methods are soon switching over to vectored operations for a > few reasons. It's likely that non-vectored methods will be gone soon. If you are introducing new 'vectored' operations, it would be nice if they work well for kernel-space as well as user-space. In the 'old days' before CONFIG_HIMEM, you could just oldfs = get_fs(); set_fs(KERNEL_DS); ...whatever.... set_fs(oldfs); to use kernel addresses. But with CONFIG_HIMEM the kernel often wants to work with "struct page *" instead of just a "void *", so this doesn't always work. It would be nice if you could pass in an 'actor' which for user-space access would call copy-to/from-user for kernel-space would do kmap/copy/kunmap Just a thought..... NeilBrown ------------------------------------------------------- This SF.net email is sponsored by: ApacheCon, November 18-21 in Las Vegas (supported by COMDEX), the only Apache event to be fully supported by the ASF. http://www.apachecon.com _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-25 12:41 ` Neil Brown 2002-10-26 3:11 ` Hirokazu Takahashi @ 2002-10-30 23:29 ` Hirokazu Takahashi 2002-10-30 23:53 ` Neil Brown 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-30 23:29 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, How is it going? neilb> I would make a special 'fast-path' for that case which didn't copy any neilb> data but passed a skbuf up, and code in nfs*xdr.c would convert that neilb> into an iovec[]; neilb> neilb> I am working on a patch which changes rpcsvc to use xdr_buf. Some of neilb> it works. Some doesn't. I include it below for your reference I neilb> repeat: it doesn't work yet. neilb> Once it is done, adding the rest of zero-copy should be fairly easy. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-30 23:29 ` Hirokazu Takahashi @ 2002-10-30 23:53 ` Neil Brown 2002-10-31 2:06 ` Hirokazu Takahashi 0 siblings, 1 reply; 87+ messages in thread From: Neil Brown @ 2002-10-30 23:53 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: nfs On Thursday October 31, taka@valinux.co.jp wrote: > Hello, > > How is it going? I've just sent some patches to Linus and nfs@lists.... The rest of the zero copy stuff should fit in quite easily, with the possible exception of single-copy writes: I haven't looked very hard at that yet. NeilBrown > > neilb> I would make a special 'fast-path' for that case which didn't copy any > neilb> data but passed a skbuf up, and code in nfs*xdr.c would convert that > neilb> into an iovec[]; > neilb> > neilb> I am working on a patch which changes rpcsvc to use xdr_buf. Some of > neilb> it works. Some doesn't. I include it below for your reference I > neilb> repeat: it doesn't work yet. > neilb> Once it is done, adding the rest of zero-copy should be fairly easy. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-30 23:53 ` Neil Brown @ 2002-10-31 2:06 ` Hirokazu Takahashi 2002-10-31 15:40 ` Hirokazu Takahashi 0 siblings, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-31 2:06 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, > > How is it going? > > I've just sent some patches to Linus and nfs@lists.... Thanks. I've seen them in linux-2.5.45. > The rest of the zero copy stuff should fit in quite easily, with the > possible exception of single-copy writes: I haven't looked very hard > at that yet. Ok, I'll try to port the zero copy stuff on it. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-31 2:06 ` Hirokazu Takahashi @ 2002-10-31 15:40 ` Hirokazu Takahashi 2002-10-31 16:56 ` Hirokazu Takahashi 2002-11-01 0:54 ` Neil Brown 0 siblings, 2 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-31 15:40 UTC (permalink / raw) To: neilb; +Cc: nfs [-- Attachment #1: Type: Text/Plain, Size: 553 bytes --] Hello, > The rest of the zero copy stuff should fit in quite easily, with the > possible exception of single-copy writes: I haven't looked very hard > at that yet. I just ported part of the zero copy stuff against linux-2.5.45. single-copy writes and per-cpu sokcets are not included yet. And I fixed a problem that NFS over TCP wouldn't work. va-nfsd-sendpage.patch ....use sendpage instead of sock_sendmsg. va-sunrpc-zeropage.patch ....zero filled page for padding. va-nfsd-vfsread.patch ....zero-copy nfsd_read/nfsd_readdir. [-- Attachment #2: zerocopy-2.5.45.taz --] [-- Type: Application/Octet-Stream, Size: 6263 bytes --] ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-31 15:40 ` Hirokazu Takahashi @ 2002-10-31 16:56 ` Hirokazu Takahashi 2002-11-01 1:10 ` Neil Brown 2002-11-01 0:54 ` Neil Brown 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-31 16:56 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, > > The rest of the zero copy stuff should fit in quite easily, with the > > possible exception of single-copy writes: I haven't looked very hard > > at that yet. > > I just ported part of the zero copy stuff against linux-2.5.45. > single-copy writes and per-cpu sokcets are not included yet. > And I fixed a problem that NFS over TCP wouldn't work. I also ported the per-cpu socket patch against linux2.5.45. --- include/linux/sunrpc/svcsock.h.ORG3 Fri Nov 1 01:29:52 2030 +++ include/linux/sunrpc/svcsock.h Fri Nov 1 01:31:28 2030 @@ -51,6 +51,7 @@ struct svc_sock { int sk_reclen; /* length of record */ int sk_tcplen; /* current read length */ time_t sk_lastrecv; /* time of last received request */ + struct svc_sock **sk_shadow; /* shadow sockets for sending */ }; /* --- net/sunrpc/svcsock.c.ORG3 Fri Nov 1 01:30:14 2030 +++ net/sunrpc/svcsock.c Fri Nov 1 01:51:34 2030 @@ -64,7 +64,9 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *, struct socket *, - int *errp, int pmap_reg); + int *errp, int type); +#define SVSK_PMAP_REGISTER 1 +#define SVSK_SHADOW 2 static void svc_udp_data_ready(struct sock *, int); static int svc_udp_recvfrom(struct svc_rqst *); static int svc_udp_sendto(struct svc_rqst *); @@ -259,6 +261,8 @@ svc_sock_put(struct svc_sock *svsk) if (!--(svsk->sk_inuse) && test_bit(SK_DEAD, &svsk->sk_flags)) { spin_unlock_bh(&serv->sv_lock); dprintk("svc: releasing dead socket\n"); + if (svsk->sk_shadow) + kfree(svsk->sk_shadow); sock_release(svsk->sk_sock); kfree(svsk); } @@ -326,6 +330,27 @@ svc_wake_up(struct svc_serv *serv) spin_unlock_bh(&serv->sv_lock); } +static inline struct svc_sock * +svc_get_svsk(struct svc_rqst *rqstp) +{ + struct svc_sock *svsk = rqstp->rq_sock; +#ifdef CONFIG_SMP + if (svsk->sk_shadow) { + struct svc_sock *shadow = svsk->sk_shadow[smp_processor_id()]; + if (shadow) { + struct svc_serv *serv = svsk->sk_server; + svsk = shadow; + if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags)) + svc_sock_setbufsize(svsk->sk_sock, + (serv->sv_nrthreads+3) * serv->sv_bufsz, + (serv->sv_nrthreads+3) * serv->sv_bufsz); + } + + } +#endif + return svsk; +} + /* * Generic sendto routine */ @@ -333,7 +358,7 @@ static int svc_sendto(struct svc_rqst *rqstp, struct xdr_buf *xdr) { mm_segment_t oldfs; - struct svc_sock *svsk = rqstp->rq_sock; + struct svc_sock *svsk = svc_get_svsk(rqstp); struct socket *sock = svsk->sk_sock; struct msghdr msg; int slen; @@ -1228,7 +1253,7 @@ svc_send(struct svc_rqst *rqstp) */ static struct svc_sock * svc_setup_socket(struct svc_serv *serv, struct socket *sock, - int *errp, int pmap_register) + int *errp, int type) { struct svc_sock *svsk; struct sock *inet; @@ -1249,6 +1274,7 @@ svc_setup_socket(struct svc_serv *serv, svsk->sk_owspace = inet->write_space; svsk->sk_server = serv; svsk->sk_lastrecv = CURRENT_TIME; + svsk->sk_shadow = NULL; INIT_LIST_HEAD(&svsk->sk_deferred); sema_init(&svsk->sk_sem, 1); @@ -1261,7 +1287,7 @@ if (svsk->sk_sk == NULL) printk(KERN_WARNING "svsk->sk_sk == NULL after svc_prot_init!\n"); /* Register socket with portmapper */ - if (*errp >= 0 && pmap_register) + if (*errp >= 0 && type == SVSK_PMAP_REGISTER) *errp = svc_register(serv, inet->protocol, ntohs(inet_sk(inet)->sport)); @@ -1273,13 +1299,13 @@ if (svsk->sk_sk == NULL) spin_lock_bh(&serv->sv_lock); - if (!pmap_register) { + if (type == SVSK_PMAP_REGISTER || type == SVSK_SHADOW) { + clear_bit(SK_TEMP, &svsk->sk_flags); + list_add(&svsk->sk_list, &serv->sv_permsocks); + } else { set_bit(SK_TEMP, &svsk->sk_flags); list_add(&svsk->sk_list, &serv->sv_tempsocks); serv->sv_tmpcnt++; - } else { - clear_bit(SK_TEMP, &svsk->sk_flags); - list_add(&svsk->sk_list, &serv->sv_permsocks); } spin_unlock_bh(&serv->sv_lock); @@ -1288,6 +1314,61 @@ if (svsk->sk_sk == NULL) return svsk; } + +/* + * Create a shadow socket which has the same sport of given svsk. + * Let each cpu have its own socket to send packets. + */ +static int +svc_create_shadow_socket(struct svc_serv *serv, struct svc_sock *svsk, + int protocol, struct sockaddr_in *sin) +{ +#ifdef CONFIG_SMP + int error; + struct socket *newsock; + struct svc_sock *newsvsk; + int i; + + if (num_online_cpus() == 1) + return 0; + + svsk->sk_shadow = kmalloc(sizeof(struct svc_sock*)*NR_CPUS, GFP_KERNEL); + if (!svsk->sk_shadow) + return -ENOMEM; + + memset(svsk->sk_shadow, 0, sizeof(struct svc_sock*)*NR_CPUS); + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + + if ((error = sock_create(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &newsock)) < 0) + return error; + if ((newsvsk = svc_setup_socket(serv, newsock, &error, SVSK_SHADOW)) == NULL) { + sock_release(newsock); + return error; + } + /* + * Make the newsvsk as shadow of the svsk. + */ + newsock->sk->reuse = 1; /* allow address reuse */ + error = newsock->ops->bind(newsock, (struct sockaddr *) sin, + sizeof(*sin)); + if (error < 0) { + sock_release(newsock); + kfree(newsvsk); + return error; + } + /* + * Unhash the newsocket not to receive packets. + */ + newsock->sk->prot->unhash(newsock->sk); + svsk->sk_shadow[i] = newsvsk; + } +#endif + return 0; +} + /* * Create socket for RPC service. */ @@ -1327,8 +1408,13 @@ svc_create_socket(struct svc_serv *serv, goto bummer; } - if ((svsk = svc_setup_socket(serv, sock, &error, 1)) != NULL) - return 0; + if ((svsk = svc_setup_socket(serv, sock, &error, SVSK_PMAP_REGISTER)) == NULL) + goto bummer; + + if (protocol == IPPROTO_UDP && sin != NULL) + svc_create_shadow_socket(serv, svsk, protocol, sin); + + return 0; bummer: dprintk("svc: svc_create_socket error = %d\n", -error); @@ -1367,6 +1453,8 @@ svc_delete_socket(struct svc_sock *svsk) if (!svsk->sk_inuse) { spin_unlock_bh(&serv->sv_lock); + if (svsk->sk_shadow) + kfree(svsk->sk_shadow); sock_release(svsk->sk_sock); kfree(svsk); } else { ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-31 16:56 ` Hirokazu Takahashi @ 2002-11-01 1:10 ` Neil Brown 2002-11-04 21:13 ` Andrew Theurer 0 siblings, 1 reply; 87+ messages in thread From: Neil Brown @ 2002-11-01 1:10 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: nfs, David S. Miller On Friday November 1, taka@valinux.co.jp wrote: > Hello, > > > > The rest of the zero copy stuff should fit in quite easily, with the > > > possible exception of single-copy writes: I haven't looked very hard > > > at that yet. > > > > I just ported part of the zero copy stuff against linux-2.5.45. > > single-copy writes and per-cpu sokcets are not included yet. > > And I fixed a problem that NFS over TCP wouldn't work. > > I also ported the per-cpu socket patch against linux2.5.45. > I still don't really like this patch. I appreciate that some sort of SMP awareness may be appropriate for nfsd, but this just doesn't feel right. Once possibility that I have considered goes like this: - Allow a (udp) socket to have 'cpu affinity' registered. - Get udp_v4_lookup add to the score for sockets that like the current cpu, and reject sockets that don't like this cpu. - Have some cpu affinity with the nfsd threads, probably having a separate idle-server-queue for each cpu. Possibly half the threads would be tied to a cpu, the other half would float, and only be used if no cpu-local threads were available. Then instead of have special 'shadow' sockets, we just create NCPUS normal udp sockets, instead of one, and give each a cpu affinity. This would mean that receiving would benefit from multiple sockets as well as sending. I have very little experience with these sort of SMP issues, so I may be missing something obvious, but to me, this approach seems cleaner and more general. Dave: what would you think of having a "unsigned long cpus_allowed" in struct inet_opt and putting the appropriate checks in udp_v4_lookup?? Is it worth experimenting with? NeilBrown ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-11-01 1:10 ` Neil Brown @ 2002-11-04 21:13 ` Andrew Theurer 0 siblings, 0 replies; 87+ messages in thread From: Andrew Theurer @ 2002-11-04 21:13 UTC (permalink / raw) To: Neil Brown, Hirokazu Takahashi; +Cc: nfs, David S. Miller > > I also ported the per-cpu socket patch against linux2.5.45. > > I still don't really like this patch. > I appreciate that some sort of SMP awareness may be appropriate for > nfsd, but this just doesn't feel right. > > Once possibility that I have considered goes like this: > > - Allow a (udp) socket to have 'cpu affinity' registered. > - Get udp_v4_lookup add to the score for sockets that > like the current cpu, and reject sockets that don't like this > cpu. > - Have some cpu affinity with the nfsd threads, probably having > a separate idle-server-queue for each cpu. Possibly half the > threads would be tied to a cpu, the other half would float, and > only be used if no cpu-local threads were available. This all sounds great, I wish I knew how to do this :) > Then instead of have special 'shadow' sockets, we just create NCPUS > normal udp sockets, instead of one, and give each a cpu affinity. > This would mean that receiving would benefit from multiple sockets > as well as sending. So, the target socket getting populated on inbound traffic would likely d= epend=20 on which CPU took the net card inturrupt? And the resulting CPU would ha= ndle=20 the NFS request? If so, and you had a good interrupt balance across CPUs= ,=20 that sounds fine. If you have an interrupt imbalance, it could be really= =20 bad. This doesn't sound like a problem on a system like PIII, where=20 interrupts can float (don't know how irqbalance works with that) but I'm = not=20 so sure about P4, even with irqbalance. Over time they do balance out, b= ut=20 in my experience a particular interrupt is being handled by one CPU or=20 another with a significant (in this context) amount of time between=20 destination changes.=20 > I have very little experience with these sort of SMP issues, so I may > be missing something obvious, but to me, this approach seems cleaner > and more general. > > Dave: what would you think of having a "unsigned long cpus_allowed" > in struct inet_opt and putting the appropriate checks in > udp_v4_lookup?? Is it worth experimenting with? > > NeilBrown ------------------------------------------------------- This SF.net email is sponsored by: ApacheCon, November 18-21 in Las Vegas (supported by COMDEX), the only Apache event to be fully supported by the ASF. http://www.apachecon.com _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-31 15:40 ` Hirokazu Takahashi 2002-10-31 16:56 ` Hirokazu Takahashi @ 2002-11-01 0:54 ` Neil Brown 2002-11-01 1:39 ` Hirokazu Takahashi 2002-11-01 3:41 ` Hirokazu Takahashi 1 sibling, 2 replies; 87+ messages in thread From: Neil Brown @ 2002-11-01 0:54 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: nfs On Friday November 1, taka@valinux.co.jp wrote: > Hello, > > > The rest of the zero copy stuff should fit in quite easily, with the > > possible exception of single-copy writes: I haven't looked very hard > > at that yet. > > I just ported part of the zero copy stuff against linux-2.5.45. > single-copy writes and per-cpu sokcets are not included yet. > And I fixed a problem that NFS over TCP wouldn't work. > > > va-nfsd-sendpage.patch ....use sendpage instead of sock_sendmsg. > va-sunrpc-zeropage.patch ....zero filled page for padding. > va-nfsd-vfsread.patch ....zero-copy nfsd_read/nfsd_readdir. A lot of this looks fine. I would like to leave the tail pointing into the end of the first page (just after the head) rather than using the sunrpc_zero_page thing as the later doesn't seem necessary. Also, I would like to send the head and tail with sendpage rather than using sock_sendmsg. To give the destination address, you can call sock_sendmsg with a length of 0, and then call ->sendpage for each page or page fragment. You should be able to remove the calls to svcbuf_reserve in nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the 'buffer' variable as well. If you could make those changes (or convince me otherwise), I will forward the patches to Linus, Thanks. NeilBrown ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-11-01 0:54 ` Neil Brown @ 2002-11-01 1:39 ` Hirokazu Takahashi 2002-11-01 3:41 ` Hirokazu Takahashi 1 sibling, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-11-01 1:39 UTC (permalink / raw) To: neilb; +Cc: nfs Hello, > > va-nfsd-sendpage.patch ....use sendpage instead of sock_sendmsg. > > va-sunrpc-zeropage.patch ....zero filled page for padding. > > va-nfsd-vfsread.patch ....zero-copy nfsd_read/nfsd_readdir. > > A lot of this looks fine. > > I would like to leave the tail pointing into the end of the first page > (just after the head) rather than using the sunrpc_zero_page thing as > the later doesn't seem necessary. > Also, I would like to send the head and tail with sendpage rather > than using sock_sendmsg. Yes, we can. I'll do it though it seems little bit tricky. > To give the destination address, you can call sock_sendmsg with > a length of 0, and then call ->sendpage for each page or page > fragment. Ok. > You should be able to remove the calls to svcbuf_reserve in > nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the > 'buffer' variable as well. Yes, you're right. > If you could make those changes (or convince me otherwise), I will > forward the patches to Linus, > Thanks. Thanks! ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-11-01 0:54 ` Neil Brown 2002-11-01 1:39 ` Hirokazu Takahashi @ 2002-11-01 3:41 ` Hirokazu Takahashi 2002-11-01 4:20 ` Neil Brown 1 sibling, 1 reply; 87+ messages in thread From: Hirokazu Takahashi @ 2002-11-01 3:41 UTC (permalink / raw) To: neilb; +Cc: nfs [-- Attachment #1: Type: Text/Plain, Size: 843 bytes --] Hello, I updated the patches. I'll send you 2 patches va-nfsd-sendpage.patch va-nfsd-vfsread.patch > I would like to leave the tail pointing into the end of the first page > (just after the head) rather than using the sunrpc_zero_page thing as > the later doesn't seem necessary. > Also, I would like to send the head and tail with sendpage rather > than using sock_sendmsg. > To give the destination address, you can call sock_sendmsg with > a length of 0, and then call ->sendpage for each page or page > fragment. > > > You should be able to remove the calls to svcbuf_reserve in > nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the > 'buffer' variable as well. > > > If you could make those changes (or convince me otherwise), I will > forward the patches to Linus, > Thanks. Thank you, Hirokazu Takahashi. [-- Attachment #2: zerocopy-2.5.45-new.taz --] [-- Type: Application/Octet-Stream, Size: 5912 bytes --] ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-11-01 3:41 ` Hirokazu Takahashi @ 2002-11-01 4:20 ` Neil Brown 2002-11-01 5:07 ` Hirokazu Takahashi 0 siblings, 1 reply; 87+ messages in thread From: Neil Brown @ 2002-11-01 4:20 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: nfs On Friday November 1, taka@valinux.co.jp wrote: > Hello, > > I updated the patches. > I'll send you 2 patches > > va-nfsd-sendpage.patch > va-nfsd-vfsread.patch > Thanks. I made a couple of little changes and sent them to Linus and the list. 1/ I simplified the sending of the tail a bit more. We assume the tail is *always* in the same page as the head, and just sendpage it. 2/ I removed svcbuf_reserve and the buffer variable from nfsd3_proc_readdirplus as well :-) Thanks again, NeilBrown ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-11-01 4:20 ` Neil Brown @ 2002-11-01 5:07 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-11-01 5:07 UTC (permalink / raw) To: neilb; +Cc: nfs Hi, > Thanks. > I made a couple of little changes and sent them to Linus and the list. > 1/ I simplified the sending of the tail a bit more. We assume the > tail is *always* in the same page as the head, and just sendpage it. I looks fine! > 2/ I removed svcbuf_reserve and the buffer variable from > nfsd3_proc_readdirplus as well :-) Thanks. ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-25 9:52 ` Hirokazu Takahashi 2002-10-25 12:41 ` Neil Brown @ 2002-10-25 17:23 ` Trond Myklebust 2002-10-26 3:26 ` Hirokazu Takahashi 1 sibling, 1 reply; 87+ messages in thread From: Trond Myklebust @ 2002-10-25 17:23 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, nfs >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: >> In particular, I think it would be good to use 'struct xdr_buf' >> from sunrpc/xdr.h instead of svc_buf. This is what the nfs >> client uses and we could share some of the infrastructure. > I just realized it would be hard to use the xdr_buf as it > couldn't handle data in a socket buffer. Each socket burfer > consists of some non-page data and some pages and each of them > might have its own offset and length. Then the following trivial modification would be quite sufficient struct xdr_buf { struct list_head list; /* Further xdr_buf */ struct iovec head[1], /* RPC header + non-page data */ tail[1]; /* Appended after page data */ struct page ** pages; /* Array of contiguous pages */ unsigned int page_base, /* Start of page data */ page_len; /* Length of page data */ unsigned int len; /* Total length of data */ }; With equally trivial fixes to xdr_kmap() and friends. None of this needs to affect existing client usage, and may in fact be useful for optimizing use of v4 COMPOUNDS later. (I was wrong about this BTW: being able to flush out all the dirty pages in a file to disk using a single COMPOUND would indeed be worth the trouble once we've managed to drop UDP as the primary NFS transport mechanism. For one thing, you would only tie up a single nfsd thread when writing to the file) Cheers, Trond ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.43 2002-10-25 17:23 ` Trond Myklebust @ 2002-10-26 3:26 ` Hirokazu Takahashi 0 siblings, 0 replies; 87+ messages in thread From: Hirokazu Takahashi @ 2002-10-26 3:26 UTC (permalink / raw) To: trond.myklebust; +Cc: neilb, nfs Hello, > Then the following trivial modification would be quite sufficient Yes, it looks good as it's rare to use two or more xdr_bufs. We can allocate extra xdr_bufs dynamically. > struct xdr_buf { > struct list_head list; /* Further xdr_buf */ > struct iovec head[1], /* RPC header + non-page data */ > tail[1]; /* Appended after page data */ > > struct page ** pages; /* Array of contiguous pages */ > unsigned int page_base, /* Start of page data */ > page_len; /* Length of page data */ > > unsigned int len; /* Total length of data */ > > }; > > With equally trivial fixes to xdr_kmap() and friends. None of this > needs to affect existing client usage, and may in fact be useful for > optimizing use of v4 COMPOUNDS later. > (I was wrong about this BTW: being able to flush out all the dirty > pages in a file to disk using a single COMPOUND would indeed be worth > the trouble once we've managed to drop UDP as the primary NFS > transport mechanism. For one thing, you would only tie up a single > nfsd thread when writing to the file) ------------------------------------------------------- This sf.net email is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 87+ messages in thread
end of thread, other threads:[~2002-11-04 21:45 UTC | newest] Thread overview: 87+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-09-18 8:14 [PATCH] zerocopy NFS for 2.5.36 Hirokazu Takahashi 2002-09-18 23:00 ` David S. Miller 2002-09-18 23:54 ` Alan Cox 2002-09-18 23:54 ` Alan Cox 2002-09-19 0:16 ` Andrew Morton 2002-09-19 2:13 ` Aaron Lehmann 2002-09-19 3:30 ` Andrew Morton 2002-09-19 3:30 ` Andrew Morton 2002-09-19 10:42 ` Alan Cox 2002-09-19 10:42 ` Alan Cox 2002-09-19 13:15 ` [NFS] " Hirokazu Takahashi 2002-09-19 20:42 ` Andrew Morton 2002-09-19 21:12 ` David S. Miller 2002-09-19 21:12 ` [NFS] " David S. Miller 2002-09-21 11:56 ` Pavel Machek 2002-09-21 11:56 ` Pavel Machek 2002-10-14 5:50 ` Neil Brown 2002-10-14 6:15 ` David S. Miller 2002-10-14 10:45 ` kuznet 2002-10-14 10:48 ` David S. Miller 2002-10-14 12:01 ` Hirokazu Takahashi 2002-10-14 14:12 ` Andrew Theurer 2002-10-16 3:44 ` Neil Brown 2002-10-16 4:31 ` David S. Miller 2002-10-16 15:04 ` Andrew Theurer 2002-10-17 2:03 ` [NFS] " Andrew Theurer 2002-10-17 2:31 ` Hirokazu Takahashi 2002-10-17 13:16 ` Andrew Theurer 2002-10-17 13:16 ` [NFS] " Andrew Theurer 2002-10-17 13:26 ` Hirokazu Takahashi 2002-10-17 13:26 ` [NFS] " Hirokazu Takahashi 2002-10-17 14:10 ` Andrew Theurer 2002-10-17 16:26 ` Hirokazu Takahashi 2002-10-17 16:26 ` [NFS] " Hirokazu Takahashi 2002-10-18 5:38 ` Trond Myklebust 2002-10-18 7:19 ` Hirokazu Takahashi 2002-10-18 15:12 ` Andrew Theurer 2002-10-18 15:12 ` [NFS] " Andrew Theurer 2002-10-19 20:34 ` Hirokazu Takahashi 2002-10-19 20:34 ` [NFS] " Hirokazu Takahashi 2002-10-22 21:16 ` Andrew Theurer 2002-10-22 21:16 ` [NFS] " Andrew Theurer 2002-10-23 9:29 ` Hirokazu Takahashi 2002-10-24 15:32 ` Andrew Theurer 2002-10-27 11:10 ` Hirokazu Takahashi 2002-10-16 11:09 ` Hirokazu Takahashi 2002-10-16 17:02 ` kaza 2002-10-17 4:36 ` rddunlap 2002-10-18 13:11 ` [PATCH] zerocopy NFS for 2.5.43 Hirokazu Takahashi 2002-10-23 1:18 ` Neil Brown 2002-10-23 3:53 ` Hirokazu Takahashi 2002-10-23 5:40 ` Hirokazu Takahashi 2002-10-23 6:03 ` Neil Brown 2002-10-23 22:35 ` Hirokazu Takahashi 2002-10-23 6:10 ` Neil Brown 2002-10-23 7:08 ` Hirokazu Takahashi 2002-10-23 15:23 ` Trond Myklebust 2002-10-23 21:50 ` Hirokazu Takahashi 2002-10-23 23:55 ` Trond Myklebust 2002-10-24 1:33 ` Hirokazu Takahashi 2002-10-27 10:39 ` Hirokazu Takahashi 2002-10-28 16:31 ` Trond Myklebust 2002-10-28 23:39 ` Hirokazu Takahashi 2002-10-29 6:36 ` Hirokazu Takahashi 2002-10-29 15:09 ` Trond Myklebust 2002-10-29 16:27 ` Hirokazu Takahashi 2002-10-29 16:49 ` Trond Myklebust 2002-10-30 3:18 ` Hirokazu Takahashi 2002-10-25 9:52 ` Hirokazu Takahashi 2002-10-25 12:41 ` Neil Brown 2002-10-26 3:11 ` Hirokazu Takahashi 2002-10-26 3:46 ` Benjamin LaHaise 2002-10-27 22:46 ` Neil Brown 2002-10-30 23:29 ` Hirokazu Takahashi 2002-10-30 23:53 ` Neil Brown 2002-10-31 2:06 ` Hirokazu Takahashi 2002-10-31 15:40 ` Hirokazu Takahashi 2002-10-31 16:56 ` Hirokazu Takahashi 2002-11-01 1:10 ` Neil Brown 2002-11-04 21:13 ` Andrew Theurer 2002-11-01 0:54 ` Neil Brown 2002-11-01 1:39 ` Hirokazu Takahashi 2002-11-01 3:41 ` Hirokazu Takahashi 2002-11-01 4:20 ` Neil Brown 2002-11-01 5:07 ` Hirokazu Takahashi 2002-10-25 17:23 ` Trond Myklebust 2002-10-26 3:26 ` Hirokazu Takahashi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.