* Re: Re: [PATCH] zerocopy NFS for 2.5.36 [not found] ` <3D8A36A5.846D806@digeo.com.suse.lists.linux.kernel> @ 2002-09-20 1:00 ` Andi Kleen 0 siblings, 0 replies; 20+ messages in thread From: Andi Kleen @ 2002-09-20 1:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs Andrew Morton <akpm@digeo.com> writes: > Hirokazu Takahashi wrote: > > > > ... > > > It needs redoing. These differences are really big, and this > > > is the kernel's most expensive function. > > > > > > A little project for someone. > > > > OK, if there is nobody who wants to do it I'll do it by myself. > > That would be fantastic - thanks. This is more a measurement > and testing exercise than a coding one. And if those measurements > are sufficiently nice (eg: >5%) then a 2.4 backport should be done. Very interesting IMHO would be to find a heuristic to switch between a write combining copy and a cache hot copy. Write combining is good for blasting huge amounts of data quickly without killing your caches. Cache hot is good for everything else. But it'll need hints from the higher level code. e.g. read and write could turn on write combining for bigger writes (let's say >8K) I discovered that just unconditionally turning it on for all copies is not good because it forces data out of cache. But I still have hope that it helps for selected copies. -Andi ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-20 1:00 ` Andi Kleen 0 siblings, 0 replies; 20+ messages in thread From: Andi Kleen @ 2002-09-20 1:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs Andrew Morton <akpm@digeo.com> writes: > Hirokazu Takahashi wrote: > > > > ... > > > It needs redoing. These differences are really big, and this > > > is the kernel's most expensive function. > > > > > > A little project for someone. > > > > OK, if there is nobody who wants to do it I'll do it by myself. > > That would be fantastic - thanks. This is more a measurement > and testing exercise than a coding one. And if those measurements > are sufficiently nice (eg: >5%) then a 2.4 backport should be done. Very interesting IMHO would be to find a heuristic to switch between a write combining copy and a cache hot copy. Write combining is good for blasting huge amounts of data quickly without killing your caches. Cache hot is good for everything else. But it'll need hints from the higher level code. e.g. read and write could turn on write combining for bigger writes (let's say >8K) I discovered that just unconditionally turning it on for all copies is not good because it forces data out of cache. But I still have hope that it helps for selected copies. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 1:00 ` [NFS] " Andi Kleen @ 2002-09-20 1:09 ` Andrew Morton -1 siblings, 0 replies; 20+ messages in thread From: Andrew Morton @ 2002-09-20 1:09 UTC (permalink / raw) To: Andi Kleen; +Cc: Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs Andi Kleen wrote: > > Andrew Morton <akpm@digeo.com> writes: > > > Hirokazu Takahashi wrote: > > > > > > ... > > > > It needs redoing. These differences are really big, and this > > > > is the kernel's most expensive function. > > > > > > > > A little project for someone. > > > > > > OK, if there is nobody who wants to do it I'll do it by myself. > > > > That would be fantastic - thanks. This is more a measurement > > and testing exercise than a coding one. And if those measurements > > are sufficiently nice (eg: >5%) then a 2.4 backport should be done. > > Very interesting IMHO would be to find a heuristic to switch between > a write combining copy and a cache hot copy. Write combining is good > for blasting huge amounts of data quickly without killing your caches. > Cache hot is good for everything else. I expect that caching userspace and not pagecache would be a reasonable choice. > But it'll need hints from the higher level code. e.g. read and write > could turn on write combining for bigger writes (let's say >8K) > I discovered that just unconditionally turning it on for all copies > is not good because it forces data out of cache. But I still have hope > that it helps for selected copies. Well if it's a really big read then bypassing the CPU cache on the userspace-side buffer would make sense. Can you control the cachability of the memory reads as well? What restrictions are there on these instructions? Would they force us to bear the cost of the aligment problem? ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-20 1:09 ` Andrew Morton 0 siblings, 0 replies; 20+ messages in thread From: Andrew Morton @ 2002-09-20 1:09 UTC (permalink / raw) To: Andi Kleen; +Cc: Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs Andi Kleen wrote: > > Andrew Morton <akpm@digeo.com> writes: > > > Hirokazu Takahashi wrote: > > > > > > ... > > > > It needs redoing. These differences are really big, and this > > > > is the kernel's most expensive function. > > > > > > > > A little project for someone. > > > > > > OK, if there is nobody who wants to do it I'll do it by myself. > > > > That would be fantastic - thanks. This is more a measurement > > and testing exercise than a coding one. And if those measurements > > are sufficiently nice (eg: >5%) then a 2.4 backport should be done. > > Very interesting IMHO would be to find a heuristic to switch between > a write combining copy and a cache hot copy. Write combining is good > for blasting huge amounts of data quickly without killing your caches. > Cache hot is good for everything else. I expect that caching userspace and not pagecache would be a reasonable choice. > But it'll need hints from the higher level code. e.g. read and write > could turn on write combining for bigger writes (let's say >8K) > I discovered that just unconditionally turning it on for all copies > is not good because it forces data out of cache. But I still have hope > that it helps for selected copies. Well if it's a really big read then bypassing the CPU cache on the userspace-side buffer would make sense. Can you control the cachability of the memory reads as well? What restrictions are there on these instructions? Would they force us to bear the cost of the aligment problem? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 1:09 ` [NFS] " Andrew Morton @ 2002-09-20 1:23 ` Andi Kleen -1 siblings, 0 replies; 20+ messages in thread From: Andi Kleen @ 2002-09-20 1:23 UTC (permalink / raw) To: Andrew Morton Cc: Andi Kleen, Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs On Thu, Sep 19, 2002 at 06:09:34PM -0700, Andrew Morton wrote: > > Very interesting IMHO would be to find a heuristic to switch between > > a write combining copy and a cache hot copy. Write combining is good > > for blasting huge amounts of data quickly without killing your caches. > > Cache hot is good for everything else. > > I expect that caching userspace and not pagecache would be > a reasonable choice. Normally yes, but not always. e.g. for squid you don't really want to cache user space. But I guess it would be a reasonable heuristic. Or at least worth a try :-) > > > But it'll need hints from the higher level code. e.g. read and write > > could turn on write combining for bigger writes (let's say >8K) > > I discovered that just unconditionally turning it on for all copies > > is not good because it forces data out of cache. But I still have hope > > that it helps for selected copies. > > Well if it's a really big read then bypassing the CPU cache on > the userspace-side buffer would make sense. > > Can you control the cachability of the memory reads as well? SSE2 has hints for that (prefetchnti and even prefetcht0,1 etc. for different cache hierarchies), but it's not completely clear on how much the CPUs follow these. For writing it's much more obvious and usually documented even. > > What restrictions are there on these instructions? Would > they force us to bear the cost of the aligment problem? They should be aligned, otherwise it makes no sense. When you assume it's more likely that one target or destination are unaligned then you can easily align either target or destination. Trick is to chose the right one, it varies on the call site. (these are for big copies so a small alignment function is lost in the noise) x86-64 copy_*_user currently aligns the destination, but hardcoding that is a bit dumb and I'm not completely happy with it. -Andi ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 @ 2002-09-20 1:23 ` Andi Kleen 0 siblings, 0 replies; 20+ messages in thread From: Andi Kleen @ 2002-09-20 1:23 UTC (permalink / raw) To: Andrew Morton Cc: Andi Kleen, Hirokazu Takahashi, alan, davem, neilb, linux-kernel, nfs On Thu, Sep 19, 2002 at 06:09:34PM -0700, Andrew Morton wrote: > > Very interesting IMHO would be to find a heuristic to switch between > > a write combining copy and a cache hot copy. Write combining is good > > for blasting huge amounts of data quickly without killing your caches. > > Cache hot is good for everything else. > > I expect that caching userspace and not pagecache would be > a reasonable choice. Normally yes, but not always. e.g. for squid you don't really want to cache user space. But I guess it would be a reasonable heuristic. Or at least worth a try :-) > > > But it'll need hints from the higher level code. e.g. read and write > > could turn on write combining for bigger writes (let's say >8K) > > I discovered that just unconditionally turning it on for all copies > > is not good because it forces data out of cache. But I still have hope > > that it helps for selected copies. > > Well if it's a really big read then bypassing the CPU cache on > the userspace-side buffer would make sense. > > Can you control the cachability of the memory reads as well? SSE2 has hints for that (prefetchnti and even prefetcht0,1 etc. for different cache hierarchies), but it's not completely clear on how much the CPUs follow these. For writing it's much more obvious and usually documented even. > > What restrictions are there on these instructions? Would > they force us to bear the cost of the aligment problem? They should be aligned, otherwise it makes no sense. When you assume it's more likely that one target or destination are unaligned then you can easily align either target or destination. Trick is to chose the right one, it varies on the call site. (these are for big copies so a small alignment function is lost in the noise) x86-64 copy_*_user currently aligns the destination, but hardcoding that is a bit dumb and I'm not completely happy with it. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 1:23 ` [NFS] " Andi Kleen (?) @ 2002-09-20 1:27 ` David S. Miller 2002-09-20 2:06 ` Andi Kleen -1 siblings, 1 reply; 20+ messages in thread From: David S. Miller @ 2002-09-20 1:27 UTC (permalink / raw) To: ak; +Cc: akpm, taka, alan, neilb, linux-kernel, nfs From: Andi Kleen <ak@suse.de> Date: Fri, 20 Sep 2002 03:23:46 +0200 On Thu, Sep 19, 2002 at 06:09:34PM -0700, Andrew Morton wrote: > Can you control the cachability of the memory reads as well? SSE2 has hints for that (prefetchnti and even prefetcht0,1 etc. for different cache hierarchies), but it's not completely clear on how much the CPUs follow these. For writing it's much more obvious and usually documented even. See "montdq/movnti", the latter of which even works on register registers. Ben LaHaise pointed this out to me earlier today. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 1:27 ` David S. Miller @ 2002-09-20 2:06 ` Andi Kleen 2002-09-20 2:01 ` David S. Miller 0 siblings, 1 reply; 20+ messages in thread From: Andi Kleen @ 2002-09-20 2:06 UTC (permalink / raw) To: David S. Miller; +Cc: ak, akpm, taka, alan, neilb, linux-kernel, nfs > See "montdq/movnti", the latter of which even works on register > registers. Ben LaHaise pointed this out to me earlier today. The issue is that you really want to do prefetching in these loops (waiting for the hardware prefetch is too slow because it needs several cache misses to trigger) so for cache hints on reading only prefetch instructions are interesting. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 2:06 ` Andi Kleen @ 2002-09-20 2:01 ` David S. Miller 2002-09-20 2:28 ` Andi Kleen 0 siblings, 1 reply; 20+ messages in thread From: David S. Miller @ 2002-09-20 2:01 UTC (permalink / raw) To: ak; +Cc: akpm, taka, alan, neilb, linux-kernel, nfs From: Andi Kleen <ak@suse.de> Date: Fri, 20 Sep 2002 04:06:19 +0200 > See "montdq/movnti", the latter of which even works on register > registers. Ben LaHaise pointed this out to me earlier today. The issue is that you really want to do prefetching in these loops (waiting for the hardware prefetch is too slow because it needs several cache misses to trigger) so for cache hints on reading only prefetch instructions are interesting. I'm talking about using this to bypass the cache on the stores. The prefetches are a seperate issue and I agree with you on that. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 2:01 ` David S. Miller @ 2002-09-20 2:28 ` Andi Kleen 2002-09-20 2:20 ` David S. Miller 0 siblings, 1 reply; 20+ messages in thread From: Andi Kleen @ 2002-09-20 2:28 UTC (permalink / raw) To: David S. Miller; +Cc: ak, akpm, taka, alan, neilb, linux-kernel, nfs On Thu, Sep 19, 2002 at 07:01:54PM -0700, David S. Miller wrote: > From: Andi Kleen <ak@suse.de> > Date: Fri, 20 Sep 2002 04:06:19 +0200 > > > See "montdq/movnti", the latter of which even works on register > > registers. Ben LaHaise pointed this out to me earlier today. > > The issue is that you really want to do prefetching in these loops > (waiting for the hardware prefetch is too slow because it needs several > cache misses to trigger) so for cache hints on reading only prefetch > instructions are interesting. > > I'm talking about using this to bypass the cache on the stores. > The prefetches are a seperate issue and I agree with you on that. I was talking generally. You cannot really use these instructions on Athlon, because they're microcoded and slow or do not exist. On Athlon it needs 3dnow write combining functions (adding FPU overhead so may not be worth it). On P3/P4 you can use movnti/movntdq yes. Just doing it for reads is more tricky/dubious. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 2:28 ` Andi Kleen @ 2002-09-20 2:20 ` David S. Miller 2002-09-20 2:35 ` Andi Kleen 0 siblings, 1 reply; 20+ messages in thread From: David S. Miller @ 2002-09-20 2:20 UTC (permalink / raw) To: ak; +Cc: akpm, taka, alan, neilb, linux-kernel, nfs From: Andi Kleen <ak@suse.de> Date: Fri, 20 Sep 2002 04:28:19 +0200 You cannot really use these instructions on Athlon, I know that Athlon lacks these instructions, they are p4 sse2 only. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-20 2:20 ` David S. Miller @ 2002-09-20 2:35 ` Andi Kleen 0 siblings, 0 replies; 20+ messages in thread From: Andi Kleen @ 2002-09-20 2:35 UTC (permalink / raw) To: David S. Miller; +Cc: ak, akpm, taka, alan, neilb, linux-kernel, nfs On Thu, Sep 19, 2002 at 07:20:48PM -0700, David S. Miller wrote: > From: Andi Kleen <ak@suse.de> > Date: Fri, 20 Sep 2002 04:28:19 +0200 > > You cannot really use these instructions on Athlon, > > I know that Athlon lacks these instructions, they are p4 sse2 > only. AFAIK it is an SSE1 feature. Athlon actually has movnti in newer models, just you do not really want to use it. -Andi ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-10-16 3:44 Neil Brown
2002-10-16 4:31 ` David S. Miller
0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2002-10-16 3:44 UTC (permalink / raw)
To: Hirokazu Takahashi; +Cc: davem, linux-kernel, nfs
On Monday October 14, taka@valinux.co.jp wrote:
> > I'm bit I'm not very sure about is the 'shadowsock' patch for having
> > several xmit sockets, one per CPU. What sort of speedup do you get
> > from this? How important is it really?
>
> It's not so important.
>
> davem> Personally, it seems rather essential for scalability on SMP.
>
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.
>
After thinking about this some more, I suspect it would have to be
quite large scale SMP to get much contention.
The only contention on the udp socket is, as you say, assembling a udp
frame, and it would be surprised if that takes a substantial faction
of the time to handle a request.
Presumably on a sufficiently large SMP machine that this became an
issue, there would be multiple NICs. Maybe it would make sense to
have one udp socket for each NIC. Would that make sense? or work?
It feels to me to be cleaner than one for each CPU.
NeilBrown
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 3:44 Neil Brown @ 2002-10-16 4:31 ` David S. Miller 2002-10-17 2:03 ` [NFS] " Andrew Theurer 0 siblings, 1 reply; 20+ messages in thread From: David S. Miller @ 2002-10-16 4:31 UTC (permalink / raw) To: neilb; +Cc: taka, linux-kernel, nfs From: Neil Brown <neilb@cse.unsw.edu.au> Date: Wed, 16 Oct 2002 13:44:04 +1000 Presumably on a sufficiently large SMP machine that this became an issue, there would be multiple NICs. Maybe it would make sense to have one udp socket for each NIC. Would that make sense? or work? It feels to me to be cleaner than one for each CPU. Doesn't make much sense. Usually we are talking via one IP address, and thus over one device. It could be using multiple NICs via BONDING, but that would be transparent to anything at the socket level. Really, I think there is real value to making the socket per-cpu even on a 2 or 4 way system. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-16 4:31 ` David S. Miller @ 2002-10-17 2:03 ` Andrew Theurer 2002-10-17 2:31 ` Hirokazu Takahashi 0 siblings, 1 reply; 20+ messages in thread From: Andrew Theurer @ 2002-10-17 2:03 UTC (permalink / raw) To: neilb, David S. Miller; +Cc: taka, linux-kernel, nfs > From: Neil Brown <neilb@cse.unsw.edu.au> > Date: Wed, 16 Oct 2002 13:44:04 +1000 > > Presumably on a sufficiently large SMP machine that this became an > issue, there would be multiple NICs. Maybe it would make sense to > have one udp socket for each NIC. Would that make sense? or work? > It feels to me to be cleaner than one for each CPU. > > Doesn't make much sense. > > Usually we are talking via one IP address, and thus over > one device. It could be using multiple NICs via BONDING, > but that would be transparent to anything at the socket > level. > > Really, I think there is real value to making the socket > per-cpu even on a 2 or 4 way system. I am still seeing some sort of problem on an 8 way (hyperthreaded 8 logical/4 physical) on UDP with these patches. I cannot get more than 2 NFSd threads in a run state at one time. TCP usually has 8 or more. The test involves 40 100Mbit clients reading a 200 MB file on one server (4 acenic adapters) in cache. I am fighting some other issues at the moment (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP and 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I think the results will get better. I'm not sure what other lock or contention point this is hitting on UDP. If there is anything I can do to help, please let me know, thanks. Andrew Theurer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 2:03 ` [NFS] " Andrew Theurer @ 2002-10-17 2:31 ` Hirokazu Takahashi 2002-10-17 13:16 ` Andrew Theurer 0 siblings, 1 reply; 20+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 2:31 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hello, Thanks for testing my patches. > I am still seeing some sort of problem on an 8 way (hyperthreaded 8 > logical/4 physical) on UDP with these patches. I cannot get more than 2 > NFSd threads in a run state at one time. TCP usually has 8 or more. The > test involves 40 100Mbit clients reading a 200 MB file on one server (4 > acenic adapters) in cache. I am fighting some other issues at the moment > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and > 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP and > 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I > think the results will get better. I'm not sure what other lock or > contention point this is hitting on UDP. If there is anything I can do to > help, please let me know, thanks. I guess some UDP packets might be lost. It may happen easily as UDP protocol doesn't support flow control. Can you check how many errors has happened? You can see them in /proc/net/snmp of the server and the clients. And how many threads did you start on your machine? Buffer size of a UDP socket depends on number of kNFS threads. Large number of threads might help you. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 2:31 ` Hirokazu Takahashi @ 2002-10-17 13:16 ` Andrew Theurer 2002-10-17 13:26 ` Hirokazu Takahashi 0 siblings, 1 reply; 20+ messages in thread From: Andrew Theurer @ 2002-10-17 13:16 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 > Hello, > > Thanks for testing my patches. > > > I am still seeing some sort of problem on an 8 way (hyperthreaded 8 > > logical/4 physical) on UDP with these patches. I cannot get more than 2 > > NFSd threads in a run state at one time. TCP usually has 8 or more. The > > test involves 40 100Mbit clients reading a 200 MB file on one server (4 > > acenic adapters) in cache. I am fighting some other issues at the moment > > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP and > > 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP and > > 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I > > think the results will get better. I'm not sure what other lock or > > contention point this is hitting on UDP. If there is anything I can do to > > help, please let me know, thanks. > > I guess some UDP packets might be lost. It may happen easily as UDP protocol > doesn't support flow control. > Can you check how many errors has happened? > You can see them in /proc/net/snmp of the server and the clients. server: Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 1000665 41 0 1000666 clients: Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 200403 0 0 200406 (all clients the same) > And how many threads did you start on your machine? > Buffer size of a UDP socket depends on number of kNFS threads. > Large number of threads might help you. 128 threads. client rsize=8196. Server and client MTU is 1500. Andrew Theurer ------------------------------------------------------- This sf.net email is sponsored by: viaVerio will pay you up to $1,000 for every account that you consolidate with us. http://ad.doubleclick.net/clk;4749864;7604308;v? http://www.viaverio.com/consolidator/osdn.cfm _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 13:16 ` Andrew Theurer @ 2002-10-17 13:26 ` Hirokazu Takahashi 2002-10-17 14:10 ` [NFS] " Andrew Theurer 0 siblings, 1 reply; 20+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 13:26 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hi, > server: Udp: InDatagrams NoPorts InErrors OutDatagrams > Udp: 1000665 41 0 1000666 > clients: Udp: InDatagrams NoPorts InErrors OutDatagrams > Udp: 200403 0 0 200406 > (all clients the same) How about IP datagrams? You can see the IP fields in /proc/net/snmp IP layer may also discard them. > > And how many threads did you start on your machine? > > Buffer size of a UDP socket depends on number of kNFS threads. > > Large number of threads might help you. > > 128 threads. client rsize=8196. Server and client MTU is 1500. It seems enough... ------------------------------------------------------- This sf.net email is sponsored by: viaVerio will pay you up to $1,000 for every account that you consolidate with us. http://ad.doubleclick.net/clk;4749864;7604308;v? http://www.viaverio.com/consolidator/osdn.cfm _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 13:26 ` Hirokazu Takahashi @ 2002-10-17 14:10 ` Andrew Theurer 2002-10-17 16:26 ` Hirokazu Takahashi 0 siblings, 1 reply; 20+ messages in thread From: Andrew Theurer @ 2002-10-17 14:10 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs > Hi, > > > server: Udp: InDatagrams NoPorts InErrors OutDatagrams > > Udp: 1000665 41 0 1000666 > > clients: Udp: InDatagrams NoPorts InErrors OutDatagrams > > Udp: 200403 0 0 200406 > > (all clients the same) > > How about IP datagrams? You can see the IP fields in /proc/net/snmp > IP layer may also discard them. Server: Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000 A Client: Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0 Andrew Theurer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 14:10 ` [NFS] " Andrew Theurer @ 2002-10-17 16:26 ` Hirokazu Takahashi 2002-10-18 5:38 ` [NFS] " Trond Myklebust 0 siblings, 1 reply; 20+ messages in thread From: Hirokazu Takahashi @ 2002-10-17 16:26 UTC (permalink / raw) To: habanero; +Cc: neilb, davem, linux-kernel, nfs Hi, > > How about IP datagrams? You can see the IP fields in /proc/net/snmp > > IP layer may also discard them. > > Server: > > Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams > InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes > ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates > Ip: 1 64 4088714 0 0 720 0 0 4086393 12233109 2 0 0 0 0 0 0 0 6000000 > > A Client: > > Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams > InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes > ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates > Ip: 2 64 2115252 0 0 0 0 0 1115244 646510 0 0 0 1200000 200008 0 0 0 0 It looks fine. Hmmm.... What version of linux do you use? Congestion avoidance mechanism of NFS clients might cause this situation. I think the congestion window size is not enough for high end machines. You can make the window be larger as a test. ------------------------------------------------------- This sf.net email is sponsored by: viaVerio will pay you up to $1,000 for every account that you consolidate with us. http://ad.doubleclick.net/clk;4749864;7604308;v? http://www.viaverio.com/consolidator/osdn.cfm _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-17 16:26 ` Hirokazu Takahashi @ 2002-10-18 5:38 ` Trond Myklebust 2002-10-18 7:19 ` Hirokazu Takahashi 0 siblings, 1 reply; 20+ messages in thread From: Trond Myklebust @ 2002-10-18 5:38 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: habanero, neilb, davem, linux-kernel, nfs >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: > Congestion avoidance mechanism of NFS clients might cause this > situation. I think the congestion window size is not enough > for high end machines. You can make the window be larger as a > test. The congestion avoidance window is supposed to adapt to the bandwidth that is available. Turn congestion avoidance off if you like, but my experience is that doing so tends to seriously degrade performance as the number of timeouts + resends skyrockets. Cheers, Trond ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-18 5:38 ` [NFS] " Trond Myklebust @ 2002-10-18 7:19 ` Hirokazu Takahashi 2002-10-18 15:12 ` Andrew Theurer 0 siblings, 1 reply; 20+ messages in thread From: Hirokazu Takahashi @ 2002-10-18 7:19 UTC (permalink / raw) To: trond.myklebust; +Cc: habanero, neilb, davem, linux-kernel, nfs Hello, > > Congestion avoidance mechanism of NFS clients might cause this > > situation. I think the congestion window size is not enough > > for high end machines. You can make the window be larger as a > > test. > > The congestion avoidance window is supposed to adapt to the bandwidth > that is available. Turn congestion avoidance off if you like, but my > experience is that doing so tends to seriously degrade performance as > the number of timeouts + resends skyrockets. Yes, you must be right. But I guess Andrew may use a great machine so that the transfer rate has exeeded the maximum size of the congestion avoidance window. Can we determin preferable maximum window size dynamically? Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-18 7:19 ` Hirokazu Takahashi @ 2002-10-18 15:12 ` Andrew Theurer 2002-10-19 20:34 ` Hirokazu Takahashi 0 siblings, 1 reply; 20+ messages in thread From: Andrew Theurer @ 2002-10-18 15:12 UTC (permalink / raw) To: trond.myklebust, Hirokazu Takahashi; +Cc: neilb, davem, linux-kernel, nfs > > > Congestion avoidance mechanism of NFS clients might cause this > > > situation. I think the congestion window size is not enough > > > for high end machines. You can make the window be larger as a > > > test. > > > > The congestion avoidance window is supposed to adapt to the bandwidth > > that is available. Turn congestion avoidance off if you like, but my > > experience is that doing so tends to seriously degrade performance as > > the number of timeouts + resends skyrockets. > > Yes, you must be right. > > But I guess Andrew may use a great machine so that the transfer rate > has exeeded the maximum size of the congestion avoidance window. > Can we determin preferable maximum window size dynamically? Is this a concern on the client only? I can run a test with just one client and see if I can saturate the 100Mbit adapter. If I can, would we need to make any adjustments then? FYI, at 115 MB/sec total throughput, that's only 2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a 100Mbit client. Andrew Theurer ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-18 15:12 ` Andrew Theurer @ 2002-10-19 20:34 ` Hirokazu Takahashi 2002-10-22 21:16 ` Andrew Theurer 0 siblings, 1 reply; 20+ messages in thread From: Hirokazu Takahashi @ 2002-10-19 20:34 UTC (permalink / raw) To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs Hello, > > Congestion avoidance mechanism of NFS clients might cause this > > situation. I think the congestion window size is not enough > > for high end machines. You can make the window be larger as a > > test. > Is this a concern on the client only? I can run a test with just one client > and see if I can saturate the 100Mbit adapter. If I can, would we need to > make any adjustments then? FYI, at 115 MB/sec total throughput, that's only > 2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec, > that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a > 100Mbit client. I think it's a client issue. NFS servers don't care about cogestion of UDP traffic and they will try to response to all NFS requests as fast as they can. You can try to increase the number of clients or the number of mount points for a test. It's easy to mount the same directory of the server on some directries of the client so that each of them can work simultaneously. # mount -t nfs server:/foo /baa1 # mount -t nfs server:/foo /baa2 # mount -t nfs server:/foo /baa3 Thank you, Hirokazu Takahashi. ------------------------------------------------------- This sf.net email is sponsored by: Access Your PC Securely with GoToMyPC. Try Free Now https://www.gotomypc.com/s/OSND/DD _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-19 20:34 ` Hirokazu Takahashi @ 2002-10-22 21:16 ` Andrew Theurer 2002-10-23 9:29 ` [NFS] " Hirokazu Takahashi 0 siblings, 1 reply; 20+ messages in thread From: Andrew Theurer @ 2002-10-22 21:16 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs On Saturday 19 October 2002 15:34, Hirokazu Takahashi wrote: > Hello, > > > > Congestion avoidance mechanism of NFS clients might cause this > > > situation. I think the congestion window size is not enough > > > for high end machines. You can make the window be larger as a > > > test. > > > > Is this a concern on the client only? I can run a test with just one > > client and see if I can saturate the 100Mbit adapter. If I can, woul= d we > > need to make any adjustments then? FYI, at 115 MB/sec total throughp= ut, > > that's only 2.875 MB/sec for each of the 40 clients. For the TCP res= ult > > of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortabl= e > > throughputs for a 100Mbit client. > > I think it's a client issue. NFS servers don't care about cogestion of = UDP > traffic and they will try to response to all NFS requests as fast as th= ey > can. > > You can try to increase the number of clients or the number of mount po= ints > for a test. It's easy to mount the same directory of the server on some > directries of the client so that each of them can work simultaneously. > # mount -t nfs server:/foo /baa1 > # mount -t nfs server:/foo /baa2 > # mount -t nfs server:/foo /baa3 I don't think it is a client congestion issue at this point. I can run t= he=20 test with just one client on UDP and achieve 11.2 MB/sec with just one mo= unt=20 point. The client has 100 Mbit Ethernet, so should be the upper limit (o= r=20 really close). In the 40 client read test, I have only achieved 2.875 MB= /sec=20 per client. That and the fact that there are never more than 2 nfsd thre= ads=20 in a run state at one time (for UDP only) leads me to believe there is st= ill=20 a scaling problem on the server for UDP. I will continue to run the test= and=20 poke a prod around. Hopefully something will jump out at me. Thanks for= all=20 the input! Andrew Theurer ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-22 21:16 ` Andrew Theurer @ 2002-10-23 9:29 ` Hirokazu Takahashi 2002-10-24 15:32 ` Andrew Theurer 0 siblings, 1 reply; 20+ messages in thread From: Hirokazu Takahashi @ 2002-10-23 9:29 UTC (permalink / raw) To: habanero; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs Hi, > > > > Congestion avoidance mechanism of NFS clients might cause this > > > > situation. I think the congestion window size is not enough > > > > for high end machines. You can make the window be larger as a > > > > test. > I don't think it is a client congestion issue at this point. I can run the > test with just one client on UDP and achieve 11.2 MB/sec with just one mount > point. The client has 100 Mbit Ethernet, so should be the upper limit (or > really close). In the 40 client read test, I have only achieved 2.875 MB/sec > per client. That and the fact that there are never more than 2 nfsd threads > in a run state at one time (for UDP only) leads me to believe there is still > a scaling problem on the server for UDP. I will continue to run the test and > poke a prod around. Hopefully something will jump out at me. Thanks for all > the input! Can You check /proc/net/rpc/nfsd which shows how many NFS requests have been retransmitted ? # cat /proc/net/rpc/nfsd rc 0 27680 162118 ^^^ This field means the clinents have retransmitted pakeckets. The transmission ratio will slow down if it have happened once. It may occur if the response from the server is slower than the clinents expect. And you can use older version - e.g. linux-2.4 series - for clients and see what will happen as older versions don't have any intelligent features. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-23 9:29 ` [NFS] " Hirokazu Takahashi @ 2002-10-24 15:32 ` Andrew Theurer 2002-10-27 11:10 ` Hirokazu Takahashi 0 siblings, 1 reply; 20+ messages in thread From: Andrew Theurer @ 2002-10-24 15:32 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: trond.myklebust, neilb, davem, linux-kernel, nfs > > I don't think it is a client congestion issue at this point. I can run the > > test with just one client on UDP and achieve 11.2 MB/sec with just one mount > > point. The client has 100 Mbit Ethernet, so should be the upper limit (or > > really close). In the 40 client read test, I have only achieved 2.875 MB/sec > > per client. That and the fact that there are never more than 2 nfsd threads > > in a run state at one time (for UDP only) leads me to believe there is still > > a scaling problem on the server for UDP. I will continue to run the test and > > poke a prod around. Hopefully something will jump out at me. Thanks for all > > the input! > > Can You check /proc/net/rpc/nfsd which shows how many NFS requests have > been retransmitted ? > > # cat /proc/net/rpc/nfsd > rc 0 27680 162118 > ^^^ > This field means the clinents have retransmitted pakeckets. > The transmission ratio will slow down if it have happened once. > It may occur if the response from the server is slower than the > clinents expect. /proc/net/rpc/nfsd rc 0 1 1025221 > And you can use older version - e.g. linux-2.4 series - for clients > and see what will happen as older versions don't have any intelligent > features. Actually all of the clients are 2.4 (RH 7.0). I could change them out to 2.5, but it may take me a little while. Let me do a little digging around. I seem to recall an issue I had earlier this year when waking up the nfsd threads and having most of them just go back to sleep. I need to go back to that code and understand it a little better. Thanks for all of your help. Andrew Theurer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-10-24 15:32 ` Andrew Theurer @ 2002-10-27 11:10 ` Hirokazu Takahashi 0 siblings, 0 replies; 20+ messages in thread From: Hirokazu Takahashi @ 2002-10-27 11:10 UTC (permalink / raw) To: habanero; +Cc: nfs Hi, >> Can You check /proc/net/rpc/nfsd which shows how many NFS requests have >> been retransmitted ? You can also check the client side. /proc/net/rpc/nfs net 0 0 0 0 rpc 191035 4339 0 ^^^^ This field shows us how many times the client has retransmitted RPC requests. Thank you, Hirokazu Takahashi. ------------------------------------------------------- This SF.net email is sponsored by: ApacheCon, November 18-21 in Las Vegas (supported by COMDEX), the only Apache event to be fully supported by the ASF. http://www.apachecon.com _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] zerocopy NFS for 2.5.36
@ 2002-09-19 0:16 Andrew Morton
2002-09-19 13:15 ` [NFS] " Hirokazu Takahashi
0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2002-09-19 0:16 UTC (permalink / raw)
To: Alan Cox; +Cc: David S. Miller, taka, neilb, linux-kernel, nfs
Alan Cox wrote:
>
> On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> > It was discussed long ago that csum_and_copy_from_user() performs
> > better than plain copy_from_user() on x86. I do not remember all
>
> The better was a freak of PPro/PII scheduling I think
>
> > details, but I do know that using copy_from_user() is not a real
> > improvement at least on x86 architecture.
>
> The same as bit is easy to explain. Its totally memory bandwidth limited
> on current x86-32 processors. (Although I'd welcome demonstrations to
> the contrary on newer toys)
Nope. There are distinct alignment problems with movsl-based
memcpy on PII and (at least) "Pentium III (Coppermine)", which is
tested here:
copy_32 uses movsl. copy_duff just uses a stream of "movl"s
Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:
akpm:/usr/src/cptimer> ./cptimer -d -s
nbytes=10240 from_align=0, to_align=0
copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
OK, movsl wins. But now give the source address 8+1 alignment:
akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
nbytes=10240 from_align=1, to_align=0
copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec
The "movl"-based copy wins. By miles.
Make the source 8+4 aligned:
akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
nbytes=10240 from_align=4, to_align=0
copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec
So movl still beats movsl, by lots.
I have various scriptlets which generate the entire matrix.
I think I ended up deciding that we should use movsl _only_
when both src and dsc are 8-byte-aligned. And that when you
multiply the gain from that by the frequency*size with which
funny alignments are used by TCP the net gain was 2% or something.
It needs redoing. These differences are really big, and this
is the kernel's most expensive function.
A little project for someone.
The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 0:16 Andrew Morton @ 2002-09-19 13:15 ` Hirokazu Takahashi 2002-09-19 20:42 ` Andrew Morton 0 siblings, 1 reply; 20+ messages in thread From: Hirokazu Takahashi @ 2002-09-19 13:15 UTC (permalink / raw) To: akpm; +Cc: alan, davem, neilb, linux-kernel, nfs Hello, > > > details, but I do know that using copy_from_user() is not a real > > > improvement at least on x86 architecture. > > > > The same as bit is easy to explain. Its totally memory bandwidth limited > > on current x86-32 processors. (Although I'd welcome demonstrations to > > the contrary on newer toys) > > Nope. There are distinct alignment problems with movsl-based > memcpy on PII and (at least) "Pentium III (Coppermine)", which is > tested here: ... > I have various scriptlets which generate the entire matrix. > > I think I ended up deciding that we should use movsl _only_ > when both src and dsc are 8-byte-aligned. And that when you > multiply the gain from that by the frequency*size with which > funny alignments are used by TCP the net gain was 2% or something. Amazing! I beleived 4-byte-aligned was enough. read/write systemcalls may also reduce their penalties. > It needs redoing. These differences are really big, and this > is the kernel's most expensive function. > > A little project for someone. OK, if there is nobody who wants to do it I'll do it by myself. > The tools are at http://www.zip.com.au/~/linux/cptimer.tar.gz ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 13:15 ` [NFS] " Hirokazu Takahashi @ 2002-09-19 20:42 ` Andrew Morton 2002-09-19 21:12 ` David S. Miller 0 siblings, 1 reply; 20+ messages in thread From: Andrew Morton @ 2002-09-19 20:42 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: alan, davem, neilb, linux-kernel, nfs Hirokazu Takahashi wrote: > > ... > > It needs redoing. These differences are really big, and this > > is the kernel's most expensive function. > > > > A little project for someone. > > OK, if there is nobody who wants to do it I'll do it by myself. That would be fantastic - thanks. This is more a measurement and testing exercise than a coding one. And if those measurements are sufficiently nice (eg: >5%) then a 2.4 backport should be done. It seems that movsl works acceptably with all alignments on AMD hardware, although this needs to be checked with more recent machines. movsl is a (bad) loss on PII and PIII for all alignments except 8&8. Don't know about P4 - I can test that in a day or two. I expect that a minimal, 90% solution would be just: fancy_copy_to_user(dst, src, count) { if (arch_has_sane_movsl || ((dst|src) & 7) == 0) movsl_copy_to_user(dst, src, count); else movl_copy_to_user(dst, src, count); } and #ifndef ARCH_HAS_FANCY_COPY_USER #define fancy_copy_to_user copy_to_user #endif and we really only need fancy_copy_to_user in a handful of places - the bulk copies in networking and filemap.c. For all the other call sites it's probably more important to keep the code footprint down than it is to squeeze the last few drops out of the copy speed. Mala Anand has done some work on this. See http://www.uwsg.iu.edu/hypermail/linux/kernel/0206.3/0100.html <searches> Yes, I have a copy of Mala's patch here which works against 2.5.current. Mala's patch will cause quite an expansion of kernel size; we would need an implementation which did not use inlining. This work was discussed at OLS2002. See http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz uaccess.h | 252 +++++++++++++++++++++++++++++++++++++++++++++++--------------- 1 files changed, 193 insertions(+), 59 deletions(-) --- 2.5.25/include/asm-i386/uaccess.h~fast-cu Tue Jul 9 21:34:58 2002 +++ 2.5.25-akpm/include/asm-i386/uaccess.h Tue Jul 9 21:51:03 2002 @@ -253,55 +253,197 @@ do { \ */ /* Generic arbitrary sized copy. */ -#define __copy_user(to,from,size) \ -do { \ - int __d0, __d1; \ - __asm__ __volatile__( \ - "0: rep; movsl\n" \ - " movl %3,%0\n" \ - "1: rep; movsb\n" \ - "2:\n" \ - ".section .fixup,\"ax\"\n" \ - "3: lea 0(%3,%0,4),%0\n" \ - " jmp 2b\n" \ - ".previous\n" \ - ".section __ex_table,\"a\"\n" \ - " .align 4\n" \ - " .long 0b,3b\n" \ - " .long 1b,2b\n" \ - ".previous" \ - : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ - : "r"(size & 3), "0"(size / 4), "1"(to), "2"(from) \ - : "memory"); \ +#define __copy_user(to,from,size) \ +do { \ + int __d0, __d1; \ + __asm__ __volatile__( \ + " cmpl $63, %0\n" \ + " jbe 5f\n" \ + " mov %%esi, %%eax\n" \ + " test $7, %%al\n" \ + " jz 5f\n" \ + " .align 2,0x90\n" \ + "0: movl 32(%4), %%eax\n" \ + " cmpl $67, %0\n" \ + " jbe 1f\n" \ + " movl 64(%4), %%eax\n" \ + " .align 2,0x90\n" \ + "1: movl 0(%4), %%eax\n" \ + " movl 4(%4), %%edx\n" \ + "2: movl %%eax, 0(%3)\n" \ + "21: movl %%edx, 4(%3)\n" \ + " movl 8(%4), %%eax\n" \ + " movl 12(%4),%%edx\n" \ + "3: movl %%eax, 8(%3)\n" \ + "31: movl %%edx, 12(%3)\n" \ + " movl 16(%4), %%eax\n" \ + " movl 20(%4), %%edx\n" \ + "4: movl %%eax, 16(%3)\n" \ + "41: movl %%edx, 20(%3)\n" \ + " movl 24(%4), %%eax\n" \ + " movl 28(%4), %%edx\n" \ + "10: movl %%eax, 24(%3)\n" \ + "51: movl %%edx, 28(%3)\n" \ + " movl 32(%4), %%eax\n" \ + " movl 36(%4), %%edx\n" \ + "11: movl %%eax, 32(%3)\n" \ + "61: movl %%edx, 36(%3)\n" \ + " movl 40(%4), %%eax\n" \ + " movl 44(%4), %%edx\n" \ + "12: movl %%eax, 40(%3)\n" \ + "71: movl %%edx, 44(%3)\n" \ + " movl 48(%4), %%eax\n" \ + " movl 52(%4), %%edx\n" \ + "13: movl %%eax, 48(%3)\n" \ + "81: movl %%edx, 52(%3)\n" \ + " movl 56(%4), %%eax\n" \ + " movl 60(%4), %%edx\n" \ + "14: movl %%eax, 56(%3)\n" \ + "91: movl %%edx, 60(%3)\n" \ + " addl $-64, %0\n" \ + " addl $64, %4\n" \ + " addl $64, %3\n" \ + " cmpl $63, %0\n" \ + " ja 0b\n" \ + "5: movl %0, %%eax\n" \ + " shrl $2, %0\n" \ + " andl $3, %%eax\n" \ + " cld\n" \ + "6: rep; movsl\n" \ + " movl %%eax, %0\n" \ + "7: rep; movsb\n" \ + "8:\n" \ + ".section .fixup,\"ax\"\n" \ + "9: lea 0(%%eax,%0,4),%0\n" \ + " jmp 8b\n" \ + "15: movl %6, %0\n" \ + " jmp 8b\n" \ + ".previous\n" \ + ".section __ex_table,\"a\"\n" \ + " .align 4\n" \ + " .long 2b,15b\n" \ + " .long 21b,15b\n" \ + " .long 3b,15b\n" \ + " .long 31b,15b\n" \ + " .long 4b,15b\n" \ + " .long 41b,15b\n" \ + " .long 10b,15b\n" \ + " .long 51b,15b\n" \ + " .long 11b,15b\n" \ + " .long 61b,15b\n" \ + " .long 12b,15b\n" \ + " .long 71b,15b\n" \ + " .long 13b,15b\n" \ + " .long 81b,15b\n" \ + " .long 14b,15b\n" \ + " .long 91b,15b\n" \ + " .long 6b,9b\n" \ + " .long 7b,8b\n" \ + ".previous" \ + : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ + : "1"(to), "2"(from), "0"(size),"i"(-EFAULT) \ + : "eax", "edx", "memory"); \ } while (0) -#define __copy_user_zeroing(to,from,size) \ -do { \ - int __d0, __d1; \ - __asm__ __volatile__( \ - "0: rep; movsl\n" \ - " movl %3,%0\n" \ - "1: rep; movsb\n" \ - "2:\n" \ - ".section .fixup,\"ax\"\n" \ - "3: lea 0(%3,%0,4),%0\n" \ - "4: pushl %0\n" \ - " pushl %%eax\n" \ - " xorl %%eax,%%eax\n" \ - " rep; stosb\n" \ - " popl %%eax\n" \ - " popl %0\n" \ - " jmp 2b\n" \ - ".previous\n" \ - ".section __ex_table,\"a\"\n" \ - " .align 4\n" \ - " .long 0b,3b\n" \ - " .long 1b,4b\n" \ - ".previous" \ - : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ - : "r"(size & 3), "0"(size / 4), "1"(to), "2"(from) \ - : "memory"); \ -} while (0) +#define __copy_user_zeroing(to,from,size) \ +do { \ + int __d0, __d1; \ + __asm__ __volatile__( \ + " cmpl $63, %0\n" \ + " jbe 5f\n" \ + " movl %%edi, %%eax\n" \ + " test $7, %%al\n" \ + " jz 5f\n" \ + " .align 2,0x90\n" \ + "0: movl 32(%4), %%eax\n" \ + " cmpl $67, %0\n" \ + " jbe 2f\n" \ + "1: movl 64(%4), %%eax\n" \ + " .align 2,0x90\n" \ + "2: movl 0(%4), %%eax\n" \ + "21: movl 4(%4), %%edx\n" \ + " movl %%eax, 0(%3)\n" \ + " movl %%edx, 4(%3)\n" \ + "3: movl 8(%4), %%eax\n" \ + "31: movl 12(%4),%%edx\n" \ + " movl %%eax, 8(%3)\n" \ + " movl %%edx, 12(%3)\n" \ + "4: movl 16(%4), %%eax\n" \ + "41: movl 20(%4), %%edx\n" \ + " movl %%eax, 16(%3)\n" \ + " movl %%edx, 20(%3)\n" \ + "10: movl 24(%4), %%eax\n" \ + "51: movl 28(%4), %%edx\n" \ + " movl %%eax, 24(%3)\n" \ + " movl %%edx, 28(%3)\n" \ + "11: movl 32(%4), %%eax\n" \ + "61: movl 36(%4), %%edx\n" \ + " movl %%eax, 32(%3)\n" \ + " movl %%edx, 36(%3)\n" \ + "12: movl 40(%4), %%eax\n" \ + "71: movl 44(%4), %%edx\n" \ + " movl %%eax, 40(%3)\n" \ + " movl %%edx, 44(%3)\n" \ + "13: movl 48(%4), %%eax\n" \ + "81: movl 52(%4), %%edx\n" \ + " movl %%eax, 48(%3)\n" \ + " movl %%edx, 52(%3)\n" \ + "14: movl 56(%4), %%eax\n" \ + "91: movl 60(%4), %%edx\n" \ + " movl %%eax, 56(%3)\n" \ + " movl %%edx, 60(%3)\n" \ + " addl $-64, %0\n" \ + " addl $64, %4\n" \ + " addl $64, %3\n" \ + " cmpl $63, %0\n" \ + " ja 0b\n" \ + "5: movl %0, %%eax\n" \ + " shrl $2, %0\n" \ + " andl $3, %%eax\n" \ + " cld\n" \ + "6: rep; movsl\n" \ + " movl %%eax,%0\n" \ + "7: rep; movsb\n" \ + "8:\n" \ + ".section .fixup,\"ax\"\n" \ + "9: lea 0(%%eax,%0,4),%0\n" \ + "16: pushl %0\n" \ + " pushl %%eax\n" \ + " xorl %%eax,%%eax\n" \ + " rep; stosb\n" \ + " popl %%eax\n" \ + " popl %0\n" \ + " jmp 8b\n" \ + "15: movl %6, %0\n" \ + " jmp 8b\n" \ + ".previous\n" \ + ".section __ex_table,\"a\"\n" \ + " .align 4\n" \ + " .long 0b,16b\n" \ + " .long 1b,16b\n" \ + " .long 2b,16b\n" \ + " .long 21b,16b\n" \ + " .long 3b,16b\n" \ + " .long 31b,16b\n" \ + " .long 4b,16b\n" \ + " .long 41b,16b\n" \ + " .long 10b,16b\n" \ + " .long 51b,16b\n" \ + " .long 11b,16b\n" \ + " .long 61b,16b\n" \ + " .long 12b,16b\n" \ + " .long 71b,16b\n" \ + " .long 13b,16b\n" \ + " .long 81b,16b\n" \ + " .long 14b,16b\n" \ + " .long 91b,16b\n" \ + " .long 6b,9b\n" \ + " .long 7b,16b\n" \ + ".previous" \ + : "=&c"(size), "=&D" (__d0), "=&S" (__d1) \ + : "1"(to), "2"(from), "0"(size),"i"(-EFAULT) \ + : "eax", "edx", "memory"); \ + } while (0) /* We let the __ versions of copy_from/to_user inline, because they're often * used in fast paths and have only a small space overhead. @@ -578,24 +720,16 @@ __constant_copy_from_user_nocheck(void * } #define copy_to_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_to_user((to),(from),(n)) : \ - __generic_copy_to_user((to),(from),(n))) + __generic_copy_to_user((to),(from),(n)) #define copy_from_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_from_user((to),(from),(n)) : \ - __generic_copy_from_user((to),(from),(n))) + __generic_copy_from_user((to),(from),(n)) #define __copy_to_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_to_user_nocheck((to),(from),(n)) : \ - __generic_copy_to_user_nocheck((to),(from),(n))) + __generic_copy_to_user_nocheck((to),(from),(n)) #define __copy_from_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_from_user_nocheck((to),(from),(n)) : \ - __generic_copy_from_user_nocheck((to),(from),(n))) + __generic_copy_from_user_nocheck((to),(from),(n)) long strncpy_from_user(char *dst, const char *src, long count); long __strncpy_from_user(char *dst, const char *src, long count); - ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH] zerocopy NFS for 2.5.36 2002-09-19 20:42 ` Andrew Morton @ 2002-09-19 21:12 ` David S. Miller 0 siblings, 0 replies; 20+ messages in thread From: David S. Miller @ 2002-09-19 21:12 UTC (permalink / raw) To: akpm; +Cc: taka, alan, neilb, linux-kernel, nfs From: Andrew Morton <akpm@digeo.com> Date: Thu, 19 Sep 2002 13:42:13 -0700 Mala's patch will cause quite an expansion of kernel size; we would need an implementation which did not use inlining. It definitely belongs in arch/i386/lib/copy.c or whatever, not inlined. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2002-10-27 11:17 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <3D89176B.40FFD09B@digeo.com.suse.lists.linux.kernel>
[not found] ` <20020919.221513.28808421.taka@valinux.co.jp.suse.lists.linux.kernel>
[not found] ` <3D8A36A5.846D806@digeo.com.suse.lists.linux.kernel>
2002-09-20 1:00 ` Re: [PATCH] zerocopy NFS for 2.5.36 Andi Kleen
2002-09-20 1:00 ` [NFS] " Andi Kleen
2002-09-20 1:09 ` Andrew Morton
2002-09-20 1:09 ` [NFS] " Andrew Morton
2002-09-20 1:23 ` Andi Kleen
2002-09-20 1:23 ` [NFS] " Andi Kleen
2002-09-20 1:27 ` David S. Miller
2002-09-20 2:06 ` Andi Kleen
2002-09-20 2:01 ` David S. Miller
2002-09-20 2:28 ` Andi Kleen
2002-09-20 2:20 ` David S. Miller
2002-09-20 2:35 ` Andi Kleen
2002-10-16 3:44 Neil Brown
2002-10-16 4:31 ` David S. Miller
2002-10-17 2:03 ` [NFS] " Andrew Theurer
2002-10-17 2:31 ` Hirokazu Takahashi
2002-10-17 13:16 ` Andrew Theurer
2002-10-17 13:26 ` Hirokazu Takahashi
2002-10-17 14:10 ` [NFS] " Andrew Theurer
2002-10-17 16:26 ` Hirokazu Takahashi
2002-10-18 5:38 ` [NFS] " Trond Myklebust
2002-10-18 7:19 ` Hirokazu Takahashi
2002-10-18 15:12 ` Andrew Theurer
2002-10-19 20:34 ` Hirokazu Takahashi
2002-10-22 21:16 ` Andrew Theurer
2002-10-23 9:29 ` [NFS] " Hirokazu Takahashi
2002-10-24 15:32 ` Andrew Theurer
2002-10-27 11:10 ` Hirokazu Takahashi
-- strict thread matches above, loose matches on Subject: below --
2002-09-19 0:16 Andrew Morton
2002-09-19 13:15 ` [NFS] " Hirokazu Takahashi
2002-09-19 20:42 ` Andrew Morton
2002-09-19 21:12 ` David S. Miller
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.