* [RFC] nfs: use 2*rsize readahead size @ 2010-02-24 2:41 Wu Fengguang 2010-02-24 3:29 ` Dave Chinner 0 siblings, 1 reply; 21+ messages in thread From: Wu Fengguang @ 2010-02-24 2:41 UTC (permalink / raw) To: Trond Myklebust Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux Memory Management List, LKML With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS readahead size 512k*15=7680k is too large than necessary for typical clients. On a e1000e--e1000e connection, I got the following numbers readahead size throughput 16k 35.5 MB/s 32k 54.3 MB/s 64k 64.1 MB/s 128k 70.5 MB/s 256k 74.6 MB/s rsize ==> 512k 77.4 MB/s 1024k 85.5 MB/s 2048k 86.8 MB/s 4096k 87.9 MB/s 8192k 89.0 MB/s 16384k 87.7 MB/s So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) can already get near full NFS bandwidth. The test script is: #!/bin/sh file=/mnt/sparse BDI=0:15 for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 do echo 3 > /proc/sys/vm/drop_caches echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb echo readahead_size=${rasize}k dd if=$file of=/dev/null bs=4k count=1024000 done CC: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Signed-off-by: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/nfs/client.c | 4 +++- fs/nfs/internal.h | 8 -------- 2 files changed, 3 insertions(+), 9 deletions(-) --- linux.orig/fs/nfs/client.c 2010-02-23 11:15:44.000000000 +0800 +++ linux/fs/nfs/client.c 2010-02-24 10:16:00.000000000 +0800 @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; server->backing_dev_info.name = "nfs"; - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; + server->backing_dev_info.ra_pages = max_t(unsigned long, + default_backing_dev_info.ra_pages, + 2 * server->rpages); server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; if (server->wsize > max_rpc_payload) --- linux.orig/fs/nfs/internal.h 2010-02-23 11:15:44.000000000 +0800 +++ linux/fs/nfs/internal.h 2010-02-23 13:26:00.000000000 +0800 @@ -10,14 +10,6 @@ struct nfs_string; -/* Maximum number of readahead requests - * FIXME: this should really be a sysctl so that users may tune it to suit - * their needs. People that do NFS over a slow network, might for - * instance want to reduce it to something closer to 1 for improved - * interactive response. - */ -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1) - /* * Determine if sessions are in use. */ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 2:41 [RFC] nfs: use 2*rsize readahead size Wu Fengguang @ 2010-02-24 3:29 ` Dave Chinner 2010-02-24 4:18 ` Wu Fengguang 2010-02-24 4:24 ` Dave Chinner 0 siblings, 2 replies; 21+ messages in thread From: Dave Chinner @ 2010-02-24 3:29 UTC (permalink / raw) To: Wu Fengguang Cc: Trond Myklebust, linux-nfs, linux-fsdevel, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote: > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > readahead size 512k*15=7680k is too large than necessary for typical > clients. > > On a e1000e--e1000e connection, I got the following numbers > > readahead size throughput > 16k 35.5 MB/s > 32k 54.3 MB/s > 64k 64.1 MB/s > 128k 70.5 MB/s > 256k 74.6 MB/s > rsize ==> 512k 77.4 MB/s > 1024k 85.5 MB/s > 2048k 86.8 MB/s > 4096k 87.9 MB/s > 8192k 89.0 MB/s > 16384k 87.7 MB/s > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) > can already get near full NFS bandwidth. > > The test script is: > > #!/bin/sh > > file=/mnt/sparse > BDI=0:15 > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > do > echo 3 > /proc/sys/vm/drop_caches > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > echo readahead_size=${rasize}k > dd if=$file of=/dev/null bs=4k count=1024000 > done That's doing a cached read out of the server cache, right? You might find the results are different if the server has to read the file from disk. I would expect reads from the server cache not to require much readahead as there is no IO latency on the server side for the readahead to hide.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 3:29 ` Dave Chinner @ 2010-02-24 4:18 ` Wu Fengguang 2010-02-24 5:22 ` Dave Chinner 2010-02-24 4:24 ` Dave Chinner 1 sibling, 1 reply; 21+ messages in thread From: Wu Fengguang @ 2010-02-24 4:18 UTC (permalink / raw) To: Dave Chinner Cc: Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote: > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote: > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > > readahead size 512k*15=7680k is too large than necessary for typical > > clients. > > > > On a e1000e--e1000e connection, I got the following numbers > > > > readahead size throughput > > 16k 35.5 MB/s > > 32k 54.3 MB/s > > 64k 64.1 MB/s > > 128k 70.5 MB/s > > 256k 74.6 MB/s > > rsize ==> 512k 77.4 MB/s > > 1024k 85.5 MB/s > > 2048k 86.8 MB/s > > 4096k 87.9 MB/s > > 8192k 89.0 MB/s > > 16384k 87.7 MB/s > > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) > > can already get near full NFS bandwidth. > > > > The test script is: > > > > #!/bin/sh > > > > file=/mnt/sparse > > BDI=0:15 > > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > > do > > echo 3 > /proc/sys/vm/drop_caches > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > > echo readahead_size=${rasize}k > > dd if=$file of=/dev/null bs=4k count=1024000 > > done > > That's doing a cached read out of the server cache, right? You It does not involve disk IO at least. (The sparse file dataset is larger than server cache.) > might find the results are different if the server has to read the > file from disk. I would expect reads from the server cache not > to require much readahead as there is no IO latency on the server > side for the readahead to hide.... Sure the result will be different when disk IO is involved. In this case I would expect the server admin to setup the optimal readahead size for the disk(s). It sounds silly to have client_readahead_size > server_readahead_size Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 4:18 ` Wu Fengguang @ 2010-02-24 5:22 ` Dave Chinner 2010-02-24 6:12 ` Wu Fengguang [not found] ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> 0 siblings, 2 replies; 21+ messages in thread From: Dave Chinner @ 2010-02-24 5:22 UTC (permalink / raw) To: Wu Fengguang Cc: Trond Myklebust, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 12:18:22PM +0800, Wu Fengguang wrote: > On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote: > > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote: > > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > > > readahead size 512k*15=7680k is too large than necessary for typical > > > clients. > > > > > > On a e1000e--e1000e connection, I got the following numbers > > > > > > readahead size throughput > > > 16k 35.5 MB/s > > > 32k 54.3 MB/s > > > 64k 64.1 MB/s > > > 128k 70.5 MB/s > > > 256k 74.6 MB/s > > > rsize ==> 512k 77.4 MB/s > > > 1024k 85.5 MB/s > > > 2048k 86.8 MB/s > > > 4096k 87.9 MB/s > > > 8192k 89.0 MB/s > > > 16384k 87.7 MB/s > > > > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) > > > can already get near full NFS bandwidth. > > > > > > The test script is: > > > > > > #!/bin/sh > > > > > > file=/mnt/sparse > > > BDI=0:15 > > > > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > > > do > > > echo 3 > /proc/sys/vm/drop_caches > > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > > > echo readahead_size=${rasize}k > > > dd if=$file of=/dev/null bs=4k count=1024000 > > > done > > > > That's doing a cached read out of the server cache, right? You > > It does not involve disk IO at least. (The sparse file dataset is > larger than server cache.) It still results in effectively the same thing: very low, consistent IO latency. Effectively all the test results show is that on a clean, low latency, uncongested network an unloaded NFS server that has no IO latency, a client only requires one 512k readahead block to hide 90% of the server read request latency. I don't think this is a particularly good test to base a new default on, though. e.g. What is the result with a smaller rsize? When the server actually has to do disk IO? When multiple clients are reading at the same time so the server may not detect accesses as sequential and issue readahead? When another client is writing to the server at the same time as the read and causing significant read IO latency at the server? What I'm trying to say is that while I agree with your premise that a 7.8MB readahead window is probably far larger than was ever intended, I disagree with your methodology and environment for selecting a better default value. The default readahead value needs to work well in as many situations as possible, not just in perfect 1:1 client/server environment. > > might find the results are different if the server has to read the > > file from disk. I would expect reads from the server cache not > > to require much readahead as there is no IO latency on the server > > side for the readahead to hide.... > > Sure the result will be different when disk IO is involved. > In this case I would expect the server admin to setup the optimal > readahead size for the disk(s). The default should do the right thing when disk IO is involved, as almost no-one has an NFS server that doesn't do IO.... ;) > It sounds silly to have > > client_readahead_size > server_readahead_size I don't think it is - the client readahead has to take into account the network latency as well as the server latency. e.g. a network with a high bandwidth but high latency is going to need much more client side readahead than a high bandwidth, low latency network to get the same throughput. Hence it is not uncommon to see larger readahead windows on network clients than for local disk access. Also, the NFS server may not even be able to detect sequential IO patterns because of the combined access patterns from the clients, and so the only effective readahead might be what the clients issue.... Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 5:22 ` Dave Chinner @ 2010-02-24 6:12 ` Wu Fengguang 2010-02-24 7:39 ` Dave Chinner [not found] ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> 1 sibling, 1 reply; 21+ messages in thread From: Wu Fengguang @ 2010-02-24 6:12 UTC (permalink / raw) To: Dave Chinner Cc: Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote: > On Wed, Feb 24, 2010 at 12:18:22PM +0800, Wu Fengguang wrote: > > On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote: > > > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote: > > > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > > > > readahead size 512k*15=7680k is too large than necessary for typical > > > > clients. > > > > > > > > On a e1000e--e1000e connection, I got the following numbers > > > > > > > > readahead size throughput > > > > 16k 35.5 MB/s > > > > 32k 54.3 MB/s > > > > 64k 64.1 MB/s > > > > 128k 70.5 MB/s > > > > 256k 74.6 MB/s > > > > rsize ==> 512k 77.4 MB/s > > > > 1024k 85.5 MB/s > > > > 2048k 86.8 MB/s > > > > 4096k 87.9 MB/s > > > > 8192k 89.0 MB/s > > > > 16384k 87.7 MB/s > > > > > > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) > > > > can already get near full NFS bandwidth. > > > > > > > > The test script is: > > > > > > > > #!/bin/sh > > > > > > > > file=/mnt/sparse > > > > BDI=0:15 > > > > > > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > > > > do > > > > echo 3 > /proc/sys/vm/drop_caches > > > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > > > > echo readahead_size=${rasize}k > > > > dd if=$file of=/dev/null bs=4k count=1024000 > > > > done > > > > > > That's doing a cached read out of the server cache, right? You > > > > It does not involve disk IO at least. (The sparse file dataset is > > larger than server cache.) > > It still results in effectively the same thing: very low, consistent > IO latency. > > Effectively all the test results show is that on a clean, low > latency, uncongested network an unloaded NFS server that has no IO > latency, a client only requires one 512k readahead block to hide 90% > of the server read request latency. I don't think this is a > particularly good test to base a new default on, though. > > e.g. What is the result with a smaller rsize? When the server > actually has to do disk IO? When multiple clients are reading at > the same time so the server may not detect accesses as sequential > and issue readahead? When another client is writing to the server at > the same time as the read and causing significant read IO latency at > the server? > > What I'm trying to say is that while I agree with your premise that > a 7.8MB readahead window is probably far larger than was ever > intended, I disagree with your methodology and environment for > selecting a better default value. The default readahead value needs > to work well in as many situations as possible, not just in perfect > 1:1 client/server environment. Good points. It's imprudent to change a default value based on one single benchmark. Need to collect more data, which may take time.. > > > might find the results are different if the server has to read the > > > file from disk. I would expect reads from the server cache not > > > to require much readahead as there is no IO latency on the server > > > side for the readahead to hide.... > > > > Sure the result will be different when disk IO is involved. > > In this case I would expect the server admin to setup the optimal > > readahead size for the disk(s). > > The default should do the right thing when disk IO is involved, as Agreed. > almost no-one has an NFS server that doesn't do IO.... ;) Sure. > > It sounds silly to have > > > > client_readahead_size > server_readahead_size > > I don't think it is - the client readahead has to take into account > the network latency as well as the server latency. e.g. a network > with a high bandwidth but high latency is going to need much more > client side readahead than a high bandwidth, low latency network to > get the same throughput. Hence it is not uncommon to see larger > readahead windows on network clients than for local disk access. Hmm I wonder if I can simulate a high-bandwidth high-latency network with e1000's RxIntDelay/TxIntDelay parameters.. > Also, the NFS server may not even be able to detect sequential IO > patterns because of the combined access patterns from the clients, > and so the only effective readahead might be what the clients > issue.... Ah yes. Even though the upstream kernel can handle it well, one may run a pretty old kernel, or other UNIX systems. If it really happens, the default 512K won't behave too bad, but may well be sub-optimal. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 6:12 ` Wu Fengguang @ 2010-02-24 7:39 ` Dave Chinner 2010-02-26 7:49 ` [RFC] nfs: use 4*rsize " Wu Fengguang 0 siblings, 1 reply; 21+ messages in thread From: Dave Chinner @ 2010-02-24 7:39 UTC (permalink / raw) To: Wu Fengguang Cc: Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote: > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote: > > What I'm trying to say is that while I agree with your premise that > > a 7.8MB readahead window is probably far larger than was ever > > intended, I disagree with your methodology and environment for > > selecting a better default value. The default readahead value needs > > to work well in as many situations as possible, not just in perfect > > 1:1 client/server environment. > > Good points. It's imprudent to change a default value based on one > single benchmark. Need to collect more data, which may take time.. Agreed - better to spend time now to get it right... > > > It sounds silly to have > > > > > > client_readahead_size > server_readahead_size > > > > I don't think it is - the client readahead has to take into account > > the network latency as well as the server latency. e.g. a network > > with a high bandwidth but high latency is going to need much more > > client side readahead than a high bandwidth, low latency network to > > get the same throughput. Hence it is not uncommon to see larger > > readahead windows on network clients than for local disk access. > > Hmm I wonder if I can simulate a high-bandwidth high-latency network > with e1000's RxIntDelay/TxIntDelay parameters.. I think netem is the blessed method of emulating different network behaviours. There's a howto+faq for setting it up here: http://www.linuxfoundation.org/collaborate/workgroups/networking/netem Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC] nfs: use 4*rsize readahead size 2010-02-24 7:39 ` Dave Chinner @ 2010-02-26 7:49 ` Wu Fengguang 2010-03-02 3:10 ` Wu Fengguang 0 siblings, 1 reply; 21+ messages in thread From: Wu Fengguang @ 2010-02-26 7:49 UTC (permalink / raw) To: Dave Chinner Cc: Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote: > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote: > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote: > > > What I'm trying to say is that while I agree with your premise that > > > a 7.8MB readahead window is probably far larger than was ever > > > intended, I disagree with your methodology and environment for > > > selecting a better default value. The default readahead value needs > > > to work well in as many situations as possible, not just in perfect > > > 1:1 client/server environment. > > > > Good points. It's imprudent to change a default value based on one > > single benchmark. Need to collect more data, which may take time.. > > Agreed - better to spend time now to get it right... I collected more data with large network latency as well as rsize=32k, and updates the readahead size accordingly to 4*rsize. === nfs: use 2*rsize readahead size With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS readahead size 512k*15=7680k is too large than necessary for typical clients. On a e1000e--e1000e connection, I got the following numbers (this reads sparse file from server and involves no disk IO) readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) 16k 35.5 MB/s 4.8 MB/s 2.1 MB/s 1.2 MB/s 32k 54.3 MB/s 6.7 MB/s 3.6 MB/s 2.3 MB/s 64k 64.1 MB/s 12.6 MB/s 6.5 MB/s 4.7 MB/s 128k 70.5 MB/s 20.1 MB/s 11.9 MB/s 8.7 MB/s 256k 74.6 MB/s 38.6 MB/s 21.3 MB/s 15.0 MB/s rsize ==> 512k 77.4 MB/s 59.4 MB/s 39.8 MB/s 25.5 MB/s 1024k 85.5 MB/s 77.9 MB/s 65.7 MB/s 43.0 MB/s 2048k 86.8 MB/s 81.5 MB/s 84.1 MB/s 59.7 MB/s ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4096k 87.9 MB/s 77.4 MB/s 56.2 MB/s 59.2 MB/s 8192k 89.0 MB/s 81.2 MB/s 78.0 MB/s 41.2 MB/s 16384k 87.7 MB/s 85.8 MB/s 62.0 MB/s 56.5 MB/s readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) 16k 37.2 MB/s 6.4 MB/s 2.1 MB/s 1.2 MB/s rsize ==> 32k 56.6 MB/s 6.8 MB/s 3.6 MB/s 2.3 MB/s 64k 66.1 MB/s 12.7 MB/s 6.6 MB/s 4.7 MB/s 128k 69.3 MB/s 22.0 MB/s 12.2 MB/s 8.9 MB/s 256k 69.6 MB/s 41.8 MB/s 20.7 MB/s 14.7 MB/s 512k 71.3 MB/s 54.1 MB/s 25.0 MB/s 16.9 MB/s ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1024k 71.5 MB/s 48.4 MB/s 26.0 MB/s 16.7 MB/s 2048k 71.7 MB/s 53.2 MB/s 25.3 MB/s 17.6 MB/s 4096k 71.5 MB/s 50.4 MB/s 25.7 MB/s 17.1 MB/s 8192k 71.1 MB/s 52.3 MB/s 26.3 MB/s 16.9 MB/s 16384k 70.2 MB/s 56.6 MB/s 27.0 MB/s 16.8 MB/s (*) 10ms+10ms means to add delay on both client & server sides with # /sbin/tc qdisc change dev eth0 root netem delay 10ms The total >=20ms delay is so large for NFS, that a simple `vi some.sh` command takes a dozen seconds. Note that the actual delay reported by ping is larger, eg. for the 1ms+1ms case: rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in flight) is able to get near full NFS bandwidth. Reducing the mulriple from 15 to 4 not only makes the client side readahead size more sane (2MB by default), but also reduces the disorderness of the server side RPC read requests, which yeilds better server side readahead behavior. To avoid small readahead when the client mount with "-o rsize=32k" or the server only supports rsize <= 32k, we take the max of 2*rsize and default_backing_dev_info.ra_pages. The latter defaults to 512K, and can be explicitly changed by user with kernel parameter "readahead=" and runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which takes effective for future NFS mounts). The test script is: #!/bin/sh file=/mnt/sparse BDI=0:15 for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 do echo 3 > /proc/sys/vm/drop_caches echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb echo readahead_size=${rasize}k dd if=$file of=/dev/null bs=4k count=1024000 done CC: Dave Chinner <david@fromorbit.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/nfs/client.c | 4 +++- fs/nfs/internal.h | 8 -------- 2 files changed, 3 insertions(+), 9 deletions(-) --- linux.orig/fs/nfs/client.c 2010-02-26 10:10:46.000000000 +0800 +++ linux/fs/nfs/client.c 2010-02-26 11:07:22.000000000 +0800 @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; server->backing_dev_info.name = "nfs"; - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; + server->backing_dev_info.ra_pages = max_t(unsigned long, + default_backing_dev_info.ra_pages, + 4 * server->rpages); server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; if (server->wsize > max_rpc_payload) --- linux.orig/fs/nfs/internal.h 2010-02-26 10:10:46.000000000 +0800 +++ linux/fs/nfs/internal.h 2010-02-26 11:07:07.000000000 +0800 @@ -10,14 +10,6 @@ struct nfs_string; -/* Maximum number of readahead requests - * FIXME: this should really be a sysctl so that users may tune it to suit - * their needs. People that do NFS over a slow network, might for - * instance want to reduce it to something closer to 1 for improved - * interactive response. - */ -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1) - /* * Determine if sessions are in use. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 4*rsize readahead size 2010-02-26 7:49 ` [RFC] nfs: use 4*rsize " Wu Fengguang @ 2010-03-02 3:10 ` Wu Fengguang 2010-03-02 14:19 ` Trond Myklebust 2010-03-02 20:14 ` Bret Towe 0 siblings, 2 replies; 21+ messages in thread From: Wu Fengguang @ 2010-03-02 3:10 UTC (permalink / raw) To: Dave Chinner Cc: Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML Dave, Here is one more test on a big ext4 disk file: 16k 39.7 MB/s 32k 54.3 MB/s 64k 63.6 MB/s 128k 72.6 MB/s 256k 71.7 MB/s rsize ==> 512k 71.7 MB/s 1024k 72.2 MB/s 2048k 71.0 MB/s 4096k 73.0 MB/s 8192k 74.3 MB/s 16384k 74.5 MB/s It shows that >=128k client side readahead is enough for single disk case :) As for RAID configurations, I guess big server side readahead should be enough. #!/bin/sh file=/mnt/ext4_test/zero BDI=0:24 for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 do echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb echo readahead_size=${rasize}k fadvise $file 0 0 dontneed ssh p9 "fadvise $file 0 0 dontneed" dd if=$file of=/dev/null bs=4k count=402400 done Thanks, Fengguang On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote: > On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote: > > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote: > > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote: > > > > What I'm trying to say is that while I agree with your premise that > > > > a 7.8MB readahead window is probably far larger than was ever > > > > intended, I disagree with your methodology and environment for > > > > selecting a better default value. The default readahead value needs > > > > to work well in as many situations as possible, not just in perfect > > > > 1:1 client/server environment. > > > > > > Good points. It's imprudent to change a default value based on one > > > single benchmark. Need to collect more data, which may take time.. > > > > Agreed - better to spend time now to get it right... > > I collected more data with large network latency as well as rsize=32k, > and updates the readahead size accordingly to 4*rsize. > > === > nfs: use 2*rsize readahead size > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > readahead size 512k*15=7680k is too large than necessary for typical > clients. > > On a e1000e--e1000e connection, I got the following numbers > (this reads sparse file from server and involves no disk IO) > > readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) > 16k 35.5 MB/s 4.8 MB/s 2.1 MB/s 1.2 MB/s > 32k 54.3 MB/s 6.7 MB/s 3.6 MB/s 2.3 MB/s > 64k 64.1 MB/s 12.6 MB/s 6.5 MB/s 4.7 MB/s > 128k 70.5 MB/s 20.1 MB/s 11.9 MB/s 8.7 MB/s > 256k 74.6 MB/s 38.6 MB/s 21.3 MB/s 15.0 MB/s > rsize ==> 512k 77.4 MB/s 59.4 MB/s 39.8 MB/s 25.5 MB/s > 1024k 85.5 MB/s 77.9 MB/s 65.7 MB/s 43.0 MB/s > 2048k 86.8 MB/s 81.5 MB/s 84.1 MB/s 59.7 MB/s > 4096k 87.9 MB/s 77.4 MB/s 56.2 MB/s 59.2 MB/s > 8192k 89.0 MB/s 81.2 MB/s 78.0 MB/s 41.2 MB/s > 16384k 87.7 MB/s 85.8 MB/s 62.0 MB/s 56.5 MB/s > > readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) > 16k 37.2 MB/s 6.4 MB/s 2.1 MB/s 1.2 MB/s > rsize ==> 32k 56.6 MB/s 6.8 MB/s 3.6 MB/s 2.3 MB/s > 64k 66.1 MB/s 12.7 MB/s 6.6 MB/s 4.7 MB/s > 128k 69.3 MB/s 22.0 MB/s 12.2 MB/s 8.9 MB/s > 256k 69.6 MB/s 41.8 MB/s 20.7 MB/s 14.7 MB/s > 512k 71.3 MB/s 54.1 MB/s 25.0 MB/s 16.9 MB/s > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > 1024k 71.5 MB/s 48.4 MB/s 26.0 MB/s 16.7 MB/s > 2048k 71.7 MB/s 53.2 MB/s 25.3 MB/s 17.6 MB/s > 4096k 71.5 MB/s 50.4 MB/s 25.7 MB/s 17.1 MB/s > 8192k 71.1 MB/s 52.3 MB/s 26.3 MB/s 16.9 MB/s > 16384k 70.2 MB/s 56.6 MB/s 27.0 MB/s 16.8 MB/s > > (*) 10ms+10ms means to add delay on both client & server sides with > # /sbin/tc qdisc change dev eth0 root netem delay 10ms > The total >=20ms delay is so large for NFS, that a simple `vi some.sh` > command takes a dozen seconds. Note that the actual delay reported > by ping is larger, eg. for the 1ms+1ms case: > rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms > > > So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in > flight) is able to get near full NFS bandwidth. Reducing the mulriple > from 15 to 4 not only makes the client side readahead size more sane > (2MB by default), but also reduces the disorderness of the server side > RPC read requests, which yeilds better server side readahead behavior. > > To avoid small readahead when the client mount with "-o rsize=32k" or > the server only supports rsize <= 32k, we take the max of 2*rsize and > default_backing_dev_info.ra_pages. The latter defaults to 512K, and can > be explicitly changed by user with kernel parameter "readahead=" and > runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which > takes effective for future NFS mounts). > > The test script is: > > #!/bin/sh > > file=/mnt/sparse > BDI=0:15 > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > do > echo 3 > /proc/sys/vm/drop_caches > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > echo readahead_size=${rasize}k > dd if=$file of=/dev/null bs=4k count=1024000 > done > > CC: Dave Chinner <david@fromorbit.com> > CC: Trond Myklebust <Trond.Myklebust@netapp.com> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > fs/nfs/client.c | 4 +++- > fs/nfs/internal.h | 8 -------- > 2 files changed, 3 insertions(+), 9 deletions(-) > > --- linux.orig/fs/nfs/client.c 2010-02-26 10:10:46.000000000 +0800 > +++ linux/fs/nfs/client.c 2010-02-26 11:07:22.000000000 +0800 > @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct > server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > > server->backing_dev_info.name = "nfs"; > - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; > + server->backing_dev_info.ra_pages = max_t(unsigned long, > + default_backing_dev_info.ra_pages, > + 4 * server->rpages); > server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; > > if (server->wsize > max_rpc_payload) > --- linux.orig/fs/nfs/internal.h 2010-02-26 10:10:46.000000000 +0800 > +++ linux/fs/nfs/internal.h 2010-02-26 11:07:07.000000000 +0800 > @@ -10,14 +10,6 @@ > > struct nfs_string; > > -/* Maximum number of readahead requests > - * FIXME: this should really be a sysctl so that users may tune it to suit > - * their needs. People that do NFS over a slow network, might for > - * instance want to reduce it to something closer to 1 for improved > - * interactive response. > - */ > -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1) > - > /* > * Determine if sessions are in use. > */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 4*rsize readahead size 2010-03-02 3:10 ` Wu Fengguang @ 2010-03-02 14:19 ` Trond Myklebust 2010-03-02 17:33 ` John Stoffel 2010-03-02 20:14 ` Bret Towe 1 sibling, 1 reply; 21+ messages in thread From: Trond Myklebust @ 2010-03-02 14:19 UTC (permalink / raw) To: Wu Fengguang Cc: Dave Chinner, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: > Dave, > > Here is one more test on a big ext4 disk file: > > 16k 39.7 MB/s > 32k 54.3 MB/s > 64k 63.6 MB/s > 128k 72.6 MB/s > 256k 71.7 MB/s > rsize ==> 512k 71.7 MB/s > 1024k 72.2 MB/s > 2048k 71.0 MB/s > 4096k 73.0 MB/s > 8192k 74.3 MB/s > 16384k 74.5 MB/s > > It shows that >=128k client side readahead is enough for single disk > case :) As for RAID configurations, I guess big server side readahead > should be enough. There are lots of people who would like to use NFS on their company WAN, where you typically have high bandwidths (up to 10GigE), but often a high latency too (due to geographical dispersion). My ping latency from here to a typical server in NetApp's Bangalore office is ~ 312ms. I read your test results with 10ms delays, but have you tested with higher than that? Cheers Trond -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 4*rsize readahead size 2010-03-02 14:19 ` Trond Myklebust @ 2010-03-02 17:33 ` John Stoffel [not found] ` <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: John Stoffel @ 2010-03-02 17:33 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Dave Chinner, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML >>>>> "Trond" == Trond Myklebust <Trond.Myklebust@netapp.com> writes: Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: >> Dave, >> >> Here is one more test on a big ext4 disk file: >> >> 16k 39.7 MB/s >> 32k 54.3 MB/s >> 64k 63.6 MB/s >> 128k 72.6 MB/s >> 256k 71.7 MB/s >> rsize ==> 512k 71.7 MB/s >> 1024k 72.2 MB/s >> 2048k 71.0 MB/s >> 4096k 73.0 MB/s >> 8192k 74.3 MB/s >> 16384k 74.5 MB/s >> >> It shows that >=128k client side readahead is enough for single disk >> case :) As for RAID configurations, I guess big server side readahead >> should be enough. Trond> There are lots of people who would like to use NFS on their Trond> company WAN, where you typically have high bandwidths (up to Trond> 10GigE), but often a high latency too (due to geographical Trond> dispersion). My ping latency from here to a typical server in Trond> NetApp's Bangalore office is ~ 312ms. I read your test results Trond> with 10ms delays, but have you tested with higher than that? If you have that high a latency, the low level TCP protocol is going to kill your performance before you get to the NFS level. You really need to open up the TCP window size at that point. And it only gets worse as the bandwidth goes up too. There's no good solution, because while you can get good throughput at points, latency is going to suffer no matter what. John -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org>]
* Re: [RFC] nfs: use 4*rsize readahead size [not found] ` <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org> @ 2010-03-02 18:42 ` Trond Myklebust 2010-03-03 3:27 ` Wu Fengguang 0 siblings, 1 reply; 21+ messages in thread From: Trond Myklebust @ 2010-03-02 18:42 UTC (permalink / raw) To: John Stoffel Cc: Wu Fengguang, Dave Chinner, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Memory Management List, LKML On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote: > >>>>> "Trond" == Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> writes: > > Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: > >> Dave, > >> > >> Here is one more test on a big ext4 disk file: > >> > >> 16k 39.7 MB/s > >> 32k 54.3 MB/s > >> 64k 63.6 MB/s > >> 128k 72.6 MB/s > >> 256k 71.7 MB/s > >> rsize ==> 512k 71.7 MB/s > >> 1024k 72.2 MB/s > >> 2048k 71.0 MB/s > >> 4096k 73.0 MB/s > >> 8192k 74.3 MB/s > >> 16384k 74.5 MB/s > >> > >> It shows that >=128k client side readahead is enough for single disk > >> case :) As for RAID configurations, I guess big server side readahead > >> should be enough. > > Trond> There are lots of people who would like to use NFS on their > Trond> company WAN, where you typically have high bandwidths (up to > Trond> 10GigE), but often a high latency too (due to geographical > Trond> dispersion). My ping latency from here to a typical server in > Trond> NetApp's Bangalore office is ~ 312ms. I read your test results > Trond> with 10ms delays, but have you tested with higher than that? > > If you have that high a latency, the low level TCP protocol is going > to kill your performance before you get to the NFS level. You really > need to open up the TCP window size at that point. And it only gets > worse as the bandwidth goes up too. Yes. You need to open the TCP window in addition to reading ahead aggressively. > There's no good solution, because while you can get good throughput at > points, latency is going to suffer no matter what. It depends upon your workload. Sequential read and write should still be doable if you have aggressive readahead and open up for lots of parallel write RPCs. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 4*rsize readahead size 2010-03-02 18:42 ` Trond Myklebust @ 2010-03-03 3:27 ` Wu Fengguang 2010-04-14 21:22 ` Dean Hildebrand 0 siblings, 1 reply; 21+ messages in thread From: Wu Fengguang @ 2010-03-03 3:27 UTC (permalink / raw) To: Trond Myklebust Cc: John Stoffel, Dave Chinner, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Mar 03, 2010 at 02:42:19AM +0800, Trond Myklebust wrote: > On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote: > > >>>>> "Trond" == Trond Myklebust <Trond.Myklebust@netapp.com> writes: > > > > Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: > > >> Dave, > > >> > > >> Here is one more test on a big ext4 disk file: > > >> > > >> 16k 39.7 MB/s > > >> 32k 54.3 MB/s > > >> 64k 63.6 MB/s > > >> 128k 72.6 MB/s > > >> 256k 71.7 MB/s > > >> rsize ==> 512k 71.7 MB/s > > >> 1024k 72.2 MB/s > > >> 2048k 71.0 MB/s > > >> 4096k 73.0 MB/s > > >> 8192k 74.3 MB/s > > >> 16384k 74.5 MB/s > > >> > > >> It shows that >=128k client side readahead is enough for single disk > > >> case :) As for RAID configurations, I guess big server side readahead > > >> should be enough. > > > > Trond> There are lots of people who would like to use NFS on their > > Trond> company WAN, where you typically have high bandwidths (up to > > Trond> 10GigE), but often a high latency too (due to geographical > > Trond> dispersion). My ping latency from here to a typical server in > > Trond> NetApp's Bangalore office is ~ 312ms. I read your test results > > Trond> with 10ms delays, but have you tested with higher than that? > > > > If you have that high a latency, the low level TCP protocol is going > > to kill your performance before you get to the NFS level. You really > > need to open up the TCP window size at that point. And it only gets > > worse as the bandwidth goes up too. > > Yes. You need to open the TCP window in addition to reading ahead > aggressively. I only get ~10MB/s throughput with following settings. # huge NFS ra size echo 89512 > /sys/devices/virtual/bdi/0:15/read_ahead_kb # on both sides /sbin/tc qdisc add dev eth0 root netem delay 200ms net.core.rmem_max = 873800000 net.core.wmem_max = 655360000 net.ipv4.tcp_rmem = 8192 87380000 873800000 net.ipv4.tcp_wmem = 4096 65536000 655360000 Did I miss something? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 4*rsize readahead size 2010-03-03 3:27 ` Wu Fengguang @ 2010-04-14 21:22 ` Dean Hildebrand 0 siblings, 0 replies; 21+ messages in thread From: Dean Hildebrand @ 2010-04-14 21:22 UTC (permalink / raw) To: Wu Fengguang Cc: Trond Myklebust, John Stoffel, Dave Chinner, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML You cannot simply update linux system tcp parameters and expect nfs to work well performance-wise over the wan. The NFS server does not use system tcp parameters. This is a long standing issue. A patch was originally added in 2.6.30 that enabled NFS to use linux tcp buffer autotuning, which would resolve the issue, but a regression was reported (http://thread.gmane.org/gmane.linux.kernel/826598 ) and so they removed the patch. Maybe its time to rethink allowing users to manually set linux nfs server tcp buffer sizes? Years have passed on this subject and people are still waiting. Good performance over the wan will require manually setting tcp buffer sizes. As mentioned in the regression thread, autotuning can reduce performance by up to 10%. Here is a patch (slightly outdated) that creates 2 sysctls that allow users to manually to set NFS TCP buffer sizes. The first link also has a fair amount of background information on the subject. http://www.spinics.net/lists/linux-nfs/msg01338.html http://www.spinics.net/lists/linux-nfs/msg01339.html Dean Wu Fengguang wrote: > On Wed, Mar 03, 2010 at 02:42:19AM +0800, Trond Myklebust wrote: > >> On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote: >> >>>>>>>> "Trond" == Trond Myklebust <Trond.Myklebust@netapp.com> writes: >>>>>>>> >>> Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: >>> >>>>> Dave, >>>>> >>>>> Here is one more test on a big ext4 disk file: >>>>> >>>>> 16k 39.7 MB/s >>>>> 32k 54.3 MB/s >>>>> 64k 63.6 MB/s >>>>> 128k 72.6 MB/s >>>>> 256k 71.7 MB/s >>>>> rsize ==> 512k 71.7 MB/s >>>>> 1024k 72.2 MB/s >>>>> 2048k 71.0 MB/s >>>>> 4096k 73.0 MB/s >>>>> 8192k 74.3 MB/s >>>>> 16384k 74.5 MB/s >>>>> >>>>> It shows that >=128k client side readahead is enough for single disk >>>>> case :) As for RAID configurations, I guess big server side readahead >>>>> should be enough. >>>>> >>> Trond> There are lots of people who would like to use NFS on their >>> Trond> company WAN, where you typically have high bandwidths (up to >>> Trond> 10GigE), but often a high latency too (due to geographical >>> Trond> dispersion). My ping latency from here to a typical server in >>> Trond> NetApp's Bangalore office is ~ 312ms. I read your test results >>> Trond> with 10ms delays, but have you tested with higher than that? >>> >>> If you have that high a latency, the low level TCP protocol is going >>> to kill your performance before you get to the NFS level. You really >>> need to open up the TCP window size at that point. And it only gets >>> worse as the bandwidth goes up too. >>> >> Yes. You need to open the TCP window in addition to reading ahead >> aggressively. >> > > I only get ~10MB/s throughput with following settings. > > # huge NFS ra size > echo 89512 > /sys/devices/virtual/bdi/0:15/read_ahead_kb > > # on both sides > /sbin/tc qdisc add dev eth0 root netem delay 200ms > > net.core.rmem_max = 873800000 > net.core.wmem_max = 655360000 > net.ipv4.tcp_rmem = 8192 87380000 873800000 > net.ipv4.tcp_wmem = 4096 65536000 655360000 > > Did I miss something? > > Thanks, > Fengguang > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 4*rsize readahead size 2010-03-02 3:10 ` Wu Fengguang 2010-03-02 14:19 ` Trond Myklebust @ 2010-03-02 20:14 ` Bret Towe 2010-03-03 1:43 ` Wu Fengguang 1 sibling, 1 reply; 21+ messages in thread From: Bret Towe @ 2010-03-02 20:14 UTC (permalink / raw) To: Wu Fengguang Cc: Dave Chinner, Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Mon, Mar 1, 2010 at 7:10 PM, Wu Fengguang <fengguang.wu@intel.com> wrote: > Dave, > > Here is one more test on a big ext4 disk file: > > 16k 39.7 MB/s > 32k 54.3 MB/s > 64k 63.6 MB/s > 128k 72.6 MB/s > 256k 71.7 MB/s > rsize ==> 512k 71.7 MB/s > 1024k 72.2 MB/s > 2048k 71.0 MB/s > 4096k 73.0 MB/s > 8192k 74.3 MB/s > 16384k 74.5 MB/s > > It shows that >=128k client side readahead is enough for single disk > case :) As for RAID configurations, I guess big server side readahead > should be enough. > > #!/bin/sh > > file=/mnt/ext4_test/zero > BDI=0:24 > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > do > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > echo readahead_size=${rasize}k > fadvise $file 0 0 dontneed > ssh p9 "fadvise $file 0 0 dontneed" > dd if=$file of=/dev/null bs=4k count=402400 > done how do you determine which bdi to use? I skimmed thru the filesystem in /sys and didn't see anything that says which is what > Thanks, > Fengguang > > On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote: >> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote: >> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote: >> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote: >> > > > What I'm trying to say is that while I agree with your premise that >> > > > a 7.8MB readahead window is probably far larger than was ever >> > > > intended, I disagree with your methodology and environment for >> > > > selecting a better default value. The default readahead value needs >> > > > to work well in as many situations as possible, not just in perfect >> > > > 1:1 client/server environment. >> > > >> > > Good points. It's imprudent to change a default value based on one >> > > single benchmark. Need to collect more data, which may take time.. >> > >> > Agreed - better to spend time now to get it right... >> >> I collected more data with large network latency as well as rsize=32k, >> and updates the readahead size accordingly to 4*rsize. >> >> === >> nfs: use 2*rsize readahead size >> >> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS >> readahead size 512k*15=7680k is too large than necessary for typical >> clients. >> >> On a e1000e--e1000e connection, I got the following numbers >> (this reads sparse file from server and involves no disk IO) >> >> readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) >> 16k 35.5 MB/s 4.8 MB/s 2.1 MB/s 1.2 MB/s >> 32k 54.3 MB/s 6.7 MB/s 3.6 MB/s 2.3 MB/s >> 64k 64.1 MB/s 12.6 MB/s 6.5 MB/s 4.7 MB/s >> 128k 70.5 MB/s 20.1 MB/s 11.9 MB/s 8.7 MB/s >> 256k 74.6 MB/s 38.6 MB/s 21.3 MB/s 15.0 MB/s >> rsize ==> 512k 77.4 MB/s 59.4 MB/s 39.8 MB/s 25.5 MB/s >> 1024k 85.5 MB/s 77.9 MB/s 65.7 MB/s 43.0 MB/s >> 2048k 86.8 MB/s 81.5 MB/s 84.1 MB/s 59.7 MB/s >> 4096k 87.9 MB/s 77.4 MB/s 56.2 MB/s 59.2 MB/s >> 8192k 89.0 MB/s 81.2 MB/s 78.0 MB/s 41.2 MB/s >> 16384k 87.7 MB/s 85.8 MB/s 62.0 MB/s 56.5 MB/s >> >> readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) >> 16k 37.2 MB/s 6.4 MB/s 2.1 MB/s 1.2 MB/s >> rsize ==> 32k 56.6 MB/s 6.8 MB/s 3.6 MB/s 2.3 MB/s >> 64k 66.1 MB/s 12.7 MB/s 6.6 MB/s 4.7 MB/s >> 128k 69.3 MB/s 22.0 MB/s 12.2 MB/s 8.9 MB/s >> 256k 69.6 MB/s 41.8 MB/s 20.7 MB/s 14.7 MB/s >> 512k 71.3 MB/s 54.1 MB/s 25.0 MB/s 16.9 MB/s >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 1024k 71.5 MB/s 48.4 MB/s 26.0 MB/s 16.7 MB/s >> 2048k 71.7 MB/s 53.2 MB/s 25.3 MB/s 17.6 MB/s >> 4096k 71.5 MB/s 50.4 MB/s 25.7 MB/s 17.1 MB/s >> 8192k 71.1 MB/s 52.3 MB/s 26.3 MB/s 16.9 MB/s >> 16384k 70.2 MB/s 56.6 MB/s 27.0 MB/s 16.8 MB/s >> >> (*) 10ms+10ms means to add delay on both client & server sides with >> # /sbin/tc qdisc change dev eth0 root netem delay 10ms >> The total >=20ms delay is so large for NFS, that a simple `vi some.sh` >> command takes a dozen seconds. Note that the actual delay reported >> by ping is larger, eg. for the 1ms+1ms case: >> rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms >> >> >> So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in >> flight) is able to get near full NFS bandwidth. Reducing the mulriple >> from 15 to 4 not only makes the client side readahead size more sane >> (2MB by default), but also reduces the disorderness of the server side >> RPC read requests, which yeilds better server side readahead behavior. >> >> To avoid small readahead when the client mount with "-o rsize=32k" or >> the server only supports rsize <= 32k, we take the max of 2*rsize and >> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can >> be explicitly changed by user with kernel parameter "readahead=" and >> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which >> takes effective for future NFS mounts). >> >> The test script is: >> >> #!/bin/sh >> >> file=/mnt/sparse >> BDI=0:15 >> >> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 >> do >> echo 3 > /proc/sys/vm/drop_caches >> echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb >> echo readahead_size=${rasize}k >> dd if=$file of=/dev/null bs=4k count=1024000 >> done >> >> CC: Dave Chinner <david@fromorbit.com> >> CC: Trond Myklebust <Trond.Myklebust@netapp.com> >> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> >> --- >> fs/nfs/client.c | 4 +++- >> fs/nfs/internal.h | 8 -------- >> 2 files changed, 3 insertions(+), 9 deletions(-) >> >> --- linux.orig/fs/nfs/client.c 2010-02-26 10:10:46.000000000 +0800 >> +++ linux/fs/nfs/client.c 2010-02-26 11:07:22.000000000 +0800 >> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct >> server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; >> >> server->backing_dev_info.name = "nfs"; >> - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; >> + server->backing_dev_info.ra_pages = max_t(unsigned long, >> + default_backing_dev_info.ra_pages, >> + 4 * server->rpages); >> server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; >> >> if (server->wsize > max_rpc_payload) >> --- linux.orig/fs/nfs/internal.h 2010-02-26 10:10:46.000000000 +0800 >> +++ linux/fs/nfs/internal.h 2010-02-26 11:07:07.000000000 +0800 >> @@ -10,14 +10,6 @@ >> >> struct nfs_string; >> >> -/* Maximum number of readahead requests >> - * FIXME: this should really be a sysctl so that users may tune it to suit >> - * their needs. People that do NFS over a slow network, might for >> - * instance want to reduce it to something closer to 1 for improved >> - * interactive response. >> - */ >> -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1) >> - >> /* >> * Determine if sessions are in use. >> */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 4*rsize readahead size 2010-03-02 20:14 ` Bret Towe @ 2010-03-03 1:43 ` Wu Fengguang 0 siblings, 0 replies; 21+ messages in thread From: Wu Fengguang @ 2010-03-03 1:43 UTC (permalink / raw) To: Bret Towe Cc: Dave Chinner, Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Mar 03, 2010 at 04:14:33AM +0800, Bret Towe wrote: > how do you determine which bdi to use? I skimmed thru > the filesystem in /sys and didn't see anything that says which is what MOUNTPOINT=" /mnt/ext4_test " # grep "$MOUNTPOINT" /proc/$$/mountinfo|awk '{print $3}' 0:24 Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>]
* Re: [RFC] nfs: use 2*rsize readahead size [not found] ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> @ 2010-02-24 11:18 ` Akshat Aranya 2010-02-25 12:37 ` Wu Fengguang 0 siblings, 1 reply; 21+ messages in thread From: Akshat Aranya @ 2010-02-24 11:18 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Trond Myklebust, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 12:22 AM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: > >> It sounds silly to have >> >> client_readahead_size > server_readahead_size > > I don't think it is - the client readahead has to take into account > the network latency as well as the server latency. e.g. a network > with a high bandwidth but high latency is going to need much more > client side readahead than a high bandwidth, low latency network to > get the same throughput. Hence it is not uncommon to see larger > readahead windows on network clients than for local disk access. > > Also, the NFS server may not even be able to detect sequential IO > patterns because of the combined access patterns from the clients, > and so the only effective readahead might be what the clients > issue.... > In my experiments, I have observed that the server-side readahead shuts off rather quickly even with a single client because the client readahead causes multiple pending read RPCs on the server which are then serviced in random order and the pattern observed by the underlying file system is non-sequential. In our file system, we had to override what the VFS thought was a random workload and continue to do readahead anyway. Cheers, Akshat -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 11:18 ` [RFC] nfs: use 2*rsize " Akshat Aranya @ 2010-02-25 12:37 ` Wu Fengguang 0 siblings, 0 replies; 21+ messages in thread From: Wu Fengguang @ 2010-02-25 12:37 UTC (permalink / raw) To: Akshat Aranya Cc: Dave Chinner, Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 07:18:26PM +0800, Akshat Aranya wrote: > On Wed, Feb 24, 2010 at 12:22 AM, Dave Chinner <david@fromorbit.com> wrote: > > > > >> It sounds silly to have > >> > >> client_readahead_size > server_readahead_size > > > > I don't think it is - the client readahead has to take into account > > the network latency as well as the server latency. e.g. a network > > with a high bandwidth but high latency is going to need much more > > client side readahead than a high bandwidth, low latency network to > > get the same throughput. Hence it is not uncommon to see larger > > readahead windows on network clients than for local disk access. > > > > Also, the NFS server may not even be able to detect sequential IO > > patterns because of the combined access patterns from the clients, > > and so the only effective readahead might be what the clients > > issue.... > > > > In my experiments, I have observed that the server-side readahead > shuts off rather quickly even with a single client because the client > readahead causes multiple pending read RPCs on the server which are > then serviced in random order and the pattern observed by the > underlying file system is non-sequential. In our file system, we had > to override what the VFS thought was a random workload and continue to > do readahead anyway. What's the server side kernel version, plus client/server side readahead size? I'd expect the context readahead to handle it well. With the patchset in <http://lkml.org/lkml/2010/2/23/376>, you can actually see the readahead details: # echo 1 > /debug/tracing/events/readahead/enable # cp test-file /dev/null # cat /debug/tracing/trace # trimmed output readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4 readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8 readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16 readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32 readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24 readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0 And I've actually verified the NFS case with the help of such traces long ago. When client_readahead_size <= server_readahead_size, the readahead requests may look a bit random at first, and then will quickly turn into a perfect series of sequential context readaheads. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 3:29 ` Dave Chinner 2010-02-24 4:18 ` Wu Fengguang @ 2010-02-24 4:24 ` Dave Chinner 2010-02-24 4:33 ` Wu Fengguang [not found] ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> 1 sibling, 2 replies; 21+ messages in thread From: Dave Chinner @ 2010-02-24 4:24 UTC (permalink / raw) To: Wu Fengguang Cc: Trond Myklebust, linux-nfs, linux-fsdevel, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote: > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote: > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > > readahead size 512k*15=7680k is too large than necessary for typical > > clients. > > > > On a e1000e--e1000e connection, I got the following numbers > > > > readahead size throughput > > 16k 35.5 MB/s > > 32k 54.3 MB/s > > 64k 64.1 MB/s > > 128k 70.5 MB/s > > 256k 74.6 MB/s > > rsize ==> 512k 77.4 MB/s > > 1024k 85.5 MB/s > > 2048k 86.8 MB/s > > 4096k 87.9 MB/s > > 8192k 89.0 MB/s > > 16384k 87.7 MB/s > > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) > > can already get near full NFS bandwidth. > > > > The test script is: > > > > #!/bin/sh > > > > file=/mnt/sparse > > BDI=0:15 > > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > > do > > echo 3 > /proc/sys/vm/drop_caches > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > > echo readahead_size=${rasize}k > > dd if=$file of=/dev/null bs=4k count=1024000 > > done > > That's doing a cached read out of the server cache, right? You > might find the results are different if the server has to read the > file from disk. I would expect reads from the server cache not > to require much readahead as there is no IO latency on the server > side for the readahead to hide.... FWIW, if you mount the client with "-o rsize=32k" or the server only supports rsize <= 32k then this will probably hurt throughput a lot because then readahead will be capped at 64k instead of 480k.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 4:24 ` Dave Chinner @ 2010-02-24 4:33 ` Wu Fengguang [not found] ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> 1 sibling, 0 replies; 21+ messages in thread From: Wu Fengguang @ 2010-02-24 4:33 UTC (permalink / raw) To: Dave Chinner Cc: Trond Myklebust, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote: > On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote: > > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote: > > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > > > readahead size 512k*15=7680k is too large than necessary for typical > > > clients. > > > > > > On a e1000e--e1000e connection, I got the following numbers > > > > > > readahead size throughput > > > 16k 35.5 MB/s > > > 32k 54.3 MB/s > > > 64k 64.1 MB/s > > > 128k 70.5 MB/s > > > 256k 74.6 MB/s > > > rsize ==> 512k 77.4 MB/s > > > 1024k 85.5 MB/s > > > 2048k 86.8 MB/s > > > 4096k 87.9 MB/s > > > 8192k 89.0 MB/s > > > 16384k 87.7 MB/s > > > > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) > > > can already get near full NFS bandwidth. > > > > > > The test script is: > > > > > > #!/bin/sh > > > > > > file=/mnt/sparse > > > BDI=0:15 > > > > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > > > do > > > echo 3 > /proc/sys/vm/drop_caches > > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > > > echo readahead_size=${rasize}k > > > dd if=$file of=/dev/null bs=4k count=1024000 > > > done > > > > That's doing a cached read out of the server cache, right? You > > might find the results are different if the server has to read the > > file from disk. I would expect reads from the server cache not > > to require much readahead as there is no IO latency on the server > > side for the readahead to hide.... > > FWIW, if you mount the client with "-o rsize=32k" or the server only > supports rsize <= 32k then this will probably hurt throughput a lot > because then readahead will be capped at 64k instead of 480k.... That's why I take the max of 2*rsize and system default readahead size (which will be enlarged to 512K): - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; + server->backing_dev_info.ra_pages = max_t(unsigned long, + default_backing_dev_info.ra_pages, + 2 * server->rpages); Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>]
* Re: [RFC] nfs: use 2*rsize readahead size [not found] ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> @ 2010-02-24 4:43 ` Wu Fengguang 2010-02-24 5:24 ` Dave Chinner 0 siblings, 1 reply; 21+ messages in thread From: Wu Fengguang @ 2010-02-24 4:43 UTC (permalink / raw) To: Dave Chinner Cc: Trond Myklebust, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote: > On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote: > > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote: > > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > > > readahead size 512k*15=7680k is too large than necessary for typical > > > clients. > > > > > > On a e1000e--e1000e connection, I got the following numbers > > > > > > readahead size throughput > > > 16k 35.5 MB/s > > > 32k 54.3 MB/s > > > 64k 64.1 MB/s > > > 128k 70.5 MB/s > > > 256k 74.6 MB/s > > > rsize ==> 512k 77.4 MB/s > > > 1024k 85.5 MB/s > > > 2048k 86.8 MB/s > > > 4096k 87.9 MB/s > > > 8192k 89.0 MB/s > > > 16384k 87.7 MB/s > > > > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) > > > can already get near full NFS bandwidth. > > > > > > The test script is: > > > > > > #!/bin/sh > > > > > > file=/mnt/sparse > > > BDI=0:15 > > > > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > > > do > > > echo 3 > /proc/sys/vm/drop_caches > > > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > > > echo readahead_size=${rasize}k > > > dd if=$file of=/dev/null bs=4k count=1024000 > > > done > > > > That's doing a cached read out of the server cache, right? You > > might find the results are different if the server has to read the > > file from disk. I would expect reads from the server cache not > > to require much readahead as there is no IO latency on the server > > side for the readahead to hide.... > > FWIW, if you mount the client with "-o rsize=32k" or the server only > supports rsize <= 32k then this will probably hurt throughput a lot > because then readahead will be capped at 64k instead of 480k.... I should have mentioned that in changelog.. Hope the updated one helps. Thanks, Fengguang --- nfs: use 2*rsize readahead size With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS readahead size 512k*15=7680k is too large than necessary for typical clients. On a e1000e--e1000e connection, I got the following numbers (this reads sparse file from server and involves no disk IO) readahead size throughput 16k 35.5 MB/s 32k 54.3 MB/s 64k 64.1 MB/s 128k 70.5 MB/s 256k 74.6 MB/s rsize ==> 512k 77.4 MB/s 1024k 85.5 MB/s 2048k 86.8 MB/s 4096k 87.9 MB/s 8192k 89.0 MB/s 16384k 87.7 MB/s So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight) can already get near full NFS bandwidth. To avoid small readahead when the client mount with "-o rsize=32k" or the server only supports rsize <= 32k, we take the max of 2*rsize and default_backing_dev_info.ra_pages. The latter defaults to 512K, and will be auto scaled down when system memory is less than 512M, or can be explicitly changed by user with kernel parameter "readahead=". The test script is: #!/bin/sh file=/mnt/sparse BDI=0:15 for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 do echo 3 > /proc/sys/vm/drop_caches echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb echo readahead_size=${rasize}k dd if=$file of=/dev/null bs=4k count=1024000 done CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> CC: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Signed-off-by: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/nfs/client.c | 4 +++- fs/nfs/internal.h | 8 -------- 2 files changed, 3 insertions(+), 9 deletions(-) --- linux.orig/fs/nfs/client.c 2010-02-23 11:15:44.000000000 +0800 +++ linux/fs/nfs/client.c 2010-02-24 10:16:00.000000000 +0800 @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; server->backing_dev_info.name = "nfs"; - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; + server->backing_dev_info.ra_pages = max_t(unsigned long, + default_backing_dev_info.ra_pages, + 2 * server->rpages); server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; if (server->wsize > max_rpc_payload) --- linux.orig/fs/nfs/internal.h 2010-02-23 11:15:44.000000000 +0800 +++ linux/fs/nfs/internal.h 2010-02-23 13:26:00.000000000 +0800 @@ -10,14 +10,6 @@ struct nfs_string; -/* Maximum number of readahead requests - * FIXME: this should really be a sysctl so that users may tune it to suit - * their needs. People that do NFS over a slow network, might for - * instance want to reduce it to something closer to 1 for improved - * interactive response. - */ -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1) - /* * Determine if sessions are in use. */ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] nfs: use 2*rsize readahead size 2010-02-24 4:43 ` Wu Fengguang @ 2010-02-24 5:24 ` Dave Chinner 0 siblings, 0 replies; 21+ messages in thread From: Dave Chinner @ 2010-02-24 5:24 UTC (permalink / raw) To: Wu Fengguang Cc: Trond Myklebust, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux Memory Management List, LKML On Wed, Feb 24, 2010 at 12:43:56PM +0800, Wu Fengguang wrote: > On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote: > > On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote: > > > That's doing a cached read out of the server cache, right? You > > > might find the results are different if the server has to read the > > > file from disk. I would expect reads from the server cache not > > > to require much readahead as there is no IO latency on the server > > > side for the readahead to hide.... > > > > FWIW, if you mount the client with "-o rsize=32k" or the server only > > supports rsize <= 32k then this will probably hurt throughput a lot > > because then readahead will be capped at 64k instead of 480k.... > > I should have mentioned that in changelog.. Hope the updated one > helps. Sorry, my fault for not reading the code correctly. Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-04-14 21:22 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-02-24 2:41 [RFC] nfs: use 2*rsize readahead size Wu Fengguang 2010-02-24 3:29 ` Dave Chinner 2010-02-24 4:18 ` Wu Fengguang 2010-02-24 5:22 ` Dave Chinner 2010-02-24 6:12 ` Wu Fengguang 2010-02-24 7:39 ` Dave Chinner 2010-02-26 7:49 ` [RFC] nfs: use 4*rsize " Wu Fengguang 2010-03-02 3:10 ` Wu Fengguang 2010-03-02 14:19 ` Trond Myklebust 2010-03-02 17:33 ` John Stoffel [not found] ` <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org> 2010-03-02 18:42 ` Trond Myklebust 2010-03-03 3:27 ` Wu Fengguang 2010-04-14 21:22 ` Dean Hildebrand 2010-03-02 20:14 ` Bret Towe 2010-03-03 1:43 ` Wu Fengguang [not found] ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> 2010-02-24 11:18 ` [RFC] nfs: use 2*rsize " Akshat Aranya 2010-02-25 12:37 ` Wu Fengguang 2010-02-24 4:24 ` Dave Chinner 2010-02-24 4:33 ` Wu Fengguang [not found] ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org> 2010-02-24 4:43 ` Wu Fengguang 2010-02-24 5:24 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).