[RFC] nfs: use 2*rsize readahead size

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] nfs: use 2*rsize readahead size
@ 2010-02-24  2:41 Wu Fengguang
  2010-02-24  3:29 ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-02-24  2:41 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linux Memory Management List, LKML

With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
readahead size 512k*15=7680k is too large than necessary for typical
clients.

On a e1000e--e1000e connection, I got the following numbers

	readahead size		throughput
		   16k           35.5 MB/s
		   32k           54.3 MB/s
		   64k           64.1 MB/s
		  128k           70.5 MB/s
		  256k           74.6 MB/s
rsize ==>	  512k           77.4 MB/s
		 1024k           85.5 MB/s
		 2048k           86.8 MB/s
		 4096k           87.9 MB/s
		 8192k           89.0 MB/s
		16384k           87.7 MB/s

So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
can already get near full NFS bandwidth.

The test script is:

#!/bin/sh

file=/mnt/sparse
BDI=0:15

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
	echo 3 > /proc/sys/vm/drop_caches
	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
	echo readahead_size=${rasize}k
	dd if=$file of=/dev/null bs=4k count=1024000
done

CC: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/nfs/client.c   |    4 +++-
 fs/nfs/internal.h |    8 --------
 2 files changed, 3 insertions(+), 9 deletions(-)

--- linux.orig/fs/nfs/client.c	2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/client.c	2010-02-24 10:16:00.000000000 +0800
@@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
 	server->backing_dev_info.name = "nfs";
-	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+	server->backing_dev_info.ra_pages = max_t(unsigned long,
+					      default_backing_dev_info.ra_pages,
+					      2 * server->rpages);
 	server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
 
 	if (server->wsize > max_rpc_payload)
--- linux.orig/fs/nfs/internal.h	2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/internal.h	2010-02-23 13:26:00.000000000 +0800
@@ -10,14 +10,6 @@
 
 struct nfs_string;
 
-/* Maximum number of readahead requests
- * FIXME: this should really be a sysctl so that users may tune it to suit
- *        their needs. People that do NFS over a slow network, might for
- *        instance want to reduce it to something closer to 1 for improved
- *        interactive response.
- */
-#define NFS_MAX_READAHEAD	(RPC_DEF_SLOT_TABLE - 1)
-
 /*
  * Determine if sessions are in use.
  */
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  2:41 [RFC] nfs: use 2*rsize readahead size Wu Fengguang
@ 2010-02-24  3:29 ` Dave Chinner
  2010-02-24  4:18   ` Wu Fengguang
  2010-02-24  4:24   ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2010-02-24  3:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Trond Myklebust, linux-nfs, linux-fsdevel,
	Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> readahead size 512k*15=7680k is too large than necessary for typical
> clients.
> 
> On a e1000e--e1000e connection, I got the following numbers
> 
> 	readahead size		throughput
> 		   16k           35.5 MB/s
> 		   32k           54.3 MB/s
> 		   64k           64.1 MB/s
> 		  128k           70.5 MB/s
> 		  256k           74.6 MB/s
> rsize ==>	  512k           77.4 MB/s
> 		 1024k           85.5 MB/s
> 		 2048k           86.8 MB/s
> 		 4096k           87.9 MB/s
> 		 8192k           89.0 MB/s
> 		16384k           87.7 MB/s
> 
> So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> can already get near full NFS bandwidth.
> 
> The test script is:
> 
> #!/bin/sh
> 
> file=/mnt/sparse
> BDI=0:15
> 
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
> 	echo 3 > /proc/sys/vm/drop_caches
> 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> 	echo readahead_size=${rasize}k
> 	dd if=$file of=/dev/null bs=4k count=1024000
> done

That's doing a cached read out of the server cache, right? You
might find the results are different if the server has to read the
file from disk. I would expect reads from the server cache not
to require much readahead as there is no IO latency on the server
side for the readahead to hide....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  3:29 ` Dave Chinner
@ 2010-02-24  4:18   ` Wu Fengguang
  2010-02-24  5:22     ` Dave Chinner
  2010-02-24  4:24   ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-02-24  4:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > readahead size 512k*15=7680k is too large than necessary for typical
> > clients.
> > 
> > On a e1000e--e1000e connection, I got the following numbers
> > 
> > 	readahead size		throughput
> > 		   16k           35.5 MB/s
> > 		   32k           54.3 MB/s
> > 		   64k           64.1 MB/s
> > 		  128k           70.5 MB/s
> > 		  256k           74.6 MB/s
> > rsize ==>	  512k           77.4 MB/s
> > 		 1024k           85.5 MB/s
> > 		 2048k           86.8 MB/s
> > 		 4096k           87.9 MB/s
> > 		 8192k           89.0 MB/s
> > 		16384k           87.7 MB/s
> > 
> > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > can already get near full NFS bandwidth.
> > 
> > The test script is:
> > 
> > #!/bin/sh
> > 
> > file=/mnt/sparse
> > BDI=0:15
> > 
> > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > do
> > 	echo 3 > /proc/sys/vm/drop_caches
> > 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > 	echo readahead_size=${rasize}k
> > 	dd if=$file of=/dev/null bs=4k count=1024000
> > done
> 
> That's doing a cached read out of the server cache, right? You

It does not involve disk IO at least. (The sparse file dataset is
larger than server cache.)

> might find the results are different if the server has to read the
> file from disk. I would expect reads from the server cache not
> to require much readahead as there is no IO latency on the server
> side for the readahead to hide....

Sure the result will be different when disk IO is involved.
In this case I would expect the server admin to setup the optimal
readahead size for the disk(s).

It sounds silly to have

        client_readahead_size > server_readahead_size

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  4:18   ` Wu Fengguang
@ 2010-02-24  5:22     ` Dave Chinner
  2010-02-24  6:12       ` Wu Fengguang
       [not found]       ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
  0 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2010-02-24  5:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Trond Myklebust,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 12:18:22PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > readahead size 512k*15=7680k is too large than necessary for typical
> > > clients.
> > > 
> > > On a e1000e--e1000e connection, I got the following numbers
> > > 
> > > 	readahead size		throughput
> > > 		   16k           35.5 MB/s
> > > 		   32k           54.3 MB/s
> > > 		   64k           64.1 MB/s
> > > 		  128k           70.5 MB/s
> > > 		  256k           74.6 MB/s
> > > rsize ==>	  512k           77.4 MB/s
> > > 		 1024k           85.5 MB/s
> > > 		 2048k           86.8 MB/s
> > > 		 4096k           87.9 MB/s
> > > 		 8192k           89.0 MB/s
> > > 		16384k           87.7 MB/s
> > > 
> > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > can already get near full NFS bandwidth.
> > > 
> > > The test script is:
> > > 
> > > #!/bin/sh
> > > 
> > > file=/mnt/sparse
> > > BDI=0:15
> > > 
> > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > do
> > > 	echo 3 > /proc/sys/vm/drop_caches
> > > 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > 	echo readahead_size=${rasize}k
> > > 	dd if=$file of=/dev/null bs=4k count=1024000
> > > done
> > 
> > That's doing a cached read out of the server cache, right? You
> 
> It does not involve disk IO at least. (The sparse file dataset is
> larger than server cache.)

It still results in effectively the same thing: very low, consistent
IO latency.

Effectively all the test results show is that on a clean, low
latency, uncongested network an unloaded NFS server that has no IO
latency, a client only requires one 512k readahead block to hide 90%
of the server read request latency.  I don't think this is a
particularly good test to base a new default on, though.

e.g. What is the result with a smaller rsize? When the server
actually has to do disk IO?  When multiple clients are reading at
the same time so the server may not detect accesses as sequential
and issue readahead? When another client is writing to the server at
the same time as the read and causing significant read IO latency at
the server?

What I'm trying to say is that while I agree with your premise that
a 7.8MB readahead window is probably far larger than was ever
intended, I disagree with your methodology and environment for
selecting a better default value.  The default readahead value needs
to work well in as many situations as possible, not just in perfect
1:1 client/server environment.

> > might find the results are different if the server has to read the
> > file from disk. I would expect reads from the server cache not
> > to require much readahead as there is no IO latency on the server
> > side for the readahead to hide....
> 
> Sure the result will be different when disk IO is involved.
> In this case I would expect the server admin to setup the optimal
> readahead size for the disk(s).

The default should do the right thing when disk IO is involved, as
almost no-one has an NFS server that doesn't do IO.... ;)

> It sounds silly to have
> 
>         client_readahead_size > server_readahead_size

I don't think it is  - the client readahead has to take into account
the network latency as well as the server latency. e.g. a network
with a high bandwidth but high latency is going to need much more
client side readahead than a high bandwidth, low latency network to
get the same throughput. Hence it is not uncommon to see larger
readahead windows on network clients than for local disk access.

Also, the NFS server may not even be able to detect sequential IO
patterns because of the combined access patterns from the clients,
and so the only effective readahead might be what the clients
issue....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  5:22     ` Dave Chinner
@ 2010-02-24  6:12       ` Wu Fengguang
  2010-02-24  7:39         ` Dave Chinner
       [not found]       ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
  1 sibling, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-02-24  6:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 12:18:22PM +0800, Wu Fengguang wrote:
> > On Wed, Feb 24, 2010 at 11:29:34AM +0800, Dave Chinner wrote:
> > > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > > readahead size 512k*15=7680k is too large than necessary for typical
> > > > clients.
> > > > 
> > > > On a e1000e--e1000e connection, I got the following numbers
> > > > 
> > > > 	readahead size		throughput
> > > > 		   16k           35.5 MB/s
> > > > 		   32k           54.3 MB/s
> > > > 		   64k           64.1 MB/s
> > > > 		  128k           70.5 MB/s
> > > > 		  256k           74.6 MB/s
> > > > rsize ==>	  512k           77.4 MB/s
> > > > 		 1024k           85.5 MB/s
> > > > 		 2048k           86.8 MB/s
> > > > 		 4096k           87.9 MB/s
> > > > 		 8192k           89.0 MB/s
> > > > 		16384k           87.7 MB/s
> > > > 
> > > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > > can already get near full NFS bandwidth.
> > > > 
> > > > The test script is:
> > > > 
> > > > #!/bin/sh
> > > > 
> > > > file=/mnt/sparse
> > > > BDI=0:15
> > > > 
> > > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > > do
> > > > 	echo 3 > /proc/sys/vm/drop_caches
> > > > 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > > 	echo readahead_size=${rasize}k
> > > > 	dd if=$file of=/dev/null bs=4k count=1024000
> > > > done
> > > 
> > > That's doing a cached read out of the server cache, right? You
> > 
> > It does not involve disk IO at least. (The sparse file dataset is
> > larger than server cache.)
> 
> It still results in effectively the same thing: very low, consistent
> IO latency.
> 
> Effectively all the test results show is that on a clean, low
> latency, uncongested network an unloaded NFS server that has no IO
> latency, a client only requires one 512k readahead block to hide 90%
> of the server read request latency.  I don't think this is a
> particularly good test to base a new default on, though.
> 
> e.g. What is the result with a smaller rsize? When the server
> actually has to do disk IO?  When multiple clients are reading at
> the same time so the server may not detect accesses as sequential
> and issue readahead? When another client is writing to the server at
> the same time as the read and causing significant read IO latency at
> the server?
> 
> What I'm trying to say is that while I agree with your premise that
> a 7.8MB readahead window is probably far larger than was ever
> intended, I disagree with your methodology and environment for
> selecting a better default value.  The default readahead value needs
> to work well in as many situations as possible, not just in perfect
> 1:1 client/server environment.

Good points. It's imprudent to change a default value based on one
single benchmark. Need to collect more data, which may take time..

> > > might find the results are different if the server has to read the
> > > file from disk. I would expect reads from the server cache not
> > > to require much readahead as there is no IO latency on the server
> > > side for the readahead to hide....
> > 
> > Sure the result will be different when disk IO is involved.
> > In this case I would expect the server admin to setup the optimal
> > readahead size for the disk(s).
> 
> The default should do the right thing when disk IO is involved, as

Agreed.

> almost no-one has an NFS server that doesn't do IO.... ;)

Sure.

> > It sounds silly to have
> > 
> >         client_readahead_size > server_readahead_size
> 
> I don't think it is  - the client readahead has to take into account
> the network latency as well as the server latency. e.g. a network
> with a high bandwidth but high latency is going to need much more
> client side readahead than a high bandwidth, low latency network to
> get the same throughput. Hence it is not uncommon to see larger
> readahead windows on network clients than for local disk access.

Hmm I wonder if I can simulate a high-bandwidth high-latency network
with e1000's RxIntDelay/TxIntDelay parameters..

> Also, the NFS server may not even be able to detect sequential IO
> patterns because of the combined access patterns from the clients,
> and so the only effective readahead might be what the clients
> issue....

Ah yes. Even though the upstream kernel can handle it well, one may
run a pretty old kernel, or other UNIX systems. If it really happens,
the default 512K won't behave too bad, but may well be sub-optimal.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  6:12       ` Wu Fengguang
@ 2010-02-24  7:39         ` Dave Chinner
  2010-02-26  7:49           ` [RFC] nfs: use 4*rsize " Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2010-02-24  7:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> > What I'm trying to say is that while I agree with your premise that
> > a 7.8MB readahead window is probably far larger than was ever
> > intended, I disagree with your methodology and environment for
> > selecting a better default value.  The default readahead value needs
> > to work well in as many situations as possible, not just in perfect
> > 1:1 client/server environment.
> 
> Good points. It's imprudent to change a default value based on one
> single benchmark. Need to collect more data, which may take time..

Agreed - better to spend time now to get it right...

> > > It sounds silly to have
> > > 
> > >         client_readahead_size > server_readahead_size
> > 
> > I don't think it is  - the client readahead has to take into account
> > the network latency as well as the server latency. e.g. a network
> > with a high bandwidth but high latency is going to need much more
> > client side readahead than a high bandwidth, low latency network to
> > get the same throughput. Hence it is not uncommon to see larger
> > readahead windows on network clients than for local disk access.
> 
> Hmm I wonder if I can simulate a high-bandwidth high-latency network
> with e1000's RxIntDelay/TxIntDelay parameters..

I think netem is the blessed method of emulating different network
behaviours. There's a howto+faq for setting it up here:

http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC] nfs: use 4*rsize readahead size
  2010-02-24  7:39         ` Dave Chinner
@ 2010-02-26  7:49           ` Wu Fengguang
  2010-03-02  3:10             ` Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-02-26  7:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
> > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> > > What I'm trying to say is that while I agree with your premise that
> > > a 7.8MB readahead window is probably far larger than was ever
> > > intended, I disagree with your methodology and environment for
> > > selecting a better default value.  The default readahead value needs
> > > to work well in as many situations as possible, not just in perfect
> > > 1:1 client/server environment.
> > 
> > Good points. It's imprudent to change a default value based on one
> > single benchmark. Need to collect more data, which may take time..
> 
> Agreed - better to spend time now to get it right...

I collected more data with large network latency as well as rsize=32k,
and updates the readahead size accordingly to 4*rsize.

===
nfs: use 2*rsize readahead size

With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
readahead size 512k*15=7680k is too large than necessary for typical
clients.

On a e1000e--e1000e connection, I got the following numbers
(this reads sparse file from server and involves no disk IO)

readahead size	normal		1ms+1ms		5ms+5ms		10ms+10ms(*)
	   16k	35.5 MB/s	 4.8 MB/s 	 2.1 MB/s 	1.2 MB/s
	   32k	54.3 MB/s	 6.7 MB/s 	 3.6 MB/s       2.3 MB/s
	   64k	64.1 MB/s	12.6 MB/s	 6.5 MB/s       4.7 MB/s
	  128k	70.5 MB/s	20.1 MB/s	11.9 MB/s       8.7 MB/s
	  256k	74.6 MB/s	38.6 MB/s	21.3 MB/s      15.0 MB/s
rsize ==> 512k	77.4 MB/s	59.4 MB/s	39.8 MB/s      25.5 MB/s
	 1024k	85.5 MB/s	77.9 MB/s	65.7 MB/s      43.0 MB/s
	 2048k	86.8 MB/s	81.5 MB/s	84.1 MB/s      59.7 MB/s
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 4096k	87.9 MB/s	77.4 MB/s	56.2 MB/s      59.2 MB/s
	 8192k	89.0 MB/s	81.2 MB/s	78.0 MB/s      41.2 MB/s
	16384k	87.7 MB/s	85.8 MB/s	62.0 MB/s      56.5 MB/s

readahead size	normal		1ms+1ms		5ms+5ms		10ms+10ms(*)
	   16k	37.2 MB/s	 6.4 MB/s	 2.1 MB/s	 1.2 MB/s
rsize ==>  32k	56.6 MB/s        6.8 MB/s        3.6 MB/s        2.3 MB/s
	   64k	66.1 MB/s       12.7 MB/s        6.6 MB/s        4.7 MB/s
	  128k	69.3 MB/s       22.0 MB/s       12.2 MB/s        8.9 MB/s
	  256k	69.6 MB/s       41.8 MB/s       20.7 MB/s       14.7 MB/s
	  512k	71.3 MB/s       54.1 MB/s       25.0 MB/s       16.9 MB/s
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 1024k	71.5 MB/s       48.4 MB/s       26.0 MB/s       16.7 MB/s
	 2048k	71.7 MB/s       53.2 MB/s       25.3 MB/s       17.6 MB/s
	 4096k	71.5 MB/s       50.4 MB/s       25.7 MB/s       17.1 MB/s
	 8192k	71.1 MB/s       52.3 MB/s       26.3 MB/s       16.9 MB/s
	16384k	70.2 MB/s       56.6 MB/s       27.0 MB/s       16.8 MB/s

(*) 10ms+10ms means to add delay on both client & server sides with
    # /sbin/tc qdisc change dev eth0 root netem delay 10ms 
    The total >=20ms delay is so large for NFS, that a simple `vi some.sh`
    command takes a dozen seconds. Note that the actual delay reported
    by ping is larger, eg. for the 1ms+1ms case:
        rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms
    

So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in
flight) is able to get near full NFS bandwidth. Reducing the mulriple
from 15 to 4 not only makes the client side readahead size more sane
(2MB by default), but also reduces the disorderness of the server side
RPC read requests, which yeilds better server side readahead behavior.

To avoid small readahead when the client mount with "-o rsize=32k" or
the server only supports rsize <= 32k, we take the max of 2*rsize and
default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
be explicitly changed by user with kernel parameter "readahead=" and
runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
takes effective for future NFS mounts).

The test script is:

#!/bin/sh

file=/mnt/sparse
BDI=0:15

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
	echo 3 > /proc/sys/vm/drop_caches
	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
	echo readahead_size=${rasize}k
	dd if=$file of=/dev/null bs=4k count=1024000
done

CC: Dave Chinner <david@fromorbit.com> 
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c   |    4 +++-
 fs/nfs/internal.h |    8 --------
 2 files changed, 3 insertions(+), 9 deletions(-)

--- linux.orig/fs/nfs/client.c	2010-02-26 10:10:46.000000000 +0800
+++ linux/fs/nfs/client.c	2010-02-26 11:07:22.000000000 +0800
@@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
 	server->backing_dev_info.name = "nfs";
-	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+	server->backing_dev_info.ra_pages = max_t(unsigned long,
+					      default_backing_dev_info.ra_pages,
+					      4 * server->rpages);
 	server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
 
 	if (server->wsize > max_rpc_payload)
--- linux.orig/fs/nfs/internal.h	2010-02-26 10:10:46.000000000 +0800
+++ linux/fs/nfs/internal.h	2010-02-26 11:07:07.000000000 +0800
@@ -10,14 +10,6 @@
 
 struct nfs_string;
 
-/* Maximum number of readahead requests
- * FIXME: this should really be a sysctl so that users may tune it to suit
- *        their needs. People that do NFS over a slow network, might for
- *        instance want to reduce it to something closer to 1 for improved
- *        interactive response.
- */
-#define NFS_MAX_READAHEAD	(RPC_DEF_SLOT_TABLE - 1)
-
 /*
  * Determine if sessions are in use.
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 4*rsize readahead size
  2010-02-26  7:49           ` [RFC] nfs: use 4*rsize " Wu Fengguang
@ 2010-03-02  3:10             ` Wu Fengguang
  2010-03-02 14:19               ` Trond Myklebust
  2010-03-02 20:14               ` Bret Towe
  0 siblings, 2 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-03-02  3:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

Dave,

Here is one more test on a big ext4 disk file:

	   16k	39.7 MB/s
	   32k	54.3 MB/s
	   64k	63.6 MB/s
	  128k	72.6 MB/s
	  256k	71.7 MB/s
rsize ==> 512k  71.7 MB/s
	 1024k	72.2 MB/s
	 2048k	71.0 MB/s
	 4096k	73.0 MB/s
	 8192k	74.3 MB/s
	16384k	74.5 MB/s

It shows that >=128k client side readahead is enough for single disk
case :) As for RAID configurations, I guess big server side readahead
should be enough.

#!/bin/sh

file=/mnt/ext4_test/zero
BDI=0:24

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
        echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
        echo readahead_size=${rasize}k
        fadvise $file 0 0 dontneed
        ssh p9 "fadvise $file 0 0 dontneed"
        dd if=$file of=/dev/null bs=4k count=402400
done

Thanks,
Fengguang

On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
> > > > What I'm trying to say is that while I agree with your premise that
> > > > a 7.8MB readahead window is probably far larger than was ever
> > > > intended, I disagree with your methodology and environment for
> > > > selecting a better default value.  The default readahead value needs
> > > > to work well in as many situations as possible, not just in perfect
> > > > 1:1 client/server environment.
> > > 
> > > Good points. It's imprudent to change a default value based on one
> > > single benchmark. Need to collect more data, which may take time..
> > 
> > Agreed - better to spend time now to get it right...
> 
> I collected more data with large network latency as well as rsize=32k,
> and updates the readahead size accordingly to 4*rsize.
> 
> ===
> nfs: use 2*rsize readahead size
> 
> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> readahead size 512k*15=7680k is too large than necessary for typical
> clients.
> 
> On a e1000e--e1000e connection, I got the following numbers
> (this reads sparse file from server and involves no disk IO)
> 
> readahead size	normal		1ms+1ms		5ms+5ms		10ms+10ms(*)
> 	   16k	35.5 MB/s	 4.8 MB/s 	 2.1 MB/s 	1.2 MB/s
> 	   32k	54.3 MB/s	 6.7 MB/s 	 3.6 MB/s       2.3 MB/s
> 	   64k	64.1 MB/s	12.6 MB/s	 6.5 MB/s       4.7 MB/s
> 	  128k	70.5 MB/s	20.1 MB/s	11.9 MB/s       8.7 MB/s
> 	  256k	74.6 MB/s	38.6 MB/s	21.3 MB/s      15.0 MB/s
> rsize ==> 512k	77.4 MB/s	59.4 MB/s	39.8 MB/s      25.5 MB/s
> 	 1024k	85.5 MB/s	77.9 MB/s	65.7 MB/s      43.0 MB/s
> 	 2048k	86.8 MB/s	81.5 MB/s	84.1 MB/s      59.7 MB/s
> 	 4096k	87.9 MB/s	77.4 MB/s	56.2 MB/s      59.2 MB/s
> 	 8192k	89.0 MB/s	81.2 MB/s	78.0 MB/s      41.2 MB/s
> 	16384k	87.7 MB/s	85.8 MB/s	62.0 MB/s      56.5 MB/s
> 
> readahead size	normal		1ms+1ms		5ms+5ms		10ms+10ms(*)
> 	   16k	37.2 MB/s	 6.4 MB/s	 2.1 MB/s	 1.2 MB/s
> rsize ==>  32k	56.6 MB/s        6.8 MB/s        3.6 MB/s        2.3 MB/s
> 	   64k	66.1 MB/s       12.7 MB/s        6.6 MB/s        4.7 MB/s
> 	  128k	69.3 MB/s       22.0 MB/s       12.2 MB/s        8.9 MB/s
> 	  256k	69.6 MB/s       41.8 MB/s       20.7 MB/s       14.7 MB/s
> 	  512k	71.3 MB/s       54.1 MB/s       25.0 MB/s       16.9 MB/s
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	 1024k	71.5 MB/s       48.4 MB/s       26.0 MB/s       16.7 MB/s
> 	 2048k	71.7 MB/s       53.2 MB/s       25.3 MB/s       17.6 MB/s
> 	 4096k	71.5 MB/s       50.4 MB/s       25.7 MB/s       17.1 MB/s
> 	 8192k	71.1 MB/s       52.3 MB/s       26.3 MB/s       16.9 MB/s
> 	16384k	70.2 MB/s       56.6 MB/s       27.0 MB/s       16.8 MB/s
> 
> (*) 10ms+10ms means to add delay on both client & server sides with
>     # /sbin/tc qdisc change dev eth0 root netem delay 10ms 
>     The total >=20ms delay is so large for NFS, that a simple `vi some.sh`
>     command takes a dozen seconds. Note that the actual delay reported
>     by ping is larger, eg. for the 1ms+1ms case:
>         rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms
>     
> 
> So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in
> flight) is able to get near full NFS bandwidth. Reducing the mulriple
> from 15 to 4 not only makes the client side readahead size more sane
> (2MB by default), but also reduces the disorderness of the server side
> RPC read requests, which yeilds better server side readahead behavior.
> 
> To avoid small readahead when the client mount with "-o rsize=32k" or
> the server only supports rsize <= 32k, we take the max of 2*rsize and
> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
> be explicitly changed by user with kernel parameter "readahead=" and
> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
> takes effective for future NFS mounts).
> 
> The test script is:
> 
> #!/bin/sh
> 
> file=/mnt/sparse
> BDI=0:15
> 
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
> 	echo 3 > /proc/sys/vm/drop_caches
> 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> 	echo readahead_size=${rasize}k
> 	dd if=$file of=/dev/null bs=4k count=1024000
> done
> 
> CC: Dave Chinner <david@fromorbit.com> 
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/client.c   |    4 +++-
>  fs/nfs/internal.h |    8 --------
>  2 files changed, 3 insertions(+), 9 deletions(-)
> 
> --- linux.orig/fs/nfs/client.c	2010-02-26 10:10:46.000000000 +0800
> +++ linux/fs/nfs/client.c	2010-02-26 11:07:22.000000000 +0800
> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
>  	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  
>  	server->backing_dev_info.name = "nfs";
> -	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
> +	server->backing_dev_info.ra_pages = max_t(unsigned long,
> +					      default_backing_dev_info.ra_pages,
> +					      4 * server->rpages);
>  	server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
>  
>  	if (server->wsize > max_rpc_payload)
> --- linux.orig/fs/nfs/internal.h	2010-02-26 10:10:46.000000000 +0800
> +++ linux/fs/nfs/internal.h	2010-02-26 11:07:07.000000000 +0800
> @@ -10,14 +10,6 @@
>  
>  struct nfs_string;
>  
> -/* Maximum number of readahead requests
> - * FIXME: this should really be a sysctl so that users may tune it to suit
> - *        their needs. People that do NFS over a slow network, might for
> - *        instance want to reduce it to something closer to 1 for improved
> - *        interactive response.
> - */
> -#define NFS_MAX_READAHEAD	(RPC_DEF_SLOT_TABLE - 1)
> -
>  /*
>   * Determine if sessions are in use.
>   */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 4*rsize readahead size
  2010-03-02  3:10             ` Wu Fengguang
@ 2010-03-02 14:19               ` Trond Myklebust
  2010-03-02 17:33                 ` John Stoffel
  2010-03-02 20:14               ` Bret Towe
  1 sibling, 1 reply; 21+ messages in thread
From: Trond Myklebust @ 2010-03-02 14:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: 
> Dave,
> 
> Here is one more test on a big ext4 disk file:
> 
> 	   16k	39.7 MB/s
> 	   32k	54.3 MB/s
> 	   64k	63.6 MB/s
> 	  128k	72.6 MB/s
> 	  256k	71.7 MB/s
> rsize ==> 512k  71.7 MB/s
> 	 1024k	72.2 MB/s
> 	 2048k	71.0 MB/s
> 	 4096k	73.0 MB/s
> 	 8192k	74.3 MB/s
> 	16384k	74.5 MB/s
> 
> It shows that >=128k client side readahead is enough for single disk
> case :) As for RAID configurations, I guess big server side readahead
> should be enough.

There are lots of people who would like to use NFS on their company WAN,
where you typically have high bandwidths (up to 10GigE), but often a
high latency too (due to geographical dispersion).
My ping latency from here to a typical server in NetApp's Bangalore
office is ~ 312ms. I read your test results with 10ms delays, but have
you tested with higher than that?

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 4*rsize readahead size
  2010-03-02 14:19               ` Trond Myklebust
@ 2010-03-02 17:33                 ` John Stoffel
       [not found]                   ` <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: John Stoffel @ 2010-03-02 17:33 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Wu Fengguang, Dave Chinner, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

>>>>> "Trond" == Trond Myklebust <Trond.Myklebust@netapp.com> writes:

Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: 
>> Dave,
>> 
>> Here is one more test on a big ext4 disk file:
>> 
>> 16k	39.7 MB/s
>> 32k	54.3 MB/s
>> 64k	63.6 MB/s
>> 128k	72.6 MB/s
>> 256k	71.7 MB/s
>> rsize ==> 512k  71.7 MB/s
>> 1024k	72.2 MB/s
>> 2048k	71.0 MB/s
>> 4096k	73.0 MB/s
>> 8192k	74.3 MB/s
>> 16384k	74.5 MB/s
>> 
>> It shows that >=128k client side readahead is enough for single disk
>> case :) As for RAID configurations, I guess big server side readahead
>> should be enough.

Trond> There are lots of people who would like to use NFS on their
Trond> company WAN, where you typically have high bandwidths (up to
Trond> 10GigE), but often a high latency too (due to geographical
Trond> dispersion).  My ping latency from here to a typical server in
Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
Trond> with 10ms delays, but have you tested with higher than that?

If you have that high a latency, the low level TCP protocol is going
to kill your performance before you get to the NFS level.  You really
need to open up the TCP window size at that point.  And it only gets
worse as the bandwidth goes up too.  

There's no good solution, because while you can get good throughput at
points, latency is going to suffer no matter what.

John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org>]

* Re: [RFC] nfs: use 4*rsize readahead size
       [not found]                   ` <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org>
@ 2010-03-02 18:42                     ` Trond Myklebust
  2010-03-03  3:27                       ` Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Trond Myklebust @ 2010-03-02 18:42 UTC (permalink / raw)
  To: John Stoffel
  Cc: Wu Fengguang, Dave Chinner,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Memory Management List, LKML

On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote: 
> >>>>> "Trond" == Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> writes:
> 
> Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: 
> >> Dave,
> >> 
> >> Here is one more test on a big ext4 disk file:
> >> 
> >> 16k	39.7 MB/s
> >> 32k	54.3 MB/s
> >> 64k	63.6 MB/s
> >> 128k	72.6 MB/s
> >> 256k	71.7 MB/s
> >> rsize ==> 512k  71.7 MB/s
> >> 1024k	72.2 MB/s
> >> 2048k	71.0 MB/s
> >> 4096k	73.0 MB/s
> >> 8192k	74.3 MB/s
> >> 16384k	74.5 MB/s
> >> 
> >> It shows that >=128k client side readahead is enough for single disk
> >> case :) As for RAID configurations, I guess big server side readahead
> >> should be enough.
> 
> Trond> There are lots of people who would like to use NFS on their
> Trond> company WAN, where you typically have high bandwidths (up to
> Trond> 10GigE), but often a high latency too (due to geographical
> Trond> dispersion).  My ping latency from here to a typical server in
> Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
> Trond> with 10ms delays, but have you tested with higher than that?
> 
> If you have that high a latency, the low level TCP protocol is going
> to kill your performance before you get to the NFS level.  You really
> need to open up the TCP window size at that point.  And it only gets
> worse as the bandwidth goes up too.  

Yes. You need to open the TCP window in addition to reading ahead
aggressively.

> There's no good solution, because while you can get good throughput at
> points, latency is going to suffer no matter what.

It depends upon your workload. Sequential read and write should still be
doable if you have aggressive readahead and open up for lots of parallel
write RPCs.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 4*rsize readahead size
  2010-03-02 18:42                     ` Trond Myklebust
@ 2010-03-03  3:27                       ` Wu Fengguang
  2010-04-14 21:22                         ` Dean Hildebrand
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-03-03  3:27 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: John Stoffel, Dave Chinner, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Mar 03, 2010 at 02:42:19AM +0800, Trond Myklebust wrote:
> On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote: 
> > >>>>> "Trond" == Trond Myklebust <Trond.Myklebust@netapp.com> writes:
> > 
> > Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: 
> > >> Dave,
> > >> 
> > >> Here is one more test on a big ext4 disk file:
> > >> 
> > >> 16k	39.7 MB/s
> > >> 32k	54.3 MB/s
> > >> 64k	63.6 MB/s
> > >> 128k	72.6 MB/s
> > >> 256k	71.7 MB/s
> > >> rsize ==> 512k  71.7 MB/s
> > >> 1024k	72.2 MB/s
> > >> 2048k	71.0 MB/s
> > >> 4096k	73.0 MB/s
> > >> 8192k	74.3 MB/s
> > >> 16384k	74.5 MB/s
> > >> 
> > >> It shows that >=128k client side readahead is enough for single disk
> > >> case :) As for RAID configurations, I guess big server side readahead
> > >> should be enough.
> > 
> > Trond> There are lots of people who would like to use NFS on their
> > Trond> company WAN, where you typically have high bandwidths (up to
> > Trond> 10GigE), but often a high latency too (due to geographical
> > Trond> dispersion).  My ping latency from here to a typical server in
> > Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
> > Trond> with 10ms delays, but have you tested with higher than that?
> > 
> > If you have that high a latency, the low level TCP protocol is going
> > to kill your performance before you get to the NFS level.  You really
> > need to open up the TCP window size at that point.  And it only gets
> > worse as the bandwidth goes up too.  
> 
> Yes. You need to open the TCP window in addition to reading ahead
> aggressively.

I only get ~10MB/s throughput with following settings.

# huge NFS ra size
echo 89512 > /sys/devices/virtual/bdi/0:15/read_ahead_kb        

# on both sides
/sbin/tc qdisc add dev eth0 root netem delay 200ms              

net.core.rmem_max = 873800000
net.core.wmem_max = 655360000
net.ipv4.tcp_rmem = 8192 87380000 873800000
net.ipv4.tcp_wmem = 4096 65536000 655360000

Did I miss something?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 4*rsize readahead size
  2010-03-03  3:27                       ` Wu Fengguang
@ 2010-04-14 21:22                         ` Dean Hildebrand
  0 siblings, 0 replies; 21+ messages in thread
From: Dean Hildebrand @ 2010-04-14 21:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Trond Myklebust, John Stoffel, Dave Chinner,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List, LKML

You cannot simply update linux system tcp parameters and expect nfs to 
work well performance-wise over the wan.  The NFS server does not use 
system tcp parameters.  This is a long standing issue.  A patch was 
originally added in 2.6.30 that enabled NFS to use linux tcp buffer 
autotuning, which would resolve the issue, but a regression was reported 
(http://thread.gmane.org/gmane.linux.kernel/826598 ) and so they removed 
the patch.

Maybe its time to rethink allowing users to manually set linux nfs 
server tcp buffer sizes?  Years have passed on this subject and people 
are still waiting.  Good performance over the wan will require manually 
setting tcp buffer sizes.  As mentioned in the regression thread, 
autotuning can reduce performance by up to 10%.  Here is a patch 
(slightly outdated) that creates 2 sysctls that allow users to manually 
to set NFS TCP buffer sizes.  The first link also has a fair amount of 
background information on the subject.
http://www.spinics.net/lists/linux-nfs/msg01338.html
http://www.spinics.net/lists/linux-nfs/msg01339.html

Dean


Wu Fengguang wrote:
> On Wed, Mar 03, 2010 at 02:42:19AM +0800, Trond Myklebust wrote:
>   
>> On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote: 
>>     
>>>>>>>> "Trond" == Trond Myklebust <Trond.Myklebust@netapp.com> writes:
>>>>>>>>                 
>>> Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: 
>>>       
>>>>> Dave,
>>>>>
>>>>> Here is one more test on a big ext4 disk file:
>>>>>
>>>>> 16k	39.7 MB/s
>>>>> 32k	54.3 MB/s
>>>>> 64k	63.6 MB/s
>>>>> 128k	72.6 MB/s
>>>>> 256k	71.7 MB/s
>>>>> rsize ==> 512k  71.7 MB/s
>>>>> 1024k	72.2 MB/s
>>>>> 2048k	71.0 MB/s
>>>>> 4096k	73.0 MB/s
>>>>> 8192k	74.3 MB/s
>>>>> 16384k	74.5 MB/s
>>>>>
>>>>> It shows that >=128k client side readahead is enough for single disk
>>>>> case :) As for RAID configurations, I guess big server side readahead
>>>>> should be enough.
>>>>>           
>>> Trond> There are lots of people who would like to use NFS on their
>>> Trond> company WAN, where you typically have high bandwidths (up to
>>> Trond> 10GigE), but often a high latency too (due to geographical
>>> Trond> dispersion).  My ping latency from here to a typical server in
>>> Trond> NetApp's Bangalore office is ~ 312ms. I read your test results
>>> Trond> with 10ms delays, but have you tested with higher than that?
>>>
>>> If you have that high a latency, the low level TCP protocol is going
>>> to kill your performance before you get to the NFS level.  You really
>>> need to open up the TCP window size at that point.  And it only gets
>>> worse as the bandwidth goes up too.  
>>>       
>> Yes. You need to open the TCP window in addition to reading ahead
>> aggressively.
>>     
>
> I only get ~10MB/s throughput with following settings.
>
> # huge NFS ra size
> echo 89512 > /sys/devices/virtual/bdi/0:15/read_ahead_kb        
>
> # on both sides
> /sbin/tc qdisc add dev eth0 root netem delay 200ms              
>
> net.core.rmem_max = 873800000
> net.core.wmem_max = 655360000
> net.ipv4.tcp_rmem = 8192 87380000 873800000
> net.ipv4.tcp_wmem = 4096 65536000 655360000
>
> Did I miss something?
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 4*rsize readahead size
  2010-03-02  3:10             ` Wu Fengguang
  2010-03-02 14:19               ` Trond Myklebust
@ 2010-03-02 20:14               ` Bret Towe
  2010-03-03  1:43                 ` Wu Fengguang
  1 sibling, 1 reply; 21+ messages in thread
From: Bret Towe @ 2010-03-02 20:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Mon, Mar 1, 2010 at 7:10 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Dave,
>
> Here is one more test on a big ext4 disk file:
>
>           16k  39.7 MB/s
>           32k  54.3 MB/s
>           64k  63.6 MB/s
>          128k  72.6 MB/s
>          256k  71.7 MB/s
> rsize ==> 512k  71.7 MB/s
>         1024k  72.2 MB/s
>         2048k  71.0 MB/s
>         4096k  73.0 MB/s
>         8192k  74.3 MB/s
>        16384k  74.5 MB/s
>
> It shows that >=128k client side readahead is enough for single disk
> case :) As for RAID configurations, I guess big server side readahead
> should be enough.
>
> #!/bin/sh
>
> file=/mnt/ext4_test/zero
> BDI=0:24
>
> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> do
>        echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
>        echo readahead_size=${rasize}k
>        fadvise $file 0 0 dontneed
>        ssh p9 "fadvise $file 0 0 dontneed"
>        dd if=$file of=/dev/null bs=4k count=402400
> done

how do you determine which bdi to use? I skimmed thru
the filesystem in /sys and didn't see anything that says which is what

> Thanks,
> Fengguang
>
> On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote:
>> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote:
>> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote:
>> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote:
>> > > > What I'm trying to say is that while I agree with your premise that
>> > > > a 7.8MB readahead window is probably far larger than was ever
>> > > > intended, I disagree with your methodology and environment for
>> > > > selecting a better default value.  The default readahead value needs
>> > > > to work well in as many situations as possible, not just in perfect
>> > > > 1:1 client/server environment.
>> > >
>> > > Good points. It's imprudent to change a default value based on one
>> > > single benchmark. Need to collect more data, which may take time..
>> >
>> > Agreed - better to spend time now to get it right...
>>
>> I collected more data with large network latency as well as rsize=32k,
>> and updates the readahead size accordingly to 4*rsize.
>>
>> ===
>> nfs: use 2*rsize readahead size
>>
>> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
>> readahead size 512k*15=7680k is too large than necessary for typical
>> clients.
>>
>> On a e1000e--e1000e connection, I got the following numbers
>> (this reads sparse file from server and involves no disk IO)
>>
>> readahead size        normal          1ms+1ms         5ms+5ms         10ms+10ms(*)
>>          16k  35.5 MB/s        4.8 MB/s        2.1 MB/s       1.2 MB/s
>>          32k  54.3 MB/s        6.7 MB/s        3.6 MB/s       2.3 MB/s
>>          64k  64.1 MB/s       12.6 MB/s        6.5 MB/s       4.7 MB/s
>>         128k  70.5 MB/s       20.1 MB/s       11.9 MB/s       8.7 MB/s
>>         256k  74.6 MB/s       38.6 MB/s       21.3 MB/s      15.0 MB/s
>> rsize ==> 512k        77.4 MB/s       59.4 MB/s       39.8 MB/s      25.5 MB/s
>>        1024k  85.5 MB/s       77.9 MB/s       65.7 MB/s      43.0 MB/s
>>        2048k  86.8 MB/s       81.5 MB/s       84.1 MB/s      59.7 MB/s
>>        4096k  87.9 MB/s       77.4 MB/s       56.2 MB/s      59.2 MB/s
>>        8192k  89.0 MB/s       81.2 MB/s       78.0 MB/s      41.2 MB/s
>>       16384k  87.7 MB/s       85.8 MB/s       62.0 MB/s      56.5 MB/s
>>
>> readahead size        normal          1ms+1ms         5ms+5ms         10ms+10ms(*)
>>          16k  37.2 MB/s        6.4 MB/s        2.1 MB/s        1.2 MB/s
>> rsize ==>  32k        56.6 MB/s        6.8 MB/s        3.6 MB/s        2.3 MB/s
>>          64k  66.1 MB/s       12.7 MB/s        6.6 MB/s        4.7 MB/s
>>         128k  69.3 MB/s       22.0 MB/s       12.2 MB/s        8.9 MB/s
>>         256k  69.6 MB/s       41.8 MB/s       20.7 MB/s       14.7 MB/s
>>         512k  71.3 MB/s       54.1 MB/s       25.0 MB/s       16.9 MB/s
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>        1024k  71.5 MB/s       48.4 MB/s       26.0 MB/s       16.7 MB/s
>>        2048k  71.7 MB/s       53.2 MB/s       25.3 MB/s       17.6 MB/s
>>        4096k  71.5 MB/s       50.4 MB/s       25.7 MB/s       17.1 MB/s
>>        8192k  71.1 MB/s       52.3 MB/s       26.3 MB/s       16.9 MB/s
>>       16384k  70.2 MB/s       56.6 MB/s       27.0 MB/s       16.8 MB/s
>>
>> (*) 10ms+10ms means to add delay on both client & server sides with
>>     # /sbin/tc qdisc change dev eth0 root netem delay 10ms
>>     The total >=20ms delay is so large for NFS, that a simple `vi some.sh`
>>     command takes a dozen seconds. Note that the actual delay reported
>>     by ping is larger, eg. for the 1ms+1ms case:
>>         rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms
>>
>>
>> So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in
>> flight) is able to get near full NFS bandwidth. Reducing the mulriple
>> from 15 to 4 not only makes the client side readahead size more sane
>> (2MB by default), but also reduces the disorderness of the server side
>> RPC read requests, which yeilds better server side readahead behavior.
>>
>> To avoid small readahead when the client mount with "-o rsize=32k" or
>> the server only supports rsize <= 32k, we take the max of 2*rsize and
>> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can
>> be explicitly changed by user with kernel parameter "readahead=" and
>> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which
>> takes effective for future NFS mounts).
>>
>> The test script is:
>>
>> #!/bin/sh
>>
>> file=/mnt/sparse
>> BDI=0:15
>>
>> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
>> do
>>       echo 3 > /proc/sys/vm/drop_caches
>>       echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
>>       echo readahead_size=${rasize}k
>>       dd if=$file of=/dev/null bs=4k count=1024000
>> done
>>
>> CC: Dave Chinner <david@fromorbit.com>
>> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
>> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>> ---
>>  fs/nfs/client.c   |    4 +++-
>>  fs/nfs/internal.h |    8 --------
>>  2 files changed, 3 insertions(+), 9 deletions(-)
>>
>> --- linux.orig/fs/nfs/client.c        2010-02-26 10:10:46.000000000 +0800
>> +++ linux/fs/nfs/client.c     2010-02-26 11:07:22.000000000 +0800
>> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
>>       server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>>
>>       server->backing_dev_info.name = "nfs";
>> -     server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
>> +     server->backing_dev_info.ra_pages = max_t(unsigned long,
>> +                                           default_backing_dev_info.ra_pages,
>> +                                           4 * server->rpages);
>>       server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
>>
>>       if (server->wsize > max_rpc_payload)
>> --- linux.orig/fs/nfs/internal.h      2010-02-26 10:10:46.000000000 +0800
>> +++ linux/fs/nfs/internal.h   2010-02-26 11:07:07.000000000 +0800
>> @@ -10,14 +10,6 @@
>>
>>  struct nfs_string;
>>
>> -/* Maximum number of readahead requests
>> - * FIXME: this should really be a sysctl so that users may tune it to suit
>> - *        their needs. People that do NFS over a slow network, might for
>> - *        instance want to reduce it to something closer to 1 for improved
>> - *        interactive response.
>> - */
>> -#define NFS_MAX_READAHEAD    (RPC_DEF_SLOT_TABLE - 1)
>> -
>>  /*
>>   * Determine if sessions are in use.
>>   */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 4*rsize readahead size
  2010-03-02 20:14               ` Bret Towe
@ 2010-03-03  1:43                 ` Wu Fengguang
  0 siblings, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-03-03  1:43 UTC (permalink / raw)
  To: Bret Towe
  Cc: Dave Chinner, Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Mar 03, 2010 at 04:14:33AM +0800, Bret Towe wrote:

> how do you determine which bdi to use? I skimmed thru
> the filesystem in /sys and didn't see anything that says which is what

MOUNTPOINT=" /mnt/ext4_test "
# grep "$MOUNTPOINT" /proc/$$/mountinfo|awk  '{print $3}'
0:24

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>]

* Re: [RFC] nfs: use 2*rsize readahead size
       [not found]       ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
@ 2010-02-24 11:18         ` Akshat Aranya
  2010-02-25 12:37           ` Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Akshat Aranya @ 2010-02-24 11:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Trond Myklebust,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 12:22 AM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:

>
>> It sounds silly to have
>>
>>         client_readahead_size > server_readahead_size
>
> I don't think it is  - the client readahead has to take into account
> the network latency as well as the server latency. e.g. a network
> with a high bandwidth but high latency is going to need much more
> client side readahead than a high bandwidth, low latency network to
> get the same throughput. Hence it is not uncommon to see larger
> readahead windows on network clients than for local disk access.
>
> Also, the NFS server may not even be able to detect sequential IO
> patterns because of the combined access patterns from the clients,
> and so the only effective readahead might be what the clients
> issue....
>

In my experiments, I have observed that the server-side readahead
shuts off rather quickly even with a single client because the client
readahead causes multiple pending read RPCs on the server which are
then serviced in random order and the pattern observed by the
underlying file system is non-sequential.  In our file system, we had
to override what the VFS thought was a random workload and continue to
do readahead anyway.

Cheers,
Akshat
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24 11:18         ` [RFC] nfs: use 2*rsize " Akshat Aranya
@ 2010-02-25 12:37           ` Wu Fengguang
  0 siblings, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-02-25 12:37 UTC (permalink / raw)
  To: Akshat Aranya
  Cc: Dave Chinner, Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 07:18:26PM +0800, Akshat Aranya wrote:
> On Wed, Feb 24, 2010 at 12:22 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> >
> >> It sounds silly to have
> >>
> >>         client_readahead_size > server_readahead_size
> >
> > I don't think it is  - the client readahead has to take into account
> > the network latency as well as the server latency. e.g. a network
> > with a high bandwidth but high latency is going to need much more
> > client side readahead than a high bandwidth, low latency network to
> > get the same throughput. Hence it is not uncommon to see larger
> > readahead windows on network clients than for local disk access.
> >
> > Also, the NFS server may not even be able to detect sequential IO
> > patterns because of the combined access patterns from the clients,
> > and so the only effective readahead might be what the clients
> > issue....
> >
> 
> In my experiments, I have observed that the server-side readahead
> shuts off rather quickly even with a single client because the client
> readahead causes multiple pending read RPCs on the server which are
> then serviced in random order and the pattern observed by the
> underlying file system is non-sequential.  In our file system, we had
> to override what the VFS thought was a random workload and continue to
> do readahead anyway.

What's the server side kernel version, plus client/server side
readahead size? I'd expect the context readahead to handle it well.

With the patchset in <http://lkml.org/lkml/2010/2/23/376>, you can
actually see the readahead details:

        # echo 1 > /debug/tracing/events/readahead/enable
        # cp test-file /dev/null
        # cat /debug/tracing/trace  # trimmed output
        readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
        readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
        readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
        readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
        readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
        readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0

And I've actually verified the NFS case with the help of such traces
long ago.  When client_readahead_size <= server_readahead_size, the
readahead requests may look a bit random at first, and then will
quickly turn into a perfect series of sequential context readaheads.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  3:29 ` Dave Chinner
  2010-02-24  4:18   ` Wu Fengguang
@ 2010-02-24  4:24   ` Dave Chinner
  2010-02-24  4:33     ` Wu Fengguang
       [not found]     ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
  1 sibling, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2010-02-24  4:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Trond Myklebust, linux-nfs, linux-fsdevel,
	Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > readahead size 512k*15=7680k is too large than necessary for typical
> > clients.
> > 
> > On a e1000e--e1000e connection, I got the following numbers
> > 
> > 	readahead size		throughput
> > 		   16k           35.5 MB/s
> > 		   32k           54.3 MB/s
> > 		   64k           64.1 MB/s
> > 		  128k           70.5 MB/s
> > 		  256k           74.6 MB/s
> > rsize ==>	  512k           77.4 MB/s
> > 		 1024k           85.5 MB/s
> > 		 2048k           86.8 MB/s
> > 		 4096k           87.9 MB/s
> > 		 8192k           89.0 MB/s
> > 		16384k           87.7 MB/s
> > 
> > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > can already get near full NFS bandwidth.
> > 
> > The test script is:
> > 
> > #!/bin/sh
> > 
> > file=/mnt/sparse
> > BDI=0:15
> > 
> > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > do
> > 	echo 3 > /proc/sys/vm/drop_caches
> > 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > 	echo readahead_size=${rasize}k
> > 	dd if=$file of=/dev/null bs=4k count=1024000
> > done
> 
> That's doing a cached read out of the server cache, right? You
> might find the results are different if the server has to read the
> file from disk. I would expect reads from the server cache not
> to require much readahead as there is no IO latency on the server
> side for the readahead to hide....

FWIW, if you mount the client with "-o rsize=32k" or the server only
supports rsize <= 32k then this will probably hurt throughput a lot
because then readahead will be capped at 64k instead of 480k....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  4:24   ` Dave Chinner
@ 2010-02-24  4:33     ` Wu Fengguang
       [not found]     ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
  1 sibling, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-02-24  4:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Trond Myklebust, linux-nfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > readahead size 512k*15=7680k is too large than necessary for typical
> > > clients.
> > > 
> > > On a e1000e--e1000e connection, I got the following numbers
> > > 
> > > 	readahead size		throughput
> > > 		   16k           35.5 MB/s
> > > 		   32k           54.3 MB/s
> > > 		   64k           64.1 MB/s
> > > 		  128k           70.5 MB/s
> > > 		  256k           74.6 MB/s
> > > rsize ==>	  512k           77.4 MB/s
> > > 		 1024k           85.5 MB/s
> > > 		 2048k           86.8 MB/s
> > > 		 4096k           87.9 MB/s
> > > 		 8192k           89.0 MB/s
> > > 		16384k           87.7 MB/s
> > > 
> > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > can already get near full NFS bandwidth.
> > > 
> > > The test script is:
> > > 
> > > #!/bin/sh
> > > 
> > > file=/mnt/sparse
> > > BDI=0:15
> > > 
> > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > do
> > > 	echo 3 > /proc/sys/vm/drop_caches
> > > 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > 	echo readahead_size=${rasize}k
> > > 	dd if=$file of=/dev/null bs=4k count=1024000
> > > done
> > 
> > That's doing a cached read out of the server cache, right? You
> > might find the results are different if the server has to read the
> > file from disk. I would expect reads from the server cache not
> > to require much readahead as there is no IO latency on the server
> > side for the readahead to hide....
> 
> FWIW, if you mount the client with "-o rsize=32k" or the server only
> supports rsize <= 32k then this will probably hurt throughput a lot
> because then readahead will be capped at 64k instead of 480k....

That's why I take the max of 2*rsize and system default readahead size
(which will be enlarged to 512K):

-       server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+       server->backing_dev_info.ra_pages = max_t(unsigned long,
+                                             default_backing_dev_info.ra_pages,
+                                             2 * server->rpages);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>]

* Re: [RFC] nfs: use 2*rsize readahead size
       [not found]     ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
@ 2010-02-24  4:43       ` Wu Fengguang
  2010-02-24  5:24         ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-02-24  4:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Trond Myklebust,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote:
> On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 10:41:01AM +0800, Wu Fengguang wrote:
> > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
> > > readahead size 512k*15=7680k is too large than necessary for typical
> > > clients.
> > > 
> > > On a e1000e--e1000e connection, I got the following numbers
> > > 
> > > 	readahead size		throughput
> > > 		   16k           35.5 MB/s
> > > 		   32k           54.3 MB/s
> > > 		   64k           64.1 MB/s
> > > 		  128k           70.5 MB/s
> > > 		  256k           74.6 MB/s
> > > rsize ==>	  512k           77.4 MB/s
> > > 		 1024k           85.5 MB/s
> > > 		 2048k           86.8 MB/s
> > > 		 4096k           87.9 MB/s
> > > 		 8192k           89.0 MB/s
> > > 		16384k           87.7 MB/s
> > > 
> > > So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
> > > can already get near full NFS bandwidth.
> > > 
> > > The test script is:
> > > 
> > > #!/bin/sh
> > > 
> > > file=/mnt/sparse
> > > BDI=0:15
> > > 
> > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
> > > do
> > > 	echo 3 > /proc/sys/vm/drop_caches
> > > 	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
> > > 	echo readahead_size=${rasize}k
> > > 	dd if=$file of=/dev/null bs=4k count=1024000
> > > done
> > 
> > That's doing a cached read out of the server cache, right? You
> > might find the results are different if the server has to read the
> > file from disk. I would expect reads from the server cache not
> > to require much readahead as there is no IO latency on the server
> > side for the readahead to hide....
> 
> FWIW, if you mount the client with "-o rsize=32k" or the server only
> supports rsize <= 32k then this will probably hurt throughput a lot
> because then readahead will be capped at 64k instead of 480k....

I should have mentioned that in changelog.. Hope the updated one
helps.

Thanks,
Fengguang
---
nfs: use 2*rsize readahead size

With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS
readahead size 512k*15=7680k is too large than necessary for typical
clients.

On a e1000e--e1000e connection, I got the following numbers
(this reads sparse file from server and involves no disk IO)

	readahead size		throughput
		   16k           35.5 MB/s
		   32k           54.3 MB/s
		   64k           64.1 MB/s
		  128k           70.5 MB/s
		  256k           74.6 MB/s
rsize ==>	  512k           77.4 MB/s
		 1024k           85.5 MB/s
		 2048k           86.8 MB/s
		 4096k           87.9 MB/s
		 8192k           89.0 MB/s
		16384k           87.7 MB/s

So it seems that readahead_size=2*rsize (ie. keep two RPC requests in flight)
can already get near full NFS bandwidth.

To avoid small readahead when the client mount with "-o rsize=32k" or
the server only supports rsize <= 32k, we take the max of 2*rsize and
default_backing_dev_info.ra_pages. The latter defaults to 512K, and
will be auto scaled down when system memory is less than 512M, or can
be explicitly changed by user with kernel parameter "readahead=".

The test script is:

#!/bin/sh

file=/mnt/sparse
BDI=0:15

for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384
do
	echo 3 > /proc/sys/vm/drop_caches
	echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb
	echo readahead_size=${rasize}k
	dd if=$file of=/dev/null bs=4k count=1024000
done

CC: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> 
CC: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/nfs/client.c   |    4 +++-
 fs/nfs/internal.h |    8 --------
 2 files changed, 3 insertions(+), 9 deletions(-)

--- linux.orig/fs/nfs/client.c	2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/client.c	2010-02-24 10:16:00.000000000 +0800
@@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
 	server->backing_dev_info.name = "nfs";
-	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+	server->backing_dev_info.ra_pages = max_t(unsigned long,
+					      default_backing_dev_info.ra_pages,
+					      2 * server->rpages);
 	server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE;
 
 	if (server->wsize > max_rpc_payload)
--- linux.orig/fs/nfs/internal.h	2010-02-23 11:15:44.000000000 +0800
+++ linux/fs/nfs/internal.h	2010-02-23 13:26:00.000000000 +0800
@@ -10,14 +10,6 @@
 
 struct nfs_string;
 
-/* Maximum number of readahead requests
- * FIXME: this should really be a sysctl so that users may tune it to suit
- *        their needs. People that do NFS over a slow network, might for
- *        instance want to reduce it to something closer to 1 for improved
- *        interactive response.
- */
-#define NFS_MAX_READAHEAD	(RPC_DEF_SLOT_TABLE - 1)
-
 /*
  * Determine if sessions are in use.
  */
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] nfs: use 2*rsize readahead size
  2010-02-24  4:43       ` Wu Fengguang
@ 2010-02-24  5:24         ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2010-02-24  5:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Trond Myklebust,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Memory Management List, LKML

On Wed, Feb 24, 2010 at 12:43:56PM +0800, Wu Fengguang wrote:
> On Wed, Feb 24, 2010 at 12:24:14PM +0800, Dave Chinner wrote:
> > On Wed, Feb 24, 2010 at 02:29:34PM +1100, Dave Chinner wrote:
> > > That's doing a cached read out of the server cache, right? You
> > > might find the results are different if the server has to read the
> > > file from disk. I would expect reads from the server cache not
> > > to require much readahead as there is no IO latency on the server
> > > side for the readahead to hide....
> > 
> > FWIW, if you mount the client with "-o rsize=32k" or the server only
> > supports rsize <= 32k then this will probably hurt throughput a lot
> > because then readahead will be capped at 64k instead of 480k....
> 
> I should have mentioned that in changelog.. Hope the updated one
> helps.

Sorry, my fault for not reading the code correctly.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-04-14 21:22 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-24  2:41 [RFC] nfs: use 2*rsize readahead size Wu Fengguang
2010-02-24  3:29 ` Dave Chinner
2010-02-24  4:18   ` Wu Fengguang
2010-02-24  5:22     ` Dave Chinner
2010-02-24  6:12       ` Wu Fengguang
2010-02-24  7:39         ` Dave Chinner
2010-02-26  7:49           ` [RFC] nfs: use 4*rsize " Wu Fengguang
2010-03-02  3:10             ` Wu Fengguang
2010-03-02 14:19               ` Trond Myklebust
2010-03-02 17:33                 ` John Stoffel
     [not found]                   ` <19341.19446.356359.99958-HgN6juyGXH5AfugRpC6u6w@public.gmane.org>
2010-03-02 18:42                     ` Trond Myklebust
2010-03-03  3:27                       ` Wu Fengguang
2010-04-14 21:22                         ` Dean Hildebrand
2010-03-02 20:14               ` Bret Towe
2010-03-03  1:43                 ` Wu Fengguang
     [not found]       ` <20100224052215.GH16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
2010-02-24 11:18         ` [RFC] nfs: use 2*rsize " Akshat Aranya
2010-02-25 12:37           ` Wu Fengguang
2010-02-24  4:24   ` Dave Chinner
2010-02-24  4:33     ` Wu Fengguang
     [not found]     ` <20100224042414.GG16175-CJ6yYqJ1V6CgjvmRZuSThA@public.gmane.org>
2010-02-24  4:43       ` Wu Fengguang
2010-02-24  5:24         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).