All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
@ 2008-06-10 18:54 Dean Hildebrand
  2008-06-11 19:48 ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Dean Hildebrand @ 2008-06-10 18:54 UTC (permalink / raw)
  To: linux-nfs

Hello,

The motivation for this patch is improved WAN write performance plus 
greater user control on the server of the TCP buffer values (window 
size).  The TCP window determines the amount of outstanding data that a 
client can have on the wire and should be large enough that a NFS client 
can fill up the pipe (the bandwidth * delay product).  Currently the TCP 
receive buffer size (used for client writes) is set very low, which 
prevents a client from filling up a network pipe with a large bandwidth 
* delay product.

Currently, the server TCP send window is set to accommodate the maximum 
number of outstanding NFSD read requests (# nfsds * maxiosize), while 
the server TCP receive window is set to a fixed value which can hold a 
few requests.  While these values set a TCP window size that is fine in 
LAN environments with a small BDP, WAN environments can require a much 
larger TCP window size, e.g., 10GigE transatlantic link with a rtt of 
120 ms has a BDP of approx 60MB.

I have a patch to net/svc/svcsock.c that allows a user to manually set 
the server TCP send and receive buffer through the sysctl interface. to 
suit the required TCP window of their network architecture.  It adds two 
/proc entries, one for the receive buffer size and one for the send 
buffer size:
/proc/sys/sunrpc/tcp_sndbuf
/proc/sys/sunrpc/tcp_rcvbuf

The uses the current buffer sizes in the code are as minimum values, 
which the user cannot decrease.  If the user sets a value of 0 in either 
/proc entry, it resets the buffer size to the default value.  The set 
/proc values are utilized when the TCP connection is initialized (mount 
time).  The values are bounded above by the *minimum* of the /proc 
values and the network TCP sysctls.

To demonstrate the usefulness of this patch, details of an experiment 
between 2 computers with a rtt of 30ms is provided below.  In this 
experiment, increasing the server /proc/sys/sunrpc/tcp_rcvbuf value 
doubles write performance.

EXPERIMENT
==========
This experiment simulates a WAN by using tc together with netem to add a 
30 ms delay to all packets on a nfs client.  The goal is to show that by 
only changing tcp_rcvbuf, the nfs client can increase write performance 
in the WAN. To verify the patch has the desired effect on the TCP 
window, I created two tcptrace plots that show the difference in tcp 
window behaviour before and after the server TCP rcvbuf size is 
increased.  When using the default server tcpbuf value of 6M, we can see 
the TCP window top out around 4.6 M, whereas increasing the server 
tcpbuf value to 32M, we can see that the TCP window tops out around 
13M.  Performance jumps from 43 MB/s to 90 MB/s.

Hardware:
2 dual-core opteron blades
GigE, Broadcom NetXtreme II BCM57065 cards
A single gigabit switch in the middle
1500 MTU
8 GB memory

Software:
Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
RHEL4

NFS Configuration:
64 rpc slots
32 nfsds
Export ext3 file system.  This disk is quite slow, I therefore exported 
using async to reduce the effect of the disk on the back end.  This way, 
the experiments record the time it takes for the data to get to the 
server (not to the disk).
# exportfs -v
/export     <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)

# cat /proc/mounts
bear109:/export /mnt nfs 
rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 
0 0

fs.nfs.nfs_congestion_kb = 91840
net.ipv4.tcp_congestion_control = cubic

Network tc Command executed on client:
tc qdisc add dev eth0 root netem delay 30ms
rtt from client (bear108) to server (bear109)
#ping bear109
PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0 ttl=64 
time=31.4 ms
64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1 ttl=64 
time=32.0 ms

TCP Configuration on client and server:
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename
# Useful for debugging multi-threaded applications
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
### IPV4 specific settings
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
# on systems with a VERY fast bus -> memory interface this is the big 
gainer
net.ipv4.tcp_rmem = 4096 16777216 16777216
net.ipv4.tcp_wmem = 4096 16777216 16777216
net.ipv4.tcp_mem = 4096 16777216 16777216
### CORE settings (mostly for socket and UDP effect)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max =  16777216
net.core.netdev_max_backlog = 300000
# Don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1
# make sure we don't run out of memory
vm.min_free_kbytes = 32768

Experiments:

On Server: (note that the real tcp buffer size is double tcp_rcvbuf)
[root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
[root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
3158016

On Client:
mount -t nfs bear109:/export /mnt
[root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
...
             KB  reclen   write
         512000    1024   43252      umount /mnt

On server:
[root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
[root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
16777216

On Client:
mount -t nfs bear109:/export /mnt
[root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
...
             KB  reclen   write
         512000    1024   90396

Dean
IBM Almaden Research Center

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-10 18:54 [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values Dean Hildebrand
@ 2008-06-11 19:48 ` Chuck Lever
  2008-06-11 21:01   ` Talpey, Thomas
  2008-06-12 21:03   ` Dean Hildebrand
  0 siblings, 2 replies; 15+ messages in thread
From: Chuck Lever @ 2008-06-11 19:48 UTC (permalink / raw)
  To: Dean Hildebrand; +Cc: linux-nfs

Howdy Dean-

On Jun 10, 2008, at 2:54 PM, Dean Hildebrand wrote:
> The motivation for this patch is improved WAN write performance plus  
> greater user control on the server of the TCP buffer values (window  
> size).  The TCP window determines the amount of outstanding data  
> that a client can have on the wire and should be large enough that a  
> NFS client can fill up the pipe (the bandwidth * delay product).   
> Currently the TCP receive buffer size (used for client writes) is  
> set very low, which prevents a client from filling up a network pipe  
> with a large bandwidth * delay product.
>
> Currently, the server TCP send window is set to accommodate the  
> maximum number of outstanding NFSD read requests (# nfsds *  
> maxiosize), while the server TCP receive window is set to a fixed  
> value which can hold a few requests.  While these values set a TCP  
> window size that is fine in LAN environments with a small BDP, WAN  
> environments can require a much larger TCP window size, e.g., 10GigE  
> transatlantic link with a rtt of 120 ms has a BDP of approx 60MB.

Was the receive buffer size computation adjusted when support for  
large transfer sizes was recently added to the NFS server?

> I have a patch to net/svc/svcsock.c that allows a user to manually  
> set the server TCP send and receive buffer through the sysctl  
> interface. to suit the required TCP window of their network  
> architecture.  It adds two /proc entries, one for the receive buffer  
> size and one for the send buffer size:
> /proc/sys/sunrpc/tcp_sndbuf
> /proc/sys/sunrpc/tcp_rcvbuf

What I'm wondering is if we can find some algorithm to set the buffer  
and window sizes *automatically*.  Why can't the NFS server select an  
appropriately large socket buffer size by default?

Since the socket buffer size is just a limit (no memory is allocated)  
why, for example, shouldn't the buffer size be large for all  
environments that have sufficient physical memory?

> The uses the current buffer sizes in the code are as minimum values,  
> which the user cannot decrease.  If the user sets a value of 0 in  
> either /proc entry, it resets the buffer size to the default value.   
> The set /proc values are utilized when the TCP connection is  
> initialized (mount time).  The values are bounded above by the  
> *minimum* of the /proc values and the network TCP sysctls.
>
> To demonstrate the usefulness of this patch, details of an  
> experiment between 2 computers with a rtt of 30ms is provided  
> below.  In this experiment, increasing the server /proc/sys/sunrpc/ 
> tcp_rcvbuf value doubles write performance.
>
> EXPERIMENT
> ==========
> This experiment simulates a WAN by using tc together with netem to  
> add a 30 ms delay to all packets on a nfs client.  The goal is to  
> show that by only changing tcp_rcvbuf, the nfs client can increase  
> write performance in the WAN. To verify the patch has the desired  
> effect on the TCP window, I created two tcptrace plots that show the  
> difference in tcp window behaviour before and after the server TCP  
> rcvbuf size is increased.  When using the default server tcpbuf  
> value of 6M, we can see the TCP window top out around 4.6 M, whereas  
> increasing the server tcpbuf value to 32M, we can see that the TCP  
> window tops out around 13M.  Performance jumps from 43 MB/s to 90 MB/ 
> s.
>
> Hardware:
> 2 dual-core opteron blades
> GigE, Broadcom NetXtreme II BCM57065 cards
> A single gigabit switch in the middle
> 1500 MTU
> 8 GB memory
>
> Software:
> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
> RHEL4
>
> NFS Configuration:
> 64 rpc slots
> 32 nfsds
> Export ext3 file system.  This disk is quite slow, I therefore  
> exported using async to reduce the effect of the disk on the back  
> end.  This way, the experiments record the time it takes for the  
> data to get to the server (not to the disk).
> # exportfs -v
> /export      
> <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)
>
> # cat /proc/mounts
> bear109:/export /mnt nfs  
> rw 
> ,vers 
> = 
> 3 
> ,rsize 
> = 
> 1048576 
> ,wsize 
> = 
> 1048576 
> ,namlen 
> = 
> 255 
> ,hard 
> ,nointr 
> ,proto 
> =tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 0 0
>
> fs.nfs.nfs_congestion_kb = 91840
> net.ipv4.tcp_congestion_control = cubic
>
> Network tc Command executed on client:
> tc qdisc add dev eth0 root netem delay 30ms
> rtt from client (bear108) to server (bear109)
> #ping bear109
> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0  
> ttl=64 time=31.4 ms
> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1  
> ttl=64 time=32.0 ms
>
> TCP Configuration on client and server:
> # Controls IP packet forwarding
> net.ipv4.ip_forward = 0
> # Controls source route verification
> net.ipv4.conf.default.rp_filter = 1
> # Do not accept source routing
> net.ipv4.conf.default.accept_source_route = 0
> # Controls the System Request debugging functionality of the kernel
> kernel.sysrq = 0
> # Controls whether core dumps will append the PID to the core filename
> # Useful for debugging multi-threaded applications
> kernel.core_uses_pid = 1
> # Controls the use of TCP syncookies
> net.ipv4.tcp_syncookies = 1
> # Controls the maximum size of a message, in bytes
> kernel.msgmnb = 65536
> # Controls the default maxmimum size of a mesage queue
> kernel.msgmax = 65536
> # Controls the maximum shared segment size, in bytes
> kernel.shmmax = 68719476736
> # Controls the maximum number of shared memory segments, in pages
> kernel.shmall = 4294967296
> ### IPV4 specific settings
> net.ipv4.tcp_timestamps = 0
> net.ipv4.tcp_sack = 1
> # on systems with a VERY fast bus -> memory interface this is the  
> big gainer
> net.ipv4.tcp_rmem = 4096 16777216 16777216
> net.ipv4.tcp_wmem = 4096 16777216 16777216
> net.ipv4.tcp_mem = 4096 16777216 16777216
> ### CORE settings (mostly for socket and UDP effect)
> net.core.rmem_max = 16777216
> net.core.wmem_max = 16777216
> net.core.rmem_default = 16777216
> net.core.wmem_default = 16777216
> net.core.optmem_max =  16777216
> net.core.netdev_max_backlog = 300000
> # Don't cache ssthresh from previous connection
> net.ipv4.tcp_no_metrics_save = 1
> # make sure we don't run out of memory
> vm.min_free_kbytes = 32768
>
> Experiments:
>
> On Server: (note that the real tcp buffer size is double tcp_rcvbuf)
> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
> 3158016
>
> On Client:
> mount -t nfs bear109:/export /mnt
> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
> ...
>            KB  reclen   write
>        512000    1024   43252      umount /mnt
>
> On server:
> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
> 16777216
>
> On Client:
> mount -t nfs bear109:/export /mnt
> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
> ...
>            KB  reclen   write
>        512000    1024   90396

The numbers you have here are averages over the whole run.  Performing  
these tests using a variety of record lengths and file sizes (up to  
several tens of gigabytes) would be useful to see where different  
memory and network latencies kick in.

In addition, have you looked at network traces to see if the server's  
TCP implementation is behaving optimally (or near optimally)?  Have  
you tried using some of the more esoteric TCP congestion algorithms  
available in 2.6 kernels?

There are also fairly unsophisticated ways to add longer delays on  
your test network, and turning up the latency knob would be a useful  
test.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-11 19:48 ` Chuck Lever
@ 2008-06-11 21:01   ` Talpey, Thomas
  2008-06-12 21:03   ` Dean Hildebrand
  1 sibling, 0 replies; 15+ messages in thread
From: Talpey, Thomas @ 2008-06-11 21:01 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Dean Hildebrand, linux-nfs

At 03:48 PM 6/11/2008, Chuck Lever wrote:
>There are also fairly unsophisticated ways to add longer delays on  
>your test network, and turning up the latency knob would be a useful  
>test.

IPFW can be used to limit effective bandwidth and insert latencies in
strange, wonderful and easy-to-configure ways. I think there's preconfigured
profile for an earth-moon link, for instance. Using a pair of virtual client/
server instances on a single machine would be a great way to prototype
this (no wire bottlenecks - pure simulation), too.

Just an idea!

Tom.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-11 19:48 ` Chuck Lever
  2008-06-11 21:01   ` Talpey, Thomas
@ 2008-06-12 21:03   ` Dean Hildebrand
  2008-06-13 18:51     ` Chuck Lever
  2008-06-13 20:53     ` J. Bruce Fields
  1 sibling, 2 replies; 15+ messages in thread
From: Dean Hildebrand @ 2008-06-12 21:03 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

Hi Chuck,


Chuck Lever wrote:
> Howdy Dean-
>
> On Jun 10, 2008, at 2:54 PM, Dean Hildebrand wrote:
>> The motivation for this patch is improved WAN write performance plus 
>> greater user control on the server of the TCP buffer values (window 
>> size).  The TCP window determines the amount of outstanding data that 
>> a client can have on the wire and should be large enough that a NFS 
>> client can fill up the pipe (the bandwidth * delay product).  
>> Currently the TCP receive buffer size (used for client writes) is set 
>> very low, which prevents a client from filling up a network pipe with 
>> a large bandwidth * delay product.
>>
>> Currently, the server TCP send window is set to accommodate the 
>> maximum number of outstanding NFSD read requests (# nfsds * 
>> maxiosize), while the server TCP receive window is set to a fixed 
>> value which can hold a few requests.  While these values set a TCP 
>> window size that is fine in LAN environments with a small BDP, WAN 
>> environments can require a much larger TCP window size, e.g., 10GigE 
>> transatlantic link with a rtt of 120 ms has a BDP of approx 60MB.
>
> Was the receive buffer size computation adjusted when support for 
> large transfer sizes was recently added to the NFS server?
Yes, it is based on the transfer size.  So in the current code, having a 
larger transfer size can improve efficiency PLUS help create a larger 
possible TCP window.  The issue seems to be that tcp window, # of NFSDs, 
and transfer size are all independent variables that need to be tuned 
individually depending on rtt, network bandwidth, disk bandwidth, etc 
etc...  We can adjust the last 2, so this patch helps adjust the first 
(tcp window).
>
>> I have a patch to net/svc/svcsock.c that allows a user to manually 
>> set the server TCP send and receive buffer through the sysctl 
>> interface. to suit the required TCP window of their network 
>> architecture.  It adds two /proc entries, one for the receive buffer 
>> size and one for the send buffer size:
>> /proc/sys/sunrpc/tcp_sndbuf
>> /proc/sys/sunrpc/tcp_rcvbuf
>
> What I'm wondering is if we can find some algorithm to set the buffer 
> and window sizes *automatically*.  Why can't the NFS server select an 
> appropriately large socket buffer size by default?

>
> Since the socket buffer size is just a limit (no memory is allocated) 
> why, for example, shouldn't the buffer size be large for all 
> environments that have sufficient physical memory?
I think the problem there is that the only way to set the buffer size 
automatically would be to know the rtt and bandwidth of the network 
connection.  Excessive numbers of packets can get dropped if the TCP 
buffer is set too large for a specific network connection.  In this 
case, the window opens too wide and lets too many packets out into the 
system, somewhere along the path buffers start overflowing and packets 
are lost, TCP congestion avoidance kicks in and cuts the window size 
dramatically and performance along with it.  This type of behaviour 
creates a sawtooth pattern for the TCP window, which is less favourable 
than a more steady state pattern that is created if the TCP buffer size 
is set appropriately.

Another point is that setting the buffer size isn't always a 
straightforward process.  All papers I've read on the subject, and my 
experience confirms this, is that setting tcp buffer sizes is more of an 
art.

So having the server set a good default value is half the battle, but 
allowing users to twiddle with this value is vital.
>
>> The uses the current buffer sizes in the code are as minimum values, 
>> which the user cannot decrease.  If the user sets a value of 0 in 
>> either /proc entry, it resets the buffer size to the default value.  
>> The set /proc values are utilized when the TCP connection is 
>> initialized (mount time).  The values are bounded above by the 
>> *minimum* of the /proc values and the network TCP sysctls.
>>
>> To demonstrate the usefulness of this patch, details of an experiment 
>> between 2 computers with a rtt of 30ms is provided below.  In this 
>> experiment, increasing the server /proc/sys/sunrpc/tcp_rcvbuf value 
>> doubles write performance.
>>
>> EXPERIMENT
>> ==========
>> This experiment simulates a WAN by using tc together with netem to 
>> add a 30 ms delay to all packets on a nfs client.  The goal is to 
>> show that by only changing tcp_rcvbuf, the nfs client can increase 
>> write performance in the WAN. To verify the patch has the desired 
>> effect on the TCP window, I created two tcptrace plots that show the 
>> difference in tcp window behaviour before and after the server TCP 
>> rcvbuf size is increased.  When using the default server tcpbuf value 
>> of 6M, we can see the TCP window top out around 4.6 M, whereas 
>> increasing the server tcpbuf value to 32M, we can see that the TCP 
>> window tops out around 13M.  Performance jumps from 43 MB/s to 90 MB/s.
>>
>> Hardware:
>> 2 dual-core opteron blades
>> GigE, Broadcom NetXtreme II BCM57065 cards
>> A single gigabit switch in the middle
>> 1500 MTU
>> 8 GB memory
>>
>> Software:
>> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
>> RHEL4
>>
>> NFS Configuration:
>> 64 rpc slots
>> 32 nfsds
>> Export ext3 file system.  This disk is quite slow, I therefore 
>> exported using async to reduce the effect of the disk on the back 
>> end.  This way, the experiments record the time it takes for the data 
>> to get to the server (not to the disk).
>> # exportfs -v
>> /export     
>> <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)
>>
>> # cat /proc/mounts
>> bear109:/export /mnt nfs 
>> rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 
>> 0 0
>>
>> fs.nfs.nfs_congestion_kb = 91840
>> net.ipv4.tcp_congestion_control = cubic
>>
>> Network tc Command executed on client:
>> tc qdisc add dev eth0 root netem delay 30ms
>> rtt from client (bear108) to server (bear109)
>> #ping bear109
>> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0 ttl=64 
>> time=31.4 ms
>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1 ttl=64 
>> time=32.0 ms
>>
>> TCP Configuration on client and server:
>> # Controls IP packet forwarding
>> net.ipv4.ip_forward = 0
>> # Controls source route verification
>> net.ipv4.conf.default.rp_filter = 1
>> # Do not accept source routing
>> net.ipv4.conf.default.accept_source_route = 0
>> # Controls the System Request debugging functionality of the kernel
>> kernel.sysrq = 0
>> # Controls whether core dumps will append the PID to the core filename
>> # Useful for debugging multi-threaded applications
>> kernel.core_uses_pid = 1
>> # Controls the use of TCP syncookies
>> net.ipv4.tcp_syncookies = 1
>> # Controls the maximum size of a message, in bytes
>> kernel.msgmnb = 65536
>> # Controls the default maxmimum size of a mesage queue
>> kernel.msgmax = 65536
>> # Controls the maximum shared segment size, in bytes
>> kernel.shmmax = 68719476736
>> # Controls the maximum number of shared memory segments, in pages
>> kernel.shmall = 4294967296
>> ### IPV4 specific settings
>> net.ipv4.tcp_timestamps = 0
>> net.ipv4.tcp_sack = 1
>> # on systems with a VERY fast bus -> memory interface this is the big 
>> gainer
>> net.ipv4.tcp_rmem = 4096 16777216 16777216
>> net.ipv4.tcp_wmem = 4096 16777216 16777216
>> net.ipv4.tcp_mem = 4096 16777216 16777216
>> ### CORE settings (mostly for socket and UDP effect)
>> net.core.rmem_max = 16777216
>> net.core.wmem_max = 16777216
>> net.core.rmem_default = 16777216
>> net.core.wmem_default = 16777216
>> net.core.optmem_max =  16777216
>> net.core.netdev_max_backlog = 300000
>> # Don't cache ssthresh from previous connection
>> net.ipv4.tcp_no_metrics_save = 1
>> # make sure we don't run out of memory
>> vm.min_free_kbytes = 32768
>>
>> Experiments:
>>
>> On Server: (note that the real tcp buffer size is double tcp_rcvbuf)
>> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>> 3158016
>>
>> On Client:
>> mount -t nfs bear109:/export /mnt
>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>> ...
>>            KB  reclen   write
>>        512000    1024   43252      umount /mnt
>>
>> On server:
>> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>> 16777216
>>
>> On Client:
>> mount -t nfs bear109:/export /mnt
>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>> ...
>>            KB  reclen   write
>>        512000    1024   90396
>
> The numbers you have here are averages over the whole run.  Performing 
> these tests using a variety of record lengths and file sizes (up to 
> several tens of gigabytes) would be useful to see where different 
> memory and network latencies kick in.
Definitely useful, although I'm not sure how this relates to this 
patch.  This patch isn't trying to alter default values, or predict 
buffer sizes based on rtt values, or dynamically alter the tcp window 
based on dropped packets, etc, it is just giving users the ability to 
customize the server tcp buffer size.  The information you are curious 
about is more relevant to creating better default values of the tcp 
buffer size.  This could be useful, but would be a long process and 
there are so many variables that I'm not sure that you could pick proper 
default values anyways.  The important thing is that the client can 
currently set its tcp buffer size via the sysctl's, this is useless if 
the server is stuck at a fixed value since the tcp window will be the 
minimum of the client and server's tcp buffer sizes.  The server cannot 
do just the same thing as the client since it cannot just rely on the 
tcp sysctl's since it also needs to ensure it has enough buffer space 
for each NFSD.

My goal with this patch is to provide users with the same flexibility 
that the client has regarding tcp buffer sizes, but also ensure that the 
minimum amount of buffer space that the NFSDs require is allocated.
>
> In addition, have you looked at network traces to see if the server's 
> TCP implementation is behaving optimally (or near optimally)?  Have 
> you tried using some of the more esoteric TCP congestion algorithms 
> available in 2.6 kernels?
I guess you are asking if I'm sure that I'm fixing the right problem?  
Nothing is broken in terms of the tcp implementation, it just requires 
bigger buffers to handle a larger BDP.  iperf, bbcp, etc all use the 
same tcp implementation and all work fine if giving a larger enough 
buffer size, so I know tcp is fine.  From reading WAN tuning papers, I 
know that setting a 3 MB server tcp buffer size (current rcvbuf default 
in linux server) is not sufficient for a BDP of, for example, 60 MB or 
more.  I've tried every tcp implementation available in the kernel at 
one point or another, but actually I've found bic to be the best in WAN 
environments since it is one of the most aggressive. 
>
> There are also fairly unsophisticated ways to add longer delays on 
> your test network, and turning up the latency knob would be a useful 
> test.
My experiment uses tc with netem to control the latency, so I can run 
any experiment, but I don't learn a lot beyond the experiment that I've 
presented.  Essentially, the bigger the BDP, the bigger your tcp buffers 
need to be.

The NFS client currently leaves tcp buffer sizes to the user, and I 
would prefer to do the same on the server via a sysctl.
Dean
>
> -- 
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-12 21:03   ` Dean Hildebrand
@ 2008-06-13 18:51     ` Chuck Lever
  2008-06-13 20:56       ` J. Bruce Fields
  2008-06-14  1:07       ` Dean Hildebrand
  2008-06-13 20:53     ` J. Bruce Fields
  1 sibling, 2 replies; 15+ messages in thread
From: Chuck Lever @ 2008-06-13 18:51 UTC (permalink / raw)
  To: Dean Hildebrand; +Cc: linux-nfs

On Jun 12, 2008, at 5:03 PM, Dean Hildebrand wrote:
> Hi Chuck,
>
> Chuck Lever wrote:
>> Howdy Dean-
>>
>> On Jun 10, 2008, at 2:54 PM, Dean Hildebrand wrote:
>>> The motivation for this patch is improved WAN write performance  
>>> plus greater user control on the server of the TCP buffer values  
>>> (window size).  The TCP window determines the amount of  
>>> outstanding data that a client can have on the wire and should be  
>>> large enough that a NFS client can fill up the pipe (the bandwidth  
>>> * delay product).  Currently the TCP receive buffer size (used for  
>>> client writes) is set very low, which prevents a client from  
>>> filling up a network pipe with a large bandwidth * delay product.
>>>
>>> Currently, the server TCP send window is set to accommodate the  
>>> maximum number of outstanding NFSD read requests (# nfsds *  
>>> maxiosize), while the server TCP receive window is set to a fixed  
>>> value which can hold a few requests.  While these values set a TCP  
>>> window size that is fine in LAN environments with a small BDP, WAN  
>>> environments can require a much larger TCP window size, e.g.,  
>>> 10GigE transatlantic link with a rtt of 120 ms has a BDP of approx  
>>> 60MB.
>>
>> Was the receive buffer size computation adjusted when support for  
>> large transfer sizes was recently added to the NFS server?
> Yes, it is based on the transfer size.  So in the current code,  
> having a larger transfer size can improve efficiency PLUS help  
> create a larger possible TCP window.  The issue seems to be that tcp  
> window, # of NFSDs, and transfer size are all independent variables  
> that need to be tuned individually depending on rtt, network  
> bandwidth, disk bandwidth, etc etc...  We can adjust the last 2, so  
> this patch helps adjust the first (tcp window).
>>
>>> I have a patch to net/svc/svcsock.c that allows a user to manually  
>>> set the server TCP send and receive buffer through the sysctl  
>>> interface. to suit the required TCP window of their network  
>>> architecture.  It adds two /proc entries, one for the receive  
>>> buffer size and one for the send buffer size:
>>> /proc/sys/sunrpc/tcp_sndbuf
>>> /proc/sys/sunrpc/tcp_rcvbuf
>>
>> What I'm wondering is if we can find some algorithm to set the  
>> buffer and window sizes *automatically*.  Why can't the NFS server  
>> select an appropriately large socket buffer size by default?
>
>>
>> Since the socket buffer size is just a limit (no memory is  
>> allocated) why, for example, shouldn't the buffer size be large for  
>> all environments that have sufficient physical memory?
> I think the problem there is that the only way to set the buffer  
> size automatically would be to know the rtt and bandwidth of the  
> network connection.  Excessive numbers of packets can get dropped if  
> the TCP buffer is set too large for a specific network connection.

> In this case, the window opens too wide and lets too many packets  
> out into the system, somewhere along the path buffers start  
> overflowing and packets are lost, TCP congestion avoidance kicks in  
> and cuts the window size dramatically and performance along with  
> it.  This type of behaviour creates a sawtooth pattern for the TCP  
> window, which is less favourable than a more steady state pattern  
> that is created if the TCP buffer size is set appropriately.

Agreed it is a performance problem, but I thought some of the newer  
TCP congestion algorithms were specifically designed to address this  
by not closing the window as aggressively.

Once the window is wide open, then, it would appear that choosing a  
good congestion avoidance algorithm is also important.

> Another point is that setting the buffer size isn't always a  
> straightforward process.  All papers I've read on the subject, and  
> my experience confirms this, is that setting tcp buffer sizes is  
> more of an art.
>
> So having the server set a good default value is half the battle,  
> but allowing users to twiddle with this value is vital.

>>> The uses the current buffer sizes in the code are as minimum  
>>> values, which the user cannot decrease.  If the user sets a value  
>>> of 0 in either /proc entry, it resets the buffer size to the  
>>> default value.  The set /proc values are utilized when the TCP  
>>> connection is initialized (mount time).  The values are bounded  
>>> above by the *minimum* of the /proc values and the network TCP  
>>> sysctls.
>>>
>>> To demonstrate the usefulness of this patch, details of an  
>>> experiment between 2 computers with a rtt of 30ms is provided  
>>> below.  In this experiment, increasing the server /proc/sys/sunrpc/ 
>>> tcp_rcvbuf value doubles write performance.
>>>
>>> EXPERIMENT
>>> ==========
>>> This experiment simulates a WAN by using tc together with netem to  
>>> add a 30 ms delay to all packets on a nfs client.  The goal is to  
>>> show that by only changing tcp_rcvbuf, the nfs client can increase  
>>> write performance in the WAN. To verify the patch has the desired  
>>> effect on the TCP window, I created two tcptrace plots that show  
>>> the difference in tcp window behaviour before and after the server  
>>> TCP rcvbuf size is increased.  When using the default server  
>>> tcpbuf value of 6M, we can see the TCP window top out around 4.6  
>>> M, whereas increasing the server tcpbuf value to 32M, we can see  
>>> that the TCP window tops out around 13M.  Performance jumps from  
>>> 43 MB/s to 90 MB/s.
>>>
>>> Hardware:
>>> 2 dual-core opteron blades
>>> GigE, Broadcom NetXtreme II BCM57065 cards
>>> A single gigabit switch in the middle
>>> 1500 MTU
>>> 8 GB memory
>>>
>>> Software:
>>> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
>>> RHEL4
>>>
>>> NFS Configuration:
>>> 64 rpc slots
>>> 32 nfsds
>>> Export ext3 file system.  This disk is quite slow, I therefore  
>>> exported using async to reduce the effect of the disk on the back  
>>> end.  This way, the experiments record the time it takes for the  
>>> data to get to the server (not to the disk).
>>> # exportfs -v
>>> /export      
>>> <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)
>>>
>>> # cat /proc/mounts
>>> bear109:/export /mnt nfs  
>>> rw 
>>> ,vers 
>>> = 
>>> 3 
>>> ,rsize 
>>> = 
>>> 1048576 
>>> ,wsize 
>>> = 
>>> 1048576 
>>> ,namlen 
>>> = 
>>> 255 
>>> ,hard 
>>> ,nointr 
>>> ,proto 
>>> =tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 0 0
>>>
>>> fs.nfs.nfs_congestion_kb = 91840
>>> net.ipv4.tcp_congestion_control = cubic
>>>
>>> Network tc Command executed on client:
>>> tc qdisc add dev eth0 root netem delay 30ms
>>> rtt from client (bear108) to server (bear109)
>>> #ping bear109
>>> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0  
>>> ttl=64 time=31.4 ms
>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1  
>>> ttl=64 time=32.0 ms
>>>
>>> TCP Configuration on client and server:
>>> # Controls IP packet forwarding
>>> net.ipv4.ip_forward = 0
>>> # Controls source route verification
>>> net.ipv4.conf.default.rp_filter = 1
>>> # Do not accept source routing
>>> net.ipv4.conf.default.accept_source_route = 0
>>> # Controls the System Request debugging functionality of the kernel
>>> kernel.sysrq = 0
>>> # Controls whether core dumps will append the PID to the core  
>>> filename
>>> # Useful for debugging multi-threaded applications
>>> kernel.core_uses_pid = 1
>>> # Controls the use of TCP syncookies
>>> net.ipv4.tcp_syncookies = 1
>>> # Controls the maximum size of a message, in bytes
>>> kernel.msgmnb = 65536
>>> # Controls the default maxmimum size of a mesage queue
>>> kernel.msgmax = 65536
>>> # Controls the maximum shared segment size, in bytes
>>> kernel.shmmax = 68719476736
>>> # Controls the maximum number of shared memory segments, in pages
>>> kernel.shmall = 4294967296
>>> ### IPV4 specific settings
>>> net.ipv4.tcp_timestamps = 0
>>> net.ipv4.tcp_sack = 1
>>> # on systems with a VERY fast bus -> memory interface this is the  
>>> big gainer
>>> net.ipv4.tcp_rmem = 4096 16777216 16777216
>>> net.ipv4.tcp_wmem = 4096 16777216 16777216
>>> net.ipv4.tcp_mem = 4096 16777216 16777216
>>> ### CORE settings (mostly for socket and UDP effect)
>>> net.core.rmem_max = 16777216
>>> net.core.wmem_max = 16777216
>>> net.core.rmem_default = 16777216
>>> net.core.wmem_default = 16777216
>>> net.core.optmem_max =  16777216
>>> net.core.netdev_max_backlog = 300000
>>> # Don't cache ssthresh from previous connection
>>> net.ipv4.tcp_no_metrics_save = 1
>>> # make sure we don't run out of memory
>>> vm.min_free_kbytes = 32768
>>>
>>> Experiments:
>>>
>>> On Server: (note that the real tcp buffer size is double tcp_rcvbuf)
>>> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>> 3158016
>>>
>>> On Client:
>>> mount -t nfs bear109:/export /mnt
>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>> ...
>>>           KB  reclen   write
>>>       512000    1024   43252      umount /mnt
>>>
>>> On server:
>>> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>> 16777216
>>>
>>> On Client:
>>> mount -t nfs bear109:/export /mnt
>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>> ...
>>>           KB  reclen   write
>>>       512000    1024   90396
>>
>> The numbers you have here are averages over the whole run.   
>> Performing these tests using a variety of record lengths and file  
>> sizes (up to several tens of gigabytes) would be useful to see  
>> where different memory and network latencies kick in.
> Definitely useful, although I'm not sure how this relates to this  
> patch.

It relates to the whole idea that this is a valid and useful parameter  
to tweak.

What your experiment shows is that there is some improvement when the  
TCP window is allowed to expand.  It does not demonstrate that the  
*best* way to provide this facility is to allow administrators to tune  
the server's TCP buffer sizes.

A single average number can hide a host of underlying sins.  This  
simple experiment, for example, does not demonstrate that TCP window  
size is the most significant issue here.  It does not show that it is  
more or less effective to adjust the window size than to select an  
appropriate congestion control algorithm (say, BIC).  It does not show  
whether the client and server are using TCP optimally.  It does not  
expose problems related to having a single data stream with one  
blocking head (eg SCTP can allow multiple streams over the same  
connection; or better performance might be achieved with multiple TCP  
connections, even if they allow only small windows).

> This patch isn't trying to alter default values, or predict buffer  
> sizes based on rtt values, or dynamically alter the tcp window based  
> on dropped packets, etc, it is just giving users the ability to  
> customize the server tcp buffer size.

I know you posted this patch because of the experiments at CITI with  
long-run 10GbE, and it's handy to now have this to experiment with.

It might also be helpful if we had a patch that made the server  
perform better in common environments, so a better default setting it  
seems to me would have greater value than simply creating a new tuning  
knob.

Would it be hard to add a metric or two with this tweak that would  
allow admins to see how often a socket buffer was completely full,  
completely empty, or how often the window size is being aggressively  
cut?

While we may not be able to determine a single optimal buffer size for  
all BDPs, are there diminishing returns in most common cases for  
increasing the buffer size past, say, 16MB?

> The information you are curious about is more relevant to creating  
> better default values of the tcp buffer size.  This could be useful,  
> but would be a long process and there are so many variables that I'm  
> not sure that you could pick proper default values anyways.  The  
> important thing is that the client can currently set its tcp buffer  
> size via the sysctl's, this is useless if the server is stuck at a  
> fixed value since the tcp window will be the minimum of the client  
> and server's tcp buffer sizes.


Well, Linux servers are not the only servers that a Linux client will  
ever encounter, so the client-side sysctl isn't as bad as useless.   
But one can argue whether that knob is ever tweaked by client  
administrators, and how useful it is.

> The server cannot do just the same thing as the client since it  
> cannot just rely on the tcp sysctl's since it also needs to ensure  
> it has enough buffer space for each NFSD.

I agree the server's current logic is too conservative.

However, the server has an automatic load-leveling feature -- it can  
close sockets if it notices it is running out of resources, and the  
Linux server does this already.  I don't think it would be terribly  
harmful to overcommit the socket buffer space since we have such a  
safety valve.

> My goal with this patch is to provide users with the same  
> flexibility that the client has regarding tcp buffer sizes, but also  
> ensure that the minimum amount of buffer space that the NFSDs  
> require is allocated.

What is the formula you used to determine the value to poke into the  
sysctl, btw?

What is an appropriate setting for a server that has to handle a mix  
of local and remote clients, for example, or a client that has to  
connect to a mix of local and remote servers?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-12 21:03   ` Dean Hildebrand
  2008-06-13 18:51     ` Chuck Lever
@ 2008-06-13 20:53     ` J. Bruce Fields
  2008-06-13 23:58       ` Dean Hildebrand
  1 sibling, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2008-06-13 20:53 UTC (permalink / raw)
  To: Dean Hildebrand; +Cc: Chuck Lever, linux-nfs

On Thu, Jun 12, 2008 at 02:03:20PM -0700, Dean Hildebrand wrote:
> Another point is that setting the buffer size isn't always a  
> straightforward process.  All papers I've read on the subject, and my  
> experience confirms this, is that setting tcp buffer sizes is more of an  
> art.

Aie.  It's bad enough if we have a half-dozen or so sysctl's to set to
get decent performance out of the nfs server.  I don't like to hear
that, on top of that, the choice of at least one of those variables is
an art....

We can leave some knobs in there for the people that like to read that
sort of paper, but the rest of the world will need *some* sort of
heuristics.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-13 18:51     ` Chuck Lever
@ 2008-06-13 20:56       ` J. Bruce Fields
  2008-06-14  1:07       ` Dean Hildebrand
  1 sibling, 0 replies; 15+ messages in thread
From: J. Bruce Fields @ 2008-06-13 20:56 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Dean Hildebrand, linux-nfs, aglo

On Fri, Jun 13, 2008 at 02:51:18PM -0400, Chuck Lever wrote:
> On Jun 12, 2008, at 5:03 PM, Dean Hildebrand wrote:
>> I think the problem there is that the only way to set the buffer size 
>> automatically would be to know the rtt and bandwidth of the network 
>> connection.  Excessive numbers of packets can get dropped if the TCP 
>> buffer is set too large for a specific network connection.
>
>> In this case, the window opens too wide and lets too many packets out 
>> into the system, somewhere along the path buffers start overflowing and 
>> packets are lost, TCP congestion avoidance kicks in and cuts the window 
>> size dramatically and performance along with it.  This type of 
>> behaviour creates a sawtooth pattern for the TCP window, which is less 
>> favourable than a more steady state pattern that is created if the TCP 
>> buffer size is set appropriately.
>
> Agreed it is a performance problem, but I thought some of the newer TCP 
> congestion algorithms were specifically designed to address this by not 
> closing the window as aggressively.
>
> Once the window is wide open, then, it would appear that choosing a good 
> congestion avoidance algorithm is also important.

Any references for Olga or I to read on that sort of behavior?

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-13 20:53     ` J. Bruce Fields
@ 2008-06-13 23:58       ` Dean Hildebrand
  2008-06-16 17:59         ` J. Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Dean Hildebrand @ 2008-06-13 23:58 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Chuck Lever, linux-nfs



J. Bruce Fields wrote:
> On Thu, Jun 12, 2008 at 02:03:20PM -0700, Dean Hildebrand wrote:
>   
>> Another point is that setting the buffer size isn't always a  
>> straightforward process.  All papers I've read on the subject, and my  
>> experience confirms this, is that setting tcp buffer sizes is more of an  
>> art.
>>     
>
> Aie.  It's bad enough if we have a half-dozen or so sysctl's to set to
> get decent performance out of the nfs server.  I don't like to hear
> that, on top of that, the choice of at least one of those variables is
> an art....
>
> We can leave some knobs in there for the people that like to read that
> sort of paper, but the rest of the world will need *some* sort of
> heuristics.
>   
Yeah, who thought computers could be artistic?!  More fun that way I 
figure :)

The reason it is an art is that you don't know the hardware that exists 
between the client and server.  Talking about things like BDP is fine, 
but in reality there are limited buffer sizes, flaky hardware, 
fluctuations in traffic, etc etc.  Using the BDP as a starting point 
though seems like the best solution, but since the linux server doesn't 
know anything about what the BDP is, it is tough to hard code any value 
into the linux kernel.  As you said, if we just give a reasonable 
default value and then ensure people can play with the knobs.  Most 
people use NFS within a LAN, and to date there has been little if any 
discussion on using NFS over the WAN (hence my interest), so I would 
argue that the current values might not be all that bad with regards to 
defaults (at least we know the behaviour isn't horrible for most people).

Networks are messy.  Anyone who wants to work in the WAN is going to 
have to read about such things, no way around it.  A simple google 
search for 'tcp wan' or 'tcp wan linux' gives loads of suggestions on 
how to configure your network, so it really isn't a burden on sysadmins 
to do such a search and then use the given knobs to adjust the tcp 
buffer size appropriately.  My patch gives sysadmins the ability to do 
the google search and then have some knobs to turn.

Some sample tcp tuning guides that I like:
http://acs.lbl.gov/TCP-tuning/tcp-wan-perf.pdf
http://acs.lbl.gov/TCP-tuning/linux.html
http://gentoo-wiki.com/HOWTO_TCP_Tuning (especially relevant is the part 
about the receive buffer)
http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Hildebrand_98265.pdf 
(our initial paper on pNFS tuning)

Dean


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-13 18:51     ` Chuck Lever
  2008-06-13 20:56       ` J. Bruce Fields
@ 2008-06-14  1:07       ` Dean Hildebrand
  2008-06-16 18:59         ` Chuck Lever
  1 sibling, 1 reply; 15+ messages in thread
From: Dean Hildebrand @ 2008-06-14  1:07 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs



Chuck Lever wrote:
> On Jun 12, 2008, at 5:03 PM, Dean Hildebrand wrote:
>> Hi Chuck,
>>
>> Chuck Lever wrote:
>>> Howdy Dean-
>>>
>>> On Jun 10, 2008, at 2:54 PM, Dean Hildebrand wrote:
>>>> The motivation for this patch is improved WAN write performance=20
>>>> plus greater user control on the server of the TCP buffer values=20
>>>> (window size). The TCP window determines the amount of outstanding=
=20
>>>> data that a client can have on the wire and should be large enough=
=20
>>>> that a NFS client can fill up the pipe (the bandwidth * delay=20
>>>> product). Currently the TCP receive buffer size (used for client=20
>>>> writes) is set very low, which prevents a client from filling up a=
=20
>>>> network pipe with a large bandwidth * delay product.
>>>>
>>>> Currently, the server TCP send window is set to accommodate the=20
>>>> maximum number of outstanding NFSD read requests (# nfsds *=20
>>>> maxiosize), while the server TCP receive window is set to a fixed=20
>>>> value which can hold a few requests. While these values set a TCP=20
>>>> window size that is fine in LAN environments with a small BDP, WAN=
=20
>>>> environments can require a much larger TCP window size, e.g.,=20
>>>> 10GigE transatlantic link with a rtt of 120 ms has a BDP of approx=
=20
>>>> 60MB.
>>>
>>> Was the receive buffer size computation adjusted when support for=20
>>> large transfer sizes was recently added to the NFS server?
>> Yes, it is based on the transfer size. So in the current code, havin=
g=20
>> a larger transfer size can improve efficiency PLUS help create a=20
>> larger possible TCP window. The issue seems to be that tcp window, #=
=20
>> of NFSDs, and transfer size are all independent variables that need=20
>> to be tuned individually depending on rtt, network bandwidth, disk=20
>> bandwidth, etc etc... We can adjust the last 2, so this patch helps=20
>> adjust the first (tcp window).
>>>
>>>> I have a patch to net/svc/svcsock.c that allows a user to manually=
=20
>>>> set the server TCP send and receive buffer through the sysctl=20
>>>> interface. to suit the required TCP window of their network=20
>>>> architecture. It adds two /proc entries, one for the receive buffe=
r=20
>>>> size and one for the send buffer size:
>>>> /proc/sys/sunrpc/tcp_sndbuf
>>>> /proc/sys/sunrpc/tcp_rcvbuf
>>>
>>> What I'm wondering is if we can find some algorithm to set the=20
>>> buffer and window sizes *automatically*. Why can't the NFS server=20
>>> select an appropriately large socket buffer size by default?
>>
>>>
>>> Since the socket buffer size is just a limit (no memory is=20
>>> allocated) why, for example, shouldn't the buffer size be large for=
=20
>>> all environments that have sufficient physical memory?
>> I think the problem there is that the only way to set the buffer siz=
e=20
>> automatically would be to know the rtt and bandwidth of the network=20
>> connection. Excessive numbers of packets can get dropped if the TCP=20
>> buffer is set too large for a specific network connection.
>
>> In this case, the window opens too wide and lets too many packets ou=
t=20
>> into the system, somewhere along the path buffers start overflowing=20
>> and packets are lost, TCP congestion avoidance kicks in and cuts the=
=20
>> window size dramatically and performance along with it. This type of=
=20
>> behaviour creates a sawtooth pattern for the TCP window, which is=20
>> less favourable than a more steady state pattern that is created if=20
>> the TCP buffer size is set appropriately.
>
> Agreed it is a performance problem, but I thought some of the newer=20
> TCP congestion algorithms were specifically designed to address this=20
> by not closing the window as aggressively.
Yes, every tcp algorithm seems to have its own niche. Personally, I hav=
e=20
found bic the best in the WAN as it is pretty aggressive at returning t=
o=20
the original window size. Since cubic is now the Linux default, and=20
changing the tcp cong control algorithm is done for an entire system=20
(meaning local clients could be adversely affected by choosing one=20
designed for specialized networks), I think we should try to optimize c=
ubic.
>
> Once the window is wide open, then, it would appear that choosing a=20
> good congestion avoidance algorithm is also important.
Yes, but it is always important to avoid ever letting the window get to=
o=20
wide, as this will cause a hiccup every single time you try to send a=20
bunch of data (a tcp window closes very quickly after data is=20
transmitted, so waiting 1 second causing you to start from the beginnin=
g=20
with a small window)
>
>> Another point is that setting the buffer size isn't always a=20
>> straightforward process. All papers I've read on the subject, and my=
=20
>> experience confirms this, is that setting tcp buffer sizes is more o=
f=20
>> an art.
>>
>> So having the server set a good default value is half the battle, bu=
t=20
>> allowing users to twiddle with this value is vital.
>
>>>> The uses the current buffer sizes in the code are as minimum=20
>>>> values, which the user cannot decrease. If the user sets a value o=
f=20
>>>> 0 in either /proc entry, it resets the buffer size to the default=20
>>>> value. The set /proc values are utilized when the TCP connection i=
s=20
>>>> initialized (mount time). The values are bounded above by the=20
>>>> *minimum* of the /proc values and the network TCP sysctls.
>>>>
>>>> To demonstrate the usefulness of this patch, details of an=20
>>>> experiment between 2 computers with a rtt of 30ms is provided=20
>>>> below. In this experiment, increasing the server=20
>>>> /proc/sys/sunrpc/tcp_rcvbuf value doubles write performance.
>>>>
>>>> EXPERIMENT
>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> This experiment simulates a WAN by using tc together with netem to=
=20
>>>> add a 30 ms delay to all packets on a nfs client. The goal is to=20
>>>> show that by only changing tcp_rcvbuf, the nfs client can increase=
=20
>>>> write performance in the WAN. To verify the patch has the desired=20
>>>> effect on the TCP window, I created two tcptrace plots that show=20
>>>> the difference in tcp window behaviour before and after the server=
=20
>>>> TCP rcvbuf size is increased. When using the default server tcpbuf=
=20
>>>> value of 6M, we can see the TCP window top out around 4.6 M,=20
>>>> whereas increasing the server tcpbuf value to 32M, we can see that=
=20
>>>> the TCP window tops out around 13M. Performance jumps from 43 MB/s=
=20
>>>> to 90 MB/s.
>>>>
>>>> Hardware:
>>>> 2 dual-core opteron blades
>>>> GigE, Broadcom NetXtreme II BCM57065 cards
>>>> A single gigabit switch in the middle
>>>> 1500 MTU
>>>> 8 GB memory
>>>>
>>>> Software:
>>>> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
>>>> RHEL4
>>>>
>>>> NFS Configuration:
>>>> 64 rpc slots
>>>> 32 nfsds
>>>> Export ext3 file system. This disk is quite slow, I therefore=20
>>>> exported using async to reduce the effect of the disk on the back=20
>>>> end. This way, the experiments record the time it takes for the=20
>>>> data to get to the server (not to the disk).
>>>> # exportfs -v
>>>> /export <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsi=
d=3D0)
>>>>
>>>> # cat /proc/mounts
>>>> bear109:/export /mnt nfs=20
>>>> rw,vers=3D3,rsize=3D1048576,wsize=3D1048576,namlen=3D255,hard,noin=
tr,proto=3Dtcp,timeo=3D600,retrans=3D2,sec=3Dsys,mountproto=3Dudp,addr=3D=
9.1.74.144=20
>>>> 0 0
>>>>
>>>> fs.nfs.nfs_congestion_kb =3D 91840
>>>> net.ipv4.tcp_congestion_control =3D cubic
>>>>
>>>> Network tc Command executed on client:
>>>> tc qdisc add dev eth0 root netem delay 30ms
>>>> rtt from client (bear108) to server (bear109)
>>>> #ping bear109
>>>> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=3D0=20
>>>> ttl=3D64 time=3D31.4 ms
>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=3D1=20
>>>> ttl=3D64 time=3D32.0 ms
>>>>
>>>> TCP Configuration on client and server:
>>>> # Controls IP packet forwarding
>>>> net.ipv4.ip_forward =3D 0
>>>> # Controls source route verification
>>>> net.ipv4.conf.default.rp_filter =3D 1
>>>> # Do not accept source routing
>>>> net.ipv4.conf.default.accept_source_route =3D 0
>>>> # Controls the System Request debugging functionality of the kerne=
l
>>>> kernel.sysrq =3D 0
>>>> # Controls whether core dumps will append the PID to the core file=
name
>>>> # Useful for debugging multi-threaded applications
>>>> kernel.core_uses_pid =3D 1
>>>> # Controls the use of TCP syncookies
>>>> net.ipv4.tcp_syncookies =3D 1
>>>> # Controls the maximum size of a message, in bytes
>>>> kernel.msgmnb =3D 65536
>>>> # Controls the default maxmimum size of a mesage queue
>>>> kernel.msgmax =3D 65536
>>>> # Controls the maximum shared segment size, in bytes
>>>> kernel.shmmax =3D 68719476736
>>>> # Controls the maximum number of shared memory segments, in pages
>>>> kernel.shmall =3D 4294967296
>>>> ### IPV4 specific settings
>>>> net.ipv4.tcp_timestamps =3D 0
>>>> net.ipv4.tcp_sack =3D 1
>>>> # on systems with a VERY fast bus -> memory interface this is the=20
>>>> big gainer
>>>> net.ipv4.tcp_rmem =3D 4096 16777216 16777216
>>>> net.ipv4.tcp_wmem =3D 4096 16777216 16777216
>>>> net.ipv4.tcp_mem =3D 4096 16777216 16777216
>>>> ### CORE settings (mostly for socket and UDP effect)
>>>> net.core.rmem_max =3D 16777216
>>>> net.core.wmem_max =3D 16777216
>>>> net.core.rmem_default =3D 16777216
>>>> net.core.wmem_default =3D 16777216
>>>> net.core.optmem_max =3D 16777216
>>>> net.core.netdev_max_backlog =3D 300000
>>>> # Don't cache ssthresh from previous connection
>>>> net.ipv4.tcp_no_metrics_save =3D 1
>>>> # make sure we don't run out of memory
>>>> vm.min_free_kbytes =3D 32768
>>>>
>>>> Experiments:
>>>>
>>>> On Server: (note that the real tcp buffer size is double tcp_rcvbu=
f)
>>>> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>> 3158016
>>>>
>>>> On Client:
>>>> mount -t nfs bear109:/export /mnt
>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>> ...
>>>> KB reclen write
>>>> 512000 1024 43252 umount /mnt
>>>>
>>>> On server:
>>>> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>> 16777216
>>>>
>>>> On Client:
>>>> mount -t nfs bear109:/export /mnt
>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>> ...
>>>> KB reclen write
>>>> 512000 1024 90396
>>>
>>> The numbers you have here are averages over the whole run.=20
>>> Performing these tests using a variety of record lengths and file=20
>>> sizes (up to several tens of gigabytes) would be useful to see wher=
e=20
>>> different memory and network latencies kick in.
>> Definitely useful, although I'm not sure how this relates to this pa=
tch.
>
> It relates to the whole idea that this is a valid and useful paramete=
r=20
> to tweak.
>
> What your experiment shows is that there is some improvement when the=
=20
> TCP window is allowed to expand. It does not demonstrate that the=20
> *best* way to provide this facility is to allow administrators to tun=
e=20
> the server's TCP buffer sizes.
By definition of how TCP is designed, tweaking the send and receive=20
buffer sizes is a useful. Please see the tcp tuning guides in my other=20
post. I would characterize tweaking the buffers as a necessary conditio=
n=20
but not a sufficient condition to achieve good throughput with tcp over=
=20
long distances.
>
> A single average number can hide a host of underlying sins. This=20
> simple experiment, for example, does not demonstrate that TCP window=20
> size is the most significant issue here.
I would say it slightly differently, that it demonstrates that it is=20
significant, but maybe not the *most* significant. There are many=20
possible bottlenecks and possible knobs to tweak. For example, I'm stil=
l=20
not achieving link speeds, so I'm sure there are other bottlenecks that=
=20
are causing reduced performance.
> It does not show that it is more or less effective to adjust the=20
> window size than to select an appropriate congestion control algorith=
m=20
> (say, BIC).
Any tcp cong. control algorithm is highly dependent on the tcp buffer=20
size. The choice of algorithm changes the behaviour when packets are=20
dropped and in the initial opening of the window, but once the window i=
s=20
open and no packets are being dropped, the algorithm is irrelevant. So=20
BIC, or westwood, or highspeed might do better in the face of dropped=20
packets, but since the current receive buffer is so small, dropped=20
packets are not the problem. Once we can use the sysctl's to tweak the=20
server buffer size, only then is the choice of algorithm going to be=20
important.
> It does not show whether the client and server are using TCP optimall=
y.
I'm not sure what you mean by *optimally*. They use tcp the only way=20
they know how non?
> It does not expose problems related to having a single data stream=20
> with one blocking head (eg SCTP can allow multiple streams over the=20
> same connection; or better performance might be achieved with multipl=
e=20
> TCP connections, even if they allow only small windows).
Yes, using multiple tcp connections might be useful, but that doesn't=20
mean you wouldn't want to adjust the tcp window of each one using my=20
patch. Actually, I can't seem to find the quote, but I read somewhere=20
that achieving performance in the WAN can be done 2 different ways: a)=20
If you can tune the buffer sizes that is the best way to go, but b) if=20
you don't have root access to change the linux tcp settings then using=20
multiple tcp streams can compensate for small buffer sizes.

Andy has/had a patch to add multiple tcp streams to NFS. I think his=20
patch and my patch work in collaboration to improve wan performance.
>
>> This patch isn't trying to alter default values, or predict buffer=20
>> sizes based on rtt values, or dynamically alter the tcp window based=
=20
>> on dropped packets, etc, it is just giving users the ability to=20
>> customize the server tcp buffer size.
>
> I know you posted this patch because of the experiments at CITI with=20
> long-run 10GbE, and it's handy to now have this to experiment with.
Actually at IBM we have our own reasons for using NFS over the WAN. I=20
would like to get these 2 knobs into the kernel as it is hard to tell=20
customers to apply kernel patches....
>
> It might also be helpful if we had a patch that made the server=20
> perform better in common environments, so a better default setting it=
=20
> seems to me would have greater value than simply creating a new tunin=
g=20
> knob.
I think there are possibly 2 (or more) patches. One that improves the=20
default buffer sizes and one that lets sysadmins tweak the value. I=20
don't see why they are mutually exclusive. My patch is a first step=20
towards allowing NFS into WAN environments. Linux currently has sysctl=20
values for the TCP parameters for exactly this reason, it is impossible=
=20
to predict the network environment of a linux machine. If the Linux nfs=
=20
server isn't going to build off of the existing Linux TCP values (which=
=20
all sysadmins know how to tweak), then it must allow sysadmins to tweak=
=20
the NFS server tcp values, either using my patch or some other related=20
patch. I'm open to how the server tcp buffers are teaked, they just nee=
d=20
to be able to be tweaked. For example, if all tcp buffer values in linu=
x=20
were taken out of the /proc file system and hardcoded, I think there=20
would be a revolt.
>
> Would it be hard to add a metric or two with this tweak that would=20
> allow admins to see how often a socket buffer was completely full,=20
> completely empty, or how often the window size is being aggressively =
cut?
So I've done this using tcpdump in combination with tcptrace. I've show=
n=20
people at citi how the tcp window grows in the experiment I describe.
>
> While we may not be able to determine a single optimal buffer size fo=
r=20
> all BDPs, are there diminishing returns in most common cases for=20
> increasing the buffer size past, say, 16MB?
Good question. It all depends on how much data you are transferring. In=
=20
order to fully open a 128MB tcp window over a very long WAN, you will=20
need to transfer at least a few gigabytes of data. If you only transfer=
=20
100 MB at a time, then you will probably be fine with a 16 MB window as=
=20
you are not transferring enough data to open the window anyways. In our=
=20
environment, we are expecting to transfer 100s of GB if not even more,=20
so the 16 MB window would be very limiting.
>
>> The information you are curious about is more relevant to creating=20
>> better default values of the tcp buffer size. This could be useful,=20
>> but would be a long process and there are so many variables that I'm=
=20
>> not sure that you could pick proper default values anyways. The=20
>> important thing is that the client can currently set its tcp buffer=20
>> size via the sysctl's, this is useless if the server is stuck at a=20
>> fixed value since the tcp window will be the minimum of the client=20
>> and server's tcp buffer sizes.
>
>
> Well, Linux servers are not the only servers that a Linux client will=
=20
> ever encounter, so the client-side sysctl isn't as bad as useless. Bu=
t=20
> one can argue whether that knob is ever tweaked by client=20
> administrators, and how useful it is.
Definitely not useless. Doing a google search for 'tcp_rmem' returns=20
over 11000 hits describing how to configure tcp settings. (ok, I didn't=
=20
review every result, but the first few pages of results are telling) It=
=20
doesn't really matter what OS the client and server use, as long as bot=
h=20
have the ability to tweak the tcp buffer size.
>
>> The server cannot do just the same thing as the client since it=20
>> cannot just rely on the tcp sysctl's since it also needs to ensure i=
t=20
>> has enough buffer space for each NFSD.
>
> I agree the server's current logic is too conservative.
>
> However, the server has an automatic load-leveling feature -- it can=20
> close sockets if it notices it is running out of resources, and the=20
> Linux server does this already. I don't think it would be terribly=20
> harmful to overcommit the socket buffer space since we have such a=20
> safety valve.
The tcp tuning guides in my other post comment on exactly my point that=
=20
proving too large a tcp window can be harmful to performance.
>
>> My goal with this patch is to provide users with the same flexibilit=
y=20
>> that the client has regarding tcp buffer sizes, but also ensure that=
=20
>> the minimum amount of buffer space that the NFSDs require is allocat=
ed.
>
> What is the formula you used to determine the value to poke into the=20
> sysctl, btw?
I like this doc: http://acs.lbl.gov/TCP-tuning/tcp-wan-perf.pdf
The optimal buffer size is twice the bandwidth * delay product of the l=
ink:
buffer size =3D bandwidth * RTT

Here is the entire relevant part:

"""
2.0 TCP Buffer Sizes
TCP uses what it calls the =93congestion window,=94 or CWND, to determi=
ne=20
how many
packets can be sent at one time. The larger the congestion window size,=
=20
the higher the
throughput. The TCP =93slow start=94 and =93congestion avoidance=94 alg=
orithms=20
determine the
size of the congestion window. The maximum congestion window is related=
=20
to the
amount of buffer space that the kernel allocates for each socket. For=20
each socket, there
is a default value for the buffer size, which can be changed by the=20
program using a system
library call just before opening the socket. There is also a kernel=20
enforced maximum
buffer size. The buffer size can be adjusted for both the send and=20
receive ends of
the socket.
To achieve maximal throughput it is critical to use optimal TCP send an=
d=20
receive socket
buffer sizes for the link you are using. If the buffers are too small,=20
the TCP congestion
window will never fully open up. If the buffers are too large, the=20
sender can overrun the
receiver, and the TCP window will shut down. For more information, see=20
the references
on page 38.
Users often wonder why, on a network where the slowest hop from site A=20
to site B is
100 Mbps (about 12 MB/sec), using ftp they can only get a throughput of=
=20
500 KB/sec.
The answer is obvious if you consider the following: typical latency=20
across the US is
about 25 ms, and many operating systems use a default TCP buffer size o=
f=20
either 24 or
32 KB (Linux is only 8 KB). Assuming a default TCP buffer of 24KB, the=20
maximum utilization
of the pipe will only be 24/300 =3D 8% (.96 MB/sec), even under ideal=20
conditions.
In fact, the buffer size typically needs to be double the TCP congestio=
n=20
window
size to keep the pipe full, so in reality only about 4% utilization of=20
the network is
achieved, or about 500 KB/sec. Therefore if you are using untuned TCP=20
buffers you=92ll
often get less than 5% of the possible bandwidth across a high-speed WA=
N=20
path. This is
why it is essential to tune the TCP buffers to the optimal value.
The optimal buffer size is twice the bandwidth * delay product of the l=
ink:
buffer size =3D 2 * bandwidth * delay
The ping program can be used to get the delay, and pipechar or pchar,=20
described below,
can be used to get the bandwidth of the slowest hop in your path. Since=
=20
ping gives the
round-trip time (RTT), this formula can be used instead of the previous=
 one:
buffer size =3D bandwidth * RTT
=46or example, if your ping time is 50 ms, and the end-to-end network=20
consists of all
100BT Ethernet and OC3 (155 Mbps), the TCP buffers should be 0.05 sec *=
=20
10 MB/sec
=3D 500 KB. If you are connected via a T1 line (1 Mbps) or less, the=20
default buffers are
fine, but if you are using a network faster than that, you will almost=20
certainly benefit
from some buffer tuning.
Two TCP settings need to be considered: the default TCP send and receiv=
e=20
buffer size
and the maximum TCP send and receive buffer size. Note that most of=20
today=92s UNIX
OSes by default have a maximum TCP buffer size of only 256 KB (and the=20
default maximum
for Linux is only 64 KB!). For instructions on how to increase the maxi=
mum
TCP buffer, see Appendix A. Setting the default TCP buffer size greater=
=20
than 128 KB
will adversely affect LAN performance. Instead, the UNIX setsockopt cal=
l=20
should be
used in your sender and receiver to set the optimal buffer size for the=
=20
link you are
using. Use of setsockopt is described in Appendix B.
It is not necessary to set both the send and receive buffer to the=20
optimal value, as the
socket will use the smaller of the two values. However, it is necessary=
=20
to make sure both
are large enough. A common technique is to set the buffer in the server=
=20
quite large
(e.g., 512 KB) and then let the client determine and set the correct=20
=93optimal=94 value.
""
>
> What is an appropriate setting for a server that has to handle a mix=20
> of local and remote clients, for example, or a client that has to=20
> connect to a mix of local and remote servers?
Yes, this is a tricky one. I believe the best way to handle it is to se=
t=20
the server tcp buffer to the MAX(local, remote) and then let the local=20
client set a smaller tcp buffer and the remote client set a larger tcp=20
buffer. The problem there is that then what if the local client is also=
=20
a remote client of another nfs server?? At this point there seems to be=
=20
some limitations.....

btw, here is another good paper with regards to tcp buffer sizing in th=
e=20
WAN:
"Optimizing 10-Gigabit Ethernet for Networks of Workstations, Clusters,=
=20
and Grids: A Case Study"
http://portal.acm.org/citation.cfm?id=3D1050200

I also found the parts in this page regarding tcp setting very very=20
useful (it also briefly talks about multiple tcp streams):
http://pcbunn.cithep.caltech.edu/bbcp/using_bbcp.htm
Dean

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-13 23:58       ` Dean Hildebrand
@ 2008-06-16 17:59         ` J. Bruce Fields
  2008-06-18 18:33           ` Dean Hildebrand
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2008-06-16 17:59 UTC (permalink / raw)
  To: Dean Hildebrand; +Cc: Chuck Lever, linux-nfs

On Fri, Jun 13, 2008 at 04:58:04PM -0700, Dean Hildebrand wrote:
> The reason it is an art is that you don't know the hardware that exists  
> between the client and server.  Talking about things like BDP is fine,  
> but in reality there are limited buffer sizes, flaky hardware,  
> fluctuations in traffic, etc etc.  Using the BDP as a starting point  
> though seems like the best solution, but since the linux server doesn't  
> know anything about what the BDP is, it is tough to hard code any value  
> into the linux kernel.  As you said, if we just give a reasonable  
> default value and then ensure people can play with the knobs.  Most  
> people use NFS within a LAN, and to date there has been little if any  
> discussion on using NFS over the WAN (hence my interest), so I would  
> argue that the current values might not be all that bad with regards to  
> defaults (at least we know the behaviour isn't horrible for most people).
>
> Networks are messy.  Anyone who wants to work in the WAN is going to  
> have to read about such things, no way around it.  A simple google  
> search for 'tcp wan' or 'tcp wan linux' gives loads of suggestions on  
> how to configure your network, so it really isn't a burden on sysadmins  
> to do such a search and then use the given knobs to adjust the tcp  
> buffer size appropriately.  My patch gives sysadmins the ability to do  
> the google search and then have some knobs to turn.
>
> Some sample tcp tuning guides that I like:
> http://acs.lbl.gov/TCP-tuning/tcp-wan-perf.pdf
> http://acs.lbl.gov/TCP-tuning/linux.html
> http://gentoo-wiki.com/HOWTO_TCP_Tuning (especially relevant is the part  
> about the receive buffer)
 
> http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Hildebrand_98265.pdf 
> (our initial paper on pNFS tuning)

Several of those refer to problems that can happen when the receive
buffer size is set unusually high, but none of them give a really
detailed description of the behavior in that case--do you know of any?

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-14  1:07       ` Dean Hildebrand
@ 2008-06-16 18:59         ` Chuck Lever
  2008-06-17 22:03           ` Dean Hildebrand
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2008-06-16 18:59 UTC (permalink / raw)
  To: Dean Hildebrand; +Cc: linux-nfs

On Jun 13, 2008, at 9:07 PM, Dean Hildebrand wrote:
 > Chuck Lever wrote:
>> On Jun 12, 2008, at 5:03 PM, Dean Hildebrand wrote:
>>> Hi Chuck,
>>>
>>> Chuck Lever wrote:
>>>> Howdy Dean-
>>>>
>>>> On Jun 10, 2008, at 2:54 PM, Dean Hildebrand wrote:
>>>>> The motivation for this patch is improved WAN write performance  
>>>>> plus greater user control on the server of the TCP buffer values  
>>>>> (window size). The TCP window determines the amount of  
>>>>> outstanding data that a client can have on the wire and should  
>>>>> be large enough that a NFS client can fill up the pipe (the  
>>>>> bandwidth * delay product). Currently the TCP receive buffer  
>>>>> size (used for client writes) is set very low, which prevents a  
>>>>> client from filling up a network pipe with a large bandwidth *  
>>>>> delay product.
>>>>>
>>>>> Currently, the server TCP send window is set to accommodate the  
>>>>> maximum number of outstanding NFSD read requests (# nfsds *  
>>>>> maxiosize), while the server TCP receive window is set to a  
>>>>> fixed value which can hold a few requests. While these values  
>>>>> set a TCP window size that is fine in LAN environments with a  
>>>>> small BDP, WAN environments can require a much larger TCP window  
>>>>> size, e.g., 10GigE transatlantic link with a rtt of 120 ms has a  
>>>>> BDP of approx 60MB.
>>>>
>>>> Was the receive buffer size computation adjusted when support for  
>>>> large transfer sizes was recently added to the NFS server?
>>> Yes, it is based on the transfer size. So in the current code,  
>>> having a larger transfer size can improve efficiency PLUS help  
>>> create a larger possible TCP window. The issue seems to be that  
>>> tcp window, # of NFSDs, and transfer size are all independent  
>>> variables that need to be tuned individually depending on rtt,  
>>> network bandwidth, disk bandwidth, etc etc... We can adjust the  
>>> last 2, so this patch helps adjust the first (tcp window).
>>>>
>>>>> I have a patch to net/svc/svcsock.c that allows a user to  
>>>>> manually set the server TCP send and receive buffer through the  
>>>>> sysctl interface. to suit the required TCP window of their  
>>>>> network architecture. It adds two /proc entries, one for the  
>>>>> receive buffer size and one for the send buffer size:
>>>>> /proc/sys/sunrpc/tcp_sndbuf
>>>>> /proc/sys/sunrpc/tcp_rcvbuf
>>>>
>>>> What I'm wondering is if we can find some algorithm to set the  
>>>> buffer and window sizes *automatically*. Why can't the NFS server  
>>>> select an appropriately large socket buffer size by default?
>>>
>>>>
>>>> Since the socket buffer size is just a limit (no memory is  
>>>> allocated) why, for example, shouldn't the buffer size be large  
>>>> for all environments that have sufficient physical memory?
>>> I think the problem there is that the only way to set the buffer  
>>> size automatically would be to know the rtt and bandwidth of the  
>>> network connection. Excessive numbers of packets can get dropped  
>>> if the TCP buffer is set too large for a specific network  
>>> connection.
>>
>>> In this case, the window opens too wide and lets too many packets  
>>> out into the system, somewhere along the path buffers start  
>>> overflowing and packets are lost, TCP congestion avoidance kicks  
>>> in and cuts the window size dramatically and performance along  
>>> with it. This type of behaviour creates a sawtooth pattern for the  
>>> TCP window, which is less favourable than a more steady state  
>>> pattern that is created if the TCP buffer size is set appropriately.
>>
>> Agreed it is a performance problem, but I thought some of the newer  
>> TCP congestion algorithms were specifically designed to address  
>> this by not closing the window as aggressively.
> Yes, every tcp algorithm seems to have its own niche. Personally, I  
> have found bic the best in the WAN as it is pretty aggressive at  
> returning to the original window size. Since cubic is now the Linux  
> default, and changing the tcp cong control algorithm is done for an  
> entire system (meaning local clients could be adversely affected by  
> choosing one designed for specialized networks), I think we should  
> try to optimize cubic.
>>
>> Once the window is wide open, then, it would appear that choosing a  
>> good congestion avoidance algorithm is also important.
> Yes, but it is always important to avoid ever letting the window get  
> too wide, as this will cause a hiccup every single time you try to  
> send a bunch of data (a tcp window closes very quickly after data is  
> transmitted, so waiting 1 second causing you to start from the  
> beginning with a small window)

Since what we really want to limit is the maximum size of the TCP  
receive window, it would be more precise to change the name of the new  
sysctl to something like nfs_tcp_max_window_size.

>>> Another point is that setting the buffer size isn't always a  
>>> straightforward process. All papers I've read on the subject, and  
>>> my experience confirms this, is that setting tcp buffer sizes is  
>>> more of an art.
>>>
>>> So having the server set a good default value is half the battle,  
>>> but allowing users to twiddle with this value is vital.
>>
>>>>> The uses the current buffer sizes in the code are as minimum  
>>>>> values, which the user cannot decrease. If the user sets a value  
>>>>> of 0 in either /proc entry, it resets the buffer size to the  
>>>>> default value. The set /proc values are utilized when the TCP  
>>>>> connection is initialized (mount time). The values are bounded  
>>>>> above by the *minimum* of the /proc values and the network TCP  
>>>>> sysctls.
>>>>>
>>>>> To demonstrate the usefulness of this patch, details of an  
>>>>> experiment between 2 computers with a rtt of 30ms is provided  
>>>>> below. In this experiment, increasing the server /proc/sys/ 
>>>>> sunrpc/tcp_rcvbuf value doubles write performance.
>>>>>
>>>>> EXPERIMENT
>>>>> ==========
>>>>> This experiment simulates a WAN by using tc together with netem  
>>>>> to add a 30 ms delay to all packets on a nfs client. The goal is  
>>>>> to show that by only changing tcp_rcvbuf, the nfs client can  
>>>>> increase write performance in the WAN. To verify the patch has  
>>>>> the desired effect on the TCP window, I created two tcptrace  
>>>>> plots that show the difference in tcp window behaviour before  
>>>>> and after the server TCP rcvbuf size is increased. When using  
>>>>> the default server tcpbuf value of 6M, we can see the TCP window  
>>>>> top out around 4.6 M, whereas increasing the server tcpbuf value  
>>>>> to 32M, we can see that the TCP window tops out around 13M.  
>>>>> Performance jumps from 43 MB/s to 90 MB/s.
>>>>>
>>>>> Hardware:
>>>>> 2 dual-core opteron blades
>>>>> GigE, Broadcom NetXtreme II BCM57065 cards
>>>>> A single gigabit switch in the middle
>>>>> 1500 MTU
>>>>> 8 GB memory
>>>>>
>>>>> Software:
>>>>> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
>>>>> RHEL4
>>>>>
>>>>> NFS Configuration:
>>>>> 64 rpc slots
>>>>> 32 nfsds
>>>>> Export ext3 file system. This disk is quite slow, I therefore  
>>>>> exported using async to reduce the effect of the disk on the  
>>>>> back end. This way, the experiments record the time it takes for  
>>>>> the data to get to the server (not to the disk).
>>>>> # exportfs -v
>>>>> /export  
>>>>> <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)
>>>>>
>>>>> # cat /proc/mounts
>>>>> bear109:/export /mnt nfs  
>>>>> rw 
>>>>> ,vers 
>>>>> = 
>>>>> 3 
>>>>> ,rsize 
>>>>> = 
>>>>> 1048576 
>>>>> ,wsize 
>>>>> = 
>>>>> 1048576 
>>>>> ,namlen 
>>>>> = 
>>>>> 255 
>>>>> ,hard 
>>>>> ,nointr 
>>>>> ,proto 
>>>>> =tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144  
>>>>> 0 0
>>>>>
>>>>> fs.nfs.nfs_congestion_kb = 91840
>>>>> net.ipv4.tcp_congestion_control = cubic
>>>>>
>>>>> Network tc Command executed on client:
>>>>> tc qdisc add dev eth0 root netem delay 30ms
>>>>> rtt from client (bear108) to server (bear109)
>>>>> #ping bear109
>>>>> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0  
>>>>> ttl=64 time=31.4 ms
>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1  
>>>>> ttl=64 time=32.0 ms
>>>>>
>>>>> TCP Configuration on client and server:
>>>>> # Controls IP packet forwarding
>>>>> net.ipv4.ip_forward = 0
>>>>> # Controls source route verification
>>>>> net.ipv4.conf.default.rp_filter = 1
>>>>> # Do not accept source routing
>>>>> net.ipv4.conf.default.accept_source_route = 0
>>>>> # Controls the System Request debugging functionality of the  
>>>>> kernel
>>>>> kernel.sysrq = 0
>>>>> # Controls whether core dumps will append the PID to the core  
>>>>> filename
>>>>> # Useful for debugging multi-threaded applications
>>>>> kernel.core_uses_pid = 1
>>>>> # Controls the use of TCP syncookies
>>>>> net.ipv4.tcp_syncookies = 1
>>>>> # Controls the maximum size of a message, in bytes
>>>>> kernel.msgmnb = 65536
>>>>> # Controls the default maxmimum size of a mesage queue
>>>>> kernel.msgmax = 65536
>>>>> # Controls the maximum shared segment size, in bytes
>>>>> kernel.shmmax = 68719476736
>>>>> # Controls the maximum number of shared memory segments, in pages
>>>>> kernel.shmall = 4294967296
>>>>> ### IPV4 specific settings
>>>>> net.ipv4.tcp_timestamps = 0
>>>>> net.ipv4.tcp_sack = 1
>>>>> # on systems with a VERY fast bus -> memory interface this is  
>>>>> the big gainer
>>>>> net.ipv4.tcp_rmem = 4096 16777216 16777216
>>>>> net.ipv4.tcp_wmem = 4096 16777216 16777216
>>>>> net.ipv4.tcp_mem = 4096 16777216 16777216
>>>>> ### CORE settings (mostly for socket and UDP effect)
>>>>> net.core.rmem_max = 16777216
>>>>> net.core.wmem_max = 16777216
>>>>> net.core.rmem_default = 16777216
>>>>> net.core.wmem_default = 16777216
>>>>> net.core.optmem_max = 16777216
>>>>> net.core.netdev_max_backlog = 300000
>>>>> # Don't cache ssthresh from previous connection
>>>>> net.ipv4.tcp_no_metrics_save = 1
>>>>> # make sure we don't run out of memory
>>>>> vm.min_free_kbytes = 32768
>>>>>
>>>>> Experiments:
>>>>>
>>>>> On Server: (note that the real tcp buffer size is double  
>>>>> tcp_rcvbuf)
>>>>> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>> 3158016
>>>>>
>>>>> On Client:
>>>>> mount -t nfs bear109:/export /mnt
>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>> ...
>>>>> KB reclen write
>>>>> 512000 1024 43252 umount /mnt
>>>>>
>>>>> On server:
>>>>> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>> 16777216
>>>>>
>>>>> On Client:
>>>>> mount -t nfs bear109:/export /mnt
>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>> ...
>>>>> KB reclen write
>>>>> 512000 1024 90396
>>>>
>>>> The numbers you have here are averages over the whole run.  
>>>> Performing these tests using a variety of record lengths and file  
>>>> sizes (up to several tens of gigabytes) would be useful to see  
>>>> where different memory and network latencies kick in.
>>> Definitely useful, although I'm not sure how this relates to this  
>>> patch.
>>
>> It relates to the whole idea that this is a valid and useful  
>> parameter to tweak.
>>
>> What your experiment shows is that there is some improvement when  
>> the TCP window is allowed to expand. It does not demonstrate that  
>> the *best* way to provide this facility is to allow administrators  
>> to tune the server's TCP buffer sizes.
> By definition of how TCP is designed, tweaking the send and receive  
> buffer sizes is a useful. Please see the tcp tuning guides in my  
> other post. I would characterize tweaking the buffers as a necessary  
> condition but not a sufficient condition to achieve good throughput  
> with tcp over long distances.
>>
>> A single average number can hide a host of underlying sins. This  
>> simple experiment, for example, does not demonstrate that TCP  
>> window size is the most significant issue here.
> I would say it slightly differently, that it demonstrates that it is  
> significant, but maybe not the *most* significant. There are many  
> possible bottlenecks and possible knobs to tweak. For example, I'm  
> still not achieving link speeds, so I'm sure there are other  
> bottlenecks that are causing reduced performance.

I think that's my basic point.  We don't have the full picture yet.   
There are benefits to adjusting the maximum window size, but as we  
learn more it may turn out that we want an entirely different knob or  
knobs.

>> It does not show that it is more or less effective to adjust the  
>> window size than to select an appropriate congestion control  
>> algorithm (say, BIC).
> Any tcp cong. control algorithm is highly dependent on the tcp  
> buffer size. The choice of algorithm changes the behaviour when  
> packets are dropped and in the initial opening of the window, but  
> once the window is open and no packets are being dropped, the  
> algorithm is irrelevant. So BIC, or westwood, or highspeed might do  
> better in the face of dropped packets, but since the current receive  
> buffer is so small, dropped packets are not the problem. Once we can  
> use the sysctl's to tweak the server buffer size, only then is the  
> choice of algorithm going to be important.

Maybe my use of the terminology is imprecise, but clearly the  
congestion control algorithm matters for determining the TCP window  
size, which is exactly what we're discussing here.

>> It does not show whether the client and server are using TCP  
>> optimally.
> I'm not sure what you mean by *optimally*. They use tcp the only way  
> they know how non?

I'm talking about whether they use Nagle, when they PUSH, how they use  
the window (servers can close a window when they are busy, for  
example), and of course whether they can or should use multiple  
connections.

>> It does not expose problems related to having a single data stream  
>> with one blocking head (eg SCTP can allow multiple streams over the  
>> same connection; or better performance might be achieved with  
>> multiple TCP connections, even if they allow only small windows).
> Yes, using multiple tcp connections might be useful, but that  
> doesn't mean you wouldn't want to adjust the tcp window of each one  
> using my patch. Actually, I can't seem to find the quote, but I read  
> somewhere that achieving performance in the WAN can be done 2  
> different ways: a) If you can tune the buffer sizes that is the best  
> way to go, but b) if you don't have root access to change the linux  
> tcp settings then using multiple tcp streams can compensate for  
> small buffer sizes.
>
> Andy has/had a patch to add multiple tcp streams to NFS. I think his  
> patch and my patch work in collaboration to improve wan performance.

Yep, I've discussed this work with him several times.  This might be a  
more practical solution than allowing larger window sizes (one reason  
being the dangers of allowing the window to get too large).

While the use of multiple streams has benefits besides increasing the  
effective TCP window size, only the client side controls the number of  
connections.  The server wouldn't have much to say about it.

>>> This patch isn't trying to alter default values, or predict buffer  
>>> sizes based on rtt values, or dynamically alter the tcp window  
>>> based on dropped packets, etc, it is just giving users the ability  
>>> to customize the server tcp buffer size.
>>
>> I know you posted this patch because of the experiments at CITI  
>> with long-run 10GbE, and it's handy to now have this to experiment  
>> with.
> Actually at IBM we have our own reasons for using NFS over the WAN.  
> I would like to get these 2 knobs into the kernel as it is hard to  
> tell customers to apply kernel patches....

>> It might also be helpful if we had a patch that made the server  
>> perform better in common environments, so a better default setting  
>> it seems to me would have greater value than simply creating a new  
>> tuning knob.
> I think there are possibly 2 (or more) patches. One that improves  
> the default buffer sizes and one that lets sysadmins tweak the  
> value. I don't see why they are mutually exclusive.

They are not.  I'm OK with studying the problem and adjusting the  
defaults appropriately.

The issue is whether adding this knob is the right approach to  
adjusting the server.  I don't think we have enough information to  
understand if this is the most useful approach.  In other words, it  
seems like a band-aid right now, but in the long run it might be the  
correct answer.

> My patch is a first step towards allowing NFS into WAN environments.  
> Linux currently has sysctl values for the TCP parameters for exactly  
> this reason, it is impossible to predict the network environment of  
> a linux machine.

> If the Linux nfs server isn't going to build off of the existing  
> Linux TCP values (which all sysadmins know how to tweak), then it  
> must allow sysadmins to tweak the NFS server tcp values, either  
> using my patch or some other related patch. I'm open to how the  
> server tcp buffers are tweaked, they just need to be able to be  
> tweaked. For example, if all tcp buffer values in linux were taken  
> out of the /proc file system and hardcoded, I think there would be a  
> revolt.

I'm not arguing for no tweaking.  What I'm saying is we should provide  
knobs that are as useful as possible, and include metrics and clear  
instructions for when and how to set the knob.

You've shown there is improvement, but not that this is the best  
solution.   It just feels like the work isn't done yet.

>> Would it be hard to add a metric or two with this tweak that would  
>> allow admins to see how often a socket buffer was completely full,  
>> completely empty, or how often the window size is being  
>> aggressively cut?
> So I've done this using tcpdump in combination with tcptrace. I've  
> shown people at citi how the tcp window grows in the experiment I  
> describe.

No, I mean as a part of the patch that adds the tweak, it should  
report various new statistics that can allow admins to see that they  
need adjustment, or that there isn't a problem at all in this area.

Scientific system tuning means assessing the problem, trying a change,  
then measuring to see if it was effective, or if it caused more  
trouble.  Lather, rinse, repeat.

>> While we may not be able to determine a single optimal buffer size  
>> for all BDPs, are there diminishing returns in most common cases  
>> for increasing the buffer size past, say, 16MB?
> Good question. It all depends on how much data you are transferring.  
> In order to fully open a 128MB tcp window over a very long WAN, you  
> will need to transfer at least a few gigabytes of data. If you only  
> transfer 100 MB at a time, then you will probably be fine with a 16  
> MB window as you are not transferring enough data to open the window  
> anyways. In our environment, we are expecting to transfer 100s of GB  
> if not even more, so the 16 MB window would be very limiting.

What about for a fast LAN?

>>> The information you are curious about is more relevant to creating  
>>> better default values of the tcp buffer size. This could be  
>>> useful, but would be a long process and there are so many  
>>> variables that I'm not sure that you could pick proper default  
>>> values anyways. The important thing is that the client can  
>>> currently set its tcp buffer size via the sysctl's, this is  
>>> useless if the server is stuck at a fixed value since the tcp  
>>> window will be the minimum of the client and server's tcp buffer  
>>> sizes.
>>
>>
>> Well, Linux servers are not the only servers that a Linux client  
>> will ever encounter, so the client-side sysctl isn't as bad as  
>> useless. But one can argue whether that knob is ever tweaked by  
>> client administrators, and how useful it is.
> Definitely not useless. Doing a google search for 'tcp_rmem' returns  
> over 11000 hits describing how to configure tcp settings. (ok, I  
> didn't review every result, but the first few pages of results are  
> telling) It doesn't really matter what OS the client and server use,  
> as long as both have the ability to tweak the tcp buffer size.

The number of hits may reflect the desperation that many have had over  
the years to get better performance from the Linux NFS  
implementation.  These days we have better performance out of the box,  
so there is less need for this kind of after-market tweaking.

I think we would be in a much better place if the client and server  
implementations worked "well enough" in nearly any network or  
environment.  That's been my goal since I started working on Linux NFS  
seven years ago.

>> What is an appropriate setting for a server that has to handle a  
>> mix of local and remote clients, for example, or a client that has  
>> to connect to a mix of local and remote servers?
> Yes, this is a tricky one. I believe the best way to handle it is to  
> set the server tcp buffer to the MAX(local, remote) and then let the  
> local client set a smaller tcp buffer and the remote client set a  
> larger tcp buffer. The problem there is that then what if the local  
> client is also a remote client of another nfs server?? At this point  
> there seems to be some limitations.....

Using multiple connections solves this problem pretty well, I think.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-16 18:59         ` Chuck Lever
@ 2008-06-17 22:03           ` Dean Hildebrand
  2008-06-18 21:32             ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Dean Hildebrand @ 2008-06-17 22:03 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

Here is my view of the situation:

We have a full picture of TCP.  TCP is well known, there are lots of 
papers/info on it, I have no doubt on what is occurring with TCP as I 
have traces that clearly show what is happening.  All documents and 
information clearly state that the buffer size is a critical part of 
improving TCP performance in the WAN.  In addition, the congestion 
control algorithm does NOT control the maximum size of the TCP window.  
The CCA controls how quickly the window reaches the maximum size, what 
happens when a packet is dropped and when to close the window.  The only 
item that controls the maximum size of the TCP window is the buffer 
values that I want a sysctl to tweak (just to be in line with the 
existing tcp buffer sysctls in Documentation/networking/ip-sysctl.txt)

What we don't have is a full picture of the other parts of transferring 
data from client to server, e.g., Trond just fixed a bug with regards to 
the writeback cache which should help write performance, that was an 
unknown up until this point.

Multiple TCP Streams
===============
There is a really big downside to multiple TCP streams: you have 
multiple TCP streams :)  Each one has its own overhead, setup connection 
cost, and of course  TCP window.  With a WAN rtt  of 200 ms (typical 
over satellite) and the current buffer size of 4MB, the nfs client would 
need 50+ TCP connections to achieve the correct performance.  That is a 
lot of overhead when comparing it with simply following the standard TCP 
tuning knowhow of increasing the buffer sizes. 

The main documentation the show that multiple tcp streams helps over the 
WAN is from GridFTP experiments.  They go over the pos and neg of the 
approach, but also talk about how tcp buffer size is also very 
important.  Multiple tcp streams is not a replacement for a proper 
buffer size 
(http://www.globus.org/alliance/publications/clusterworld/0904GridFinal.pdf)

If you have documentation counteracting these experiments I would be 
very interested to see them.

One Variable or Two
===============
I'd be happy with using a single variable for both the send and receive 
buffers, but since we are essentially doing the same thing as the 
net.ipv4.tcp_wmem/rmem variables, I think nfsd_tcp_max_mem would be more 
in line with existing Linux terminology. (also, we are talking about 
nfsd, not nfs, so I'd prefer to make that clear in the variable name)

Summary
=======
I'm providing you with all the information I have with regards to my 
experiments with NFS and TCP.  I agree that a better default is needed 
and my patch allows further experimentation to get to that value.    My 
patch does not add modify current NFS behaviour.  It changes a hard 
coded value for the server buffer size to be a variable in /proc.  
Blocking a method to modify this hard coded value means blocking further 
experimentation to find a better default value.  My patch is a first 
step toward trying to find a good default tcp server buffer value.

Dean




>
> Since what we really want to limit is the maximum size of the TCP 
> receive window, it would be more precise to change the name of the new 
> sysctl to something like nfs_tcp_max_window_size.

>
>>>> Another point is that setting the buffer size isn't always a 
>>>> straightforward process. All papers I've read on the subject, and 
>>>> my experience confirms this, is that setting tcp buffer sizes is 
>>>> more of an art.
>>>>
>>>> So having the server set a good default value is half the battle, 
>>>> but allowing users to twiddle with this value is vital.
>>>
>>>>>> The uses the current buffer sizes in the code are as minimum 
>>>>>> values, which the user cannot decrease. If the user sets a value 
>>>>>> of 0 in either /proc entry, it resets the buffer size to the 
>>>>>> default value. The set /proc values are utilized when the TCP 
>>>>>> connection is initialized (mount time). The values are bounded 
>>>>>> above by the *minimum* of the /proc values and the network TCP 
>>>>>> sysctls.
>>>>>>
>>>>>> To demonstrate the usefulness of this patch, details of an 
>>>>>> experiment between 2 computers with a rtt of 30ms is provided 
>>>>>> below. In this experiment, increasing the server 
>>>>>> /proc/sys/sunrpc/tcp_rcvbuf value doubles write performance.
>>>>>>
>>>>>> EXPERIMENT
>>>>>> ==========
>>>>>> This experiment simulates a WAN by using tc together with netem 
>>>>>> to add a 30 ms delay to all packets on a nfs client. The goal is 
>>>>>> to show that by only changing tcp_rcvbuf, the nfs client can 
>>>>>> increase write performance in the WAN. To verify the patch has 
>>>>>> the desired effect on the TCP window, I created two tcptrace 
>>>>>> plots that show the difference in tcp window behaviour before and 
>>>>>> after the server TCP rcvbuf size is increased. When using the 
>>>>>> default server tcpbuf value of 6M, we can see the TCP window top 
>>>>>> out around 4.6 M, whereas increasing the server tcpbuf value to 
>>>>>> 32M, we can see that the TCP window tops out around 13M. 
>>>>>> Performance jumps from 43 MB/s to 90 MB/s.
>>>>>>
>>>>>> Hardware:
>>>>>> 2 dual-core opteron blades
>>>>>> GigE, Broadcom NetXtreme II BCM57065 cards
>>>>>> A single gigabit switch in the middle
>>>>>> 1500 MTU
>>>>>> 8 GB memory
>>>>>>
>>>>>> Software:
>>>>>> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
>>>>>> RHEL4
>>>>>>
>>>>>> NFS Configuration:
>>>>>> 64 rpc slots
>>>>>> 32 nfsds
>>>>>> Export ext3 file system. This disk is quite slow, I therefore 
>>>>>> exported using async to reduce the effect of the disk on the back 
>>>>>> end. This way, the experiments record the time it takes for the 
>>>>>> data to get to the server (not to the disk).
>>>>>> # exportfs -v
>>>>>> /export 
>>>>>> <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)
>>>>>>
>>>>>> # cat /proc/mounts
>>>>>> bear109:/export /mnt nfs 
>>>>>> rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 
>>>>>> 0 0
>>>>>>
>>>>>> fs.nfs.nfs_congestion_kb = 91840
>>>>>> net.ipv4.tcp_congestion_control = cubic
>>>>>>
>>>>>> Network tc Command executed on client:
>>>>>> tc qdisc add dev eth0 root netem delay 30ms
>>>>>> rtt from client (bear108) to server (bear109)
>>>>>> #ping bear109
>>>>>> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
>>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0 
>>>>>> ttl=64 time=31.4 ms
>>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1 
>>>>>> ttl=64 time=32.0 ms
>>>>>>
>>>>>> TCP Configuration on client and server:
>>>>>> # Controls IP packet forwarding
>>>>>> net.ipv4.ip_forward = 0
>>>>>> # Controls source route verification
>>>>>> net.ipv4.conf.default.rp_filter = 1
>>>>>> # Do not accept source routing
>>>>>> net.ipv4.conf.default.accept_source_route = 0
>>>>>> # Controls the System Request debugging functionality of the kernel
>>>>>> kernel.sysrq = 0
>>>>>> # Controls whether core dumps will append the PID to the core 
>>>>>> filename
>>>>>> # Useful for debugging multi-threaded applications
>>>>>> kernel.core_uses_pid = 1
>>>>>> # Controls the use of TCP syncookies
>>>>>> net.ipv4.tcp_syncookies = 1
>>>>>> # Controls the maximum size of a message, in bytes
>>>>>> kernel.msgmnb = 65536
>>>>>> # Controls the default maxmimum size of a mesage queue
>>>>>> kernel.msgmax = 65536
>>>>>> # Controls the maximum shared segment size, in bytes
>>>>>> kernel.shmmax = 68719476736
>>>>>> # Controls the maximum number of shared memory segments, in pages
>>>>>> kernel.shmall = 4294967296
>>>>>> ### IPV4 specific settings
>>>>>> net.ipv4.tcp_timestamps = 0
>>>>>> net.ipv4.tcp_sack = 1
>>>>>> # on systems with a VERY fast bus -> memory interface this is the 
>>>>>> big gainer
>>>>>> net.ipv4.tcp_rmem = 4096 16777216 16777216
>>>>>> net.ipv4.tcp_wmem = 4096 16777216 16777216
>>>>>> net.ipv4.tcp_mem = 4096 16777216 16777216
>>>>>> ### CORE settings (mostly for socket and UDP effect)
>>>>>> net.core.rmem_max = 16777216
>>>>>> net.core.wmem_max = 16777216
>>>>>> net.core.rmem_default = 16777216
>>>>>> net.core.wmem_default = 16777216
>>>>>> net.core.optmem_max = 16777216
>>>>>> net.core.netdev_max_backlog = 300000
>>>>>> # Don't cache ssthresh from previous connection
>>>>>> net.ipv4.tcp_no_metrics_save = 1
>>>>>> # make sure we don't run out of memory
>>>>>> vm.min_free_kbytes = 32768
>>>>>>
>>>>>> Experiments:
>>>>>>
>>>>>> On Server: (note that the real tcp buffer size is double tcp_rcvbuf)
>>>>>> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>>> 3158016
>>>>>>
>>>>>> On Client:
>>>>>> mount -t nfs bear109:/export /mnt
>>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>>> ...
>>>>>> KB reclen write
>>>>>> 512000 1024 43252 umount /mnt
>>>>>>
>>>>>> On server:
>>>>>> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>>> 16777216
>>>>>>
>>>>>> On Client:
>>>>>> mount -t nfs bear109:/export /mnt
>>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>>> ...
>>>>>> KB reclen write
>>>>>> 512000 1024 90396
>>>>>
>>>>> The numbers you have here are averages over the whole run. 
>>>>> Performing these tests using a variety of record lengths and file 
>>>>> sizes (up to several tens of gigabytes) would be useful to see 
>>>>> where different memory and network latencies kick in.
>>>> Definitely useful, although I'm not sure how this relates to this 
>>>> patch.
>>>
>>> It relates to the whole idea that this is a valid and useful 
>>> parameter to tweak.
>>>
>>> What your experiment shows is that there is some improvement when 
>>> the TCP window is allowed to expand. It does not demonstrate that 
>>> the *best* way to provide this facility is to allow administrators 
>>> to tune the server's TCP buffer sizes.
>> By definition of how TCP is designed, tweaking the send and receive 
>> buffer sizes is a useful. Please see the tcp tuning guides in my 
>> other post. I would characterize tweaking the buffers as a necessary 
>> condition but not a sufficient condition to achieve good throughput 
>> with tcp over long distances.
>>>
>>> A single average number can hide a host of underlying sins. This 
>>> simple experiment, for example, does not demonstrate that TCP window 
>>> size is the most significant issue here.
>> I would say it slightly differently, that it demonstrates that it is 
>> significant, but maybe not the *most* significant. There are many 
>> possible bottlenecks and possible knobs to tweak. For example, I'm 
>> still not achieving link speeds, so I'm sure there are other 
>> bottlenecks that are causing reduced performance.
>
> I think that's my basic point.  We don't have the full picture yet.  
> There are benefits to adjusting the maximum window size, but as we 
> learn more it may turn out that we want an entirely different knob or 
> knobs.

>
>>> It does not show that it is more or less effective to adjust the 
>>> window size than to select an appropriate congestion control 
>>> algorithm (say, BIC).
>> Any tcp cong. control algorithm is highly dependent on the tcp buffer 
>> size. The choice of algorithm changes the behaviour when packets are 
>> dropped and in the initial opening of the window, but once the window 
>> is open and no packets are being dropped, the algorithm is 
>> irrelevant. So BIC, or westwood, or highspeed might do better in the 
>> face of dropped packets, but since the current receive buffer is so 
>> small, dropped packets are not the problem. Once we can use the 
>> sysctl's to tweak the server buffer size, only then is the choice of 
>> algorithm going to be important.
>
> Maybe my use of the terminology is imprecise, but clearly the 
> congestion control algorithm matters for determining the TCP window 
> size, which is exactly what we're discussing here.

>
>>> It does not show whether the client and server are using TCP optimally.
>> I'm not sure what you mean by *optimally*. They use tcp the only way 
>> they know how non?
>
> I'm talking about whether they use Nagle, when they PUSH, how they use 
> the window (servers can close a window when they are busy, for 
> example), and of course whether they can or should use multiple 
> connections.

>
>>> It does not expose problems related to having a single data stream 
>>> with one blocking head (eg SCTP can allow multiple streams over the 
>>> same connection; or better performance might be achieved with 
>>> multiple TCP connections, even if they allow only small windows).
>> Yes, using multiple tcp connections might be useful, but that doesn't 
>> mean you wouldn't want to adjust the tcp window of each one using my 
>> patch. Actually, I can't seem to find the quote, but I read somewhere 
>> that achieving performance in the WAN can be done 2 different ways: 
>> a) If you can tune the buffer sizes that is the best way to go, but 
>> b) if you don't have root access to change the linux tcp settings 
>> then using multiple tcp streams can compensate for small buffer sizes.
>>
>> Andy has/had a patch to add multiple tcp streams to NFS. I think his 
>> patch and my patch work in collaboration to improve wan performance.
>
> Yep, I've discussed this work with him several times.  This might be a 
> more practical solution than allowing larger window sizes (one reason 
> being the dangers of allowing the window to get too large).
>
> While the use of multiple streams has benefits besides increasing the 
> effective TCP window size, only the client side controls the number of 
> connections.  The server wouldn't have much to say about it.

>
>>>> This patch isn't trying to alter default values, or predict buffer 
>>>> sizes based on rtt values, or dynamically alter the tcp window 
>>>> based on dropped packets, etc, it is just giving users the ability 
>>>> to customize the server tcp buffer size.
>>>
>>> I know you posted this patch because of the experiments at CITI with 
>>> long-run 10GbE, and it's handy to now have this to experiment with.
>> Actually at IBM we have our own reasons for using NFS over the WAN. I 
>> would like to get these 2 knobs into the kernel as it is hard to tell 
>> customers to apply kernel patches....
>
>>> It might also be helpful if we had a patch that made the server 
>>> perform better in common environments, so a better default setting 
>>> it seems to me would have greater value than simply creating a new 
>>> tuning knob.
>> I think there are possibly 2 (or more) patches. One that improves the 
>> default buffer sizes and one that lets sysadmins tweak the value. I 
>> don't see why they are mutually exclusive.
>
> They are not.  I'm OK with studying the problem and adjusting the 
> defaults appropriately.
>
> The issue is whether adding this knob is the right approach to 
> adjusting the server.  I don't think we have enough information to 
> understand if this is the most useful approach.  In other words, it 
> seems like a band-aid right now, but in the long run it might be the 
> correct answer.

>
>> My patch is a first step towards allowing NFS into WAN environments. 
>> Linux currently has sysctl values for the TCP parameters for exactly 
>> this reason, it is impossible to predict the network environment of a 
>> linux machine.
>
>> If the Linux nfs server isn't going to build off of the existing 
>> Linux TCP values (which all sysadmins know how to tweak), then it 
>> must allow sysadmins to tweak the NFS server tcp values, either using 
>> my patch or some other related patch. I'm open to how the server tcp 
>> buffers are tweaked, they just need to be able to be tweaked. For 
>> example, if all tcp buffer values in linux were taken out of the 
>> /proc file system and hardcoded, I think there would be a revolt.
>
> I'm not arguing for no tweaking.  What I'm saying is we should provide 
> knobs that are as useful as possible, and include metrics and clear 
> instructions for when and how to set the knob.
>
> You've shown there is improvement, but not that this is the best 
> solution.   It just feels like the work isn't done yet.

>
>>> Would it be hard to add a metric or two with this tweak that would 
>>> allow admins to see how often a socket buffer was completely full, 
>>> completely empty, or how often the window size is being aggressively 
>>> cut?
>> So I've done this using tcpdump in combination with tcptrace. I've 
>> shown people at citi how the tcp window grows in the experiment I 
>> describe.
>
> No, I mean as a part of the patch that adds the tweak, it should 
> report various new statistics that can allow admins to see that they 
> need adjustment, or that there isn't a problem at all in this area.
>
> Scientific system tuning means assessing the problem, trying a change, 
> then measuring to see if it was effective, or if it caused more 
> trouble.  Lather, rinse, repeat.

>
>>> While we may not be able to determine a single optimal buffer size 
>>> for all BDPs, are there diminishing returns in most common cases for 
>>> increasing the buffer size past, say, 16MB?
>> Good question. It all depends on how much data you are transferring. 
>> In order to fully open a 128MB tcp window over a very long WAN, you 
>> will need to transfer at least a few gigabytes of data. If you only 
>> transfer 100 MB at a time, then you will probably be fine with a 16 
>> MB window as you are not transferring enough data to open the window 
>> anyways. In our environment, we are expecting to transfer 100s of GB 
>> if not even more, so the 16 MB window would be very limiting.
>
> What about for a fast LAN?

>
>>>> The information you are curious about is more relevant to creating 
>>>> better default values of the tcp buffer size. This could be useful, 
>>>> but would be a long process and there are so many variables that 
>>>> I'm not sure that you could pick proper default values anyways. The 
>>>> important thing is that the client can currently set its tcp buffer 
>>>> size via the sysctl's, this is useless if the server is stuck at a 
>>>> fixed value since the tcp window will be the minimum of the client 
>>>> and server's tcp buffer sizes.
>>>
>>>
>>> Well, Linux servers are not the only servers that a Linux client 
>>> will ever encounter, so the client-side sysctl isn't as bad as 
>>> useless. But one can argue whether that knob is ever tweaked by 
>>> client administrators, and how useful it is.
>> Definitely not useless. Doing a google search for 'tcp_rmem' returns 
>> over 11000 hits describing how to configure tcp settings. (ok, I 
>> didn't review every result, but the first few pages of results are 
>> telling) It doesn't really matter what OS the client and server use, 
>> as long as both have the ability to tweak the tcp buffer size.
>
> The number of hits may reflect the desperation that many have had over 
> the years to get better performance from the Linux NFS 
> implementation.  These days we have better performance out of the box, 
> so there is less need for this kind of after-market tweaking.
>
> I think we would be in a much better place if the client and server 
> implementations worked "well enough" in nearly any network or 
> environment.  That's been my goal since I started working on Linux NFS 
> seven years ago.
>
>>> What is an appropriate setting for a server that has to handle a mix 
>>> of local and remote clients, for example, or a client that has to 
>>> connect to a mix of local and remote servers?
>> Yes, this is a tricky one. I believe the best way to handle it is to 
>> set the server tcp buffer to the MAX(local, remote) and then let the 
>> local client set a smaller tcp buffer and the remote client set a 
>> larger tcp buffer. The problem there is that then what if the local 
>> client is also a remote client of another nfs server?? At this point 
>> there seems to be some limitations.....
>
> Using multiple connections solves this problem pretty well, I think.
>
> -- 
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-16 17:59         ` J. Bruce Fields
@ 2008-06-18 18:33           ` Dean Hildebrand
  0 siblings, 0 replies; 15+ messages in thread
From: Dean Hildebrand @ 2008-06-18 18:33 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Chuck Lever, linux-nfs



J. Bruce Fields wrote:
> On Fri, Jun 13, 2008 at 04:58:04PM -0700, Dean Hildebrand wrote:
>   
>> The reason it is an art is that you don't know the hardware that exists  
>> between the client and server.  Talking about things like BDP is fine,  
>> but in reality there are limited buffer sizes, flaky hardware,  
>> fluctuations in traffic, etc etc.  Using the BDP as a starting point  
>> though seems like the best solution, but since the linux server doesn't  
>> know anything about what the BDP is, it is tough to hard code any value  
>> into the linux kernel.  As you said, if we just give a reasonable  
>> default value and then ensure people can play with the knobs.  Most  
>> people use NFS within a LAN, and to date there has been little if any  
>> discussion on using NFS over the WAN (hence my interest), so I would  
>> argue that the current values might not be all that bad with regards to  
>> defaults (at least we know the behaviour isn't horrible for most people).
>>
>> Networks are messy.  Anyone who wants to work in the WAN is going to  
>> have to read about such things, no way around it.  A simple google  
>> search for 'tcp wan' or 'tcp wan linux' gives loads of suggestions on  
>> how to configure your network, so it really isn't a burden on sysadmins  
>> to do such a search and then use the given knobs to adjust the tcp  
>> buffer size appropriately.  My patch gives sysadmins the ability to do  
>> the google search and then have some knobs to turn.
>>
>> Some sample tcp tuning guides that I like:
>> http://acs.lbl.gov/TCP-tuning/tcp-wan-perf.pdf
>> http://acs.lbl.gov/TCP-tuning/linux.html
>> http://gentoo-wiki.com/HOWTO_TCP_Tuning (especially relevant is the part  
>> about the receive buffer)
>>     
>  
>   
>> http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Hildebrand_98265.pdf 
>> (our initial paper on pNFS tuning)
>>     
>
> Several of those refer to problems that can happen when the receive
> buffer size is set unusually high, but none of them give a really
> detailed description of the behavior in that case--do you know of any?
>   
In an earlier post, I referred to the saw-tooth pattern that will happen 
with the window when the sender transmits faster than the receiver can 
receive.  I believe bic and cubic try to reduce the impact by not 
closing the window all the way, but it is still better to not 
intentionally lose packets by setting the receive buffer too high.  Not 
sure if I sent this doc out already, but it also has some info on tuned 
buffers vs. parallel tcp streams.  It also shows some graphs of the 
window closing once too many packets are lost.
http://acs.lbl.gov/TCP-tuning/TCP-Tuning-Tutorial.pdf

Sections 2.1 and 2.2 of the following paper published SC2002 give an 
interesting intro to tuning tcp buffers and the ups and downs of using 
parallel TCP streams.  They quote the gridftp papers and indicate that 
the best performance is with parallel tcp streams and tuned buffers.  
They give the danger of setting a buffer size too big as:
"Although memory is comparably cheap, the vast majority of the 
connections are so small that allocating large buffers to each flow can 
put any system at risk
of running out of memory."
http://www.supercomp.org/sc2002/paperpdfs/pap.pap151.pdf

(Note, both of the following docs are from the same person.  There are 
other docs, they are don't seem to be quite as clear.)
Dean


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-17 22:03           ` Dean Hildebrand
@ 2008-06-18 21:32             ` Chuck Lever
  2008-06-25  1:06               ` Dean Hildebrand
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2008-06-18 21:32 UTC (permalink / raw)
  To: Dean Hildebrand; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 23190 bytes --]

Dean Hildebrand wrote:
> We have a full picture of TCP.  TCP is well known, there are lots of 
> papers/info on it, I have no doubt on what is occurring with TCP as I 
> have traces that clearly show what is happening.  All documents and 
> information clearly state that the buffer size is a critical part of 
> improving TCP performance in the WAN.  In addition, the congestion 
> control algorithm does NOT control the maximum size of the TCP window.  
> The CCA controls how quickly the window reaches the maximum size, what 
> happens when a packet is dropped and when to close the window.  The only 
> item that controls the maximum size of the TCP window is the buffer 
> values that I want a sysctl to tweak (just to be in line with the 
> existing tcp buffer sysctls in Documentation/networking/ip-sysctl.txt)

IMO it's just plain broken that the application layer has to understand 
and manage this detail about TCP.

> What we don't have is a full picture of the other parts of transferring 
> data from client to server, e.g., Trond just fixed a bug with regards to 
> the writeback cache which should help write performance, that was an 
> unknown up until this point.
> 
> Multiple TCP Streams
> ===============
> There is a really big downside to multiple TCP streams: you have 
> multiple TCP streams :)  Each one has its own overhead, setup connection 
> cost, and of course  TCP window.  With a WAN rtt  of 200 ms (typical 
> over satellite) and the current buffer size of 4MB, the nfs client would 
> need 50+ TCP connections to achieve the correct performance.   That is a
> lot of overhead when comparing it with simply following the standard TCP 
> tuning knowhow of increasing the buffer sizes.

I suspect that anyone operating NFS over a sat link would have an 
already lowered performance expectation.

If the maximum window size defaults to, say, 16MB, and you have a 
smaller RTT (which is typical of intercontinental 10GbE links which you 
might be more willing to pump huge amounts of data over than a sat 
link), you will need fewer concurrent connections to achieve optimal 
performance, and that becomes more practical.

For networks with a much smaller BDP (like, say, MOST OF THEM :-) you 
might be able to get away with only a few connections, or even one, if 
we decide to use a larger default maximum window size.

There are plenty of advantages to having multiple connections between 
client and server.  The fact that it helps the large BDP case is just a 
bonus.

> The main documentation the show that multiple tcp streams helps over the 
> WAN is from GridFTP experiments.  They go over the pos and neg of the 
> approach, but also talk about how tcp buffer size is also very 
> important.  Multiple tcp streams is not a replacement for a proper 
> buffer size 
> (http://www.globus.org/alliance/publications/clusterworld/0904GridFinal.pdf)

Here's something else to consider:

TCP is likely not the right transport protocol for networks with a large 
BDP.  Perhaps SCTP, which embeds support for multiple streams in a 
single connection, is better for this case... and what we really want to 
do is create an SCTP-based transport capability for NFS.  Or maybe we 
really want to use iWARP over SCTP.

> If you have documentation counteracting these experiments I would be 
> very interested to see them.

I think you are willfully misinterpreting my objection to your sysctl patch.

I never said your experiments are incorrect; they are valuable.  My 
point is that they don't demonstrate that this is a useful knob for our 
most common use cases, and that it is the correct and only way to get 
"good enough" performance for most common deployments of NFS.  It helps 
the large BDP case, but as you said, it doesn't make all the problems go 
away there either.

Is it easy to get optimal results with this?  How do admins evaluate the 
results of changing this value?  Is it easy to get bad results with it? 
  Can it result in bad behavior that results in problems for other users 
of the network?

I also never said "use multiple connections but leave the buffer size 
alone."  I think we agree that a larger receive buffer size is a good 
idea for the NFS server, in general.  The question is whether allowing 
admins to tune it is the most effective way to benefit performance for 
our user base, or can we get away with using a more optimal but fixed 
default size (which is simpler for admins to understand and for us to 
maintain)?

Or are we just working around what is effectively a defect in TCP itself?

I think we need to look at the bigger picture, which contains plenty of 
other interesting alternatives that may have larger benefit.

> One Variable or Two
> ===============
> I'd be happy with using a single variable for both the send and receive 
> buffers, but since we are essentially doing the same thing as the 
> net.ipv4.tcp_wmem/rmem variables, I think nfsd_tcp_max_mem would be more 
> in line with existing Linux terminology. (also, we are talking about 
> nfsd, not nfs, so I'd prefer to make that clear in the variable name)
> 
> Summary
> =======
> I'm providing you with all the information I have with regards to my 
> experiments with NFS and TCP.  I agree that a better default is needed 
> and my patch allows further experimentation to get to that value.    My 
> patch does not add modify current NFS behaviour.  It changes a hard 
> coded value for the server buffer size to be a variable in /proc.  
> Blocking a method to modify this hard coded value means blocking further 
> experimentation to find a better default value.  My patch is a first 
> step toward trying to find a good default tcp server buffer value.

>> Since what we really want to limit is the maximum size of the TCP 
>> receive window, it would be more precise to change the name of the new 
>> sysctl to something like nfs_tcp_max_window_size.
> 
>>
>>>>> Another point is that setting the buffer size isn't always a 
>>>>> straightforward process. All papers I've read on the subject, and 
>>>>> my experience confirms this, is that setting tcp buffer sizes is 
>>>>> more of an art.
>>>>>
>>>>> So having the server set a good default value is half the battle, 
>>>>> but allowing users to twiddle with this value is vital.
>>>>
>>>>>>> The uses the current buffer sizes in the code are as minimum 
>>>>>>> values, which the user cannot decrease. If the user sets a value 
>>>>>>> of 0 in either /proc entry, it resets the buffer size to the 
>>>>>>> default value. The set /proc values are utilized when the TCP 
>>>>>>> connection is initialized (mount time). The values are bounded 
>>>>>>> above by the *minimum* of the /proc values and the network TCP 
>>>>>>> sysctls.
>>>>>>>
>>>>>>> To demonstrate the usefulness of this patch, details of an 
>>>>>>> experiment between 2 computers with a rtt of 30ms is provided 
>>>>>>> below. In this experiment, increasing the server 
>>>>>>> /proc/sys/sunrpc/tcp_rcvbuf value doubles write performance.
>>>>>>>
>>>>>>> EXPERIMENT
>>>>>>> ==========
>>>>>>> This experiment simulates a WAN by using tc together with netem 
>>>>>>> to add a 30 ms delay to all packets on a nfs client. The goal is 
>>>>>>> to show that by only changing tcp_rcvbuf, the nfs client can 
>>>>>>> increase write performance in the WAN. To verify the patch has 
>>>>>>> the desired effect on the TCP window, I created two tcptrace 
>>>>>>> plots that show the difference in tcp window behaviour before and 
>>>>>>> after the server TCP rcvbuf size is increased. When using the 
>>>>>>> default server tcpbuf value of 6M, we can see the TCP window top 
>>>>>>> out around 4.6 M, whereas increasing the server tcpbuf value to 
>>>>>>> 32M, we can see that the TCP window tops out around 13M. 
>>>>>>> Performance jumps from 43 MB/s to 90 MB/s.
>>>>>>>
>>>>>>> Hardware:
>>>>>>> 2 dual-core opteron blades
>>>>>>> GigE, Broadcom NetXtreme II BCM57065 cards
>>>>>>> A single gigabit switch in the middle
>>>>>>> 1500 MTU
>>>>>>> 8 GB memory
>>>>>>>
>>>>>>> Software:
>>>>>>> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
>>>>>>> RHEL4
>>>>>>>
>>>>>>> NFS Configuration:
>>>>>>> 64 rpc slots
>>>>>>> 32 nfsds
>>>>>>> Export ext3 file system. This disk is quite slow, I therefore 
>>>>>>> exported using async to reduce the effect of the disk on the back 
>>>>>>> end. This way, the experiments record the time it takes for the 
>>>>>>> data to get to the server (not to the disk).
>>>>>>> # exportfs -v
>>>>>>> /export 
>>>>>>> <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)
>>>>>>>
>>>>>>> # cat /proc/mounts
>>>>>>> bear109:/export /mnt nfs 
>>>>>>> rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 
>>>>>>> 0 0
>>>>>>>
>>>>>>> fs.nfs.nfs_congestion_kb = 91840
>>>>>>> net.ipv4.tcp_congestion_control = cubic
>>>>>>>
>>>>>>> Network tc Command executed on client:
>>>>>>> tc qdisc add dev eth0 root netem delay 30ms
>>>>>>> rtt from client (bear108) to server (bear109)
>>>>>>> #ping bear109
>>>>>>> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
>>>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0 
>>>>>>> ttl=64 time=31.4 ms
>>>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1 
>>>>>>> ttl=64 time=32.0 ms
>>>>>>>
>>>>>>> TCP Configuration on client and server:
>>>>>>> # Controls IP packet forwarding
>>>>>>> net.ipv4.ip_forward = 0
>>>>>>> # Controls source route verification
>>>>>>> net.ipv4.conf.default.rp_filter = 1
>>>>>>> # Do not accept source routing
>>>>>>> net.ipv4.conf.default.accept_source_route = 0
>>>>>>> # Controls the System Request debugging functionality of the kernel
>>>>>>> kernel.sysrq = 0
>>>>>>> # Controls whether core dumps will append the PID to the core 
>>>>>>> filename
>>>>>>> # Useful for debugging multi-threaded applications
>>>>>>> kernel.core_uses_pid = 1
>>>>>>> # Controls the use of TCP syncookies
>>>>>>> net.ipv4.tcp_syncookies = 1
>>>>>>> # Controls the maximum size of a message, in bytes
>>>>>>> kernel.msgmnb = 65536
>>>>>>> # Controls the default maxmimum size of a mesage queue
>>>>>>> kernel.msgmax = 65536
>>>>>>> # Controls the maximum shared segment size, in bytes
>>>>>>> kernel.shmmax = 68719476736
>>>>>>> # Controls the maximum number of shared memory segments, in pages
>>>>>>> kernel.shmall = 4294967296
>>>>>>> ### IPV4 specific settings
>>>>>>> net.ipv4.tcp_timestamps = 0
>>>>>>> net.ipv4.tcp_sack = 1
>>>>>>> # on systems with a VERY fast bus -> memory interface this is the 
>>>>>>> big gainer
>>>>>>> net.ipv4.tcp_rmem = 4096 16777216 16777216
>>>>>>> net.ipv4.tcp_wmem = 4096 16777216 16777216
>>>>>>> net.ipv4.tcp_mem = 4096 16777216 16777216
>>>>>>> ### CORE settings (mostly for socket and UDP effect)
>>>>>>> net.core.rmem_max = 16777216
>>>>>>> net.core.wmem_max = 16777216
>>>>>>> net.core.rmem_default = 16777216
>>>>>>> net.core.wmem_default = 16777216
>>>>>>> net.core.optmem_max = 16777216
>>>>>>> net.core.netdev_max_backlog = 300000
>>>>>>> # Don't cache ssthresh from previous connection
>>>>>>> net.ipv4.tcp_no_metrics_save = 1
>>>>>>> # make sure we don't run out of memory
>>>>>>> vm.min_free_kbytes = 32768
>>>>>>>
>>>>>>> Experiments:
>>>>>>>
>>>>>>> On Server: (note that the real tcp buffer size is double tcp_rcvbuf)
>>>>>>> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>> 3158016
>>>>>>>
>>>>>>> On Client:
>>>>>>> mount -t nfs bear109:/export /mnt
>>>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>>>> ...
>>>>>>> KB reclen write
>>>>>>> 512000 1024 43252 umount /mnt
>>>>>>>
>>>>>>> On server:
>>>>>>> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>> 16777216
>>>>>>>
>>>>>>> On Client:
>>>>>>> mount -t nfs bear109:/export /mnt
>>>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>>>> ...
>>>>>>> KB reclen write
>>>>>>> 512000 1024 90396
>>>>>>
>>>>>> The numbers you have here are averages over the whole run. 
>>>>>> Performing these tests using a variety of record lengths and file 
>>>>>> sizes (up to several tens of gigabytes) would be useful to see 
>>>>>> where different memory and network latencies kick in.
>>>>> Definitely useful, although I'm not sure how this relates to this 
>>>>> patch.
>>>>
>>>> It relates to the whole idea that this is a valid and useful 
>>>> parameter to tweak.
>>>>
>>>> What your experiment shows is that there is some improvement when 
>>>> the TCP window is allowed to expand. It does not demonstrate that 
>>>> the *best* way to provide this facility is to allow administrators 
>>>> to tune the server's TCP buffer sizes.
>>> By definition of how TCP is designed, tweaking the send and receive 
>>> buffer sizes is a useful. Please see the tcp tuning guides in my 
>>> other post. I would characterize tweaking the buffers as a necessary 
>>> condition but not a sufficient condition to achieve good throughput 
>>> with tcp over long distances.
>>>>
>>>> A single average number can hide a host of underlying sins. This 
>>>> simple experiment, for example, does not demonstrate that TCP window 
>>>> size is the most significant issue here.
>>> I would say it slightly differently, that it demonstrates that it is 
>>> significant, but maybe not the *most* significant. There are many 
>>> possible bottlenecks and possible knobs to tweak. For example, I'm 
>>> still not achieving link speeds, so I'm sure there are other 
>>> bottlenecks that are causing reduced performance.
>>
>> I think that's my basic point.  We don't have the full picture yet.  
>> There are benefits to adjusting the maximum window size, but as we 
>> learn more it may turn out that we want an entirely different knob or 
>> knobs.
> 
>>
>>>> It does not show that it is more or less effective to adjust the 
>>>> window size than to select an appropriate congestion control 
>>>> algorithm (say, BIC).
>>> Any tcp cong. control algorithm is highly dependent on the tcp buffer 
>>> size. The choice of algorithm changes the behaviour when packets are 
>>> dropped and in the initial opening of the window, but once the window 
>>> is open and no packets are being dropped, the algorithm is 
>>> irrelevant. So BIC, or westwood, or highspeed might do better in the 
>>> face of dropped packets, but since the current receive buffer is so 
>>> small, dropped packets are not the problem. Once we can use the 
>>> sysctl's to tweak the server buffer size, only then is the choice of 
>>> algorithm going to be important.
>>
>> Maybe my use of the terminology is imprecise, but clearly the 
>> congestion control algorithm matters for determining the TCP window 
>> size, which is exactly what we're discussing here.
> 
>>
>>>> It does not show whether the client and server are using TCP optimally.
>>> I'm not sure what you mean by *optimally*. They use tcp the only way 
>>> they know how non?
>>
>> I'm talking about whether they use Nagle, when they PUSH, how they use 
>> the window (servers can close a window when they are busy, for 
>> example), and of course whether they can or should use multiple 
>> connections.
> 
>>
>>>> It does not expose problems related to having a single data stream 
>>>> with one blocking head (eg SCTP can allow multiple streams over the 
>>>> same connection; or better performance might be achieved with 
>>>> multiple TCP connections, even if they allow only small windows).
>>> Yes, using multiple tcp connections might be useful, but that doesn't 
>>> mean you wouldn't want to adjust the tcp window of each one using my 
>>> patch. Actually, I can't seem to find the quote, but I read somewhere 
>>> that achieving performance in the WAN can be done 2 different ways: 
>>> a) If you can tune the buffer sizes that is the best way to go, but 
>>> b) if you don't have root access to change the linux tcp settings 
>>> then using multiple tcp streams can compensate for small buffer sizes.
>>>
>>> Andy has/had a patch to add multiple tcp streams to NFS. I think his 
>>> patch and my patch work in collaboration to improve wan performance.
>>
>> Yep, I've discussed this work with him several times.  This might be a 
>> more practical solution than allowing larger window sizes (one reason 
>> being the dangers of allowing the window to get too large).
>>
>> While the use of multiple streams has benefits besides increasing the 
>> effective TCP window size, only the client side controls the number of 
>> connections.  The server wouldn't have much to say about it.
> 
>>
>>>>> This patch isn't trying to alter default values, or predict buffer 
>>>>> sizes based on rtt values, or dynamically alter the tcp window 
>>>>> based on dropped packets, etc, it is just giving users the ability 
>>>>> to customize the server tcp buffer size.
>>>>
>>>> I know you posted this patch because of the experiments at CITI with 
>>>> long-run 10GbE, and it's handy to now have this to experiment with.
>>> Actually at IBM we have our own reasons for using NFS over the WAN. I 
>>> would like to get these 2 knobs into the kernel as it is hard to tell 
>>> customers to apply kernel patches....
>>
>>>> It might also be helpful if we had a patch that made the server 
>>>> perform better in common environments, so a better default setting 
>>>> it seems to me would have greater value than simply creating a new 
>>>> tuning knob.
>>> I think there are possibly 2 (or more) patches. One that improves the 
>>> default buffer sizes and one that lets sysadmins tweak the value. I 
>>> don't see why they are mutually exclusive.
>>
>> They are not.  I'm OK with studying the problem and adjusting the 
>> defaults appropriately.
>>
>> The issue is whether adding this knob is the right approach to 
>> adjusting the server.  I don't think we have enough information to 
>> understand if this is the most useful approach.  In other words, it 
>> seems like a band-aid right now, but in the long run it might be the 
>> correct answer.
> 
>>
>>> My patch is a first step towards allowing NFS into WAN environments. 
>>> Linux currently has sysctl values for the TCP parameters for exactly 
>>> this reason, it is impossible to predict the network environment of a 
>>> linux machine.
>>
>>> If the Linux nfs server isn't going to build off of the existing 
>>> Linux TCP values (which all sysadmins know how to tweak), then it 
>>> must allow sysadmins to tweak the NFS server tcp values, either using 
>>> my patch or some other related patch. I'm open to how the server tcp 
>>> buffers are tweaked, they just need to be able to be tweaked. For 
>>> example, if all tcp buffer values in linux were taken out of the 
>>> /proc file system and hardcoded, I think there would be a revolt.
>>
>> I'm not arguing for no tweaking.  What I'm saying is we should provide 
>> knobs that are as useful as possible, and include metrics and clear 
>> instructions for when and how to set the knob.
>>
>> You've shown there is improvement, but not that this is the best 
>> solution.   It just feels like the work isn't done yet.
> 
>>
>>>> Would it be hard to add a metric or two with this tweak that would 
>>>> allow admins to see how often a socket buffer was completely full, 
>>>> completely empty, or how often the window size is being aggressively 
>>>> cut?
>>> So I've done this using tcpdump in combination with tcptrace. I've 
>>> shown people at citi how the tcp window grows in the experiment I 
>>> describe.
>>
>> No, I mean as a part of the patch that adds the tweak, it should 
>> report various new statistics that can allow admins to see that they 
>> need adjustment, or that there isn't a problem at all in this area.
>>
>> Scientific system tuning means assessing the problem, trying a change, 
>> then measuring to see if it was effective, or if it caused more 
>> trouble.  Lather, rinse, repeat.
> 
>>
>>>> While we may not be able to determine a single optimal buffer size 
>>>> for all BDPs, are there diminishing returns in most common cases for 
>>>> increasing the buffer size past, say, 16MB?
>>> Good question. It all depends on how much data you are transferring. 
>>> In order to fully open a 128MB tcp window over a very long WAN, you 
>>> will need to transfer at least a few gigabytes of data. If you only 
>>> transfer 100 MB at a time, then you will probably be fine with a 16 
>>> MB window as you are not transferring enough data to open the window 
>>> anyways. In our environment, we are expecting to transfer 100s of GB 
>>> if not even more, so the 16 MB window would be very limiting.
>>
>> What about for a fast LAN?
> 
>>
>>>>> The information you are curious about is more relevant to creating 
>>>>> better default values of the tcp buffer size. This could be useful, 
>>>>> but would be a long process and there are so many variables that 
>>>>> I'm not sure that you could pick proper default values anyways. The 
>>>>> important thing is that the client can currently set its tcp buffer 
>>>>> size via the sysctl's, this is useless if the server is stuck at a 
>>>>> fixed value since the tcp window will be the minimum of the client 
>>>>> and server's tcp buffer sizes.
>>>>
>>>>
>>>> Well, Linux servers are not the only servers that a Linux client 
>>>> will ever encounter, so the client-side sysctl isn't as bad as 
>>>> useless. But one can argue whether that knob is ever tweaked by 
>>>> client administrators, and how useful it is.
>>> Definitely not useless. Doing a google search for 'tcp_rmem' returns 
>>> over 11000 hits describing how to configure tcp settings. (ok, I 
>>> didn't review every result, but the first few pages of results are 
>>> telling) It doesn't really matter what OS the client and server use, 
>>> as long as both have the ability to tweak the tcp buffer size.
>>
>> The number of hits may reflect the desperation that many have had over 
>> the years to get better performance from the Linux NFS 
>> implementation.  These days we have better performance out of the box, 
>> so there is less need for this kind of after-market tweaking.
>>
>> I think we would be in a much better place if the client and server 
>> implementations worked "well enough" in nearly any network or 
>> environment.  That's been my goal since I started working on Linux NFS 
>> seven years ago.
>>
>>>> What is an appropriate setting for a server that has to handle a mix 
>>>> of local and remote clients, for example, or a client that has to 
>>>> connect to a mix of local and remote servers?
>>> Yes, this is a tricky one. I believe the best way to handle it is to 
>>> set the server tcp buffer to the MAX(local, remote) and then let the 
>>> local client set a smaller tcp buffer and the remote client set a 
>>> larger tcp buffer. The problem there is that then what if the local 
>>> client is also a remote client of another nfs server?? At this point 
>>> there seems to be some limitations.....
>>
>> Using multiple connections solves this problem pretty well, I think.
>>
>> -- 
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com

[-- Attachment #2: chuck_lever.vcf --]
[-- Type: text/x-vcard, Size: 259 bytes --]

begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
  2008-06-18 21:32             ` Chuck Lever
@ 2008-06-25  1:06               ` Dean Hildebrand
  0 siblings, 0 replies; 15+ messages in thread
From: Dean Hildebrand @ 2008-06-25  1:06 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs

Hi Chuck,

It seems we are at an impasse.  You disagree with the current way Linux 
does TCP tuning (through sysctls) and so disagree with my patch which is 
following the current way of doing things.  The thing is, we are living 
in a world where Linux does its TCP tuning through sysctls, we must live 
with this fact and try to develop a short-term solution that works 
within this framework.  A long term solution should simultaneously be 
investigated, I like everything you said about using SCTP, iWarp, etc.

I never asked you to contradict MY experiments, but the experiments from 
gridftp that demonstrate that BOTH the rcv buffer PLUS the number of TCP 
connections are important over long-fat links.  But here is something to 
consider, if you don't like my sysctl to control the rcv buffer, how 
would you control the # of tcp connections?  Possibly a sysctl?  Maybe a 
mount option?  Either way you are exposing this information to the 
application layer?  Plus, why are you biasing one type of tcp tuning vs. 
another without any experiments to back your bias up?

In summary, defaults are important, and I think Olga's patch helps a lot 
with that regard, but they cannot replace customized tuning.

Dean

Chuck Lever wrote:
> Dean Hildebrand wrote:
>> We have a full picture of TCP.  TCP is well known, there are lots of 
>> papers/info on it, I have no doubt on what is occurring with TCP as I 
>> have traces that clearly show what is happening.  All documents and 
>> information clearly state that the buffer size is a critical part of 
>> improving TCP performance in the WAN.  In addition, the congestion 
>> control algorithm does NOT control the maximum size of the TCP 
>> window.  The CCA controls how quickly the window reaches the maximum 
>> size, what happens when a packet is dropped and when to close the 
>> window.  The only item that controls the maximum size of the TCP 
>> window is the buffer values that I want a sysctl to tweak (just to be 
>> in line with the existing tcp buffer sysctls in 
>> Documentation/networking/ip-sysctl.txt)
>
> IMO it's just plain broken that the application layer has to 
> understand and manage this detail about TCP.
>
>> What we don't have is a full picture of the other parts of 
>> transferring data from client to server, e.g., Trond just fixed a bug 
>> with regards to the writeback cache which should help write 
>> performance, that was an unknown up until this point.
>>
>> Multiple TCP Streams
>> ===============
>> There is a really big downside to multiple TCP streams: you have 
>> multiple TCP streams :)  Each one has its own overhead, setup 
>> connection cost, and of course  TCP window.  With a WAN rtt  of 200 
>> ms (typical over satellite) and the current buffer size of 4MB, the 
>> nfs client would need 50+ TCP connections to achieve the correct 
>> performance.   That is a
>> lot of overhead when comparing it with simply following the standard 
>> TCP tuning knowhow of increasing the buffer sizes.
>
> I suspect that anyone operating NFS over a sat link would have an 
> already lowered performance expectation.
>
> If the maximum window size defaults to, say, 16MB, and you have a 
> smaller RTT (which is typical of intercontinental 10GbE links which 
> you might be more willing to pump huge amounts of data over than a sat 
> link), you will need fewer concurrent connections to achieve optimal 
> performance, and that becomes more practical.
>
> For networks with a much smaller BDP (like, say, MOST OF THEM :-) you 
> might be able to get away with only a few connections, or even one, if 
> we decide to use a larger default maximum window size.
>
> There are plenty of advantages to having multiple connections between 
> client and server.  The fact that it helps the large BDP case is just 
> a bonus.
>
>> The main documentation the show that multiple tcp streams helps over 
>> the WAN is from GridFTP experiments.  They go over the pos and neg of 
>> the approach, but also talk about how tcp buffer size is also very 
>> important.  Multiple tcp streams is not a replacement for a proper 
>> buffer size 
>> (http://www.globus.org/alliance/publications/clusterworld/0904GridFinal.pdf) 
>>
>
> Here's something else to consider:
>
> TCP is likely not the right transport protocol for networks with a 
> large BDP.  Perhaps SCTP, which embeds support for multiple streams in 
> a single connection, is better for this case... and what we really 
> want to do is create an SCTP-based transport capability for NFS.  Or 
> maybe we really want to use iWARP over SCTP.
>
>> If you have documentation counteracting these experiments I would be 
>> very interested to see them.
>
> I think you are willfully misinterpreting my objection to your sysctl 
> patch.
>
> I never said your experiments are incorrect; they are valuable.  My 
> point is that they don't demonstrate that this is a useful knob for 
> our most common use cases, and that it is the correct and only way to 
> get "good enough" performance for most common deployments of NFS.  It 
> helps the large BDP case, but as you said, it doesn't make all the 
> problems go away there either.
>
> Is it easy to get optimal results with this?  How do admins evaluate 
> the results of changing this value?  Is it easy to get bad results 
> with it?  Can it result in bad behavior that results in problems for 
> other users of the network?
>
> I also never said "use multiple connections but leave the buffer size 
> alone."  I think we agree that a larger receive buffer size is a good 
> idea for the NFS server, in general.  The question is whether allowing 
> admins to tune it is the most effective way to benefit performance for 
> our user base, or can we get away with using a more optimal but fixed 
> default size (which is simpler for admins to understand and for us to 
> maintain)?
>
> Or are we just working around what is effectively a defect in TCP itself?
>
> I think we need to look at the bigger picture, which contains plenty 
> of other interesting alternatives that may have larger benefit.
>
>> One Variable or Two
>> ===============
>> I'd be happy with using a single variable for both the send and 
>> receive buffers, but since we are essentially doing the same thing as 
>> the net.ipv4.tcp_wmem/rmem variables, I think nfsd_tcp_max_mem would 
>> be more in line with existing Linux terminology. (also, we are 
>> talking about nfsd, not nfs, so I'd prefer to make that clear in the 
>> variable name)
>>
>> Summary
>> =======
>> I'm providing you with all the information I have with regards to my 
>> experiments with NFS and TCP.  I agree that a better default is 
>> needed and my patch allows further experimentation to get to that 
>> value.    My patch does not add modify current NFS behaviour.  It 
>> changes a hard coded value for the server buffer size to be a 
>> variable in /proc.  Blocking a method to modify this hard coded value 
>> means blocking further experimentation to find a better default 
>> value.  My patch is a first step toward trying to find a good default 
>> tcp server buffer value.
>
>>> Since what we really want to limit is the maximum size of the TCP 
>>> receive window, it would be more precise to change the name of the 
>>> new sysctl to something like nfs_tcp_max_window_size.
>>
>>>
>>>>>> Another point is that setting the buffer size isn't always a 
>>>>>> straightforward process. All papers I've read on the subject, and 
>>>>>> my experience confirms this, is that setting tcp buffer sizes is 
>>>>>> more of an art.
>>>>>>
>>>>>> So having the server set a good default value is half the battle, 
>>>>>> but allowing users to twiddle with this value is vital.
>>>>>
>>>>>>>> The uses the current buffer sizes in the code are as minimum 
>>>>>>>> values, which the user cannot decrease. If the user sets a 
>>>>>>>> value of 0 in either /proc entry, it resets the buffer size to 
>>>>>>>> the default value. The set /proc values are utilized when the 
>>>>>>>> TCP connection is initialized (mount time). The values are 
>>>>>>>> bounded above by the *minimum* of the /proc values and the 
>>>>>>>> network TCP sysctls.
>>>>>>>>
>>>>>>>> To demonstrate the usefulness of this patch, details of an 
>>>>>>>> experiment between 2 computers with a rtt of 30ms is provided 
>>>>>>>> below. In this experiment, increasing the server 
>>>>>>>> /proc/sys/sunrpc/tcp_rcvbuf value doubles write performance.
>>>>>>>>
>>>>>>>> EXPERIMENT
>>>>>>>> ==========
>>>>>>>> This experiment simulates a WAN by using tc together with netem 
>>>>>>>> to add a 30 ms delay to all packets on a nfs client. The goal 
>>>>>>>> is to show that by only changing tcp_rcvbuf, the nfs client can 
>>>>>>>> increase write performance in the WAN. To verify the patch has 
>>>>>>>> the desired effect on the TCP window, I created two tcptrace 
>>>>>>>> plots that show the difference in tcp window behaviour before 
>>>>>>>> and after the server TCP rcvbuf size is increased. When using 
>>>>>>>> the default server tcpbuf value of 6M, we can see the TCP 
>>>>>>>> window top out around 4.6 M, whereas increasing the server 
>>>>>>>> tcpbuf value to 32M, we can see that the TCP window tops out 
>>>>>>>> around 13M. Performance jumps from 43 MB/s to 90 MB/s.
>>>>>>>>
>>>>>>>> Hardware:
>>>>>>>> 2 dual-core opteron blades
>>>>>>>> GigE, Broadcom NetXtreme II BCM57065 cards
>>>>>>>> A single gigabit switch in the middle
>>>>>>>> 1500 MTU
>>>>>>>> 8 GB memory
>>>>>>>>
>>>>>>>> Software:
>>>>>>>> Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
>>>>>>>> RHEL4
>>>>>>>>
>>>>>>>> NFS Configuration:
>>>>>>>> 64 rpc slots
>>>>>>>> 32 nfsds
>>>>>>>> Export ext3 file system. This disk is quite slow, I therefore 
>>>>>>>> exported using async to reduce the effect of the disk on the 
>>>>>>>> back end. This way, the experiments record the time it takes 
>>>>>>>> for the data to get to the server (not to the disk).
>>>>>>>> # exportfs -v
>>>>>>>> /export 
>>>>>>>> <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)
>>>>>>>>
>>>>>>>> # cat /proc/mounts
>>>>>>>> bear109:/export /mnt nfs 
>>>>>>>> rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 
>>>>>>>> 0 0
>>>>>>>>
>>>>>>>> fs.nfs.nfs_congestion_kb = 91840
>>>>>>>> net.ipv4.tcp_congestion_control = cubic
>>>>>>>>
>>>>>>>> Network tc Command executed on client:
>>>>>>>> tc qdisc add dev eth0 root netem delay 30ms
>>>>>>>> rtt from client (bear108) to server (bear109)
>>>>>>>> #ping bear109
>>>>>>>> PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
>>>>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0 
>>>>>>>> ttl=64 time=31.4 ms
>>>>>>>> 64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1 
>>>>>>>> ttl=64 time=32.0 ms
>>>>>>>>
>>>>>>>> TCP Configuration on client and server:
>>>>>>>> # Controls IP packet forwarding
>>>>>>>> net.ipv4.ip_forward = 0
>>>>>>>> # Controls source route verification
>>>>>>>> net.ipv4.conf.default.rp_filter = 1
>>>>>>>> # Do not accept source routing
>>>>>>>> net.ipv4.conf.default.accept_source_route = 0
>>>>>>>> # Controls the System Request debugging functionality of the 
>>>>>>>> kernel
>>>>>>>> kernel.sysrq = 0
>>>>>>>> # Controls whether core dumps will append the PID to the core 
>>>>>>>> filename
>>>>>>>> # Useful for debugging multi-threaded applications
>>>>>>>> kernel.core_uses_pid = 1
>>>>>>>> # Controls the use of TCP syncookies
>>>>>>>> net.ipv4.tcp_syncookies = 1
>>>>>>>> # Controls the maximum size of a message, in bytes
>>>>>>>> kernel.msgmnb = 65536
>>>>>>>> # Controls the default maxmimum size of a mesage queue
>>>>>>>> kernel.msgmax = 65536
>>>>>>>> # Controls the maximum shared segment size, in bytes
>>>>>>>> kernel.shmmax = 68719476736
>>>>>>>> # Controls the maximum number of shared memory segments, in pages
>>>>>>>> kernel.shmall = 4294967296
>>>>>>>> ### IPV4 specific settings
>>>>>>>> net.ipv4.tcp_timestamps = 0
>>>>>>>> net.ipv4.tcp_sack = 1
>>>>>>>> # on systems with a VERY fast bus -> memory interface this is 
>>>>>>>> the big gainer
>>>>>>>> net.ipv4.tcp_rmem = 4096 16777216 16777216
>>>>>>>> net.ipv4.tcp_wmem = 4096 16777216 16777216
>>>>>>>> net.ipv4.tcp_mem = 4096 16777216 16777216
>>>>>>>> ### CORE settings (mostly for socket and UDP effect)
>>>>>>>> net.core.rmem_max = 16777216
>>>>>>>> net.core.wmem_max = 16777216
>>>>>>>> net.core.rmem_default = 16777216
>>>>>>>> net.core.wmem_default = 16777216
>>>>>>>> net.core.optmem_max = 16777216
>>>>>>>> net.core.netdev_max_backlog = 300000
>>>>>>>> # Don't cache ssthresh from previous connection
>>>>>>>> net.ipv4.tcp_no_metrics_save = 1
>>>>>>>> # make sure we don't run out of memory
>>>>>>>> vm.min_free_kbytes = 32768
>>>>>>>>
>>>>>>>> Experiments:
>>>>>>>>
>>>>>>>> On Server: (note that the real tcp buffer size is double 
>>>>>>>> tcp_rcvbuf)
>>>>>>>> [root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>>> 3158016
>>>>>>>>
>>>>>>>> On Client:
>>>>>>>> mount -t nfs bear109:/export /mnt
>>>>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>>>>> ...
>>>>>>>> KB reclen write
>>>>>>>> 512000 1024 43252 umount /mnt
>>>>>>>>
>>>>>>>> On server:
>>>>>>>> [root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>>> [root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
>>>>>>>> 16777216
>>>>>>>>
>>>>>>>> On Client:
>>>>>>>> mount -t nfs bear109:/export /mnt
>>>>>>>> [root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
>>>>>>>> ...
>>>>>>>> KB reclen write
>>>>>>>> 512000 1024 90396
>>>>>>>
>>>>>>> The numbers you have here are averages over the whole run. 
>>>>>>> Performing these tests using a variety of record lengths and 
>>>>>>> file sizes (up to several tens of gigabytes) would be useful to 
>>>>>>> see where different memory and network latencies kick in.
>>>>>> Definitely useful, although I'm not sure how this relates to this 
>>>>>> patch.
>>>>>
>>>>> It relates to the whole idea that this is a valid and useful 
>>>>> parameter to tweak.
>>>>>
>>>>> What your experiment shows is that there is some improvement when 
>>>>> the TCP window is allowed to expand. It does not demonstrate that 
>>>>> the *best* way to provide this facility is to allow administrators 
>>>>> to tune the server's TCP buffer sizes.
>>>> By definition of how TCP is designed, tweaking the send and receive 
>>>> buffer sizes is a useful. Please see the tcp tuning guides in my 
>>>> other post. I would characterize tweaking the buffers as a 
>>>> necessary condition but not a sufficient condition to achieve good 
>>>> throughput with tcp over long distances.
>>>>>
>>>>> A single average number can hide a host of underlying sins. This 
>>>>> simple experiment, for example, does not demonstrate that TCP 
>>>>> window size is the most significant issue here.
>>>> I would say it slightly differently, that it demonstrates that it 
>>>> is significant, but maybe not the *most* significant. There are 
>>>> many possible bottlenecks and possible knobs to tweak. For example, 
>>>> I'm still not achieving link speeds, so I'm sure there are other 
>>>> bottlenecks that are causing reduced performance.
>>>
>>> I think that's my basic point.  We don't have the full picture yet.  
>>> There are benefits to adjusting the maximum window size, but as we 
>>> learn more it may turn out that we want an entirely different knob 
>>> or knobs.
>>
>>>
>>>>> It does not show that it is more or less effective to adjust the 
>>>>> window size than to select an appropriate congestion control 
>>>>> algorithm (say, BIC).
>>>> Any tcp cong. control algorithm is highly dependent on the tcp 
>>>> buffer size. The choice of algorithm changes the behaviour when 
>>>> packets are dropped and in the initial opening of the window, but 
>>>> once the window is open and no packets are being dropped, the 
>>>> algorithm is irrelevant. So BIC, or westwood, or highspeed might do 
>>>> better in the face of dropped packets, but since the current 
>>>> receive buffer is so small, dropped packets are not the problem. 
>>>> Once we can use the sysctl's to tweak the server buffer size, only 
>>>> then is the choice of algorithm going to be important.
>>>
>>> Maybe my use of the terminology is imprecise, but clearly the 
>>> congestion control algorithm matters for determining the TCP window 
>>> size, which is exactly what we're discussing here.
>>
>>>
>>>>> It does not show whether the client and server are using TCP 
>>>>> optimally.
>>>> I'm not sure what you mean by *optimally*. They use tcp the only 
>>>> way they know how non?
>>>
>>> I'm talking about whether they use Nagle, when they PUSH, how they 
>>> use the window (servers can close a window when they are busy, for 
>>> example), and of course whether they can or should use multiple 
>>> connections.
>>
>>>
>>>>> It does not expose problems related to having a single data stream 
>>>>> with one blocking head (eg SCTP can allow multiple streams over 
>>>>> the same connection; or better performance might be achieved with 
>>>>> multiple TCP connections, even if they allow only small windows).
>>>> Yes, using multiple tcp connections might be useful, but that 
>>>> doesn't mean you wouldn't want to adjust the tcp window of each one 
>>>> using my patch. Actually, I can't seem to find the quote, but I 
>>>> read somewhere that achieving performance in the WAN can be done 2 
>>>> different ways: a) If you can tune the buffer sizes that is the 
>>>> best way to go, but b) if you don't have root access to change the 
>>>> linux tcp settings then using multiple tcp streams can compensate 
>>>> for small buffer sizes.
>>>>
>>>> Andy has/had a patch to add multiple tcp streams to NFS. I think 
>>>> his patch and my patch work in collaboration to improve wan 
>>>> performance.
>>>
>>> Yep, I've discussed this work with him several times.  This might be 
>>> a more practical solution than allowing larger window sizes (one 
>>> reason being the dangers of allowing the window to get too large).
>>>
>>> While the use of multiple streams has benefits besides increasing 
>>> the effective TCP window size, only the client side controls the 
>>> number of connections.  The server wouldn't have much to say about it.
>>
>>>
>>>>>> This patch isn't trying to alter default values, or predict 
>>>>>> buffer sizes based on rtt values, or dynamically alter the tcp 
>>>>>> window based on dropped packets, etc, it is just giving users the 
>>>>>> ability to customize the server tcp buffer size.
>>>>>
>>>>> I know you posted this patch because of the experiments at CITI 
>>>>> with long-run 10GbE, and it's handy to now have this to experiment 
>>>>> with.
>>>> Actually at IBM we have our own reasons for using NFS over the WAN. 
>>>> I would like to get these 2 knobs into the kernel as it is hard to 
>>>> tell customers to apply kernel patches....
>>>
>>>>> It might also be helpful if we had a patch that made the server 
>>>>> perform better in common environments, so a better default setting 
>>>>> it seems to me would have greater value than simply creating a new 
>>>>> tuning knob.
>>>> I think there are possibly 2 (or more) patches. One that improves 
>>>> the default buffer sizes and one that lets sysadmins tweak the 
>>>> value. I don't see why they are mutually exclusive.
>>>
>>> They are not.  I'm OK with studying the problem and adjusting the 
>>> defaults appropriately.
>>>
>>> The issue is whether adding this knob is the right approach to 
>>> adjusting the server.  I don't think we have enough information to 
>>> understand if this is the most useful approach.  In other words, it 
>>> seems like a band-aid right now, but in the long run it might be the 
>>> correct answer.
>>
>>>
>>>> My patch is a first step towards allowing NFS into WAN 
>>>> environments. Linux currently has sysctl values for the TCP 
>>>> parameters for exactly this reason, it is impossible to predict the 
>>>> network environment of a linux machine.
>>>
>>>> If the Linux nfs server isn't going to build off of the existing 
>>>> Linux TCP values (which all sysadmins know how to tweak), then it 
>>>> must allow sysadmins to tweak the NFS server tcp values, either 
>>>> using my patch or some other related patch. I'm open to how the 
>>>> server tcp buffers are tweaked, they just need to be able to be 
>>>> tweaked. For example, if all tcp buffer values in linux were taken 
>>>> out of the /proc file system and hardcoded, I think there would be 
>>>> a revolt.
>>>
>>> I'm not arguing for no tweaking.  What I'm saying is we should 
>>> provide knobs that are as useful as possible, and include metrics 
>>> and clear instructions for when and how to set the knob.
>>>
>>> You've shown there is improvement, but not that this is the best 
>>> solution.   It just feels like the work isn't done yet.
>>
>>>
>>>>> Would it be hard to add a metric or two with this tweak that would 
>>>>> allow admins to see how often a socket buffer was completely full, 
>>>>> completely empty, or how often the window size is being 
>>>>> aggressively cut?
>>>> So I've done this using tcpdump in combination with tcptrace. I've 
>>>> shown people at citi how the tcp window grows in the experiment I 
>>>> describe.
>>>
>>> No, I mean as a part of the patch that adds the tweak, it should 
>>> report various new statistics that can allow admins to see that they 
>>> need adjustment, or that there isn't a problem at all in this area.
>>>
>>> Scientific system tuning means assessing the problem, trying a 
>>> change, then measuring to see if it was effective, or if it caused 
>>> more trouble.  Lather, rinse, repeat.
>>
>>>
>>>>> While we may not be able to determine a single optimal buffer size 
>>>>> for all BDPs, are there diminishing returns in most common cases 
>>>>> for increasing the buffer size past, say, 16MB?
>>>> Good question. It all depends on how much data you are 
>>>> transferring. In order to fully open a 128MB tcp window over a very 
>>>> long WAN, you will need to transfer at least a few gigabytes of 
>>>> data. If you only transfer 100 MB at a time, then you will probably 
>>>> be fine with a 16 MB window as you are not transferring enough data 
>>>> to open the window anyways. In our environment, we are expecting to 
>>>> transfer 100s of GB if not even more, so the 16 MB window would be 
>>>> very limiting.
>>>
>>> What about for a fast LAN?
>>
>>>
>>>>>> The information you are curious about is more relevant to 
>>>>>> creating better default values of the tcp buffer size. This could 
>>>>>> be useful, but would be a long process and there are so many 
>>>>>> variables that I'm not sure that you could pick proper default 
>>>>>> values anyways. The important thing is that the client can 
>>>>>> currently set its tcp buffer size via the sysctl's, this is 
>>>>>> useless if the server is stuck at a fixed value since the tcp 
>>>>>> window will be the minimum of the client and server's tcp buffer 
>>>>>> sizes.
>>>>>
>>>>>
>>>>> Well, Linux servers are not the only servers that a Linux client 
>>>>> will ever encounter, so the client-side sysctl isn't as bad as 
>>>>> useless. But one can argue whether that knob is ever tweaked by 
>>>>> client administrators, and how useful it is.
>>>> Definitely not useless. Doing a google search for 'tcp_rmem' 
>>>> returns over 11000 hits describing how to configure tcp settings. 
>>>> (ok, I didn't review every result, but the first few pages of 
>>>> results are telling) It doesn't really matter what OS the client 
>>>> and server use, as long as both have the ability to tweak the tcp 
>>>> buffer size.
>>>
>>> The number of hits may reflect the desperation that many have had 
>>> over the years to get better performance from the Linux NFS 
>>> implementation.  These days we have better performance out of the 
>>> box, so there is less need for this kind of after-market tweaking.
>>>
>>> I think we would be in a much better place if the client and server 
>>> implementations worked "well enough" in nearly any network or 
>>> environment.  That's been my goal since I started working on Linux 
>>> NFS seven years ago.
>>>
>>>>> What is an appropriate setting for a server that has to handle a 
>>>>> mix of local and remote clients, for example, or a client that has 
>>>>> to connect to a mix of local and remote servers?
>>>> Yes, this is a tricky one. I believe the best way to handle it is 
>>>> to set the server tcp buffer to the MAX(local, remote) and then let 
>>>> the local client set a smaller tcp buffer and the remote client set 
>>>> a larger tcp buffer. The problem there is that then what if the 
>>>> local client is also a remote client of another nfs server?? At 
>>>> this point there seems to be some limitations.....
>>>
>>> Using multiple connections solves this problem pretty well, I think.
>>>
>>> -- 
>>> Chuck Lever
>>> chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-06-25  1:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-10 18:54 [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values Dean Hildebrand
2008-06-11 19:48 ` Chuck Lever
2008-06-11 21:01   ` Talpey, Thomas
2008-06-12 21:03   ` Dean Hildebrand
2008-06-13 18:51     ` Chuck Lever
2008-06-13 20:56       ` J. Bruce Fields
2008-06-14  1:07       ` Dean Hildebrand
2008-06-16 18:59         ` Chuck Lever
2008-06-17 22:03           ` Dean Hildebrand
2008-06-18 21:32             ` Chuck Lever
2008-06-25  1:06               ` Dean Hildebrand
2008-06-13 20:53     ` J. Bruce Fields
2008-06-13 23:58       ` Dean Hildebrand
2008-06-16 17:59         ` J. Bruce Fields
2008-06-18 18:33           ` Dean Hildebrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.