All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dean Hildebrand <seattleplus@gmail.com>
To: linux-nfs@vger.kernel.org
Subject: [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values
Date: Tue, 10 Jun 2008 11:54:28 -0700	[thread overview]
Message-ID: <484ECDE4.6030108@gmail.com> (raw)

Hello,

The motivation for this patch is improved WAN write performance plus 
greater user control on the server of the TCP buffer values (window 
size).  The TCP window determines the amount of outstanding data that a 
client can have on the wire and should be large enough that a NFS client 
can fill up the pipe (the bandwidth * delay product).  Currently the TCP 
receive buffer size (used for client writes) is set very low, which 
prevents a client from filling up a network pipe with a large bandwidth 
* delay product.

Currently, the server TCP send window is set to accommodate the maximum 
number of outstanding NFSD read requests (# nfsds * maxiosize), while 
the server TCP receive window is set to a fixed value which can hold a 
few requests.  While these values set a TCP window size that is fine in 
LAN environments with a small BDP, WAN environments can require a much 
larger TCP window size, e.g., 10GigE transatlantic link with a rtt of 
120 ms has a BDP of approx 60MB.

I have a patch to net/svc/svcsock.c that allows a user to manually set 
the server TCP send and receive buffer through the sysctl interface. to 
suit the required TCP window of their network architecture.  It adds two 
/proc entries, one for the receive buffer size and one for the send 
buffer size:
/proc/sys/sunrpc/tcp_sndbuf
/proc/sys/sunrpc/tcp_rcvbuf

The uses the current buffer sizes in the code are as minimum values, 
which the user cannot decrease.  If the user sets a value of 0 in either 
/proc entry, it resets the buffer size to the default value.  The set 
/proc values are utilized when the TCP connection is initialized (mount 
time).  The values are bounded above by the *minimum* of the /proc 
values and the network TCP sysctls.

To demonstrate the usefulness of this patch, details of an experiment 
between 2 computers with a rtt of 30ms is provided below.  In this 
experiment, increasing the server /proc/sys/sunrpc/tcp_rcvbuf value 
doubles write performance.

EXPERIMENT
==========
This experiment simulates a WAN by using tc together with netem to add a 
30 ms delay to all packets on a nfs client.  The goal is to show that by 
only changing tcp_rcvbuf, the nfs client can increase write performance 
in the WAN. To verify the patch has the desired effect on the TCP 
window, I created two tcptrace plots that show the difference in tcp 
window behaviour before and after the server TCP rcvbuf size is 
increased.  When using the default server tcpbuf value of 6M, we can see 
the TCP window top out around 4.6 M, whereas increasing the server 
tcpbuf value to 32M, we can see that the TCP window tops out around 
13M.  Performance jumps from 43 MB/s to 90 MB/s.

Hardware:
2 dual-core opteron blades
GigE, Broadcom NetXtreme II BCM57065 cards
A single gigabit switch in the middle
1500 MTU
8 GB memory

Software:
Kernel: Bruce's 2.6.25-rc9-CITI_NFS4_ALL-1 tree
RHEL4

NFS Configuration:
64 rpc slots
32 nfsds
Export ext3 file system.  This disk is quite slow, I therefore exported 
using async to reduce the effect of the disk on the back end.  This way, 
the experiments record the time it takes for the data to get to the 
server (not to the disk).
# exportfs -v
/export     <world>(rw,async,wdelay,nohide,insecure,no_root_squash,fsid=0)

# cat /proc/mounts
bear109:/export /mnt nfs 
rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=9.1.74.144 
0 0

fs.nfs.nfs_congestion_kb = 91840
net.ipv4.tcp_congestion_control = cubic

Network tc Command executed on client:
tc qdisc add dev eth0 root netem delay 30ms
rtt from client (bear108) to server (bear109)
#ping bear109
PING bear109.almaden.ibm.com (9.1.74.144) 56(84) bytes of data.
64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=0 ttl=64 
time=31.4 ms
64 bytes from bear109.almaden.ibm.com (9.1.74.144): icmp_seq=1 ttl=64 
time=32.0 ms

TCP Configuration on client and server:
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename
# Useful for debugging multi-threaded applications
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
### IPV4 specific settings
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
# on systems with a VERY fast bus -> memory interface this is the big 
gainer
net.ipv4.tcp_rmem = 4096 16777216 16777216
net.ipv4.tcp_wmem = 4096 16777216 16777216
net.ipv4.tcp_mem = 4096 16777216 16777216
### CORE settings (mostly for socket and UDP effect)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max =  16777216
net.core.netdev_max_backlog = 300000
# Don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1
# make sure we don't run out of memory
vm.min_free_kbytes = 32768

Experiments:

On Server: (note that the real tcp buffer size is double tcp_rcvbuf)
[root@bear109 ~]# echo 0 > /proc/sys/sunrpc/tcp_rcvbuf
[root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
3158016

On Client:
mount -t nfs bear109:/export /mnt
[root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
...
             KB  reclen   write
         512000    1024   43252      umount /mnt

On server:
[root@bear109 ~]# echo 16777216 > /proc/sys/sunrpc/tcp_rcvbuf
[root@bear109 ~]# cat /proc/sys/sunrpc/tcp_rcvbuf
16777216

On Client:
mount -t nfs bear109:/export /mnt
[root@bear108 ~]# iozone -aec -i 0 -+n -f /mnt/test -r 1M -s 500M
...
             KB  reclen   write
         512000    1024   90396

Dean
IBM Almaden Research Center

             reply	other threads:[~2008-06-10 18:54 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-10 18:54 Dean Hildebrand [this message]
2008-06-11 19:48 ` [PATCH 0/1] SUNRPC: Add sysctl variables for server TCP snd/rcv buffer values Chuck Lever
2008-06-11 21:01   ` Talpey, Thomas
2008-06-12 21:03   ` Dean Hildebrand
2008-06-13 18:51     ` Chuck Lever
2008-06-13 20:56       ` J. Bruce Fields
2008-06-14  1:07       ` Dean Hildebrand
2008-06-16 18:59         ` Chuck Lever
2008-06-17 22:03           ` Dean Hildebrand
2008-06-18 21:32             ` Chuck Lever
2008-06-25  1:06               ` Dean Hildebrand
2008-06-13 20:53     ` J. Bruce Fields
2008-06-13 23:58       ` Dean Hildebrand
2008-06-16 17:59         ` J. Bruce Fields
2008-06-18 18:33           ` Dean Hildebrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=484ECDE4.6030108@gmail.com \
    --to=seattleplus@gmail.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.