From: Harshula Jayasuriya <harshula@redhat.com>
To: Steve Dickson <SteveD@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
Chuck Lever <chuck.lever@oracle.com>, Olaf Kirch <okir@suse.de>
Subject: [PATCH] nfs-utils: Add a warning to the nfs manpage regarding using NFS over UDP on high-speed links
Date: Wed, 09 May 2012 10:59:24 +1000 [thread overview]
Message-ID: <1336525164.21032.9.camel@serendib> (raw)
In-Reply-To: <2194470C-5FD9-4317-9A30-2E6C244138D5@oracle.com>
* Using NFS over UDP on high-speed links such as Gigabit can cause
silent data corruption.
* The man page text was written by Olaf Kirch and committed to (but not
upstream):
https://build.opensuse.org/package/view_file?file=warn-nfs-udp.patch&package=nfs-utils&project=openSUSE%3AFactory&rev=8e3e60c70e8270cd4afa036e13f6b2bb
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Olaf Kirch <okir@suse.com>
---
utils/mount/nfs.man | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 81 insertions(+), 0 deletions(-)
diff --git a/utils/mount/nfs.man b/utils/mount/nfs.man
index 0d20cf0..87e27e1 100644
--- a/utils/mount/nfs.man
+++ b/utils/mount/nfs.man
@@ -500,6 +500,8 @@ Specifying a netid that uses TCP forces all traffic from the
command and the NFS client to use TCP.
Specifying a netid that uses UDP forces all traffic types to use UDP.
.IP
+.B Before using NFS over UDP, refer to the TRANSPORT METHODS section.
+.IP
If the
.B proto
mount option is not specified, the
@@ -514,6 +516,8 @@ The
option is an alternative to specifying
.BR proto=udp.
It is included for compatibility with other operating systems.
+.IP
+.B Before using NFS over UDP, refer to the TRANSPORT METHODS section.
.TP 1.5i
.B tcp
The
@@ -1070,6 +1074,83 @@ or
options are specified more than once on the same mount command line,
then the value of the rightmost instance of each of these options
takes effect.
+.SS "Using NFS over UDP on high-speed links"
+Using NFS over UDP on high-speed links such as Gigabit
+.BR "can cause silent data corruption" .
+.P
+The problem can be triggered at high loads, and is caused by problems in
+IP fragment reassembly. NFS read and writes typically transmit UDP packets
+of 4 Kilobytes or more, which have to be broken up into several fragments
+in order to be sent over the Ethernet link, which limits packets to 1500
+bytes by default. This process happens at the IP network layer and is
+called fragmentation.
+.P
+In order to identify fragments that belong together, IP assigns a 16bit
+.I IP ID
+value to each packet; fragments generated from the same UDP packet
+will have the same IP ID. The receiving system will collect these
+fragments and combine them to form the original UDP packet. This process
+is called reassembly. The default timeout for packet reassembly is
+30 seconds; if the network stack does not receive all fragments of
+a given packet within this interval, it assumes the missing fragment(s)
+got lost and discards those it already received.
+.P
+The problem this creates over high-speed links is that it is possible
+to send more than 65536 packets within 30 seconds. In fact, with
+heavy NFS traffic one can observe that the IP IDs repeat after about
+5 seconds.
+.P
+This has serious effects on reassembly: if one fragment gets lost,
+another fragment
+.I from a different packet
+but with the
+.I same IP ID
+will arrive within the 30 second timeout, and the network stack will
+combine these fragments to form a new packet. Most of the time, network
+layers above IP will detect this mismatched reassembly - in the case
+of UDP, the UDP checksum, which is a 16 bit checksum over the entire
+packet payload, will usually not match, and UDP will discard the
+bad packet.
+.P
+However, the UDP checksum is 16 bit only, so there is a chance of 1 in
+65536 that it will match even if the packet payload is completely
+random (which very often isn't the case). If that is the case,
+silent data corruption will occur.
+.P
+This potential should be taken seriously, at least on Gigabit
+Ethernet.
+Network speeds of 100Mbit/s should be considered less
+problematic, because with most traffic patterns IP ID wrap around
+will take much longer than 30 seconds.
+.P
+It is therefore strongly recommended to use
+.BR "NFS over TCP where possible" ,
+since TCP does not perform fragmentation.
+.P
+If you absolutely have to use NFS over UDP over Gigabit Ethernet,
+some steps can be taken to mitigate the problem and reduce the
+probability of corruption:
+.TP +1.5i
+.I Jumbo frames:
+Many Gigabit network cards are capable of transmitting
+frames bigger than the 1500 byte limit of traditional Ethernet, typically
+9000 bytes. Using jumbo frames of 9000 bytes will allow you to run NFS over
+UDP at a page size of 8K without fragmentation. Of course, this is
+only feasible if all involved stations support jumbo frames.
+.IP
+To enable a machine to send jumbo frames on cards that support it,
+it is sufficient to configure the interface for a MTU value of 9000.
+.TP +1.5i
+.I Lower reassembly timeout:
+By lowering this timeout below the time it takes the IP ID counter
+to wrap around, incorrect reassembly of fragments can be prevented
+as well. To do so, simply write the new timeout value (in seconds)
+to the file
+.BR /proc/sys/net/ipv4/ipfrag_time .
+.IP
+A value of 2 seconds will greatly reduce the probability of IPID clashes on
+a single Gigabit link, while still allowing for a reasonable timeout
+when receiving fragmented traffic from distant peers.
.SH "DATA AND METADATA COHERENCE"
Some modern cluster file systems provide
perfect cache coherence among their clients.
--
1.7.7.6
next prev parent reply other threads:[~2012-05-09 0:59 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-28 5:22 "Using NFS over UDP on high-speed links such as Gigabit can cause silent data corruption." Harshula
2012-02-28 11:52 ` Jeff Layton
2012-02-28 12:32 ` Harshula
2012-02-28 12:41 ` Jeff Layton
2012-03-05 1:56 ` Harshula
2012-02-28 12:46 ` Jim Rees
2012-02-28 12:57 ` Jeff Layton
2012-02-28 14:35 ` Chuck Lever
2012-02-28 15:09 ` Jim Rees
2012-02-28 15:50 ` Chuck Lever
2012-03-05 2:17 ` Harshula
2012-03-05 15:08 ` Chuck Lever
2012-05-09 0:59 ` Harshula Jayasuriya [this message]
2012-05-09 18:14 ` [PATCH] nfs-utils: Add a warning to the nfs manpage regarding using NFS over UDP on high-speed links Steve Dickson
2012-05-09 18:38 ` Peter Staubach
2012-05-09 22:16 ` Harshula
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1336525164.21032.9.camel@serendib \
--to=harshula@redhat.com \
--cc=SteveD@redhat.com \
--cc=chuck.lever@oracle.com \
--cc=jlayton@redhat.com \
--cc=linux-nfs@vger.kernel.org \
--cc=okir@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).