Re: [PATCH] nfs-utils: Add a warning to the nfs manpage regarding using NFS over UDP on high-speed links

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Steve Dickson <SteveD@redhat.com>
To: Harshula Jayasuriya <harshula@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Chuck Lever <chuck.lever@oracle.com>, Olaf Kirch <okir@suse.de>
Subject: Re: [PATCH] nfs-utils: Add a warning to the nfs manpage regarding using NFS over UDP on high-speed links
Date: Wed, 09 May 2012 14:14:37 -0400	[thread overview]
Message-ID: <4FAAB40D.50503@RedHat.com> (raw)
In-Reply-To: <1336525164.21032.9.camel@serendib>



On 05/08/2012 08:59 PM, Harshula Jayasuriya wrote:
> * Using NFS over UDP on high-speed links such as Gigabit can cause
>   silent data corruption.
> * The man page text was written by Olaf Kirch and committed to (but not
>   upstream):
> https://build.opensuse.org/package/view_file?file=warn-nfs-udp.patch&package=nfs-utils&project=openSUSE%3AFactory&rev=8e3e60c70e8270cd4afa036e13f6b2bb
> 
> Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
> Acked-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Olaf Kirch <okir@suse.com>
Committed...

steved.
> ---
>  utils/mount/nfs.man |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 81 insertions(+), 0 deletions(-)
> 
> diff --git a/utils/mount/nfs.man b/utils/mount/nfs.man
> index 0d20cf0..87e27e1 100644
> --- a/utils/mount/nfs.man
> +++ b/utils/mount/nfs.man
> @@ -500,6 +500,8 @@ Specifying a netid that uses TCP forces all traffic from the
>  command and the NFS client to use TCP.
>  Specifying a netid that uses UDP forces all traffic types to use UDP.
>  .IP
> +.B Before using NFS over UDP, refer to the TRANSPORT METHODS section.
> +.IP
>  If the
>  .B proto
>  mount option is not specified, the
> @@ -514,6 +516,8 @@ The
>  option is an alternative to specifying
>  .BR proto=udp.
>  It is included for compatibility with other operating systems.
> +.IP
> +.B Before using NFS over UDP, refer to the TRANSPORT METHODS section.
>  .TP 1.5i
>  .B tcp
>  The
> @@ -1070,6 +1074,83 @@ or
>  options are specified more than once on the same mount command line,
>  then the value of the rightmost instance of each of these options
>  takes effect.
> +.SS "Using NFS over UDP on high-speed links"
> +Using NFS over UDP on high-speed links such as Gigabit
> +.BR "can cause silent data corruption" .
> +.P
> +The problem can be triggered at high loads, and is caused by problems in
> +IP fragment reassembly. NFS read and writes typically transmit UDP packets
> +of 4 Kilobytes or more, which have to be broken up into several fragments
> +in order to be sent over the Ethernet link, which limits packets to 1500
> +bytes by default. This process happens at the IP network layer and is
> +called fragmentation.
> +.P
> +In order to identify fragments that belong together, IP assigns a 16bit
> +.I IP ID
> +value to each packet; fragments generated from the same UDP packet
> +will have the same IP ID. The receiving system will collect these
> +fragments and combine them to form the original UDP packet. This process
> +is called reassembly. The default timeout for packet reassembly is
> +30 seconds; if the network stack does not receive all fragments of
> +a given packet within this interval, it assumes the missing fragment(s)
> +got lost and discards those it already received.
> +.P
> +The problem this creates over high-speed links is that it is possible
> +to send more than 65536 packets within 30 seconds. In fact, with
> +heavy NFS traffic one can observe that the IP IDs repeat after about
> +5 seconds.
> +.P
> +This has serious effects on reassembly: if one fragment gets lost,
> +another fragment
> +.I from a different packet
> +but with the
> +.I same IP ID
> +will arrive within the 30 second timeout, and the network stack will
> +combine these fragments to form a new packet. Most of the time, network
> +layers above IP will detect this mismatched reassembly - in the case
> +of UDP, the UDP checksum, which is a 16 bit checksum over the entire
> +packet payload, will usually not match, and UDP will discard the
> +bad packet.
> +.P
> +However, the UDP checksum is 16 bit only, so there is a chance of 1 in
> +65536 that it will match even if the packet payload is completely
> +random (which very often isn't the case). If that is the case,
> +silent data corruption will occur.
> +.P
> +This potential should be taken seriously, at least on Gigabit
> +Ethernet.
> +Network speeds of 100Mbit/s should be considered less
> +problematic, because with most traffic patterns IP ID wrap around
> +will take much longer than 30 seconds.
> +.P
> +It is therefore strongly recommended to use
> +.BR "NFS over TCP where possible" ,
> +since TCP does not perform fragmentation.
> +.P
> +If you absolutely have to use NFS over UDP over Gigabit Ethernet,
> +some steps can be taken to mitigate the problem and reduce the
> +probability of corruption:
> +.TP +1.5i
> +.I Jumbo frames:
> +Many Gigabit network cards are capable of transmitting
> +frames bigger than the 1500 byte limit of traditional Ethernet, typically
> +9000 bytes. Using jumbo frames of 9000 bytes will allow you to run NFS over
> +UDP at a page size of 8K without fragmentation. Of course, this is
> +only feasible if all involved stations support jumbo frames.
> +.IP
> +To enable a machine to send jumbo frames on cards that support it,
> +it is sufficient to configure the interface for a MTU value of 9000.
> +.TP +1.5i
> +.I Lower reassembly timeout:
> +By lowering this timeout below the time it takes the IP ID counter
> +to wrap around, incorrect reassembly of fragments can be prevented
> +as well. To do so, simply write the new timeout value (in seconds)
> +to the file
> +.BR /proc/sys/net/ipv4/ipfrag_time .
> +.IP
> +A value of 2 seconds will greatly reduce the probability of IPID clashes on
> +a single Gigabit link, while still allowing for a reasonable timeout
> +when receiving fragmented traffic from distant peers.
>  .SH "DATA AND METADATA COHERENCE"
>  Some modern cluster file systems provide
>  perfect cache coherence among their clients.
> -- 1.7.7.6
>

next prev parent reply	other threads:[~2012-05-09 18:15 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-28  5:22 "Using NFS over UDP on high-speed links such as Gigabit can cause silent data corruption." Harshula
2012-02-28 11:52 ` Jeff Layton
2012-02-28 12:32   ` Harshula
2012-02-28 12:41     ` Jeff Layton
2012-03-05  1:56       ` Harshula
2012-02-28 12:46   ` Jim Rees
2012-02-28 12:57     ` Jeff Layton
2012-02-28 14:35     ` Chuck Lever
2012-02-28 15:09       ` Jim Rees
2012-02-28 15:50       ` Chuck Lever
2012-03-05  2:17         ` Harshula
2012-03-05 15:08           ` Chuck Lever
2012-05-09  0:59             ` [PATCH] nfs-utils: Add a warning to the nfs manpage regarding using NFS over UDP on high-speed links Harshula Jayasuriya
2012-05-09 18:14               ` Steve Dickson [this message]
2012-05-09 18:38               ` Peter Staubach
2012-05-09 22:16                 ` Harshula

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FAAB40D.50503@RedHat.com \
    --to=steved@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=harshula@redhat.com \
    --cc=jlayton@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=okir@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).