Performance tuning on modern 2.6 series kernels

All of lore.kernel.org
 help / color / mirror / Atom feed

* Performance tuning on modern 2.6 series kernels
@ 2007-06-13 17:45 Bill Johnstone
  2007-06-13 18:58 ` Trond Myklebust
  2007-06-14 18:00 ` Chuck Lever
  0 siblings, 2 replies; 4+ messages in thread
From: Bill Johnstone @ 2007-06-13 17:45 UTC (permalink / raw)
  To: nfs

Hello all.

I've been trying to understand performance tuning for NFSv3 on Linux
2.6 kernels (say 2.6.20 and up), and I just wound up uncertain, with a
lot of questions.  Additionally, the Linux-NFS documentation for
performance tuning at http://nfs.sourceforge.net/nfs-howto/ar01s05.html
has a lot of discussion regarding 2.4 series kernels, which doesn't
really seem that relevant these days.  It would be great if there was
more discussion of 2.6-related information (if so desired, I could
submit some changes to the docs based on what I hear back).  Anyway,
here's what I would really appreciate some clarification on:

1. TCP vs. UDP.  Assuming a high-bandwidth switched network, with
enough disk and network bandwidth on the storage server, and
rsize/wsize (for both) less than the MTU, UDP should have less overhead
and be faster, than TCP, correct?

2. Large rsize/wsize, and TCP vs. UDP again.  I understand that if the
rsize/wsize is larger than the MTU, a UDP packet containing NFS
read/write data will be fragmented into multiple packets.  However, all
the documentation in this cases seems to imply that even though the
same MTU restriction applies to TCP, TCP will be faster in this case
than the fragmented UDP.  Why is this?  Doesn't TCP need to "fragment"
the NFS payload as well?

3. Socket buffers.  Many references mention changing
/proc/sys/net/core/{r,w}mem_default and
/proc/sys/net/core/{r,w}mem_max
but documents such as NetApp's "Using the Linux NFS Client with Network
Appliance Filers" ( http://www.netapp.com/library/tr/3183.pdf ) mention
that this is not necessary in 2.6 kernels, as they "auto-tune" socket
buffers.  However, even if we needn't change the default values, is
there any benefit to increasing the rmem_max / wmem_max parameters?

4. tcp_window_scaling, tcp_timestamps, and tcp_sack .  There is an IBM
Linux NFS performance tuning article at
http://www.ibm.com/developerworks/eserver/library/es-033104.html which
mentions turning off all three of these features, and presumaby the
tcp_window_scaling causes them to see an effective MSS less than what
they get without it.  Is there merit to turning these features off
today, on modern 2.6 kernels?

Thank you very much for any enlightenment you provide.

      ___________________________________________________________________________________
You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_html.html

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Performance tuning on modern 2.6 series kernels
  2007-06-13 17:45 Performance tuning on modern 2.6 series kernels Bill Johnstone
@ 2007-06-13 18:58 ` Trond Myklebust
  2007-06-14 18:00 ` Chuck Lever
  1 sibling, 0 replies; 4+ messages in thread
From: Trond Myklebust @ 2007-06-13 18:58 UTC (permalink / raw)
  To: Bill Johnstone; +Cc: nfs

On Wed, 2007-06-13 at 10:45 -0700, Bill Johnstone wrote:
> Hello all.
> 
> I've been trying to understand performance tuning for NFSv3 on Linux
> 2.6 kernels (say 2.6.20 and up), and I just wound up uncertain, with a
> lot of questions.  Additionally, the Linux-NFS documentation for
> performance tuning at http://nfs.sourceforge.net/nfs-howto/ar01s05.html
> has a lot of discussion regarding 2.4 series kernels, which doesn't
> really seem that relevant these days.  It would be great if there was
> more discussion of 2.6-related information (if so desired, I could
> submit some changes to the docs based on what I hear back).  Anyway,
> here's what I would really appreciate some clarification on:
> 
> 1. TCP vs. UDP.  Assuming a high-bandwidth switched network, with
> enough disk and network bandwidth on the storage server, and
> rsize/wsize (for both) less than the MTU, UDP should have less overhead
> and be faster, than TCP, correct?

Only if the network is reliable. On lossy networks, TCP is infinitely
preferable.

> 2. Large rsize/wsize, and TCP vs. UDP again.  I understand that if the
> rsize/wsize is larger than the MTU, a UDP packet containing NFS
> read/write data will be fragmented into multiple packets.  However, all
> the documentation in this cases seems to imply that even though the
> same MTU restriction applies to TCP, TCP will be faster in this case
> than the fragmented UDP.  Why is this?  Doesn't TCP need to "fragment"
> the NFS payload as well?

Yes, but it doesn't need to retransmit the entire message if just one
fragment gets lost. Normally, it will just retransmit the one fragment.

> 3. Socket buffers.  Many references mention changing
> /proc/sys/net/core/{r,w}mem_default and
> /proc/sys/net/core/{r,w}mem_max
> but documents such as NetApp's "Using the Linux NFS Client with Network
> Appliance Filers" ( http://www.netapp.com/library/tr/3183.pdf ) mention
> that this is not necessary in 2.6 kernels, as they "auto-tune" socket
> buffers.  However, even if we needn't change the default values, is
> there any benefit to increasing the rmem_max / wmem_max parameters?

No. The RPC layer overrides them.

> 4. tcp_window_scaling, tcp_timestamps, and tcp_sack .  There is an IBM
> Linux NFS performance tuning article at
> http://www.ibm.com/developerworks/eserver/library/es-033104.html which
> mentions turning off all three of these features, and presumaby the
> tcp_window_scaling causes them to see an effective MSS less than what
> they get without it.  Is there merit to turning these features off
> today, on modern 2.6 kernels?

Those were in a section on "Web applications", not NFS AFAICS. Their
main reason for turning off window scaling is presumably to work around
the "broken router" issue??? (see http://lwn.net/Articles/92727/)
Otherwise, given their choices of tcp_rmem/tcp_wmem/tcp_mem, the
decision to turn off window scaling seems very odd...

See http://www.psc.edu/networking/projects/tcptune/ for a better
explanation of these 3 parameters.

NFS will usually want to allow receive buffers of > 64k, so you are
probably better off leaving window scaling on.

tcp_timestamps should probably also be left on, since you really do want
good estimation of the RTT and protection against sequence number
wraparound.

Finally, you should probably keep SACK on too if you are using window
scaling.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Performance tuning on modern 2.6 series kernels
  2007-06-13 17:45 Performance tuning on modern 2.6 series kernels Bill Johnstone
  2007-06-13 18:58 ` Trond Myklebust
@ 2007-06-14 18:00 ` Chuck Lever
  1 sibling, 0 replies; 4+ messages in thread
From: Chuck Lever @ 2007-06-14 18:00 UTC (permalink / raw)
  To: Bill Johnstone; +Cc: nfs

[-- Attachment #1: Type: text/plain, Size: 2129 bytes --]

Bill Johnstone wrote:
> 1. TCP vs. UDP.  Assuming a high-bandwidth switched network, with
> enough disk and network bandwidth on the storage server, and
> rsize/wsize (for both) less than the MTU, UDP should have less overhead
> and be faster, than TCP, correct?

The overhead of TCP ACKs and extra TCP header fields does cause slow 
down, but it really depends on network latency, host interrupt latency, 
and host processor speed on the server and client.   The slowdown was 
measured at about 5% several years ago on gigahertz class Pentium III 
processors, but is probably much less these days.

In almost any real world network, TCP will win over UDP because it is 
properly designed to recover from network losses due to congestion and 
link speed mismatches.  Even the "perfect" network you describe can 
suffer packet losses from bus and buffer overruns and hardware errors.

In addition, the NFSv4 spec requires the use of reliable transport 
protocol, which UDP is not.  So for NFSv4, UDP isn't an option.

In other words, TCP is the only choice for any modern NFS deployment.

> 2. Large rsize/wsize, and TCP vs. UDP again.  I understand that if the
> rsize/wsize is larger than the MTU, a UDP packet containing NFS
> read/write data will be fragmented into multiple packets.  However, all
> the documentation in this cases seems to imply that even though the
> same MTU restriction applies to TCP, TCP will be faster in this case
> than the fragmented UDP.  Why is this?  Doesn't TCP need to "fragment"
> the NFS payload as well?

TCP does break data into MTU sized chunks, but has proper management of 
the chunks using 32-bit TCP sequence numbers.  UDP uses only a 16 bit ID 
field in the IP header, which is known to wrap in many common real world 
situations.  IP ID wrapping can cause fragment misassembly which results 
in either a bad checksum on reassembly or corrupt file data.

TCP was designed as a *reliable* transport protocol, thus provides many 
guarantees that the data that leaves one host is the same as the data 
that arrives at the receiving end.  UDP makes no guarantees about data 
reliability.

[-- Attachment #2: chuck.lever.vcf --]
[-- Type: text/x-vcard, Size: 315 bytes --]

begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard

[-- Attachment #3: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #4: Type: text/plain, Size: 140 bytes --]

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Performance tuning on modern 2.6 series kernels
@ 2007-06-15  4:17 Bill Johnstone
  0 siblings, 0 replies; 4+ messages in thread
From: Bill Johnstone @ 2007-06-15  4:17 UTC (permalink / raw)
  To: nfs

--- Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Wed, 2007-06-13 at 10:45 -0700, Bill Johnstone wrote:

> > 4. tcp_window_scaling, tcp_timestamps, and tcp_sack .  There is an
> > IBM
> > Linux NFS performance tuning article at
> > http://www.ibm.com/developerworks/eserver/library/es-033104.html
> > which
> > mentions turning off all three of these features, and presumaby the
> > tcp_window_scaling causes them to see an effective MSS less than
> > what
> > they get without it.  Is there merit to turning these features off
> > today, on modern 2.6 kernels?
> 
> Those were in a section on "Web applications", not NFS AFAICS.
> Their
> main reason for turning off window scaling is presumably to work
> around
> the "broken router" issue??? (see http://lwn.net/Articles/92727/)
> Otherwise, given their choices of tcp_rmem/tcp_wmem/tcp_mem, the
> decision to turn off window scaling seems very odd...

Yes, I was mistaken about the overall intent of the article, it was
on generic "performance optimization" but came up as a relevant hit in
a
Google search for NFS Linux performance tuning.

They are not very clear about the specific reasons for turning each
specific feature off, but they indicate that having them on reduced
their MSS below the maximum (and expected) value for 1500 byte MTU
ethernet.  Perhaps, as you say, this is an implementation artifact of
the tcp window scaling "broken router" workaround.  Either way, it
seems like in a modern implementation, we should leave these features
on.

____________________________________________________________________________________
Don't get soaked.  Take a quick peak at the forecast
with the Yahoo! Search weather shortcut.
http://tools.search.yahoo.com/shortcuts/#loc_weather

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-06-15  4:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-13 17:45 Performance tuning on modern 2.6 series kernels Bill Johnstone
2007-06-13 18:58 ` Trond Myklebust
2007-06-14 18:00 ` Chuck Lever
  -- strict thread matches above, loose matches on Subject: below --
2007-06-15  4:17 Bill Johnstone

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.