bug in linux mount? (says NetApp)

All of lore.kernel.org
 help / color / mirror / Atom feed

* bug in linux mount? (says NetApp)
@ 2006-07-11 19:00 Gregory Baker
  2006-07-11 20:21 ` Chuck Lever
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Gregory Baker @ 2006-07-11 19:00 UTC (permalink / raw)
  To: nfs; +Cc: autofs

We have thousands of linux clients hitting netapp file servers (many 
3500 series, clustered) on a local gigabit LAN.  From time to time, 
applications return "file not found" when attempting to automount a 
directory and access a file.  An example of this is a long running 
process, which reads in data, processes it for hours (in which time the 
filesystem is unmounted) then tries to read more data from that mount 
point (which causes a "file not found" error in the application).  This 
occurs about 1/100th of the time.

Researching at Netapp turns up this bit by Chuck Lever (Linux NFS 
contributer)

"Using the Linux NFS Client with Network Appliance Filers"
http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)

page 10 says...

"Due to a bug in the mount command, the default retransmission timeout 
value on Linux for NFS over TCP is quite small...To obtain standard 
behavior, we strongly recommend using "timeo=600, retrans=2" explicitly 
when mounting via TCP."

Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3) 
would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths 
of a second (10 seconds).  It appears netapp is suggesting waiting 
600+600 = 1200 tenths (120 seconds) before giving up on the mount command...

* What "bug" in the mount command do you believe NetApp is talking about?

* What do you think proper options for NFS auto/mounts would be for 
extremely busy centralized NFS filers?

* What is the reference standard behavior?

Thanks,

--Greg

-- 
----------------------------------------------------------------------
Greg Baker                                         512-602-3287 (work)
gregory.baker@amd.com                              512-602-6970 (fax)
5900 E. Ben White Blvd MS 626                      512-555-1212 (info)
Austin, TX 78741

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in linux mount? (says NetApp)
  2006-07-11 19:00 bug in linux mount? (says NetApp) Gregory Baker
@ 2006-07-11 20:21 ` Chuck Lever
  2006-07-14 20:36   ` Gregory Baker
  2006-07-11 23:27 ` [NFS] " Trond Myklebust
  2006-07-12  0:40 ` Blake Golliher
  2 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2006-07-11 20:21 UTC (permalink / raw)
  To: gregory.baker; +Cc: autofs, nfs

On 7/11/06, Gregory Baker <gregory.baker@amd.com> wrote:
> We have thousands of linux clients hitting netapp file servers (many
> 3500 series, clustered) on a local gigabit LAN.  From time to time,
> applications return "file not found" when attempting to automount a
> directory and access a file.  An example of this is a long running
> process, which reads in data, processes it for hours (in which time the
> filesystem is unmounted) then tries to read more data from that mount
> point (which causes a "file not found" error in the application).  This
> occurs about 1/100th of the time.
>
> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS
> contributer)
>
> "Using the Linux NFS Client with Network Appliance Filers"
> http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
>
> page 10 says...
>
> "Due to a bug in the mount command, the default retransmission timeout
> value on Linux for NFS over TCP is quite small...To obtain standard
> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly
> when mounting via TCP."
>
> Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3)
> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths
> of a second (10 seconds).  It appears netapp is suggesting waiting
> 600+600 = 1200 tenths (120 seconds) before giving up on the mount command...

It's important to distinguish two different types of timeouts.

1.  The mount operation has timed out.

2.  After the mount operation succeeds, an NFS RPC operation has timed out.

TR-3183 discusses the proper settings for 2, but you are experiencing 1.

The automounter attempts to mount one of the filer's exports, but the
mount request times out causing the mounted-on directory to be
exposed.  Your filer is heavily loaded, and the filer's mountd is
single-threaded.  The filer may also be experiencing delays when
requesting information from external servers (like DNS or NIS), in
which case the mount request is held up at the filer.

Both sides are at fault:  the Linux mount command should retry (and I
believe later releases of RHEL 3 were fixed to do this) and the filer
configuration should be reviewed to make sure there are no avoidable
delays while processing mount requests.

> * What "bug" in the mount command do you believe NetApp is talking about?

The bug is that the mount command overrides the proper default RPC
timeout value with a timeout value of 0.7 seconds.  This is *not* the
timeout for mount operations, it is the timeout for the in-kernel NFS
client to retransmit RPC requests.

> * What do you think proper options for NFS auto/mounts would be for
> extremely busy centralized NFS filers?

If you are using NFS over TCP, the proper timeout value is 60 seconds.

> * What is the reference standard behavior?

Solaris, which is the NFSv3 reference implementation, uses effectively
a 60 second timeout on TCP mounts.

-- 
"We who cut mere stones must always be envisioning cathedrals"
   -- Quarry worker's creed

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NFS] bug in linux mount? (says NetApp)
  2006-07-11 19:00 bug in linux mount? (says NetApp) Gregory Baker
  2006-07-11 20:21 ` Chuck Lever
@ 2006-07-11 23:27 ` Trond Myklebust
  2006-07-11 23:34   ` Gregory Baker
                     ` (2 more replies)
  2006-07-12  0:40 ` Blake Golliher
  2 siblings, 3 replies; 10+ messages in thread
From: Trond Myklebust @ 2006-07-11 23:27 UTC (permalink / raw)
  To: gregory.baker; +Cc: autofs, nfs

On Tue, 2006-07-11 at 14:00 -0500, Gregory Baker wrote:
> We have thousands of linux clients hitting netapp file servers (many 
> 3500 series, clustered) on a local gigabit LAN.  From time to time, 
> applications return "file not found" when attempting to automount a 
> directory and access a file.  An example of this is a long running 
> process, which reads in data, processes it for hours (in which time the 
> filesystem is unmounted) then tries to read more data from that mount 
> point (which causes a "file not found" error in the application).  This 
> occurs about 1/100th of the time.
> 
> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS 
> contributer)
> 
> "Using the Linux NFS Client with Network Appliance Filers"
> http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
> 
> page 10 says...
> 
> "Due to a bug in the mount command, the default retransmission timeout 
> value on Linux for NFS over TCP is quite small...To obtain standard 
> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly 
> when mounting via TCP."
> 
> Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3) 
> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths 
> of a second (10 seconds).  It appears netapp is suggesting waiting 
> 600+600 = 1200 tenths (120 seconds) before giving up on the mount command...

No they are not. See below.

> * What "bug" in the mount command do you believe NetApp is talking about?

It has nothing to do with the mount timeout: Chuck is talking about the
retransmission timeout for TCP connections 'timeo' which should indeed
be set to a high value since TCP guarantees message delivery (unlike UDP
which requires a small timeo value). Setting it too low means that you
end up spamming your server with a load of unnecessary retransmissions.

This was indeed the case for some older versions of 'mount' and also for
older versions of the am-utils/amd automounters.

> * What do you think proper options for NFS auto/mounts would be for 
> extremely busy centralized NFS filers?

Something like

mount -t nfs -ohard,timeo=600,retrans=2,rsize=32768,wsize=32768,tcp foo:/ /bar

should be a fairly safe bet. You might want to add the 'intr' flag too,
depending on how you feel about the behaviour w.r.t. pressing ^C.

> * What is the reference standard behavior?

To which reference are you referring?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NFS] bug in linux mount? (says NetApp)
  2006-07-11 23:27 ` [NFS] " Trond Myklebust
@ 2006-07-11 23:34   ` Gregory Baker
  2006-07-12  3:03   ` [autofs] " Ian Kent
  2006-07-12  9:32   ` James Pearson
  2 siblings, 0 replies; 10+ messages in thread
From: Gregory Baker @ 2006-07-11 23:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: autofs, nfs


Thanks Trond!

I was referring to the 'standard' comment from the netapp PDF:

"Due to a bug in the mount command, the default retransmission timeout
value on Linux for NFS over TCP is quite small...To obtain standard
behavior, we strongly recommend using "timeo=600, retrans=2" explicitly
when mounting via TCP."

And was wondering what the 'standard' was.  Chuck politely pointed me to 
Solaris as the NFSv3 reference for 'standard'.

Thanks,

--Greg

Trond Myklebust wrote:
> On Tue, 2006-07-11 at 14:00 -0500, Gregory Baker wrote:
>> We have thousands of linux clients hitting netapp file servers (many 
>> 3500 series, clustered) on a local gigabit LAN.  From time to time, 
>> applications return "file not found" when attempting to automount a 
>> directory and access a file.  An example of this is a long running 
>> process, which reads in data, processes it for hours (in which time the 
>> filesystem is unmounted) then tries to read more data from that mount 
>> point (which causes a "file not found" error in the application).  This 
>> occurs about 1/100th of the time.
>>
>> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS 
>> contributer)
>>
>> "Using the Linux NFS Client with Network Appliance Filers"
>> http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
>>
>> page 10 says...
>>
>> "Due to a bug in the mount command, the default retransmission timeout 
>> value on Linux for NFS over TCP is quite small...To obtain standard 
>> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly 
>> when mounting via TCP."
>>
>> Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3) 
>> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths 
>> of a second (10 seconds).  It appears netapp is suggesting waiting 
>> 600+600 = 1200 tenths (120 seconds) before giving up on the mount command...
> 
> No they are not. See below.
> 
>> * What "bug" in the mount command do you believe NetApp is talking about?
> 
> It has nothing to do with the mount timeout: Chuck is talking about the
> retransmission timeout for TCP connections 'timeo' which should indeed
> be set to a high value since TCP guarantees message delivery (unlike UDP
> which requires a small timeo value). Setting it too low means that you
> end up spamming your server with a load of unnecessary retransmissions.
> 
> This was indeed the case for some older versions of 'mount' and also for
> older versions of the am-utils/amd automounters.
> 
>> * What do you think proper options for NFS auto/mounts would be for 
>> extremely busy centralized NFS filers?
> 
> Something like
> 
> mount -t nfs -ohard,timeo=600,retrans=2,rsize=32768,wsize=32768,tcp foo:/ /bar
> 
> should be a fairly safe bet. You might want to add the 'intr' flag too,
> depending on how you feel about the behaviour w.r.t. pressing ^C.
> 
>> * What is the reference standard behavior?
> 
> To which reference are you referring?
> 
> Cheers,
>   Trond
> 

-- 
----------------------------------------------------------------------
Greg Baker                                         512-602-3287 (work)
gregory.baker@amd.com                              512-602-6970 (fax)
5900 E. Ben White Blvd MS 626                      512-555-1212 (info)
Austin, TX 78741

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in linux mount? (says NetApp)
  2006-07-11 19:00 bug in linux mount? (says NetApp) Gregory Baker
  2006-07-11 20:21 ` Chuck Lever
  2006-07-11 23:27 ` [NFS] " Trond Myklebust
@ 2006-07-12  0:40 ` Blake Golliher
  2006-07-12  1:07   ` Gregory Baker
  2 siblings, 1 reply; 10+ messages in thread
From: Blake Golliher @ 2006-07-12  0:40 UTC (permalink / raw)
  To: gregory.baker; +Cc: autofs, nfs

What version of OnTap are you running?

-Blake

On Jul 11, 2006, at 12:00 PM, Gregory Baker wrote:

>
> We have thousands of linux clients hitting netapp file servers (many
> 3500 series, clustered) on a local gigabit LAN.  From time to time,
> applications return "file not found" when attempting to automount a
> directory and access a file.  An example of this is a long running
> process, which reads in data, processes it for hours (in which time the
> filesystem is unmounted) then tries to read more data from that mount
> point (which causes a "file not found" error in the application).  This
> occurs about 1/100th of the time.
>
> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS
> contributer)
>
> "Using the Linux NFS Client with Network Appliance Filers"
> http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
>
> page 10 says...
>
> "Due to a bug in the mount command, the default retransmission timeout
> value on Linux for NFS over TCP is quite small...To obtain standard
> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly
> when mounting via TCP."
>
> Our defaults (assuming man pages are correct, RedHat Enterprise Linux  
> 3)
> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105  
> tenths
> of a second (10 seconds).  It appears netapp is suggesting waiting
> 600+600 = 1200 tenths (120 seconds) before giving up on the mount  
> command...
>
> * What "bug" in the mount command do you believe NetApp is talking  
> about?
>
> * What do you think proper options for NFS auto/mounts would be for
> extremely busy centralized NFS filers?
>
> * What is the reference standard behavior?
>
> Thanks,
>
> --Greg
>
> --  
> ----------------------------------------------------------------------
> Greg Baker                                         512-602-3287 (work)
> gregory.baker@amd.com                              512-602-6970 (fax)
> 5900 E. Ben White Blvd MS 626                      512-555-1212 (info)
> Austin, TX 78741
>
>
>
>
>
> ----------------------------------------------------------------------- 
> --
> Using Tomcat but need to do more? Need to support web services,  
> security?
> Get stuff done quickly with pre-integrated technology to make your job  
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache  
> Geronimo
> http://sel.as-us.falkag.net/sel? 
> cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs
>
>



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in linux mount? (says NetApp)
  2006-07-12  0:40 ` Blake Golliher
@ 2006-07-12  1:07   ` Gregory Baker
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Baker @ 2006-07-12  1:07 UTC (permalink / raw)
  To: Blake Golliher; +Cc: autofs, nfs


We have various versions; 3500 and R series are running 7.x, the older 
840/880 are running 6.5.x

  NetApp Release 7.0.2P4: Wed Nov 30 01:52:11 PST 2005 charity.amd.com
  NetApp Release 7.0.3P4: Sat Jan 28 06:03:18 PST 2006 chang.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 diego.amd.com
  NetApp Release 7.0.3P4: Sat Jan 28 06:03:18 PST 2006 eng.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 escher.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 faith.amd.com
  NetApp Release 6.5.6P10D3:Thu Apr 13 22:13:21 PDT 2006 fortitud.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 frida.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 hope.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 justice.amd.com
  NetApp Release 7.0.3P4: Sat Jan 28 06:03:18 PST 2006 kuching.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 okeeffe.amd.com
  NetApp Release 7.0.2P4: Wed Nov 30 01:52:11 PST 2005 prudence.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 raffael.amd.com
  NetApp Release 7.0.3P4: Sat Jan 28 06:03:18 PST 2006 siam.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 sobriety.amd.com
  NetApp Release 6.5.5P7: Thu Jul 21 02:10:50 PDT 2005 vangogh.amd.com
  NetApp Release 7.2RC1: Wed Feb 15 00:46:09 PST 2006 jacen.amd.com
  NetApp Release 7.2RC1: Wed Feb 15 00:46:09 PST 2006 jaina.amd.com

--Greg

Blake Golliher wrote:
> What version of OnTap are you running?
> 
> -Blake
> 
> On Jul 11, 2006, at 12:00 PM, Gregory Baker wrote:
> 
>>
>> We have thousands of linux clients hitting netapp file servers (many
>> 3500 series, clustered) on a local gigabit LAN.  From time to time,
>> applications return "file not found" when attempting to automount a
>> directory and access a file.  An example of this is a long running
>> process, which reads in data, processes it for hours (in which time the
>> filesystem is unmounted) then tries to read more data from that mount
>> point (which causes a "file not found" error in the application).  This
>> occurs about 1/100th of the time.
>>
>> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS
>> contributer)
>>
>> "Using the Linux NFS Client with Network Appliance Filers"
>> http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
>>
>> page 10 says...
>>
>> "Due to a bug in the mount command, the default retransmission timeout
>> value on Linux for NFS over TCP is quite small...To obtain standard
>> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly
>> when mounting via TCP."
>>
>> Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3)
>> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths
>> of a second (10 seconds).  It appears netapp is suggesting waiting
>> 600+600 = 1200 tenths (120 seconds) before giving up on the mount 
>> command...
>>
>> * What "bug" in the mount command do you believe NetApp is talking about?
>>
>> * What do you think proper options for NFS auto/mounts would be for
>> extremely busy centralized NFS filers?
>>
>> * What is the reference standard behavior?
>>
>> Thanks,
>>
>> --Greg
>>
>> -- ----------------------------------------------------------------------
>> Greg Baker                                         512-602-3287 (work)
>> gregory.baker@amd.com                              512-602-6970 (fax)
>> 5900 E. Ben White Blvd MS 626                      512-555-1212 (info)
>> Austin, TX 78741
>>
>>
>>
>>
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job 
>> easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache 
>> Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> NFS maillist  -  NFS@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nfs
>>
>>
> 

-- 
----------------------------------------------------------------------
Greg Baker                                         512-602-3287 (work)
gregory.baker@amd.com                              512-602-6970 (fax)
5900 E. Ben White Blvd MS 626                      512-555-1212 (info)
Austin, TX 78741





-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [autofs] Re:  bug in linux mount? (says NetApp)
  2006-07-11 23:27 ` [NFS] " Trond Myklebust
  2006-07-11 23:34   ` Gregory Baker
@ 2006-07-12  3:03   ` Ian Kent
  2006-07-12 12:19     ` Trond Myklebust
  2006-07-12  9:32   ` James Pearson
  2 siblings, 1 reply; 10+ messages in thread
From: Ian Kent @ 2006-07-12  3:03 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: autofs, gregory.baker, nfs

On Tue, 2006-07-11 at 19:27 -0400, Trond Myklebust wrote:
> On Tue, 2006-07-11 at 14:00 -0500, Gregory Baker wrote:
> > We have thousands of linux clients hitting netapp file servers (many 
> > 3500 series, clustered) on a local gigabit LAN.  From time to time, 
> > applications return "file not found" when attempting to automount a 
> > directory and access a file.  An example of this is a long running 
> > process, which reads in data, processes it for hours (in which time the 
> > filesystem is unmounted) then tries to read more data from that mount 
> > point (which causes a "file not found" error in the application).  This 
> > occurs about 1/100th of the time.
> > 
> > Researching at Netapp turns up this bit by Chuck Lever (Linux NFS 
> > contributer)
> > 
> > "Using the Linux NFS Client with Network Appliance Filers"
> > http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
> > 
> > page 10 says...
> > 
> > "Due to a bug in the mount command, the default retransmission timeout 
> > value on Linux for NFS over TCP is quite small...To obtain standard 
> > behavior, we strongly recommend using "timeo=600, retrans=2" explicitly 
> > when mounting via TCP."
> > 
> > Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3) 
> > would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths 
> > of a second (10 seconds).  It appears netapp is suggesting waiting 
> > 600+600 = 1200 tenths (120 seconds) before giving up on the mount command...
> 
> No they are not. See below.
> 
> > * What "bug" in the mount command do you believe NetApp is talking about?
> 
> It has nothing to do with the mount timeout: Chuck is talking about the
> retransmission timeout for TCP connections 'timeo' which should indeed
> be set to a high value since TCP guarantees message delivery (unlike UDP
> which requires a small timeo value). Setting it too low means that you
> end up spamming your server with a load of unnecessary retransmissions.
> 
> This was indeed the case for some older versions of 'mount' and also for
> older versions of the am-utils/amd automounters.
> 
> > * What do you think proper options for NFS auto/mounts would be for 
> > extremely busy centralized NFS filers?
> 
> Something like
> 
> mount -t nfs -ohard,timeo=600,retrans=2,rsize=32768,wsize=32768,tcp foo:/ /bar

I thought that the default timeo had changed to 600 (60 secs) for TCP
mounts in later versions of mount (it should be) and that the retrans
shouldn't matter as 60 secs is the RPC major timeout. I thought the
point of this default was to prevent RPC from retransmitting as the TCP
layer would take care of it. 

Trond what am I missing.

> 
> should be a fairly safe bet. You might want to add the 'intr' flag too,
> depending on how you feel about the behaviour w.r.t. pressing ^C.
> 
> > * What is the reference standard behavior?

I don't think it's a standard in as much as it's common sense.
Certainly if the defaults are set to those that are sensible for UDP you
are much more likely to bogus failures.

Ian



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in linux mount? (says NetApp)
  2006-07-11 23:27 ` [NFS] " Trond Myklebust
  2006-07-11 23:34   ` Gregory Baker
  2006-07-12  3:03   ` [autofs] " Ian Kent
@ 2006-07-12  9:32   ` James Pearson
  2 siblings, 0 replies; 10+ messages in thread
From: James Pearson @ 2006-07-12  9:32 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: autofs, nfs

Trond Myklebust wrote:
> 
> It has nothing to do with the mount timeout: Chuck is talking about the
> retransmission timeout for TCP connections 'timeo' which should indeed
> be set to a high value since TCP guarantees message delivery (unlike UDP
> which requires a small timeo value). Setting it too low means that you
> end up spamming your server with a load of unnecessary retransmissions.
> 
> This was indeed the case for some older versions of 'mount' and also for
> older versions of the am-utils/amd automounters.

Do you know how you tell what value of timeo is being used by default?

The source code for mount (nfsmount.c, part of util-linux v2.12) has the 
comment:

   /* timeo is filled in after we know whether it'll be TCP or UDP */

Can I assume, in this case, the value of timeo will be a suitable value 
for tcp mounts?

Thanks

James Pearson


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [autofs] Re:  bug in linux mount? (says NetApp)
  2006-07-12  3:03   ` [autofs] " Ian Kent
@ 2006-07-12 12:19     ` Trond Myklebust
  0 siblings, 0 replies; 10+ messages in thread
From: Trond Myklebust @ 2006-07-12 12:19 UTC (permalink / raw)
  To: Ian Kent; +Cc: autofs, gregory.baker, nfs

On Wed, 2006-07-12 at 11:03 +0800, Ian Kent wrote:
> On Tue, 2006-07-11 at 19:27 -0400, Trond Myklebust wrote:
> > Something like
> > 
> > mount -t nfs -ohard,timeo=600,retrans=2,rsize=32768,wsize=32768,tcp foo:/ /bar
> 
> I thought that the default timeo had changed to 600 (60 secs) for TCP
> mounts in later versions of mount (it should be) and that the retrans
> shouldn't matter as 60 secs is the RPC major timeout. I thought the
> point of this default was to prevent RPC from retransmitting as the TCP
> la
> > should be a fairly safe bet. You might want to add the 'intr' flag too,
> > depending on how you feel about the behaviour w.r.t. pressing ^C.
> > 
> > > * What is the reference standard behavior?
> 
> I don't think it's a standard in as much as it's common sense.
> Certainly if the defaults are set to those that are sensible for UDP you
> are much more likely to bogus failures.
> 
> Ian
> yer would take care of it. 
> 
> Trond what am I missing.

The fact that not everyone may be using a recent version of mount :-)
The above line should work fine with all versions.

Actually, the "retrans" value is pretty irrelevant for hard mounts. It
is more important for soft mounts for which it defines the number of
resends before giving up the RPC call and returning an error to the
application.

Cheers,
  Trond



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug in linux mount? (says NetApp)
  2006-07-11 20:21 ` Chuck Lever
@ 2006-07-14 20:36   ` Gregory Baker
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Baker @ 2006-07-14 20:36 UTC (permalink / raw)
  To: Chuck Lever; +Cc: autofs, nfs


Ahh... I should have expanded "linux clients" to "linux clients running 
RHEL 3 U5".

[greg@apathy greg]$ rpm -qa util-linux
util-linux-2.11y-31.6

Red Hat support has this to say...

[...snip...]

"I have been looking into this issue and I have found other people are 
experiencing similar behavior.  I also found a fix that was added to the 
util-linux package that I think addresses this issue...... I believe 
this is what Chuck refers to with his comment "and I believe later 
releases of RHEL 3 were fixed to do this"

 >From the upstream package change log:

"RHEL3 util-linux >=2.11y-31.8 should make the default 70s (instead of 
7s) for TCP mounts:

* Wed Jun  8 2005 Steve Dickson <SteveD@RedHat.com> 2.11y-31.8
- Changed nfsmount to retry calls to mountd in foreground as
  well as in background (bz# 138775)
- Increased TCP timeouts to 70 secs (bz# 151097)"

I am pretty sure this will fix the problem that you are seeing.  The 
util-linux package in the Red Hat Enterprise Linux AS (v. 3 for x86) 
Beta channel on RHN is version util-linux-2.11y-31.16.i386.rpm, which 
shold have this fix in it."

[...snip...]

The bug/errata

http://rhn.redhat.com/errata/RHBA-2005-626.html

became available in RHEL3 U6.  Sigh.

We skipped U3, U4 (autofs woes) U6 (just finished upgrading from U2->U5 
and dealing with fallout) and recently began using U7 (to support Sun 
x4100 SAS drives).

Thanks,

--Greg

Chuck Lever wrote:
> On 7/11/06, Gregory Baker <gregory.baker@amd.com> wrote:
>> We have thousands of linux clients hitting netapp file servers (many
>> 3500 series, clustered) on a local gigabit LAN.  From time to time,
>> applications return "file not found" when attempting to automount a
>> directory and access a file.  An example of this is a long running
>> process, which reads in data, processes it for hours (in which time the
>> filesystem is unmounted) then tries to read more data from that mount
>> point (which causes a "file not found" error in the application).  This
>> occurs about 1/100th of the time.
>>
>> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS
>> contributer)
>>
>> "Using the Linux NFS Client with Network Appliance Filers"
>> http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
>>
>> page 10 says...
>>
>> "Due to a bug in the mount command, the default retransmission timeout
>> value on Linux for NFS over TCP is quite small...To obtain standard
>> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly
>> when mounting via TCP."
>>
>> Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3)
>> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths
>> of a second (10 seconds).  It appears netapp is suggesting waiting
>> 600+600 = 1200 tenths (120 seconds) before giving up on the mount 
>> command...
> 
> It's important to distinguish two different types of timeouts.
> 
> 1.  The mount operation has timed out.
> 
> 2.  After the mount operation succeeds, an NFS RPC operation has timed out.
> 
> TR-3183 discusses the proper settings for 2, but you are experiencing 1.
> 
> The automounter attempts to mount one of the filer's exports, but the
> mount request times out causing the mounted-on directory to be
> exposed.  Your filer is heavily loaded, and the filer's mountd is
> single-threaded.  The filer may also be experiencing delays when
> requesting information from external servers (like DNS or NIS), in
> which case the mount request is held up at the filer.
> 
> Both sides are at fault:  the Linux mount command should retry (and I
> believe later releases of RHEL 3 were fixed to do this) and the filer
> configuration should be reviewed to make sure there are no avoidable
> delays while processing mount requests.
> 
>> * What "bug" in the mount command do you believe NetApp is talking about?
> 
> The bug is that the mount command overrides the proper default RPC
> timeout value with a timeout value of 0.7 seconds.  This is *not* the
> timeout for mount operations, it is the timeout for the in-kernel NFS
> client to retransmit RPC requests.
> 
>> * What do you think proper options for NFS auto/mounts would be for
>> extremely busy centralized NFS filers?
> 
> If you are using NFS over TCP, the proper timeout value is 60 seconds.
> 
>> * What is the reference standard behavior?
> 
> Solaris, which is the NFSv3 reference implementation, uses effectively
> a 60 second timeout on TCP mounts.
> 

-- 
----------------------------------------------------------------------
Greg Baker                                         512-602-3287 (work)
gregory.baker@amd.com                              512-602-6970 (fax)
5900 E. Ben White Blvd MS 626                      512-555-1212 (info)
Austin, TX 78741





-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-07-14 20:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-11 19:00 bug in linux mount? (says NetApp) Gregory Baker
2006-07-11 20:21 ` Chuck Lever
2006-07-14 20:36   ` Gregory Baker
2006-07-11 23:27 ` [NFS] " Trond Myklebust
2006-07-11 23:34   ` Gregory Baker
2006-07-12  3:03   ` [autofs] " Ian Kent
2006-07-12 12:19     ` Trond Myklebust
2006-07-12  9:32   ` James Pearson
2006-07-12  0:40 ` Blake Golliher
2006-07-12  1:07   ` Gregory Baker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.