linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Mark Hills <mark@pogo.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>, Neil Brown <neilb@suse.de>,
	linux-nfs@vger.kernel.org
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
Date: Wed, 8 Dec 2010 10:38:38 -0500	[thread overview]
Message-ID: <20101208153838.GD32305@fieldses.org> (raw)
In-Reply-To: <alpine.NEB.2.01.1012071143080.28772@jrf.vwaro.pbz>

On Wed, Dec 08, 2010 at 02:45:40PM +0000, Mark Hills wrote:
> On Wed, 1 Dec 2010, Chuck Lever wrote:
> 
> > On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> > 
> > > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> > > 
> > >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > > [...]
> > >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> > >>> the client log we see messages like:
> > >>> 
> > >>>  Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> > >>> 
> > >>> Using strace and isolating one of these, I can see a non-blocking connect 
> > >>> has already managed to make a connection and even send/receive some data. 
> > >>> 
> > >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> > >>> mount.nfs when waiting for a response of some sort. The socket in question 
> > >>> is a connection to mountd:
> > >>> 
> > >>>  26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> > >>>  26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> > >>>  26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> > >>> 
> > >>> When it returns:
> > >>> 
> > >>>  26512 <... poll resumed> )              = 0 (Timeout)
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> > >>>  26512 close(3)                          = 0
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>>  26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> > >>> 
> > >>> There's no re-try from here, just a failed mount.
> [...]
> > >> Can you tell where the mountd threads are typically waiting?
> > > 
> > > Here's a trace from mountd. Note the long pause after fdatasync():
> > > 
> > > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > > 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> [...]
> > fdatasync() will write all outstanding dirty data on that partition to 
> > persistent storage, not just the dirty data for that file.  If /var is 
> > on a partition that has a lot of other activity, then this is the 
> > expected outcome, unfortunately.  I would expect such delays to be due 
> > to a large amount of outstanding writes, not to periodic synchronous 
> > writes.
> 
> Moving rmtab to a tmpfs, this moved the problem to another bottleneck 
> writing to /var, this time via syslog.
> 
> We identified a crontab entry which uses RPM to apply updates to the 
> server OS at randomised times which was blocking /var for 10 seconds or 
> so.

So is the problem that mountd ends up waiting on a synchronous write of
one of those "authenticated mount request" log messages?

Maybe we should downgrade those messages?  On a server with a lot of
clients a log on every mount call seems obnoxious.  Doing it
synchronously even more so (though I guess that's a problem for the
syslog configuration.)

> This returns the issue to giving up after the 9999 millisecond timeout 
> above, which I think is making the client overly sensitive.
> 
> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
> We are using mount.nfs for a v3 mount over TCP.
> 
> Locating this timeout wasn't easy. There's no '9999' in any of the 
> nfs-utils or glibc's rpc code.
> 
> It turns out that the timeout originates in 
> nfs-utils/support/nfs/rpc_socket.c as
> 
>   #define NFSRPC_TIMEOUT_TCP	(10)
> 
> This is used to populate a struct timeval *timeout which is used and 
> updated by nfs_connect_nb(). Then the remaining timeout of 9999 
> milliseconds is used in an RPC clnt_call operation.
> 
> 10 seconds over TCP is simply too short to be reliable here; it's not a 
> long period of time for a fatal timeout. I will attempt to test on our 
> cluster with an extended timeout.

Should this be tied to some mount option?

Chuck probably knows how this is supposed to work....

--b.

> 
> I presume that the bug is that the timeout is too short, and not that the 
> timeout should cause a retry in a layer above -- because re-sending the 
> request over reliable TCP would not make sense.
> 
> For UDP the retries are handled inside the RPC call, clntudp_call(). In 
> which case, what is the reason the UDP timeout differs from TCP? And is it 
> also too small?

  reply	other threads:[~2010-12-08 15:38 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-15 18:43 Listen backlog set to 64 Mark Hills
2010-11-16 18:20 ` J. Bruce Fields
2010-11-16 19:05   ` Mark Hills
2010-11-16 22:08   ` Neil Brown
2010-11-29 20:59     ` J. Bruce Fields
2010-11-30 17:50       ` Mark Hills
2010-11-30 20:00         ` J. Bruce Fields
2010-11-30 22:09           ` Mark Hills
2010-12-01 18:18           ` Mark Hills
2010-12-01 18:28             ` Chuck Lever
2010-12-01 18:46               ` J. Bruce Fields
2010-12-08 14:45               ` mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64) Mark Hills
2010-12-08 15:38                 ` J. Bruce Fields [this message]
2010-12-08 16:45                 ` Chuck Lever
2010-12-08 17:31                   ` Mark Hills
2010-12-08 18:28                     ` Chuck Lever
2010-12-08 18:37                       ` J. Bruce Fields
2010-12-08 20:34                         ` Chuck Lever
2010-12-08 21:04                         ` Chuck Lever
2010-12-13 16:19                       ` Chuck Lever
2010-12-01 18:36             ` Listen backlog set to 64 J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101208153838.GD32305@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mark@pogo.org.uk \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).