linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chuck Lever <chucklever@gmail.com>
To: Guillaume Morin <guillaume@morinfr.org>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Chris Mason <clm@fb.com>
Subject: Re: [BUG] nfs3 client stops retrying to connect
Date: Mon, 8 Jun 2015 13:50:47 -0400	[thread overview]
Message-ID: <21A8A567-1EB4-4E3A-8DB8-BD07212044D0@gmail.com> (raw)
In-Reply-To: <20150608171006.GA13396@bender.morinfr.org>


On Jun 8, 2015, at 1:10 PM, Guillaume Morin <guillaume@morinfr.org> wrote:

> Chuck,
> 
> On 04 Jun 22:57, Chuck Lever wrote:
>>> I am 100% sure that XPRT_CONNECTING is the issue because 1) the state
>>> had the flag up 2) there was absolutley no nfs network traffic between the
>>> client and the server 3) I "unfroze" the mounts by clearing it manually.
>>> 
>>> xs_tcp_cancel_linger_timeout, I think, is guaranteed to clear the flag.
>> 
>> I'm speculating based on some comments in the git log, but what if
>> the transport never sees TCP_CLOSE, but rather gets an error_report
>> callback instead?
> 
> I don't think that could be it because xs_tcp_setup_socket() does the
> connecting and is clearing the bit in all cases so at the time you would get
> a TCP_CLOSE it would have been cleared a while ago.

The linger timer is started by FIN_WAIT1 or LAST_ACK, and
xs_tcp_schedule_linger_timeout sets XPRT_CONNECTING and
XPRT_CONNECTION_ABORT.

At a guess there could be a race between xs_tcp_cancel_linger_timeout
and the connect worker clearing those flags.

> So that's why I thought the best explanation was finding a place where
> the worker task running xs_tcp_setup_socket() is cancelled and the bit
> not cleared.  This is how I found xs_tcp_close()
> 
>>> Either the callback is canceled and it clears the flag or the callback
>>> will do it.  I am not sure how this could leave the flag set but I am
>>> not familiar with this code, so I could totally be missing something
>>> obvious.
>>> 
>>> xs_tcp_close() is the only thing I have found which cancels the callback
>>> and does not clear the flag.
>> 
>> How would xs_tcp_close() be invoked?
> 
> TBH I do not know.  It's the close() method of the xprt so I am assuming
> there are a few places where it could be.  But I am not familiar with
> the code base..

AFAICT ->close is invoked when the transport is being shut down, in other
words at umount time. It is also invoked when the autoclose timer fires.

Autoclose is simply a mechanism for reaping NFS sockets that are idle.
I think the timer is 5 or 6 minutes.

Autoclose won’t fire if there is frequent work being done on the mount
point. If this is related to autoclose, then the workload on the client
might need to be sparse (NFS requests only every few minutes or so) to
reproduce it.

For example, autoclose fires and tries to shut down the socket after the
server is no longer responding.

> We had to move an nfs server on friday and I got a few machines that had
> the same issue again…

That suggests one requirement for your reproducer: after clients have
mounted it, the NFS server needs to be fully down for an extended period.

Since some clients recovered, I assume the server retained its IP address.
Did the network route change?

--
Chuck Lever
chucklever@gmail.com




  reply	other threads:[~2015-06-08 17:47 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-21  1:21 [BUG] nfs3 client stops retrying to connect Guillaume Morin
2015-06-03 18:31 ` Chuck Lever
2015-06-04 20:06   ` Guillaume Morin
2015-06-04 21:23     ` Chuck Lever
2015-06-04 22:14       ` Guillaume Morin
2015-06-05  2:57         ` Chuck Lever
2015-06-08 17:10           ` Guillaume Morin
2015-06-08 17:50             ` Chuck Lever [this message]
2015-06-08 18:12               ` Guillaume Morin
2015-08-25 15:16                 ` Guillaume Morin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=21A8A567-1EB4-4E3A-8DB8-BD07212044D0@gmail.com \
    --to=chucklever@gmail.com \
    --cc=clm@fb.com \
    --cc=guillaume@morinfr.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trond.myklebust@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).