linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Guillaume Morin <guillaume@morinfr.org>
To: Chuck Lever <chucklever@gmail.com>
Cc: Guillaume Morin <guillaume@morinfr.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Chris Mason <clm@fb.com>
Subject: Re: [BUG] nfs3 client stops retrying to connect
Date: Mon, 8 Jun 2015 20:12:10 +0200	[thread overview]
Message-ID: <20150608181210.GA18244@bender.morinfr.org> (raw)
In-Reply-To: <21A8A567-1EB4-4E3A-8DB8-BD07212044D0@gmail.com>

On 08 Jun 13:50, Chuck Lever wrote:
> The linger timer is started by FIN_WAIT1 or LAST_ACK, and
> xs_tcp_schedule_linger_timeout sets XPRT_CONNECTING and
> XPRT_CONNECTION_ABORT.
> 
> At a guess there could be a race between xs_tcp_cancel_linger_timeout
> and the connect worker clearing those flags.

The connect worker is xs_tcp_setup_socket().  It clears the connecting
bit in all code paths.  So the only kind of race I can see here is
another function cancelling it before it runs without clearing the bit.

xs_tcp_cancel_linger_timeout() does the right thing afaict.  It clears
the bit if cancel_delayed_work() returns a non-zero value.

The only other place where the worker is cancelled is xs_close() but it
does not clear the bit. So if it cancels the worker before it had
started running, the bit will stay up.

> AFAICT ->close is invoked when the transport is being shut down, in other
> words at umount time. It is also invoked when the autoclose timer fires.
> 
> Autoclose is simply a mechanism for reaping NFS sockets that are idle.
> I think the timer is 5 or 6 minutes.
> 
> Autoclose won't fire if there is frequent work being done on the mount
> point. If this is related to autoclose, then the workload on the client
> might need to be sparse (NFS requests only every few minutes or so) to
> reproduce it.
> 
> For example, autoclose fires and tries to shut down the socket after the
> server is no longer responding.

It does not seem that autoclose is the cause here since it has happened
only during server outages. 

If autoclose and umount are the only thing that can call xs_close(),
that seems unlikely to eb the problem.  But I see that xprt_connect()
can call it too so that gives me some hope

> > We had to move an nfs server on friday and I got a few machines that had
> > the same issue again?
> 
> That suggests one requirement for your reproducer: after clients have
> mounted it, the NFS server needs to be fully down for an extended period.

Yes, it seems to be the case but if it's a race this just gives more
opportunity to race.

> Since some clients recovered, I assume the server retained its IP address.
> Did the network route change?

No the route did not change

-- 
Guillaume Morin <guillaume@morinfr.org>

  reply	other threads:[~2015-06-08 18:12 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-21  1:21 [BUG] nfs3 client stops retrying to connect Guillaume Morin
2015-06-03 18:31 ` Chuck Lever
2015-06-04 20:06   ` Guillaume Morin
2015-06-04 21:23     ` Chuck Lever
2015-06-04 22:14       ` Guillaume Morin
2015-06-05  2:57         ` Chuck Lever
2015-06-08 17:10           ` Guillaume Morin
2015-06-08 17:50             ` Chuck Lever
2015-06-08 18:12               ` Guillaume Morin [this message]
2015-08-25 15:16                 ` Guillaume Morin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150608181210.GA18244@bender.morinfr.org \
    --to=guillaume@morinfr.org \
    --cc=chucklever@gmail.com \
    --cc=clm@fb.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trond.myklebust@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).