From: Guillaume Morin <guillaume@morinfr.org>
To: Chuck Lever <chucklever@gmail.com>,
Guillaume Morin <guillaume@morinfr.org>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
Trond Myklebust <trond.myklebust@primarydata.com>,
Chris Mason <clm@fb.com>
Subject: Re: [BUG] nfs3 client stops retrying to connect
Date: Tue, 25 Aug 2015 17:16:14 +0200 [thread overview]
Message-ID: <20150825151614.GA31127@bender.morinfr.org> (raw)
In-Reply-To: <20150608181210.GA18244@bender.morinfr.org>
On 08 Jun 20:12, Guillaume Morin wrote:
>
> On 08 Jun 13:50, Chuck Lever wrote:
> > The linger timer is started by FIN_WAIT1 or LAST_ACK, and
> > xs_tcp_schedule_linger_timeout sets XPRT_CONNECTING and
> > XPRT_CONNECTION_ABORT.
> >
> > At a guess there could be a race between xs_tcp_cancel_linger_timeout
> > and the connect worker clearing those flags.
>
> The connect worker is xs_tcp_setup_socket(). It clears the connecting
> bit in all code paths. So the only kind of race I can see here is
> another function cancelling it before it runs without clearing the bit.
>
> xs_tcp_cancel_linger_timeout() does the right thing afaict. It clears
> the bit if cancel_delayed_work() returns a non-zero value.
>
> The only other place where the worker is cancelled is xs_close() but it
> does not clear the bit. So if it cancels the worker before it had
> started running, the bit will stay up.
FWIW I patched our production kernel a couple months ago to clear the
connecting bit in xs_close(). Since then we've had a few nfs server
downtime and the problem has never reoccured while before the change we
always had a few machines that could not reconnect. I feel fairly
confident this was the bug.
I am posting the change in case it helps someone running one of the
stable kernels
sunrpc: call xprt_clear_connecting in xs_close
It closes the race where the CONNECTING bit in the xprt
is left on while the kernel is not trying to connect
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 41c2f9d..1b71c59 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -891,6 +891,7 @@ static void xs_close(struct rpc_xprt *xprt)
dprintk("RPC: xs_close xprt %p\n", xprt);
cancel_delayed_work_sync(&transport->connect_worker);
+ xprt_clear_connecting(xprt);
xs_reset_transport(transport);
xprt->reestablish_timeout = 0;
Another option would be is to call clear_bit a few lines later but
clear_bit is never used for CONNECTING so I went with this.
--
Guillaume Morin <guillaume@morinfr.org>
prev parent reply other threads:[~2015-08-25 15:16 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-21 1:21 [BUG] nfs3 client stops retrying to connect Guillaume Morin
2015-06-03 18:31 ` Chuck Lever
2015-06-04 20:06 ` Guillaume Morin
2015-06-04 21:23 ` Chuck Lever
2015-06-04 22:14 ` Guillaume Morin
2015-06-05 2:57 ` Chuck Lever
2015-06-08 17:10 ` Guillaume Morin
2015-06-08 17:50 ` Chuck Lever
2015-06-08 18:12 ` Guillaume Morin
2015-08-25 15:16 ` Guillaume Morin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150825151614.GA31127@bender.morinfr.org \
--to=guillaume@morinfr.org \
--cc=chucklever@gmail.com \
--cc=clm@fb.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trond.myklebust@primarydata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).