Linux NFS development
 help / color / mirror / Atom feed
From: Peter Staubach <staubach@redhat.com>
To: Neil Horman <nhorman@redhat.com>
Cc: linux-nfs@vger.kernel.org, lhh@redhat.com, nfsv4@linux-nfs.org,
	Jeff Layton <jlayton@redhat.com>
Subject: Re: rapid clustered nfs server failover and hung clients -- how best to	close the sockets?
Date: Mon, 09 Jun 2008 11:37:27 -0400	[thread overview]
Message-ID: <484D4E37.3060001@redhat.com> (raw)
In-Reply-To: <20080609152321.GA20181@hmsendeavour.rdu.redhat.com>

Neil Horman wrote:
> On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
>   
>> Jeff Layton wrote:
>>     
>>> Apologies for the long email, but I ran into an interesting problem the
>>> other day and am looking for some feedback on my general approach to
>>> fixing it before I spend too much time on it:
>>>
>>> We (RH) have a cluster-suite product that some people use for making HA
>>> NFS services. When our QA folks test this, they often will start up
>>> some operations that do activity on an NFS mount from the cluster and
>>> then rapidly do failovers between cluster machines and make sure
>>> everything keeps moving along. The cluster is designed to not shut down
>>> nfsd's when a failover occurs. nfsd's are considered a "shared
>>> resource". It's possible that there could be multiple clustered
>>> services for NFS-sharing, so when a failover occurs, we just manipulate
>>> the exports table.
>>>
>>> The problem we've run into is that occasionally they fail over to the
>>> alternate machine and then back very rapidly. Because nfsd's are not
>>> shut down on failover, sockets are not closed. So what happens is
>>> something like this on TCP mounts:
>>>
>>> - client has NFS mount from clustered NFS service on one server
>>>
>>> - service fails over, new server doesn't know anything about the
>>>  existing socket, so it sends a RST back to the client when data
>>>  comes in. Client closes connection and reopens it and does some
>>>  I/O on the socket.
>>>
>>> - service fails back to original server. The original socket there
>>>  is still open, but now the TCP sequence numbers are off. When
>>>  packets come into the server we end up with an ACK storm, and the
>>>  client hangs for a long time.
>>>
>>> Neil Horman did a good writeup of this problem here for those that
>>> want the gory details:
>>>
>>>    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>>>
>>> I can think of 3 ways to fix this:
>>>
>>> 1) Add something like the recently added "unlock_ip" interface that
>>> was added for NLM. Maybe a "close_ip" that allows us to close all
>>> nfsd sockets connected to a given local IP address. So clustering
>>> software could do something like:
>>>
>>>    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>>>
>>> ...and make sure that all of the sockets are closed.
>>>
>>> 2) just use the same "unlock_ip" interface and just have it also
>>> close sockets in addition to dropping locks.
>>>
>>> 3) have an nfsd close all non-listening connections when it gets a
>>> certain signal (maybe SIGUSR1 or something). Connections on a
>>> sockets that aren't failing over should just get a RST and would
>>> reopen their connections.
>>>
>>> ...my preference would probably be approach #1.
>>>
>>> I've only really done some rudimentary perusing of the code, so there
>>> may be roadblocks with some of these approaches I haven't considered.
>>> Does anyone have thoughts on the general problem or idea for a solution?
>>>
>>> The situation is a bit specific to failover testing -- most people failing
>>> over don't do it so rapidly, but we'd still like to ensure that this
>>> problem doesn't occur if someone does do it.
>>>
>>> Thanks,
>>>  
>>>       
>> This doesn't sound like it would be an NFS specific situation.
>> Why doesn't TCP handle this, without causing an ACK storm?
>>
>>     
>
> You're right, its not a problem specific to NFS, any TCP based service in which
> sockets are not explicitly closed on the application are subject to this
> problem.  however, I think NFS is currently the only clustered service that we
> offer in which we explicitly leave nfsd running during such a 'soft' failover,
> and so practically speaking, this is the only place that this issue manifests
> itself.  If we could shut down nfsd on the server doing a failover, that would
> solve this problem (as it prevents the problem with all other clustered tcp
> based services), but from what I'm told, thats a non-starter.
>
>   

I think that this last would be a good thing to pursue anyway,
or at least be able to understand why it would be considered to
be a "non-starter".  When failing away a service, why not stop
the service on the original node?

These floating virtual IP and ARP games can get tricky to handle
in the boundary cases like this sort of one.

> As for why TCP doesnt handle this, thats because the situation is ambiguous from
> the point of view of the client and server.  The write up in the bugzilla has
> all the gory details, but the executive summary is that during rapid failover,
> the client will ack some data to server A in the cluster, and some to server B
> in the cluster.  If you quickly fail over and back between the servers in the
> cluster, each server will see some gaps in the data stream sequence numbers, but
> the client will see that all data has been acked.  This leaves the connection in
> an unrecoverable state.

I would wonder what happens if we stick some other NFS/RPC/TCP/IP
implementation into the situation.  I wonder if it would see and
generate the same situation?

       ps

  reply	other threads:[~2008-06-09 15:37 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-09 14:31 rapid clustered nfs server failover and hung clients -- how best to close the sockets? Jeff Layton
     [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 15:03   ` Peter Staubach
2008-06-09 15:18     ` Jeff Layton
     [not found]       ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 15:31         ` Neil Horman
2008-06-09 15:43           ` Jeff Layton
     [not found]         ` <RTPCLUEXC1-PRDOLZCH000001d2-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
     [not found]           ` <20080609120110.1fee7221-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
     [not found]             ` <RTPCLUEXC1-PRDF8Eqf000001d4-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
     [not found]               ` <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 16:40                 ` Talpey, Thomas
2008-06-09 16:46                   ` Jeff Layton
2008-06-09 18:03                   ` J. Bruce Fields
2008-06-09 17:14               ` J. Bruce Fields
2008-06-09 15:51       ` Talpey, Thomas
2008-06-09 16:01         ` Jeff Layton
2008-06-09 16:03           ` Neil Horman
2008-06-09 16:09           ` Talpey, Thomas
2008-06-09 16:22             ` Jeff Layton
2008-06-09 19:36               ` Chuck Lever
2008-06-09 20:11                 ` Jeff Layton
2008-06-09 20:56                   ` Chuck Lever
2008-06-09 15:23     ` Neil Horman
2008-06-09 15:37       ` Peter Staubach [this message]
2008-06-09 15:49         ` Jeff Layton
     [not found]           ` <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 16:01             ` Chuck Lever
2008-06-09 16:04         ` Neil Horman
2008-06-09 15:46       ` Chuck Lever
2008-06-09 16:00       ` Peter Staubach
2008-06-09 16:24         ` Neil Horman
2008-06-09 15:51 ` J. Bruce Fields
2008-06-09 16:02   ` Jeff Layton
2008-06-09 17:23     ` J. Bruce Fields
2008-06-09 19:10       ` Jeff Layton
2008-06-09 20:19         ` Lon Hohberger
2008-06-09 17:14 ` Wendy Cheng
2008-06-09 17:24   ` Jeff Layton
2008-06-09 17:51     ` Talpey, Thomas
2008-06-09 17:59       ` Talpey, Thomas
2008-06-09 19:01       ` Jeff Layton
2008-06-09 19:13         ` Talpey, Thomas
2008-06-09 18:10     ` Neil Horman
2008-06-09 18:07   ` Neil Horman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=484D4E37.3060001@redhat.com \
    --to=staubach@redhat.com \
    --cc=jlayton@redhat.com \
    --cc=lhh@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=nfsv4@linux-nfs.org \
    --cc=nhorman@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox