rapid clustered nfs server failover and hung clients -- how best to close the sockets?

Linux NFS development
 help / color / mirror / Atom feed

* rapid clustered nfs server failover and hung clients -- how best to close the sockets?
@ 2008-06-09 14:31 Jeff Layton
       [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 14:31 UTC (permalink / raw)
  To: linux-nfs, nfsv4; +Cc: nhorman, lhh

Apologies for the long email, but I ran into an interesting problem the
other day and am looking for some feedback on my general approach to
fixing it before I spend too much time on it:

We (RH) have a cluster-suite product that some people use for making HA
NFS services. When our QA folks test this, they often will start up
some operations that do activity on an NFS mount from the cluster and
then rapidly do failovers between cluster machines and make sure
everything keeps moving along. The cluster is designed to not shut down
nfsd's when a failover occurs. nfsd's are considered a "shared
resource". It's possible that there could be multiple clustered
services for NFS-sharing, so when a failover occurs, we just manipulate
the exports table.

The problem we've run into is that occasionally they fail over to the
alternate machine and then back very rapidly. Because nfsd's are not
shut down on failover, sockets are not closed. So what happens is
something like this on TCP mounts:

- client has NFS mount from clustered NFS service on one server

- service fails over, new server doesn't know anything about the
  existing socket, so it sends a RST back to the client when data
  comes in. Client closes connection and reopens it and does some
  I/O on the socket.

- service fails back to original server. The original socket there
  is still open, but now the TCP sequence numbers are off. When
  packets come into the server we end up with an ACK storm, and the
  client hangs for a long time.

Neil Horman did a good writeup of this problem here for those that
want the gory details:

    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16

I can think of 3 ways to fix this:

1) Add something like the recently added "unlock_ip" interface that
was added for NLM. Maybe a "close_ip" that allows us to close all
nfsd sockets connected to a given local IP address. So clustering
software could do something like:

    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip

...and make sure that all of the sockets are closed.

2) just use the same "unlock_ip" interface and just have it also
close sockets in addition to dropping locks.

3) have an nfsd close all non-listening connections when it gets a
certain signal (maybe SIGUSR1 or something). Connections on a
sockets that aren't failing over should just get a RST and would
reopen their connections.

...my preference would probably be approach #1.

I've only really done some rudimentary perusing of the code, so there
may be roadblocks with some of these approaches I haven't considered.
Does anyone have thoughts on the general problem or idea for a solution?

The situation is a bit specific to failover testing -- most people failing
over don't do it so rapidly, but we'd still like to ensure that this
problem doesn't occur if someone does do it.

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
       [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-09 15:03   ` Peter Staubach
  2008-06-09 15:18     ` Jeff Layton
  2008-06-09 15:23     ` Neil Horman
  0 siblings, 2 replies; 38+ messages in thread
From: Peter Staubach @ 2008-06-09 15:03 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, nfsv4, nhorman, lhh

Jeff Layton wrote:
> Apologies for the long email, but I ran into an interesting problem the
> other day and am looking for some feedback on my general approach to
> fixing it before I spend too much time on it:
>
> We (RH) have a cluster-suite product that some people use for making HA
> NFS services. When our QA folks test this, they often will start up
> some operations that do activity on an NFS mount from the cluster and
> then rapidly do failovers between cluster machines and make sure
> everything keeps moving along. The cluster is designed to not shut down
> nfsd's when a failover occurs. nfsd's are considered a "shared
> resource". It's possible that there could be multiple clustered
> services for NFS-sharing, so when a failover occurs, we just manipulate
> the exports table.
>
> The problem we've run into is that occasionally they fail over to the
> alternate machine and then back very rapidly. Because nfsd's are not
> shut down on failover, sockets are not closed. So what happens is
> something like this on TCP mounts:
>
> - client has NFS mount from clustered NFS service on one server
>
> - service fails over, new server doesn't know anything about the
>   existing socket, so it sends a RST back to the client when data
>   comes in. Client closes connection and reopens it and does some
>   I/O on the socket.
>
> - service fails back to original server. The original socket there
>   is still open, but now the TCP sequence numbers are off. When
>   packets come into the server we end up with an ACK storm, and the
>   client hangs for a long time.
>
> Neil Horman did a good writeup of this problem here for those that
> want the gory details:
>
>     https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>
> I can think of 3 ways to fix this:
>
> 1) Add something like the recently added "unlock_ip" interface that
> was added for NLM. Maybe a "close_ip" that allows us to close all
> nfsd sockets connected to a given local IP address. So clustering
> software could do something like:
>
>     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>
> ...and make sure that all of the sockets are closed.
>
> 2) just use the same "unlock_ip" interface and just have it also
> close sockets in addition to dropping locks.
>
> 3) have an nfsd close all non-listening connections when it gets a
> certain signal (maybe SIGUSR1 or something). Connections on a
> sockets that aren't failing over should just get a RST and would
> reopen their connections.
>
> ...my preference would probably be approach #1.
>
> I've only really done some rudimentary perusing of the code, so there
> may be roadblocks with some of these approaches I haven't considered.
> Does anyone have thoughts on the general problem or idea for a solution?
>
> The situation is a bit specific to failover testing -- most people failing
> over don't do it so rapidly, but we'd still like to ensure that this
> problem doesn't occur if someone does do it.
>
> Thanks,
>   

This doesn't sound like it would be an NFS specific situation.
Why doesn't TCP handle this, without causing an ACK storm?

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:03   ` Peter Staubach
@ 2008-06-09 15:18     ` Jeff Layton
       [not found]       ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2008-06-09 15:51       ` Talpey, Thomas
  2008-06-09 15:23     ` Neil Horman
  1 sibling, 2 replies; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 15:18 UTC (permalink / raw)
  To: Peter Staubach; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 09 Jun 2008 11:03:53 -0400
Peter Staubach <staubach@redhat.com> wrote:

> Jeff Layton wrote:
> > Apologies for the long email, but I ran into an interesting problem the
> > other day and am looking for some feedback on my general approach to
> > fixing it before I spend too much time on it:
> >
> > We (RH) have a cluster-suite product that some people use for making HA
> > NFS services. When our QA folks test this, they often will start up
> > some operations that do activity on an NFS mount from the cluster and
> > then rapidly do failovers between cluster machines and make sure
> > everything keeps moving along. The cluster is designed to not shut down
> > nfsd's when a failover occurs. nfsd's are considered a "shared
> > resource". It's possible that there could be multiple clustered
> > services for NFS-sharing, so when a failover occurs, we just manipulate
> > the exports table.
> >
> > The problem we've run into is that occasionally they fail over to the
> > alternate machine and then back very rapidly. Because nfsd's are not
> > shut down on failover, sockets are not closed. So what happens is
> > something like this on TCP mounts:
> >
> > - client has NFS mount from clustered NFS service on one server
> >
> > - service fails over, new server doesn't know anything about the
> >   existing socket, so it sends a RST back to the client when data
> >   comes in. Client closes connection and reopens it and does some
> >   I/O on the socket.
> >
> > - service fails back to original server. The original socket there
> >   is still open, but now the TCP sequence numbers are off. When
> >   packets come into the server we end up with an ACK storm, and the
> >   client hangs for a long time.
> >
> > Neil Horman did a good writeup of this problem here for those that
> > want the gory details:
> >
> >     https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
> >
> > I can think of 3 ways to fix this:
> >
> > 1) Add something like the recently added "unlock_ip" interface that
> > was added for NLM. Maybe a "close_ip" that allows us to close all
> > nfsd sockets connected to a given local IP address. So clustering
> > software could do something like:
> >
> >     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> >
> > ...and make sure that all of the sockets are closed.
> >
> > 2) just use the same "unlock_ip" interface and just have it also
> > close sockets in addition to dropping locks.
> >
> > 3) have an nfsd close all non-listening connections when it gets a
> > certain signal (maybe SIGUSR1 or something). Connections on a
> > sockets that aren't failing over should just get a RST and would
> > reopen their connections.
> >
> > ...my preference would probably be approach #1.
> >
> > I've only really done some rudimentary perusing of the code, so there
> > may be roadblocks with some of these approaches I haven't considered.
> > Does anyone have thoughts on the general problem or idea for a solution?
> >
> > The situation is a bit specific to failover testing -- most people failing
> > over don't do it so rapidly, but we'd still like to ensure that this
> > problem doesn't occur if someone does do it.
> >
> > Thanks,
> >   
> 
> This doesn't sound like it would be an NFS specific situation.
> Why doesn't TCP handle this, without causing an ACK storm?
> 

No, it's not specific to NFS. It can happen to any "service" that
floats IP addresses between machines, but does not close the sockets
that are connected to those addresses. Most services that fail over
(at least in RH's cluster server) shut down the daemons on failover
too, so tends to mitigate this problem elsewhere.

I'm not sure how the TCP layer can really handle this situation. On
the wire, it looks to the client and server like the connection has
been hijacked (and in a sense, it has). It would be nice if it
didn't end up in an ACK storm, but I'm not aware of a way to prevent
that that stays within the spec.

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
       [not found]       ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-09 15:31         ` Neil Horman
  2008-06-09 15:43           ` Jeff Layton
       [not found]         ` <RTPCLUEXC1-PRDOLZCH000001d2-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
  1 sibling, 1 reply; 38+ messages in thread
From: Neil Horman @ 2008-06-09 15:31 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Peter Staubach, linux-nfs, nfsv4, nhorman, lhh

On Mon, Jun 09, 2008 at 11:18:21AM -0400, Jeff Layton wrote:
> On Mon, 09 Jun 2008 11:03:53 -0400
> Peter Staubach <staubach@redhat.com> wrote:
> 
> > Jeff Layton wrote:
> > > Apologies for the long email, but I ran into an interesting problem the
> > > other day and am looking for some feedback on my general approach to
> > > fixing it before I spend too much time on it:
> > >
> > > We (RH) have a cluster-suite product that some people use for making HA
> > > NFS services. When our QA folks test this, they often will start up
> > > some operations that do activity on an NFS mount from the cluster and
> > > then rapidly do failovers between cluster machines and make sure
> > > everything keeps moving along. The cluster is designed to not shut down
> > > nfsd's when a failover occurs. nfsd's are considered a "shared
> > > resource". It's possible that there could be multiple clustered
> > > services for NFS-sharing, so when a failover occurs, we just manipulate
> > > the exports table.
> > >
> > > The problem we've run into is that occasionally they fail over to the
> > > alternate machine and then back very rapidly. Because nfsd's are not
> > > shut down on failover, sockets are not closed. So what happens is
> > > something like this on TCP mounts:
> > >
> > > - client has NFS mount from clustered NFS service on one server
> > >
> > > - service fails over, new server doesn't know anything about the
> > >   existing socket, so it sends a RST back to the client when data
> > >   comes in. Client closes connection and reopens it and does some
> > >   I/O on the socket.
> > >
> > > - service fails back to original server. The original socket there
> > >   is still open, but now the TCP sequence numbers are off. When
> > >   packets come into the server we end up with an ACK storm, and the
> > >   client hangs for a long time.
> > >
> > > Neil Horman did a good writeup of this problem here for those that
> > > want the gory details:
> > >
> > >     https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
> > >
> > > I can think of 3 ways to fix this:
> > >
> > > 1) Add something like the recently added "unlock_ip" interface that
> > > was added for NLM. Maybe a "close_ip" that allows us to close all
> > > nfsd sockets connected to a given local IP address. So clustering
> > > software could do something like:
> > >
> > >     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> > >
> > > ...and make sure that all of the sockets are closed.
> > >
> > > 2) just use the same "unlock_ip" interface and just have it also
> > > close sockets in addition to dropping locks.
> > >
> > > 3) have an nfsd close all non-listening connections when it gets a
> > > certain signal (maybe SIGUSR1 or something). Connections on a
> > > sockets that aren't failing over should just get a RST and would
> > > reopen their connections.
> > >
> > > ...my preference would probably be approach #1.
> > >
> > > I've only really done some rudimentary perusing of the code, so there
> > > may be roadblocks with some of these approaches I haven't considered.
> > > Does anyone have thoughts on the general problem or idea for a solution?
> > >
> > > The situation is a bit specific to failover testing -- most people failing
> > > over don't do it so rapidly, but we'd still like to ensure that this
> > > problem doesn't occur if someone does do it.
> > >
> > > Thanks,
> > >   
> > 
> > This doesn't sound like it would be an NFS specific situation.
> > Why doesn't TCP handle this, without causing an ACK storm?
> > 
> 
> No, it's not specific to NFS. It can happen to any "service" that
> floats IP addresses between machines, but does not close the sockets
> that are connected to those addresses. Most services that fail over
> (at least in RH's cluster server) shut down the daemons on failover
> too, so tends to mitigate this problem elsewhere.
> 
> I'm not sure how the TCP layer can really handle this situation. On
> the wire, it looks to the client and server like the connection has
> been hijacked (and in a sense, it has). It would be nice if it
> didn't end up in an ACK storm, but I'm not aware of a way to prevent
> that that stays within the spec.
> 
I've not really thought it through yet, but would IP tables be another options
here?  Could you, if you preformed a soft failover, add a rule that responded to
any frame on an active connection that wasn't a SYN frame, force the sending of
an ACK frame?  It probably wouldn't scale, and its kind of ugly, but it could
work...

Neil


> -- 
> Jeff Layton <jlayton@redhat.com>

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:31         ` Neil Horman
@ 2008-06-09 15:43           ` Jeff Layton
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 15:43 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 9 Jun 2008 11:31:55 -0400
Neil Horman <nhorman@redhat.com> wrote:

> On Mon, Jun 09, 2008 at 11:18:21AM -0400, Jeff Layton wrote:
> > On Mon, 09 Jun 2008 11:03:53 -0400
> > Peter Staubach <staubach@redhat.com> wrote:
> > 
> > > Jeff Layton wrote:
> > > > Apologies for the long email, but I ran into an interesting problem the
> > > > other day and am looking for some feedback on my general approach to
> > > > fixing it before I spend too much time on it:
> > > >
> > > > We (RH) have a cluster-suite product that some people use for making HA
> > > > NFS services. When our QA folks test this, they often will start up
> > > > some operations that do activity on an NFS mount from the cluster and
> > > > then rapidly do failovers between cluster machines and make sure
> > > > everything keeps moving along. The cluster is designed to not shut down
> > > > nfsd's when a failover occurs. nfsd's are considered a "shared
> > > > resource". It's possible that there could be multiple clustered
> > > > services for NFS-sharing, so when a failover occurs, we just manipulate
> > > > the exports table.
> > > >
> > > > The problem we've run into is that occasionally they fail over to the
> > > > alternate machine and then back very rapidly. Because nfsd's are not
> > > > shut down on failover, sockets are not closed. So what happens is
> > > > something like this on TCP mounts:
> > > >
> > > > - client has NFS mount from clustered NFS service on one server
> > > >
> > > > - service fails over, new server doesn't know anything about the
> > > >   existing socket, so it sends a RST back to the client when data
> > > >   comes in. Client closes connection and reopens it and does some
> > > >   I/O on the socket.
> > > >
> > > > - service fails back to original server. The original socket there
> > > >   is still open, but now the TCP sequence numbers are off. When
> > > >   packets come into the server we end up with an ACK storm, and the
> > > >   client hangs for a long time.
> > > >
> > > > Neil Horman did a good writeup of this problem here for those that
> > > > want the gory details:
> > > >
> > > >     https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
> > > >
> > > > I can think of 3 ways to fix this:
> > > >
> > > > 1) Add something like the recently added "unlock_ip" interface that
> > > > was added for NLM. Maybe a "close_ip" that allows us to close all
> > > > nfsd sockets connected to a given local IP address. So clustering
> > > > software could do something like:
> > > >
> > > >     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> > > >
> > > > ...and make sure that all of the sockets are closed.
> > > >
> > > > 2) just use the same "unlock_ip" interface and just have it also
> > > > close sockets in addition to dropping locks.
> > > >
> > > > 3) have an nfsd close all non-listening connections when it gets a
> > > > certain signal (maybe SIGUSR1 or something). Connections on a
> > > > sockets that aren't failing over should just get a RST and would
> > > > reopen their connections.
> > > >
> > > > ...my preference would probably be approach #1.
> > > >
> > > > I've only really done some rudimentary perusing of the code, so there
> > > > may be roadblocks with some of these approaches I haven't considered.
> > > > Does anyone have thoughts on the general problem or idea for a solution?
> > > >
> > > > The situation is a bit specific to failover testing -- most people failing
> > > > over don't do it so rapidly, but we'd still like to ensure that this
> > > > problem doesn't occur if someone does do it.
> > > >
> > > > Thanks,
> > > >   
> > > 
> > > This doesn't sound like it would be an NFS specific situation.
> > > Why doesn't TCP handle this, without causing an ACK storm?
> > > 
> > 
> > No, it's not specific to NFS. It can happen to any "service" that
> > floats IP addresses between machines, but does not close the sockets
> > that are connected to those addresses. Most services that fail over
> > (at least in RH's cluster server) shut down the daemons on failover
> > too, so tends to mitigate this problem elsewhere.
> > 
> > I'm not sure how the TCP layer can really handle this situation. On
> > the wire, it looks to the client and server like the connection has
> > been hijacked (and in a sense, it has). It would be nice if it
> > didn't end up in an ACK storm, but I'm not aware of a way to prevent
> > that that stays within the spec.
> > 
> I've not really thought it through yet, but would IP tables be another options
> here?  Could you, if you preformed a soft failover, add a rule that responded to
> any frame on an active connection that wasn't a SYN frame, force the sending of
> an ACK frame?  It probably wouldn't scale, and its kind of ugly, but it could
> work...
> 

Yow, that is ugly...

So once a client does a new SYN, what would have to happen to make the
connection then work? That sounds pretty complicated. I could forsee
using
iptables here though...

When the service is "leaving" the server:

1) add rule to drop all traffic to port 2049
2) restart all of the nfsd's
3) remove iptables rule

...that would (briefly) disrupt communications between all clients and
the server, but it probably would work. You'd need to drop traffic to
prevent races that might get you an "Connection Refused".

Still, it's a kludge. I'd prefer a fix that didn't cause service
disruptions for anything but the stuff that's failing over. Also, that
would be pretty nightmarish from a coding standpoint. People have all
sorts of firewalling configurations, so doing this may be difficult in
practice.

Cheers,
-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <RTPCLUEXC1-PRDOLZCH000001d2-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>]

[parent not found: <20080609120110.1fee7221-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

[parent not found: <RTPCLUEXC1-PRDF8Eqf000001d4-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>]

[parent not found: <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: rapid clustered nfs server failover and hung clients --   how best to close the sockets?
       [not found]               ` <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-09 16:40                 ` Talpey, Thomas
  2008-06-09 16:46                   ` Jeff Layton
  2008-06-09 18:03                   ` J. Bruce Fields
  0 siblings, 2 replies; 38+ messages in thread
From: Talpey, Thomas @ 2008-06-09 16:40 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Peter Staubach, linux-nfs, lhh, nfsv4, nhorman

At 12:22 PM 6/9/2008, Jeff Layton wrote:
>That might be worth investigating, but sounds like it might cause problems
>with the services associated with IP addresses that are staying on the
>victim server.

Jeff, I think you have many years of job security to look forward to, here. :-)

Since you sent this to the NFSv4 list - is there any chance you're thinking
to not transparently take over IP addresses, but use NFSv4 locations and
referrals for these "migrations"? Yes, I know some clients may not quite be
there yet.

Tom.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients --    how best to close the sockets?
  2008-06-09 16:40                 ` Talpey, Thomas
@ 2008-06-09 16:46                   ` Jeff Layton
  2008-06-09 18:03                   ` J. Bruce Fields
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 16:46 UTC (permalink / raw)
  To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 09 Jun 2008 12:40:05 -0400
"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:

> At 12:22 PM 6/9/2008, Jeff Layton wrote:
> >That might be worth investigating, but sounds like it might cause problems
> >with the services associated with IP addresses that are staying on the
> >victim server.
> 
> Jeff, I think you have many years of job security to look forward to, here. :-)
> 

:-)

> Since you sent this to the NFSv4 list - is there any chance you're thinking
> to not transparently take over IP addresses, but use NFSv4 locations and
> referrals for these "migrations"? Yes, I know some clients may not quite be
> there yet.
> 

An interesting thought. I sent this to the nfsv4 list since I assume
nfsv4 will also be affected by this problem....

I'm not aware of any plans to integrate the new v4 stuff into our
cluster product. It would make a lot of sense though, so perhaps after
it gets some more upstream soak time we'll want to consider it. That
would be an extremely attractive thing with something like GFS on
the backend.

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 16:40                 ` Talpey, Thomas
  2008-06-09 16:46                   ` Jeff Layton
@ 2008-06-09 18:03                   ` J. Bruce Fields
  1 sibling, 0 replies; 38+ messages in thread
From: J. Bruce Fields @ 2008-06-09 18:03 UTC (permalink / raw)
  To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman, Jeff Layton

On Mon, Jun 09, 2008 at 12:40:05PM -0400, Talpey, Thomas wrote:
> At 12:22 PM 6/9/2008, Jeff Layton wrote:
> >That might be worth investigating, but sounds like it might cause problems
> >with the services associated with IP addresses that are staying on the
> >victim server.
> 
> Jeff, I think you have many years of job security to look forward to, here. :-)
> 
> Since you sent this to the NFSv4 list - is there any chance you're thinking
> to not transparently take over IP addresses, but use NFSv4 locations and
> referrals for these "migrations"?

Yeah, definitely.  We've a got a prototype and some other work in
progress--hopefully there'll be something "real" in the coming months!

There's some overlap with nfsv2/v3, though (not in this case, but in the
need for lock migration, for example).  And people really are using this
floating-ip address stuff now, so anything we can do to make it more
reliable or easier to use is welcome.

--b.

> Yes, I know some clients may not quite be
> there yet.
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
       [not found]             ` <RTPCLUEXC1-PRDF8Eqf000001d4-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
       [not found]               ` <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-09 17:14               ` J. Bruce Fields
  1 sibling, 0 replies; 38+ messages in thread
From: J. Bruce Fields @ 2008-06-09 17:14 UTC (permalink / raw)
  To: Talpey, Thomas
  Cc: Jeff Layton, Peter Staubach, linux-nfs, lhh, nfsv4, nhorman

On Mon, Jun 09, 2008 at 12:09:48PM -0400, Talpey, Thomas wrote:
> At 12:01 PM 6/9/2008, Jeff Layton wrote:
> >On Mon, 09 Jun 2008 11:51:51 -0400
> >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
> >
> >> At 11:18 AM 6/9/2008, Jeff Layton wrote:
> >> >No, it's not specific to NFS. It can happen to any "service" that
> >> >floats IP addresses between machines, but does not close the sockets
> >> >that are connected to those addresses. Most services that fail over
> >> >(at least in RH's cluster server) shut down the daemons on failover
> >> >too, so tends to mitigate this problem elsewhere.
> >> 
> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the
> >> victim server?
> >
> >The victim server might have other nfsd/lockd's running on them. Stopping
> >all the nfsd's could bring down lockd, and then you have to deal with lock
> >recovery on the stuff that isn't moving to the other server.
> 
> But but but... the IP address is the only identification the client can use
> to isolate a server.

Right.

> You're telling me that some locks will migrate and some won't?  Good
> luck with that! The clients are going to be mightily confused.

Locks migrate or not depending on the server ip address.  Where do you
see the confusion?

--b.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:18     ` Jeff Layton
       [not found]       ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-09 15:51       ` Talpey, Thomas
  2008-06-09 16:01         ` Jeff Layton
  1 sibling, 1 reply; 38+ messages in thread
From: Talpey, Thomas @ 2008-06-09 15:51 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman

At 11:18 AM 6/9/2008, Jeff Layton wrote:
>No, it's not specific to NFS. It can happen to any "service" that
>floats IP addresses between machines, but does not close the sockets
>that are connected to those addresses. Most services that fail over
>(at least in RH's cluster server) shut down the daemons on failover
>too, so tends to mitigate this problem elsewhere.

Why exactly don't you choose to restart the nfsd's (and lockd's) on the
victim server? Failing that, for TCP at least would ifdown/ifup accomplish
the socket reset?

Tom.

_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients --  how best to close the sockets?
  2008-06-09 15:51       ` Talpey, Thomas
@ 2008-06-09 16:01         ` Jeff Layton
  2008-06-09 16:03           ` Neil Horman
  2008-06-09 16:09           ` Talpey, Thomas
  0 siblings, 2 replies; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 16:01 UTC (permalink / raw)
  To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 09 Jun 2008 11:51:51 -0400
"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:

> At 11:18 AM 6/9/2008, Jeff Layton wrote:
> >No, it's not specific to NFS. It can happen to any "service" that
> >floats IP addresses between machines, but does not close the sockets
> >that are connected to those addresses. Most services that fail over
> >(at least in RH's cluster server) shut down the daemons on failover
> >too, so tends to mitigate this problem elsewhere.
> 
> Why exactly don't you choose to restart the nfsd's (and lockd's) on the
> victim server?

The victim server might have other nfsd/lockd's running on them. Stopping
all the nfsd's could bring down lockd, and then you have to deal with lock
recovery on the stuff that isn't moving to the other server.

> Failing that, for TCP at least would ifdown/ifup accomplish
> the socket reset?
> 

I don't think ifdown/ifup closes the sockets, but maybe someone can
correct me on this...

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 16:01         ` Jeff Layton
@ 2008-06-09 16:03           ` Neil Horman
  2008-06-09 16:09           ` Talpey, Thomas
  1 sibling, 0 replies; 38+ messages in thread
From: Neil Horman @ 2008-06-09 16:03 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, Jun 09, 2008 at 12:01:10PM -0400, Jeff Layton wrote:
> On Mon, 09 Jun 2008 11:51:51 -0400
> "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
> 
> > At 11:18 AM 6/9/2008, Jeff Layton wrote:
> > >No, it's not specific to NFS. It can happen to any "service" that
> > >floats IP addresses between machines, but does not close the sockets
> > >that are connected to those addresses. Most services that fail over
> > >(at least in RH's cluster server) shut down the daemons on failover
> > >too, so tends to mitigate this problem elsewhere.
> > 
> > Why exactly don't you choose to restart the nfsd's (and lockd's) on the
> > victim server?
> 
> The victim server might have other nfsd/lockd's running on them. Stopping
> all the nfsd's could bring down lockd, and then you have to deal with lock
> recovery on the stuff that isn't moving to the other server.
> 
> > Failing that, for TCP at least would ifdown/ifup accomplish
> > the socket reset?
> > 
> 
> I don't think ifdown/ifup closes the sockets, but maybe someone can
> correct me on this...
> 
if up/down doesn't do anything to the sockets per-se, but could have any number
of side effects depending how other aspects of your network/application are
configured.  Certainly not a reliable way to destroy a connection.
Neil

> -- 
> Jeff Layton <jlayton@redhat.com>

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 16:01         ` Jeff Layton
  2008-06-09 16:03           ` Neil Horman
@ 2008-06-09 16:09           ` Talpey, Thomas
  2008-06-09 16:22             ` Jeff Layton
  1 sibling, 1 reply; 38+ messages in thread
From: Talpey, Thomas @ 2008-06-09 16:09 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman

At 12:01 PM 6/9/2008, Jeff Layton wrote:
>On Mon, 09 Jun 2008 11:51:51 -0400
>"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
>
>> At 11:18 AM 6/9/2008, Jeff Layton wrote:
>> >No, it's not specific to NFS. It can happen to any "service" that
>> >floats IP addresses between machines, but does not close the sockets
>> >that are connected to those addresses. Most services that fail over
>> >(at least in RH's cluster server) shut down the daemons on failover
>> >too, so tends to mitigate this problem elsewhere.
>> 
>> Why exactly don't you choose to restart the nfsd's (and lockd's) on the
>> victim server?
>
>The victim server might have other nfsd/lockd's running on them. Stopping
>all the nfsd's could bring down lockd, and then you have to deal with lock
>recovery on the stuff that isn't moving to the other server.

But but but... the IP address is the only identification the client can use
to isolate a server. You're telling me that some locks will migrate and
some won't? Good luck with that! The clients are going to be mightily
confused.

>
>> Failing that, for TCP at least would ifdown/ifup accomplish
>> the socket reset?
>> 
>
>I don't think ifdown/ifup closes the sockets, but maybe someone can
>correct me on this...

No, it doesn't close the sockets, but it sends interface-down status to them.
The nfsd's, in theory, should close the sockets in response. But, it's possible
(probable?) that nfsd may ignore this, and do nothing. It's just an idea.

Tom.

_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients --   how best to close the sockets?
  2008-06-09 16:09           ` Talpey, Thomas
@ 2008-06-09 16:22             ` Jeff Layton
  2008-06-09 19:36               ` Chuck Lever
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 16:22 UTC (permalink / raw)
  To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 09 Jun 2008 12:09:48 -0400
"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:

> At 12:01 PM 6/9/2008, Jeff Layton wrote:
> >On Mon, 09 Jun 2008 11:51:51 -0400
> >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
> >
> >> At 11:18 AM 6/9/2008, Jeff Layton wrote:
> >> >No, it's not specific to NFS. It can happen to any "service" that
> >> >floats IP addresses between machines, but does not close the sockets
> >> >that are connected to those addresses. Most services that fail over
> >> >(at least in RH's cluster server) shut down the daemons on failover
> >> >too, so tends to mitigate this problem elsewhere.
> >> 
> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the
> >> victim server?
> >
> >The victim server might have other nfsd/lockd's running on them. Stopping
> >all the nfsd's could bring down lockd, and then you have to deal with lock
> >recovery on the stuff that isn't moving to the other server.
> 
> But but but... the IP address is the only identification the client can use
> to isolate a server. You're telling me that some locks will migrate and
> some won't? Good luck with that! The clients are going to be mightily
> confused.
> 

Maybe I'm not being clear. My understanding is this:

Right now, when we fail over we send a SIGKILL to lockd, and then send
a SM_NOTIFY to all of the clients that the "victim" server has,
regardless of what IP address the clients are talking to. So all locks
get dropped and all clients should recover their locks. Since the
service will fail over to the new host, locks that were in that export
will get recovered on the "new" host.

But, we just recently added this new "unlock_ip" interface. With that,
we should be able to just send SM_NOTIFY's to clients of that IP
address. Locks associated with that server address will be recovered
and the others should be unaffected.

> >
> >> Failing that, for TCP at least would ifdown/ifup accomplish
> >> the socket reset?
> >> 
> >
> >I don't think ifdown/ifup closes the sockets, but maybe someone can
> >correct me on this...
> 
> No, it doesn't close the sockets, but it sends interface-down status to them.
> The nfsd's, in theory, should close the sockets in response. But, it's possible
> (probable?) that nfsd may ignore this, and do nothing. It's just an idea.
> 

That might be worth investigating, but sounds like it might cause problems
with the services associated with IP addresses that are staying on the
victim server.

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 16:22             ` Jeff Layton
@ 2008-06-09 19:36               ` Chuck Lever
  2008-06-09 20:11                 ` Jeff Layton
  0 siblings, 1 reply; 38+ messages in thread
From: Chuck Lever @ 2008-06-09 19:36 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, Jun 9, 2008 at 12:22 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Mon, 09 Jun 2008 12:09:48 -0400
> "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
>
>> At 12:01 PM 6/9/2008, Jeff Layton wrote:
>> >On Mon, 09 Jun 2008 11:51:51 -0400
>> >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
>> >
>> >> At 11:18 AM 6/9/2008, Jeff Layton wrote:
>> >> >No, it's not specific to NFS. It can happen to any "service" that
>> >> >floats IP addresses between machines, but does not close the sockets
>> >> >that are connected to those addresses. Most services that fail over
>> >> >(at least in RH's cluster server) shut down the daemons on failover
>> >> >too, so tends to mitigate this problem elsewhere.
>> >>
>> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the
>> >> victim server?
>> >
>> >The victim server might have other nfsd/lockd's running on them. Stopping
>> >all the nfsd's could bring down lockd, and then you have to deal with lock
>> >recovery on the stuff that isn't moving to the other server.
>>
>> But but but... the IP address is the only identification the client can use
>> to isolate a server. You're telling me that some locks will migrate and
>> some won't? Good luck with that! The clients are going to be mightily
>> confused.
>>
>
> Maybe I'm not being clear. My understanding is this:
>
> Right now, when we fail over we send a SIGKILL to lockd, and then send
> a SM_NOTIFY to all of the clients that the "victim" server has,
> regardless of what IP address the clients are talking to. So all locks
> get dropped and all clients should recover their locks. Since the
> service will fail over to the new host, locks that were in that export
> will get recovered on the "new" host.
>
> But, we just recently added this new "unlock_ip" interface. With that,
> we should be able to just send SM_NOTIFY's to clients of that IP
> address. Locks associated with that server address will be recovered
> and the others should be unaffected.

Maybe that's a little imprecise.

The failover_unlock_ip() API doesn't send any SM_NOTIFY calls at all,
it tells the server's NLM to drop all locks held by that IP, but
there's logic in nlmsvc_is_client() specifically to keep monitoring
these clients.  The SM_NOTIFY calls will come from user space, just to
be clear.

If this is truly a service migration, I would think that the old
server would want to stop monitoring these clients anyway.

> All of
> the NSM/NLM stuff here is really separate from the main problem I'm
> interested in at the moment, which is how to deal with the old, stale
> sockets that nfsd has open after the local address disappears.

IMO it's just the reverse: the main problem is how to do service
migration in a robust fashion; the bugs you are focused on right at
the moment are due to the fact the current migration strategy is
poorly designed.  The real issue is how do you fix your design, and
that's a lot bigger than addressing a few SYNs and ACKs.  I do not
believe there is going to be a simple network level fix here if you
want to prevent more corner cases.

I am still of the opinion that you can't do this without involvement
from the nfsd threads.  The old server is going to have to stop
accepting incoming connections during the failover period.  NetApp
found that it is not enough to drop a newly accepted connection
without having read any data -- that confuses some clients.  Your
server really does need to shut off the listener, in order to refuse
new connections.

I think this might be a new server state.  A bunch of nfsd threads
will exist and be processing NFS requests, but there will be no
listener.

Then the old server can drain the doomed sockets and disconnect them
in an orderly manner.  This will prevent a lot of segment ordering
problems and keep network layer confusion about socket state to a
minimum.  It's a good idea to try to return any pending replies to
clients before closing the connection to reduce the likelihood of RPC
retransmits.  To prevent the clients from transmitting any new
requests, use a half-close (just close the receiving half of the
connection on the server).

Naturally this will have to be time-bounded because clients can be too
busy to read any remaining data off the socket, or could just be dead.
 That shouldn't hold up your service migration event.

Any clients attempting to connect to the old server during failover
will be refused.  If they are trying to access legitimate NFS
resources that have not been migrated, they will retry connecting
later, so this really shouldn't be an issue.  Clients connecting to
the new server should be OK, but again, I think they should be fenced
from the old server's file system until the old server has finished
processing any pending requests from clients that are being migrated
to the new server.

When failover is complete, the old server can start accepting new TCP
connections again.  Clients connecting to the old server looking for
migrated resources should get something like ESTALE ("These are not
the file handles you are looking for.").

In this way, the server is in control over the migration, and isn't
depending on any wonky TCP behavior to make it happen correctly.  It's
using entirely legitimate features of the socket interface to move
each client through the necessary states of migration.

Now that the network connections are figured out, your servers can
start worrying about recoverying NLM, NSM, and DRC state.

-- 
I am certain that these presidents will understand the cry of the
people of Bolivia, of the people of Latin America and the whole world,
which wants to have more food and not more cars. First food, then if
something's left over, more cars, more automobiles. I think that life
has to come first.
-- Evo Morales
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 19:36               ` Chuck Lever
@ 2008-06-09 20:11                 ` Jeff Layton
  2008-06-09 20:56                   ` Chuck Lever
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 20:11 UTC (permalink / raw)
  To: chucklever; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 9 Jun 2008 15:36:18 -0400
"Chuck Lever" <chuck.lever@oracle.com> wrote:

> On Mon, Jun 9, 2008 at 12:22 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Mon, 09 Jun 2008 12:09:48 -0400
> > "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
> >
> >> At 12:01 PM 6/9/2008, Jeff Layton wrote:
> >> >On Mon, 09 Jun 2008 11:51:51 -0400
> >> >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
> >> >
> >> >> At 11:18 AM 6/9/2008, Jeff Layton wrote:
> >> >> >No, it's not specific to NFS. It can happen to any "service" that
> >> >> >floats IP addresses between machines, but does not close the sockets
> >> >> >that are connected to those addresses. Most services that fail over
> >> >> >(at least in RH's cluster server) shut down the daemons on failover
> >> >> >too, so tends to mitigate this problem elsewhere.
> >> >>
> >> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the
> >> >> victim server?
> >> >
> >> >The victim server might have other nfsd/lockd's running on them. Stopping
> >> >all the nfsd's could bring down lockd, and then you have to deal with lock
> >> >recovery on the stuff that isn't moving to the other server.
> >>
> >> But but but... the IP address is the only identification the client can use
> >> to isolate a server. You're telling me that some locks will migrate and
> >> some won't? Good luck with that! The clients are going to be mightily
> >> confused.
> >>
> >
> > Maybe I'm not being clear. My understanding is this:
> >
> > Right now, when we fail over we send a SIGKILL to lockd, and then send
> > a SM_NOTIFY to all of the clients that the "victim" server has,
> > regardless of what IP address the clients are talking to. So all locks
> > get dropped and all clients should recover their locks. Since the
> > service will fail over to the new host, locks that were in that export
> > will get recovered on the "new" host.
> >
> > But, we just recently added this new "unlock_ip" interface. With that,
> > we should be able to just send SM_NOTIFY's to clients of that IP
> > address. Locks associated with that server address will be recovered
> > and the others should be unaffected.
> 
> Maybe that's a little imprecise.
> 
> The failover_unlock_ip() API doesn't send any SM_NOTIFY calls at all,
> it tells the server's NLM to drop all locks held by that IP, but
> there's logic in nlmsvc_is_client() specifically to keep monitoring
> these clients.  The SM_NOTIFY calls will come from user space, just to
> be clear.
> 
> If this is truly a service migration, I would think that the old
> server would want to stop monitoring these clients anyway.
> 
> > All of
> > the NSM/NLM stuff here is really separate from the main problem I'm
> > interested in at the moment, which is how to deal with the old, stale
> > sockets that nfsd has open after the local address disappears.
> 
> IMO it's just the reverse: the main problem is how to do service
> migration in a robust fashion; the bugs you are focused on right at
> the moment are due to the fact the current migration strategy is
> poorly designed.  The real issue is how do you fix your design, and
> that's a lot bigger than addressing a few SYNs and ACKs.  I do not
> believe there is going to be a simple network level fix here if you
> want to prevent more corner cases.
> 
> I am still of the opinion that you can't do this without involvement
> from the nfsd threads.  The old server is going to have to stop
> accepting incoming connections during the failover period.  NetApp
> found that it is not enough to drop a newly accepted connection
> without having read any data -- that confuses some clients.  Your
> server really does need to shut off the listener, in order to refuse
> new connections.
> 

I'm not sure I follow your logic here. The first thing that happens
when failover occurs is that the IP address is removed from the
interface. This prevents new connections on that IP address (and new
packets for existing connections for that matter). Why would this not
be sufficient to prevent new activity on those sockets?

> I think this might be a new server state.  A bunch of nfsd threads
> will exist and be processing NFS requests, but there will be no
> listener.
> 
> Then the old server can drain the doomed sockets and disconnect them
> in an orderly manner.  This will prevent a lot of segment ordering
> problems and keep network layer confusion about socket state to a
> minimum.  It's a good idea to try to return any pending replies to
> clients before closing the connection to reduce the likelihood of RPC
> retransmits.  To prevent the clients from transmitting any new
> requests, use a half-close (just close the receiving half of the
> connection on the server).
> 

Ahh ok. So you're thinking that we need to keep the IP address in place
so that we can send replies for RPC's that are still in progress? That
makes sense.

I suppose that instead of shutting down the listener altogether, we
could just have the listener refuse connections for the given destination
address. That's probably simpler and would mean less disruption for exports
on other IP addrs.

That said, if we assume we want to use the unlock_ip interface then
there's a potential race between writing to unlock_ip and taking down
the address. I'll have to think about how to deal with that maybe some
sort of 3 stage teardown:

1) refuse new connections for the IP address, drain the RPC queues,
   half close sockets

2) remove the address from the interface

3) close sockets the rest of the way, stop refusing connections

Then again, we might actually be better off restarting nfsd instead. It's
certainly simpler...

> Naturally this will have to be time-bounded because clients can be too
> busy to read any remaining data off the socket, or could just be dead.
>  That shouldn't hold up your service migration event.
> 

Definitely.

> Any clients attempting to connect to the old server during failover
> will be refused.  If they are trying to access legitimate NFS
> resources that have not been migrated, they will retry connecting
> later, so this really shouldn't be an issue.  Clients connecting to
> the new server should be OK, but again, I think they should be fenced
> from the old server's file system until the old server has finished
> processing any pending requests from clients that are being migrated
> to the new server.
> 
> When failover is complete, the old server can start accepting new TCP
> connections again.  Clients connecting to the old server looking for
> migrated resources should get something like ESTALE ("These are not
> the file handles you are looking for.").
> 

I think we return -EACCES or something (whatever you get when you try to
access something that isn't exported). We remove the export from the
exports table when we fail over.

> In this way, the server is in control over the migration, and isn't
> depending on any wonky TCP behavior to make it happen correctly.  It's
> using entirely legitimate features of the socket interface to move
> each client through the necessary states of migration.
> 
> Now that the network connections are figured out, your servers can
> start worrying about recoverying NLM, NSM, and DRC state.
> 
-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 20:11                 ` Jeff Layton
@ 2008-06-09 20:56                   ` Chuck Lever
  0 siblings, 0 replies; 38+ messages in thread
From: Chuck Lever @ 2008-06-09 20:56 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, chucklever, lhh, nfsv4, nhorman

On Jun 9, 2008, at 4:11 PM, Jeff Layton wrote:
> On Mon, 9 Jun 2008 15:36:18 -0400
> "Chuck Lever" <chuck.lever@oracle.com> wrote:
>> On Mon, Jun 9, 2008 at 12:22 PM, Jeff Layton <jlayton@redhat.com>  
>> wrote:
>>> On Mon, 09 Jun 2008 12:09:48 -0400
>>> "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
>>>
>>>> At 12:01 PM 6/9/2008, Jeff Layton wrote:
>>>>> On Mon, 09 Jun 2008 11:51:51 -0400
>>>>> "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
>>>>>
>>>>>> At 11:18 AM 6/9/2008, Jeff Layton wrote:
>>>>>>> No, it's not specific to NFS. It can happen to any "service"  
>>>>>>> that
>>>>>>> floats IP addresses between machines, but does not close the  
>>>>>>> sockets
>>>>>>> that are connected to those addresses. Most services that fail  
>>>>>>> over
>>>>>>> (at least in RH's cluster server) shut down the daemons on  
>>>>>>> failover
>>>>>>> too, so tends to mitigate this problem elsewhere.
>>>>>>
>>>>>> Why exactly don't you choose to restart the nfsd's (and  
>>>>>> lockd's) on the
>>>>>> victim server?
>>>>>
>>>>> The victim server might have other nfsd/lockd's running on them.  
>>>>> Stopping
>>>>> all the nfsd's could bring down lockd, and then you have to deal  
>>>>> with lock
>>>>> recovery on the stuff that isn't moving to the other server.
>>>>
>>>> But but but... the IP address is the only identification the  
>>>> client can use
>>>> to isolate a server. You're telling me that some locks will  
>>>> migrate and
>>>> some won't? Good luck with that! The clients are going to be  
>>>> mightily
>>>> confused.
>>>>
>>>
>>> Maybe I'm not being clear. My understanding is this:
>>>
>>> Right now, when we fail over we send a SIGKILL to lockd, and then  
>>> send
>>> a SM_NOTIFY to all of the clients that the "victim" server has,
>>> regardless of what IP address the clients are talking to. So all  
>>> locks
>>> get dropped and all clients should recover their locks. Since the
>>> service will fail over to the new host, locks that were in that  
>>> export
>>> will get recovered on the "new" host.
>>>
>>> But, we just recently added this new "unlock_ip" interface. With  
>>> that,
>>> we should be able to just send SM_NOTIFY's to clients of that IP
>>> address. Locks associated with that server address will be recovered
>>> and the others should be unaffected.
>>
>> Maybe that's a little imprecise.
>>
>> The failover_unlock_ip() API doesn't send any SM_NOTIFY calls at all,
>> it tells the server's NLM to drop all locks held by that IP, but
>> there's logic in nlmsvc_is_client() specifically to keep monitoring
>> these clients.  The SM_NOTIFY calls will come from user space, just  
>> to
>> be clear.
>>
>> If this is truly a service migration, I would think that the old
>> server would want to stop monitoring these clients anyway.
>>
>>> All of
>>> the NSM/NLM stuff here is really separate from the main problem I'm
>>> interested in at the moment, which is how to deal with the old,  
>>> stale
>>> sockets that nfsd has open after the local address disappears.
>>
>> IMO it's just the reverse: the main problem is how to do service
>> migration in a robust fashion; the bugs you are focused on right at
>> the moment are due to the fact the current migration strategy is
>> poorly designed.  The real issue is how do you fix your design, and
>> that's a lot bigger than addressing a few SYNs and ACKs.  I do not
>> believe there is going to be a simple network level fix here if you
>> want to prevent more corner cases.
>>
>> I am still of the opinion that you can't do this without involvement
>> from the nfsd threads.  The old server is going to have to stop
>> accepting incoming connections during the failover period.  NetApp
>> found that it is not enough to drop a newly accepted connection
>> without having read any data -- that confuses some clients.  Your
>> server really does need to shut off the listener, in order to refuse
>> new connections.
>>
>
> I'm not sure I follow your logic here. The first thing that happens
> when failover occurs is that the IP address is removed from the
> interface. This prevents new connections on that IP address (and new
> packets for existing connections for that matter). Why would this not
> be sufficient to prevent new activity on those sockets?

Because precisely the situation that you have observed occurs.  The  
clients and servers get confused about network state because the TCP  
connections weren't properly shut down.

>> I think this might be a new server state.  A bunch of nfsd threads
>> will exist and be processing NFS requests, but there will be no
>> listener.
>>
>> Then the old server can drain the doomed sockets and disconnect them
>> in an orderly manner.  This will prevent a lot of segment ordering
>> problems and keep network layer confusion about socket state to a
>> minimum.  It's a good idea to try to return any pending replies to
>> clients before closing the connection to reduce the likelihood of RPC
>> retransmits.  To prevent the clients from transmitting any new
>> requests, use a half-close (just close the receiving half of the
>> connection on the server).
>
> Ahh ok. So you're thinking that we need to keep the IP address in  
> place
> so that we can send replies for RPC's that are still in progress? That
> makes sense.

There is a part of NFSD failover that must occur before the IP address  
is taken down.  Otherwise you orphan NFS requests on the server and  
TCP segments in the network.

You have an opportunity, during service migration, to shut down the  
old service gracefully so that you greatly reduce the risk of data  
loss or corruption.

> I suppose that instead of shutting down the listener altogether, we
> could just have the listener refuse connections for the given  
> destination address.

I didn't think you could do that to an active listener.  Even if you  
could, NFSD would depend on the specifics of the network layer  
implementation to disallow races or partially connected states while  
the listener socket was transitioning.

Since these events are rare compared to RPC requests and new  
connections, I would think it wouldn't matter if the listener wasn't  
available for a brief period.  What matters is the service shut down  
on the old server is clean and orderly.

> That said, if we assume we want to use the unlock_ip interface then
> there's a potential race between writing to unlock_ip and taking down
> the address. I'll have to think about how to deal with that maybe some
> sort of 3 stage teardown:
>
> 1) refuse new connections for the IP address, drain the RPC queues,
>   half close sockets
>
> 2) remove the address from the interface
>
> 3) close sockets the rest of the way, stop refusing connections

I'm not sure what you accomplish with "close sockets the rest of the  
way" after you have removed the address from the interface?

The NFSDs should gracefully destroy all resources on that address  
before you remove the address from the interface.  Closing the sockets  
properly means that both halves of the connection duplex have an  
opportunity to go through the FIN,ACK dance.

There is still a risk that things will get confused.  But in most  
normal cases, this is enough to ensure an orderly transition of the  
network connections.

>> Naturally this will have to be time-bounded because clients can be  
>> too
>> busy to read any remaining data off the socket, or could just be  
>> dead.
>> That shouldn't hold up your service migration event.
>
> Definitely.
>
>> Any clients attempting to connect to the old server during failover
>> will be refused.  If they are trying to access legitimate NFS
>> resources that have not been migrated, they will retry connecting
>> later, so this really shouldn't be an issue.  Clients connecting to
>> the new server should be OK, but again, I think they should be fenced
>> from the old server's file system until the old server has finished
>> processing any pending requests from clients that are being migrated
>> to the new server.
>>
>> When failover is complete, the old server can start accepting new TCP
>> connections again.  Clients connecting to the old server looking for
>> migrated resources should get something like ESTALE ("These are not
>> the file handles you are looking for.").
>
> I think we return -EACCES or something (whatever you get when you  
> try to
> access something that isn't exported). We remove the export from the
> exports table when we fail over.

>> In this way, the server is in control over the migration, and isn't
>> depending on any wonky TCP behavior to make it happen correctly.   
>> It's
>> using entirely legitimate features of the socket interface to move
>> each client through the necessary states of migration.
>>
>> Now that the network connections are figured out, your servers can
>> start worrying about recoverying NLM, NSM, and DRC state.
>>
> -- 
> Jeff Layton <jlayton@redhat.com>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:03   ` Peter Staubach
  2008-06-09 15:18     ` Jeff Layton
@ 2008-06-09 15:23     ` Neil Horman
  2008-06-09 15:37       ` Peter Staubach
                         ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Neil Horman @ 2008-06-09 15:23 UTC (permalink / raw)
  To: Peter Staubach; +Cc: linux-nfs, lhh, nfsv4, nhorman, Jeff Layton

On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
> Jeff Layton wrote:
> >Apologies for the long email, but I ran into an interesting problem the
> >other day and am looking for some feedback on my general approach to
> >fixing it before I spend too much time on it:
> >
> >We (RH) have a cluster-suite product that some people use for making HA
> >NFS services. When our QA folks test this, they often will start up
> >some operations that do activity on an NFS mount from the cluster and
> >then rapidly do failovers between cluster machines and make sure
> >everything keeps moving along. The cluster is designed to not shut down
> >nfsd's when a failover occurs. nfsd's are considered a "shared
> >resource". It's possible that there could be multiple clustered
> >services for NFS-sharing, so when a failover occurs, we just manipulate
> >the exports table.
> >
> >The problem we've run into is that occasionally they fail over to the
> >alternate machine and then back very rapidly. Because nfsd's are not
> >shut down on failover, sockets are not closed. So what happens is
> >something like this on TCP mounts:
> >
> >- client has NFS mount from clustered NFS service on one server
> >
> >- service fails over, new server doesn't know anything about the
> >  existing socket, so it sends a RST back to the client when data
> >  comes in. Client closes connection and reopens it and does some
> >  I/O on the socket.
> >
> >- service fails back to original server. The original socket there
> >  is still open, but now the TCP sequence numbers are off. When
> >  packets come into the server we end up with an ACK storm, and the
> >  client hangs for a long time.
> >
> >Neil Horman did a good writeup of this problem here for those that
> >want the gory details:
> >
> >    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
> >
> >I can think of 3 ways to fix this:
> >
> >1) Add something like the recently added "unlock_ip" interface that
> >was added for NLM. Maybe a "close_ip" that allows us to close all
> >nfsd sockets connected to a given local IP address. So clustering
> >software could do something like:
> >
> >    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> >
> >...and make sure that all of the sockets are closed.
> >
> >2) just use the same "unlock_ip" interface and just have it also
> >close sockets in addition to dropping locks.
> >
> >3) have an nfsd close all non-listening connections when it gets a
> >certain signal (maybe SIGUSR1 or something). Connections on a
> >sockets that aren't failing over should just get a RST and would
> >reopen their connections.
> >
> >...my preference would probably be approach #1.
> >
> >I've only really done some rudimentary perusing of the code, so there
> >may be roadblocks with some of these approaches I haven't considered.
> >Does anyone have thoughts on the general problem or idea for a solution?
> >
> >The situation is a bit specific to failover testing -- most people failing
> >over don't do it so rapidly, but we'd still like to ensure that this
> >problem doesn't occur if someone does do it.
> >
> >Thanks,
> >  
> 
> This doesn't sound like it would be an NFS specific situation.
> Why doesn't TCP handle this, without causing an ACK storm?
> 

You're right, its not a problem specific to NFS, any TCP based service in which
sockets are not explicitly closed on the application are subject to this
problem.  however, I think NFS is currently the only clustered service that we
offer in which we explicitly leave nfsd running during such a 'soft' failover,
and so practically speaking, this is the only place that this issue manifests
itself.  If we could shut down nfsd on the server doing a failover, that would
solve this problem (as it prevents the problem with all other clustered tcp
based services), but from what I'm told, thats a non-starter.

As for why TCP doesnt handle this, thats because the situation is ambiguous from
the point of view of the client and server.  The write up in the bugzilla has
all the gory details, but the executive summary is that during rapid failover,
the client will ack some data to server A in the cluster, and some to server B
in the cluster.  If you quickly fail over and back between the servers in the
cluster, each server will see some gaps in the data stream sequence numbers, but
the client will see that all data has been acked.  This leaves the connection in
an unrecoverable state.

Regards
Neil
  
>    Thanx...
> 
>       ps

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:23     ` Neil Horman
@ 2008-06-09 15:37       ` Peter Staubach
  2008-06-09 15:49         ` Jeff Layton
  2008-06-09 16:04         ` Neil Horman
  2008-06-09 15:46       ` Chuck Lever
  2008-06-09 16:00       ` Peter Staubach
  2 siblings, 2 replies; 38+ messages in thread
From: Peter Staubach @ 2008-06-09 15:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, Jeff Layton

Neil Horman wrote:
> On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
>   
>> Jeff Layton wrote:
>>     
>>> Apologies for the long email, but I ran into an interesting problem the
>>> other day and am looking for some feedback on my general approach to
>>> fixing it before I spend too much time on it:
>>>
>>> We (RH) have a cluster-suite product that some people use for making HA
>>> NFS services. When our QA folks test this, they often will start up
>>> some operations that do activity on an NFS mount from the cluster and
>>> then rapidly do failovers between cluster machines and make sure
>>> everything keeps moving along. The cluster is designed to not shut down
>>> nfsd's when a failover occurs. nfsd's are considered a "shared
>>> resource". It's possible that there could be multiple clustered
>>> services for NFS-sharing, so when a failover occurs, we just manipulate
>>> the exports table.
>>>
>>> The problem we've run into is that occasionally they fail over to the
>>> alternate machine and then back very rapidly. Because nfsd's are not
>>> shut down on failover, sockets are not closed. So what happens is
>>> something like this on TCP mounts:
>>>
>>> - client has NFS mount from clustered NFS service on one server
>>>
>>> - service fails over, new server doesn't know anything about the
>>>  existing socket, so it sends a RST back to the client when data
>>>  comes in. Client closes connection and reopens it and does some
>>>  I/O on the socket.
>>>
>>> - service fails back to original server. The original socket there
>>>  is still open, but now the TCP sequence numbers are off. When
>>>  packets come into the server we end up with an ACK storm, and the
>>>  client hangs for a long time.
>>>
>>> Neil Horman did a good writeup of this problem here for those that
>>> want the gory details:
>>>
>>>    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>>>
>>> I can think of 3 ways to fix this:
>>>
>>> 1) Add something like the recently added "unlock_ip" interface that
>>> was added for NLM. Maybe a "close_ip" that allows us to close all
>>> nfsd sockets connected to a given local IP address. So clustering
>>> software could do something like:
>>>
>>>    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>>>
>>> ...and make sure that all of the sockets are closed.
>>>
>>> 2) just use the same "unlock_ip" interface and just have it also
>>> close sockets in addition to dropping locks.
>>>
>>> 3) have an nfsd close all non-listening connections when it gets a
>>> certain signal (maybe SIGUSR1 or something). Connections on a
>>> sockets that aren't failing over should just get a RST and would
>>> reopen their connections.
>>>
>>> ...my preference would probably be approach #1.
>>>
>>> I've only really done some rudimentary perusing of the code, so there
>>> may be roadblocks with some of these approaches I haven't considered.
>>> Does anyone have thoughts on the general problem or idea for a solution?
>>>
>>> The situation is a bit specific to failover testing -- most people failing
>>> over don't do it so rapidly, but we'd still like to ensure that this
>>> problem doesn't occur if someone does do it.
>>>
>>> Thanks,
>>>  
>>>       
>> This doesn't sound like it would be an NFS specific situation.
>> Why doesn't TCP handle this, without causing an ACK storm?
>>
>>     
>
> You're right, its not a problem specific to NFS, any TCP based service in which
> sockets are not explicitly closed on the application are subject to this
> problem.  however, I think NFS is currently the only clustered service that we
> offer in which we explicitly leave nfsd running during such a 'soft' failover,
> and so practically speaking, this is the only place that this issue manifests
> itself.  If we could shut down nfsd on the server doing a failover, that would
> solve this problem (as it prevents the problem with all other clustered tcp
> based services), but from what I'm told, thats a non-starter.
>
>   

I think that this last would be a good thing to pursue anyway,
or at least be able to understand why it would be considered to
be a "non-starter".  When failing away a service, why not stop
the service on the original node?

These floating virtual IP and ARP games can get tricky to handle
in the boundary cases like this sort of one.

> As for why TCP doesnt handle this, thats because the situation is ambiguous from
> the point of view of the client and server.  The write up in the bugzilla has
> all the gory details, but the executive summary is that during rapid failover,
> the client will ack some data to server A in the cluster, and some to server B
> in the cluster.  If you quickly fail over and back between the servers in the
> cluster, each server will see some gaps in the data stream sequence numbers, but
> the client will see that all data has been acked.  This leaves the connection in
> an unrecoverable state.

I would wonder what happens if we stick some other NFS/RPC/TCP/IP
implementation into the situation.  I wonder if it would see and
generate the same situation?

       ps

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:37       ` Peter Staubach
@ 2008-06-09 15:49         ` Jeff Layton
       [not found]           ` <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2008-06-09 16:04         ` Neil Horman
  1 sibling, 1 reply; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 15:49 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Neil Horman, lhh, nfsv4, linux-nfs

On Mon, 09 Jun 2008 11:37:27 -0400
Peter Staubach <staubach@redhat.com> wrote:

> Neil Horman wrote:
> > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
> >   
> >> Jeff Layton wrote:
> >>     
> >>> Apologies for the long email, but I ran into an interesting problem the
> >>> other day and am looking for some feedback on my general approach to
> >>> fixing it before I spend too much time on it:
> >>>
> >>> We (RH) have a cluster-suite product that some people use for making HA
> >>> NFS services. When our QA folks test this, they often will start up
> >>> some operations that do activity on an NFS mount from the cluster and
> >>> then rapidly do failovers between cluster machines and make sure
> >>> everything keeps moving along. The cluster is designed to not shut down
> >>> nfsd's when a failover occurs. nfsd's are considered a "shared
> >>> resource". It's possible that there could be multiple clustered
> >>> services for NFS-sharing, so when a failover occurs, we just manipulate
> >>> the exports table.
> >>>
> >>> The problem we've run into is that occasionally they fail over to the
> >>> alternate machine and then back very rapidly. Because nfsd's are not
> >>> shut down on failover, sockets are not closed. So what happens is
> >>> something like this on TCP mounts:
> >>>
> >>> - client has NFS mount from clustered NFS service on one server
> >>>
> >>> - service fails over, new server doesn't know anything about the
> >>>  existing socket, so it sends a RST back to the client when data
> >>>  comes in. Client closes connection and reopens it and does some
> >>>  I/O on the socket.
> >>>
> >>> - service fails back to original server. The original socket there
> >>>  is still open, but now the TCP sequence numbers are off. When
> >>>  packets come into the server we end up with an ACK storm, and the
> >>>  client hangs for a long time.
> >>>
> >>> Neil Horman did a good writeup of this problem here for those that
> >>> want the gory details:
> >>>
> >>>    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
> >>>
> >>> I can think of 3 ways to fix this:
> >>>
> >>> 1) Add something like the recently added "unlock_ip" interface that
> >>> was added for NLM. Maybe a "close_ip" that allows us to close all
> >>> nfsd sockets connected to a given local IP address. So clustering
> >>> software could do something like:
> >>>
> >>>    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> >>>
> >>> ...and make sure that all of the sockets are closed.
> >>>
> >>> 2) just use the same "unlock_ip" interface and just have it also
> >>> close sockets in addition to dropping locks.
> >>>
> >>> 3) have an nfsd close all non-listening connections when it gets a
> >>> certain signal (maybe SIGUSR1 or something). Connections on a
> >>> sockets that aren't failing over should just get a RST and would
> >>> reopen their connections.
> >>>
> >>> ...my preference would probably be approach #1.
> >>>
> >>> I've only really done some rudimentary perusing of the code, so there
> >>> may be roadblocks with some of these approaches I haven't considered.
> >>> Does anyone have thoughts on the general problem or idea for a solution?
> >>>
> >>> The situation is a bit specific to failover testing -- most people failing
> >>> over don't do it so rapidly, but we'd still like to ensure that this
> >>> problem doesn't occur if someone does do it.
> >>>
> >>> Thanks,
> >>>  
> >>>       
> >> This doesn't sound like it would be an NFS specific situation.
> >> Why doesn't TCP handle this, without causing an ACK storm?
> >>
> >>     
> >
> > You're right, its not a problem specific to NFS, any TCP based service in which
> > sockets are not explicitly closed on the application are subject to this
> > problem.  however, I think NFS is currently the only clustered service that we
> > offer in which we explicitly leave nfsd running during such a 'soft' failover,
> > and so practically speaking, this is the only place that this issue manifests
> > itself.  If we could shut down nfsd on the server doing a failover, that would
> > solve this problem (as it prevents the problem with all other clustered tcp
> > based services), but from what I'm told, thats a non-starter.
> >
> >   
> 
> I think that this last would be a good thing to pursue anyway,
> or at least be able to understand why it would be considered to
> be a "non-starter".  When failing away a service, why not stop
> the service on the original node?
> 

Suppose you have more than one "NFS service". People do occasionally set
up NFS exports in separate services. Also, there's the possibility of a
mix of clustered + non-clustered exports. So shutting down nfsd could
disrupt NFS services on any IP addresses that remain on the box.

That said, we could maybe shut down nfsd and trust that retransmissions
will take care of the problem. That could be racy though.

> These floating virtual IP and ARP games can get tricky to handle
> in the boundary cases like this sort of one.
> 
> > As for why TCP doesnt handle this, thats because the situation is ambiguous from
> > the point of view of the client and server.  The write up in the bugzilla has
> > all the gory details, but the executive summary is that during rapid failover,
> > the client will ack some data to server A in the cluster, and some to server B
> > in the cluster.  If you quickly fail over and back between the servers in the
> > cluster, each server will see some gaps in the data stream sequence numbers, but
> > the client will see that all data has been acked.  This leaves the connection in
> > an unrecoverable state.
> 
> I would wonder what happens if we stick some other NFS/RPC/TCP/IP
> implementation into the situation.  I wonder if it would see and
> generate the same situation?
> 

Assuming you mean changing the client to a different sort of OS, then yes, I
think the same thing would likely happen unless it has some mechanism to
break out of an ACK storm like this.

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
       [not found]           ` <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-09 16:01             ` Chuck Lever
  0 siblings, 0 replies; 38+ messages in thread
From: Chuck Lever @ 2008-06-09 16:01 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Peter Staubach, Neil Horman, linux-nfs, nfsv4, lhh

On Mon, Jun 9, 2008 at 11:49 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Mon, 09 Jun 2008 11:37:27 -0400
> Peter Staubach <staubach@redhat.com> wrote:
>
>> Neil Horman wrote:
>> > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
>> >
>> >> Jeff Layton wrote:
>> >>
>> >>> Apologies for the long email, but I ran into an interesting problem the
>> >>> other day and am looking for some feedback on my general approach to
>> >>> fixing it before I spend too much time on it:
>> >>>
>> >>> We (RH) have a cluster-suite product that some people use for making HA
>> >>> NFS services. When our QA folks test this, they often will start up
>> >>> some operations that do activity on an NFS mount from the cluster and
>> >>> then rapidly do failovers between cluster machines and make sure
>> >>> everything keeps moving along. The cluster is designed to not shut down
>> >>> nfsd's when a failover occurs. nfsd's are considered a "shared
>> >>> resource". It's possible that there could be multiple clustered
>> >>> services for NFS-sharing, so when a failover occurs, we just manipulate
>> >>> the exports table.
>> >>>
>> >>> The problem we've run into is that occasionally they fail over to the
>> >>> alternate machine and then back very rapidly. Because nfsd's are not
>> >>> shut down on failover, sockets are not closed. So what happens is
>> >>> something like this on TCP mounts:
>> >>>
>> >>> - client has NFS mount from clustered NFS service on one server
>> >>>
>> >>> - service fails over, new server doesn't know anything about the
>> >>>  existing socket, so it sends a RST back to the client when data
>> >>>  comes in. Client closes connection and reopens it and does some
>> >>>  I/O on the socket.
>> >>>
>> >>> - service fails back to original server. The original socket there
>> >>>  is still open, but now the TCP sequence numbers are off. When
>> >>>  packets come into the server we end up with an ACK storm, and the
>> >>>  client hangs for a long time.
>> >>>
>> >>> Neil Horman did a good writeup of this problem here for those that
>> >>> want the gory details:
>> >>>
>> >>>    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>> >>>
>> >>> I can think of 3 ways to fix this:
>> >>>
>> >>> 1) Add something like the recently added "unlock_ip" interface that
>> >>> was added for NLM. Maybe a "close_ip" that allows us to close all
>> >>> nfsd sockets connected to a given local IP address. So clustering
>> >>> software could do something like:
>> >>>
>> >>>    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>> >>>
>> >>> ...and make sure that all of the sockets are closed.
>> >>>
>> >>> 2) just use the same "unlock_ip" interface and just have it also
>> >>> close sockets in addition to dropping locks.
>> >>>
>> >>> 3) have an nfsd close all non-listening connections when it gets a
>> >>> certain signal (maybe SIGUSR1 or something). Connections on a
>> >>> sockets that aren't failing over should just get a RST and would
>> >>> reopen their connections.
>> >>>
>> >>> ...my preference would probably be approach #1.
>> >>>
>> >>> I've only really done some rudimentary perusing of the code, so there
>> >>> may be roadblocks with some of these approaches I haven't considered.
>> >>> Does anyone have thoughts on the general problem or idea for a solution?
>> >>>
>> >>> The situation is a bit specific to failover testing -- most people failing
>> >>> over don't do it so rapidly, but we'd still like to ensure that this
>> >>> problem doesn't occur if someone does do it.
>> >>>
>> >>> Thanks,
>> >>>
>> >>>
>> >> This doesn't sound like it would be an NFS specific situation.
>> >> Why doesn't TCP handle this, without causing an ACK storm?
>> >>
>> >>
>> >
>> > You're right, its not a problem specific to NFS, any TCP based service in which
>> > sockets are not explicitly closed on the application are subject to this
>> > problem.  however, I think NFS is currently the only clustered service that we
>> > offer in which we explicitly leave nfsd running during such a 'soft' failover,
>> > and so practically speaking, this is the only place that this issue manifests
>> > itself.  If we could shut down nfsd on the server doing a failover, that would
>> > solve this problem (as it prevents the problem with all other clustered tcp
>> > based services), but from what I'm told, thats a non-starter.
>> >
>> >
>>
>> I think that this last would be a good thing to pursue anyway,
>> or at least be able to understand why it would be considered to
>> be a "non-starter".  When failing away a service, why not stop
>> the service on the original node?
>>
>
> Suppose you have more than one "NFS service". People do occasionally set
> up NFS exports in separate services. Also, there's the possibility of a
> mix of clustered + non-clustered exports. So shutting down nfsd could
> disrupt NFS services on any IP addresses that remain on the box.
>
> That said, we could maybe shut down nfsd and trust that retransmissions
> will take care of the problem. That could be racy though.

In that case, it might make sense to have an nfsd-specific mechanism
that allows you to fence exports instead of whole servers.

-- 
I am certain that these presidents will understand the cry of the
people of Bolivia, of the people of Latin America and the whole world,
which wants to have more food and not more cars. First food, then if
something's left over, more cars, more automobiles. I think that life
has to come first.
-- Evo Morales

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:37       ` Peter Staubach
  2008-06-09 15:49         ` Jeff Layton
@ 2008-06-09 16:04         ` Neil Horman
  1 sibling, 0 replies; 38+ messages in thread
From: Neil Horman @ 2008-06-09 16:04 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Neil Horman, Jeff Layton, linux-nfs, nfsv4, lhh

On Mon, Jun 09, 2008 at 11:37:27AM -0400, Peter Staubach wrote:
> Neil Horman wrote:
> >On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
> >  
> >>Jeff Layton wrote:
> >>    
> >>>Apologies for the long email, but I ran into an interesting problem the
> >>>other day and am looking for some feedback on my general approach to
> >>>fixing it before I spend too much time on it:
> >>>
> >>>We (RH) have a cluster-suite product that some people use for making HA
> >>>NFS services. When our QA folks test this, they often will start up
> >>>some operations that do activity on an NFS mount from the cluster and
> >>>then rapidly do failovers between cluster machines and make sure
> >>>everything keeps moving along. The cluster is designed to not shut down
> >>>nfsd's when a failover occurs. nfsd's are considered a "shared
> >>>resource". It's possible that there could be multiple clustered
> >>>services for NFS-sharing, so when a failover occurs, we just manipulate
> >>>the exports table.
> >>>
> >>>The problem we've run into is that occasionally they fail over to the
> >>>alternate machine and then back very rapidly. Because nfsd's are not
> >>>shut down on failover, sockets are not closed. So what happens is
> >>>something like this on TCP mounts:
> >>>
> >>>- client has NFS mount from clustered NFS service on one server
> >>>
> >>>- service fails over, new server doesn't know anything about the
> >>> existing socket, so it sends a RST back to the client when data
> >>> comes in. Client closes connection and reopens it and does some
> >>> I/O on the socket.
> >>>
> >>>- service fails back to original server. The original socket there
> >>> is still open, but now the TCP sequence numbers are off. When
> >>> packets come into the server we end up with an ACK storm, and the
> >>> client hangs for a long time.
> >>>
> >>>Neil Horman did a good writeup of this problem here for those that
> >>>want the gory details:
> >>>
> >>>   https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
> >>>
> >>>I can think of 3 ways to fix this:
> >>>
> >>>1) Add something like the recently added "unlock_ip" interface that
> >>>was added for NLM. Maybe a "close_ip" that allows us to close all
> >>>nfsd sockets connected to a given local IP address. So clustering
> >>>software could do something like:
> >>>
> >>>   # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> >>>
> >>>...and make sure that all of the sockets are closed.
> >>>
> >>>2) just use the same "unlock_ip" interface and just have it also
> >>>close sockets in addition to dropping locks.
> >>>
> >>>3) have an nfsd close all non-listening connections when it gets a
> >>>certain signal (maybe SIGUSR1 or something). Connections on a
> >>>sockets that aren't failing over should just get a RST and would
> >>>reopen their connections.
> >>>
> >>>...my preference would probably be approach #1.
> >>>
> >>>I've only really done some rudimentary perusing of the code, so there
> >>>may be roadblocks with some of these approaches I haven't considered.
> >>>Does anyone have thoughts on the general problem or idea for a solution?
> >>>
> >>>The situation is a bit specific to failover testing -- most people 
> >>>failing
> >>>over don't do it so rapidly, but we'd still like to ensure that this
> >>>problem doesn't occur if someone does do it.
> >>>
> >>>Thanks,
> >>> 
> >>>      
> >>This doesn't sound like it would be an NFS specific situation.
> >>Why doesn't TCP handle this, without causing an ACK storm?
> >>
> >>    
> >
> >You're right, its not a problem specific to NFS, any TCP based service in 
> >which
> >sockets are not explicitly closed on the application are subject to this
> >problem.  however, I think NFS is currently the only clustered service 
> >that we
> >offer in which we explicitly leave nfsd running during such a 'soft' 
> >failover,
> >and so practically speaking, this is the only place that this issue 
> >manifests
> >itself.  If we could shut down nfsd on the server doing a failover, that 
> >would
> >solve this problem (as it prevents the problem with all other clustered tcp
> >based services), but from what I'm told, thats a non-starter.
> >
> >  
> 
> I think that this last would be a good thing to pursue anyway,
> or at least be able to understand why it would be considered to
> be a "non-starter".  When failing away a service, why not stop
> the service on the original node?
> 
> These floating virtual IP and ARP games can get tricky to handle
> in the boundary cases like this sort of one.
> 
> >As for why TCP doesnt handle this, thats because the situation is 
> >ambiguous from
> >the point of view of the client and server.  The write up in the bugzilla 
> >has
> >all the gory details, but the executive summary is that during rapid 
> >failover,
> >the client will ack some data to server A in the cluster, and some to 
> >server B
> >in the cluster.  If you quickly fail over and back between the servers in 
> >the
> >cluster, each server will see some gaps in the data stream sequence 
> >numbers, but
> >the client will see that all data has been acked.  This leaves the 
> >connection in
> >an unrecoverable state.
> 
> I would wonder what happens if we stick some other NFS/RPC/TCP/IP
> implementation into the situation.  I wonder if it would see and
> generate the same situation?
> 
>       ps
I can only imagine it would.  The problem doesn't stem from any particular
ideosyncracy in the provided nfsd, but rather in the fact that the nfsd is kept
running on both servers between failovers.

Neil


-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:23     ` Neil Horman
  2008-06-09 15:37       ` Peter Staubach
@ 2008-06-09 15:46       ` Chuck Lever
  2008-06-09 16:00       ` Peter Staubach
  2 siblings, 0 replies; 38+ messages in thread
From: Chuck Lever @ 2008-06-09 15:46 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, Jeff Layton

On Mon, Jun 9, 2008 at 11:23 AM, Neil Horman <nhorman@redhat.com> wrote:
> On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
>> Jeff Layton wrote:
>> >Apologies for the long email, but I ran into an interesting problem the
>> >other day and am looking for some feedback on my general approach to
>> >fixing it before I spend too much time on it:
>> >
>> >We (RH) have a cluster-suite product that some people use for making HA
>> >NFS services. When our QA folks test this, they often will start up
>> >some operations that do activity on an NFS mount from the cluster and
>> >then rapidly do failovers between cluster machines and make sure
>> >everything keeps moving along. The cluster is designed to not shut down
>> >nfsd's when a failover occurs. nfsd's are considered a "shared
>> >resource". It's possible that there could be multiple clustered
>> >services for NFS-sharing, so when a failover occurs, we just manipulate
>> >the exports table.
>> >
>> >The problem we've run into is that occasionally they fail over to the
>> >alternate machine and then back very rapidly. Because nfsd's are not
>> >shut down on failover, sockets are not closed. So what happens is
>> >something like this on TCP mounts:
>> >
>> >- client has NFS mount from clustered NFS service on one server
>> >
>> >- service fails over, new server doesn't know anything about the
>> >  existing socket, so it sends a RST back to the client when data
>> >  comes in. Client closes connection and reopens it and does some
>> >  I/O on the socket.
>> >
>> >- service fails back to original server. The original socket there
>> >  is still open, but now the TCP sequence numbers are off. When
>> >  packets come into the server we end up with an ACK storm, and the
>> >  client hangs for a long time.
>> >
>> >Neil Horman did a good writeup of this problem here for those that
>> >want the gory details:
>> >
>> >    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>> >
>> >I can think of 3 ways to fix this:
>> >
>> >1) Add something like the recently added "unlock_ip" interface that
>> >was added for NLM. Maybe a "close_ip" that allows us to close all
>> >nfsd sockets connected to a given local IP address. So clustering
>> >software could do something like:
>> >
>> >    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>> >
>> >...and make sure that all of the sockets are closed.
>> >
>> >2) just use the same "unlock_ip" interface and just have it also
>> >close sockets in addition to dropping locks.
>> >
>> >3) have an nfsd close all non-listening connections when it gets a
>> >certain signal (maybe SIGUSR1 or something). Connections on a
>> >sockets that aren't failing over should just get a RST and would
>> >reopen their connections.
>> >
>> >...my preference would probably be approach #1.
>> >
>> >I've only really done some rudimentary perusing of the code, so there
>> >may be roadblocks with some of these approaches I haven't considered.
>> >Does anyone have thoughts on the general problem or idea for a solution?
>> >
>> >The situation is a bit specific to failover testing -- most people failing
>> >over don't do it so rapidly, but we'd still like to ensure that this
>> >problem doesn't occur if someone does do it.
>> >
>> >Thanks,
>> >
>>
>> This doesn't sound like it would be an NFS specific situation.
>> Why doesn't TCP handle this, without causing an ACK storm?

The NetApp guys can tell you all kinds of horror stories about filer
cluster failover and TCP.

The servers must stop responding to client requests and to client
connection attempts during the failover.  Some clients are not smart
enough to delay their reconnect attempt and will hammer the server
until it finally responds.  That is probably part of the reason for
the "ACK storm".

You also have a problem with what to do about your server's DRC.
During the failover, some requests may get through to the failing
server, and may be executed and retired, but the reply never gets back
to the client because the socket is torn down.

So the best bet for something like this, if you can't shutdown the
nfsd, is to fence the failing server from the network and from
back-end storage.  Something like iptables will not be adequate to
handle the NFS/RPC idempotency issues.

> You're right, its not a problem specific to NFS, any TCP based service in which
> sockets are not explicitly closed on the application are subject to this
> problem.  however, I think NFS is currently the only clustered service that we
> offer in which we explicitly leave nfsd running during such a 'soft' failover,
> and so practically speaking, this is the only place that this issue manifests
> itself.  If we could shut down nfsd on the server doing a failover, that would
> solve this problem (as it prevents the problem with all other clustered tcp
> based services), but from what I'm told, thats a non-starter.
>
> As for why TCP doesnt handle this, thats because the situation is ambiguous from
> the point of view of the client and server.  The write up in the bugzilla has
> all the gory details, but the executive summary is that during rapid failover,
> the client will ack some data to server A in the cluster, and some to server B
> in the cluster.  If you quickly fail over and back between the servers in the
> cluster, each server will see some gaps in the data stream sequence numbers, but
> the client will see that all data has been acked.  This leaves the connection in
> an unrecoverable state.

-- 
I am certain that these presidents will understand the cry of the
people of Bolivia, of the people of Latin America and the whole world,
which wants to have more food and not more cars. First food, then if
something's left over, more cars, more automobiles. I think that life
has to come first.
-- Evo Morales
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:23     ` Neil Horman
  2008-06-09 15:37       ` Peter Staubach
  2008-06-09 15:46       ` Chuck Lever
@ 2008-06-09 16:00       ` Peter Staubach
  2008-06-09 16:24         ` Neil Horman
  2 siblings, 1 reply; 38+ messages in thread
From: Peter Staubach @ 2008-06-09 16:00 UTC (permalink / raw)
  To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, Jeff Layton

Neil Horman wrote:
> On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
>   
>> Jeff Layton wrote:
>>     
>>> Apologies for the long email, but I ran into an interesting problem the
>>> other day and am looking for some feedback on my general approach to
>>> fixing it before I spend too much time on it:
>>>
>>> We (RH) have a cluster-suite product that some people use for making HA
>>> NFS services. When our QA folks test this, they often will start up
>>> some operations that do activity on an NFS mount from the cluster and
>>> then rapidly do failovers between cluster machines and make sure
>>> everything keeps moving along. The cluster is designed to not shut down
>>> nfsd's when a failover occurs. nfsd's are considered a "shared
>>> resource". It's possible that there could be multiple clustered
>>> services for NFS-sharing, so when a failover occurs, we just manipulate
>>> the exports table.
>>>
>>> The problem we've run into is that occasionally they fail over to the
>>> alternate machine and then back very rapidly. Because nfsd's are not
>>> shut down on failover, sockets are not closed. So what happens is
>>> something like this on TCP mounts:
>>>
>>> - client has NFS mount from clustered NFS service on one server
>>>
>>> - service fails over, new server doesn't know anything about the
>>>  existing socket, so it sends a RST back to the client when data
>>>  comes in. Client closes connection and reopens it and does some
>>>  I/O on the socket.
>>>
>>> - service fails back to original server. The original socket there
>>>  is still open, but now the TCP sequence numbers are off. When
>>>  packets come into the server we end up with an ACK storm, and the
>>>  client hangs for a long time.
>>>
>>> Neil Horman did a good writeup of this problem here for those that
>>> want the gory details:
>>>
>>>    https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>>>
>>> I can think of 3 ways to fix this:
>>>
>>> 1) Add something like the recently added "unlock_ip" interface that
>>> was added for NLM. Maybe a "close_ip" that allows us to close all
>>> nfsd sockets connected to a given local IP address. So clustering
>>> software could do something like:
>>>
>>>    # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>>>
>>> ...and make sure that all of the sockets are closed.
>>>
>>> 2) just use the same "unlock_ip" interface and just have it also
>>> close sockets in addition to dropping locks.
>>>
>>> 3) have an nfsd close all non-listening connections when it gets a
>>> certain signal (maybe SIGUSR1 or something). Connections on a
>>> sockets that aren't failing over should just get a RST and would
>>> reopen their connections.
>>>
>>> ...my preference would probably be approach #1.
>>>
>>> I've only really done some rudimentary perusing of the code, so there
>>> may be roadblocks with some of these approaches I haven't considered.
>>> Does anyone have thoughts on the general problem or idea for a solution?
>>>
>>> The situation is a bit specific to failover testing -- most people failing
>>> over don't do it so rapidly, but we'd still like to ensure that this
>>> problem doesn't occur if someone does do it.
>>>
>>> Thanks,
>>>  
>>>       
>> This doesn't sound like it would be an NFS specific situation.
>> Why doesn't TCP handle this, without causing an ACK storm?
>>
>>     
>
> You're right, its not a problem specific to NFS, any TCP based service in which
> sockets are not explicitly closed on the application are subject to this
> problem.  however, I think NFS is currently the only clustered service that we
> offer in which we explicitly leave nfsd running during such a 'soft' failover,
> and so practically speaking, this is the only place that this issue manifests
> itself.  If we could shut down nfsd on the server doing a failover, that would
> solve this problem (as it prevents the problem with all other clustered tcp
> based services), but from what I'm told, thats a non-starter.
>
> As for why TCP doesnt handle this, thats because the situation is ambiguous from
> the point of view of the client and server.  The write up in the bugzilla has
> all the gory details, but the executive summary is that during rapid failover,
> the client will ack some data to server A in the cluster, and some to server B
> in the cluster.  If you quickly fail over and back between the servers in the
> cluster, each server will see some gaps in the data stream sequence numbers, but
> the client will see that all data has been acked.  This leaves the connection in
> an unrecoverable state.

This doesn't seem so ambiguous from the client's viewpoint to me.

The server sends back an ACK for a sequence number which is less
than the beginning sequence number that the client has to
retransmit.  Shouldn't that imply a problem to the client and
cause the TCP on the client to give up and return an error to
the caller, in this case the RPC?

Can there be gaps in sequence numbers?

    Thanx...

       ps
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 16:00       ` Peter Staubach
@ 2008-06-09 16:24         ` Neil Horman
  0 siblings, 0 replies; 38+ messages in thread
From: Neil Horman @ 2008-06-09 16:24 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Neil Horman, lhh, nfsv4, linux-nfs, Jeff Layton

On Mon, Jun 09, 2008 at 12:00:37PM -0400, Peter Staubach wrote:
> Neil Horman wrote:
> >On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote:
> >  
> >>Jeff Layton wrote:
> >>    
> >>>Apologies for the long email, but I ran into an interesting problem the
> >>>other day and am looking for some feedback on my general approach to
> >>>fixing it before I spend too much time on it:
> >>>
> >>>We (RH) have a cluster-suite product that some people use for making HA
> >>>NFS services. When our QA folks test this, they often will start up
> >>>some operations that do activity on an NFS mount from the cluster and
> >>>then rapidly do failovers between cluster machines and make sure
> >>>everything keeps moving along. The cluster is designed to not shut down
> >>>nfsd's when a failover occurs. nfsd's are considered a "shared
> >>>resource". It's possible that there could be multiple clustered
> >>>services for NFS-sharing, so when a failover occurs, we just manipulate
> >>>the exports table.
> >>>
> >>>The problem we've run into is that occasionally they fail over to the
> >>>alternate machine and then back very rapidly. Because nfsd's are not
> >>>shut down on failover, sockets are not closed. So what happens is
> >>>something like this on TCP mounts:
> >>>
> >>>- client has NFS mount from clustered NFS service on one server
> >>>
> >>>- service fails over, new server doesn't know anything about the
> >>> existing socket, so it sends a RST back to the client when data
> >>> comes in. Client closes connection and reopens it and does some
> >>> I/O on the socket.
> >>>
> >>>- service fails back to original server. The original socket there
> >>> is still open, but now the TCP sequence numbers are off. When
> >>> packets come into the server we end up with an ACK storm, and the
> >>> client hangs for a long time.
> >>>
> >>>Neil Horman did a good writeup of this problem here for those that
> >>>want the gory details:
> >>>
> >>>   https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
> >>>
> >>>I can think of 3 ways to fix this:
> >>>
> >>>1) Add something like the recently added "unlock_ip" interface that
> >>>was added for NLM. Maybe a "close_ip" that allows us to close all
> >>>nfsd sockets connected to a given local IP address. So clustering
> >>>software could do something like:
> >>>
> >>>   # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> >>>
> >>>...and make sure that all of the sockets are closed.
> >>>
> >>>2) just use the same "unlock_ip" interface and just have it also
> >>>close sockets in addition to dropping locks.
> >>>
> >>>3) have an nfsd close all non-listening connections when it gets a
> >>>certain signal (maybe SIGUSR1 or something). Connections on a
> >>>sockets that aren't failing over should just get a RST and would
> >>>reopen their connections.
> >>>
> >>>...my preference would probably be approach #1.
> >>>
> >>>I've only really done some rudimentary perusing of the code, so there
> >>>may be roadblocks with some of these approaches I haven't considered.
> >>>Does anyone have thoughts on the general problem or idea for a solution?
> >>>
> >>>The situation is a bit specific to failover testing -- most people 
> >>>failing
> >>>over don't do it so rapidly, but we'd still like to ensure that this
> >>>problem doesn't occur if someone does do it.
> >>>
> >>>Thanks,
> >>> 
> >>>      
> >>This doesn't sound like it would be an NFS specific situation.
> >>Why doesn't TCP handle this, without causing an ACK storm?
> >>
> >>    
> >
> >You're right, its not a problem specific to NFS, any TCP based service in 
> >which
> >sockets are not explicitly closed on the application are subject to this
> >problem.  however, I think NFS is currently the only clustered service 
> >that we
> >offer in which we explicitly leave nfsd running during such a 'soft' 
> >failover,
> >and so practically speaking, this is the only place that this issue 
> >manifests
> >itself.  If we could shut down nfsd on the server doing a failover, that 
> >would
> >solve this problem (as it prevents the problem with all other clustered tcp
> >based services), but from what I'm told, thats a non-starter.
> >
> >As for why TCP doesnt handle this, thats because the situation is 
> >ambiguous from
> >the point of view of the client and server.  The write up in the bugzilla 
> >has
> >all the gory details, but the executive summary is that during rapid 
> >failover,
> >the client will ack some data to server A in the cluster, and some to 
> >server B
> >in the cluster.  If you quickly fail over and back between the servers in 
> >the
> >cluster, each server will see some gaps in the data stream sequence 
> >numbers, but
> >the client will see that all data has been acked.  This leaves the 
> >connection in
> >an unrecoverable state.
> 
> This doesn't seem so ambiguous from the client's viewpoint to me.
> 
> The server sends back an ACK for a sequence number which is less
> than the beginning sequence number that the client has to
> retransmit.  Shouldn't that imply a problem to the client and
> cause the TCP on the client to give up and return an error to
> the caller, in this case the RPC?
> 
> Can there be gaps in sequence numbers?
> 
No there can't be gaps in sequence numbers, but the fact that there are on a
given connection is in fact ambiguous.  See RFC 793 page 36/37 for a more
detailed explination.  The RFC mandates that in response to an out of range
sequence number for an established connection, the peer can only respond with an
empty ACK containing the next available send-sequence number.

The problem lies in the fact that, due to the failover and failback, the peers
have differeing views on what state the connection is in.  The NFS client has,
at the time this problem occurs seen ACKs to all the data it has sent.  As such,
it now sees this ack that is backward in time and assumes that this frame
somehow got lost in the network, and just now made it here, after all the
subsequent frames did.  The appropriate thing, per the rfc, is to ignore it, and
send an ACK reminding the peer of where it is in sequence.

The NFS server on the other hand, is in fact missing a chunk of sequence
numbers, which were acked by the other server in the cluster during the
failover, failback period,  So it legitimately thinks that some set of sequence
numbers got dropped, and it can't continue until it has them.  The only thing it
can do is continue to ACK its last seen sequence number, hoping that the client
will retransmit them (which it should, because as far as this server is
concerned, it never acked them).

There could be an argument made, I suppose for adding some sort of knob to set a
threshold for this particular behavior (X Data-less ACKs in Y amount of TIME ==
RST or some such), but I'm sure that won't get much upstream traction (at least
I won't propose it), since the knob would violate the RFC, possibly reset
legitimate connections (think keep alive frames), and really only solve a
problem that is manufactured by keeping processes alive (allbeit apparently
necessecary) in such a way that two systems share a tcp connection.

Regards
Neil

>    Thanx...
> 
>       ps

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 14:31 rapid clustered nfs server failover and hung clients -- how best to close the sockets? Jeff Layton
       [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-09 15:51 ` J. Bruce Fields
  2008-06-09 16:02   ` Jeff Layton
  2008-06-09 17:14 ` Wendy Cheng
  2 siblings, 1 reply; 38+ messages in thread
From: J. Bruce Fields @ 2008-06-09 15:51 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> I can think of 3 ways to fix this:
> 
> 1) Add something like the recently added "unlock_ip" interface that
> was added for NLM. Maybe a "close_ip" that allows us to close all
> nfsd sockets connected to a given local IP address. So clustering
> software could do something like:
> 
>     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> 
> ...and make sure that all of the sockets are closed.
> 
> 2) just use the same "unlock_ip" interface and just have it also
> close sockets in addition to dropping locks.
> 
> 3) have an nfsd close all non-listening connections when it gets a
> certain signal (maybe SIGUSR1 or something). Connections on a
> sockets that aren't failing over should just get a RST and would
> reopen their connections.
> 
> ...my preference would probably be approach #1.

What do you see as the advantage of #1 over #2?  Are there cases where
someone would want to drop locks but not also close connections (or
vice-versa)?

--b.
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 15:51 ` J. Bruce Fields
@ 2008-06-09 16:02   ` Jeff Layton
  2008-06-09 17:23     ` J. Bruce Fields
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 16:02 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 9 Jun 2008 11:51:36 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> > I can think of 3 ways to fix this:
> > 
> > 1) Add something like the recently added "unlock_ip" interface that
> > was added for NLM. Maybe a "close_ip" that allows us to close all
> > nfsd sockets connected to a given local IP address. So clustering
> > software could do something like:
> > 
> >     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> > 
> > ...and make sure that all of the sockets are closed.
> > 
> > 2) just use the same "unlock_ip" interface and just have it also
> > close sockets in addition to dropping locks.
> > 
> > 3) have an nfsd close all non-listening connections when it gets a
> > certain signal (maybe SIGUSR1 or something). Connections on a
> > sockets that aren't failing over should just get a RST and would
> > reopen their connections.
> > 
> > ...my preference would probably be approach #1.
> 
> What do you see as the advantage of #1 over #2?  Are there cases where
> someone would want to drop locks but not also close connections (or
> vice-versa)?
> 

There's no real advantage that I can see (maybe if they're running a
cluster with no NLM services somehow). Mostly that "unlock_ip" seems to
imply that it deals with locking, and this doesn't. I'd be OK with #2
if it's a reasonable solution. Given what Chuck mentioned, it sounds
like we'll also need to take care to make sure that existing calls
complete and the replies get flushed out too, so this could be more
complicated that I had anticipated.

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 16:02   ` Jeff Layton
@ 2008-06-09 17:23     ` J. Bruce Fields
  2008-06-09 19:10       ` Jeff Layton
  0 siblings, 1 reply; 38+ messages in thread
From: J. Bruce Fields @ 2008-06-09 17:23 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, Jun 09, 2008 at 12:02:43PM -0400, Jeff Layton wrote:
> On Mon, 9 Jun 2008 11:51:36 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> > > I can think of 3 ways to fix this:
> > > 
> > > 1) Add something like the recently added "unlock_ip" interface that
> > > was added for NLM. Maybe a "close_ip" that allows us to close all
> > > nfsd sockets connected to a given local IP address. So clustering
> > > software could do something like:
> > > 
> > >     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> > > 
> > > ...and make sure that all of the sockets are closed.
> > > 
> > > 2) just use the same "unlock_ip" interface and just have it also
> > > close sockets in addition to dropping locks.
> > > 
> > > 3) have an nfsd close all non-listening connections when it gets a
> > > certain signal (maybe SIGUSR1 or something). Connections on a
> > > sockets that aren't failing over should just get a RST and would
> > > reopen their connections.
> > > 
> > > ...my preference would probably be approach #1.
> > 
> > What do you see as the advantage of #1 over #2?  Are there cases where
> > someone would want to drop locks but not also close connections (or
> > vice-versa)?
> > 
> 
> There's no real advantage that I can see (maybe if they're running a
> cluster with no NLM services somehow). Mostly that "unlock_ip" seems to
> imply that it deals with locking, and this doesn't. I'd be OK with #2
> if it's a reasonable solution. Given what Chuck mentioned, it sounds
> like we'll also need to take care to make sure that existing calls
> complete and the replies get flushed out too, so this could be more
> complicated that I had anticipated.

It seems to me that in the long run what we'd like is a virtualized NFS
service--you should be able to start and stop independent "servers"
hosted on a single kernel, and to clients they should look like
completely independent servers.

And I guess the question is how little "virtualization" you can get away
with and still have the whole thing work.

But anyway, ideally I think there'd be a single interface that says
"shut down the nfs service provided via server ip x.y.z.w, for possible
migration to another host".  That's the only operation anyone really
want to do--independent control over the tcp connections, and the locks,
and the rpc cache, and whatever else needs to be dealt with, sounds
unlikely to be useful.

--b.
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 17:23     ` J. Bruce Fields
@ 2008-06-09 19:10       ` Jeff Layton
  2008-06-09 20:19         ` Lon Hohberger
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 19:10 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 9 Jun 2008 13:23:13 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Jun 09, 2008 at 12:02:43PM -0400, Jeff Layton wrote:
> > On Mon, 9 Jun 2008 11:51:36 -0400
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote:
> > > > I can think of 3 ways to fix this:
> > > > 
> > > > 1) Add something like the recently added "unlock_ip" interface that
> > > > was added for NLM. Maybe a "close_ip" that allows us to close all
> > > > nfsd sockets connected to a given local IP address. So clustering
> > > > software could do something like:
> > > > 
> > > >     # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
> > > > 
> > > > ...and make sure that all of the sockets are closed.
> > > > 
> > > > 2) just use the same "unlock_ip" interface and just have it also
> > > > close sockets in addition to dropping locks.
> > > > 
> > > > 3) have an nfsd close all non-listening connections when it gets a
> > > > certain signal (maybe SIGUSR1 or something). Connections on a
> > > > sockets that aren't failing over should just get a RST and would
> > > > reopen their connections.
> > > > 
> > > > ...my preference would probably be approach #1.
> > > 
> > > What do you see as the advantage of #1 over #2?  Are there cases where
> > > someone would want to drop locks but not also close connections (or
> > > vice-versa)?
> > > 
> > 
> > There's no real advantage that I can see (maybe if they're running a
> > cluster with no NLM services somehow). Mostly that "unlock_ip" seems to
> > imply that it deals with locking, and this doesn't. I'd be OK with #2
> > if it's a reasonable solution. Given what Chuck mentioned, it sounds
> > like we'll also need to take care to make sure that existing calls
> > complete and the replies get flushed out too, so this could be more
> > complicated that I had anticipated.
> 
> It seems to me that in the long run what we'd like is a virtualized NFS
> service--you should be able to start and stop independent "servers"
> hosted on a single kernel, and to clients they should look like
> completely independent servers.
> 
> And I guess the question is how little "virtualization" you can get away
> with and still have the whole thing work.

Yep. That was Lon's exact question. Could we start nfsd's that just
work for certain exports? The answer (of course) is currently no.

As an idle side thought, I wonder whether/how we could make nfsd
containerized? I wonder if it's possible to run a local nfsd in
a Solaris zone/container thingy.

> 
> But anyway, ideally I think there'd be a single interface that says
> "shut down the nfs service provided via server ip x.y.z.w, for possible
> migration to another host".  That's the only operation anyone really
> want to do--independent control over the tcp connections, and the locks,
> and the rpc cache, and whatever else needs to be dealt with, sounds
> unlikely to be useful.
> 

Ok. When I get some time to work on this, I'll plan to work on hooking
into the current unlock_ip interface rather than creating a new
procfile. That does seem to make the most sense, though the name
"unlock_ip" might not really adequately convey what it will now be doing...
 
-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 19:10       ` Jeff Layton
@ 2008-06-09 20:19         ` Lon Hohberger
  0 siblings, 0 replies; 38+ messages in thread
From: Lon Hohberger @ 2008-06-09 20:19 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, nfsv4, nhorman


On Mon, 2008-06-09 at 15:10 -0400, Jeff Layton wrote:
> > It seems to me that in the long run what we'd like is a virtualized NFS
> > service--you should be able to start and stop independent "servers"
> > hosted on a single kernel, and to clients they should look like
> > completely independent servers.
> > 
> > And I guess the question is how little "virtualization" you can get away
> > with and still have the whole thing work.
> 
> Yep. That was Lon's exact question. Could we start nfsd's that just
> work for certain exports? The answer (of course) is currently no.

s/exports/IP addresses/

-- Lon


_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 14:31 rapid clustered nfs server failover and hung clients -- how best to close the sockets? Jeff Layton
       [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2008-06-09 15:51 ` J. Bruce Fields
@ 2008-06-09 17:14 ` Wendy Cheng
  2008-06-09 17:24   ` Jeff Layton
  2008-06-09 18:07   ` Neil Horman
  2 siblings, 2 replies; 38+ messages in thread
From: Wendy Cheng @ 2008-06-09 17:14 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman

Jeff Layton wrote:
> The problem we've run into is that occasionally they fail over to the
> alternate machine and then back very rapidly. 

It is a well known issue in the NFS-TCP failover arena (or more 
specifically, for floating IP applications) that failover from server A 
to server B, then immediately failing back from server B to A would 
*not* work well. IIRC last round of discussing with Red Hat GPS and 
support folks, we concluded that most of the applications/users *can* 
tolerate this restriction.

Maybe another more basic question: "other than QA efforts, are there 
real NFSv2/v3 applications depending on this "feature" ? Or there may 
need tons of efforts for something that will not have much usages when 
it is finally delivered ?

-- Wendy

_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 17:14 ` Wendy Cheng
@ 2008-06-09 17:24   ` Jeff Layton
  2008-06-09 17:51     ` Talpey, Thomas
  2008-06-09 18:10     ` Neil Horman
  2008-06-09 18:07   ` Neil Horman
  1 sibling, 2 replies; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 17:24 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-nfs, lhh, nfsv4, nhorman

On Mon, 09 Jun 2008 13:14:56 -0400
Wendy Cheng <s.wendy.cheng@gmail.com> wrote:

> Jeff Layton wrote:
> > The problem we've run into is that occasionally they fail over to the
> > alternate machine and then back very rapidly. 
> 
> It is a well known issue in the NFS-TCP failover arena (or more 
> specifically, for floating IP applications) that failover from server A 
> to server B, then immediately failing back from server B to A would 
> *not* work well. IIRC last round of discussing with Red Hat GPS and 
> support folks, we concluded that most of the applications/users *can* 
> tolerate this restriction.
> 
> Maybe another more basic question: "other than QA efforts, are there 
> real NFSv2/v3 applications depending on this "feature" ? Or there may 
> need tons of efforts for something that will not have much usages when 
> it is finally delivered ?
> 

Certainly a valid question...

While rapid failover like this is unusual, it's easily possible for a
sysadmin to do it. Maybe they moved the wrong service, or their downtime
was for something very brief but the service had to be off of the host to
make the change. In that case, a quick failover and back could easily
be something that happens in a real environment.

As to whether it's worth a ton of effort, that's a tough call. People want
HA services to guard against outages. Anything that jeopardizes that is
probably worth fixing. This could be solved with documentation, but a note
like:

"Be sure to wait for X minutes between failovers"

...wouldn't instill me with a lot of confidence. We'd have to have
some sort of mechanism to enforce this, and that would be less than
ideal.

IMO, the ideal thing would be to make sure that the "old" server is
ready to pick up the service again as soon as possible after the service
leaves it.

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 17:24   ` Jeff Layton
@ 2008-06-09 17:51     ` Talpey, Thomas
  2008-06-09 17:59       ` Talpey, Thomas
  2008-06-09 19:01       ` Jeff Layton
  2008-06-09 18:10     ` Neil Horman
  1 sibling, 2 replies; 38+ messages in thread
From: Talpey, Thomas @ 2008-06-09 17:51 UTC (permalink / raw)
  To: Jeff Layton; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman

At 01:24 PM 6/9/2008, Jeff Layton wrote:
>
>"Be sure to wait for X minutes between failovers"

At least one grace period.

>
>...wouldn't instill me with a lot of confidence. We'd have to have
>some sort of mechanism to enforce this, and that would be less than
>ideal.
>
>IMO, the ideal thing would be to make sure that the "old" server is
>ready to pick up the service again as soon as possible after the service
>leaves it.

A great goal, but it seems to me you've bundled a lot of other
incompatible requirements along with it. Having some services
restart and not others, for example. And mixing transparent IP
address takeover with stateful recovery such as TCP reconnect
and NSM/NLM. NSM provides only notification, there's no way for
either server to know for sure all the clients have completed
either switch-to or switch-back.

Of course, you could switch to UDP-only, that would fix the
TCP issue. But it won't fix NSM/NLM.

Tom.

_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 17:51     ` Talpey, Thomas
@ 2008-06-09 17:59       ` Talpey, Thomas
  2008-06-09 19:01       ` Jeff Layton
  1 sibling, 0 replies; 38+ messages in thread
From: Talpey, Thomas @ 2008-06-09 17:59 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman, Wendy Cheng

At 01:51 PM 6/9/2008, Talpey, Thomas wrote:
>and NSM/NLM. NSM provides only notification, there's no way for
>either server to know for sure all the clients have completed
>either switch-to or switch-back.

Just in case it helps to understand why relying on NSM is so risky:

	<http://www.connectathon.org/talks06/talpey-cthon06-nsm.pdf>

Slides 16, 17 and 23, especially.

Tom.

_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients --  how best to close the sockets?
  2008-06-09 17:51     ` Talpey, Thomas
  2008-06-09 17:59       ` Talpey, Thomas
@ 2008-06-09 19:01       ` Jeff Layton
  2008-06-09 19:13         ` Talpey, Thomas
  1 sibling, 1 reply; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 19:01 UTC (permalink / raw)
  To: Talpey, Thomas; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman

On Mon, 09 Jun 2008 13:51:05 -0400
"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:

> At 01:24 PM 6/9/2008, Jeff Layton wrote:
> >
> >"Be sure to wait for X minutes between failovers"
> 
> At least one grace period.
> 

Actually, we have to wait until all of the sockets on the old server
time out. This is difficult to predict and can be quite long.

> >
> >...wouldn't instill me with a lot of confidence. We'd have to have
> >some sort of mechanism to enforce this, and that would be less than
> >ideal.
> >
> >IMO, the ideal thing would be to make sure that the "old" server is
> >ready to pick up the service again as soon as possible after the service
> >leaves it.
> 
> A great goal, but it seems to me you've bundled a lot of other
> incompatible requirements along with it. Having some services
> restart and not others, for example. And mixing transparent IP
> address takeover with stateful recovery such as TCP reconnect
> and NSM/NLM. NSM provides only notification, there's no way for
> either server to know for sure all the clients have completed
> either switch-to or switch-back.
> 

Thanks for the slides -- very interesting.

Yep. NSM is risky, but this is really the same situation as solo NFS
server spontaneously rebooting. The failover we're doing is really just
simulating that (for the case of lockd anyway). The unreliability is just
an unfortunate fact of life with NFSv2/3...

> Of course, you could switch to UDP-only, that would fix the
> TCP issue. But it won't fix NSM/NLM.
> 

Right. Nothing can really fix that so we just have to make do. All of
the NSM/NLM stuff here is really separate from the main problem I'm
interested in at the moment, which is how to deal with the old, stale
sockets that nfsd has open after the local address disappears.

-- 
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 19:01       ` Jeff Layton
@ 2008-06-09 19:13         ` Talpey, Thomas
  0 siblings, 0 replies; 38+ messages in thread
From: Talpey, Thomas @ 2008-06-09 19:13 UTC (permalink / raw)
  To: Jeff Layton; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman

At 03:01 PM 6/9/2008, Jeff Layton wrote:
>On Mon, 09 Jun 2008 13:51:05 -0400
>"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote:
>
>> At 01:24 PM 6/9/2008, Jeff Layton wrote:
>> >
>> >"Be sure to wait for X minutes between failovers"
>> 
>> At least one grace period.
>> 
>
>Actually, we have to wait until all of the sockets on the old server
>time out. This is difficult to predict and can be quite long.

I just gave the floor. The ceiling is yours. :-)

Orphaned server TCP sockets, btw, in general last forever without keepalive.
Even with keepalive, they can last many tens of minutes.

Tom.

_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 17:24   ` Jeff Layton
  2008-06-09 17:51     ` Talpey, Thomas
@ 2008-06-09 18:10     ` Neil Horman
  1 sibling, 0 replies; 38+ messages in thread
From: Neil Horman @ 2008-06-09 18:10 UTC (permalink / raw)
  To: Jeff Layton; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman

On Mon, Jun 09, 2008 at 01:24:25PM -0400, Jeff Layton wrote:
> On Mon, 09 Jun 2008 13:14:56 -0400
> Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
> 
> > Jeff Layton wrote:
> > > The problem we've run into is that occasionally they fail over to the
> > > alternate machine and then back very rapidly. 
> > 
> > It is a well known issue in the NFS-TCP failover arena (or more 
> > specifically, for floating IP applications) that failover from server A 
> > to server B, then immediately failing back from server B to A would 
> > *not* work well. IIRC last round of discussing with Red Hat GPS and 
> > support folks, we concluded that most of the applications/users *can* 
> > tolerate this restriction.
> > 
> > Maybe another more basic question: "other than QA efforts, are there 
> > real NFSv2/v3 applications depending on this "feature" ? Or there may 
> > need tons of efforts for something that will not have much usages when 
> > it is finally delivered ?
> > 
> 
> Certainly a valid question...
> 
> While rapid failover like this is unusual, it's easily possible for a
> sysadmin to do it. Maybe they moved the wrong service, or their downtime
> was for something very brief but the service had to be off of the host to
> make the change. In that case, a quick failover and back could easily
> be something that happens in a real environment.
> 
> As to whether it's worth a ton of effort, that's a tough call. People want
> HA services to guard against outages. Anything that jeopardizes that is
> probably worth fixing. This could be solved with documentation, but a note
> like:
> 
> "Be sure to wait for X minutes between failovers"
> 
Thats the real problem here.  Given the problem as we've describe it, its
possible for X to be _large_, potentially indefinite.

> IMO, the ideal thing would be to make sure that the "old" server is
> ready to pick up the service again as soon as possible after the service
> leaves it.
> 
Yes, this is really what needs to happen.  In this environment, a floating IP
address effectively means that nfsd services can inadvertently 'share' a tcp
connection, and if nfsd is to play in a floating IP environment it needs to be
able to handle that sharing...

Neil

> -- 
> Jeff Layton <jlayton@redhat.com>

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
  2008-06-09 17:14 ` Wendy Cheng
  2008-06-09 17:24   ` Jeff Layton
@ 2008-06-09 18:07   ` Neil Horman
  1 sibling, 0 replies; 38+ messages in thread
From: Neil Horman @ 2008-06-09 18:07 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-nfs, lhh, nfsv4, nhorman, Jeff Layton

On Mon, Jun 09, 2008 at 01:14:56PM -0400, Wendy Cheng wrote:
> Jeff Layton wrote:
> >The problem we've run into is that occasionally they fail over to the
> >alternate machine and then back very rapidly. 
> 
> It is a well known issue in the NFS-TCP failover arena (or more 
> specifically, for floating IP applications) that failover from server A 
> to server B, then immediately failing back from server B to A would
> *not* work well. IIRC last round of discussing with Red Hat GPS and 
> support folks, we concluded that most of the applications/users *can* 
> tolerate this restriction.

I think the big problem here is that this restriction has a window that can be
particularly long lived.  If an application doesn't close its sockets, the time
between a failover event, and the time when it is safe to fail back, is bounded
by the lifetime of the socket on the 'failed' server.  given the right
configuration, this could be indefinite.  Worse, you could fail at just the
wrong time after the sequence number wraps completely, and pickup where you left
off, not knowing you lost 4GB of data in the process.


> 
> Maybe another more basic question: "other than QA efforts, are there 
> real NFSv2/v3 applications depending on this "feature" ? Or there may 
> need tons of efforts for something that will not have much usages when 
> it is finally delivered ?
> 
> -- Wendy
> 
> 
> 

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2008-06-09 20:56 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-09 14:31 rapid clustered nfs server failover and hung clients -- how best to close the sockets? Jeff Layton
     [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 15:03   ` Peter Staubach
2008-06-09 15:18     ` Jeff Layton
     [not found]       ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 15:31         ` Neil Horman
2008-06-09 15:43           ` Jeff Layton
     [not found]         ` <RTPCLUEXC1-PRDOLZCH000001d2-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
     [not found]           ` <20080609120110.1fee7221-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
     [not found]             ` <RTPCLUEXC1-PRDF8Eqf000001d4-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
     [not found]               ` <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 16:40                 ` Talpey, Thomas
2008-06-09 16:46                   ` Jeff Layton
2008-06-09 18:03                   ` J. Bruce Fields
2008-06-09 17:14               ` J. Bruce Fields
2008-06-09 15:51       ` Talpey, Thomas
2008-06-09 16:01         ` Jeff Layton
2008-06-09 16:03           ` Neil Horman
2008-06-09 16:09           ` Talpey, Thomas
2008-06-09 16:22             ` Jeff Layton
2008-06-09 19:36               ` Chuck Lever
2008-06-09 20:11                 ` Jeff Layton
2008-06-09 20:56                   ` Chuck Lever
2008-06-09 15:23     ` Neil Horman
2008-06-09 15:37       ` Peter Staubach
2008-06-09 15:49         ` Jeff Layton
     [not found]           ` <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 16:01             ` Chuck Lever
2008-06-09 16:04         ` Neil Horman
2008-06-09 15:46       ` Chuck Lever
2008-06-09 16:00       ` Peter Staubach
2008-06-09 16:24         ` Neil Horman
2008-06-09 15:51 ` J. Bruce Fields
2008-06-09 16:02   ` Jeff Layton
2008-06-09 17:23     ` J. Bruce Fields
2008-06-09 19:10       ` Jeff Layton
2008-06-09 20:19         ` Lon Hohberger
2008-06-09 17:14 ` Wendy Cheng
2008-06-09 17:24   ` Jeff Layton
2008-06-09 17:51     ` Talpey, Thomas
2008-06-09 17:59       ` Talpey, Thomas
2008-06-09 19:01       ` Jeff Layton
2008-06-09 19:13         ` Talpey, Thomas
2008-06-09 18:10     ` Neil Horman
2008-06-09 18:07   ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox