* rapid clustered nfs server failover and hung clients -- how best to close the sockets?
@ 2008-06-09 14:31 Jeff Layton
[not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
` (2 more replies)
0 siblings, 3 replies; 38+ messages in thread
From: Jeff Layton @ 2008-06-09 14:31 UTC (permalink / raw)
To: linux-nfs, nfsv4; +Cc: nhorman, lhh
Apologies for the long email, but I ran into an interesting problem the
other day and am looking for some feedback on my general approach to
fixing it before I spend too much time on it:
We (RH) have a cluster-suite product that some people use for making HA
NFS services. When our QA folks test this, they often will start up
some operations that do activity on an NFS mount from the cluster and
then rapidly do failovers between cluster machines and make sure
everything keeps moving along. The cluster is designed to not shut down
nfsd's when a failover occurs. nfsd's are considered a "shared
resource". It's possible that there could be multiple clustered
services for NFS-sharing, so when a failover occurs, we just manipulate
the exports table.
The problem we've run into is that occasionally they fail over to the
alternate machine and then back very rapidly. Because nfsd's are not
shut down on failover, sockets are not closed. So what happens is
something like this on TCP mounts:
- client has NFS mount from clustered NFS service on one server
- service fails over, new server doesn't know anything about the
existing socket, so it sends a RST back to the client when data
comes in. Client closes connection and reopens it and does some
I/O on the socket.
- service fails back to original server. The original socket there
is still open, but now the TCP sequence numbers are off. When
packets come into the server we end up with an ACK storm, and the
client hangs for a long time.
Neil Horman did a good writeup of this problem here for those that
want the gory details:
https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
I can think of 3 ways to fix this:
1) Add something like the recently added "unlock_ip" interface that
was added for NLM. Maybe a "close_ip" that allows us to close all
nfsd sockets connected to a given local IP address. So clustering
software could do something like:
# echo 10.20.30.40 > /proc/fs/nfsd/close_ip
...and make sure that all of the sockets are closed.
2) just use the same "unlock_ip" interface and just have it also
close sockets in addition to dropping locks.
3) have an nfsd close all non-listening connections when it gets a
certain signal (maybe SIGUSR1 or something). Connections on a
sockets that aren't failing over should just get a RST and would
reopen their connections.
...my preference would probably be approach #1.
I've only really done some rudimentary perusing of the code, so there
may be roadblocks with some of these approaches I haven't considered.
Does anyone have thoughts on the general problem or idea for a solution?
The situation is a bit specific to failover testing -- most people failing
over don't do it so rapidly, but we'd still like to ensure that this
problem doesn't occur if someone does do it.
Thanks,
--
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
^ permalink raw reply [flat|nested] 38+ messages in thread[parent not found: <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-06-09 15:03 ` Peter Staubach 2008-06-09 15:18 ` Jeff Layton 2008-06-09 15:23 ` Neil Horman 0 siblings, 2 replies; 38+ messages in thread From: Peter Staubach @ 2008-06-09 15:03 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, nfsv4, nhorman, lhh Jeff Layton wrote: > Apologies for the long email, but I ran into an interesting problem the > other day and am looking for some feedback on my general approach to > fixing it before I spend too much time on it: > > We (RH) have a cluster-suite product that some people use for making HA > NFS services. When our QA folks test this, they often will start up > some operations that do activity on an NFS mount from the cluster and > then rapidly do failovers between cluster machines and make sure > everything keeps moving along. The cluster is designed to not shut down > nfsd's when a failover occurs. nfsd's are considered a "shared > resource". It's possible that there could be multiple clustered > services for NFS-sharing, so when a failover occurs, we just manipulate > the exports table. > > The problem we've run into is that occasionally they fail over to the > alternate machine and then back very rapidly. Because nfsd's are not > shut down on failover, sockets are not closed. So what happens is > something like this on TCP mounts: > > - client has NFS mount from clustered NFS service on one server > > - service fails over, new server doesn't know anything about the > existing socket, so it sends a RST back to the client when data > comes in. Client closes connection and reopens it and does some > I/O on the socket. > > - service fails back to original server. The original socket there > is still open, but now the TCP sequence numbers are off. When > packets come into the server we end up with an ACK storm, and the > client hangs for a long time. > > Neil Horman did a good writeup of this problem here for those that > want the gory details: > > https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > > I can think of 3 ways to fix this: > > 1) Add something like the recently added "unlock_ip" interface that > was added for NLM. Maybe a "close_ip" that allows us to close all > nfsd sockets connected to a given local IP address. So clustering > software could do something like: > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > ...and make sure that all of the sockets are closed. > > 2) just use the same "unlock_ip" interface and just have it also > close sockets in addition to dropping locks. > > 3) have an nfsd close all non-listening connections when it gets a > certain signal (maybe SIGUSR1 or something). Connections on a > sockets that aren't failing over should just get a RST and would > reopen their connections. > > ...my preference would probably be approach #1. > > I've only really done some rudimentary perusing of the code, so there > may be roadblocks with some of these approaches I haven't considered. > Does anyone have thoughts on the general problem or idea for a solution? > > The situation is a bit specific to failover testing -- most people failing > over don't do it so rapidly, but we'd still like to ensure that this > problem doesn't occur if someone does do it. > > Thanks, > This doesn't sound like it would be an NFS specific situation. Why doesn't TCP handle this, without causing an ACK storm? Thanx... ps ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:03 ` Peter Staubach @ 2008-06-09 15:18 ` Jeff Layton [not found] ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 2008-06-09 15:51 ` Talpey, Thomas 2008-06-09 15:23 ` Neil Horman 1 sibling, 2 replies; 38+ messages in thread From: Jeff Layton @ 2008-06-09 15:18 UTC (permalink / raw) To: Peter Staubach; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 09 Jun 2008 11:03:53 -0400 Peter Staubach <staubach@redhat.com> wrote: > Jeff Layton wrote: > > Apologies for the long email, but I ran into an interesting problem the > > other day and am looking for some feedback on my general approach to > > fixing it before I spend too much time on it: > > > > We (RH) have a cluster-suite product that some people use for making HA > > NFS services. When our QA folks test this, they often will start up > > some operations that do activity on an NFS mount from the cluster and > > then rapidly do failovers between cluster machines and make sure > > everything keeps moving along. The cluster is designed to not shut down > > nfsd's when a failover occurs. nfsd's are considered a "shared > > resource". It's possible that there could be multiple clustered > > services for NFS-sharing, so when a failover occurs, we just manipulate > > the exports table. > > > > The problem we've run into is that occasionally they fail over to the > > alternate machine and then back very rapidly. Because nfsd's are not > > shut down on failover, sockets are not closed. So what happens is > > something like this on TCP mounts: > > > > - client has NFS mount from clustered NFS service on one server > > > > - service fails over, new server doesn't know anything about the > > existing socket, so it sends a RST back to the client when data > > comes in. Client closes connection and reopens it and does some > > I/O on the socket. > > > > - service fails back to original server. The original socket there > > is still open, but now the TCP sequence numbers are off. When > > packets come into the server we end up with an ACK storm, and the > > client hangs for a long time. > > > > Neil Horman did a good writeup of this problem here for those that > > want the gory details: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > > > > I can think of 3 ways to fix this: > > > > 1) Add something like the recently added "unlock_ip" interface that > > was added for NLM. Maybe a "close_ip" that allows us to close all > > nfsd sockets connected to a given local IP address. So clustering > > software could do something like: > > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > > > ...and make sure that all of the sockets are closed. > > > > 2) just use the same "unlock_ip" interface and just have it also > > close sockets in addition to dropping locks. > > > > 3) have an nfsd close all non-listening connections when it gets a > > certain signal (maybe SIGUSR1 or something). Connections on a > > sockets that aren't failing over should just get a RST and would > > reopen their connections. > > > > ...my preference would probably be approach #1. > > > > I've only really done some rudimentary perusing of the code, so there > > may be roadblocks with some of these approaches I haven't considered. > > Does anyone have thoughts on the general problem or idea for a solution? > > > > The situation is a bit specific to failover testing -- most people failing > > over don't do it so rapidly, but we'd still like to ensure that this > > problem doesn't occur if someone does do it. > > > > Thanks, > > > > This doesn't sound like it would be an NFS specific situation. > Why doesn't TCP handle this, without causing an ACK storm? > No, it's not specific to NFS. It can happen to any "service" that floats IP addresses between machines, but does not close the sockets that are connected to those addresses. Most services that fail over (at least in RH's cluster server) shut down the daemons on failover too, so tends to mitigate this problem elsewhere. I'm not sure how the TCP layer can really handle this situation. On the wire, it looks to the client and server like the connection has been hijacked (and in a sense, it has). It would be nice if it didn't end up in an ACK storm, but I'm not aware of a way to prevent that that stays within the spec. -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? [not found] ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-06-09 15:31 ` Neil Horman 2008-06-09 15:43 ` Jeff Layton [not found] ` <RTPCLUEXC1-PRDOLZCH000001d2-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org> 1 sibling, 1 reply; 38+ messages in thread From: Neil Horman @ 2008-06-09 15:31 UTC (permalink / raw) To: Jeff Layton; +Cc: Peter Staubach, linux-nfs, nfsv4, nhorman, lhh On Mon, Jun 09, 2008 at 11:18:21AM -0400, Jeff Layton wrote: > On Mon, 09 Jun 2008 11:03:53 -0400 > Peter Staubach <staubach@redhat.com> wrote: > > > Jeff Layton wrote: > > > Apologies for the long email, but I ran into an interesting problem the > > > other day and am looking for some feedback on my general approach to > > > fixing it before I spend too much time on it: > > > > > > We (RH) have a cluster-suite product that some people use for making HA > > > NFS services. When our QA folks test this, they often will start up > > > some operations that do activity on an NFS mount from the cluster and > > > then rapidly do failovers between cluster machines and make sure > > > everything keeps moving along. The cluster is designed to not shut down > > > nfsd's when a failover occurs. nfsd's are considered a "shared > > > resource". It's possible that there could be multiple clustered > > > services for NFS-sharing, so when a failover occurs, we just manipulate > > > the exports table. > > > > > > The problem we've run into is that occasionally they fail over to the > > > alternate machine and then back very rapidly. Because nfsd's are not > > > shut down on failover, sockets are not closed. So what happens is > > > something like this on TCP mounts: > > > > > > - client has NFS mount from clustered NFS service on one server > > > > > > - service fails over, new server doesn't know anything about the > > > existing socket, so it sends a RST back to the client when data > > > comes in. Client closes connection and reopens it and does some > > > I/O on the socket. > > > > > > - service fails back to original server. The original socket there > > > is still open, but now the TCP sequence numbers are off. When > > > packets come into the server we end up with an ACK storm, and the > > > client hangs for a long time. > > > > > > Neil Horman did a good writeup of this problem here for those that > > > want the gory details: > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > > > > > > I can think of 3 ways to fix this: > > > > > > 1) Add something like the recently added "unlock_ip" interface that > > > was added for NLM. Maybe a "close_ip" that allows us to close all > > > nfsd sockets connected to a given local IP address. So clustering > > > software could do something like: > > > > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > > > > > ...and make sure that all of the sockets are closed. > > > > > > 2) just use the same "unlock_ip" interface and just have it also > > > close sockets in addition to dropping locks. > > > > > > 3) have an nfsd close all non-listening connections when it gets a > > > certain signal (maybe SIGUSR1 or something). Connections on a > > > sockets that aren't failing over should just get a RST and would > > > reopen their connections. > > > > > > ...my preference would probably be approach #1. > > > > > > I've only really done some rudimentary perusing of the code, so there > > > may be roadblocks with some of these approaches I haven't considered. > > > Does anyone have thoughts on the general problem or idea for a solution? > > > > > > The situation is a bit specific to failover testing -- most people failing > > > over don't do it so rapidly, but we'd still like to ensure that this > > > problem doesn't occur if someone does do it. > > > > > > Thanks, > > > > > > > This doesn't sound like it would be an NFS specific situation. > > Why doesn't TCP handle this, without causing an ACK storm? > > > > No, it's not specific to NFS. It can happen to any "service" that > floats IP addresses between machines, but does not close the sockets > that are connected to those addresses. Most services that fail over > (at least in RH's cluster server) shut down the daemons on failover > too, so tends to mitigate this problem elsewhere. > > I'm not sure how the TCP layer can really handle this situation. On > the wire, it looks to the client and server like the connection has > been hijacked (and in a sense, it has). It would be nice if it > didn't end up in an ACK storm, but I'm not aware of a way to prevent > that that stays within the spec. > I've not really thought it through yet, but would IP tables be another options here? Could you, if you preformed a soft failover, add a rule that responded to any frame on an active connection that wasn't a SYN frame, force the sending of an ACK frame? It probably wouldn't scale, and its kind of ugly, but it could work... Neil > -- > Jeff Layton <jlayton@redhat.com> -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:31 ` Neil Horman @ 2008-06-09 15:43 ` Jeff Layton 0 siblings, 0 replies; 38+ messages in thread From: Jeff Layton @ 2008-06-09 15:43 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 9 Jun 2008 11:31:55 -0400 Neil Horman <nhorman@redhat.com> wrote: > On Mon, Jun 09, 2008 at 11:18:21AM -0400, Jeff Layton wrote: > > On Mon, 09 Jun 2008 11:03:53 -0400 > > Peter Staubach <staubach@redhat.com> wrote: > > > > > Jeff Layton wrote: > > > > Apologies for the long email, but I ran into an interesting problem the > > > > other day and am looking for some feedback on my general approach to > > > > fixing it before I spend too much time on it: > > > > > > > > We (RH) have a cluster-suite product that some people use for making HA > > > > NFS services. When our QA folks test this, they often will start up > > > > some operations that do activity on an NFS mount from the cluster and > > > > then rapidly do failovers between cluster machines and make sure > > > > everything keeps moving along. The cluster is designed to not shut down > > > > nfsd's when a failover occurs. nfsd's are considered a "shared > > > > resource". It's possible that there could be multiple clustered > > > > services for NFS-sharing, so when a failover occurs, we just manipulate > > > > the exports table. > > > > > > > > The problem we've run into is that occasionally they fail over to the > > > > alternate machine and then back very rapidly. Because nfsd's are not > > > > shut down on failover, sockets are not closed. So what happens is > > > > something like this on TCP mounts: > > > > > > > > - client has NFS mount from clustered NFS service on one server > > > > > > > > - service fails over, new server doesn't know anything about the > > > > existing socket, so it sends a RST back to the client when data > > > > comes in. Client closes connection and reopens it and does some > > > > I/O on the socket. > > > > > > > > - service fails back to original server. The original socket there > > > > is still open, but now the TCP sequence numbers are off. When > > > > packets come into the server we end up with an ACK storm, and the > > > > client hangs for a long time. > > > > > > > > Neil Horman did a good writeup of this problem here for those that > > > > want the gory details: > > > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > > > > > > > > I can think of 3 ways to fix this: > > > > > > > > 1) Add something like the recently added "unlock_ip" interface that > > > > was added for NLM. Maybe a "close_ip" that allows us to close all > > > > nfsd sockets connected to a given local IP address. So clustering > > > > software could do something like: > > > > > > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > > > > > > > ...and make sure that all of the sockets are closed. > > > > > > > > 2) just use the same "unlock_ip" interface and just have it also > > > > close sockets in addition to dropping locks. > > > > > > > > 3) have an nfsd close all non-listening connections when it gets a > > > > certain signal (maybe SIGUSR1 or something). Connections on a > > > > sockets that aren't failing over should just get a RST and would > > > > reopen their connections. > > > > > > > > ...my preference would probably be approach #1. > > > > > > > > I've only really done some rudimentary perusing of the code, so there > > > > may be roadblocks with some of these approaches I haven't considered. > > > > Does anyone have thoughts on the general problem or idea for a solution? > > > > > > > > The situation is a bit specific to failover testing -- most people failing > > > > over don't do it so rapidly, but we'd still like to ensure that this > > > > problem doesn't occur if someone does do it. > > > > > > > > Thanks, > > > > > > > > > > This doesn't sound like it would be an NFS specific situation. > > > Why doesn't TCP handle this, without causing an ACK storm? > > > > > > > No, it's not specific to NFS. It can happen to any "service" that > > floats IP addresses between machines, but does not close the sockets > > that are connected to those addresses. Most services that fail over > > (at least in RH's cluster server) shut down the daemons on failover > > too, so tends to mitigate this problem elsewhere. > > > > I'm not sure how the TCP layer can really handle this situation. On > > the wire, it looks to the client and server like the connection has > > been hijacked (and in a sense, it has). It would be nice if it > > didn't end up in an ACK storm, but I'm not aware of a way to prevent > > that that stays within the spec. > > > I've not really thought it through yet, but would IP tables be another options > here? Could you, if you preformed a soft failover, add a rule that responded to > any frame on an active connection that wasn't a SYN frame, force the sending of > an ACK frame? It probably wouldn't scale, and its kind of ugly, but it could > work... > Yow, that is ugly... So once a client does a new SYN, what would have to happen to make the connection then work? That sounds pretty complicated. I could forsee using iptables here though... When the service is "leaving" the server: 1) add rule to drop all traffic to port 2049 2) restart all of the nfsd's 3) remove iptables rule ...that would (briefly) disrupt communications between all clients and the server, but it probably would work. You'd need to drop traffic to prevent races that might get you an "Connection Refused". Still, it's a kludge. I'd prefer a fix that didn't cause service disruptions for anything but the stuff that's failing over. Also, that would be pretty nightmarish from a coding standpoint. People have all sorts of firewalling configurations, so doing this may be difficult in practice. Cheers, -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <RTPCLUEXC1-PRDOLZCH000001d2-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>]
[parent not found: <20080609120110.1fee7221-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
[parent not found: <RTPCLUEXC1-PRDF8Eqf000001d4-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>]
[parent not found: <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? [not found] ` <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-06-09 16:40 ` Talpey, Thomas 2008-06-09 16:46 ` Jeff Layton 2008-06-09 18:03 ` J. Bruce Fields 0 siblings, 2 replies; 38+ messages in thread From: Talpey, Thomas @ 2008-06-09 16:40 UTC (permalink / raw) To: Jeff Layton; +Cc: Peter Staubach, linux-nfs, lhh, nfsv4, nhorman At 12:22 PM 6/9/2008, Jeff Layton wrote: >That might be worth investigating, but sounds like it might cause problems >with the services associated with IP addresses that are staying on the >victim server. Jeff, I think you have many years of job security to look forward to, here. :-) Since you sent this to the NFSv4 list - is there any chance you're thinking to not transparently take over IP addresses, but use NFSv4 locations and referrals for these "migrations"? Yes, I know some clients may not quite be there yet. Tom. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:40 ` Talpey, Thomas @ 2008-06-09 16:46 ` Jeff Layton 2008-06-09 18:03 ` J. Bruce Fields 1 sibling, 0 replies; 38+ messages in thread From: Jeff Layton @ 2008-06-09 16:46 UTC (permalink / raw) To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 09 Jun 2008 12:40:05 -0400 "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > At 12:22 PM 6/9/2008, Jeff Layton wrote: > >That might be worth investigating, but sounds like it might cause problems > >with the services associated with IP addresses that are staying on the > >victim server. > > Jeff, I think you have many years of job security to look forward to, here. :-) > :-) > Since you sent this to the NFSv4 list - is there any chance you're thinking > to not transparently take over IP addresses, but use NFSv4 locations and > referrals for these "migrations"? Yes, I know some clients may not quite be > there yet. > An interesting thought. I sent this to the nfsv4 list since I assume nfsv4 will also be affected by this problem.... I'm not aware of any plans to integrate the new v4 stuff into our cluster product. It would make a lot of sense though, so perhaps after it gets some more upstream soak time we'll want to consider it. That would be an extremely attractive thing with something like GFS on the backend. -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:40 ` Talpey, Thomas 2008-06-09 16:46 ` Jeff Layton @ 2008-06-09 18:03 ` J. Bruce Fields 1 sibling, 0 replies; 38+ messages in thread From: J. Bruce Fields @ 2008-06-09 18:03 UTC (permalink / raw) To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman, Jeff Layton On Mon, Jun 09, 2008 at 12:40:05PM -0400, Talpey, Thomas wrote: > At 12:22 PM 6/9/2008, Jeff Layton wrote: > >That might be worth investigating, but sounds like it might cause problems > >with the services associated with IP addresses that are staying on the > >victim server. > > Jeff, I think you have many years of job security to look forward to, here. :-) > > Since you sent this to the NFSv4 list - is there any chance you're thinking > to not transparently take over IP addresses, but use NFSv4 locations and > referrals for these "migrations"? Yeah, definitely. We've a got a prototype and some other work in progress--hopefully there'll be something "real" in the coming months! There's some overlap with nfsv2/v3, though (not in this case, but in the need for lock migration, for example). And people really are using this floating-ip address stuff now, so anything we can do to make it more reliable or easier to use is welcome. --b. > Yes, I know some clients may not quite be > there yet. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? [not found] ` <RTPCLUEXC1-PRDF8Eqf000001d4-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org> [not found] ` <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-06-09 17:14 ` J. Bruce Fields 1 sibling, 0 replies; 38+ messages in thread From: J. Bruce Fields @ 2008-06-09 17:14 UTC (permalink / raw) To: Talpey, Thomas Cc: Jeff Layton, Peter Staubach, linux-nfs, lhh, nfsv4, nhorman On Mon, Jun 09, 2008 at 12:09:48PM -0400, Talpey, Thomas wrote: > At 12:01 PM 6/9/2008, Jeff Layton wrote: > >On Mon, 09 Jun 2008 11:51:51 -0400 > >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > > > >> At 11:18 AM 6/9/2008, Jeff Layton wrote: > >> >No, it's not specific to NFS. It can happen to any "service" that > >> >floats IP addresses between machines, but does not close the sockets > >> >that are connected to those addresses. Most services that fail over > >> >(at least in RH's cluster server) shut down the daemons on failover > >> >too, so tends to mitigate this problem elsewhere. > >> > >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the > >> victim server? > > > >The victim server might have other nfsd/lockd's running on them. Stopping > >all the nfsd's could bring down lockd, and then you have to deal with lock > >recovery on the stuff that isn't moving to the other server. > > But but but... the IP address is the only identification the client can use > to isolate a server. Right. > You're telling me that some locks will migrate and some won't? Good > luck with that! The clients are going to be mightily confused. Locks migrate or not depending on the server ip address. Where do you see the confusion? --b. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:18 ` Jeff Layton [not found] ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-06-09 15:51 ` Talpey, Thomas 2008-06-09 16:01 ` Jeff Layton 1 sibling, 1 reply; 38+ messages in thread From: Talpey, Thomas @ 2008-06-09 15:51 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman At 11:18 AM 6/9/2008, Jeff Layton wrote: >No, it's not specific to NFS. It can happen to any "service" that >floats IP addresses between machines, but does not close the sockets >that are connected to those addresses. Most services that fail over >(at least in RH's cluster server) shut down the daemons on failover >too, so tends to mitigate this problem elsewhere. Why exactly don't you choose to restart the nfsd's (and lockd's) on the victim server? Failing that, for TCP at least would ifdown/ifup accomplish the socket reset? Tom. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:51 ` Talpey, Thomas @ 2008-06-09 16:01 ` Jeff Layton 2008-06-09 16:03 ` Neil Horman 2008-06-09 16:09 ` Talpey, Thomas 0 siblings, 2 replies; 38+ messages in thread From: Jeff Layton @ 2008-06-09 16:01 UTC (permalink / raw) To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 09 Jun 2008 11:51:51 -0400 "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > At 11:18 AM 6/9/2008, Jeff Layton wrote: > >No, it's not specific to NFS. It can happen to any "service" that > >floats IP addresses between machines, but does not close the sockets > >that are connected to those addresses. Most services that fail over > >(at least in RH's cluster server) shut down the daemons on failover > >too, so tends to mitigate this problem elsewhere. > > Why exactly don't you choose to restart the nfsd's (and lockd's) on the > victim server? The victim server might have other nfsd/lockd's running on them. Stopping all the nfsd's could bring down lockd, and then you have to deal with lock recovery on the stuff that isn't moving to the other server. > Failing that, for TCP at least would ifdown/ifup accomplish > the socket reset? > I don't think ifdown/ifup closes the sockets, but maybe someone can correct me on this... -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:01 ` Jeff Layton @ 2008-06-09 16:03 ` Neil Horman 2008-06-09 16:09 ` Talpey, Thomas 1 sibling, 0 replies; 38+ messages in thread From: Neil Horman @ 2008-06-09 16:03 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, Jun 09, 2008 at 12:01:10PM -0400, Jeff Layton wrote: > On Mon, 09 Jun 2008 11:51:51 -0400 > "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > > > At 11:18 AM 6/9/2008, Jeff Layton wrote: > > >No, it's not specific to NFS. It can happen to any "service" that > > >floats IP addresses between machines, but does not close the sockets > > >that are connected to those addresses. Most services that fail over > > >(at least in RH's cluster server) shut down the daemons on failover > > >too, so tends to mitigate this problem elsewhere. > > > > Why exactly don't you choose to restart the nfsd's (and lockd's) on the > > victim server? > > The victim server might have other nfsd/lockd's running on them. Stopping > all the nfsd's could bring down lockd, and then you have to deal with lock > recovery on the stuff that isn't moving to the other server. > > > Failing that, for TCP at least would ifdown/ifup accomplish > > the socket reset? > > > > I don't think ifdown/ifup closes the sockets, but maybe someone can > correct me on this... > if up/down doesn't do anything to the sockets per-se, but could have any number of side effects depending how other aspects of your network/application are configured. Certainly not a reliable way to destroy a connection. Neil > -- > Jeff Layton <jlayton@redhat.com> -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:01 ` Jeff Layton 2008-06-09 16:03 ` Neil Horman @ 2008-06-09 16:09 ` Talpey, Thomas 2008-06-09 16:22 ` Jeff Layton 1 sibling, 1 reply; 38+ messages in thread From: Talpey, Thomas @ 2008-06-09 16:09 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman At 12:01 PM 6/9/2008, Jeff Layton wrote: >On Mon, 09 Jun 2008 11:51:51 -0400 >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > >> At 11:18 AM 6/9/2008, Jeff Layton wrote: >> >No, it's not specific to NFS. It can happen to any "service" that >> >floats IP addresses between machines, but does not close the sockets >> >that are connected to those addresses. Most services that fail over >> >(at least in RH's cluster server) shut down the daemons on failover >> >too, so tends to mitigate this problem elsewhere. >> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the >> victim server? > >The victim server might have other nfsd/lockd's running on them. Stopping >all the nfsd's could bring down lockd, and then you have to deal with lock >recovery on the stuff that isn't moving to the other server. But but but... the IP address is the only identification the client can use to isolate a server. You're telling me that some locks will migrate and some won't? Good luck with that! The clients are going to be mightily confused. > >> Failing that, for TCP at least would ifdown/ifup accomplish >> the socket reset? >> > >I don't think ifdown/ifup closes the sockets, but maybe someone can >correct me on this... No, it doesn't close the sockets, but it sends interface-down status to them. The nfsd's, in theory, should close the sockets in response. But, it's possible (probable?) that nfsd may ignore this, and do nothing. It's just an idea. Tom. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:09 ` Talpey, Thomas @ 2008-06-09 16:22 ` Jeff Layton 2008-06-09 19:36 ` Chuck Lever 0 siblings, 1 reply; 38+ messages in thread From: Jeff Layton @ 2008-06-09 16:22 UTC (permalink / raw) To: Talpey, Thomas; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 09 Jun 2008 12:09:48 -0400 "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > At 12:01 PM 6/9/2008, Jeff Layton wrote: > >On Mon, 09 Jun 2008 11:51:51 -0400 > >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > > > >> At 11:18 AM 6/9/2008, Jeff Layton wrote: > >> >No, it's not specific to NFS. It can happen to any "service" that > >> >floats IP addresses between machines, but does not close the sockets > >> >that are connected to those addresses. Most services that fail over > >> >(at least in RH's cluster server) shut down the daemons on failover > >> >too, so tends to mitigate this problem elsewhere. > >> > >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the > >> victim server? > > > >The victim server might have other nfsd/lockd's running on them. Stopping > >all the nfsd's could bring down lockd, and then you have to deal with lock > >recovery on the stuff that isn't moving to the other server. > > But but but... the IP address is the only identification the client can use > to isolate a server. You're telling me that some locks will migrate and > some won't? Good luck with that! The clients are going to be mightily > confused. > Maybe I'm not being clear. My understanding is this: Right now, when we fail over we send a SIGKILL to lockd, and then send a SM_NOTIFY to all of the clients that the "victim" server has, regardless of what IP address the clients are talking to. So all locks get dropped and all clients should recover their locks. Since the service will fail over to the new host, locks that were in that export will get recovered on the "new" host. But, we just recently added this new "unlock_ip" interface. With that, we should be able to just send SM_NOTIFY's to clients of that IP address. Locks associated with that server address will be recovered and the others should be unaffected. > > > >> Failing that, for TCP at least would ifdown/ifup accomplish > >> the socket reset? > >> > > > >I don't think ifdown/ifup closes the sockets, but maybe someone can > >correct me on this... > > No, it doesn't close the sockets, but it sends interface-down status to them. > The nfsd's, in theory, should close the sockets in response. But, it's possible > (probable?) that nfsd may ignore this, and do nothing. It's just an idea. > That might be worth investigating, but sounds like it might cause problems with the services associated with IP addresses that are staying on the victim server. -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:22 ` Jeff Layton @ 2008-06-09 19:36 ` Chuck Lever 2008-06-09 20:11 ` Jeff Layton 0 siblings, 1 reply; 38+ messages in thread From: Chuck Lever @ 2008-06-09 19:36 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, Jun 9, 2008 at 12:22 PM, Jeff Layton <jlayton@redhat.com> wrote: > On Mon, 09 Jun 2008 12:09:48 -0400 > "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > >> At 12:01 PM 6/9/2008, Jeff Layton wrote: >> >On Mon, 09 Jun 2008 11:51:51 -0400 >> >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: >> > >> >> At 11:18 AM 6/9/2008, Jeff Layton wrote: >> >> >No, it's not specific to NFS. It can happen to any "service" that >> >> >floats IP addresses between machines, but does not close the sockets >> >> >that are connected to those addresses. Most services that fail over >> >> >(at least in RH's cluster server) shut down the daemons on failover >> >> >too, so tends to mitigate this problem elsewhere. >> >> >> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the >> >> victim server? >> > >> >The victim server might have other nfsd/lockd's running on them. Stopping >> >all the nfsd's could bring down lockd, and then you have to deal with lock >> >recovery on the stuff that isn't moving to the other server. >> >> But but but... the IP address is the only identification the client can use >> to isolate a server. You're telling me that some locks will migrate and >> some won't? Good luck with that! The clients are going to be mightily >> confused. >> > > Maybe I'm not being clear. My understanding is this: > > Right now, when we fail over we send a SIGKILL to lockd, and then send > a SM_NOTIFY to all of the clients that the "victim" server has, > regardless of what IP address the clients are talking to. So all locks > get dropped and all clients should recover their locks. Since the > service will fail over to the new host, locks that were in that export > will get recovered on the "new" host. > > But, we just recently added this new "unlock_ip" interface. With that, > we should be able to just send SM_NOTIFY's to clients of that IP > address. Locks associated with that server address will be recovered > and the others should be unaffected. Maybe that's a little imprecise. The failover_unlock_ip() API doesn't send any SM_NOTIFY calls at all, it tells the server's NLM to drop all locks held by that IP, but there's logic in nlmsvc_is_client() specifically to keep monitoring these clients. The SM_NOTIFY calls will come from user space, just to be clear. If this is truly a service migration, I would think that the old server would want to stop monitoring these clients anyway. > All of > the NSM/NLM stuff here is really separate from the main problem I'm > interested in at the moment, which is how to deal with the old, stale > sockets that nfsd has open after the local address disappears. IMO it's just the reverse: the main problem is how to do service migration in a robust fashion; the bugs you are focused on right at the moment are due to the fact the current migration strategy is poorly designed. The real issue is how do you fix your design, and that's a lot bigger than addressing a few SYNs and ACKs. I do not believe there is going to be a simple network level fix here if you want to prevent more corner cases. I am still of the opinion that you can't do this without involvement from the nfsd threads. The old server is going to have to stop accepting incoming connections during the failover period. NetApp found that it is not enough to drop a newly accepted connection without having read any data -- that confuses some clients. Your server really does need to shut off the listener, in order to refuse new connections. I think this might be a new server state. A bunch of nfsd threads will exist and be processing NFS requests, but there will be no listener. Then the old server can drain the doomed sockets and disconnect them in an orderly manner. This will prevent a lot of segment ordering problems and keep network layer confusion about socket state to a minimum. It's a good idea to try to return any pending replies to clients before closing the connection to reduce the likelihood of RPC retransmits. To prevent the clients from transmitting any new requests, use a half-close (just close the receiving half of the connection on the server). Naturally this will have to be time-bounded because clients can be too busy to read any remaining data off the socket, or could just be dead. That shouldn't hold up your service migration event. Any clients attempting to connect to the old server during failover will be refused. If they are trying to access legitimate NFS resources that have not been migrated, they will retry connecting later, so this really shouldn't be an issue. Clients connecting to the new server should be OK, but again, I think they should be fenced from the old server's file system until the old server has finished processing any pending requests from clients that are being migrated to the new server. When failover is complete, the old server can start accepting new TCP connections again. Clients connecting to the old server looking for migrated resources should get something like ESTALE ("These are not the file handles you are looking for."). In this way, the server is in control over the migration, and isn't depending on any wonky TCP behavior to make it happen correctly. It's using entirely legitimate features of the socket interface to move each client through the necessary states of migration. Now that the network connections are figured out, your servers can start worrying about recoverying NLM, NSM, and DRC state. -- I am certain that these presidents will understand the cry of the people of Bolivia, of the people of Latin America and the whole world, which wants to have more food and not more cars. First food, then if something's left over, more cars, more automobiles. I think that life has to come first. -- Evo Morales _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 19:36 ` Chuck Lever @ 2008-06-09 20:11 ` Jeff Layton 2008-06-09 20:56 ` Chuck Lever 0 siblings, 1 reply; 38+ messages in thread From: Jeff Layton @ 2008-06-09 20:11 UTC (permalink / raw) To: chucklever; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 9 Jun 2008 15:36:18 -0400 "Chuck Lever" <chuck.lever@oracle.com> wrote: > On Mon, Jun 9, 2008 at 12:22 PM, Jeff Layton <jlayton@redhat.com> wrote: > > On Mon, 09 Jun 2008 12:09:48 -0400 > > "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > > > >> At 12:01 PM 6/9/2008, Jeff Layton wrote: > >> >On Mon, 09 Jun 2008 11:51:51 -0400 > >> >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > >> > > >> >> At 11:18 AM 6/9/2008, Jeff Layton wrote: > >> >> >No, it's not specific to NFS. It can happen to any "service" that > >> >> >floats IP addresses between machines, but does not close the sockets > >> >> >that are connected to those addresses. Most services that fail over > >> >> >(at least in RH's cluster server) shut down the daemons on failover > >> >> >too, so tends to mitigate this problem elsewhere. > >> >> > >> >> Why exactly don't you choose to restart the nfsd's (and lockd's) on the > >> >> victim server? > >> > > >> >The victim server might have other nfsd/lockd's running on them. Stopping > >> >all the nfsd's could bring down lockd, and then you have to deal with lock > >> >recovery on the stuff that isn't moving to the other server. > >> > >> But but but... the IP address is the only identification the client can use > >> to isolate a server. You're telling me that some locks will migrate and > >> some won't? Good luck with that! The clients are going to be mightily > >> confused. > >> > > > > Maybe I'm not being clear. My understanding is this: > > > > Right now, when we fail over we send a SIGKILL to lockd, and then send > > a SM_NOTIFY to all of the clients that the "victim" server has, > > regardless of what IP address the clients are talking to. So all locks > > get dropped and all clients should recover their locks. Since the > > service will fail over to the new host, locks that were in that export > > will get recovered on the "new" host. > > > > But, we just recently added this new "unlock_ip" interface. With that, > > we should be able to just send SM_NOTIFY's to clients of that IP > > address. Locks associated with that server address will be recovered > > and the others should be unaffected. > > Maybe that's a little imprecise. > > The failover_unlock_ip() API doesn't send any SM_NOTIFY calls at all, > it tells the server's NLM to drop all locks held by that IP, but > there's logic in nlmsvc_is_client() specifically to keep monitoring > these clients. The SM_NOTIFY calls will come from user space, just to > be clear. > > If this is truly a service migration, I would think that the old > server would want to stop monitoring these clients anyway. > > > All of > > the NSM/NLM stuff here is really separate from the main problem I'm > > interested in at the moment, which is how to deal with the old, stale > > sockets that nfsd has open after the local address disappears. > > IMO it's just the reverse: the main problem is how to do service > migration in a robust fashion; the bugs you are focused on right at > the moment are due to the fact the current migration strategy is > poorly designed. The real issue is how do you fix your design, and > that's a lot bigger than addressing a few SYNs and ACKs. I do not > believe there is going to be a simple network level fix here if you > want to prevent more corner cases. > > I am still of the opinion that you can't do this without involvement > from the nfsd threads. The old server is going to have to stop > accepting incoming connections during the failover period. NetApp > found that it is not enough to drop a newly accepted connection > without having read any data -- that confuses some clients. Your > server really does need to shut off the listener, in order to refuse > new connections. > I'm not sure I follow your logic here. The first thing that happens when failover occurs is that the IP address is removed from the interface. This prevents new connections on that IP address (and new packets for existing connections for that matter). Why would this not be sufficient to prevent new activity on those sockets? > I think this might be a new server state. A bunch of nfsd threads > will exist and be processing NFS requests, but there will be no > listener. > > Then the old server can drain the doomed sockets and disconnect them > in an orderly manner. This will prevent a lot of segment ordering > problems and keep network layer confusion about socket state to a > minimum. It's a good idea to try to return any pending replies to > clients before closing the connection to reduce the likelihood of RPC > retransmits. To prevent the clients from transmitting any new > requests, use a half-close (just close the receiving half of the > connection on the server). > Ahh ok. So you're thinking that we need to keep the IP address in place so that we can send replies for RPC's that are still in progress? That makes sense. I suppose that instead of shutting down the listener altogether, we could just have the listener refuse connections for the given destination address. That's probably simpler and would mean less disruption for exports on other IP addrs. That said, if we assume we want to use the unlock_ip interface then there's a potential race between writing to unlock_ip and taking down the address. I'll have to think about how to deal with that maybe some sort of 3 stage teardown: 1) refuse new connections for the IP address, drain the RPC queues, half close sockets 2) remove the address from the interface 3) close sockets the rest of the way, stop refusing connections Then again, we might actually be better off restarting nfsd instead. It's certainly simpler... > Naturally this will have to be time-bounded because clients can be too > busy to read any remaining data off the socket, or could just be dead. > That shouldn't hold up your service migration event. > Definitely. > Any clients attempting to connect to the old server during failover > will be refused. If they are trying to access legitimate NFS > resources that have not been migrated, they will retry connecting > later, so this really shouldn't be an issue. Clients connecting to > the new server should be OK, but again, I think they should be fenced > from the old server's file system until the old server has finished > processing any pending requests from clients that are being migrated > to the new server. > > When failover is complete, the old server can start accepting new TCP > connections again. Clients connecting to the old server looking for > migrated resources should get something like ESTALE ("These are not > the file handles you are looking for."). > I think we return -EACCES or something (whatever you get when you try to access something that isn't exported). We remove the export from the exports table when we fail over. > In this way, the server is in control over the migration, and isn't > depending on any wonky TCP behavior to make it happen correctly. It's > using entirely legitimate features of the socket interface to move > each client through the necessary states of migration. > > Now that the network connections are figured out, your servers can > start worrying about recoverying NLM, NSM, and DRC state. > -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 20:11 ` Jeff Layton @ 2008-06-09 20:56 ` Chuck Lever 0 siblings, 0 replies; 38+ messages in thread From: Chuck Lever @ 2008-06-09 20:56 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, chucklever, lhh, nfsv4, nhorman On Jun 9, 2008, at 4:11 PM, Jeff Layton wrote: > On Mon, 9 Jun 2008 15:36:18 -0400 > "Chuck Lever" <chuck.lever@oracle.com> wrote: >> On Mon, Jun 9, 2008 at 12:22 PM, Jeff Layton <jlayton@redhat.com> >> wrote: >>> On Mon, 09 Jun 2008 12:09:48 -0400 >>> "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: >>> >>>> At 12:01 PM 6/9/2008, Jeff Layton wrote: >>>>> On Mon, 09 Jun 2008 11:51:51 -0400 >>>>> "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: >>>>> >>>>>> At 11:18 AM 6/9/2008, Jeff Layton wrote: >>>>>>> No, it's not specific to NFS. It can happen to any "service" >>>>>>> that >>>>>>> floats IP addresses between machines, but does not close the >>>>>>> sockets >>>>>>> that are connected to those addresses. Most services that fail >>>>>>> over >>>>>>> (at least in RH's cluster server) shut down the daemons on >>>>>>> failover >>>>>>> too, so tends to mitigate this problem elsewhere. >>>>>> >>>>>> Why exactly don't you choose to restart the nfsd's (and >>>>>> lockd's) on the >>>>>> victim server? >>>>> >>>>> The victim server might have other nfsd/lockd's running on them. >>>>> Stopping >>>>> all the nfsd's could bring down lockd, and then you have to deal >>>>> with lock >>>>> recovery on the stuff that isn't moving to the other server. >>>> >>>> But but but... the IP address is the only identification the >>>> client can use >>>> to isolate a server. You're telling me that some locks will >>>> migrate and >>>> some won't? Good luck with that! The clients are going to be >>>> mightily >>>> confused. >>>> >>> >>> Maybe I'm not being clear. My understanding is this: >>> >>> Right now, when we fail over we send a SIGKILL to lockd, and then >>> send >>> a SM_NOTIFY to all of the clients that the "victim" server has, >>> regardless of what IP address the clients are talking to. So all >>> locks >>> get dropped and all clients should recover their locks. Since the >>> service will fail over to the new host, locks that were in that >>> export >>> will get recovered on the "new" host. >>> >>> But, we just recently added this new "unlock_ip" interface. With >>> that, >>> we should be able to just send SM_NOTIFY's to clients of that IP >>> address. Locks associated with that server address will be recovered >>> and the others should be unaffected. >> >> Maybe that's a little imprecise. >> >> The failover_unlock_ip() API doesn't send any SM_NOTIFY calls at all, >> it tells the server's NLM to drop all locks held by that IP, but >> there's logic in nlmsvc_is_client() specifically to keep monitoring >> these clients. The SM_NOTIFY calls will come from user space, just >> to >> be clear. >> >> If this is truly a service migration, I would think that the old >> server would want to stop monitoring these clients anyway. >> >>> All of >>> the NSM/NLM stuff here is really separate from the main problem I'm >>> interested in at the moment, which is how to deal with the old, >>> stale >>> sockets that nfsd has open after the local address disappears. >> >> IMO it's just the reverse: the main problem is how to do service >> migration in a robust fashion; the bugs you are focused on right at >> the moment are due to the fact the current migration strategy is >> poorly designed. The real issue is how do you fix your design, and >> that's a lot bigger than addressing a few SYNs and ACKs. I do not >> believe there is going to be a simple network level fix here if you >> want to prevent more corner cases. >> >> I am still of the opinion that you can't do this without involvement >> from the nfsd threads. The old server is going to have to stop >> accepting incoming connections during the failover period. NetApp >> found that it is not enough to drop a newly accepted connection >> without having read any data -- that confuses some clients. Your >> server really does need to shut off the listener, in order to refuse >> new connections. >> > > I'm not sure I follow your logic here. The first thing that happens > when failover occurs is that the IP address is removed from the > interface. This prevents new connections on that IP address (and new > packets for existing connections for that matter). Why would this not > be sufficient to prevent new activity on those sockets? Because precisely the situation that you have observed occurs. The clients and servers get confused about network state because the TCP connections weren't properly shut down. >> I think this might be a new server state. A bunch of nfsd threads >> will exist and be processing NFS requests, but there will be no >> listener. >> >> Then the old server can drain the doomed sockets and disconnect them >> in an orderly manner. This will prevent a lot of segment ordering >> problems and keep network layer confusion about socket state to a >> minimum. It's a good idea to try to return any pending replies to >> clients before closing the connection to reduce the likelihood of RPC >> retransmits. To prevent the clients from transmitting any new >> requests, use a half-close (just close the receiving half of the >> connection on the server). > > Ahh ok. So you're thinking that we need to keep the IP address in > place > so that we can send replies for RPC's that are still in progress? That > makes sense. There is a part of NFSD failover that must occur before the IP address is taken down. Otherwise you orphan NFS requests on the server and TCP segments in the network. You have an opportunity, during service migration, to shut down the old service gracefully so that you greatly reduce the risk of data loss or corruption. > I suppose that instead of shutting down the listener altogether, we > could just have the listener refuse connections for the given > destination address. I didn't think you could do that to an active listener. Even if you could, NFSD would depend on the specifics of the network layer implementation to disallow races or partially connected states while the listener socket was transitioning. Since these events are rare compared to RPC requests and new connections, I would think it wouldn't matter if the listener wasn't available for a brief period. What matters is the service shut down on the old server is clean and orderly. > That said, if we assume we want to use the unlock_ip interface then > there's a potential race between writing to unlock_ip and taking down > the address. I'll have to think about how to deal with that maybe some > sort of 3 stage teardown: > > 1) refuse new connections for the IP address, drain the RPC queues, > half close sockets > > 2) remove the address from the interface > > 3) close sockets the rest of the way, stop refusing connections I'm not sure what you accomplish with "close sockets the rest of the way" after you have removed the address from the interface? The NFSDs should gracefully destroy all resources on that address before you remove the address from the interface. Closing the sockets properly means that both halves of the connection duplex have an opportunity to go through the FIN,ACK dance. There is still a risk that things will get confused. But in most normal cases, this is enough to ensure an orderly transition of the network connections. >> Naturally this will have to be time-bounded because clients can be >> too >> busy to read any remaining data off the socket, or could just be >> dead. >> That shouldn't hold up your service migration event. > > Definitely. > >> Any clients attempting to connect to the old server during failover >> will be refused. If they are trying to access legitimate NFS >> resources that have not been migrated, they will retry connecting >> later, so this really shouldn't be an issue. Clients connecting to >> the new server should be OK, but again, I think they should be fenced >> from the old server's file system until the old server has finished >> processing any pending requests from clients that are being migrated >> to the new server. >> >> When failover is complete, the old server can start accepting new TCP >> connections again. Clients connecting to the old server looking for >> migrated resources should get something like ESTALE ("These are not >> the file handles you are looking for."). > > I think we return -EACCES or something (whatever you get when you > try to > access something that isn't exported). We remove the export from the > exports table when we fail over. >> In this way, the server is in control over the migration, and isn't >> depending on any wonky TCP behavior to make it happen correctly. >> It's >> using entirely legitimate features of the socket interface to move >> each client through the necessary states of migration. >> >> Now that the network connections are figured out, your servers can >> start worrying about recoverying NLM, NSM, and DRC state. >> > -- > Jeff Layton <jlayton@redhat.com> -- Chuck Lever chuck[dot]lever[at]oracle[dot]com _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:03 ` Peter Staubach 2008-06-09 15:18 ` Jeff Layton @ 2008-06-09 15:23 ` Neil Horman 2008-06-09 15:37 ` Peter Staubach ` (2 more replies) 1 sibling, 3 replies; 38+ messages in thread From: Neil Horman @ 2008-06-09 15:23 UTC (permalink / raw) To: Peter Staubach; +Cc: linux-nfs, lhh, nfsv4, nhorman, Jeff Layton On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: > Jeff Layton wrote: > >Apologies for the long email, but I ran into an interesting problem the > >other day and am looking for some feedback on my general approach to > >fixing it before I spend too much time on it: > > > >We (RH) have a cluster-suite product that some people use for making HA > >NFS services. When our QA folks test this, they often will start up > >some operations that do activity on an NFS mount from the cluster and > >then rapidly do failovers between cluster machines and make sure > >everything keeps moving along. The cluster is designed to not shut down > >nfsd's when a failover occurs. nfsd's are considered a "shared > >resource". It's possible that there could be multiple clustered > >services for NFS-sharing, so when a failover occurs, we just manipulate > >the exports table. > > > >The problem we've run into is that occasionally they fail over to the > >alternate machine and then back very rapidly. Because nfsd's are not > >shut down on failover, sockets are not closed. So what happens is > >something like this on TCP mounts: > > > >- client has NFS mount from clustered NFS service on one server > > > >- service fails over, new server doesn't know anything about the > > existing socket, so it sends a RST back to the client when data > > comes in. Client closes connection and reopens it and does some > > I/O on the socket. > > > >- service fails back to original server. The original socket there > > is still open, but now the TCP sequence numbers are off. When > > packets come into the server we end up with an ACK storm, and the > > client hangs for a long time. > > > >Neil Horman did a good writeup of this problem here for those that > >want the gory details: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > > > >I can think of 3 ways to fix this: > > > >1) Add something like the recently added "unlock_ip" interface that > >was added for NLM. Maybe a "close_ip" that allows us to close all > >nfsd sockets connected to a given local IP address. So clustering > >software could do something like: > > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > > >...and make sure that all of the sockets are closed. > > > >2) just use the same "unlock_ip" interface and just have it also > >close sockets in addition to dropping locks. > > > >3) have an nfsd close all non-listening connections when it gets a > >certain signal (maybe SIGUSR1 or something). Connections on a > >sockets that aren't failing over should just get a RST and would > >reopen their connections. > > > >...my preference would probably be approach #1. > > > >I've only really done some rudimentary perusing of the code, so there > >may be roadblocks with some of these approaches I haven't considered. > >Does anyone have thoughts on the general problem or idea for a solution? > > > >The situation is a bit specific to failover testing -- most people failing > >over don't do it so rapidly, but we'd still like to ensure that this > >problem doesn't occur if someone does do it. > > > >Thanks, > > > > This doesn't sound like it would be an NFS specific situation. > Why doesn't TCP handle this, without causing an ACK storm? > You're right, its not a problem specific to NFS, any TCP based service in which sockets are not explicitly closed on the application are subject to this problem. however, I think NFS is currently the only clustered service that we offer in which we explicitly leave nfsd running during such a 'soft' failover, and so practically speaking, this is the only place that this issue manifests itself. If we could shut down nfsd on the server doing a failover, that would solve this problem (as it prevents the problem with all other clustered tcp based services), but from what I'm told, thats a non-starter. As for why TCP doesnt handle this, thats because the situation is ambiguous from the point of view of the client and server. The write up in the bugzilla has all the gory details, but the executive summary is that during rapid failover, the client will ack some data to server A in the cluster, and some to server B in the cluster. If you quickly fail over and back between the servers in the cluster, each server will see some gaps in the data stream sequence numbers, but the client will see that all data has been acked. This leaves the connection in an unrecoverable state. Regards Neil > Thanx... > > ps -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:23 ` Neil Horman @ 2008-06-09 15:37 ` Peter Staubach 2008-06-09 15:49 ` Jeff Layton 2008-06-09 16:04 ` Neil Horman 2008-06-09 15:46 ` Chuck Lever 2008-06-09 16:00 ` Peter Staubach 2 siblings, 2 replies; 38+ messages in thread From: Peter Staubach @ 2008-06-09 15:37 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, Jeff Layton Neil Horman wrote: > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: > >> Jeff Layton wrote: >> >>> Apologies for the long email, but I ran into an interesting problem the >>> other day and am looking for some feedback on my general approach to >>> fixing it before I spend too much time on it: >>> >>> We (RH) have a cluster-suite product that some people use for making HA >>> NFS services. When our QA folks test this, they often will start up >>> some operations that do activity on an NFS mount from the cluster and >>> then rapidly do failovers between cluster machines and make sure >>> everything keeps moving along. The cluster is designed to not shut down >>> nfsd's when a failover occurs. nfsd's are considered a "shared >>> resource". It's possible that there could be multiple clustered >>> services for NFS-sharing, so when a failover occurs, we just manipulate >>> the exports table. >>> >>> The problem we've run into is that occasionally they fail over to the >>> alternate machine and then back very rapidly. Because nfsd's are not >>> shut down on failover, sockets are not closed. So what happens is >>> something like this on TCP mounts: >>> >>> - client has NFS mount from clustered NFS service on one server >>> >>> - service fails over, new server doesn't know anything about the >>> existing socket, so it sends a RST back to the client when data >>> comes in. Client closes connection and reopens it and does some >>> I/O on the socket. >>> >>> - service fails back to original server. The original socket there >>> is still open, but now the TCP sequence numbers are off. When >>> packets come into the server we end up with an ACK storm, and the >>> client hangs for a long time. >>> >>> Neil Horman did a good writeup of this problem here for those that >>> want the gory details: >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 >>> >>> I can think of 3 ways to fix this: >>> >>> 1) Add something like the recently added "unlock_ip" interface that >>> was added for NLM. Maybe a "close_ip" that allows us to close all >>> nfsd sockets connected to a given local IP address. So clustering >>> software could do something like: >>> >>> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip >>> >>> ...and make sure that all of the sockets are closed. >>> >>> 2) just use the same "unlock_ip" interface and just have it also >>> close sockets in addition to dropping locks. >>> >>> 3) have an nfsd close all non-listening connections when it gets a >>> certain signal (maybe SIGUSR1 or something). Connections on a >>> sockets that aren't failing over should just get a RST and would >>> reopen their connections. >>> >>> ...my preference would probably be approach #1. >>> >>> I've only really done some rudimentary perusing of the code, so there >>> may be roadblocks with some of these approaches I haven't considered. >>> Does anyone have thoughts on the general problem or idea for a solution? >>> >>> The situation is a bit specific to failover testing -- most people failing >>> over don't do it so rapidly, but we'd still like to ensure that this >>> problem doesn't occur if someone does do it. >>> >>> Thanks, >>> >>> >> This doesn't sound like it would be an NFS specific situation. >> Why doesn't TCP handle this, without causing an ACK storm? >> >> > > You're right, its not a problem specific to NFS, any TCP based service in which > sockets are not explicitly closed on the application are subject to this > problem. however, I think NFS is currently the only clustered service that we > offer in which we explicitly leave nfsd running during such a 'soft' failover, > and so practically speaking, this is the only place that this issue manifests > itself. If we could shut down nfsd on the server doing a failover, that would > solve this problem (as it prevents the problem with all other clustered tcp > based services), but from what I'm told, thats a non-starter. > > I think that this last would be a good thing to pursue anyway, or at least be able to understand why it would be considered to be a "non-starter". When failing away a service, why not stop the service on the original node? These floating virtual IP and ARP games can get tricky to handle in the boundary cases like this sort of one. > As for why TCP doesnt handle this, thats because the situation is ambiguous from > the point of view of the client and server. The write up in the bugzilla has > all the gory details, but the executive summary is that during rapid failover, > the client will ack some data to server A in the cluster, and some to server B > in the cluster. If you quickly fail over and back between the servers in the > cluster, each server will see some gaps in the data stream sequence numbers, but > the client will see that all data has been acked. This leaves the connection in > an unrecoverable state. I would wonder what happens if we stick some other NFS/RPC/TCP/IP implementation into the situation. I wonder if it would see and generate the same situation? ps ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:37 ` Peter Staubach @ 2008-06-09 15:49 ` Jeff Layton [not found] ` <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 2008-06-09 16:04 ` Neil Horman 1 sibling, 1 reply; 38+ messages in thread From: Jeff Layton @ 2008-06-09 15:49 UTC (permalink / raw) To: Peter Staubach; +Cc: Neil Horman, lhh, nfsv4, linux-nfs On Mon, 09 Jun 2008 11:37:27 -0400 Peter Staubach <staubach@redhat.com> wrote: > Neil Horman wrote: > > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: > > > >> Jeff Layton wrote: > >> > >>> Apologies for the long email, but I ran into an interesting problem the > >>> other day and am looking for some feedback on my general approach to > >>> fixing it before I spend too much time on it: > >>> > >>> We (RH) have a cluster-suite product that some people use for making HA > >>> NFS services. When our QA folks test this, they often will start up > >>> some operations that do activity on an NFS mount from the cluster and > >>> then rapidly do failovers between cluster machines and make sure > >>> everything keeps moving along. The cluster is designed to not shut down > >>> nfsd's when a failover occurs. nfsd's are considered a "shared > >>> resource". It's possible that there could be multiple clustered > >>> services for NFS-sharing, so when a failover occurs, we just manipulate > >>> the exports table. > >>> > >>> The problem we've run into is that occasionally they fail over to the > >>> alternate machine and then back very rapidly. Because nfsd's are not > >>> shut down on failover, sockets are not closed. So what happens is > >>> something like this on TCP mounts: > >>> > >>> - client has NFS mount from clustered NFS service on one server > >>> > >>> - service fails over, new server doesn't know anything about the > >>> existing socket, so it sends a RST back to the client when data > >>> comes in. Client closes connection and reopens it and does some > >>> I/O on the socket. > >>> > >>> - service fails back to original server. The original socket there > >>> is still open, but now the TCP sequence numbers are off. When > >>> packets come into the server we end up with an ACK storm, and the > >>> client hangs for a long time. > >>> > >>> Neil Horman did a good writeup of this problem here for those that > >>> want the gory details: > >>> > >>> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > >>> > >>> I can think of 3 ways to fix this: > >>> > >>> 1) Add something like the recently added "unlock_ip" interface that > >>> was added for NLM. Maybe a "close_ip" that allows us to close all > >>> nfsd sockets connected to a given local IP address. So clustering > >>> software could do something like: > >>> > >>> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > >>> > >>> ...and make sure that all of the sockets are closed. > >>> > >>> 2) just use the same "unlock_ip" interface and just have it also > >>> close sockets in addition to dropping locks. > >>> > >>> 3) have an nfsd close all non-listening connections when it gets a > >>> certain signal (maybe SIGUSR1 or something). Connections on a > >>> sockets that aren't failing over should just get a RST and would > >>> reopen their connections. > >>> > >>> ...my preference would probably be approach #1. > >>> > >>> I've only really done some rudimentary perusing of the code, so there > >>> may be roadblocks with some of these approaches I haven't considered. > >>> Does anyone have thoughts on the general problem or idea for a solution? > >>> > >>> The situation is a bit specific to failover testing -- most people failing > >>> over don't do it so rapidly, but we'd still like to ensure that this > >>> problem doesn't occur if someone does do it. > >>> > >>> Thanks, > >>> > >>> > >> This doesn't sound like it would be an NFS specific situation. > >> Why doesn't TCP handle this, without causing an ACK storm? > >> > >> > > > > You're right, its not a problem specific to NFS, any TCP based service in which > > sockets are not explicitly closed on the application are subject to this > > problem. however, I think NFS is currently the only clustered service that we > > offer in which we explicitly leave nfsd running during such a 'soft' failover, > > and so practically speaking, this is the only place that this issue manifests > > itself. If we could shut down nfsd on the server doing a failover, that would > > solve this problem (as it prevents the problem with all other clustered tcp > > based services), but from what I'm told, thats a non-starter. > > > > > > I think that this last would be a good thing to pursue anyway, > or at least be able to understand why it would be considered to > be a "non-starter". When failing away a service, why not stop > the service on the original node? > Suppose you have more than one "NFS service". People do occasionally set up NFS exports in separate services. Also, there's the possibility of a mix of clustered + non-clustered exports. So shutting down nfsd could disrupt NFS services on any IP addresses that remain on the box. That said, we could maybe shut down nfsd and trust that retransmissions will take care of the problem. That could be racy though. > These floating virtual IP and ARP games can get tricky to handle > in the boundary cases like this sort of one. > > > As for why TCP doesnt handle this, thats because the situation is ambiguous from > > the point of view of the client and server. The write up in the bugzilla has > > all the gory details, but the executive summary is that during rapid failover, > > the client will ack some data to server A in the cluster, and some to server B > > in the cluster. If you quickly fail over and back between the servers in the > > cluster, each server will see some gaps in the data stream sequence numbers, but > > the client will see that all data has been acked. This leaves the connection in > > an unrecoverable state. > > I would wonder what happens if we stick some other NFS/RPC/TCP/IP > implementation into the situation. I wonder if it would see and > generate the same situation? > Assuming you mean changing the client to a different sort of OS, then yes, I think the same thing would likely happen unless it has some mechanism to break out of an ACK storm like this. -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? [not found] ` <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-06-09 16:01 ` Chuck Lever 0 siblings, 0 replies; 38+ messages in thread From: Chuck Lever @ 2008-06-09 16:01 UTC (permalink / raw) To: Jeff Layton; +Cc: Peter Staubach, Neil Horman, linux-nfs, nfsv4, lhh On Mon, Jun 9, 2008 at 11:49 AM, Jeff Layton <jlayton@redhat.com> wrote: > On Mon, 09 Jun 2008 11:37:27 -0400 > Peter Staubach <staubach@redhat.com> wrote: > >> Neil Horman wrote: >> > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: >> > >> >> Jeff Layton wrote: >> >> >> >>> Apologies for the long email, but I ran into an interesting problem the >> >>> other day and am looking for some feedback on my general approach to >> >>> fixing it before I spend too much time on it: >> >>> >> >>> We (RH) have a cluster-suite product that some people use for making HA >> >>> NFS services. When our QA folks test this, they often will start up >> >>> some operations that do activity on an NFS mount from the cluster and >> >>> then rapidly do failovers between cluster machines and make sure >> >>> everything keeps moving along. The cluster is designed to not shut down >> >>> nfsd's when a failover occurs. nfsd's are considered a "shared >> >>> resource". It's possible that there could be multiple clustered >> >>> services for NFS-sharing, so when a failover occurs, we just manipulate >> >>> the exports table. >> >>> >> >>> The problem we've run into is that occasionally they fail over to the >> >>> alternate machine and then back very rapidly. Because nfsd's are not >> >>> shut down on failover, sockets are not closed. So what happens is >> >>> something like this on TCP mounts: >> >>> >> >>> - client has NFS mount from clustered NFS service on one server >> >>> >> >>> - service fails over, new server doesn't know anything about the >> >>> existing socket, so it sends a RST back to the client when data >> >>> comes in. Client closes connection and reopens it and does some >> >>> I/O on the socket. >> >>> >> >>> - service fails back to original server. The original socket there >> >>> is still open, but now the TCP sequence numbers are off. When >> >>> packets come into the server we end up with an ACK storm, and the >> >>> client hangs for a long time. >> >>> >> >>> Neil Horman did a good writeup of this problem here for those that >> >>> want the gory details: >> >>> >> >>> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 >> >>> >> >>> I can think of 3 ways to fix this: >> >>> >> >>> 1) Add something like the recently added "unlock_ip" interface that >> >>> was added for NLM. Maybe a "close_ip" that allows us to close all >> >>> nfsd sockets connected to a given local IP address. So clustering >> >>> software could do something like: >> >>> >> >>> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip >> >>> >> >>> ...and make sure that all of the sockets are closed. >> >>> >> >>> 2) just use the same "unlock_ip" interface and just have it also >> >>> close sockets in addition to dropping locks. >> >>> >> >>> 3) have an nfsd close all non-listening connections when it gets a >> >>> certain signal (maybe SIGUSR1 or something). Connections on a >> >>> sockets that aren't failing over should just get a RST and would >> >>> reopen their connections. >> >>> >> >>> ...my preference would probably be approach #1. >> >>> >> >>> I've only really done some rudimentary perusing of the code, so there >> >>> may be roadblocks with some of these approaches I haven't considered. >> >>> Does anyone have thoughts on the general problem or idea for a solution? >> >>> >> >>> The situation is a bit specific to failover testing -- most people failing >> >>> over don't do it so rapidly, but we'd still like to ensure that this >> >>> problem doesn't occur if someone does do it. >> >>> >> >>> Thanks, >> >>> >> >>> >> >> This doesn't sound like it would be an NFS specific situation. >> >> Why doesn't TCP handle this, without causing an ACK storm? >> >> >> >> >> > >> > You're right, its not a problem specific to NFS, any TCP based service in which >> > sockets are not explicitly closed on the application are subject to this >> > problem. however, I think NFS is currently the only clustered service that we >> > offer in which we explicitly leave nfsd running during such a 'soft' failover, >> > and so practically speaking, this is the only place that this issue manifests >> > itself. If we could shut down nfsd on the server doing a failover, that would >> > solve this problem (as it prevents the problem with all other clustered tcp >> > based services), but from what I'm told, thats a non-starter. >> > >> > >> >> I think that this last would be a good thing to pursue anyway, >> or at least be able to understand why it would be considered to >> be a "non-starter". When failing away a service, why not stop >> the service on the original node? >> > > Suppose you have more than one "NFS service". People do occasionally set > up NFS exports in separate services. Also, there's the possibility of a > mix of clustered + non-clustered exports. So shutting down nfsd could > disrupt NFS services on any IP addresses that remain on the box. > > That said, we could maybe shut down nfsd and trust that retransmissions > will take care of the problem. That could be racy though. In that case, it might make sense to have an nfsd-specific mechanism that allows you to fence exports instead of whole servers. -- I am certain that these presidents will understand the cry of the people of Bolivia, of the people of Latin America and the whole world, which wants to have more food and not more cars. First food, then if something's left over, more cars, more automobiles. I think that life has to come first. -- Evo Morales ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:37 ` Peter Staubach 2008-06-09 15:49 ` Jeff Layton @ 2008-06-09 16:04 ` Neil Horman 1 sibling, 0 replies; 38+ messages in thread From: Neil Horman @ 2008-06-09 16:04 UTC (permalink / raw) To: Peter Staubach; +Cc: Neil Horman, Jeff Layton, linux-nfs, nfsv4, lhh On Mon, Jun 09, 2008 at 11:37:27AM -0400, Peter Staubach wrote: > Neil Horman wrote: > >On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: > > > >>Jeff Layton wrote: > >> > >>>Apologies for the long email, but I ran into an interesting problem the > >>>other day and am looking for some feedback on my general approach to > >>>fixing it before I spend too much time on it: > >>> > >>>We (RH) have a cluster-suite product that some people use for making HA > >>>NFS services. When our QA folks test this, they often will start up > >>>some operations that do activity on an NFS mount from the cluster and > >>>then rapidly do failovers between cluster machines and make sure > >>>everything keeps moving along. The cluster is designed to not shut down > >>>nfsd's when a failover occurs. nfsd's are considered a "shared > >>>resource". It's possible that there could be multiple clustered > >>>services for NFS-sharing, so when a failover occurs, we just manipulate > >>>the exports table. > >>> > >>>The problem we've run into is that occasionally they fail over to the > >>>alternate machine and then back very rapidly. Because nfsd's are not > >>>shut down on failover, sockets are not closed. So what happens is > >>>something like this on TCP mounts: > >>> > >>>- client has NFS mount from clustered NFS service on one server > >>> > >>>- service fails over, new server doesn't know anything about the > >>> existing socket, so it sends a RST back to the client when data > >>> comes in. Client closes connection and reopens it and does some > >>> I/O on the socket. > >>> > >>>- service fails back to original server. The original socket there > >>> is still open, but now the TCP sequence numbers are off. When > >>> packets come into the server we end up with an ACK storm, and the > >>> client hangs for a long time. > >>> > >>>Neil Horman did a good writeup of this problem here for those that > >>>want the gory details: > >>> > >>> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > >>> > >>>I can think of 3 ways to fix this: > >>> > >>>1) Add something like the recently added "unlock_ip" interface that > >>>was added for NLM. Maybe a "close_ip" that allows us to close all > >>>nfsd sockets connected to a given local IP address. So clustering > >>>software could do something like: > >>> > >>> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > >>> > >>>...and make sure that all of the sockets are closed. > >>> > >>>2) just use the same "unlock_ip" interface and just have it also > >>>close sockets in addition to dropping locks. > >>> > >>>3) have an nfsd close all non-listening connections when it gets a > >>>certain signal (maybe SIGUSR1 or something). Connections on a > >>>sockets that aren't failing over should just get a RST and would > >>>reopen their connections. > >>> > >>>...my preference would probably be approach #1. > >>> > >>>I've only really done some rudimentary perusing of the code, so there > >>>may be roadblocks with some of these approaches I haven't considered. > >>>Does anyone have thoughts on the general problem or idea for a solution? > >>> > >>>The situation is a bit specific to failover testing -- most people > >>>failing > >>>over don't do it so rapidly, but we'd still like to ensure that this > >>>problem doesn't occur if someone does do it. > >>> > >>>Thanks, > >>> > >>> > >>This doesn't sound like it would be an NFS specific situation. > >>Why doesn't TCP handle this, without causing an ACK storm? > >> > >> > > > >You're right, its not a problem specific to NFS, any TCP based service in > >which > >sockets are not explicitly closed on the application are subject to this > >problem. however, I think NFS is currently the only clustered service > >that we > >offer in which we explicitly leave nfsd running during such a 'soft' > >failover, > >and so practically speaking, this is the only place that this issue > >manifests > >itself. If we could shut down nfsd on the server doing a failover, that > >would > >solve this problem (as it prevents the problem with all other clustered tcp > >based services), but from what I'm told, thats a non-starter. > > > > > > I think that this last would be a good thing to pursue anyway, > or at least be able to understand why it would be considered to > be a "non-starter". When failing away a service, why not stop > the service on the original node? > > These floating virtual IP and ARP games can get tricky to handle > in the boundary cases like this sort of one. > > >As for why TCP doesnt handle this, thats because the situation is > >ambiguous from > >the point of view of the client and server. The write up in the bugzilla > >has > >all the gory details, but the executive summary is that during rapid > >failover, > >the client will ack some data to server A in the cluster, and some to > >server B > >in the cluster. If you quickly fail over and back between the servers in > >the > >cluster, each server will see some gaps in the data stream sequence > >numbers, but > >the client will see that all data has been acked. This leaves the > >connection in > >an unrecoverable state. > > I would wonder what happens if we stick some other NFS/RPC/TCP/IP > implementation into the situation. I wonder if it would see and > generate the same situation? > > ps I can only imagine it would. The problem doesn't stem from any particular ideosyncracy in the provided nfsd, but rather in the fact that the nfsd is kept running on both servers between failovers. Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:23 ` Neil Horman 2008-06-09 15:37 ` Peter Staubach @ 2008-06-09 15:46 ` Chuck Lever 2008-06-09 16:00 ` Peter Staubach 2 siblings, 0 replies; 38+ messages in thread From: Chuck Lever @ 2008-06-09 15:46 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, Jeff Layton On Mon, Jun 9, 2008 at 11:23 AM, Neil Horman <nhorman@redhat.com> wrote: > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: >> Jeff Layton wrote: >> >Apologies for the long email, but I ran into an interesting problem the >> >other day and am looking for some feedback on my general approach to >> >fixing it before I spend too much time on it: >> > >> >We (RH) have a cluster-suite product that some people use for making HA >> >NFS services. When our QA folks test this, they often will start up >> >some operations that do activity on an NFS mount from the cluster and >> >then rapidly do failovers between cluster machines and make sure >> >everything keeps moving along. The cluster is designed to not shut down >> >nfsd's when a failover occurs. nfsd's are considered a "shared >> >resource". It's possible that there could be multiple clustered >> >services for NFS-sharing, so when a failover occurs, we just manipulate >> >the exports table. >> > >> >The problem we've run into is that occasionally they fail over to the >> >alternate machine and then back very rapidly. Because nfsd's are not >> >shut down on failover, sockets are not closed. So what happens is >> >something like this on TCP mounts: >> > >> >- client has NFS mount from clustered NFS service on one server >> > >> >- service fails over, new server doesn't know anything about the >> > existing socket, so it sends a RST back to the client when data >> > comes in. Client closes connection and reopens it and does some >> > I/O on the socket. >> > >> >- service fails back to original server. The original socket there >> > is still open, but now the TCP sequence numbers are off. When >> > packets come into the server we end up with an ACK storm, and the >> > client hangs for a long time. >> > >> >Neil Horman did a good writeup of this problem here for those that >> >want the gory details: >> > >> > https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 >> > >> >I can think of 3 ways to fix this: >> > >> >1) Add something like the recently added "unlock_ip" interface that >> >was added for NLM. Maybe a "close_ip" that allows us to close all >> >nfsd sockets connected to a given local IP address. So clustering >> >software could do something like: >> > >> > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip >> > >> >...and make sure that all of the sockets are closed. >> > >> >2) just use the same "unlock_ip" interface and just have it also >> >close sockets in addition to dropping locks. >> > >> >3) have an nfsd close all non-listening connections when it gets a >> >certain signal (maybe SIGUSR1 or something). Connections on a >> >sockets that aren't failing over should just get a RST and would >> >reopen their connections. >> > >> >...my preference would probably be approach #1. >> > >> >I've only really done some rudimentary perusing of the code, so there >> >may be roadblocks with some of these approaches I haven't considered. >> >Does anyone have thoughts on the general problem or idea for a solution? >> > >> >The situation is a bit specific to failover testing -- most people failing >> >over don't do it so rapidly, but we'd still like to ensure that this >> >problem doesn't occur if someone does do it. >> > >> >Thanks, >> > >> >> This doesn't sound like it would be an NFS specific situation. >> Why doesn't TCP handle this, without causing an ACK storm? The NetApp guys can tell you all kinds of horror stories about filer cluster failover and TCP. The servers must stop responding to client requests and to client connection attempts during the failover. Some clients are not smart enough to delay their reconnect attempt and will hammer the server until it finally responds. That is probably part of the reason for the "ACK storm". You also have a problem with what to do about your server's DRC. During the failover, some requests may get through to the failing server, and may be executed and retired, but the reply never gets back to the client because the socket is torn down. So the best bet for something like this, if you can't shutdown the nfsd, is to fence the failing server from the network and from back-end storage. Something like iptables will not be adequate to handle the NFS/RPC idempotency issues. > You're right, its not a problem specific to NFS, any TCP based service in which > sockets are not explicitly closed on the application are subject to this > problem. however, I think NFS is currently the only clustered service that we > offer in which we explicitly leave nfsd running during such a 'soft' failover, > and so practically speaking, this is the only place that this issue manifests > itself. If we could shut down nfsd on the server doing a failover, that would > solve this problem (as it prevents the problem with all other clustered tcp > based services), but from what I'm told, thats a non-starter. > > As for why TCP doesnt handle this, thats because the situation is ambiguous from > the point of view of the client and server. The write up in the bugzilla has > all the gory details, but the executive summary is that during rapid failover, > the client will ack some data to server A in the cluster, and some to server B > in the cluster. If you quickly fail over and back between the servers in the > cluster, each server will see some gaps in the data stream sequence numbers, but > the client will see that all data has been acked. This leaves the connection in > an unrecoverable state. -- I am certain that these presidents will understand the cry of the people of Bolivia, of the people of Latin America and the whole world, which wants to have more food and not more cars. First food, then if something's left over, more cars, more automobiles. I think that life has to come first. -- Evo Morales _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:23 ` Neil Horman 2008-06-09 15:37 ` Peter Staubach 2008-06-09 15:46 ` Chuck Lever @ 2008-06-09 16:00 ` Peter Staubach 2008-06-09 16:24 ` Neil Horman 2 siblings, 1 reply; 38+ messages in thread From: Peter Staubach @ 2008-06-09 16:00 UTC (permalink / raw) To: Neil Horman; +Cc: linux-nfs, lhh, nfsv4, Jeff Layton Neil Horman wrote: > On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: > >> Jeff Layton wrote: >> >>> Apologies for the long email, but I ran into an interesting problem the >>> other day and am looking for some feedback on my general approach to >>> fixing it before I spend too much time on it: >>> >>> We (RH) have a cluster-suite product that some people use for making HA >>> NFS services. When our QA folks test this, they often will start up >>> some operations that do activity on an NFS mount from the cluster and >>> then rapidly do failovers between cluster machines and make sure >>> everything keeps moving along. The cluster is designed to not shut down >>> nfsd's when a failover occurs. nfsd's are considered a "shared >>> resource". It's possible that there could be multiple clustered >>> services for NFS-sharing, so when a failover occurs, we just manipulate >>> the exports table. >>> >>> The problem we've run into is that occasionally they fail over to the >>> alternate machine and then back very rapidly. Because nfsd's are not >>> shut down on failover, sockets are not closed. So what happens is >>> something like this on TCP mounts: >>> >>> - client has NFS mount from clustered NFS service on one server >>> >>> - service fails over, new server doesn't know anything about the >>> existing socket, so it sends a RST back to the client when data >>> comes in. Client closes connection and reopens it and does some >>> I/O on the socket. >>> >>> - service fails back to original server. The original socket there >>> is still open, but now the TCP sequence numbers are off. When >>> packets come into the server we end up with an ACK storm, and the >>> client hangs for a long time. >>> >>> Neil Horman did a good writeup of this problem here for those that >>> want the gory details: >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 >>> >>> I can think of 3 ways to fix this: >>> >>> 1) Add something like the recently added "unlock_ip" interface that >>> was added for NLM. Maybe a "close_ip" that allows us to close all >>> nfsd sockets connected to a given local IP address. So clustering >>> software could do something like: >>> >>> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip >>> >>> ...and make sure that all of the sockets are closed. >>> >>> 2) just use the same "unlock_ip" interface and just have it also >>> close sockets in addition to dropping locks. >>> >>> 3) have an nfsd close all non-listening connections when it gets a >>> certain signal (maybe SIGUSR1 or something). Connections on a >>> sockets that aren't failing over should just get a RST and would >>> reopen their connections. >>> >>> ...my preference would probably be approach #1. >>> >>> I've only really done some rudimentary perusing of the code, so there >>> may be roadblocks with some of these approaches I haven't considered. >>> Does anyone have thoughts on the general problem or idea for a solution? >>> >>> The situation is a bit specific to failover testing -- most people failing >>> over don't do it so rapidly, but we'd still like to ensure that this >>> problem doesn't occur if someone does do it. >>> >>> Thanks, >>> >>> >> This doesn't sound like it would be an NFS specific situation. >> Why doesn't TCP handle this, without causing an ACK storm? >> >> > > You're right, its not a problem specific to NFS, any TCP based service in which > sockets are not explicitly closed on the application are subject to this > problem. however, I think NFS is currently the only clustered service that we > offer in which we explicitly leave nfsd running during such a 'soft' failover, > and so practically speaking, this is the only place that this issue manifests > itself. If we could shut down nfsd on the server doing a failover, that would > solve this problem (as it prevents the problem with all other clustered tcp > based services), but from what I'm told, thats a non-starter. > > As for why TCP doesnt handle this, thats because the situation is ambiguous from > the point of view of the client and server. The write up in the bugzilla has > all the gory details, but the executive summary is that during rapid failover, > the client will ack some data to server A in the cluster, and some to server B > in the cluster. If you quickly fail over and back between the servers in the > cluster, each server will see some gaps in the data stream sequence numbers, but > the client will see that all data has been acked. This leaves the connection in > an unrecoverable state. This doesn't seem so ambiguous from the client's viewpoint to me. The server sends back an ACK for a sequence number which is less than the beginning sequence number that the client has to retransmit. Shouldn't that imply a problem to the client and cause the TCP on the client to give up and return an error to the caller, in this case the RPC? Can there be gaps in sequence numbers? Thanx... ps _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:00 ` Peter Staubach @ 2008-06-09 16:24 ` Neil Horman 0 siblings, 0 replies; 38+ messages in thread From: Neil Horman @ 2008-06-09 16:24 UTC (permalink / raw) To: Peter Staubach; +Cc: Neil Horman, lhh, nfsv4, linux-nfs, Jeff Layton On Mon, Jun 09, 2008 at 12:00:37PM -0400, Peter Staubach wrote: > Neil Horman wrote: > >On Mon, Jun 09, 2008 at 11:03:53AM -0400, Peter Staubach wrote: > > > >>Jeff Layton wrote: > >> > >>>Apologies for the long email, but I ran into an interesting problem the > >>>other day and am looking for some feedback on my general approach to > >>>fixing it before I spend too much time on it: > >>> > >>>We (RH) have a cluster-suite product that some people use for making HA > >>>NFS services. When our QA folks test this, they often will start up > >>>some operations that do activity on an NFS mount from the cluster and > >>>then rapidly do failovers between cluster machines and make sure > >>>everything keeps moving along. The cluster is designed to not shut down > >>>nfsd's when a failover occurs. nfsd's are considered a "shared > >>>resource". It's possible that there could be multiple clustered > >>>services for NFS-sharing, so when a failover occurs, we just manipulate > >>>the exports table. > >>> > >>>The problem we've run into is that occasionally they fail over to the > >>>alternate machine and then back very rapidly. Because nfsd's are not > >>>shut down on failover, sockets are not closed. So what happens is > >>>something like this on TCP mounts: > >>> > >>>- client has NFS mount from clustered NFS service on one server > >>> > >>>- service fails over, new server doesn't know anything about the > >>> existing socket, so it sends a RST back to the client when data > >>> comes in. Client closes connection and reopens it and does some > >>> I/O on the socket. > >>> > >>>- service fails back to original server. The original socket there > >>> is still open, but now the TCP sequence numbers are off. When > >>> packets come into the server we end up with an ACK storm, and the > >>> client hangs for a long time. > >>> > >>>Neil Horman did a good writeup of this problem here for those that > >>>want the gory details: > >>> > >>> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16 > >>> > >>>I can think of 3 ways to fix this: > >>> > >>>1) Add something like the recently added "unlock_ip" interface that > >>>was added for NLM. Maybe a "close_ip" that allows us to close all > >>>nfsd sockets connected to a given local IP address. So clustering > >>>software could do something like: > >>> > >>> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > >>> > >>>...and make sure that all of the sockets are closed. > >>> > >>>2) just use the same "unlock_ip" interface and just have it also > >>>close sockets in addition to dropping locks. > >>> > >>>3) have an nfsd close all non-listening connections when it gets a > >>>certain signal (maybe SIGUSR1 or something). Connections on a > >>>sockets that aren't failing over should just get a RST and would > >>>reopen their connections. > >>> > >>>...my preference would probably be approach #1. > >>> > >>>I've only really done some rudimentary perusing of the code, so there > >>>may be roadblocks with some of these approaches I haven't considered. > >>>Does anyone have thoughts on the general problem or idea for a solution? > >>> > >>>The situation is a bit specific to failover testing -- most people > >>>failing > >>>over don't do it so rapidly, but we'd still like to ensure that this > >>>problem doesn't occur if someone does do it. > >>> > >>>Thanks, > >>> > >>> > >>This doesn't sound like it would be an NFS specific situation. > >>Why doesn't TCP handle this, without causing an ACK storm? > >> > >> > > > >You're right, its not a problem specific to NFS, any TCP based service in > >which > >sockets are not explicitly closed on the application are subject to this > >problem. however, I think NFS is currently the only clustered service > >that we > >offer in which we explicitly leave nfsd running during such a 'soft' > >failover, > >and so practically speaking, this is the only place that this issue > >manifests > >itself. If we could shut down nfsd on the server doing a failover, that > >would > >solve this problem (as it prevents the problem with all other clustered tcp > >based services), but from what I'm told, thats a non-starter. > > > >As for why TCP doesnt handle this, thats because the situation is > >ambiguous from > >the point of view of the client and server. The write up in the bugzilla > >has > >all the gory details, but the executive summary is that during rapid > >failover, > >the client will ack some data to server A in the cluster, and some to > >server B > >in the cluster. If you quickly fail over and back between the servers in > >the > >cluster, each server will see some gaps in the data stream sequence > >numbers, but > >the client will see that all data has been acked. This leaves the > >connection in > >an unrecoverable state. > > This doesn't seem so ambiguous from the client's viewpoint to me. > > The server sends back an ACK for a sequence number which is less > than the beginning sequence number that the client has to > retransmit. Shouldn't that imply a problem to the client and > cause the TCP on the client to give up and return an error to > the caller, in this case the RPC? > > Can there be gaps in sequence numbers? > No there can't be gaps in sequence numbers, but the fact that there are on a given connection is in fact ambiguous. See RFC 793 page 36/37 for a more detailed explination. The RFC mandates that in response to an out of range sequence number for an established connection, the peer can only respond with an empty ACK containing the next available send-sequence number. The problem lies in the fact that, due to the failover and failback, the peers have differeing views on what state the connection is in. The NFS client has, at the time this problem occurs seen ACKs to all the data it has sent. As such, it now sees this ack that is backward in time and assumes that this frame somehow got lost in the network, and just now made it here, after all the subsequent frames did. The appropriate thing, per the rfc, is to ignore it, and send an ACK reminding the peer of where it is in sequence. The NFS server on the other hand, is in fact missing a chunk of sequence numbers, which were acked by the other server in the cluster during the failover, failback period, So it legitimately thinks that some set of sequence numbers got dropped, and it can't continue until it has them. The only thing it can do is continue to ACK its last seen sequence number, hoping that the client will retransmit them (which it should, because as far as this server is concerned, it never acked them). There could be an argument made, I suppose for adding some sort of knob to set a threshold for this particular behavior (X Data-less ACKs in Y amount of TIME == RST or some such), but I'm sure that won't get much upstream traction (at least I won't propose it), since the knob would violate the RFC, possibly reset legitimate connections (think keep alive frames), and really only solve a problem that is manufactured by keeping processes alive (allbeit apparently necessecary) in such a way that two systems share a tcp connection. Regards Neil > Thanx... > > ps -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 14:31 rapid clustered nfs server failover and hung clients -- how best to close the sockets? Jeff Layton [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-06-09 15:51 ` J. Bruce Fields 2008-06-09 16:02 ` Jeff Layton 2008-06-09 17:14 ` Wendy Cheng 2 siblings, 1 reply; 38+ messages in thread From: J. Bruce Fields @ 2008-06-09 15:51 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote: > I can think of 3 ways to fix this: > > 1) Add something like the recently added "unlock_ip" interface that > was added for NLM. Maybe a "close_ip" that allows us to close all > nfsd sockets connected to a given local IP address. So clustering > software could do something like: > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > ...and make sure that all of the sockets are closed. > > 2) just use the same "unlock_ip" interface and just have it also > close sockets in addition to dropping locks. > > 3) have an nfsd close all non-listening connections when it gets a > certain signal (maybe SIGUSR1 or something). Connections on a > sockets that aren't failing over should just get a RST and would > reopen their connections. > > ...my preference would probably be approach #1. What do you see as the advantage of #1 over #2? Are there cases where someone would want to drop locks but not also close connections (or vice-versa)? --b. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 15:51 ` J. Bruce Fields @ 2008-06-09 16:02 ` Jeff Layton 2008-06-09 17:23 ` J. Bruce Fields 0 siblings, 1 reply; 38+ messages in thread From: Jeff Layton @ 2008-06-09 16:02 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 9 Jun 2008 11:51:36 -0400 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote: > > I can think of 3 ways to fix this: > > > > 1) Add something like the recently added "unlock_ip" interface that > > was added for NLM. Maybe a "close_ip" that allows us to close all > > nfsd sockets connected to a given local IP address. So clustering > > software could do something like: > > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > > > ...and make sure that all of the sockets are closed. > > > > 2) just use the same "unlock_ip" interface and just have it also > > close sockets in addition to dropping locks. > > > > 3) have an nfsd close all non-listening connections when it gets a > > certain signal (maybe SIGUSR1 or something). Connections on a > > sockets that aren't failing over should just get a RST and would > > reopen their connections. > > > > ...my preference would probably be approach #1. > > What do you see as the advantage of #1 over #2? Are there cases where > someone would want to drop locks but not also close connections (or > vice-versa)? > There's no real advantage that I can see (maybe if they're running a cluster with no NLM services somehow). Mostly that "unlock_ip" seems to imply that it deals with locking, and this doesn't. I'd be OK with #2 if it's a reasonable solution. Given what Chuck mentioned, it sounds like we'll also need to take care to make sure that existing calls complete and the replies get flushed out too, so this could be more complicated that I had anticipated. -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 16:02 ` Jeff Layton @ 2008-06-09 17:23 ` J. Bruce Fields 2008-06-09 19:10 ` Jeff Layton 0 siblings, 1 reply; 38+ messages in thread From: J. Bruce Fields @ 2008-06-09 17:23 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, Jun 09, 2008 at 12:02:43PM -0400, Jeff Layton wrote: > On Mon, 9 Jun 2008 11:51:36 -0400 > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote: > > > I can think of 3 ways to fix this: > > > > > > 1) Add something like the recently added "unlock_ip" interface that > > > was added for NLM. Maybe a "close_ip" that allows us to close all > > > nfsd sockets connected to a given local IP address. So clustering > > > software could do something like: > > > > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > > > > > ...and make sure that all of the sockets are closed. > > > > > > 2) just use the same "unlock_ip" interface and just have it also > > > close sockets in addition to dropping locks. > > > > > > 3) have an nfsd close all non-listening connections when it gets a > > > certain signal (maybe SIGUSR1 or something). Connections on a > > > sockets that aren't failing over should just get a RST and would > > > reopen their connections. > > > > > > ...my preference would probably be approach #1. > > > > What do you see as the advantage of #1 over #2? Are there cases where > > someone would want to drop locks but not also close connections (or > > vice-versa)? > > > > There's no real advantage that I can see (maybe if they're running a > cluster with no NLM services somehow). Mostly that "unlock_ip" seems to > imply that it deals with locking, and this doesn't. I'd be OK with #2 > if it's a reasonable solution. Given what Chuck mentioned, it sounds > like we'll also need to take care to make sure that existing calls > complete and the replies get flushed out too, so this could be more > complicated that I had anticipated. It seems to me that in the long run what we'd like is a virtualized NFS service--you should be able to start and stop independent "servers" hosted on a single kernel, and to clients they should look like completely independent servers. And I guess the question is how little "virtualization" you can get away with and still have the whole thing work. But anyway, ideally I think there'd be a single interface that says "shut down the nfs service provided via server ip x.y.z.w, for possible migration to another host". That's the only operation anyone really want to do--independent control over the tcp connections, and the locks, and the rpc cache, and whatever else needs to be dealt with, sounds unlikely to be useful. --b. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 17:23 ` J. Bruce Fields @ 2008-06-09 19:10 ` Jeff Layton 2008-06-09 20:19 ` Lon Hohberger 0 siblings, 1 reply; 38+ messages in thread From: Jeff Layton @ 2008-06-09 19:10 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 9 Jun 2008 13:23:13 -0400 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Mon, Jun 09, 2008 at 12:02:43PM -0400, Jeff Layton wrote: > > On Mon, 9 Jun 2008 11:51:36 -0400 > > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > > > On Mon, Jun 09, 2008 at 10:31:37AM -0400, Jeff Layton wrote: > > > > I can think of 3 ways to fix this: > > > > > > > > 1) Add something like the recently added "unlock_ip" interface that > > > > was added for NLM. Maybe a "close_ip" that allows us to close all > > > > nfsd sockets connected to a given local IP address. So clustering > > > > software could do something like: > > > > > > > > # echo 10.20.30.40 > /proc/fs/nfsd/close_ip > > > > > > > > ...and make sure that all of the sockets are closed. > > > > > > > > 2) just use the same "unlock_ip" interface and just have it also > > > > close sockets in addition to dropping locks. > > > > > > > > 3) have an nfsd close all non-listening connections when it gets a > > > > certain signal (maybe SIGUSR1 or something). Connections on a > > > > sockets that aren't failing over should just get a RST and would > > > > reopen their connections. > > > > > > > > ...my preference would probably be approach #1. > > > > > > What do you see as the advantage of #1 over #2? Are there cases where > > > someone would want to drop locks but not also close connections (or > > > vice-versa)? > > > > > > > There's no real advantage that I can see (maybe if they're running a > > cluster with no NLM services somehow). Mostly that "unlock_ip" seems to > > imply that it deals with locking, and this doesn't. I'd be OK with #2 > > if it's a reasonable solution. Given what Chuck mentioned, it sounds > > like we'll also need to take care to make sure that existing calls > > complete and the replies get flushed out too, so this could be more > > complicated that I had anticipated. > > It seems to me that in the long run what we'd like is a virtualized NFS > service--you should be able to start and stop independent "servers" > hosted on a single kernel, and to clients they should look like > completely independent servers. > > And I guess the question is how little "virtualization" you can get away > with and still have the whole thing work. Yep. That was Lon's exact question. Could we start nfsd's that just work for certain exports? The answer (of course) is currently no. As an idle side thought, I wonder whether/how we could make nfsd containerized? I wonder if it's possible to run a local nfsd in a Solaris zone/container thingy. > > But anyway, ideally I think there'd be a single interface that says > "shut down the nfs service provided via server ip x.y.z.w, for possible > migration to another host". That's the only operation anyone really > want to do--independent control over the tcp connections, and the locks, > and the rpc cache, and whatever else needs to be dealt with, sounds > unlikely to be useful. > Ok. When I get some time to work on this, I'll plan to work on hooking into the current unlock_ip interface rather than creating a new procfile. That does seem to make the most sense, though the name "unlock_ip" might not really adequately convey what it will now be doing... -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 19:10 ` Jeff Layton @ 2008-06-09 20:19 ` Lon Hohberger 0 siblings, 0 replies; 38+ messages in thread From: Lon Hohberger @ 2008-06-09 20:19 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, nfsv4, nhorman On Mon, 2008-06-09 at 15:10 -0400, Jeff Layton wrote: > > It seems to me that in the long run what we'd like is a virtualized NFS > > service--you should be able to start and stop independent "servers" > > hosted on a single kernel, and to clients they should look like > > completely independent servers. > > > > And I guess the question is how little "virtualization" you can get away > > with and still have the whole thing work. > > Yep. That was Lon's exact question. Could we start nfsd's that just > work for certain exports? The answer (of course) is currently no. s/exports/IP addresses/ -- Lon _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 14:31 rapid clustered nfs server failover and hung clients -- how best to close the sockets? Jeff Layton [not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 2008-06-09 15:51 ` J. Bruce Fields @ 2008-06-09 17:14 ` Wendy Cheng 2008-06-09 17:24 ` Jeff Layton 2008-06-09 18:07 ` Neil Horman 2 siblings, 2 replies; 38+ messages in thread From: Wendy Cheng @ 2008-06-09 17:14 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman Jeff Layton wrote: > The problem we've run into is that occasionally they fail over to the > alternate machine and then back very rapidly. It is a well known issue in the NFS-TCP failover arena (or more specifically, for floating IP applications) that failover from server A to server B, then immediately failing back from server B to A would *not* work well. IIRC last round of discussing with Red Hat GPS and support folks, we concluded that most of the applications/users *can* tolerate this restriction. Maybe another more basic question: "other than QA efforts, are there real NFSv2/v3 applications depending on this "feature" ? Or there may need tons of efforts for something that will not have much usages when it is finally delivered ? -- Wendy _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 17:14 ` Wendy Cheng @ 2008-06-09 17:24 ` Jeff Layton 2008-06-09 17:51 ` Talpey, Thomas 2008-06-09 18:10 ` Neil Horman 2008-06-09 18:07 ` Neil Horman 1 sibling, 2 replies; 38+ messages in thread From: Jeff Layton @ 2008-06-09 17:24 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-nfs, lhh, nfsv4, nhorman On Mon, 09 Jun 2008 13:14:56 -0400 Wendy Cheng <s.wendy.cheng@gmail.com> wrote: > Jeff Layton wrote: > > The problem we've run into is that occasionally they fail over to the > > alternate machine and then back very rapidly. > > It is a well known issue in the NFS-TCP failover arena (or more > specifically, for floating IP applications) that failover from server A > to server B, then immediately failing back from server B to A would > *not* work well. IIRC last round of discussing with Red Hat GPS and > support folks, we concluded that most of the applications/users *can* > tolerate this restriction. > > Maybe another more basic question: "other than QA efforts, are there > real NFSv2/v3 applications depending on this "feature" ? Or there may > need tons of efforts for something that will not have much usages when > it is finally delivered ? > Certainly a valid question... While rapid failover like this is unusual, it's easily possible for a sysadmin to do it. Maybe they moved the wrong service, or their downtime was for something very brief but the service had to be off of the host to make the change. In that case, a quick failover and back could easily be something that happens in a real environment. As to whether it's worth a ton of effort, that's a tough call. People want HA services to guard against outages. Anything that jeopardizes that is probably worth fixing. This could be solved with documentation, but a note like: "Be sure to wait for X minutes between failovers" ...wouldn't instill me with a lot of confidence. We'd have to have some sort of mechanism to enforce this, and that would be less than ideal. IMO, the ideal thing would be to make sure that the "old" server is ready to pick up the service again as soon as possible after the service leaves it. -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 17:24 ` Jeff Layton @ 2008-06-09 17:51 ` Talpey, Thomas 2008-06-09 17:59 ` Talpey, Thomas 2008-06-09 19:01 ` Jeff Layton 2008-06-09 18:10 ` Neil Horman 1 sibling, 2 replies; 38+ messages in thread From: Talpey, Thomas @ 2008-06-09 17:51 UTC (permalink / raw) To: Jeff Layton; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman At 01:24 PM 6/9/2008, Jeff Layton wrote: > >"Be sure to wait for X minutes between failovers" At least one grace period. > >...wouldn't instill me with a lot of confidence. We'd have to have >some sort of mechanism to enforce this, and that would be less than >ideal. > >IMO, the ideal thing would be to make sure that the "old" server is >ready to pick up the service again as soon as possible after the service >leaves it. A great goal, but it seems to me you've bundled a lot of other incompatible requirements along with it. Having some services restart and not others, for example. And mixing transparent IP address takeover with stateful recovery such as TCP reconnect and NSM/NLM. NSM provides only notification, there's no way for either server to know for sure all the clients have completed either switch-to or switch-back. Of course, you could switch to UDP-only, that would fix the TCP issue. But it won't fix NSM/NLM. Tom. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 17:51 ` Talpey, Thomas @ 2008-06-09 17:59 ` Talpey, Thomas 2008-06-09 19:01 ` Jeff Layton 1 sibling, 0 replies; 38+ messages in thread From: Talpey, Thomas @ 2008-06-09 17:59 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, lhh, nfsv4, nhorman, Wendy Cheng At 01:51 PM 6/9/2008, Talpey, Thomas wrote: >and NSM/NLM. NSM provides only notification, there's no way for >either server to know for sure all the clients have completed >either switch-to or switch-back. Just in case it helps to understand why relying on NSM is so risky: <http://www.connectathon.org/talks06/talpey-cthon06-nsm.pdf> Slides 16, 17 and 23, especially. Tom. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 17:51 ` Talpey, Thomas 2008-06-09 17:59 ` Talpey, Thomas @ 2008-06-09 19:01 ` Jeff Layton 2008-06-09 19:13 ` Talpey, Thomas 1 sibling, 1 reply; 38+ messages in thread From: Jeff Layton @ 2008-06-09 19:01 UTC (permalink / raw) To: Talpey, Thomas; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman On Mon, 09 Jun 2008 13:51:05 -0400 "Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > At 01:24 PM 6/9/2008, Jeff Layton wrote: > > > >"Be sure to wait for X minutes between failovers" > > At least one grace period. > Actually, we have to wait until all of the sockets on the old server time out. This is difficult to predict and can be quite long. > > > >...wouldn't instill me with a lot of confidence. We'd have to have > >some sort of mechanism to enforce this, and that would be less than > >ideal. > > > >IMO, the ideal thing would be to make sure that the "old" server is > >ready to pick up the service again as soon as possible after the service > >leaves it. > > A great goal, but it seems to me you've bundled a lot of other > incompatible requirements along with it. Having some services > restart and not others, for example. And mixing transparent IP > address takeover with stateful recovery such as TCP reconnect > and NSM/NLM. NSM provides only notification, there's no way for > either server to know for sure all the clients have completed > either switch-to or switch-back. > Thanks for the slides -- very interesting. Yep. NSM is risky, but this is really the same situation as solo NFS server spontaneously rebooting. The failover we're doing is really just simulating that (for the case of lockd anyway). The unreliability is just an unfortunate fact of life with NFSv2/3... > Of course, you could switch to UDP-only, that would fix the > TCP issue. But it won't fix NSM/NLM. > Right. Nothing can really fix that so we just have to make do. All of the NSM/NLM stuff here is really separate from the main problem I'm interested in at the moment, which is how to deal with the old, stale sockets that nfsd has open after the local address disappears. -- Jeff Layton <jlayton@redhat.com> _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 19:01 ` Jeff Layton @ 2008-06-09 19:13 ` Talpey, Thomas 0 siblings, 0 replies; 38+ messages in thread From: Talpey, Thomas @ 2008-06-09 19:13 UTC (permalink / raw) To: Jeff Layton; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman At 03:01 PM 6/9/2008, Jeff Layton wrote: >On Mon, 09 Jun 2008 13:51:05 -0400 >"Talpey, Thomas" <Thomas.Talpey@netapp.com> wrote: > >> At 01:24 PM 6/9/2008, Jeff Layton wrote: >> > >> >"Be sure to wait for X minutes between failovers" >> >> At least one grace period. >> > >Actually, we have to wait until all of the sockets on the old server >time out. This is difficult to predict and can be quite long. I just gave the floor. The ceiling is yours. :-) Orphaned server TCP sockets, btw, in general last forever without keepalive. Even with keepalive, they can last many tens of minutes. Tom. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 17:24 ` Jeff Layton 2008-06-09 17:51 ` Talpey, Thomas @ 2008-06-09 18:10 ` Neil Horman 1 sibling, 0 replies; 38+ messages in thread From: Neil Horman @ 2008-06-09 18:10 UTC (permalink / raw) To: Jeff Layton; +Cc: lhh, linux-nfs, Wendy Cheng, nfsv4, nhorman On Mon, Jun 09, 2008 at 01:24:25PM -0400, Jeff Layton wrote: > On Mon, 09 Jun 2008 13:14:56 -0400 > Wendy Cheng <s.wendy.cheng@gmail.com> wrote: > > > Jeff Layton wrote: > > > The problem we've run into is that occasionally they fail over to the > > > alternate machine and then back very rapidly. > > > > It is a well known issue in the NFS-TCP failover arena (or more > > specifically, for floating IP applications) that failover from server A > > to server B, then immediately failing back from server B to A would > > *not* work well. IIRC last round of discussing with Red Hat GPS and > > support folks, we concluded that most of the applications/users *can* > > tolerate this restriction. > > > > Maybe another more basic question: "other than QA efforts, are there > > real NFSv2/v3 applications depending on this "feature" ? Or there may > > need tons of efforts for something that will not have much usages when > > it is finally delivered ? > > > > Certainly a valid question... > > While rapid failover like this is unusual, it's easily possible for a > sysadmin to do it. Maybe they moved the wrong service, or their downtime > was for something very brief but the service had to be off of the host to > make the change. In that case, a quick failover and back could easily > be something that happens in a real environment. > > As to whether it's worth a ton of effort, that's a tough call. People want > HA services to guard against outages. Anything that jeopardizes that is > probably worth fixing. This could be solved with documentation, but a note > like: > > "Be sure to wait for X minutes between failovers" > Thats the real problem here. Given the problem as we've describe it, its possible for X to be _large_, potentially indefinite. > IMO, the ideal thing would be to make sure that the "old" server is > ready to pick up the service again as soon as possible after the service > leaves it. > Yes, this is really what needs to happen. In this environment, a floating IP address effectively means that nfsd services can inadvertently 'share' a tcp connection, and if nfsd is to play in a floating IP environment it needs to be able to handle that sharing... Neil > -- > Jeff Layton <jlayton@redhat.com> -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets? 2008-06-09 17:14 ` Wendy Cheng 2008-06-09 17:24 ` Jeff Layton @ 2008-06-09 18:07 ` Neil Horman 1 sibling, 0 replies; 38+ messages in thread From: Neil Horman @ 2008-06-09 18:07 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-nfs, lhh, nfsv4, nhorman, Jeff Layton On Mon, Jun 09, 2008 at 01:14:56PM -0400, Wendy Cheng wrote: > Jeff Layton wrote: > >The problem we've run into is that occasionally they fail over to the > >alternate machine and then back very rapidly. > > It is a well known issue in the NFS-TCP failover arena (or more > specifically, for floating IP applications) that failover from server A > to server B, then immediately failing back from server B to A would > *not* work well. IIRC last round of discussing with Red Hat GPS and > support folks, we concluded that most of the applications/users *can* > tolerate this restriction. I think the big problem here is that this restriction has a window that can be particularly long lived. If an application doesn't close its sockets, the time between a failover event, and the time when it is safe to fail back, is bounded by the lifetime of the socket on the 'failed' server. given the right configuration, this could be indefinite. Worse, you could fail at just the wrong time after the sequence number wraps completely, and pickup where you left off, not knowing you lost 4GB of data in the process. > > Maybe another more basic question: "other than QA efforts, are there > real NFSv2/v3 applications depending on this "feature" ? Or there may > need tons of efforts for something that will not have much usages when > it is finally delivered ? > > -- Wendy > > > -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2008-06-09 20:56 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-09 14:31 rapid clustered nfs server failover and hung clients -- how best to close the sockets? Jeff Layton
[not found] ` <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 15:03 ` Peter Staubach
2008-06-09 15:18 ` Jeff Layton
[not found] ` <20080609111821.6e06d4f8-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 15:31 ` Neil Horman
2008-06-09 15:43 ` Jeff Layton
[not found] ` <RTPCLUEXC1-PRDOLZCH000001d2-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
[not found] ` <20080609120110.1fee7221-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
[not found] ` <RTPCLUEXC1-PRDF8Eqf000001d4-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
[not found] ` <20080609122249.51767b21-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 16:40 ` Talpey, Thomas
2008-06-09 16:46 ` Jeff Layton
2008-06-09 18:03 ` J. Bruce Fields
2008-06-09 17:14 ` J. Bruce Fields
2008-06-09 15:51 ` Talpey, Thomas
2008-06-09 16:01 ` Jeff Layton
2008-06-09 16:03 ` Neil Horman
2008-06-09 16:09 ` Talpey, Thomas
2008-06-09 16:22 ` Jeff Layton
2008-06-09 19:36 ` Chuck Lever
2008-06-09 20:11 ` Jeff Layton
2008-06-09 20:56 ` Chuck Lever
2008-06-09 15:23 ` Neil Horman
2008-06-09 15:37 ` Peter Staubach
2008-06-09 15:49 ` Jeff Layton
[not found] ` <20080609114909.131cfaef-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-09 16:01 ` Chuck Lever
2008-06-09 16:04 ` Neil Horman
2008-06-09 15:46 ` Chuck Lever
2008-06-09 16:00 ` Peter Staubach
2008-06-09 16:24 ` Neil Horman
2008-06-09 15:51 ` J. Bruce Fields
2008-06-09 16:02 ` Jeff Layton
2008-06-09 17:23 ` J. Bruce Fields
2008-06-09 19:10 ` Jeff Layton
2008-06-09 20:19 ` Lon Hohberger
2008-06-09 17:14 ` Wendy Cheng
2008-06-09 17:24 ` Jeff Layton
2008-06-09 17:51 ` Talpey, Thomas
2008-06-09 17:59 ` Talpey, Thomas
2008-06-09 19:01 ` Jeff Layton
2008-06-09 19:13 ` Talpey, Thomas
2008-06-09 18:10 ` Neil Horman
2008-06-09 18:07 ` Neil Horman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox