* [RFC] Multiple server selection and replicated mount failover @ 2006-05-02 5:56 Ian Kent 2006-05-24 5:05 ` Ian Kent 0 siblings, 1 reply; 16+ messages in thread From: Ian Kent @ 2006-05-02 5:56 UTC (permalink / raw) To: nfs; +Cc: linux-fsdevel, autofs mailing list Hi all, For some time now I have had code in autofs that attempts to select an appropriate server from a weighted list to satisfy server priority selection and Replicated Server requirements. The code has been problematic from the beginning and is still incorrect largely due to me not merging the original patch well and also not fixing it correctly afterward. So I'd like to have this work properly and to do that I also need to consider read-only NFS mount fail over. The rules for server selection are, in order of priority (I believe): 1) Hosts on the local subnet. 2) Hosts on the local network. 3) Hosts on other network. Each of these proximity groups is made up of the largest number of servers supporting a given NFS protocol version. For example if there were 5 servers and 4 supported v3 and 2 supported v2 then the candidate group would be made up of the 4 supporting v3. Within the group of candidate servers the one with the best response time is selected. Selection within a proximity group can be further influenced by a zero based weight associated with each host. The higher the weight (a cost really) the less likely a server is to be selected. I'm not clear on exactly how he weight influences the selection, so perhaps someone who is familiar with this could explain it? Apart from mount time server selection read-only replicated servers need to be able to fail over to another server if the current one becomes unavailable. The questions I have are: 1) What is the best place for each part of this process to be carried out. - mount time selection. - read-only mount fail over. 2) What mechanisms would be best to use for the selection process. 3) Is there any existing work available that anyone is aware of that could be used as a reference. 4) How does NFS v4 fit into this picture as I believe that some of this functionality is included within the protocol. Any comments or suggestions or reference code would be very much appreciated. Ian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC] Multiple server selection and replicated mount failover 2006-05-02 5:56 [RFC] Multiple server selection and replicated mount failover Ian Kent @ 2006-05-24 5:05 ` Ian Kent 2006-05-24 13:02 ` [NFS] " Peter Staubach 2006-05-24 16:29 ` Trond Myklebust 0 siblings, 2 replies; 16+ messages in thread From: Ian Kent @ 2006-05-24 5:05 UTC (permalink / raw) To: nfs; +Cc: linux-fsdevel, autofs mailing list On Tue, 2 May 2006, Ian Kent wrote: > > Hi all, > > For some time now I have had code in autofs that attempts to select an > appropriate server from a weighted list to satisfy server priority > selection and Replicated Server requirements. The code has been > problematic from the beginning and is still incorrect largely due to me > not merging the original patch well and also not fixing it correctly > afterward. > > So I'd like to have this work properly and to do that I also need to > consider read-only NFS mount fail over. > > The rules for server selection are, in order of priority (I believe): > > 1) Hosts on the local subnet. > 2) Hosts on the local network. > 3) Hosts on other network. > > Each of these proximity groups is made up of the largest number of > servers supporting a given NFS protocol version. For example if there were > 5 servers and 4 supported v3 and 2 supported v2 then the candidate group > would be made up of the 4 supporting v3. Within the group of candidate > servers the one with the best response time is selected. Selection > within a proximity group can be further influenced by a zero based weight > associated with each host. The higher the weight (a cost really) the less > likely a server is to be selected. I'm not clear on exactly how he weight > influences the selection, so perhaps someone who is familiar with this > could explain it? I've re-written the server selection code now and I believe it works correctly. > > Apart from mount time server selection read-only replicated servers need > to be able to fail over to another server if the current one becomes > unavailable. > > The questions I have are: > > 1) What is the best place for each part of this process to be > carried out. > - mount time selection. > - read-only mount fail over. I think mount time selection should be done in mount and I believe the failover needs to be done in the kernel against the list established with the user space selection. The list should only change when a umount and then a mount occurs (surely this is the only practical way to do it ?). The code that I now have for the selection process can potentially improve the code used by patches to mount for probing NFS servers and doing this once in one place has to be better than doing it in automount and mount. The failover is another story. It seems to me that there are two similar ways to do this: 1) Pass a list of address and path entries to NFS at mount time and intercept errors, identify if the host is down and if it is select and mount another server. 2) Mount each member of the list with the best one on top and intercept errors, identify if the host is down and if it is select another from the list of mounts and put it atop the mounts. Maintaining the ordering with this approach could be difficult. With either of these approaches handling open files and held locks appears to be the the difficult part. Anyone have anything to contribute on how I could handle this or problems that I will encounter? snip .. > > 3) Is there any existing work available that anyone is aware > of that could be used as a reference. Still wondering about this. > > 4) How does NFS v4 fit into this picture as I believe that some > of this functionality is included within the protocol. And this. NFS v4 appears quite different so should I be considering this for v2 and v3 only? > > Any comments or suggestions or reference code would be very much > appreciated. Still. Ian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 5:05 ` Ian Kent @ 2006-05-24 13:02 ` Peter Staubach 2006-05-24 13:45 ` Ian Kent 2006-05-24 20:45 ` [NFS] " John T. Kohl 2006-05-24 16:29 ` Trond Myklebust 1 sibling, 2 replies; 16+ messages in thread From: Peter Staubach @ 2006-05-24 13:02 UTC (permalink / raw) To: Ian Kent; +Cc: nfs, linux-fsdevel, autofs mailing list Ian Kent wrote: >On Tue, 2 May 2006, Ian Kent wrote: > > > >>Hi all, >> >>For some time now I have had code in autofs that attempts to select an >>appropriate server from a weighted list to satisfy server priority >>selection and Replicated Server requirements. The code has been >>problematic from the beginning and is still incorrect largely due to me >>not merging the original patch well and also not fixing it correctly >>afterward. >> >>So I'd like to have this work properly and to do that I also need to >>consider read-only NFS mount fail over. >> >>The rules for server selection are, in order of priority (I believe): >> >>1) Hosts on the local subnet. >>2) Hosts on the local network. >>3) Hosts on other network. >> >>Each of these proximity groups is made up of the largest number of >>servers supporting a given NFS protocol version. For example if there were >>5 servers and 4 supported v3 and 2 supported v2 then the candidate group >>would be made up of the 4 supporting v3. Within the group of candidate >>servers the one with the best response time is selected. Selection >>within a proximity group can be further influenced by a zero based weight >>associated with each host. The higher the weight (a cost really) the less >>likely a server is to be selected. I'm not clear on exactly how he weight >>influences the selection, so perhaps someone who is familiar with this >>could explain it? >> >> > >I've re-written the server selection code now and I believe it works >correctly. > > > >>Apart from mount time server selection read-only replicated servers need >>to be able to fail over to another server if the current one becomes >>unavailable. >> >>The questions I have are: >> >>1) What is the best place for each part of this process to be >> carried out. >> - mount time selection. >> - read-only mount fail over. >> >> > >I think mount time selection should be done in mount and I believe the >failover needs to be done in the kernel against the list established with >the user space selection. The list should only change when a umount >and then a mount occurs (surely this is the only practical way to do it >?). > >The code that I now have for the selection process can potentially improve >the code used by patches to mount for probing NFS servers and doing this >once in one place has to be better than doing it in automount and mount. > >The failover is another story. > >It seems to me that there are two similar ways to do this: > >1) Pass a list of address and path entries to NFS at mount time and >intercept errors, identify if the host is down and if it is select and >mount another server. > >2) Mount each member of the list with the best one on top and intercept >errors, identify if the host is down and if it is select another from the >list of mounts and put it atop the mounts. Maintaining the ordering with >this approach could be difficult. > >With either of these approaches handling open files and held locks appears >to be the the difficult part. > >Anyone have anything to contribute on how I could handle this or problems >that I will encounter? > > > It seems to me that there is one other way which is similiar to #1 except that instead of passing path entries to NFS at mount time, pass in file handles. This keeps all of the MOUNT protocol processing at the user level and does not require the kernel to learn anything about the MOUNT protocol. It also allows a reasonable list to be constructed, with checking to ensure that all the servers support the same version of the NFS protocol, probably that all of the server support the same transport protocol, etc. >snip .. > > > >>3) Is there any existing work available that anyone is aware >> of that could be used as a reference. >> >> > >Still wondering about this. > > > Well, there is the Solaris support. >>4) How does NFS v4 fit into this picture as I believe that some >> of this functionality is included within the protocol. >> >> > >And this. > >NFS v4 appears quite different so should I be considering this for v2 and >v3 only? > > > >>Any comments or suggestions or reference code would be very much >>appreciated. >> >> The Solaris support works by passing a list of structs containing server information down into the kernel at mount time. This makes normal mounting just a subset of the replicated support because a normal mount would just contain a list of a single entry. When the Solaris client gets a timeout from an RPC, it checks to see whether this file and mount are failover'able. This checks to see whether there are alternate servers in the list and could contain a check to see if there are locks existing on the file. If there are locks, then don't failover. The alternative to doing this is to attempt to move the lock, but this could be problematic because there would be no guarantee that the new lock could be acquired. Anyway, if the file is failover'able, then a new server is chosen from the list and the file handle associated with the file is remapped to the equivalent file on the new server. This is done by repeating the lookups done to get the original file handle. Once the new file handle is acquired, then some minimal checks are done to try to ensure that the files are the "same". This is probably mostly checking to see whether the sizes of the two files are the same. Please note that this approach contains the interesting aspect that files are only failed over when they need to be and are not failed over proactively. This can lead to the situation where processes using the the file system can be talking to many of the different underlying servers, all at the sametime. If a server goes down and then comes back up before a process, which was talking to that server, notices, then it will just continue to use that server, while another process, which noticed the failed server, may have failed over to a new server. The key ingredient to this approach, I think, is a list of servers and information about them, and then information for each active NFS inode that keeps track of the pathname used to discover the file handle and also the server which is being currently used by the specific file. Thanx... ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 13:02 ` [NFS] " Peter Staubach @ 2006-05-24 13:45 ` Ian Kent 2006-05-24 14:04 ` Peter Staubach 2006-05-24 20:45 ` [NFS] " John T. Kohl 1 sibling, 1 reply; 16+ messages in thread From: Ian Kent @ 2006-05-24 13:45 UTC (permalink / raw) To: Peter Staubach; +Cc: nfs, linux-fsdevel, autofs mailing list On Wed, 2006-05-24 at 09:02 -0400, Peter Staubach wrote: > Ian Kent wrote: > > > > >I've re-written the server selection code now and I believe it works > >correctly. > > > > > > > >>Apart from mount time server selection read-only replicated servers need > >>to be able to fail over to another server if the current one becomes > >>unavailable. > >> > >>The questions I have are: > >> > >>1) What is the best place for each part of this process to be > >> carried out. > >> - mount time selection. > >> - read-only mount fail over. > >> > >> > > > >I think mount time selection should be done in mount and I believe the > >failover needs to be done in the kernel against the list established with > >the user space selection. The list should only change when a umount > >and then a mount occurs (surely this is the only practical way to do it > >?). > > > >The code that I now have for the selection process can potentially improve > >the code used by patches to mount for probing NFS servers and doing this > >once in one place has to be better than doing it in automount and mount. > > > >The failover is another story. > > > >It seems to me that there are two similar ways to do this: > > > >1) Pass a list of address and path entries to NFS at mount time and > >intercept errors, identify if the host is down and if it is select and > >mount another server. > > > >2) Mount each member of the list with the best one on top and intercept > >errors, identify if the host is down and if it is select another from the > >list of mounts and put it atop the mounts. Maintaining the ordering with > >this approach could be difficult. > > > >With either of these approaches handling open files and held locks appears > >to be the the difficult part. > > > >Anyone have anything to contribute on how I could handle this or problems > >that I will encounter? > > > > > > > > It seems to me that there is one other way which is similiar to #1 except > that instead of passing path entries to NFS at mount time, pass in file > handles. This keeps all of the MOUNT protocol processing at the user > level and does not require the kernel to learn anything about the MOUNT > protocol. It also allows a reasonable list to be constructed, with > checking to ensure that all the servers support the same version of the > NFS protocol, probably that all of the server support the same transport > protocol, etc. Of course, like #1 but with the benefits of #2 without the clutter. I guess all I would have to do then is the vfs mount to make it happen. Are we assuming a restriction like all the mounts have the same path exported from the server? mtab could get a little confused. > > >snip .. > > > > > > > >>3) Is there any existing work available that anyone is aware > >> of that could be used as a reference. > >> > >> > > > >Still wondering about this. > > > > > > > > Well, there is the Solaris support. But I'm not supposed to peek at that am I (cough, splutter, ...)? > > >>4) How does NFS v4 fit into this picture as I believe that some > >> of this functionality is included within the protocol. > >> > >> > > > >And this. > > > >NFS v4 appears quite different so should I be considering this for v2 and > >v3 only? > > > > > > > >>Any comments or suggestions or reference code would be very much > >>appreciated. > >> > >> > > The Solaris support works by passing a list of structs containing server > information down into the kernel at mount time. This makes normal mounting > just a subset of the replicated support because a normal mount would just > contain a list of a single entry. Cool. That's the way the selection code I have works, except for the kernel bit of course. > > When the Solaris client gets a timeout from an RPC, it checks to see whether > this file and mount are failover'able. This checks to see whether there are > alternate servers in the list and could contain a check to see if there are > locks existing on the file. If there are locks, then don't failover. The > alternative to doing this is to attempt to move the lock, but this could > be problematic because there would be no guarantee that the new lock could > be acquired. Yep. Failing over the locks looks like it could turn into a nightmare really fast. Sounds like a good simplifying restriction for a first stab at this. > > Anyway, if the file is failover'able, then a new server is chosen from the > list and the file handle associated with the file is remapped to the > equivalent file on the new server. This is done by repeating the lookups > done to get the original file handle. Once the new file handle is acquired, > then some minimal checks are done to try to ensure that the files are the > "same". This is probably mostly checking to see whether the sizes of the > two files are the same. > > Please note that this approach contains the interesting aspect that > files are only failed over when they need to be and are not failed over > proactively. This can lead to the situation where processes using the > the file system can be talking to many of the different underlying > servers, all at the sametime. If a server goes down and then comes back > up before a process, which was talking to that server, notices, then it > will just continue to use that server, while another process, which > noticed the failed server, may have failed over to a new server. Interesting. This hadn't occurred to me yet. I was still at the stage of wondering whether the "on demand" approach would work but the simplifying restriction above should make it workable (I think ....). > > The key ingredient to this approach, I think, is a list of servers and > information about them, and then information for each active NFS inode > that keeps track of the pathname used to discover the file handle and > also the server which is being currently used by the specific file. Haven't quite got to the path issues yet. But can't we just get the path from d_path? It will return the path from a given dentry to the root of the mount, if I remember correctly, and we have a file handle for the server. But your talking about the difficulty of the housekeeping overall I think. > Thanx... Thanks for your comments. Much appreciated and certainly very helpful. Ian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 13:45 ` Ian Kent @ 2006-05-24 14:04 ` Peter Staubach 2006-05-24 14:31 ` Ian Kent 0 siblings, 1 reply; 16+ messages in thread From: Peter Staubach @ 2006-05-24 14:04 UTC (permalink / raw) To: Ian Kent; +Cc: nfs, linux-fsdevel, autofs mailing list Ian Kent wrote: > >Of course, like #1 but with the benefits of #2 without the clutter. I >guess all I would have to do then is the vfs mount to make it happen. >Are we assuming a restriction like all the mounts have the same path >exported from the server? mtab could get a little confused. > > > I don't think that this needs to be restricted like this. I think that it would be nice if mtab just showed the arguments which were used to the mount command, ie. with all of the server and path combinations listed just like they were on the command line or in the autofs map. >But I'm not supposed to peek at that am I (cough, splutter, ...)? > > > Yes, I know. I am trying to figure out how much of the architecture and implementation that I can talk about too... :-) >Cool. That's the way the selection code I have works, except for the >kernel bit of course. > > > Good news! >Yep. Failing over the locks looks like it could turn into a nightmare >really fast. Sounds like a good simplifying restriction for a first stab >at this. > > > Agreed. >Interesting. This hadn't occurred to me yet. > >I was still at the stage of wondering whether the "on demand" approach >would work but the simplifying restriction above should make it workable >(I think ....). > > > I think that simple is good. We can always get more complicate later, if need be. (Hope not... :-) ) >>The key ingredient to this approach, I think, is a list of servers and >>information about them, and then information for each active NFS inode >>that keeps track of the pathname used to discover the file handle and >>also the server which is being currently used by the specific file. >> >> > >Haven't quite got to the path issues yet. >But can't we just get the path from d_path? >It will return the path from a given dentry to the root of the mount, if >I remember correctly, and we have a file handle for the server. > >But your talking about the difficulty of the housekeeping overall I >think. > > > Yes, I was specifically trying to avoid talking about how to manage the information. I think that that is an implementation detail, which is better left until after the high level architecture is defined. >> Thanx... >> >> > >Thanks for your comments. >Much appreciated and certainly very helpful. > You're welcome. I can try to talk more about the architecture and implementation that I am familiar with, if you like. ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 14:04 ` Peter Staubach @ 2006-05-24 14:31 ` Ian Kent 0 siblings, 0 replies; 16+ messages in thread From: Ian Kent @ 2006-05-24 14:31 UTC (permalink / raw) To: Peter Staubach; +Cc: nfs, linux-fsdevel, autofs mailing list On Wed, 2006-05-24 at 10:04 -0400, Peter Staubach wrote: > Ian Kent wrote: > > > > >Of course, like #1 but with the benefits of #2 without the clutter. I > >guess all I would have to do then is the vfs mount to make it happen. > >Are we assuming a restriction like all the mounts have the same path > >exported from the server? mtab could get a little confused. > > > > > > > > I don't think that this needs to be restricted like this. I think that > it would be nice if mtab just showed the arguments which were used to > the mount command, ie. with all of the server and path combinations listed > just like they were on the command line or in the autofs map. > > > >But I'm not supposed to peek at that am I (cough, splutter, ...)? > > > > > > > > Yes, I know. I am trying to figure out how much of the architecture and > implementation that I can talk about too... :-) > > >Cool. That's the way the selection code I have works, except for the > >kernel bit of course. > > > > > > > > Good news! It's in autofs 5 now. The idea is that it will work for any mount string, replicated syntax or not. So there's no extra mucking around. I hope to push it into mount and provide a configure option to disable it in autofs if mount can do it instead. > > >Yep. Failing over the locks looks like it could turn into a nightmare > >really fast. Sounds like a good simplifying restriction for a first stab > >at this. > > > > > > > > Agreed. > > >Interesting. This hadn't occurred to me yet. > > > >I was still at the stage of wondering whether the "on demand" approach > >would work but the simplifying restriction above should make it workable > >(I think ....). > > > > > > > > I think that simple is good. We can always get more complicate later, if > need be. (Hope not... :-) ) > > >>The key ingredient to this approach, I think, is a list of servers and > >>information about them, and then information for each active NFS inode > >>that keeps track of the pathname used to discover the file handle and > >>also the server which is being currently used by the specific file. > >> > >> > > > >Haven't quite got to the path issues yet. > >But can't we just get the path from d_path? > >It will return the path from a given dentry to the root of the mount, if > >I remember correctly, and we have a file handle for the server. > > > >But your talking about the difficulty of the housekeeping overall I > >think. > > > > > > > > Yes, I was specifically trying to avoid talking about how to manage the > information. I think that that is an implementation detail, which is > better left until after the high level architecture is defined. Yep. > > > >> Thanx... > >> > >> > > > >Thanks for your comments. > >Much appreciated and certainly very helpful. > > > > You're welcome. > > I can try to talk more about the architecture and implementation that I am > familiar with, if you like. Any and all information is good. Food for thought will give me something to eat! Ian ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 13:02 ` [NFS] " Peter Staubach 2006-05-24 13:45 ` Ian Kent @ 2006-05-24 20:45 ` John T. Kohl 2006-05-24 20:52 ` Dan Stromberg ` (2 more replies) 1 sibling, 3 replies; 16+ messages in thread From: John T. Kohl @ 2006-05-24 20:45 UTC (permalink / raw) To: Peter Staubach; +Cc: Ian Kent, nfs, linux-fsdevel, autofs mailing list >>>>> "PS" == Peter Staubach <staubach@redhat.com> writes: PS> When the Solaris client gets a timeout from an RPC, it checks to see PS> whether this file and mount are failover'able. This checks to see PS> whether there are alternate servers in the list and could contain a PS> check to see if there are locks existing on the file. If there are PS> locks, then don't failover. The alternative to doing this is to PS> attempt to move the lock, but this could be problematic because PS> there would be no guarantee that the new lock could be acquired. PS> Anyway, if the file is failover'able, then a new server is chosen PS> from the list and the file handle associated with the file is PS> remapped to the equivalent file on the new server. This is done by PS> repeating the lookups done to get the original file handle. Once PS> the new file handle is acquired, then some minimal checks are done PS> to try to ensure that the files are the "same". This is probably PS> mostly checking to see whether the sizes of the two files are the PS> same. PS> Please note that this approach contains the interesting aspect that PS> files are only failed over when they need to be and are not failed over PS> proactively. This can lead to the situation where processes using the PS> the file system can be talking to many of the different underlying PS> servers, all at the sametime. If a server goes down and then comes back PS> up before a process, which was talking to that server, notices, then it PS> will just continue to use that server, while another process, which PS> noticed the failed server, may have failed over to a new server. If you have multiple processes talking to different server replicas, can you then get cases where the processes aren't sharing the same files given the same name? Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and starts working on it. It then sits around doing nothing for a while. Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2, and then looks up "c/file.c" which will be referencing the object on server 2 ? A & B then try locking to cooperate... Are replicas only useful for read-only copies? If they're read-only, do locks even make sense? -- John Kohl Senior Software Engineer - Rational Software - IBM Software Group Lexington, Massachusetts, USA jtk@us.ibm.com <http://www.ibm.com/software/rational/> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 20:45 ` [NFS] " John T. Kohl @ 2006-05-24 20:52 ` Dan Stromberg 2006-05-29 7:31 ` [autofs] " Ian Kent 2006-05-30 12:02 ` Jeff Moyer 2 siblings, 0 replies; 16+ messages in thread From: Dan Stromberg @ 2006-05-24 20:52 UTC (permalink / raw) To: John T. Kohl Cc: Peter Staubach, Ian Kent, nfs, linux-fsdevel, autofs mailing list, strombrg On Wed, 2006-05-24 at 16:45 -0400, John T. Kohl wrote: > > If you have multiple processes talking to different server replicas, can > you then get cases where the processes aren't sharing the same files given > the same name? Sounds like it to me. > Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and > starts working on it. It then sits around doing nothing for a while. > > Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2, > and then looks up "c/file.c" which will be referencing the object on > server 2 ? Yup. > A & B then try locking to cooperate... Yup. To get good locking semantics, you'll probably need a distributed filesystem like GFS or Lustre. > Are replicas only useful for read-only copies? If they're read-only, do > locks even make sense? Yes, they can - imagine a software system that wants to make sure only one process is accessing some data at a time. The lock might be in one filesystem, but the data might be in another. Or you might be reading from a device that has really expensive seeks (a DVD comes to mind), so you want to be sure that only one thing is reading from it at a time. There are other possible scenarios I imagine, but many folks may be able to live without any of them. :) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 20:45 ` [NFS] " John T. Kohl 2006-05-24 20:52 ` Dan Stromberg @ 2006-05-29 7:31 ` Ian Kent 2006-05-30 12:02 ` Jeff Moyer 2 siblings, 0 replies; 16+ messages in thread From: Ian Kent @ 2006-05-29 7:31 UTC (permalink / raw) To: John T. Kohl; +Cc: Peter Staubach, linux-fsdevel, autofs mailing list, nfs On Thu, 24 May 2006, John T. Kohl wrote: > >>>>> "PS" == Peter Staubach <staubach@redhat.com> writes: > > PS> When the Solaris client gets a timeout from an RPC, it checks to see > PS> whether this file and mount are failover'able. This checks to see > PS> whether there are alternate servers in the list and could contain a > PS> check to see if there are locks existing on the file. If there are > PS> locks, then don't failover. The alternative to doing this is to > PS> attempt to move the lock, but this could be problematic because > PS> there would be no guarantee that the new lock could be acquired. > > PS> Anyway, if the file is failover'able, then a new server is chosen > PS> from the list and the file handle associated with the file is > PS> remapped to the equivalent file on the new server. This is done by > PS> repeating the lookups done to get the original file handle. Once > PS> the new file handle is acquired, then some minimal checks are done > PS> to try to ensure that the files are the "same". This is probably > PS> mostly checking to see whether the sizes of the two files are the > PS> same. > > PS> Please note that this approach contains the interesting aspect that > PS> files are only failed over when they need to be and are not failed over > PS> proactively. This can lead to the situation where processes using the > PS> the file system can be talking to many of the different underlying > PS> servers, all at the sametime. If a server goes down and then comes back > PS> up before a process, which was talking to that server, notices, then it > PS> will just continue to use that server, while another process, which > PS> noticed the failed server, may have failed over to a new server. > > If you have multiple processes talking to different server replicas, can > you then get cases where the processes aren't sharing the same files given > the same name? > > Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and > starts working on it. It then sits around doing nothing for a while. > > Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2, > and then looks up "c/file.c" which will be referencing the object on > server 2 ? > > A & B then try locking to cooperate... > > Are replicas only useful for read-only copies? If they're read-only, do > locks even make sense? Apps will take locks whether it makes sense or not. So refusing to fail-over if locks are held is likely the best approach. The case of replica filesystems themselves being updated could give rise to some interesting difficulties. Ian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 20:45 ` [NFS] " John T. Kohl 2006-05-24 20:52 ` Dan Stromberg 2006-05-29 7:31 ` [autofs] " Ian Kent @ 2006-05-30 12:02 ` Jeff Moyer 2 siblings, 0 replies; 16+ messages in thread From: Jeff Moyer @ 2006-05-30 12:02 UTC (permalink / raw) To: John T. Kohl Cc: Peter Staubach, linux-fsdevel, autofs mailing list, nfs, Ian Kent ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; jtk@us.ibm.com (John T. Kohl) adds: >>>>>> "PS" == Peter Staubach <staubach@redhat.com> writes: PS> When the Solaris client gets a timeout from an RPC, it checks to see PS> whether this file and mount are failover'able. This checks to see PS> whether there are alternate servers in the list and could contain a PS> check to see if there are locks existing on the file. If there are PS> locks, then don't failover. The alternative to doing this is to PS> attempt to move the lock, but this could be problematic because there PS> would be no guarantee that the new lock could be acquired. PS> Anyway, if the file is failover'able, then a new server is chosen from PS> the list and the file handle associated with the file is remapped to PS> the equivalent file on the new server. This is done by repeating the PS> lookups done to get the original file handle. Once the new file handle PS> is acquired, then some minimal checks are done to try to ensure that PS> the files are the "same". This is probably mostly checking to see PS> whether the sizes of the two files are the same. PS> Please note that this approach contains the interesting aspect that PS> files are only failed over when they need to be and are not failed over PS> proactively. This can lead to the situation where processes using the PS> the file system can be talking to many of the different underlying PS> servers, all at the sametime. If a server goes down and then comes PS> back up before a process, which was talking to that server, notices, PS> then it will just continue to use that server, while another process, PS> which noticed the failed server, may have failed over to a new server. jtk> If you have multiple processes talking to different server replicas, jtk> can you then get cases where the processes aren't sharing the same jtk> files given the same name? jtk> Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and jtk> starts working on it. It then sits around doing nothing for a while. jtk> Process "B" cd's to /mount/a/b, gets a timeout, fails over to server jtk> 2, and then looks up "c/file.c" which will be referencing the object jtk> on server 2 ? jtk> A & B then try locking to cooperate... jtk> Are replicas only useful for read-only copies? If they're read-only, jtk> do locks even make sense? In the docs I've read, the replicated failover only works for read-only file systems. You can have a replicated server entry for read-write file systems, but only one of those will be mounted by the automounter. To change servers would require a timeout (unmount) and subsequent lookup (mount). I don't think we need to try to kill ourselves by making this too complex. -Jeff ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 5:05 ` Ian Kent 2006-05-24 13:02 ` [NFS] " Peter Staubach @ 2006-05-24 16:29 ` Trond Myklebust 2006-05-24 17:58 ` [autofs] " Jeff Moyer 1 sibling, 1 reply; 16+ messages in thread From: Trond Myklebust @ 2006-05-24 16:29 UTC (permalink / raw) To: Ian Kent; +Cc: nfs, linux-fsdevel, autofs mailing list On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote: > It seems to me that there are two similar ways to do this: > > 1) Pass a list of address and path entries to NFS at mount time and > intercept errors, identify if the host is down and if it is select and > mount another server. > > 2) Mount each member of the list with the best one on top and intercept > errors, identify if the host is down and if it is select another from the > list of mounts and put it atop the mounts. Maintaining the ordering with > this approach could be difficult. Solaris has implemented option (1). To me, that is the approach that makes the most sense: why add the overhead of maintaining all these redundant mounts? > With either of these approaches handling open files and held locks appears > to be the the difficult part. Always has been, and always will. We're working on this problem, but progress is slow. In any case, we'll be concentrating on solving it for NFSv4 first (since that has native support for migrated/replicated volumes). > Anyone have anything to contribute on how I could handle this or problems > that I will encounter? > > > snip .. > > > > > 3) Is there any existing work available that anyone is aware > > of that could be used as a reference. > > Still wondering about this. > > > > > 4) How does NFS v4 fit into this picture as I believe that some > > of this functionality is included within the protocol. > > And this. > > NFS v4 appears quite different so should I be considering this for v2 and > v3 only? NFSv4 has full support for migration/replication in the protocol. If a filesystem fails on a given server, then the server itself will tell the client where it can find the replicas. There should be no need to provide that information at mount time. Cheers, Trond ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 16:29 ` Trond Myklebust @ 2006-05-24 17:58 ` Jeff Moyer 2006-05-24 18:31 ` Trond Myklebust 0 siblings, 1 reply; 16+ messages in thread From: Jeff Moyer @ 2006-05-24 17:58 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ian Kent, linux-fsdevel, autofs mailing list, nfs ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds: trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote: >> > 4) How does NFS v4 fit into this picture as I believe that some >> > of this functionality is included within the protocol. >> >> And this. >> >> NFS v4 appears quite different so should I be considering this for v2 and >> v3 only? trond.myklebust> NFSv4 has full support for migration/replication in the trond.myklebust> protocol. If a filesystem fails on a given server, then trond.myklebust> the server itself will tell the client where it can find trond.myklebust> the replicas. There should be no need to provide that trond.myklebust> information at mount time. And what happens when the server disappears? -Jeff ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 17:58 ` [autofs] " Jeff Moyer @ 2006-05-24 18:31 ` Trond Myklebust 2006-05-24 19:17 ` Peter Staubach 2006-05-25 3:56 ` Ian Kent 0 siblings, 2 replies; 16+ messages in thread From: Trond Myklebust @ 2006-05-24 18:31 UTC (permalink / raw) To: Jeff Moyer; +Cc: Ian Kent, linux-fsdevel, autofs mailing list, nfs On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote: > ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds: > > trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote: > >> > 4) How does NFS v4 fit into this picture as I believe that some > >> > of this functionality is included within the protocol. > >> > >> And this. > >> > >> NFS v4 appears quite different so should I be considering this for v2 and > >> v3 only? > > trond.myklebust> NFSv4 has full support for migration/replication in the > trond.myklebust> protocol. If a filesystem fails on a given server, then > trond.myklebust> the server itself will tell the client where it can find > trond.myklebust> the replicas. There should be no need to provide that > trond.myklebust> information at mount time. > > And what happens when the server disappears? There are 2 strategies for dealing with that: Firstly, we can maintain a cache of the list of replica volumes (we can request the list of replicas when we mount the original volume). Secondly, there are plans to add a backup list of failover servers in a specialised DNS record. This strategy could be made to work for NFSv2/v3 too. Cheers, Trond ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 18:31 ` Trond Myklebust @ 2006-05-24 19:17 ` Peter Staubach 2006-05-24 19:45 ` Trond Myklebust 2006-05-25 3:56 ` Ian Kent 1 sibling, 1 reply; 16+ messages in thread From: Peter Staubach @ 2006-05-24 19:17 UTC (permalink / raw) To: Trond Myklebust Cc: Jeff Moyer, Ian Kent, linux-fsdevel, autofs mailing list, nfs Trond Myklebust wrote: >On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote: > > >>==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds: >> >>trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote: >> >> >>>>>4) How does NFS v4 fit into this picture as I believe that some >>>>> of this functionality is included within the protocol. >>>>> >>>>> >>>>And this. >>>> >>>>NFS v4 appears quite different so should I be considering this for v2 and >>>>v3 only? >>>> >>>> >>trond.myklebust> NFSv4 has full support for migration/replication in the >>trond.myklebust> protocol. If a filesystem fails on a given server, then >>trond.myklebust> the server itself will tell the client where it can find >>trond.myklebust> the replicas. There should be no need to provide that >>trond.myklebust> information at mount time. >> >>And what happens when the server disappears? >> >> > >There are 2 strategies for dealing with that: > >Firstly, we can maintain a cache of the list of replica volumes (we can >request the list of replicas when we mount the original volume). > > > This assumes a lot on the part of the server and it doesn't seem to me that current server implementations are ready with the infrastructure to be able to make this a reality. I think that the client should be prepared to handle this sort of scenario but also be prepared to take a list of servers at mount time too. >Secondly, there are plans to add a backup list of failover servers in a >specialised DNS record. This strategy could be made to work for NFSv2/v3 >too. > This would seem to be a solution for how to determine the list of replicas, but not how the NFS client fails over from one replica to the next. Thanx... ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 19:17 ` Peter Staubach @ 2006-05-24 19:45 ` Trond Myklebust 0 siblings, 0 replies; 16+ messages in thread From: Trond Myklebust @ 2006-05-24 19:45 UTC (permalink / raw) To: Peter Staubach Cc: Jeff Moyer, Ian Kent, linux-fsdevel, autofs mailing list, nfs On Wed, 2006-05-24 at 15:17 -0400, Peter Staubach wrote: > This would seem to be a solution for how to determine the list of replicas, > but not how the NFS client fails over from one replica to the next. I'm fully aware of _that_. As I said earlier, work is in progress. Cheers, Trond ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover 2006-05-24 18:31 ` Trond Myklebust 2006-05-24 19:17 ` Peter Staubach @ 2006-05-25 3:56 ` Ian Kent 1 sibling, 0 replies; 16+ messages in thread From: Ian Kent @ 2006-05-25 3:56 UTC (permalink / raw) To: Trond Myklebust; +Cc: Jeff Moyer, linux-fsdevel, autofs mailing list, nfs On Wed, 2006-05-24 at 14:31 -0400, Trond Myklebust wrote: > On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote: > > ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds: > > > > trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote: > > >> > 4) How does NFS v4 fit into this picture as I believe that some > > >> > of this functionality is included within the protocol. > > >> > > >> And this. > > >> > > >> NFS v4 appears quite different so should I be considering this for v2 and > > >> v3 only? > > > > trond.myklebust> NFSv4 has full support for migration/replication in the > > trond.myklebust> protocol. If a filesystem fails on a given server, then > > trond.myklebust> the server itself will tell the client where it can find > > trond.myklebust> the replicas. There should be no need to provide that > > trond.myklebust> information at mount time. > > > > And what happens when the server disappears? > > There are 2 strategies for dealing with that: > > Firstly, we can maintain a cache of the list of replica volumes (we can > request the list of replicas when we mount the original volume). > > Secondly, there are plans to add a backup list of failover servers in a > specialised DNS record. This strategy could be made to work for NFSv2/v3 > too. > I see. That would work fine. Personally, I'm not keen on using DNS for this as it adds another source, separate from the original source, that needs to be kept up to date. Unfortunately, in many environments it's not possible to deploy new services, often for several years after they become available. So there is a need to do this for v2 and v3 in the absence of v4. We at least need to support the mount syntax used in other industry OSs to round out the v2 and v3 implementation, so using mount seems the logical thing to do. I think this would also fit in well with v4 in that, as you mention above, the replica information needs to be gathered at mount time. I have the opportunity to spend some time on this now. Ideally I would like to fit in with the work that is being done for v4 as much as possible. For example I noticed references to a struct nfs_fs_locations in you patch set which may be useful for the information I need. However, I haven't spotted anything that relates to fail detection and fail over itself (ok I know you said your working on it) so perhaps I can contribute to this in a way that could help your v4 work. So what's you plan for this? Ian ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2006-05-30 11:57 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-02 5:56 [RFC] Multiple server selection and replicated mount failover Ian Kent 2006-05-24 5:05 ` Ian Kent 2006-05-24 13:02 ` [NFS] " Peter Staubach 2006-05-24 13:45 ` Ian Kent 2006-05-24 14:04 ` Peter Staubach 2006-05-24 14:31 ` Ian Kent 2006-05-24 20:45 ` [NFS] " John T. Kohl 2006-05-24 20:52 ` Dan Stromberg 2006-05-29 7:31 ` [autofs] " Ian Kent 2006-05-30 12:02 ` Jeff Moyer 2006-05-24 16:29 ` Trond Myklebust 2006-05-24 17:58 ` [autofs] " Jeff Moyer 2006-05-24 18:31 ` Trond Myklebust 2006-05-24 19:17 ` Peter Staubach 2006-05-24 19:45 ` Trond Myklebust 2006-05-25 3:56 ` Ian Kent
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).