[RFC] Multiple server selection and replicated mount failover

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Multiple server selection and replicated mount failover
@ 2006-05-02  5:56 Ian Kent
  2006-05-24  5:05 ` Ian Kent
  0 siblings, 1 reply; 16+ messages in thread
From: Ian Kent @ 2006-05-02  5:56 UTC (permalink / raw)
  To: nfs; +Cc: linux-fsdevel, autofs mailing list

Hi all,

For some time now I have had code in autofs that attempts to select an 
appropriate server from a weighted list to satisfy server priority 
selection and Replicated Server requirements. The code has been 
problematic from the beginning and is still incorrect largely due to me 
not merging the original patch well and also not fixing it correctly 
afterward.

So I'd like to have this work properly and to do that I also need to 
consider read-only NFS mount fail over.

The rules for server selection are, in order of priority (I believe):

1) Hosts on the local subnet.
2) Hosts on the local network.
3) Hosts on other network.

Each of these proximity groups is made up of the largest number of 
servers supporting a given NFS protocol version. For example if there were 
5 servers and 4 supported v3 and 2 supported v2 then the candidate group 
would be made up of the 4 supporting v3. Within the group of candidate 
servers the one with the best response time is selected. Selection 
within a proximity group can be further influenced by a zero based weight 
associated with each host. The higher the weight (a cost really) the less 
likely a server is to be selected. I'm not clear on exactly how he weight 
influences the selection, so perhaps someone who is familiar with this 
could explain it?

Apart from mount time server selection read-only replicated servers need 
to be able to fail over to another server if the current one becomes 
unavailable. 

The questions I have are:

1) What is the best place for each part of this process to be
   carried out.
   - mount time selection.
   - read-only mount fail over.

2) What mechanisms would be best to use for the selection process.

3) Is there any existing work available that anyone is aware
   of that could be used as a reference.

4) How does NFS v4 fit into this picture as I believe that some
   of this functionality is included within the protocol.

Any comments or suggestions or reference code would be very much 
appreciated.

Ian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-02  5:56 [RFC] Multiple server selection and replicated mount failover Ian Kent
@ 2006-05-24  5:05 ` Ian Kent
  2006-05-24 13:02   ` [NFS] " Peter Staubach
  2006-05-24 16:29   ` Trond Myklebust
  0 siblings, 2 replies; 16+ messages in thread
From: Ian Kent @ 2006-05-24  5:05 UTC (permalink / raw)
  To: nfs; +Cc: linux-fsdevel, autofs mailing list

On Tue, 2 May 2006, Ian Kent wrote:

> 
> Hi all,
> 
> For some time now I have had code in autofs that attempts to select an 
> appropriate server from a weighted list to satisfy server priority 
> selection and Replicated Server requirements. The code has been 
> problematic from the beginning and is still incorrect largely due to me 
> not merging the original patch well and also not fixing it correctly 
> afterward.
> 
> So I'd like to have this work properly and to do that I also need to 
> consider read-only NFS mount fail over.
> 
> The rules for server selection are, in order of priority (I believe):
> 
> 1) Hosts on the local subnet.
> 2) Hosts on the local network.
> 3) Hosts on other network.
> 
> Each of these proximity groups is made up of the largest number of 
> servers supporting a given NFS protocol version. For example if there were 
> 5 servers and 4 supported v3 and 2 supported v2 then the candidate group 
> would be made up of the 4 supporting v3. Within the group of candidate 
> servers the one with the best response time is selected. Selection 
> within a proximity group can be further influenced by a zero based weight 
> associated with each host. The higher the weight (a cost really) the less 
> likely a server is to be selected. I'm not clear on exactly how he weight 
> influences the selection, so perhaps someone who is familiar with this 
> could explain it?

I've re-written the server selection code now and I believe it works 
correctly.

> 
> Apart from mount time server selection read-only replicated servers need 
> to be able to fail over to another server if the current one becomes 
> unavailable. 
> 
> The questions I have are:
> 
> 1) What is the best place for each part of this process to be
>    carried out.
>    - mount time selection.
>    - read-only mount fail over.

I think mount time selection should be done in mount and I believe the 
failover needs to be done in the kernel against the list established with 
the user space selection. The list should only change when a umount 
and then a mount occurs (surely this is the only practical way to do it 
?).

The code that I now have for the selection process can potentially improve 
the code used by patches to mount for probing NFS servers and doing this 
once in one place has to be better than doing it in automount and mount.

The failover is another story.

It seems to me that there are two similar ways to do this:

1) Pass a list of address and path entries to NFS at mount time and 
intercept errors, identify if the host is down and if it is select and 
mount another server.

2) Mount each member of the list with the best one on top and intercept 
errors, identify if the host is down and if it is select another from the 
list of mounts and put it atop the mounts. Maintaining the ordering with 
this approach could be difficult.

With either of these approaches handling open files and held locks appears 
to be the the difficult part.

Anyone have anything to contribute on how I could handle this or problems 
that I will encounter?

snip ..

> 
> 3) Is there any existing work available that anyone is aware
>    of that could be used as a reference.

Still wondering about this.

> 
> 4) How does NFS v4 fit into this picture as I believe that some
>    of this functionality is included within the protocol.

And this.

NFS v4 appears quite different so should I be considering this for v2 and 
v3 only?

> 
> Any comments or suggestions or reference code would be very much 
> appreciated.

Still.

Ian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24  5:05 ` Ian Kent
@ 2006-05-24 13:02   ` Peter Staubach
  2006-05-24 13:45     ` Ian Kent
  2006-05-24 20:45     ` [NFS] " John T. Kohl
  2006-05-24 16:29   ` Trond Myklebust
  1 sibling, 2 replies; 16+ messages in thread
From: Peter Staubach @ 2006-05-24 13:02 UTC (permalink / raw)
  To: Ian Kent; +Cc: nfs, linux-fsdevel, autofs mailing list

Ian Kent wrote:

>On Tue, 2 May 2006, Ian Kent wrote:
>
>  
>
>>Hi all,
>>
>>For some time now I have had code in autofs that attempts to select an 
>>appropriate server from a weighted list to satisfy server priority 
>>selection and Replicated Server requirements. The code has been 
>>problematic from the beginning and is still incorrect largely due to me 
>>not merging the original patch well and also not fixing it correctly 
>>afterward.
>>
>>So I'd like to have this work properly and to do that I also need to 
>>consider read-only NFS mount fail over.
>>
>>The rules for server selection are, in order of priority (I believe):
>>
>>1) Hosts on the local subnet.
>>2) Hosts on the local network.
>>3) Hosts on other network.
>>
>>Each of these proximity groups is made up of the largest number of 
>>servers supporting a given NFS protocol version. For example if there were 
>>5 servers and 4 supported v3 and 2 supported v2 then the candidate group 
>>would be made up of the 4 supporting v3. Within the group of candidate 
>>servers the one with the best response time is selected. Selection 
>>within a proximity group can be further influenced by a zero based weight 
>>associated with each host. The higher the weight (a cost really) the less 
>>likely a server is to be selected. I'm not clear on exactly how he weight 
>>influences the selection, so perhaps someone who is familiar with this 
>>could explain it?
>>    
>>
>
>I've re-written the server selection code now and I believe it works 
>correctly.
>
>  
>
>>Apart from mount time server selection read-only replicated servers need 
>>to be able to fail over to another server if the current one becomes 
>>unavailable. 
>>
>>The questions I have are:
>>
>>1) What is the best place for each part of this process to be
>>   carried out.
>>   - mount time selection.
>>   - read-only mount fail over.
>>    
>>
>
>I think mount time selection should be done in mount and I believe the 
>failover needs to be done in the kernel against the list established with 
>the user space selection. The list should only change when a umount 
>and then a mount occurs (surely this is the only practical way to do it 
>?).
>
>The code that I now have for the selection process can potentially improve 
>the code used by patches to mount for probing NFS servers and doing this 
>once in one place has to be better than doing it in automount and mount.
>
>The failover is another story.
>
>It seems to me that there are two similar ways to do this:
>
>1) Pass a list of address and path entries to NFS at mount time and 
>intercept errors, identify if the host is down and if it is select and 
>mount another server.
>
>2) Mount each member of the list with the best one on top and intercept 
>errors, identify if the host is down and if it is select another from the 
>list of mounts and put it atop the mounts. Maintaining the ordering with 
>this approach could be difficult.
>
>With either of these approaches handling open files and held locks appears 
>to be the the difficult part.
>
>Anyone have anything to contribute on how I could handle this or problems 
>that I will encounter?
>
>  
>

It seems to me that there is one other way which is similiar to #1 except
that instead of passing path entries to NFS at mount time, pass in file
handles.  This keeps all of the MOUNT protocol processing at the user
level and does not require the kernel to learn anything about the MOUNT
protocol.  It also allows a reasonable list to be constructed, with
checking to ensure that all the servers support the same version of the
NFS protocol, probably that all of the server support the same transport
protocol, etc.

>snip ..
>
>  
>
>>3) Is there any existing work available that anyone is aware
>>   of that could be used as a reference.
>>    
>>
>
>Still wondering about this.
>
>  
>

Well, there is the Solaris support.

>>4) How does NFS v4 fit into this picture as I believe that some
>>   of this functionality is included within the protocol.
>>    
>>
>
>And this.
>
>NFS v4 appears quite different so should I be considering this for v2 and 
>v3 only?
>
>  
>
>>Any comments or suggestions or reference code would be very much 
>>appreciated.
>>    
>>

The Solaris support works by passing a list of structs containing server
information down into the kernel at mount time.  This makes normal mounting
just a subset of the replicated support because a normal mount would just
contain a list of a single entry.

When the Solaris client gets a timeout from an RPC, it checks to see whether
this file and mount are failover'able.  This checks to see whether there are
alternate servers in the list and could contain a check to see if there are
locks existing on the file.  If there are locks, then don't failover.  The
alternative to doing this is to attempt to move the lock, but this could
be problematic because there would be no guarantee that the new lock could
be acquired.

Anyway, if the file is failover'able, then a new server is chosen from the
list and the file handle associated with the file is remapped to the
equivalent file on the new server.  This is done by repeating the lookups
done to get the original file handle.  Once the new file handle is acquired,
then some minimal checks are done to try to ensure that the files are the
"same".  This is probably mostly checking to see whether the sizes of the
two files are the same.

Please note that this approach contains the interesting aspect that
files are only failed over when they need to be and are not failed over
proactively.  This can lead to the situation where processes using the
the file system can be talking to many of the different underlying
servers, all at the sametime.  If a server goes down and then comes back
up before a process, which was talking to that server, notices, then it
will just continue to use that server, while another process, which
noticed the failed server, may have failed over to a new server.

The key ingredient to this approach, I think, is a list of servers and
information about them, and then information for each active NFS inode
that keeps track of the pathname used to discover the file handle and
also the server which is being currently used by the specific file.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 13:02   ` [NFS] " Peter Staubach
@ 2006-05-24 13:45     ` Ian Kent
  2006-05-24 14:04       ` Peter Staubach
  2006-05-24 20:45     ` [NFS] " John T. Kohl
  1 sibling, 1 reply; 16+ messages in thread
From: Ian Kent @ 2006-05-24 13:45 UTC (permalink / raw)
  To: Peter Staubach; +Cc: nfs, linux-fsdevel, autofs mailing list

On Wed, 2006-05-24 at 09:02 -0400, Peter Staubach wrote:
> Ian Kent wrote:
> 
> >
> >I've re-written the server selection code now and I believe it works 
> >correctly.
> >
> >  
> >
> >>Apart from mount time server selection read-only replicated servers need 
> >>to be able to fail over to another server if the current one becomes 
> >>unavailable. 
> >>
> >>The questions I have are:
> >>
> >>1) What is the best place for each part of this process to be
> >>   carried out.
> >>   - mount time selection.
> >>   - read-only mount fail over.
> >>    
> >>
> >
> >I think mount time selection should be done in mount and I believe the 
> >failover needs to be done in the kernel against the list established with 
> >the user space selection. The list should only change when a umount 
> >and then a mount occurs (surely this is the only practical way to do it 
> >?).
> >
> >The code that I now have for the selection process can potentially improve 
> >the code used by patches to mount for probing NFS servers and doing this 
> >once in one place has to be better than doing it in automount and mount.
> >
> >The failover is another story.
> >
> >It seems to me that there are two similar ways to do this:
> >
> >1) Pass a list of address and path entries to NFS at mount time and 
> >intercept errors, identify if the host is down and if it is select and 
> >mount another server.
> >
> >2) Mount each member of the list with the best one on top and intercept 
> >errors, identify if the host is down and if it is select another from the 
> >list of mounts and put it atop the mounts. Maintaining the ordering with 
> >this approach could be difficult.
> >
> >With either of these approaches handling open files and held locks appears 
> >to be the the difficult part.
> >
> >Anyone have anything to contribute on how I could handle this or problems 
> >that I will encounter?
> >
> >  
> >
> 
> It seems to me that there is one other way which is similiar to #1 except
> that instead of passing path entries to NFS at mount time, pass in file
> handles.  This keeps all of the MOUNT protocol processing at the user
> level and does not require the kernel to learn anything about the MOUNT
> protocol.  It also allows a reasonable list to be constructed, with
> checking to ensure that all the servers support the same version of the
> NFS protocol, probably that all of the server support the same transport
> protocol, etc.

Of course, like #1 but with the benefits of #2 without the clutter. I
guess all I would have to do then is the vfs mount to make it happen.
Are we assuming a restriction like all the mounts have the same path
exported from the server? mtab could get a little confused.

> 
> >snip ..
> >
> >  
> >
> >>3) Is there any existing work available that anyone is aware
> >>   of that could be used as a reference.
> >>    
> >>
> >
> >Still wondering about this.
> >
> >  
> >
> 
> Well, there is the Solaris support.

But I'm not supposed to peek at that am I (cough, splutter, ...)?

> 
> >>4) How does NFS v4 fit into this picture as I believe that some
> >>   of this functionality is included within the protocol.
> >>    
> >>
> >
> >And this.
> >
> >NFS v4 appears quite different so should I be considering this for v2 and 
> >v3 only?
> >
> >  
> >
> >>Any comments or suggestions or reference code would be very much 
> >>appreciated.
> >>    
> >>
> 
> The Solaris support works by passing a list of structs containing server
> information down into the kernel at mount time.  This makes normal mounting
> just a subset of the replicated support because a normal mount would just
> contain a list of a single entry.

Cool. That's the way the selection code I have works, except for the
kernel bit of course.

> 
> When the Solaris client gets a timeout from an RPC, it checks to see whether
> this file and mount are failover'able.  This checks to see whether there are
> alternate servers in the list and could contain a check to see if there are
> locks existing on the file.  If there are locks, then don't failover.  The
> alternative to doing this is to attempt to move the lock, but this could
> be problematic because there would be no guarantee that the new lock could
> be acquired.

Yep. Failing over the locks looks like it could turn into a nightmare
really fast. Sounds like a good simplifying restriction for a first stab
at this.

> 
> Anyway, if the file is failover'able, then a new server is chosen from the
> list and the file handle associated with the file is remapped to the
> equivalent file on the new server.  This is done by repeating the lookups
> done to get the original file handle.  Once the new file handle is acquired,
> then some minimal checks are done to try to ensure that the files are the
> "same".  This is probably mostly checking to see whether the sizes of the
> two files are the same.
> 
> Please note that this approach contains the interesting aspect that
> files are only failed over when they need to be and are not failed over
> proactively.  This can lead to the situation where processes using the
> the file system can be talking to many of the different underlying
> servers, all at the sametime.  If a server goes down and then comes back
> up before a process, which was talking to that server, notices, then it
> will just continue to use that server, while another process, which
> noticed the failed server, may have failed over to a new server.

Interesting. This hadn't occurred to me yet.

I was still at the stage of wondering whether the "on demand" approach
would work but the simplifying restriction above should make it workable
(I think ....).

> 
> The key ingredient to this approach, I think, is a list of servers and
> information about them, and then information for each active NFS inode
> that keeps track of the pathname used to discover the file handle and
> also the server which is being currently used by the specific file.

Haven't quite got to the path issues yet.
But can't we just get the path from d_path?
It will return the path from a given dentry to the root of the mount, if
I remember correctly, and we have a file handle for the server.

But your talking about the difficulty of the housekeeping overall I
think.
 
>     Thanx...

Thanks for your comments.
Much appreciated and certainly very helpful.

Ian


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 13:45     ` Ian Kent
@ 2006-05-24 14:04       ` Peter Staubach
  2006-05-24 14:31         ` Ian Kent
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Staubach @ 2006-05-24 14:04 UTC (permalink / raw)
  To: Ian Kent; +Cc: nfs, linux-fsdevel, autofs mailing list

Ian Kent wrote:

>
>Of course, like #1 but with the benefits of #2 without the clutter. I
>guess all I would have to do then is the vfs mount to make it happen.
>Are we assuming a restriction like all the mounts have the same path
>exported from the server? mtab could get a little confused.
>
>  
>

I don't think that this needs to be restricted like this.  I think that
it would be nice if mtab just showed the arguments which were used to
the mount command, ie. with all of the server and path combinations listed
just like they were on the command line or in the autofs map.


>But I'm not supposed to peek at that am I (cough, splutter, ...)?
>
>  
>

Yes, I know.  I am trying to figure out how much of the architecture and
implementation that I can talk about too...  :-)

>Cool. That's the way the selection code I have works, except for the
>kernel bit of course.
>
>  
>

Good news!

>Yep. Failing over the locks looks like it could turn into a nightmare
>really fast. Sounds like a good simplifying restriction for a first stab
>at this.
>
>  
>

Agreed.

>Interesting. This hadn't occurred to me yet.
>
>I was still at the stage of wondering whether the "on demand" approach
>would work but the simplifying restriction above should make it workable
>(I think ....).
>
>  
>

I think that simple is good.  We can always get more complicate later, if
need be.  (Hope not...  :-) )

>>The key ingredient to this approach, I think, is a list of servers and
>>information about them, and then information for each active NFS inode
>>that keeps track of the pathname used to discover the file handle and
>>also the server which is being currently used by the specific file.
>>    
>>
>
>Haven't quite got to the path issues yet.
>But can't we just get the path from d_path?
>It will return the path from a given dentry to the root of the mount, if
>I remember correctly, and we have a file handle for the server.
>
>But your talking about the difficulty of the housekeeping overall I
>think.
>
>  
>

Yes, I was specifically trying to avoid talking about how to manage the
information.  I think that that is an implementation detail, which is
better left until after the high level architecture is defined.


>>    Thanx...
>>    
>>
>
>Thanks for your comments.
>Much appreciated and certainly very helpful.
>

You're welcome.

I can try to talk more about the architecture and implementation that I am
familiar with, if you like.

       ps

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 14:04       ` Peter Staubach
@ 2006-05-24 14:31         ` Ian Kent
  0 siblings, 0 replies; 16+ messages in thread
From: Ian Kent @ 2006-05-24 14:31 UTC (permalink / raw)
  To: Peter Staubach; +Cc: nfs, linux-fsdevel, autofs mailing list

On Wed, 2006-05-24 at 10:04 -0400, Peter Staubach wrote:
> Ian Kent wrote:
> 
> >
> >Of course, like #1 but with the benefits of #2 without the clutter. I
> >guess all I would have to do then is the vfs mount to make it happen.
> >Are we assuming a restriction like all the mounts have the same path
> >exported from the server? mtab could get a little confused.
> >
> >  
> >
> 
> I don't think that this needs to be restricted like this.  I think that
> it would be nice if mtab just showed the arguments which were used to
> the mount command, ie. with all of the server and path combinations listed
> just like they were on the command line or in the autofs map.
> 
> 
> >But I'm not supposed to peek at that am I (cough, splutter, ...)?
> >
> >  
> >
> 
> Yes, I know.  I am trying to figure out how much of the architecture and
> implementation that I can talk about too...  :-)
> 
> >Cool. That's the way the selection code I have works, except for the
> >kernel bit of course.
> >
> >  
> >
> 
> Good news!

It's in autofs 5 now. The idea is that it will work for any mount
string, replicated syntax or not. So there's no extra mucking around.

I hope to push it into mount and provide a configure option to disable
it in autofs if mount can do it instead.

> 
> >Yep. Failing over the locks looks like it could turn into a nightmare
> >really fast. Sounds like a good simplifying restriction for a first stab
> >at this.
> >
> >  
> >
> 
> Agreed.
> 
> >Interesting. This hadn't occurred to me yet.
> >
> >I was still at the stage of wondering whether the "on demand" approach
> >would work but the simplifying restriction above should make it workable
> >(I think ....).
> >
> >  
> >
> 
> I think that simple is good.  We can always get more complicate later, if
> need be.  (Hope not...  :-) )
> 
> >>The key ingredient to this approach, I think, is a list of servers and
> >>information about them, and then information for each active NFS inode
> >>that keeps track of the pathname used to discover the file handle and
> >>also the server which is being currently used by the specific file.
> >>    
> >>
> >
> >Haven't quite got to the path issues yet.
> >But can't we just get the path from d_path?
> >It will return the path from a given dentry to the root of the mount, if
> >I remember correctly, and we have a file handle for the server.
> >
> >But your talking about the difficulty of the housekeeping overall I
> >think.
> >
> >  
> >
> 
> Yes, I was specifically trying to avoid talking about how to manage the
> information.  I think that that is an implementation detail, which is
> better left until after the high level architecture is defined.

Yep.

> 
> 
> >>    Thanx...
> >>    
> >>
> >
> >Thanks for your comments.
> >Much appreciated and certainly very helpful.
> >
> 
> You're welcome.
> 
> I can try to talk more about the architecture and implementation that I am
> familiar with, if you like.

Any and all information is good.
Food for thought will give me something to eat!

Ian



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 13:02   ` [NFS] " Peter Staubach
  2006-05-24 13:45     ` Ian Kent
@ 2006-05-24 20:45     ` John T. Kohl
  2006-05-24 20:52       ` Dan Stromberg
                         ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: John T. Kohl @ 2006-05-24 20:45 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Ian Kent, nfs, linux-fsdevel, autofs mailing list

>>>>> "PS" == Peter Staubach <staubach@redhat.com> writes:

PS> When the Solaris client gets a timeout from an RPC, it checks to see
PS> whether this file and mount are failover'able.  This checks to see
PS> whether there are alternate servers in the list and could contain a
PS> check to see if there are locks existing on the file.  If there are
PS> locks, then don't failover.  The alternative to doing this is to
PS> attempt to move the lock, but this could be problematic because
PS> there would be no guarantee that the new lock could be acquired.

PS> Anyway, if the file is failover'able, then a new server is chosen
PS> from the list and the file handle associated with the file is
PS> remapped to the equivalent file on the new server.  This is done by
PS> repeating the lookups done to get the original file handle.  Once
PS> the new file handle is acquired, then some minimal checks are done
PS> to try to ensure that the files are the "same".  This is probably
PS> mostly checking to see whether the sizes of the two files are the
PS> same.

PS> Please note that this approach contains the interesting aspect that
PS> files are only failed over when they need to be and are not failed over
PS> proactively.  This can lead to the situation where processes using the
PS> the file system can be talking to many of the different underlying
PS> servers, all at the sametime.  If a server goes down and then comes back
PS> up before a process, which was talking to that server, notices, then it
PS> will just continue to use that server, while another process, which
PS> noticed the failed server, may have failed over to a new server.

If you have multiple processes talking to different server replicas, can
you then get cases where the processes aren't sharing the same files given
the same name?

Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
starts working on it.  It then sits around doing nothing for a while.

Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2,
and then looks up "c/file.c" which will be referencing the object on
server 2 ?

A & B then try locking to cooperate...

Are replicas only useful for read-only copies?  If they're read-only, do
locks even make sense?

-- 
John Kohl
Senior Software Engineer - Rational Software - IBM Software Group
Lexington, Massachusetts, USA
jtk@us.ibm.com
<http://www.ibm.com/software/rational/>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 20:45     ` [NFS] " John T. Kohl
@ 2006-05-24 20:52       ` Dan Stromberg
  2006-05-29  7:31       ` [autofs] " Ian Kent
  2006-05-30 12:02       ` Jeff Moyer
  2 siblings, 0 replies; 16+ messages in thread
From: Dan Stromberg @ 2006-05-24 20:52 UTC (permalink / raw)
  To: John T. Kohl
  Cc: Peter Staubach, Ian Kent, nfs, linux-fsdevel, autofs mailing list,
	strombrg

On Wed, 2006-05-24 at 16:45 -0400, John T. Kohl wrote:

> 
> If you have multiple processes talking to different server replicas, can
> you then get cases where the processes aren't sharing the same files given
> the same name?

Sounds like it to me.

> Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
> starts working on it.  It then sits around doing nothing for a while.
> 
> Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2,
> and then looks up "c/file.c" which will be referencing the object on
> server 2 ?

Yup.

> A & B then try locking to cooperate...

Yup.

To get good locking semantics, you'll probably need a distributed
filesystem like GFS or Lustre.

> Are replicas only useful for read-only copies?  If they're read-only, do
> locks even make sense?

Yes, they can - imagine a software system that wants to make sure only
one process is accessing some data at a time.  The lock might be in one
filesystem, but the data might be in another.  Or you might be reading
from a device that has really expensive seeks (a DVD comes to mind), so
you want to be sure that only one thing is reading from it at a time.
There are other possible scenarios I imagine, but many folks may be able
to live without any of them.  :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 20:45     ` [NFS] " John T. Kohl
  2006-05-24 20:52       ` Dan Stromberg
@ 2006-05-29  7:31       ` Ian Kent
  2006-05-30 12:02       ` Jeff Moyer
  2 siblings, 0 replies; 16+ messages in thread
From: Ian Kent @ 2006-05-29  7:31 UTC (permalink / raw)
  To: John T. Kohl; +Cc: Peter Staubach, linux-fsdevel, autofs mailing list, nfs

On Thu, 24 May 2006, John T. Kohl wrote:

> >>>>> "PS" == Peter Staubach <staubach@redhat.com> writes:
> 
> PS> When the Solaris client gets a timeout from an RPC, it checks to see
> PS> whether this file and mount are failover'able.  This checks to see
> PS> whether there are alternate servers in the list and could contain a
> PS> check to see if there are locks existing on the file.  If there are
> PS> locks, then don't failover.  The alternative to doing this is to
> PS> attempt to move the lock, but this could be problematic because
> PS> there would be no guarantee that the new lock could be acquired.
> 
> PS> Anyway, if the file is failover'able, then a new server is chosen
> PS> from the list and the file handle associated with the file is
> PS> remapped to the equivalent file on the new server.  This is done by
> PS> repeating the lookups done to get the original file handle.  Once
> PS> the new file handle is acquired, then some minimal checks are done
> PS> to try to ensure that the files are the "same".  This is probably
> PS> mostly checking to see whether the sizes of the two files are the
> PS> same.
> 
> PS> Please note that this approach contains the interesting aspect that
> PS> files are only failed over when they need to be and are not failed over
> PS> proactively.  This can lead to the situation where processes using the
> PS> the file system can be talking to many of the different underlying
> PS> servers, all at the sametime.  If a server goes down and then comes back
> PS> up before a process, which was talking to that server, notices, then it
> PS> will just continue to use that server, while another process, which
> PS> noticed the failed server, may have failed over to a new server.
> 
> If you have multiple processes talking to different server replicas, can
> you then get cases where the processes aren't sharing the same files given
> the same name?
> 
> Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
> starts working on it.  It then sits around doing nothing for a while.
> 
> Process "B" cd's to /mount/a/b, gets a timeout, fails over to server 2,
> and then looks up "c/file.c" which will be referencing the object on
> server 2 ?
> 
> A & B then try locking to cooperate...
> 
> Are replicas only useful for read-only copies?  If they're read-only, do
> locks even make sense?

Apps will take locks whether it makes sense or not.
So refusing to fail-over if locks are held is likely the best approach.

The case of replica filesystems themselves being updated could give rise 
to some interesting difficulties.

Ian


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 20:45     ` [NFS] " John T. Kohl
  2006-05-24 20:52       ` Dan Stromberg
  2006-05-29  7:31       ` [autofs] " Ian Kent
@ 2006-05-30 12:02       ` Jeff Moyer
  2 siblings, 0 replies; 16+ messages in thread
From: Jeff Moyer @ 2006-05-30 12:02 UTC (permalink / raw)
  To: John T. Kohl
  Cc: Peter Staubach, linux-fsdevel, autofs mailing list, nfs, Ian Kent

==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; jtk@us.ibm.com (John T. Kohl) adds:

>>>>>> "PS" == Peter Staubach <staubach@redhat.com> writes:
PS> When the Solaris client gets a timeout from an RPC, it checks to see
PS> whether this file and mount are failover'able.  This checks to see
PS> whether there are alternate servers in the list and could contain a
PS> check to see if there are locks existing on the file.  If there are
PS> locks, then don't failover.  The alternative to doing this is to
PS> attempt to move the lock, but this could be problematic because there
PS> would be no guarantee that the new lock could be acquired.

PS> Anyway, if the file is failover'able, then a new server is chosen from
PS> the list and the file handle associated with the file is remapped to
PS> the equivalent file on the new server.  This is done by repeating the
PS> lookups done to get the original file handle.  Once the new file handle
PS> is acquired, then some minimal checks are done to try to ensure that
PS> the files are the "same".  This is probably mostly checking to see
PS> whether the sizes of the two files are the same.

PS> Please note that this approach contains the interesting aspect that
PS> files are only failed over when they need to be and are not failed over
PS> proactively.  This can lead to the situation where processes using the
PS> the file system can be talking to many of the different underlying
PS> servers, all at the sametime.  If a server goes down and then comes
PS> back up before a process, which was talking to that server, notices,
PS> then it will just continue to use that server, while another process,
PS> which noticed the failed server, may have failed over to a new server.

jtk> If you have multiple processes talking to different server replicas,
jtk> can you then get cases where the processes aren't sharing the same
jtk> files given the same name?

jtk> Process "A" looks up /mount/a/b/c/file.c (using server 1) opens it and
jtk> starts working on it.  It then sits around doing nothing for a while.

jtk> Process "B" cd's to /mount/a/b, gets a timeout, fails over to server
jtk> 2, and then looks up "c/file.c" which will be referencing the object
jtk> on server 2 ?

jtk> A & B then try locking to cooperate...

jtk> Are replicas only useful for read-only copies?  If they're read-only,
jtk> do locks even make sense?

In the docs I've read, the replicated failover only works for read-only
file systems.  You can have a replicated server entry for read-write file
systems, but only one of those will be mounted by the automounter.  To
change servers would require a timeout (unmount) and subsequent lookup
(mount).

I don't think we need to try to kill ourselves by making this too complex.

-Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24  5:05 ` Ian Kent
  2006-05-24 13:02   ` [NFS] " Peter Staubach
@ 2006-05-24 16:29   ` Trond Myklebust
  2006-05-24 17:58     ` [autofs] " Jeff Moyer
  1 sibling, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2006-05-24 16:29 UTC (permalink / raw)
  To: Ian Kent; +Cc: nfs, linux-fsdevel, autofs mailing list

On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:

> It seems to me that there are two similar ways to do this:
> 
> 1) Pass a list of address and path entries to NFS at mount time and 
> intercept errors, identify if the host is down and if it is select and 
> mount another server.
> 
> 2) Mount each member of the list with the best one on top and intercept 
> errors, identify if the host is down and if it is select another from the 
> list of mounts and put it atop the mounts. Maintaining the ordering with 
> this approach could be difficult.

Solaris has implemented option (1). To me, that is the approach that
makes the most sense: why add the overhead of maintaining all these
redundant mounts?

> With either of these approaches handling open files and held locks appears 
> to be the the difficult part.

Always has been, and always will. We're working on this problem, but
progress is slow. In any case, we'll be concentrating on solving it for
NFSv4 first (since that has native support for migrated/replicated
volumes).

> Anyone have anything to contribute on how I could handle this or problems 
> that I will encounter?
> 
> 
> snip ..
> 
> > 
> > 3) Is there any existing work available that anyone is aware
> >    of that could be used as a reference.
> 
> Still wondering about this.
> 
> > 
> > 4) How does NFS v4 fit into this picture as I believe that some
> >    of this functionality is included within the protocol.
> 
> And this.
> 
> NFS v4 appears quite different so should I be considering this for v2 and 
> v3 only?

NFSv4 has full support for migration/replication in the protocol. If a
filesystem fails on a given server, then the server itself will tell the
client where it can find the replicas. There should be no need to
provide that information at mount time.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 16:29   ` Trond Myklebust
@ 2006-05-24 17:58     ` Jeff Moyer
  2006-05-24 18:31       ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Moyer @ 2006-05-24 17:58 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ian Kent, linux-fsdevel, autofs mailing list, nfs

==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds:

trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
>> > 4) How does NFS v4 fit into this picture as I believe that some
>> >    of this functionality is included within the protocol.
>> 
>> And this.
>> 
>> NFS v4 appears quite different so should I be considering this for v2 and 
>> v3 only?

trond.myklebust> NFSv4 has full support for migration/replication in the
trond.myklebust> protocol. If a filesystem fails on a given server, then
trond.myklebust> the server itself will tell the client where it can find
trond.myklebust> the replicas. There should be no need to provide that
trond.myklebust> information at mount time.

And what happens when the server disappears?

-Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 17:58     ` [autofs] " Jeff Moyer
@ 2006-05-24 18:31       ` Trond Myklebust
  2006-05-24 19:17         ` Peter Staubach
  2006-05-25  3:56         ` Ian Kent
  0 siblings, 2 replies; 16+ messages in thread
From: Trond Myklebust @ 2006-05-24 18:31 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Ian Kent, linux-fsdevel, autofs mailing list, nfs

On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote:
> ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds:
> 
> trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
> >> > 4) How does NFS v4 fit into this picture as I believe that some
> >> >    of this functionality is included within the protocol.
> >> 
> >> And this.
> >> 
> >> NFS v4 appears quite different so should I be considering this for v2 and 
> >> v3 only?
> 
> trond.myklebust> NFSv4 has full support for migration/replication in the
> trond.myklebust> protocol. If a filesystem fails on a given server, then
> trond.myklebust> the server itself will tell the client where it can find
> trond.myklebust> the replicas. There should be no need to provide that
> trond.myklebust> information at mount time.
> 
> And what happens when the server disappears?

There are 2 strategies for dealing with that:

Firstly, we can maintain a cache of the list of replica volumes (we can
request the list of replicas when we mount the original volume).

Secondly, there are plans to add a backup list of failover servers in a
specialised DNS record. This strategy could be made to work for NFSv2/v3
too.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 18:31       ` Trond Myklebust
@ 2006-05-24 19:17         ` Peter Staubach
  2006-05-24 19:45           ` Trond Myklebust
  2006-05-25  3:56         ` Ian Kent
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Staubach @ 2006-05-24 19:17 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Moyer, Ian Kent, linux-fsdevel, autofs mailing list, nfs

Trond Myklebust wrote:

>On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote:
>  
>
>>==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds:
>>
>>trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
>>    
>>
>>>>>4) How does NFS v4 fit into this picture as I believe that some
>>>>>   of this functionality is included within the protocol.
>>>>>          
>>>>>
>>>>And this.
>>>>
>>>>NFS v4 appears quite different so should I be considering this for v2 and 
>>>>v3 only?
>>>>        
>>>>
>>trond.myklebust> NFSv4 has full support for migration/replication in the
>>trond.myklebust> protocol. If a filesystem fails on a given server, then
>>trond.myklebust> the server itself will tell the client where it can find
>>trond.myklebust> the replicas. There should be no need to provide that
>>trond.myklebust> information at mount time.
>>
>>And what happens when the server disappears?
>>    
>>
>
>There are 2 strategies for dealing with that:
>
>Firstly, we can maintain a cache of the list of replica volumes (we can
>request the list of replicas when we mount the original volume).
>
>  
>

This assumes a lot on the part of the server and it doesn't seem to me
that current server implementations are ready with the infrastructure to
be able to make this a reality.

I think that the client should be prepared to handle this sort of scenario
but also be prepared to take a list of servers at mount time too.

>Secondly, there are plans to add a backup list of failover servers in a
>specialised DNS record. This strategy could be made to work for NFSv2/v3
>too.
>

This would seem to be a solution for how to determine the list of replicas,
but not how the NFS client fails over from one replica to the next.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 19:17         ` Peter Staubach
@ 2006-05-24 19:45           ` Trond Myklebust
  0 siblings, 0 replies; 16+ messages in thread
From: Trond Myklebust @ 2006-05-24 19:45 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Jeff Moyer, Ian Kent, linux-fsdevel, autofs mailing list, nfs

On Wed, 2006-05-24 at 15:17 -0400, Peter Staubach wrote:

> This would seem to be a solution for how to determine the list of replicas,
> but not how the NFS client fails over from one replica to the next.

I'm fully aware of _that_. As I said earlier, work is in progress.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
  2006-05-24 18:31       ` Trond Myklebust
  2006-05-24 19:17         ` Peter Staubach
@ 2006-05-25  3:56         ` Ian Kent
  1 sibling, 0 replies; 16+ messages in thread
From: Ian Kent @ 2006-05-25  3:56 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jeff Moyer, linux-fsdevel, autofs mailing list, nfs

On Wed, 2006-05-24 at 14:31 -0400, Trond Myklebust wrote:
> On Wed, 2006-05-24 at 13:58 -0400, Jeff Moyer wrote:
> > ==> Regarding [autofs] Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover; Trond Myklebust <trond.myklebust@fys.uio.no> adds:
> > 
> > trond.myklebust> On Wed, 2006-05-24 at 13:05 +0800, Ian Kent wrote:
> > >> > 4) How does NFS v4 fit into this picture as I believe that some
> > >> >    of this functionality is included within the protocol.
> > >> 
> > >> And this.
> > >> 
> > >> NFS v4 appears quite different so should I be considering this for v2 and 
> > >> v3 only?
> > 
> > trond.myklebust> NFSv4 has full support for migration/replication in the
> > trond.myklebust> protocol. If a filesystem fails on a given server, then
> > trond.myklebust> the server itself will tell the client where it can find
> > trond.myklebust> the replicas. There should be no need to provide that
> > trond.myklebust> information at mount time.
> > 
> > And what happens when the server disappears?
> 
> There are 2 strategies for dealing with that:
> 
> Firstly, we can maintain a cache of the list of replica volumes (we can
> request the list of replicas when we mount the original volume).
> 
> Secondly, there are plans to add a backup list of failover servers in a
> specialised DNS record. This strategy could be made to work for NFSv2/v3
> too.
> 

I see. That would work fine.

Personally, I'm not keen on using DNS for this as it adds another
source, separate from the original source, that needs to be kept up to
date.

Unfortunately, in many environments it's not possible to deploy new
services, often for several years after they become available. So there
is a need to do this for v2 and v3 in the absence of v4. We at least
need to support the mount syntax used in other industry OSs to round out
the v2 and v3 implementation, so using mount seems the logical thing to
do. I think this would also fit in well with v4 in that, as you mention
above, the replica information needs to be gathered at mount time.

I have the opportunity to spend some time on this now.

Ideally I would like to fit in with the work that is being done for v4
as much as possible. For example I noticed references to a struct
nfs_fs_locations in you patch set which may be useful for the
information I need. However, I haven't spotted anything that relates to
fail detection and fail over itself (ok I know you said your working on
it) so perhaps I can contribute to this in a way that could help your v4
work. So what's you plan for this?

Ian

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-05-30 11:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-02  5:56 [RFC] Multiple server selection and replicated mount failover Ian Kent
2006-05-24  5:05 ` Ian Kent
2006-05-24 13:02   ` [NFS] " Peter Staubach
2006-05-24 13:45     ` Ian Kent
2006-05-24 14:04       ` Peter Staubach
2006-05-24 14:31         ` Ian Kent
2006-05-24 20:45     ` [NFS] " John T. Kohl
2006-05-24 20:52       ` Dan Stromberg
2006-05-29  7:31       ` [autofs] " Ian Kent
2006-05-30 12:02       ` Jeff Moyer
2006-05-24 16:29   ` Trond Myklebust
2006-05-24 17:58     ` [autofs] " Jeff Moyer
2006-05-24 18:31       ` Trond Myklebust
2006-05-24 19:17         ` Peter Staubach
2006-05-24 19:45           ` Trond Myklebust
2006-05-25  3:56         ` Ian Kent

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).