From: Ian Kent <raven@themaw.net>
To: Peter Staubach <staubach@redhat.com>
Cc: nfs@lists.sourceforge.net,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
autofs mailing list <autofs@linux.kernel.org>
Subject: Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover
Date: Wed, 24 May 2006 21:45:45 +0800 [thread overview]
Message-ID: <1148478346.8182.22.camel@raven.themaw.net> (raw)
In-Reply-To: <44745972.2010305@redhat.com>
On Wed, 2006-05-24 at 09:02 -0400, Peter Staubach wrote:
> Ian Kent wrote:
>
> >
> >I've re-written the server selection code now and I believe it works
> >correctly.
> >
> >
> >
> >>Apart from mount time server selection read-only replicated servers need
> >>to be able to fail over to another server if the current one becomes
> >>unavailable.
> >>
> >>The questions I have are:
> >>
> >>1) What is the best place for each part of this process to be
> >> carried out.
> >> - mount time selection.
> >> - read-only mount fail over.
> >>
> >>
> >
> >I think mount time selection should be done in mount and I believe the
> >failover needs to be done in the kernel against the list established with
> >the user space selection. The list should only change when a umount
> >and then a mount occurs (surely this is the only practical way to do it
> >?).
> >
> >The code that I now have for the selection process can potentially improve
> >the code used by patches to mount for probing NFS servers and doing this
> >once in one place has to be better than doing it in automount and mount.
> >
> >The failover is another story.
> >
> >It seems to me that there are two similar ways to do this:
> >
> >1) Pass a list of address and path entries to NFS at mount time and
> >intercept errors, identify if the host is down and if it is select and
> >mount another server.
> >
> >2) Mount each member of the list with the best one on top and intercept
> >errors, identify if the host is down and if it is select another from the
> >list of mounts and put it atop the mounts. Maintaining the ordering with
> >this approach could be difficult.
> >
> >With either of these approaches handling open files and held locks appears
> >to be the the difficult part.
> >
> >Anyone have anything to contribute on how I could handle this or problems
> >that I will encounter?
> >
> >
> >
>
> It seems to me that there is one other way which is similiar to #1 except
> that instead of passing path entries to NFS at mount time, pass in file
> handles. This keeps all of the MOUNT protocol processing at the user
> level and does not require the kernel to learn anything about the MOUNT
> protocol. It also allows a reasonable list to be constructed, with
> checking to ensure that all the servers support the same version of the
> NFS protocol, probably that all of the server support the same transport
> protocol, etc.
Of course, like #1 but with the benefits of #2 without the clutter. I
guess all I would have to do then is the vfs mount to make it happen.
Are we assuming a restriction like all the mounts have the same path
exported from the server? mtab could get a little confused.
>
> >snip ..
> >
> >
> >
> >>3) Is there any existing work available that anyone is aware
> >> of that could be used as a reference.
> >>
> >>
> >
> >Still wondering about this.
> >
> >
> >
>
> Well, there is the Solaris support.
But I'm not supposed to peek at that am I (cough, splutter, ...)?
>
> >>4) How does NFS v4 fit into this picture as I believe that some
> >> of this functionality is included within the protocol.
> >>
> >>
> >
> >And this.
> >
> >NFS v4 appears quite different so should I be considering this for v2 and
> >v3 only?
> >
> >
> >
> >>Any comments or suggestions or reference code would be very much
> >>appreciated.
> >>
> >>
>
> The Solaris support works by passing a list of structs containing server
> information down into the kernel at mount time. This makes normal mounting
> just a subset of the replicated support because a normal mount would just
> contain a list of a single entry.
Cool. That's the way the selection code I have works, except for the
kernel bit of course.
>
> When the Solaris client gets a timeout from an RPC, it checks to see whether
> this file and mount are failover'able. This checks to see whether there are
> alternate servers in the list and could contain a check to see if there are
> locks existing on the file. If there are locks, then don't failover. The
> alternative to doing this is to attempt to move the lock, but this could
> be problematic because there would be no guarantee that the new lock could
> be acquired.
Yep. Failing over the locks looks like it could turn into a nightmare
really fast. Sounds like a good simplifying restriction for a first stab
at this.
>
> Anyway, if the file is failover'able, then a new server is chosen from the
> list and the file handle associated with the file is remapped to the
> equivalent file on the new server. This is done by repeating the lookups
> done to get the original file handle. Once the new file handle is acquired,
> then some minimal checks are done to try to ensure that the files are the
> "same". This is probably mostly checking to see whether the sizes of the
> two files are the same.
>
> Please note that this approach contains the interesting aspect that
> files are only failed over when they need to be and are not failed over
> proactively. This can lead to the situation where processes using the
> the file system can be talking to many of the different underlying
> servers, all at the sametime. If a server goes down and then comes back
> up before a process, which was talking to that server, notices, then it
> will just continue to use that server, while another process, which
> noticed the failed server, may have failed over to a new server.
Interesting. This hadn't occurred to me yet.
I was still at the stage of wondering whether the "on demand" approach
would work but the simplifying restriction above should make it workable
(I think ....).
>
> The key ingredient to this approach, I think, is a list of servers and
> information about them, and then information for each active NFS inode
> that keeps track of the pathname used to discover the file handle and
> also the server which is being currently used by the specific file.
Haven't quite got to the path issues yet.
But can't we just get the path from d_path?
It will return the path from a given dentry to the root of the mount, if
I remember correctly, and we have a file handle for the server.
But your talking about the difficulty of the housekeeping overall I
think.
> Thanx...
Thanks for your comments.
Much appreciated and certainly very helpful.
Ian
next prev parent reply other threads:[~2006-05-24 13:45 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-05-02 5:56 [RFC] Multiple server selection and replicated mount failover Ian Kent
2006-05-24 5:05 ` Ian Kent
2006-05-24 13:02 ` [NFS] " Peter Staubach
2006-05-24 13:45 ` Ian Kent [this message]
2006-05-24 14:04 ` Peter Staubach
2006-05-24 14:31 ` Ian Kent
2006-05-24 20:45 ` [NFS] " John T. Kohl
2006-05-24 20:52 ` Dan Stromberg
2006-05-29 7:31 ` [autofs] " Ian Kent
2006-05-30 12:02 ` Jeff Moyer
2006-05-24 16:29 ` Trond Myklebust
2006-05-24 17:58 ` [autofs] " Jeff Moyer
2006-05-24 18:31 ` Trond Myklebust
2006-05-24 19:17 ` Peter Staubach
2006-05-24 19:45 ` Trond Myklebust
2006-05-25 3:56 ` Ian Kent
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1148478346.8182.22.camel@raven.themaw.net \
--to=raven@themaw.net \
--cc=autofs@linux.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=nfs@lists.sourceforge.net \
--cc=staubach@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).