From: Wendy Cheng <wcheng@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover
Date: Tue, 17 Apr 2007 15:30:21 -0400 [thread overview]
Message-ID: <4625204D.1030509@redhat.com> (raw)
In-Reply-To: <46156F3F.3070606@redhat.com>
Few new thoughts from the latest round of review are really good and
worth doing....
However, since this particular NLM patch set is only part of the overall
scaffolding code to allow NFS V3 server fail over before NFS V4 is
widely adopted and stabilized, I'm wondering whether we should drag
ourselves too far for something that will be replaced soon. Lon and I
had been discussing the possibility of proposing new design changes into
the existing state monitoring protocol itself - but I'm leaning toward
*not* doing client SM_NOTIFY eventually (by passing the lock states
directly from fail-over server to take-over server if all possible).
This would consolidate few next work items such as NFSD V3 request reply
cache entires (or at least non-idempotent operation entries) or NFS V4
states that need to get moved around between the fail over servers.
In general, NFS cluster failover has been error prone and has timing
constraints (e.g. failover must finish within a sensible time interval).
Would it make more sense to have a workable solution with restricted
application first ? We can always merge various pieces together later as
we learn more from our users. For this reasoning, simple and plain
patches like this set would work best for now.
In any case, the following collect the review comments so far:
o 1-1 [from hch]
"Dropping locks should also support uuid or dev_t based exports."
A valid request. The easiest solution might be simply taking Neil's idea
by using export path name. So this issue is combined into 1-3 (see below
for details).
o 1-2 [from hch]
"It would be nice to have a more general push api for changes to
filesystem state, that works on a similar basis as getting information
from /etc/exports."
Could hch (or anyone) elaborate more on this ? Should I interpret it as
implementing a configuration file (that describes the failover options
that has a format similar to /etc/exports (including filesystem
identifiers, the length of grace period, etc) and a command (maybe two -
one on failover server and one on take-over server) to kick off the
failover based on the pre-defined configuration file ?
o 1-3 [from neilb]
"It would seem to make more sense to use the filesystem name (i.e. a
path) by writing a directory name to /proc/fs/nfsd/nlm_unlock and maybe
also to /proc/fs/nlm_restart_grace_for_fs" and have 'my_name' in the
SM_MON request be the path name of the export point rather the network
address."
It was my mistake to mention that we could use "fsid" in the "my_name"
field in previous post. As Lon pointed out, SM_MON requires server
address so we do not blindly notify clients that could result with
unspecified behaviors. On the other hand, the "path name" idea does
solve various problems if we want to support different types of existing
filesystem identifiers for failover purpose. Combining the configuration
file mentioned in 1-2, this could be a nice long term solution. Few
concerns (about using path name alone) :
*String comparison can be error-prone and slow
* It loses the abstraction provided by the "fsid" approach, particularly
for a cluster filesystem load balancing purpose. With "fsid" approach,
we could simply export the same directory using two different fsid(s)
(associated with two different IP addresses) for various purposes on the
same node.
* Will have to repeatedly educate users that "dev_t" is not unique
across reboots or nodes; uuid is restricted to one single disk
partition; and both of them require extra steps to obtain the values
somewhere else that are not easily read by human eyes. My support
experiences taught me that by the time users really understand the
difference, they'll switch to fsid anyway.
1-4 [from bfields]
"Unrelated bug fix should break out from the feature patches".
Will do
2-1 [from cluster coherent NFS conf. call]
"Hooks to allow cluster filesystem does its own "start" and "stop" of
grace period."
This could be solved by using a configuration file as described in 1-2.
3-1 [from okir]
"There's not enough room in the SM_MON request to accommodate additional
network addresses (e.g. IPv6)".
SM_MON is sent and received *within* the very same server. Is it really
matter whether we follow the protocol standard in this particular RPC
call ? My guess is not. Current patch writes server IP into "my_name"
field as a variable length character array. I see no reason this can't
be a larger character array (say 128 bytes for IPV6) to accommodate all
the existing network addressing we know of.
3-2 [from okir]
"Should we think about replacing SM_MON with some new design altogether
(think netlink) ?"
Yes. But before we spend the efforts, I would suggest we focus on
1. Offering a tentative workable NFS V3 solution for our users first.
2. Check out the requirement from NFS V4 implementation so we don't end
up revising the new changes again when V4 failover arrives.
In short, my vote is taking this (NLM) patch set and let people try it
out while we switch our gear to look into other NFS V3 failover issues
(nfsd in particular). Neil ?
-- Wendy
next prev parent reply other threads:[~2007-04-17 19:30 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-05 21:50 [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover Wendy Cheng
2007-04-11 17:01 ` [Cluster-devel] Re: [NFS] " J. Bruce Fields
2007-04-17 19:30 ` Wendy Cheng [this message]
2007-04-18 18:56 ` [Cluster-devel] " Wendy Cheng
2007-04-18 19:46 ` Wendy Cheng
2007-04-19 14:41 ` [Cluster-devel] Re: [NFS] " Christoph Hellwig
2007-04-19 15:08 ` Wendy Cheng
[not found] ` <message from Wendy Cheng on Tuesday April 17>
2007-04-19 7:04 ` [Cluster-devel] " Neil Brown
2007-04-19 14:53 ` Wendy Cheng
2007-04-24 3:30 ` Wendy Cheng
[not found] ` <message from Wendy Cheng on Monday April 23>
2007-04-24 5:52 ` [NFS] " Neil Brown
2007-04-26 4:35 ` Wendy Cheng
[not found] ` <message from Wendy Cheng on Thursday April 26>
2007-04-26 5:43 ` Neil Brown
2007-04-27 2:24 ` Wendy Cheng
2007-04-27 6:00 ` Neil Brown
2007-04-27 11:15 ` Jeff Layton
[not found] ` <message from Jeff Layton on Friday April 27>
2007-04-27 12:40 ` Neil Brown
2007-04-27 18:57 ` Jeff Layton
2007-04-27 14:17 ` Christoph Hellwig
2007-04-27 15:43 ` J. Bruce Fields
2007-04-27 15:36 ` Wendy Cheng
2007-04-27 16:31 ` J. Bruce Fields
[not found] ` <message from J. Bruce Fields on Friday April 27>
2007-04-27 22:22 ` Neil Brown
2007-04-29 20:14 ` J. Bruce Fields
[not found] ` <message from J. Bruce Fields on Sunday April 29>
2007-04-29 23:10 ` Neil Brown
2007-04-30 5:19 ` Wendy Cheng
2007-05-04 18:42 ` J. Bruce Fields
2007-05-04 21:35 ` Wendy Cheng
2007-04-27 20:34 ` Frank van Maarseveen
2007-04-28 3:55 ` Wendy Cheng
[not found] ` <message from Wendy Cheng on Friday April 27>
2007-04-28 4:51 ` Neil Brown
2007-04-28 5:27 ` Marc Eshel
2007-04-28 12:33 ` Frank van Maarseveen
2007-04-27 15:12 ` Jeff Layton
2007-04-25 14:18 ` [Cluster-devel] Re: [NFS] " J. Bruce Fields
2007-04-25 14:10 ` Wendy Cheng
2007-04-25 15:21 ` Marc Eshel
2007-04-25 15:19 ` Wendy Cheng
2007-04-25 15:39 ` Wendy Cheng
2007-04-25 15:59 ` J. Bruce Fields
2007-04-25 15:52 ` Wendy Cheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4625204D.1030509@redhat.com \
--to=wcheng@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).