* [PATCH 0/4 Revised] NLM - lock failover
@ 2007-04-05 21:50 Wendy Cheng
2007-04-11 17:01 ` J. Bruce Fields
` (3 more replies)
0 siblings, 4 replies; 42+ messages in thread
From: Wendy Cheng @ 2007-04-05 21:50 UTC (permalink / raw)
To: nfs, cluster-devel; +Cc: Lon Hohberger
Revised patches based on 2.6.21-rc4 kernel and nfs-utils-1.1.0-rc1 that
address issues discussed in:
https://www.redhat.com/archives/cluster-devel/2006-September/msg00034.html
Quick How-to:
1) Failover server exports filesystem with "fsid" option as:
/etc/exports entry> /mnt/shared/exports *(fsid=1234,sync,rw)
2) Failover server dispatch rpc.statd with "-H" option.
3) Failover server drops locks based on fsid by:
shell> echo 1234 > /proc/fs/nfsd/nlm_unlock
4) Takeover server enters per fsid grace period by:
shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace
5) Takeover server notifies clients for lock reclaim by:
shell> /usr/sbin/sm-notify -f -v floating_ip_address -P an_sm_directory
Patch Summary:
4-1: implement /proc/fs/nfsd/nlm_unlock
4-2: implement /proc/fs/nfsd/nlm_set_igrace
4-3: correctly record and pass incoming server ip interface into rpc.statd.
4-4: nfs-utils statd changes
4-1 includes an existing lockd bug fix as discussed in:
http://sourceforge.net/mailarchive/forum.php?thread_name=4603506D.5040807%40redhat.com&forum_name=nfs
(subject: [NFS] Question about f_count in struct nlm_file)
4-4 includes an existing nfs-utils statd bug fix as discussed in:
http://sourceforge.net/mailarchive/message.php?msg_name=46142B4F.1030507%40redhat.com
(subject: Re: [NFS] lockd and statd)
Misc:
o No IPV6 support due to testing efforts
o NFS V3 only - will compare notes with CITI folks (NFS V4 issues)
o Still need some error-inject tests.
-- Wendy
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-05 21:50 [PATCH 0/4 Revised] NLM - lock failover Wendy Cheng @ 2007-04-11 17:01 ` J. Bruce Fields 2007-04-17 19:30 ` [Cluster-devel] " Wendy Cheng ` (2 subsequent siblings) 3 siblings, 0 replies; 42+ messages in thread From: J. Bruce Fields @ 2007-04-11 17:01 UTC (permalink / raw) To: Wendy Cheng; +Cc: cluster-devel, Lon Hohberger, nfs On Thu, Apr 05, 2007 at 05:50:55PM -0400, Wendy Cheng wrote: > Revised patches based on 2.6.21-rc4 kernel and nfs-utils-1.1.0-rc1 that > address issues discussed in: > https://www.redhat.com/archives/cluster-devel/2006-September/msg00034.html > > Quick How-to: > 1) Failover server exports filesystem with "fsid" option as: > /etc/exports entry> /mnt/shared/exports *(fsid=1234,sync,rw) > 2) Failover server dispatch rpc.statd with "-H" option. > 3) Failover server drops locks based on fsid by: > shell> echo 1234 > /proc/fs/nfsd/nlm_unlock > 4) Takeover server enters per fsid grace period by: > shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace > 5) Takeover server notifies clients for lock reclaim by: > shell> /usr/sbin/sm-notify -f -v floating_ip_address -P an_sm_directory > > Patch Summary: > 4-1: implement /proc/fs/nfsd/nlm_unlock > 4-2: implement /proc/fs/nfsd/nlm_set_igrace > 4-3: correctly record and pass incoming server ip interface into rpc.statd. > 4-4: nfs-utils statd changes > 4-1 includes an existing lockd bug fix as discussed in: > http://sourceforge.net/mailarchive/forum.php?thread_name=4603506D.5040807%40redhat.com&forum_name=nfs > (subject: [NFS] Question about f_count in struct nlm_file) That's the one separate chunk in nlm_traverse_files()? Could you keep that split out as a separate patch? I see that it got some discussion before but I'm not clear what the resolution was.... --b. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-05 21:50 [PATCH 0/4 Revised] NLM - lock failover Wendy Cheng 2007-04-11 17:01 ` J. Bruce Fields @ 2007-04-17 19:30 ` Wendy Cheng 2007-04-18 18:56 ` Wendy Cheng 2007-04-19 7:04 ` [Cluster-devel] " Neil Brown 2007-04-25 14:18 ` J. Bruce Fields 2011-11-30 10:13 ` Pavel 3 siblings, 2 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-17 19:30 UTC (permalink / raw) To: Neil Brown; +Cc: cluster-devel, nfs Few new thoughts from the latest round of review are really good and worth doing.... However, since this particular NLM patch set is only part of the overall scaffolding code to allow NFS V3 server fail over before NFS V4 is widely adopted and stabilized, I'm wondering whether we should drag ourselves too far for something that will be replaced soon. Lon and I had been discussing the possibility of proposing new design changes into the existing state monitoring protocol itself - but I'm leaning toward *not* doing client SM_NOTIFY eventually (by passing the lock states directly from fail-over server to take-over server if all possible). This would consolidate few next work items such as NFSD V3 request reply cache entires (or at least non-idempotent operation entries) or NFS V4 states that need to get moved around between the fail over servers. In general, NFS cluster failover has been error prone and has timing constraints (e.g. failover must finish within a sensible time interval). Would it make more sense to have a workable solution with restricted application first ? We can always merge various pieces together later as we learn more from our users. For this reasoning, simple and plain patches like this set would work best for now. In any case, the following collect the review comments so far: o 1-1 [from hch] "Dropping locks should also support uuid or dev_t based exports." A valid request. The easiest solution might be simply taking Neil's idea by using export path name. So this issue is combined into 1-3 (see below for details). o 1-2 [from hch] "It would be nice to have a more general push api for changes to filesystem state, that works on a similar basis as getting information from /etc/exports." Could hch (or anyone) elaborate more on this ? Should I interpret it as implementing a configuration file (that describes the failover options that has a format similar to /etc/exports (including filesystem identifiers, the length of grace period, etc) and a command (maybe two - one on failover server and one on take-over server) to kick off the failover based on the pre-defined configuration file ? o 1-3 [from neilb] "It would seem to make more sense to use the filesystem name (i.e. a path) by writing a directory name to /proc/fs/nfsd/nlm_unlock and maybe also to /proc/fs/nlm_restart_grace_for_fs" and have 'my_name' in the SM_MON request be the path name of the export point rather the network address." It was my mistake to mention that we could use "fsid" in the "my_name" field in previous post. As Lon pointed out, SM_MON requires server address so we do not blindly notify clients that could result with unspecified behaviors. On the other hand, the "path name" idea does solve various problems if we want to support different types of existing filesystem identifiers for failover purpose. Combining the configuration file mentioned in 1-2, this could be a nice long term solution. Few concerns (about using path name alone) : *String comparison can be error-prone and slow * It loses the abstraction provided by the "fsid" approach, particularly for a cluster filesystem load balancing purpose. With "fsid" approach, we could simply export the same directory using two different fsid(s) (associated with two different IP addresses) for various purposes on the same node. * Will have to repeatedly educate users that "dev_t" is not unique across reboots or nodes; uuid is restricted to one single disk partition; and both of them require extra steps to obtain the values somewhere else that are not easily read by human eyes. My support experiences taught me that by the time users really understand the difference, they'll switch to fsid anyway. 1-4 [from bfields] "Unrelated bug fix should break out from the feature patches". Will do 2-1 [from cluster coherent NFS conf. call] "Hooks to allow cluster filesystem does its own "start" and "stop" of grace period." This could be solved by using a configuration file as described in 1-2. 3-1 [from okir] "There's not enough room in the SM_MON request to accommodate additional network addresses (e.g. IPv6)". SM_MON is sent and received *within* the very same server. Is it really matter whether we follow the protocol standard in this particular RPC call ? My guess is not. Current patch writes server IP into "my_name" field as a variable length character array. I see no reason this can't be a larger character array (say 128 bytes for IPV6) to accommodate all the existing network addressing we know of. 3-2 [from okir] "Should we think about replacing SM_MON with some new design altogether (think netlink) ?" Yes. But before we spend the efforts, I would suggest we focus on 1. Offering a tentative workable NFS V3 solution for our users first. 2. Check out the requirement from NFS V4 implementation so we don't end up revising the new changes again when V4 failover arrives. In short, my vote is taking this (NLM) patch set and let people try it out while we switch our gear to look into other NFS V3 failover issues (nfsd in particular). Neil ? -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-17 19:30 ` [Cluster-devel] " Wendy Cheng @ 2007-04-18 18:56 ` Wendy Cheng 2007-04-18 19:46 ` [Cluster-devel] " Wendy Cheng 2007-04-19 14:41 ` Christoph Hellwig 2007-04-19 7:04 ` [Cluster-devel] " Neil Brown 1 sibling, 2 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-18 18:56 UTC (permalink / raw) To: nfs, cluster-devel Arjan, I need an objective opinion and am wondering whether you could give me some advices ... I'm quite upset about this Christoph guy and really want to "talk back". Will my response be too strong that ends up harm my later on patches ? -- Wendy Christoph Hellwig wrote: > On Tue, Apr 17, 2007 at 10:11:13PM -0400, Wendy Cheng wrote: > > However, since this particular NLM patch set is only part of the > overall > > scaffolding code to allow NFS V3 server fail over before NFS V4 is > > widely adopted and stabilized, I'm wondering whether we should drag > > ourselves too far for something that will be replaced soon. > > I don't think that's a valid argument. We hack this up because it's > going to be obsolete mid-term never was a really good argument. And in > this case it's a particularly bad one. People won't rush to NFSv4 just > because someone declares it stable now. And if they did we couldn't > simply rip out existing functionality. > The "hack" and "bad" are very subjective words in this context. Comparing to many other code currently living inside Linux kernel tree, this patch set, gone thru 3 rounds of extensive review and discussions, deserves at least "average" standing in terms of solution, quality and testing efforts. On the other hand, I certainly welcome further constructive suggestions and ideas though. -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-18 18:56 ` Wendy Cheng @ 2007-04-18 19:46 ` Wendy Cheng 2007-04-19 14:41 ` Christoph Hellwig 1 sibling, 0 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-18 19:46 UTC (permalink / raw) To: Wendy Cheng; +Cc: cluster-devel, nfs oops :( ... ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-18 18:56 ` Wendy Cheng 2007-04-18 19:46 ` [Cluster-devel] " Wendy Cheng @ 2007-04-19 14:41 ` Christoph Hellwig 2007-04-19 15:08 ` Wendy Cheng 1 sibling, 1 reply; 42+ messages in thread From: Christoph Hellwig @ 2007-04-19 14:41 UTC (permalink / raw) To: Wendy Cheng; +Cc: cluster-devel, nfs On Wed, Apr 18, 2007 at 02:56:18PM -0400, Wendy Cheng wrote: > Arjan, I need an objective opinion and am wondering whether you could > give me some advices ... > > I'm quite upset about this Christoph guy and really want to "talk back". > Will my response be too strong that ends up harm my later on patches ? I don't mind strong answers, but I'd generally prefer clueful strong answers :) ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-19 14:41 ` Christoph Hellwig @ 2007-04-19 15:08 ` Wendy Cheng 0 siblings, 0 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-19 15:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: cluster-devel, nfs Christoph Hellwig wrote: > On Wed, Apr 18, 2007 at 02:56:18PM -0400, Wendy Cheng wrote: > >> Arjan, I need an objective opinion and am wondering whether you could >> give me some advices ... >> >> I'm quite upset about this Christoph guy and really want to "talk back". >> Will my response be too strong that ends up harm my later on patches ? >> > > I don't mind strong answers, but I'd generally prefer clueful strong > answers :) > > Well, I have been seriously considering your previous advice about getting another job. Maybe working on a new email system that would allow people to cancel or stop mistakenly sent out emails would be a very good new project ? :) Glad to know you don't feel offended. -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-17 19:30 ` [Cluster-devel] " Wendy Cheng 2007-04-18 18:56 ` Wendy Cheng @ 2007-04-19 7:04 ` Neil Brown 2007-04-19 14:53 ` Wendy Cheng 2007-04-24 3:30 ` Wendy Cheng 1 sibling, 2 replies; 42+ messages in thread From: Neil Brown @ 2007-04-19 7:04 UTC (permalink / raw) To: Wendy Cheng; +Cc: cluster-devel, nfs On Tuesday April 17, wcheng@redhat.com wrote: > > In short, my vote is taking this (NLM) patch set and let people try it > out while we switch our gear to look into other NFS V3 failover issues > (nfsd in particular). Neil ? I agree with Christoph in that we should do it properly. That doesn't mean that we need a complete solution. But we do want to make sure to avoid any design decisions that we might not want to be stuck with. Sometimes that's unavoidable, but let's try a little harder for the moment. One thing that has been bothering me is that sometimes the "filesystem" (in the guise of an fsid) is used to talk to the kernel about failover issues (when flushing locks or restarting the grace period) and sometimes the local network address is used (when talking with statd). I would rather use a single identifier. In my previous email I was leaning towards using the filesystem as the single identifier. Today I'm leaning the other way - to using the local network address. It works like this: We have a module parameter for lockd something like "virtual_server". If that is set to 0, none of the following changes are effective. If it is set to 1: The destination address for any lockd request becomes part of the key to find the nsm_handle. The my_name field in SM_MON requests and SM_UNMON requests is set to a textual representation of that destination address. The reply to SM_MON (currently completely ignored by all versions of Linux) has an extra value which indicates how many more seconds of grace period there is to go. This can be stuffed into res_stat maybe. Places where we currently check 'nlmsvc_grace_period', get moved to *after* the nlmsvc_retrieve_args call, and the grace_period value is extracted from host->nsm. This is the full extent of the kernel changes. To remove old locks, we arrange for the callbacks registered with statd for the relevant clients to be called. To set the grace period, we make sure statd knows about it and it will return the relevant information to lockd. To notify clients of the need to reclaim locks, we simple use the information stored by statd, which contains the local network address. The only aspect of this that gives me any cause for concern is overloading the return value for SM_MON. Possibly it might be cleaner to define an SM_MON2 with different args or whatever. As this interface is entirely local to the one machine, and as it can quite easily be kept back-compatible, I think the concept is fine. Statd would need to pass the my_name field to the ha callout rather than replacing it with "127.0.0.1", but other than that I don't think any changes are needed to statd (though I haven't thought through that fully yet). Comments? NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-19 7:04 ` [Cluster-devel] " Neil Brown @ 2007-04-19 14:53 ` Wendy Cheng 2007-04-24 3:30 ` Wendy Cheng 1 sibling, 0 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-19 14:53 UTC (permalink / raw) To: Neil Brown; +Cc: cluster-devel, nfs Neil Brown wrote: > On Tuesday April 17, wcheng@redhat.com wrote: > >> In short, my vote is taking this (NLM) patch set and let people try it >> out while we switch our gear to look into other NFS V3 failover issues >> (nfsd in particular). Neil ? >> > > I agree with Christoph in that we should do it properly. > That doesn't mean that we need a complete solution. But we do want to > make sure to avoid any design decisions that we might not want to be > stuck with. Sometimes that's unavoidable, but let's try a little > harder for the moment. > As any code review, set personal feeling aside, at the end of the day, you would start to appreciate some of the look-like-harsh comments. This instance is definitely one of that moments. I agree we should try harder. NFS failover has been a difficult subject. There is a three-years-old Red Hat bugzilla asking for this feature, plus few others marked as duplicate. By reading through the comments last night, I do feel strongly that we should put restrictions on the implementation to avoid dragging users into another three more years. > One thing that has been bothering me is that sometimes the > "filesystem" (in the guise of an fsid) is used to talk to the kernel > about failover issues (when flushing locks or restarting the grace > period) and sometimes the local network address is used (when talking > with statd). > > I would rather use a single identifier. In my previous email I was > leaning towards using the filesystem as the single identifier. Today > I'm leaning the other way - to using the local network address. > > It works like this: > > We have a module parameter for lockd something like > "virtual_server". > If that is set to 0, none of the following changes are effective. > If it is set to 1: > > The destination address for any lockd request becomes part of the > key to find the nsm_handle. > The my_name field in SM_MON requests and SM_UNMON requests is set > to a textual representation of that destination address. > The reply to SM_MON (currently completely ignored by all versions > of Linux) has an extra value which indicates how many more seconds > of grace period there is to go. This can be stuffed into res_stat > maybe. > Places where we currently check 'nlmsvc_grace_period', get moved to > *after* the nlmsvc_retrieve_args call, and the grace_period value > is extracted from host->nsm. > > This is the full extent of the kernel changes. > > To remove old locks, we arrange for the callbacks registered with > statd for the relevant clients to be called. > To set the grace period, we make sure statd knows about it and it > will return the relevant information to lockd. > To notify clients of the need to reclaim locks, we simple use the > information stored by statd, which contains the local network > address. > > The only aspect of this that gives me any cause for concern is > overloading the return value for SM_MON. Possibly it might be cleaner > to define an SM_MON2 with different args or whatever. > As this interface is entirely local to the one machine, and as it can > quite easily be kept back-compatible, I think the concept is fine. > > Statd would need to pass the my_name field to the ha callout rather > than replacing it with "127.0.0.1", but other than that I don't think > any changes are needed to statd (though I haven't thought through that > fully yet). > > Comments? > > Need sometime to look into the ramifications ... comment will follow soon. -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-19 7:04 ` [Cluster-devel] " Neil Brown 2007-04-19 14:53 ` Wendy Cheng @ 2007-04-24 3:30 ` Wendy Cheng 2007-04-24 5:52 ` Neil Brown 1 sibling, 1 reply; 42+ messages in thread From: Wendy Cheng @ 2007-04-24 3:30 UTC (permalink / raw) To: Neil Brown; +Cc: cluster-devel, nfs Neil Brown wrote: >One thing that has been bothering me is that sometimes the >"filesystem" (in the guise of an fsid) is used to talk to the kernel >about failover issues (when flushing locks or restarting the grace >period) and sometimes the local network address is used (when talking >with statd). > > This is a perception issue - it depends on how the design is described. More on this later. >I would rather use a single identifier. In my previous email I was >leaning towards using the filesystem as the single identifier. Today >I'm leaning the other way - to using the local network address. > > Guess you're juggling with too many things so forget why we came down to this route ? We started the discussion using network interface (to drop the locks) but found it wouldn't work well on local filesytems such as ext3. There is really no control on which local (sever side) interface NFS clients will use (shouldn't be hard to implement one though). When the fail-over server starts to remove the locks, it needs a way to find *all* of the locks associated with the will-be-moved partition. This is to allow umount to succeed. The server ip address alone can't guarantee that. That was the reason we switched to fsid. Also remember this is NFS v2/v3 - clients have no knowledge of server migration. Now, let's move back to first paragraph. An active-active failover can be described as a 5-steps process: Step 1. Quiesce the floating network address. Step 2. Move the exported filesystem directories from Server A to Server B. Step 3. Re-enable the network interface. Step 4. Inform clients about the changes via NSM (Network Status Monitor) Protocol. Step 5. Grace period. I was told last week that, independent of lockd, some cluster filesystems do have their own implementation of grace period. It is on the wish list that this feature is taken into consideration. IMHO, the overall process should be viewed as a collaboration between filesystem, network interface, and NFS protocol itself. Mixing the filesystem and network operations are unavoidable. On the other hand, the current proposed interface is expandable .. say, prefix a non-numerical string "DEV" or "UUID" to ask for dropping locks as in: shell> echo "DEV12390 > /proc/fs/nfsd/nlm_unlock; or allow individual grace period of 10 seconds as: shell> echo "1234@10" > nlm_set_grace_for_fsid With above said, some of the following flow confuses me ... comment inlined as below .. >It works like this: > > We have a module parameter for lockd something like > "virtual_server". > If that is set to 0, none of the following changes are effective. > If it is set to 1: > > ok with me ... > The destination address for any lockd request becomes part of the > key to find the nsm_handle. > > As explained above, the address along can't guarantee the associated locks get cleaned up for one particular filesystem. > The my_name field in SM_MON requests and SM_UNMON requests is set > to a textual representation of that destination address. > > That's what the current patch does. > The reply to SM_MON (currently completely ignored by all versions > of Linux) has an extra value which indicates how many more seconds > of grace period there is to go. This can be stuffed into res_stat > maybe. > Places where we currently check 'nlmsvc_grace_period', get moved to > *after* the nlmsvc_retrieve_args call, and the grace_period value > is extracted from host->nsm. > > ok with me but I don't see the advantages though ? > This is the full extent of the kernel changes. > > To remove old locks, we arrange for the callbacks registered with > statd for the relevant clients to be called. > To set the grace period, we make sure statd knows about it and it > will return the relevant information to lockd. > To notify clients of the need to reclaim locks, we simple use the > information stored by statd, which contains the local network > address. > > I'm lost here... help ? >The only aspect of this that gives me any cause for concern is >overloading the return value for SM_MON. Possibly it might be cleaner >to define an SM_MON2 with different args or whatever. >As this interface is entirely local to the one machine, and as it can >quite easily be kept back-compatible, I think the concept is fine. > > Agree ! >Statd would need to pass the my_name field to the ha callout rather >than replacing it with "127.0.0.1", but other than that I don't think >any changes are needed to statd (though I haven't thought through that >fully yet). > > That's the current patch does. >Comments? > > > > I feel we're in the loop again... If there is any way I can shorten this discussion, please do let me know. -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-24 3:30 ` Wendy Cheng @ 2007-04-24 5:52 ` Neil Brown 2007-04-26 4:35 ` Wendy Cheng 0 siblings, 1 reply; 42+ messages in thread From: Neil Brown @ 2007-04-24 5:52 UTC (permalink / raw) To: wcheng; +Cc: cluster-devel, nfs On Monday April 23, wcheng@redhat.com wrote: > Neil Brown wrote: > > >One thing that has been bothering me is that sometimes the > >"filesystem" (in the guise of an fsid) is used to talk to the kernel > >about failover issues (when flushing locks or restarting the grace > >period) and sometimes the local network address is used (when talking > >with statd). > > This is a perception issue - it depends on how the design is described. Perception affects understanding. Understanding is vital. > More on this later. OK. > > >I would rather use a single identifier. In my previous email I was > >leaning towards using the filesystem as the single identifier. Today > >I'm leaning the other way - to using the local network address. > > > > > Guess you're juggling with too many things so forget why we came down to > this route ? Probably :-) > We started the discussion using network interface (to drop > the locks) but found it wouldn't work well on local filesytems such as > ext3. There is really no control on which local (sever side) interface > NFS clients will use (shouldn't be hard to implement one though). When > the fail-over server starts to remove the locks, it needs a way to find > *all* of the locks associated with the will-be-moved partition. This is > to allow umount to succeed. The server ip address alone can't guarantee > that. That was the reason we switched to fsid. Also remember this is NFS > v2/v3 - clients have no knowledge of server migration. Hmmm... I had in mind that you would have some name in the DNS like "virtual-nas-foo" which maps to a number of IP addresses, And every client that wants to access /bar, which is known to be served by virtual-nas-foo would: mount virtual-nas-foo:/bar /bar and some server (A) from the pool of possibilities would configure a bunch of virtual interfaces to have the different IP addresses that the DNS knows to be associated with 'virtual-nas-foo'. It might also configure a bunch of other virtual interfaces with the addresses of 'virtual-nas-baz', but no client would ever try to mount virtual-nas-baz:/bar /bar because, while that might work depending on the server configuration, it is clearly a config error and as soon as /bar was migrated A to B, those clients would mysteriously lose service. So it seems to me we do know exactly the list of local-addresses that could possibly be associated with locks on a given filesystem. They are exactly the IP addresses that are publicly acknowledged to be usable for that filesystem. And if any client tries to access the filesystem using a different IP address then they are doing the wrong thing and should be reformatted. Maybe the idea of using network addresses was the first suggestion, and maybe it was rejected for the reasons you give, but it doesn't currently seem like those reasons are valid. Maybe those who proposed those reasons (and maybe that was me) couldn't see the big picture at the time.... maybe I still don't see the big picture? > > The reply to SM_MON (currently completely ignored by all versions > > of Linux) has an extra value which indicates how many more seconds > > of grace period there is to go. This can be stuffed into res_stat > > maybe. > > Places where we currently check 'nlmsvc_grace_period', get moved to > > *after* the nlmsvc_retrieve_args call, and the grace_period value > > is extracted from host->nsm. > > > > > ok with me but I don't see the advantages though ? So we can have a different grace period for each different 'host'. > > > This is the full extent of the kernel changes. > > > > To remove old locks, we arrange for the callbacks registered with > > statd for the relevant clients to be called. > > To set the grace period, we make sure statd knows about it and it > > will return the relevant information to lockd. > > To notify clients of the need to reclaim locks, we simple use the > > information stored by statd, which contains the local network > > address. > > > > > > I'm lost here... help ? > Ok, I'll try to not be so terse. > > To remove old locks, we arrange for the callbacks registered with > > statd for the relevant clients to be called. Part of unmounting the filesystem from Server A requires getting Server A to drop all the locks on the filesystem. We know they can only be held by client that sent request to a given set of IP addresses. Lockd created an 'nsm' for each client/local-IP pair and registered each of those with statd. The information registered with statd includes the details of an RPC call that can be made to lockd to tell it to drop all the locks owned by that client/local-IP pair. The statd in 1.1.0 records all this information in the files created in /var/lib/nfs/sm (and could pass it to the ha-callout if required). So when it is time to unmount the filesystem, some program can look through all the files in nfs/nm, read each of the lines, find those which relate to any of the local IP address that we want to move, and initialiate the RPC callback described on that line. This will tell lockd to drop those lockd. When all the RPCs have been sent, lockd will not hold any locks on that filesystem any more. > > To set the grace period, we make sure statd knows about it and it > > will return the relevant information to lockd. On Server-B, we mount the filesystem(s) and export them. When a lock request arrives from some client, lockd needs to know whether the grace period is still active. We want that determination to depend on which filesystem/local-IP was used. One way to do that is to have in information passing in by statd when lockd asks for the client to be monitored. A possible implementation would be to have the ha-callout find out the virtual-server was migrated, and return a number of seconds remaining by writing it to stdout. statd could run the ha-callout with output to a pipe, read the number, and include that in the reply to SM_MON. > > To notify clients of the need to reclaim locks, we simple use the > > information stored by statd, which contains the local network > > address. Once the filesystem is exported on Server-B, we need to notify all clients to reclaim their locks. We can find the same lines that were used to tell lockd to close locks on the server, and use that information to tell client that they need to reclaim (or information recorded elsewhere by the ha-callout can do the same thing). Does that make it clearer? > I feel we're in the loop again... If there is any way I can shorten this > discussion, please do let me know. > Much as the 'waterfall model' is frowned upon these days, I wonder if it could serve us here. I feel it has taken me quite a while to gain a full understanding of what you are trying to achieve. Maybe it would be useful to have a concise/precise description of what the goal is. I think a lot of the issues have now become clear, but it seems there remains the issue of what system-wide configurations are expected, and what configuration we can rule 'out of scope' and decide we don't have to deal with. Once we have a clear statement of the gaol that we can agree on, it should be a lot easy to evaluate and reason about different implementation proposals. NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-24 5:52 ` Neil Brown @ 2007-04-26 4:35 ` Wendy Cheng 2007-04-26 5:43 ` Neil Brown 0 siblings, 1 reply; 42+ messages in thread From: Wendy Cheng @ 2007-04-26 4:35 UTC (permalink / raw) To: Neil Brown; +Cc: cluster-devel, nfs Neil Brown wrote: >On Monday April 23, wcheng@redhat.com wrote: > > >>Neil Brown wrote: >> >>[snip] >> >> We started the discussion using network interface (to drop >>the locks) but found it wouldn't work well on local filesytems such as >>ext3. There is really no control on which local (sever side) interface >>NFS clients will use (shouldn't be hard to implement one though). When >>the fail-over server starts to remove the locks, it needs a way to find >>*all* of the locks associated with the will-be-moved partition. This is >>to allow umount to succeed. The server ip address alone can't guarantee >>that. That was the reason we switched to fsid. Also remember this is NFS >>v2/v3 - clients have no knowledge of server migration. >> >> >[snip] > >So it seems to me we do know exactly the list of local-addresses that >could possibly be associated with locks on a given filesystem. They >are exactly the IP addresses that are publicly acknowledged to be >usable for that filesystem. >And if any client tries to access the filesystem using a different IP >address then they are doing the wrong thing and should be reformatted. > > A convincing argument... unfortunately, this happens to be a case where we need to protect server from client's misbehaviors. For a local filesystem (ext3), if any file reference count is not zero (i.e. some clients are still holding the locks), the filesystem can't be un-mounted. We would have to fail the failover to avoid data corruption. >Maybe the idea of using network addresses was the first suggestion, >and maybe it was rejected for the reasons you give, but it doesn't >currently seem like those reasons are valid. Maybe those who proposed >those reasons (and maybe that was me) couldn't see the big picture at >the time... > > This debate has been (so far) tolerable and helpful - so I'm not going to comment on this paragraph :) ... But I have to remind people my first proposal was adding new flags into export command (say "exportfs -ud" to unexport+drop locks, and "exportfs -g" to re-export and start grace period). Then we moved to "echo network-addr into procfs", later switched to "fsid" approach. A very long journey ... > > >>> The reply to SM_MON (currently completely ignored by all versions >>> of Linux) has an extra value which indicates how many more seconds >>> of grace period there is to go. This can be stuffed into res_stat >>> maybe. >>> Places where we currently check 'nlmsvc_grace_period', get moved to >>> *after* the nlmsvc_retrieve_args call, and the grace_period value >>> is extracted from host->nsm. >>> >>> >>> >>> >>ok with me but I don't see the advantages though ? >> >> > >So we can have a different grace period for each different 'host'. > > IMHO, having grace period for each client (host) is overkilled. > [snip] > >Part of unmounting the filesystem from Server A requires getting >Server A to drop all the locks on the filesystem. We know they can >only be held by client that sent request to a given set of IP >addresses. Lockd created an 'nsm' for each client/local-IP pair and >registered each of those with statd. The information registered with >statd includes the details of an RPC call that can be made to lockd to >tell it to drop all the locks owned by that client/local-IP pair. > >The statd in 1.1.0 records all this information in the files created >in /var/lib/nfs/sm (and could pass it to the ha-callout if required). >So when it is time to unmount the filesystem, some program can look >through all the files in nfs/nm, read each of the lines, find those >which relate to any of the local IP address that we want to move, and >initialiate the RPC callback described on that line. This will tell >lockd to drop those lockd. When all the RPCs have been sent, lockd >will not hold any locks on that filesystem any more. > > Bright idea ! But doesn't solve the issue of misbehaved clients who come in from un-wanted (server) interfaces. Does it ? > >[snip] >I feel it has taken me quite a while to gain a full understanding of >what you are trying to achieve. Maybe it would be useful to have a >concise/precise description of what the goal is. >I think a lot of the issues have now become clear, but it seems there >remains the issue of what system-wide configurations are expected, and >what configuration we can rule 'out of scope' and decide we don't have >to deal with. > > I'm trying to do the write-up now. But could the following temporarily serve the purpose ? What is not clear from this thread of discussion? http://www.redhat.com/archives/linux-cluster/2006-June/msg00050.html -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-26 4:35 ` Wendy Cheng @ 2007-04-26 5:43 ` Neil Brown 2007-04-27 2:24 ` Wendy Cheng 0 siblings, 1 reply; 42+ messages in thread From: Neil Brown @ 2007-04-26 5:43 UTC (permalink / raw) To: wcheng; +Cc: cluster-devel, nfs On Thursday April 26, wcheng@redhat.com wrote: > > A convincing argument... unfortunately, this happens to be a case where > we need to protect server from client's misbehaviors. For a local > filesystem (ext3), if any file reference count is not zero (i.e. some > clients are still holding the locks), the filesystem can't be > un-mounted. We would have to fail the failover to avoid data corruption. I think this is a tangential problem. "removing locks held by troublesome clients so that I can unmount my filesystem" is quite different from "remove locks held by client clients using virtual-NAS-foo so they can be migrated". I would have no problem with a new file in the nfsd filesystem such that echo /some/path > /proc/fs/nfsd/nlm_drop_locks would cause lockd to drop all locks on all files with the same 'struct super' as "/some/path"->i_sb. But I think that is independent functionality, that might be useful to people who aren't doing active-active failover, but happens also to be useful in conjunction with active-active failover. We could discuss whether it should be "same superblock" or "same vfsmount". Both make sense to some extent. The latter is possible more flexible. If you had this interface, you might not need to send the various RPC calls to lockd to get it to drop locks.... but then if you had a cluster filesystem and wanted to only move some clients to a different host, you would not want to drop *all* the locks on the filesystem, so maybe both interfaces are still needed. > > IMHO, having grace period for each client (host) is overkilled. Yes, it gives you much more flexibility than you would ever what or use, and in that sense it is overkill. But it also makes available the specific flexibility that you do want (grace period per local-address) with an extremely simple change to the lockd interface, which I think is a big win. > > > >[snip] > >I feel it has taken me quite a while to gain a full understanding of > >what you are trying to achieve. Maybe it would be useful to have a > >concise/precise description of what the goal is. > >I think a lot of the issues have now become clear, but it seems there > >remains the issue of what system-wide configurations are expected, and > >what configuration we can rule 'out of scope' and decide we don't have > >to deal with. > > > > > I'm trying to do the write-up now. But could the following temporarily > serve the purpose ? What is not clear from this thread of discussion? > > http://www.redhat.com/archives/linux-cluster/2006-June/msg00050.html Lots of things are not clear - mostly things that have since become clear in the ongoing discussion. - The many-IPs to many-filesystems possibilitiy - The need to explicitly handle mis-configured clients - The details of needs with respect to SM_NOTIFY callbacks - the "big picture" stuff. I confess that I had a much more shallow understanding of how statd interacts with lockd when this discussion first started. I'm sure that slowed me down in understanding the key issues, and in suggesting workable possibilities. I am sorry that this has taken so long. However I think we are very close to a solution that will solve everybody's needs. And you've found some bugs along the way!! NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-26 5:43 ` Neil Brown @ 2007-04-27 2:24 ` Wendy Cheng 2007-04-27 6:00 ` Neil Brown 0 siblings, 1 reply; 42+ messages in thread From: Wendy Cheng @ 2007-04-27 2:24 UTC (permalink / raw) To: Neil Brown; +Cc: cluster-devel, nfs Neil Brown wrote: >On Thursday April 26, wcheng@redhat.com wrote: > > >>A convincing argument... unfortunately, this happens to be a case where >>we need to protect server from client's misbehaviors. For a local >>filesystem (ext3), if any file reference count is not zero (i.e. some >>clients are still holding the locks), the filesystem can't be >>un-mounted. We would have to fail the failover to avoid data corruption. >> >> > >I think this is a tangential problem. >"removing locks held by troublesome clients so that I can unmount my >filesystem" is quite different from "remove locks held by client >clients using virtual-NAS-foo so they can be migrated". > > The reason to unmount is because we want to migrate the virtual IP. IMO they are the same issue but it is silly to keep fighting about this. In any case, one interface is better than two, if you allow me to insist on this. So how about we do RPC call to lockd to tell it to drop the locks owned by the client/local-IP pair as you proposed, *but* add an "OR" with fsid to fool proof the process ? Say something like this: RPC_to_lockd_with (client_host, client_ip, fsid); if ((host == client_host && vip == client_ip) || (get_fsid(file) == client_fsid)) drop_the_locks(); This logic (RPC to lockd) will be triggered by a new command added to nfs-util package. If we can agree on this, the rest would be easy. Done ? -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 2:24 ` Wendy Cheng @ 2007-04-27 6:00 ` Neil Brown 2007-04-27 11:15 ` Jeff Layton 0 siblings, 1 reply; 42+ messages in thread From: Neil Brown @ 2007-04-27 6:00 UTC (permalink / raw) To: wcheng; +Cc: cluster-devel, nfs On Thursday April 26, wcheng@redhat.com wrote: > Neil Brown wrote: > > >On Thursday April 26, wcheng@redhat.com wrote: > > > > > >>A convincing argument... unfortunately, this happens to be a case where > >>we need to protect server from client's misbehaviors. For a local > >>filesystem (ext3), if any file reference count is not zero (i.e. some > >>clients are still holding the locks), the filesystem can't be > >>un-mounted. We would have to fail the failover to avoid data corruption. > >> > >> > > > >I think this is a tangential problem. > >"removing locks held by troublesome clients so that I can unmount my > >filesystem" is quite different from "remove locks held by client > >clients using virtual-NAS-foo so they can be migrated". > > > > > The reason to unmount is because we want to migrate the virtual IP. The reason to unmount is because we want to migrate the filesystem. In your application that happens at the same time as migrating the virtual IP, but they are still distinct operations. > IMO > they are the same issue but it is silly to keep fighting about this. In > any case, one interface is better than two, if you allow me to insist on > this. How many interfaces depends somewhat on how many jobs to do. You want to destroy state that will be rebuilt on a different server, and you want to force-unmount a filesystem. Two different jobs. Two interfaces seems OK. If they could both be done with one simple interface that would be ideal, but I'm not sure they can. And no-one gets to insist on anything. You are writing the code. I am accepting/rejecting it. We both need to agree or we won't move forward. (Well... I could just write code myself, but I don't plan to do that). > > So how about we do RPC call to lockd to tell it to drop the locks owned > by the client/local-IP pair as you proposed, *but* add an "OR" with fsid > to fool proof the process ? Say something like this: > > RPC_to_lockd_with (client_host, client_ip, fsid); > if ((host == client_host && vip == client_ip) || > (get_fsid(file) == client_fsid)) > drop_the_locks(); > > This logic (RPC to lockd) will be triggered by a new command added to > nfs-util package. > > If we can agree on this, the rest would be easy. Done ? Sorry, but we cannot agree with this, and I think the rest is still easy. The more I think about it, the less I like the idea of using an fsid. The fsid concept was created simply because we needed something that would fit inside a filehandle. I think that is the only place it should be used. Outside of filehandles, we have a perfectly good and well-understood mechanism for identifying files and filesystems. It is a "path name". The functionality "drop all locks held by lockd on a particular filesystem" is potentially useful outside of any fail-over configuration, and should work on any filesystem, not just one that was exported with 'fsid='. So if you need that, then I think it really must be implemented by something a lot like echo -n /path/name > /proc/fs/nfs/nlm_unlock_filesystem This is something that we could possible teach "fuser -k" about - so it can effectively 'kill' that part of lockd that is accessing a given filesystem. It is useful to failover, but definitely useful beyond failover. Everything else can be done in the RPC interface between lockd and statd, leveraging the "my_name" field to identify state based on which local network address was used. All this other functionality is completely agnostic about the particular filesystem and just looks at the virtual IP that was used. All this other functionality is all that you need unless you have a misbehaving client. You would do all the lockd/statd/rpc stuff. Then try to unmount the filesystem. If that fails, try "fuser -k -m /whatever" and try the unmount again. Another interface alternative might be to hook in to umount(MNT_FORCE), but that would require even broader review, and probably isn't worth it.... NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 6:00 ` Neil Brown @ 2007-04-27 11:15 ` Jeff Layton 2007-04-27 12:40 ` Neil Brown 0 siblings, 1 reply; 42+ messages in thread From: Jeff Layton @ 2007-04-27 11:15 UTC (permalink / raw) To: Neil Brown; +Cc: cluster-devel, nfs On Fri, Apr 27, 2007 at 04:00:13PM +1000, Neil Brown wrote: > On Thursday April 26, wcheng@redhat.com wrote: > > Neil Brown wrote: > > > > >On Thursday April 26, wcheng@redhat.com wrote: > > > > > > > > >>A convincing argument... unfortunately, this happens to be a case where > > >>we need to protect server from client's misbehaviors. For a local > > >>filesystem (ext3), if any file reference count is not zero (i.e. some > > >>clients are still holding the locks), the filesystem can't be > > >>un-mounted. We would have to fail the failover to avoid data corruption. > > >> > > >> > > > > > >I think this is a tangential problem. > > >"removing locks held by troublesome clients so that I can unmount my > > >filesystem" is quite different from "remove locks held by client > > >clients using virtual-NAS-foo so they can be migrated". > > > > > > > > The reason to unmount is because we want to migrate the virtual IP. > > The reason to unmount is because we want to migrate the filesystem. In > your application that happens at the same time as migrating the > virtual IP, but they are still distinct operations. > > > IMO > > they are the same issue but it is silly to keep fighting about this. In > > any case, one interface is better than two, if you allow me to insist on > > this. > > How many interfaces depends somewhat on how many jobs to do. > You want to destroy state that will be rebuilt on a different server, > and you want to force-unmount a filesystem. Two different jobs. Two > interfaces seems OK. > If they could both be done with one simple interface that would be > ideal, but I'm not sure they can. > > And no-one gets to insist on anything. > You are writing the code. I am accepting/rejecting it. We both need > to agree or we won't move forward. (Well... I could just write code > myself, but I don't plan to do that). > > > > > So how about we do RPC call to lockd to tell it to drop the locks owned > > by the client/local-IP pair as you proposed, *but* add an "OR" with fsid > > to fool proof the process ? Say something like this: > > > > RPC_to_lockd_with (client_host, client_ip, fsid); > > if ((host == client_host && vip == client_ip) || > > (get_fsid(file) == client_fsid)) > > drop_the_locks(); > > > > This logic (RPC to lockd) will be triggered by a new command added to > > nfs-util package. > > > > If we can agree on this, the rest would be easy. Done ? > > Sorry, but we cannot agree with this, and I think the rest is still > easy. > > The more I think about it, the less I like the idea of using an fsid. > The fsid concept was created simply because we needed something that > would fit inside a filehandle. I think that is the only place it > should be used. > Outside of filehandles, we have a perfectly good and well-understood > mechanism for identifying files and filesystems. It is a "path name". > The functionality "drop all locks held by lockd on a particular > filesystem" is potentially useful outside of any fail-over > configuration, and should work on any filesystem, not just one that > was exported with 'fsid='. > > So if you need that, then I think it really must be implemented by > something a lot like > echo -n /path/name > /proc/fs/nfs/nlm_unlock_filesystem > > This is something that we could possible teach "fuser -k" about - so > it can effectively 'kill' that part of lockd that is accessing a given > filesystem. It is useful to failover, but definitely useful beyond > failover. Just a note that I posted a patch ~ a year ago that did precisely that. The interface was a little bit different. I had userspace echoing in a dev_t number, but it wouldn't be too hard to change it to use a pathname instead. Subject was: [PATCH] lockd: add procfs control to cue lockd to release all locks on a device ...if anyone is interested in having me resurrect it. -- Jeff > > > Everything else can be done in the RPC interface between lockd and > statd, leveraging the "my_name" field to identify state based on which > local network address was used. All this other functionality is > completely agnostic about the particular filesystem and just looks at > the virtual IP that was used. > All this other functionality is all that you need unless you have a > misbehaving client. > You would do all the lockd/statd/rpc stuff. Then try to unmount the > filesystem. If that fails, try "fuser -k -m /whatever" and try the > unmount again. > > Another interface alternative might be to hook in to > umount(MNT_FORCE), but that would require even broader review, and > probably isn't worth it.... > > NeilBrown > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 11:15 ` Jeff Layton @ 2007-04-27 12:40 ` Neil Brown 2007-04-27 13:42 ` Jeff Layton 0 siblings, 1 reply; 42+ messages in thread From: Neil Brown @ 2007-04-27 12:40 UTC (permalink / raw) To: Jeff Layton; +Cc: cluster-devel, nfs On Friday April 27, jlayton@poochiereds.net wrote: > On Fri, Apr 27, 2007 at 04:00:13PM +1000, Neil Brown wrote: > > > > So if you need that, then I think it really must be implemented by > > something a lot like > > echo -n /path/name > /proc/fs/nfs/nlm_unlock_filesystem > > > > This is something that we could possible teach "fuser -k" about - so > > it can effectively 'kill' that part of lockd that is accessing a given > > filesystem. It is useful to failover, but definitely useful beyond > > failover. > > Just a note that I posted a patch ~ a year ago that did precisely that. The > interface was a little bit different. I had userspace echoing in a dev_t > number, but it wouldn't be too hard to change it to use a pathname instead. > > Subject was: > > [PATCH] lockd: add procfs control to cue lockd to release all locks on a device > > ...if anyone is interested in having me resurrect it. > > -- Jeff http://lkml.org/lkml/2006/4/10/240 Looks like no-one ever replied. I probably didn't see it: things on linux-kernel that don't have 'nfs' or 'raid' (or a few related strings) in the subject have at best an even chance of me seeing them. I've just added 'lockd' to the list of important strings :-) I would rather a path name, and would rather it came through the 'nfsd' filesystem, but those are fairly trivial changes. nlm_traverse_files has changed a bit since then, but it should be easier to unlock based on filesystem with the current code (especially if we made the first arg a void*..). NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 12:40 ` Neil Brown @ 2007-04-27 13:42 ` Jeff Layton 2007-04-27 14:17 ` Christoph Hellwig 2007-04-27 15:12 ` Jeff Layton 0 siblings, 2 replies; 42+ messages in thread From: Jeff Layton @ 2007-04-27 13:42 UTC (permalink / raw) To: Neil Brown, Wendy Cheng; +Cc: cluster-devel, nfs On Fri, Apr 27, 2007 at 10:40:16PM +1000, Neil Brown wrote: > On Friday April 27, jlayton@poochiereds.net wrote: > > On Fri, Apr 27, 2007 at 04:00:13PM +1000, Neil Brown wrote: > > > > > > So if you need that, then I think it really must be implemented by > > > something a lot like > > > echo -n /path/name > /proc/fs/nfs/nlm_unlock_filesystem > > > > > > This is something that we could possible teach "fuser -k" about - so > > > it can effectively 'kill' that part of lockd that is accessing a given > > > filesystem. It is useful to failover, but definitely useful beyond > > > failover. > > > > Just a note that I posted a patch ~ a year ago that did precisely that. The > > interface was a little bit different. I had userspace echoing in a dev_t > > number, but it wouldn't be too hard to change it to use a pathname instead. > > > > Subject was: > > > > [PATCH] lockd: add procfs control to cue lockd to release all locks on a device > > > > ...if anyone is interested in having me resurrect it. > > > > -- Jeff > > http://lkml.org/lkml/2006/4/10/240 > > Looks like no-one ever replied. > I probably didn't see it: things on linux-kernel that don't have > 'nfs' or 'raid' (or a few related strings) in the subject have at best > an even chance of me seeing them. I've just added 'lockd' to the list > of important strings :-) > > I would rather a path name, and would rather it came through the > 'nfsd' filesystem, but those are fairly trivial changes. > > nlm_traverse_files has changed a bit since then, but it should be > easier to unlock based on filesystem with the current code > (especially if we made the first arg a void*..). > > NeilBrown > Ok, I'll toss cleaning that patch up and reposting it on to my to-do list... Wendy, it seems like you had some objection to this patch at the time, but the nature of it escapes me. Do you recall what your concern with it was? Thanks, Jeff ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 13:42 ` Jeff Layton @ 2007-04-27 14:17 ` Christoph Hellwig 2007-04-27 15:42 ` J. Bruce Fields 2007-04-27 15:12 ` Jeff Layton 1 sibling, 1 reply; 42+ messages in thread From: Christoph Hellwig @ 2007-04-27 14:17 UTC (permalink / raw) To: Jeff Layton; +Cc: Neil Brown, cluster-devel, nfs On Fri, Apr 27, 2007 at 09:42:48AM -0400, Jeff Layton wrote: > Ok, I'll toss cleaning that patch up and reposting it on to my to-do list... > > Wendy, it seems like you had some objection to this patch at the time, but > the nature of it escapes me. Do you recall what your concern with it was? I like the idea of the patch. Intead of writing a dev_t into procfs it should probably be changed to write a path to a new file in the nfsctl filesystem. In fact couldn't this be treated as a reexport with a NFSEXP_ flag meaning drop all locks to avoid creating new interfaces? ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 14:17 ` Christoph Hellwig @ 2007-04-27 15:42 ` J. Bruce Fields 2007-04-27 15:36 ` Wendy Cheng 0 siblings, 1 reply; 42+ messages in thread From: J. Bruce Fields @ 2007-04-27 15:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Neil Brown, cluster-devel, nfs, Jeff Layton On Fri, Apr 27, 2007 at 03:17:10PM +0100, Christoph Hellwig wrote: > In fact couldn't this be treated as a reexport with a NFSEXP_ flag > meaning drop all locks to avoid creating new interfaces? Off hand, I can't see any reason why that wouldn't work. The code to handle it would probably go in fs/nfsd/export.c:svc_export_parse(). --b. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 15:42 ` J. Bruce Fields @ 2007-04-27 15:36 ` Wendy Cheng 2007-04-27 16:31 ` J. Bruce Fields 2007-04-27 20:34 ` Frank van Maarseveen 0 siblings, 2 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-27 15:36 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Neil Brown, cluster-devel, nfs, Jeff Layton J. Bruce Fields wrote: > On Fri, Apr 27, 2007 at 03:17:10PM +0100, Christoph Hellwig wrote: > >> In fact couldn't this be treated as a reexport with a NFSEXP_ flag >> meaning drop all locks to avoid creating new interfaces? >> > > Off hand, I can't see any reason why that wouldn't work. The code to > handle it would probably go in fs/nfsd/export.c:svc_export_parse(). > > Sign :( ... folks, we go back to the loop again. That *was* my first proposal ... -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 15:36 ` Wendy Cheng @ 2007-04-27 16:31 ` J. Bruce Fields 2007-04-27 22:22 ` Neil Brown 2007-04-27 20:34 ` Frank van Maarseveen 1 sibling, 1 reply; 42+ messages in thread From: J. Bruce Fields @ 2007-04-27 16:31 UTC (permalink / raw) To: Wendy Cheng Cc: Christoph Hellwig, Neil Brown, cluster-devel, nfs, Jeff Layton On Fri, Apr 27, 2007 at 11:36:16AM -0400, Wendy Cheng wrote: > J. Bruce Fields wrote: > >On Fri, Apr 27, 2007 at 03:17:10PM +0100, Christoph Hellwig wrote: > > > >>In fact couldn't this be treated as a reexport with a NFSEXP_ flag > >>meaning drop all locks to avoid creating new interfaces? > >> > > > >Off hand, I can't see any reason why that wouldn't work. The code to > >handle it would probably go in fs/nfsd/export.c:svc_export_parse(). > > > > > Sign :( ... folks, we go back to the loop again. That *was* my first > proposal ... So you're talking about this and followups?: http://marc.info/?l=linux-nfs&m=115009204513790&w=2 I just took a look and couldn't find any complaints about that approach. Were they elsewhere? I understand the frustration. There's a balance betweeen on the one hand, being willing to throw out some hard work and start over if someone comes up with a real objection, and, on the other hand, sticking to a design when you're convinced it's right. I *really* appreciate good review, but I also try to avoid doing something I don't like just because it seems to be the only way to make somebody else happy.... If they've got a real point then I should be able to understand it. If not, then I risk doing all the work to make them happy just to throw it away because I can't defend the approach in the end, or because I find out I misunderstood their original point. (Then again, sometimes I do just have to trust somebody. And sometimes I guess learning who can be trusted about what is part of the process.) In this case I think the complaint about requiring fsid's on everything is legimate, and the original approach of using the export was sensible. But I haven't been paying as much attention as I should have, and I probably missed something. --b. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 16:31 ` J. Bruce Fields @ 2007-04-27 22:22 ` Neil Brown 2007-04-29 20:13 ` J. Bruce Fields 0 siblings, 1 reply; 42+ messages in thread From: Neil Brown @ 2007-04-27 22:22 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, cluster-devel, nfs, Jeff Layton On Friday April 27, bfields@fieldses.org wrote: > On Fri, Apr 27, 2007 at 11:36:16AM -0400, Wendy Cheng wrote: > > J. Bruce Fields wrote: > > >On Fri, Apr 27, 2007 at 03:17:10PM +0100, Christoph Hellwig wrote: > > > > > >>In fact couldn't this be treated as a reexport with a NFSEXP_ flag > > >>meaning drop all locks to avoid creating new interfaces? > > >> > > > > > >Off hand, I can't see any reason why that wouldn't work. The code to > > >handle it would probably go in fs/nfsd/export.c:svc_export_parse(). > > > > > > > > Sign :( ... folks, we go back to the loop again. That *was* my first > > proposal ... Yes, I grinned when I saw it too. Your first proposal was actually a flag to "unexport", where as Christoph seems to be a flag to "export". So there is at least a subtle difference. A flag to unexport cannot work because we don't call unexport - we just flush a kernel cache. A flag to export is just .... weird. All the other export flags are state flags. This would be an action flag. They are quite different things. Setting a state flag again is a no-op. Setting an action flag again has a very real effect. Also, each filesystem is potentially exported multiple times for different sets of clients. If such a flag (whether on 'export' or 'unexport') just said "remove locks from this set of clients" it wouldn't meet the needs, and if it said "remove all locks" it would be a very irregular interface. > > So you're talking about this and followups?: > > http://marc.info/?l=linux-nfs&m=115009204513790&w=2 > > I just took a look and couldn't find any complaints about that > approach. Were they elsewhere? https://www.redhat.com/archives/linux-cluster/2006-June/msg00101.html Is where I said that I don't like the unexport flag. NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 22:22 ` Neil Brown @ 2007-04-29 20:13 ` J. Bruce Fields 2007-04-29 23:10 ` Neil Brown 0 siblings, 1 reply; 42+ messages in thread From: J. Bruce Fields @ 2007-04-29 20:13 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, cluster-devel, nfs, Jeff Layton On Sat, Apr 28, 2007 at 08:22:55AM +1000, Neil Brown wrote: > A flag to unexport cannot work because we don't call unexport - we > just flush a kernel cache. > > A flag to export is just .... weird. All the other export flags are > state flags. This would be an action flag. They are quite different > things. Setting a state flag again is a no-op. Setting an action > flag again has a very real effect. In this case the second set shouldn't have any effect--whatever flag is set should prevent further locks from being accepted, shouldn't it? (If it matters.) > Also, each filesystem is potentially exported multiple times for > different sets of clients. If such a flag (whether on 'export' or > 'unexport') just said "remove locks from this set of clients" it > wouldn't meet the needs, and if it said "remove all locks" it would be > a very irregular interface. The same could be said of the "fsid=" option on exports. It doesn't make sense to provide different filehandle- or path- name spaces depending on the IP address of a client. If my laptop changes IP address, then I can (grudgingly) accept the fact that the server may have to deny me access that I had before--maybe it just can't trust the network I moved to for whatever reason--but I'd really rather it didn't suddenly start giving me paths, or different filehandles, or different semantics (like sync vs. async). So the export interface is already being used for stuff that's really intended to be per-filesystem rather than per-(filesystem, client) pair. > > So you're talking about this and followups?: > > > > http://marc.info/?l=linux-nfs&m=115009204513790&w=2 > > > > I just took a look and couldn't find any complaints about that > > approach. Were they elsewhere? > > https://www.redhat.com/archives/linux-cluster/2006-June/msg00101.html > > Is where I said that I don't like the unexport flag. Got it, thanks. --b. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-29 20:13 ` J. Bruce Fields @ 2007-04-29 23:10 ` Neil Brown 2007-04-30 5:19 ` Wendy Cheng 2007-05-04 18:42 ` J. Bruce Fields 0 siblings, 2 replies; 42+ messages in thread From: Neil Brown @ 2007-04-29 23:10 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, cluster-devel, nfs, Jeff Layton On Sunday April 29, bfields@fieldses.org wrote: > On Sat, Apr 28, 2007 at 08:22:55AM +1000, Neil Brown wrote: > > A flag to unexport cannot work because we don't call unexport - we > > just flush a kernel cache. > > > > A flag to export is just .... weird. All the other export flags are > > state flags. This would be an action flag. They are quite different > > things. Setting a state flag again is a no-op. Setting an action > > flag again has a very real effect. > > In this case the second set shouldn't have any effect--whatever flag is > set should prevent further locks from being accepted, shouldn't it? (If > it matters.) yes, I guess a "No locks are allowed against this export" makes more sense than "Remove all locks on this export now". Though currently the locks are against the filesystem - the export can disappear from the cache while the locks remain - so it's a long way from perfect. Possibly we could insist that the export remains in the kernel while files are locked .... but we update export flags by replacing the export, so that would be a little awkward. Also, I think I was half-thinking about the "reset the grace period" operation, and that looks a lot like an action.... unless you make it grace_period_ends=seconds-since-epoch. That might work. > > > Also, each filesystem is potentially exported multiple times for > > different sets of clients. If such a flag (whether on 'export' or > > 'unexport') just said "remove locks from this set of clients" it > > wouldn't meet the needs, and if it said "remove all locks" it would be > > a very irregular interface. > > The same could be said of the "fsid=" option on exports. It doesn't > make sense to provide different filehandle- or path- name spaces > depending on the IP address of a client. If my laptop changes IP > address, then I can (grudgingly) accept the fact that the server may > have to deny me access that I had before--maybe it just can't trust the > network I moved to for whatever reason--but I'd really rather it didn't > suddenly start giving me paths, or different filehandles, or different > semantics (like sync vs. async). > > So the export interface is already being used for stuff that's really > intended to be per-filesystem rather than per-(filesystem, client) pair. ro/rw is often different based on client address, but yes: at lot of the flags don't really make sense being different for different clients on the same filesystem. My feeling was that the "nolocks" flag is essentially pointless unless it is the same for all exports on the one filesystem, and that gives it a very different feel. To make use of such a flag you could not rely on the normal mechanism for loading flag information: on-demand loading by mountd. You would need to look through /proc/fs/nfsd/exports, find all the current exports for the filesystem, tell the kernel to change each export to have the "nolocks" flag. And then when you have done all of that, you want to immediately remove all those export entries so you can unmount the filesystem. So while it could be made to work, it doesn't feel clean at all. A grace_period_ends=seconds-since-epoch flag would not have most of those problems. e.g. it could be demand loaded. But there is the risk that it might be set for some exports on a given filesystem and not for others. And the consequence of that is that some clients might not be able to reclaim their locks (because the lock has already been given to a client which didn't know about the new grace period). Now maybe it would be good to have a bunch of nfsd options that are explicitly per-filesystem rather than per-export. Maybe that is the sort of interface we should be designing. echo "+nolocks /path/to/filesystem" > /proc/fs/nfsd/filesystem_settings echo "grace_end=12345678 /path/to/filesystem" > /proc/.... echo "-write_gather /path" > ..... We would need to be clear on how long those settings remain in the kernel, how it can be told to completely forget a particular filesystem etc.. But we probably don't need to go over-board straight away. I like the interface: echo -n "flag flag .. /path/name" > /proc/fs/nfsd/filesystem_settings where if flags is "?flag", then the value is returned by a subsequent read on the same file-descriptor. At this point we only need "nolocks" and "grace_end". The grace_end information persists until that point in time. The "nolocks" information .... doesn't persist(?). NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-29 23:10 ` Neil Brown @ 2007-04-30 5:19 ` Wendy Cheng 2007-05-04 18:42 ` J. Bruce Fields 1 sibling, 0 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-30 5:19 UTC (permalink / raw) To: Neil Brown Cc: J. Bruce Fields, Christoph Hellwig, cluster-devel, nfs, Jeff Layton Neil Brown wrote: >But we probably don't need to go over-board straight away. >I like the interface: > echo -n "flag flag .. /path/name" > /proc/fs/nfsd/filesystem_settings > >where if flags is "?flag", then the value is returned by a subsequent >read on the same file-descriptor. > > > Will do a quick prototype to see whether this would work as good as it appears. I haven't given up RPC call (into lockd) either since it seems to be a bright idea. -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-29 23:10 ` Neil Brown 2007-04-30 5:19 ` Wendy Cheng @ 2007-05-04 18:42 ` J. Bruce Fields 2007-05-04 21:35 ` Wendy Cheng 1 sibling, 1 reply; 42+ messages in thread From: J. Bruce Fields @ 2007-05-04 18:42 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, cluster-devel, nfs, Jeff Layton On Mon, Apr 30, 2007 at 09:10:38AM +1000, Neil Brown wrote: > where if flags is "?flag", then the value is returned by a subsequent > read on the same file-descriptor. The ?flag thing seems a little awkward. It'd be nice if we could get all the flags for a single filesystem just by cat'ing an appropriate file. --b. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-05-04 18:42 ` J. Bruce Fields @ 2007-05-04 21:35 ` Wendy Cheng 0 siblings, 0 replies; 42+ messages in thread From: Wendy Cheng @ 2007-05-04 21:35 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, Christoph Hellwig, cluster-devel, nfs, Jeff Layton J. Bruce Fields wrote: >On Mon, Apr 30, 2007 at 09:10:38AM +1000, Neil Brown wrote: > > >>where if flags is "?flag", then the value is returned by a subsequent >>read on the same file-descriptor. >> >> > >The ?flag thing seems a little awkward. It'd be nice if we could get >all the flags for a single filesystem just by cat'ing an appropriate >file. > >--b. > > ok, make sense ... Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 15:36 ` Wendy Cheng 2007-04-27 16:31 ` J. Bruce Fields @ 2007-04-27 20:34 ` Frank van Maarseveen 2007-04-28 3:55 ` Wendy Cheng 1 sibling, 1 reply; 42+ messages in thread From: Frank van Maarseveen @ 2007-04-27 20:34 UTC (permalink / raw) To: Wendy Cheng Cc: Christoph Hellwig, cluster-devel, Jeff Layton, Neil Brown, J. Bruce Fields, nfs On Fri, Apr 27, 2007 at 11:36:16AM -0400, Wendy Cheng wrote: > J. Bruce Fields wrote: > > On Fri, Apr 27, 2007 at 03:17:10PM +0100, Christoph Hellwig wrote: > > > >> In fact couldn't this be treated as a reexport with a NFSEXP_ flag > >> meaning drop all locks to avoid creating new interfaces? > >> > > > > Off hand, I can't see any reason why that wouldn't work. The code to > > handle it would probably go in fs/nfsd/export.c:svc_export_parse(). > > > > > Sign :( ... folks, we go back to the loop again. That *was* my first > proposal ... I'm quite interested in _any_ patch which would allow me to drop the locks obtained by NFS clients on a specific export, either by (1) "exportfs -uh" or by (2) "echo /some/path > /proc/fs/nfsd/nlm_drop_lock" as Neil mentioned. I want to migrate virtual(*) NFS servers _including_ the locks without having to tear down the whole machine. In my case "migration" is a sort of scheduled failover: no HA or clusters involved. At first, the "exportfs -uh" proposal (maybe fsid driven) seems "the right thing" because after unexporting there's no valid case to preserve the locks AFAICS. Unexport implies EACCES for subsequent NFS accesses anyway and unexporting /cdrom for example is _required_ in order to be able to umount and eject the thing. As it stands today, unexporting is not even enough to be able to unmount it and that's not good. (the need to having to unexport a /cdrom before being able to eject it is actually a problem on its own -- a separate issue). So why not drop the locks always upon unexport? Stupid question of course because exporting anything will not send out any sm notifications so that would break symmetry. I'd prefer (2) "echo /some/path > /proc/fs/nfsd/nlm_drop_lock" because: - Tying the -h (drop locks) option to -u (unexport) is too restrictive IMO. For one thing, there's a bug in the linux NFS client locking code (I reported this earlier) which results in locks not being removed from the server. It was not too difficult to reproduce and programs on the client will wait forever due to this. To handle these kind of situations I need (2) on the server. - (2) may be useful for other NFS server setups: it is inherently more flexible. - (2) does not depend on nfs-utils. It's simpler. (*) virtual in this case means a UUID or IP based fsid= option and an additional IP address on eth0 per export entry, such, that it becomes possible to move an export entry + disk partition + local mount to different hardware without needing to remount it on all <large number> NFS clients. -- Frank ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 20:34 ` Frank van Maarseveen @ 2007-04-28 3:55 ` Wendy Cheng 2007-04-28 4:51 ` Neil Brown 0 siblings, 1 reply; 42+ messages in thread From: Wendy Cheng @ 2007-04-28 3:55 UTC (permalink / raw) To: Frank van Maarseveen Cc: Christoph Hellwig, cluster-devel, Jeff Layton, Neil Brown, J. Bruce Fields, nfs Frank van Maarseveen wrote: >I'm quite interested in _any_ patch which would allow me to drop >the locks obtained by NFS clients on a specific export, either by (1) >"exportfs -uh" or by (2) "echo /some/path > /proc/fs/nfsd/nlm_drop_lock" >as Neil mentioned. > > Thanks for commenting on this. Opinions from users who will eventually use these interfaces are always valued. >[snip] > > >I'd prefer (2) "echo /some/path > /proc/fs/nfsd/nlm_drop_lock" because: > > To convert the first patch of this submitted series from "fsid" to "/some/path" is a no-brainer, since we had gone thru several rounds of similar changes. However, my questions (it is more of a Neil's question) are, if I convert the first patch to do this, 1) then why do we still need the RPC drop-lock call in nfs-util ? 2) what should we do for the 2nd patch ? i.e., how do we communicate with the take-over server it is time for its action, by RPC call or by "echo /some/path > /proc/fs/nfsd/nlm_set_grace_or_whatever" ? In general, I feel if we do this "/some/path" approach, we may as well simply convert the 2nd patch from "fsid" to "/some/path". Then we would finish this long journey. -- Wendy >- Tying the -h (drop locks) option to -u (unexport) is too restrictive IMO. > For one thing, there's a bug in the linux NFS client locking code (I > reported this earlier) which results in locks not being removed from > the server. It was not too difficult to reproduce and programs on the > client will wait forever due to this. To handle these kind of situations > I need (2) on the server. > >- (2) may be useful for other NFS server setups: it is inherently more > flexible. > >- (2) does not depend on nfs-utils. It's simpler. > > >(*) virtual in this case means a UUID or IP based fsid= option and an >additional IP address on eth0 per export entry, such, that it becomes >possible to move an export entry + disk partition + local mount to >different hardware without needing to remount it on all <large number> >NFS clients. > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-28 3:55 ` Wendy Cheng @ 2007-04-28 4:51 ` Neil Brown 2007-04-28 5:26 ` Marc Eshel 2007-04-28 12:33 ` Frank van Maarseveen 0 siblings, 2 replies; 42+ messages in thread From: Neil Brown @ 2007-04-28 4:51 UTC (permalink / raw) To: wcheng Cc: cluster-devel, Frank van Maarseveen, Jeff Layton, Christoph Hellwig, nfs, J. Bruce Fields On Friday April 27, wcheng@redhat.com wrote: > Frank van Maarseveen wrote: > > > >I'd prefer (2) "echo /some/path > /proc/fs/nfsd/nlm_drop_lock" because: > > > > > To convert the first patch of this submitted series from "fsid" to > "/some/path" is a no-brainer, since we had gone thru several rounds of > similar changes. However, my questions (it is more of a Neil's question) > are, if I convert the first patch to do this, > > 1) then why do we still need the RPC drop-lock call in nfs-util ? Maybe we don't. I can imagine a (probably hypothetical) situation where you want to drop some but not all of the locks on a filesystem - if it is a cluster-aware filesystem that several virtual-NAS's export, and you want to move just one virtual-NAS. But if you don't want to be able to do that, you obviously don't have to. > 2) what should we do for the 2nd patch ? i.e., how do we communicate > with the take-over server it is time for its action, by RPC call or by > "echo /some/path > /proc/fs/nfsd/nlm_set_grace_or_whatever" ? I'm happy with using a path name like this to restart the grace period. Where would you store the per-filesystem grace-period-end?? I guess you would need a new little data structure indexed by ... 'struct super_block *' I guess. It would need to hold a reference on the superblock until the grace period expired would it? It might seem 'obvious' to store it in 'struct svc_export', but there can be several of these per filesystem, and more could be added after you set the grace period. So it would be messy to get that right. > > In general, I feel if we do this "/some/path" approach, we may as well > simply convert the 2nd patch from "fsid" to "/some/path". Then we would > finish this long journey. Certainly a lot closer. If we are creating "nlm_drop_locks" and "nlm_set_grace" interfaces, we should spend a few moments considering exactly what semantics they should have. In both cases we write a filename. Presumably it must start with a '/' and be null terminated, so you use "echo -n" rather than "echo". After all, a filename can contain a newline. Is there any extra info we might want to pass in or out at the same time? For nlm_drop_locks, we might also want to be able to query locked - "Do you hold any locks on this filesystem". Even "how many?". For set_grace, we might want to ask how many seconds are left in the grace period (I'm not sure how this info would be used, but it is always nice to be able to read any value that you can write). Does it make sense to have a single file with composite semantics? We write XX/path/name where XX can be: a number, to set second remaining in grace period a '?' (or empty string) to query state a '-' to remove all locks (and cancels any grace period) We then read back two numbers, the seconds remaining in the grace period, and the number of locked files. Then we need to make sure we choose appropriate names. I think that the string 'lockd' make more sense than 'nlm', as we are interacting with the daemon, not configuring the protocol. We might not either need either as the file is inside /proc/fs/nfsd, it is obviously related to nfsd. And if we can use the interface to query, then names like 'set' and 'drop' and probably mis-placed. Maybe "grace" and "locks". If no path is given, the requests have system-wide effect. If there is a non-empty path, just that filesystem if queried/modified. These are just possibilities. I'm quite happy with either 1 or 2 files. I just want to be sure a number of options have been considered, and that a reasoned choice as been made. NeilBrown ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-28 4:51 ` Neil Brown @ 2007-04-28 5:26 ` Marc Eshel 2007-04-28 12:33 ` Frank van Maarseveen 1 sibling, 0 replies; 42+ messages in thread From: Marc Eshel @ 2007-04-28 5:26 UTC (permalink / raw) To: Neil Brown Cc: cluster-devel, Jeff Layton, Christoph Hellwig, J. Bruce Fields, nfs, Frank van Maarseveen, nfs-bounces nfs-bounces@lists.sourceforge.net wrote on 04/27/2007 09:51:17 PM: > On Friday April 27, wcheng@redhat.com wrote: > > Frank van Maarseveen wrote: > > > > > >I'd prefer (2) "echo /some/path > /proc/fs/nfsd/nlm_drop_lock" because: > > > > > > > > To convert the first patch of this submitted series from "fsid" to > > "/some/path" is a no-brainer, since we had gone thru several rounds of > > similar changes. However, my questions (it is more of a Neil's question) > > are, if I convert the first patch to do this, > > > > 1) then why do we still need the RPC drop-lock call in nfs-util ? > > Maybe we don't. > I can imagine a (probably hypothetical) situation where you want to > drop some but not all of the locks on a filesystem - if it is a > cluster-aware filesystem that several virtual-NAS's export, and you > want to move just one virtual-NAS. But if you don't want to be able > to do that, you obviously don't have to. It would be very useful for cluster filesystems, that can export the same filesystem from few servers using multiple IP addresses from each server, to be able to move IP address among server for load balancing. Marc. > > 2) what should we do for the 2nd patch ? i.e., how do we communicate > > with the take-over server it is time for its action, by RPC call or by > > "echo /some/path > /proc/fs/nfsd/nlm_set_grace_or_whatever" ? > > I'm happy with using a path name like this to restart the grace > period. Where would you store the per-filesystem grace-period-end?? > I guess you would need a new little data structure indexed by > ... 'struct super_block *' I guess. It would need to hold a reference > on the superblock until the grace period expired would it? > > It might seem 'obvious' to store it in 'struct svc_export', but there > can be several of these per filesystem, and more could be added after > you set the grace period. So it would be messy to get that right. > > > > > > In general, I feel if we do this "/some/path" approach, we may as well > > simply convert the 2nd patch from "fsid" to "/some/path". Then we would > > finish this long journey. > > Certainly a lot closer. > If we are creating "nlm_drop_locks" and "nlm_set_grace" interfaces, we > should spend a few moments considering exactly what semantics they > should have. > > In both cases we write a filename. Presumably it must start with a > '/' and be null terminated, so you use "echo -n" rather than "echo". > After all, a filename can contain a newline. > > Is there any extra info we might want to pass in or out at the same > time? > > For nlm_drop_locks, we might also want to be able to query locked - > "Do you hold any locks on this filesystem". Even "how many?". > For set_grace, we might want to ask how many seconds are left in the > grace period (I'm not sure how this info would be used, but it is > always nice to be able to read any value that you can write). > > Does it make sense to have a single file with composite semantics? > > We write > XX/path/name > where XX can be: > a number, to set second remaining in grace period > a '?' (or empty string) to query state > a '-' to remove all locks (and cancels any grace period) > We then read back two numbers, the seconds remaining in the grace > period, and the number of locked files. > > Then we need to make sure we choose appropriate names. I think that > the string 'lockd' make more sense than 'nlm', as we are interacting > with the daemon, not configuring the protocol. We might not either > need either as the file is inside /proc/fs/nfsd, it is obviously > related to nfsd. > And if we can use the interface to query, then names like 'set' and > 'drop' and probably mis-placed. Maybe "grace" and "locks". > If no path is given, the requests have system-wide effect. If there > is a non-empty path, just that filesystem if queried/modified. > > These are just possibilities. I'm quite happy with either 1 or 2 > files. I just want to be sure a number of options have been > considered, and that a reasoned choice as been made. > > NeilBrown > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-28 4:51 ` Neil Brown 2007-04-28 5:26 ` Marc Eshel @ 2007-04-28 12:33 ` Frank van Maarseveen 1 sibling, 0 replies; 42+ messages in thread From: Frank van Maarseveen @ 2007-04-28 12:33 UTC (permalink / raw) To: Neil Brown Cc: cluster-devel, Jeff Layton, Christoph Hellwig, nfs, J. Bruce Fields On Sat, Apr 28, 2007 at 02:51:17PM +1000, Neil Brown wrote: > On Friday April 27, wcheng@redhat.com wrote: [...] > Certainly a lot closer. > If we are creating "nlm_drop_locks" and "nlm_set_grace" interfaces, we > should spend a few moments considering exactly what semantics they > should have. > > In both cases we write a filename. Presumably it must start with a > '/' and be null terminated, so you use "echo -n" rather than "echo". > After all, a filename can contain a newline. I don't care much about the trailing newline. Try mounting and exporting it and mounting it on the client ;-). Truncating the string at the first newline may be a practical thing to do. > > Is there any extra info we might want to pass in or out at the same > time? > > For nlm_drop_locks, we might also want to be able to query locked - > "Do you hold any locks on this filesystem". Even "how many?". The "no locks dropped" case might be useful. #locks dropped is only informational (without client info) and covers the first case too so that would be my choice but I don't have any strong opinion about this. > > Does it make sense to have a single file with composite semantics? Only if that would avoid an otherwise unavoidable race. There are just too many components involved with NFS so to avoid any race I'd probably unplug it temporarily with iptables or "ip addr del..." But I would like to be able to drop locks without entering grace mode: a zero second grace mode when combined. > > We write > XX/path/name > where XX can be: Try mounting and exporting pathnames with spaces.. that's not going to work anytime soon, or even anytime at all (other unixes). So no need to use / as separator. > a number, to set second remaining in grace period > a '?' (or empty string) to query state You mean: write "?/path/name" to tell the kernel what subsequent reads should query? > a '-' to remove all locks (and cancels any grace period) That's a strange combination. But cancelling a grace period is equivalent with setting it to zero seconds so no need for a special case. I'd go for simplicity: one file per function (unless there's an unavoidable race). What about: /proc/fs/nfsd/nlm_grace: Write a number to set the grace period in seconds (0==cancel). May be followed by a space + pathname to indicate the superblock/list of svc_something the grace period applies to (otherwise it's global). Truncate the string at a newline. /proc/fs/nfsd/nlm_unlock: Write either a pathname or "" to drop locks. This has the same syntax as the second field of nlm_grace. Optional: In addition to a pathname support "fsid=" syntax in both cases. If you wanna go wild then support a file= syntax to recover from stale locks on individual files due to buggy clients. -- Frank ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover 2007-04-27 13:42 ` Jeff Layton 2007-04-27 14:17 ` Christoph Hellwig @ 2007-04-27 15:12 ` Jeff Layton 1 sibling, 0 replies; 42+ messages in thread From: Jeff Layton @ 2007-04-27 15:12 UTC (permalink / raw) To: Neil Brown, Wendy Cheng, cluster-devel, nfs On Fri, Apr 27, 2007 at 09:42:48AM -0400, Jeff Layton wrote: > Ok, I'll toss cleaning that patch up and reposting it on to my to-do list... > > Wendy, it seems like you had some objection to this patch at the time, but > the nature of it escapes me. Do you recall what your concern with it was? > > Thanks, > Jeff > Actually, on second thought, I'm going to leave this in Wendy's capable hands. The changes to the existing code and the interface changes needed are probably enough to make it so that this patch will be rewritten from scratch anyway... Cheers, Jeff ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-05 21:50 [PATCH 0/4 Revised] NLM - lock failover Wendy Cheng 2007-04-11 17:01 ` J. Bruce Fields 2007-04-17 19:30 ` [Cluster-devel] " Wendy Cheng @ 2007-04-25 14:18 ` J. Bruce Fields 2007-04-25 14:10 ` Wendy Cheng 2011-11-30 10:13 ` Pavel 3 siblings, 1 reply; 42+ messages in thread From: J. Bruce Fields @ 2007-04-25 14:18 UTC (permalink / raw) To: Wendy Cheng; +Cc: cluster-devel, Lon Hohberger, nfs On Thu, Apr 05, 2007 at 05:50:55PM -0400, Wendy Cheng wrote: > 1) Failover server exports filesystem with "fsid" option as: > /etc/exports entry> /mnt/shared/exports *(fsid=1234,sync,rw) > 2) Failover server dispatch rpc.statd with "-H" option. > 3) Failover server drops locks based on fsid by: > shell> echo 1234 > /proc/fs/nfsd/nlm_unlock > 4) Takeover server enters per fsid grace period by: > shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace > 5) Takeover server notifies clients for lock reclaim by: > shell> /usr/sbin/sm-notify -f -v floating_ip_address -P an_sm_directory I don't understand statd and lockd as well as I should. Where exactly does the takeover server stop serving requests, and the failover server start? If this isn't done carefully, you can leave a window between steps 3 and 4 where a client could acquire a lock before its rightful owner reclaims it, right? --b. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-25 14:18 ` J. Bruce Fields @ 2007-04-25 14:10 ` Wendy Cheng 2007-04-25 15:21 ` Marc Eshel 2007-04-25 15:59 ` J. Bruce Fields 0 siblings, 2 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-25 14:10 UTC (permalink / raw) To: J. Bruce Fields; +Cc: cluster-devel, Lon Hohberger, nfs J. Bruce Fields wrote: > On Thu, Apr 05, 2007 at 05:50:55PM -0400, Wendy Cheng wrote: > >> 1) Failover server exports filesystem with "fsid" option as: >> /etc/exports entry> /mnt/shared/exports *(fsid=1234,sync,rw) >> 2) Failover server dispatch rpc.statd with "-H" option. >> 3) Failover server drops locks based on fsid by: >> shell> echo 1234 > /proc/fs/nfsd/nlm_unlock >> 4) Takeover server enters per fsid grace period by: >> shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace >> 5) Takeover server notifies clients for lock reclaim by: >> shell> /usr/sbin/sm-notify -f -v floating_ip_address -P an_sm_directory >> > > I don't understand statd and lockd as well as I should. Where exactly > does the takeover server stop serving requests, and the failover server > start? If this isn't done carefully, you can leave a window between > steps 3 and 4 where a client could acquire a lock before its rightful > owner reclaims it, right? > > The detailed overall steps were described in the first email we sent *long* time (> 6 months, I think) ago. The first step of the whole process is tearing down the floating IP from the failover server. The IP is not accessible until filesystem is safely fail-over and SM_NOTIFY ready to be sent. Last round of discussion gave me an impression that as long as I rebased the code into akpm's mm tree, these patches would get accepted. So I have been quite careless in this submission and just realized people have a very short memory :) .. Will do the write-up and put it somewhere so we don't need to go thru this again. -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-25 14:10 ` Wendy Cheng @ 2007-04-25 15:21 ` Marc Eshel 2007-04-25 15:19 ` Wendy Cheng 2007-04-25 15:59 ` J. Bruce Fields 1 sibling, 1 reply; 42+ messages in thread From: Marc Eshel @ 2007-04-25 15:21 UTC (permalink / raw) To: Wendy Cheng Cc: J. Bruce Fields, cluster-devel, Lon Hohberger, nfs, nfs-bounces nfs-bounces@lists.sourceforge.net wrote on 04/25/2007 07:10:31 AM: > J. Bruce Fields wrote: > > On Thu, Apr 05, 2007 at 05:50:55PM -0400, Wendy Cheng wrote: > > > >> 1) Failover server exports filesystem with "fsid" option as: > >> /etc/exports entry> /mnt/shared/exports *(fsid=1234,sync,rw) > >> 2) Failover server dispatch rpc.statd with "-H" option. > >> 3) Failover server drops locks based on fsid by: > >> shell> echo 1234 > /proc/fs/nfsd/nlm_unlock > >> 4) Takeover server enters per fsid grace period by: > >> shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace > >> 5) Takeover server notifies clients for lock reclaim by: > >> shell> /usr/sbin/sm-notify -f -v floating_ip_address -P an_sm_directory > >> > > > > I don't understand statd and lockd as well as I should. Where exactly > > does the takeover server stop serving requests, and the failover server > > start? If this isn't done carefully, you can leave a window between > > steps 3 and 4 where a client could acquire a lock before its rightful > > owner reclaims it, right? > > > > > The detailed overall steps were described in the first email we sent > *long* time (> 6 months, I think) ago. The first step of the whole > process is tearing down the floating IP from the failover server. The IP > is not accessible until filesystem is safely fail-over and SM_NOTIFY > ready to be sent. I thought this is a solution for an active active server where a cluster file system can export the same file system from multiple NFS servers. Marc. > > Last round of discussion gave me an impression that as long as I rebased > the code into akpm's mm tree, these patches would get accepted. So I > have been quite careless in this submission and just realized people > have a very short memory :) .. Will do the write-up and put it somewhere > so we don't need to go thru this again. > > -- Wendy > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-25 15:21 ` Marc Eshel @ 2007-04-25 15:19 ` Wendy Cheng 2007-04-25 15:39 ` [Cluster-devel] " Wendy Cheng 0 siblings, 1 reply; 42+ messages in thread From: Wendy Cheng @ 2007-04-25 15:19 UTC (permalink / raw) To: Marc Eshel Cc: J. Bruce Fields, cluster-devel, Lon Hohberger, nfs, nfs-bounces Marc Eshel wrote: >> The detailed overall steps were described in the first email we sent >> *long* time (> 6 months, I think) ago. The first step of the whole >> process is tearing down the floating IP from the failover server. The IP >> is not accessible until filesystem is safely fail-over and SM_NOTIFY >> ready to be sent. >> > > I thought this is a solution for an active active server where a cluster > file system can export the same file system from multiple NFS servers. > Marc. > > Yes ... but remember we should have two cases here: 1) Local filesystems such as ext3 - both IP and filesystem are not accessible during the transition. 2) Cluster filesystem such as GFS or GPFS - filesystem may still be accessible (depending on the configuration, say you have advertised two exported IP addresses, each serving different subdirectories for the very same cluster filesystem). The failover IP should be suspended during the transition until SM_NOTIFY is ready to go out (but the other IP should be up and services the requests as it should). I assume people understand that the affected export entries should have been un-exported (as part of the over-all process). -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [Cluster-devel] Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-25 15:19 ` Wendy Cheng @ 2007-04-25 15:39 ` Wendy Cheng 0 siblings, 0 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-25 15:39 UTC (permalink / raw) To: cluster-devel, nfs Wendy Cheng wrote: > Marc Eshel wrote: >>> The detailed overall steps were described in the first email we sent >>> *long* time (> 6 months, I think) ago. The first step of the whole >>> process is tearing down the floating IP from the failover server. >>> The IP is not accessible until filesystem is safely fail-over and >>> SM_NOTIFY ready to be sent. >> >> I thought this is a solution for an active active server where a >> cluster file system can export the same file system from multiple NFS >> servers. >> Marc. >> > > Yes ... but remember we should have two cases here: > > 1) Local filesystems such as ext3 - both IP and filesystem are not > accessible during the transition. > 2) Cluster filesystem such as GFS or GPFS - filesystem may still be > accessible (depending on the configuration, say you have advertised > two exported IP addresses, each serving different subdirectories for > the very same cluster filesystem). The failover IP should be suspended > during the transition until SM_NOTIFY is ready to go out (but the > other IP should be up and services the requests as it should). > > I assume people understand that the affected export entries should > have been un-exported (as part of the over-all process). To remind people, we had described the overall steps in the first round of kernel interface discussions submitted to nfs mailing list (before implemented the code). These steps included un-export the entries, tearing down the IP, filesystem migration, SM_NOTIFY, grace period, etc. I'm in the middle of redoing the write-up. It will be uploaded into somewhere soon. -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-25 14:10 ` Wendy Cheng 2007-04-25 15:21 ` Marc Eshel @ 2007-04-25 15:59 ` J. Bruce Fields 2007-04-25 15:52 ` Wendy Cheng 1 sibling, 1 reply; 42+ messages in thread From: J. Bruce Fields @ 2007-04-25 15:59 UTC (permalink / raw) To: Wendy Cheng; +Cc: cluster-devel, Lon Hohberger, nfs On Wed, Apr 25, 2007 at 10:10:31AM -0400, Wendy Cheng wrote: > The detailed overall steps were described in the first email we sent > *long* time (> 6 months, I think) ago. The first step of the whole > process is tearing down the floating IP from the failover server. The IP > is not accessible until filesystem is safely fail-over and SM_NOTIFY > ready to be sent. I understand, thanks. > Last round of discussion gave me an impression that as long as I rebased > the code into akpm's mm tree, these patches would get accepted. So I > have been quite careless in this submission and just realized people > have a very short memory :) .. Will do the write-up and put it somewhere > so we don't need to go thru this again. Yeah, apologies for the short memory. I'll try to follow more closely from now on! If practical, it would be helpful to have any such documentation in the final version of the patches, though, either as patch comments or (maybe better in this case) as comments in the code or in Documentation/. When someone needs to go back and find out how this was all meant to work, it'll be easier to find in the source tree or the git history than in the mail archives. --b. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-25 15:59 ` J. Bruce Fields @ 2007-04-25 15:52 ` Wendy Cheng 0 siblings, 0 replies; 42+ messages in thread From: Wendy Cheng @ 2007-04-25 15:52 UTC (permalink / raw) To: J. Bruce Fields; +Cc: cluster-devel, Lon Hohberger, nfs J. Bruce Fields wrote: > If practical, it would be helpful to have any such documentation in the > final version of the patches, though, either as patch comments or (maybe > better in this case) as comments in the code or in Documentation/. When > someone needs to go back and find out how this was all meant to work, > it'll be easier to find in the source tree or the git history than in > the mail archives. > > I'm correcting the oversight at this moment .... had been working in cluster-only development group(s) for too long and forgot NFS mailing list is a community consisting of people from different background(s). -- Wendy ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/4 Revised] NLM - lock failover 2007-04-05 21:50 [PATCH 0/4 Revised] NLM - lock failover Wendy Cheng ` (2 preceding siblings ...) 2007-04-25 14:18 ` J. Bruce Fields @ 2011-11-30 10:13 ` Pavel 3 siblings, 0 replies; 42+ messages in thread From: Pavel @ 2011-11-30 10:13 UTC (permalink / raw) To: linux-nfs Wendy Cheng <wcheng <at> redhat.com> writes: > > Revised patches based on 2.6.21-rc4 kernel and nfs-utils-1.1.0-rc1 that > address issues discussed in: > https://www.redhat.com/archives/cluster-devel/2006-September/msg00034.html > > Quick How-to: > 1) Failover server exports filesystem with "fsid" option as: > /etc/exports entry> /mnt/shared/exports *(fsid=1234,sync,rw) > 2) Failover server dispatch rpc.statd with "-H" option. > 3) Failover server drops locks based on fsid by: > shell> echo 1234 > /proc/fs/nfsd/nlm_unlock > 4) Takeover server enters per fsid grace period by: > shell> echo 1234 > /proc/fs/nfsd/nlm_set_igrace > 5) Takeover server notifies clients for lock reclaim by: > shell> /usr/sbin/sm-notify -f -v floating_ip_address -P an_sm_directory > > Patch Summary: > 4-1: implement /proc/fs/nfsd/nlm_unlock > 4-2: implement /proc/fs/nfsd/nlm_set_igrace > 4-3: correctly record and pass incoming server ip interface into rpc.statd. > 4-4: nfs-utils statd changes > 4-1 includes an existing lockd bug fix as discussed in: > http://sourceforge.net/mailarchive/forum.php? thread_name=4603506D.5040807%40redhat.com&forum_name=nfs > (subject: [NFS] Question about f_count in struct nlm_file) > 4-4 includes an existing nfs-utils statd bug fix as discussed in: > http://sourceforge.net/mailarchive/message.php? msg_name=46142B4F.1030507%40redhat.com > (subject: Re: [NFS] lockd and statd) > > Misc: > o No IPV6 support due to testing efforts > o NFS V3 only - will compare notes with CITI folks (NFS V4 issues) > o Still need some error-inject tests. > Hi everyone! I'm building an A/A cluster using NFS v3 and local file systems, and looking for efficient ways for failover (for now I have to restart nfs-kernel-server on Takeover node to be able to initiate grace period), so the discussed solutions are very interesting to me. Now (4 years after) in current nfs-utils packages (v. 1.2.2-4 and later) I can see that the ability to release locks was really implemented and is working well (I mean interfaces /proc/fs/nfsd/unlock_ip and /proc/fs/nfsd/unlock_filesystem), but how about reacquiring locks on the node, share migrates to? - I've been going through various mailing lists and found a lot of discussions on the topic (also dated mainly 2007), but don't seem to find any rpc-based mechanism or interface like /proc/fs/nfsd/nlm_set_grace to do that, was it ever made? Thanks! ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2011-11-30 10:14 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-04-05 21:50 [PATCH 0/4 Revised] NLM - lock failover Wendy Cheng 2007-04-11 17:01 ` J. Bruce Fields 2007-04-17 19:30 ` [Cluster-devel] " Wendy Cheng 2007-04-18 18:56 ` Wendy Cheng 2007-04-18 19:46 ` [Cluster-devel] " Wendy Cheng 2007-04-19 14:41 ` Christoph Hellwig 2007-04-19 15:08 ` Wendy Cheng 2007-04-19 7:04 ` [Cluster-devel] " Neil Brown 2007-04-19 14:53 ` Wendy Cheng 2007-04-24 3:30 ` Wendy Cheng 2007-04-24 5:52 ` Neil Brown 2007-04-26 4:35 ` Wendy Cheng 2007-04-26 5:43 ` Neil Brown 2007-04-27 2:24 ` Wendy Cheng 2007-04-27 6:00 ` Neil Brown 2007-04-27 11:15 ` Jeff Layton 2007-04-27 12:40 ` Neil Brown 2007-04-27 13:42 ` Jeff Layton 2007-04-27 14:17 ` Christoph Hellwig 2007-04-27 15:42 ` J. Bruce Fields 2007-04-27 15:36 ` Wendy Cheng 2007-04-27 16:31 ` J. Bruce Fields 2007-04-27 22:22 ` Neil Brown 2007-04-29 20:13 ` J. Bruce Fields 2007-04-29 23:10 ` Neil Brown 2007-04-30 5:19 ` Wendy Cheng 2007-05-04 18:42 ` J. Bruce Fields 2007-05-04 21:35 ` Wendy Cheng 2007-04-27 20:34 ` Frank van Maarseveen 2007-04-28 3:55 ` Wendy Cheng 2007-04-28 4:51 ` Neil Brown 2007-04-28 5:26 ` Marc Eshel 2007-04-28 12:33 ` Frank van Maarseveen 2007-04-27 15:12 ` Jeff Layton 2007-04-25 14:18 ` J. Bruce Fields 2007-04-25 14:10 ` Wendy Cheng 2007-04-25 15:21 ` Marc Eshel 2007-04-25 15:19 ` Wendy Cheng 2007-04-25 15:39 ` [Cluster-devel] " Wendy Cheng 2007-04-25 15:59 ` J. Bruce Fields 2007-04-25 15:52 ` Wendy Cheng 2011-11-30 10:13 ` Pavel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).