* [RFC] NLM lock failover admin interface
@ 2006-06-12 5:25 Wendy Cheng
2006-06-12 6:11 ` Wendy Cheng
` (4 more replies)
0 siblings, 5 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-12 5:25 UTC (permalink / raw)
To: nfs; +Cc: linux-cluster
NFS v2/v3 active-active NLM lock failover has been an issue with our
cluster suite. With current implementation, it (cluster suite) is trying
to carry the workaround as much as it can with user mode scripts where,
upon failover, on taken-over server, it:
1. Tear down virtual IP.
2. Unexport the subject NFS export.
3. Signal lockd to drop the locks.
4. Un-mount filesystem if needed.
There are many other issues (such as /var/lib/nfs/statd/sm file, etc)
but this particular post is to further refine step 3 to avoid the 50
second global (default) grace period for all NFS exports; i.e., we would
like to be able to selectively drop locks (only) associated with the
requested exports without disrupting other NFS services.
We've done some prototype (coding) works but would like to search for
community consensus on the admin interface if possible. We've tried out
the following:
1. /proc interface, say writing the fsid into a /proc directory entry
would end up dropping all NLM locks associated with the NFS export that
has fsid in its /etc/exports file.
2. Adding a new flag into "exportfs" command, say "h", such that
"exportfs -uh *:/export_path"
would un-export the entry and drop the NLM locks associated with the
entry.
3. Add a new nfsctl by re-using a 2.4 kernel flag (NFSCTL_FOLOCKS) where
it takes:
struct nfsctl_folocks {
int type;
unsigned int fsid;
unsigned int devno;
}
as input argument. Depending on "type", the kernel call would drop the
locks associated with either the fsid, or devno.
The core of the implementation is a new cloned version of
nlm_traverse_files() where it searches the "nlm_files" list one by one
to compare the fsid (or devno) based on nlm_file.f_handle field. A
helper function is also implemented to extract the fsid (or devno) from
f_handle.
The new function is planned to allow failover to abort if the file can't
be closed. We may also put the file locks back if abort occurs.
Would appreciate comments on the above admin interface. As soon as the
external interface can be finalized, the code will be submitted for
review.
-- Wendy
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC] NLM lock failover admin interface 2006-06-12 5:25 [RFC] NLM lock failover admin interface Wendy Cheng @ 2006-06-12 6:11 ` Wendy Cheng 2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields ` (3 subsequent siblings) 4 siblings, 0 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-12 6:11 UTC (permalink / raw) To: linux clustering; +Cc: nfs On Mon, 2006-06-12 at 01:25 -0400, Wendy Cheng wrote: > NFS v2/v3 active-active NLM lock failover has been an issue with our > cluster suite. With current implementation, it (cluster suite) is trying > to carry the workaround as much as it can with user mode scripts where, > upon failover, on taken-over server, it: > > 1. Tear down virtual IP. > 2. Unexport the subject NFS export. > 3. Signal lockd to drop the locks. > 4. Un-mount filesystem if needed. > > There are many other issues (such as /var/lib/nfs/statd/sm file, etc) > but this particular post is to further refine step 3 to avoid the 50 > second global (default) grace period for all NFS exports; i.e., we would > like to be able to selectively drop locks (only) associated with the > requested exports without disrupting other NFS services. > > We've done some prototype (coding) works but would like to search for > community consensus on the admin interface if possible. While ping-pong the emails with our base kernel folks to choose between /proc, or exportfs, or nfsctl (internally within the company - mostly with steved and staubach), Peter suggested to try out multiple lockd(s) to handle different NFS exports. In that case, we may require to change a big portion of lockd kernel code. I prefer not going that far since lockd failover is our cluster suite's immediate issue. However, if this approach can get everyone's vote, we'll comply. -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Linux-cluster] [RFC] NLM lock failover admin interface 2006-06-12 5:25 [RFC] NLM lock failover admin interface Wendy Cheng 2006-06-12 6:11 ` Wendy Cheng @ 2006-06-12 15:00 ` J. Bruce Fields 2006-06-12 15:44 ` [NFS] " Wendy Cheng 2006-06-12 17:27 ` James Yarbrough ` (2 subsequent siblings) 4 siblings, 1 reply; 28+ messages in thread From: J. Bruce Fields @ 2006-06-12 15:00 UTC (permalink / raw) To: linux clustering; +Cc: nfs On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote: > 2. Adding a new flag into "exportfs" command, say "h", such that > > "exportfs -uh *:/export_path" > > would un-export the entry and drop the NLM locks associated with the > entry. What does the kernel interface end up looking like in that case? --b. _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields @ 2006-06-12 15:44 ` Wendy Cheng 2006-06-12 16:20 ` [Linux-cluster] " Madhan P 2006-06-12 17:23 ` [Linux-cluster] " Steve Dickson 0 siblings, 2 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-12 15:44 UTC (permalink / raw) To: J. Bruce Fields; +Cc: nfs, linux clustering [-- Attachment #1: Type: text/plain, Size: 955 bytes --] J. Bruce Fields wrote: >On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote: > > >>2. Adding a new flag into "exportfs" command, say "h", such that >> >> "exportfs -uh *:/export_path" >> >>would un-export the entry and drop the NLM locks associated with the >>entry. >> >> > >What does the kernel interface end up looking like in that case? > > > Happy to see this new exportfs command gets positive response - it was our original pick too. Uploaded is part of a draft version of 2.4 base kernel patch - we're cleaning up 2.6 patches at this moment. It basically adds a new export flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently only defined up to 16 bits) so nfs-util and kernel can communicate. The nice thing about this approach is the recovery part - the take-over server can use the counter part command to export and set grace period for one particular interface within the same system call. -- Wendy [-- Attachment #2: gfs_nlm.patch --] [-- Type: text/plain, Size: 785 bytes --] --- linux-2.4.21-43.EL/fs/nfsd/export.c 2006-05-14 17:16:21.000000000 -0400 +++ linux/fs/nfsd/export.c 2006-05-29 02:13:29.000000000 -0400 @@ -388,6 +388,10 @@ exp_unexport(struct nfsctl_export *nxp) exp_do_unexport(exp); err = 0; } + if (nxp->ex_flags & NFSEXP_FOLOCK) { + dprintk("exp_unexport: nfsd_lockd_unexport called\n"); + nfsd_lockd_unexport(clp); + } } exp_unlock(); --- linux-2.4.21-43.EL/include/linux/nfsd/export.h 2006-05-14 17:23:57.000000000 -0400 +++ linux/include/linux/nfsd/export.h 2006-05-29 02:12:07.000000000 -0400 @@ -42,7 +42,7 @@ #define NFSEXP_FSID 0x2000 #define NFSEXP_NOACL 0x8000 /* turn off acl support */ #define NFSEXP_ALLFLAGS 0xFFFF - +#define NFSEXP_FOLOCK 0x00010000 /* NLM lock failover */ #ifdef __KERNEL__ [-- Attachment #3: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Linux-cluster] [RFC] NLM lock failover admin interface 2006-06-12 15:44 ` [NFS] " Wendy Cheng @ 2006-06-12 16:20 ` Madhan P 2006-06-12 16:58 ` Madhan P 2006-06-12 18:09 ` [NFS] " Wendy Cheng 2006-06-12 17:23 ` [Linux-cluster] " Steve Dickson 1 sibling, 2 replies; 28+ messages in thread From: Madhan P @ 2006-06-12 16:20 UTC (permalink / raw) To: J. Bruce Fields, Wendy Cheng; +Cc: linux clustering, nfs For what it's worth, would second this approach of using a flag to unexport and associating the cleanup with that. Another quick hack we used was to store the NSM entries on a standard location on the respective exported filesystem, so that notification is sent once the filesystem comes back online on the destination server and is exported again. BTW, this was not on Linux. It was a simple solution providing the necessary active/active and active/passive cluster support. - Madhan >>> On 6/12/2006 at 9:14:55 pm, in message <448D8BF7.7010105@redhat.com>, Wendy Cheng <wcheng@redhat.com> wrote: > J. Bruce Fields wrote: > >>On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote: >> >> >>>2. Adding a new flag into "exportfs" command, say "h", such that >>> >>> "exportfs -uh *:/export_path" >>> >>>would un-export the entry and drop the NLM locks associated with the >>>entry. >>> >>> >> >>What does the kernel interface end up looking like in that case? >> >> >> > Happy to see this new exportfs command gets positive response - it was > our original pick too. > > Uploaded is part of a draft version of 2.4 base kernel patch - we're > cleaning up 2.6 patches at this moment. It basically adds a new export > flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently only > defined up to 16 bits) so nfs-util and kernel can communicate. > > The nice thing about this approach is the recovery part - the take-over > server can use the counter part command to export and set grace period > for one particular interface within the same system call. > > -- Wendy _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Linux-cluster] [RFC] NLM lock failover admin interface 2006-06-12 16:20 ` [Linux-cluster] " Madhan P @ 2006-06-12 16:58 ` Madhan P 2006-06-12 18:09 ` [NFS] " Wendy Cheng 1 sibling, 0 replies; 28+ messages in thread From: Madhan P @ 2006-06-12 16:58 UTC (permalink / raw) To: nfs; +Cc: linux clustering For what it's worth, would second this approach of using a flag to unexport and associating the cleanup with that. Another quick hack we used was to store the NSM entries on a standard location on the respective exported filesystem, so that notification is sent once the filesystem comes back online on the destination server and is exported again. BTW, this was not on Linux. It was a simple solution providing the necessary active/active and active/passive cluster support. - Madhan >>> On 6/12/2006 at 9:14:55 pm, in message <448D8BF7.7010105@redhat.com>, Wendy Cheng <wcheng@redhat.com> wrote: > J. Bruce Fields wrote: > >>On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote: >> >> >>>2. Adding a new flag into "exportfs" command, say "h", such that >>> >>> "exportfs -uh *:/export_path" >>> >>>would un-export the entry and drop the NLM locks associated with the >>>entry. >>> >>> >> >>What does the kernel interface end up looking like in that case? >> >> >> > Happy to see this new exportfs command gets positive response - it was > our original pick too. > > Uploaded is part of a draft version of 2.4 base kernel patch - we're > cleaning up 2.6 patches at this moment. It basically adds a new export > flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently only > defined up to 16 bits) so nfs-util and kernel can communicate. > > The nice thing about this approach is the recovery part - the take-over > server can use the counter part command to export and set grace period > for one particular interface within the same system call. > > -- Wendy _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-12 16:20 ` [Linux-cluster] " Madhan P 2006-06-12 16:58 ` Madhan P @ 2006-06-12 18:09 ` Wendy Cheng 1 sibling, 0 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-12 18:09 UTC (permalink / raw) To: Madhan P; +Cc: linux clustering, nfs Madhan P wrote: >For what it's worth, would second this approach of using a flag to >unexport and associating the cleanup with that. > Happy to have another vote :) ! It is appreicated. > Another quick hack we >used was to store the NSM entries on a standard location on the >respective exported filesystem, so that notification is sent once the >filesystem comes back online on the destination server and is exported >again. BTW, this was not on Linux. It was a simple solution providing >the necessary active/active and active/passive cluster support. > > Lon Hohberge (from our cluster suite team) has been working on similar setup too (to structure the MSM file directory). We'll submit the associated kernel patch when it is ready ("rpc.statd -H" needs some bandaids). Future reviews and comments are also appreciated. -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Linux-cluster] [RFC] NLM lock failover admin interface 2006-06-12 15:44 ` [NFS] " Wendy Cheng 2006-06-12 16:20 ` [Linux-cluster] " Madhan P @ 2006-06-12 17:23 ` Steve Dickson 1 sibling, 0 replies; 28+ messages in thread From: Steve Dickson @ 2006-06-12 17:23 UTC (permalink / raw) To: Wendy Cheng; +Cc: J. Bruce Fields, linux clustering, nfs Wendy Cheng wrote: > The nice thing about this approach is the recovery part - the take-over > server can use the counter part command to export and set grace period > for one particular interface within the same system call. Actually this is a pretty clean and simple interface... imho.. The only issue I had was adding a flag to an older version and then having to carry that flag forward... So if this interface is accepted and added to the mainline nfs-utils (which it should be.. imho) that fact it is so clean and simple would make the back porting fairly trivial... steved. _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] NLM lock failover admin interface 2006-06-12 5:25 [RFC] NLM lock failover admin interface Wendy Cheng 2006-06-12 6:11 ` Wendy Cheng 2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields @ 2006-06-12 17:27 ` James Yarbrough 2006-06-12 19:07 ` [NFS] " Wendy Cheng 2006-06-13 3:17 ` Neil Brown 2006-06-13 15:23 ` James Yarbrough 4 siblings, 1 reply; 28+ messages in thread From: James Yarbrough @ 2006-06-12 17:27 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-cluster, nfs > 2. Adding a new flag into "exportfs" command, say "h", such that > > "exportfs -uh *:/export_path" > > would un-export the entry and drop the NLM locks associated with the > entry. This is fine for releasing the locks, but how do you plan to re-enter the grace period for reclaiming the locks when you relocate the export? And how do you intend to segregate the export for which reclaims are valid from the ones which are not? How do you plan to support the sending of SM_NOTIFY? This might be where a lockd per export has an advantage. -- jmy@sgi.com 650 933 3124 Why is there a snake in my Coke? _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-12 17:27 ` James Yarbrough @ 2006-06-12 19:07 ` Wendy Cheng 0 siblings, 0 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-12 19:07 UTC (permalink / raw) To: James Yarbrough; +Cc: linux-cluster, nfs James Yarbrough wrote: >>2. Adding a new flag into "exportfs" command, say "h", such that >> >> "exportfs -uh *:/export_path" >> >>would un-export the entry and drop the NLM locks associated with the >>entry. >> >> > >This is fine for releasing the locks, but how do you plan to re-enter >the grace period for reclaiming the locks when you relocate the export? >And how do you intend to segregate the export for which reclaims are >valid from the ones which are not? How do you plan to support the >sending of SM_NOTIFY? This might be where a lockd per export has an >advantage. > > > Yeah, that's why Peter's idea (different lockd(s)) is also attractive. However, on the practical side, we don't plan to introduce kernel patches agressively. The approach is to be away from mainline NLM code base until we have enough QA cycles to make sure things work. The unexport part would allow other nfs services on the taken-over server un-interrupted. On the take-over server side, we currently do a global grace period. The plan has been to put a little delay before fixing take-over server's logic due to other NLM/posix lock issues - for example, the current (linux) NLM doesn't bother to call filesystem's lock method (which virtually disables any cluster filesystem's NFS locking across different NFS servers). However, if we have enough resources and/or volunteers, we may do these things in parallel. The following are planned: Take-over server logic: 1. setup the statd sm file (currently /var/lib/nfs/statd/sm or the equivalent configured directory) properly. 2. rpc.statd is dispatched with "--ha-callout" option. 3. implement the ha-callout user mode program to create a seperate statd sm files for each exported ip. 4. export the target filesystem and set up grace period based on fsid (or devno). It will be used in NLM procedure calls by extracting the fsid (or devno) from nfs file handle to decide accepting or reject the not-reclaiming requests. 5. bring up the failover IP address. 6. send SM_NOTIFY to client machines using the configured sm directory created by the ha-callout program (rpc.statd -N -P). Step 4 will be the counter-part of our unexport flag. -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] NLM lock failover admin interface 2006-06-12 5:25 [RFC] NLM lock failover admin interface Wendy Cheng ` (2 preceding siblings ...) 2006-06-12 17:27 ` James Yarbrough @ 2006-06-13 3:17 ` Neil Brown 2006-06-13 7:00 ` [NFS] " Wendy Cheng 2006-06-14 6:54 ` [NFS] " Wendy Cheng 2006-06-13 15:23 ` James Yarbrough 4 siblings, 2 replies; 28+ messages in thread From: Neil Brown @ 2006-06-13 3:17 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-cluster, nfs On Monday June 12, wcheng@redhat.com wrote: > NFS v2/v3 active-active NLM lock failover has been an issue with our > cluster suite. With current implementation, it (cluster suite) is trying > to carry the workaround as much as it can with user mode scripts where, > upon failover, on taken-over server, it: > > 1. Tear down virtual IP. > 2. Unexport the subject NFS export. > 3. Signal lockd to drop the locks. > 4. Un-mount filesystem if needed. > ... > we would > like to be able to selectively drop locks (only) associated with the > requested exports without disrupting other NFS services. There seems to be an unstated assumption here that there is one virtual IP per exported filesystem. Is that true? Assuming it is and that I understand properly what you want to do.... I think that maybe the right thing to do is *not* drop the locks on a particular filesystem, but to drop the locks made to a particular virtual IP. Then it would make a lot of sense to have one lockd thread per IP, and signal the lockd in order to drop the locks. True: that might be more code. But if it is the right thing to do, then it should be done that way. On the other hand, I can see a value in removing all the locks for a particular filesytem quite independent of failover requirements. If I want to force-unmount a filesystem, I need to unexport it, and I need to kill all the locks. Currently you can only remove locks from all filesystems, which might not be ideal. I'm not at all keen on the NFSEXP_FOLOCK flag to exp_unexport, as that is an interface that I would like to discard eventually. The preferred mechanism for exporting filesystems is to flush the appropriate 'cache', and allow it to be repopulated with whatever is still valid via upcalls to mountd. So: I think if we really want to "remove all NFS locks on a filesystem", we could probably tie it into umount - maybe have lockd register some callback which gets called just before s_op->umount_begin. If we want to remove all locks that arrived on a particular interface, then we should arrange to do exactly that. There are a number of different options here. One is the multiple-lockd-threads idea. One is to register a callback when an interface is shut down. Another (possibly the best) is to arrange a new signal for lockd which say "Drop any locks which were sent to IP addresses that are no longer valid local addresses". So those are my thoughts. Do any of them seem reasonable to you? NeilBrown _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-13 3:17 ` Neil Brown @ 2006-06-13 7:00 ` Wendy Cheng 2006-06-13 7:08 ` Neil Brown 2006-06-14 6:54 ` [NFS] " Wendy Cheng 1 sibling, 1 reply; 28+ messages in thread From: Wendy Cheng @ 2006-06-13 7:00 UTC (permalink / raw) To: Neil Brown; +Cc: linux-cluster, nfs On Tue, 2006-06-13 at 13:17 +1000, Neil Brown wrote: > So: > I think if we really want to "remove all NFS locks on a filesystem", > we could probably tie it into umount - maybe have lockd register some > callback which gets called just before s_op->umount_begin. The "umount_begin" idea was one time on my list but got discarded. The thought was that nfsd was not a filesystem, neither was lockd. How to register something with VFS umount for non-filesystem kernel modules ? Invent another autofs-like pseudo filesystem ? Mostly, not every filesystem would like to get un-mounted upon failover (GFS, for example, does not get un-mounted by our cluster suite upon failover). > If we want to remove all locks that arrived on a particular > interface, then we should arrange to do exactly that. There are a > number of different options here. > One is the multiple-lockd-threads idea. Certainly a good option. To make it happen, we still need admin interface. How to pass IP address from user mode into kernel - care to give this some suggestions if you have them handy ? Should socket ports get dynamics assigned ? Will we have scalibility issues ? > One is to register a callback when an interface is shut down. > Another (possibly the best) is to arrange a new signal for lockd > which say "Drop any locks which were sent to IP addresses that are > no longer valid local addresses". These, again, give individual filesystem no freedom to adjust what they need upon failover. But I'll check them out this week - maybe there are good socket layer hooks that I overlook. > > So those are my thoughts. Do any of them seem reasonable to you? > The comments are greatly appreciated. And hopefully we can reach agreement soon. -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] NLM lock failover admin interface 2006-06-13 7:00 ` [NFS] " Wendy Cheng @ 2006-06-13 7:08 ` Neil Brown 0 siblings, 0 replies; 28+ messages in thread From: Neil Brown @ 2006-06-13 7:08 UTC (permalink / raw) To: Wendy Cheng; +Cc: nfs, linux-cluster On Tuesday June 13, wcheng@redhat.com wrote: > > One is to register a callback when an interface is shut down. > > Another (possibly the best) is to arrange a new signal for lockd > > which say "Drop any locks which were sent to IP addresses that are > > no longer valid local addresses". > > These, again, give individual filesystem no freedom to adjust what they > need upon failover. But I'll check them out this week - maybe there are > good socket layer hooks that I overlook. > Can you say more about what sort of adjustments an individual filesystem might want the freedom to make? It might help me understand the issues better. Thanks, NeilBrown _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-13 3:17 ` Neil Brown 2006-06-13 7:00 ` [NFS] " Wendy Cheng @ 2006-06-14 6:54 ` Wendy Cheng 2006-06-14 11:36 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-14 6:54 UTC (permalink / raw) To: Neil Brown; +Cc: linux-cluster, nfs Hi, KABI (kernel application binary interface) commitment is a big thing from our end - so I would like to focus more on the interface agreement before jumping into coding and implementation details. > One is the multiple-lockd-threads idea. Assume we still have this on the table.... Could I expect the admin interface goes thru rpc.lockd command (man page and nfs-util code changes) ? The modified command will take similar options as rpc.statd; more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the individual IP (socket address) to kernel, we'll need nfsctl with struct nfsctl_svc modified. For the kernel piece, since we're there anyway, could we have the individual lockd IP interface passed to SM (statd) (in SM_MON call) ? This would allow statd to structure its SM files based on each lockd IP address, an important part of lock recovery. > One is to register a callback when an interface is shut down. Haven't checked out (linux) socket interface yet. I'm very fuzzy how this can be done. Anyone has good ideas ? > Another (possibly the best) is to arrange a new signal for lockd > which say "Drop any locks which were sent to IP addresses that are > no longer valid local addresses". Very appealing - but the devil's always in the details. How to decide which IP address is no longer valid ? Or how does lockd know about these IP addresses ? And how to associate one particular IP address with the "struct nlm_file" entries within nlm_files list ? Need few more days to sort this out (or any one already has ideas in mind ?). -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-14 6:54 ` [NFS] " Wendy Cheng @ 2006-06-14 11:36 ` Christoph Hellwig 2006-06-14 13:39 ` Wendy Cheng 2006-06-14 14:00 ` Wendy Cheng 2006-06-15 4:27 ` [NFS] " Neil Brown 2 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2006-06-14 11:36 UTC (permalink / raw) To: Wendy Cheng; +Cc: Neil Brown, nfs, linux-cluster On Wed, Jun 14, 2006 at 02:54:51AM -0400, Wendy Cheng wrote: > Hi, > > KABI (kernel application binary interface) commitment is a big thing > from our end - so I would like to focus more on the interface agreement > before jumping into coding and implementation details. Please stop this crap now. If zou don't get that there is no kernel internal ABI and there never will be get a different job ASAP. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-14 11:36 ` Christoph Hellwig @ 2006-06-14 13:39 ` Wendy Cheng 0 siblings, 0 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-14 13:39 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Neil Brown, nfs, linux-cluster On Wed, 2006-06-14 at 12:36 +0100, Christoph Hellwig wrote: > On Wed, Jun 14, 2006 at 02:54:51AM -0400, Wendy Cheng wrote: > > Hi, > > > > KABI (kernel application binary interface) commitment is a big thing > > from our end - so I would like to focus more on the interface agreement > > before jumping into coding and implementation details. > > Please stop this crap now. If zou don't get that there is no kernel internal > ABI and there never will be get a different job ASAP. Actually I don't quite understand this statement (sorry! English is not my native language) but it is ok. People are entitled for different opinions and I respect yours. On the technical side, just a pre-cautious, in case we need to touch some kernel export symbols so it would be nice to have external (and admin) interfaces decided before we start to code. So I'll not talk about this and I assume we can keep focusing on NLM issues. No more noises from each other. Fair ? -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-14 6:54 ` [NFS] " Wendy Cheng 2006-06-14 11:36 ` Christoph Hellwig @ 2006-06-14 14:00 ` Wendy Cheng 2006-06-15 14:07 ` [NFS] " William A.(Andy) Adamson 2006-06-15 4:27 ` [NFS] " Neil Brown 2 siblings, 1 reply; 28+ messages in thread From: Wendy Cheng @ 2006-06-14 14:00 UTC (permalink / raw) To: linux clustering; +Cc: Neil Brown, nfs On Wed, 2006-06-14 at 02:54 -0400, Wendy Cheng wrote: > > Assume we still have this on the table.... Could I expect the admin > interface goes thru rpc.lockd command (man page and nfs-util code > changes) ? The modified command will take similar options as rpc.statd; > more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the > individual IP (socket address) to kernel, we'll need nfsctl with struct > nfsctl_svc modified. I want to make sure people catch this. Here we're talking about NFS system call interface changes. We need either a new NFS syscall or altering the existing nfsctl_svc structure. -- Wendy > > For the kernel piece, since we're there anyway, could we have the > individual lockd IP interface passed to SM (statd) (in SM_MON call) ? > This would allow statd to structure its SM files based on each lockd IP > address, an important part of lock recovery. > > > One is to register a callback when an interface is shut down. > > Haven't checked out (linux) socket interface yet. I'm very fuzzy how > this can be done. Anyone has good ideas ? > > > Another (possibly the best) is to arrange a new signal for lockd > > which say "Drop any locks which were sent to IP addresses that are > > no longer valid local addresses". > > Very appealing - but the devil's always in the details. How to decide > which IP address is no longer valid ? Or how does lockd know about these > IP addresses ? And how to associate one particular IP address with the > "struct nlm_file" entries within nlm_files list ? Need few more days to > sort this out (or any one already has ideas in mind ?). > > -- Wendy > > -- > Linux-cluster mailing list > Linux-cluster@redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] Re: [RFC] NLM lock failover admin interface 2006-06-14 14:00 ` Wendy Cheng @ 2006-06-15 14:07 ` William A.(Andy) Adamson 2006-06-15 15:09 ` Wendy Cheng 2006-06-16 6:09 ` [Linux-cluster] " Neil Brown 0 siblings, 2 replies; 28+ messages in thread From: William A.(Andy) Adamson @ 2006-06-15 14:07 UTC (permalink / raw) To: Wendy Cheng; +Cc: Neil Brown, nfs, linux clustering this discusion has centered around removing the locks of an export. we also want the interface to ge able to remove the locks owned by a single client. this is needed to enable client migration between replica's or between nodes in a cluster file system. it is not acceptable to place an entire export in grace just to move a small number of clients. -->Andy wcheng@redhat.com said: > On Wed, 2006-06-14 at 02:54 -0400, Wendy Cheng wrote: > > Assume we still have this on the table.... Could I expect the admin > interface goes thru rpc.lockd command (man page and nfs-util code > changes) ? The modified command will take similar options as rpc.statd; > more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the > individual IP (socket address) to kernel, we'll need nfsctl with struct > nfsctl_svc modified. > > I want to make sure people catch this. Here we're talking about NFS system > call interface changes. We need either a new NFS syscall or altering the > existing nfsctl_svc structure. > -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] Re: [RFC] NLM lock failover admin interface 2006-06-15 14:07 ` [NFS] " William A.(Andy) Adamson @ 2006-06-15 15:09 ` Wendy Cheng 2006-06-16 6:09 ` [Linux-cluster] " Neil Brown 1 sibling, 0 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-15 15:09 UTC (permalink / raw) To: William A.(Andy) Adamson; +Cc: Neil Brown, nfs, linux clustering William A.(Andy) Adamson wrote: >this discusion has centered around removing the locks of an export. >we also want the interface to ge able to remove the locks owned by a single >client. this is needed to enable client migration between replica's or between >nodes in a cluster file system. it is not acceptable to place an entire export >in grace just to move a small number of clients. > > > Andy, Gotcha ... forgot about NFS V4. BTW, the discussion has moved back to /proc interface. I agree we need to add one more layer of granularity into it. Glad you caught this flaw. -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Linux-cluster] Re: [RFC] NLM lock failover admin interface 2006-06-15 14:07 ` [NFS] " William A.(Andy) Adamson 2006-06-15 15:09 ` Wendy Cheng @ 2006-06-16 6:09 ` Neil Brown 2006-06-16 15:39 ` [NFS] " William A.(Andy) Adamson 1 sibling, 1 reply; 28+ messages in thread From: Neil Brown @ 2006-06-16 6:09 UTC (permalink / raw) To: William A.(Andy) Adamson; +Cc: linux clustering, nfs, Wendy Cheng On Thursday June 15, andros@citi.umich.edu wrote: > this discusion has centered around removing the locks of an export. > we also want the interface to ge able to remove the locks owned by a single > client. this is needed to enable client migration between replica's or between > nodes in a cluster file system. it is not acceptable to place an entire export > in grace just to move a small number of clients. Hmmmm.... You want to remove all the locks owned by a particular client with the intension of reclaiming those locks against a different NFS server (on a cluster filesystem) and you don't want to put the whole filesystem into grace mode while doing it. Is that correct? Sounds extremely racy to me. Suppose some other client takes a conflicting lock between dropping them on one server and claiming them on the other? That would be bad. The purpose of the grace mode is precisely to avoid this sort of race. It would seem that what you "really" want to do is to tell the cluster filesystem to migrate the locks to a different node and some how tell lockd about out. Is there a comprehensive design document about how this is going to work, because I'm feeling doubtful. For the 'between replicas' case - I'm not sure locking makes sense. Locking on a read-only filesystem is pretty pointless, and presumably replicas are read-only??? Basically, dropping locks that are expected to be picked up again, without putting the whole filesystem into a grace period simply doesn't sound workable to me. Am I missing something? NeilBrown _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] Re: [RFC] NLM lock failover admin interface 2006-06-16 6:09 ` [Linux-cluster] " Neil Brown @ 2006-06-16 15:39 ` William A.(Andy) Adamson 0 siblings, 0 replies; 28+ messages in thread From: William A.(Andy) Adamson @ 2006-06-16 15:39 UTC (permalink / raw) To: Neil Brown; +Cc: linux clustering, nfs > On Thursday June 15, andros@citi.umich.edu wrote: > > this discusion has centered around removing the locks of an export. > > we also want the interface to ge able to remove the locks owned by a single > > client. this is needed to enable client migration between replica's or between > > nodes in a cluster file system. it is not acceptable to place an entire export > > in grace just to move a small number of clients. > > Hmmmm.... > You want to remove all the locks owned by a particular client > with the intension of reclaiming those locks against a different NFS > server (on a cluster filesystem) > and you don't want to put the whole filesystem into grace mode while > doing it. > > Is that correct? yes. > > Sounds extremely racy to me. Suppose some other client takes a > conflicting lock between dropping them on one server and claiming them > on the other? That would be bad. The purpose of the grace mode is > precisely to avoid this sort of race. the idea is that the underlying file system can place only the files with locks held by the migrating client(s) into grace, leaving all other files for normal operation. the migrating (nfsv4) client then reclaims opens, locks and delegations on the new server. its just reducing the scope of the grace period. > > It would seem that what you "really" want to do is to tell the cluster > filesystem to migrate the locks to a different node and some how tell > lockd about out. what we really want is for the cluster file system to share the locks between the original node and the new node. then the client can simply be redirected and no grace period or reclaim is needed. this is much harder to code than a reduced grace period as describe above. from what we hear, lustre has this functionality. either way, the files with locks held by the migrating client need to be identified by both the lock manager (lockd/nfsv4 server) and the underlying fs. > > Is there a comprehensive design document about how this is going to > work, because I'm feeling doubtful. we have a work in progress - it's not done but may help describe our thinking. http://wiki.linux-nfs.org/index.php/Recovery_and_migration > > For the 'between replicas' case - I'm not sure locking makes sense. > Locking on a read-only filesystem is pretty pointless, and presumably > replicas are read-only??? nope. we have a promising prototye read/write replica scheme that we are testing. http://www.citi.umich.edu/techreports/reports/citi-tr-06-3.pdf i agree this is an outlying case.... but another immediate consumer of such an iterface would be an administator who needs to remove the locks for a client. -->Andy > > Basically, dropping locks that are expected to be picked up again, > without putting the whole filesystem into a grace period simply > doesn't sound workable to me. > > Am I missing something? > > NeilBrown > > > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-14 6:54 ` [NFS] " Wendy Cheng 2006-06-14 11:36 ` Christoph Hellwig 2006-06-14 14:00 ` Wendy Cheng @ 2006-06-15 4:27 ` Neil Brown 2006-06-15 6:39 ` Wendy Cheng 2 siblings, 1 reply; 28+ messages in thread From: Neil Brown @ 2006-06-15 4:27 UTC (permalink / raw) To: Wendy Cheng; +Cc: nfs, linux-cluster On Wednesday June 14, wcheng@redhat.com wrote: > Hi, > > KABI (kernel application binary interface) commitment is a big thing > from our end - so I would like to focus more on the interface agreement > before jumping into coding and implementation details. > Before we can agree on an interface, we need to be clear what functionality is required. You started out suggesting that the required functionality was to "remove all locks that lockd holds on a particular filesystem". I responded that I suspect a better functionality was "remove all locks that locked holds on behalf of a particular IP address". You replied that this such an approach > give[s] individual filesystem no freedom to adjust what they > need upon failover. I asked: > Can you say more about what sort of adjustments an individual filesystem > might want the freedom to make? It might help me understand the > issues better. and am still waiting for an answer. Without an answer, I still lean towards and IP-address based approach, and the reply from James Yarbrough seems to support that (though I don't want to read too much into his comments). Lockd is not currently structured to associate locks with server-ip-addresses. There is an assumption that one client may talk to any of the IP addresses that the server supports. This is clearly not the case for the failover scenario that you are considering, so a little restructuring might be in order. Some locks will be held on behalf of a client, no matter what interface the requests arrive on. Other locks will be held on behalf of a client and tied to a particular server IP address. Probably the easiest way to make this distinction in as a new nfsd export flag. So, maybe something like this: Add a 'struct sockaddr_in' to 'struct nlm_file'. If nlm_fopen return (say) 3, then treat is as success, and also copy rqstp->rq_addr into that 'sockaddr_in'. define a new file in the 'nfsd' filesystem into which can be written an IP address and which calls some new lockd function which releases all locks held for that IP address. Probably get nlm_lookup_file to insist that if the sockaddr_in is defined in a lock, it must match the one in rqstp Does that sound OK ? > > One is the multiple-lockd-threads idea. > > Assume we still have this on the table.... Could I expect the admin > interface goes thru rpc.lockd command (man page and nfs-util code > changes) ? The modified command will take similar options as rpc.statd; > more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the > individual IP (socket address) to kernel, we'll need nfsctl with struct > nfsctl_svc modified. I'm losing interest in the multiple-lockd-threads approach myself (for the moment anyway :-) However I would be against trying to re-use rpc.lockd - that was a mistake that is best forgotten. If the above approach were taken, then I don't think you need anything more than echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock (or whatever), though it you really want to wrap that in a shell script that might be ok. > > For the kernel piece, since we're there anyway, could we have the > individual lockd IP interface passed to SM (statd) (in SM_MON call) ? > This would allow statd to structure its SM files based on each lockd IP > address, an important part of lock recovery. > Maybe.... but I don't get the scenario. Surely the SM files are only needed when the server restarts, and in that case it needs to notify all clients... Or is it that you want to make sure the notification comes from the right IP address.... I guess that would make sense. I that what you are after? > > One is to register a callback when an interface is shut down. > > Haven't checked out (linux) socket interface yet. I'm very fuzzy how > this can be done. Anyone has good ideas ? No good idea, but I have a feeling there is a callback we could use. However I think I am going off this idea. > > > Another (possibly the best) is to arrange a new signal for lockd > > which say "Drop any locks which were sent to IP addresses that are > > no longer valid local addresses". > > Very appealing - but the devil's always in the details. How to decide > which IP address is no longer valid ? Or how does lockd know about these > IP addresses ? And how to associate one particular IP address with the > "struct nlm_file" entries within nlm_files list ? Need few more days to > sort this out (or any one already has ideas in mind ?). See above. NeilBrown ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [NFS] [RFC] NLM lock failover admin interface 2006-06-15 4:27 ` [NFS] " Neil Brown @ 2006-06-15 6:39 ` Wendy Cheng 2006-06-15 8:02 ` Neil Brown 0 siblings, 1 reply; 28+ messages in thread From: Wendy Cheng @ 2006-06-15 6:39 UTC (permalink / raw) To: Neil Brown; +Cc: nfs, linux-cluster On Thu, 2006-06-15 at 14:27 +1000, Neil Brown wrote: > You started out suggesting that the required functionality was to > "remove all locks that lockd holds on a particular filesystem". I didn't make this clear. No, we don't want to "remove all locks associated with a particular filesystem". We want to "remove all locks associated with an NFS service" - one NFS service is normally associated with one NFS export. For example, say in /etc/exports: /mnt/export_fs/dir_1 *(fsid=1,async,rw) /mnt/export_fs/dir_2 *(fsid=2,async,rw) One same filesystem (export_fs) is exported via two entries, each with its own fsid. The "fsid" is eventually encoded as part of the filehanlde stored into "struct nlm_file" and linked into nlm_file global list. This is to allow, not only active-active failover (for local filesystem such as ext3), but also load balancing for cluster file systems (such as GFS). In reality, each NFS service is associated with one virtual IP. The failover and load-balancing tasks are carried out by moving the virtual IP around - so I'm ok with the idea of "remove all locks that lockd holds on behalf of a particular IP address". > > Lockd is not currently structured to associate locks with > server-ip-addresses. There is an assumption that one client may talk > to any of the IP addresses that the server supports. This is clearly > not the case for the failover scenario that you are considering, so a > little restructuring might be in order. > > Some locks will be held on behalf of a client, no matter what > interface the requests arrive on. Other locks will be held on behalf > of a client and tied to a particular server IP address. Probably the > easiest way to make this distinction in as a new nfsd export flag. We're very close now - note that I originally proposed adding a new nfsd export flag (NFSEXP_FOLOCKS) so we can OR it into export's ex_flag upon un-export. If the new action flag is set, a new sub-call added into unexport kernel routine will walk thru nlm_file to find the export entry (matched by either fsid or devno, taken from filehandle, within nlm_file struct); then subsequently release the lock. The ex_flag is an "int" but currently only used up to 16 bit. So my new export flag is defined as: NFSEXP_FOLOCKS 0x00010000. > > So, maybe something like this: > > Add a 'struct sockaddr_in' to 'struct nlm_file'. > If nlm_fopen return (say) 3, then treat is as success, and > also copy rqstp->rq_addr into that 'sockaddr_in'. > define a new file in the 'nfsd' filesystem into which can > be written an IP address and which calls some new lockd > function which releases all locks held for that IP address. > Probably get nlm_lookup_file to insist that if the sockaddr_in > is defined in a lock, it must match the one in rqstp Yes, we definitely can do this but there is a "BUT" from our end. What I did in my prototyping code is taking filehandle from nlm_file structure and yank the fsid (or devno) out of it (so we didn't need to know the socket address). With (your) above approach, adding a new field into "struct nlm_file" to hold the sock addr, sadly say, violates our KABI policy. I learnt my lesson. Forget KABI for now. Let me see what you have in the next paragraph (so I can know how to response ...) > > > > > One is the multiple-lockd-threads idea. > > I'm losing interest in the multiple-lockd-threads approach myself (for > the moment anyway :-) > However I would be against trying to re-use rpc.lockd - that was a > mistake that is best forgotten. > If the above approach were taken, then I don't think you need anything > more than > echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock > (or whatever), though it you really want to wrap that in a shell > script that might be ok. This is funny - so we go back to /proc. OK with me :) but you may want to re-think my exportfs command approach. Want me to go over the unexport flow again ? The idea is to add a new user mode flag, say "-h". If you unexport the interface as: shell> exportfs -u *:/export_path // nothing happens, old behavior but if you do: shell> exportfs -hu *:/export_patch // the kernel code would walk thru // nlm_file list to release the // the locks. The "-h" "OR" 0x0001000 into ex_flags field of struct nfsctl_export so kernel can know what to do. With fsid (or devno) in filehandle within nlm_file, we don't need socket address at all. But again, I'm OK with /proc approach. However, with /proc approach, we may need socket address (since not every export uses fsid and devno is not easy to get). Do we agree now ? In simple sentence, I prefer my original "exportfs - hu" approach. But I'm ok with /proc if you insist. > > > > > For the kernel piece, since we're there anyway, could we have the > > individual lockd IP interface passed to SM (statd) (in SM_MON call) ? > > This would allow statd to structure its SM files based on each lockd IP > > address, an important part of lock recovery. > > > > Maybe.... but I don't get the scenario. > Surely the SM files are only needed when the server restarts, and in > that case it needs to notify all clients... Or is it that you want to > make sure the notification comes from the right IP address.... I guess > that would make sense. I that what you are after? Yes ! Right now, lockd doesn't pass the specific server address (that client connects to) to statd. I don't know how the "-H" can ever work. Consider this a bug. If you forget what "rpc.statd -H" is, check out the man page (man rpc.statd). Thank you for the patience - I'm grateful. -- Wendy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] NLM lock failover admin interface 2006-06-15 6:39 ` Wendy Cheng @ 2006-06-15 8:02 ` Neil Brown 2006-06-15 18:43 ` Wendy Cheng 0 siblings, 1 reply; 28+ messages in thread From: Neil Brown @ 2006-06-15 8:02 UTC (permalink / raw) To: Wendy Cheng; +Cc: nfs, linux-cluster On Thursday June 15, wcheng@redhat.com wrote: > On Thu, 2006-06-15 at 14:27 +1000, Neil Brown wrote: > > > You started out suggesting that the required functionality was to > > "remove all locks that lockd holds on a particular filesystem". > > I didn't make this clear. No, we don't want to "remove all locks > associated with a particular filesystem". We want to "remove all locks > associated with an NFS service" - one NFS service is normally associated > with one NFS export. For example, say in /etc/exports: > > /mnt/export_fs/dir_1 *(fsid=1,async,rw) > /mnt/export_fs/dir_2 *(fsid=2,async,rw) That makes sense. > > One same filesystem (export_fs) is exported via two entries, each with > its own fsid. The "fsid" is eventually encoded as part of the filehanlde > stored into "struct nlm_file" and linked into nlm_file global list. > > This is to allow, not only active-active failover (for local filesystem > such as ext3), but also load balancing for cluster file systems (such as > GFS). Could you please explain to me what "active-active failover for local filesystem such as ext3" means (I'm not very familiar with cluster terminology). It sounds like the filesystem is active on two nodes at once, which of course cannot work for ext3, so I am confused. And if you are doing "failover", what has failed? The load-balancing scenario makes sense (at least so far...). > > In reality, each NFS service is associated with one virtual IP. The > failover and load-balancing tasks are carried out by moving the virtual > IP around - so I'm ok with the idea of "remove all locks that lockd > holds on behalf of a particular IP address". > Good. :-) > > > > Lockd is not currently structured to associate locks with > > server-ip-addresses. There is an assumption that one client may talk > > to any of the IP addresses that the server supports. This is clearly > > not the case for the failover scenario that you are considering, so a > > little restructuring might be in order. > > > > Some locks will be held on behalf of a client, no matter what > > interface the requests arrive on. Other locks will be held on behalf > > of a client and tied to a particular server IP address. Probably the > > easiest way to make this distinction in as a new nfsd export flag. > > We're very close now - note that I originally proposed adding a new nfsd > export flag (NFSEXP_FOLOCKS) so we can OR it into export's ex_flag upon > un-export. If the new action flag is set, a new sub-call added into > unexport kernel routine will walk thru nlm_file to find the export entry > (matched by either fsid or devno, taken from filehandle, within nlm_file > struct); then subsequently release the lock. > > The ex_flag is an "int" but currently only used up to 16 bit. So my new > export flag is defined as: NFSEXP_FOLOCKS 0x00010000. > Our two export flags mean VERY different things. Mine says 'locks against this export are per-server-ip-address'. Yours says (I think) 'remove all lockd locks from this export' and is really an unexport flag, not an export flag. And this makes it not really workable. We no-longer require the user of the nfssvc syscall to unexport filesystems. Infact nfs-utils doesn't use it at all if /proc/fs/nfsd is mounted. filesystems are unexported by their entry in the export cache expiring, or the cache being flushed. There is simply no room in the current knfsd design for an unexport flag - sorry ;-( > > > > So, maybe something like this: > > > > Add a 'struct sockaddr_in' to 'struct nlm_file'. > > If nlm_fopen return (say) 3, then treat is as success, and > > also copy rqstp->rq_addr into that 'sockaddr_in'. > > define a new file in the 'nfsd' filesystem into which can > > be written an IP address and which calls some new lockd > > function which releases all locks held for that IP address. > > Probably get nlm_lookup_file to insist that if the sockaddr_in > > is defined in a lock, it must match the one in rqstp > > Yes, we definitely can do this but there is a "BUT" from our end. What I > did in my prototyping code is taking filehandle from nlm_file structure > and yank the fsid (or devno) out of it (so we didn't need to know the > socket address). With (your) above approach, adding a new field into > "struct nlm_file" to hold the sock addr, sadly say, violates our KABI > policy. Does it? 'struct nlm_file' is a structure that is entirely local to lockd. It does not feature in any of the interface between lockd and any other part of the kernel. It is not part of any credible KABI. The other changes I suggest involve adding an exported symbol to lockd, which does change the KABI but in a completely back-compatible way, and re-interpreting the return value of a callout. That could not break any external module - it could only break someone's setup if they had an alternate lockd module, but I don't your KABI policy allows people to replace modules and stay supported, However, as you say.... > > I learnt my lesson. Forget KABI for now. Let me see what you have in the > next paragraph (so I can know how to response ...) > ....we aren't going to let KABI issues get in our way. > > > > > > > > One is the multiple-lockd-threads idea. > > > > I'm losing interest in the multiple-lockd-threads approach myself (for > > the moment anyway :-) > > However I would be against trying to re-use rpc.lockd - that was a > > mistake that is best forgotten. > > If the above approach were taken, then I don't think you need anything > > more than > > echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock > > (or whatever), though it you really want to wrap that in a shell > > script that might be ok. > > This is funny - so we go back to /proc. OK with me :) Only sort-of back to /proc. /proc/fs/nfsd is a separate filesystem which happens to be mounted there normally. The unexport system call goes through this exact same filesystem (though it is somewhat under-the-hood) so at that level, we are really propose the same style of interface implementation. > but you may want > to re-think my exportfs command approach. Want me to go over the > unexport flow again ? The idea is to add a new user mode flag, say "-h". > If you unexport the interface as: > > shell> exportfs -u *:/export_path // nothing happens, old behavior > > but if you do: > > shell> exportfs -hu *:/export_patch // the kernel code would walk thru > // nlm_file list to release the > // the locks. > > The "-h" "OR" 0x0001000 into ex_flags field of struct nfsctl_export so > kernel can know what to do. With fsid (or devno) in filehandle within > nlm_file, we don't need socket address at all. But apart from nfsctl_export being a dead end, this is still exportpoint specific rather than IP address specific. > > But again, I'm OK with /proc approach. However, with /proc approach, we > may need socket address (since not every export uses fsid and devno is > not easy to get). Absolutely. We need a socket address. As part of this process you are shutting down an interface. We know (or can easily discover) the address of that interface. That is exactly the address that we feed to nfsd. > > Do we agree now ? In simple sentence, I prefer my original "exportfs - > hu" approach. But I'm ok with /proc if you insist. > I'm not at an 'insist'ing stage at the moment - I like to at least pretend to be open minded :-) The main thing I don't like about your "exportfs -hu" approach is that I don't think it will work (actually, looking at nfs-utils, I'm not so sure that "exportfs -u" will work at all if you don't have /proc/fs/nfsd mounted....) The other thing I don't like is that it doesn't address your primary need - decommissioning an IP address. Rather it addresses a secondary need - removing some locks from some filesystems. But I'm still open to debate... > > > > > > > > > For the kernel piece, since we're there anyway, could we have the > > > individual lockd IP interface passed to SM (statd) (in SM_MON call) ? > > > This would allow statd to structure its SM files based on each lockd IP > > > address, an important part of lock recovery. > > > > > > > Maybe.... but I don't get the scenario. > > Surely the SM files are only needed when the server restarts, and in > > that case it needs to notify all clients... Or is it that you want to > > make sure the notification comes from the right IP address.... I guess > > that would make sense. I that what you are after? > > Yes ! Right now, lockd doesn't pass the specific server address (that > client connects to) to statd. I don't know how the "-H" can ever work. > Consider this a bug. If you forget what "rpc.statd -H" is, check out the > man page (man rpc.statd). I have to admit I have never given that code a lot of attention. I reviewed when sent it - it seemed to make sense and had no obvious problems - so I accepted it. I wouldn't be enormously surprised if it didn't work in some situations. > > Thank you for the patience - I'm grateful. Ditto. Conversations work much better when people are patient and polite. Thanks, NeilBrown _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] NLM lock failover admin interface 2006-06-15 8:02 ` Neil Brown @ 2006-06-15 18:43 ` Wendy Cheng 0 siblings, 0 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-15 18:43 UTC (permalink / raw) To: Neil Brown; +Cc: linux clustering, nfs Neil Brown wrote: >Could you please explain to me what "active-active failover for local >filesystem such as ext3" means > Clustering is a profilic subject so the term may mean different things to different people. The setup we discuss here is to move an NFS service from one server to the other while both servers are up and running (active-active). The goal is not to disturb other NFS services that are not involved with the transition. >It sounds like the filesystem is active on two nodes at once, which of >course cannot work for ext3, so I am confused. >And if you are doing "failover", what has failed? > >The load-balancing scenario makes sense (at least so far...). > > Local filesystem such as ext3 will never be mounted on more than two nodes but cluster filesystems (e.g. our GFS) will. Moving ext3 normally implies error conditions (a true failover) though in rare cases, it may be kicked off for load balancing purpose. Current GFS locking has the "node-id" concept - the easiest way (at this moment) for virtual IP to float around is to drop the locks and let NLM reclaim the locks from the new server. > >Our two export flags mean VERY different things. >Mine says 'locks against this export are per-server-ip-address'. >Yours says (I think) 'remove all lockd locks from this export' and is >really an unexport flag, not an export flag. > >And this makes it not really workable. We no-longer require the user >of the nfssvc syscall to unexport filesystems. Infact nfs-utils doesn't >use it at all if /proc/fs/nfsd is mounted. filesystems are unexported >by their entry in the export cache expiring, or the cache being >flushed. > > The important thing (for me) is the vfsmount reference count which can only be properly decreased when unexport is triggered. Without decreasing the vfsmount, ext3 can not be un-mounted (and we need to umount ext3 upon failover). I havn't looked into community versions of kernel source for a while (but I'll check). So what can I do to ensure this will happen ? - i.e., after the filesystem has been accessed by nfsd, how can I safely un-mount it without shuting down nfsd (and/or lockd) ? >'struct nlm_file' is a structure that is entirely local to lockd. >It does not feature in any of the interface between lockd and any >other part of the kernel. It is not part of any credible KABI. >The other changes I suggest involve adding an exported symbol to >lockd, which does change the KABI but in a completely back-compatible >way, and re-interpreting the return value of a callout. >That could not break any external module - it could only break >someone's setup if they had an alternate lockd module, but I don't >your KABI policy allows people to replace modules and stay supported, > > Yes, you're right ! I looked into the wrong code (well, it was late in the night so I was not very functional at that moment). Had some prototype code where I transported the nlm_file from one server to another server , experimenting auto-reclaiming locks without stated. I exported the nlm_file list there. So let's forget about this >>>>> One is the multiple-lockd-threads idea. >>>>> >>>>> >>>I'm losing interest in the multiple-lockd-threads approach myself (for >>>the moment anyway :-) >>> >>> Good! because I'm not sure whether we'll hit scalibility issue or not (100 nfs services implies 100 lockd threads !). >>>However I would be against trying to re-use rpc.lockd - that was a >>>mistake that is best forgotten. >>> >>> Highlight this :) ... Give me some comfort feelings that I'm not the only person who would make mistakes. >>>If the above approach were taken, then I don't think you need anything >>>more than >>> echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock >>>(or whatever), though it you really want to wrap that in a shell >>>script that might be ok. >>> >>> >>This is funny - so we go back to /proc. OK with me :) >> >> > >Only sort-of back to /proc. /proc/fs/nfsd is a separate filesystem >which happens to be mounted there normally. >The unexport system call goes through this exact same filesystem >(though it is somewhat under-the-hood) so at that level, we are >really propose the same style of interface implementation. > > >>But again, I'm OK with /proc approach. However, with /proc approach, we >>may need socket address (since not every export uses fsid and devno is >>not easy to get). >> >> > >Absolutely. We need a socket address. >As part of this process you are shutting down an interface. We know >(or can easily discover) the address of that interface. That is >exactly the address that we feed to nfsd. > > Now, it looks good ! Will do the following: 1. Futher understand the steps to make sure we can un-mount ext3 due to "unexport" method changes. 2. Start to code to the /proc interface and make sure "rpc.stated -H"can work (lock reclaiming needs it). Will keep NFS v4 in mind as well. By the way, there is a socket state-change-handler (TCP only) and/or network interface notification routine that seem to be workable (your previous thoughts). However, I don't plan to keep exploring that possibility since we now have a simple and workable method in place. -- Wendy _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] NLM lock failover admin interface 2006-06-12 5:25 [RFC] NLM lock failover admin interface Wendy Cheng ` (3 preceding siblings ...) 2006-06-13 3:17 ` Neil Brown @ 2006-06-13 15:23 ` James Yarbrough 4 siblings, 0 replies; 28+ messages in thread From: James Yarbrough @ 2006-06-13 15:23 UTC (permalink / raw) To: Wendy Cheng, Neil Brown; +Cc: linux-cluster, nfs [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 1195 bytes --] > There seems to be an unstated assumption here that there is one > virtual IP per exported filesystem. Is that true? This is the normal case for such HA services. There may actually be a single IP address covering multiple filesystems and/or NFS exports. > I think that maybe the right thing to do is *not* drop the locks on a > particular filesystem, but to drop the locks made to a particular > virtual IP. For filesystems such as ext2 or xfs, you unmount the filesystem on the current server and mount it on the new server when doing a failover. In this case, you have to be able to get rid of all the locks first and you do that for the entire filesystem. For a cluster filesystem such as cxfs, you don't actually unmount the filesystem, so you really need the per-IP address approach. > If I want to force-unmount a filesystem, I need to unexport it, and I > need to kill all the locks. Currently you can only remove locks from > all filesystems, which might not be ideal. This is definitely less than ideal. This will force notification and reclaim for all exported filesystems. This can be a significant problem. jmy@sgi.com 650 933 3124 Why is there a snake in my Coke? [-- Attachment #2: Type: text/plain, Size: 0 bytes --] [-- Attachment #3: Type: text/plain, Size: 140 bytes --] _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Linux-cluster] [RFC] NLM lock failover admin interface @ 2006-06-12 14:45 Stanley, Jon 2006-06-13 3:39 ` Wendy Cheng 0 siblings, 1 reply; 28+ messages in thread From: Stanley, Jon @ 2006-06-12 14:45 UTC (permalink / raw) To: linux clustering, nfs > -----Original Message----- > From: linux-cluster-bounces@redhat.com > [mailto:linux-cluster-bounces@redhat.com] On Behalf Of Wendy Cheng > Sent: Monday, June 12, 2006 12:26 AM > To: nfs@lists.sourceforge.net > Cc: linux-cluster@redhat.com > Subject: [Linux-cluster] [RFC] NLM lock failover admin interface > NOTE - I don't use NFS functionality in Cluster Suite, so my coments may be entirely meaningless. > > 1. /proc interface, say writing the fsid into a /proc directory entry > would end up dropping all NLM locks associated with the NFS > export that > has fsid in its /etc/exports file. This would defintely have it's advantages for people who know what they're doing - they could drop all locks without unexporting the filesystem. However, it also gives people the opportunity to shoot themselves in the foot - by eliminating locks that are needed. After weighing the pros and cons, I really don't think that any method accessible via /proc is a good idea. > > 2. Adding a new flag into "exportfs" command, say "h", such that > > "exportfs -uh *:/export_path" > > would un-export the entry and drop the NLM locks associated with the > entry. > This is the best of the three, IMHO. Gives you the safety of *knowing* that the filesystem was unexported before dropping the locks, and preventing folks from shooting themselves in the foot. The other option that was mentioned, a separate lockd for each fs, is also a good idea - but would require a lot of coding no doubt, and introduce more instability into what I already preceive as an unstable NFS subsystem in Linux (I *refuse* to use Linux as an NFS server and instead go with Solaris - I've had *really* bad experiences with Linux NFS under load - but that's getting OT). _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [Linux-cluster] [RFC] NLM lock failover admin interface 2006-06-12 14:45 [Linux-cluster] " Stanley, Jon @ 2006-06-13 3:39 ` Wendy Cheng 0 siblings, 0 replies; 28+ messages in thread From: Wendy Cheng @ 2006-06-13 3:39 UTC (permalink / raw) To: Stanley, Jon; +Cc: nfs, linux clustering On Mon, 2006-06-12 at 09:45 -0500, Stanley, Jon wrote: > > > -----Original Message----- > > From: linux-cluster-bounces@redhat.com > > [mailto:linux-cluster-bounces@redhat.com] On Behalf Of Wendy Cheng > > Sent: Monday, June 12, 2006 12:26 AM > > To: nfs@lists.sourceforge.net > > Cc: linux-cluster@redhat.com > > Subject: [Linux-cluster] [RFC] NLM lock failover admin interface Jon, Thank you for review this - it helps ! -- Wendy > > > > 1. /proc interface, say writing the fsid into a /proc directory entry > > would end up dropping all NLM locks associated with the NFS > > export that > > has fsid in its /etc/exports file. > > This would defintely have it's advantages for people who know what > they're doing - they could drop all locks without unexporting the > filesystem. However, it also gives people the opportunity to shoot > themselves in the foot - by eliminating locks that are needed. After > weighing the pros and cons, I really don't think that any method > accessible via /proc is a good idea. > > > > > 2. Adding a new flag into "exportfs" command, say "h", such that > > > > "exportfs -uh *:/export_path" > > > > would un-export the entry and drop the NLM locks associated with the > > entry. > > > > This is the best of the three, IMHO. Gives you the safety of *knowing* > that the filesystem was unexported before dropping the locks, and > preventing folks from shooting themselves in the foot. > > The other option that was mentioned, a separate lockd for each fs, is > also a good idea - but would require a lot of coding no doubt, and > introduce more instability into what I already preceive as an unstable > NFS subsystem in Linux (I *refuse* to use Linux as an NFS server and > instead go with Solaris - I've had *really* bad experiences with Linux > NFS under load - but that's getting OT). > > > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2006-06-16 15:39 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-12 5:25 [RFC] NLM lock failover admin interface Wendy Cheng 2006-06-12 6:11 ` Wendy Cheng 2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields 2006-06-12 15:44 ` [NFS] " Wendy Cheng 2006-06-12 16:20 ` [Linux-cluster] " Madhan P 2006-06-12 16:58 ` Madhan P 2006-06-12 18:09 ` [NFS] " Wendy Cheng 2006-06-12 17:23 ` [Linux-cluster] " Steve Dickson 2006-06-12 17:27 ` James Yarbrough 2006-06-12 19:07 ` [NFS] " Wendy Cheng 2006-06-13 3:17 ` Neil Brown 2006-06-13 7:00 ` [NFS] " Wendy Cheng 2006-06-13 7:08 ` Neil Brown 2006-06-14 6:54 ` [NFS] " Wendy Cheng 2006-06-14 11:36 ` Christoph Hellwig 2006-06-14 13:39 ` Wendy Cheng 2006-06-14 14:00 ` Wendy Cheng 2006-06-15 14:07 ` [NFS] " William A.(Andy) Adamson 2006-06-15 15:09 ` Wendy Cheng 2006-06-16 6:09 ` [Linux-cluster] " Neil Brown 2006-06-16 15:39 ` [NFS] " William A.(Andy) Adamson 2006-06-15 4:27 ` [NFS] " Neil Brown 2006-06-15 6:39 ` Wendy Cheng 2006-06-15 8:02 ` Neil Brown 2006-06-15 18:43 ` Wendy Cheng 2006-06-13 15:23 ` James Yarbrough -- strict thread matches above, loose matches on Subject: below -- 2006-06-12 14:45 [Linux-cluster] " Stanley, Jon 2006-06-13 3:39 ` Wendy Cheng
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.