Linux NFS development
 help / color / mirror / Atom feed
* [RFC] NLM lock failover admin interface
@ 2006-06-12  5:25 Wendy Cheng
  2006-06-12  6:11 ` Wendy Cheng
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-12  5:25 UTC (permalink / raw)
  To: nfs; +Cc: linux-cluster

NFS v2/v3 active-active NLM lock failover has been an issue with our
cluster suite. With current implementation, it (cluster suite) is trying
to carry the workaround as much as it can with user mode scripts where,
upon failover, on taken-over server, it:

1. Tear down virtual IP.
2. Unexport the subject NFS export.
3. Signal lockd to drop the locks.
4. Un-mount filesystem if needed.

There are many other issues (such as /var/lib/nfs/statd/sm file, etc)
but this particular post is to further refine step 3 to avoid the 50
second global (default) grace period for all NFS exports; i.e., we would
like to be able to selectively drop locks (only) associated with the
requested exports without disrupting other NFS services. 

We've done some prototype (coding) works but would like to search for
community consensus on the admin interface if possible. We've tried out
the following:

1. /proc interface, say writing the fsid into a /proc directory entry
would end up dropping all NLM locks associated with the NFS export that
has fsid in its /etc/exports file.

2. Adding a new flag into "exportfs" command, say "h", such that

   "exportfs -uh *:/export_path"

would un-export the entry and drop the NLM locks associated with the
entry.

3. Add a new nfsctl by re-using a 2.4 kernel flag (NFSCTL_FOLOCKS) where
it takes:

   struct nfsctl_folocks {
        int           type;
        unsigned int  fsid;
        unsigned int  devno;
   }

as input argument. Depending on "type", the kernel call would drop the
locks associated with either the fsid, or devno. 

The core of the implementation is a new cloned version of
nlm_traverse_files() where it searches the "nlm_files" list one by one
to compare the fsid (or devno) based on nlm_file.f_handle field. A
helper function is also implemented to extract the fsid (or devno) from
f_handle.

The new function is planned to allow failover to abort if the file can't
be closed. We may also put the file locks back if abort occurs.

Would appreciate comments on the above admin interface. As soon as the
external interface can be finalized, the code will be submitted for
review.

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] NLM lock failover admin interface
  2006-06-12  5:25 [RFC] NLM lock failover admin interface Wendy Cheng
@ 2006-06-12  6:11 ` Wendy Cheng
  2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-12  6:11 UTC (permalink / raw)
  To: linux clustering; +Cc: nfs

On Mon, 2006-06-12 at 01:25 -0400, Wendy Cheng wrote:
> NFS v2/v3 active-active NLM lock failover has been an issue with our
> cluster suite. With current implementation, it (cluster suite) is trying
> to carry the workaround as much as it can with user mode scripts where,
> upon failover, on taken-over server, it:
> 
> 1. Tear down virtual IP.
> 2. Unexport the subject NFS export.
> 3. Signal lockd to drop the locks.
> 4. Un-mount filesystem if needed.
> 
> There are many other issues (such as /var/lib/nfs/statd/sm file, etc)
> but this particular post is to further refine step 3 to avoid the 50
> second global (default) grace period for all NFS exports; i.e., we would
> like to be able to selectively drop locks (only) associated with the
> requested exports without disrupting other NFS services. 
> 
> We've done some prototype (coding) works but would like to search for
> community consensus on the admin interface if possible. 

While ping-pong the emails with our base kernel folks to choose
between /proc, or exportfs, or nfsctl (internally within the company -
mostly with steved and staubach), Peter suggested to try out multiple
lockd(s) to handle different NFS exports. In that case, we may require
to change a big portion of lockd kernel code. I prefer not going that
far since lockd failover is our cluster suite's immediate issue.
However, if this approach can get everyone's vote, we'll comply.

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Linux-cluster] [RFC] NLM lock failover admin interface
@ 2006-06-12 14:45 Stanley, Jon
  2006-06-13  3:39 ` Wendy Cheng
  0 siblings, 1 reply; 28+ messages in thread
From: Stanley, Jon @ 2006-06-12 14:45 UTC (permalink / raw)
  To: linux clustering, nfs

 

> -----Original Message-----
> From: linux-cluster-bounces@redhat.com 
> [mailto:linux-cluster-bounces@redhat.com] On Behalf Of Wendy Cheng
> Sent: Monday, June 12, 2006 12:26 AM
> To: nfs@lists.sourceforge.net
> Cc: linux-cluster@redhat.com
> Subject: [Linux-cluster] [RFC] NLM lock failover admin interface
> 
NOTE - I don't use NFS functionality in Cluster Suite, so my coments may
be entirely meaningless.

> 
> 1. /proc interface, say writing the fsid into a /proc directory entry
> would end up dropping all NLM locks associated with the NFS 
> export that
> has fsid in its /etc/exports file.

This would defintely have it's advantages for people who know what
they're doing - they could drop all locks without unexporting the
filesystem.  However, it also gives people the opportunity to shoot
themselves in the foot - by eliminating locks that are needed.  After
weighing the pros and cons, I really don't think that any method
accessible via /proc is a good idea.

> 
> 2. Adding a new flag into "exportfs" command, say "h", such that
> 
>    "exportfs -uh *:/export_path"
> 
> would un-export the entry and drop the NLM locks associated with the
> entry.
> 

This is the best of the three, IMHO.  Gives you the safety of *knowing*
that the filesystem was unexported before dropping the locks, and
preventing folks from shooting themselves in the foot.

The other option that was mentioned, a separate lockd for each fs, is
also a good idea - but would require a lot of coding no doubt, and
introduce more instability into what I already preceive as an unstable
NFS subsystem in Linux (I *refuse* to use Linux as an NFS server and
instead go with Solaris - I've had *really* bad experiences with Linux
NFS under load - but that's getting OT).


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Linux-cluster] [RFC] NLM lock failover admin interface
  2006-06-12  5:25 [RFC] NLM lock failover admin interface Wendy Cheng
  2006-06-12  6:11 ` Wendy Cheng
@ 2006-06-12 15:00 ` J. Bruce Fields
  2006-06-12 15:44   ` [NFS] " Wendy Cheng
  2006-06-12 17:27 ` James Yarbrough
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: J. Bruce Fields @ 2006-06-12 15:00 UTC (permalink / raw)
  To: linux clustering; +Cc: nfs

On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote:
> 2. Adding a new flag into "exportfs" command, say "h", such that
> 
>    "exportfs -uh *:/export_path"
> 
> would un-export the entry and drop the NLM locks associated with the
> entry.

What does the kernel interface end up looking like in that case?

--b.


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields
@ 2006-06-12 15:44   ` Wendy Cheng
  2006-06-12 16:20     ` [Linux-cluster] " Madhan P
  2006-06-12 17:23     ` [Linux-cluster] " Steve Dickson
  0 siblings, 2 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-12 15:44 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: nfs, linux clustering

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

J. Bruce Fields wrote:

>On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote:
>  
>
>>2. Adding a new flag into "exportfs" command, say "h", such that
>>
>>   "exportfs -uh *:/export_path"
>>
>>would un-export the entry and drop the NLM locks associated with the
>>entry.
>>    
>>
>
>What does the kernel interface end up looking like in that case?
>
>  
>
Happy to see this new exportfs command gets positive response - it was 
our original pick too.

Uploaded is part of a draft version of 2.4 base kernel patch - we're 
cleaning up 2.6 patches at this moment. It basically adds a new export 
flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently only 
defined up to 16 bits) so nfs-util and kernel can communicate.

The nice thing about this approach is the recovery part - the take-over 
server can use the counter part command to export and set grace period 
for one particular interface within the same system call.

-- Wendy

[-- Attachment #2: gfs_nlm.patch --]
[-- Type: text/plain, Size: 785 bytes --]

--- linux-2.4.21-43.EL/fs/nfsd/export.c	2006-05-14 17:16:21.000000000 -0400
+++ linux/fs/nfsd/export.c	2006-05-29 02:13:29.000000000 -0400
@@ -388,6 +388,10 @@ exp_unexport(struct nfsctl_export *nxp)
 			exp_do_unexport(exp);
 			err = 0;
 		}
+		if (nxp->ex_flags & NFSEXP_FOLOCK) {
+			dprintk("exp_unexport: nfsd_lockd_unexport called\n");
+			nfsd_lockd_unexport(clp);
+		}
 	}
 
 	exp_unlock();
--- linux-2.4.21-43.EL/include/linux/nfsd/export.h	2006-05-14 17:23:57.000000000 -0400
+++ linux/include/linux/nfsd/export.h	2006-05-29 02:12:07.000000000 -0400
@@ -42,7 +42,7 @@
 #define NFSEXP_FSID			0x2000
 #define NFSEXP_NOACL			0x8000	/* turn off acl support */
 #define NFSEXP_ALLFLAGS		0xFFFF
-
+#define NFSEXP_FOLOCK			0x00010000	/* NLM lock failover */
 
 #ifdef __KERNEL__
 

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Linux-cluster] [RFC] NLM lock failover admin interface
  2006-06-12 15:44   ` [NFS] " Wendy Cheng
@ 2006-06-12 16:20     ` Madhan P
  2006-06-12 16:58       ` Madhan P
  2006-06-12 18:09       ` [NFS] " Wendy Cheng
  2006-06-12 17:23     ` [Linux-cluster] " Steve Dickson
  1 sibling, 2 replies; 28+ messages in thread
From: Madhan P @ 2006-06-12 16:20 UTC (permalink / raw)
  To: J. Bruce Fields, Wendy Cheng; +Cc: linux clustering, nfs

For what it's worth, would second this approach of using a flag to
unexport and associating the cleanup with that.  Another quick hack we
used was to store the NSM entries on a standard location on the
respective exported filesystem, so that notification is sent once the
filesystem comes back online on the destination server and is exported
again.  BTW, this was not on Linux. It was a  simple solution providing
the necessary active/active and active/passive cluster support.

- Madhan

>>> On 6/12/2006 at 9:14:55 pm, in message
<448D8BF7.7010105@redhat.com>, Wendy
Cheng <wcheng@redhat.com> wrote:
> J. Bruce Fields wrote:
> 
>>On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote:
>>  
>>
>>>2. Adding a new flag into "exportfs" command, say "h", such that
>>>
>>>   "exportfs -uh *:/export_path"
>>>
>>>would un-export the entry and drop the NLM locks associated with
the
>>>entry.
>>>    
>>>
>>
>>What does the kernel interface end up looking like in that case?
>>
>>  
>>
> Happy to see this new exportfs command gets positive response - it
was 
> our original pick too.
> 
> Uploaded is part of a draft version of 2.4 base kernel patch - we're

> cleaning up 2.6 patches at this moment. It basically adds a new
export 
> flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently
only 
> defined up to 16 bits) so nfs-util and kernel can communicate.
> 
> The nice thing about this approach is the recovery part - the
take-over 
> server can use the counter part command to export and set grace
period 
> for one particular interface within the same system call.
> 
> -- Wendy


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Linux-cluster] [RFC] NLM lock failover admin interface
  2006-06-12 16:20     ` [Linux-cluster] " Madhan P
@ 2006-06-12 16:58       ` Madhan P
  2006-06-12 18:09       ` [NFS] " Wendy Cheng
  1 sibling, 0 replies; 28+ messages in thread
From: Madhan P @ 2006-06-12 16:58 UTC (permalink / raw)
  To: nfs; +Cc: linux clustering

For what it's worth, would second this approach of using a flag to
unexport and associating the cleanup with that.  Another quick hack we
used was to store the NSM entries on a standard location on the
respective exported filesystem, so that notification is sent once the
filesystem comes back online on the destination server and is exported
again.  BTW, this was not on Linux. It was a  simple solution providing
the necessary active/active and active/passive cluster support.

- Madhan

>>> On 6/12/2006 at 9:14:55 pm, in message
<448D8BF7.7010105@redhat.com>, Wendy
Cheng <wcheng@redhat.com> wrote:
> J. Bruce Fields wrote:
> 
>>On Mon, Jun 12, 2006 at 01:25:43AM -0400, Wendy Cheng wrote:
>>  
>>
>>>2. Adding a new flag into "exportfs" command, say "h", such that
>>>
>>>   "exportfs -uh *:/export_path"
>>>
>>>would un-export the entry and drop the NLM locks associated with
the
>>>entry.
>>>    
>>>
>>
>>What does the kernel interface end up looking like in that case?
>>
>>  
>>
> Happy to see this new exportfs command gets positive response - it
was 
> our original pick too.
> 
> Uploaded is part of a draft version of 2.4 base kernel patch - we're

> cleaning up 2.6 patches at this moment. It basically adds a new
export 
> flag (NFSEXP_FOLOCK - note that ex_flags is an int but is currently
only 
> defined up to 16 bits) so nfs-util and kernel can communicate.
> 
> The nice thing about this approach is the recovery part - the
take-over 
> server can use the counter part command to export and set grace
period 
> for one particular interface within the same system call.
> 
> -- Wendy


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Linux-cluster] [RFC] NLM lock failover admin interface
  2006-06-12 15:44   ` [NFS] " Wendy Cheng
  2006-06-12 16:20     ` [Linux-cluster] " Madhan P
@ 2006-06-12 17:23     ` Steve Dickson
  1 sibling, 0 replies; 28+ messages in thread
From: Steve Dickson @ 2006-06-12 17:23 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: J. Bruce Fields, linux clustering, nfs

Wendy Cheng wrote:
> The nice thing about this approach is the recovery part - the take-over 
> server can use the counter part command to export and set grace period 
> for one particular interface within the same system call.
Actually this is a pretty clean and simple interface... imho..
The only issue I had was adding a flag to an older version and then
having to carry that flag forward... So if this interface is
accepted and added to the mainline nfs-utils (which it should be.. imho)
that fact it is so clean and simple would make the back porting fairly
trivial...

steved.


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] NLM lock failover admin interface
  2006-06-12  5:25 [RFC] NLM lock failover admin interface Wendy Cheng
  2006-06-12  6:11 ` Wendy Cheng
  2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields
@ 2006-06-12 17:27 ` James Yarbrough
  2006-06-12 19:07   ` [NFS] " Wendy Cheng
  2006-06-13  3:17 ` Neil Brown
  2006-06-13 15:23 ` James Yarbrough
  4 siblings, 1 reply; 28+ messages in thread
From: James Yarbrough @ 2006-06-12 17:27 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-cluster, nfs

> 2. Adding a new flag into "exportfs" command, say "h", such that
> 
>    "exportfs -uh *:/export_path"
> 
> would un-export the entry and drop the NLM locks associated with the
> entry.

This is fine for releasing the locks, but how do you plan to re-enter
the grace period for reclaiming the locks when you relocate the export?
And how do you intend to segregate the export for which reclaims are
valid from the ones which are not?  How do you plan to support the
sending of SM_NOTIFY?  This might be where a lockd per export has an
advantage.

-- 
jmy@sgi.com
650 933 3124

Why is there a snake in my Coke?


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-12 16:20     ` [Linux-cluster] " Madhan P
  2006-06-12 16:58       ` Madhan P
@ 2006-06-12 18:09       ` Wendy Cheng
  1 sibling, 0 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-12 18:09 UTC (permalink / raw)
  To: Madhan P; +Cc: linux clustering, nfs

Madhan P wrote:

>For what it's worth, would second this approach of using a flag to
>unexport and associating the cleanup with that. 
>

Happy to have another vote :)  !  It is appreicated.

> Another quick hack we
>used was to store the NSM entries on a standard location on the
>respective exported filesystem, so that notification is sent once the
>filesystem comes back online on the destination server and is exported
>again.  BTW, this was not on Linux. It was a  simple solution providing
>the necessary active/active and active/passive cluster support.
>  
>

Lon Hohberge (from our cluster suite team) has been working on similar 
setup too (to structure the MSM file directory). We'll submit the 
associated kernel patch when it is ready ("rpc.statd -H" needs some 
bandaids). Future reviews and comments are also appreciated.

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-12 17:27 ` James Yarbrough
@ 2006-06-12 19:07   ` Wendy Cheng
  0 siblings, 0 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-12 19:07 UTC (permalink / raw)
  To: James Yarbrough; +Cc: linux-cluster, nfs

James Yarbrough wrote:

>>2. Adding a new flag into "exportfs" command, say "h", such that
>>
>>   "exportfs -uh *:/export_path"
>>
>>would un-export the entry and drop the NLM locks associated with the
>>entry.
>>    
>>
>
>This is fine for releasing the locks, but how do you plan to re-enter
>the grace period for reclaiming the locks when you relocate the export?
>And how do you intend to segregate the export for which reclaims are
>valid from the ones which are not?  How do you plan to support the
>sending of SM_NOTIFY?  This might be where a lockd per export has an
>advantage.
>
>  
>
Yeah, that's why Peter's idea (different lockd(s)) is also attractive. 
However, on the practical side, we don't plan to introduce kernel 
patches agressively. The approach is to be away from mainline NLM code 
base until we have enough QA cycles to make sure things work. The 
unexport part would allow other nfs services on the taken-over server 
un-interrupted. On the take-over server side, we currently do a global 
grace period. The plan has been to put a little delay before fixing 
take-over server's logic due to other NLM/posix lock issues - for 
example, the current (linux) NLM doesn't bother to call filesystem's 
lock method (which virtually disables any cluster filesystem's NFS 
locking across different NFS servers). However, if we have enough 
resources and/or volunteers, we may do these things in parallel. The 
following are planned:

Take-over server logic:
1. setup the statd sm file (currently /var/lib/nfs/statd/sm or the
    equivalent configured directory) properly.
2. rpc.statd is dispatched with "--ha-callout" option.
3. implement the ha-callout user mode program to create a seperate
    statd sm files for each exported ip.
4. export the target filesystem and set up grace period based on
    fsid (or devno). It will be used in NLM procedure calls by
    extracting the fsid (or devno) from nfs file handle to decide
    accepting or reject the not-reclaiming requests.
5. bring up the failover IP address.
6. send SM_NOTIFY to client machines using the configured sm
    directory created by the ha-callout program (rpc.statd -N -P).

Step 4 will be the counter-part of our unexport flag.

-- Wendy

 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] NLM lock failover admin interface
  2006-06-12  5:25 [RFC] NLM lock failover admin interface Wendy Cheng
                   ` (2 preceding siblings ...)
  2006-06-12 17:27 ` James Yarbrough
@ 2006-06-13  3:17 ` Neil Brown
  2006-06-13  7:00   ` [NFS] " Wendy Cheng
  2006-06-14  6:54   ` [NFS] " Wendy Cheng
  2006-06-13 15:23 ` James Yarbrough
  4 siblings, 2 replies; 28+ messages in thread
From: Neil Brown @ 2006-06-13  3:17 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-cluster, nfs

On Monday June 12, wcheng@redhat.com wrote:
> NFS v2/v3 active-active NLM lock failover has been an issue with our
> cluster suite. With current implementation, it (cluster suite) is trying
> to carry the workaround as much as it can with user mode scripts where,
> upon failover, on taken-over server, it:
> 
> 1. Tear down virtual IP.
> 2. Unexport the subject NFS export.
> 3. Signal lockd to drop the locks.
> 4. Un-mount filesystem if needed.
> 
...
>                                                                 we would
> like to be able to selectively drop locks (only) associated with the
> requested exports without disrupting other NFS services. 

There seems to be an unstated assumption here that there is one
virtual IP per exported filesystem.  Is that true?

Assuming it is and that I understand properly what you want to do....

I think that maybe the right thing to do is *not* drop the locks on a
particular filesystem, but to drop the locks made to a particular
virtual IP.

Then it would make a lot of sense to have one lockd thread per IP, and
signal the lockd in order to drop the locks.
True: that might be more code.  But if it is the right thing to do,
then it should be done that way.

On the other hand, I can see a value in removing all the locks for a
particular filesytem quite independent of failover requirements.
If I want to force-unmount a filesystem, I need to unexport it, and I
need to kill all the locks.  Currently you can only remove locks from
all filesystems, which might not be ideal.

I'm not at all keen on the NFSEXP_FOLOCK flag to exp_unexport, as that
is an interface that I would like to discard eventually.  The
preferred mechanism for exporting filesystems is to flush the
appropriate 'cache', and allow it to be repopulated with whatever is
still valid via upcalls to mountd.

So:
 I think if we really want to "remove all NFS locks on a filesystem",
 we could probably tie it into umount - maybe have lockd register some
 callback which gets called just before s_op->umount_begin.

 If we want to remove all locks that arrived on a particular
 interface, then we should arrange to do exactly that.  There are a
 number of different options here. 
  One is the multiple-lockd-threads idea.
  One is to register a callback when an interface is shut down.
  Another (possibly the best) is to arrange a new signal for lockd
  which say "Drop any locks which were sent to IP addresses that are
  no longer valid local addresses".

So those are my thoughts.  Do any of them seem reasonable to you?

NeilBrown



_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Linux-cluster] [RFC] NLM lock failover admin interface
  2006-06-12 14:45 [Linux-cluster] " Stanley, Jon
@ 2006-06-13  3:39 ` Wendy Cheng
  0 siblings, 0 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-13  3:39 UTC (permalink / raw)
  To: Stanley, Jon; +Cc: nfs, linux clustering

On Mon, 2006-06-12 at 09:45 -0500, Stanley, Jon wrote:
>  
> > -----Original Message-----
> > From: linux-cluster-bounces@redhat.com 
> > [mailto:linux-cluster-bounces@redhat.com] On Behalf Of Wendy Cheng
> > Sent: Monday, June 12, 2006 12:26 AM
> > To: nfs@lists.sourceforge.net
> > Cc: linux-cluster@redhat.com
> > Subject: [Linux-cluster] [RFC] NLM lock failover admin interface

Jon, Thank you for review this - it helps !

-- Wendy

> > 
> > 1. /proc interface, say writing the fsid into a /proc directory entry
> > would end up dropping all NLM locks associated with the NFS 
> > export that
> > has fsid in its /etc/exports file.
> 
> This would defintely have it's advantages for people who know what
> they're doing - they could drop all locks without unexporting the
> filesystem.  However, it also gives people the opportunity to shoot
> themselves in the foot - by eliminating locks that are needed.  After
> weighing the pros and cons, I really don't think that any method
> accessible via /proc is a good idea.
> 
> > 
> > 2. Adding a new flag into "exportfs" command, say "h", such that
> > 
> >    "exportfs -uh *:/export_path"
> > 
> > would un-export the entry and drop the NLM locks associated with the
> > entry.
> > 
> 
> This is the best of the three, IMHO.  Gives you the safety of *knowing*
> that the filesystem was unexported before dropping the locks, and
> preventing folks from shooting themselves in the foot.
> 
> The other option that was mentioned, a separate lockd for each fs, is
> also a good idea - but would require a lot of coding no doubt, and
> introduce more instability into what I already preceive as an unstable
> NFS subsystem in Linux (I *refuse* to use Linux as an NFS server and
> instead go with Solaris - I've had *really* bad experiences with Linux
> NFS under load - but that's getting OT).
> 
> 
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs



_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-13  3:17 ` Neil Brown
@ 2006-06-13  7:00   ` Wendy Cheng
  2006-06-13  7:08     ` Neil Brown
  2006-06-14  6:54   ` [NFS] " Wendy Cheng
  1 sibling, 1 reply; 28+ messages in thread
From: Wendy Cheng @ 2006-06-13  7:00 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-cluster, nfs

On Tue, 2006-06-13 at 13:17 +1000, Neil Brown wrote:

> So:
>  I think if we really want to "remove all NFS locks on a filesystem",
>  we could probably tie it into umount - maybe have lockd register some
>  callback which gets called just before s_op->umount_begin.

The "umount_begin" idea was one time on my list but got discarded. The
thought was that nfsd was not a filesystem, neither was lockd. How to
register something with VFS umount for non-filesystem kernel modules ?
Invent another autofs-like pseudo filesystem ? Mostly, not every
filesystem would like to get un-mounted upon failover (GFS, for example,
does not get un-mounted by our cluster suite upon failover).

>  If we want to remove all locks that arrived on a particular
>  interface, then we should arrange to do exactly that.  There are a
>  number of different options here. 
>   One is the multiple-lockd-threads idea.

Certainly a good option. To make it happen, we still need admin
interface. How to pass IP address from user mode into kernel - care to
give this some suggestions if you have them handy ? Should socket ports
get dynamics assigned ? Will we have scalibility issues ? 
 
>   One is to register a callback when an interface is shut down.
>   Another (possibly the best) is to arrange a new signal for lockd
>   which say "Drop any locks which were sent to IP addresses that are
>   no longer valid local addresses".

These, again, give individual filesystem no freedom to adjust what they
need upon failover. But I'll check them out this week - maybe there are
good socket layer hooks that I overlook. 

> 
> So those are my thoughts.  Do any of them seem reasonable to you?
> 

The comments are greatly appreciated. And hopefully we can reach
agreement soon.   

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] NLM lock failover admin interface
  2006-06-13  7:00   ` [NFS] " Wendy Cheng
@ 2006-06-13  7:08     ` Neil Brown
  0 siblings, 0 replies; 28+ messages in thread
From: Neil Brown @ 2006-06-13  7:08 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: nfs, linux-cluster

On Tuesday June 13, wcheng@redhat.com wrote:
> >   One is to register a callback when an interface is shut down.
> >   Another (possibly the best) is to arrange a new signal for lockd
> >   which say "Drop any locks which were sent to IP addresses that are
> >   no longer valid local addresses".
> 
> These, again, give individual filesystem no freedom to adjust what they
> need upon failover. But I'll check them out this week - maybe there are
> good socket layer hooks that I overlook. 
> 

Can you say more about what sort of adjustments an individual filesystem
might want the freedom to make?  It might help me understand the
issues better.

Thanks,
NeilBrown


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] NLM lock failover admin interface
  2006-06-12  5:25 [RFC] NLM lock failover admin interface Wendy Cheng
                   ` (3 preceding siblings ...)
  2006-06-13  3:17 ` Neil Brown
@ 2006-06-13 15:23 ` James Yarbrough
  4 siblings, 0 replies; 28+ messages in thread
From: James Yarbrough @ 2006-06-13 15:23 UTC (permalink / raw)
  To: Wendy Cheng, Neil Brown; +Cc: linux-cluster, nfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1195 bytes --]

> There seems to be an unstated assumption here that there is one
> virtual IP per exported filesystem.  Is that true?

This is the normal case for such HA services.  There may actually be
a single IP address covering multiple filesystems and/or NFS exports.

> I think that maybe the right thing to do is *not* drop the locks on a
> particular filesystem, but to drop the locks made to a particular
> virtual IP.

For filesystems such as ext2 or xfs, you unmount the filesystem on the
current server and mount it on the new server when doing a failover.
In this case, you have to be able to get rid of all the locks first and
you do that for the entire filesystem.  For a cluster filesystem such as
cxfs, you don't actually unmount the filesystem, so you really need the
per-IP address approach.

> If I want to force-unmount a filesystem, I need to unexport it, and I
> need to kill all the locks.  Currently you can only remove locks from
> all filesystems, which might not be ideal.

This is definitely less than ideal.  This will force notification and
reclaim for all exported filesystems.  This can be a significant problem.

jmy@sgi.com
650 933 3124

Why is there a snake in my Coke?


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 140 bytes --]

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-13  3:17 ` Neil Brown
  2006-06-13  7:00   ` [NFS] " Wendy Cheng
@ 2006-06-14  6:54   ` Wendy Cheng
  2006-06-14 11:36     ` Christoph Hellwig
                       ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-14  6:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-cluster, nfs

Hi,

KABI (kernel application binary interface) commitment is a big thing
from our end - so I would like to focus more on the interface agreement
before jumping into coding and implementation details. 

>   One is the multiple-lockd-threads idea.

Assume we still have this on the table.... Could I expect the admin
interface goes thru rpc.lockd command (man page and nfs-util code
changes) ? The modified command will take similar options as rpc.statd;
more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
individual IP (socket address) to kernel, we'll need nfsctl with struct
nfsctl_svc modified.

For the kernel piece, since we're there anyway, could we have the
individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
This would allow statd to structure its SM files based on each lockd IP
address, an important part of lock recovery.

>   One is to register a callback when an interface is shut down.

Haven't checked out (linux) socket interface yet. I'm very fuzzy how
this can be done. Anyone has good ideas ? 

>   Another (possibly the best) is to arrange a new signal for lockd
>   which say "Drop any locks which were sent to IP addresses that are
>   no longer valid local addresses".

Very appealing - but the devil's always in the details. How to decide
which IP address is no longer valid ? Or how does lockd know about these
IP addresses ? And how to associate one particular IP address with the
"struct nlm_file" entries within nlm_files list ? Need few more days to
sort this out (or any one already has ideas in mind ?).

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-14  6:54   ` [NFS] " Wendy Cheng
@ 2006-06-14 11:36     ` Christoph Hellwig
  2006-06-14 13:39       ` Wendy Cheng
  2006-06-14 14:00     ` Wendy Cheng
  2006-06-15  4:27     ` [NFS] " Neil Brown
  2 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2006-06-14 11:36 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: Neil Brown, nfs, linux-cluster

On Wed, Jun 14, 2006 at 02:54:51AM -0400, Wendy Cheng wrote:
> Hi,
> 
> KABI (kernel application binary interface) commitment is a big thing
> from our end - so I would like to focus more on the interface agreement
> before jumping into coding and implementation details. 

Please stop this crap now.  If zou don't get that there is no kernel internal
ABI and there never will be get a different job ASAP.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-14 11:36     ` Christoph Hellwig
@ 2006-06-14 13:39       ` Wendy Cheng
  0 siblings, 0 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-14 13:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Neil Brown, nfs, linux-cluster

On Wed, 2006-06-14 at 12:36 +0100, Christoph Hellwig wrote:
> On Wed, Jun 14, 2006 at 02:54:51AM -0400, Wendy Cheng wrote:
> > Hi,
> > 
> > KABI (kernel application binary interface) commitment is a big thing
> > from our end - so I would like to focus more on the interface agreement
> > before jumping into coding and implementation details. 
> 
> Please stop this crap now.  If zou don't get that there is no kernel internal
> ABI and there never will be get a different job ASAP.

Actually I don't quite understand this statement (sorry! English is not
my native language) but it is ok. People are entitled for different
opinions and I respect yours.  

On the technical side, just a pre-cautious, in case we need to touch
some kernel export symbols so it would be nice to have external (and
admin) interfaces decided before we start to code. 

So I'll not talk about this and I assume we can keep focusing on NLM
issues. No more noises from each other. Fair ?

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-14  6:54   ` [NFS] " Wendy Cheng
  2006-06-14 11:36     ` Christoph Hellwig
@ 2006-06-14 14:00     ` Wendy Cheng
  2006-06-15 14:07       ` [NFS] " William A.(Andy) Adamson
  2006-06-15  4:27     ` [NFS] " Neil Brown
  2 siblings, 1 reply; 28+ messages in thread
From: Wendy Cheng @ 2006-06-14 14:00 UTC (permalink / raw)
  To: linux clustering; +Cc: Neil Brown, nfs

On Wed, 2006-06-14 at 02:54 -0400, Wendy Cheng wrote:

> 
> Assume we still have this on the table.... Could I expect the admin
> interface goes thru rpc.lockd command (man page and nfs-util code
> changes) ? The modified command will take similar options as rpc.statd;
> more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
> individual IP (socket address) to kernel, we'll need nfsctl with struct
> nfsctl_svc modified.

I want to make sure people catch this. Here we're talking about NFS
system call interface changes. We need either a new NFS syscall or
altering the existing nfsctl_svc structure.

-- Wendy

> 
> For the kernel piece, since we're there anyway, could we have the
> individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
> This would allow statd to structure its SM files based on each lockd IP
> address, an important part of lock recovery.
> 
> >   One is to register a callback when an interface is shut down.
> 
> Haven't checked out (linux) socket interface yet. I'm very fuzzy how
> this can be done. Anyone has good ideas ? 
> 
> >   Another (possibly the best) is to arrange a new signal for lockd
> >   which say "Drop any locks which were sent to IP addresses that are
> >   no longer valid local addresses".
> 
> Very appealing - but the devil's always in the details. How to decide
> which IP address is no longer valid ? Or how does lockd know about these
> IP addresses ? And how to associate one particular IP address with the
> "struct nlm_file" entries within nlm_files list ? Need few more days to
> sort this out (or any one already has ideas in mind ?).
> 
> -- Wendy
> 
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-14  6:54   ` [NFS] " Wendy Cheng
  2006-06-14 11:36     ` Christoph Hellwig
  2006-06-14 14:00     ` Wendy Cheng
@ 2006-06-15  4:27     ` Neil Brown
  2006-06-15  6:39       ` Wendy Cheng
  2 siblings, 1 reply; 28+ messages in thread
From: Neil Brown @ 2006-06-15  4:27 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: nfs, linux-cluster

On Wednesday June 14, wcheng@redhat.com wrote:
> Hi,
> 
> KABI (kernel application binary interface) commitment is a big thing
> from our end - so I would like to focus more on the interface agreement
> before jumping into coding and implementation details. 
> 

Before we can agree on an interface, we need to be clear what
functionality is required.

You started out suggesting that the required functionality was to
"remove all locks that lockd holds on a particular filesystem".

I responded that I suspect a better functionality was "remove all
locks that locked holds on behalf of a particular IP address".

You replied that this such an approach

>  give[s] individual filesystem no freedom to adjust what they
> need upon failover. 

I asked:
> Can you say more about what sort of adjustments an individual filesystem
> might want the freedom to make?  It might help me understand the
> issues better.

and am still waiting for an answer.  Without an answer, I still lean
towards and IP-address based approach, and the reply from James
Yarbrough seems to support that (though I don't want to read too much
into his comments).

Lockd is not currently structured to associate locks with
server-ip-addresses.  There is an assumption that one client may talk
to any of the IP addresses that the server supports.  This is clearly
not the case for the failover scenario that you are considering, so a
little restructuring might be in order.

Some locks will be held on behalf of a client, no matter what
interface the requests arrive on.  Other locks will be held on behalf
of a client and tied to a particular server IP address.  Probably the
easiest way to make this distinction in as a new nfsd export flag.

So, maybe something like this:

  Add a 'struct sockaddr_in' to 'struct nlm_file'.
  If nlm_fopen return (say) 3, then treat is as success, and 
    also copy rqstp->rq_addr into that 'sockaddr_in'.
  define a new file in the 'nfsd' filesystem into which can
    be written an IP address and which calls some new lockd
    function which releases all locks held for that IP address.
  Probably get nlm_lookup_file to insist that if the sockaddr_in
    is defined in a lock, it must match the one in rqstp

Does that sound OK ?


> >   One is the multiple-lockd-threads idea.
> 
> Assume we still have this on the table.... Could I expect the admin
> interface goes thru rpc.lockd command (man page and nfs-util code
> changes) ? The modified command will take similar options as rpc.statd;
> more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
> individual IP (socket address) to kernel, we'll need nfsctl with struct
> nfsctl_svc modified.

I'm losing interest in the multiple-lockd-threads approach myself (for
the moment anyway :-)
However I would be against trying to re-use rpc.lockd - that was a
mistake that is best forgotten.
If the above approach were taken, then I don't think you need anything
more than
   echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock
(or whatever), though it you really want to wrap that in a shell
script that might be ok.

> 
> For the kernel piece, since we're there anyway, could we have the
> individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
> This would allow statd to structure its SM files based on each lockd IP
> address, an important part of lock recovery.
> 

Maybe....  but I don't get the scenario.
Surely the SM files are only needed when the server  restarts, and in
that case it needs to notify all clients... Or is it that you want to
make sure the notification comes from the right IP address.... I guess
that would make sense.  I that what you are after?


> >   One is to register a callback when an interface is shut down.
> 
> Haven't checked out (linux) socket interface yet. I'm very fuzzy how
> this can be done. Anyone has good ideas ? 

No good idea, but I have a feeling there is a callback we could use.
However I think I am going off this idea.

> 
> >   Another (possibly the best) is to arrange a new signal for lockd
> >   which say "Drop any locks which were sent to IP addresses that are
> >   no longer valid local addresses".
> 
> Very appealing - but the devil's always in the details. How to decide
> which IP address is no longer valid ? Or how does lockd know about these
> IP addresses ? And how to associate one particular IP address with the
> "struct nlm_file" entries within nlm_files list ? Need few more days to
> sort this out (or any one already has ideas in mind ?).

See above.

NeilBrown

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] [RFC] NLM lock failover admin interface
  2006-06-15  4:27     ` [NFS] " Neil Brown
@ 2006-06-15  6:39       ` Wendy Cheng
  2006-06-15  8:02         ` Neil Brown
  0 siblings, 1 reply; 28+ messages in thread
From: Wendy Cheng @ 2006-06-15  6:39 UTC (permalink / raw)
  To: Neil Brown; +Cc: nfs, linux-cluster

On Thu, 2006-06-15 at 14:27 +1000, Neil Brown wrote:

> You started out suggesting that the required functionality was to
> "remove all locks that lockd holds on a particular filesystem".

I didn't make this clear. No, we don't want to "remove all locks
associated with a particular filesystem". We want to "remove all locks
associated with an NFS service" - one NFS service is normally associated
with one NFS export. For example, say in /etc/exports:

/mnt/export_fs/dir_1     *(fsid=1,async,rw)
/mnt/export_fs/dir_2     *(fsid=2,async,rw)

One same filesystem (export_fs) is exported via two entries, each with
its own fsid. The "fsid" is eventually encoded as part of the filehanlde
stored into "struct nlm_file" and linked into nlm_file global list. 

This is to allow, not only active-active failover (for local filesystem
such as ext3), but also load balancing for cluster file systems (such as
GFS). 

In reality, each NFS service is associated with one virtual IP. The
failover and load-balancing tasks are carried out by moving the virtual
IP around - so I'm ok with the idea of "remove all locks that lockd
holds on behalf of a particular IP address".
  
> 
> Lockd is not currently structured to associate locks with
> server-ip-addresses.  There is an assumption that one client may talk
> to any of the IP addresses that the server supports.  This is clearly
> not the case for the failover scenario that you are considering, so a
> little restructuring might be in order.
> 
> Some locks will be held on behalf of a client, no matter what
> interface the requests arrive on.  Other locks will be held on behalf
> of a client and tied to a particular server IP address.  Probably the
> easiest way to make this distinction in as a new nfsd export flag.

We're very close now - note that I originally proposed adding a new nfsd
export flag (NFSEXP_FOLOCKS) so we can OR it into export's ex_flag upon
un-export. If the new action flag is set, a new sub-call added into
unexport kernel routine will walk thru nlm_file to find the export entry
(matched by either fsid or devno, taken from filehandle, within nlm_file
struct); then subsequently release the lock.   

The ex_flag is an "int" but currently only used up to 16 bit. So my new
export flag is defined as: NFSEXP_FOLOCKS 0x00010000. 

> 
> So, maybe something like this:
> 
>   Add a 'struct sockaddr_in' to 'struct nlm_file'.
>   If nlm_fopen return (say) 3, then treat is as success, and 
>     also copy rqstp->rq_addr into that 'sockaddr_in'.
>   define a new file in the 'nfsd' filesystem into which can
>     be written an IP address and which calls some new lockd
>     function which releases all locks held for that IP address.
>   Probably get nlm_lookup_file to insist that if the sockaddr_in
>     is defined in a lock, it must match the one in rqstp

Yes, we definitely can do this but there is a "BUT" from our end. What I
did in my prototyping code is taking filehandle from nlm_file structure
and yank the fsid (or devno) out of it (so we didn't need to know the
socket address). With (your) above approach, adding a new field into
"struct nlm_file" to hold the sock addr, sadly say, violates our KABI
policy. 

I learnt my lesson. Forget KABI for now. Let me see what you have in the
next paragraph (so I can know how to response ...)

> 
> 
> > >   One is the multiple-lockd-threads idea.
> 
> I'm losing interest in the multiple-lockd-threads approach myself (for
> the moment anyway :-)
> However I would be against trying to re-use rpc.lockd - that was a
> mistake that is best forgotten.
> If the above approach were taken, then I don't think you need anything
> more than
>    echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock
> (or whatever), though it you really want to wrap that in a shell
> script that might be ok.

This is funny - so we go back to /proc. OK with me :) but you may want
to re-think my exportfs command approach. Want me to go over the
unexport flow again ? The idea is to add a new user mode flag, say "-h".
If you unexport the interface as:

shell> exportfs -u *:/export_path   // nothing happens, old behavior

but if you do:

shell> exportfs -hu *:/export_patch  // the kernel code would walk thru 
                                     // nlm_file list to release the
                                     // the locks.

The "-h" "OR" 0x0001000 into ex_flags field of struct nfsctl_export so
kernel can know what to do. With fsid (or devno) in filehandle within
nlm_file, we don't need socket address at all. 

But again, I'm OK with /proc approach. However, with /proc approach, we
may need socket address (since not every export uses fsid and devno is
not easy to get).

Do we agree now ? In simple sentence, I prefer my original "exportfs -
hu" approach. But I'm ok with /proc if you insist.


> 
> > 
> > For the kernel piece, since we're there anyway, could we have the
> > individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
> > This would allow statd to structure its SM files based on each lockd IP
> > address, an important part of lock recovery.
> > 
> 
> Maybe....  but I don't get the scenario.
> Surely the SM files are only needed when the server  restarts, and in
> that case it needs to notify all clients... Or is it that you want to
> make sure the notification comes from the right IP address.... I guess
> that would make sense.  I that what you are after?

Yes ! Right now, lockd doesn't pass the specific server address (that
client connects to) to statd. I don't know how the "-H" can ever work.
Consider this a bug. If you forget what "rpc.statd -H" is, check out the
man page (man rpc.statd).

Thank you for the patience - I'm grateful.

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] NLM lock failover admin interface
  2006-06-15  6:39       ` Wendy Cheng
@ 2006-06-15  8:02         ` Neil Brown
  2006-06-15 18:43           ` Wendy Cheng
  0 siblings, 1 reply; 28+ messages in thread
From: Neil Brown @ 2006-06-15  8:02 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: nfs, linux-cluster

On Thursday June 15, wcheng@redhat.com wrote:
> On Thu, 2006-06-15 at 14:27 +1000, Neil Brown wrote:
> 
> > You started out suggesting that the required functionality was to
> > "remove all locks that lockd holds on a particular filesystem".
> 
> I didn't make this clear. No, we don't want to "remove all locks
> associated with a particular filesystem". We want to "remove all locks
> associated with an NFS service" - one NFS service is normally associated
> with one NFS export. For example, say in /etc/exports:
> 
> /mnt/export_fs/dir_1     *(fsid=1,async,rw)
> /mnt/export_fs/dir_2     *(fsid=2,async,rw)

That makes sense.

> 
> One same filesystem (export_fs) is exported via two entries, each with
> its own fsid. The "fsid" is eventually encoded as part of the filehanlde
> stored into "struct nlm_file" and linked into nlm_file global list. 
> 
> This is to allow, not only active-active failover (for local filesystem
> such as ext3), but also load balancing for cluster file systems (such as
> GFS). 

Could you please explain to me what "active-active failover for local
filesystem such as ext3" means (I'm not very familiar with cluster
terminology).
It sounds like the filesystem is active on two nodes at once, which of
course cannot work for ext3, so I am confused.
And if you are doing "failover", what has failed?

The load-balancing scenario makes sense (at least so far...).

> 
> In reality, each NFS service is associated with one virtual IP. The
> failover and load-balancing tasks are carried out by moving the virtual
> IP around - so I'm ok with the idea of "remove all locks that lockd
> holds on behalf of a particular IP address".
>   

Good. :-)

> > 
> > Lockd is not currently structured to associate locks with
> > server-ip-addresses.  There is an assumption that one client may talk
> > to any of the IP addresses that the server supports.  This is clearly
> > not the case for the failover scenario that you are considering, so a
> > little restructuring might be in order.
> > 
> > Some locks will be held on behalf of a client, no matter what
> > interface the requests arrive on.  Other locks will be held on behalf
> > of a client and tied to a particular server IP address.  Probably the
> > easiest way to make this distinction in as a new nfsd export flag.
> 
> We're very close now - note that I originally proposed adding a new nfsd
> export flag (NFSEXP_FOLOCKS) so we can OR it into export's ex_flag upon
> un-export. If the new action flag is set, a new sub-call added into
> unexport kernel routine will walk thru nlm_file to find the export entry
> (matched by either fsid or devno, taken from filehandle, within nlm_file
> struct); then subsequently release the lock.   
> 
> The ex_flag is an "int" but currently only used up to 16 bit. So my new
> export flag is defined as: NFSEXP_FOLOCKS 0x00010000. 
> 

Our two export flags mean VERY different things.
Mine says 'locks against this export are per-server-ip-address'.
Yours says (I think) 'remove all lockd locks from this export' and is
really an unexport flag, not an export flag.

And this makes it not really workable.  We no-longer require the user
of the nfssvc syscall to unexport filesystems.  Infact nfs-utils doesn't
use it at all if /proc/fs/nfsd is mounted.  filesystems are unexported
by their entry in the export cache expiring, or the cache being
flushed.

There is simply no room in the current knfsd design for an unexport
flag - sorry ;-(


> > 
> > So, maybe something like this:
> > 
> >   Add a 'struct sockaddr_in' to 'struct nlm_file'.
> >   If nlm_fopen return (say) 3, then treat is as success, and 
> >     also copy rqstp->rq_addr into that 'sockaddr_in'.
> >   define a new file in the 'nfsd' filesystem into which can
> >     be written an IP address and which calls some new lockd
> >     function which releases all locks held for that IP address.
> >   Probably get nlm_lookup_file to insist that if the sockaddr_in
> >     is defined in a lock, it must match the one in rqstp
> 
> Yes, we definitely can do this but there is a "BUT" from our end. What I
> did in my prototyping code is taking filehandle from nlm_file structure
> and yank the fsid (or devno) out of it (so we didn't need to know the
> socket address). With (your) above approach, adding a new field into
> "struct nlm_file" to hold the sock addr, sadly say, violates our KABI
> policy. 

Does it?
'struct nlm_file' is a structure that is entirely local to lockd.
It does not feature in any of the interface between lockd and any
other part of the kernel.  It is not part of any credible KABI.
The other changes I suggest involve adding an exported symbol to
lockd, which does change the KABI but in a completely back-compatible
way, and re-interpreting the return value of a callout.  
That could not break any external module - it could only break
someone's setup if they had an alternate lockd module, but I don't
your KABI policy allows people to replace modules and stay supported,

However, as you say....

> 
> I learnt my lesson. Forget KABI for now. Let me see what you have in the
> next paragraph (so I can know how to response ...)
> 

....we aren't going to let KABI issues get in our way.

> > 
> > 
> > > >   One is the multiple-lockd-threads idea.
> > 
> > I'm losing interest in the multiple-lockd-threads approach myself (for
> > the moment anyway :-)
> > However I would be against trying to re-use rpc.lockd - that was a
> > mistake that is best forgotten.
> > If the above approach were taken, then I don't think you need anything
> > more than
> >    echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock
> > (or whatever), though it you really want to wrap that in a shell
> > script that might be ok.
> 
> This is funny - so we go back to /proc. OK with me :)

Only sort-of back to /proc.  /proc/fs/nfsd is a separate filesystem
which happens to be mounted there normally.
The unexport system call goes through this exact same filesystem
(though it is somewhat under-the-hood) so at that level, we are
really propose the same style of interface implementation.

>                                                       but you may want
> to re-think my exportfs command approach. Want me to go over the
> unexport flow again ? The idea is to add a new user mode flag, say "-h".
> If you unexport the interface as:
> 
> shell> exportfs -u *:/export_path   // nothing happens, old behavior
> 
> but if you do:
> 
> shell> exportfs -hu *:/export_patch  // the kernel code would walk thru 
>                                      // nlm_file list to release the
>                                      // the locks.
> 
> The "-h" "OR" 0x0001000 into ex_flags field of struct nfsctl_export so
> kernel can know what to do. With fsid (or devno) in filehandle within
> nlm_file, we don't need socket address at all. 

But apart from nfsctl_export being a dead end, this is still
exportpoint specific rather than IP address specific.

> 
> But again, I'm OK with /proc approach. However, with /proc approach, we
> may need socket address (since not every export uses fsid and devno is
> not easy to get).

Absolutely. We need a socket address.  
As part of this process you are shutting down an interface.  We know
(or can easily discover) the address of that interface.  That is
exactly the address that we feed to nfsd.

> 
> Do we agree now ? In simple sentence, I prefer my original "exportfs -
> hu" approach. But I'm ok with /proc if you insist.
> 

I'm not at an 'insist'ing stage at the moment - I like to at least
pretend to be open minded :-)

The main thing I don't like about your "exportfs -hu" approach is that
I don't think it will work (actually, looking at nfs-utils, I'm not so
sure that "exportfs -u" will work at all if you don't have 
/proc/fs/nfsd mounted....)

The other thing I don't like is that it doesn't address your primary
need - decommissioning an IP address.
Rather it addresses a secondary need - removing some locks from some
filesystems. 

But I'm still open to debate...

> 
> > 
> > > 
> > > For the kernel piece, since we're there anyway, could we have the
> > > individual lockd IP interface passed to SM (statd) (in SM_MON call) ?
> > > This would allow statd to structure its SM files based on each lockd IP
> > > address, an important part of lock recovery.
> > > 
> > 
> > Maybe....  but I don't get the scenario.
> > Surely the SM files are only needed when the server  restarts, and in
> > that case it needs to notify all clients... Or is it that you want to
> > make sure the notification comes from the right IP address.... I guess
> > that would make sense.  I that what you are after?
> 
> Yes ! Right now, lockd doesn't pass the specific server address (that
> client connects to) to statd. I don't know how the "-H" can ever work.
> Consider this a bug. If you forget what "rpc.statd -H" is, check out the
> man page (man rpc.statd).

I have to admit I have never given that code a lot of attention.  I
reviewed when sent it - it seemed to make sense and had no obvious
problems - so I accepted it.  I wouldn't be enormously surprised if it
didn't work in some situations.

> 
> Thank you for the patience - I'm grateful.

Ditto.
Conversations work much better when people are patient and polite.

Thanks,
NeilBrown


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] Re: [RFC] NLM lock failover admin interface
  2006-06-14 14:00     ` Wendy Cheng
@ 2006-06-15 14:07       ` William A.(Andy) Adamson
  2006-06-15 15:09         ` Wendy Cheng
  2006-06-16  6:09         ` [Linux-cluster] " Neil Brown
  0 siblings, 2 replies; 28+ messages in thread
From: William A.(Andy) Adamson @ 2006-06-15 14:07 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: Neil Brown, nfs, linux clustering

this discusion has centered around removing the locks of an export.
we also want the interface to ge able to remove the locks owned by a single 
client. this is needed to enable client migration between replica's or between 
nodes in a cluster file system. it is not acceptable to place an entire export 
in grace just to move a small number of clients.

-->Andy

wcheng@redhat.com said:
> On Wed, 2006-06-14 at 02:54 -0400, Wendy Cheng wrote:
> 
> Assume we still have this on the table.... Could I expect the admin
> interface goes thru rpc.lockd command (man page and nfs-util code
> changes) ? The modified command will take similar options as rpc.statd;
> more specifically, the -n, -o, and -p (see "man rpc.statd"). To pass the
> individual IP (socket address) to kernel, we'll need nfsctl with struct
> nfsctl_svc modified.
>
> I want to make sure people catch this. Here we're talking about NFS system
> call interface changes. We need either a new NFS syscall or altering the
> existing nfsctl_svc structure.

> -- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] Re: [RFC] NLM lock failover admin interface
  2006-06-15 14:07       ` [NFS] " William A.(Andy) Adamson
@ 2006-06-15 15:09         ` Wendy Cheng
  2006-06-16  6:09         ` [Linux-cluster] " Neil Brown
  1 sibling, 0 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-15 15:09 UTC (permalink / raw)
  To: William A.(Andy) Adamson; +Cc: Neil Brown, nfs, linux clustering

William A.(Andy) Adamson wrote:

>this discusion has centered around removing the locks of an export.
>we also want the interface to ge able to remove the locks owned by a single 
>client. this is needed to enable client migration between replica's or between 
>nodes in a cluster file system. it is not acceptable to place an entire export 
>in grace just to move a small number of clients.
>
>  
>
Andy,

Gotcha ... forgot about NFS V4. BTW, the discussion has moved back to 
/proc interface. I agree we need to add one more layer of granularity 
into it. Glad you caught this flaw.

-- Wendy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] NLM lock failover admin interface
  2006-06-15  8:02         ` Neil Brown
@ 2006-06-15 18:43           ` Wendy Cheng
  0 siblings, 0 replies; 28+ messages in thread
From: Wendy Cheng @ 2006-06-15 18:43 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux clustering, nfs

Neil Brown wrote:

>Could you please explain to me what "active-active failover for local
>filesystem such as ext3" means 
>
Clustering is a profilic subject so the term may mean different things 
to different people. The setup we discuss here is to move an NFS service 
from one server to the other while both servers are up and running 
(active-active). The goal is not to disturb other NFS services that are 
not involved with the transition.

>It sounds like the filesystem is active on two nodes at once, which of
>course cannot work for ext3, so I am confused.
>And if you are doing "failover", what has failed?
>
>The load-balancing scenario makes sense (at least so far...).
>  
>
Local filesystem such as ext3 will never be mounted on more than two 
nodes but cluster filesystems (e.g. our GFS) will. Moving ext3 normally 
implies error conditions (a true failover) though in rare cases, it may 
be kicked off for load balancing purpose. Current GFS locking has the 
"node-id" concept - the easiest way (at this moment) for virtual IP to 
float around is to drop the locks and let NLM reclaim the locks from the 
new server.

>
>Our two export flags mean VERY different things.
>Mine says 'locks against this export are per-server-ip-address'.
>Yours says (I think) 'remove all lockd locks from this export' and is
>really an unexport flag, not an export flag.
>
>And this makes it not really workable.  We no-longer require the user
>of the nfssvc syscall to unexport filesystems.  Infact nfs-utils doesn't
>use it at all if /proc/fs/nfsd is mounted.  filesystems are unexported
>by their entry in the export cache expiring, or the cache being
>flushed.
>  
>
The important thing (for me) is the vfsmount reference count which can 
only be properly decreased when unexport is triggered. Without 
decreasing the vfsmount, ext3 can not be un-mounted (and we need to 
umount ext3 upon failover). I havn't looked into community versions of 
kernel source for a while (but I'll check). So what can I do to ensure 
this will happen ? - i.e., after the filesystem has been accessed by 
nfsd, how can I safely un-mount it without shuting down nfsd (and/or 
lockd) ?   

>'struct nlm_file' is a structure that is entirely local to lockd.
>It does not feature in any of the interface between lockd and any
>other part of the kernel.  It is not part of any credible KABI.
>The other changes I suggest involve adding an exported symbol to
>lockd, which does change the KABI but in a completely back-compatible
>way, and re-interpreting the return value of a callout.  
>That could not break any external module - it could only break
>someone's setup if they had an alternate lockd module, but I don't
>your KABI policy allows people to replace modules and stay supported,
>  
>
Yes, you're right ! I looked into the wrong code (well, it was late in 
the night so I was not very functional at that moment). Had some 
prototype code where I transported the nlm_file from one server to 
another server , experimenting auto-reclaiming locks without stated. I 
exported the nlm_file list there. So let's forget about this 

>>>>>  One is the multiple-lockd-threads idea.
>>>>>          
>>>>>
>>>I'm losing interest in the multiple-lockd-threads approach myself (for
>>>the moment anyway :-)
>>>      
>>>
Good! because I'm not sure whether we'll hit scalibility issue or not 
(100 nfs services implies 100 lockd threads !).

>>>However I would be against trying to re-use rpc.lockd - that was a
>>>mistake that is best forgotten.
>>>      
>>>
Highlight this :) ... Give me some comfort feelings that I'm not the 
only person who would make mistakes.

>>>If the above approach were taken, then I don't think you need anything
>>>more than
>>>   echo aa.bb.cc.dd > /proc/fs/nfsd/vserver_unlock
>>>(or whatever), though it you really want to wrap that in a shell
>>>script that might be ok.
>>>      
>>>
>>This is funny - so we go back to /proc. OK with me :)
>>    
>>
>
>Only sort-of back to /proc.  /proc/fs/nfsd is a separate filesystem
>which happens to be mounted there normally.
>The unexport system call goes through this exact same filesystem
>(though it is somewhat under-the-hood) so at that level, we are
>really propose the same style of interface implementation.
>  
>
>>But again, I'm OK with /proc approach. However, with /proc approach, we
>>may need socket address (since not every export uses fsid and devno is
>>not easy to get).
>>    
>>
>
>Absolutely. We need a socket address.  
>As part of this process you are shutting down an interface.  We know
>(or can easily discover) the address of that interface.  That is
>exactly the address that we feed to nfsd.
>  
>
Now, it looks good ! Will do the following:
 
1. Futher understand the steps to make sure we can un-mount ext3 due to 
"unexport" method changes.
2. Start to code to the /proc interface and make sure "rpc.stated -H"can 
work (lock reclaiming needs it). Will keep NFS v4 in mind as well.

By the way, there is a socket state-change-handler (TCP only) and/or 
network interface notification routine that seem to be workable (your 
previous thoughts). However, I don't plan to keep exploring that 
possibility since we now have a simple and workable method in place.

-- Wendy







_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Linux-cluster] Re: [RFC] NLM lock failover admin interface
  2006-06-15 14:07       ` [NFS] " William A.(Andy) Adamson
  2006-06-15 15:09         ` Wendy Cheng
@ 2006-06-16  6:09         ` Neil Brown
  2006-06-16 15:39           ` [NFS] " William A.(Andy) Adamson
  1 sibling, 1 reply; 28+ messages in thread
From: Neil Brown @ 2006-06-16  6:09 UTC (permalink / raw)
  To: William A.(Andy) Adamson; +Cc: linux clustering, nfs, Wendy Cheng

On Thursday June 15, andros@citi.umich.edu wrote:
> this discusion has centered around removing the locks of an export.
> we also want the interface to ge able to remove the locks owned by a single 
> client. this is needed to enable client migration between replica's or between 
> nodes in a cluster file system. it is not acceptable to place an entire export 
> in grace just to move a small number of clients.

Hmmmm....
You want to remove all the locks owned by a particular client
with the intension of reclaiming those locks against a different NFS
server (on a cluster filesystem)
and you don't want to put the whole filesystem into grace mode while
doing it.

Is that correct?

Sounds extremely racy to me.  Suppose some other client takes a
conflicting lock between dropping them on one server and claiming them
on the other?  That would be bad.  The purpose of the grace mode is
precisely to avoid this sort of race.

It would seem that what you "really" want to do is to tell the cluster
filesystem to migrate the locks to a different node and some how tell
lockd about out.

Is there a comprehensive design document about how this is going to
work, because I'm feeling doubtful.

For the 'between replicas' case - I'm not sure locking makes sense.
Locking on a read-only filesystem is pretty pointless, and presumably
replicas are read-only???

Basically, dropping locks that are expected to be picked up again,
without putting the whole filesystem into a grace period simply
doesn't sound workable to me.

Am I missing something?

NeilBrown


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [NFS] Re: [RFC] NLM lock failover admin interface
  2006-06-16  6:09         ` [Linux-cluster] " Neil Brown
@ 2006-06-16 15:39           ` William A.(Andy) Adamson
  0 siblings, 0 replies; 28+ messages in thread
From: William A.(Andy) Adamson @ 2006-06-16 15:39 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux clustering, nfs

> On Thursday June 15, andros@citi.umich.edu wrote:
> > this discusion has centered around removing the locks of an export.
> > we also want the interface to ge able to remove the locks owned by a single 
> > client. this is needed to enable client migration between replica's or between 
> > nodes in a cluster file system. it is not acceptable to place an entire export 
> > in grace just to move a small number of clients.
> 
> Hmmmm....
> You want to remove all the locks owned by a particular client
> with the intension of reclaiming those locks against a different NFS
> server (on a cluster filesystem)
> and you don't want to put the whole filesystem into grace mode while
> doing it.
> 
> Is that correct?

yes.

> 
> Sounds extremely racy to me.  Suppose some other client takes a
> conflicting lock between dropping them on one server and claiming them
> on the other?  That would be bad.  The purpose of the grace mode is
> precisely to avoid this sort of race.

the idea is that the underlying file system can place only the files with 
locks held by the migrating client(s) into grace, leaving all other files for 
normal operation. the migrating (nfsv4) client then reclaims opens, locks and 
delegations on the new server. its just reducing the scope of the grace period.

> 
> It would seem that what you "really" want to do is to tell the cluster
> filesystem to migrate the locks to a different node and some how tell
> lockd about out.

what we really want is for the cluster file system to share the locks between 
the original node and the new node. then the client can simply be redirected 
and no grace period or reclaim is needed. this is much harder to code than a 
reduced grace period as describe above. from what we hear, lustre has this 
functionality.

either way, the files with locks held by the migrating client need to be 
identified by both the lock manager (lockd/nfsv4 server) and the underlying fs.

> 
> Is there a comprehensive design document about how this is going to
> work, because I'm feeling doubtful.

we have a work in progress - it's not done but may help describe our thinking.

http://wiki.linux-nfs.org/index.php/Recovery_and_migration

> 
> For the 'between replicas' case - I'm not sure locking makes sense.
> Locking on a read-only filesystem is pretty pointless, and presumably
> replicas are read-only???

nope. we have a promising prototye read/write replica scheme that we are 
testing.

http://www.citi.umich.edu/techreports/reports/citi-tr-06-3.pdf

i agree this is an outlying case....

but another immediate consumer of such an iterface would be an administator 
who needs to remove the locks for a client.

-->Andy

> 
> Basically, dropping locks that are expected to be picked up again,
> without putting the whole filesystem into a grace period simply
> doesn't sound workable to me.
> 
> Am I missing something?
> 
> NeilBrown
> 
> 
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-06-16 15:39 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-12  5:25 [RFC] NLM lock failover admin interface Wendy Cheng
2006-06-12  6:11 ` Wendy Cheng
2006-06-12 15:00 ` [Linux-cluster] " J. Bruce Fields
2006-06-12 15:44   ` [NFS] " Wendy Cheng
2006-06-12 16:20     ` [Linux-cluster] " Madhan P
2006-06-12 16:58       ` Madhan P
2006-06-12 18:09       ` [NFS] " Wendy Cheng
2006-06-12 17:23     ` [Linux-cluster] " Steve Dickson
2006-06-12 17:27 ` James Yarbrough
2006-06-12 19:07   ` [NFS] " Wendy Cheng
2006-06-13  3:17 ` Neil Brown
2006-06-13  7:00   ` [NFS] " Wendy Cheng
2006-06-13  7:08     ` Neil Brown
2006-06-14  6:54   ` [NFS] " Wendy Cheng
2006-06-14 11:36     ` Christoph Hellwig
2006-06-14 13:39       ` Wendy Cheng
2006-06-14 14:00     ` Wendy Cheng
2006-06-15 14:07       ` [NFS] " William A.(Andy) Adamson
2006-06-15 15:09         ` Wendy Cheng
2006-06-16  6:09         ` [Linux-cluster] " Neil Brown
2006-06-16 15:39           ` [NFS] " William A.(Andy) Adamson
2006-06-15  4:27     ` [NFS] " Neil Brown
2006-06-15  6:39       ` Wendy Cheng
2006-06-15  8:02         ` Neil Brown
2006-06-15 18:43           ` Wendy Cheng
2006-06-13 15:23 ` James Yarbrough
  -- strict thread matches above, loose matches on Subject: below --
2006-06-12 14:45 [Linux-cluster] " Stanley, Jon
2006-06-13  3:39 ` Wendy Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox