[Cluster-devel] [RFC PATCH 0/3] NLM lock failover

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [RFC PATCH 0/3] NLM lock failover
@ 2006-06-29 17:47 Wendy Cheng
  2006-08-01  1:55 ` [Cluster-devel] [PATCH " Wendy Cheng
  0 siblings, 1 reply; 14+ messages in thread
From: Wendy Cheng @ 2006-06-29 17:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The uploaded patches implement NLM lock failover as discussed in:

http://www.redhat.com/archives/linux-cluster/2006-June/msg00050.html

[PATCH 1/3]
Add a new admin interface into current nfsd procfs filesystem to trigger 
NLM lock releasing logic. The command is invoked by echoing the server 
virtual IP address into /proc/fs/nfsd/nlm_unlock file as:

shell> cd /prod/fs/nfsd
shell> echo 10.10.1.1 > nlm_unlock

It is currently restricted to IPV4 addressing and the command should be 
invoked from the taken-over NFS server. To do list (not yet implemented) 
include:

1. IPv6 addessing enablement
2. Add "client:server" ip pair to allow NFS V4 lock failover as proposed 
by Andy Adamson (CITI).

[PATCH 2/3]
Add take-over server counter-part command into nfsd procfs interface to 
allow selectively setting of per (virtual) ip (lockd) grace period. It 
is also invoked by echoing the virtual IP into 
/proc/fs/nfsd/nlm_set_ip_grace file as:

shell> cd /proc/fs/nfsd
shell> echo 10.10.1.1 > nlm_set_ip_grace

It is also restricted to IPV4 addressing and the command is expected to 
be invoked from the take-over NFS server.

[PATCH 3/3]
This kernel patch has *not* been unit tested out yet and it needs to be 
paired with user mode nfs-utils changes (not ready in time for this 
RFC). It puts the taken-over IPv4 address in standard dot notation into 
the 3rd parameter of ha_callout program (see man rpc.statd for details) 
for "add-client" event. For "del-client", we would assume the monitored 
host should be removed from machine-wide lists, regardless server's 
interface.

Upon agree-ed on, we will integrate the changes into our cluster suite 
to start a full function verification test. Please comment.

-- Wendy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] [PATCH 0/3] NLM lock failover
  2006-06-29 17:47 [Cluster-devel] [RFC PATCH 0/3] NLM lock failover Wendy Cheng
@ 2006-08-01  1:55 ` Wendy Cheng
       [not found]   ` <message from Wendy Cheng on Monday July 31>
  2006-08-04  9:27   ` Greg Banks
  0 siblings, 2 replies; 14+ messages in thread
From: Wendy Cheng @ 2006-08-01  1:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

For background info, please check out:
o http://www.redhat.com/archives/linux-cluster/2006-June/msg00050.html
  for interface discussion.
o https://www.redhat.com/archives/cluster-devel/2006-June/msg00231.html
  for first drafted code review.

Note and Restrictions:
o With nfs-utils-1.0.8-rc4 and nfs-utils-lib-1.0.8, the tests went
  surprisingly well, particularly the ha-callout feature. *No* change
  is made into these two user mode utility packages.
o The nfs-utils config flag RESTRICTED_STATD must be off for NLM
  failover to be functional correctly.
o The third parameter passed to rpc.statd ha-callout program is no
  longer be the system_utsname.nodename (set by sethostname()). It 
  is, instead, the specific IP interface where the server receives 
  the client's request.
o The patches are for NFS v2/v3 only. However, we do leave room for
  future NFS V4 expansion. For example, echoing client_ip at server_ip
  into /proc/fs/nfsd/nlm_unlock could be used to drop the V4 locks.
o IP V6 modification is not included in this patch set. If required,
  it will be submitted as another patch set.

PATCH 1/3
---------
Add a new admin interface into current nfsd procfs filesystem to trigger
NLM lock releasing logic. The command is invoked by echoing the server
IP V4 address (in standard dot notation) into /proc/fs/nfsd/nlm_unlock
file as:

shell> cd /prod/fs/nfsd
shell> echo 10.10.1.1 > nlm_unlock

PATCH 2/3
---------
Add take-over server counter-part command into nfsd procfs interface to
allow selectively setting of per (virtual) ip (lockd) grace period. The
grace period setting follows current system-wide grace period rule and
default. It is also invoked by echoing the server IP V4 address (in dot
notation) into /proc/fs/nfsd/nlm_set_ip_grace file:

shell> cd /proc/fs/nfsd
shell> echo 10.10.1.1 > nlm_set_ip_grace

PATCH 3/3
---------
This kernel patch adds a new field into struct nlm_host that holds the
server IP address. Upon SM_MON and SM_UNMON procedure calls, the IP (in
standard V4 dot notation) is placed as the "my_name" string and passed
to local statd daemon. This enables ha-callout program ("man rpc.statd"
for details) to receive the IP address of the server that has received
the client's request. Before this change, my_name string is universally
set to system_utsname.nodename.

The user mode HA implementation is expected to:
1. Specify a user mode ha-calloupt program (rpc.statd -H) for receiving
   client monitored states.
2. Based on info passed by ha-callout, individual state-directory should
   be created and can be read from take-over server.
3. Upon failover, on take-over server, send out notification to nfs
   client via (rpc.statd -n server_ip -P individual_state_directory -N)
   command.

-- Wendy

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <message from Wendy Cheng on Monday July 31>]

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
       [not found]   ` <message from Wendy Cheng on Monday July 31>
@ 2006-08-03  4:14     ` Neil Brown
  2006-08-03 21:34       ` Wendy Cheng
  2006-08-07 22:38       ` Wendy Cheng
  0 siblings, 2 replies; 14+ messages in thread
From: Neil Brown @ 2006-08-03  4:14 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Thanks for these.

First note:  it helps a lot if the Subject line for each patch
contains a distinctive short description of what the patch does.
Rather than just "NLM lock failover" three times, maybe
  Add nlm_lock file to nfsd fs to allow selective unlocked based on server IP
  Add nlm_set_ip_grace file to allow selective setting of grace time
  Use IP address rather than hostname in SM_{UN,}MON calls to statd

Or something like that.

> 
> PATCH 1/3
> ---------
> Add a new admin interface into current nfsd procfs filesystem to trigger
> NLM lock releasing logic. The command is invoked by echoing the server
> IP V4 address (in standard dot notation) into /proc/fs/nfsd/nlm_unlock
> file as:
> 
> shell> cd /prod/fs/nfsd
> shell> echo 10.10.1.1 > nlm_unlock
> 

This patch makes an assumption that any given filehandle will only arrive at
one particular interface - never more.  This is implicit in the fact
that f_iaddr is stored in 'struct nlm_file' which is indexed by
filehandle.

In the case where you are intending to support active-active failover
this is should be the case, but obviously configuration errors are
possible.

I think what I would like is that if requests arrive at two (or more)
different interfaces for the one file, then f_iaddr is cleared and some
flag is set.
When an IP is written to nlm_unlock, if the flag is set, then a
warning message is printer 
  Note: some files access via multiple interfaces will not be
  unlocked.

A consequence of this is that you cannot have a virtual server with
two (or more interfaces).  Is this likely to be a problem?
e.g. if you have 4 physical interfaces on your server, might you want
to bind a different IP to each for each virtual server?
If you did, then my change above would mean that you couldn't do
failover, and we might need to look at other options...

Possibly (and maybe this is more work than is justified), lockd can
monitor interface usage and deduce interface pools based on seeing the
same filehandle on multiple interfaces.  Then when an unlock request
arrives on nlm_unlock, lockd would require all interfaces that touched
a file to be 'unlocked' before actually dropping the locks on the
file.

As you can probably tell I was "thinking out loud" there and it may
not be particularly coherent or cohesive.   

Do you have any thoughts on this issues?

> PATCH 2/3
> ---------
> Add take-over server counter-part command into nfsd procfs interface to
> allow selectively setting of per (virtual) ip (lockd) grace period. The
> grace period setting follows current system-wide grace period rule and
> default. It is also invoked by echoing the server IP V4 address (in dot
> notation) into /proc/fs/nfsd/nlm_set_ip_grace file:
> 
> shell> cd /proc/fs/nfsd
> shell> echo 10.10.1.1 > nlm_set_ip_grace
> 

I think nlm_ip_mutex should be a spinlock, and I don't think you
should need to hold the lock in __nlm_servs_gc, as you have already
disconnected these entries from the global list.

+extern unsigned long set_grace_period(void); /* see fs/lockd/svc.c */

That should go in a header file.

+			switch (passthru_check) {
+				case NLMSVC_FO_PASSTHRU:
+					break;
+				case NLMSVC_FO_RECLAIM:
+					if (argp->reclaim) break;
+				default:
+					return nlm_lck_denied_grace_period;
+			}

I'd rather you spelt out
				case NLMSVC_FO_BLOCK_ANY:
rather than used 'default:' - it makes the code more readable.

and surely you should check for NLMSVC_FO_PASSTHRU before calling
nlmsvc_fo_check ???

It seems to me that it would be clearer not to put the nlmsvc_fo_check
call inside nlm*svc_retrieve_args, but rather to define e.g.
  nlmsvc_is_grace_period(rqstp)
which checks nlmsvc_grace_period and the list of IPs, and then replace
every
	if (nlmsvc_grace_period) {
		resp->status = nlm_lck_denied_grace_period;
		return rpc_success;
	}
with
	if (nlmsvc_is_grace_period(rqstp)) {
		resp->status = nlm_lck_denied_grace_period;
		return rpc_success;
	}
and every 
	if (nlmsvc_grace_period && !argp->reclaim) {
		resp->status = nlm_lck_denied_grace_period;
		return rpc_success;
	}
with
	if (!argp->reclaim && nlmsvc_is_grace_period(rqstp)) {
		resp->status = nlm_lck_denied_grace_period;
		return rpc_success;
	}

Does that seem reasonable to you?

> PATCH 3/3
> ---------
> This kernel patch adds a new field into struct nlm_host that holds the
> server IP address. Upon SM_MON and SM_UNMON procedure calls, the IP (in
> standard V4 dot notation) is placed as the "my_name" string and passed
> to local statd daemon. This enables ha-callout program ("man rpc.statd"
> for details) to receive the IP address of the server that has received
> the client's request. Before this change, my_name string is universally
> set to system_utsname.nodename.
> 
> The user mode HA implementation is expected to:
> 1. Specify a user mode ha-calloupt program (rpc.statd -H) for receiving
>    client monitored states.
> 2. Based on info passed by ha-callout, individual state-directory should
>    be created and can be read from take-over server.
> 3. Upon failover, on take-over server, send out notification to nfs
>    client via (rpc.statd -n server_ip -P individual_state_directory -N)
>    command.

Was it necessary to rename "s_server" to "s_where"?  Could you not
have just introduced "s_server_ip"???

And if you really want s_where, then I would like some #defines and
make

+	if (server) 
+		host->h_where = 1;
+	else
+		host->h_where = 0;

something like
	host->where = server ? ON_SERVER : ON_CLIENT

(reading some more) In fact, you don't need h_where at all.  Just
change h_server to be the IP address, and then use e.g.
-	args.proto= (host->h_proto<<1) | host->h_server;
+	args.serv = host->h_server;
+	args.proto= (host->h_proto<<1) | (host->h_server?1:0)

(or host->server!=0  or !!host->server - whatever takes your fancy).

@@ -142,7 +142,7 @@ out_err:
 static u32 *
 xdr_encode_common(struct rpc_rqst *rqstp, u32 *p, struct nsm_args *argp)
 {
-	char	buffer[20];
+	char	buffer[100]; 

This looks like it should really be a separate patch, and should
probably be __NEW_UTS_LEN+1 rather than 100.

But I worry about other people who might be using ha-callout already
and expecting a host name there.  We are making a non-optional
user-visible change here.  Is it really a safe thing to do?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-03  4:14     ` [Cluster-devel] Re: [NFS] " Neil Brown
@ 2006-08-03 21:34       ` Wendy Cheng
  2006-08-07 22:38       ` Wendy Cheng
  1 sibling, 0 replies; 14+ messages in thread
From: Wendy Cheng @ 2006-08-03 21:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Neil Brown wrote:

>First note:  it helps a lot if the Subject line for each patch
>contains a distinctive short description of what the patch does.
>  
>
This is due to inexperience with open source patch submission plus 
end-of-day fatigue :) .. It will be improved.

>>PATCH 1/3
>>---------
>>This patch makes an assumption that any given filehandle will only arrive at
>>one particular interface - never more.  This is implicit in the fact
>>that f_iaddr is stored in 'struct nlm_file' which is indexed by
>>filehandle.
>>
>>.....
>>
>>A consequence of this is that you cannot have a virtual server with
>>two (or more interfaces).  Is this likely to be a problem?
>>e.g. if you have 4 physical interfaces on your server, might you want
>>to bind a different IP to each for each virtual server?
>>If you did, then my change above would mean that you couldn't do
>>failover, and we might need to look at other options...
>>
>>Possibly (and maybe this is more work than is justified), lockd can
>>monitor interface usage and deduce interface pools based on seeing the
>>same filehandle on multiple interfaces.  Then when an unlock request
>>arrives on nlm_unlock, lockd would require all interfaces that touched
>>a file to be 'unlocked' before actually dropping the locks on the
>>file.
>>
>>As you can probably tell I was "thinking out loud" there and it may
>>not be particularly coherent or cohesive.   
>>
>>Do you have any thoughts on this issues?
>>    
>>
Another option is dropping the (NLM) locks based on "fsid" (that can be 
retrieved from filehandle), instead of virtual ip address. Note that 
"fsid" has a good use in a cluster environment (compared to device 
major/minor since different nodes may have different device numbers). 
See any bad thing about fsid approach ?

One catch (about fsid) I can think of is that it must be passed from 
lockd to statd (then to ha-callout program). Current SM_MON and SM_UNMON 
protocol doesn't have any extra field for us to do that. Will add one 
more field causing any issue ? e.g.

current SM_MON argument

string<1024> mon_name;
string<1024> my_name;
unit32 my_prog;
unit32 my_vers;
unit32 my_proc;
opaque[16] priv;

Will add "opaque[16] fsid" after "priv" be ok ?  Ditto for SM_UNMON. On 
the other hand, the fsid can be the 4th parameter to pass to ha-callout 
program (then, that we can avoid breaking any existing ha-callup 
application).

Lets give it few more days to think these issues over.

All others (comments for PATCH 2/3 and 3/3) are helpful coding advices - 
they are appreciated and changes will be made accordingly.

-- Wendy




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-03  4:14     ` [Cluster-devel] Re: [NFS] " Neil Brown
  2006-08-03 21:34       ` Wendy Cheng
@ 2006-08-07 22:38       ` Wendy Cheng
  1 sibling, 0 replies; 14+ messages in thread
From: Wendy Cheng @ 2006-08-07 22:38 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Neil Brown wrote

>
>This patch makes an assumption that any given filehandle will only arrive at
>one particular interface - never more.  This is implicit in the fact
>that f_iaddr is stored in 'struct nlm_file' which is indexed by
>filehandle.
>
>In the case where you are intending to support active-active failover
>this is should be the case, but obviously configuration errors are
>possible.
>
>I think what I would like is that if requests arrive at two (or more)
>different interfaces for the one file, then f_iaddr is cleared and some
>flag is set.
>When an IP is written to nlm_unlock, if the flag is set, then a
>warning message is printer 
>  Note: some files access via multiple interfaces will not be
>  unlocked.
>  
>

Have given this issue a long thought during the weekend. The suggested 
changes will work but so is "fsid" approach. I prefer "fsid" because it 
is simpler to implement and very effective. The problem with this 
approach is that it is a little bit awkward to explain - I don't believe 
it is well-understand (or even aware of ) by most of the system 
admin(s). We may need a good write-up for the procedures. It, however, 
effectively handles the issues associated with an export entry getting 
accessed by multiple virtual ip interfaces.

The test runs (with fsid) today have been encouraging. Will push the 
revised patches for review as soon as they pass some sanity checks and 
testings. However, the following is the main changes made by this new 
implementation:

First, it is required to export the filesystem using "fsid"; e.g. 
"export *:/mnt/tank1 (fsid=9468,rw) ".
Patch 1: Drop the locks based on fsid; e.g. "echo 9468 > 
/proc/fs/nfsd/nlm_unlock"
Patch 2: Set individual grace period based on fsid "echo 9468 > 
/proc/fs/nfsd/nlm_set_igrace"
Patch 3: Utility functions to allow cluster suite (or system admin) to 
implement its own client reclaim notifications.

Unfortunately, it is too cumbersome to switch Patch 3 using fsid. So the 
old trick stays (we still use server ip to facilitate the client reclaim 
notification process).

In the mean time, the following is how I yank the fsid out of file 
handle. Send it here for preview purpose. If anyone spots anything 
wrong, please do let me know. This will be the "guts" of this whole thing:

-- Wendy

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fsid_from_fh.txt
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20060807/9e34b65b/attachment.txt>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-01  1:55 ` [Cluster-devel] [PATCH " Wendy Cheng
       [not found]   ` <message from Wendy Cheng on Monday July 31>
@ 2006-08-04  9:27   ` Greg Banks
  2006-08-04 13:27     ` Wendy Cheng
  1 sibling, 1 reply; 14+ messages in thread
From: Greg Banks @ 2006-08-04  9:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, 2006-08-01 at 11:55, Wendy Cheng wrote:
> o The nfs-utils config flag RESTRICTED_STATD must be off for NLM
>   failover to be functional correctly.

That would reopen this ancient security hole:

http://www.cert.org/advisories/CA-99-05-statd-automountd.html

which might not be the best of ideas.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-04  9:27   ` Greg Banks
@ 2006-08-04 13:27     ` Wendy Cheng
  2006-08-04 14:56       ` Wendy Cheng
  2006-08-07  4:05       ` Greg Banks
  0 siblings, 2 replies; 14+ messages in thread
From: Wendy Cheng @ 2006-08-04 13:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 2006-08-04 at 19:27 +1000, Greg Banks wrote:
> On Tue, 2006-08-01 at 11:55, Wendy Cheng wrote:
> > o The nfs-utils config flag RESTRICTED_STATD must be off for NLM
> >   failover to be functional correctly.
> 
> That would reopen this ancient security hole:
> 
> http://www.cert.org/advisories/CA-99-05-statd-automountd.html
> 
> which might not be the best of ideas.
> 

ok, thanks ! I'll look into this. But I believe nfs-utils-1.0.8-rc4 has
this off by default ?

-- Wendy



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-04 13:27     ` Wendy Cheng
@ 2006-08-04 14:56       ` Wendy Cheng
  2006-08-04 15:51         ` Trond Myklebust
  2006-08-07  4:05       ` Greg Banks
  1 sibling, 1 reply; 14+ messages in thread
From: Wendy Cheng @ 2006-08-04 14:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 2006-08-04 at 09:27 -0400, Wendy Cheng wrote:
> On Fri, 2006-08-04 at 19:27 +1000, Greg Banks wrote:
> > On Tue, 2006-08-01 at 11:55, Wendy Cheng wrote:
> > > o The nfs-utils config flag RESTRICTED_STATD must be off for NLM
> > >   failover to be functional correctly.
> > 
> > That would reopen this ancient security hole:
> > 
> > http://www.cert.org/advisories/CA-99-05-statd-automountd.html
> > 
> > which might not be the best of ideas.
> > 
> 
> ok, thanks ! I'll look into this. But I believe nfs-utils-1.0.8-rc4 has
> this off by default ?
> 

Anyway, better be conservative than sorry - I think we want to switch to
"fsid" approach to avoid messing with these networking issues, including
IPV6 modification. That is, we use fsid as the key to drop the lock and
set per-fsid NLM grace period. The ha-callout will have a 4th argument
(fsid) when invoked. 

-- Wendy





^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-04 14:56       ` Wendy Cheng
@ 2006-08-04 15:51         ` Trond Myklebust
  2006-08-05  5:44           ` Wendy Cheng
  0 siblings, 1 reply; 14+ messages in thread
From: Trond Myklebust @ 2006-08-04 15:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 2006-08-04 at 10:56 -0400, Wendy Cheng wrote:
> Anyway, better be conservative than sorry - I think we want to switch to
> "fsid" approach to avoid messing with these networking issues, including
> IPV6 modification. That is, we use fsid as the key to drop the lock and
> set per-fsid NLM grace period. The ha-callout will have a 4th argument
> (fsid) when invoked. 

What is the point of doing that? As far as the client is concerned, a
server has either rebooted or it hasn't. It doesn't know about single
filesystems rebooting.

Cheers,
  Trond



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-04 15:51         ` Trond Myklebust
@ 2006-08-05  5:44           ` Wendy Cheng
  2006-08-07  4:05             ` Greg Banks
  0 siblings, 1 reply; 14+ messages in thread
From: Wendy Cheng @ 2006-08-05  5:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 2006-08-04 at 11:51 -0400, Trond Myklebust wrote:
> On Fri, 2006-08-04 at 10:56 -0400, Wendy Cheng wrote:
> > Anyway, better be conservative than sorry - I think we want to switch to
> > "fsid" approach to avoid messing with these networking issues, including
> > IPV6 modification. That is, we use fsid as the key to drop the lock and
> > set per-fsid NLM grace period. The ha-callout will have a 4th argument
> > (fsid) when invoked. 
> 
> What is the point of doing that? As far as the client is concerned, a
> server has either rebooted or it hasn't. It doesn't know about single
> filesystems rebooting.
> 

For active-active failover, the submitted patches allow:  

1: Drop the locks tied with one particular floating ip (in old server).
2: Notify relevant clients that the floating ip has been rebooted.
3: Set per-ip nlm grace period.
4: The (notified) nfs clients reclaim locks into the new server.

While the above 4 steps are being executed, both servers keep alive with
other nfs services un-interrupted. (1) and (3) are accomplished by Patch
3-1 and Patch 3-2. (4) is nfs client's task that follows its existing
logic without changes. 

For (2), the basics are built upon the existing rpc.statd's HA features,
specifically the -H and -nNP option. It, however, needs Patch 3-3 to
pass the correct floating ip address into rpc.statd user mode daemon as
the following: 

For system not involved in HA failover, nothing has change. All new
functions are optional with added-on feature. For cluster failover,

1. The rpc.statd is dispatched as "rpc.statd -H ha-callout"
2. Upon each monitor RPC calls (SM_MON or SM_UNMON), rpc.statd
   received the following from kernel:
   2-a: event (mon or unmon)
   2-b: server interface
   2-c: client interface.
3. The rpc.statd does its normal chores by writing or deleting 
   the client interface to/from the default sm directory. Server
   interface is not used here.  
   (btw, this is the existing logic without changes).
4. Then, the rpc.statd invokes ha-callout with the following three
   arguments:
   4-a: event (add-client or del-client)
   4-b: server interface
   4-c: client interface
   The ha-callout (in our case, it will be part of RHCS cluster suite)
   builds multiple sm directories based on 4-b, say 
   /shared_storage/sm_x,  where x is server's ip interface.
5. Upon failover, the cluster suite invokes 
   "rpc.statd -n x -N -P /shared_storage/sm_x" to notify affected
   clients. The new short-life rpc.statd will send the notification to
   relevant (nlm) clients and subsequently exits. The old rpc.statd
   (from step 1) is not aware of the failover event.

Note that before patch 3-3, the kernel always sets 2-b to
system_utsname.nodename. For rpc.statd, if RESTRICTED_STATD flag is on,
the rpc.statd always set 4-b to 127.0.0.1. Without RESTRICTED_STATD on,
it sets 4-b with whatever was passed by kernel (via 2-b). What (kernel)
patch 3-3 does is setting 2-b to the floating ip so rpc.statd could get
the correct ip and pass it into 4-b.

Greg said (I havn't figured out how) without setting 4-b to 127.0.0.1,
we "may" open a security hole. So the thinking here is, then, let's not
change anything but add an fsid as 4th argument for ha-callout as:

   4-d: fsid.

where "fsid" can be viewed as an unique identifier for an NFS export
specified in exports file (check "man exports"); e.g.

        /failover_dir   *(rw, fsid=1234)

With the added fsid info from ha-callout program, the cluster suite (or
human administrator) should be able to associated which (nlm) client has
been affected by one particular failover. 

From implementation point of view, since fsid, if specified, has already
been part of the filehandle that is part of the nlm_file structure, we
should be able to replace the floating ip in the submitted patches with
fsid and still accomplish the very same thing. In short, the failover
sequence with the new interface would look like:

taken-over server:
A-1. tear down floating ip, say 10.10.1.1
A-2. unexport subject filesystem
A-3. "echo 1234 > /proc/fs/nfsd/nlm_unlock"  //fsid=1234
A-4. umount filesystem.

take-over server:
B-1. mount the subject filesystem
B-2. "echo 1234 > /proc/fs/nfsd/nlm_set_ip_grace"
B-3. "rpc.statd -n 10.10.1.1 -N -P /shared_storage/sm_10.10.1.1"
B-4. bring up 10.10.1.1
B-5. re-export the filesystem

A-3 and B-2 could be issued multiple times if the floating ip is
associated with multiple fsid(s).

Make sense ?

This fsid can also resolve Neil's concern (about nlm client using wrong
server interface to access filesystem) that I'll follow up sometime next
week. 

-- Wendy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-05  5:44           ` Wendy Cheng
@ 2006-08-07  4:05             ` Greg Banks
  2006-08-07 20:14               ` James Yarbrough
  0 siblings, 1 reply; 14+ messages in thread
From: Greg Banks @ 2006-08-07  4:05 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sat, 2006-08-05 at 15:44, Wendy Cheng wrote:
> On Fri, 2006-08-04 at 11:51 -0400, Trond Myklebust wrote:
> > On Fri, 2006-08-04 at 10:56 -0400, Wendy Cheng wrote:

> Note that before patch 3-3, the kernel always sets 2-b to
> system_utsname.nodename. For rpc.statd, if RESTRICTED_STATD flag is on,
> the rpc.statd always set 4-b to 127.0.0.1. Without RESTRICTED_STATD on,
> it sets 4-b with whatever was passed by kernel (via 2-b). What (kernel)
> patch 3-3 does is setting 2-b to the floating ip so rpc.statd could get
> the correct ip and pass it into 4-b.
> 
> Greg said (I havn't figured out how) without setting 4-b to 127.0.0.1,
> we "may" open a security hole.

Aha, I see what you needed.  You could have changed the logic
in the RESTRICTED_STATD case of sm_mon_1_svc() not to ignore the
passed my_addr.s_addr if svc_getcaller(rqstp->rq_xprt) is a
privileged port on localhost.  This would probably give you your
logic without reopening the security hole.

> take-over server:
> B-1. mount the subject filesystem
> B-2. "echo 1234 > /proc/fs/nfsd/nlm_set_ip_grace"
> B-3. "rpc.statd -n 10.10.1.1 -N -P /shared_storage/sm_10.10.1.1"
> B-4. bring up 10.10.1.1
> B-5. re-export the filesystem

Umm, don't you want to do B-3 after B-4 and B-5 ?  Otherwise
clients might racily fail on the first try.

Also, just curious here, when do you purge the clients' ARP caches?

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-07  4:05             ` Greg Banks
@ 2006-08-07 20:14               ` James Yarbrough
  2006-08-07 21:03                 ` Wendy Cheng
  0 siblings, 1 reply; 14+ messages in thread
From: James Yarbrough @ 2006-08-07 20:14 UTC (permalink / raw)
  To: cluster-devel.redhat.com

> > take-over server:
> > B-1. mount the subject filesystem
> > B-2. "echo 1234 > /proc/fs/nfsd/nlm_set_ip_grace"
> > B-3. "rpc.statd -n 10.10.1.1 -N -P /shared_storage/sm_10.10.1.1"
> > B-4. bring up 10.10.1.1
> > B-5. re-export the filesystem
> 
> Umm, don't you want to do B-3 after B-4 and B-5 ?  Otherwise
> clients might racily fail on the first try.

I don't think they will necessrily fail.  It depends on whether the
server sends ICMP unreachable messages and how the client responds to
those.  In any case, I think the ordering should be B-5, B-4, and B-3
last.  One can argue about the ordering of B-3 and B-4, but if exporting
(B-5) does not happen before bringing up the IP address (B-4), clients
can get ESTALE replies.  For better transparency, it's probably best
to avoid ESTALE.

It's probably OK to do step B-3 after bringing up the IP address since
that will mimic what happens during boot.

> 
> Also, just curious here, when do you purge the clients' ARP caches?

I don't think you can actually do a purge from the server explicitly.
You should get the desired result when the IP address (10.10.1.1 in
the above example) is brought up.  There's a gratuitous ARP that goes
with that step.

-- 
jmy at sgi.com
650 933 3124

Blow up the mailbox!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-07 20:14               ` James Yarbrough
@ 2006-08-07 21:03                 ` Wendy Cheng
  0 siblings, 0 replies; 14+ messages in thread
From: Wendy Cheng @ 2006-08-07 21:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com

James Yarbrough wrote:

>>>take-over server:
>>>B-1. mount the subject filesystem
>>>B-2. "echo 1234 > /proc/fs/nfsd/nlm_set_ip_grace"
>>>B-3. "rpc.statd -n 10.10.1.1 -N -P /shared_storage/sm_10.10.1.1"
>>>B-4. bring up 10.10.1.1
>>>B-5. re-export the filesystem
>>>      
>>>
>>Umm, don't you want to do B-3 after B-4 and B-5 ?  Otherwise
>>clients might racily fail on the first try.
>>    
>>
>
>I don't think they will necessrily fail.  It depends on whether the
>server sends ICMP unreachable messages and how the client responds to
>those.  In any case, I think the ordering should be B-5, B-4, and B-3
>last.  One can argue about the ordering of B-3 and B-4, but if exporting
>(B-5) does not happen before bringing up the IP address (B-4), clients
>can get ESTALE replies.  For better transparency, it's probably best
>to avoid ESTALE.
>
>It's probably OK to do step B-3 after bringing up the IP address since
>that will mimic what happens during boot.
>  
>

Yes, you and Greg are mostly right - that was an oversight from my test 
script. But our user mode RHCS script (Lon wrote that piece of code) 
does it correctly.  He did B-5, B-4, and B-3 last.

<info>   Adding export: *:/mnt/tank1 (fsid=9468,rw)
<info>   Adding export: *:/mnt/tank2 (fsid=661,rw)
<debug>  Link for eth0: Detected
<info>   Adding IPv4 address 10.15.89.203 to eth0
<debug>  Sending gratuitous ARP: 10.15.89.203 00:30:48:27:92:d6 brd
ff:ff:ff:ff:ff:ff
<info>   Sending reclaim notifications via tank-02
Start of nfs1 complete

-- Wendy



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover
  2006-08-04 13:27     ` Wendy Cheng
  2006-08-04 14:56       ` Wendy Cheng
@ 2006-08-07  4:05       ` Greg Banks
  1 sibling, 0 replies; 14+ messages in thread
From: Greg Banks @ 2006-08-07  4:05 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 2006-08-04 at 23:27, Wendy Cheng wrote:
> On Fri, 2006-08-04 at 19:27 +1000, Greg Banks wrote:
> > On Tue, 2006-08-01 at 11:55, Wendy Cheng wrote:
> > > o The nfs-utils config flag RESTRICTED_STATD must be off for NLM
> > >   failover to be functional correctly.
> > 
> > That would reopen this ancient security hole:
> > 
> > http://www.cert.org/advisories/CA-99-05-statd-automountd.html
> > 
> > which might not be the best of ideas.
> > 
> 
> ok, thanks ! I'll look into this. But I believe nfs-utils-1.0.8-rc4 has
> this off by default ?

I really hope distros have --enable-secure-statd in their .specs.

I know SLES9+ doesn't need it, because SLES has Olaf's in-kernel
rpc.statd which (IIRC) has the equivalent of RESTRICTED_STATD
hardcoded.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-08-07 22:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-29 17:47 [Cluster-devel] [RFC PATCH 0/3] NLM lock failover Wendy Cheng
2006-08-01  1:55 ` [Cluster-devel] [PATCH " Wendy Cheng
     [not found]   ` <message from Wendy Cheng on Monday July 31>
2006-08-03  4:14     ` [Cluster-devel] Re: [NFS] " Neil Brown
2006-08-03 21:34       ` Wendy Cheng
2006-08-07 22:38       ` Wendy Cheng
2006-08-04  9:27   ` Greg Banks
2006-08-04 13:27     ` Wendy Cheng
2006-08-04 14:56       ` Wendy Cheng
2006-08-04 15:51         ` Trond Myklebust
2006-08-05  5:44           ` Wendy Cheng
2006-08-07  4:05             ` Greg Banks
2006-08-07 20:14               ` James Yarbrough
2006-08-07 21:03                 ` Wendy Cheng
2006-08-07  4:05       ` Greg Banks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).