[Cluster-devel] Re: [PATCH 1/2] NLM failover unlock commands

All of lore.kernel.org
 help / color / mirror / Atom feed

From: J. Bruce Fields <bfields@fieldses.org>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] Re: [PATCH 1/2] NLM failover unlock commands
Date: Thu, 24 Jan 2008 15:19:10 -0500	[thread overview]
Message-ID: <20080124201910.GF26164@fieldses.org> (raw)
In-Reply-To: <4798EAE1.2000707@redhat.com>

On Thu, Jan 24, 2008 at 02:45:37PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> In practice, it seems that both the unlock_ip and unlock_pathname
>> methods that revoke locks are going to be called together.  The two
>> separate calls therefore seem a little redundant.  The reason we *need*
>> both is that it's possible that a misconfigured client could grab locks
>> for a (server ip, export) combination that it isn't supposed to.
>>   
>
> That is not a correct assumption. The two commands (unlock_ip and  
> unlock_pathname) are not necessarily called together. It is ok for local  
> filesystem (ext3) but not for cluster filesystem where the very same  
> filesystem (or subtree) can be exported from multiple servers using  
> different subtrees.

Ouch.  Are people really doing that, and why?  What happens if the
subtrees share files (because of hard links) that are locked from both
nodes?

> Also as we discussed before, it is  
> "unlock_filesystem", *not* "unlock_pathname" (this implies sub-tree  
> exports) due to implementation difficulties (see the "Implementation  
> Notes" from http://people.redhat.com/wcheng/Patches/NFS/NLM/004.txt).

Unless I misread the latest patch, it's actually matching on the
vfsmount, right?

I guess that means we *could* handle the above situation by doing a

	mount --bind /path/to/export/point /path/to/export/point

on each export, at which point there will be a separate vfsmount for
each export point?

But I don't think that's what we really want.  The goal is to ensure
that the nfs server holds no locks on a disk filesystem so we can
unmount it completely from this machine and mount it elsewhere.  So we
should really be removing all locks for the superblock, not just for a
particular mount of that superblock.  Otherwise we'll have odd problems
if someone happens to do the unlock_filesystem downcall from a different
namespace or something.

>> So it makes sense to me to restrict locking from the beginning to
>> prevent that from happening.  Therefore I'd like to add a call at the
>> beginning like:
>>
>> 	echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>>   
>
> My second patch set is about to be sent out (doing text description at  
> this moment .. sorry for the delay).

Good, thanks.

>> before any exports are set up, which both starts a grace period, and
>> tells nfs to allow locks on the filesystem /exports/example only if
>> they're addressed to the server ip 192.168.1.1.  Then on shutdown,
>>
>> 	echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>>
>> should be sufficient to guarantee that nfsd/lockd no longer holds locks
>> on /exports/example.
>>
>> (I think Wendy's pretty close to that api already after adding the
>> second method to start grace?)
>>
>> The other advantage to having the server-ip from the start is that at
>> the time we make lock requests to the cluster filesystem, we can tell it
>> that the locks associated with 192.168.1.1 are special: they may migrate
>> as a group to another node, and on node failure they should (if
>> possible) be held to give a chance for another node to take them over.
>>
>> Internally I'd like to have an object like
>>
>> 	struct lock_manager {
>> 		char *lm_name;
>> 		...
>> 	}
>>
>> for each server ip address.  A pointer to this structure would be passed
>> with each lock request, allowing the filesystem to associate locks to
>> lock_manager's.  The name would be a string derived from the server ip
>> address that the cluster can compare to match reclaim requests with the
>> locks that they're reclaiming from another node.
>>   
>
> I still don't have a warm feeling about adding this (at this moment) -  
> somehow feel we over-engineer the lock failover issues.

I agree that that's a risk.

> Remember lock  failover is just a small portion of the general NFS
> server failover (for  both HA and load balancing purpose) issues.
> Maybe we should have  something simple and usable for 2.6 kernel
> before adding this type of  complication ?

Yeah.  We should aim to include basic failover functionality in 2.6.26,
one way or another--I think that dealing with the other issues I'm
worried about won't actually be a great deal more complicated, but if
that doesn't pan out then fine.  I would like to at least make sure this
is all working for nfsv4 as well, though.  (Currently only locks held by
v2/v3 clients are dropped.)

--b.

WARNING: multiple messages have this Message-ID (diff)

From: "J. Bruce Fields" <bfields@fieldses.org>
To: Wendy Cheng <wcheng@redhat.com>
Cc: Neil Brown <neilb@suse.de>, Christoph Hellwig <hch@infradead.org>,
	NFS list <linux-nfs@vger.kernel.org>,
	cluster-devel@redhat.com
Subject: Re: [PATCH 1/2] NLM failover unlock commands
Date: Thu, 24 Jan 2008 15:19:10 -0500	[thread overview]
Message-ID: <20080124201910.GF26164@fieldses.org> (raw)
In-Reply-To: <4798EAE1.2000707@redhat.com>

On Thu, Jan 24, 2008 at 02:45:37PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> In practice, it seems that both the unlock_ip and unlock_pathname
>> methods that revoke locks are going to be called together.  The two
>> separate calls therefore seem a little redundant.  The reason we *need*
>> both is that it's possible that a misconfigured client could grab locks
>> for a (server ip, export) combination that it isn't supposed to.
>>   
>
> That is not a correct assumption. The two commands (unlock_ip and  
> unlock_pathname) are not necessarily called together. It is ok for local  
> filesystem (ext3) but not for cluster filesystem where the very same  
> filesystem (or subtree) can be exported from multiple servers using  
> different subtrees.

Ouch.  Are people really doing that, and why?  What happens if the
subtrees share files (because of hard links) that are locked from both
nodes?

> Also as we discussed before, it is  
> "unlock_filesystem", *not* "unlock_pathname" (this implies sub-tree  
> exports) due to implementation difficulties (see the "Implementation  
> Notes" from http://people.redhat.com/wcheng/Patches/NFS/NLM/004.txt).

Unless I misread the latest patch, it's actually matching on the
vfsmount, right?

I guess that means we *could* handle the above situation by doing a

	mount --bind /path/to/export/point /path/to/export/point

on each export, at which point there will be a separate vfsmount for
each export point?

But I don't think that's what we really want.  The goal is to ensure
that the nfs server holds no locks on a disk filesystem so we can
unmount it completely from this machine and mount it elsewhere.  So we
should really be removing all locks for the superblock, not just for a
particular mount of that superblock.  Otherwise we'll have odd problems
if someone happens to do the unlock_filesystem downcall from a different
namespace or something.

>> So it makes sense to me to restrict locking from the beginning to
>> prevent that from happening.  Therefore I'd like to add a call at the
>> beginning like:
>>
>> 	echo "192.168.1.1 /exports/example" > /proc/fs/nfsd/start_grace
>>   
>
> My second patch set is about to be sent out (doing text description at  
> this moment .. sorry for the delay).

Good, thanks.

>> before any exports are set up, which both starts a grace period, and
>> tells nfs to allow locks on the filesystem /exports/example only if
>> they're addressed to the server ip 192.168.1.1.  Then on shutdown,
>>
>> 	echo "192.168.1.1" >/proc/fs/nfsd/unlock_ip
>>
>> should be sufficient to guarantee that nfsd/lockd no longer holds locks
>> on /exports/example.
>>
>> (I think Wendy's pretty close to that api already after adding the
>> second method to start grace?)
>>
>> The other advantage to having the server-ip from the start is that at
>> the time we make lock requests to the cluster filesystem, we can tell it
>> that the locks associated with 192.168.1.1 are special: they may migrate
>> as a group to another node, and on node failure they should (if
>> possible) be held to give a chance for another node to take them over.
>>
>> Internally I'd like to have an object like
>>
>> 	struct lock_manager {
>> 		char *lm_name;
>> 		...
>> 	}
>>
>> for each server ip address.  A pointer to this structure would be passed
>> with each lock request, allowing the filesystem to associate locks to
>> lock_manager's.  The name would be a string derived from the server ip
>> address that the cluster can compare to match reclaim requests with the
>> locks that they're reclaiming from another node.
>>   
>
> I still don't have a warm feeling about adding this (at this moment) -  
> somehow feel we over-engineer the lock failover issues.

I agree that that's a risk.

> Remember lock  failover is just a small portion of the general NFS
> server failover (for  both HA and load balancing purpose) issues.
> Maybe we should have  something simple and usable for 2.6 kernel
> before adding this type of  complication ?

Yeah.  We should aim to include basic failover functionality in 2.6.26,
one way or another--I think that dealing with the other issues I'm
worried about won't actually be a great deal more complicated, but if
that doesn't pan out then fine.  I would like to at least make sure this
is all working for nfsv4 as well, though.  (Currently only locks held by
v2/v3 clients are dropped.)

--b.

next prev parent reply	other threads:[~2008-01-24 20:19 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-07  5:39 [Cluster-devel] [PATCH 1/2] NLM failover unlock commands Wendy Cheng
2008-01-07  5:39 ` Wendy Cheng
     [not found] ` <message from Wendy Cheng on Monday January 7>
2008-01-08  5:18   ` [Cluster-devel] " Neil Brown
2008-01-08  5:18     ` Neil Brown
2008-01-09  2:51     ` [Cluster-devel] " Wendy Cheng
2008-01-09  2:51       ` Wendy Cheng
2008-01-08  5:31   ` [Cluster-devel] Re: [PATCH 2/2] Fix lockd panic Neil Brown
2008-01-08  5:31     ` Neil Brown
2008-01-09  3:02     ` [Cluster-devel] " Wendy Cheng
2008-01-09  3:02       ` Wendy Cheng
2008-01-09  4:43       ` [Cluster-devel] " Wendy Cheng
2008-01-09  4:43         ` Wendy Cheng
2008-01-09 23:33         ` [Cluster-devel] " Wendy Cheng
2008-01-09 23:33           ` Wendy Cheng
2008-01-12  6:51           ` [Cluster-devel] " Wendy Cheng
2008-01-12  6:51             ` Wendy Cheng
2008-01-08 17:02 ` [Cluster-devel] Re: [PATCH 1/2] NLM failover unlock commands Christoph Hellwig
2008-01-08 17:02   ` Christoph Hellwig
2008-01-08 17:49   ` [Cluster-devel] " Christoph Hellwig
2008-01-08 17:49     ` Christoph Hellwig
2008-01-08 20:57     ` [Cluster-devel] " Wendy Cheng
2008-01-08 20:57       ` Wendy Cheng
2008-01-09 18:02       ` [Cluster-devel] " Christoph Hellwig
2008-01-09 18:02         ` Christoph Hellwig
2008-01-10  7:59         ` [Cluster-devel] " Christoph Hellwig
2008-01-10  7:59           ` Christoph Hellwig
2008-01-12  7:03           ` [Cluster-devel] " Wendy Cheng
2008-01-12  7:03             ` Wendy Cheng
2008-01-12  9:38             ` [Cluster-devel] " Christoph Hellwig
2008-01-12  9:38               ` Christoph Hellwig
2008-01-14 23:07             ` [Cluster-devel] " J. Bruce Fields
2008-01-14 23:07               ` J. Bruce Fields
     [not found]               ` <message from J. Bruce Fields on Monday January 14>
2008-01-14 23:31                 ` [Cluster-devel] " Neil Brown
2008-01-14 23:31                   ` Neil Brown
     [not found]                   ` <18315.61638.14133.308991-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-01-15 16:38                     ` Chuck Lever
2008-01-22 22:53                   ` [Cluster-devel] " J. Bruce Fields
2008-01-22 22:53                     ` J. Bruce Fields
     [not found]                     ` <message from J. Bruce Fields on Tuesday January 22>
2008-01-24  4:02                       ` [Cluster-devel] " Neil Brown
2008-01-24  4:02                         ` Neil Brown
2008-01-15 16:14               ` [Cluster-devel] " Wendy Cheng
2008-01-15 16:14                 ` Wendy Cheng
2008-01-15 16:30                 ` [Cluster-devel] " J. Bruce Fields
2008-01-15 16:30                   ` J. Bruce Fields
     [not found]             ` <message from Wendy Cheng on Saturday January 12>
2008-01-14 23:52               ` [Cluster-devel] " Neil Brown
2008-01-14 23:52                 ` Neil Brown
2008-01-15 20:17                 ` [Cluster-devel] " Wendy Cheng
2008-01-15 20:17                   ` Wendy Cheng
     [not found]                   ` <message from Wendy Cheng on Tuesday January 15>
2008-01-15 20:50                     ` [Cluster-devel] " Neil Brown
2008-01-15 20:50                       ` Neil Brown
2008-01-15 20:56                       ` [Cluster-devel] " Wendy Cheng
2008-01-15 20:56                         ` Wendy Cheng
2008-01-15 22:48                       ` [Cluster-devel] " Wendy Cheng
2008-01-15 22:48                         ` Wendy Cheng
2008-01-17 15:10                         ` [Cluster-devel] " J. Bruce Fields
2008-01-17 15:10                           ` J. Bruce Fields
2008-01-17 15:48                           ` [Cluster-devel] " Wendy Cheng
2008-01-17 15:48                             ` Wendy Cheng
2008-01-17 16:08                             ` [Cluster-devel] " Wendy Cheng
2008-01-17 16:08                               ` Wendy Cheng
2008-01-17 16:10                               ` [Cluster-devel] " Wendy Cheng
2008-01-17 16:10                                 ` Wendy Cheng
2008-01-18 10:21                                 ` [Cluster-devel] " Frank van Maarseveen
2008-01-18 10:21                                   ` Frank van Maarseveen
2008-01-18 15:00                                   ` [Cluster-devel] " Wendy Cheng
2008-01-18 15:00                                     ` Wendy Cheng
2008-01-17 16:14                             ` [Cluster-devel] " J. Bruce Fields
2008-01-17 16:14                               ` J. Bruce Fields
2008-01-17 16:17                               ` [Cluster-devel] " Wendy Cheng
2008-01-17 16:17                                 ` Wendy Cheng
2008-01-17 16:21                                 ` [Cluster-devel] " J. Bruce Fields
2008-01-17 16:21                                   ` J. Bruce Fields
2008-01-17 16:31                             ` [Cluster-devel] " J. Bruce Fields
2008-01-17 16:31                               ` J. Bruce Fields
2008-01-17 16:31                               ` [Cluster-devel] " Wendy Cheng
2008-01-17 16:31                                 ` Wendy Cheng
2008-01-17 16:40                                 ` [Cluster-devel] " J. Bruce Fields
2008-01-17 16:40                                   ` J. Bruce Fields
2008-01-17 17:35                                   ` Frank Filz
2008-01-17 17:59                                     ` [Cluster-devel] " Wendy Cheng
2008-01-17 17:59                                       ` Wendy Cheng
2008-01-17 18:07                                   ` [Cluster-devel] " Wendy Cheng
2008-01-17 18:07                                     ` Wendy Cheng
2008-01-17 20:23                                     ` [Cluster-devel] " J. Bruce Fields
2008-01-17 20:23                                       ` J. Bruce Fields
2008-01-18 10:03                                       ` [Cluster-devel] " Frank van Maarseveen
2008-01-18 10:03                                         ` Frank van Maarseveen
2008-01-18 14:56                                         ` [Cluster-devel] " Wendy Cheng
2008-01-18 14:56                                           ` Wendy Cheng
2008-01-24 16:00                                       ` [Cluster-devel] " J. Bruce Fields
2008-01-24 16:00                                         ` J. Bruce Fields
2008-01-24 16:19                                         ` Peter Staubach
2008-01-24 16:39                                           ` [Cluster-devel] " J. Bruce Fields
2008-01-24 16:39                                             ` J. Bruce Fields
2008-01-24 19:45                                         ` [Cluster-devel] " Wendy Cheng
2008-01-24 19:45                                           ` Wendy Cheng
2008-01-24 20:19                                           ` J. Bruce Fields [this message]
2008-01-24 20:19                                             ` J. Bruce Fields
2008-01-24 21:06                                             ` [Cluster-devel] " Wendy Cheng
2008-01-24 21:06                                               ` Wendy Cheng
2008-01-24 21:40                                               ` [Cluster-devel] " J. Bruce Fields
2008-01-24 21:40                                                 ` J. Bruce Fields
2008-01-24 21:49                                                 ` [Cluster-devel] " Wendy Cheng
2008-01-24 21:49                                                   ` Wendy Cheng
2008-01-28  3:46                                         ` [Cluster-devel] " Felix Blyakher
2008-01-28  3:46                                           ` Felix Blyakher
2008-01-28 15:56                                           ` [Cluster-devel] " Wendy Cheng
2008-01-28 15:56                                             ` Wendy Cheng
2008-01-28 17:06                                             ` [Cluster-devel] " Felix Blyakher
2008-01-28 17:06                                               ` Felix Blyakher
2008-01-16  4:19                     ` Neil Brown
2008-01-16  4:19                       ` Neil Brown
2008-01-09  3:49   ` [Cluster-devel] " Wendy Cheng
2008-01-09  3:49     ` Wendy Cheng
2008-01-09 16:13     ` [Cluster-devel] " J. Bruce Fields
2008-01-09 16:13       ` J. Bruce Fields
  -- strict thread matches above, loose matches on Subject: below --
2008-01-07  5:53 [Cluster-devel] [PATCH 2/2] Fix lockd panic Wendy Cheng
2008-01-07  5:53 ` Wendy Cheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080124201910.GF26164@fieldses.org \
    --to=bfields@fieldses.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.