Re: [PATCH 18/19] lockd: Update NSM state from SM_MON replies

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "J. Bruce Fields" <bfields@fieldses.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 18/19] lockd: Update NSM state from SM_MON replies
Date: Fri, 8 May 2009 11:33:23 -0400	[thread overview]
Message-ID: <20090508153323.GA18704@fieldses.org> (raw)
In-Reply-To: <92B5F3F1-7091-4F94-B656-1733DC5A7B02@oracle.com>

On Fri, May 08, 2009 at 11:19:07AM -0400, Chuck Lever wrote:
> On Apr 28, 2009, at 3:11 PM, Chuck Lever wrote:
>> On Apr 28, 2009, at 12:38 PM, J. Bruce Fields wrote:
>>> On Tue, Apr 28, 2009 at 12:34:19PM -0400, Chuck Lever wrote:
>>>> On Apr 28, 2009, at 12:25 PM, J. Bruce Fields wrote:
>>>>> On Thu, Apr 23, 2009 at 07:33:33PM -0400, Chuck Lever wrote:
>>>>>> When rpc.statd starts up in user space at boot time, it 
>>>>>> attempts to
>>>>>> write the latest NSM local state number into
>>>>>> /proc/sys/fs/nfs/nsm_local_state.
>>>>>>
>>>>>> If lockd.ko isn't loaded yet (as is the case in most  
>>>>>> configurations),
>>>>>> that file doesn't exist, thus the kernel's NSM state remains 
>>>>>> set to
>>>>>> its initial value of zero during lockd operation.
>>>>>>
>>>>>> This is a problem because rpc.statd and lockd use the NSM state
>>>>>> number
>>>>>> to prevent repeated lock recovery on rebooted hosts.  If lockd  
>>>>>> sends
>>>>>> a zero NSM state, but then a delayed SM_NOTIFY with a real NSM  
>>>>>> state
>>>>>> number is received, there is no way for lockd or rpc.statd to
>>>>>> distinguish that stale SM_NOTIFY from an actual reboot.  Thus lock
>>>>>> recovery could be performed after the rebooted host has already
>>>>>> started reclaiming locks, and those locks will be lost.
>>>>>>
>>>>>> We could change /etc/init.d/nfslock so it always modprobes  
>>>>>> lockd.ko
>>>>>> before starting rpc.statd.  However, if lockd.ko is ever unloaded
>>>>>> and reloaded, we are back at square one, since the NSM state is 
>>>>>> not
>>>>>> preserved across an unload/reload cycle.  This may happen  
>>>>>> frequently
>>>>>> on clients that use automounter.  A period of NFS inactivity  
>>>>>> causes
>>>>>> lockd.ko to be unloaded, and the kernel loses its NSM state  
>>>>>> setting.
>>>>>
>>>>> Aie.  Can we also fix the automounter or some other part of the
>>>>> userspace configuration?
>>>>
>>>> User space isn't the problem here... it's the fact that lockd can  
>>>> get
>>>> unloaded after a period of inactivity.  IMO lockd should be pinned  
>>>> in
>>>> the kernel after it is loaded with /etc/init.d/nfslock.
>>>>
>>>>>> Instead, let's use the fact that rpc.statd plants the local  
>>>>>> system's
>>>>>> NSM state in every SM_MON (and SM_UNMON) reply.  lockd performs a
>>>>>> synchronous SM_MON upcall to the local rpc.statd _before_  
>>>>>> sending its
>>>>>> first NLM request to a new remote.  This would permit rpc.statd to
>>>>>> provide the current NSM state to lockd, even after lockd.ko had 
>>>>>> been
>>>>>> unloaded and reloaded.
>>>>>>
>>>>>> Note that NLMPROC_LOCK arguments are constructed before the
>>>>>> nsm_monitor() call, so we have to rearrange argument construction
>>>>>> very
>>>>>> slightly to make this all work out.
>>>>>>
>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>> ---
>>>>>>
>>>>>> fs/lockd/clntproc.c |    2 +-
>>>>>> fs/lockd/mon.c      |    6 +++++-
>>>>>> 2 files changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/lockd/clntproc.c b/fs/lockd/clntproc.c
>>>>>> index dd79570..f55b900 100644
>>>>>> --- a/fs/lockd/clntproc.c
>>>>>> +++ b/fs/lockd/clntproc.c
>>>>>> @@ -126,7 +126,6 @@ static void nlmclnt_setlockargs(struct  
>>>>>> nlm_rqst
>>>>>> *req, struct file_lock *fl)
>>>>>> 	struct nlm_lock	*lock = &argp->lock;
>>>>>>
>>>>>> 	nlmclnt_next_cookie(&argp->cookie);
>>>>>> -	argp->state   = nsm_local_state;
>>>>>> 	memcpy(&lock->fh, NFS_FH(fl->fl_file->f_path.dentry->d_inode),
>>>>>> sizeof(struct nfs_fh));
>>>>>> 	lock->caller  = utsname()->nodename;
>>>>>> 	lock->oh.data = req->a_owner;
>>>>>> @@ -519,6 +518,7 @@ nlmclnt_lock(struct nlm_rqst *req, struct
>>>>>> file_lock *fl)
>>>>>>
>>>>>> 	if (nsm_monitor(host) < 0)
>>>>>> 		goto out;
>>>>>> +	req->a_args.state = nsm_local_state;
>>>>>
>>>>> Hm.  It looks like a_args.state is never used, except in ifdef'd- 
>>>>> out
>>>>> code in nlm4svc_proc_lock() and nlmsvc_proc_lock() ifdef'd out.
>>>>> Something's wrong there.  (Not your fault; but needs looking into.)
>>>>
>>>> This isn't a big deal on the server side (I guess I should give this
>>>> patch to Trond instead of you, in that case).
>
> Since this is a client-side only patch, should I pass this to Trond  
> instead?

OK.

>
> [ more below ]
>
>>>> The client passes its NSM state number to the server in NLMPROC_LOCK
>>>> calls.  There is no mechanism for the server to pass its NSM state
>>>> number to the client via the NLM protocol.  So the first the  
>>>> client is
>>>> aware of the server's NSM state number is after the server reboots  
>>>> (via
>>>> SM_NOTIFY).  If the server never reboots, the client will never  
>>>> know the
>>>> server's NSM state number.
>>>
>>> So the #if 0'd code should just be deleted?
>>
>> OK, I misread your question before.
>>
>> As I read the code, our server does not appear to utilize the client's 
>> NSM state number, except for gating SM_NOTIFY requests with a 
>> previously-seen NSM state number.  The #ifdef'd code would potentially 
>> deny lock requests if it detected the state number going backwards.
>>
>> It would be nicer if the server actually tracked the client's state  
>> number, but it doesn't appear to do that today.  The #ifdef'd code  
>> serves to remind us that we should consider this.  This would also  
>> prevent a delayed SM_NOTIFY from causing the server to drop locks  
>> reacquired during the grace period accidentally.
>>
>> So I think it would be good to leave it, or replace it with a FIXME  
>> comment, for now.  Eventually we should add a little extra logic to  
>> handle this case.
>>
>>> --b.
>>>
>>>>
>>>>>> 	fl->fl_flags |= FL_ACCESS;
>>>>>> 	status = do_vfs_lock(fl);
>>>>>> diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
>>>>>> index 6d5d4a4..5017d50 100644
>>>>>> --- a/fs/lockd/mon.c
>>>>>> +++ b/fs/lockd/mon.c
>>>>>> @@ -188,8 +188,12 @@ int nsm_monitor(const struct nlm_host *host)
>>>>>> 		status = -EIO;
>>>>>> 	if (status < 0)
>>>>>> 		printk(KERN_NOTICE "lockd: cannot monitor %s\n", nsm->sm_name);
>>>>>> -	else
>>>>>> +	else {
>>>>>> 		nsm->sm_monitored = 1;
>>>>>> +		nsm_local_state = res.state;
>>>>>> +		dprintk("lockd: nsm_monitor: NSM state is now %d\n",
>>>>>> +				nsm_local_state);
>>>>>
>>>>> Could we make that a dprintk in the case where this changes  
>>>>> nsm_local
>>>>> state from something other than zero (nsm_lock_state &&
>>>>> nsm_local_state
>>>>> != res.state)?
>>>>>
>>>>> (Just to make sure no statd is returning inconsistent  
>>>>> nsm_local_stats
>>>>> here.)
>
> Having the kernel limit changes to the state number is probably not a  
> good idea.  Certain statd operations such as SM_SIMU_CRASH will modify  
> that state number.  We don't use SM_SIMU_CRASH today, but handling  
> server failover and such will likely require something like it.
>
> In any event, servers that are careful enough to track a client's NSM  
> state number will tell us pretty quickly if this is not working right.
>
>>>> I'm not sure that's a big deal, but...
>>>>
>>>> Note that the XNFS version 3 spec suggests the local lockd should
>>>> request the NSM state number when it starts up by posting an
>>>> SM_UNMON_ALL to the local statd.  That might be safer than loading  
>>>> it
>>>> after every SM_MON.
>
> So, the problem with using SM_UNMON_ALL when lockd starts up is that it 
> introduces yet another start-up ordering dependency.  In order for this 
> solution to work, statd is required to be running before lockd starts up. 
>  I think we discussed a few weeks ago how, on the server, lockd needs to 
> start first so that it is available before reboot notifications are sent.

You can start statd without sending notifications.

Note a few years ago Neil added a very detailed discussion of server and
client startup order to nfs-utils/README, worth reading.

--b.

> Even though this patch is for the client, I'm loathe to add yet another 
> start-up ordering dependency in this area.  Theoretically this stuff 
> should work correctly no matter what order you start it (especially since 
> we don't package NFS init scripts with nfs-utils).  The current proposal 
> (using the result of SM_MON) provides adequate NSM state number updates 
> without introducing new ordering constraints.

next prev parent reply	other threads:[~2009-05-08 15:33 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-23 23:31 [PATCH 00/19] Proposed server-side patches for 2.6.31 Chuck Lever
     [not found] ` <20090423231550.17283.24432.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-23 23:31   ` [PATCH 01/19] SUNRPC: Fix error return value of svc_addr_len() Chuck Lever
     [not found]     ` <20090423233124.17283.40252.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-25 22:17       ` J. Bruce Fields
2009-04-27 16:49         ` Chuck Lever
2009-04-27 23:51           ` J. Bruce Fields
2009-04-28 15:28             ` Chuck Lever
2009-04-28 15:31               ` J. Bruce Fields
2009-04-23 23:31   ` [PATCH 02/19] NFSD: Refactor transport removal out of __write_ports() Chuck Lever
2009-04-23 23:31   ` [PATCH 03/19] NFSD: Refactor transport addition " Chuck Lever
2009-04-23 23:31   ` [PATCH 04/19] NFSD: Refactor portlist socket closing into a helper Chuck Lever
2009-04-23 23:31   ` [PATCH 05/19] NFSD: Refactor socket creation out of __write_ports() Chuck Lever
     [not found]     ` <20090423233155.17283.37345.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-25 22:40       ` J. Bruce Fields
2009-04-23 23:32   ` [PATCH 06/19] NFSD: Note an additional requirement when passing TCP sockets to portlist Chuck Lever
2009-04-23 23:32   ` [PATCH 07/19] NFSD: Finish refactoring __write_ports() Chuck Lever
2009-04-23 23:32   ` [PATCH 08/19] NFSD: move lockd_up() before svc_addsock() Chuck Lever
2009-04-23 23:32   ` [PATCH 09/19] NFSD: Prevent a buffer overflow in svc_xprt_names() Chuck Lever
     [not found]     ` <20090423233225.17283.10176.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-27 23:56       ` J. Bruce Fields
2009-04-23 23:32   ` [PATCH 10/19] SUNRPC: pass buffer size to svc_addsock() Chuck Lever
2009-04-23 23:32   ` [PATCH 11/19] SUNRPC: pass buffer size to svc_sock_names() Chuck Lever
2009-04-23 23:32   ` [PATCH 12/19] SUNRPC: Switch one_sock_name() to use snprintf() Chuck Lever
2009-04-23 23:32   ` [PATCH 13/19] SUNRPC: Support PF_INET6 in one_sock_name() Chuck Lever
2009-04-23 23:33   ` [PATCH 14/19] SUNRPC: Clean up one_sock_name() Chuck Lever
2009-04-23 23:33   ` [PATCH 15/19] NFSD: Stricter buffer size checking in write_recoverydir() Chuck Lever
2009-04-23 23:33   ` [PATCH 16/19] NFSD: Stricter buffer size checking in write_versions() Chuck Lever
2009-04-23 23:33   ` [PATCH 17/19] NFSD: Stricter buffer size checking in fs/nfsd/nfsctl.c Chuck Lever
     [not found]     ` <20090423233325.17283.71127.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-28 16:31       ` J. Bruce Fields
2009-04-28 16:36         ` Chuck Lever
2009-04-28 21:30           ` J. Bruce Fields
2009-04-23 23:33   ` [PATCH 18/19] lockd: Update NSM state from SM_MON replies Chuck Lever
     [not found]     ` <20090423233332.17283.23011.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-28 16:25       ` J. Bruce Fields
2009-04-28 16:34         ` Chuck Lever
2009-04-28 16:38           ` J. Bruce Fields
2009-04-28 19:11             ` Chuck Lever
2009-05-08 15:19               ` Chuck Lever
2009-05-08 15:33                 ` J. Bruce Fields [this message]
2009-04-23 23:33   ` [PATCH 19/19] lockd: clean up 64-bit alignment fix in nsm_init_private() Chuck Lever
     [not found]     ` <20090423233340.17283.29580.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-28 16:31       ` J. Bruce Fields
2009-04-28 16:35         ` Chuck Lever
2009-04-28 16:40           ` J. Bruce Fields
2009-04-28 17:24             ` Chuck Lever
2009-04-28 21:36               ` J. Bruce Fields
2009-04-28 22:03                 ` Måns Rullgård
     [not found]                   ` <yw1x63gozb9f.fsf-O+uoZmgXk1l54TAoqtyWWQ@public.gmane.org>
2009-04-28 22:14                     ` Chuck Lever
2009-04-28 22:11                 ` Chuck Lever
2009-04-28 22:23                   ` J. Bruce Fields
2009-04-28 22:31                   ` Måns Rullgård
     [not found]                     ` <yw1xws94xved.fsf-O+uoZmgXk1l54TAoqtyWWQ@public.gmane.org>
2009-04-28 22:43                       ` Chuck Lever
2009-04-28 22:52                         ` Måns Rullgård
     [not found]                           ` <yw1xskjsxuff.fsf-O+uoZmgXk1l54TAoqtyWWQ@public.gmane.org>
2009-04-29 15:16                             ` Chuck Lever
2009-04-29 18:02                               ` Måns Rullgård
2009-04-25 22:14   ` [PATCH 00/19] Proposed server-side patches for 2.6.31 J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090508153323.GA18704@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox