Re: [PATCH 18/19] lockd: Update NSM state from SM_MON replies

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "J. Bruce Fields" <bfields@fieldses.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 18/19] lockd: Update NSM state from SM_MON replies
Date: Fri, 8 May 2009 11:33:23 -0400	[thread overview]
Message-ID: <20090508153323.GA18704@fieldses.org> (raw)
In-Reply-To: <92B5F3F1-7091-4F94-B656-1733DC5A7B02@oracle.com>

On Fri, May 08, 2009 at 11:19:07AM -0400, Chuck Lever wrote:
> On Apr 28, 2009, at 3:11 PM, Chuck Lever wrote:
>> On Apr 28, 2009, at 12:38 PM, J. Bruce Fields wrote:
>>> On Tue, Apr 28, 2009 at 12:34:19PM -0400, Chuck Lever wrote:
>>>> On Apr 28, 2009, at 12:25 PM, J. Bruce Fields wrote:
>>>>> On Thu, Apr 23, 2009 at 07:33:33PM -0400, Chuck Lever wrote:
>>>>>> When rpc.statd starts up in user space at boot time, it 
>>>>>> attempts to
>>>>>> write the latest NSM local state number into
>>>>>> /proc/sys/fs/nfs/nsm_local_state.
>>>>>>
>>>>>> If lockd.ko isn't loaded yet (as is the case in most  
>>>>>> configurations),
>>>>>> that file doesn't exist, thus the kernel's NSM state remains 
>>>>>> set to
>>>>>> its initial value of zero during lockd operation.
>>>>>>
>>>>>> This is a problem because rpc.statd and lockd use the NSM state
>>>>>> number
>>>>>> to prevent repeated lock recovery on rebooted hosts.  If lockd  
>>>>>> sends
>>>>>> a zero NSM state, but then a delayed SM_NOTIFY with a real NSM  
>>>>>> state
>>>>>> number is received, there is no way for lockd or rpc.statd to
>>>>>> distinguish that stale SM_NOTIFY from an actual reboot.  Thus lock
>>>>>> recovery could be performed after the rebooted host has already
>>>>>> started reclaiming locks, and those locks will be lost.
>>>>>>
>>>>>> We could change /etc/init.d/nfslock so it always modprobes  
>>>>>> lockd.ko
>>>>>> before starting rpc.statd.  However, if lockd.ko is ever unloaded
>>>>>> and reloaded, we are back at square one, since the NSM state is 
>>>>>> not
>>>>>> preserved across an unload/reload cycle.  This may happen  
>>>>>> frequently
>>>>>> on clients that use automounter.  A period of NFS inactivity  
>>>>>> causes
>>>>>> lockd.ko to be unloaded, and the kernel loses its NSM state  
>>>>>> setting.
>>>>>
>>>>> Aie.  Can we also fix the automounter or some other part of the
>>>>> userspace configuration?
>>>>
>>>> User space isn't the problem here... it's the fact that lockd can  
>>>> get
>>>> unloaded after a period of inactivity.  IMO lockd should be pinned  
>>>> in
>>>> the kernel after it is loaded with /etc/init.d/nfslock.
>>>>
>>>>>> Instead, let's use the fact that rpc.statd plants the local  
>>>>>> system's
>>>>>> NSM state in every SM_MON (and SM_UNMON) reply.  lockd performs a
>>>>>> synchronous SM_MON upcall to the local rpc.statd _before_  
>>>>>> sending its
>>>>>> first NLM request to a new remote.  This would permit rpc.statd to
>>>>>> provide the current NSM state to lockd, even after lockd.ko had 
>>>>>> been
>>>>>> unloaded and reloaded.
>>>>>>
>>>>>> Note that NLMPROC_LOCK arguments are constructed before the
>>>>>> nsm_monitor() call, so we have to rearrange argument construction
>>>>>> very
>>>>>> slightly to make this all work out.
>>>>>>
>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>> ---
>>>>>>
>>>>>> fs/lockd/clntproc.c |    2 +-
>>>>>> fs/lockd/mon.c      |    6 +++++-
>>>>>> 2 files changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/lockd/clntproc.c b/fs/lockd/clntproc.c
>>>>>> index dd79570..f55b900 100644
>>>>>> --- a/fs/lockd/clntproc.c
>>>>>> +++ b/fs/lockd/clntproc.c
>>>>>> @@ -126,7 +126,6 @@ static void nlmclnt_setlockargs(struct  
>>>>>> nlm_rqst
>>>>>> *req, struct file_lock *fl)
>>>>>> 	struct nlm_lock	*lock = &argp->lock;
>>>>>>
>>>>>> 	nlmclnt_next_cookie(&argp->cookie);
>>>>>> -	argp->state   = nsm_local_state;
>>>>>> 	memcpy(&lock->fh, NFS_FH(fl->fl_file->f_path.dentry->d_inode),
>>>>>> sizeof(struct nfs_fh));
>>>>>> 	lock->caller  = utsname()->nodename;
>>>>>> 	lock->oh.data = req->a_owner;
>>>>>> @@ -519,6 +518,7 @@ nlmclnt_lock(struct nlm_rqst *req, struct
>>>>>> file_lock *fl)
>>>>>>
>>>>>> 	if (nsm_monitor(host) < 0)
>>>>>> 		goto out;
>>>>>> +	req->a_args.state = nsm_local_state;
>>>>>
>>>>> Hm.  It looks like a_args.state is never used, except in ifdef'd- 
>>>>> out
>>>>> code in nlm4svc_proc_lock() and nlmsvc_proc_lock() ifdef'd out.
>>>>> Something's wrong there.  (Not your fault; but needs looking into.)
>>>>
>>>> This isn't a big deal on the server side (I guess I should give this
>>>> patch to Trond instead of you, in that case).
>
> Since this is a client-side only patch, should I pass this to Trond  
> instead?

OK.

>
> [ more below ]
>
>>>> The client passes its NSM state number to the server in NLMPROC_LOCK
>>>> calls.  There is no mechanism for the server to pass its NSM state
>>>> number to the client via the NLM protocol.  So the first the  
>>>> client is
>>>> aware of the server's NSM state number is after the server reboots  
>>>> (via
>>>> SM_NOTIFY).  If the server never reboots, the client will never  
>>>> know the
>>>> server's NSM state number.
>>>
>>> So the #if 0'd code should just be deleted?
>>
>> OK, I misread your question before.
>>
>> As I read the code, our server does not appear to utilize the client's 
>> NSM state number, except for gating SM_NOTIFY requests with a 
>> previously-seen NSM state number.  The #ifdef'd code would potentially 
>> deny lock requests if it detected the state number going backwards.
>>
>> It would be nicer if the server actually tracked the client's state  
>> number, but it doesn't appear to do that today.  The #ifdef'd code  
>> serves to remind us that we should consider this.  This would also  
>> prevent a delayed SM_NOTIFY from causing the server to drop locks  
>> reacquired during the grace period accidentally.
>>
>> So I think it would be good to leave it, or replace it with a FIXME  
>> comment, for now.  Eventually we should add a little extra logic to  
>> handle this case.
>>
>>> --b.
>>>
>>>>
>>>>>> 	fl->fl_flags |= FL_ACCESS;
>>>>>> 	status = do_vfs_lock(fl);
>>>>>> diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
>>>>>> index 6d5d4a4..5017d50 100644
>>>>>> --- a/fs/lockd/mon.c
>>>>>> +++ b/fs/lockd/mon.c
>>>>>> @@ -188,8 +188,12 @@ int nsm_monitor(const struct nlm_host *host)
>>>>>> 		status = -EIO;
>>>>>> 	if (status < 0)
>>>>>> 		printk(KERN_NOTICE "lockd: cannot monitor %s\n", nsm->sm_name);
>>>>>> -	else
>>>>>> +	else {
>>>>>> 		nsm->sm_monitored = 1;
>>>>>> +		nsm_local_state = res.state;
>>>>>> +		dprintk("lockd: nsm_monitor: NSM state is now %d\n",
>>>>>> +				nsm_local_state);
>>>>>
>>>>> Could we make that a dprintk in the case where this changes  
>>>>> nsm_local
>>>>> state from something other than zero (nsm_lock_state &&
>>>>> nsm_local_state
>>>>> != res.state)?
>>>>>
>>>>> (Just to make sure no statd is returning inconsistent  
>>>>> nsm_local_stats
>>>>> here.)
>
> Having the kernel limit changes to the state number is probably not a  
> good idea.  Certain statd operations such as SM_SIMU_CRASH will modify  
> that state number.  We don't use SM_SIMU_CRASH today, but handling  
> server failover and such will likely require something like it.
>
> In any event, servers that are careful enough to track a client's NSM  
> state number will tell us pretty quickly if this is not working right.
>
>>>> I'm not sure that's a big deal, but...
>>>>
>>>> Note that the XNFS version 3 spec suggests the local lockd should
>>>> request the NSM state number when it starts up by posting an
>>>> SM_UNMON_ALL to the local statd.  That might be safer than loading  
>>>> it
>>>> after every SM_MON.
>
> So, the problem with using SM_UNMON_ALL when lockd starts up is that it 
> introduces yet another start-up ordering dependency.  In order for this 
> solution to work, statd is required to be running before lockd starts up. 
>  I think we discussed a few weeks ago how, on the server, lockd needs to 
> start first so that it is available before reboot notifications are sent.

You can start statd without sending notifications.

Note a few years ago Neil added a very detailed discussion of server and
client startup order to nfs-utils/README, worth reading.

--b.

> Even though this patch is for the client, I'm loathe to add yet another 
> start-up ordering dependency in this area.  Theoretically this stuff 
> should work correctly no matter what order you start it (especially since 
> we don't package NFS init scripts with nfs-utils).  The current proposal 
> (using the result of SM_MON) provides adequate NSM state number updates 
> without introducing new ordering constraints.

next prev parent reply	other threads:[~2009-05-08 15:33 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-23 23:31 [PATCH 00/19] Proposed server-side patches for 2.6.31 Chuck Lever
     [not found] ` <20090423231550.17283.24432.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-23 23:31   ` [PATCH 01/19] SUNRPC: Fix error return value of svc_addr_len() Chuck Lever
     [not found]     ` <20090423233124.17283.40252.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-25 22:17       ` J. Bruce Fields
2009-04-27 16:49         ` Chuck Lever
2009-04-27 23:51           ` J. Bruce Fields
2009-04-28 15:28             ` Chuck Lever
2009-04-28 15:31               ` J. Bruce Fields
2009-04-23 23:31   ` [PATCH 02/19] NFSD: Refactor transport removal out of __write_ports() Chuck Lever
2009-04-23 23:31   ` [PATCH 03/19] NFSD: Refactor transport addition " Chuck Lever
2009-04-23 23:31   ` [PATCH 04/19] NFSD: Refactor portlist socket closing into a helper Chuck Lever
2009-04-23 23:31   ` [PATCH 05/19] NFSD: Refactor socket creation out of __write_ports() Chuck Lever
     [not found]     ` <20090423233155.17283.37345.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-25 22:40       ` J. Bruce Fields
2009-04-23 23:32   ` [PATCH 06/19] NFSD: Note an additional requirement when passing TCP sockets to portlist Chuck Lever
2009-04-23 23:32   ` [PATCH 07/19] NFSD: Finish refactoring __write_ports() Chuck Lever
2009-04-23 23:32   ` [PATCH 08/19] NFSD: move lockd_up() before svc_addsock() Chuck Lever
2009-04-23 23:32   ` [PATCH 09/19] NFSD: Prevent a buffer overflow in svc_xprt_names() Chuck Lever
     [not found]     ` <20090423233225.17283.10176.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-27 23:56       ` J. Bruce Fields
2009-04-23 23:32   ` [PATCH 10/19] SUNRPC: pass buffer size to svc_addsock() Chuck Lever
2009-04-23 23:32   ` [PATCH 11/19] SUNRPC: pass buffer size to svc_sock_names() Chuck Lever
2009-04-23 23:32   ` [PATCH 12/19] SUNRPC: Switch one_sock_name() to use snprintf() Chuck Lever
2009-04-23 23:32   ` [PATCH 13/19] SUNRPC: Support PF_INET6 in one_sock_name() Chuck Lever
2009-04-23 23:33   ` [PATCH 14/19] SUNRPC: Clean up one_sock_name() Chuck Lever
2009-04-23 23:33   ` [PATCH 15/19] NFSD: Stricter buffer size checking in write_recoverydir() Chuck Lever
2009-04-23 23:33   ` [PATCH 16/19] NFSD: Stricter buffer size checking in write_versions() Chuck Lever
2009-04-23 23:33   ` [PATCH 17/19] NFSD: Stricter buffer size checking in fs/nfsd/nfsctl.c Chuck Lever
     [not found]     ` <20090423233325.17283.71127.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-28 16:31       ` J. Bruce Fields
2009-04-28 16:36         ` Chuck Lever
2009-04-28 21:30           ` J. Bruce Fields
2009-04-23 23:33   ` [PATCH 18/19] lockd: Update NSM state from SM_MON replies Chuck Lever
     [not found]     ` <20090423233332.17283.23011.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-28 16:25       ` J. Bruce Fields
2009-04-28 16:34         ` Chuck Lever
2009-04-28 16:38           ` J. Bruce Fields
2009-04-28 19:11             ` Chuck Lever
2009-05-08 15:19               ` Chuck Lever
2009-05-08 15:33                 ` J. Bruce Fields [this message]
2009-04-23 23:33   ` [PATCH 19/19] lockd: clean up 64-bit alignment fix in nsm_init_private() Chuck Lever
     [not found]     ` <20090423233340.17283.29580.stgit-07a7zB5ZJzbwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
2009-04-28 16:31       ` J. Bruce Fields
2009-04-28 16:35         ` Chuck Lever
2009-04-28 16:40           ` J. Bruce Fields
2009-04-28 17:24             ` Chuck Lever
2009-04-28 21:36               ` J. Bruce Fields
2009-04-28 22:03                 ` Måns Rullgård
     [not found]                   ` <yw1x63gozb9f.fsf-O+uoZmgXk1l54TAoqtyWWQ@public.gmane.org>
2009-04-28 22:14                     ` Chuck Lever
2009-04-28 22:11                 ` Chuck Lever
2009-04-28 22:23                   ` J. Bruce Fields
2009-04-28 22:31                   ` Måns Rullgård
     [not found]                     ` <yw1xws94xved.fsf-O+uoZmgXk1l54TAoqtyWWQ@public.gmane.org>
2009-04-28 22:43                       ` Chuck Lever
2009-04-28 22:52                         ` Måns Rullgård
     [not found]                           ` <yw1xskjsxuff.fsf-O+uoZmgXk1l54TAoqtyWWQ@public.gmane.org>
2009-04-29 15:16                             ` Chuck Lever
2009-04-29 18:02                               ` Måns Rullgård
2009-04-25 22:14   ` [PATCH 00/19] Proposed server-side patches for 2.6.31 J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090508153323.GA18704@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.