From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mi Jinlong Subject: Re: [RFC] After server stop nfslock service, client still can get lock success Date: Thu, 19 Nov 2009 17:48:23 +0800 Message-ID: <4B051467.70404@cn.fujitsu.com> References: <4B027123.4060100@cn.fujitsu.com> <84C94F5A-0192-4F6D-858D-0CCA92574625@oracle.com> <4B03C366.8050009@cn.fujitsu.com> <799834E0-C52E-462A-A036-64B4C4DF5C06@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: "Trond.Myklebust" , NFSv3 list , "J. Bruce Fields" To: Chuck Lever Return-path: Received: from cn.fujitsu.com ([222.73.24.84]:59025 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752029AbZKSJrD (ORCPT ); Thu, 19 Nov 2009 04:47:03 -0500 In-Reply-To: <799834E0-C52E-462A-A036-64B4C4DF5C06@oracle.com> Sender: linux-nfs-owner@vger.kernel.org List-ID: Hi Chuck Lever: > > On Nov 18, 2009, at 4:50 AM, Mi Jinlong wrote: > >> Hi >> >> Chuck Lever: >>> >>> On Nov 17, 2009, at 4:47 AM, Mi Jinlong wrote: >>> >>>> When testing NLM, i find a bug. >>>> After server stop nfslock service, client still can get lock success >>>> >>>> Test process: >>>> >>>> Step1: client open nfs file. >>>> Step2: client using fcntl to get lock. >>>> Step3: client using fcntl to release lock. >>>> Step4: service stop it's nfslock service. >>>> Step5: client using fcntl to get lock again. >>>> >>>> At step5, client should get lock fail, but it's success. >>>> >>>> Reason: >>>> When server stop nfslock service, client's host struct not be >>>> unmonitor at server. When client get lock again, the client's >>>> host struct will be reuse but don't monitor again. >>>> So that, at step5 client can get lock success. >>> >>> Effectively, the client is still monitored, since it is still in statd's >>> monitored list. Shutting down statd does not remove it from the monitor >>> list. If the local host reboots, sm-notify will still send the remote >>> an SM_NOTIFY request, which is correct. >>> >>> Additionally, new clients attempting to lock files when statd is down >>> will fail, which is correct if statd is not available. >>> >>> Conversely, if a monitored remote reboots, there is no way to notify the >>> local lockd of the reboot, since statd normally relays the SM_NOTIFY to >>> lockd, but isn't running. That might be a problem. >> >> Yes, it seems a problem. >> >> I don't confirm it, so i want get your opinion. > > Currently, there isn't a high degree of coordination between lockd and > statd. This is to maintain good scalability when serving NFS lock > requests. You offered a couple of alternatives for improving this > specific situation, but my opinion is that there are larger, more > general coordination issues here, and that what you observed is expected > behavior for the current design. > > This still seems to me like a case of "Patient: Doctor, it hurts when I > do that." "Doctor: Well, then, don't do that." In other words, we > assume that "service nfslock stop" won't be used under normal operating > conditions, and we know that NLM will misbehave if you stop statd during > normal operation. > >>> However, shutting down statd during normal operation is not a normal or >>> supported thing to do. >>> >>>> Question: >>>> 1. Should unmonitor the client's host struct at server >>>> when server stop nfslock service ? >>>> >>>> 2. Whether let rpc.statd tell kernel it's status(when start and stop) >>>> by send a SM_NOTIFY ? >>> >>> There are a number of other coordination issues around statd start-up >>> and shut down. The server's grace period, for instance, is not >>> synchronized with sending reboot notifications. So, we do recognize >>> this is a general problem. >>> >>> In this case, however, I would expect indeterminate behavior if statd is >>> shut down during normal operation, and that's exactly what we get. I'm >>> not sure it's even reasonable to support this use case. Why would >>> someone shut down statd and expect reliable NFSv2/v3 locking behavior? >>> In other words, with due respect, what problem would we solve by fixing >>> this, other than making your test case work? >> >> When server's nfslock service is stop, client can get lock success >> sometimes >> and can't get success sometimes, it's puzzled. > > On Linux, the user space "nfslock" service is actually nothing more than > statd. Linux's NLM service is handled in the kernel, and is started and > stopped when either a) there are NFS mounts, or b) NFSD is started. The > kernel's NLM service has nothing to do with "service nfslock start" any > more. I think there used to be a user space NLM implementation. > >>> Out of curiosity, what happens if you try this on a Solaris server? >> >> I'm a new man for Solaris. >> When Solaris's nlockmgr is stop, client can't get lock immediately. > > I should have been more clear: if you stop Solaris' user space NSM > daemon, can you lock files consistently? My bet is that Solaris will > demonstrate a similar degree of inconsistent behavior if you try > NFSv2/v3 locking while starting and stopping its NSM service daemon. ^_^ You are right, when i stop Solaris's NSM, client still can get lock success. Maybe it's the same as Linux. -- Regards Mi Jinlong