Re: [RFC] server's statd and lockd will not sync after its nfslock restart

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Trond Myklebust <trond.myklebust@fys.uio.no>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	Neil Brown <neilb@suse.de>, Steve Dickson <SteveD@redhat.com>,
	NFSv3 list <linux-nfs@vger.kernel.org>,
	Mi Jinlong <mijinlong@cn.fujitsu.com>
Subject: Re: [RFC] server's statd and lockd will not sync after its nfslock restart
Date: Thu, 17 Dec 2009 15:27:44 -0500	[thread overview]
Message-ID: <1261081664.4080.18.camel@localhost> (raw)
In-Reply-To: <35D45F43-D98F-460E-8060-F7C5F3ADFCFE@oracle.com>

On Thu, 2009-12-17 at 11:18 -0500, Chuck Lever wrote: 
> On Dec 17, 2009, at 5:07 AM, Mi Jinlong wrote:
> > Chuck Lever :
> >> On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote:
> >>> Chuck Lever:
> >>>> On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
> >>>>> Hi,
> >
> > ...snip...
> >
> >>>>>
> >>>>> The Primary Reason:
> >>>>>
> >>>>> At step3, when client's reclaimed lock request is sent to server,
> >>>>> client's host(the host struct) is reused but not be re-monitored  
> >>>>> at
> >>>>> server's lockd. After that, statd and lockd are not sync.
> >>>>
> >>>> The kernel squashes SM_MON upcalls for hosts that it already  
> >>>> believes
> >>>> are monitored.  This is a scalability feature.
> >>>
> >>> When statd start, it will move files from /var/lib/nfs/statd/sm/ to
> >>> /var/lib/nfs/statd/sm.bak/.
> >>
> >> Well, it's really sm-notify that does this.  sm-notify is run by
> >> rpc.statd when it starts up.
> >>
> >> However, sm-notify should only retire the monitor list the first  
> >> time it
> >> is run after a reboot.  Simply restarting statd should not change the
> >> on-disk monitor list in the slightest.  If it does, there's some  
> >> kind of
> >> problem with the way sm-notify's pid file is managed, or perhaps with
> >> the nfslock script.
> >
> >  When starting, statd will call run_sm_notify() function to run sm- 
> > notify.
> >  Using command "service nfslock restart" will case statd stop and  
> > start,
> >  so sm-notify will be run. If sm-notify run, the on-disk monitor list
> >  will be changed.
> >
> >>
> >>> If lockd don't send a SM_MON to statd,
> >>> statd will not monitor those client which be monitored before statd
> >>> restart.
> >>>
> >>>>> Question:
> >>>>>
> >>>>> In my opinion, if lockd is allowed reuseing the client's host, it
> >>>>> should
> >>>>> send a SM_MON to statd when reuse. If not allowed, the client's  
> >>>>> host
> >>>>> should
> >>>>> be destroyed immediately.
> >>>>>
> >>>>> What should lockd to do?  Reuse ? Destroy ? Or some other action?
> >>>>
> >>>> I don't immediately see why lockd should change it's behavior.   
> >>>> Perhaps
> >>>> statd/sm-notify were incorrect to delete the monitor list when you
> >>>> restarted the nfslock service?
> >>>
> >>> Sorry, maybe i did not express clearly.
> >>> I mean, lockd reuse the host struct which was created before statd
> >>> restart.
> >>>
> >>> It seems have deleted the monitor list when nfslock restart.
> >>
> >> lockd does not touch any user space files; the on-disk monitor list  
> >> is
> >> managed by statd and sm-notify.  A remote peer rebooting does not  
> >> clear
> >> the "monitored" flag for that peer in the local kernel's lockd, so it
> >> won't send another SM_MON request.
> >
> >  Yes, that's right.
> >
> >  But, this case refers to server's lockd, not the remote peer.
> >  I thank, when local system's nfslock restart, local kernel's lockd
> >  clear all other client's host strcut's "monitored" flag.
> >
> >>
> >> Now, it may be the case that "service nfslock start" uses a command  
> >> line
> >> option that forces a fresh sm-notify run, and that is what is  
> >> wiping the
> >> on-disk monitor list.  That would be the bug in this case -- sm- 
> >> notify
> >> can and should be allowed to make its own determination of whether  
> >> the
> >> monitor list gets retired.  Notification should not normally be  
> >> forced
> >> by command line options in the nfslock script.
> >
> >  A fresh sm-notify run is cause by statd start.
> >  I find it through codes by followed.
> >
> > utils/statd/statd.c
> > ...
> > 478         if (! (run_mode & MODE_NO_NOTIFY))
> > 479                 switch (pid = fork()) {
> > 480                 case 0:
> > 481                         run_sm_notify(out_port);
> > 482                         break;
> > 483                 case -1:
> > 484                         break;
> > 485                 default:
> > 486                         waitpid(pid, NULL, 0);
> > 487                 }
> > ....
> >
> >
> > I thank, when statd restart and call sm-notify, the on-disk monitor  
> > list will
> > be deleted, so lockd should clear all other client's host strcut's  
> > "monitored" flag.
> > After that, a reused host struct will be re-monitored, a on-disk  
> > monitor
> > will be re-created. Like that, lockd and statd will sync .
> 
> run_sm_notify() simply forks and execs the sm-notify program.  This  
> program checks for the existence of a pid file.  If the pid file  
> exists, then sm-notify exits.  If it does not, then sm-notify retires  
> the records in /var/lib/nfs/statd/sm and posts reboot notifications.
> 
> Jeff Layton pointed out to me yesterday that Red Hat's nfslock script  
> unconditionally deletes sm-notify's pid file every time "service  
> nfslock start" is done, which effectively defeats sm-notify's reboot  
> detection.
> 
> sm-notify was written by a developer at SuSE.  SuSE Linux uses a tmpfs  
> for /var/run, but Red Hat uses permanent storage for this directory.   
> Thus on SuSE, the pid file gets deleted automatically by a reboot, but  
> on Red Hat, the pid file must be deleted "by hand" or reboot  
> notification never occurs.
> 
> So the root cause of this problem is that the current mechanism sm- 
> notify uses to detect a reboot is not portable across distributions.
> 
> My new-statd prototype used a semaphor instead of a pid file to detect  
> reboots.  A semaphor is shared (visible to other processes) and will  
> continue to exist until it is deleted or the system reboots.  It is a  
> resource that is not destroyed automatically when the sm-notify  
> process exits.  If creating the semaphor fails, sm-notify exits.  If  
> creating it succeeds, it runs.
> 
> Would anyone strongly object to using a semaphor instead of a pid file  
> here?  Is support for semaphors always built into kernels?  Would  
> there be any problems with the small size of the semaphor name space?   
> Is there another similar facility that might be better?
> 

One alternative might be to just record the kernel's random boot_id in
the pid file. That gets regenerated on each boot, so should be unique.

Trond

next prev parent reply	other threads:[~2009-12-17 20:27 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-15 10:02 [RFC] server's statd and lockd will not sync after its nfslock restart Mi Jinlong
2009-12-15 12:41 ` J. Bruce Fields
2009-12-16  9:46   ` Mi Jinlong
2009-12-15 15:10 ` Chuck Lever
2009-12-16 10:27   ` Mi Jinlong
2009-12-16 13:49     ` Jeff Layton
     [not found]       ` <20091216084902.64f722ad-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2009-12-17  9:34         ` Mi Jinlong
2009-12-16 19:33     ` Chuck Lever
2009-12-17 10:07       ` Mi Jinlong
2009-12-17 16:18         ` Chuck Lever
2009-12-17 20:14           ` J. Bruce Fields
2009-12-17 20:35             ` Chuck Lever
2009-12-17 20:27           ` Trond Myklebust [this message]
2009-12-17 20:34             ` Chuck Lever
2009-12-17 20:48               ` Trond Myklebust
2009-12-17 23:14           ` Neil Brown
     [not found]             ` <20091218101438.48eb06a4-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-12-18 15:18               ` Chuck Lever
2009-12-19 16:42                 ` Steve Dickson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1261081664.4080.18.camel@localhost \
    --to=trond.myklebust@fys.uio.no \
    --cc=SteveD@redhat.com \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mijinlong@cn.fujitsu.com \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).