RFC: merging sm-notify and rpc.statd

Linux NFS development
 help / color / mirror / Atom feed

* RFC: merging sm-notify and rpc.statd
@ 2009-05-19 14:36 Chuck Lever
  2009-05-19 22:39 ` Neil Brown
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2009-05-19 14:36 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux NFS mailing list

Hi Neil-

As part of IPv6 support for NFS, I've been looking at rpc.statd and sm- 
notify.  IPv6 support touches so many parts of both, and the current  
open-coded RPC request schedulers in both can't support netids without  
major revision or replacement.  So I've decided to write a replacement  
instead of grafting in support for IPv6 to the current implementation.

For many reasons I'm thinking of merging sm-notify and rpc.statd back  
together.  The two were split only a few years ago, and it seems to me  
that it was done to support SuSE's in-kernel statd, which has since  
been effectively abandoned.

Having the two separated has ushered in a host of minor  
complications.  Packaging and init-scripts are more complicated.  Both  
executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}.   
There are two separate man pages that share a lot of the same content.

So, what do you think about folding sm-notify back into rpc.statd?   
Steve suggested there may have been a customer issue that drove the  
separation.  Do you have any recollection of the issues?

For the rest of the list: are there strong dependencies outside RH and  
SuSE distributions that would require a separate sm-notify  
executable?  Any other issues?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: merging sm-notify and rpc.statd
  2009-05-19 14:36 RFC: merging sm-notify and rpc.statd Chuck Lever
@ 2009-05-19 22:39 ` Neil Brown
       [not found]   ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Neil Brown @ 2009-05-19 22:39 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS mailing list

On Tuesday May 19, chuck.lever@oracle.com wrote:
> Hi Neil-
> 
> As part of IPv6 support for NFS, I've been looking at rpc.statd and sm- 
> notify.  IPv6 support touches so many parts of both, and the current  
> open-coded RPC request schedulers in both can't support netids without  
> major revision or replacement.  So I've decided to write a replacement  
> instead of grafting in support for IPv6 to the current implementation.
> 
> For many reasons I'm thinking of merging sm-notify and rpc.statd back  
> together.  The two were split only a few years ago, and it seems to me  
> that it was done to support SuSE's in-kernel statd, which has since  
> been effectively abandoned.
> 
> Having the two separated has ushered in a host of minor  
> complications.  Packaging and init-scripts are more complicated.  Both  
> executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}.   
> There are two separate man pages that share a lot of the same content.
> 
> So, what do you think about folding sm-notify back into rpc.statd?   
> Steve suggested there may have been a customer issue that drove the  
> separation.  Do you have any recollection of the issues?
> 
> For the rest of the list: are there strong dependencies outside RH and  
> SuSE distributions that would require a separate sm-notify  
> executable?  Any other issues?

While the separation of sm-notify was presumably driven by the suse
in-kernel statd, that wasn't the reason that I copied the idea in
nfs-utils.

sm-notify and statd really have two very different tasks.

sm-notify :
   - is a 'client' for the "SM" protocol.
   - must be run at boot time, and after that is not needed.


statd :
   - is a 'server' for the "SM" protocol.
   - only needs to be running when either nfsd is running or an 
     nfs mount which supports locks is active

Thus I feel they are conceptually quite distinct.

It is probably true that they could share a slab of code, and putting
that code in a common .c file would make a lot of sense.

I am not strongly against re-uniting them.  However before doing that,
I think it would be a good idea to collect a list of the problems that
would be solved by unifying them, and the asking the question: is
unifying them the only or best solution to these problems.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: merging sm-notify and rpc.statd
       [not found]   ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-05-19 23:25     ` Mike Frysinger
  2009-05-20  1:05       ` NeilBrown
  2009-05-20 16:38     ` Chuck Lever
  1 sibling, 1 reply; 8+ messages in thread
From: Mike Frysinger @ 2009-05-19 23:25 UTC (permalink / raw)
  To: Neil Brown; +Cc: Chuck Lever, Linux NFS mailing list

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]

On Tuesday 19 May 2009 18:39:47 Neil Brown wrote:
> sm-notify :
>    - is a 'client' for the "SM" protocol.
>    - must be run at boot time, and after that is not needed.
>
> statd :
>    - is a 'server' for the "SM" protocol.
>    - only needs to be running when either nfsd is running or an
>      nfs mount which supports locks is active

that last part -- any nfs mount with locks -- means that pretty much every nfs 
client out there needs it running.

sm-notify is pretty minuscule, so the overhead of having that run on a server 
is negligible, especially when combined with the already required statd.
-mike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: merging sm-notify and rpc.statd
  2009-05-19 23:25     ` Mike Frysinger
@ 2009-05-20  1:05       ` NeilBrown
       [not found]         ` <bdacaae74fbd57ee96599286eae43751.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2009-05-20  1:05 UTC (permalink / raw)
  To: Mike Frysinger; +Cc: Chuck Lever, Linux NFS mailing list

On Wed, May 20, 2009 9:25 am, Mike Frysinger wrote:
> On Tuesday 19 May 2009 18:39:47 Neil Brown wrote:
>> sm-notify :
>>    - is a 'client' for the "SM" protocol.
>>    - must be run at boot time, and after that is not needed.
>>
>> statd :
>>    - is a 'server' for the "SM" protocol.
>>    - only needs to be running when either nfsd is running or an
>>      nfs mount which supports locks is active
>
> that last part -- any nfs mount with locks -- means that pretty much every
> nfs
> client out there needs it running.
>
> sm-notify is pretty minuscule, so the overhead of having that run on a
> server
> is negligible, especially when combined with the already required statd.
> -mike
>

The point is that sm-notify should really be run at boot whenever
it is installed.
statd, being a server that listens to request from the network, should
only be run if it is needed (because most people like the policy of
only running network services that are actually needed).

If all nfs mounts are performed manually, then you might not want to
run statd for quite a long time after boot.  But you need to run
sm-notify immediately after boot to ensure that any locks you held
before the reboot get released.

They really are separate functions.

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: merging sm-notify and rpc.statd
       [not found]         ` <bdacaae74fbd57ee96599286eae43751.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
@ 2009-05-20  1:10           ` Ben Greear
  0 siblings, 0 replies; 8+ messages in thread
From: Ben Greear @ 2009-05-20  1:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Frysinger, Chuck Lever, Linux NFS mailing list

NeilBrown wrote:
> They really are separate functions.
>   
Maybe have one code base and have it do different things based on a cmd-line
argument or the name (use a symlink for one of the functions) ?

If the code is really similar, that should allow easy reuse?

Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: merging sm-notify and rpc.statd
       [not found]   ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2009-05-19 23:25     ` Mike Frysinger
@ 2009-05-20 16:38     ` Chuck Lever
  2009-05-21  0:01       ` Neil Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2009-05-20 16:38 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux NFS mailing list

On May 19, 2009, at 6:39 PM, Neil Brown wrote:
> On Tuesday May 19, chuck.lever@oracle.com wrote:
>> Hi Neil-
>>
>> As part of IPv6 support for NFS, I've been looking at rpc.statd and  
>> sm-
>> notify.  IPv6 support touches so many parts of both, and the current
>> open-coded RPC request schedulers in both can't support netids  
>> without
>> major revision or replacement.  So I've decided to write a  
>> replacement
>> instead of grafting in support for IPv6 to the current  
>> implementation.
>>
>> For many reasons I'm thinking of merging sm-notify and rpc.statd back
>> together.  The two were split only a few years ago, and it seems to  
>> me
>> that it was done to support SuSE's in-kernel statd, which has since
>> been effectively abandoned.
>>
>> Having the two separated has ushered in a host of minor
>> complications.  Packaging and init-scripts are more complicated.   
>> Both
>> executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}.
>> There are two separate man pages that share a lot of the same  
>> content.
>>
>> So, what do you think about folding sm-notify back into rpc.statd?
>> Steve suggested there may have been a customer issue that drove the
>> separation.  Do you have any recollection of the issues?
>>
>> For the rest of the list: are there strong dependencies outside RH  
>> and
>> SuSE distributions that would require a separate sm-notify
>> executable?  Any other issues?
>
> While the separation of sm-notify was presumably driven by the suse
> in-kernel statd, that wasn't the reason that I copied the idea in
> nfs-utils.
>
> sm-notify and statd really have two very different tasks.
>
> sm-notify :
>   - is a 'client' for the "SM" protocol.
>   - must be run at boot time, and after that is not needed.

> statd :
>   - is a 'server' for the "SM" protocol.
>   - only needs to be running when either nfsd is running or an
>     nfs mount which supports locks is active
>
> Thus I feel they are conceptually quite distinct.

There are details that make it not such a clean conceptual break:

  o  Who manages the NSM state number?  sm-notify sends it out to  
remote peers, and statd returns it in SM_MON and SM_UNMON replies.   
There has to be some co-ordination of how the state number is  
updated.  If sm-notify runs separately (for example, with the "-- 
force" option) and updates the state number, how does statd know  
there's a new state number?  If lockd isn't loaded and running when sm- 
notify runs, how is the kernel going to get the right NSM state number?

  o  statd still has client duties: it has to post NLM callbacks to  
the local lockd.  Sending notifications to remote peers is not so  
different from that, conceptually.  One could argue, therefore, that  
we should split that piece out of statd as well, but that would mean  
we fork/exec every time we get an unauthenticated SM_NOTIFY request  
from a monitored peer.  That exposes a DoS vulnerability.

  o  statd has to wait while sm-notify copies the monitor list.  It  
really shouldn't accept SM_MON requests while the notification list is  
created.  But if it waits for long, it will appear that the NSM  
service has died.  So there is some non-trivial synchronization  
between the two, and that appears to be split between statd and sm- 
notify today (and that synchronization requirement isn't documented in  
any way).

  o  statd has to fire up sm-notify when it receives SM_SIMU_CRASH.   
Today our lockd doesn't send that, but it could in the future.  So, sm- 
notify is not strictly an "only-at-reboot" kind of affair.

  o  sm-notify tries to do a sync(2) to make sure that the file system  
state is made permanent after an NSM state update.  Bruce has  
suggested doing the sync only after the first SM_MON (to reduce  
overhead during system boot), but that moves the sync(2) far away from  
the logic that updates the state number.  That exposes us to NSM state  
number walk-back if the system crashes at the wrong time.  It's  
arguable how much of a problem that is.

  o  It is better to send notifications when lockd is up.  For  
clients, at least, lockd comes up only after the first NFS mount, and  
in automounter scenarios, that may not be for some time after a  
reboot.  Servers may not start nfslock until they do "service nfslock  
start; service nfs start" at some point possibly long after reboot.   
So should clients be notified right when the server peer starts up, or  
after the server peer has fired up its NFSD and lockd service?

  o  Those who package statd/sm-notify have to understand how these  
operate.  The people who create system init-scripts are generally not  
NFS experts, thus they must have local knowledge about statd and sm- 
notify in order to get this all correct.  It would be more fool-proof  
if we hard-coded the start-up behavior, and took it out of the hands  
of the init-scripts folks, whom we do not control.  How do we document  
the operational dependencies in a way that makes it very hard for non- 
NFS folks to set this up incorrectly?  One way is to build it all in a  
single program.

> It is probably true that they could share a slab of code, and putting
> that code in a common .c file would make a lot of sense.

Yes, I've started doing that to try to understand what code can be  
shared.

> I am not strongly against re-uniting them.  However before doing that,
> I think it would be a good idea to collect a list of the problems that
> would be solved by unifying them, and the asking the question: is
> unifying them the only or best solution to these problems.

Agreed.  See above.

If there are one or more strong reasons to keep these separate, I can  
go down that road.  But I think the practical matters of making NSM  
work in multiple Linux distributions, each with their own packaging  
and init-script mechanisms and requirements, suggests we'd be better  
off making it simple to get this right.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: merging sm-notify and rpc.statd
  2009-05-20 16:38     ` Chuck Lever
@ 2009-05-21  0:01       ` Neil Brown
       [not found]         ` <18964.39373.232045.96215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Neil Brown @ 2009-05-21  0:01 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS mailing list

On Wednesday May 20, chuck.lever@oracle.com wrote:
> On May 19, 2009, at 6:39 PM, Neil Brown wrote:
> > On Tuesday May 19, chuck.lever@oracle.com wrote:
> >> Hi Neil-
> >>
> >> As part of IPv6 support for NFS, I've been looking at rpc.statd and  
> >> sm-
> >> notify.  IPv6 support touches so many parts of both, and the current
> >> open-coded RPC request schedulers in both can't support netids  
> >> without
> >> major revision or replacement.  So I've decided to write a  
> >> replacement
> >> instead of grafting in support for IPv6 to the current  
> >> implementation.
> >>
> >> For many reasons I'm thinking of merging sm-notify and rpc.statd back
> >> together.  The two were split only a few years ago, and it seems to  
> >> me
> >> that it was done to support SuSE's in-kernel statd, which has since
> >> been effectively abandoned.
> >>
> >> Having the two separated has ushered in a host of minor
> >> complications.  Packaging and init-scripts are more complicated.   
> >> Both
> >> executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}.
> >> There are two separate man pages that share a lot of the same  
> >> content.
> >>
> >> So, what do you think about folding sm-notify back into rpc.statd?
> >> Steve suggested there may have been a customer issue that drove the
> >> separation.  Do you have any recollection of the issues?
> >>
> >> For the rest of the list: are there strong dependencies outside RH  
> >> and
> >> SuSE distributions that would require a separate sm-notify
> >> executable?  Any other issues?
> >
> > While the separation of sm-notify was presumably driven by the suse
> > in-kernel statd, that wasn't the reason that I copied the idea in
> > nfs-utils.
> >
> > sm-notify and statd really have two very different tasks.
> >
> > sm-notify :
> >   - is a 'client' for the "SM" protocol.
> >   - must be run at boot time, and after that is not needed.
> 
> > statd :
> >   - is a 'server' for the "SM" protocol.
> >   - only needs to be running when either nfsd is running or an
> >     nfs mount which supports locks is active
> >
> > Thus I feel they are conceptually quite distinct.
> 
> There are details that make it not such a clean conceptual break:
> 
>   o  Who manages the NSM state number?  sm-notify sends it out to  
> remote peers, and statd returns it in SM_MON and SM_UNMON replies.   
> There has to be some co-ordination of how the state number is  
> updated.  If sm-notify runs separately (for example, with the "-- 
> force" option) and updates the state number, how does statd know  
> there's a new state number?  If lockd isn't loaded and running when sm- 
> notify runs, how is the kernel going to get the right NSM state number?

sm-notify manages the state number.
statd must ensure that sm-notify has run before it reads the number
from the file.  As sm-notify has its own locking to ensure it is run
only once, statd simple runs sm-notify before proceeded.
sm-notify explicitly tells the kernel what the state number is.

If the lockd modules isn't loaded when sm-notify runs that might be a
small problem.  I'd have to remind my self of all the details of the
lockd protocols to be sure what was needed.  Maybe statd should tell
it to lockd when it first hears from lockd.

> 
>   o  statd still has client duties: it has to post NLM callbacks to  
> the local lockd.  Sending notifications to remote peers is not so  
> different from that, conceptually.  One could argue, therefore, that  
> we should split that piece out of statd as well, but that would mean  
> we fork/exec every time we get an unauthenticated SM_NOTIFY request  
> from a monitored peer.  That exposes a DoS vulnerability.

Yes, client duties.  But a client for a different protocol.
I think we have a strawman argument here.  I would certainly never
suggest that the lockd call back should be done by a separate process.

At it's core, statd works like this:
   lockd says to statd "Tell me if X restarts, and tell X if I restart".
   So statd listen for X to say "I have restarted" and passes that on
   to to lockd.
   Statd cannot directly tell X that it has restarted because it will
   have died first.  So it leaves a note (on the fridge) for someone
   else to do it.  That "someone else" is sm-notify.
   So sm-notify is running on behalf of the statd from before the last
   reboot.  In that sense it is quite separate from the currently
   running statd.

> 
>   o  statd has to wait while sm-notify copies the monitor list.  It  
> really shouldn't accept SM_MON requests while the notification list is  
> created.  But if it waits for long, it will appear that the NSM  
> service has died.  So there is some non-trivial synchronization  
> between the two, and that appears to be split between statd and sm- 
> notify today (and that synchronization requirement isn't documented in  
> any way).

Sounds like there could be an implementation problem here.
I don't think sm-notify need to copy the monitor list exactly.  It
just needs to move it out of the way so statd has a clean slate.
   mv /var/lib/nfs/sm /var/lib/nfs/sm.bak.$UNIQUE
   mkdir /var/lib/nfs/sm
   # let statd continue
   # shuffle through files in /var/lib/nfs/sm.bak*

And while I agree that more documentation is a good thing, I think the
synchronization is enforced so documentation isn't essential.
statd runs sm-notify before doing anything.  sm-notify does the
minimum for synchronization before forking and exiting and allowing
statd to continue. (or maybe not as I discover below)

> 
>   o  statd has to fire up sm-notify when it receives SM_SIMU_CRASH.   
> Today our lockd doesn't send that, but it could in the future.  So, sm- 
> notify is not strictly an "only-at-reboot" kind of affair.

True, but not a strong case for anything I would think.

> 
>   o  sm-notify tries to do a sync(2) to make sure that the file system  
> state is made permanent after an NSM state update.  Bruce has  
> suggested doing the sync only after the first SM_MON (to reduce  
> overhead during system boot), but that moves the sync(2) far away from  
> the logic that updates the state number.  That exposes us to NSM state  
> number walk-back if the system crashes at the wrong time.  It's  
> arguable how much of a problem that is.

Sounds like there is room for improvement here, definitely.

This is only a half-formed idea, but:
  sm-notify could update 'state' to an odd number if it is even, but
     not sync anything
  statd, on the first SM_MON, updates 'state' to an even number if it
     was odd and in that case does the required sync.

 I would need to check the protocol and the code and do a bit of case
 analysis to be sure I had that right, but I suspect it is close.
 (or it could be made completely irrelevant but subsequent
  observations.  Read on!)

> 
>   o  It is better to send notifications when lockd is up.  For  
> clients, at least, lockd comes up only after the first NFS mount, and  
> in automounter scenarios, that may not be for some time after a  
> reboot.  Servers may not start nfslock until they do "service nfslock  
> start; service nfs start" at some point possibly long after reboot.   
> So should clients be notified right when the server peer starts up, or  
> after the server peer has fired up its NFSD and lockd service?
> 

When a client notifies a server that it has rebooted, the server
simply drops the locks.  There is no need for the client lockd to be
running.

When a server notifies a client that it has rebooted, the client tries
to reclaim the locks.  So the server lockd *must* be running at that
time.  It is not a case of 'better'.  It is 'must'.
So if a machine is an NFS server that plans to keep serving, it must
start nfsd (and hence lockd) before running sm-notify.

However it is good to have statd running before lockd, as lockd needs
to talk to statd.
So there order seems to be:
  statd
  nfsd and hence lockd
  sm-notify

which is clearly documented in the README, but seems to disagree with
what we said above :-)
We want to clean out the 'sm' directory, then run statd/nfsd/lockd,
then sm-notify reads the sm.bak and sends off the notifications.

There does seem to be room for improvement here.  And I feel that
having sm-notify separate actually makes it easier to get this
right...

How about this for a bit of a left-field idea:
 - files representing monitored hosts are stored in
         /var/lib/nfs/sm.$STATE
 - At reboot, /var/lib/nfs/state is incremented (twice?) but not
    synced.
 - statd, on first SM_MON creates /var/lib/nfs/sm.$STATE if needed,
   based on the value in the 'state' file, and does the required
   sync at that point
 - sm-notify can be run at any time after nfsd (if required) is
   started,  and send notification to any host in a sm.$STATE where
   $STATE < 'state'.  The 'state' number in the notification is
   $STATE (or is it $STATE+1??)

>   o  Those who package statd/sm-notify have to understand how these  
> operate.  The people who create system init-scripts are generally not  
> NFS experts, thus they must have local knowledge about statd and sm- 
> notify in order to get this all correct.  It would be more fool-proof  
> if we hard-coded the start-up behavior, and took it out of the hands  
> of the init-scripts folks, whom we do not control.  How do we document  
> the operational dependencies in a way that makes it very hard for non- 
> NFS folks to set this up incorrectly?  One way is to build it all in a  
> single program.

That is a strong argument.  It is probably part of the argument for
putting it all in the kernel too.
A valid question is: *can* we build it all into a single program?

Given that:
  state and sm need to be updated before statd responds to SM_MON
  statd should be ready to respond to SM_MON before lockd starts
  exportfs -av must be run before nfsd starts
  nfsd and lockd must start before notifications are sent on a server
  notifications (from the server) must be sent promptly after
     nfsd starts its grace period.

I find it hard to see a single statd being able to do the whole thing.

We have a 'README' to document the order.  We could provide a sample
startup script.  I don't think we *can* provide a "get it all right"
program.

> 
> If there are one or more strong reasons to keep these separate, I can  
> go down that road.  But I think the practical matters of making NSM  
> work in multiple Linux distributions, each with their own packaging  
> and init-script mechanisms and requirements, suggests we'd be better  
> off making it simple to get this right.

"simple to get this right" is certainly good.
But "right" must over-rule "simple", and it seems like we might not
even really be a "right" yet. :-(

Maybe the way to make sure people get it work is to detect broken
configurations and fail horribly...
So:
  sm-notify performs its own /var/run locking to make sure it is only
   run once (plus allow for --simu-crash??)
   It quickly updates /var/lib/nfs/ (which no sync) and then checks
   to see if mountd is running.  If it is, it assume 'server' and
   waits a while for lockd to appear (both checks via portmap).
   Once lockd is running (or mountd was not), it sends out
   notifications.
  mountd checks if sm-notify has already run (via the /var/run file),
   and complains gently, maybe only if it is less than a few minutes
   before boot.  e.g.
    WARNING: during boot, mountd must be run before sm-notify!

  statd always runs sm-notify first and waits for it to exit, which
    it does once it has moved things aside and updated 'state'.
    One the first SM_MON call, statd call 'fsync' on 'state' and
    related directories, and writes the 'state' value to the
    kernel....  which is moments to late.  The kernel has already
    used it.  Maybe we need a call to nsm_monitor in nlmclnt_proc,
    and maybe _reclaim and _cancel too - not sure

  mount.nfs makes sure statd is running - we already have that.

  rpc.nfsd can complain if statd is not already running, or maybe
    even just start it.

That, I think, should enforce some of the ordering, and complain
if other ordering requirements aren't met.

And just for the record: my strongest argument for keeping them
separate is that statd (being network service) should only be started
if and when it is actually needed, while sm-notify should always be
run at boot in case it has some cleaning up to do.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: merging sm-notify and rpc.statd
       [not found]         ` <18964.39373.232045.96215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-05-21 17:14           ` Chuck Lever
  0 siblings, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2009-05-21 17:14 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux NFS mailing list

On May 20, 2009, at 8:01 PM, Neil Brown wrote:
> On Wednesday May 20, chuck.lever@oracle.com wrote:
>> On May 19, 2009, at 6:39 PM, Neil Brown wrote:
>>> On Tuesday May 19, chuck.lever@oracle.com wrote:
>>>> Hi Neil-
>>>>
>>>> As part of IPv6 support for NFS, I've been looking at rpc.statd and
>>>> sm-
>>>> notify.  IPv6 support touches so many parts of both, and the  
>>>> current
>>>> open-coded RPC request schedulers in both can't support netids
>>>> without
>>>> major revision or replacement.  So I've decided to write a
>>>> replacement
>>>> instead of grafting in support for IPv6 to the current
>>>> implementation.
>>>>
>>>> For many reasons I'm thinking of merging sm-notify and rpc.statd  
>>>> back
>>>> together.  The two were split only a few years ago, and it seems to
>>>> me
>>>> that it was done to support SuSE's in-kernel statd, which has since
>>>> been effectively abandoned.
>>>>
>>>> Having the two separated has ushered in a host of minor
>>>> complications.  Packaging and init-scripts are more complicated.
>>>> Both
>>>> executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}.
>>>> There are two separate man pages that share a lot of the same
>>>> content.
>>>>
>>>> So, what do you think about folding sm-notify back into rpc.statd?
>>>> Steve suggested there may have been a customer issue that drove the
>>>> separation.  Do you have any recollection of the issues?
>>>>
>>>> For the rest of the list: are there strong dependencies outside RH
>>>> and
>>>> SuSE distributions that would require a separate sm-notify
>>>> executable?  Any other issues?
>>>
>>> While the separation of sm-notify was presumably driven by the suse
>>> in-kernel statd, that wasn't the reason that I copied the idea in
>>> nfs-utils.
>>>
>>> sm-notify and statd really have two very different tasks.
>>>
>>> sm-notify :
>>>  - is a 'client' for the "SM" protocol.
>>>  - must be run at boot time, and after that is not needed.
>>
>>> statd :
>>>  - is a 'server' for the "SM" protocol.
>>>  - only needs to be running when either nfsd is running or an
>>>    nfs mount which supports locks is active
>>>
>>> Thus I feel they are conceptually quite distinct.
>>
>> There are details that make it not such a clean conceptual break:
>>
>>  o  Who manages the NSM state number?  sm-notify sends it out to
>> remote peers, and statd returns it in SM_MON and SM_UNMON replies.
>> There has to be some co-ordination of how the state number is
>> updated.  If sm-notify runs separately (for example, with the "--
>> force" option) and updates the state number, how does statd know
>> there's a new state number?  If lockd isn't loaded and running when  
>> sm-
>> notify runs, how is the kernel going to get the right NSM state  
>> number?
>
> sm-notify manages the state number.
> statd must ensure that sm-notify has run before it reads the number
> from the file.  As sm-notify has its own locking to ensure it is run
> only once, statd simple runs sm-notify before proceeded.
> sm-notify explicitly tells the kernel what the state number is.

Except in the SM_SIMU_CRASH case.  sm-notify updates the on-disk state  
number, but today, statd reads the state number once at start-up, and  
never updates it.  So it would miss that case; lockd and statd would  
continue to advertise the old state number.  (I think statd is also  
supposed to simulate a crash if it gets SIGUSR1).

> If the lockd modules isn't loaded when sm-notify runs that might be a
> small problem.

That is a frequent problem on today's clients.  lockd isn't loaded by / 
etc/init.d/nfslock unless there are module parameters specified (which  
in most cases, there aren't).  The state number is also lost if, for  
instance, the number of NFS mounts goes to zero and lockd is  
unloaded.  This can easily happen on clients that manage their NFS  
mounts with automounter.

In my experience our clients almost always send a zero state number  
today.

One could even go so far as to argue that an unload-load of lockd  
counts as a reboot (in terms of NSM state number management), and thus  
we should increment the NSM state number in that case to ensure that  
clients and servers start with a clean slate.

>  I'd have to remind my self of all the details of the
> lockd protocols to be sure what was needed.  Maybe statd should tell
> it to lockd when it first hears from lockd.

I've sent a patch to Trond to change lockd to pick up the state number  
from SM_MON replies.  lockd could also do an SM_UNMON_ALL when it is  
first loaded, and pick up the state number from its reply.

>>  o  statd still has client duties: it has to post NLM callbacks to
>> the local lockd.  Sending notifications to remote peers is not so
>> different from that, conceptually.  One could argue, therefore, that
>> we should split that piece out of statd as well, but that would mean
>> we fork/exec every time we get an unauthenticated SM_NOTIFY request
>> from a monitored peer.  That exposes a DoS vulnerability.
>
> Yes, client duties.  But a client for a different protocol.
> I think we have a strawman argument here.  I would certainly never
> suggest that the lockd call back should be done by a separate process.
>
> At it's core, statd works like this:
>   lockd says to statd "Tell me if X restarts, and tell X if I  
> restart".
>   So statd listen for X to say "I have restarted" and passes that on
>   to to lockd.
>   Statd cannot directly tell X that it has restarted because it will
>   have died first.  So it leaves a note (on the fridge) for someone
>   else to do it.  That "someone else" is sm-notify.
>   So sm-notify is running on behalf of the statd from before the last
>   reboot.  In that sense it is quite separate from the currently
>   running statd.
>
>>  o  statd has to wait while sm-notify copies the monitor list.  It
>> really shouldn't accept SM_MON requests while the notification list  
>> is
>> created.  But if it waits for long, it will appear that the NSM
>> service has died.  So there is some non-trivial synchronization
>> between the two, and that appears to be split between statd and sm-
>> notify today (and that synchronization requirement isn't documented  
>> in
>> any way).
>
> Sounds like there could be an implementation problem here.
> I don't think sm-notify need to copy the monitor list exactly.  It
> just needs to move it out of the way so statd has a clean slate.
>   mv /var/lib/nfs/sm /var/lib/nfs/sm.bak.$UNIQUE
>   mkdir /var/lib/nfs/sm
>   # let statd continue
>   # shuffle through files in /var/lib/nfs/sm.bak*

The current implementation is careful to preserve some or all existing  
files in sm.bak.  Basically if a previous notification never  
succeeded, the file for that peer stays in sm.bak, and sm-notify will  
try to notify that host again during the next reboot.  So, a file can  
be overwritten, but files for old peers are preserved in this case.

This seems reasonable to ensure peers are notified, although we may  
get a growing number of files in some situations.  We could assess a  
timeout -- after 5 reboots, we can be fairly certain the peer isn't  
coming back, and that the file should be removed.

> And while I agree that more documentation is a good thing, I think the
> synchronization is enforced so documentation isn't essential.
> statd runs sm-notify before doing anything.  sm-notify does the
> minimum for synchronization before forking and exiting and allowing
> statd to continue. (or maybe not as I discover below)

There is a rather mysterious sequence of forks at start up, and we  
happen to get this behavior today.  It's not terribly straightforward,  
and could be removed by someone in the future who is trying to reduce  
complexity.  Anyway...

>>  o  statd has to fire up sm-notify when it receives SM_SIMU_CRASH.
>> Today our lockd doesn't send that, but it could in the future.  So,  
>> sm-
>> notify is not strictly an "only-at-reboot" kind of affair.
>
> True, but not a strong case for anything I would think.
>
>>
>>  o  sm-notify tries to do a sync(2) to make sure that the file system
>> state is made permanent after an NSM state update.  Bruce has
>> suggested doing the sync only after the first SM_MON (to reduce
>> overhead during system boot), but that moves the sync(2) far away  
>> from
>> the logic that updates the state number.  That exposes us to NSM  
>> state
>> number walk-back if the system crashes at the wrong time.  It's
>> arguable how much of a problem that is.
>
> Sounds like there is room for improvement here, definitely.
>
> This is only a half-formed idea, but:
>  sm-notify could update 'state' to an odd number if it is even, but
>     not sync anything
>  statd, on the first SM_MON, updates 'state' to an even number if it
>     was odd and in that case does the required sync.

That would still provide an opportunity for state number replay, which  
would make at least one subsequent notification a no-op.

Given recent discussions on lkml about the behavior of sync/fsync with  
regard to renames, unlinks, file creation and the like, I think we  
should be more conservative about this, not less.  (In fact my current  
prototype uses sqlite3 instead of flat files for all of this).

> I would need to check the protocol and the code and do a bit of case
> analysis to be sure I had that right, but I suspect it is close.
> (or it could be made completely irrelevant but subsequent
>  observations.  Read on!)
>
>>  o  It is better to send notifications when lockd is up.  For
>> clients, at least, lockd comes up only after the first NFS mount, and
>> in automounter scenarios, that may not be for some time after a
>> reboot.  Servers may not start nfslock until they do "service nfslock
>> start; service nfs start" at some point possibly long after reboot.
>> So should clients be notified right when the server peer starts up,  
>> or
>> after the server peer has fired up its NFSD and lockd service?
>
> When a client notifies a server that it has rebooted, the server
> simply drops the locks.  There is no need for the client lockd to be
> running.

Agreed.  However, at least for Linux, statd is used on both the client  
and server, and a system can act as both concurrently.  There's no  
real way for statd to distinguish between remote clients and servers  
from an SM_MON request.

> When a server notifies a client that it has rebooted, the client tries
> to reclaim the locks.  So the server lockd *must* be running at that
> time.  It is not a case of 'better'.  It is 'must'.

Jeff Layton observed Solaris NFS servers (the reference NFSv2/v3  
implementation) sending reboot notifications before their lockd is  
alive.  That's why I qualified the requirement.

> So if a machine is an NFS server that plans to keep serving, it must
> start nfsd (and hence lockd) before running sm-notify.
>
> However it is good to have statd running before lockd, as lockd needs
> to talk to statd.
> So there order seems to be:
>  statd
>  nfsd and hence lockd
>  sm-notify
>
> which is clearly documented in the README, but seems to disagree with
> what we said above :-)
> We want to clean out the 'sm' directory, then run statd/nfsd/lockd,
> then sm-notify reads the sm.bak and sends off the notifications.
>
> There does seem to be room for improvement here.  And I feel that
> having sm-notify separate actually makes it easier to get this
> right...
>
> How about this for a bit of a left-field idea:
> - files representing monitored hosts are stored in
>         /var/lib/nfs/sm.$STATE
> - At reboot, /var/lib/nfs/state is incremented (twice?) but not
>    synced.
> - statd, on first SM_MON creates /var/lib/nfs/sm.$STATE if needed,
>   based on the value in the 'state' file, and does the required
>   sync at that point
> - sm-notify can be run at any time after nfsd (if required) is
>   started,  and send notification to any host in a sm.$STATE where
>   $STATE < 'state'.  The 'state' number in the notification is
>   $STATE (or is it $STATE+1??)

sm-notify should send the same NSM state number as lockd is sending in  
NLMPROC_LOCK requests.  afaict only odd state numbers are passed  
between peers.

>>  o  Those who package statd/sm-notify have to understand how these
>> operate.  The people who create system init-scripts are generally not
>> NFS experts, thus they must have local knowledge about statd and sm-
>> notify in order to get this all correct.  It would be more fool-proof
>> if we hard-coded the start-up behavior, and took it out of the hands
>> of the init-scripts folks, whom we do not control.  How do we  
>> document
>> the operational dependencies in a way that makes it very hard for  
>> non-
>> NFS folks to set this up incorrectly?  One way is to build it all  
>> in a
>> single program.
>
> That is a strong argument.  It is probably part of the argument for
> putting it all in the kernel too.

Putting it _all_ in the kernel is a challenge.  One issue is that the  
kernel should never write into local files, so some user space  
interaction is rather a requirement.

However, I think a scheme where the kernel provides the NSM service  
listener, and exposes its NSM cache to user space via rpc_pipefs or  
some other mechanism might be better than having lockd post SM_MON/ 
SM_UNMON requests and listen for NLM callbacks from statd.

The kernel can provide more information about the remote peer:  the IP  
address it used to contact us; the transport protocol it used to  
contact us; and whether it is a client or a server peer.  None of that  
information is available in the NSM protocol today.

The kernel also knows for certain when reboots occur, and when server- 
side grace period starts and ends.

That's a future idea, though.  Right now we just need something that  
supports IPv6.

> A valid question is: *can* we build it all into a single program?
>
> Given that:
>  state and sm need to be updated before statd responds to SM_MON
>  statd should be ready to respond to SM_MON before lockd starts
>  exportfs -av must be run before nfsd starts
>  nfsd and lockd must start before notifications are sent on a server
>  notifications (from the server) must be sent promptly after
>     nfsd starts its grace period.

Perhaps another desirable characteristic would be to curtail or stop  
notification once the grace period ends.

But it seems to me that start of the grace period is when you want to  
post SM_NOTIFY requests.  And statd can't possibly know when that is  
unless lockd tells it.

> I find it hard to see a single statd being able to do the whole thing.

> We have a 'README' to document the order.  We could provide a sample
> startup script.  I don't think we *can* provide a "get it all right"
> program.

I don't see anything in your argument why it can't be done in a single  
program, but could be done in an init script (or two).

statd could, for example, listen for signals to determine when to fire  
off sm-notify.  It already listens for SIGUSR1 today.  Or, we could  
require the kernel to post an SM_SIMU_CRASH when it is ready for statd  
to send notifications.  (That's one reason I brought up SM_SIMU_CRASH  
above).

So I guess my argument is that we can do this in a single program if  
we use a little more of the NSM protocol, ensuring that lockd  
communicates a little more with statd.

>> If there are one or more strong reasons to keep these separate, I can
>> go down that road.  But I think the practical matters of making NSM
>> work in multiple Linux distributions, each with their own packaging
>> and init-script mechanisms and requirements, suggests we'd be better
>> off making it simple to get this right.
>
> "simple to get this right" is certainly good.
> But "right" must over-rule "simple", and it seems like we might not
> even really be at "right" yet. :-(
>
> Maybe the way to make sure people get it work is to detect broken
> configurations and fail horribly...

As Greg likes to say:  "Meh."  I think everyone will be better off if  
we try to get it all to work automatically.  With warnings, we then  
depend on the patience of distributors and administrators to  
troubleshoot this.  It should "just work."

> So:
>  sm-notify performs its own /var/run locking to make sure it is only
>   run once (plus allow for --simu-crash??)
>   It quickly updates /var/lib/nfs/ (which no sync) and then checks
>   to see if mountd is running.  If it is, it assume 'server' and
>   waits a while for lockd to appear (both checks via portmap).
>   Once lockd is running (or mountd was not), it sends out
>   notifications.
>  mountd checks if sm-notify has already run (via the /var/run file),
>   and complains gently, maybe only if it is less than a few minutes
>   before boot.  e.g.
>    WARNING: during boot, mountd must be run before sm-notify!
>
>  statd always runs sm-notify first and waits for it to exit, which
>    it does once it has moved things aside and updated 'state'.
>    On the first SM_MON call, statd calls 'fsync' on 'state' and
>    related directories, and writes the 'state' value to the
>    kernel....  which is moments to late.  The kernel has already
>    used it.

The state number is returned in the SM_MON reply.  As mentioned, I  
sent Trond a patch for client side to dig that out before posting an  
NLMPROC_LOCK request.  The server side doesn't seem to care what its  
local NSM state is.

>  Maybe we need a call to nsm_monitor in nlmclnt_proc,
>    and maybe _reclaim and _cancel too - not sure
>
>  mount.nfs makes sure statd is running - we already have that.

We also have lockd checking that statd is running via an SM_MON upcall  
before sending the first NLM request on this mount point (yes, and  
that check is actually working in 2.6.29!  it now refuses to allow a  
lock operation if it can't contact statd).  Do we need both?

>   rpc.nfsd can complain if statd is not already running, or maybe
>    even just start it.

> That, I think, should enforce some of the ordering, and complain
> if other ordering requirements aren't met.
>
> And just for the record: my strongest argument for keeping them
> separate is that statd (being network service) should only be started
> if and when it is actually needed, while sm-notify should always be
> run at boot in case it has some cleaning up to do.

OK, noted.  I take it this is more of a security thing -- try to limit  
network service exposure when possible.

I know that Linux statd has a checkered security past, but it seems  
that we're not terribly consistent on this front with other services.   
rpcbind is always running whether we have NFSD and NFS mounts or not.   
rpcbind, statd and lockd are running when we have only NFSv4 mounts,  
and rpcbind and statd run when we have no mounts at all.

Systems that don't want NFS can simply avoid starting /etc/init.d/ 
nfslock and /etc/init.d/nfs at boot time. IMO that's enough -- the  
added dynamic starting up and shutting down of these services makes  
them much more complex and fragile than needed.

There is only a single case I can think of where we might want  
notification, but not want to start statd.  That is when an admin  
decides to disable NFS on a system.  One last notification is  
appropriate, but statd shouldn't be started.  We could probably  
accomplish this with the "notify then exit" option on statd.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-05-21 17:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-19 14:36 RFC: merging sm-notify and rpc.statd Chuck Lever
2009-05-19 22:39 ` Neil Brown
     [not found]   ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-05-19 23:25     ` Mike Frysinger
2009-05-20  1:05       ` NeilBrown
     [not found]         ` <bdacaae74fbd57ee96599286eae43751.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
2009-05-20  1:10           ` Ben Greear
2009-05-20 16:38     ` Chuck Lever
2009-05-21  0:01       ` Neil Brown
     [not found]         ` <18964.39373.232045.96215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-05-21 17:14           ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox