* RFC: merging sm-notify and rpc.statd
@ 2009-05-19 14:36 Chuck Lever
2009-05-19 22:39 ` Neil Brown
0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2009-05-19 14:36 UTC (permalink / raw)
To: Neil Brown; +Cc: Linux NFS mailing list
Hi Neil-
As part of IPv6 support for NFS, I've been looking at rpc.statd and sm-
notify. IPv6 support touches so many parts of both, and the current
open-coded RPC request schedulers in both can't support netids without
major revision or replacement. So I've decided to write a replacement
instead of grafting in support for IPv6 to the current implementation.
For many reasons I'm thinking of merging sm-notify and rpc.statd back
together. The two were split only a few years ago, and it seems to me
that it was done to support SuSE's in-kernel statd, which has since
been effectively abandoned.
Having the two separated has ushered in a host of minor
complications. Packaging and init-scripts are more complicated. Both
executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}.
There are two separate man pages that share a lot of the same content.
So, what do you think about folding sm-notify back into rpc.statd?
Steve suggested there may have been a customer issue that drove the
separation. Do you have any recollection of the issues?
For the rest of the list: are there strong dependencies outside RH and
SuSE distributions that would require a separate sm-notify
executable? Any other issues?
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: RFC: merging sm-notify and rpc.statd 2009-05-19 14:36 RFC: merging sm-notify and rpc.statd Chuck Lever @ 2009-05-19 22:39 ` Neil Brown [not found] ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Neil Brown @ 2009-05-19 22:39 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux NFS mailing list On Tuesday May 19, chuck.lever@oracle.com wrote: > Hi Neil- > > As part of IPv6 support for NFS, I've been looking at rpc.statd and sm- > notify. IPv6 support touches so many parts of both, and the current > open-coded RPC request schedulers in both can't support netids without > major revision or replacement. So I've decided to write a replacement > instead of grafting in support for IPv6 to the current implementation. > > For many reasons I'm thinking of merging sm-notify and rpc.statd back > together. The two were split only a few years ago, and it seems to me > that it was done to support SuSE's in-kernel statd, which has since > been effectively abandoned. > > Having the two separated has ushered in a host of minor > complications. Packaging and init-scripts are more complicated. Both > executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}. > There are two separate man pages that share a lot of the same content. > > So, what do you think about folding sm-notify back into rpc.statd? > Steve suggested there may have been a customer issue that drove the > separation. Do you have any recollection of the issues? > > For the rest of the list: are there strong dependencies outside RH and > SuSE distributions that would require a separate sm-notify > executable? Any other issues? While the separation of sm-notify was presumably driven by the suse in-kernel statd, that wasn't the reason that I copied the idea in nfs-utils. sm-notify and statd really have two very different tasks. sm-notify : - is a 'client' for the "SM" protocol. - must be run at boot time, and after that is not needed. statd : - is a 'server' for the "SM" protocol. - only needs to be running when either nfsd is running or an nfs mount which supports locks is active Thus I feel they are conceptually quite distinct. It is probably true that they could share a slab of code, and putting that code in a common .c file would make a lot of sense. I am not strongly against re-uniting them. However before doing that, I think it would be a good idea to collect a list of the problems that would be solved by unifying them, and the asking the question: is unifying them the only or best solution to these problems. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: RFC: merging sm-notify and rpc.statd [not found] ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-05-19 23:25 ` Mike Frysinger 2009-05-20 1:05 ` NeilBrown 2009-05-20 16:38 ` Chuck Lever 1 sibling, 1 reply; 8+ messages in thread From: Mike Frysinger @ 2009-05-19 23:25 UTC (permalink / raw) To: Neil Brown; +Cc: Chuck Lever, Linux NFS mailing list [-- Attachment #1: Type: text/plain, Size: 632 bytes --] On Tuesday 19 May 2009 18:39:47 Neil Brown wrote: > sm-notify : > - is a 'client' for the "SM" protocol. > - must be run at boot time, and after that is not needed. > > statd : > - is a 'server' for the "SM" protocol. > - only needs to be running when either nfsd is running or an > nfs mount which supports locks is active that last part -- any nfs mount with locks -- means that pretty much every nfs client out there needs it running. sm-notify is pretty minuscule, so the overhead of having that run on a server is negligible, especially when combined with the already required statd. -mike [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RFC: merging sm-notify and rpc.statd 2009-05-19 23:25 ` Mike Frysinger @ 2009-05-20 1:05 ` NeilBrown [not found] ` <bdacaae74fbd57ee96599286eae43751.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: NeilBrown @ 2009-05-20 1:05 UTC (permalink / raw) To: Mike Frysinger; +Cc: Chuck Lever, Linux NFS mailing list On Wed, May 20, 2009 9:25 am, Mike Frysinger wrote: > On Tuesday 19 May 2009 18:39:47 Neil Brown wrote: >> sm-notify : >> - is a 'client' for the "SM" protocol. >> - must be run at boot time, and after that is not needed. >> >> statd : >> - is a 'server' for the "SM" protocol. >> - only needs to be running when either nfsd is running or an >> nfs mount which supports locks is active > > that last part -- any nfs mount with locks -- means that pretty much every > nfs > client out there needs it running. > > sm-notify is pretty minuscule, so the overhead of having that run on a > server > is negligible, especially when combined with the already required statd. > -mike > The point is that sm-notify should really be run at boot whenever it is installed. statd, being a server that listens to request from the network, should only be run if it is needed (because most people like the policy of only running network services that are actually needed). If all nfs mounts are performed manually, then you might not want to run statd for quite a long time after boot. But you need to run sm-notify immediately after boot to ensure that any locks you held before the reboot get released. They really are separate functions. NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <bdacaae74fbd57ee96599286eae43751.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>]
* Re: RFC: merging sm-notify and rpc.statd [not found] ` <bdacaae74fbd57ee96599286eae43751.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org> @ 2009-05-20 1:10 ` Ben Greear 0 siblings, 0 replies; 8+ messages in thread From: Ben Greear @ 2009-05-20 1:10 UTC (permalink / raw) To: NeilBrown; +Cc: Mike Frysinger, Chuck Lever, Linux NFS mailing list NeilBrown wrote: > They really are separate functions. > Maybe have one code base and have it do different things based on a cmd-line argument or the name (use a symlink for one of the functions) ? If the code is really similar, that should allow easy reuse? Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RFC: merging sm-notify and rpc.statd [not found] ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2009-05-19 23:25 ` Mike Frysinger @ 2009-05-20 16:38 ` Chuck Lever 2009-05-21 0:01 ` Neil Brown 1 sibling, 1 reply; 8+ messages in thread From: Chuck Lever @ 2009-05-20 16:38 UTC (permalink / raw) To: Neil Brown; +Cc: Linux NFS mailing list On May 19, 2009, at 6:39 PM, Neil Brown wrote: > On Tuesday May 19, chuck.lever@oracle.com wrote: >> Hi Neil- >> >> As part of IPv6 support for NFS, I've been looking at rpc.statd and >> sm- >> notify. IPv6 support touches so many parts of both, and the current >> open-coded RPC request schedulers in both can't support netids >> without >> major revision or replacement. So I've decided to write a >> replacement >> instead of grafting in support for IPv6 to the current >> implementation. >> >> For many reasons I'm thinking of merging sm-notify and rpc.statd back >> together. The two were split only a few years ago, and it seems to >> me >> that it was done to support SuSE's in-kernel statd, which has since >> been effectively abandoned. >> >> Having the two separated has ushered in a host of minor >> complications. Packaging and init-scripts are more complicated. >> Both >> executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}. >> There are two separate man pages that share a lot of the same >> content. >> >> So, what do you think about folding sm-notify back into rpc.statd? >> Steve suggested there may have been a customer issue that drove the >> separation. Do you have any recollection of the issues? >> >> For the rest of the list: are there strong dependencies outside RH >> and >> SuSE distributions that would require a separate sm-notify >> executable? Any other issues? > > While the separation of sm-notify was presumably driven by the suse > in-kernel statd, that wasn't the reason that I copied the idea in > nfs-utils. > > sm-notify and statd really have two very different tasks. > > sm-notify : > - is a 'client' for the "SM" protocol. > - must be run at boot time, and after that is not needed. > statd : > - is a 'server' for the "SM" protocol. > - only needs to be running when either nfsd is running or an > nfs mount which supports locks is active > > Thus I feel they are conceptually quite distinct. There are details that make it not such a clean conceptual break: o Who manages the NSM state number? sm-notify sends it out to remote peers, and statd returns it in SM_MON and SM_UNMON replies. There has to be some co-ordination of how the state number is updated. If sm-notify runs separately (for example, with the "-- force" option) and updates the state number, how does statd know there's a new state number? If lockd isn't loaded and running when sm- notify runs, how is the kernel going to get the right NSM state number? o statd still has client duties: it has to post NLM callbacks to the local lockd. Sending notifications to remote peers is not so different from that, conceptually. One could argue, therefore, that we should split that piece out of statd as well, but that would mean we fork/exec every time we get an unauthenticated SM_NOTIFY request from a monitored peer. That exposes a DoS vulnerability. o statd has to wait while sm-notify copies the monitor list. It really shouldn't accept SM_MON requests while the notification list is created. But if it waits for long, it will appear that the NSM service has died. So there is some non-trivial synchronization between the two, and that appears to be split between statd and sm- notify today (and that synchronization requirement isn't documented in any way). o statd has to fire up sm-notify when it receives SM_SIMU_CRASH. Today our lockd doesn't send that, but it could in the future. So, sm- notify is not strictly an "only-at-reboot" kind of affair. o sm-notify tries to do a sync(2) to make sure that the file system state is made permanent after an NSM state update. Bruce has suggested doing the sync only after the first SM_MON (to reduce overhead during system boot), but that moves the sync(2) far away from the logic that updates the state number. That exposes us to NSM state number walk-back if the system crashes at the wrong time. It's arguable how much of a problem that is. o It is better to send notifications when lockd is up. For clients, at least, lockd comes up only after the first NFS mount, and in automounter scenarios, that may not be for some time after a reboot. Servers may not start nfslock until they do "service nfslock start; service nfs start" at some point possibly long after reboot. So should clients be notified right when the server peer starts up, or after the server peer has fired up its NFSD and lockd service? o Those who package statd/sm-notify have to understand how these operate. The people who create system init-scripts are generally not NFS experts, thus they must have local knowledge about statd and sm- notify in order to get this all correct. It would be more fool-proof if we hard-coded the start-up behavior, and took it out of the hands of the init-scripts folks, whom we do not control. How do we document the operational dependencies in a way that makes it very hard for non- NFS folks to set this up incorrectly? One way is to build it all in a single program. > It is probably true that they could share a slab of code, and putting > that code in a common .c file would make a lot of sense. Yes, I've started doing that to try to understand what code can be shared. > I am not strongly against re-uniting them. However before doing that, > I think it would be a good idea to collect a list of the problems that > would be solved by unifying them, and the asking the question: is > unifying them the only or best solution to these problems. Agreed. See above. If there are one or more strong reasons to keep these separate, I can go down that road. But I think the practical matters of making NSM work in multiple Linux distributions, each with their own packaging and init-script mechanisms and requirements, suggests we'd be better off making it simple to get this right. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RFC: merging sm-notify and rpc.statd 2009-05-20 16:38 ` Chuck Lever @ 2009-05-21 0:01 ` Neil Brown [not found] ` <18964.39373.232045.96215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Neil Brown @ 2009-05-21 0:01 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux NFS mailing list On Wednesday May 20, chuck.lever@oracle.com wrote: > On May 19, 2009, at 6:39 PM, Neil Brown wrote: > > On Tuesday May 19, chuck.lever@oracle.com wrote: > >> Hi Neil- > >> > >> As part of IPv6 support for NFS, I've been looking at rpc.statd and > >> sm- > >> notify. IPv6 support touches so many parts of both, and the current > >> open-coded RPC request schedulers in both can't support netids > >> without > >> major revision or replacement. So I've decided to write a > >> replacement > >> instead of grafting in support for IPv6 to the current > >> implementation. > >> > >> For many reasons I'm thinking of merging sm-notify and rpc.statd back > >> together. The two were split only a few years ago, and it seems to > >> me > >> that it was done to support SuSE's in-kernel statd, which has since > >> been effectively abandoned. > >> > >> Having the two separated has ushered in a host of minor > >> complications. Packaging and init-scripts are more complicated. > >> Both > >> executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}. > >> There are two separate man pages that share a lot of the same > >> content. > >> > >> So, what do you think about folding sm-notify back into rpc.statd? > >> Steve suggested there may have been a customer issue that drove the > >> separation. Do you have any recollection of the issues? > >> > >> For the rest of the list: are there strong dependencies outside RH > >> and > >> SuSE distributions that would require a separate sm-notify > >> executable? Any other issues? > > > > While the separation of sm-notify was presumably driven by the suse > > in-kernel statd, that wasn't the reason that I copied the idea in > > nfs-utils. > > > > sm-notify and statd really have two very different tasks. > > > > sm-notify : > > - is a 'client' for the "SM" protocol. > > - must be run at boot time, and after that is not needed. > > > statd : > > - is a 'server' for the "SM" protocol. > > - only needs to be running when either nfsd is running or an > > nfs mount which supports locks is active > > > > Thus I feel they are conceptually quite distinct. > > There are details that make it not such a clean conceptual break: > > o Who manages the NSM state number? sm-notify sends it out to > remote peers, and statd returns it in SM_MON and SM_UNMON replies. > There has to be some co-ordination of how the state number is > updated. If sm-notify runs separately (for example, with the "-- > force" option) and updates the state number, how does statd know > there's a new state number? If lockd isn't loaded and running when sm- > notify runs, how is the kernel going to get the right NSM state number? sm-notify manages the state number. statd must ensure that sm-notify has run before it reads the number from the file. As sm-notify has its own locking to ensure it is run only once, statd simple runs sm-notify before proceeded. sm-notify explicitly tells the kernel what the state number is. If the lockd modules isn't loaded when sm-notify runs that might be a small problem. I'd have to remind my self of all the details of the lockd protocols to be sure what was needed. Maybe statd should tell it to lockd when it first hears from lockd. > > o statd still has client duties: it has to post NLM callbacks to > the local lockd. Sending notifications to remote peers is not so > different from that, conceptually. One could argue, therefore, that > we should split that piece out of statd as well, but that would mean > we fork/exec every time we get an unauthenticated SM_NOTIFY request > from a monitored peer. That exposes a DoS vulnerability. Yes, client duties. But a client for a different protocol. I think we have a strawman argument here. I would certainly never suggest that the lockd call back should be done by a separate process. At it's core, statd works like this: lockd says to statd "Tell me if X restarts, and tell X if I restart". So statd listen for X to say "I have restarted" and passes that on to to lockd. Statd cannot directly tell X that it has restarted because it will have died first. So it leaves a note (on the fridge) for someone else to do it. That "someone else" is sm-notify. So sm-notify is running on behalf of the statd from before the last reboot. In that sense it is quite separate from the currently running statd. > > o statd has to wait while sm-notify copies the monitor list. It > really shouldn't accept SM_MON requests while the notification list is > created. But if it waits for long, it will appear that the NSM > service has died. So there is some non-trivial synchronization > between the two, and that appears to be split between statd and sm- > notify today (and that synchronization requirement isn't documented in > any way). Sounds like there could be an implementation problem here. I don't think sm-notify need to copy the monitor list exactly. It just needs to move it out of the way so statd has a clean slate. mv /var/lib/nfs/sm /var/lib/nfs/sm.bak.$UNIQUE mkdir /var/lib/nfs/sm # let statd continue # shuffle through files in /var/lib/nfs/sm.bak* And while I agree that more documentation is a good thing, I think the synchronization is enforced so documentation isn't essential. statd runs sm-notify before doing anything. sm-notify does the minimum for synchronization before forking and exiting and allowing statd to continue. (or maybe not as I discover below) > > o statd has to fire up sm-notify when it receives SM_SIMU_CRASH. > Today our lockd doesn't send that, but it could in the future. So, sm- > notify is not strictly an "only-at-reboot" kind of affair. True, but not a strong case for anything I would think. > > o sm-notify tries to do a sync(2) to make sure that the file system > state is made permanent after an NSM state update. Bruce has > suggested doing the sync only after the first SM_MON (to reduce > overhead during system boot), but that moves the sync(2) far away from > the logic that updates the state number. That exposes us to NSM state > number walk-back if the system crashes at the wrong time. It's > arguable how much of a problem that is. Sounds like there is room for improvement here, definitely. This is only a half-formed idea, but: sm-notify could update 'state' to an odd number if it is even, but not sync anything statd, on the first SM_MON, updates 'state' to an even number if it was odd and in that case does the required sync. I would need to check the protocol and the code and do a bit of case analysis to be sure I had that right, but I suspect it is close. (or it could be made completely irrelevant but subsequent observations. Read on!) > > o It is better to send notifications when lockd is up. For > clients, at least, lockd comes up only after the first NFS mount, and > in automounter scenarios, that may not be for some time after a > reboot. Servers may not start nfslock until they do "service nfslock > start; service nfs start" at some point possibly long after reboot. > So should clients be notified right when the server peer starts up, or > after the server peer has fired up its NFSD and lockd service? > When a client notifies a server that it has rebooted, the server simply drops the locks. There is no need for the client lockd to be running. When a server notifies a client that it has rebooted, the client tries to reclaim the locks. So the server lockd *must* be running at that time. It is not a case of 'better'. It is 'must'. So if a machine is an NFS server that plans to keep serving, it must start nfsd (and hence lockd) before running sm-notify. However it is good to have statd running before lockd, as lockd needs to talk to statd. So there order seems to be: statd nfsd and hence lockd sm-notify which is clearly documented in the README, but seems to disagree with what we said above :-) We want to clean out the 'sm' directory, then run statd/nfsd/lockd, then sm-notify reads the sm.bak and sends off the notifications. There does seem to be room for improvement here. And I feel that having sm-notify separate actually makes it easier to get this right... How about this for a bit of a left-field idea: - files representing monitored hosts are stored in /var/lib/nfs/sm.$STATE - At reboot, /var/lib/nfs/state is incremented (twice?) but not synced. - statd, on first SM_MON creates /var/lib/nfs/sm.$STATE if needed, based on the value in the 'state' file, and does the required sync at that point - sm-notify can be run at any time after nfsd (if required) is started, and send notification to any host in a sm.$STATE where $STATE < 'state'. The 'state' number in the notification is $STATE (or is it $STATE+1??) > o Those who package statd/sm-notify have to understand how these > operate. The people who create system init-scripts are generally not > NFS experts, thus they must have local knowledge about statd and sm- > notify in order to get this all correct. It would be more fool-proof > if we hard-coded the start-up behavior, and took it out of the hands > of the init-scripts folks, whom we do not control. How do we document > the operational dependencies in a way that makes it very hard for non- > NFS folks to set this up incorrectly? One way is to build it all in a > single program. That is a strong argument. It is probably part of the argument for putting it all in the kernel too. A valid question is: *can* we build it all into a single program? Given that: state and sm need to be updated before statd responds to SM_MON statd should be ready to respond to SM_MON before lockd starts exportfs -av must be run before nfsd starts nfsd and lockd must start before notifications are sent on a server notifications (from the server) must be sent promptly after nfsd starts its grace period. I find it hard to see a single statd being able to do the whole thing. We have a 'README' to document the order. We could provide a sample startup script. I don't think we *can* provide a "get it all right" program. > > If there are one or more strong reasons to keep these separate, I can > go down that road. But I think the practical matters of making NSM > work in multiple Linux distributions, each with their own packaging > and init-script mechanisms and requirements, suggests we'd be better > off making it simple to get this right. "simple to get this right" is certainly good. But "right" must over-rule "simple", and it seems like we might not even really be a "right" yet. :-( Maybe the way to make sure people get it work is to detect broken configurations and fail horribly... So: sm-notify performs its own /var/run locking to make sure it is only run once (plus allow for --simu-crash??) It quickly updates /var/lib/nfs/ (which no sync) and then checks to see if mountd is running. If it is, it assume 'server' and waits a while for lockd to appear (both checks via portmap). Once lockd is running (or mountd was not), it sends out notifications. mountd checks if sm-notify has already run (via the /var/run file), and complains gently, maybe only if it is less than a few minutes before boot. e.g. WARNING: during boot, mountd must be run before sm-notify! statd always runs sm-notify first and waits for it to exit, which it does once it has moved things aside and updated 'state'. One the first SM_MON call, statd call 'fsync' on 'state' and related directories, and writes the 'state' value to the kernel.... which is moments to late. The kernel has already used it. Maybe we need a call to nsm_monitor in nlmclnt_proc, and maybe _reclaim and _cancel too - not sure mount.nfs makes sure statd is running - we already have that. rpc.nfsd can complain if statd is not already running, or maybe even just start it. That, I think, should enforce some of the ordering, and complain if other ordering requirements aren't met. And just for the record: my strongest argument for keeping them separate is that statd (being network service) should only be started if and when it is actually needed, while sm-notify should always be run at boot in case it has some cleaning up to do. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <18964.39373.232045.96215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: RFC: merging sm-notify and rpc.statd [not found] ` <18964.39373.232045.96215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-05-21 17:14 ` Chuck Lever 0 siblings, 0 replies; 8+ messages in thread From: Chuck Lever @ 2009-05-21 17:14 UTC (permalink / raw) To: Neil Brown; +Cc: Linux NFS mailing list On May 20, 2009, at 8:01 PM, Neil Brown wrote: > On Wednesday May 20, chuck.lever@oracle.com wrote: >> On May 19, 2009, at 6:39 PM, Neil Brown wrote: >>> On Tuesday May 19, chuck.lever@oracle.com wrote: >>>> Hi Neil- >>>> >>>> As part of IPv6 support for NFS, I've been looking at rpc.statd and >>>> sm- >>>> notify. IPv6 support touches so many parts of both, and the >>>> current >>>> open-coded RPC request schedulers in both can't support netids >>>> without >>>> major revision or replacement. So I've decided to write a >>>> replacement >>>> instead of grafting in support for IPv6 to the current >>>> implementation. >>>> >>>> For many reasons I'm thinking of merging sm-notify and rpc.statd >>>> back >>>> together. The two were split only a few years ago, and it seems to >>>> me >>>> that it was done to support SuSE's in-kernel statd, which has since >>>> been effectively abandoned. >>>> >>>> Having the two separated has ushered in a host of minor >>>> complications. Packaging and init-scripts are more complicated. >>>> Both >>>> executables have separate knowlege about /var/lib/nfs/{sm,sm.bak}. >>>> There are two separate man pages that share a lot of the same >>>> content. >>>> >>>> So, what do you think about folding sm-notify back into rpc.statd? >>>> Steve suggested there may have been a customer issue that drove the >>>> separation. Do you have any recollection of the issues? >>>> >>>> For the rest of the list: are there strong dependencies outside RH >>>> and >>>> SuSE distributions that would require a separate sm-notify >>>> executable? Any other issues? >>> >>> While the separation of sm-notify was presumably driven by the suse >>> in-kernel statd, that wasn't the reason that I copied the idea in >>> nfs-utils. >>> >>> sm-notify and statd really have two very different tasks. >>> >>> sm-notify : >>> - is a 'client' for the "SM" protocol. >>> - must be run at boot time, and after that is not needed. >> >>> statd : >>> - is a 'server' for the "SM" protocol. >>> - only needs to be running when either nfsd is running or an >>> nfs mount which supports locks is active >>> >>> Thus I feel they are conceptually quite distinct. >> >> There are details that make it not such a clean conceptual break: >> >> o Who manages the NSM state number? sm-notify sends it out to >> remote peers, and statd returns it in SM_MON and SM_UNMON replies. >> There has to be some co-ordination of how the state number is >> updated. If sm-notify runs separately (for example, with the "-- >> force" option) and updates the state number, how does statd know >> there's a new state number? If lockd isn't loaded and running when >> sm- >> notify runs, how is the kernel going to get the right NSM state >> number? > > sm-notify manages the state number. > statd must ensure that sm-notify has run before it reads the number > from the file. As sm-notify has its own locking to ensure it is run > only once, statd simple runs sm-notify before proceeded. > sm-notify explicitly tells the kernel what the state number is. Except in the SM_SIMU_CRASH case. sm-notify updates the on-disk state number, but today, statd reads the state number once at start-up, and never updates it. So it would miss that case; lockd and statd would continue to advertise the old state number. (I think statd is also supposed to simulate a crash if it gets SIGUSR1). > If the lockd modules isn't loaded when sm-notify runs that might be a > small problem. That is a frequent problem on today's clients. lockd isn't loaded by / etc/init.d/nfslock unless there are module parameters specified (which in most cases, there aren't). The state number is also lost if, for instance, the number of NFS mounts goes to zero and lockd is unloaded. This can easily happen on clients that manage their NFS mounts with automounter. In my experience our clients almost always send a zero state number today. One could even go so far as to argue that an unload-load of lockd counts as a reboot (in terms of NSM state number management), and thus we should increment the NSM state number in that case to ensure that clients and servers start with a clean slate. > I'd have to remind my self of all the details of the > lockd protocols to be sure what was needed. Maybe statd should tell > it to lockd when it first hears from lockd. I've sent a patch to Trond to change lockd to pick up the state number from SM_MON replies. lockd could also do an SM_UNMON_ALL when it is first loaded, and pick up the state number from its reply. >> o statd still has client duties: it has to post NLM callbacks to >> the local lockd. Sending notifications to remote peers is not so >> different from that, conceptually. One could argue, therefore, that >> we should split that piece out of statd as well, but that would mean >> we fork/exec every time we get an unauthenticated SM_NOTIFY request >> from a monitored peer. That exposes a DoS vulnerability. > > Yes, client duties. But a client for a different protocol. > I think we have a strawman argument here. I would certainly never > suggest that the lockd call back should be done by a separate process. > > At it's core, statd works like this: > lockd says to statd "Tell me if X restarts, and tell X if I > restart". > So statd listen for X to say "I have restarted" and passes that on > to to lockd. > Statd cannot directly tell X that it has restarted because it will > have died first. So it leaves a note (on the fridge) for someone > else to do it. That "someone else" is sm-notify. > So sm-notify is running on behalf of the statd from before the last > reboot. In that sense it is quite separate from the currently > running statd. > >> o statd has to wait while sm-notify copies the monitor list. It >> really shouldn't accept SM_MON requests while the notification list >> is >> created. But if it waits for long, it will appear that the NSM >> service has died. So there is some non-trivial synchronization >> between the two, and that appears to be split between statd and sm- >> notify today (and that synchronization requirement isn't documented >> in >> any way). > > Sounds like there could be an implementation problem here. > I don't think sm-notify need to copy the monitor list exactly. It > just needs to move it out of the way so statd has a clean slate. > mv /var/lib/nfs/sm /var/lib/nfs/sm.bak.$UNIQUE > mkdir /var/lib/nfs/sm > # let statd continue > # shuffle through files in /var/lib/nfs/sm.bak* The current implementation is careful to preserve some or all existing files in sm.bak. Basically if a previous notification never succeeded, the file for that peer stays in sm.bak, and sm-notify will try to notify that host again during the next reboot. So, a file can be overwritten, but files for old peers are preserved in this case. This seems reasonable to ensure peers are notified, although we may get a growing number of files in some situations. We could assess a timeout -- after 5 reboots, we can be fairly certain the peer isn't coming back, and that the file should be removed. > And while I agree that more documentation is a good thing, I think the > synchronization is enforced so documentation isn't essential. > statd runs sm-notify before doing anything. sm-notify does the > minimum for synchronization before forking and exiting and allowing > statd to continue. (or maybe not as I discover below) There is a rather mysterious sequence of forks at start up, and we happen to get this behavior today. It's not terribly straightforward, and could be removed by someone in the future who is trying to reduce complexity. Anyway... >> o statd has to fire up sm-notify when it receives SM_SIMU_CRASH. >> Today our lockd doesn't send that, but it could in the future. So, >> sm- >> notify is not strictly an "only-at-reboot" kind of affair. > > True, but not a strong case for anything I would think. > >> >> o sm-notify tries to do a sync(2) to make sure that the file system >> state is made permanent after an NSM state update. Bruce has >> suggested doing the sync only after the first SM_MON (to reduce >> overhead during system boot), but that moves the sync(2) far away >> from >> the logic that updates the state number. That exposes us to NSM >> state >> number walk-back if the system crashes at the wrong time. It's >> arguable how much of a problem that is. > > Sounds like there is room for improvement here, definitely. > > This is only a half-formed idea, but: > sm-notify could update 'state' to an odd number if it is even, but > not sync anything > statd, on the first SM_MON, updates 'state' to an even number if it > was odd and in that case does the required sync. That would still provide an opportunity for state number replay, which would make at least one subsequent notification a no-op. Given recent discussions on lkml about the behavior of sync/fsync with regard to renames, unlinks, file creation and the like, I think we should be more conservative about this, not less. (In fact my current prototype uses sqlite3 instead of flat files for all of this). > I would need to check the protocol and the code and do a bit of case > analysis to be sure I had that right, but I suspect it is close. > (or it could be made completely irrelevant but subsequent > observations. Read on!) > >> o It is better to send notifications when lockd is up. For >> clients, at least, lockd comes up only after the first NFS mount, and >> in automounter scenarios, that may not be for some time after a >> reboot. Servers may not start nfslock until they do "service nfslock >> start; service nfs start" at some point possibly long after reboot. >> So should clients be notified right when the server peer starts up, >> or >> after the server peer has fired up its NFSD and lockd service? > > When a client notifies a server that it has rebooted, the server > simply drops the locks. There is no need for the client lockd to be > running. Agreed. However, at least for Linux, statd is used on both the client and server, and a system can act as both concurrently. There's no real way for statd to distinguish between remote clients and servers from an SM_MON request. > When a server notifies a client that it has rebooted, the client tries > to reclaim the locks. So the server lockd *must* be running at that > time. It is not a case of 'better'. It is 'must'. Jeff Layton observed Solaris NFS servers (the reference NFSv2/v3 implementation) sending reboot notifications before their lockd is alive. That's why I qualified the requirement. > So if a machine is an NFS server that plans to keep serving, it must > start nfsd (and hence lockd) before running sm-notify. > > However it is good to have statd running before lockd, as lockd needs > to talk to statd. > So there order seems to be: > statd > nfsd and hence lockd > sm-notify > > which is clearly documented in the README, but seems to disagree with > what we said above :-) > We want to clean out the 'sm' directory, then run statd/nfsd/lockd, > then sm-notify reads the sm.bak and sends off the notifications. > > There does seem to be room for improvement here. And I feel that > having sm-notify separate actually makes it easier to get this > right... > > How about this for a bit of a left-field idea: > - files representing monitored hosts are stored in > /var/lib/nfs/sm.$STATE > - At reboot, /var/lib/nfs/state is incremented (twice?) but not > synced. > - statd, on first SM_MON creates /var/lib/nfs/sm.$STATE if needed, > based on the value in the 'state' file, and does the required > sync at that point > - sm-notify can be run at any time after nfsd (if required) is > started, and send notification to any host in a sm.$STATE where > $STATE < 'state'. The 'state' number in the notification is > $STATE (or is it $STATE+1??) sm-notify should send the same NSM state number as lockd is sending in NLMPROC_LOCK requests. afaict only odd state numbers are passed between peers. >> o Those who package statd/sm-notify have to understand how these >> operate. The people who create system init-scripts are generally not >> NFS experts, thus they must have local knowledge about statd and sm- >> notify in order to get this all correct. It would be more fool-proof >> if we hard-coded the start-up behavior, and took it out of the hands >> of the init-scripts folks, whom we do not control. How do we >> document >> the operational dependencies in a way that makes it very hard for >> non- >> NFS folks to set this up incorrectly? One way is to build it all >> in a >> single program. > > That is a strong argument. It is probably part of the argument for > putting it all in the kernel too. Putting it _all_ in the kernel is a challenge. One issue is that the kernel should never write into local files, so some user space interaction is rather a requirement. However, I think a scheme where the kernel provides the NSM service listener, and exposes its NSM cache to user space via rpc_pipefs or some other mechanism might be better than having lockd post SM_MON/ SM_UNMON requests and listen for NLM callbacks from statd. The kernel can provide more information about the remote peer: the IP address it used to contact us; the transport protocol it used to contact us; and whether it is a client or a server peer. None of that information is available in the NSM protocol today. The kernel also knows for certain when reboots occur, and when server- side grace period starts and ends. That's a future idea, though. Right now we just need something that supports IPv6. > A valid question is: *can* we build it all into a single program? > > Given that: > state and sm need to be updated before statd responds to SM_MON > statd should be ready to respond to SM_MON before lockd starts > exportfs -av must be run before nfsd starts > nfsd and lockd must start before notifications are sent on a server > notifications (from the server) must be sent promptly after > nfsd starts its grace period. Perhaps another desirable characteristic would be to curtail or stop notification once the grace period ends. But it seems to me that start of the grace period is when you want to post SM_NOTIFY requests. And statd can't possibly know when that is unless lockd tells it. > I find it hard to see a single statd being able to do the whole thing. > We have a 'README' to document the order. We could provide a sample > startup script. I don't think we *can* provide a "get it all right" > program. I don't see anything in your argument why it can't be done in a single program, but could be done in an init script (or two). statd could, for example, listen for signals to determine when to fire off sm-notify. It already listens for SIGUSR1 today. Or, we could require the kernel to post an SM_SIMU_CRASH when it is ready for statd to send notifications. (That's one reason I brought up SM_SIMU_CRASH above). So I guess my argument is that we can do this in a single program if we use a little more of the NSM protocol, ensuring that lockd communicates a little more with statd. >> If there are one or more strong reasons to keep these separate, I can >> go down that road. But I think the practical matters of making NSM >> work in multiple Linux distributions, each with their own packaging >> and init-script mechanisms and requirements, suggests we'd be better >> off making it simple to get this right. > > "simple to get this right" is certainly good. > But "right" must over-rule "simple", and it seems like we might not > even really be at "right" yet. :-( > > Maybe the way to make sure people get it work is to detect broken > configurations and fail horribly... As Greg likes to say: "Meh." I think everyone will be better off if we try to get it all to work automatically. With warnings, we then depend on the patience of distributors and administrators to troubleshoot this. It should "just work." > So: > sm-notify performs its own /var/run locking to make sure it is only > run once (plus allow for --simu-crash??) > It quickly updates /var/lib/nfs/ (which no sync) and then checks > to see if mountd is running. If it is, it assume 'server' and > waits a while for lockd to appear (both checks via portmap). > Once lockd is running (or mountd was not), it sends out > notifications. > mountd checks if sm-notify has already run (via the /var/run file), > and complains gently, maybe only if it is less than a few minutes > before boot. e.g. > WARNING: during boot, mountd must be run before sm-notify! > > statd always runs sm-notify first and waits for it to exit, which > it does once it has moved things aside and updated 'state'. > On the first SM_MON call, statd calls 'fsync' on 'state' and > related directories, and writes the 'state' value to the > kernel.... which is moments to late. The kernel has already > used it. The state number is returned in the SM_MON reply. As mentioned, I sent Trond a patch for client side to dig that out before posting an NLMPROC_LOCK request. The server side doesn't seem to care what its local NSM state is. > Maybe we need a call to nsm_monitor in nlmclnt_proc, > and maybe _reclaim and _cancel too - not sure > > mount.nfs makes sure statd is running - we already have that. We also have lockd checking that statd is running via an SM_MON upcall before sending the first NLM request on this mount point (yes, and that check is actually working in 2.6.29! it now refuses to allow a lock operation if it can't contact statd). Do we need both? > rpc.nfsd can complain if statd is not already running, or maybe > even just start it. > That, I think, should enforce some of the ordering, and complain > if other ordering requirements aren't met. > > And just for the record: my strongest argument for keeping them > separate is that statd (being network service) should only be started > if and when it is actually needed, while sm-notify should always be > run at boot in case it has some cleaning up to do. OK, noted. I take it this is more of a security thing -- try to limit network service exposure when possible. I know that Linux statd has a checkered security past, but it seems that we're not terribly consistent on this front with other services. rpcbind is always running whether we have NFSD and NFS mounts or not. rpcbind, statd and lockd are running when we have only NFSv4 mounts, and rpcbind and statd run when we have no mounts at all. Systems that don't want NFS can simply avoid starting /etc/init.d/ nfslock and /etc/init.d/nfs at boot time. IMO that's enough -- the added dynamic starting up and shutting down of these services makes them much more complex and fragile than needed. There is only a single case I can think of where we might want notification, but not want to start statd. That is when an admin decides to disable NFS on a system. One last notification is appropriate, but statd shouldn't be started. We could probably accomplish this with the "notify then exit" option on statd. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-05-21 17:15 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-19 14:36 RFC: merging sm-notify and rpc.statd Chuck Lever
2009-05-19 22:39 ` Neil Brown
[not found] ` <18963.13619.563465.804193-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-05-19 23:25 ` Mike Frysinger
2009-05-20 1:05 ` NeilBrown
[not found] ` <bdacaae74fbd57ee96599286eae43751.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
2009-05-20 1:10 ` Ben Greear
2009-05-20 16:38 ` Chuck Lever
2009-05-21 0:01 ` Neil Brown
[not found] ` <18964.39373.232045.96215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-05-21 17:14 ` Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox