All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: reboot recovery
Date: Tue, 9 Mar 2010 15:53:49 -0500	[thread overview]
Message-ID: <20100309205349.GD26453@fieldses.org> (raw)
In-Reply-To: <4B9687D7.7010902@oracle.com>

On Tue, Mar 09, 2010 at 12:39:35PM -0500, Chuck Lever wrote:
> Thanks, this is very clear.
>
> On 03/08/2010 08:46 PM, J. Bruce Fields wrote:
>> The Linux server's reboot recovery code has long-standing architectural
>> problems, fails to adhere to the specifications in some cases, and does
>> not yet handle NFSv4.1 reboot recovery.  An overhaul has been a
>> long-standing todo.
>>
>> This is my attempt to state the problem and a rough solution.
>>
>> Requirements
>> ^^^^^^^^^^^^
>>
>> Requirements, as compared to current code:
>>
>> 	- Correctly implements the algorithm described in section 8.6.3
>> 	  of rfc 3530, and eliminates known race conditions on recovery.
>> 	- Does not attempt to manage files and directories directly from
>> 	  inside the kernel.
>> 	- Supports RECLAIM_COMPLETE.
>>
>> Requirements, in more detail:
>>
>> A "server instance" is the lifetime from start to shutdown of a server;
>> a reboot ends one server instance and starts another.
>
> It would be better if you architected this not in terms of a server  
> reboot, but in terms of "service nfs stop" and "service nfs start".

Good point; fixed in my local copy.

(Though that may work for v4-only servers, since I think v2/v3 may still
have problems with restarts that don't restart everything (including the
client).)

>> Draft design
>> ^^^^^^^^^^^^
>>
>> We will modify rpc.statd to handle to manage state in userspace.
>
> Please don't.  statd is ancient krufty code that is already barely able  
> to do what it needs to do.
>
> statd is single-threaded.  It makes dozens of blocking DNS calls to  
> handle NSM protocol requests.  It makes NLM downcalls on the same thread  
> that handles everything else.  Unless an effort was undertaken to make  
> statd multithreaded, this extra work could cause signficant latency for  
> handling upcalls.

Hm, OK.  I guess I don't want to make this project dependent on
rewriting statd.

So, other possibilities:
	- Modify one of the other existing userland daemons.
	- Make a separate daemon just for this.
	- ditch the daemon entirely and depend mainly on hotplug-like
	  invocations of a userland program that exist after it handles
	  a single call.

>> Previous prototype code from CITI will be considered as a starting
>> point.
>>
>> Kernel<->user communication will use four files in the "nfsd"
>> filesystem.  All of them will use the encoding used for rpc cache
>> upcalls and downcalls, which consist of whitespace-separated fields
>> escaped as necessary to allow binary data.
>
> In general, we don't want to mix RPC listeners and upcall file  
> descriptors.  mountd has to access the cache file descriptors to satisfy  
> MNT requests, so there is a reason to do it in that case.  Here there is  
> no purpose to mix these two.  It only adds needless implementation  
> complexity and unnecessary security exposures.
>
> Yesterday, it was suggested that we split mountd into a piece that  
> handled upcalls and a piece that handled remote MNT requests via RPC.  
> Weren't you the one who argued in favor of getting rid of daemons called  
> "rpc.foo" for NFSv4-only operation? :-)

Yeah.  So I guess a subcase of the second option above would be to name
the new daemon "nfsd-userland-helper" (or something as generic) and
eventually make it handle export upcalls too.  I don't know.

>> Before starting the server, and writing to allow_client, statd will
>> manage boot times and old clients using files in /var/lib/nfs:
>>
>> 	If boot_time exists:
>> 		- It will be read, and the contents interpreted as an
>> 		  ascii-encoded unix time in seconds.
>> 		- All client records older than that time will be removed.
>> 		- The current boot_time will be recorded to
>> 		  new_boot_time (replacing any existing such file).
>> 		- All remaining clients will be written to allow_client.
>> 	If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is
>> 		created if necessary, but nothing else is done.
>
> Since I've split out the pieces of statd that manage its on-disk file  
> format (see support/nsm/file.c) it shouldn't be difficult to  
> copy-n-paste the pieces needed to construct /var/lib/nfs/v4clients.
>
> I have some additional patches for statd that can detect system reboots,  
> but again, it would be better perhaps to design for "server nfs restart"  
> rather than a full system reboot.

OK.

>
>> Statd will then wait for create_client, expire_client, and grace_done
>> calls.  On grace_done, it will rename boot_time to old_boot_time, and
>> new_boot_time to boot_time.
>
> Although it's noble to attempt to reuse old code in this way, I think  
> you will be far better off constructing and using a proper scaffold for  
> dealing generically with cache upcalls.  By doing this we avoid the  
> complexity of updating working legacy code, and have a better chance for  
> building something that scales well right off the bat.  This is new  
> code, so why chain yourself to legacy problems?
>
> A starting place could be the work Trond is doing to replace idmapd.

OK, thanks for the review.

--b.

  reply	other threads:[~2010-03-09 20:52 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-09  1:46 reboot recovery J. Bruce Fields
2010-03-09 14:46 ` Andy Adamson
2010-03-09 14:53   ` J. Bruce Fields
2010-03-09 14:55     ` William A. (Andy) Adamson
2010-03-09 15:10       ` J. Bruce Fields
2010-03-09 15:17         ` William A. (Andy) Adamson
2010-03-09 16:11           ` J. Bruce Fields
2010-03-09 17:39 ` Chuck Lever
2010-03-09 20:53   ` J. Bruce Fields [this message]
2010-03-09 21:07     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100309205349.GD26453@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.