From: Chuck Lever <chuck.lever@oracle.com>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Subject: Re: reboot recovery
Date: Tue, 09 Mar 2010 16:07:31 -0500 [thread overview]
Message-ID: <4B96B893.4030300@oracle.com> (raw)
In-Reply-To: <20100309205349.GD26453@fieldses.org>
On 03/09/2010 03:53 PM, J. Bruce Fields wrote:
> On Tue, Mar 09, 2010 at 12:39:35PM -0500, Chuck Lever wrote:
>> Thanks, this is very clear.
>>
>> On 03/08/2010 08:46 PM, J. Bruce Fields wrote:
>>> The Linux server's reboot recovery code has long-standing architectural
>>> problems, fails to adhere to the specifications in some cases, and does
>>> not yet handle NFSv4.1 reboot recovery. An overhaul has been a
>>> long-standing todo.
>>>
>>> This is my attempt to state the problem and a rough solution.
>>>
>>> Requirements
>>> ^^^^^^^^^^^^
>>>
>>> Requirements, as compared to current code:
>>>
>>> - Correctly implements the algorithm described in section 8.6.3
>>> of rfc 3530, and eliminates known race conditions on recovery.
>>> - Does not attempt to manage files and directories directly from
>>> inside the kernel.
>>> - Supports RECLAIM_COMPLETE.
>>>
>>> Requirements, in more detail:
>>>
>>> A "server instance" is the lifetime from start to shutdown of a server;
>>> a reboot ends one server instance and starts another.
>>
>> It would be better if you architected this not in terms of a server
>> reboot, but in terms of "service nfs stop" and "service nfs start".
>
> Good point; fixed in my local copy.
>
> (Though that may work for v4-only servers, since I think v2/v3 may still
> have problems with restarts that don't restart everything (including the
> client).)
Well, eventually I hope to address some of those issues. But, no use
tying our NFSv4 stuff to the problems of the v2/v3 implementation.
>>> Draft design
>>> ^^^^^^^^^^^^
>>>
>>> We will modify rpc.statd to handle to manage state in userspace.
>>
>> Please don't. statd is ancient krufty code that is already barely able
>> to do what it needs to do.
>>
>> statd is single-threaded. It makes dozens of blocking DNS calls to
>> handle NSM protocol requests. It makes NLM downcalls on the same thread
>> that handles everything else. Unless an effort was undertaken to make
>> statd multithreaded, this extra work could cause signficant latency for
>> handling upcalls.
>
> Hm, OK. I guess I don't want to make this project dependent on
> rewriting statd.
>
> So, other possibilities:
> - Modify one of the other existing userland daemons.
> - Make a separate daemon just for this.
> - ditch the daemon entirely and depend mainly on hotplug-like
> invocations of a userland program that exist after it handles
> a single call.
>
>>> Previous prototype code from CITI will be considered as a starting
>>> point.
>>>
>>> Kernel<->user communication will use four files in the "nfsd"
>>> filesystem. All of them will use the encoding used for rpc cache
>>> upcalls and downcalls, which consist of whitespace-separated fields
>>> escaped as necessary to allow binary data.
>>
>> In general, we don't want to mix RPC listeners and upcall file
>> descriptors. mountd has to access the cache file descriptors to satisfy
>> MNT requests, so there is a reason to do it in that case. Here there is
>> no purpose to mix these two. It only adds needless implementation
>> complexity and unnecessary security exposures.
>>
>> Yesterday, it was suggested that we split mountd into a piece that
>> handled upcalls and a piece that handled remote MNT requests via RPC.
>> Weren't you the one who argued in favor of getting rid of daemons called
>> "rpc.foo" for NFSv4-only operation? :-)
>
> Yeah. So I guess a subcase of the second option above would be to name
> the new daemon "nfsd-userland-helper" (or something as generic) and
> eventually make it handle export upcalls too. I don't know.
I wasn't thinking of a single daemon for this stuff, necessarily, but
rather a single framework that can be easily fit to whatever task is
needed. Just alter a few constants, specify the arguments and their
types, add boiling water, type 'make' and fluff with fork.
We've already got referral/DNS, idmapper, gss, and mountd upcalls, and
they all seem to do it differently from each other.
--
chuck[dot]lever[at]oracle[dot]com
prev parent reply other threads:[~2010-03-09 21:09 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-09 1:46 reboot recovery J. Bruce Fields
2010-03-09 14:46 ` Andy Adamson
2010-03-09 14:53 ` J. Bruce Fields
2010-03-09 14:55 ` William A. (Andy) Adamson
2010-03-09 15:10 ` J. Bruce Fields
2010-03-09 15:17 ` William A. (Andy) Adamson
2010-03-09 16:11 ` J. Bruce Fields
2010-03-09 17:39 ` Chuck Lever
2010-03-09 20:53 ` J. Bruce Fields
2010-03-09 21:07 ` Chuck Lever [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B96B893.4030300@oracle.com \
--to=chuck.lever@oracle.com \
--cc=bfields@fieldses.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox