From: "J. Bruce Fields" <bfields@fieldses.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: reboot recovery
Date: Tue, 9 Mar 2010 15:53:49 -0500 [thread overview]
Message-ID: <20100309205349.GD26453@fieldses.org> (raw)
In-Reply-To: <4B9687D7.7010902@oracle.com>
On Tue, Mar 09, 2010 at 12:39:35PM -0500, Chuck Lever wrote:
> Thanks, this is very clear.
>
> On 03/08/2010 08:46 PM, J. Bruce Fields wrote:
>> The Linux server's reboot recovery code has long-standing architectural
>> problems, fails to adhere to the specifications in some cases, and does
>> not yet handle NFSv4.1 reboot recovery. An overhaul has been a
>> long-standing todo.
>>
>> This is my attempt to state the problem and a rough solution.
>>
>> Requirements
>> ^^^^^^^^^^^^
>>
>> Requirements, as compared to current code:
>>
>> - Correctly implements the algorithm described in section 8.6.3
>> of rfc 3530, and eliminates known race conditions on recovery.
>> - Does not attempt to manage files and directories directly from
>> inside the kernel.
>> - Supports RECLAIM_COMPLETE.
>>
>> Requirements, in more detail:
>>
>> A "server instance" is the lifetime from start to shutdown of a server;
>> a reboot ends one server instance and starts another.
>
> It would be better if you architected this not in terms of a server
> reboot, but in terms of "service nfs stop" and "service nfs start".
Good point; fixed in my local copy.
(Though that may work for v4-only servers, since I think v2/v3 may still
have problems with restarts that don't restart everything (including the
client).)
>> Draft design
>> ^^^^^^^^^^^^
>>
>> We will modify rpc.statd to handle to manage state in userspace.
>
> Please don't. statd is ancient krufty code that is already barely able
> to do what it needs to do.
>
> statd is single-threaded. It makes dozens of blocking DNS calls to
> handle NSM protocol requests. It makes NLM downcalls on the same thread
> that handles everything else. Unless an effort was undertaken to make
> statd multithreaded, this extra work could cause signficant latency for
> handling upcalls.
Hm, OK. I guess I don't want to make this project dependent on
rewriting statd.
So, other possibilities:
- Modify one of the other existing userland daemons.
- Make a separate daemon just for this.
- ditch the daemon entirely and depend mainly on hotplug-like
invocations of a userland program that exist after it handles
a single call.
>> Previous prototype code from CITI will be considered as a starting
>> point.
>>
>> Kernel<->user communication will use four files in the "nfsd"
>> filesystem. All of them will use the encoding used for rpc cache
>> upcalls and downcalls, which consist of whitespace-separated fields
>> escaped as necessary to allow binary data.
>
> In general, we don't want to mix RPC listeners and upcall file
> descriptors. mountd has to access the cache file descriptors to satisfy
> MNT requests, so there is a reason to do it in that case. Here there is
> no purpose to mix these two. It only adds needless implementation
> complexity and unnecessary security exposures.
>
> Yesterday, it was suggested that we split mountd into a piece that
> handled upcalls and a piece that handled remote MNT requests via RPC.
> Weren't you the one who argued in favor of getting rid of daemons called
> "rpc.foo" for NFSv4-only operation? :-)
Yeah. So I guess a subcase of the second option above would be to name
the new daemon "nfsd-userland-helper" (or something as generic) and
eventually make it handle export upcalls too. I don't know.
>> Before starting the server, and writing to allow_client, statd will
>> manage boot times and old clients using files in /var/lib/nfs:
>>
>> If boot_time exists:
>> - It will be read, and the contents interpreted as an
>> ascii-encoded unix time in seconds.
>> - All client records older than that time will be removed.
>> - The current boot_time will be recorded to
>> new_boot_time (replacing any existing such file).
>> - All remaining clients will be written to allow_client.
>> If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is
>> created if necessary, but nothing else is done.
>
> Since I've split out the pieces of statd that manage its on-disk file
> format (see support/nsm/file.c) it shouldn't be difficult to
> copy-n-paste the pieces needed to construct /var/lib/nfs/v4clients.
>
> I have some additional patches for statd that can detect system reboots,
> but again, it would be better perhaps to design for "server nfs restart"
> rather than a full system reboot.
OK.
>
>> Statd will then wait for create_client, expire_client, and grace_done
>> calls. On grace_done, it will rename boot_time to old_boot_time, and
>> new_boot_time to boot_time.
>
> Although it's noble to attempt to reuse old code in this way, I think
> you will be far better off constructing and using a proper scaffold for
> dealing generically with cache upcalls. By doing this we avoid the
> complexity of updating working legacy code, and have a better chance for
> building something that scales well right off the bat. This is new
> code, so why chain yourself to legacy problems?
>
> A starting place could be the work Trond is doing to replace idmapd.
OK, thanks for the review.
--b.
next prev parent reply other threads:[~2010-03-09 20:52 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-09 1:46 reboot recovery J. Bruce Fields
2010-03-09 14:46 ` Andy Adamson
2010-03-09 14:53 ` J. Bruce Fields
2010-03-09 14:55 ` William A. (Andy) Adamson
2010-03-09 15:10 ` J. Bruce Fields
2010-03-09 15:17 ` William A. (Andy) Adamson
2010-03-09 16:11 ` J. Bruce Fields
2010-03-09 17:39 ` Chuck Lever
2010-03-09 20:53 ` J. Bruce Fields [this message]
2010-03-09 21:07 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100309205349.GD26453@fieldses.org \
--to=bfields@fieldses.org \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox