From: Peter Staubach <staubach@redhat.com>
To: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>,
Christoph Hellwig <hch@infradead.org>,
Brad Boyer <flar@allandria.com>,
Anton Altaparmakov <aia21@cam.ac.uk>,
Andreas Dilger <adilger@clusterfs.com>,
Gary Grider <ggrider@lanl.gov>,
linux-fsdevel@vger.kernel.org
Subject: Re: NFSv4/pNFS possible POSIX I/O API standards
Date: Mon, 04 Dec 2006 13:02:23 -0500 [thread overview]
Message-ID: <457462AF.5080601@redhat.com> (raw)
In-Reply-To: <Pine.LNX.4.62.0612011018100.15475@wtf.di.newdream.net>
Sage Weil wrote:
> On Fri, 1 Dec 2006, Trond Myklebust wrote:
>> I'm quite happy with a proposal for a statlite(). I'm objecting to
>> readdirplus() because I can't see that it offers you anything useful.
>> You haven't provided an example of an application which would clearly
>> benefit from a readdirplus() interface instead of readdir()+statlite()
>> and possibly some tools for managing cache consistency.
>
> Okay, now I think I understand where you're coming from.
>
> The difference between readdirplus() and readdir()+statlite() is that
> (depending on the mask you specify) statlite() either provides the
> "right" answer (ala stat()), or anything that is vaguely "recent."
> readdirplus() would provide size/mtime from sometime _after_ the
> initial opendir() call, establishing a useful ordering. So without
> readdirplus(), you either get readdir()+stat() and the performance
> problems I mentioned before, or readdir()+statlite() where "recent"
> may not be good enough.
>
> Instead of my previous example of proccess #1 waiting for process #2
> to finish and then checking the results with stat(), imagine instead
> that #1 is waiting for 100,000 other processes to finish, and then
> wants to check the results (size/mtime) of all of them.
> readdir()+statlite() won't work, and readdir()+stat() may be
> pathologically slow.
>
> Also, it's a tiring and trivial example, but even the 'ls -al'
> scenario isn't ideally addressed by readdir()+statlite(), since
> statlite() might return size/mtime from before 'ls -al' was executed
> by the user. One can easily imagine modifying a file on one host,
> then doing 'ls -al' on another host and not seeing the effects. If
> 'ls -al' can use readdirplus(), it's overall application semantics can
> be preserved without hammering large directories in a distributed
> filesystem.
>
I think that there are several points which are missing here.
First, readdirplus(), without any sort of caching, is going to be _very_
expensive, performance-wise, for _any_ size directory. You can see this
by instrumenting any NFS server which already supports the NFSv3 READDIRPLUS
semantics.
Second, the NFS client side readdirplus() implementation is going to be
_very_ expensive as well. The NFS client does write-behind and all this
data _must_ be flushed to the server _before_ the over the wire READDIRPLUS
can be issued. This means that the client will have to step through every
inode which is associated with the directory inode being readdirplus()'d
and ensure that all modified data has been successfully written out. This
part of the operation, for a sufficiently large directory and a sufficiently
large page cache, could take signficant time in itself.
These overheads may make this new operation expensive enough that no
applications will end up using it.
>> I agree that an interface which allows a userland process offer hints to
>> the kernel as to what kind of cache consistency it requires for file
>> metadata would be useful. We already have stuff like posix_fadvise() etc
>> for file data, and perhaps it might be worth looking into how you could
>> devise something similar for metadata.
>> If what you really want is for applications to be able to manage network
>> filesystem cache consistency, then why not provide those tools instead?
>
> True, something to manage the attribute cache consistency for
> statlite() results would also address the issue by letting an
> application declare how weak it's results are allowed to be. That
> seems a bit more awkward, though, and would only affect
> statlite()--the only call that allows weak consistency in the first
> place. In contrast, readdirplus maps nicely onto what filesystems
> like NFS are already doing over the wire.
Speaking of applications, how many applications are there in the world,
or even being contemplated, which are interested in a directory of
files and whether or not this set of files has changed from the previous
snapshot of the set of files? Most applications deal with one or two
files on such a basis, not multitudes. In fact, having worked with
file systems and NFS in particular for more than 20 years now, I have
yet to hear of one. This is a lot of work and complexity for very
little gain, I think.
Is this not a problem which be better solved at the application level?
Or perhaps finer granularity than "noac" for the NFS attribute caching?
Thanx...
ps
next prev parent reply other threads:[~2006-12-04 18:04 UTC|newest]
Thread overview: 124+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-11-28 4:34 NFSv4/pNFS possible POSIX I/O API standards Gary Grider
2006-11-28 5:54 ` Christoph Hellwig
2006-11-28 10:54 ` Andreas Dilger
2006-11-28 11:28 ` Anton Altaparmakov
2006-11-28 20:17 ` Russell Cattelan
2006-11-28 23:28 ` Wendy Cheng
2006-11-29 9:12 ` Christoph Hellwig
2006-11-29 9:04 ` Christoph Hellwig
2006-11-29 9:14 ` Christoph Hellwig
2006-11-29 9:48 ` Andreas Dilger
2006-11-29 10:18 ` Anton Altaparmakov
2006-11-29 8:26 ` Brad Boyer
2006-11-30 9:25 ` Christoph Hellwig
2006-11-30 17:49 ` Sage Weil
2006-12-01 5:26 ` Trond Myklebust
2006-12-01 7:08 ` Sage Weil
2006-12-01 14:41 ` Trond Myklebust
2006-12-01 16:47 ` Sage Weil
2006-12-01 18:07 ` Trond Myklebust
2006-12-01 18:42 ` Sage Weil
2006-12-01 19:13 ` Trond Myklebust
2006-12-01 20:32 ` Sage Weil
2006-12-04 18:02 ` Peter Staubach [this message]
2006-12-05 23:20 ` readdirplus() as possible POSIX I/O API Sage Weil
2006-12-06 15:48 ` Peter Staubach
2006-12-03 1:57 ` NFSv4/pNFS possible POSIX I/O API standards Andreas Dilger
2006-12-03 7:34 ` Kari Hurtta
2006-12-03 1:52 ` Andreas Dilger
2006-12-03 16:10 ` Sage Weil
2006-12-04 7:32 ` Andreas Dilger
2006-12-04 15:15 ` Trond Myklebust
2006-12-05 0:59 ` Rob Ross
2006-12-05 4:44 ` Gary Grider
2006-12-05 10:05 ` Christoph Hellwig
2006-12-05 5:56 ` Trond Myklebust
2006-12-05 10:07 ` Christoph Hellwig
2006-12-05 14:20 ` Matthew Wilcox
2006-12-06 15:04 ` Rob Ross
2006-12-06 15:44 ` Matthew Wilcox
2006-12-06 16:15 ` Rob Ross
2006-12-05 14:55 ` Trond Myklebust
2006-12-05 22:11 ` Rob Ross
2006-12-05 23:24 ` Trond Myklebust
2006-12-06 16:42 ` Rob Ross
2006-12-06 12:22 ` Ragnar Kjørstad
2006-12-06 15:14 ` Trond Myklebust
2006-12-05 16:55 ` Latchesar Ionkov
2006-12-05 22:12 ` Christoph Hellwig
2006-12-06 23:12 ` Latchesar Ionkov
2006-12-06 23:33 ` Trond Myklebust
2006-12-05 21:50 ` Rob Ross
2006-12-05 22:05 ` Christoph Hellwig
2006-12-05 23:18 ` Sage Weil
2006-12-05 23:55 ` Ulrich Drepper
2006-12-06 10:06 ` Andreas Dilger
2006-12-06 17:19 ` Ulrich Drepper
2006-12-06 17:27 ` Rob Ross
2006-12-06 17:42 ` Ulrich Drepper
2006-12-06 18:01 ` Ragnar Kjørstad
2006-12-06 18:13 ` Ulrich Drepper
2006-12-17 14:41 ` Ragnar Kjørstad
2006-12-17 19:07 ` Ulrich Drepper
2006-12-17 19:38 ` Matthew Wilcox
2006-12-17 21:51 ` Ulrich Drepper
2006-12-18 2:57 ` Ragnar Kjørstad
2006-12-18 3:54 ` Gary Grider
2006-12-07 5:57 ` Andreas Dilger
2006-12-15 22:37 ` Ulrich Drepper
2006-12-16 18:13 ` Andreas Dilger
2006-12-16 19:08 ` Ulrich Drepper
2006-12-14 23:58 ` statlite() Rob Ross
2006-12-07 23:39 ` NFSv4/pNFS possible POSIX I/O API standards Nikita Danilov
2006-12-05 14:37 ` Peter Staubach
2006-12-05 10:26 ` readdirplus() as possible POSIX I/O API Andreas Dilger
2006-12-05 15:23 ` Trond Myklebust
2006-12-06 10:28 ` Andreas Dilger
2006-12-06 15:10 ` Trond Myklebust
2006-12-05 17:06 ` Latchesar Ionkov
2006-12-05 22:48 ` Rob Ross
2006-11-29 10:25 ` NFSv4/pNFS possible POSIX I/O API standards Steven Whitehouse
2006-11-30 12:29 ` Christoph Hellwig
2006-12-01 15:52 ` Ric Wheeler
2006-11-29 12:23 ` Matthew Wilcox
2006-11-29 12:35 ` Matthew Wilcox
2006-11-29 16:26 ` Gary Grider
2006-11-29 17:18 ` Christoph Hellwig
2006-11-29 12:39 ` Christoph Hellwig
2006-12-01 22:29 ` Rob Ross
2006-12-02 2:35 ` Latchesar Ionkov
2006-12-05 0:37 ` Rob Ross
2006-12-05 10:02 ` Christoph Hellwig
2006-12-05 16:47 ` Latchesar Ionkov
2006-12-05 17:01 ` Matthew Wilcox
[not found] ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com>
2006-12-05 17:10 ` Latchesar Ionkov
2006-12-05 17:39 ` Matthew Wilcox
2006-12-05 21:55 ` Rob Ross
2006-12-05 21:50 ` Peter Staubach
2006-12-05 21:44 ` Rob Ross
2006-12-06 11:01 ` openg Christoph Hellwig
2006-12-06 15:41 ` openg Trond Myklebust
2006-12-06 15:42 ` openg Rob Ross
2006-12-06 23:32 ` openg Christoph Hellwig
2006-12-14 23:36 ` openg Rob Ross
2006-12-06 23:25 ` Re: NFSv4/pNFS possible POSIX I/O API standards Latchesar Ionkov
2006-12-06 9:48 ` David Chinner
2006-12-06 15:53 ` openg and path_to_handle Rob Ross
2006-12-06 16:04 ` Matthew Wilcox
2006-12-06 16:20 ` Rob Ross
2006-12-06 20:57 ` David Chinner
2006-12-06 20:40 ` David Chinner
2006-12-06 20:50 ` Matthew Wilcox
2006-12-06 21:09 ` David Chinner
2006-12-06 22:09 ` Andreas Dilger
2006-12-06 22:17 ` Matthew Wilcox
2006-12-06 22:41 ` Andreas Dilger
2006-12-06 23:39 ` Christoph Hellwig
2006-12-14 22:52 ` Rob Ross
2006-12-06 20:50 ` Rob Ross
2006-12-06 21:01 ` David Chinner
2006-12-06 23:19 ` Latchesar Ionkov
2006-12-14 21:00 ` Rob Ross
2006-12-14 21:20 ` Matthew Wilcox
2006-12-14 23:02 ` Rob Ross
2006-11-28 15:08 ` NFSv4/pNFS possible POSIX I/O API standards Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=457462AF.5080601@redhat.com \
--to=staubach@redhat.com \
--cc=adilger@clusterfs.com \
--cc=aia21@cam.ac.uk \
--cc=flar@allandria.com \
--cc=ggrider@lanl.gov \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=sage@newdream.net \
--cc=trond.myklebust@fys.uio.no \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).