From: Rob Ross <rross@mcs.anl.gov>
To: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>,
Trond Myklebust <trond.myklebust@fys.uio.no>,
Andreas Dilger <adilger@clusterfs.com>,
Sage Weil <sage@newdream.net>, Brad Boyer <flar@allandria.com>,
Anton Altaparmakov <aia21@cam.ac.uk>,
Gary Grider <ggrider@lanl.gov>,
linux-fsdevel@vger.kernel.org
Subject: Re: NFSv4/pNFS possible POSIX I/O API standards
Date: Wed, 06 Dec 2006 10:15:13 -0600 [thread overview]
Message-ID: <4576EC91.6000706@mcs.anl.gov> (raw)
In-Reply-To: <20061206154426.GU3013@parisc-linux.org>
Matthew Wilcox wrote:
> On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
>> The openg() solution has the following advantages to what you propose.
>> First, it places the burden of the communication of the file handle on
>> the application process, not the file system. That means less work for
>> the file system. Second, it does not require that clients respond to
>> unexpected network traffic. Third, the network traffic is deterministic
>> -- one client interacts with the file system and then explicitly
>> performs the broadcast. Fourth, it does not require that the file system
>> store additional state on clients.
>
> You didn't address the disadvantages I pointed out on December 1st in a
> mail to the posix mailing list:
I coincidentally just wrote about some of this in another email. Wasn't
trying to avoid you...
> : I now understand this not so much as a replacement for dup() but in
> : terms of being able to open by NFS filehandle, or inode number. The
> : fh_t is presumably generated by the underlying cluster filesystem, and
> : is a handle that has meaning on all nodes that are members of the
> : cluster.
Exactly.
> : I think we need to consider security issues (that have also come up
> : when open-by-inode-number was proposed). For example, how long is the
> : fh_t intended to be valid for? Forever? Until the cluster is rebooted?
> : Could the fh_t be used by any user, or only those with credentials to
> : access the file? What happens if we revoke() the original fd?
The fh_t would be validated either (a) when the openfh() is called, or
on accesses using the associated capability. As Christoph pointed out,
this really is a capability and encapsulates everything necessary for a
particular user to access a particular file. It can be handed to others,
and in fact that is a critical feature for our use case.
After the openfh(), the access model is identical to a previously
open()ed file. So the question is what happens between the openg() and
the openfh().
Our intention was to allow servers to "forget" these fh_ts at will. So a
revoke between openg() and openfh() would kill the fh_t, and the
subsequent openfh() would fail, or subsequent accesses would fail
(depending on when the FS chose to validate).
Does this help?
> : I'm a little concerned about the generation of a suitable fh_t.
> : In the implementation of sutoc(), how does the kernel know which
> : filesystem to ask to translate it? It's not impossible (though it is
> : implausible) that an fh_t could be meaningful to more than one
> : filesystem.
> :
> : One possibility of fixing this could be to use a magic number at the
> : beginning of the fh_t to distinguish which filesystem this belongs
> : to (a list of currently-used magic numbers in Linux can be found at
> : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)
>
> Christoph has also touched on some of these points, and added some I
> missed.
We could use advice on this point. Certainly it's possible to encode
information about the FS from which the fh_t originated, but we haven't
tried to spell out exactly how that would happen. Your approach
described here sounds good to me.
>> In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing
>> the flag) would likely cause a storm of network traffic if clients were
>> closely synchronized (which they are likely to be).
>
> I think you're referring to a naive application, rather than a naive
> cluster filesystem, right? There's several ways to fix that problem,
> including throttling broadcasts of information, having nodes ask their
> immediate neighbours if they have a cache of the information, and having
> the server not respond (wait for a retransmit) if it's recently sent out
> a broadcast.
Yes, naive application. You're right that the file system could adapt to
this, but on the other hand if we were explicitly passing the fh_t in
user space, we could just use MPI_Bcast and be done with it, with an
algorithm that is well-matched to the system, etc.
>> However, the application change issue is actually moot; we will make
>> whatever changes inside our MPI-IO implementation, and many users will
>> get the benefits for free.
>
> That's good.
Absolutely. Same goes for readx()/writex() also, BTW, at least for
MPI-IO users. We will build the input parameters inside MPI-IO using
existing information from users, rather than applying data sieving or
using multiple POSIX calls.
>> The readdirplus(), readx()/writex(), and openg()/openfh() were all
>> designed to allow our applications to explain exactly what they wanted
>> and to allow for explicit communication. I understand that there is a
>> tendency toward solutions where the FS guesses what the app is going to
>> do or is passed a hint (e.g. fadvise) about what is going to happen,
>> because these things don't require interface changes. But these
>> solutions just aren't as effective as actually spelling out what the
>> application wants.
>
> Sure, but I think you're emphasising "these interfaces let us get our
> job done" over the legitimate concerns that we have. I haven't really
> looked at the readdirplus() or readx()/writex() interfaces, but the
> security problems with openg() makes me think you haven't really looked
> at it from the "what could go wrong" perspective.
I'm sorry if it seems like I'm ignoring your concerns; that isn't my
intention. I am advocating the calls though, because the whole point in
getting into these discussions is to improve the state of things for
these access patterns.
Part of the problem is that the descriptions of these calls were written
for inclusion in a POSIX document and not for discussion on this list.
Those descriptions don't usually include detailed descriptions of
implementation options or use cases. We should have created some
additional documentation before coming to this list, but what is done is
done.
In the case of openg(), the major approach to things "going wrong" is
for the server to just forget it ever handed out the fh_t and make the
application figure it out. We think that makes implementations
relatively simple, because we don't require so much. It makes using this
capability a little more difficult outside the kernel, but we're
prepared for that.
> I'd be interested in
> reviewing the readx()/writex() interfaces, but still don't see a document
> for them anywhere.
Really? Ack! Ok. I'll talk with the others and get a readx()/writex()
page up soon, although it would be nice to let the discussion of these
few calm down a bit before we start with those...I'm not getting much
done at work right now :).
Thanks for the discussion,
Rob
next prev parent reply other threads:[~2006-12-06 16:15 UTC|newest]
Thread overview: 124+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-11-28 4:34 NFSv4/pNFS possible POSIX I/O API standards Gary Grider
2006-11-28 5:54 ` Christoph Hellwig
2006-11-28 10:54 ` Andreas Dilger
2006-11-28 11:28 ` Anton Altaparmakov
2006-11-28 20:17 ` Russell Cattelan
2006-11-28 23:28 ` Wendy Cheng
2006-11-29 9:12 ` Christoph Hellwig
2006-11-29 9:04 ` Christoph Hellwig
2006-11-29 9:14 ` Christoph Hellwig
2006-11-29 9:48 ` Andreas Dilger
2006-11-29 10:18 ` Anton Altaparmakov
2006-11-29 8:26 ` Brad Boyer
2006-11-30 9:25 ` Christoph Hellwig
2006-11-30 17:49 ` Sage Weil
2006-12-01 5:26 ` Trond Myklebust
2006-12-01 7:08 ` Sage Weil
2006-12-01 14:41 ` Trond Myklebust
2006-12-01 16:47 ` Sage Weil
2006-12-01 18:07 ` Trond Myklebust
2006-12-01 18:42 ` Sage Weil
2006-12-01 19:13 ` Trond Myklebust
2006-12-01 20:32 ` Sage Weil
2006-12-04 18:02 ` Peter Staubach
2006-12-05 23:20 ` readdirplus() as possible POSIX I/O API Sage Weil
2006-12-06 15:48 ` Peter Staubach
2006-12-03 1:57 ` NFSv4/pNFS possible POSIX I/O API standards Andreas Dilger
2006-12-03 7:34 ` Kari Hurtta
2006-12-03 1:52 ` Andreas Dilger
2006-12-03 16:10 ` Sage Weil
2006-12-04 7:32 ` Andreas Dilger
2006-12-04 15:15 ` Trond Myklebust
2006-12-05 0:59 ` Rob Ross
2006-12-05 4:44 ` Gary Grider
2006-12-05 10:05 ` Christoph Hellwig
2006-12-05 5:56 ` Trond Myklebust
2006-12-05 10:07 ` Christoph Hellwig
2006-12-05 14:20 ` Matthew Wilcox
2006-12-06 15:04 ` Rob Ross
2006-12-06 15:44 ` Matthew Wilcox
2006-12-06 16:15 ` Rob Ross [this message]
2006-12-05 14:55 ` Trond Myklebust
2006-12-05 22:11 ` Rob Ross
2006-12-05 23:24 ` Trond Myklebust
2006-12-06 16:42 ` Rob Ross
2006-12-06 12:22 ` Ragnar Kjørstad
2006-12-06 15:14 ` Trond Myklebust
2006-12-05 16:55 ` Latchesar Ionkov
2006-12-05 22:12 ` Christoph Hellwig
2006-12-06 23:12 ` Latchesar Ionkov
2006-12-06 23:33 ` Trond Myklebust
2006-12-05 21:50 ` Rob Ross
2006-12-05 22:05 ` Christoph Hellwig
2006-12-05 23:18 ` Sage Weil
2006-12-05 23:55 ` Ulrich Drepper
2006-12-06 10:06 ` Andreas Dilger
2006-12-06 17:19 ` Ulrich Drepper
2006-12-06 17:27 ` Rob Ross
2006-12-06 17:42 ` Ulrich Drepper
2006-12-06 18:01 ` Ragnar Kjørstad
2006-12-06 18:13 ` Ulrich Drepper
2006-12-17 14:41 ` Ragnar Kjørstad
2006-12-17 19:07 ` Ulrich Drepper
2006-12-17 19:38 ` Matthew Wilcox
2006-12-17 21:51 ` Ulrich Drepper
2006-12-18 2:57 ` Ragnar Kjørstad
2006-12-18 3:54 ` Gary Grider
2006-12-07 5:57 ` Andreas Dilger
2006-12-15 22:37 ` Ulrich Drepper
2006-12-16 18:13 ` Andreas Dilger
2006-12-16 19:08 ` Ulrich Drepper
2006-12-14 23:58 ` statlite() Rob Ross
2006-12-07 23:39 ` NFSv4/pNFS possible POSIX I/O API standards Nikita Danilov
2006-12-05 14:37 ` Peter Staubach
2006-12-05 10:26 ` readdirplus() as possible POSIX I/O API Andreas Dilger
2006-12-05 15:23 ` Trond Myklebust
2006-12-06 10:28 ` Andreas Dilger
2006-12-06 15:10 ` Trond Myklebust
2006-12-05 17:06 ` Latchesar Ionkov
2006-12-05 22:48 ` Rob Ross
2006-11-29 10:25 ` NFSv4/pNFS possible POSIX I/O API standards Steven Whitehouse
2006-11-30 12:29 ` Christoph Hellwig
2006-12-01 15:52 ` Ric Wheeler
2006-11-29 12:23 ` Matthew Wilcox
2006-11-29 12:35 ` Matthew Wilcox
2006-11-29 16:26 ` Gary Grider
2006-11-29 17:18 ` Christoph Hellwig
2006-11-29 12:39 ` Christoph Hellwig
2006-12-01 22:29 ` Rob Ross
2006-12-02 2:35 ` Latchesar Ionkov
2006-12-05 0:37 ` Rob Ross
2006-12-05 10:02 ` Christoph Hellwig
2006-12-05 16:47 ` Latchesar Ionkov
2006-12-05 17:01 ` Matthew Wilcox
[not found] ` <f158dc670612050909m366594c5ubaa87d9a9ecc8c2a@mail.gmail.com>
2006-12-05 17:10 ` Latchesar Ionkov
2006-12-05 17:39 ` Matthew Wilcox
2006-12-05 21:55 ` Rob Ross
2006-12-05 21:50 ` Peter Staubach
2006-12-05 21:44 ` Rob Ross
2006-12-06 11:01 ` openg Christoph Hellwig
2006-12-06 15:41 ` openg Trond Myklebust
2006-12-06 15:42 ` openg Rob Ross
2006-12-06 23:32 ` openg Christoph Hellwig
2006-12-14 23:36 ` openg Rob Ross
2006-12-06 23:25 ` Re: NFSv4/pNFS possible POSIX I/O API standards Latchesar Ionkov
2006-12-06 9:48 ` David Chinner
2006-12-06 15:53 ` openg and path_to_handle Rob Ross
2006-12-06 16:04 ` Matthew Wilcox
2006-12-06 16:20 ` Rob Ross
2006-12-06 20:57 ` David Chinner
2006-12-06 20:40 ` David Chinner
2006-12-06 20:50 ` Matthew Wilcox
2006-12-06 21:09 ` David Chinner
2006-12-06 22:09 ` Andreas Dilger
2006-12-06 22:17 ` Matthew Wilcox
2006-12-06 22:41 ` Andreas Dilger
2006-12-06 23:39 ` Christoph Hellwig
2006-12-14 22:52 ` Rob Ross
2006-12-06 20:50 ` Rob Ross
2006-12-06 21:01 ` David Chinner
2006-12-06 23:19 ` Latchesar Ionkov
2006-12-14 21:00 ` Rob Ross
2006-12-14 21:20 ` Matthew Wilcox
2006-12-14 23:02 ` Rob Ross
2006-11-28 15:08 ` NFSv4/pNFS possible POSIX I/O API standards Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4576EC91.6000706@mcs.anl.gov \
--to=rross@mcs.anl.gov \
--cc=adilger@clusterfs.com \
--cc=aia21@cam.ac.uk \
--cc=flar@allandria.com \
--cc=ggrider@lanl.gov \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=matthew@wil.cx \
--cc=sage@newdream.net \
--cc=trond.myklebust@fys.uio.no \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).