linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nikolai Joukov <kolya@cs.sunysb.edu>
To: "John T. Kohl" <jtk@us.ibm.com>
Cc: Trond Myklebust <trondmy@trondhjem.org>,
	dhowells@redhat.com, nfsv4@linux-nfs.org,
	Charles Wright <cwright@cs.sunysb.edu>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [RFC] Support for stackable file systems on top of nfs
Date: Sun, 13 Nov 2005 19:44:14 -0500 (EST)	[thread overview]
Message-ID: <Pine.GSO.4.53.0511131919530.23544@compserv1> (raw)

>> Charles> On Fri, 2005-11-11 at 08:45 -0500, John T. Kohl wrote:
>>> Other than i_mapping/f_mapping, I don't think it's possible right now
>>> for stacking file systems to handle the address_space operations in our
>>> layer *and* share the same pages with the backing-store, since the struct
>>> pages are attached to the address space via file->f_mapping.
> Charles> At Stony Brook, we've come across similar problems. It is relatively
> Charles> easy to double cache, but inefficient.  It is also relatively easy to
> Charles> single-cache, but then you don't get to intercept any of these
> Charles> interesting operations.  Getting both at once is tricky.
>
> We currently do single-caching, by passing on the mmap operation to the
> backing store (swapping in the backing store file for vma->vm_file).
> (We do the equivalent in our MVFS built for vnode kernels.)  Swapping
> the vm file is mostly workable, but we do have to be a bit too
> knowledgable about the innards of file mapping and do some things to
> accomodate the actions taken after fop->mmap is called.

What we are discussing here is only the tip of the iceberg.  We are
discussing the simplest case:
1) Page N of the upper filesystem's file corresponds to page N of the
lower file.
2) No page processing is necessary.
In that case we use either the technique described above
(http://lxr.fsl.cs.sunysb.edu/fistgen/source/templates/Linux-2.6/file.c#L458)
or CODA's i_mapping/f_mapping way.  Both have their own problems.

However, for most stackable filesystems we need to intercept the
writepage/readpage/prepare_write/commit_write operations.  Also, page N of
the upper filesystem may correspond to some other page M of the lower
filesystem.  For example, this is the case for fan-out stackable
filesystems (Unionfs, and RAID-like filesystems).  There are dozens of
practical stackable filesystems were we have to double-cache only because
Linux VFS does not allow to intercept the page-based operations *and*
avoid double caching.  I would like to point  out here that neither *BSD
nor Windows stackable filesystems have this problem.  To solve the
problem, VFS should allow stackable filesystems to 1) do something
(calculate checksumms, calculate parity for filesystem-level RAIDs, etc.)
inside of a stackable filesystem's readpage/writepage/..., 2) call lower
filesystem's readpage/writepage/... passing *any* page to these lower
functions.  The page passed below may be a page of the lower filesystem
to get double caching, or an upper page to get no double caching.  Here
is a kludge that works in many cases:

int stackable_readpage(file_t *file, page_t *page)
{
 ...
 page->mapping = lower_inode->i_mapping;
 err = lower_inode->i_mapping->a_ops->readpage(lower_file, page);
 page->mapping = inode->i_mapping;
 ...
}

All structures with the 'lower_' prefix belong to the lower filesystem.
It doesn't seem to be the exactly right way to go but it provides the same
flexibility for Linux stackable filesystems that they enjoy in *BSD and
Windows.  A correct implementation requires some isolation of the page
structure from the file/dentry/inode added at the VFS level.  However, it
would be sufficient if we can make the code above work in all the cases
over all existing filesystems.

Sincerely,
Nikolai Joukov.
**************************************
* Ph.D. student  (Advisor: Erez Zadok)
* File systems and Storage Laboratory
* Stony Brook University (SUNY)
**************************************


             reply	other threads:[~2005-11-14  0:44 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-14  0:44 Nikolai Joukov [this message]
2005-11-14 16:02 ` [RFC] Support for stackable file systems on top of nfs David Howells
2005-11-14 20:48   ` Erez Zadok
2005-11-14 21:13     ` John T. Kohl
2005-11-14 21:32       ` Jamie Lokier
2005-11-14 16:11 ` John T. Kohl
  -- strict thread matches above, loose matches on Subject: below --
2005-11-10 17:32 Dave Kleikamp
2005-11-10 20:07 ` Christoph Hellwig
2005-11-10 21:35   ` John T. Kohl
2005-11-10 21:40     ` Shaya Potter
2005-11-10 21:57       ` John T. Kohl
2005-11-10 21:50     ` Christoph Hellwig
2005-11-11  2:31     ` Trond Myklebust
2005-11-11  4:04       ` Trond Myklebust
2005-11-11 13:45         ` John T. Kohl
2005-11-11 15:27           ` Charles P. Wright
2005-11-11 17:38             ` John T. Kohl
2005-11-14 15:56     ` David Howells
2005-11-10 21:24 ` Trond Myklebust
2005-11-10 21:36   ` Shaya Potter
2005-11-10 22:18     ` Trond Myklebust
2005-11-10 22:27       ` Shaya Potter
2005-11-10 22:40         ` Trond Myklebust
2005-11-11  0:12           ` Bryan Henderson
2005-11-11  1:30             ` Brad Boyer
2005-11-11  2:06             ` Trond Myklebust
2005-11-11 18:18               ` Bryan Henderson
2005-11-11 19:22                 ` Trond Myklebust
2005-11-11 21:57                   ` Bryan Henderson
2005-11-11 22:41                     ` Trond Myklebust
2005-11-14 19:02                       ` Bryan Henderson
2005-11-11 16:40             ` Nikita Danilov
2005-11-11 18:45               ` Bryan Henderson
2005-11-11 19:31                 ` Nikita Danilov
2005-11-11 19:42                   ` Trond Myklebust
2005-11-11 23:13                   ` Bryan Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.GSO.4.53.0511131919530.23544@compserv1 \
    --to=kolya@cs.sunysb.edu \
    --cc=cwright@cs.sunysb.edu \
    --cc=dhowells@redhat.com \
    --cc=jtk@us.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=nfsv4@linux-nfs.org \
    --cc=trondmy@trondhjem.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).