From mboxrd@z Thu Jan  1 00:00:00 1970
From: Nikolai Joukov <kolya@cs.sunysb.edu>
Subject: Re: [RFC] Support for stackable file systems on top of nfs
Date: Sun, 13 Nov 2005 19:44:14 -0500 (EST)
Message-ID: <Pine.GSO.4.53.0511131919530.23544@compserv1>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Cc: Trond Myklebust <trondmy@trondhjem.org>, dhowells@redhat.com,
	nfsv4@linux-nfs.org, Charles Wright <cwright@cs.sunysb.edu>,
	linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from sbcs.cs.sunysb.edu ([130.245.1.15]:4533 "EHLO
	sbcs.cs.sunysb.edu") by vger.kernel.org with ESMTP id S1750807AbVKNAo3
	(ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Sun, 13 Nov 2005 19:44:29 -0500
To: "John T. Kohl" <jtk@us.ibm.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

>> Charles> On Fri, 2005-11-11 at 08:45 -0500, John T. Kohl wrote:
>>> Other than i_mapping/f_mapping, I don't think it's possible right now
>>> for stacking file systems to handle the address_space operations in our
>>> layer *and* share the same pages with the backing-store, since the struct
>>> pages are attached to the address space via file->f_mapping.
> Charles> At Stony Brook, we've come across similar problems. It is relatively
> Charles> easy to double cache, but inefficient.  It is also relatively easy to
> Charles> single-cache, but then you don't get to intercept any of these
> Charles> interesting operations.  Getting both at once is tricky.
>
> We currently do single-caching, by passing on the mmap operation to the
> backing store (swapping in the backing store file for vma->vm_file).
> (We do the equivalent in our MVFS built for vnode kernels.)  Swapping
> the vm file is mostly workable, but we do have to be a bit too
> knowledgable about the innards of file mapping and do some things to
> accomodate the actions taken after fop->mmap is called.

What we are discussing here is only the tip of the iceberg.  We are
discussing the simplest case:
1) Page N of the upper filesystem's file corresponds to page N of the
lower file.
2) No page processing is necessary.
In that case we use either the technique described above
(http://lxr.fsl.cs.sunysb.edu/fistgen/source/templates/Linux-2.6/file.c#L458)
or CODA's i_mapping/f_mapping way.  Both have their own problems.

However, for most stackable filesystems we need to intercept the
writepage/readpage/prepare_write/commit_write operations.  Also, page N of
the upper filesystem may correspond to some other page M of the lower
filesystem.  For example, this is the case for fan-out stackable
filesystems (Unionfs, and RAID-like filesystems).  There are dozens of
practical stackable filesystems were we have to double-cache only because
Linux VFS does not allow to intercept the page-based operations *and*
avoid double caching.  I would like to point  out here that neither *BSD
nor Windows stackable filesystems have this problem.  To solve the
problem, VFS should allow stackable filesystems to 1) do something
(calculate checksumms, calculate parity for filesystem-level RAIDs, etc.)
inside of a stackable filesystem's readpage/writepage/..., 2) call lower
filesystem's readpage/writepage/... passing *any* page to these lower
functions.  The page passed below may be a page of the lower filesystem
to get double caching, or an upper page to get no double caching.  Here
is a kludge that works in many cases:

int stackable_readpage(file_t *file, page_t *page)
{
 ...
 page->mapping = lower_inode->i_mapping;
 err = lower_inode->i_mapping->a_ops->readpage(lower_file, page);
 page->mapping = inode->i_mapping;
 ...
}

All structures with the 'lower_' prefix belong to the lower filesystem.
It doesn't seem to be the exactly right way to go but it provides the same
flexibility for Linux stackable filesystems that they enjoy in *BSD and
Windows.  A correct implementation requires some isolation of the page
structure from the file/dentry/inode added at the VFS level.  However, it
would be sufficient if we can make the code above work in all the cases
over all existing filesystems.

Sincerely,
Nikolai Joukov.
**************************************
* Ph.D. student  (Advisor: Erez Zadok)
* File systems and Storage Laboratory
* Stony Brook University (SUNY)
**************************************