From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nikolai Joukov Subject: Re: [RFC] Support for stackable file systems on top of nfs Date: Sun, 13 Nov 2005 19:44:14 -0500 (EST) Message-ID: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Trond Myklebust , dhowells@redhat.com, nfsv4@linux-nfs.org, Charles Wright , linux-fsdevel@vger.kernel.org Return-path: Received: from sbcs.cs.sunysb.edu ([130.245.1.15]:4533 "EHLO sbcs.cs.sunysb.edu") by vger.kernel.org with ESMTP id S1750807AbVKNAo3 (ORCPT ); Sun, 13 Nov 2005 19:44:29 -0500 To: "John T. Kohl" Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org >> Charles> On Fri, 2005-11-11 at 08:45 -0500, John T. Kohl wrote: >>> Other than i_mapping/f_mapping, I don't think it's possible right now >>> for stacking file systems to handle the address_space operations in our >>> layer *and* share the same pages with the backing-store, since the struct >>> pages are attached to the address space via file->f_mapping. > Charles> At Stony Brook, we've come across similar problems. It is relatively > Charles> easy to double cache, but inefficient. It is also relatively easy to > Charles> single-cache, but then you don't get to intercept any of these > Charles> interesting operations. Getting both at once is tricky. > > We currently do single-caching, by passing on the mmap operation to the > backing store (swapping in the backing store file for vma->vm_file). > (We do the equivalent in our MVFS built for vnode kernels.) Swapping > the vm file is mostly workable, but we do have to be a bit too > knowledgable about the innards of file mapping and do some things to > accomodate the actions taken after fop->mmap is called. What we are discussing here is only the tip of the iceberg. We are discussing the simplest case: 1) Page N of the upper filesystem's file corresponds to page N of the lower file. 2) No page processing is necessary. In that case we use either the technique described above (http://lxr.fsl.cs.sunysb.edu/fistgen/source/templates/Linux-2.6/file.c#L458) or CODA's i_mapping/f_mapping way. Both have their own problems. However, for most stackable filesystems we need to intercept the writepage/readpage/prepare_write/commit_write operations. Also, page N of the upper filesystem may correspond to some other page M of the lower filesystem. For example, this is the case for fan-out stackable filesystems (Unionfs, and RAID-like filesystems). There are dozens of practical stackable filesystems were we have to double-cache only because Linux VFS does not allow to intercept the page-based operations *and* avoid double caching. I would like to point out here that neither *BSD nor Windows stackable filesystems have this problem. To solve the problem, VFS should allow stackable filesystems to 1) do something (calculate checksumms, calculate parity for filesystem-level RAIDs, etc.) inside of a stackable filesystem's readpage/writepage/..., 2) call lower filesystem's readpage/writepage/... passing *any* page to these lower functions. The page passed below may be a page of the lower filesystem to get double caching, or an upper page to get no double caching. Here is a kludge that works in many cases: int stackable_readpage(file_t *file, page_t *page) { ... page->mapping = lower_inode->i_mapping; err = lower_inode->i_mapping->a_ops->readpage(lower_file, page); page->mapping = inode->i_mapping; ... } All structures with the 'lower_' prefix belong to the lower filesystem. It doesn't seem to be the exactly right way to go but it provides the same flexibility for Linux stackable filesystems that they enjoy in *BSD and Windows. A correct implementation requires some isolation of the page structure from the file/dentry/inode added at the VFS level. However, it would be sufficient if we can make the code above work in all the cases over all existing filesystems. Sincerely, Nikolai Joukov. ************************************** * Ph.D. student (Advisor: Erez Zadok) * File systems and Storage Laboratory * Stony Brook University (SUNY) **************************************