linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] Support for stackable file systems on top of nfs
@ 2005-11-14  0:44 Nikolai Joukov
  2005-11-14 16:02 ` David Howells
  2005-11-14 16:11 ` John T. Kohl
  0 siblings, 2 replies; 36+ messages in thread
From: Nikolai Joukov @ 2005-11-14  0:44 UTC (permalink / raw)
  To: John T. Kohl
  Cc: Trond Myklebust, dhowells, nfsv4, Charles Wright, linux-fsdevel

>> Charles> On Fri, 2005-11-11 at 08:45 -0500, John T. Kohl wrote:
>>> Other than i_mapping/f_mapping, I don't think it's possible right now
>>> for stacking file systems to handle the address_space operations in our
>>> layer *and* share the same pages with the backing-store, since the struct
>>> pages are attached to the address space via file->f_mapping.
> Charles> At Stony Brook, we've come across similar problems. It is relatively
> Charles> easy to double cache, but inefficient.  It is also relatively easy to
> Charles> single-cache, but then you don't get to intercept any of these
> Charles> interesting operations.  Getting both at once is tricky.
>
> We currently do single-caching, by passing on the mmap operation to the
> backing store (swapping in the backing store file for vma->vm_file).
> (We do the equivalent in our MVFS built for vnode kernels.)  Swapping
> the vm file is mostly workable, but we do have to be a bit too
> knowledgable about the innards of file mapping and do some things to
> accomodate the actions taken after fop->mmap is called.

What we are discussing here is only the tip of the iceberg.  We are
discussing the simplest case:
1) Page N of the upper filesystem's file corresponds to page N of the
lower file.
2) No page processing is necessary.
In that case we use either the technique described above
(http://lxr.fsl.cs.sunysb.edu/fistgen/source/templates/Linux-2.6/file.c#L458)
or CODA's i_mapping/f_mapping way.  Both have their own problems.

However, for most stackable filesystems we need to intercept the
writepage/readpage/prepare_write/commit_write operations.  Also, page N of
the upper filesystem may correspond to some other page M of the lower
filesystem.  For example, this is the case for fan-out stackable
filesystems (Unionfs, and RAID-like filesystems).  There are dozens of
practical stackable filesystems were we have to double-cache only because
Linux VFS does not allow to intercept the page-based operations *and*
avoid double caching.  I would like to point  out here that neither *BSD
nor Windows stackable filesystems have this problem.  To solve the
problem, VFS should allow stackable filesystems to 1) do something
(calculate checksumms, calculate parity for filesystem-level RAIDs, etc.)
inside of a stackable filesystem's readpage/writepage/..., 2) call lower
filesystem's readpage/writepage/... passing *any* page to these lower
functions.  The page passed below may be a page of the lower filesystem
to get double caching, or an upper page to get no double caching.  Here
is a kludge that works in many cases:

int stackable_readpage(file_t *file, page_t *page)
{
 ...
 page->mapping = lower_inode->i_mapping;
 err = lower_inode->i_mapping->a_ops->readpage(lower_file, page);
 page->mapping = inode->i_mapping;
 ...
}

All structures with the 'lower_' prefix belong to the lower filesystem.
It doesn't seem to be the exactly right way to go but it provides the same
flexibility for Linux stackable filesystems that they enjoy in *BSD and
Windows.  A correct implementation requires some isolation of the page
structure from the file/dentry/inode added at the VFS level.  However, it
would be sufficient if we can make the code above work in all the cases
over all existing filesystems.

Sincerely,
Nikolai Joukov.
**************************************
* Ph.D. student  (Advisor: Erez Zadok)
* File systems and Storage Laboratory
* Stony Brook University (SUNY)
**************************************


^ permalink raw reply	[flat|nested] 36+ messages in thread
* [RFC] Support for stackable file systems on top of nfs
@ 2005-11-10 17:32 Dave Kleikamp
  2005-11-10 20:07 ` Christoph Hellwig
  2005-11-10 21:24 ` Trond Myklebust
  0 siblings, 2 replies; 36+ messages in thread
From: Dave Kleikamp @ 2005-11-10 17:32 UTC (permalink / raw)
  To: nfsv4, fsdevel

The following patch allows stackable file systems, such as ClearCase's
mvfs, to run atop nfs.  mvfs has it's own file and inode structures, but
points its inode->i_mapping to the lower file system's mapping.  This
causes problems when nfs's address space operations try to extract the
open context from file->private_data.

The patch adds a small overhead of checking the file structure to see if
it contains an inode that is not the mapping's host.

I am curious if there are any other stackable file systems that could
benefit from this.

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>

diff -Nurp linux-2.6.14-git/fs/nfs/direct.c linux/fs/nfs/direct.c
--- linux-2.6.14-git/fs/nfs/direct.c	2005-11-07 07:53:49.000000000 -0600
+++ linux/fs/nfs/direct.c	2005-11-09 14:58:59.000000000 -0600
@@ -604,7 +604,19 @@ nfs_direct_IO(int rw, struct kiocb *iocb
 	if (!is_sync_kiocb(iocb))
 		return result;
 
-	ctx = (struct nfs_open_context *)file->private_data;
+	if (nfs_is_valid_file(file))
+		ctx = get_nfs_open_context((struct nfs_open_context *)
+				file->private_data);
+	else {
+		/* file belongs to a stackable file system.
+		 * Can't trust the inode either */
+		inode = inode->i_mapping->host;
+
+		ctx = nfs_find_open_context(inode, NULL,
+				(rw == READ) ? FMODE_READ : FMODE_WRITE);
+		if (ctx == NULL)
+			return -EBADF;
+	}
 	switch (rw) {
 	case READ:
 		dprintk("NFS: direct_IO(read) (%s) off/no(%Lu/%lu)\n",
@@ -623,6 +635,7 @@ nfs_direct_IO(int rw, struct kiocb *iocb
 	default:
 		break;
 	}
+	put_nfs_open_context(ctx);
 	return result;
 }
 
diff -Nurp linux-2.6.14-git/fs/nfs/read.c linux/fs/nfs/read.c
--- linux-2.6.14-git/fs/nfs/read.c	2005-11-07 07:53:49.000000000 -0600
+++ linux/fs/nfs/read.c	2005-11-09 11:47:05.000000000 -0600
@@ -506,7 +506,7 @@ int nfs_readpage(struct file *file, stru
 	if (error)
 		goto out_error;
 
-	if (file == NULL) {
+	if (!nfs_is_valid_file(file)) {
 		ctx = nfs_find_open_context(inode, NULL, FMODE_READ);
 		if (ctx == NULL)
 			return -EBADF;
@@ -575,7 +575,7 @@ int nfs_readpages(struct file *filp, str
 			(long long)NFS_FILEID(inode),
 			nr_pages);
 
-	if (filp == NULL) {
+	if (!nfs_is_valid_file(filp)) {
 		desc.ctx = nfs_find_open_context(inode, NULL, FMODE_READ);
 		if (desc.ctx == NULL)
 			return -EBADF;
diff -Nurp linux-2.6.14-git/fs/nfs/write.c linux/fs/nfs/write.c
--- linux-2.6.14-git/fs/nfs/write.c	2005-11-07 07:53:49.000000000 -0600
+++ linux/fs/nfs/write.c	2005-11-09 14:14:33.000000000 -0600
@@ -703,10 +703,16 @@ static struct nfs_page * nfs_update_requ
 
 int nfs_flush_incompatible(struct file *file, struct page *page)
 {
-	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
+	struct nfs_open_context *ctx;
 	struct inode	*inode = page->mapping->host;
 	struct nfs_page	*req;
 	int		status = 0;
+
+	if (nfs_is_valid_file(file))
+		ctx = (struct nfs_open_context *)file->private_data;
+	else
+		ctx = NULL;
+
 	/*
 	 * Look for a request corresponding to this page. If there
 	 * is one, and it belongs to another file, we flush it out
@@ -733,7 +739,7 @@ int nfs_flush_incompatible(struct file *
 int nfs_updatepage(struct file *file, struct page *page,
 		unsigned int offset, unsigned int count)
 {
-	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
+	struct nfs_open_context *ctx;
 	struct inode	*inode = page->mapping->host;
 	struct nfs_page	*req;
 	int		status = 0;
@@ -743,14 +749,23 @@ int nfs_updatepage(struct file *file, st
 		file->f_dentry->d_name.name, count,
 		(long long)(page_offset(page) +offset));
 
+	if (nfs_is_valid_file(file))
+		ctx = get_nfs_open_context((struct nfs_open_context *)
+				file->private_data);
+	else {
+		ctx = nfs_find_open_context(inode, NULL, FMODE_WRITE);
+		if (!ctx)
+			return -EBADF;
+	}
+		
 	if (IS_SYNC(inode)) {
 		status = nfs_writepage_sync(ctx, inode, page, offset, count, 0);
 		if (status > 0) {
 			if (offset == 0 && status == PAGE_CACHE_SIZE)
 				SetPageUptodate(page);
-			return 0;
+			status = 0;
 		}
-		return status;
+		goto out;
 	}
 
 	/* If we're not using byte range locks, and we know the page
@@ -803,6 +818,8 @@ done:
 			status, (long long)i_size_read(inode));
 	if (status < 0)
 		ClearPageUptodate(page);
+out:
+	put_nfs_open_context(ctx);
 	return status;
 }
 
diff -Nurp linux-2.6.14-git/include/linux/nfs_fs.h linux/include/linux/nfs_fs.h
--- linux-2.6.14-git/include/linux/nfs_fs.h	2005-11-07 07:53:50.000000000 -0600
+++ linux/include/linux/nfs_fs.h	2005-11-09 11:44:53.000000000 -0600
@@ -350,6 +350,20 @@ static inline struct rpc_cred *nfs_file_
 }
 
 /*
+ * A stackable file system may have it's own file & inode structures, which
+ * point to the local inode's mapping.  The address space operations cannot
+ * use the stackable file system's file structure to get to the open context
+ */
+static inline int nfs_is_valid_file(struct file *file)
+{
+	struct inode *inode;
+	if (!file)
+		return 0;
+	inode = file->f_dentry->d_inode;
+	return (inode == inode->i_mapping->host);
+}
+
+/*
  * linux/fs/nfs/xattr.c
  */
 #ifdef CONFIG_NFS_V3_ACL

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2005-11-14 21:34 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-14  0:44 [RFC] Support for stackable file systems on top of nfs Nikolai Joukov
2005-11-14 16:02 ` David Howells
2005-11-14 20:48   ` Erez Zadok
2005-11-14 21:13     ` John T. Kohl
2005-11-14 21:32       ` Jamie Lokier
2005-11-14 16:11 ` John T. Kohl
  -- strict thread matches above, loose matches on Subject: below --
2005-11-10 17:32 Dave Kleikamp
2005-11-10 20:07 ` Christoph Hellwig
2005-11-10 21:35   ` John T. Kohl
2005-11-10 21:40     ` Shaya Potter
2005-11-10 21:57       ` John T. Kohl
2005-11-10 21:50     ` Christoph Hellwig
2005-11-11  2:31     ` Trond Myklebust
2005-11-11  4:04       ` Trond Myklebust
2005-11-11 13:45         ` John T. Kohl
2005-11-11 15:27           ` Charles P. Wright
2005-11-11 17:38             ` John T. Kohl
2005-11-14 15:56     ` David Howells
2005-11-10 21:24 ` Trond Myklebust
2005-11-10 21:36   ` Shaya Potter
2005-11-10 22:18     ` Trond Myklebust
2005-11-10 22:27       ` Shaya Potter
2005-11-10 22:40         ` Trond Myklebust
2005-11-11  0:12           ` Bryan Henderson
2005-11-11  1:30             ` Brad Boyer
2005-11-11  2:06             ` Trond Myklebust
2005-11-11 18:18               ` Bryan Henderson
2005-11-11 19:22                 ` Trond Myklebust
2005-11-11 21:57                   ` Bryan Henderson
2005-11-11 22:41                     ` Trond Myklebust
2005-11-14 19:02                       ` Bryan Henderson
2005-11-11 16:40             ` Nikita Danilov
2005-11-11 18:45               ` Bryan Henderson
2005-11-11 19:31                 ` Nikita Danilov
2005-11-11 19:42                   ` Trond Myklebust
2005-11-11 23:13                   ` Bryan Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).