* NFS Patch for FSCache
@ 2005-05-09 10:31 Steve Dickson
2005-05-09 21:19 ` Andrew Morton
2005-06-13 12:52 ` Steve Dickson
0 siblings, 2 replies; 14+ messages in thread
From: Steve Dickson @ 2005-05-09 10:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, linux-cachefs
[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]
Hello,
Attached is a patch that enables NFS to use David Howells'
File System Caching implementation (FSCache). Also attached
are two supplemental patches that are need to fixed two oops that
were found during debugging (Note: these patches are also
in people.redhat.com/steved/cachefs/2.6.12-rc3-mm3/)
2.6.12-rc3-mm3-nfs-fscache.patch - David and I have
been working on this for sometime now and the code
seems to be pretty solid. One issue is Trond's dislike
of how the NFS code is dependent on FSCache calls. I did
looking to changing this, but its not clear (at least to
me) how we could make things better... But that's
something that will need to be addressed.
The second issue is what we've been calling "NFS aliasing".
The fact that two mounted NFS super blocks can point to
the same page causes major fits for FSC. David has
proposed some patches to resolve this issue that are still
under review. But at this point, to stop a BUG() popping,
when a second NFS filesystem is mounted, the
2.6.12-rc3-mm3-fscache-cookie-exist.patch is needed.
The final patch 2.6.12-rc3-mm3-cachefs-wb.patch is need
to stop another BUG() from popping during NFS reads.
NFS uses FSC on a per-mount bases which means a new
mount flag 'fsc' is need to activate the caching.
Example:
mount -t nfs4 -o fsc server:/home /mnt/server/home
(Note: people.redhat.com/steved/cachefs/util-linux/ has the
util-linux binary and source rpms with the fsc support).
To set up a mounted cachefs partition, first initialize
the disk partition by:
echo "cachefs___" >/dev/hdg9
then mount the partition:
mount -t cachefs /dev/hdg9 /cache-hdg9
See Documentation/filesystems/caching in the kernel
source for more details.
I'm hopeful that you'll added these patches to your tree
so they will get some much needed testing. I'm also going
to be pushing to get the caching code into a Fedora Core
kernel, but due to the dependency on David's new vm_ops,
page_mkwrite, this might take some time...
(Note: people.redhat.com/steved/cachefs/mmpatches has all of the
current mm patches)
Comments?
steved.
[-- Attachment #2: 2.6.12-rc3-mm3-nfs-fscache.patch --]
[-- Type: text/x-patch, Size: 26573 bytes --]
This patch enables NFS to use file system caching (i.e. FSCache).
To turn this feature on you must specifiy the -o fsc mount flag
as well as have a cachefs partition mounted.
Signed-off-by: Steve Dickson <steved@redhat.com>
--- 2.6.12-rc2-mm3/fs/nfs/file.c.orig 2005-04-23 10:13:24.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/nfs/file.c 2005-04-23 11:25:47.000000000 -0400
@@ -27,9 +27,11 @@
#include <linux/slab.h>
#include <linux/pagemap.h>
#include <linux/smp_lock.h>
+#include <linux/buffer_head.h>
#include <asm/uaccess.h>
#include <asm/system.h>
+#include "nfs-fscache.h"
#include "delegation.h"
@@ -194,6 +196,12 @@ nfs_file_sendfile(struct file *filp, lof
return res;
}
+static int nfs_file_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+ wait_on_page_fs_misc(page);
+ return 0;
+}
+
static int
nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
{
@@ -207,6 +215,10 @@ nfs_file_mmap(struct file * file, struct
status = nfs_revalidate_inode(NFS_SERVER(inode), inode);
if (!status)
status = generic_file_mmap(file, vma);
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ vma->vm_ops->page_mkwrite = nfs_file_page_mkwrite;
+
return status;
}
@@ -258,6 +270,11 @@ static int nfs_commit_write(struct file
return status;
}
+/*
+ * since we use page->private for our own nefarious purposes when using fscache, we have to
+ * override extra address space ops to prevent fs/buffer.c from getting confused, even though we
+ * may not have asked its opinion
+ */
struct address_space_operations nfs_file_aops = {
.readpage = nfs_readpage,
.readpages = nfs_readpages,
@@ -269,6 +286,11 @@ struct address_space_operations nfs_file
#ifdef CONFIG_NFS_DIRECTIO
.direct_IO = nfs_direct_IO,
#endif
+#ifdef CONFIG_NFS_FSCACHE
+ .sync_page = block_sync_page,
+ .releasepage = nfs_releasepage,
+ .invalidatepage = nfs_invalidatepage,
+#endif
};
/*
--- 2.6.12-rc2-mm3/fs/nfs/inode.c.orig 2005-04-23 10:13:24.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/nfs/inode.c 2005-04-23 17:51:57.000000000 -0400
@@ -42,6 +42,8 @@
#include "nfs4_fs.h"
#include "delegation.h"
+#include "nfs-fscache.h"
+
#define NFSDBG_FACILITY NFSDBG_VFS
#define NFS_PARANOIA 1
@@ -169,6 +171,10 @@ nfs_clear_inode(struct inode *inode)
cred = nfsi->cache_access.cred;
if (cred)
put_rpccred(cred);
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_clear_fscookie(nfsi);
+
BUG_ON(atomic_read(&nfsi->data_updates) != 0);
}
@@ -503,6 +509,9 @@ nfs_fill_super(struct super_block *sb, s
server->namelen = NFS2_MAXNAMLEN;
}
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_fill_fscookie(sb);
+
sb->s_op = &nfs_sops;
return nfs_sb_init(sb, authflavor);
}
@@ -579,6 +588,7 @@ static int nfs_show_options(struct seq_f
{ NFS_MOUNT_NOAC, ",noac", "" },
{ NFS_MOUNT_NONLM, ",nolock", ",lock" },
{ NFS_MOUNT_NOACL, ",noacl", "" },
+ { NFS_MOUNT_FSCACHE, ",fscache", "" },
{ 0, NULL, NULL }
};
struct proc_nfs_info *nfs_infop;
@@ -623,6 +633,9 @@ nfs_zap_caches(struct inode *inode)
nfsi->flags |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
else
nfsi->flags |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_zap_fscookie(nfsi);
}
static void nfs_zap_acl_cache(struct inode *inode)
@@ -770,6 +783,9 @@ nfs_fhget(struct super_block *sb, struct
memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
nfsi->cache_access.cred = NULL;
+ if (NFS_SB(sb)->flags & NFS_MOUNT_FSCACHE)
+ nfs_fhget_fscookie(sb, nfsi);
+
unlock_new_inode(inode);
} else
nfs_refresh_inode(inode, fattr);
@@ -1076,6 +1092,9 @@ __nfs_revalidate_inode(struct nfs_server
(long long)NFS_FILEID(inode));
/* This ensures we revalidate dentries */
nfsi->cache_change_attribute++;
+
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_renew_fscookie(server, nfsi);
}
if (flags & NFS_INO_INVALID_ACL)
nfs_zap_acl_cache(inode);
@@ -1515,6 +1534,14 @@ static struct super_block *nfs_get_sb(st
goto out_err;
}
+#ifndef CONFIG_NFS_FSCACHE
+ if (data->flags & NFS_MOUNT_FSCACHE) {
+ printk(KERN_WARNING "NFS: kernel not compiled with CONFIG_NFS_FSCACHE\n");
+ kfree(server);
+ return ERR_PTR(-EINVAL);
+ }
+#endif
+
s = sget(fs_type, nfs_compare_super, nfs_set_super, server);
if (IS_ERR(s) || s->s_root)
goto out_rpciod_down;
@@ -1542,6 +1569,9 @@ static void nfs_kill_super(struct super_
kill_anon_super(s);
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_kill_fscookie(server);
+
if (server->client != NULL && !IS_ERR(server->client))
rpc_shutdown_client(server->client);
if (server->client_sys != NULL && !IS_ERR(server->client_sys))
@@ -1760,6 +1790,9 @@ static int nfs4_fill_super(struct super_
sb->s_time_gran = 1;
+ if (server->flags & NFS4_MOUNT_FSCACHE)
+ nfs4_fill_fscookie(sb);
+
sb->s_op = &nfs4_sops;
err = nfs_sb_init(sb, authflavour);
if (err == 0)
@@ -1903,6 +1936,9 @@ static void nfs4_kill_super(struct super
nfs_return_all_delegations(sb);
kill_anon_super(sb);
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_kill_fscookie(server);
+
nfs4_renewd_prepare_shutdown(server);
if (server->client != NULL && !IS_ERR(server->client))
@@ -2021,6 +2057,11 @@ static int __init init_nfs_fs(void)
{
int err;
+ /* we want to be able to cache */
+ err = nfs_register_netfs();
+ if (err < 0)
+ goto out5;
+
err = nfs_init_nfspagecache();
if (err)
goto out4;
@@ -2068,6 +2109,9 @@ out2:
out3:
nfs_destroy_nfspagecache();
out4:
+ nfs_unregister_netfs();
+out5:
+
return err;
}
@@ -2080,6 +2124,7 @@ static void __exit exit_nfs_fs(void)
nfs_destroy_readpagecache();
nfs_destroy_inodecache();
nfs_destroy_nfspagecache();
+ nfs_unregister_netfs();
#ifdef CONFIG_PROC_FS
rpc_proc_unregister("nfs");
#endif
--- 2.6.12-rc2-mm3/fs/nfs/Makefile.orig 2005-04-23 10:13:24.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/nfs/Makefile 2005-04-23 11:25:47.000000000 -0400
@@ -13,4 +13,5 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4x
delegation.o idmap.o \
callback.o callback_xdr.o callback_proc.o
nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
+nfs-$(CONFIG_NFS_FSCACHE) += nfs-fscache.o
nfs-objs := $(nfs-y)
--- /dev/null 2005-03-28 14:47:10.233040208 -0500
+++ 2.6.12-rc2-mm3/fs/nfs/nfs-fscache.c 2005-04-23 15:14:02.000000000 -0400
@@ -0,0 +1,191 @@
+/* nfs-fscache.c: NFS filesystem cache interface
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_fs_sb.h>
+
+#include "nfs-fscache.h"
+
+#define NFS_CACHE_FH_INDEX_SIZE sizeof(struct nfs_fh)
+
+/*
+ * the root index is
+ */
+static struct fscache_page *nfs_cache_get_page_token(struct page *page);
+
+static struct fscache_netfs_operations nfs_cache_ops = {
+ .get_page_token = nfs_cache_get_page_token,
+};
+
+struct fscache_netfs nfs_cache_netfs = {
+ .name = "nfs",
+ .version = 0,
+ .ops = &nfs_cache_ops,
+};
+
+/*
+ * the root index for the filesystem is defined by nfsd IP address and ports
+ */
+static fscache_match_val_t nfs_cache_server_match(void *target,
+ const void *entry);
+static void nfs_cache_server_update(void *source, void *entry);
+
+struct fscache_index_def nfs_cache_server_index_def = {
+ .name = "servers",
+ .data_size = 18,
+ .keys[0] = { FSCACHE_INDEX_KEYS_IPV6ADDR, 16 },
+ .keys[1] = { FSCACHE_INDEX_KEYS_BIN, 2 },
+ .match = nfs_cache_server_match,
+ .update = nfs_cache_server_update,
+};
+
+/*
+ * the primary index for each server is simply made up of a series of NFS file
+ * handles
+ */
+static fscache_match_val_t nfs_cache_fh_match(void *target, const void *entry);
+static void nfs_cache_fh_update(void *source, void *entry);
+
+struct fscache_index_def nfs_cache_fh_index_def = {
+ .name = "fh",
+ .data_size = NFS_CACHE_FH_INDEX_SIZE,
+ .keys[0] = { FSCACHE_INDEX_KEYS_BIN_SZ2,
+ sizeof(struct nfs_fh) },
+ .match = nfs_cache_fh_match,
+ .update = nfs_cache_fh_update,
+};
+
+/*
+ * get a page token for the specified page
+ * - the token will be attached to page->private and PG_private will be set on
+ * the page
+ */
+static struct fscache_page *nfs_cache_get_page_token(struct page *page)
+{
+ return fscache_page_get_private(page, GFP_NOIO);
+}
+
+static const uint8_t nfs_cache_ipv6_wrapper_for_ipv4[12] = {
+ [0 ... 9] = 0x00,
+ [10 ... 11] = 0xff
+};
+
+/*
+ * match a server record obtained from the cache
+ */
+static fscache_match_val_t nfs_cache_server_match(void *target,
+ const void *entry)
+{
+ struct nfs_server *server = target;
+ const uint8_t *data = entry;
+
+ switch (server->addr.sin_family) {
+ case AF_INET:
+ if (memcmp(data + 0,
+ &nfs_cache_ipv6_wrapper_for_ipv4,
+ 12) != 0)
+ break;
+
+ if (memcmp(data + 12, &server->addr.sin_addr, 4) != 0)
+ break;
+
+ if (memcmp(data + 16, &server->addr.sin_port, 2) != 0)
+ break;
+
+ return FSCACHE_MATCH_SUCCESS;
+
+ case AF_INET6:
+ if (memcmp(data + 0, &server->addr.sin_addr, 16) != 0)
+ break;
+
+ if (memcmp(data + 16, &server->addr.sin_port, 2) != 0)
+ break;
+
+ return FSCACHE_MATCH_SUCCESS;
+
+ default:
+ break;
+ }
+
+ return FSCACHE_MATCH_FAILED;
+}
+
+/*
+ * update a server record in the cache
+ */
+static void nfs_cache_server_update(void *source, void *entry)
+{
+ struct nfs_server *server = source;
+ uint8_t *data = entry;
+
+ switch (server->addr.sin_family) {
+ case AF_INET:
+ memcpy(data + 0, &nfs_cache_ipv6_wrapper_for_ipv4, 12);
+ memcpy(data + 12, &server->addr.sin_addr, 4);
+ memcpy(data + 16, &server->addr.sin_port, 2);
+ return;
+
+ case AF_INET6:
+ memcpy(data + 0, &server->addr.sin_addr, 16);
+ memcpy(data + 16, &server->addr.sin_port, 2);
+ return;
+
+ default:
+ return;
+ }
+}
+
+/*
+ * match a file handle record obtained from the cache
+ */
+static fscache_match_val_t nfs_cache_fh_match(void *target, const void *entry)
+{
+ struct nfs_inode *nfsi = target;
+ const uint8_t *data = entry;
+ uint16_t nsize;
+
+ /* check the file handle matches */
+ memcpy(&nsize, data, 2);
+ nsize = ntohs(nsize);
+
+ if (nsize <= NFS_CACHE_FH_INDEX_SIZE && nfsi->fh.size == nsize) {
+ if (memcmp(data + 2, nfsi->fh.data, nsize) == 0) {
+ return FSCACHE_MATCH_SUCCESS;
+ }
+ }
+
+ return FSCACHE_MATCH_FAILED;
+}
+
+/*
+ * update a fh record in the cache
+ */
+static void nfs_cache_fh_update(void *source, void *entry)
+{
+ struct nfs_inode *nfsi = source;
+ uint16_t nsize;
+ uint8_t *data = entry;
+
+ BUG_ON(nfsi->fh.size > NFS_CACHE_FH_INDEX_SIZE - 2);
+
+ /* set the file handle */
+ nsize = htons(nfsi->fh.size);
+ memcpy(data, &nsize, 2);
+ memcpy(data + 2, &nfsi->fh.data, nfsi->fh.size);
+ memset(data + 2 + nfsi->fh.size,
+ FSCACHE_INDEX_DEADFILL_PATTERN,
+ NFS_CACHE_FH_INDEX_SIZE - 2 - nfsi->fh.size);
+}
--- /dev/null 2005-03-28 14:47:10.233040208 -0500
+++ 2.6.12-rc2-mm3/fs/nfs/nfs-fscache.h 2005-04-23 17:51:15.000000000 -0400
@@ -0,0 +1,158 @@
+/* nfs-fscache.h: NFS filesystem cache interface definitions
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _NFS_FSCACHE_H
+#define _NFS_FSCACHE_H
+
+#include <linux/nfs_mount.h>
+#include <linux/nfs4_mount.h>
+#include <linux/fscache.h>
+
+#ifdef CONFIG_NFS_FSCACHE
+#ifndef CONFIG_FSCACHE
+#error "CONFIG_NFS_FSCACHE is defined but not CONFIG_FSCACHE"
+#endif
+
+extern struct fscache_netfs nfs_cache_netfs;
+extern struct fscache_index_def nfs_cache_server_index_def;
+extern struct fscache_index_def nfs_cache_fh_index_def;
+
+extern int nfs_invalidatepage(struct page *, unsigned long);
+extern int nfs_releasepage(struct page *, int);
+extern int nfs_mkwrite(struct page *);
+
+static inline void
+nfs_renew_fscookie(struct nfs_server *server, struct nfs_inode *nfsi)
+{
+ struct fscache_cookie *old = nfsi->fscache;
+
+ /* retire the current fscache cache and get a new one */
+ fscache_relinquish_cookie(nfsi->fscache, 1);
+ nfsi->fscache = fscache_acquire_cookie(server->fscache, NULL, nfsi);
+
+ dfprintk(FSCACHE,
+ "NFS: revalidation new cookie (0x%p/0x%p/0x%p/0x%p)\n",
+ server, nfsi, old, nfsi->fscache);
+
+ return;
+}
+static inline void
+nfs4_fill_fscookie(struct super_block *sb)
+{
+ struct nfs_server *server = NFS_SB(sb);
+
+ /* create a cache index for looking up filehandles */
+ server->fscache = fscache_acquire_cookie(nfs_cache_netfs.primary_index,
+ &nfs_cache_fh_index_def, server);
+ if (server->fscache == NULL) {
+ printk(KERN_WARNING "NFS4: No Fscache cookie. Turning Fscache off!\n");
+ } else /* reuse the NFS mount option */
+ server->flags |= NFS_MOUNT_FSCACHE;
+
+ dfprintk(FSCACHE,"NFS: nfs4 cookie (0x%p,0x%p/0x%p)\n",
+ sb, server, server->fscache);
+
+ return;
+}
+static inline void
+nfs_fill_fscookie(struct super_block *sb)
+{
+ struct nfs_server *server = NFS_SB(sb);
+
+ /* create a cache index for looking up filehandles */
+ server->fscache = fscache_acquire_cookie(nfs_cache_netfs.primary_index,
+ &nfs_cache_fh_index_def, server);
+ if (server->fscache == NULL) {
+ server->flags &= ~NFS_MOUNT_FSCACHE;
+ printk(KERN_WARNING "NFS: No Fscache cookie. Turning Fscache off!\n");
+ }
+ dfprintk(FSCACHE,"NFS: cookie (0x%p/0x%p/0x%p)\n",
+ sb, server, server->fscache);
+
+ return;
+}
+static inline void
+nfs_fhget_fscookie(struct super_block *sb, struct nfs_inode *nfsi)
+{
+ struct nfs_server *server = NFS_SB(sb);
+
+ nfsi->fscache = fscache_acquire_cookie(server->fscache, NULL, nfsi);
+ if (server->fscache == NULL)
+ printk(KERN_WARNING "NFS: NULL FScache cookie: sb 0x%p nfsi 0x%p\n", sb, nfsi);
+
+ dfprintk(FSCACHE, "NFS: fhget new cookie (0x%p/0x%p/0x%p)\n",
+ sb, nfsi, nfsi->fscache);
+
+ return;
+}
+static inline void
+nfs_kill_fscookie(struct nfs_server *server)
+{
+ dfprintk(FSCACHE,"NFS: killing cookie (0x%p/0x%p)\n",
+ server, server->fscache);
+
+ fscache_relinquish_cookie(server->fscache, 0);
+ server->fscache = NULL;
+
+ return;
+}
+static inline void
+nfs_clear_fscookie(struct nfs_inode *nfsi)
+{
+ dfprintk(FSCACHE, "NFS: clear cookie (0x%p/0x%p)\n",
+ nfsi, nfsi->fscache);
+
+ fscache_relinquish_cookie(nfsi->fscache, 0);
+ nfsi->fscache = NULL;
+
+ return;
+}
+static inline void
+nfs_zap_fscookie(struct nfs_inode *nfsi)
+{
+ dfprintk(FSCACHE,"NFS: zapping cookie (0x%p/0x%p)\n",
+ nfsi, nfsi->fscache);
+
+ fscache_relinquish_cookie(nfsi->fscache, 1);
+ nfsi->fscache = NULL;
+
+ return;
+}
+static inline int
+nfs_register_netfs(void)
+{
+ int err;
+
+ err = fscache_register_netfs(&nfs_cache_netfs, &nfs_cache_server_index_def);
+
+ return err;
+}
+static inline void
+nfs_unregister_netfs(void)
+{
+ fscache_unregister_netfs(&nfs_cache_netfs);
+
+ return;
+}
+#else
+static inline void nfs_fill_fscookie(struct super_block *sb) {}
+static inline void nfs_fhget_fscookie(struct super_block *sb, struct nfs_inode *nfsi) {}
+static inline void nfs4_fill_fscookie(struct super_block *sb) {}
+static inline void nfs_kill_fscookie(struct nfs_server *server) {}
+static inline void nfs_clear_fscookie(struct nfs_inode *nfsi) {}
+static inline void nfs_zap_fscookie(struct nfs_inode *nfsi) {}
+static inline void
+ nfs_renew_fscookie(struct nfs_server *server, struct nfs_inode *nfsi) {}
+static inline int nfs_register_netfs() { return 0; }
+static inline void nfs_unregister_netfs() {}
+
+#endif
+#endif /* _NFS_FSCACHE_H */
--- 2.6.12-rc2-mm3/fs/nfs/read.c.orig 2005-04-23 10:13:25.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/nfs/read.c 2005-04-23 11:25:47.000000000 -0400
@@ -27,6 +27,7 @@
#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
+#include <linux/nfs_mount.h>
#include <linux/smp_lock.h>
#include <asm/system.h>
@@ -73,6 +74,47 @@ int nfs_return_empty_page(struct page *p
return 0;
}
+#ifdef CONFIG_NFS_FSCACHE
+/*
+ * store a newly fetched page in fscache
+ */
+static void
+nfs_readpage_to_fscache_complete(void *cookie_data, struct page *page, void *data, int error)
+{
+ dprintk("NFS: readpage_to_fscache_complete (%p/%p/%p/%d)\n",
+ cookie_data, page, data, error);
+
+ end_page_fs_misc(page);
+}
+
+static inline void
+nfs_readpage_to_fscache(struct inode *inode, struct page *page, int sync)
+{
+ int ret;
+
+ dprintk("NFS: readpage_to_fscache(0x%p/0x%p/0x%p/%d)\n",
+ NFS_I(inode)->fscache, page, inode, sync);
+
+ SetPageFsMisc(page);
+ ret = fscache_write_page(NFS_I(inode)->fscache, page,
+ nfs_readpage_to_fscache_complete, NULL, GFP_KERNEL);
+ if (ret != 0) {
+ dprintk("NFS: readpage_to_fscache: error %d\n", ret);
+ fscache_uncache_page(NFS_I(inode)->fscache, page);
+ ClearPageFsMisc(page);
+ }
+
+ unlock_page(page);
+}
+#else
+static inline void
+nfs_readpage_to_fscache(struct inode *inode, struct page *page, int sync)
+{
+ BUG();
+}
+#endif
+
+
/*
* Read a page synchronously.
*/
@@ -149,6 +191,13 @@ static int nfs_readpage_sync(struct nfs_
ClearPageError(page);
result = 0;
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_readpage_to_fscache(inode, page, 1);
+ else
+ unlock_page(page);
+
+ return result;
+
io_error:
unlock_page(page);
nfs_readdata_free(rdata);
@@ -180,7 +229,13 @@ static int nfs_readpage_async(struct nfs
static void nfs_readpage_release(struct nfs_page *req)
{
- unlock_page(req->wb_page);
+ struct inode *d_inode = req->wb_context->dentry->d_inode;
+
+ if ((NFS_SERVER(d_inode)->flags & NFS_MOUNT_FSCACHE) &&
+ PageUptodate(req->wb_page))
+ nfs_readpage_to_fscache(d_inode, req->wb_page, 0);
+ else
+ unlock_page(req->wb_page);
nfs_clear_request(req);
nfs_release_request(req);
@@ -477,6 +532,67 @@ void nfs_readpage_result(struct rpc_task
data->complete(data, status);
}
+
+/*
+ * Read a page through the on-disc cache if possible
+ */
+#ifdef CONFIG_NFS_FSCACHE
+static void
+nfs_readpage_from_fscache_complete(void *cookie_data, struct page *page, void *data, int error)
+{
+ dprintk("NFS: readpage_from_fscache_complete (0x%p/0x%p/0x%p/%d)\n",
+ cookie_data, page, data, error);
+
+ if (error)
+ SetPageError(page);
+ else
+ SetPageUptodate(page);
+
+ unlock_page(page);
+}
+
+static inline int
+nfs_readpage_from_fscache(struct inode *inode, struct page *page)
+{
+ struct fscache_page *pageio;
+ int ret;
+
+ dprintk("NFS: readpage_from_fscache(0x%p/0x%p/0x%p)\n",
+ NFS_I(inode)->fscache, page, inode);
+
+ pageio = fscache_page_get_private(page, GFP_NOIO);
+ if (IS_ERR(pageio)) {
+ dprintk("NFS: fscache_page_get_private error %ld\n", PTR_ERR(pageio));
+ return PTR_ERR(pageio);
+ }
+
+ ret = fscache_read_or_alloc_page(NFS_I(inode)->fscache,
+ page,
+ nfs_readpage_from_fscache_complete,
+ NULL,
+ GFP_KERNEL);
+
+ switch (ret) {
+ case 1: /* read BIO submitted and wb-journal entry found */
+ BUG();
+
+ case 0: /* read BIO submitted (page in fscache) */
+ return ret;
+
+ case -ENOBUFS: /* inode not in cache */
+ case -ENODATA: /* page not in cache */
+ dprintk("NFS: fscache_read_or_alloc_page error %d\n", ret);
+ return 1;
+
+ default:
+ return ret;
+ }
+}
+#else
+static inline int
+nfs_readpage_from_fscache(struct inode *inode, struct page *page) { return 1; }
+#endif
+
/*
* Read a page over NFS.
* We read the page synchronously in the following case:
@@ -510,6 +626,13 @@ int nfs_readpage(struct file *file, stru
ctx = get_nfs_open_context((struct nfs_open_context *)
file->private_data);
if (!IS_SYNC(inode)) {
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE) {
+ error = nfs_readpage_from_fscache(inode, page);
+ if (error < 0)
+ goto out_error;
+ if (error == 0)
+ return error;
+ }
error = nfs_readpage_async(ctx, inode, page);
goto out;
}
@@ -540,6 +663,15 @@ readpage_async_filler(void *data, struct
unsigned int len;
nfs_wb_page(inode, page);
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE) {
+ int error = nfs_readpage_from_fscache(inode, page);
+ if (error < 0)
+ return error;
+ if (error == 0)
+ return error;
+ }
+
len = nfs_page_length(inode, page);
if (len == 0)
return nfs_return_empty_page(page);
@@ -613,3 +745,61 @@ void nfs_destroy_readpagecache(void)
if (kmem_cache_destroy(nfs_rdata_cachep))
printk(KERN_INFO "nfs_read_data: not all structures were freed\n");
}
+
+#ifdef CONFIG_NFS_FSCACHE
+int nfs_invalidatepage(struct page *page, unsigned long offset)
+{
+ int ret = 1;
+ struct nfs_server *server = NFS_SERVER(page->mapping->host);
+
+ BUG_ON(!PageLocked(page));
+
+ if (server->flags & NFS_MOUNT_FSCACHE) {
+ if (PagePrivate(page)) {
+ struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+
+ dfprintk(PAGECACHE,"NFS: fscache invalidatepage (0x%p/0x%p/0x%p)\n",
+ nfsi->fscache, page, nfsi);
+
+ fscache_uncache_page(nfsi->fscache, page);
+
+ if (offset == 0) {
+ BUG_ON(!PageLocked(page));
+ ret = 0;
+ if (!PageWriteback(page))
+ ret = page->mapping->a_ops->releasepage(page, 0);
+ }
+ }
+ } else
+ ret = 0;
+
+ return ret;
+}
+int nfs_releasepage(struct page *page, int gfp_flags)
+{
+ struct fscache_page *pageio;
+ struct nfs_server *server = NFS_SERVER(page->mapping->host);
+
+ if (server->flags & NFS_MOUNT_FSCACHE && PagePrivate(page)) {
+ struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+
+ dfprintk(PAGECACHE,"NFS: fscache releasepage (0x%p/0x%p/0x%p)\n",
+ nfsi->fscache, page, nfsi);
+
+ fscache_uncache_page(nfsi->fscache, page);
+ pageio = (struct fscache_page *) page->private;
+ page->private = 0;
+ ClearPagePrivate(page);
+
+ if (pageio)
+ kfree(pageio);
+ }
+
+ return 0;
+}
+int nfs_mkwrite(struct page *page)
+{
+ wait_on_page_fs_misc(page);
+ return 0;
+}
+#endif
--- 2.6.12-rc2-mm3/fs/nfs/write.c.orig 2005-04-23 10:13:25.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/nfs/write.c 2005-04-23 18:07:11.000000000 -0400
@@ -255,6 +255,38 @@ static int wb_priority(struct writeback_
}
/*
+ * store an updated page in fscache
+ */
+#ifdef CONFIG_NFS_FSCACHE
+static void
+nfs_writepage_to_fscache_complete(void *cookie_data, struct page *page, void *data, int error)
+{
+ /* really need to synchronise the end of writeback, probably using a page flag */
+}
+static inline void
+nfs_writepage_to_fscache(struct inode *inode, struct page *page)
+{
+ int ret;
+
+ dprintk("NFS: writepage_to_fscache (0x%p/0x%p/0x%p)\n",
+ NFS_I(inode)->fscache, page, inode);
+
+ ret = fscache_write_page(NFS_I(inode)->fscache, page,
+ nfs_writepage_to_fscache_complete, NULL, GFP_KERNEL);
+ if (ret != 0) {
+ dprintk("NFS: fscache_write_page error %d\n", ret);
+ fscache_uncache_page(NFS_I(inode)->fscache, page);
+ }
+}
+#else
+static inline void
+nfs_writepage_to_fscache(struct inode *inode, struct page *page)
+{
+ BUG();
+}
+#endif
+
+/*
* Write an mmapped page to the server.
*/
int nfs_writepage(struct page *page, struct writeback_control *wbc)
@@ -299,6 +331,10 @@ do_it:
err = -EBADF;
goto out;
}
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_writepage_to_fscache(inode, page);
+
lock_kernel();
if (!IS_SYNC(inode) && inode_referenced) {
err = nfs_writepage_async(ctx, inode, page, 0, offset);
--- 2.6.12-rc2-mm3/fs/Kconfig.orig 2005-04-23 10:13:23.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/Kconfig 2005-04-23 11:25:48.000000000 -0400
@@ -1456,6 +1456,13 @@ config NFS_V4
If unsure, say N.
+config NFS_FSCACHE
+ bool "Provide NFS client caching support (EXPERIMENTAL)"
+ depends on NFS_FS && FSCACHE && EXPERIMENTAL
+ help
+ Say Y here if you want NFS data to be cached locally on disc through
+ the general filesystem cache manager
+
config NFS_DIRECTIO
bool "Allow direct I/O on NFS files (EXPERIMENTAL)"
depends on NFS_FS && EXPERIMENTAL
--- 2.6.12-rc2-mm3/include/linux/nfs_fs.h.orig 2005-04-23 10:13:28.000000000 -0400
+++ 2.6.12-rc2-mm3/include/linux/nfs_fs.h 2005-04-23 15:27:22.000000000 -0400
@@ -29,6 +29,7 @@
#include <linux/nfs_xdr.h>
#include <linux/rwsem.h>
#include <linux/mempool.h>
+#include <linux/fscache.h>
/*
* Enable debugging support for nfs client.
@@ -184,6 +185,11 @@ struct nfs_inode {
int delegation_state;
struct rw_semaphore rwsem;
#endif /* CONFIG_NFS_V4*/
+
+#ifdef CONFIG_NFS_FSCACHE
+ struct fscache_cookie *fscache;
+#endif
+
struct inode vfs_inode;
};
@@ -564,6 +570,7 @@ extern void * nfs_root_data(void);
#define NFSDBG_FILE 0x0040
#define NFSDBG_ROOT 0x0080
#define NFSDBG_CALLBACK 0x0100
+#define NFSDBG_FSCACHE 0x0200
#define NFSDBG_ALL 0xFFFF
#ifdef __KERNEL__
--- 2.6.12-rc2-mm3/include/linux/nfs_fs_sb.h.orig 2005-04-23 10:13:28.000000000 -0400
+++ 2.6.12-rc2-mm3/include/linux/nfs_fs_sb.h 2005-04-23 11:25:48.000000000 -0400
@@ -3,6 +3,7 @@
#include <linux/list.h>
#include <linux/backing-dev.h>
+#include <linux/fscache.h>
/*
* NFS client parameters stored in the superblock.
@@ -47,6 +48,10 @@ struct nfs_server {
that are supported on this
filesystem */
#endif
+
+#ifdef CONFIG_NFS_FSCACHE
+ struct fscache_cookie *fscache; /* cache cookie */
+#endif
};
/* Server capabilities */
--- 2.6.12-rc2-mm3/include/linux/nfs_mount.h.orig 2005-04-23 10:13:28.000000000 -0400
+++ 2.6.12-rc2-mm3/include/linux/nfs_mount.h 2005-04-23 11:25:48.000000000 -0400
@@ -61,6 +61,7 @@ struct nfs_mount_data {
#define NFS_MOUNT_NOACL 0x0800 /* 4 */
#define NFS_MOUNT_STRICTLOCK 0x1000 /* reserved for NFSv4 */
#define NFS_MOUNT_SECFLAVOUR 0x2000 /* 5 */
+#define NFS_MOUNT_FSCACHE 0x3000
#define NFS_MOUNT_FLAGMASK 0xFFFF
#endif
--- 2.6.12-rc2-mm3/include/linux/nfs4_mount.h.orig 2005-03-02 02:38:09.000000000 -0500
+++ 2.6.12-rc2-mm3/include/linux/nfs4_mount.h 2005-04-23 11:25:48.000000000 -0400
@@ -65,6 +65,7 @@ struct nfs4_mount_data {
#define NFS4_MOUNT_NOCTO 0x0010 /* 1 */
#define NFS4_MOUNT_NOAC 0x0020 /* 1 */
#define NFS4_MOUNT_STRICTLOCK 0x1000 /* 1 */
+#define NFS4_MOUNT_FSCACHE 0x2000 /* 1 */
#define NFS4_MOUNT_FLAGMASK 0xFFFF
#endif
[-- Attachment #3: 2.6.12-rc3-mm3-fscache-cookie-exist.patch --]
[-- Type: text/x-patch, Size: 669 bytes --]
Fails a second NFS mount with EEXIST instead of oops.
Signed-off-by: Steve Dickson <steved@redhat.com>
--- 2.6.12-rc3-mm3/fs/fscache/cookie.c.orig 2005-05-07 09:30:28.000000000 -0400
+++ 2.6.12-rc3-mm3/fs/fscache/cookie.c 2005-05-07 11:01:39.000000000 -0400
@@ -452,7 +452,11 @@ static int fscache_search_for_object(str
cache->ops->lock_node(node);
/* a node should only ever be attached to one cookie */
- BUG_ON(!list_empty(&node->cookie_link));
+ if (!list_empty(&node->cookie_link)) {
+ cache->ops->unlock_node(node);
+ ret = -EEXIST;
+ goto error;
+ }
/* attach the node to the cache's node list */
if (list_empty(&node->cache_link)) {
[-- Attachment #4: 2.6.12-rc2-mm3-cachefs-wb.patch --]
[-- Type: text/x-patch, Size: 452 bytes --]
--- 2.6.12-rc2-mm3/fs/cachefs/journal.c.save 2005-04-27 08:06:03.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/cachefs/journal.c 2005-05-03 11:11:17.000000000 -0400
@@ -682,6 +682,7 @@ static inline void cachefs_trans_batch_p
list_add_tail(&block->batch_link, plist);
block->writeback = block->page;
get_page(block->writeback);
+ SetPageWriteback(block->writeback);
/* make sure DMA can reach the data */
flush_dcache_page(block->writeback);
[-- Attachment #5: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NFS Patch for FSCache
2005-05-09 10:31 NFS Patch for FSCache Steve Dickson
@ 2005-05-09 21:19 ` Andrew Morton
2005-05-10 18:43 ` Steve Dickson
2005-05-10 19:12 ` [Linux-cachefs] " David Howells
2005-06-13 12:52 ` Steve Dickson
1 sibling, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2005-05-09 21:19 UTC (permalink / raw)
To: Steve Dickson; +Cc: linux-fsdevel, linux-cachefs
Steve Dickson <SteveD@redhat.com> wrote:
>
> Attached is a patch that enables NFS to use David Howells'
> File System Caching implementation (FSCache).
Do you have any performance results for this?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NFS Patch for FSCache
2005-05-09 21:19 ` Andrew Morton
@ 2005-05-10 18:43 ` Steve Dickson
2005-05-10 19:12 ` [Linux-cachefs] " David Howells
1 sibling, 0 replies; 14+ messages in thread
From: Steve Dickson @ 2005-05-10 18:43 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, linux-cachefs
Andrew Morton wrote:
> Steve Dickson <SteveD@redhat.com> wrote:
>
>>Attached is a patch that enables NFS to use David Howells'
>>File System Caching implementation (FSCache).
>
>
> Do you have any performance results for this?
I haven't done any formal performance testing, but from
the functionality testing I've done, I've seen
a ~20% increase in reads speed (verses otw reads).
Mainly due to the fact NFS only needs to do getattrs
and such when the data is cached. But buyer beware...
this a very rough number, so mileage may very. ;-)
I don't have a number for writes, (maybe David does)
but I'm sure there will be a penalty to cache that
data, but its something that can be improve over time.
But the real saving, imho, is the fact those
reads were measured after the filesystem was
umount then remounted. So system wise, there
should be some gain due to the fact that NFS
is not using the network....
steved.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Linux-cachefs] Re: NFS Patch for FSCache
2005-05-09 21:19 ` Andrew Morton
2005-05-10 18:43 ` Steve Dickson
@ 2005-05-10 19:12 ` David Howells
2005-05-14 2:18 ` Troy Benjegerdes
2005-05-16 13:30 ` David Howells
1 sibling, 2 replies; 14+ messages in thread
From: David Howells @ 2005-05-10 19:12 UTC (permalink / raw)
To: Linux filesystem caching discussion list; +Cc: Andrew Morton, linux-fsdevel
Steve Dickson <SteveD@redhat.com> wrote:
> But the real saving, imho, is the fact those reads were measured after the
> filesystem was umount then remounted. So system wise, there should be some
> gain due to the fact that NFS is not using the network....
I tested md5sum read speed also. My testbox is a dual 200MHz PPro. It's got
128MB of RAM. I've got a 100MB file on the NFS server for it to read.
No Cache: ~14s
Cold Cache: ~15s
Warm Cache: ~2s
Now these numbers are approximate because they're from memory.
Note that a cold cache is worse than no cache because CacheFS (a) has to check
the disk before NFS goes to the server, and (b) has to journal the allocations
of new data blocks. It may also have to wait whilst pages are written to disk
before it can get new ones rather than just dropping them (100MB is big enough
wrt 128MB that this will happen) and 100MB is sufficient to cause it to start
using single- and double-indirection pointers to find its blocks on disk,
though these are cached in the page cache.
David
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-12 22:43 [Linux-cachefs] " Lever, Charles
@ 2005-05-13 11:17 ` David Howells
2005-05-14 2:08 ` Troy Benjegerdes
2005-05-16 12:47 ` [Linux-cachefs] " David Howells
0 siblings, 2 replies; 14+ messages in thread
From: David Howells @ 2005-05-13 11:17 UTC (permalink / raw)
To: Lever, Charles
Cc: linux-fsdevel, Linux filesystem caching discussion list, SteveD
Charles Lever <Charles.Lever@netapp.com> wrote:
> to benchmark this i think you need to explore the architectural
> weaknesses of your approach. how bad will it get using cachefs with
> badly designed applications or client/server setups?
There are a number of critical points:
(1) Inodes
I need to represent in the cache the files I have cached so that I can
find the data attached to them and keep track of the last access times
for the purposes of culling inodes to make space. There are several ways
of doing this:
(a) Set aside a portion of the disk as an inode table and search it
end to end for each open attempt. Now assume I store one inode per
sector (512 bytes) and have a table of 128*1024 entries; this
means the worst case is that I have to read through 64MB (16384
pages) - and the worst case is going to happen every time there's
a cache miss. Now this can be alleviated in a couple of ways:
- By holding a bitmap, say, indicating which page-sized blocks
actually have inodes in them, but that then precludes the use of
do_generic_mapping_read() and readahead.
- By simply keeping track of where the last used block is to cut
short the scan and by moving inodes down or by always allocating
as low a slot as possible.
The really unpleasant side of this is that it will cycle on
average at least half of this space through the page cache every
time we scan the table. The average is likely to be worse than
half if cache misses are taken into account.
This also cuts out a chunk of the cache and makes it permanently
unavailable.
(b) Store metadata in an extensible file. This is what cachefs
actually does. It's slower than (a), but does allow (almost)
unlimited growth of the inode table and only uses as much of the
space in the cache as is necessary. Since I keep a list of free
inode entries, allocating a new one is very quick.
However, the worst case performance for (b) is actually worse than
for (a) because I not only have to read through the blocks of
inode definitions, but I also have to read the pointer blocks, and
there's on average one of those per 1024 inodes. Even worse is
that reading a particular inode has to be synchronous with respect
to walking the indirection chain.
This also has the same unpleasant side effects on scanning the
table as (a), only more so.
(c) Store the metadata records in a tree. This would permit
predictable and potentially constant time lookup for a particular
inode, though we would have to walk the tree to find the
particular inode we're looking for, which has to be synchronous.
Scanning a tree is potentially even worse than for flat files like
in (a) and (b) since you potentially have a lot more intermediate
nodes to walk. However, the really big advantage of a tree is that
it's a lot easier to remove dead space, thus compressing the tree,
plus it only needs to eat up as much cache space as is necessary.
Trees are a lot harder to juggle with the type of journalling that
cachefs does now, however, since the worst case for fanning out
any particular node is that you have to allocate as many new nodes
as there are slots in the page you already have plus one. So if
you make slots sector size, that's nine pages on a 4K page system,
but 129 on a 64K page system, and you have to journal all this
data...
Ideally, for optimal lookup speed, you want your tree to be
balanced, and that means dealing with rotation through nodes with
more than two pointers. This is something else the current method
of journalling is really poor at coping with.
However, the use of a wandering tree rather than exquisite
journalling would make matters easier here; the only journalling
required would be the change of tree root and the change in state
of the free block lists. The downside of a wandering tree is that
you have to maintain sufficient free space to be able to wander in
the performance of a deletion to make more space.
(2) Indexing
When a file is opened, CacheFS has to look in the cache to see if that
file is represented there. This means searching the set of cached files
for the one in which we're particularly interested.
Now I could store the lookup keys in the inodes themselves, but this has
two important consequences:
(a) It restricts the amount of auxilliary data an inode can store;
this includes such things as direct data pointers.
(b) It restricts the size of a netfs key and auxilliary data that can
be stored in an inode without increasing the on-disk inode
representation.
Furthermore, the keys of a particular class of object from a particular
netfs are generally of the same sort of size. For instance AFS vnodes
require a vnode ID (4 bytes) and a vnode uniquifier (4 bytes) per file.
Nothing else. These can be packed much more closely than can inodes
making them that much faster to search.
Therefore, I chose to arrange cachefs as a set of homogenous indexes,
where each index is defined by the netfs to hold elements of a particular
size. This makes searching an index that much faster.
However, since it's an index tree that's defined on disk, open() is still
synchronous in that it has to walk down the index tree until it finds (or
not) the particular data file it's looking for.
So the rapid opening and closing of a lot of small files is going to be
penalised; though this is reduced by the fact that we cache information
about indexes we know about. For instance, in the NFS client caching, we
have two layers of indexes: a server index (pinned by NFS server
records), each entry of which points to another index that keeps track of
the NFS files we know about by the NFS file handle.
Unfortunately, NFS file handles are potentially quite large (128 bytes
maximum for NFS4). This means that we can't fit many per page (about 30
on a 4K page). Now imagine that we're looking at a 2.6.11 kernel source
tree on an NFS server; this has about 21000 files. This means that we
require about 700 pages at least, more if the NFS filesystem wants to
store auxilliary data as well. So the worst case lookup is reading
through 700 pages (cache miss is worst case). If the key was stored in
with the inodes, this would mean at least 2625 pages to be read (assuming
8 per page). Now using do_generic_mapping_read() alleviates some of the
latency associated with this by doing readahead, but it's still quite a
turnover of the page cache.
If NFS had just the one index for all files on all servers, then the keys
would be bigger, though maybe only by 4 bytes (IP address), permitting
say 29 per page. Now assume you've got two servers, each with a kernel
source tree; that brings the number of files to 42000, and a worst case
lookup of say 1448 pages in the index or 5250 in the inode table
directly.
As you can see, keeping your indexes small greatly helps reduce the
lookup times, provided you can keep information about the indexes pinned
in memory to held you get to the bottom faster.
However, having more levels of index and subdividing the key space
between them brings its own pitfall: there are more levels, and walking
them has to be synchronous. Not only that, but creating all these indexes
on disk also has to be synchronous; and worse, has to be journalled.
Now all this is for flat-file indexes, as cachefs currently uses. If,
instead, cachefs used a tree the worst case lookup time would be (for a
balanced tree):
round_up(log1024(#objects)) * step_time
For 4K pages. Not only that, but the maximum page cache thrash would be
round_up(log1024(#objects)) too. So if you've got a million objects, then
this would be 2.
(3) Data storage
Finally, once you've found your inode, you have to be able to retrieve
data from it. cachefs does this by having a small number of direct
pointers in the inode itself, plus a single-indirect pointer, plus a
double indirect pointer, and plus potentionally higher-order indirection
pointers.
So for cachefs as it stands, up to the first 112 pages (448KB if 4K
pages) of a file are pointed to directly. These pages are really fast to
access because the inodes are held in memory as long as the fscache
cookies persist. For further into a file peformance degrades as more and
more levels of indirection blocks have to be traversed; but I think
that's acceptable. The single indirection block covers the next 4MB of
the file and then the double indirection block covers the next 4GB. We'd
then have to use triple indirection for the following 4TB and so on.
One disadvantage of doing this is that walking the indirection chain is,
of course, synchronous; though the page cache will alleviate the problem
somewhat. However, worse is that we have to journal the allocations of
pages and pointer blocks.
There are some ways of improving things:
(a) Extents
When the netfs presents some pages for caching, if these are all
contiguous in the inode's page space then they could be
represented on disk as a start page index, a size and either a
list of blocks or the first block of a contiguous chunk of disk.
Extents are quite tricky to deal with, depending on the degree of
data integrity you want. How do you deal with overlapping extents?
Do you just overwrite the first extent with the data from the
second and then write an abbreviated second extent (and perhaps
append as much of the second onto the first if possible)? What if
you're guaranteeing not to overwrite any data?
Also you really need a flexible tree structure to manage extents,
I think.
On-disk contiguous extents are also impractical as it's unlikely
that you'd be able to scrape together large runs of adjacent
blocks easily.
(b) Larger grained allocations
Performance could be improved at the cost of lowering the
utilisation of the disk by allocating blocks in groups. Pointers
in pointer blocks would then become a tuple containing a pointer
to the block group and a bitmap of the usage of the blocks in that
group - we mustn't return uninitialised data to the netfs. This
would allow one tuple to address up to 32 blocks (assuming a
32-bit bitmap); a coverage of 128KB with 4KB pages and 2MB with
64KB pages. However, the chunking factor could be anything from 2
to 32; a factor of 4 (16KB) might be a good compromise.
It would also be possible to create power-of-2 gradated allocation
bands in the cache, and attempt to allocate an appropriately sized
chunk, depending on the relationship between the target page group
position in the inode and the size of the inode's data.
This, however, would complicate the free block list handling as
each band would require its own free list list maintenance
This sort of thing would also easy to provide mount-time tuning
for.
(4) Other objects
One thing cachefs doesn't supply facilities for is caching other types of
objects, such as extended attributes. This is tricky with the flat file
indexing used as there's nowhere really to store a list of the objects
required. A move to a tree based filesystem would make this possible.
> for instance, what happens when the client's cache disk is much slower
> than the server (high performance RAID with high speed networking)?
Then using a local cache won't help you, no matter how hard it tries, except
in the following circumstances:
(1) The server is not available.
(2) The network is heavily used by more than just one machine.
(3) The server is very busy.
> what happens when the client's cache disk fills up so the disk cache is
> constantly turning over (which files are kicked out of your backing
> cachefs to make room for new data)?
I want that to be based on an LRU approach, using last access times. Inodes
pinned by being open can't be disposed of and neither can inodes pinned by
being marked so; but anything else is fair game for culling.
The cache has to be scanned occasionally to build up a list of inodes that are
candidates for being culled, and I think a certain amount of space must be
kept available to satisfy allocation requests; therefore the culling needs
thresholds.
Unfortunately, culling is going to be slower than allocation in general
because we always know where we're going to allocate, but we have to search
for something to get the chop.
> what happens with multi-threaded I/O-bound applications when the cachefs is
> on a single spindle?
I don't think that multi-threaded applications are actually distinguishable on
Linux from several single-threaded applications.
Allocation has to be a serialised if proper filesystem integrity is to be
maintained and if there is to be proper error handling. I do, however, try and
keep the critical section as small and as fast as possible.
> is there any performance dependency on the size of the backing cachefs?
The blocks holding virtually contiguous data may be scattered further apart on
the disk. This is unfortunate, but unless I can reclaim specific blocks or
use larger grained allocations there's not a lot I can do about that.
> do you also cache directory contents on disk?
That's entirely up to the netfs. AFS, yes; NFS, no. If cachefs were extended
to support the caching of other types of object, this would become easier for
NFS.
> remember that the application you designed this for (preserving cache
> contents across client reboots) is only one way this will be used. some
> of us would like to use this facility to provide a high-performance
> local cache larger than the client's RAM. :^)
Agreed. I'd like to be able to disable the journal. I'd also like to be able
to use it for low priority swap space. Obviously swap space can't be evicted
from the cache without approval from the swapper.
> synchronous file system metadata management is the bane of every cachefs
> implementation i know about.
Yeah. But imagine you work for a company with /usr mounted over the network by
every machine. Then the power fails. When the power comes back, all these
machines wake up, see their cache is in a bad state, reinitialise it and
immediately splat the server trying to regain /usr.
> have you measured what performance impact there is when cache files go from
> no indirection to single indirect blocks, or from single to double
> indirection? have you measured how expensive it is to reuse a single cache
> file because the cachefs file system is already full? how expensive is it
> to invalidate the data in the cache (say, if some other client changes a
> file you already have cached in your cachefs)?
Not yet. All these things require a journal update, but the changes don't
actually happen immediately, and we don't actually spend that much time
waiting around, except when we need to read something from the disk first.
The journal manager makes sure that the altered blocks hit the disk after the
journal does, and that happens in a background thread. Changes are batched up,
and we can write up to about 4000 in a batch (we must not overwrite the
previous BATCH and ACK marks in the journal).
> what about using an extent-based file system for the backing cachefs?
> that would probably not be too difficult because you have a good
> prediction already of how large the file will be (just look at the file
> size on the server).
Extents make deletion a lot slower, I suspect, because the structure is a lot
more flexible. Extents also do not eliminate indirection; they merely move it
elsewhere - an extent tree is indirect.
I'm working on making cachefs wandering tree based at the moment, at least for
the indexing. I've considered having data pointer blocks (extents) as objects
in the tree, but it's actually more complicated that way:-/
> how about using smallish chunks, like the AFS cache manager, to avoid
> indirection entirely?
In what way does this help? You have to, for example, be able to cope with a
64-bit machine requesting pages from anywhere within a file, so you might get
a request for page 0xFFFFFFFFFFFFFFFF from a sparse file and have to be able
to deal with it. How are you going to handle that without indirection? CacheFS
at the moment can't deal with that, but it wouldn't be too hard to make it
possible, it merely requires 7th-level indirection (I think). And if I move to
64-bit block pointers at some point, you'd be able to store almost that many
pages, assuming you could find a block device big enough.
> would there be any performance advantage to caching small files in memory
> and large files on disk, or vice versa?
Well, you have the page cache; but how do you decide what should be kept in
memory and what should be committed to disk? If you keep small files in
memory, then you lose them if the power goes off or you reboot, and if you're
trying to operate disconnectedly, then you've got a problem.
David
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-13 11:17 ` David Howells
@ 2005-05-14 2:08 ` Troy Benjegerdes
2005-05-16 12:47 ` [Linux-cachefs] " David Howells
1 sibling, 0 replies; 14+ messages in thread
From: Troy Benjegerdes @ 2005-05-14 2:08 UTC (permalink / raw)
To: Linux filesystem caching discussion list
Cc: linux-fsdevel, Lever, Charles, SteveD
> > for instance, what happens when the client's cache disk is much slower
> > than the server (high performance RAID with high speed networking)?
>
> Then using a local cache won't help you, no matter how hard it tries, except
> in the following circumstances:
>
> (1) The server is not available.
>
> (2) The network is heavily used by more than just one machine.
>
> (3) The server is very busy.
>
> > what happens when the client's cache disk fills up so the disk cache is
> > constantly turning over (which files are kicked out of your backing
> > cachefs to make room for new data)?
>
> I want that to be based on an LRU approach, using last access times. Inodes
> pinned by being open can't be disposed of and neither can inodes pinned by
> being marked so; but anything else is fair game for culling.
>
> The cache has to be scanned occasionally to build up a list of inodes that are
> candidates for being culled, and I think a certain amount of space must be
> kept available to satisfy allocation requests; therefore the culling needs
> thresholds.
>
> Unfortunately, culling is going to be slower than allocation in general
> because we always know where we're going to allocate, but we have to search
> for something to get the chop.
I would like to suggest that cache culling be driven by a userspace
daeomon, with LRU usage being used as a fallback approach if the
userspace app doesn't respond fast enough. Or at the least provide a way
to load modules to provide different culling algorithms.
If the server is responding and delivering files faster than we can
write them to local disk and cull space, should we really be caching at
all? Is it even appropriate for the kernel to make that decision?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-10 19:12 ` [Linux-cachefs] " David Howells
@ 2005-05-14 2:18 ` Troy Benjegerdes
2005-05-16 13:30 ` David Howells
1 sibling, 0 replies; 14+ messages in thread
From: Troy Benjegerdes @ 2005-05-14 2:18 UTC (permalink / raw)
To: Linux filesystem caching discussion list; +Cc: Andrew Morton, linux-fsdevel
On Tue, May 10, 2005 at 08:12:51PM +0100, David Howells wrote:
>
> Steve Dickson <SteveD@redhat.com> wrote:
>
> > But the real saving, imho, is the fact those reads were measured after the
> > filesystem was umount then remounted. So system wise, there should be some
> > gain due to the fact that NFS is not using the network....
>
> I tested md5sum read speed also. My testbox is a dual 200MHz PPro. It's got
> 128MB of RAM. I've got a 100MB file on the NFS server for it to read.
>
> No Cache: ~14s
> Cold Cache: ~15s
> Warm Cache: ~2s
>
> Now these numbers are approximate because they're from memory.
>
> Note that a cold cache is worse than no cache because CacheFS (a) has to check
> the disk before NFS goes to the server, and (b) has to journal the allocations
> of new data blocks. It may also have to wait whilst pages are written to disk
> before it can get new ones rather than just dropping them (100MB is big enough
> wrt 128MB that this will happen) and 100MB is sufficient to cause it to start
> using single- and double-indirection pointers to find its blocks on disk,
> though these are cached in the page cache.
How big was the cachefs filesystem?
Now try reading a 1GB file over nfs..
I have found (with openafs), that I either need a really small cache, or
a really big one.. The bigger the openafs cache gets, the slower it
goes. The only place i run with a > 1GB openafs cache is on an imap
server that has an 8gb cache for maildirs.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-10 19:12 ` [Linux-cachefs] " David Howells
2005-05-14 2:18 ` Troy Benjegerdes
@ 2005-05-16 13:30 ` David Howells
1 sibling, 0 replies; 14+ messages in thread
From: David Howells @ 2005-05-16 13:30 UTC (permalink / raw)
To: Linux filesystem caching discussion list; +Cc: Andrew Morton, linux-fsdevel
Troy Benjegerdes <hozer@hozed.org> wrote:
> How big was the cachefs filesystem?
Several Gig. I don't remember how big, but the disk I tried it on is totally
kaput unfortunately.
> Now try reading a 1GB file over nfs..
I'll give that a go at some point. However, I suspect that any size over twice
the amount of pagecache available is going to scale fairly consistently until
you start hitting the lid on the cache. I say twice because firstly you fill
the pagecache with pages and start throwing them at the disk, and then you
have to start on a rolling process of waiting for those to hit the disk before
evicting them from the pagecache, which isn't going to get going smoothly
until you've ejected the original load of pages.
> I have found (with openafs), that I either need a really small cache, or
> a really big one.. The bigger the openafs cache gets, the slower it
> goes. The only place i run with a > 1GB openafs cache is on an imap
> server that has an 8gb cache for maildirs.
What filesystem underlies your OpenAFS cache? OpenAFS doesn't actually do its
own file to disk management within the cache, but uses a host filesystem for
that.
David
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-16 12:47 ` [Linux-cachefs] " David Howells
@ 2005-05-17 21:42 ` David Masover
2005-05-18 10:28 ` [Linux-cachefs] " David Howells
1 sibling, 0 replies; 14+ messages in thread
From: David Masover @ 2005-05-17 21:42 UTC (permalink / raw)
To: Linux filesystem caching discussion list
Cc: linux-fsdevel, Lever, Charles, SteveD
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
David Howells wrote:
[...]
>>If the server is responding and delivering files faster than we can
>>write them to local disk and cull space, should we really be caching at
>>all? Is it even appropriate for the kernel to make that decision?
>
>
> That's a very tricky question, and it's most likely when the network + server
> retrieval speeds are better than the disk retrieval speeds, in which case you
> shouldn't be using a cache, except for the cases of where you want to be able
> to live without the server for some reason or other; and there the cache is
> being used to enhance reliability, not speed.
>
> However, we do need to control turnover on the cache. If the rate is greater
> than we can sustain, there needs to be a way to suspend caching on certain
> files, but what the heuristic for that should be, I'm not sure.
Does the cache call sync/fsync overly often? If not, we can gain
something by using an underlying FS with lazy writes.
I think the caching should be done asynchronously. As stuff comes in,
it should be handed off both to the app requesting it and to a queue to
write it to the cache. If the queue gets too full, start dropping stuff
from it the same way you do from cache -- probably LRU or LFU or
something similar.
Another question -- how much performance do we lose by caching, assuming
that both the network/server and the local disk are infinitely fast?
That is, how many cycles do we lose vs. local disk access? Basically,
I'm looking for something that does what InterMezzo was supposed to --
make cache access almost as fast as local access, so that I can replace
all local stuff with a cache.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iQIVAwUBQoplVngHNmZLgCUhAQKsLA//UJZWzkFcWUYShLvHacnS/bIDs2xOkbhJ
4wYmyb2EIpg4Qnff2zy5tlkwGt5pHCE8fDfa7ZAbFbJaaJGrmzDde3xZ9uyTCKfi
WssZXnB0uBvLmqrTrieDSsqGFtBUjfSQbF8YBcr1E9bQnOgcyATbmSjSjewpyUIb
ALYb7wxXvaNN4MwrOc0YZKlczyYlWjDDpLSn+R/rS26mXgD8o1A0tyFK5GI2TX/E
Oo6+mVpKkoqt+4jNlISxrfIBJPrHKe2Zo+HCtpLkYmw7l/Wiz301QvLUiDUMBa2m
4e1bCEn6uWhSpb+IK9Z9xDcLChvLz7fQBr7AYvepXConJhxnkn+02yuf5SYyMFDK
yn41GgeyQL2w00ybkIoAZuFgIySH5h084jebUKLjtalSslqwQBSNJfjJDOBdUAll
S7hy4VuC1F2YBCurqpj6gC8wpdiXmfEDyKozDw+js9gfNSGMN3N3kl6jCmtuss4B
eXcnQZ+3RmBTzgSB87KK5zhC/FPl/U4BEHl0o7/mtIC4GuFPpB3pZwGNK2BXbamV
Wa2jT/dbGXdkQmqUgDaQchDlo6hgnhJenuA50JoV6ea1MIpr01r7NnKHuD6+i1ep
fJO5ltT4J1XbFUsx5r0W4xzUEpblksrWxtxLB5zQmeDbMFvvAcB68KZ965E4S02N
LjM2jS/sVEc=
=OnFw
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: Re: NFS Patch for FSCache
@ 2005-05-18 16:32 Lever, Charles
2005-05-18 17:49 ` David Howells
0 siblings, 1 reply; 14+ messages in thread
From: Lever, Charles @ 2005-05-18 16:32 UTC (permalink / raw)
To: David Howells, Linux filesystem caching discussion list
Cc: linux-fsdevel, SteveD
> > If not, we can gain something by using an underlying FS
> with lazy writes.
>
> Yes, to some extent. There's still the problem of filesystem
> integrity to deal
> with, and lazy writes hold up journal closure. This isn't
> necessarily a
> problem, except when you want to delete and launder a block
> that has a write
> hanging over it. It's not unsolvable, just tricky.
>
> Besides, what do you mean by lazy?
as i see it, you have two things to guarantee:
1. attributes cached on the disk are either up to date, or clearly out
of date (otherwise there's no way to tell whether cached data is stale
or not), and
2. the consistency of the backing file system must be maintained.
in fact, you don't need to maintain data coherency up to the very last
moment, since the client is pushing data to the server for permanent
storage. cached data in the local backing FS can be out of date after a
client reboot without any harm whatever, so it doesn't matter a wit that
the on-disk state of the backing FS trails the page cache.
(of course you do need to sync up completely with the server if you
intend to use CacheFS for disconnected operation, but that can be
handled by "umount" rather than keeping strict data coherency all the
time).
it also doesn't matter if the backing FS can't keep up with the server.
the failure mode can be graceful, so that as the backing FS becomes
loaded, it passes more requests back to the server and caches less data
and fewer requests. this is how it works when there is more data to
cache than there is space to cache it; it should work the same way if
the I/O rate is higher than the backing FS can handle.
> Actually, probably the biggest bottleneck is the disk block allocator.
in my experience with the AFS cache manager, this is exactly the
problem. the ideal case is where the backing FS behaves a lot like swap
-- just get the bits down onto disk in any location, without any
sophisticated management of free space. the problem is keeping track of
the data blocks during a client crash/reboot.
the real problem arises when the cache is full and you want to cache a
new file. the cache manager must choose a file to reclaim, release all
the blocks for that file, then immediately reallocate them for the new
file. all of this is synchronous activity.
are there advantages to a log-structured file system for this purpose?
is there a good way to trade disk space for the performance of your
block allocator?
> Well, with infinitely fast disk and network, very little -
> you can afford to
> be profligate on your turnover of disk space, and this
> affects the options you
> might choose in designing your cache.
in fact, with an infinitely fast server and network, there would be no
need for local caching at all. so maybe that's not such an interesting
thing to consider.
it might be more appropriate to design, configure, and measure CacheFS
with real typical network and server latency numbers in mind.
> Reading one really big file (bigger than the memory
> available) over AFS, with
> a cold cache it took very roughly 107% of the time it took
> with no cache; but
> using a warm cache, it took 14% of the time it took with no
> cache. However,
> this is on my particular test box, and it varies a lot from
> box to box.
david, what is the behavior when the file that needs to be cached is
larger than the backing file system? for example, what happens when
some client application starts reading a large media file that won't fit
entirely in the cache?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-18 16:32 Lever, Charles
@ 2005-05-18 17:49 ` David Howells
0 siblings, 0 replies; 14+ messages in thread
From: David Howells @ 2005-05-18 17:49 UTC (permalink / raw)
To: Lever, Charles
Cc: linux-fsdevel, Linux filesystem caching discussion list, SteveD
Lever, Charles <Charles.Lever@netapp.com> wrote:
> 1. attributes cached on the disk are either up to date, or clearly out
> of date (otherwise there's no way to tell whether cached data is stale
> or not), and
We have to trust the netfs to know when an inode is obsolete. That's why
cachefs calls back to the netfs to validate the inodes it finds. With AFS this
checks the vnode version number and the data version number.
> in fact, you don't need to maintain data coherency up to the very last
> moment, since the client is pushing data to the server for permanent
> storage. cached data in the local backing FS can be out of date after a
> client reboot without any harm whatever, so it doesn't matter a wit that
> the on-disk state of the backing FS trails the page cache.
True, but also consider that the fact that if a netfs wants to throw a page
into the cache, it must keep it around long enough for us to write it to disk.
So if the user is grabbing a file say twice the size as the maximum pagecache
size, being too lazy will hold up the read as the VM then tries to eject pages
that are pending writing to the cache.
Actually, the best way to do this would be to get the VM involved in the
caching, I think. Currently, the netfs has to issue a write to the disk, and
there're only certain points at which it's able to do that:
- readpage completion
- page release
- writepage (if the page is altered locally)
The one that's going to impact performance least is when the netfs finishes
reading a page. Getting the VM involved would allow the VM to batch up writes
to the cache and to predict better when to do the writes.
One of the problems I've got is that I'd like to be able to gang up writes to
the cache, but that's difficult as the pages tend to be read individually
across the network, and thus complete individually.
Furthermore, consider the fact that the netfs records state tracking
information in the cache (such as AFS's data version). This must be modified
after the changed pages are written to the cache (or deleted from it) lest you
end up with old data for the version specified.
> (of course you do need to sync up completely with the server if you
> intend to use CacheFS for disconnected operation, but that can be
> handled by "umount" rather than keeping strict data coherency all the
> time).
Disconnected operation is a whole 'nother kettle of miscellaneous swimming
things.
One of the reasons I'd like to move to a wandering tree is that it makes data
journalling almost trivial; and if the tree is arranged correctly, it makes it
possible to get a free inode update too - thus allowing the netfs coherency
data to be updated simultaneously.
> it also doesn't matter if the backing FS can't keep up with the server.
> the failure mode can be graceful, so that as the backing FS becomes
> loaded, it passes more requests back to the server and caches less data
> and fewer requests. this is how it works when there is more data to
> cache than there is space to cache it; it should work the same way if
> the I/O rate is higher than the backing FS can handle.
True. I've defined the interface to return -ENOBUFS if we can't cache a file
right now, or just to silently drop the thing and tell the netfs we did
it. The read operation then would return -ENODATA next time, thus indicating
we need to fetch it from the server again.
The best way to do it is probably to have hysteresis on allocation for
insertion: suspend insertion if the number of free blocks falls below some
limit and re-enable insertion if the number of free blocks rises above a
higher count. Then set the culler running if we drop below the higher limit.
And then, if insertion is suspended, we start returning -ENOBUFS on requests
to cache something. Not only that, but if a netfs wants to update a block, we
can also return -ENOBUFS and steal the block that held the old data (with a
wandering tree that's fairly easy to do). The stolen block can then be
laundered and made available to the allocator again.
> > Actually, probably the biggest bottleneck is the disk block allocator.
>
> in my experience with the AFS cache manager, this is exactly the
> problem. the ideal case is where the backing FS behaves a lot like swap
> -- just get the bits down onto disk in any location, without any
> sophisticated management of free space. the problem is keeping track of
> the data blocks during a client crash/reboot.
Which is something swap space doesn't need to worry about. It's reinitialised
on boot. Filesystem integrity is not an issue. If we don't care about
integrity, life is easy.
The main problem in the allocator is one of tentative allocation vs journal
update moving the free list pointer. If I can hold off on the latter or just
discard the former and send the tentative block for relaundering, then I can
probably reduce the serialisation problems.
> the real problem arises when the cache is full and you want to cache a
> new file. the cache manager must choose a file to reclaim, release all
> the blocks for that file, then immediately reallocate them for the new
> file. all of this is synchronous activity.
Not exactly. I plan to have cachefs anticipate the need by keeping a float of
free blocks. Whilst this reduces the utilisation of the cache, it should
decrease the allocator latency.
> are there advantages to a log-structured file system for this purpose?
Yes, but there are a lot more problems, and the problems increase with cache
size:
(1) You need to know what's where in the cache; that means scanning the cache
on mount. You could offset this by storing your map in the block device
and only scanning on power failure (when the blockdev wasn't properly
unmounted).
(2) You need to keep a map in memory. I suppose you could keep the map on
disk and rebuild it on umount and mount after power failure. But this
does mean scanning the entire block device.
(3) When the current point in the cache catches up with the tail, what do you
do? Do you just discard the block at the tail? Or do you move it down if
it's not something you want to delete yet? (And how do you decide which?)
This will potentially have the effect of discarding regularly used items
from the cache at regular intervals; particularly if someone uses a data
set larger than the the size of the cache.
(4) How do you decide where the current point is? This depends on whether
you're willing to allow pages to overlap "page boundaries" or not. You
could take a leaf out of JFFS2's book and divide the cache into erase
blocks, each of several pages. This would then cut down on the amount of
scanning you need to do, and would make handling small files trivial.
If you can get this right it would be quite cute, but it would make handling
of pinned files awkward. You can't throw away anything that's pinned, but must
slide it down instead. Now imagine half your cache is pinned - you're
potentially going to end up spending a lot of time shovelling stuff down,
unless you can skip blocks that are fully pinned.
> is there a good way to trade disk space for the performance of your
> block allocator?
Potentially. If I can keep a list of tentative allocations and add that to the
journal, then it's easy to zap then during replay. It does, however,
complicate journalling.
> in fact, with an infinitely fast server and network, there would be no
> need for local caching at all. so maybe that's not such an interesting
> thing to consider.
Not really. The only thing it guards against is the server becoming
unavailable.
> it might be more appropriate to design, configure, and measure CacheFS
> with real typical network and server latency numbers in mind.
Yes. What I'm currently using as the basis of my design is accessing a kernel
source tree over the network. That's on the order of 22000 files these days
and 320MB of disk space. That's an average occupancy of about 14.5KB of space
per file.
As an alternative load, I consider what it would take to cache /usr. That's
about 373000 files on my box and 11GB of disk space; that's about 29KB per
file.
> david, what is the behavior when the file that needs to be cached is
> larger than the backing file system? for example, what happens when
> some client application starts reading a large media file that won't fit
> entirely in the cache?
Well, it depends. If it's a large sparse file that we're only going to grab a
few blocks from then it's not a problem, but if it's something we're going to
read all of then obviously we've got a problem. I think I have to set a limit
as to the maximum number of blocks a file can occupy in a cache.
Beyond that, if a file occupies all the blocks of cache it can, then we have
to refuse allocation of more blocks for it. What happens then depends. If the
file is officially pinned by the user, we can't actually get rid of it, and
all we can do is give them rude error messages. If it's merely pinned by
virtue of being held by an fscache cookie, then we could just keep it around
until the cookie is released or we could just start silently recycling the
space it's currently occupying whilst returning -ENOBUFS to all further
attempts to cache more of that file.
But unless the user gives as a hint, we can't judge in advance what the best
acttion is. I'd consider O_DIRECT as being a hint, of course, but we may want
to make other options available.
David
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-18 10:28 ` [Linux-cachefs] " David Howells
@ 2005-05-19 2:18 ` Troy Benjegerdes
2005-05-19 6:48 ` David Masover
0 siblings, 1 reply; 14+ messages in thread
From: Troy Benjegerdes @ 2005-05-19 2:18 UTC (permalink / raw)
To: Linux filesystem caching discussion list; +Cc: linux-fsdevel
> Reading one really big file (bigger than the memory available) over AFS, with
> a cold cache it took very roughly 107% of the time it took with no cache; but
> using a warm cache, it took 14% of the time it took with no cache. However,
> this is on my particular test box, and it varies a lot from box to box.
What network did that box have?
I'm finding that with OpenAFS, and memcache, read performance is
affected greatly by the -chunksize argument to afsd. Using -chunksize 20
(1MB chunks) gets me around 50MB/sec, while -chunksize 18 gets
5-7MB/sec. (I believe that's the size of the 'fetchrpc' calls)
Another question with the afs client.. I'd really like to use the kafs
client to mount a root filesystem, then use OpenAFS to mount /afs so
I can have read/write support. I went so far as to patch kafs to mount
as type 'kafs', but then found that both clients want to listent on port
7000. Can I either
a) change the port for kafs
b) get working read-write and auth support for kafs?
I'm guessing a) is much more likely..
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Re: NFS Patch for FSCache
2005-05-19 2:18 ` Troy Benjegerdes
@ 2005-05-19 6:48 ` David Masover
0 siblings, 0 replies; 14+ messages in thread
From: David Masover @ 2005-05-19 6:48 UTC (permalink / raw)
To: Linux filesystem caching discussion list; +Cc: linux-fsdevel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Troy Benjegerdes wrote:
>>Reading one really big file (bigger than the memory available) over AFS, with
>>a cold cache it took very roughly 107% of the time it took with no cache; but
>>using a warm cache, it took 14% of the time it took with no cache. However,
>>this is on my particular test box, and it varies a lot from box to box.
>
>
> What network did that box have?
>
> I'm finding that with OpenAFS, and memcache, read performance is
> affected greatly by the -chunksize argument to afsd. Using -chunksize 20
> (1MB chunks) gets me around 50MB/sec, while -chunksize 18 gets
> 5-7MB/sec. (I believe that's the size of the 'fetchrpc' calls)
>
> Another question with the afs client.. I'd really like to use the kafs
> client to mount a root filesystem, then use OpenAFS to mount /afs so
> I can have read/write support. I went so far as to patch kafs to mount
> as type 'kafs', but then found that both clients want to listent on port
> 7000. Can I either
>
> a) change the port for kafs
> b) get working read-write and auth support for kafs?
>
> I'm guessing a) is much more likely..
How about a more hack-ish solution?
An initrd environment of some sort. This especially makes sense if you
use a hard disk partition -- assuming the local disk is only used for
local cache, it's a good idea.
How I'd do it:
If kernel and initrd are server-side (as in a PXE boot):
- - initrd copies itself to tmpfs and then deallocates itself, so that it
can be swapped out
- - new tmpfs-based initrd has an OpenAFS client, and mounts /afs
- - root is somewhere in /afs, and the initrd pivot_root's to it.
If kernel and initrd are client-side (hard drive):
- - as above, except:
- - initrd checks for a new version of itself
- - if there is a new version of kernel/initrd, it:
- downloads both a new kernel and a new initrd
- saves them to disk, so we don't re-download next boot
- kexecs the new kernel/initrd
- assuming the upgrade worked, the new kernel/initrd will:
- check for an update again
- won't find one and will continue booting.
Also, we don't actually need a ramdisk -- we can use a partition. This
is a little easier to manage, because we don't have to worry about
freeing the ramdisk or how much swap we need or things like that.
So, this is a bit harder to maintain than what you're suggesting, but
it's workable. The admin just needs a good set of tools to rebuild the
initrd every kernel upgrade, or when he decides to add stuff to it.
But, it provides a pretty decent working environment once it's done, and
it puts everything except the AFS client on the AFS root fs. It also
avoids using the kafs client at all -- all network stuff is done in
userspace.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iQIVAwUBQow2z3gHNmZLgCUhAQJCtQ//VAFNzVo8kZFrkhtrYvq5qkmashiFOC4h
P2GA3GLNWPCjQ+UDwwhFUd89Zudu8iPDkaiiWtrOIt9I5jzjX0pjPLiamBZV1u8n
eNdNClpRUbWFNoItfczMvENftD9ZfaWtmn5CX/7IF2R1fEhEGdI5a67ibHeDTa8H
aqYGjb4KWyGYbrHbJUsHXbsCrxZdKbmp/esS6sotpFdBkWyZMFzQsGE4Insvyqrq
mfhX+o4YGql0p+UvcqJ8U823HQwpsVomnD1OE7NzAP01Y1Q0N5fUCTE92zMwqljR
6t9/FLBx0Vlp976WOQX1pSPlKy2nYL0VbuiGLfKbaWc3eHngBXdaW1qiNxC+QSK3
Nb6ojJ9QdQInwP0z0uQpWm+Od8Rv0qxuEx+kmpckOtN+0xuHH/f6DKzm1zrEMRkA
GhFTQfBda0DZ/4z6Qxrya8Je5+8NAno73DkKfFlboqd3JpJSiJ6sJRCGmtAVE5gG
1MaRUV3oXuALXd37gUNoPbCzlB5A6aZn9F4TV0jQSd388Ya97wwACDwRlAHyOLa/
yAx69UYPM+rEv0n3f5QtK8PBEIEh+W+e3VD7k1U2SiyV76WYL00BfqucdadHjqv8
dAawVzIsfv0Fl9xqUbk3BJYRokG0WzgA7B+6qS3C398/c9KTkTf1ZIdnQzVKmPL3
95k5sD1lNG4=
=X8zT
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NFS Patch for FSCache
2005-05-09 10:31 NFS Patch for FSCache Steve Dickson
2005-05-09 21:19 ` Andrew Morton
@ 2005-06-13 12:52 ` Steve Dickson
1 sibling, 0 replies; 14+ messages in thread
From: Steve Dickson @ 2005-06-13 12:52 UTC (permalink / raw)
To: Linux filesystem caching discussion list
Cc: Andrew Morton, linux-fsdevel, Trond Myklebust
[-- Attachment #1: Type: text/plain, Size: 349 bytes --]
I notice that a number NFS patches when into
2.6.12-rc6-mm1 so I wanted to make sure the patches I
posted, which allow NFS to use cachefs, had not
become stale. It turns out they didn't and they
still work as they did in rc3-mm3... But I figured
I would repost them anyways in hopes to get them
reviewed and accepted in the -mm tree....
steved.
[-- Attachment #2: 2.6.12-rc6-mm1-nfs-fscache.patch --]
[-- Type: text/x-patch, Size: 26574 bytes --]
This patch enables NFS to use file system caching (i.e. FSCache).
To turn this feature on you must specifiy the -o fsc mount flag
as well as have a cachefs partition mounted.
Signed-off-by: Steve Dickson <steved@redhat.com>
--- 2.6.12-rc5-mm2/fs/nfs/file.c.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/fs/nfs/file.c 2005-06-05 11:44:48.000000000 -0400
@@ -27,9 +27,11 @@
#include <linux/slab.h>
#include <linux/pagemap.h>
#include <linux/smp_lock.h>
+#include <linux/buffer_head.h>
#include <asm/uaccess.h>
#include <asm/system.h>
+#include "nfs-fscache.h"
#include "delegation.h"
@@ -194,6 +196,12 @@ nfs_file_sendfile(struct file *filp, lof
return res;
}
+static int nfs_file_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+ wait_on_page_fs_misc(page);
+ return 0;
+}
+
static int
nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
{
@@ -207,6 +215,10 @@ nfs_file_mmap(struct file * file, struct
status = nfs_revalidate_inode(NFS_SERVER(inode), inode);
if (!status)
status = generic_file_mmap(file, vma);
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ vma->vm_ops->page_mkwrite = nfs_file_page_mkwrite;
+
return status;
}
@@ -258,6 +270,11 @@ static int nfs_commit_write(struct file
return status;
}
+/*
+ * since we use page->private for our own nefarious purposes when using fscache, we have to
+ * override extra address space ops to prevent fs/buffer.c from getting confused, even though we
+ * may not have asked its opinion
+ */
struct address_space_operations nfs_file_aops = {
.readpage = nfs_readpage,
.readpages = nfs_readpages,
@@ -269,6 +286,11 @@ struct address_space_operations nfs_file
#ifdef CONFIG_NFS_DIRECTIO
.direct_IO = nfs_direct_IO,
#endif
+#ifdef CONFIG_NFS_FSCACHE
+ .sync_page = block_sync_page,
+ .releasepage = nfs_releasepage,
+ .invalidatepage = nfs_invalidatepage,
+#endif
};
/*
--- 2.6.12-rc5-mm2/fs/nfs/inode.c.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/fs/nfs/inode.c 2005-06-05 11:44:48.000000000 -0400
@@ -42,6 +42,8 @@
#include "nfs4_fs.h"
#include "delegation.h"
+#include "nfs-fscache.h"
+
#define NFSDBG_FACILITY NFSDBG_VFS
#define NFS_PARANOIA 1
@@ -169,6 +171,10 @@ nfs_clear_inode(struct inode *inode)
cred = nfsi->cache_access.cred;
if (cred)
put_rpccred(cred);
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_clear_fscookie(nfsi);
+
BUG_ON(atomic_read(&nfsi->data_updates) != 0);
}
@@ -503,6 +509,9 @@ nfs_fill_super(struct super_block *sb, s
server->namelen = NFS2_MAXNAMLEN;
}
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_fill_fscookie(sb);
+
sb->s_op = &nfs_sops;
return nfs_sb_init(sb, authflavor);
}
@@ -579,6 +588,7 @@ static int nfs_show_options(struct seq_f
{ NFS_MOUNT_NOAC, ",noac", "" },
{ NFS_MOUNT_NONLM, ",nolock", ",lock" },
{ NFS_MOUNT_NOACL, ",noacl", "" },
+ { NFS_MOUNT_FSCACHE, ",fscache", "" },
{ 0, NULL, NULL }
};
struct proc_nfs_info *nfs_infop;
@@ -623,6 +633,9 @@ nfs_zap_caches(struct inode *inode)
nfsi->flags |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
else
nfsi->flags |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_zap_fscookie(nfsi);
}
static void nfs_zap_acl_cache(struct inode *inode)
@@ -770,6 +783,9 @@ nfs_fhget(struct super_block *sb, struct
memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
nfsi->cache_access.cred = NULL;
+ if (NFS_SB(sb)->flags & NFS_MOUNT_FSCACHE)
+ nfs_fhget_fscookie(sb, nfsi);
+
unlock_new_inode(inode);
} else
nfs_refresh_inode(inode, fattr);
@@ -1076,6 +1092,9 @@ __nfs_revalidate_inode(struct nfs_server
(long long)NFS_FILEID(inode));
/* This ensures we revalidate dentries */
nfsi->cache_change_attribute++;
+
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_renew_fscookie(server, nfsi);
}
if (flags & NFS_INO_INVALID_ACL)
nfs_zap_acl_cache(inode);
@@ -1515,6 +1534,14 @@ static struct super_block *nfs_get_sb(st
goto out_err;
}
+#ifndef CONFIG_NFS_FSCACHE
+ if (data->flags & NFS_MOUNT_FSCACHE) {
+ printk(KERN_WARNING "NFS: kernel not compiled with CONFIG_NFS_FSCACHE\n");
+ kfree(server);
+ return ERR_PTR(-EINVAL);
+ }
+#endif
+
s = sget(fs_type, nfs_compare_super, nfs_set_super, server);
if (IS_ERR(s) || s->s_root)
goto out_rpciod_down;
@@ -1542,6 +1569,9 @@ static void nfs_kill_super(struct super_
kill_anon_super(s);
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_kill_fscookie(server);
+
if (server->client != NULL && !IS_ERR(server->client))
rpc_shutdown_client(server->client);
if (server->client_sys != NULL && !IS_ERR(server->client_sys))
@@ -1760,6 +1790,9 @@ static int nfs4_fill_super(struct super_
sb->s_time_gran = 1;
+ if (server->flags & NFS4_MOUNT_FSCACHE)
+ nfs4_fill_fscookie(sb);
+
sb->s_op = &nfs4_sops;
err = nfs_sb_init(sb, authflavour);
if (err == 0)
@@ -1903,6 +1936,9 @@ static void nfs4_kill_super(struct super
nfs_return_all_delegations(sb);
kill_anon_super(sb);
+ if (server->flags & NFS_MOUNT_FSCACHE)
+ nfs_kill_fscookie(server);
+
nfs4_renewd_prepare_shutdown(server);
if (server->client != NULL && !IS_ERR(server->client))
@@ -2021,6 +2057,11 @@ static int __init init_nfs_fs(void)
{
int err;
+ /* we want to be able to cache */
+ err = nfs_register_netfs();
+ if (err < 0)
+ goto out5;
+
err = nfs_init_nfspagecache();
if (err)
goto out4;
@@ -2068,6 +2109,9 @@ out2:
out3:
nfs_destroy_nfspagecache();
out4:
+ nfs_unregister_netfs();
+out5:
+
return err;
}
@@ -2080,6 +2124,7 @@ static void __exit exit_nfs_fs(void)
nfs_destroy_readpagecache();
nfs_destroy_inodecache();
nfs_destroy_nfspagecache();
+ nfs_unregister_netfs();
#ifdef CONFIG_PROC_FS
rpc_proc_unregister("nfs");
#endif
--- 2.6.12-rc5-mm2/fs/nfs/Makefile.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/fs/nfs/Makefile 2005-06-05 11:44:48.000000000 -0400
@@ -13,4 +13,5 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4x
delegation.o idmap.o \
callback.o callback_xdr.o callback_proc.o
nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
+nfs-$(CONFIG_NFS_FSCACHE) += nfs-fscache.o
nfs-objs := $(nfs-y)
--- /dev/null 2005-06-05 03:42:13.591137792 -0400
+++ 2.6.12-rc5-mm2/fs/nfs/nfs-fscache.c 2005-06-05 11:44:48.000000000 -0400
@@ -0,0 +1,191 @@
+/* nfs-fscache.c: NFS filesystem cache interface
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_fs_sb.h>
+
+#include "nfs-fscache.h"
+
+#define NFS_CACHE_FH_INDEX_SIZE sizeof(struct nfs_fh)
+
+/*
+ * the root index is
+ */
+static struct fscache_page *nfs_cache_get_page_token(struct page *page);
+
+static struct fscache_netfs_operations nfs_cache_ops = {
+ .get_page_token = nfs_cache_get_page_token,
+};
+
+struct fscache_netfs nfs_cache_netfs = {
+ .name = "nfs",
+ .version = 0,
+ .ops = &nfs_cache_ops,
+};
+
+/*
+ * the root index for the filesystem is defined by nfsd IP address and ports
+ */
+static fscache_match_val_t nfs_cache_server_match(void *target,
+ const void *entry);
+static void nfs_cache_server_update(void *source, void *entry);
+
+struct fscache_index_def nfs_cache_server_index_def = {
+ .name = "servers",
+ .data_size = 18,
+ .keys[0] = { FSCACHE_INDEX_KEYS_IPV6ADDR, 16 },
+ .keys[1] = { FSCACHE_INDEX_KEYS_BIN, 2 },
+ .match = nfs_cache_server_match,
+ .update = nfs_cache_server_update,
+};
+
+/*
+ * the primary index for each server is simply made up of a series of NFS file
+ * handles
+ */
+static fscache_match_val_t nfs_cache_fh_match(void *target, const void *entry);
+static void nfs_cache_fh_update(void *source, void *entry);
+
+struct fscache_index_def nfs_cache_fh_index_def = {
+ .name = "fh",
+ .data_size = NFS_CACHE_FH_INDEX_SIZE,
+ .keys[0] = { FSCACHE_INDEX_KEYS_BIN_SZ2,
+ sizeof(struct nfs_fh) },
+ .match = nfs_cache_fh_match,
+ .update = nfs_cache_fh_update,
+};
+
+/*
+ * get a page token for the specified page
+ * - the token will be attached to page->private and PG_private will be set on
+ * the page
+ */
+static struct fscache_page *nfs_cache_get_page_token(struct page *page)
+{
+ return fscache_page_get_private(page, GFP_NOIO);
+}
+
+static const uint8_t nfs_cache_ipv6_wrapper_for_ipv4[12] = {
+ [0 ... 9] = 0x00,
+ [10 ... 11] = 0xff
+};
+
+/*
+ * match a server record obtained from the cache
+ */
+static fscache_match_val_t nfs_cache_server_match(void *target,
+ const void *entry)
+{
+ struct nfs_server *server = target;
+ const uint8_t *data = entry;
+
+ switch (server->addr.sin_family) {
+ case AF_INET:
+ if (memcmp(data + 0,
+ &nfs_cache_ipv6_wrapper_for_ipv4,
+ 12) != 0)
+ break;
+
+ if (memcmp(data + 12, &server->addr.sin_addr, 4) != 0)
+ break;
+
+ if (memcmp(data + 16, &server->addr.sin_port, 2) != 0)
+ break;
+
+ return FSCACHE_MATCH_SUCCESS;
+
+ case AF_INET6:
+ if (memcmp(data + 0, &server->addr.sin_addr, 16) != 0)
+ break;
+
+ if (memcmp(data + 16, &server->addr.sin_port, 2) != 0)
+ break;
+
+ return FSCACHE_MATCH_SUCCESS;
+
+ default:
+ break;
+ }
+
+ return FSCACHE_MATCH_FAILED;
+}
+
+/*
+ * update a server record in the cache
+ */
+static void nfs_cache_server_update(void *source, void *entry)
+{
+ struct nfs_server *server = source;
+ uint8_t *data = entry;
+
+ switch (server->addr.sin_family) {
+ case AF_INET:
+ memcpy(data + 0, &nfs_cache_ipv6_wrapper_for_ipv4, 12);
+ memcpy(data + 12, &server->addr.sin_addr, 4);
+ memcpy(data + 16, &server->addr.sin_port, 2);
+ return;
+
+ case AF_INET6:
+ memcpy(data + 0, &server->addr.sin_addr, 16);
+ memcpy(data + 16, &server->addr.sin_port, 2);
+ return;
+
+ default:
+ return;
+ }
+}
+
+/*
+ * match a file handle record obtained from the cache
+ */
+static fscache_match_val_t nfs_cache_fh_match(void *target, const void *entry)
+{
+ struct nfs_inode *nfsi = target;
+ const uint8_t *data = entry;
+ uint16_t nsize;
+
+ /* check the file handle matches */
+ memcpy(&nsize, data, 2);
+ nsize = ntohs(nsize);
+
+ if (nsize <= NFS_CACHE_FH_INDEX_SIZE && nfsi->fh.size == nsize) {
+ if (memcmp(data + 2, nfsi->fh.data, nsize) == 0) {
+ return FSCACHE_MATCH_SUCCESS;
+ }
+ }
+
+ return FSCACHE_MATCH_FAILED;
+}
+
+/*
+ * update a fh record in the cache
+ */
+static void nfs_cache_fh_update(void *source, void *entry)
+{
+ struct nfs_inode *nfsi = source;
+ uint16_t nsize;
+ uint8_t *data = entry;
+
+ BUG_ON(nfsi->fh.size > NFS_CACHE_FH_INDEX_SIZE - 2);
+
+ /* set the file handle */
+ nsize = htons(nfsi->fh.size);
+ memcpy(data, &nsize, 2);
+ memcpy(data + 2, &nfsi->fh.data, nfsi->fh.size);
+ memset(data + 2 + nfsi->fh.size,
+ FSCACHE_INDEX_DEADFILL_PATTERN,
+ NFS_CACHE_FH_INDEX_SIZE - 2 - nfsi->fh.size);
+}
--- /dev/null 2005-06-05 03:42:13.591137792 -0400
+++ 2.6.12-rc5-mm2/fs/nfs/nfs-fscache.h 2005-06-05 11:44:48.000000000 -0400
@@ -0,0 +1,158 @@
+/* nfs-fscache.h: NFS filesystem cache interface definitions
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _NFS_FSCACHE_H
+#define _NFS_FSCACHE_H
+
+#include <linux/nfs_mount.h>
+#include <linux/nfs4_mount.h>
+#include <linux/fscache.h>
+
+#ifdef CONFIG_NFS_FSCACHE
+#ifndef CONFIG_FSCACHE
+#error "CONFIG_NFS_FSCACHE is defined but not CONFIG_FSCACHE"
+#endif
+
+extern struct fscache_netfs nfs_cache_netfs;
+extern struct fscache_index_def nfs_cache_server_index_def;
+extern struct fscache_index_def nfs_cache_fh_index_def;
+
+extern int nfs_invalidatepage(struct page *, unsigned long);
+extern int nfs_releasepage(struct page *, int);
+extern int nfs_mkwrite(struct page *);
+
+static inline void
+nfs_renew_fscookie(struct nfs_server *server, struct nfs_inode *nfsi)
+{
+ struct fscache_cookie *old = nfsi->fscache;
+
+ /* retire the current fscache cache and get a new one */
+ fscache_relinquish_cookie(nfsi->fscache, 1);
+ nfsi->fscache = fscache_acquire_cookie(server->fscache, NULL, nfsi);
+
+ dfprintk(FSCACHE,
+ "NFS: revalidation new cookie (0x%p/0x%p/0x%p/0x%p)\n",
+ server, nfsi, old, nfsi->fscache);
+
+ return;
+}
+static inline void
+nfs4_fill_fscookie(struct super_block *sb)
+{
+ struct nfs_server *server = NFS_SB(sb);
+
+ /* create a cache index for looking up filehandles */
+ server->fscache = fscache_acquire_cookie(nfs_cache_netfs.primary_index,
+ &nfs_cache_fh_index_def, server);
+ if (server->fscache == NULL) {
+ printk(KERN_WARNING "NFS4: No Fscache cookie. Turning Fscache off!\n");
+ } else /* reuse the NFS mount option */
+ server->flags |= NFS_MOUNT_FSCACHE;
+
+ dfprintk(FSCACHE,"NFS: nfs4 cookie (0x%p,0x%p/0x%p)\n",
+ sb, server, server->fscache);
+
+ return;
+}
+static inline void
+nfs_fill_fscookie(struct super_block *sb)
+{
+ struct nfs_server *server = NFS_SB(sb);
+
+ /* create a cache index for looking up filehandles */
+ server->fscache = fscache_acquire_cookie(nfs_cache_netfs.primary_index,
+ &nfs_cache_fh_index_def, server);
+ if (server->fscache == NULL) {
+ server->flags &= ~NFS_MOUNT_FSCACHE;
+ printk(KERN_WARNING "NFS: No Fscache cookie. Turning Fscache off!\n");
+ }
+ dfprintk(FSCACHE,"NFS: cookie (0x%p/0x%p/0x%p)\n",
+ sb, server, server->fscache);
+
+ return;
+}
+static inline void
+nfs_fhget_fscookie(struct super_block *sb, struct nfs_inode *nfsi)
+{
+ struct nfs_server *server = NFS_SB(sb);
+
+ nfsi->fscache = fscache_acquire_cookie(server->fscache, NULL, nfsi);
+ if (server->fscache == NULL)
+ printk(KERN_WARNING "NFS: NULL FScache cookie: sb 0x%p nfsi 0x%p\n", sb, nfsi);
+
+ dfprintk(FSCACHE, "NFS: fhget new cookie (0x%p/0x%p/0x%p)\n",
+ sb, nfsi, nfsi->fscache);
+
+ return;
+}
+static inline void
+nfs_kill_fscookie(struct nfs_server *server)
+{
+ dfprintk(FSCACHE,"NFS: killing cookie (0x%p/0x%p)\n",
+ server, server->fscache);
+
+ fscache_relinquish_cookie(server->fscache, 0);
+ server->fscache = NULL;
+
+ return;
+}
+static inline void
+nfs_clear_fscookie(struct nfs_inode *nfsi)
+{
+ dfprintk(FSCACHE, "NFS: clear cookie (0x%p/0x%p)\n",
+ nfsi, nfsi->fscache);
+
+ fscache_relinquish_cookie(nfsi->fscache, 0);
+ nfsi->fscache = NULL;
+
+ return;
+}
+static inline void
+nfs_zap_fscookie(struct nfs_inode *nfsi)
+{
+ dfprintk(FSCACHE,"NFS: zapping cookie (0x%p/0x%p)\n",
+ nfsi, nfsi->fscache);
+
+ fscache_relinquish_cookie(nfsi->fscache, 1);
+ nfsi->fscache = NULL;
+
+ return;
+}
+static inline int
+nfs_register_netfs(void)
+{
+ int err;
+
+ err = fscache_register_netfs(&nfs_cache_netfs, &nfs_cache_server_index_def);
+
+ return err;
+}
+static inline void
+nfs_unregister_netfs(void)
+{
+ fscache_unregister_netfs(&nfs_cache_netfs);
+
+ return;
+}
+#else
+static inline void nfs_fill_fscookie(struct super_block *sb) {}
+static inline void nfs_fhget_fscookie(struct super_block *sb, struct nfs_inode *nfsi) {}
+static inline void nfs4_fill_fscookie(struct super_block *sb) {}
+static inline void nfs_kill_fscookie(struct nfs_server *server) {}
+static inline void nfs_clear_fscookie(struct nfs_inode *nfsi) {}
+static inline void nfs_zap_fscookie(struct nfs_inode *nfsi) {}
+static inline void
+ nfs_renew_fscookie(struct nfs_server *server, struct nfs_inode *nfsi) {}
+static inline int nfs_register_netfs() { return 0; }
+static inline void nfs_unregister_netfs() {}
+
+#endif
+#endif /* _NFS_FSCACHE_H */
--- 2.6.12-rc5-mm2/fs/nfs/read.c.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/fs/nfs/read.c 2005-06-05 11:44:48.000000000 -0400
@@ -27,6 +27,7 @@
#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
+#include <linux/nfs_mount.h>
#include <linux/smp_lock.h>
#include <asm/system.h>
@@ -73,6 +74,47 @@ int nfs_return_empty_page(struct page *p
return 0;
}
+#ifdef CONFIG_NFS_FSCACHE
+/*
+ * store a newly fetched page in fscache
+ */
+static void
+nfs_readpage_to_fscache_complete(void *cookie_data, struct page *page, void *data, int error)
+{
+ dprintk("NFS: readpage_to_fscache_complete (%p/%p/%p/%d)\n",
+ cookie_data, page, data, error);
+
+ end_page_fs_misc(page);
+}
+
+static inline void
+nfs_readpage_to_fscache(struct inode *inode, struct page *page, int sync)
+{
+ int ret;
+
+ dprintk("NFS: readpage_to_fscache(0x%p/0x%p/0x%p/%d)\n",
+ NFS_I(inode)->fscache, page, inode, sync);
+
+ SetPageFsMisc(page);
+ ret = fscache_write_page(NFS_I(inode)->fscache, page,
+ nfs_readpage_to_fscache_complete, NULL, GFP_KERNEL);
+ if (ret != 0) {
+ dprintk("NFS: readpage_to_fscache: error %d\n", ret);
+ fscache_uncache_page(NFS_I(inode)->fscache, page);
+ ClearPageFsMisc(page);
+ }
+
+ unlock_page(page);
+}
+#else
+static inline void
+nfs_readpage_to_fscache(struct inode *inode, struct page *page, int sync)
+{
+ BUG();
+}
+#endif
+
+
/*
* Read a page synchronously.
*/
@@ -149,6 +191,13 @@ static int nfs_readpage_sync(struct nfs_
ClearPageError(page);
result = 0;
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_readpage_to_fscache(inode, page, 1);
+ else
+ unlock_page(page);
+
+ return result;
+
io_error:
unlock_page(page);
nfs_readdata_free(rdata);
@@ -180,7 +229,13 @@ static int nfs_readpage_async(struct nfs
static void nfs_readpage_release(struct nfs_page *req)
{
- unlock_page(req->wb_page);
+ struct inode *d_inode = req->wb_context->dentry->d_inode;
+
+ if ((NFS_SERVER(d_inode)->flags & NFS_MOUNT_FSCACHE) &&
+ PageUptodate(req->wb_page))
+ nfs_readpage_to_fscache(d_inode, req->wb_page, 0);
+ else
+ unlock_page(req->wb_page);
nfs_clear_request(req);
nfs_release_request(req);
@@ -477,6 +532,67 @@ void nfs_readpage_result(struct rpc_task
data->complete(data, status);
}
+
+/*
+ * Read a page through the on-disc cache if possible
+ */
+#ifdef CONFIG_NFS_FSCACHE
+static void
+nfs_readpage_from_fscache_complete(void *cookie_data, struct page *page, void *data, int error)
+{
+ dprintk("NFS: readpage_from_fscache_complete (0x%p/0x%p/0x%p/%d)\n",
+ cookie_data, page, data, error);
+
+ if (error)
+ SetPageError(page);
+ else
+ SetPageUptodate(page);
+
+ unlock_page(page);
+}
+
+static inline int
+nfs_readpage_from_fscache(struct inode *inode, struct page *page)
+{
+ struct fscache_page *pageio;
+ int ret;
+
+ dprintk("NFS: readpage_from_fscache(0x%p/0x%p/0x%p)\n",
+ NFS_I(inode)->fscache, page, inode);
+
+ pageio = fscache_page_get_private(page, GFP_NOIO);
+ if (IS_ERR(pageio)) {
+ dprintk("NFS: fscache_page_get_private error %ld\n", PTR_ERR(pageio));
+ return PTR_ERR(pageio);
+ }
+
+ ret = fscache_read_or_alloc_page(NFS_I(inode)->fscache,
+ page,
+ nfs_readpage_from_fscache_complete,
+ NULL,
+ GFP_KERNEL);
+
+ switch (ret) {
+ case 1: /* read BIO submitted and wb-journal entry found */
+ BUG();
+
+ case 0: /* read BIO submitted (page in fscache) */
+ return ret;
+
+ case -ENOBUFS: /* inode not in cache */
+ case -ENODATA: /* page not in cache */
+ dprintk("NFS: fscache_read_or_alloc_page error %d\n", ret);
+ return 1;
+
+ default:
+ return ret;
+ }
+}
+#else
+static inline int
+nfs_readpage_from_fscache(struct inode *inode, struct page *page) { return 1; }
+#endif
+
/*
* Read a page over NFS.
* We read the page synchronously in the following case:
@@ -510,6 +626,13 @@ int nfs_readpage(struct file *file, stru
ctx = get_nfs_open_context((struct nfs_open_context *)
file->private_data);
if (!IS_SYNC(inode)) {
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE) {
+ error = nfs_readpage_from_fscache(inode, page);
+ if (error < 0)
+ goto out_error;
+ if (error == 0)
+ return error;
+ }
error = nfs_readpage_async(ctx, inode, page);
goto out;
}
@@ -540,6 +663,15 @@ readpage_async_filler(void *data, struct
unsigned int len;
nfs_wb_page(inode, page);
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE) {
+ int error = nfs_readpage_from_fscache(inode, page);
+ if (error < 0)
+ return error;
+ if (error == 0)
+ return error;
+ }
+
len = nfs_page_length(inode, page);
if (len == 0)
return nfs_return_empty_page(page);
@@ -613,3 +745,61 @@ void nfs_destroy_readpagecache(void)
if (kmem_cache_destroy(nfs_rdata_cachep))
printk(KERN_INFO "nfs_read_data: not all structures were freed\n");
}
+
+#ifdef CONFIG_NFS_FSCACHE
+int nfs_invalidatepage(struct page *page, unsigned long offset)
+{
+ int ret = 1;
+ struct nfs_server *server = NFS_SERVER(page->mapping->host);
+
+ BUG_ON(!PageLocked(page));
+
+ if (server->flags & NFS_MOUNT_FSCACHE) {
+ if (PagePrivate(page)) {
+ struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+
+ dfprintk(PAGECACHE,"NFS: fscache invalidatepage (0x%p/0x%p/0x%p)\n",
+ nfsi->fscache, page, nfsi);
+
+ fscache_uncache_page(nfsi->fscache, page);
+
+ if (offset == 0) {
+ BUG_ON(!PageLocked(page));
+ ret = 0;
+ if (!PageWriteback(page))
+ ret = page->mapping->a_ops->releasepage(page, 0);
+ }
+ }
+ } else
+ ret = 0;
+
+ return ret;
+}
+int nfs_releasepage(struct page *page, int gfp_flags)
+{
+ struct fscache_page *pageio;
+ struct nfs_server *server = NFS_SERVER(page->mapping->host);
+
+ if (server->flags & NFS_MOUNT_FSCACHE && PagePrivate(page)) {
+ struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+
+ dfprintk(PAGECACHE,"NFS: fscache releasepage (0x%p/0x%p/0x%p)\n",
+ nfsi->fscache, page, nfsi);
+
+ fscache_uncache_page(nfsi->fscache, page);
+ pageio = (struct fscache_page *) page->private;
+ page->private = 0;
+ ClearPagePrivate(page);
+
+ if (pageio)
+ kfree(pageio);
+ }
+
+ return 0;
+}
+int nfs_mkwrite(struct page *page)
+{
+ wait_on_page_fs_misc(page);
+ return 0;
+}
+#endif
--- 2.6.12-rc5-mm2/fs/nfs/write.c.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/fs/nfs/write.c 2005-06-05 11:44:48.000000000 -0400
@@ -255,6 +255,38 @@ static int wb_priority(struct writeback_
}
/*
+ * store an updated page in fscache
+ */
+#ifdef CONFIG_NFS_FSCACHE
+static void
+nfs_writepage_to_fscache_complete(void *cookie_data, struct page *page, void *data, int error)
+{
+ /* really need to synchronise the end of writeback, probably using a page flag */
+}
+static inline void
+nfs_writepage_to_fscache(struct inode *inode, struct page *page)
+{
+ int ret;
+
+ dprintk("NFS: writepage_to_fscache (0x%p/0x%p/0x%p)\n",
+ NFS_I(inode)->fscache, page, inode);
+
+ ret = fscache_write_page(NFS_I(inode)->fscache, page,
+ nfs_writepage_to_fscache_complete, NULL, GFP_KERNEL);
+ if (ret != 0) {
+ dprintk("NFS: fscache_write_page error %d\n", ret);
+ fscache_uncache_page(NFS_I(inode)->fscache, page);
+ }
+}
+#else
+static inline void
+nfs_writepage_to_fscache(struct inode *inode, struct page *page)
+{
+ BUG();
+}
+#endif
+
+/*
* Write an mmapped page to the server.
*/
int nfs_writepage(struct page *page, struct writeback_control *wbc)
@@ -299,6 +331,10 @@ do_it:
err = -EBADF;
goto out;
}
+
+ if (NFS_SERVER(inode)->flags & NFS_MOUNT_FSCACHE)
+ nfs_writepage_to_fscache(inode, page);
+
lock_kernel();
if (!IS_SYNC(inode) && inode_referenced) {
err = nfs_writepage_async(ctx, inode, page, 0, offset);
--- 2.6.12-rc5-mm2/fs/Kconfig.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/fs/Kconfig 2005-06-05 11:44:48.000000000 -0400
@@ -1495,6 +1495,13 @@ config NFS_V4
If unsure, say N.
+config NFS_FSCACHE
+ bool "Provide NFS client caching support (EXPERIMENTAL)"
+ depends on NFS_FS && FSCACHE && EXPERIMENTAL
+ help
+ Say Y here if you want NFS data to be cached locally on disc through
+ the general filesystem cache manager
+
config NFS_DIRECTIO
bool "Allow direct I/O on NFS files (EXPERIMENTAL)"
depends on NFS_FS && EXPERIMENTAL
--- 2.6.12-rc5-mm2/include/linux/nfs_fs.h.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/include/linux/nfs_fs.h 2005-06-05 11:44:48.000000000 -0400
@@ -29,6 +29,7 @@
#include <linux/nfs_xdr.h>
#include <linux/rwsem.h>
#include <linux/mempool.h>
+#include <linux/fscache.h>
/*
* Enable debugging support for nfs client.
@@ -184,6 +185,11 @@ struct nfs_inode {
int delegation_state;
struct rw_semaphore rwsem;
#endif /* CONFIG_NFS_V4*/
+
+#ifdef CONFIG_NFS_FSCACHE
+ struct fscache_cookie *fscache;
+#endif
+
struct inode vfs_inode;
};
@@ -564,6 +570,7 @@ extern void * nfs_root_data(void);
#define NFSDBG_FILE 0x0040
#define NFSDBG_ROOT 0x0080
#define NFSDBG_CALLBACK 0x0100
+#define NFSDBG_FSCACHE 0x0200
#define NFSDBG_ALL 0xFFFF
#ifdef __KERNEL__
--- 2.6.12-rc5-mm2/include/linux/nfs_fs_sb.h.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/include/linux/nfs_fs_sb.h 2005-06-05 11:44:48.000000000 -0400
@@ -3,6 +3,7 @@
#include <linux/list.h>
#include <linux/backing-dev.h>
+#include <linux/fscache.h>
/*
* NFS client parameters stored in the superblock.
@@ -47,6 +48,10 @@ struct nfs_server {
that are supported on this
filesystem */
#endif
+
+#ifdef CONFIG_NFS_FSCACHE
+ struct fscache_cookie *fscache; /* cache cookie */
+#endif
};
/* Server capabilities */
--- 2.6.12-rc5-mm2/include/linux/nfs_mount.h.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/include/linux/nfs_mount.h 2005-06-05 11:44:48.000000000 -0400
@@ -61,6 +61,7 @@ struct nfs_mount_data {
#define NFS_MOUNT_NOACL 0x0800 /* 4 */
#define NFS_MOUNT_STRICTLOCK 0x1000 /* reserved for NFSv4 */
#define NFS_MOUNT_SECFLAVOUR 0x2000 /* 5 */
+#define NFS_MOUNT_FSCACHE 0x3000
#define NFS_MOUNT_FLAGMASK 0xFFFF
#endif
--- 2.6.12-rc5-mm2/include/linux/nfs4_mount.h.orig 2005-06-05 11:44:35.000000000 -0400
+++ 2.6.12-rc5-mm2/include/linux/nfs4_mount.h 2005-06-05 11:44:48.000000000 -0400
@@ -65,6 +65,7 @@ struct nfs4_mount_data {
#define NFS4_MOUNT_NOCTO 0x0010 /* 1 */
#define NFS4_MOUNT_NOAC 0x0020 /* 1 */
#define NFS4_MOUNT_STRICTLOCK 0x1000 /* 1 */
+#define NFS4_MOUNT_FSCACHE 0x2000 /* 1 */
#define NFS4_MOUNT_FLAGMASK 0xFFFF
#endif
[-- Attachment #3: 2.6.12-rc6-mm1-fscache-cookie-exist.patch --]
[-- Type: text/x-patch, Size: 669 bytes --]
Fails a second NFS mount with EEXIST instead of oops.
Signed-off-by: Steve Dickson <steved@redhat.com>
--- 2.6.12-rc3-mm3/fs/fscache/cookie.c.orig 2005-05-07 09:30:28.000000000 -0400
+++ 2.6.12-rc3-mm3/fs/fscache/cookie.c 2005-05-07 11:01:39.000000000 -0400
@@ -452,7 +452,11 @@ static int fscache_search_for_object(str
cache->ops->lock_node(node);
/* a node should only ever be attached to one cookie */
- BUG_ON(!list_empty(&node->cookie_link));
+ if (!list_empty(&node->cookie_link)) {
+ cache->ops->unlock_node(node);
+ ret = -EEXIST;
+ goto error;
+ }
/* attach the node to the cache's node list */
if (list_empty(&node->cache_link)) {
[-- Attachment #4: 2.6.12-rc6-mm1-cachefs-wb.patch --]
[-- Type: text/x-patch, Size: 594 bytes --]
This fixes a BUG() poping at mm/filemap.c:465 pops when reading a
100m file using nfs4.
Signed-off-by: Steve Dickson <steved@redhat.com>
--- 2.6.12-rc2-mm3/fs/cachefs/journal.c.save 2005-04-27 08:06:03.000000000 -0400
+++ 2.6.12-rc2-mm3/fs/cachefs/journal.c 2005-05-03 11:11:17.000000000 -0400
@@ -682,6 +682,7 @@ static inline void cachefs_trans_batch_p
list_add_tail(&block->batch_link, plist);
block->writeback = block->page;
get_page(block->writeback);
+ SetPageWriteback(block->writeback);
/* make sure DMA can reach the data */
flush_dcache_page(block->writeback);
[-- Attachment #5: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2005-06-13 12:52 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-09 10:31 NFS Patch for FSCache Steve Dickson
2005-05-09 21:19 ` Andrew Morton
2005-05-10 18:43 ` Steve Dickson
2005-05-10 19:12 ` [Linux-cachefs] " David Howells
2005-05-14 2:18 ` Troy Benjegerdes
2005-05-16 13:30 ` David Howells
2005-06-13 12:52 ` Steve Dickson
-- strict thread matches above, loose matches on Subject: below --
2005-05-12 22:43 [Linux-cachefs] " Lever, Charles
2005-05-13 11:17 ` David Howells
2005-05-14 2:08 ` Troy Benjegerdes
2005-05-16 12:47 ` [Linux-cachefs] " David Howells
2005-05-17 21:42 ` David Masover
2005-05-18 10:28 ` [Linux-cachefs] " David Howells
2005-05-19 2:18 ` Troy Benjegerdes
2005-05-19 6:48 ` David Masover
2005-05-18 16:32 Lever, Charles
2005-05-18 17:49 ` David Howells
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).