[PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit
@ 2006-04-20 16:59 David Howells
  2006-04-20 16:59 ` [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops David Howells
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 16:59 UTC (permalink / raw)
  To: torvalds, akpm, steved, sct, aviro
  Cc: linux-fsdevel, linux-cachefs, nfsv4, linux-kernel

The attached patch provides a filesystem-specific page bit that a filesystem
can synchronise upon.  This can be used, for example, by a netfs to synchronise
with CacheFS writing its pages to disk.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 include/linux/page-flags.h |   10 ++++++++++
 include/linux/pagemap.h    |   11 +++++++++++
 mm/filemap.c               |   17 +++++++++++++++++
 3 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d276a4e..5486874 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -74,6 +74,7 @@
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
 
 #define PG_checked		 8	/* kill me in 2.5.<early>. */
+#define PG_fs_misc		 8
 #define PG_arch_1		 9
 #define PG_reserved		10
 #define PG_private		11	/* Has something at ->private */
@@ -376,4 +377,13 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
+/*
+ * Filesystem-specific page bit testing
+ */
+#define PageFsMisc(page)		test_bit(PG_fs_misc, &(page)->flags)
+#define SetPageFsMisc(page)		set_bit(PG_fs_misc, &(page)->flags)
+#define TestSetPageFsMisc(page)		test_and_set_bit(PG_fs_misc, &(page)->flags)
+#define ClearPageFsMisc(page)		clear_bit(PG_fs_misc, &(page)->flags)
+#define TestClearPageFsMisc(page)	test_and_clear_bit(PG_fs_misc, &(page)->flags)
+
 #endif	/* PAGE_FLAGS_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9539efd..02e7d8b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -206,6 +206,17 @@ static inline void wait_on_page_writebac
 extern void end_page_writeback(struct page *page);
 
 /*
+ * Wait for filesystem-specific page synchronisation to complete
+ */
+static inline void wait_on_page_fs_misc(struct page *page)
+{
+	if (PageFsMisc(page))
+		wait_on_page_bit(page, PG_fs_misc);
+}
+
+extern void fastcall end_page_fs_misc(struct page *page);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index 3ef2073..d4a598f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -545,6 +545,23 @@ void fastcall __lock_page(struct page *p
 EXPORT_SYMBOL(__lock_page);
 
 /*
+ * Note completion of filesystem specific page synchronisation
+ *
+ * This is used to allow a page to be written to a filesystem cache in the
+ * background without holding up the completion of readpage
+ */
+void fastcall end_page_fs_misc(struct page *page)
+{
+	smp_mb__before_clear_bit();
+	if (!TestClearPageFsMisc(page))
+		BUG();
+	smp_mb__after_clear_bit();
+	__wake_up_bit(page_waitqueue(page), &page->flags, PG_fs_misc);
+}
+
+EXPORT_SYMBOL(end_page_fs_misc);
+
+/*
  * a rather lightweight function, finding and getting a reference to a
  * hashed page atomically.
  */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
@ 2006-04-20 16:59 ` David Howells
  2006-04-20 17:40   ` Zach Brown
  2006-04-20 16:59 ` [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files David Howells
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: David Howells @ 2006-04-20 16:59 UTC (permalink / raw)
  To: torvalds, akpm, steved, sct, aviro
  Cc: linux-fsdevel, linux-cachefs, nfsv4, linux-kernel

The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

 (*) Account and limit number of dirty pages. This is one piece of the puzzle
     needed to make shared writable mapping work safely in FUSE.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 include/linux/mm.h |    4 ++
 mm/memory.c        |   99 +++++++++++++++++++++++++++++++++++++++-------------
 mm/mmap.c          |   12 +++++-
 mm/mprotect.c      |   11 +++++-
 4 files changed, 98 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1154684..cd3c2cf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -200,6 +200,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 0ec7bc6..6c6891e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1445,25 +1445,59 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pte_pfn(orig_pte));
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the address space that the page is about to
+			 * become writable so that it can prohibit this or wait
+			 * for the page to get into an appropriate state.
+			 *
+			 * We do this without the lock held, so that it can
+			 * sleep if it needs to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1523,6 +1557,10 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -2074,18 +2112,31 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/mmap.c b/mm/mmap.c
index e6ee123..6446c61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1065,7 +1065,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1089,6 +1090,12 @@ munmap_back:
 			goto free_vma;
 	}
 
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1921,7 +1928,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4c14d42..2697abd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +133,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +159,14 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
  2006-04-20 16:59 ` [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops David Howells
@ 2006-04-20 16:59 ` David Howells
  2006-04-20 17:18   ` Christoph Hellwig
                     ` (3 more replies)
  2006-04-20 16:59 ` [PATCH 4/7] FS-Cache: Export find_get_pages() David Howells
                   ` (5 subsequent siblings)
  7 siblings, 4 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 16:59 UTC (permalink / raw)
  To: torvalds, akpm, steved, sct, aviro
  Cc: linux-fsdevel, linux-cachefs, nfsv4, linux-kernel

Make it possible to avoid ENFILE checking for kernel specific open files, such
as are used by the CacheFiles module.

After, for example, tarring up a kernel source tree over the network, the
CacheFiles module may easily have 20000+ files open in the backing filesystem,
thus causing all non-root processes to be given error ENFILE when they try to
open a file, socket, pipe, etc..

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 arch/ia64/kernel/perfmon.c            |    2 +-
 drivers/infiniband/core/uverbs_main.c |    2 +-
 fs/eventpoll.c                        |    2 +-
 fs/file_table.c                       |   36 ++++++++++++++++++++++++---------
 fs/hugetlbfs/inode.c                  |    2 +-
 fs/inotify.c                          |    2 +-
 fs/namei.c                            |    2 +-
 fs/open.c                             |   22 +++++++++++++++++++-
 fs/pipe.c                             |    4 ++--
 include/linux/file.h                  |    1 -
 include/linux/fs.h                    |    8 ++++++-
 kernel/futex.c                        |    2 +-
 kernel/sysctl.c                       |    2 +-
 mm/shmem.c                            |    2 +-
 mm/tiny-shmem.c                       |    2 +-
 net/socket.c                          |    2 +-
 16 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 077f212..f23ab3a 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2162,7 +2162,7 @@ pfm_alloc_fd(struct file **cfile)
 
 	ret = -ENFILE;
 
-	file = get_empty_filp();
+	file = get_empty_filp(0);
 	if (!file) goto out;
 
 	/*
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index ff092a0..4f7137c 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -525,7 +525,7 @@ struct file *ib_uverbs_alloc_event_file(
 		goto err;
 	}
 
-	filp = get_empty_filp();
+	filp = get_empty_filp(0);
 	if (!filp) {
 		ret = -ENFILE;
 		goto err_fd;
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1b4491c..f774038 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -714,7 +714,7 @@ static int ep_getfd(int *efd, struct ino
 
 	/* Get an ready to use file */
 	error = -ENFILE;
-	file = get_empty_filp();
+	file = get_empty_filp(0);
 	if (!file)
 		goto eexit_1;
 
diff --git a/fs/file_table.c b/fs/file_table.c
index bcea199..300e7c2 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -34,6 +34,7 @@ struct files_stat_struct files_stat = {
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
+static atomic_t nr_kernel_files;
 
 static inline void file_free_rcu(struct rcu_head *head)
 {
@@ -43,7 +44,10 @@ static inline void file_free_rcu(struct 
 
 static inline void file_free(struct file *f)
 {
-	percpu_counter_dec(&nr_files);
+	if (!(f->f_kernel_flags & FKFLAGS_KERNEL))
+		percpu_counter_dec(&nr_files);
+	else
+		atomic_dec(&nr_kernel_files);
 	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
 }
 
@@ -72,6 +76,7 @@ int proc_nr_files(ctl_table *table, int 
                      void __user *buffer, size_t *lenp, loff_t *ppos)
 {
 	files_stat.nr_files = get_nr_files();
+	files_stat.nr_kernel_files = atomic_read(&nr_kernel_files);
 	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
 }
 #else
@@ -86,7 +91,7 @@ int proc_nr_files(ctl_table *table, int 
  * Returns NULL, if there are no more free file structures or
  * we run out of memory.
  */
-struct file *get_empty_filp(void)
+struct file *get_empty_filp(int kernel)
 {
 	struct task_struct *tsk;
 	static int old_max;
@@ -95,20 +100,29 @@ struct file *get_empty_filp(void)
 	/*
 	 * Privileged users can go above max_files
 	 */
-	if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) {
-		/*
-		 * percpu_counters are inaccurate.  Do an expensive check before
-		 * we go and fail.
-		 */
-		if (percpu_counter_sum(&nr_files) >= files_stat.max_files)
-			goto over;
+	if (!kernel) {
+		if (get_nr_files() >= files_stat.max_files &&
+		    !capable(CAP_SYS_ADMIN)
+		    ) {
+			/*
+			 * percpu_counters are inaccurate.  Do an expensive
+			 * check before we go and fail.
+			 */
+			if (percpu_counter_sum(&nr_files) >=
+			    files_stat.max_files)
+				goto over;
+		}
 	}
 
 	f = kmem_cache_alloc(filp_cachep, GFP_KERNEL);
 	if (f == NULL)
 		goto fail;
 
-	percpu_counter_inc(&nr_files);
+	if (!kernel)
+		percpu_counter_inc(&nr_files);
+	else
+		atomic_inc(&nr_kernel_files);
+
 	memset(f, 0, sizeof(*f));
 	if (security_file_alloc(f))
 		goto fail_sec;
@@ -117,6 +131,7 @@ struct file *get_empty_filp(void)
 	INIT_LIST_HEAD(&f->f_u.fu_list);
 	atomic_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
+	f->f_kernel_flags = kernel ? FKFLAGS_KERNEL : 0;
 	f->f_uid = tsk->fsuid;
 	f->f_gid = tsk->fsgid;
 	eventpoll_init_file(f);
@@ -235,6 +250,7 @@ struct file fastcall *fget_light(unsigne
 	return file;
 }
 
+EXPORT_SYMBOL(fget_light);
 
 void put_filp(struct file *file)
 {
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3a5b4e9..cc27ee8 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -770,7 +770,7 @@ struct file *hugetlb_zero_setup(size_t s
 		goto out_shm_unlock;
 
 	error = -ENFILE;
-	file = get_empty_filp();
+	file = get_empty_filp(0);
 	if (!file)
 		goto out_dentry;
 
diff --git a/fs/inotify.c b/fs/inotify.c
index 1f50302..2e66e05 100644
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -939,7 +939,7 @@ asmlinkage long sys_inotify_init(void)
 	if (fd < 0)
 		return fd;
 
-	filp = get_empty_filp();
+	filp = get_empty_filp(0);
 	if (!filp) {
 		ret = -ENFILE;
 		goto out_put_fd;
diff --git a/fs/namei.c b/fs/namei.c
index 96723ae..6713213 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1146,7 +1146,7 @@ static int __path_lookup_intent_open(int
 		unsigned int lookup_flags, struct nameidata *nd,
 		int open_flags, int create_mode)
 {
-	struct file *filp = get_empty_filp();
+	struct file *filp = get_empty_filp(0);
 	int err;
 
 	if (filp == NULL)
diff --git a/fs/open.c b/fs/open.c
index 53ec28c..cea1538 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -962,7 +962,7 @@ struct file *dentry_open(struct dentry *
 	struct file *f;
 
 	error = -ENFILE;
-	f = get_empty_filp();
+	f = get_empty_filp(0);
 	if (f == NULL) {
 		dput(dentry);
 		mntput(mnt);
@@ -974,6 +974,26 @@ struct file *dentry_open(struct dentry *
 EXPORT_SYMBOL(dentry_open);
 
 /*
+ * open a specifically in-kernel file
+ */
+struct file *dentry_open_kernel(struct dentry *dentry, struct vfsmount *mnt, int flags)
+{
+	int error;
+	struct file *f;
+
+	error = -ENFILE;
+	f = get_empty_filp(1);
+	if (f == NULL) {
+		dput(dentry);
+		mntput(mnt);
+		return ERR_PTR(error);
+	}
+
+	return __dentry_open(dentry, mnt, flags, f, NULL);
+}
+EXPORT_SYMBOL(dentry_open_kernel);
+
+/*
  * Find an empty file descriptor entry, and mark it busy.
  */
 int get_unused_fd(void)
diff --git a/fs/pipe.c b/fs/pipe.c
index 7fefb10..6081367 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -795,11 +795,11 @@ int do_pipe(int *fd)
 	int i, j;
 
 	error = -ENFILE;
-	f1 = get_empty_filp();
+	f1 = get_empty_filp(0);
 	if (!f1)
 		goto no_files;
 
-	f2 = get_empty_filp();
+	f2 = get_empty_filp(0);
 	if (!f2)
 		goto close_f1;
 
diff --git a/include/linux/file.h b/include/linux/file.h
index 9f7c251..da7be8f 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -79,7 +79,6 @@ extern void FASTCALL(set_close_on_exec(u
 extern void put_filp(struct file *);
 extern int get_unused_fd(void);
 extern void FASTCALL(put_unused_fd(unsigned int fd));
-struct kmem_cache;
 
 extern struct file ** alloc_fd_array(int);
 extern void free_fd_array(struct file **, int);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3de2bfb..979b1d3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -33,6 +33,7 @@ struct files_stat_struct {
 	int nr_files;		/* read only */
 	int nr_free_files;	/* read only */
 	int max_files;		/* tunable */
+	int nr_kernel_files;	/* read only */
 };
 extern struct files_stat_struct files_stat;
 extern int get_max_files(void);
@@ -70,6 +71,8 @@ extern int dir_notify_enable;
    behavior for cross-node execution/opening_for_writing of files */
 #define FMODE_EXEC	16
 
+#define FKFLAGS_KERNEL	1		/* kernel internal file (not accounted) */
+
 #define RW_MASK		1
 #define RWA_MASK	2
 #define READ 0
@@ -640,6 +643,7 @@ struct file {
 	atomic_t		f_count;
 	unsigned int 		f_flags;
 	mode_t			f_mode;
+	unsigned short		f_kernel_flags;
 	loff_t			f_pos;
 	struct fown_struct	f_owner;
 	unsigned int		f_uid, f_gid;
@@ -1377,6 +1381,7 @@ extern long do_sys_open(int fdf, const c
 			int mode);
 extern struct file *filp_open(const char *, int, int);
 extern struct file * dentry_open(struct dentry *, struct vfsmount *, int);
+extern struct file * dentry_open_kernel(struct dentry *, struct vfsmount *, int);
 extern int filp_close(struct file *, fl_owner_t id);
 extern char * getname(const char __user *);
 
@@ -1577,7 +1582,7 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
-extern struct file * get_empty_filp(void);
+extern struct file * get_empty_filp(int kernel);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);
 struct bio;
@@ -1603,6 +1608,7 @@ extern ssize_t generic_file_direct_write
 		unsigned long *, loff_t, loff_t *, size_t, size_t);
 extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern int generic_file_buffered_write_one_kernel_page(struct file *, pgoff_t, struct page *);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
 extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
 ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
diff --git a/kernel/futex.c b/kernel/futex.c
index 5699c51..7c334f3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -779,7 +779,7 @@ static int futex_fd(unsigned long uaddr,
 	ret = get_unused_fd();
 	if (ret < 0)
 		goto out;
-	filp = get_empty_filp();
+	filp = get_empty_filp(0);
 	if (!filp) {
 		put_unused_fd(ret);
 		ret = -ENFILE;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e82726f..e8f9b5f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -943,7 +943,7 @@ static ctl_table fs_table[] = {
 		.ctl_name	= FS_NRFILE,
 		.procname	= "file-nr",
 		.data		= &files_stat,
-		.maxlen		= 3*sizeof(int),
+		.maxlen		= 4*sizeof(int),
 		.mode		= 0444,
 		.proc_handler	= &proc_nr_files,
 	},
diff --git a/mm/shmem.c b/mm/shmem.c
index 37eaf42..83bbbe8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2311,7 +2311,7 @@ struct file *shmem_file_setup(char *name
 		goto put_memory;
 
 	error = -ENFILE;
-	file = get_empty_filp();
+	file = get_empty_filp(0);
 	if (!file)
 		goto put_dentry;
 
diff --git a/mm/tiny-shmem.c b/mm/tiny-shmem.c
index f9d6a9c..b014dd5 100644
--- a/mm/tiny-shmem.c
+++ b/mm/tiny-shmem.c
@@ -71,7 +71,7 @@ struct file *shmem_file_setup(char *name
 		goto put_memory;
 
 	error = -ENFILE;
-	file = get_empty_filp();
+	file = get_empty_filp(0);
 	if (!file)
 		goto put_dentry;
 
diff --git a/net/socket.c b/net/socket.c
index 23898f4..9743df2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -377,7 +377,7 @@ static int sock_alloc_fd(struct file **f
 
 	fd = get_unused_fd();
 	if (likely(fd >= 0)) {
-		struct file *file = get_empty_filp();
+		struct file *file = get_empty_filp(0);
 
 		*filep = file;
 		if (unlikely(!file)) {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/7] FS-Cache: Export find_get_pages()
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
  2006-04-20 16:59 ` [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops David Howells
  2006-04-20 16:59 ` [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files David Howells
@ 2006-04-20 16:59 ` David Howells
  2006-04-20 17:19   ` Christoph Hellwig
  2006-04-20 17:45   ` David Howells
  2006-04-20 16:59 ` [PATCH 5/7] FS-Cache: Generic filesystem caching facility David Howells
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 16:59 UTC (permalink / raw)
  To: torvalds, akpm, steved, sct, aviro
  Cc: linux-fsdevel, linux-cachefs, nfsv4, linux-kernel

The attached patch exports find_get_pages() for use by the kAFS filesystem in
conjunction with it caching patch.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 mm/filemap.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index d4a598f..d6f7ab4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -714,6 +714,8 @@ unsigned find_get_pages(struct address_s
 	return ret;
 }
 
+EXPORT_SYMBOL(find_get_pages);
+
 /*
  * Like find_get_pages, except we only return pages which are tagged with
  * `tag'.   We update *index to index the next page for the traversal.

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/7] FS-Cache: Generic filesystem caching facility
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
                   ` (2 preceding siblings ...)
  2006-04-20 16:59 ` [PATCH 4/7] FS-Cache: Export find_get_pages() David Howells
@ 2006-04-20 16:59 ` David Howells
  2006-04-21  0:46   ` Andrew Morton
  2006-04-21 14:15   ` David Howells
  2006-04-20 16:59 ` [PATCH 6/7] FS-Cache: Make kAFS use FS-Cache David Howells
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 16:59 UTC (permalink / raw)
  To: torvalds, akpm, steved, sct, aviro
  Cc: linux-fsdevel, linux-cachefs, nfsv4, linux-kernel

The attached patch adds a generic intermediary (FS-Cache) by which filesystems
may call on local caching capabilities, and by which local caching backends may
make caches available:

	+---------+
	|         |                        +--------------+
	|   NFS   |--+                     |              |
	|         |  |                 +-->|   CacheFS    |
	+---------+  |   +----------+  |   |  /dev/hda5   |
	             |   |          |  |   +--------------+
	+---------+  +-->|          |  |
	|         |      |          |--+
	|   AFS   |----->| FS-Cache |
	|         |      |          |--+
	+---------+  +-->|          |  |
	             |   |          |  |   +--------------+
	+---------+  |   +----------+  |   |              |
	|         |  |                 +-->|  CacheFiles  |
	|  ISOFS  |--+                     |  /var/cache  |
	|         |                        +--------------+
	+---------+

The patch also documents the netfs interface and the cache backend
interface provided by the facility.


There are a number of reasons why I'm not using i_mapping to do this.
These have been discussed a lot on the LKML and CacheFS mailing lists,
but to summarise the basics:

 (1) Most filesystems don't do hole reportage.  Holes in files are treated as
     blocks of zeros and can't be distinguished otherwise, making it difficult
     to distinguish blocks that have been read from the network and cached from
     those that haven't.

 (2) The backing inode must be fully populated before being exposed to
     userspace through the main inode because the VM/VFS goes directly to the
     backing inode and does not interrogate the front inode on VM ops.

     Therefore:

     (a) The backing inode must fit entirely within the cache.

     (b) All backed files currently open must fit entirely within the cache at
     	 the same time.

     (c) A working set of files in total larger than the cache may not be
     	 cached.

     (d) A file may not grow larger than the available space in the cache.

     (e) A file that's open and cached, and remotely grows larger than the
     	 cache is potentially stuffed.

 (3) Writes go to the backing filesystem, and can only be transferred to the
     network when the file is closed.

 (4) There's no record of what changes have been made, so the whole file must
     be written back.

 (5) The pages belong to the backing filesystem, and all metadata associated
     with that page are relevant only to the backing filesystem, and not
     anything stacked atop it.


The attached patch adds a generic core to which both networking filesystems and
caches may bind.  It transfers requests from networking filesystems to
appropriate caches if possible, or else gracefully denies them.

If this facility is disabled in the kernel configuration, then all its
operations will be trivially reducible to nothing by the compiler.

FS-Cache provides the following facilities:

 (1) Caches can be added / removed at any time, even whilst in use.

 (2) Adds a facility by which tags can be used to refer to caches, even if
     they're not mounted yet.

 (3) More than one cache can be used at once.  Caches can be selected
     explicitly by use of tags.

 (4) The netfs is provided with an interface that allows either party to
     withdraw caching facilities from a file (required for (1)).

 (5) A netfs may annotate cache objects that belongs to it.

 (6) Cache objects can be pinned and reservations made.

 (7) The interface to the netfs returns as few errors as possible, preferring
     rather to let the netfs remain oblivious.

 (8) Cookies are used to represent indices, files and other objects to the
     netfs.  The simplest cookie is just a NULL pointer - indicating nothing
     cached there.

 (9) The netfs is allowed to propose - dynamically - any index hierarchy it
     desires, though it must be aware that the index search function is
     recursive, stack space is limited, and indices can only be children of
     indices.

(10) Indices can be used to group files together to reduce key size and to make
     group invalidation easier.  The use of indices may make lookup quicker,
     but that's cache dependent.

(11) Data I/O is effectively done directly to and from the netfs's pages.  The
     netfs indicates that page A is at index B of the data-file represented by
     cookie C, and that it should be read or written.  The cache backend may or
     may not start I/O on that page, but if it does, a netfs callback will be
     invoked to indicate completion.  The I/O may be either synchronous or
     asynchronous.

(12) Cookies can be "retired" upon release.  At this point FS-Cache will mark
     them as obsolete and the index hierarchy rooted at that point will get
     recycled.

(13) The netfs provides a "match" function for index searches.  In addition to
     saying whether a match was made or not, this can also specify that an
     entry should be updated or deleted.


FS-Cache maintains a virtual indexing tree in which all indices, files, objects
and pages are kept.  Bits of this tree may actually reside in one or more
caches.

                                           FSDEF
                                             |
                        +------------------------------------+
                        |                                    |
                       NFS                                  AFS
                        |                                    |
           +--------------------------+                +-----------+
           |                          |                |           |
        homedir                     mirror          afs.org   redhat.com
           |                          |                            |
     +------------+           +---------------+              +----------+
     |            |           |               |              |          |
   00001        00002       00007           00125        vol00001   vol00002
     |            |           |               |                         |
 +---+---+     +-----+      +---+      +------+------+            +-----+----+
 |   |   |     |     |      |   |      |      |      |            |     |    |
PG0 PG1 PG2   PG0  XATTR   PG0 PG1   DIRENT DIRENT DIRENT        R/W   R/O  Bak
                     |                                            |
                    PG0                                       +-------+
                                                              |       |
                                                            00001   00003
                                                              |
                                                          +---+---+
                                                          |   |   |
                                                         PG0 PG1 PG2

In the example above, you can see two netfs's being backed: NFS and AFS.  These
have different index hierarchies:

 (*) The NFS primary index will probably contain per-server indices.  Each
     server index is indexed by NFS file handles to get data file objects.
     Each data file objects can have an array of pages, but may also have
     further child objects, such as extended attributes and directory entries.
     Extended attribute objects themselves have page-array contents.

 (*) The AFS primary index contains per-cell indices.  Each cell index contains
     per-logical-volume indices.  Each of volume index contains up to three
     indices for the read-write, read-only and backup mirrors of those volumes.
     Each of these contains vnode data file objects, each of which contains an
     array of pages.

The very top index is the FS-Cache master index in which individual netfs's
have entries.

Any index object may reside in more than one cache, provided it only has index
children.  Any index with non-index object children will be assumed to only
reside in one cache.


The FS-Cache overview can be found in:

	Documentation/filesystems/caching/fscache.txt

The netfs API to FS-Cache can be found in:

	Documentation/filesystems/caching/netfs-api.txt

The cache backend API to FS-Cache can be found in:

	Documentation/filesystems/caching/backend-api.txt


Signed-Off-By: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/caching/backend-api.txt |  345 +++++++
 Documentation/filesystems/caching/fscache.txt     |  151 +++
 Documentation/filesystems/caching/netfs-api.txt   |  726 ++++++++++++++
 fs/Kconfig                                        |   15 
 fs/Makefile                                       |    1 
 fs/fscache/Makefile                               |   13 
 fs/fscache/cookie.c                               | 1065 +++++++++++++++++++++
 fs/fscache/fscache-int.h                          |   71 +
 fs/fscache/fsdef.c                                |  113 ++
 fs/fscache/main.c                                 |  150 +++
 fs/fscache/page.c                                 |  548 +++++++++++
 include/linux/fscache-cache.h                     |  220 ++++
 include/linux/fscache.h                           |  484 ++++++++++
 13 files changed, 3902 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
new file mode 100644
index 0000000..896c778
--- /dev/null
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -0,0 +1,345 @@
+			  ==========================
+			  FS-CACHE CACHE BACKEND API
+			  ==========================
+
+The FS-Cache system provides an API by which actual caches can be supplied to
+FS-Cache for it to then serve out to network filesystems and other interested
+parties.
+
+This API is declared in <linux/fscache-cache.h>.
+
+
+====================================
+INITIALISING AND REGISTERING A CACHE
+====================================
+
+To start off, a cache definition must be initialised and registered for each
+cache the backend wants to make available.  For instance, CacheFS does this in
+the fill_super() operation on mounting.
+
+The cache definition (struct fscache_cache) should be initialised by calling:
+
+	void fscache_init_cache(struct fscache_cache *cache,
+				struct fscache_cache_ops *ops,
+				const char *idfmt,
+				...)
+
+Where:
+
+ (*) "cache" is a pointer to the cache definition;
+
+ (*) "ops" is a pointer to the table of operations that the backend supports on
+     this cache;
+
+ (*) and a format and printf-style arguments for constructing a label for the
+     cache.
+
+
+The cache should then be registered with FS-Cache by passing a pointer to the
+previously initialised cache definition to:
+
+	int fscache_add_cache(struct fscache_cache *cache,
+			      struct fscache_object *fsdef,
+			      const char *tagname);
+
+Two extra arguments should also be supplied:
+
+ (*) "fsdef" which should point to the object representation for the FS-Cache
+     master index in this cache.  Netfs primary index entries will be created
+     here.
+
+ (*) "tagname" which, if given, should be a text string naming this cache.  If
+     this is NULL, the identifier will be used instead.  For CacheFS, the
+     identifier is set to name the underlying block device and the tag can be
+     supplied by mount.
+
+This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
+is already in use.  0 will be returned on success.
+
+
+=====================
+UNREGISTERING A CACHE
+=====================
+
+A cache can be withdrawn from the system by calling this function with a
+pointer to the cache definition:
+
+	void fscache_withdraw_cache(struct fscache_cache *cache)
+
+In CacheFS's case, this is called by put_super().
+
+
+==================
+FS-CACHE UTILITIES
+==================
+
+FS-Cache provides some utilities that a cache backend may make use of:
+
+ (*) Find the parent of an object:
+
+	struct fscache_object *
+	fscache_find_parent_object(struct fscache_object *object)
+
+     This allows a backend to find the logical parent of an index or data file
+     in the cache hierarchy.
+
+ (*) Note occurrence of an I/O error in a cache:
+
+	void fscache_io_error(struct fscache_cache *cache)
+
+     This tells FS-Cache that an I/O error occurred in the cache.  After this
+     has been called, only resource dissociation operations (object and page
+     release) will be passed from the netfs to the cache backend for the
+     specified cache.
+
+     This does not actually withdraw the cache.  That must be done separately.
+
+
+========================
+RELEVANT DATA STRUCTURES
+========================
+
+ (*) Index/Data file FS-Cache representation cookie:
+
+	struct fscache_cookie {
+		struct fscache_object_def	*def;
+		struct fscache_netfs		*netfs;
+		void				*netfs_data;
+		...
+	};
+
+     The fields that might be of use to the backend describe the object
+     definition, the netfs definition and the netfs's data for this cookie.
+     The object definition contain functions supplied by the netfs for loading
+     and matching index entries; these are required to provide some of the
+     cache operations.
+
+ (*) In-cache object representation:
+
+	struct fscache_object {
+		struct fscache_cache		*cache;
+		struct fscache_cookie		*cookie;
+		unsigned long			flags;
+	#define FSCACHE_OBJECT_RECYCLING	1
+		...
+	};
+
+     Structures of this type should be allocated by the cache backend and
+     passed to FS-Cache when requested by the appropriate cache operation.  In
+     the case of CacheFS, they're embedded in CacheFS's internal object
+     structures.
+
+     Each object contains a pointer to the cookie that represents the object it
+     is backing.  It also contains a flag that indicates whether the object is
+     being retired when put_object() is called.  This should be initialised by
+     calling fscache_object_init(object).
+
+
+================
+CACHE OPERATIONS
+================
+
+The cache backend provides FS-Cache with a table of operations that can be
+performed on the denizens of the cache.  These are held in a structure of type:
+
+	struct fscache_cache_ops
+
+ (*) Name of cache provider [mandatory]:
+
+	const char *name
+
+     This isn't strictly an operation, but should be pointed at a string naming
+     the backend.
+
+ (*) Object lookup [mandatory]:
+
+	struct fscache_object *(*lookup_object)(struct fscache_cache *cache,
+						struct fscache_object *parent,
+						struct fscache_cookie *cookie)
+
+     This method is used to look up an object in the specified cache, given a
+     pointer to the parent object and the cookie to which the object will be
+     attached.  This should instantiate that object in the cache if it can, or
+     return -ENOBUFS or -ENOMEM if it can't.
+
+ (*) Increment object refcount [mandatory]:
+
+	struct fscache_object *(*grab_object)(struct fscache_object *object)
+
+     This method is called to increment the reference count on an object.  It
+     may fail (for instance if the cache is being withdrawn) by returning NULL.
+     It should return the object pointer if successful.
+
+ (*) Lock/Unlock object [mandatory]:
+
+	void (*lock_object)(struct fscache_object *object)
+	void (*unlock_object)(struct fscache_object *object)
+
+     These methods are used to exclusively lock an object.  It must be possible
+     to schedule with the lock held, so a spinlock isn't sufficient.
+
+ (*) Pin/Unpin object [optional]:
+
+	int (*pin_object)(struct fscache_object *object)
+	void (*unpin_object)(struct fscache_object *object)
+
+     These methods are used to pin an object into the cache.  Once pinned an
+     object cannot be reclaimed to make space.  Return -ENOSPC if there's not
+     enough space in the cache to permit this.
+
+ (*) Update object [mandatory]:
+
+	int (*update_object)(struct fscache_object *object)
+
+     This is called to update the index entry for the specified object.  The
+     new information should be in object->cookie->netfs_data.  This can be
+     obtained by calling object->cookie->def->get_aux()/get_attr().
+
+ (*) Release object reference [mandatory]:
+
+	void (*put_object)(struct fscache_object *object)
+
+     This method is used to discard a reference to an object.  The object may
+     be destroyed when all the references held by FS-Cache are released.
+
+ (*) Synchronise a cache [mandatory]:
+
+	void (*sync)(struct fscache_cache *cache)
+
+     This is called to ask the backend to synchronise a cache with its backing
+     device.
+
+ (*) Dissociate a cache [mandatory]:
+
+	void (*dissociate_pages)(struct fscache_cache *cache)
+
+     This is called to ask a cache to perform any page dissociations as part of
+     cache withdrawal.
+
+ (*) Set the data size on a cache file [mandatory]:
+
+	int (*set_i_size)(struct fscache_object *object, loff_t i_size);
+
+     This is called to indicate to the cache the maximum size a file may reach.
+     The cache may use this to reserve space on the cache.  It may also return
+     -ENOBUFS to indicate that insufficient space is available to expand the
+     metadata used to track the data.  It should return 0 if successful or
+     -ENOMEM or -EIO on error.
+
+ (*) Reserve cache space for an object's data [optional]:
+
+	int (*reserve_space)(struct fscache_object *object, loff_t size);
+
+     This is called to request that cache space be reserved to hold the data
+     for an object and the metadata used to track it.  Zero size should be
+     taken as request to cancel a reservation.
+
+     This should return 0 if successful, -ENOSPC if there isn't enough space
+     available, or -ENOMEM or -EIO on other errors.
+
+     The reservation may exceed the size of the object, thus permitting future
+     expansion.  If the amount of space consumed by an object would exceed the
+     reservation, it's permitted to refuse requests to allocate pages, but not
+     required.  An object may be pruned down to its reservation size if larger
+     than that already.
+
+ (*) Request page be read from cache [mandatory]:
+
+	int (*read_or_alloc_page)(struct fscache_object *object,
+				  struct page *page,
+				  fscache_rw_complete_t end_io_func,
+				  void *end_io_data,
+				  gfp_t gfp)
+
+     This is called to attempt to read a netfs page from the cache, or to
+     reserve a backing block if not.  FS-Cache will have done as much checking
+     as it can before calling, but most of the work belongs to the backend.
+
+     If there's no page in the cache, then -ENODATA should be returned if the
+     backend managed to reserve a backing block; -ENOBUFS, -ENOMEM or -EIO if
+     it didn't.
+
+     If there is a page in the cache, then a read operation should be queued
+     and 0 returned.  When the read finishes, end_io_func() should be called
+     with the following arguments:
+
+	(*end_io_func)(object->cookie->netfs_data,
+		       page,
+		       end_io_data,
+		       error);
+
+     The mark_pages_cached() cookie operation should be called for the page if
+     any cache metadata is retained.  This will indicate to the netfs that the
+     page needs explicit uncaching.  This operation takes a pagevec, thus
+     allowing several pages to be marked at once.
+
+ (*) Request pages be read from cache [mandatory]:
+
+	int (*read_or_alloc_pages)(struct fscache_object *object,
+				   struct address_space *mapping,
+				   struct list_head *pages,
+				   unsigned *nr_pages,
+				   fscache_rw_complete_t end_io_func,
+				   void *end_io_data,
+				   gfp_t gfp)
+
+     This is like the previous operation, except it will be handed a list of
+     pages instead of one page.  Any pages on which a read operation is started
+     must be added to the page cache for the specified mapping and also to the
+     LRU.  Such pages must also be removed from the pages list and nr_pages
+     decremented per page.
+
+     If there was an error such as -ENOMEM, then that should be returned; else
+     if one or more pages couldn't be read or allocated, then -ENOBUFS should
+     be returned; else if one or more pages couldn't be read, then -ENODATA
+     should be returned.  If all the pages are dispatched then 0 should be
+     returned.
+
+ (*) Request page be allocated in the cache [mandatory]:
+
+	int (*allocate_page)(struct fscache_object *object,
+			     struct page *page,
+			     gfp_t gfp)
+
+     This is like read_or_alloc_page(), except that it shouldn't read from the
+     cache, even if there's data there that could be retrieved.  It should,
+     however, set up any internal metadata required such that write_page() can
+     write to the cache.
+
+     If there's no backing block available, then -ENOBUFS should be returned
+     (or -ENOMEM or -EIO if there were other problems).  If a block is
+     successfully allocated, then the netfs page should be marked and 0
+     returned.
+
+ (*) Request page be written to cache [mandatory]:
+
+	int (*write_page)(struct fscache_object *object,
+			  struct page *page,
+			  fscache_rw_complete_t end_io_func,
+			  void *end_io_data,
+			  gfp_t gfp)
+
+     This is called to write from a page on which there was a previously
+     successful read_or_alloc_page() call.  FS-Cache filters out pages that
+     don't have mappings.
+
+     If there's no backing block available, then -ENOBUFS should be returned
+     (or -ENOMEM or -EIO if there were other problems).
+
+     If the write operation could be queued, then 0 should be returned.  When
+     the write completes, end_io_func() should be called with the following
+     arguments:
+
+	(*end_io_func)(object->cookie->netfs_data,
+		       page,
+		       end_io_data,
+		       error);
+
+ (*) Discard retained per-page metadata [mandatory]:
+
+	void (*uncache_pages)(struct fscache_object *object,
+			      struct pagevec *pagevec)
+
+     This is called when one or more netfs pages are being evicted from the
+     pagecache.  The cache backend should tear down any internal representation
+     or tracking it maintains.
diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
new file mode 100644
index 0000000..82c3168
--- /dev/null
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -0,0 +1,151 @@
+			  ==========================
+			  General Filesystem Caching
+			  ==========================
+
+========
+OVERVIEW
+========
+
+This facility is a general purpose cache for network filesystems, though it
+could be used for caching other things such as ISO9660 filesystems too.
+
+FS-Cache mediates between cache backends (such as CacheFS) and network
+filesystems:
+
+	+---------+
+	|         |                        +--------------+
+	|   NFS   |--+                     |              |
+	|         |  |                 +-->|   CacheFS    |
+	+---------+  |   +----------+  |   |  /dev/hda5   |
+	             |   |          |  |   +--------------+
+	+---------+  +-->|          |  |
+	|         |      |          |--+
+	|   AFS   |----->| FS-Cache |
+	|         |      |          |--+
+	+---------+  +-->|          |  |
+	             |   |          |  |   +--------------+
+	+---------+  |   +----------+  |   |              |
+	|         |  |                 +-->|  CacheFiles  |
+	|  ISOFS  |--+                     |  /var/cache  |
+	|         |                        +--------------+
+	+---------+
+
+
+FS-Cache does not follow the idea of completely loading every netfs file
+opened in its entirety into a cache before permitting it to be accessed and
+then serving the pages out of that cache rather than the netfs inode because:
+
+ (1) It must be practical to operate without a cache.
+
+ (2) The size of any accessible file must not be limited to the size of the
+     cache.
+
+ (3) The combined size of all opened files (this includes mapped libraries)
+     must not be limited to the size of the cache.
+
+ (4) The user should not be forced to download an entire file just to do a
+     one-off access of a small portion of it (such as might be done with the
+     "file" program).
+
+It instead serves the cache out in PAGE_SIZE chunks as and when requested by
+the netfs('s) using it.
+
+
+FS-Cache provides the following facilities:
+
+ (1) More than one cache can be used at once.  Caches can be selected
+     explicitly by use of tags.
+
+ (2) Caches can be added / removed at any time.
+
+ (3) The netfs is provided with an interface that allows either party to
+     withdraw caching facilities from a file (required for (2)).
+
+ (4) The interface to the netfs returns as few errors as possible, preferring
+     rather to let the netfs remain oblivious.
+
+ (5) Cookies are used to represent indices, files and other objects to the
+     netfs.  The simplest cookie is just a NULL pointer - indicating nothing
+     cached there.
+
+ (6) The netfs is allowed to propose - dynamically - any index hierarchy it
+     desires, though it must be aware that the index search function is
+     recursive, stack space is limited, and indices can only be children of
+     indices.
+
+ (7) Data I/O is done direct to and from the netfs's pages.  The netfs
+     indicates that page A is at index B of the data-file represented by cookie
+     C, and that it should be read or written.  The cache backend may or may
+     not start I/O on that page, but if it does, a netfs callback will be
+     invoked to indicate completion.  The I/O may be either synchronous or
+     asynchronous.
+
+ (8) Cookies can be "retired" upon release.  At this point FS-Cache will mark
+     them as obsolete and the index hierarchy rooted at that point will get
+     recycled.
+
+ (9) The netfs provides a "match" function for index searches.  In addition to
+     saying whether a match was made or not, this can also specify that an
+     entry should be updated or deleted.
+
+
+FS-Cache maintains a virtual indexing tree in which all indices, files, objects
+and pages are kept.  Bits of this tree may actually reside in one or more
+caches.
+
+                                           FSDEF
+                                             |
+                        +------------------------------------+
+                        |                                    |
+                       NFS                                  AFS
+                        |                                    |
+           +--------------------------+                +-----------+
+           |                          |                |           |
+        homedir                     mirror          afs.org   redhat.com
+           |                          |                            |
+     +------------+           +---------------+              +----------+
+     |            |           |               |              |          |
+   00001        00002       00007           00125        vol00001   vol00002
+     |            |           |               |                         |
+ +---+---+     +-----+      +---+      +------+------+            +-----+----+
+ |   |   |     |     |      |   |      |      |      |            |     |    |
+PG0 PG1 PG2   PG0  XATTR   PG0 PG1   DIRENT DIRENT DIRENT        R/W   R/O  Bak
+                     |                                            |
+                    PG0                                       +-------+
+                                                              |       |
+                                                            00001   00003
+                                                              |
+                                                          +---+---+
+                                                          |   |   |
+                                                         PG0 PG1 PG2
+
+In the example above, you can see two netfs's being backed: NFS and AFS.  These
+have different index hierarchies:
+
+ (*) The NFS primary index contains per-server indices.  Each server index is
+     indexed by NFS file handles to get data file objects.  Each data file
+     objects can have an array of pages, but may also have further child
+     objects, such as extended attributes and directory entries.  Extended
+     attribute objects themselves have page-array contents.
+
+ (*) The AFS primary index contains per-cell indices.  Each cell index contains
+     per-logical-volume indices.  Each of volume index contains up to three
+     indices for the read-write, read-only and backup mirrors of those volumes.
+     Each of these contains vnode data file objects, each of which contains an
+     array of pages.
+
+The very top index is the FS-Cache master index in which individual netfs's
+have entries.
+
+Any index object may reside in more than one cache, provided it only has index
+children.  Any index with non-index object children will be assumed to only
+reside in one cache.
+
+
+The netfs API to FS-Cache can be found in:
+
+	Documentation/filesystems/caching/netfs-api.txt
+
+The cache backend API to FS-Cache can be found in:
+
+	Documentation/filesystems/caching/backend-api.txt
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
new file mode 100644
index 0000000..db9a880
--- /dev/null
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -0,0 +1,726 @@
+			===============================
+			FS-CACHE NETWORK FILESYSTEM API
+			===============================
+
+There's an API by which a network filesystem can make use of the FS-Cache
+facilities.  This is based around a number of principles:
+
+ (1) Caches can store a number of different object types.  There are two main
+     object types: indices and files.  The first is a special type used by
+     FS-Cache to make finding objects faster and to make retiring of groups of
+     objects easier.
+
+ (2) Every index, file or other object is represented by a cookie.  This cookie
+     may or may not have anything associated with it, but the netfs doesn't
+     need to care.
+
+ (3) Barring the top-level index (one entry per cached netfs), the index
+     hierarchy for each netfs is structured according the whim of the netfs.
+
+This API is declared in <linux/fscache.h>.
+
+This document contains the following sections:
+
+	 (1) Network filesystem definition
+	 (2) Index definition
+	 (3) Object definition
+	 (4) Network filesystem (un)registration
+	 (5) Cache tag lookup
+	 (6) Index registration
+	 (7) Data file registration
+	 (8) Miscellaneous object registration
+	 (9) Setting the data file size
+	(10) Page alloc/read/write
+	(11) Page uncaching
+	(12) Index and data file update
+	(13) Miscellaneous cookie operations
+	(14) Cookie unregistration
+	(15) Index and data file invalidation
+
+
+=============================
+NETWORK FILESYSTEM DEFINITION
+=============================
+
+FS-Cache needs a description of the network filesystem.  This is specified
+using a record of the following structure:
+
+	struct fscache_netfs {
+		uint32_t			version;
+		const char			*name;
+		struct fscache_netfs_operations	*ops;
+		struct fscache_cookie		*primary_index;
+		...
+	};
+
+This first three fields should be filled in before registration, and the fourth
+will be filled in by the registration function; any other fields should just be
+ignored and are for internal use only.
+
+The fields are:
+
+ (1) The name of the netfs (used as the key in the toplevel index).
+
+ (2) The version of the netfs (if the name matches but the version doesn't, the
+     entire in-cache hierarchy for this netfs will be scrapped and begun
+     afresh).
+
+ (3) The operations table is defined as follows:
+
+	struct fscache_netfs_operations {
+	};
+
+     Currently there aren't any functions here.
+
+ (4) The cookie representing the primary index will be allocated according to
+     another parameter passed into the registration function.
+
+For example, kAFS (linux/fs/afs/) uses the following definitions to describe
+itself:
+
+	static struct fscache_netfs_operations afs_cache_ops = {
+	};
+
+	struct fscache_netfs afs_cache_netfs = {
+		.version	= 0,
+		.name		= "afs",
+		.ops		= &afs_cache_ops,
+	};
+
+
+================
+INDEX DEFINITION
+================
+
+Indices are used for two purposes:
+
+ (1) To aid the finding of a file based on a series of keys (such as AFS's
+     "cell", "volume ID", "vnode ID").
+
+ (2) To make it easier to discard a subset of all the files cached based around
+     a particular key - for instance to mirror the removal of an AFS volume.
+
+However, since it's unlikely that any two netfs's are going to want to define
+their index hierarchies in quite the same way, FS-Cache tries to impose as few
+restraints as possible on how an index is structured and where it is placed in
+the tree.  The netfs can even mix indices and data files at the same level, but
+it's not recommended.
+
+Each index entry consists of a key of indeterminate length plus some auxilliary
+data, also of indeterminate length.
+
+There are some limits on indices:
+
+ (1) Any index containing non-index objects should be restricted to a single
+     cache.  Any such objects created within an index will be created in the
+     first cache only.  The cache in which an index is created can be
+     controlled by cache tags (see below).
+
+ (2) The entry data must be atomically journallable, so it is limited to about
+     400 bytes at present.  At least 400 bytes will be available.
+
+ (3) The depth of the index tree should be judged with care as the search
+     function is recursive.  Too many layers will run the kernel out of stack.
+
+
+=================
+OBJECT DEFINITION
+=================
+
+To define an object, a structure of the following type should be filled out:
+
+	struct fscache_object_def
+	{
+		uint8_t name[16];
+		uint8_t type;
+
+		struct fscache_cache_tag *(*select_cache)(
+			const void *parent_netfs_data,
+			const void *cookie_netfs_data);
+
+		uint16_t (*get_key)(const void *cookie_netfs_data,
+				    void *buffer,
+				    uint16_t bufmax);
+
+		void (*get_attr)(const void *cookie_netfs_data,
+				 uint64_t *size);
+
+		uint16_t (*get_aux)(const void *cookie_netfs_data,
+				    void *buffer,
+				    uint16_t bufmax);
+
+		fscache_checkaux_t (*check_aux)(void *cookie_netfs_data,
+						const void *data,
+						uint16_t datalen);
+
+		void (*mark_pages_cached)(void *cookie_netfs_data,
+					  struct address_space *mapping,
+					  struct pagevec *cached_pvec);
+
+		void (*now_uncached)(void *cookie_netfs_data);
+	};
+
+This has the following fields:
+
+ (1) The type of the object [mandatory].
+
+     This is one of the following values:
+
+	(*) FSCACHE_COOKIE_TYPE_INDEX
+
+	    This defines an index, which is a special FS-Cache type.
+
+	(*) FSCACHE_COOKIE_TYPE_DATAFILE
+
+	    This defines an ordinary data file.
+
+	(*) Any other value between 2 and 255
+
+	    This defines an extraordinary object such as an XATTR.
+
+ (2) The name of the object type (NUL terminated unless all 16 chars are used)
+     [optional].
+
+ (3) A function to select the cache in which to store an index [optional].
+
+     This function is invoked when an index needs to be instantiated in a cache
+     during the instantiation of a non-index object.  Only the immediate index
+     parent for the non-index object will be queried.  Any indices above that
+     in the hierarchy may be stored in multiple caches.  This function does not
+     need to be supplied for any non-index object or any index that will only
+     have index children.
+
+     If this function is not supplied or if it returns NULL then the first
+     cache in the parent's list will be chosed, or failing that, the first
+     cache in the master list.
+
+ (4) A function to retrieve an object's key from the netfs [mandatory].
+
+     This function will be called with the netfs data that was passed to the
+     cookie acquisition function and the maximum length of key data that it may
+     provide.  It should write the required key data into the given buffer and
+     return the quantity it wrote.
+
+ (5) A function to retrieve attribute data from the netfs [optional].
+
+     This function will be called with the netfs data that was passed to the
+     cookie acquisition function.  It should return the size of the file if
+     this is a data file.  The size may be used to govern how much cache must
+     be reserved for this file in the cache.
+
+     If the function is absent, a file size of 0 is assumed.
+
+ (6) A function to retrieve auxilliary data from the netfs [optional].
+
+     This function will be called with the netfs data that was passed to the
+     cookie acquisition function and the maximum length of auxilliary data that
+     it may provide.  It should write the auxilliary data into the given buffer
+     and return the quantity it wrote.
+
+     If this function is absent, the auxilliary data length will be set to 0.
+
+     The length of the auxilliary data buffer may be dependent on the key
+     length.  A netfs mustn't rely on being able to provide more than 400 bytes
+     for both.
+
+ (7) A function to check the auxilliary data [optional].
+
+     This function will be called to check that a match found in the cache for
+     this object is valid.  For instance with AFS it could check the auxilliary
+     data against the data version number returned by the server to determine
+     whether the index entry in a cache is still valid.
+
+     If this function is absent, it will be assumed that matching objects in a
+     cache are always valid.
+
+     If present, the function should return one of the following values:
+
+	(*) FSCACHE_CHECKAUX_OKAY		- the entry is okay as is
+	(*) FSCACHE_CHECKAUX_NEEDS_UPDATE	- the entry requires update
+	(*) FSCACHE_CHECKAUX_OBSOLETE		- the entry should be deleted
+
+     This function can also be used to extract data from the auxilliary data in
+     the cache and copy it into the netfs's structures.
+
+ (8) A function to mark a page as retaining cache metadata [mandatory].
+
+     This is called by the cache to indicate that it is retaining in-memory
+     information for this page and that the netfs should uncache the page when
+     it has finished.  This does not indicate whether there's data on the disk
+     or not.  Note that several pages at once may be presented for marking.
+
+     kAFS and NFS use the PG_private bit on the page structure for this, but
+     that may not be appropriate in all cases.
+
+     This function is not required for indices as they're not permitted data.
+
+ (9) A function to unmark all the pages retaining cache metadata [mandatory].
+
+     This is called by FS-Cache to indicate that a backing store is being
+     unbound from a cookie and that all the marks on the pages should be
+     cleared to prevent confusion.  Note that the cache will have torn down all
+     its tracking information so that the pages don't need to be explicitly
+     uncached.
+
+     This function is not required for indices as they're not permitted data.
+
+
+===================================
+NETWORK FILESYSTEM (UN)REGISTRATION
+===================================
+
+The first step is to declare the network filesystem to the cache.  This also
+involves specifying the layout of the primary index (for AFS, this would be the
+"cell" level).
+
+The registration function is:
+
+	int fscache_register_netfs(struct fscache_netfs *netfs);
+
+It just takes a pointer to the netfs definition.  It returns 0 or an error as
+appropriate.
+
+For kAFS, registration is done as follows:
+
+	ret = fscache_register_netfs(&afs_cache_netfs);
+
+The last step is, of course, unregistration:
+
+	void fscache_unregister_netfs(struct fscache_netfs *netfs);
+
+
+================
+CACHE TAG LOOKUP
+================
+
+FS-Cache permits the use of more than one cache.  To permit particular index
+subtrees to be bound to particular caches, the second step is to look up cache
+representation tags.  This step is optional; it can be left entirely up to
+FS-Cache as to which cache should be used.  The problem with doing that is that
+FS-Cache will always pick the first cache that was registered.
+
+To get the representation for a named tag:
+
+	struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
+
+This takes a text string as the name and returns a representation of a tag.  It
+will never return an error.  It may return a dummy tag, however, if it runs out
+of memory; this will inhibit caching with this tag.
+
+Any representation so obtained must be released by passing it to this function:
+
+	void fscache_release_cache_tag(struct fscache_cache_tag *tag);
+
+The tag will be retrieved by FS-Cache when it calls the object definition
+operation select_cache().
+
+
+==================
+INDEX REGISTRATION
+==================
+
+The third step is to inform FS-Cache about part of an index hierarchy that can
+be used to locate files.  This is done by requesting a cookie for each index in
+the path to the file:
+
+	struct fscache_cookie *
+	fscache_acquire_cookie(struct fscache_cookie *parent,
+			       struct fscache_object_def *def,
+			       void *netfs_data);
+
+This function creates an index entry in the index represented by parent,
+filling in the index entry by calling the operations pointed to by def.
+
+Note that this function never returns an error - all errors are handled
+internally.  It may also return FSCACHE_NEGATIVE_COOKIE.  It is quite
+acceptable to pass this token back to this function as the parent to another
+acquisition (or even to the relinquish cookie, read page and write page
+functions - see below).
+
+Note also that no indices are actually created in a cache until a non-index
+object needs to be created somewhere down the hierarchy.  Furthermore, an index
+may be created in several different caches independently at different times.
+This is all handled transparently, and the netfs doesn't see any of it.
+
+For example, with AFS, a cell would be added to the primary index.  This index
+entry would have a dependent inode containing a volume location index for the
+volume mappings within this cell:
+
+	cell->cache =
+		fscache_acquire_cookie(afs_cache_netfs.primary_index,
+				       &afs_cell_cache_index_def,
+				       cell);
+
+Then when a volume location was accessed, it would be entered into the cell's
+index and an inode would be allocated that acts as a volume type and hash chain
+combination:
+
+	vlocation->cache =
+		fscache_acquire_cookie(cell->cache,
+				       &afs_vlocation_cache_index_def,
+				       vlocation);
+
+And then a particular flavour of volume (R/O for example) could be added to
+that index, creating another index for vnodes (AFS inode equivalents):
+
+	volume->cache =
+		fscache_acquire_cookie(vlocation->cache,
+				       &afs_volume_cache_index_def,
+				       volume);
+
+
+======================
+DATA FILE REGISTRATION
+======================
+
+The fourth step is to request a data file be created in the cache.  This is
+identical to index cookie acquisition.  The only difference is that the type in
+the object definition should be something other than index type.
+
+	vnode->cache =
+		fscache_acquire_cookie(volume->cache,
+				       &afs_vnode_cache_object_def,
+				       vnode);
+
+
+=================================
+MISCELLANEOUS OBJECT REGISTRATION
+=================================
+
+An optional step is to request an object of miscellaneous type be created in
+the cache.  This is almost identical to index cookie acquisition.  The only
+difference is that the type in the object definition should be something other
+than index type.  Whilst the parent object could be an index, it's more likely
+it would be some other type of object such as a data file.
+
+	xattr->cache =
+		fscache_acquire_cookie(vnode->cache,
+				       &afs_xattr_cache_object_def,
+				       xattr);
+
+Miscellaneous objects might be used to store extended attributes or directory
+entries for example.
+
+
+==========================
+SETTING THE DATA FILE SIZE
+==========================
+
+The fifth step is to set the size of the file.  This doesn't automatically
+reserve any space in the cache, but permits the cache to adjust its metadata
+for data tracking appropriately:
+
+	int fscache_set_i_size(struct fscache_cookie *cookie, loff_t i_size);
+
+The cache will return -ENOBUFS if there is no backing cache or if there is no
+space to allocate any extra metadata required in the cache.
+
+Note that attempts to read or write data pages in the cache over this size may
+be rebuffed with -ENOBUFS.
+
+
+=====================
+PAGE READ/ALLOC/WRITE
+=====================
+
+And the sixth step is to store and retrieve pages in the cache.  There are
+three functions that are used to do this.
+
+Note:
+
+ (1) A page should not be re-read or re-allocated without uncaching it first.
+
+ (2) A read or allocated page must be uncached when the netfs page is released
+     from the pagecache.
+
+ (3) A page should only be written to the cache if previous read or allocated.
+
+This permits the cache to maintain its page tracking in proper order.
+
+
+PAGE READ
+---------
+
+Firstly, the netfs should ask FS-Cache to examine the caches and read the
+contents cached for a particular page of a particular file if present, or else
+allocate space to store the contents if not:
+
+	typedef
+	void (*fscache_rw_complete_t)(void *cookie_data,
+				      struct page *page,
+				      void *end_io_data,
+				      int error);
+
+	int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+				       struct page *page,
+				       fscache_rw_complete_t end_io_func,
+				       void *end_io_data,
+				       gfp_t gfp);
+
+The cookie argument must specify a cookie for an object that isn't an index,
+the page specified will have the data loaded into it (and is also used to
+specify the page number), and the gfp argument is used to control how any
+memory allocations made are satisfied.
+
+If the cookie indicates the inode is not cached:
+
+ (1) The function will return -ENOBUFS.
+
+Else if there's a copy of the page resident in the cache:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) The function will submit a request to read the data from the cache's
+     backing device directly into the page specified.
+
+ (3) The function will return 0.
+
+ (4) When the read is complete, end_io_func() will be invoked with:
+
+     (*) The netfs data supplied when the cookie was created.
+
+     (*) The page descriptor.
+
+     (*) The end_io_data argument passed to the above function.
+
+     (*) An argument that's 0 on success or negative for an error code.
+
+     If an error occurs, it should be assumed that the page contains no usable
+     data.
+
+Otherwise, if there's not a copy available in cache, but the cache may be able
+to store the page:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) A block may be reserved in the cache and attached to the object at the
+     appropriate place.
+
+ (3) The function will return -ENODATA.
+
+This function may also return -ENOMEM or -EINTR, in which case it won't have
+read any data from the cache.
+
+
+PAGE ALLOCATE
+-------------
+
+Alternatively, if there's not expected to be any data in the cache for a page
+because the file has been extended, a block can simply be allocated instead:
+
+	int fscache_alloc_page(struct fscache_cookie *cookie,
+			       struct page *page,
+			       gfp_t gfp);
+
+This is similar to the fscache_read_or_alloc_page() function, except that it
+never reads from the cache.  It will return 0 if a block has been allocated,
+rather than -ENODATA as the other would.  One or the other must be performed
+before writing to the cache.
+
+The mark_pages_cached() cookie operation will be called on the page if
+successful.
+
+
+PAGE WRITE
+----------
+
+Secondly, if the netfs changes the contents of the page (either due to an
+initial download or if a user performs a write), then the page should be
+written back to the cache:
+
+	int fscache_write_page(struct fscache_cookie *cookie,
+			       struct page *page,
+			       fscache_rw_complete_t end_io_func,
+			       void *end_io_data,
+			       gfp_t gfp);
+
+The cookie argument must specify a data file cookie, the page specified should
+contain the data to be written (and is also used to specify the page number),
+and the gfp argument is used to control how any memory allocations made are
+satisfied.
+
+The page must have first been read or allocated successfully and must not have
+been uncached before writing is performed.
+
+If the cookie indicates the inode is not cached then:
+
+ (1) The function will return -ENOBUFS.
+
+Else if space can be allocated in the cache to hold this page:
+
+ (1) The function will submit a request to write the data to cache's backing
+     device directly from the page specified.
+
+ (2) The function will return 0.
+
+ (3) When the write is complete the end_io_func() will be invoked with:
+
+     (*) The netfs data supplied when the cookie was created.
+
+     (*) The page descriptor.
+
+     (*) The end_io_data argument passed to the function.
+
+     (*) An argument that's 0 on success or negative for an error.
+
+     If an error occurs, it can be assumed that the page has not been written
+     to the cache, and that either there's a block containing the old data or
+     no block at all in the cache.
+
+Else if there's no space available in the cache, -ENOBUFS will be returned.
+
+
+MULTIPLE PAGE READ
+------------------
+
+A facility is provided to read several pages at once, as requested by the
+readpages() address space operation:
+
+	int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+					struct address_space *mapping,
+					struct list_head *pages,
+					int *nr_pages,
+					fscache_rw_complete_t end_io_func,
+					void *end_io_data,
+					gfp_t gfp);
+
+This works in a similar way to fscache_read_or_alloc_page(), except:
+
+ (1) Any page it can retrieve data for is removed from pages and nr_pages and
+     dispatched for reading to the disk.  Reads of adjacent pages on disk may
+     be merged for greater efficiency.
+
+ (2) The mark_pages_cached() cookie operation will be called on several pages
+     at once if they're being read or allocated.
+
+ (3) If there was an general error, then that error will be returned.
+
+     Else if some pages couldn't be allocated or read, then -ENOBUFS will be
+     returned.
+
+     Else if some pages couldn't be read but were allocated, then -ENODATA will
+     be returned.
+
+     Otherwise, if all pages had reads dispatched, then 0 will be returned, the
+     list will be empty and *nr_pages will be 0.
+
+ (4) end_io_func will be called once for each page being read as the reads
+     complete.
+
+Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
+some of the pages being read and some being allocated.  Those pages will have
+been marked appropriately and will need uncaching.
+
+
+==============
+PAGE UNCACHING
+==============
+
+To uncache a page, this function should be called:
+
+	void fscache_uncache_page(struct fscache_cookie *cookie,
+				  struct page *page);
+
+This function permits the cache to release any in-memory representation it
+might be holding for this netfs page.  This function must be called once for
+each page on which the read or write page functions above have been called to
+make sure the cache's in-memory tracking information gets torn down.
+
+Note that pages can't be explicitly deleted from the a data file.  The whole
+data file must be retired (see the relinquish cookie function below).
+
+Furthermore, note that this does not cancel the asynchronous read or write
+operation started by the read/alloc and write functions.
+
+There is another unbinding operation similar to the above that takes a set of
+pages to unbind in one go:
+
+	void fscache_uncache_pagevec(struct fscache_cookie *cookie,
+				     struct pagevec *pagevec);
+
+
+==========================
+INDEX AND DATA FILE UPDATE
+==========================
+
+To request an update of the index data for an index or other object, the
+following function should be called:
+
+	void fscache_update_cookie(struct fscache_cookie *cookie);
+
+This function will refer back to the netfs_data pointer stored in the cookie by
+the acquisition function to obtain the data to write into each revised index
+entry.  The update method in the parent index definition will be called to
+transfer the data.
+
+Note that partial updates may happen automatically at other times, such as when
+data blocks are added to a data file object.
+
+
+===============================
+MISCELLANEOUS COOKIE OPERATIONS
+===============================
+
+There are a number of operations that can be used to control cookies:
+
+ (*) Cookie pinning:
+
+	int fscache_pin_cookie(struct fscache_cookie *cookie);
+	void fscache_unpin_cookie(struct fscache_cookie *cookie);
+
+     These operations permit data cookies to be pinned into the cache and to
+     have the pinning removed.  They are not permitted on index cookies.
+
+     The pinning function will return 0 if successful, -ENOBUFS in the cookie
+     isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
+     -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+     -EIO if there's any other problem.
+
+ (*) Data space reservation:
+
+	int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
+
+     This permits a netfs to request cache space be reserved to store up to the
+     given amount of a file.  It is permitted to ask for more than the current
+     size of the file to allow for future file expansion.
+
+     If size is given as zero then the reservation will be cancelled.
+
+     The function will return 0 if successful, -ENOBUFS in the cookie isn't
+     backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
+     -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+     -EIO if there's any other problem.
+
+     Note that this doesn't pin an object in a cache; it can still be culled to
+     make space if it's not in use.
+
+
+=====================
+COOKIE UNREGISTRATION
+=====================
+
+To get rid of a cookie, this function should be called.
+
+	void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+				       int retire);
+
+If retire is non-zero, then the object will be marked for recycling, and all
+copies of it will be removed from all active caches in which it is present.
+Not only that but all child objects will also be retired.
+
+If retire is zero, then the object may be available again when next the
+acquisition function is called.  Retirement here will overrule the pinning on a
+cookie.
+
+One very important note - relinquish must NOT be called for a cookie unless all
+the cookies for "child" indices, objects and pages have been relinquished
+first.
+
+
+================================
+INDEX AND DATA FILE INVALIDATION
+================================
+
+There is no direct way to invalidate an index subtree or a data file.  To do
+this, the caller should relinquish and retire the cookie they have, and then
+acquire a new one.
diff --git a/fs/Kconfig b/fs/Kconfig
index f9b5842..66acf29 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -508,6 +508,21 @@ config FUSE_FS
 	  If you want to develop a userspace FS, or if you want to use
 	  a filesystem based on FUSE, answer Y or M.
 
+menu "Caches"
+
+config FSCACHE
+	tristate "General filesystem cache manager"
+	depends on EXPERIMENTAL
+	help
+	  This option enables a generic filesystem caching manager that can be
+	  used by various network and other filesystems to cache data
+	  locally. Different sorts of caches can be plugged in, depending on the
+	  resources available.
+
+	  See Documentation/filesystems/caching/fscache.txt for more information.
+
+endmenu
+
 menu "CD-ROM/DVD Filesystems"
 
 config ISO9660_FS
diff --git a/fs/Makefile b/fs/Makefile
index 83bf478..36ee03b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -50,6 +50,7 @@ obj-y				+= devpts/
 obj-$(CONFIG_PROFILING)		+= dcookies.o
  
 # Do not add any filesystems before this line
+obj-$(CONFIG_FSCACHE)		+= fscache/
 obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
 obj-$(CONFIG_EXT3_FS)		+= ext3/ # Before ext2 so root fs can be ext3
 obj-$(CONFIG_JBD)		+= jbd/
diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
new file mode 100644
index 0000000..4c91e22
--- /dev/null
+++ b/fs/fscache/Makefile
@@ -0,0 +1,13 @@
+#
+# Makefile for general filesystem caching code
+#
+
+#CFLAGS += -finstrument-functions
+
+fscache-objs := \
+	cookie.o \
+	fsdef.o \
+	main.o \
+	page.o
+
+obj-$(CONFIG_FSCACHE) := fscache.o
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
new file mode 100644
index 0000000..91577d5
--- /dev/null
+++ b/fs/fscache/cookie.c
@@ -0,0 +1,1065 @@
+/* cookie.c: general filesystem cache cookie management
+ *
+ * Copyright (C) 2004-5 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include "fscache-int.h"
+
+static LIST_HEAD(fscache_cache_tag_list);
+static LIST_HEAD(fscache_cache_list);
+static LIST_HEAD(fscache_netfs_list);
+static DECLARE_RWSEM(fscache_addremove_sem);
+static struct fscache_cache_tag fscache_nomem_tag;
+
+kmem_cache_t *fscache_cookie_jar;
+
+static void fscache_withdraw_object(struct fscache_cache *cache,
+				    struct fscache_object *object);
+
+static void __fscache_cookie_put(struct fscache_cookie *cookie);
+
+static inline void fscache_cookie_put(struct fscache_cookie *cookie)
+{
+#ifdef CONFIG_DEBUG_SLAB
+	BUG_ON((atomic_read(&cookie->usage) & 0xffff0000) == 0x6b6b0000);
+#endif
+
+	BUG_ON(atomic_read(&cookie->usage) <= 0);
+
+	if (atomic_dec_and_test(&cookie->usage))
+		__fscache_cookie_put(cookie);
+
+}
+
+/*****************************************************************************/
+/*
+ * look up a cache tag
+ */
+struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *name)
+{
+	struct fscache_cache_tag *tag, *xtag;
+
+	/* firstly check for the existence of the tag under read lock */
+	down_read(&fscache_addremove_sem);
+
+	list_for_each_entry(tag, &fscache_cache_tag_list, link) {
+		if (strcmp(tag->name, name) == 0) {
+			atomic_inc(&tag->usage);
+			up_read(&fscache_addremove_sem);
+			return tag;
+		}
+	}
+
+	up_read(&fscache_addremove_sem);
+
+	/* the tag does not exist - create a candidate */
+	xtag = kmalloc(sizeof(*tag) + strlen(name) + 1, GFP_KERNEL);
+	if (!xtag) {
+		/* return a dummy tag if out of memory */
+		up_read(&fscache_addremove_sem);
+		return &fscache_nomem_tag;
+	}
+
+	atomic_set(&tag->usage, 1);
+	strcpy(tag->name, name);
+
+	/* write lock, search again and add if still not present */
+	down_write(&fscache_addremove_sem);
+
+	list_for_each_entry(tag, &fscache_cache_tag_list, link) {
+		if (strcmp(tag->name, name) == 0) {
+			atomic_inc(&tag->usage);
+			up_write(&fscache_addremove_sem);
+			kfree(xtag);
+			return tag;
+		}
+	}
+
+	list_add_tail(&xtag->link, &fscache_cache_tag_list);
+	up_write(&fscache_addremove_sem);
+	return xtag;
+
+} /* end __fscache_lookup_cache_tag() */
+
+/*****************************************************************************/
+/*
+ * release a reference to a cache tag
+ */
+void __fscache_release_cache_tag(struct fscache_cache_tag *tag)
+{
+	if (tag != &fscache_nomem_tag) {
+		down_write(&fscache_addremove_sem);
+
+		if (atomic_dec_and_test(&tag->usage))
+			list_del_init(&tag->link);
+		else
+			tag = NULL;
+
+		up_write(&fscache_addremove_sem);
+
+		kfree(tag);
+	}
+
+} /* end __fscache_release_cache_tag() */
+
+/*****************************************************************************/
+/*
+ * register a network filesystem for caching
+ */
+int __fscache_register_netfs(struct fscache_netfs *netfs)
+{
+	struct fscache_netfs *ptr;
+	int ret;
+
+	_enter("{%s}", netfs->name);
+
+	INIT_LIST_HEAD(&netfs->link);
+
+	/* allocate a cookie for the primary index */
+	netfs->primary_index =
+		kmem_cache_alloc(fscache_cookie_jar, SLAB_KERNEL);
+
+	if (!netfs->primary_index) {
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	/* initialise the primary index cookie */
+	memset(netfs->primary_index, 0, sizeof(*netfs->primary_index));
+
+	atomic_set(&netfs->primary_index->usage, 1);
+	atomic_set(&netfs->primary_index->children, 0);
+
+	netfs->primary_index->def		= &fscache_fsdef_netfs_def;
+	netfs->primary_index->parent		= &fscache_fsdef_index;
+	netfs->primary_index->netfs		= netfs;
+	netfs->primary_index->netfs_data	= netfs;
+
+	atomic_inc(&netfs->primary_index->parent->usage);
+	atomic_inc(&netfs->primary_index->parent->children);
+
+	init_rwsem(&netfs->primary_index->sem);
+	INIT_HLIST_HEAD(&netfs->primary_index->backing_objects);
+
+	/* check the netfs type is not already present */
+	down_write(&fscache_addremove_sem);
+
+	ret = -EEXIST;
+	list_for_each_entry(ptr, &fscache_netfs_list, link) {
+		if (strcmp(ptr->name, netfs->name) == 0)
+			goto already_registered;
+	}
+
+	list_add(&netfs->link, &fscache_netfs_list);
+	ret = 0;
+
+	printk("FS-Cache: netfs '%s' registered for caching\n", netfs->name);
+
+already_registered:
+	up_write(&fscache_addremove_sem);
+
+	if (ret < 0) {
+		netfs->primary_index->parent = NULL;
+		__fscache_cookie_put(netfs->primary_index);
+		netfs->primary_index = NULL;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_register_netfs() */
+
+EXPORT_SYMBOL(__fscache_register_netfs);
+
+/*****************************************************************************/
+/*
+ * unregister a network filesystem from the cache
+ * - all cookies must have been released first
+ */
+void __fscache_unregister_netfs(struct fscache_netfs *netfs)
+{
+	_enter("{%s.%u}", netfs->name, netfs->version);
+
+	down_write(&fscache_addremove_sem);
+
+	list_del(&netfs->link);
+	fscache_relinquish_cookie(netfs->primary_index, 0);
+
+	up_write(&fscache_addremove_sem);
+
+	printk("FS-Cache: netfs '%s' unregistered from caching\n",
+	       netfs->name);
+
+	_leave("");
+
+} /* end __fscache_unregister_netfs() */
+
+EXPORT_SYMBOL(__fscache_unregister_netfs);
+
+/*****************************************************************************/
+/*
+ * initialise a cache record
+ */
+void fscache_init_cache(struct fscache_cache *cache,
+			struct fscache_cache_ops *ops,
+			const char *idfmt,
+			...)
+{
+	va_list va;
+
+	memset(cache, 0, sizeof(*cache));
+
+	cache->ops = ops;
+
+	va_start(va, idfmt);
+	vsnprintf(cache->identifier, sizeof(cache->identifier), idfmt, va);
+	va_end(va);
+
+	INIT_LIST_HEAD(&cache->link);
+	INIT_LIST_HEAD(&cache->object_list);
+	spin_lock_init(&cache->object_list_lock);
+	init_rwsem(&cache->withdrawal_sem);
+
+} /* end fscache_init_cache() */
+
+EXPORT_SYMBOL(fscache_init_cache);
+
+/*****************************************************************************/
+/*
+ * declare a mounted cache as being open for business
+ */
+int fscache_add_cache(struct fscache_cache *cache,
+		      struct fscache_object *ifsdef,
+		      const char *tagname)
+{
+	struct fscache_cache_tag *tag;
+
+	BUG_ON(!cache->ops);
+	BUG_ON(!ifsdef);
+
+	cache->flags = 0;
+
+	if (!tagname)
+		tagname = cache->identifier;
+
+	BUG_ON(!tagname[0]);
+
+	_enter("{%s.%s},,%s", cache->ops->name, cache->identifier, tagname);
+
+	if (!cache->ops->grab_object(ifsdef))
+		BUG();
+
+	ifsdef->cookie = &fscache_fsdef_index;
+	ifsdef->cache = cache;
+	cache->fsdef = ifsdef;
+
+	down_write(&fscache_addremove_sem);
+
+	/* instantiate or allocate a cache tag */
+	list_for_each_entry(tag, &fscache_cache_tag_list, link) {
+		if (strcmp(tag->name, tagname) == 0) {
+			if (tag->cache) {
+				printk(KERN_ERR
+				       "FS-Cache: cache tag '%s' already in use\n",
+				       tagname);
+				up_write(&fscache_addremove_sem);
+				return -EEXIST;
+			}
+
+			atomic_inc(&tag->usage);
+			goto found_cache_tag;
+		}
+	}
+
+	tag = kmalloc(sizeof(*tag) + strlen(tagname) + 1, GFP_KERNEL);
+	if (!tag) {
+		up_write(&fscache_addremove_sem);
+		return -ENOMEM;
+	}
+
+	atomic_set(&tag->usage, 1);
+	strcpy(tag->name, tagname);
+	list_add_tail(&tag->link, &fscache_cache_tag_list);
+
+found_cache_tag:
+	tag->cache = cache;
+	cache->tag = tag;
+
+	/* add the cache to the list */
+	list_add(&cache->link, &fscache_cache_list);
+
+	/* add the cache's netfs definition index object to the cache's
+	 * list */
+	spin_lock(&cache->object_list_lock);
+	list_add_tail(&ifsdef->cache_link, &cache->object_list);
+	spin_unlock(&cache->object_list_lock);
+
+	/* add the cache's netfs definition index object to the top level index
+	 * cookie as a known backing object */
+	down_write(&fscache_fsdef_index.sem);
+
+	hlist_add_head(&ifsdef->cookie_link,
+		       &fscache_fsdef_index.backing_objects);
+
+	atomic_inc(&fscache_fsdef_index.usage);
+
+	/* done */
+	up_write(&fscache_fsdef_index.sem);
+	up_write(&fscache_addremove_sem);
+
+	printk(KERN_NOTICE
+	       "FS-Cache: Cache \"%s\" added (type %s)\n",
+	       cache->tag->name, cache->ops->name);
+
+	_leave(" = 0 [%s]", cache->identifier);
+	return 0;
+
+} /* end fscache_add_cache() */
+
+EXPORT_SYMBOL(fscache_add_cache);
+
+/*****************************************************************************/
+/*
+ * note a cache I/O error
+ */
+void fscache_io_error(struct fscache_cache *cache)
+{
+	set_bit(FSCACHE_IOERROR, &cache->flags);
+
+	printk(KERN_ERR "FS-Cache: Cache %s stopped due to I/O error\n",
+	       cache->ops->name);
+
+} /* end fscache_io_error() */
+
+EXPORT_SYMBOL(fscache_io_error);
+
+/*****************************************************************************/
+/*
+ * withdraw an unmounted cache from the active service
+ */
+void fscache_withdraw_cache(struct fscache_cache *cache)
+{
+	struct fscache_object *object;
+
+	_enter("");
+
+	printk(KERN_NOTICE
+	       "FS-Cache: Withdrawing cache \"%s\"\n",
+	       cache->tag->name);
+
+	/* make the cache unavailable for cookie acquisition */
+	down_write(&cache->withdrawal_sem);
+
+	down_write(&fscache_addremove_sem);
+	list_del_init(&cache->link);
+	cache->tag->cache = NULL;
+	up_write(&fscache_addremove_sem);
+
+	/* mark all objects as being withdrawn */
+	spin_lock(&cache->object_list_lock);
+	list_for_each_entry(object, &cache->object_list, cache_link) {
+		set_bit(FSCACHE_OBJECT_WITHDRAWN, &object->flags);
+	}
+	spin_unlock(&cache->object_list_lock);
+
+	/* make sure all pages pinned by operations on behalf of the netfs are
+	 * written to disc */
+	cache->ops->sync_cache(cache);
+
+	/* dissociate all the netfs pages backed by this cache from the block
+	 * mappings in the cache */
+	cache->ops->dissociate_pages(cache);
+
+	/* we now have to destroy all the active objects pertaining to this
+	 * cache */
+	spin_lock(&cache->object_list_lock);
+
+	while (!list_empty(&cache->object_list)) {
+		object = list_entry(cache->object_list.next,
+				    struct fscache_object, cache_link);
+		list_del_init(&object->cache_link);
+		spin_unlock(&cache->object_list_lock);
+
+		_debug("withdraw %p", object->cookie);
+
+		/* we've extracted an active object from the tree - now dispose
+		 * of it */
+		fscache_withdraw_object(cache, object);
+
+		spin_lock(&cache->object_list_lock);
+	}
+
+	spin_unlock(&cache->object_list_lock);
+
+	fscache_release_cache_tag(cache->tag);
+	cache->tag = NULL;
+
+	_leave("");
+
+} /* end fscache_withdraw_cache() */
+
+EXPORT_SYMBOL(fscache_withdraw_cache);
+
+/*****************************************************************************/
+/*
+ * withdraw an object from active service at the behest of the cache
+ * - need break the links to a cached object cookie
+ * - called under two situations:
+ *   (1) recycler decides to reclaim an in-use object
+ *   (2) a cache is unmounted
+ * - have to take care as the cookie can be being relinquished by the netfs
+ *   simultaneously
+ * - the active object is pinned by the caller holding a refcount on it
+ */
+static void fscache_withdraw_object(struct fscache_cache *cache,
+				    struct fscache_object *object)
+{
+	struct fscache_cookie *cookie, *xcookie = NULL;
+
+	_enter(",%p", object);
+
+	/* first of all we have to break the links between the object and the
+	 * cookie
+	 * - we have to hold both semaphores BUT we have to get the cookie sem
+	 *   FIRST
+	 */
+	cache->ops->lock_object(object);
+
+	cookie = object->cookie;
+	if (cookie) {
+		/* pin the cookie so that is doesn't escape */
+		atomic_inc(&cookie->usage);
+
+		/* re-order the locks to avoid deadlock */
+		cache->ops->unlock_object(object);
+		down_write(&cookie->sem);
+		cache->ops->lock_object(object);
+
+		/* erase references from the object to the cookie */
+		hlist_del_init(&object->cookie_link);
+
+		xcookie = object->cookie;
+		object->cookie = NULL;
+
+		up_write(&cookie->sem);
+	}
+
+	cache->ops->unlock_object(object);
+
+	/* we've broken the links between cookie and object */
+	if (xcookie) {
+		fscache_cookie_put(xcookie);
+		cache->ops->put_object(object);
+	}
+
+	/* unpin the cookie */
+	if (cookie) {
+		if (cookie->def && cookie->def->now_uncached)
+			cookie->def->now_uncached(cookie->netfs_data);
+		fscache_cookie_put(cookie);
+	}
+
+	_leave("");
+
+} /* end fscache_withdraw_object() */
+
+/*****************************************************************************/
+/*
+ * select a cache on which to store an object
+ * - the cache addremove semaphore must be at least read-locked by the caller
+ * - the object will never be an index
+ */
+static struct fscache_cache *fscache_select_cache_for_object(struct fscache_cookie *cookie)
+{
+	struct fscache_cache_tag *tag;
+	struct fscache_object *object;
+	struct fscache_cache *cache;
+
+	_enter("");
+
+	if (list_empty(&fscache_cache_list)) {
+		_leave(" = NULL [no cache]");
+		return NULL;
+	}
+
+	/* we check the parent to determine the cache to use */
+	down_read(&cookie->parent->sem);
+
+	/* the first in the parent's backing list should be the preferred
+	 * cache */
+	if (!hlist_empty(&cookie->parent->backing_objects)) {
+		object = hlist_entry(cookie->parent->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		cache = object->cache;
+		if (test_bit(FSCACHE_IOERROR, &cache->flags))
+			cache = NULL;
+
+		up_read(&cookie->parent->sem);
+		_leave(" = %p [parent]", cache);
+		return cache;
+	}
+
+	/* the parent is unbacked */
+	if (cookie->parent->def->type != FSCACHE_COOKIE_TYPE_INDEX) {
+		/* parent not an index and is unbacked */
+		up_read(&cookie->parent->sem);
+		_leave(" = NULL [parent ubni]");
+		return NULL;
+	}
+
+	up_read(&cookie->parent->sem);
+
+	if (!cookie->parent->def->select_cache)
+		goto no_preference;
+
+	/* ask the netfs for its preference */
+	tag = cookie->parent->def->select_cache(
+		cookie->parent->parent->netfs_data,
+		cookie->parent->netfs_data);
+
+	if (!tag)
+		goto no_preference;
+
+	if (tag == &fscache_nomem_tag) {
+		_leave(" = NULL [nomem tag]");
+		return NULL;
+	}
+
+	if (!tag->cache) {
+		_leave(" = NULL [unbacked tag]");
+		return NULL;
+	}
+
+	if (test_bit(FSCACHE_IOERROR, &tag->cache->flags))
+		return NULL;
+
+	_leave(" = %p [specific]", tag->cache);
+	return tag->cache;
+
+no_preference:
+	/* netfs has no preference - just select first cache */
+	cache = list_entry(fscache_cache_list.next,
+			   struct fscache_cache, link);
+	_leave(" = %p [first]", cache);
+	return cache;
+
+} /* end fscache_select_cache_for_object() */
+
+/*****************************************************************************/
+/*
+ * get a backing object for a cookie from the chosen cache
+ * - the cookie must be write-locked by the caller
+ * - all parent indexes will be obtained recursively first
+ */
+static struct fscache_object *fscache_lookup_object(struct fscache_cookie *cookie,
+						    struct fscache_cache *cache)
+{
+	struct fscache_cookie *parent = cookie->parent;
+	struct fscache_object *pobject, *object;
+	struct hlist_node *_p;
+
+	_enter("{%s/%s},",
+	       parent && parent->def ? parent->def->name : "",
+	       cookie->def ? (char *) cookie->def->name : "<file>");
+
+	if (test_bit(FSCACHE_IOERROR, &cache->flags))
+		return NULL;
+
+	/* see if we have the backing object for this cookie + cache immediately
+	 * to hand
+	 */
+	object = NULL;
+	hlist_for_each_entry(object, _p,
+			     &cookie->backing_objects, cookie_link
+			     ) {
+		if (object->cache == cache)
+			break;
+	}
+
+	if (object) {
+		_leave(" = %p [old]", object);
+		return object;
+	}
+
+	BUG_ON(!parent); /* FSDEF entries don't have a parent */
+
+	/* we don't have a backing cookie, so we need to consult the object's
+	 * parent index in the selected cache and maybe insert an entry
+	 * therein; so the first thing to do is make sure that the parent index
+	 * is represented on disc
+	 */
+	down_read(&parent->sem);
+
+	pobject = NULL;
+	hlist_for_each_entry(pobject, _p,
+			     &parent->backing_objects, cookie_link
+			     ) {
+		if (pobject->cache == cache)
+			break;
+	}
+
+	if (!pobject) {
+		/* we don't know about the parent object */
+		up_read(&parent->sem);
+		down_write(&parent->sem);
+
+		pobject = fscache_lookup_object(parent, cache);
+		if (IS_ERR(pobject)) {
+			up_write(&parent->sem);
+			_leave(" = %ld [no ipobj]", PTR_ERR(pobject));
+			return pobject;
+		}
+
+		_debug("pobject=%p", pobject);
+
+		BUG_ON(pobject->cookie != parent);
+
+		downgrade_write(&parent->sem);
+	}
+
+	/* now we can attempt to look up this object in the parent, possibly
+	 * creating a representation on disc when we do so
+	 */
+	object = cache->ops->lookup_object(cache, pobject, cookie);
+	up_read(&parent->sem);
+
+	if (IS_ERR(object)) {
+		_leave(" = %ld [no obj]", PTR_ERR(object));
+		return object;
+	}
+
+	/* keep track of it */
+	cache->ops->lock_object(object);
+
+	BUG_ON(!hlist_unhashed(&object->cookie_link));
+
+	/* attach to the cache's object list */
+	if (list_empty(&object->cache_link)) {
+		spin_lock(&cache->object_list_lock);
+		list_add(&object->cache_link, &cache->object_list);
+		spin_unlock(&cache->object_list_lock);
+	}
+
+	/* attach to the cookie */
+	object->cookie = cookie;
+	atomic_inc(&cookie->usage);
+	hlist_add_head(&object->cookie_link, &cookie->backing_objects);
+
+	/* done */
+	cache->ops->unlock_object(object);
+	_leave(" = %p [new]", object);
+	return object;
+
+} /* end fscache_lookup_object() */
+
+/*****************************************************************************/
+/*
+ * request a cookie to represent an object (index, datafile, xattr, etc)
+ * - parent specifies the parent object
+ *   - the top level index cookie for each netfs is stored in the fscache_netfs
+ *     struct upon registration
+ * - idef points to the definition
+ * - the netfs_data will be passed to the functions pointed to in *def
+ * - all attached caches will be searched to see if they contain this object
+ * - index objects aren't stored on disk until there's a dependent file that
+ *   needs storing
+ * - other objects are stored in a selected cache immediately, and all the
+ *   indexes forming the path to it are instantiated if necessary
+ * - we never let on to the netfs about errors
+ *   - we may set a negative cookie pointer, but that's okay
+ */
+struct fscache_cookie *__fscache_acquire_cookie(struct fscache_cookie *parent,
+						struct fscache_cookie_def *def,
+						void *netfs_data)
+{
+	struct fscache_cookie *cookie;
+	struct fscache_cache *cache;
+	struct fscache_object *object;
+	int ret = 0;
+
+	BUG_ON(!def);
+
+	_enter("{%s},{%s},%p",
+	       parent ? (char *) parent->def->name : "<no-parent>",
+	       def->name, netfs_data);
+
+	/* if there's no parent cookie, then we don't create one here either */
+	if (parent == FSCACHE_NEGATIVE_COOKIE) {
+		_leave(" [no parent]");
+		return FSCACHE_NEGATIVE_COOKIE;
+	}
+
+	/* validate the definition */
+	BUG_ON(!def->get_key);
+	BUG_ON(!def->name[0]);
+
+	BUG_ON(def->type == FSCACHE_COOKIE_TYPE_INDEX &&
+	       parent->def->type != FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* allocate and initialise a cookie */
+	cookie = kmem_cache_alloc(fscache_cookie_jar, SLAB_KERNEL);
+	if (!cookie) {
+		_leave(" [ENOMEM]");
+		return FSCACHE_NEGATIVE_COOKIE;
+	}
+
+	atomic_set(&cookie->usage, 1);
+	atomic_set(&cookie->children, 0);
+
+	atomic_inc(&parent->usage);
+	atomic_inc(&parent->children);
+
+	cookie->def		= def;
+	cookie->parent		= parent;
+	cookie->netfs		= parent->netfs;
+	cookie->netfs_data	= netfs_data;
+
+	/* now we need to see whether the backing objects for this cookie yet
+	 * exist, if not there'll be nothing to search */
+	down_read(&fscache_addremove_sem);
+
+	if (list_empty(&fscache_cache_list)) {
+		up_read(&fscache_addremove_sem);
+		_leave(" = %p [no caches]", cookie);
+		return cookie;
+	}
+
+	/* if the object is an index then we need do nothing more here - we
+	 * create indexes on disk when we need them as an index may exist in
+	 * multiple caches */
+	if (cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX) {
+		down_write(&cookie->sem);
+
+		/* the object is a file - we need to select a cache in which to
+		 * store it */
+		cache = fscache_select_cache_for_object(cookie);
+		if (!cache)
+			goto no_cache; /* couldn't decide on a cache */
+
+		/* create a file index entry on disc, along with all the
+		 * indexes required to find it again later */
+		object = fscache_lookup_object(cookie, cache);
+		if (IS_ERR(object)) {
+			ret = PTR_ERR(object);
+			goto error;
+		}
+
+		up_write(&cookie->sem);
+	}
+out:
+	up_read(&fscache_addremove_sem);
+	_leave(" = %p", cookie);
+	return cookie;
+
+no_cache:
+	ret = -ENOMEDIUM;
+	goto error_cleanup;
+error:
+	printk(KERN_ERR "FS-Cache: error from cache: %d\n", ret);
+error_cleanup:
+	if (cookie) {
+		up_write(&cookie->sem);
+		__fscache_cookie_put(cookie);
+		cookie = FSCACHE_NEGATIVE_COOKIE;
+		atomic_dec(&parent->children);
+	}
+
+	goto out;
+
+} /* end __fscache_acquire_cookie() */
+
+EXPORT_SYMBOL(__fscache_acquire_cookie);
+
+/*****************************************************************************/
+/*
+ * release a cookie back to the cache
+ * - the object will be marked as recyclable on disc if retire is true
+ * - all dependents of this cookie must have already been unregistered
+ *   (indexes/files/pages)
+ */
+void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
+{
+	struct fscache_cache *cache;
+	struct fscache_object *object;
+	struct hlist_node *_p;
+
+	if (cookie == FSCACHE_NEGATIVE_COOKIE) {
+		_leave(" [no cookie]");
+		return;
+	}
+
+	_enter("%p{%s},%d", cookie, cookie->def->name, retire);
+
+	if (atomic_read(&cookie->children) != 0) {
+		printk("FS-Cache: cookie still has children\n");
+		BUG();
+	}
+
+	/* detach pointers back to the netfs */
+	down_write(&cookie->sem);
+
+	cookie->netfs_data	= NULL;
+	cookie->def		= NULL;
+
+	/* mark retired objects for recycling */
+	if (retire) {
+		hlist_for_each_entry(object, _p,
+				     &cookie->backing_objects,
+				     cookie_link
+				     ) {
+			set_bit(FSCACHE_OBJECT_RECYCLING, &object->flags);
+		}
+	}
+
+	/* break links with all the active objects */
+	while (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object,
+				     cookie_link);
+
+		/* detach each cache object from the object cookie */
+		set_bit(FSCACHE_OBJECT_RELEASING, &object->flags);
+
+		hlist_del_init(&object->cookie_link);
+
+		cache = object->cache;
+		cache->ops->lock_object(object);
+		object->cookie = NULL;
+		cache->ops->unlock_object(object);
+
+		if (atomic_dec_and_test(&cookie->usage))
+			/* the cookie refcount shouldn't be reduced to 0 yet */
+			BUG();
+
+		spin_lock(&cache->object_list_lock);
+		list_del_init(&object->cache_link);
+		spin_unlock(&cache->object_list_lock);
+
+		cache->ops->put_object(object);
+	}
+
+	up_write(&cookie->sem);
+
+	if (cookie->parent) {
+#ifdef CONFIG_DEBUG_SLAB
+		BUG_ON((atomic_read(&cookie->parent->children) & 0xffff0000) == 0x6b6b0000);
+#endif
+		atomic_dec(&cookie->parent->children);
+	}
+
+	/* finally dispose of the cookie */
+	fscache_cookie_put(cookie);
+
+	_leave("");
+
+} /* end __fscache_relinquish_cookie() */
+
+EXPORT_SYMBOL(__fscache_relinquish_cookie);
+
+/*****************************************************************************/
+/*
+ * update the index entries backing a cookie
+ */
+void __fscache_update_cookie(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	struct hlist_node *_p;
+
+	if (cookie == FSCACHE_NEGATIVE_COOKIE) {
+		_leave(" [no cookie]");
+		return;
+	}
+
+	_enter("{%s}", cookie->def->name);
+
+	BUG_ON(!cookie->def->get_aux);
+
+	down_write(&cookie->sem);
+	down_read(&cookie->parent->sem);
+
+	/* update the index entry on disc in each cache backing this cookie */
+	hlist_for_each_entry(object, _p,
+			     &cookie->backing_objects, cookie_link
+			     ) {
+		if (!test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			object->cache->ops->update_object(object);
+	}
+
+	up_read(&cookie->parent->sem);
+	up_write(&cookie->sem);
+	_leave("");
+
+} /* end __fscache_update_cookie() */
+
+EXPORT_SYMBOL(__fscache_update_cookie);
+
+/*****************************************************************************/
+/*
+ * destroy a cookie
+ */
+static void __fscache_cookie_put(struct fscache_cookie *cookie)
+{
+	struct fscache_cookie *parent;
+
+	_enter("%p", cookie);
+
+	for (;;) {
+		parent = cookie->parent;
+		BUG_ON(!hlist_empty(&cookie->backing_objects));
+		kmem_cache_free(fscache_cookie_jar, cookie);
+
+		if (!parent)
+			break;
+
+		cookie = parent;
+		BUG_ON(atomic_read(&cookie->usage) <= 0);
+		if (!atomic_dec_and_test(&cookie->usage))
+			break;
+	}
+
+	_leave("");
+
+} /* end __fscache_cookie_put() */
+
+/*****************************************************************************/
+/*
+ * initialise an cookie jar slab element prior to any use
+ */
+void fscache_cookie_init_once(void *_cookie, kmem_cache_t *cachep,
+			      unsigned long flags)
+{
+	struct fscache_cookie *cookie = _cookie;
+
+	if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
+	    SLAB_CTOR_CONSTRUCTOR) {
+		memset(cookie, 0, sizeof(*cookie));
+		init_rwsem(&cookie->sem);
+		INIT_HLIST_HEAD(&cookie->backing_objects);
+	}
+
+} /* end fscache_cookie_init_once() */
+
+/*****************************************************************************/
+/*
+ * pin an object into the cache
+ */
+int __fscache_pin_cookie(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p", cookie);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" = -ENOBUFS");
+		return -ENOBUFS;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it and exclude
+	 * read and write attempts on pages
+	 */
+	down_write(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and pin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		if (!object->cache->ops->pin_object) {
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				ret = object->cache->ops->pin_object(object);
+
+				object->cache->ops->put_object(object);
+			}
+
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_write(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_pin_cookie() */
+
+EXPORT_SYMBOL(__fscache_pin_cookie);
+
+/*****************************************************************************/
+/*
+ * unpin an object into the cache
+ */
+void __fscache_unpin_cookie(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p", cookie);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" [no obj]");
+		return;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it and exclude
+	 * read and write attempts on pages
+	 */
+	down_write(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and unpin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		if (!object->cache->ops->unpin_object)
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				object->cache->ops->unpin_object(object);
+
+				object->cache->ops->put_object(object);
+			}
+
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_write(&cookie->sem);
+	_leave("");
+
+} /* end __fscache_unpin_cookie() */
+
+EXPORT_SYMBOL(__fscache_unpin_cookie);
diff --git a/fs/fscache/fscache-int.h b/fs/fscache/fscache-int.h
new file mode 100644
index 0000000..7aca89f
--- /dev/null
+++ b/fs/fscache/fscache-int.h
@@ -0,0 +1,71 @@
+/* fscache-int.h: internal definitions
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _FSCACHE_INT_H
+#define _FSCACHE_INT_H
+
+#include <linux/fscache-cache.h>
+#include <linux/timer.h>
+#include <linux/bio.h>
+
+extern kmem_cache_t *fscache_cookie_jar;
+
+extern struct fscache_cookie fscache_fsdef_index;
+extern struct fscache_cookie_def fscache_fsdef_netfs_def;
+
+extern void fscache_cookie_init_once(void *_cookie, kmem_cache_t *cachep, unsigned long flags);
+
+/*****************************************************************************/
+/*
+ * debug tracing
+ */
+#define dbgprintk(FMT,...) \
+	printk("[%-6.6s] "FMT"\n",current->comm ,##__VA_ARGS__)
+#define _dbprintk(FMT,...) do { } while(0)
+
+#define kenter(FMT,...)	dbgprintk("==> %s("FMT")",__FUNCTION__ ,##__VA_ARGS__)
+#define kleave(FMT,...)	dbgprintk("<== %s()"FMT"",__FUNCTION__ ,##__VA_ARGS__)
+#define kdebug(FMT,...)	dbgprintk(FMT ,##__VA_ARGS__)
+
+#define kjournal(FMT,...) _dbprintk(FMT ,##__VA_ARGS__)
+
+#define dbgfree(ADDR)  _dbprintk("%p:%d: FREEING %p",__FILE__,__LINE__,ADDR)
+
+#define dbgpgalloc(PAGE)						\
+do {									\
+	_dbprintk("PGALLOC %s:%d: %p {%lx,%lu}\n",			\
+		  __FILE__,__LINE__,					\
+		  (PAGE),(PAGE)->mapping->host->i_ino,(PAGE)->index	\
+		  );							\
+} while(0)
+
+#define dbgpgfree(PAGE)						\
+do {								\
+	if ((PAGE))						\
+		_dbprintk("PGFREE %s:%d: %p {%lx,%lu}\n",	\
+			  __FILE__,__LINE__,			\
+			  (PAGE),				\
+			  (PAGE)->mapping->host->i_ino,		\
+			  (PAGE)->index				\
+			  );					\
+} while(0)
+
+#ifdef __KDEBUG
+#define _enter(FMT,...)	kenter(FMT,##__VA_ARGS__)
+#define _leave(FMT,...)	kleave(FMT,##__VA_ARGS__)
+#define _debug(FMT,...)	kdebug(FMT,##__VA_ARGS__)
+#else
+#define _enter(FMT,...)	do { } while(0)
+#define _leave(FMT,...)	do { } while(0)
+#define _debug(FMT,...)	do { } while(0)
+#endif
+
+#endif /* _FSCACHE_INT_H */
diff --git a/fs/fscache/fsdef.c b/fs/fscache/fsdef.c
new file mode 100644
index 0000000..75a0d45
--- /dev/null
+++ b/fs/fscache/fsdef.c
@@ -0,0 +1,113 @@
+/* fsdef.c: filesystem index definition
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include "fscache-int.h"
+
+static uint16_t fscache_fsdef_netfs_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax);
+
+static uint16_t fscache_fsdef_netfs_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax);
+
+static fscache_checkaux_t fscache_fsdef_netfs_check_aux(void *cookie_netfs_data,
+							const void *data,
+							uint16_t datalen);
+
+struct fscache_cookie_def fscache_fsdef_netfs_def = {
+	.name		= "FSDEF.netfs",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= fscache_fsdef_netfs_get_key,
+	.get_aux	= fscache_fsdef_netfs_get_aux,
+	.check_aux	= fscache_fsdef_netfs_check_aux,
+};
+
+struct fscache_cookie fscache_fsdef_index = {
+	.usage		= ATOMIC_INIT(1),
+	.def		= NULL,
+	.sem		= __RWSEM_INITIALIZER(fscache_fsdef_index.sem),
+	.backing_objects = HLIST_HEAD_INIT,
+};
+
+EXPORT_SYMBOL(fscache_fsdef_index);
+
+/*****************************************************************************/
+/*
+ * get the key data for an FSDEF index record
+ */
+static uint16_t fscache_fsdef_netfs_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
+{
+	const struct fscache_netfs *netfs = cookie_netfs_data;
+	unsigned klen;
+
+	_enter("{%s.%u},", netfs->name, netfs->version);
+
+	klen = strlen(netfs->name);
+	if (klen > bufmax)
+		return 0;
+
+	memcpy(buffer, netfs->name, klen);
+	return klen;
+
+} /* end fscache_fsdef_netfs_get_key() */
+
+/*****************************************************************************/
+/*
+ * get the auxilliary data for an FSDEF index record
+ */
+static uint16_t fscache_fsdef_netfs_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
+{
+	const struct fscache_netfs *netfs = cookie_netfs_data;
+	unsigned dlen;
+
+	_enter("{%s.%u},", netfs->name, netfs->version);
+
+	dlen = sizeof(uint32_t);
+	if (dlen > bufmax)
+		return 0;
+
+	memcpy(buffer, &netfs->version, dlen);
+	return dlen;
+
+} /* end fscache_fsdef_netfs_get_aux() */
+
+/*****************************************************************************/
+/*
+ * check that the version stored in the auxilliary data is correct
+ */
+static fscache_checkaux_t fscache_fsdef_netfs_check_aux(void *cookie_netfs_data,
+							const void *data,
+							uint16_t datalen)
+{
+	struct fscache_netfs *netfs = cookie_netfs_data;
+	uint32_t version;
+
+	_enter("{%s},,%hu", netfs->name, datalen);
+
+	if (datalen != sizeof(version)) {
+		_leave(" = OBSOLETE [dl=%d v=%d]",
+		       datalen, sizeof(version));
+		return FSCACHE_CHECKAUX_OBSOLETE;
+	}
+
+	memcpy(&version, data, sizeof(version));
+	if (version != netfs->version) {
+		_leave(" = OBSOLETE [ver=%x net=%x]",
+		       version, netfs->version);
+		return FSCACHE_CHECKAUX_OBSOLETE;
+	}
+
+	_leave(" = OKAY");
+	return FSCACHE_CHECKAUX_OKAY;
+
+} /* end fscache_fsdef_netfs_check_aux() */
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
new file mode 100644
index 0000000..613c3b1
--- /dev/null
+++ b/fs/fscache/main.c
@@ -0,0 +1,150 @@
+/* main.c: general filesystem caching manager
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include "fscache-int.h"
+
+int fscache_debug = 0;
+
+static int fscache_init(void);
+static void fscache_exit(void);
+
+fs_initcall(fscache_init);
+module_exit(fscache_exit);
+
+MODULE_DESCRIPTION("FS Cache Manager");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+static void fscache_ktype_release(struct kobject *kobject);
+
+static struct sysfs_ops fscache_sysfs_ops = {
+	.show		= NULL,
+	.store		= NULL,
+};
+
+static struct kobj_type fscache_ktype = {
+	.release	= fscache_ktype_release,
+	.sysfs_ops	= &fscache_sysfs_ops,
+	.default_attrs	= NULL,
+};
+
+struct kset fscache_kset = {
+	.kobj.name	= "fscache",
+	.kobj.kset	= &fs_subsys.kset,
+	.ktype		= &fscache_ktype,
+};
+
+EXPORT_SYMBOL(fscache_kset);
+
+/*****************************************************************************/
+/*
+ * initialise the fs caching module
+ */
+static int fscache_init(void)
+{
+	int ret;
+
+	fscache_cookie_jar =
+		kmem_cache_create("fscache_cookie_jar",
+				  sizeof(struct fscache_cookie),
+				  0,
+				  0,
+				  fscache_cookie_init_once,
+				  NULL);
+
+	if (!fscache_cookie_jar) {
+		printk(KERN_NOTICE
+		       "FS-Cache: Failed to allocate a cookie jar\n");
+		return -ENOMEM;
+	}
+
+	ret = kset_register(&fscache_kset);
+	if (ret < 0) {
+		kmem_cache_destroy(fscache_cookie_jar);
+		return ret;
+	}
+
+	printk(KERN_NOTICE "FS-Cache: Loaded\n");
+	return 0;
+
+} /* end fscache_init() */
+
+/*****************************************************************************/
+/*
+ * clean up on module removal
+ */
+static void __exit fscache_exit(void)
+{
+	_enter("");
+
+	kset_unregister(&fscache_kset);
+	kmem_cache_destroy(fscache_cookie_jar);
+	printk(KERN_NOTICE "FS-Cache: unloaded\n");
+
+} /* end fscache_exit() */
+
+/*****************************************************************************/
+/*
+ * release the ktype
+ */
+static void fscache_ktype_release(struct kobject *kobject)
+{
+} /* end fscache_ktype_release() */
+
+/*****************************************************************************/
+/*
+ * clear the dead space between task_struct and kernel stack
+ * - called by supplying -finstrument-functions to gcc
+ */
+#if 0
+void __cyg_profile_func_enter (void *this_fn, void *call_site)
+__attribute__((no_instrument_function));
+
+void __cyg_profile_func_enter (void *this_fn, void *call_site)
+{
+       asm volatile("  movl    %%esp,%%edi     \n"
+                    "  andl    %0,%%edi        \n"
+                    "  addl    %1,%%edi        \n"
+                    "  movl    %%esp,%%ecx     \n"
+                    "  subl    %%edi,%%ecx     \n"
+                    "  shrl    $2,%%ecx        \n"
+                    "  movl    $0xedededed,%%eax     \n"
+                    "  rep stosl               \n"
+                    :
+                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
+                    : "eax", "ecx", "edi", "memory", "cc"
+                    );
+}
+
+void __cyg_profile_func_exit(void *this_fn, void *call_site)
+__attribute__((no_instrument_function));
+
+void __cyg_profile_func_exit(void *this_fn, void *call_site)
+{
+       asm volatile("  movl    %%esp,%%edi     \n"
+                    "  andl    %0,%%edi        \n"
+                    "  addl    %1,%%edi        \n"
+                    "  movl    %%esp,%%ecx     \n"
+                    "  subl    %%edi,%%ecx     \n"
+                    "  shrl    $2,%%ecx        \n"
+                    "  movl    $0xdadadada,%%eax     \n"
+                    "  rep stosl               \n"
+                    :
+                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
+                    : "eax", "ecx", "edi", "memory", "cc"
+                    );
+}
+#endif
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
new file mode 100644
index 0000000..b197be7
--- /dev/null
+++ b/fs/fscache/page.c
@@ -0,0 +1,548 @@
+/* page.c: general filesystem cache cookie management
+ *
+ * Copyright (C) 2004-5 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/fscache-cache.h>
+#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
+#include "fscache-int.h"
+
+/*****************************************************************************/
+/*
+ * set the data file size on an object in the cache
+ */
+int __fscache_set_i_size(struct fscache_cookie *cookie, loff_t i_size)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,%llu,", cookie, i_size);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" = -ENOBUFS");
+		return -ENOBUFS;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it and exclude
+	 * read and write attempts on pages
+	 */
+	down_write(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and pin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (object->cache->ops->set_i_size &&
+		    down_read_trylock(&object->cache->withdrawal_sem)
+		    ) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				ret = object->cache->ops->set_i_size(object,
+								     i_size);
+
+				object->cache->ops->put_object(object);
+			}
+
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_write(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_set_i_size() */
+
+EXPORT_SYMBOL(__fscache_set_i_size);
+
+/*****************************************************************************/
+/*
+ * reserve space for an object
+ */
+int __fscache_reserve_space(struct fscache_cookie *cookie, loff_t size)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,%llu,", cookie, size);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" = -ENOBUFS");
+		return -ENOBUFS;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it and exclude
+	 * read and write attempts on pages
+	 */
+	down_write(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and pin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		if (!object->cache->ops->reserve_space) {
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				ret = object->cache->ops->reserve_space(object,
+									size);
+
+				object->cache->ops->put_object(object);
+			}
+
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_write(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_reserve_space() */
+
+EXPORT_SYMBOL(__fscache_reserve_space);
+
+/*****************************************************************************/
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - we return:
+ *   -ENOMEM	- out of memory, nothing done
+ *   -EINTR	- interrupted
+ *   -ENOBUFS	- no backing object available in which to cache the block
+ *   -ENODATA	- no data available in the backing object for this block
+ *   0		- dispatched a read - it'll call end_io_func() when finished
+ */
+int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+				 struct page *page,
+				 fscache_rw_complete_t end_io_func,
+				 void *end_io_data,
+				 gfp_t gfp)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,{%lu},", cookie, page->index);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" -ENOBUFS [no backing objects]");
+		return -ENOBUFS;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it */
+	down_read(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and pin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				ret = object->cache->ops->read_or_alloc_page(
+					object,
+					page,
+					end_io_func,
+					end_io_data,
+					gfp);
+
+				object->cache->ops->put_object(object);
+			}
+
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_read(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_read_or_alloc_page() */
+
+EXPORT_SYMBOL(__fscache_read_or_alloc_page);
+
+/*****************************************************************************/
+/*
+ * read a list of page from the cache or allocate a block in which to store
+ * them
+ * - we return:
+ *   -ENOMEM	- out of memory, some pages may be being read
+ *   -EINTR	- interrupted, some pages may be being read
+ *   -ENOBUFS	- no backing object or space available in which to cache any
+ *                pages not being read
+ *   -ENODATA	- no data available in the backing object for some or all of
+ *                the pages
+ *   0		- dispatched a read on all pages
+ *
+ * end_io_func() will be called for each page read from the cache as it is
+ * finishes being read
+ *
+ * any pages for which a read is dispatched will be removed from pages and
+ * nr_pages
+ */
+int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+				  struct address_space *mapping,
+				  struct list_head *pages,
+				  unsigned *nr_pages,
+				  fscache_rw_complete_t end_io_func,
+				  void *end_io_data,
+				  gfp_t gfp)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,,%d,,,", cookie, *nr_pages);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" -ENOBUFS [no backing objects]");
+		return -ENOBUFS;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+	BUG_ON(list_empty(pages));
+	BUG_ON(*nr_pages <= 0);
+
+	/* prevent the file from being uncached whilst we access it */
+	down_read(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and pin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				ret = object->cache->ops->read_or_alloc_pages(
+					object,
+					mapping,
+					pages,
+					nr_pages,
+					end_io_func,
+					end_io_data,
+					gfp);
+
+				object->cache->ops->put_object(object);
+			}
+
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_read(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_read_or_alloc_pages() */
+
+EXPORT_SYMBOL(__fscache_read_or_alloc_pages);
+
+/*****************************************************************************/
+/*
+ * allocate a block in the cache on which to store a page
+ * - we return:
+ *   -ENOMEM	- out of memory, nothing done
+ *   -EINTR	- interrupted
+ *   -ENOBUFS	- no backing object available in which to cache the block
+ *   0		- block allocated
+ */
+int __fscache_alloc_page(struct fscache_cookie *cookie,
+			 struct page *page,
+			 gfp_t gfp)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,{%lu},", cookie, page->index);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" -ENOBUFS [no backing objects]");
+		return -ENOBUFS;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it */
+	down_read(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and pin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				ret = object->cache->ops->allocate_page(object,
+									page,
+									gfp);
+
+				object->cache->ops->put_object(object);
+			}
+
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_read(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_alloc_page() */
+
+EXPORT_SYMBOL(__fscache_alloc_page);
+
+/*****************************************************************************/
+/*
+ * request a page be stored in the cache
+ * - returns:
+ *   -ENOMEM	- out of memory, nothing done
+ *   -EINTR	- interrupted
+ *   -ENOBUFS	- no backing object available in which to cache the page
+ *   0		- dispatched a write - it'll call end_io_func() when finished
+ */
+int __fscache_write_page(struct fscache_cookie *cookie,
+			 struct page *page,
+			 fscache_rw_complete_t end_io_func,
+			 void *end_io_data,
+			 gfp_t gfp)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,{%lu},", cookie, page->index);
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from been uncached whilst we deal with it */
+	down_read(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			/* ask the cache to honour the operation */
+			ret = object->cache->ops->write_page(object,
+							     page,
+							     end_io_func,
+							     end_io_data,
+							     gfp);
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_read(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_write_page() */
+
+EXPORT_SYMBOL(__fscache_write_page);
+
+/*****************************************************************************/
+/*
+ * request several pages be stored in the cache
+ * - returns:
+ *   -ENOMEM	- out of memory, nothing done
+ *   -EINTR	- interrupted
+ *   -ENOBUFS	- no backing object available in which to cache the page
+ *   0		- dispatched a write - it'll call end_io_func() when finished
+ */
+int __fscache_write_pages(struct fscache_cookie *cookie,
+			  struct pagevec *pagevec,
+			  fscache_rw_complete_t end_io_func,
+			  void *end_io_data,
+			  gfp_t gfp)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,{%ld},", cookie, pagevec->nr);
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from been uncached whilst we deal with it */
+	down_read(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			/* ask the cache to honour the operation */
+			ret = object->cache->ops->write_pages(object,
+							      pagevec,
+							      end_io_func,
+							      end_io_data,
+							      gfp);
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+out:
+	up_read(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end __fscache_write_pages() */
+
+EXPORT_SYMBOL(__fscache_write_pages);
+
+/*****************************************************************************/
+/*
+ * remove a page from the cache
+ */
+void __fscache_uncache_page(struct fscache_cookie *cookie, struct page *page)
+{
+	struct fscache_object *object;
+	struct pagevec pagevec;
+
+	_enter(",{%lu}", page->index);
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" [no backing]");
+		return;
+	}
+
+	pagevec_init(&pagevec, 0);
+	pagevec_add(&pagevec, page);
+
+	/* ask the cache to honour the operation */
+	down_read(&cookie->sem);
+
+	if (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			object->cache->ops->uncache_pages(object, &pagevec);
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+	up_read(&cookie->sem);
+
+	_leave("");
+	return;
+
+} /* end __fscache_uncache_page() */
+
+EXPORT_SYMBOL(__fscache_uncache_page);
+
+/*****************************************************************************/
+/*
+ * remove a bunch of pages from the cache
+ */
+void __fscache_uncache_pages(struct fscache_cookie *cookie,
+			     struct pagevec *pagevec)
+{
+	struct fscache_object *object;
+
+	_enter(",{%ld}", pagevec->nr);
+
+	BUG_ON(pagevec->nr <= 0);
+	BUG_ON(!pagevec->pages[0]);
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" [no backing]");
+		return;
+	}
+
+	/* ask the cache to honour the operation */
+	down_read(&cookie->sem);
+
+	if (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		/* prevent the cache from being withdrawn */
+		if (down_read_trylock(&object->cache->withdrawal_sem)) {
+			object->cache->ops->uncache_pages(object, pagevec);
+			up_read(&object->cache->withdrawal_sem);
+		}
+	}
+
+	up_read(&cookie->sem);
+
+	_leave("");
+	return;
+
+} /* end __fscache_uncache_pages() */
+
+EXPORT_SYMBOL(__fscache_uncache_pages);
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
new file mode 100644
index 0000000..b3971ec
--- /dev/null
+++ b/include/linux/fscache-cache.h
@@ -0,0 +1,220 @@
+/* fscache-cache.h: general filesystem caching backing cache interface
+ *
+ * Copyright (C) 2004-6 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FSCACHE_CACHE_H
+#define _LINUX_FSCACHE_CACHE_H
+
+#include <linux/fscache.h>
+
+#define NR_MAXCACHES BITS_PER_LONG
+
+struct fscache_cache;
+struct fscache_cache_ops;
+struct fscache_object;
+
+/*
+ * cache tag definition
+ */
+struct fscache_cache_tag {
+	struct list_head		link;
+	struct fscache_cache		*cache;		/* cache referred to by this tag */
+	atomic_t			usage;
+	char				name[0];	/* tag name */
+};
+
+/*
+ * cache definition
+ */
+struct fscache_cache {
+	struct fscache_cache_ops	*ops;
+	struct fscache_cache_tag	*tag;		/* tag representing this cache */
+	struct list_head		link;		/* link in list of caches */
+	struct rw_semaphore		withdrawal_sem;	/* withdrawal control sem */
+	size_t				max_index_size;	/* maximum size of index data */
+	char				identifier[32];	/* cache label */
+
+	/* node management */
+	struct list_head		object_list;	/* list of data/index objects */
+	spinlock_t			object_list_lock;
+	struct fscache_object		*fsdef;		/* object for the fsdef index */
+	unsigned long			flags;
+#define FSCACHE_IOERROR			0	/* cache stopped on I/O error */
+};
+
+extern void fscache_init_cache(struct fscache_cache *cache,
+			       struct fscache_cache_ops *ops,
+			       const char *idfmt,
+			       ...) __attribute__ ((format (printf,3,4)));
+
+extern int fscache_add_cache(struct fscache_cache *cache,
+			     struct fscache_object *fsdef,
+			     const char *tagname);
+extern void fscache_withdraw_cache(struct fscache_cache *cache);
+
+extern void fscache_io_error(struct fscache_cache *cache);
+
+/*****************************************************************************/
+/*
+ * cache operations
+ */
+struct fscache_cache_ops {
+	/* name of cache provider */
+	const char *name;
+
+	/* look up the object for a cookie, creating it on disc if necessary */
+	struct fscache_object *(*lookup_object)(struct fscache_cache *cache,
+						struct fscache_object *parent,
+						struct fscache_cookie *cookie);
+
+	/* increment the usage count on this object (may fail if unmounting) */
+	struct fscache_object *(*grab_object)(struct fscache_object *object);
+
+	/* lock a semaphore on an object */
+	void (*lock_object)(struct fscache_object *object);
+
+	/* unlock a semaphore on an object */
+	void (*unlock_object)(struct fscache_object *object);
+
+	/* pin an object in the cache */
+	int (*pin_object)(struct fscache_object *object);
+
+	/* unpin an object in the cache */
+	void (*unpin_object)(struct fscache_object *object);
+
+	/* store the updated auxilliary data on an object */
+	void (*update_object)(struct fscache_object *object);
+
+	/* dispose of a reference to an object */
+	void (*put_object)(struct fscache_object *object);
+
+	/* sync a cache */
+	void (*sync_cache)(struct fscache_cache *cache);
+
+	/* set the data size of an object */
+	int (*set_i_size)(struct fscache_object *object, loff_t i_size);
+
+	/* reserve space for an object's data and associated metadata */
+	int (*reserve_space)(struct fscache_object *object, loff_t i_size);
+
+	/* request a backing block for a page be read or allocated in the
+	 * cache */
+	int (*read_or_alloc_page)(struct fscache_object *object,
+				  struct page *page,
+				  fscache_rw_complete_t end_io_func,
+				  void *end_io_data,
+				  unsigned long gfp);
+
+	/* request backing blocks for a list of pages be read or allocated in
+	 * the cache */
+	int (*read_or_alloc_pages)(struct fscache_object *object,
+				   struct address_space *mapping,
+				   struct list_head *pages,
+				   unsigned *nr_pages,
+				   fscache_rw_complete_t end_io_func,
+				   void *end_io_data,
+				   unsigned long gfp);
+
+	/* request a backing block for a page be allocated in the cache so that
+	 * it can be written directly */
+	int (*allocate_page)(struct fscache_object *object,
+			     struct page *page,
+			     unsigned long gfp);
+
+	/* write a page to its backing block in the cache */
+	int (*write_page)(struct fscache_object *object,
+			  struct page *page,
+			  fscache_rw_complete_t end_io_func,
+			  void *end_io_data,
+			  unsigned long gfp);
+
+	/* write several pages to their backing blocks in the cache */
+	int (*write_pages)(struct fscache_object *object,
+			   struct pagevec *pagevec,
+			   fscache_rw_complete_t end_io_func,
+			   void *end_io_data,
+			   unsigned long gfp);
+
+	/* detach backing block from a bunch of pages */
+	void (*uncache_pages)(struct fscache_object *object,
+			     struct pagevec *pagevec);
+
+	/* dissociate a cache from all the pages it was backing */
+	void (*dissociate_pages)(struct fscache_cache *cache);
+};
+
+/*****************************************************************************/
+/*
+ * data file or index object cookie
+ * - a file will only appear in one cache
+ * - a request to cache a file may or may not be honoured, subject to
+ *   constraints such as disc space
+ * - indexes files are created on disc just-in-time
+ */
+struct fscache_cookie {
+	atomic_t			usage;		/* number of users of this cookie */
+	atomic_t			children;	/* number of children of this cookie */
+	struct rw_semaphore		sem;		/* list creation vs scan lock */
+	struct hlist_head		backing_objects; /* object(s) backing this file/index */
+	struct fscache_cookie_def	*def;		/* definition */
+	struct fscache_cookie		*parent;	/* parent of this entry */
+	struct fscache_netfs		*netfs;		/* owner network fs definition */
+	void				*netfs_data;	/* back pointer to netfs */
+};
+
+extern struct fscache_cookie fscache_fsdef_index;
+
+/*****************************************************************************/
+/*
+ * on-disc cache file or index handle
+ */
+struct fscache_object {
+	unsigned long			flags;
+#define FSCACHE_OBJECT_RELEASING	0	/* T if object is being released */
+#define FSCACHE_OBJECT_RECYCLING	1	/* T if object is being retired */
+#define FSCACHE_OBJECT_WITHDRAWN	2	/* T if object has been withdrawn */
+
+	struct list_head		cache_link;	/* link in cache->object_list */
+	struct hlist_node		cookie_link;	/* link in cookie->backing_objects */
+	struct fscache_cache		*cache;		/* cache that supplied this object */
+	struct fscache_cookie		*cookie;	/* netfs's file/index object */
+};
+
+static inline
+void fscache_object_init(struct fscache_object *object)
+{
+	object->flags = 0;
+	INIT_LIST_HEAD(&object->cache_link);
+	INIT_HLIST_NODE(&object->cookie_link);
+	object->cache = NULL;
+	object->cookie = NULL;
+}
+
+/* find the parent index object for a object */
+static inline
+struct fscache_object *fscache_find_parent_object(struct fscache_object *object)
+{
+	struct fscache_object *parent;
+	struct fscache_cookie *cookie = object->cookie;
+	struct fscache_cache *cache = object->cache;
+	struct hlist_node *_p;
+
+	hlist_for_each_entry(parent, _p,
+			     &cookie->parent->backing_objects,
+			     cookie_link
+			     ) {
+		if (parent->cache == cache)
+			return parent;
+	}
+
+	return NULL;
+}
+
+#endif /* _LINUX_FSCACHE_CACHE_H */
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
new file mode 100644
index 0000000..8aa464b
--- /dev/null
+++ b/include/linux/fscache.h
@@ -0,0 +1,484 @@
+/* fscache.h: general filesystem caching interface
+ *
+ * Copyright (C) 2004-5 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FSCACHE_H
+#define _LINUX_FSCACHE_H
+
+#include <linux/config.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/pagemap.h>
+#include <linux/pagevec.h>
+
+#ifdef CONFIG_FSCACHE_MODULE
+#define CONFIG_FSCACHE
+#endif
+
+struct pagevec;
+struct fscache_cache_tag;
+struct fscache_cookie;
+struct fscache_netfs;
+struct fscache_netfs_operations;
+
+#define FSCACHE_NEGATIVE_COOKIE ((struct fscache_cookie *) NULL)
+
+typedef void (*fscache_rw_complete_t)(struct page *page,
+				      void *data,
+				      int error);
+
+/* result of index entry consultation */
+typedef enum {
+	FSCACHE_CHECKAUX_OKAY,		/* entry okay as is */
+	FSCACHE_CHECKAUX_NEEDS_UPDATE,	/* entry requires update */
+	FSCACHE_CHECKAUX_OBSOLETE,	/* entry requires deletion */
+} fscache_checkaux_t;
+
+/*****************************************************************************/
+/*
+ * fscache cookie definition
+ */
+struct fscache_cookie_def
+{
+	/* name of cookie type */
+	char name[16];
+
+	/* cookie type */
+	uint8_t type;
+#define FSCACHE_COOKIE_TYPE_INDEX	0
+#define FSCACHE_COOKIE_TYPE_DATAFILE	1
+
+	/* select the cache into which to insert an entry in this index
+	 * - optional
+	 * - should return a cache identifier or NULL to cause the cache to be
+	 *   inherited from the parent if possible or the first cache picked
+	 *   for a non-index file if not
+	 */
+	struct fscache_cache_tag *(*select_cache)(const void *parent_netfs_data,
+						  const void *cookie_netfs_data);
+
+	/* get an index key
+	 * - should store the key data in the buffer
+	 * - should return the amount of amount stored
+	 * - not permitted to return an error
+	 * - the netfs data from the cookie being used as the source is
+	 *   presented
+	 */
+	uint16_t (*get_key)(const void *cookie_netfs_data,
+			    void *buffer,
+			    uint16_t bufmax);
+
+	/* get certain file attributes from the netfs data
+	 * - this function can be absent for an index
+	 * - not permitted to return an error
+	 * - the netfs data from the cookie being used as the source is
+	 *   presented
+	 */
+	void (*get_attr)(const void *cookie_netfs_data, uint64_t *size);
+
+	/* get the auxilliary data from netfs data
+	 * - this function can be absent if the index carries no state data
+	 * - should store the auxilliary data in the buffer
+	 * - should return the amount of amount stored
+	 * - not permitted to return an error
+	 * - the netfs data from the cookie being used as the source is
+	 *   presented
+	 */
+	uint16_t (*get_aux)(const void *cookie_netfs_data,
+			    void *buffer,
+			    uint16_t bufmax);
+
+	/* consult the netfs about the state of an object
+	 * - this function can be absent if the index carries no state data
+	 * - the netfs data from the cookie being used as the target is
+	 *   presented, as is the auxilliary data
+	 */
+	fscache_checkaux_t (*check_aux)(void *cookie_netfs_data,
+					const void *data,
+					uint16_t datalen);
+
+	/* indicate pages that now have cache metadata retained
+	 * - this function should mark the specified pages as now being cached
+	 */
+	void (*mark_pages_cached)(void *cookie_netfs_data,
+				  struct address_space *mapping,
+				  struct pagevec *cached_pvec);
+
+	/* indicate the cookie is no longer uncached
+	 * - this function is called when the backing store currently caching
+	 *   a cookie is removed
+	 * - the netfs should use this to clean up any markers indicating
+	 *   cached pages
+	 * - this is mandatory for any object that may have data
+	 */
+	void (*now_uncached)(void *cookie_netfs_data);
+};
+
+/* pattern used to fill dead space in an index entry */
+#define FSCACHE_INDEX_DEADFILL_PATTERN 0x79
+
+#ifdef CONFIG_FSCACHE
+extern struct fscache_cookie *__fscache_acquire_cookie(struct fscache_cookie *parent,
+						       struct fscache_cookie_def *def,
+						       void *netfs_data);
+
+extern void __fscache_relinquish_cookie(struct fscache_cookie *cookie,
+					int retire);
+
+extern void __fscache_update_cookie(struct fscache_cookie *cookie);
+#endif
+
+static inline
+struct fscache_cookie *fscache_acquire_cookie(struct fscache_cookie *parent,
+					      struct fscache_cookie_def *def,
+					      void *netfs_data)
+{
+#ifdef CONFIG_FSCACHE
+	if (parent != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_acquire_cookie(parent, def, netfs_data);
+#endif
+	return FSCACHE_NEGATIVE_COOKIE;
+}
+
+static inline
+void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+			       int retire)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		__fscache_relinquish_cookie(cookie, retire);
+#endif
+}
+
+static inline
+void fscache_update_cookie(struct fscache_cookie *cookie)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		__fscache_update_cookie(cookie);
+#endif
+}
+
+/*****************************************************************************/
+/*
+ * pin or unpin a cookie in a cache
+ * - only available for data cookies
+ */
+#ifdef CONFIG_FSCACHE
+extern int __fscache_pin_cookie(struct fscache_cookie *cookie);
+extern void __fscache_unpin_cookie(struct fscache_cookie *cookie);
+#endif
+
+static inline
+int fscache_pin_cookie(struct fscache_cookie *cookie)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_pin_cookie(cookie);
+#endif
+	return -ENOBUFS;
+}
+
+static inline
+void fscache_unpin_cookie(struct fscache_cookie *cookie)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		__fscache_unpin_cookie(cookie);
+#endif
+}
+
+/*****************************************************************************/
+/*
+ * fscache cached network filesystem type
+ * - name, version and ops must be filled in before registration
+ * - all other fields will be set during registration
+ */
+struct fscache_netfs
+{
+	uint32_t			version;	/* indexing version */
+	const char			*name;		/* filesystem name */
+	struct fscache_cookie		*primary_index;
+	struct fscache_netfs_operations	*ops;
+	struct list_head		link;		/* internal link */
+};
+
+struct fscache_netfs_operations
+{
+};
+
+#ifdef CONFIG_FSCACHE
+extern int __fscache_register_netfs(struct fscache_netfs *netfs);
+extern void __fscache_unregister_netfs(struct fscache_netfs *netfs);
+#endif
+
+static inline
+int fscache_register_netfs(struct fscache_netfs *netfs)
+{
+#ifdef CONFIG_FSCACHE
+	return __fscache_register_netfs(netfs);
+#else
+	return 0;
+#endif
+}
+
+static inline
+void fscache_unregister_netfs(struct fscache_netfs *netfs)
+{
+#ifdef CONFIG_FSCACHE
+	__fscache_unregister_netfs(netfs);
+#endif
+}
+
+/*****************************************************************************/
+/*
+ * look up a cache tag
+ * - cache tags are used to select specific caches in which to cache indexes
+ */
+#ifdef CONFIG_FSCACHE
+extern struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *name);
+extern void __fscache_release_cache_tag(struct fscache_cache_tag *tag);
+#endif
+
+static inline
+struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name)
+{
+#ifdef CONFIG_FSCACHE
+	return __fscache_lookup_cache_tag(name);
+#else
+	return NULL;
+#endif
+}
+
+static inline
+void fscache_release_cache_tag(struct fscache_cache_tag *tag)
+{
+#ifdef CONFIG_FSCACHE
+	__fscache_release_cache_tag(tag);
+#endif
+}
+
+/*****************************************************************************/
+/*
+ * set the data size on a cached object
+ * - no pages beyond the end of the object will be accessible
+ * - returns -ENOBUFS if the file is not backed
+ * - returns -ENOSPC if a pinned file of that size can't be stored
+ * - returns 0 if okay
+ */
+#ifdef CONFIG_FSCACHE
+extern int __fscache_set_i_size(struct fscache_cookie *cookie, loff_t i_size);
+#endif
+
+static inline
+int fscache_set_i_size(struct fscache_cookie *cookie, loff_t i_size)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_set_i_size(cookie, i_size);
+#endif
+	return -ENOBUFS;
+}
+
+/*****************************************************************************/
+/*
+ * reserve data space for a cached object
+ * - returns -ENOBUFS if the file is not backed
+ * - returns -ENOSPC if there isn't enough space to honour the reservation
+ * - returns 0 if okay
+ */
+#ifdef CONFIG_FSCACHE
+extern int __fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
+#endif
+
+static inline
+int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_reserve_space(cookie, size);
+#endif
+	return -ENOBUFS;
+}
+
+/*****************************************************************************/
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - if the page is not backed by a file:
+ *   - -ENOBUFS will be returned and nothing more will be done
+ * - else if the page is backed by a block in the cache:
+ *   - a read will be started which will call end_io_func on completion
+ * - else if the page is unbacked:
+ *   - a block will be allocated
+ *   - -ENODATA will be returned
+ */
+#ifdef CONFIG_FSCACHE
+extern int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+					struct page *page,
+					fscache_rw_complete_t end_io_func,
+					void *end_io_data,
+					gfp_t gfp);
+#endif
+
+static inline
+int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+			       struct page *page,
+			       fscache_rw_complete_t end_io_func,
+			       void *end_io_data,
+			       gfp_t gfp)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_read_or_alloc_page(cookie, page, end_io_func,
+						    end_io_data, gfp);
+#endif
+	return -ENOBUFS;
+}
+
+#ifdef CONFIG_FSCACHE
+extern int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+					 struct address_space *mapping,
+					 struct list_head *pages,
+					 unsigned *nr_pages,
+					 fscache_rw_complete_t end_io_func,
+					 void *end_io_data,
+					 gfp_t gfp);
+#endif
+
+static inline
+int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+				struct address_space *mapping,
+				struct list_head *pages,
+				unsigned *nr_pages,
+				fscache_rw_complete_t end_io_func,
+				void *end_io_data,
+				gfp_t gfp)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_read_or_alloc_pages(cookie, mapping, pages,
+						     nr_pages, end_io_func,
+						     end_io_data, gfp);
+#endif
+	return -ENOBUFS;
+}
+
+/*
+ * allocate a block in which to store a page
+ * - if the page is not backed by a file:
+ *   - -ENOBUFS will be returned and nothing more will be done
+ * - else
+ *   - a block will be allocated if there isn't one
+ *   - 0 will be returned
+ */
+#ifdef CONFIG_FSCACHE
+extern int __fscache_alloc_page(struct fscache_cookie *cookie,
+				struct page *page,
+				gfp_t gfp);
+#endif
+
+static inline
+int fscache_alloc_page(struct fscache_cookie *cookie,
+		       struct page *page,
+		       gfp_t gfp)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_alloc_page(cookie, page, gfp);
+#endif
+	return -ENOBUFS;
+}
+
+/*
+ * request a page be stored in the cache
+ * - this request may be ignored if no cache block is currently allocated, in
+ *   which case it:
+ *   - returns -ENOBUFS
+ * - if a cache block was already allocated:
+ *   - a BIO will be dispatched to write the page (end_io_func will be called
+ *     from the completion function)
+ *   - returns 0
+ */
+#ifdef CONFIG_FSCACHE
+extern int __fscache_write_page(struct fscache_cookie *cookie,
+				struct page *page,
+				fscache_rw_complete_t end_io_func,
+				void *end_io_data,
+				gfp_t gfp);
+
+extern int __fscache_write_pages(struct fscache_cookie *cookie,
+				 struct pagevec *pagevec,
+				 fscache_rw_complete_t end_io_func,
+				 void *end_io_data,
+				 gfp_t gfp);
+#endif
+
+static inline
+int fscache_write_page(struct fscache_cookie *cookie,
+		       struct page *page,
+		       fscache_rw_complete_t end_io_func,
+		       void *end_io_data,
+		       gfp_t gfp)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_write_page(cookie, page, end_io_func,
+					    end_io_data, gfp);
+#endif
+	return -ENOBUFS;
+}
+
+static inline
+int fscache_write_pages(struct fscache_cookie *cookie,
+			struct pagevec *pagevec,
+			fscache_rw_complete_t end_io_func,
+			void *end_io_data,
+			gfp_t gfp)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		return __fscache_write_pages(cookie, pagevec, end_io_func,
+					     end_io_data, gfp);
+#endif
+	return -ENOBUFS;
+}
+
+/*
+ * indicate that caching is no longer required on a page
+ * - note: cannot cancel any outstanding BIOs between this page and the cache
+ */
+#ifdef CONFIG_FSCACHE
+extern void __fscache_uncache_page(struct fscache_cookie *cookie,
+				   struct page *page);
+extern void __fscache_uncache_pages(struct fscache_cookie *cookie,
+				    struct pagevec *pagevec);
+#endif
+
+static inline
+void fscache_uncache_page(struct fscache_cookie *cookie,
+			  struct page *page)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		__fscache_uncache_page(cookie, page);
+#endif
+}
+
+static inline
+void fscache_uncache_pagevec(struct fscache_cookie *cookie,
+			     struct pagevec *pagevec)
+{
+#ifdef CONFIG_FSCACHE
+	if (cookie != FSCACHE_NEGATIVE_COOKIE)
+		__fscache_uncache_pages(cookie, pagevec);
+#endif
+}
+
+#endif /* _LINUX_FSCACHE_H */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/7] FS-Cache: Make kAFS use FS-Cache
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
                   ` (3 preceding siblings ...)
  2006-04-20 16:59 ` [PATCH 5/7] FS-Cache: Generic filesystem caching facility David Howells
@ 2006-04-20 16:59 ` David Howells
  2006-04-20 16:59 ` [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem David Howells
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 16:59 UTC (permalink / raw)
  To: torvalds, akpm, steved, sct, aviro
  Cc: linux-fsdevel, linux-cachefs, nfsv4, linux-kernel

The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches.  The kAFS filesystem will use caching
automatically if it's available.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 fs/Kconfig         |    7 +
 fs/afs/cache.h     |   27 ------
 fs/afs/cell.c      |  109 ++++++++++++++---------
 fs/afs/cell.h      |   16 +--
 fs/afs/cmservice.c |    2 
 fs/afs/dir.c       |   15 +--
 fs/afs/file.c      |  240 +++++++++++++++++++++++++++++++++-----------------
 fs/afs/fsclient.c  |    4 +
 fs/afs/inode.c     |   43 ++++++---
 fs/afs/internal.h  |   24 ++---
 fs/afs/main.c      |   24 ++---
 fs/afs/mntpt.c     |   12 +--
 fs/afs/proc.c      |    1 
 fs/afs/server.c    |    3 -
 fs/afs/vlocation.c |  185 ++++++++++++++++++++++++---------------
 fs/afs/vnode.c     |  249 +++++++++++++++++++++++++++++++++++++++++++---------
 fs/afs/vnode.h     |   10 +-
 fs/afs/volume.c    |   78 ++++++----------
 fs/afs/volume.h    |   28 +-----
 19 files changed, 660 insertions(+), 417 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 66acf29..6c95e58 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1816,6 +1816,13 @@ config AFS_FS
 
 	  If unsure, say N.
 
+config AFS_FSCACHE
+	bool "Provide AFS client caching support"
+	depends on AFS_FS && FSCACHE && EXPERIMENTAL
+	help
+	  Say Y here if you want AFS data to be cached locally on through the
+	  generic filesystem cache manager
+
 config RXRPC
 	tristate
 
diff --git a/fs/afs/cache.h b/fs/afs/cache.h
deleted file mode 100644
index 9eb7722..0000000
--- a/fs/afs/cache.h
+++ /dev/null
@@ -1,27 +0,0 @@
-/* cache.h: AFS local cache management interface
- *
- * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#ifndef _LINUX_AFS_CACHE_H
-#define _LINUX_AFS_CACHE_H
-
-#undef AFS_CACHING_SUPPORT
-
-#include <linux/mm.h>
-#ifdef AFS_CACHING_SUPPORT
-#include <linux/cachefs.h>
-#endif
-#include "types.h"
-
-#ifdef __KERNEL__
-
-#endif /* __KERNEL__ */
-
-#endif /* _LINUX_AFS_CACHE_H */
diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index 009a9ae..93a0846 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -31,17 +31,21 @@ static DEFINE_RWLOCK(afs_cells_lock);
 static DECLARE_RWSEM(afs_cells_sem); /* add/remove serialisation */
 static struct afs_cell *afs_cell_root;
 
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
-						const void *entry);
-static void afs_cell_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_cache_cell_index_def = {
-	.name			= "cell_ix",
-	.data_size		= sizeof(struct afs_cache_cell),
-	.keys[0]		= { CACHEFS_INDEX_KEYS_ASCIIZ, 64 },
-	.match			= afs_cell_cache_match,
-	.update			= afs_cell_cache_update,
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_cell_cache_get_key(const void *cookie_netfs_data,
+				       void *buffer, uint16_t buflen);
+static uint16_t afs_cell_cache_get_aux(const void *cookie_netfs_data,
+				       void *buffer, uint16_t buflen);
+static fscache_checkaux_t afs_cell_cache_check_aux(void *cookie_netfs_data,
+						   const void *buffer,
+						   uint16_t buflen);
+
+static struct fscache_cookie_def afs_cell_cache_index_def = {
+	.name		= "AFS cell",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= afs_cell_cache_get_key,
+	.get_aux	= afs_cell_cache_get_aux,
+	.check_aux	= afs_cell_cache_check_aux,
 };
 #endif
 
@@ -115,12 +119,11 @@ int afs_cell_create(const char *name, ch
 	if (ret < 0)
 		goto error;
 
-#ifdef AFS_CACHING_SUPPORT
-	/* put it up for caching */
-	cachefs_acquire_cookie(afs_cache_netfs.primary_index,
-			       &afs_vlocation_cache_index_def,
-			       cell,
-			       &cell->cache);
+#ifdef CONFIG_AFS_FSCACHE
+	/* put it up for caching (this never returns an error) */
+	cell->cache = fscache_acquire_cookie(afs_cache_netfs.primary_index,
+					     &afs_cell_cache_index_def,
+					     cell);
 #endif
 
 	/* add to the cell lists */
@@ -345,8 +348,8 @@ static void afs_cell_destroy(struct afs_
 	list_del_init(&cell->proc_link);
 	up_write(&afs_proc_cells_sem);
 
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(cell->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_relinquish_cookie(cell->cache, 0);
 #endif
 
 	up_write(&afs_cells_sem);
@@ -526,44 +529,62 @@ void afs_cell_purge(void)
 
 /*****************************************************************************/
 /*
- * match a cell record obtained from the cache
+ * set the key for the index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
-						const void *entry)
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_cell_cache_get_key(const void *cookie_netfs_data,
+				       void *buffer, uint16_t bufmax)
 {
-	const struct afs_cache_cell *ccell = entry;
-	struct afs_cell *cell = target;
+	const struct afs_cell *cell = cookie_netfs_data;
+	uint16_t klen;
 
-	_enter("{%s},{%s}", ccell->name, cell->name);
+	_enter("%p,%p,%u", cell, buffer, bufmax);
 
-	if (strncmp(ccell->name, cell->name, sizeof(ccell->name)) == 0) {
-		_leave(" = SUCCESS");
-		return CACHEFS_MATCH_SUCCESS;
-	}
+	klen = strlen(cell->name);
+	if (klen > bufmax)
+		return 0;
+
+	memcpy(buffer, cell->name, klen);
+	return klen;
 
-	_leave(" = FAILED");
-	return CACHEFS_MATCH_FAILED;
-} /* end afs_cell_cache_match() */
+} /* end afs_cell_cache_get_key() */
 #endif
 
 /*****************************************************************************/
 /*
- * update a cell record in the cache
+ * provide new auxilliary cache data
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_cell_cache_update(void *source, void *entry)
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_cell_cache_get_aux(const void *cookie_netfs_data,
+				       void *buffer, uint16_t bufmax)
 {
-	struct afs_cache_cell *ccell = entry;
-	struct afs_cell *cell = source;
+	const struct afs_cell *cell = cookie_netfs_data;
+	uint16_t dlen;
 
-	_enter("%p,%p", source, entry);
+	_enter("%p,%p,%u", cell, buffer, bufmax);
 
-	strncpy(ccell->name, cell->name, sizeof(ccell->name));
+	dlen = cell->vl_naddrs * sizeof(cell->vl_addrs[0]);
+	dlen = min(dlen, bufmax);
+	dlen &= ~(sizeof(cell->vl_addrs[0]) - 1);
 
-	memcpy(ccell->vl_servers,
-	       cell->vl_addrs,
-	       min(sizeof(ccell->vl_servers), sizeof(cell->vl_addrs)));
+	memcpy(buffer, cell->vl_addrs, dlen);
+
+	return dlen;
+
+} /* end afs_cell_cache_get_aux() */
+#endif
+
+/*****************************************************************************/
+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+#ifdef CONFIG_AFS_FSCACHE
+static fscache_checkaux_t afs_cell_cache_check_aux(void *cookie_netfs_data,
+						   const void *buffer,
+						   uint16_t buflen)
+{
+	_leave(" = OKAY");
+	return FSCACHE_CHECKAUX_OKAY;
 
-} /* end afs_cell_cache_update() */
+} /* end afs_cell_cache_check_aux() */
 #endif
diff --git a/fs/afs/cell.h b/fs/afs/cell.h
index 4834910..d670502 100644
--- a/fs/afs/cell.h
+++ b/fs/afs/cell.h
@@ -13,7 +13,7 @@
 #define _LINUX_AFS_CELL_H
 
 #include "types.h"
-#include "cache.h"
+#include <linux/fscache.h>
 
 #define AFS_CELL_MAX_ADDRS 15
 
@@ -21,16 +21,6 @@ extern volatile int afs_cells_being_purg
 
 /*****************************************************************************/
 /*
- * entry in the cached cell catalogue
- */
-struct afs_cache_cell
-{
-	char			name[64];	/* cell name (padded with NULs) */
-	struct in_addr		vl_servers[15];	/* cached cell VL servers */
-};
-
-/*****************************************************************************/
-/*
  * AFS cell record
  */
 struct afs_cell
@@ -39,8 +29,8 @@ struct afs_cell
 	struct list_head	link;		/* main cell list link */
 	struct list_head	proc_link;	/* /proc cell list link */
 	struct proc_dir_entry	*proc_dir;	/* /proc dir for this cell */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+	struct fscache_cookie	*cache;		/* caching cookie */
 #endif
 
 	/* server record management */
diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c
index 3d097fd..f87d5a7 100644
--- a/fs/afs/cmservice.c
+++ b/fs/afs/cmservice.c
@@ -24,7 +24,7 @@
 #include "internal.h"
 
 static unsigned afscm_usage;		/* AFS cache manager usage count */
-static struct rw_semaphore afscm_sem;	/* AFS cache manager start/stop semaphore */
+static DECLARE_RWSEM(afscm_sem);	/* AFS cache manager start/stop semaphore */
 
 static int afscm_new_call(struct rxrpc_call *call);
 static void afscm_attention(struct rxrpc_call *call);
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index a6dff6a..b8e7c32 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -145,7 +145,7 @@ static inline void afs_dir_check_page(st
 	qty /= sizeof(union afs_dir_block);
 
 	/* check them */
-	dbuf = page_address(page);
+	dbuf = kmap_atomic(page, KM_USER0);
 	for (tmp = 0; tmp < qty; tmp++) {
 		if (dbuf->blocks[tmp].pagehdr.magic != AFS_DIR_MAGIC) {
 			printk("kAFS: %s(%lu): bad magic %d/%d is %04hx\n",
@@ -154,12 +154,12 @@ static inline void afs_dir_check_page(st
 			goto error;
 		}
 	}
+	kunmap_atomic(dbuf, KM_USER0);
 
-	SetPageChecked(page);
 	return;
 
  error:
-	SetPageChecked(page);
+	kunmap_atomic(dbuf, KM_USER0);
 	SetPageError(page);
 
 } /* end afs_dir_check_page() */
@@ -170,7 +170,6 @@ static inline void afs_dir_check_page(st
  */
 static inline void afs_dir_put_page(struct page *page)
 {
-	kunmap(page);
 	page_cache_release(page);
 
 } /* end afs_dir_put_page() */
@@ -190,11 +189,9 @@ static struct page *afs_dir_get_page(str
 			       NULL);
 	if (!IS_ERR(page)) {
 		wait_on_page_locked(page);
-		kmap(page);
 		if (!PageUptodate(page))
 			goto fail;
-		if (!PageChecked(page))
-			afs_dir_check_page(dir, page);
+		afs_dir_check_page(dir, page);
 		if (PageError(page))
 			goto fail;
 	}
@@ -359,7 +356,7 @@ static int afs_dir_iterate(struct inode 
 
 		limit = blkoff & ~(PAGE_SIZE - 1);
 
-		dbuf = page_address(page);
+		dbuf = kmap_atomic(page, KM_USER0);
 
 		/* deal with the individual blocks stashed on this page */
 		do {
@@ -368,6 +365,7 @@ static int afs_dir_iterate(struct inode 
 			ret = afs_dir_iterate_block(fpos, dblock, blkoff,
 						    cookie, filldir);
 			if (ret != 1) {
+				kunmap_atomic(dbuf, KM_USER0);
 				afs_dir_put_page(page);
 				goto out;
 			}
@@ -376,6 +374,7 @@ static int afs_dir_iterate(struct inode 
 
 		} while (*fpos < dir->i_size && blkoff < limit);
 
+		kunmap_atomic(dbuf, KM_USER0);
 		afs_dir_put_page(page);
 		ret = 0;
 	}
diff --git a/fs/afs/file.c b/fs/afs/file.c
index 7bb7168..7521cd5 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -16,12 +16,15 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/buffer_head.h>
 #include "volume.h"
 #include "vnode.h"
 #include <rxrpc/call.h>
 #include "internal.h"
 
+#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+
 #if 0
 static int afs_file_open(struct inode *inode, struct file *file);
 static int afs_file_release(struct inode *inode, struct file *file);
@@ -30,30 +33,68 @@ static int afs_file_release(struct inode
 static int afs_file_readpage(struct file *file, struct page *page);
 static void afs_file_invalidatepage(struct page *page, unsigned long offset);
 static int afs_file_releasepage(struct page *page, gfp_t gfp_flags);
+static int afs_file_mmap(struct file * file, struct vm_area_struct * vma);
+
+#ifdef CONFIG_AFS_FSCACHE
+static int afs_file_readpages(struct file *filp, struct address_space *mapping,
+			      struct list_head *pages, unsigned nr_pages);
+static int afs_file_page_mkwrite(struct vm_area_struct *vma, struct page *page);
+#endif
 
 struct inode_operations afs_file_inode_operations = {
 	.getattr	= afs_inode_getattr,
 };
 
+struct file_operations afs_file_file_operations = {
+	.read		= generic_file_read,
+	.mmap		= afs_file_mmap,
+};
+
 struct address_space_operations afs_fs_aops = {
 	.readpage	= afs_file_readpage,
+#ifdef CONFIG_AFS_FSCACHE
+	.readpages	= afs_file_readpages,
+#endif
 	.sync_page	= block_sync_page,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.releasepage	= afs_file_releasepage,
 	.invalidatepage	= afs_file_invalidatepage,
 };
 
+static struct vm_operations_struct afs_fs_vm_operations = {
+	.nopage		= filemap_nopage,
+	.populate	= filemap_populate,
+#ifdef CONFIG_AFS_FSCACHE
+	.page_mkwrite	= afs_file_page_mkwrite,
+#endif
+};
+
+/*****************************************************************************/
+/*
+ * set up a memory mapping on an AFS file
+ * - we set our own VMA ops so that we can catch the page becoming writable for
+ *   userspace for shared-writable mmap
+ */
+static int afs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	_enter("");
+
+	file_accessed(file);
+	vma->vm_ops = &afs_fs_vm_operations;
+	return 0;
+
+} /* end afs_file_mmap() */
+
 /*****************************************************************************/
 /*
  * deal with notification that a page was read from the cache
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_file_readpage_read_complete(void *cookie_data,
-					    struct page *page,
+#ifdef CONFIG_AFS_FSCACHE
+static void afs_file_readpage_read_complete(struct page *page,
 					    void *data,
 					    int error)
 {
-	_enter("%p,%p,%p,%d", cookie_data, page, data, error);
+	_enter("%p,%p,%d", page, data, error);
 
 	if (error)
 		SetPageError(page);
@@ -68,15 +109,16 @@ static void afs_file_readpage_read_compl
 /*
  * deal with notification that a page was written to the cache
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_file_readpage_write_complete(void *cookie_data,
-					     struct page *page,
+#ifdef CONFIG_AFS_FSCACHE
+static void afs_file_readpage_write_complete(struct page *page,
 					     void *data,
 					     int error)
 {
-	_enter("%p,%p,%p,%d", cookie_data, page, data, error);
+	_enter("%p,%p,%d", page, data, error);
 
-	unlock_page(page);
+	/* note that the page has been written to the cache and can now be
+	 * modified */
+	end_page_fs_misc(page);
 
 } /* end afs_file_readpage_write_complete() */
 #endif
@@ -88,16 +130,13 @@ static void afs_file_readpage_write_comp
 static int afs_file_readpage(struct file *file, struct page *page)
 {
 	struct afs_rxfs_fetch_descriptor desc;
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_page *pageio;
-#endif
 	struct afs_vnode *vnode;
 	struct inode *inode;
 	int ret;
 
 	inode = page->mapping->host;
 
-	_enter("{%lu},{%lu}", inode->i_ino, page->index);
+	_enter("{%lu},%p{%lu}", inode->i_ino, page, page->index);
 
 	vnode = AFS_FS_I(inode);
 
@@ -107,13 +146,9 @@ static int afs_file_readpage(struct file
 	if (vnode->flags & AFS_VNODE_DELETED)
 		goto error;
 
-#ifdef AFS_CACHING_SUPPORT
-	ret = cachefs_page_get_private(page, &pageio, GFP_NOIO);
-	if (ret < 0)
-		goto error;
-
+#ifdef CONFIG_AFS_FSCACHE
 	/* is it cached? */
-	ret = cachefs_read_or_alloc_page(vnode->cache,
+	ret = fscache_read_or_alloc_page(vnode->cache,
 					 page,
 					 afs_file_readpage_read_complete,
 					 NULL,
@@ -123,18 +158,20 @@ static int afs_file_readpage(struct file
 #endif
 
 	switch (ret) {
-		/* read BIO submitted and wb-journal entry found */
-	case 1:
-		BUG(); // TODO - handle wb-journal match
-
 		/* read BIO submitted (page in cache) */
 	case 0:
 		break;
 
-		/* no page available in cache */
-	case -ENOBUFS:
+		/* page not yet cached */
 	case -ENODATA:
+		_debug("cache said ENODATA");
+		goto go_on;
+
+		/* page will not be cached */
+	case -ENOBUFS:
+		_debug("cache said ENOBUFS");
 	default:
+	go_on:
 		desc.fid	= vnode->fid;
 		desc.offset	= page->index << PAGE_CACHE_SHIFT;
 		desc.size	= min((size_t) (inode->i_size - desc.offset),
@@ -148,34 +185,40 @@ static int afs_file_readpage(struct file
 		ret = afs_vnode_fetch_data(vnode, &desc);
 		kunmap(page);
 		if (ret < 0) {
-			if (ret==-ENOENT) {
-				_debug("got NOENT from server"
+			if (ret == -ENOENT) {
+				kdebug("got NOENT from server"
 				       " - marking file deleted and stale");
 				vnode->flags |= AFS_VNODE_DELETED;
 				ret = -ESTALE;
 			}
 
-#ifdef AFS_CACHING_SUPPORT
-			cachefs_uncache_page(vnode->cache, page);
+#ifdef CONFIG_AFS_FSCACHE
+			fscache_uncache_page(vnode->cache, page);
+			ClearPagePrivate(page);
 #endif
 			goto error;
 		}
 
 		SetPageUptodate(page);
 
-#ifdef AFS_CACHING_SUPPORT
-		if (cachefs_write_page(vnode->cache,
-				       page,
-				       afs_file_readpage_write_complete,
-				       NULL,
-				       GFP_KERNEL) != 0
-		    ) {
-			cachefs_uncache_page(vnode->cache, page);
-			unlock_page(page);
+		/* send the page to the cache */
+#ifdef CONFIG_AFS_FSCACHE
+		if (PagePrivate(page)) {
+			if (TestSetPageFsMisc(page))
+				BUG();
+			if (fscache_write_page(vnode->cache,
+					       page,
+					       afs_file_readpage_write_complete,
+					       NULL,
+					       GFP_KERNEL) != 0
+			    ) {
+				fscache_uncache_page(vnode->cache, page);
+				ClearPagePrivate(page);
+				end_page_fs_misc(page);
+			}
 		}
-#else
-		unlock_page(page);
 #endif
+		unlock_page(page);
 	}
 
 	_leave(" = 0");
@@ -192,20 +235,63 @@ static int afs_file_readpage(struct file
 
 /*****************************************************************************/
 /*
- * get a page cookie for the specified page
+ * read a set of pages
  */
-#ifdef AFS_CACHING_SUPPORT
-int afs_cache_get_page_cookie(struct page *page,
-			      struct cachefs_page **_page_cookie)
+#ifdef CONFIG_AFS_FSCACHE
+static int afs_file_readpages(struct file *filp, struct address_space *mapping,
+			      struct list_head *pages, unsigned nr_pages)
 {
-	int ret;
+	struct afs_vnode *vnode;
+#if 0
+	struct pagevec lru_pvec;
+	unsigned page_idx;
+#endif
+	int ret = 0;
 
-	_enter("");
-	ret = cachefs_page_get_private(page,_page_cookie, GFP_NOIO);
+	_enter(",{%lu},,%d", mapping->host->i_ino, nr_pages);
 
-	_leave(" = %d", ret);
+	vnode = AFS_FS_I(mapping->host);
+	if (vnode->flags & AFS_VNODE_DELETED) {
+		_leave(" = -ESTALE");
+		return -ESTALE;
+	}
+
+	/* attempt to read as many of the pages as possible */
+	ret = fscache_read_or_alloc_pages(vnode->cache,
+					  mapping,
+					  pages,
+					  &nr_pages,
+					  afs_file_readpage_read_complete,
+					  NULL,
+					  mapping_gfp_mask(mapping));
+
+	switch (ret) {
+		/* all pages are being read from the cache */
+	case 0:
+		BUG_ON(!list_empty(pages));
+		BUG_ON(nr_pages != 0);
+		_leave(" = 0 [reading all]");
+		return 0;
+
+		/* there were pages that couldn't be read from the cache */
+	case -ENODATA:
+	case -ENOBUFS:
+		break;
+
+		/* other error */
+	default:
+		_leave(" = %d", ret);
+		return ret;
+	}
+
+	/* load the missing pages from the network */
+	ret = read_cache_pages(mapping, pages,
+			       (void *) afs_file_readpage, NULL);
+
+	_leave(" = %d [netting]", ret);
 	return ret;
-} /* end afs_cache_get_page_cookie() */
+
+} /* end afs_file_readpages() */
 #endif
 
 /*****************************************************************************/
@@ -214,35 +300,23 @@ int afs_cache_get_page_cookie(struct pag
  */
 static void afs_file_invalidatepage(struct page *page, unsigned long offset)
 {
-	int ret = 1;
-
 	_enter("{%lu},%lu", page->index, offset);
 
 	BUG_ON(!PageLocked(page));
 
-	if (PagePrivate(page)) {
-#ifdef AFS_CACHING_SUPPORT
-		struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
-		cachefs_uncache_page(vnode->cache,page);
-#endif
-
+	if (PagePrivate(page) && offset == 0) {
 		/* We release buffers only if the entire page is being
 		 * invalidated.
 		 * The get_block cached value has been unconditionally
 		 * invalidated, so real IO is not possible anymore.
 		 */
-		if (offset == 0) {
-			BUG_ON(!PageLocked(page));
-
-			ret = 0;
-			if (!PageWriteback(page))
-				ret = page->mapping->a_ops->releasepage(page,
-									0);
-			/* possibly should BUG_ON(!ret); - neilb */
-		}
+		if (!PageWriteback(page))
+			if (page->mapping->a_ops->releasepage(page, 0) != 0)
+				BUG();
 	}
 
-	_leave(" = %d", ret);
+	_leave("");
+
 } /* end afs_file_invalidatepage() */
 
 /*****************************************************************************/
@@ -251,23 +325,29 @@ static void afs_file_invalidatepage(stru
  */
 static int afs_file_releasepage(struct page *page, gfp_t gfp_flags)
 {
-	struct cachefs_page *pageio;
-
 	_enter("{%lu},%x", page->index, gfp_flags);
 
-	if (PagePrivate(page)) {
-#ifdef AFS_CACHING_SUPPORT
-		struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
-		cachefs_uncache_page(vnode->cache, page);
+#ifdef CONFIG_AFS_FSCACHE
+	wait_on_page_fs_misc(page);
+	fscache_uncache_page(AFS_FS_I(page->mapping->host)->cache, page);
+	ClearPagePrivate(page);
 #endif
 
-		pageio = (struct cachefs_page *) page_private(page);
-		set_page_private(page, 0);
-		ClearPagePrivate(page);
-
-		kfree(pageio);
-	}
-
 	_leave(" = 0");
 	return 0;
+
 } /* end afs_file_releasepage() */
+
+/*****************************************************************************/
+/*
+ * wait for the disc cache to finish writing before permitting modification of
+ * our page in the page cache
+ */
+#ifdef CONFIG_AFS_FSCACHE
+static int afs_file_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+	wait_on_page_fs_misc(page);
+	return 0;
+
+} /* end afs_file_page_mkwrite() */
+#endif
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 61bc371..c88c41a 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -398,6 +398,8 @@ int afs_rxfs_fetch_file_status(struct af
 		bp++; /* spare6 */
 	}
 
+	_debug("Data Version %llx\n", vnode->status.version);
+
 	/* success */
 	ret = 0;
 
@@ -408,7 +410,7 @@ int afs_rxfs_fetch_file_status(struct af
  out_put_conn:
 	afs_server_release_callslot(server, &callslot);
  out:
-	_leave("");
+	_leave(" = %d", ret);
 	return ret;
 
  abort:
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 4ebb30a..d188380 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -65,6 +65,11 @@ static int afs_inode_map_status(struct a
 		return -EBADMSG;
 	}
 
+#ifdef CONFIG_AFS_FSCACHE
+	if (vnode->status.size != inode->i_size)
+		fscache_set_i_size(vnode->cache, vnode->status.size);
+#endif
+
 	inode->i_nlink		= vnode->status.nlink;
 	inode->i_uid		= vnode->status.owner;
 	inode->i_gid		= 0;
@@ -101,13 +106,33 @@ static int afs_inode_fetch_status(struct
 	struct afs_vnode *vnode;
 	int ret;
 
+	_enter("");
+
 	vnode = AFS_FS_I(inode);
 
 	ret = afs_vnode_fetch_status(vnode);
 
-	if (ret == 0)
+	if (ret == 0) {
+#ifdef CONFIG_AFS_FSCACHE
+		if (vnode->cache == FSCACHE_NEGATIVE_COOKIE) {
+			vnode->cache =
+				fscache_acquire_cookie(vnode->volume->cache,
+						       &afs_vnode_cache_index_def,
+						       vnode);
+			if (!vnode->cache)
+				printk("Negative\n");
+		}
+#endif
 		ret = afs_inode_map_status(vnode);
+#ifdef CONFIG_AFS_FSCACHE
+		if (ret < 0) {
+			fscache_relinquish_cookie(vnode->cache, 0);
+			vnode->cache = FSCACHE_NEGATIVE_COOKIE;
+		}
+#endif
+	}
 
+	_leave(" = %d", ret);
 	return ret;
 
 } /* end afs_inode_fetch_status() */
@@ -122,6 +147,7 @@ static int afs_iget5_test(struct inode *
 
 	return inode->i_ino == data->fid.vnode &&
 		inode->i_version == data->fid.unique;
+
 } /* end afs_iget5_test() */
 
 /*****************************************************************************/
@@ -179,20 +205,11 @@ inline int afs_iget(struct super_block *
 		return ret;
 	}
 
-#ifdef AFS_CACHING_SUPPORT
-	/* set up caching before reading the status, as fetch-status reads the
-	 * first page of symlinks to see if they're really mntpts */
-	cachefs_acquire_cookie(vnode->volume->cache,
-			       NULL,
-			       vnode,
-			       &vnode->cache);
-#endif
-
 	/* okay... it's a new inode */
 	inode->i_flags |= S_NOATIME;
 	vnode->flags |= AFS_VNODE_CHANGED;
 	ret = afs_inode_fetch_status(inode);
-	if (ret<0)
+	if (ret < 0)
 		goto bad_inode;
 
 	/* success */
@@ -278,8 +295,8 @@ void afs_clear_inode(struct inode *inode
 
 	afs_vnode_give_up_callback(vnode);
 
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(vnode->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_relinquish_cookie(vnode->cache, 0);
 	vnode->cache = NULL;
 #endif
 
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 72febdf..0bddcdf 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -16,15 +16,17 @@
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/fscache.h>
 
 /*
  * debug tracing
  */
-#define kenter(FMT, a...)	printk("==> %s("FMT")\n",__FUNCTION__ , ## a)
-#define kleave(FMT, a...)	printk("<== %s()"FMT"\n",__FUNCTION__ , ## a)
-#define kdebug(FMT, a...)	printk(FMT"\n" , ## a)
-#define kproto(FMT, a...)	printk("### "FMT"\n" , ## a)
-#define knet(FMT, a...)		printk(FMT"\n" , ## a)
+#define __kdbg(FMT, a...)	printk("[%05d] "FMT"\n", current->pid , ## a)
+#define kenter(FMT, a...)	__kdbg("==> %s("FMT")", __FUNCTION__ , ## a)
+#define kleave(FMT, a...)	__kdbg("<== %s()"FMT, __FUNCTION__ , ## a)
+#define kdebug(FMT, a...)	__kdbg(FMT , ## a)
+#define kproto(FMT, a...)	__kdbg("### "FMT , ## a)
+#define knet(FMT, a...)		__kdbg(FMT , ## a)
 
 #ifdef __KDEBUG
 #define _enter(FMT, a...)	kenter(FMT , ## a)
@@ -56,9 +58,6 @@ static inline void afs_discard_my_signal
  */
 extern struct rw_semaphore afs_proc_cells_sem;
 extern struct list_head afs_proc_cells;
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_cache_cell_index_def;
-#endif
 
 /*
  * dir.c
@@ -72,11 +71,6 @@ extern const struct file_operations afs_
 extern struct address_space_operations afs_fs_aops;
 extern struct inode_operations afs_file_inode_operations;
 
-#ifdef AFS_CACHING_SUPPORT
-extern int afs_cache_get_page_cookie(struct page *page,
-				     struct cachefs_page **_page_cookie);
-#endif
-
 /*
  * inode.c
  */
@@ -97,8 +91,8 @@ extern void afs_key_unregister(void);
 /*
  * main.c
  */
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_netfs afs_cache_netfs;
+#ifdef CONFIG_AFS_FSCACHE
+extern struct fscache_netfs afs_cache_netfs;
 #endif
 
 /*
diff --git a/fs/afs/main.c b/fs/afs/main.c
index 913c689..5840bb2 100644
--- a/fs/afs/main.c
+++ b/fs/afs/main.c
@@ -1,6 +1,6 @@
 /* main.c: AFS client file system
  *
- * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2002,5 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowells@redhat.com)
  *
  * This program is free software; you can redistribute it and/or
@@ -14,11 +14,11 @@
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/completion.h>
+#include <linux/fscache.h>
 #include <rxrpc/rxrpc.h>
 #include <rxrpc/transport.h>
 #include <rxrpc/call.h>
 #include <rxrpc/peer.h>
-#include "cache.h"
 #include "cell.h"
 #include "server.h"
 #include "fsclient.h"
@@ -51,12 +51,11 @@ static struct rxrpc_peer_ops afs_peer_op
 struct list_head afs_cb_hash_tbl[AFS_CB_HASH_COUNT];
 DEFINE_SPINLOCK(afs_cb_hash_lock);
 
-#ifdef AFS_CACHING_SUPPORT
-static struct cachefs_netfs_operations afs_cache_ops = {
-	.get_page_cookie	= afs_cache_get_page_cookie,
+#ifdef CONFIG_AFS_FSCACHE
+static struct fscache_netfs_operations afs_cache_ops = {
 };
 
-struct cachefs_netfs afs_cache_netfs = {
+struct fscache_netfs afs_cache_netfs = {
 	.name			= "afs",
 	.version		= 0,
 	.ops			= &afs_cache_ops,
@@ -83,10 +82,9 @@ static int __init afs_init(void)
 	if (ret < 0)
 		return ret;
 
-#ifdef AFS_CACHING_SUPPORT
+#ifdef CONFIG_AFS_FSCACHE
 	/* we want to be able to cache */
-	ret = cachefs_register_netfs(&afs_cache_netfs,
-				     &afs_cache_cell_index_def);
+	ret = fscache_register_netfs(&afs_cache_netfs);
 	if (ret < 0)
 		goto error;
 #endif
@@ -137,8 +135,8 @@ static int __init afs_init(void)
 	afs_key_unregister();
  error_cache:
 #endif
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_unregister_netfs(&afs_cache_netfs);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_unregister_netfs(&afs_cache_netfs);
  error:
 #endif
 	afs_cell_purge();
@@ -167,8 +165,8 @@ static void __exit afs_exit(void)
 #ifdef CONFIG_KEYS_TURNED_OFF
 	afs_key_unregister();
 #endif
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_unregister_netfs(&afs_cache_netfs);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_unregister_netfs(&afs_cache_netfs);
 #endif
 	afs_proc_cleanup();
 
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index 4e6eeb5..91f57cf 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -82,7 +82,7 @@ int afs_mntpt_check_symlink(struct afs_v
 
 	ret = -EIO;
 	wait_on_page_locked(page);
-	buf = kmap(page);
+	buf = kmap_atomic(page, KM_USER0);
 	if (!PageUptodate(page))
 		goto out_free;
 	if (PageError(page))
@@ -105,7 +105,7 @@ int afs_mntpt_check_symlink(struct afs_v
 	ret = 0;
 
  out_free:
-	kunmap(page);
+	kunmap_atomic(buf, KM_USER0);
 	page_cache_release(page);
  out:
 	_leave(" = %d", ret);
@@ -195,9 +195,9 @@ static struct vfsmount *afs_mntpt_do_aut
 	if (!PageUptodate(page) || PageError(page))
 		goto error;
 
-	buf = kmap(page);
+	buf = kmap_atomic(page, KM_USER0);
 	memcpy(devname, buf, size);
-	kunmap(page);
+	kunmap_atomic(buf, KM_USER0);
 	page_cache_release(page);
 	page = NULL;
 
@@ -276,12 +276,12 @@ static void *afs_mntpt_follow_link(struc
  */
 static void afs_mntpt_expiry_timed_out(struct afs_timer *timer)
 {
-	kenter("");
+//	kenter("");
 
 	mark_mounts_for_expiry(&afs_vfsmounts);
 
 	afs_kafstimod_add_timer(&afs_mntpt_expiry_timer,
 				afs_mntpt_expiry_timeout * HZ);
 
-	kleave("");
+//	kleave("");
 } /* end afs_mntpt_expiry_timed_out() */
diff --git a/fs/afs/proc.c b/fs/afs/proc.c
index 101d21b..db58488 100644
--- a/fs/afs/proc.c
+++ b/fs/afs/proc.c
@@ -177,6 +177,7 @@ int afs_proc_init(void)
  */
 void afs_proc_cleanup(void)
 {
+	remove_proc_entry("rootcell", proc_afs);
 	remove_proc_entry("cells", proc_afs);
 
 	remove_proc_entry("fs/afs", NULL);
diff --git a/fs/afs/server.c b/fs/afs/server.c
index 62b093a..7103e10 100644
--- a/fs/afs/server.c
+++ b/fs/afs/server.c
@@ -377,7 +377,6 @@ int afs_server_request_callslot(struct a
 	else if (list_empty(&server->fs_callq)) {
 		/* no one waiting */
 		server->fs_conn_cnt[nconn]++;
-		spin_unlock(&server->fs_lock);
 	}
 	else {
 		/* someone's waiting - dequeue them and wake them up */
@@ -395,9 +394,9 @@ int afs_server_request_callslot(struct a
 		}
 		pcallslot->ready = 1;
 		wake_up_process(pcallslot->task);
-		spin_unlock(&server->fs_lock);
 	}
 
+	spin_unlock(&server->fs_lock);
 	rxrpc_put_connection(callslot->conn);
 	callslot->conn = NULL;
 
diff --git a/fs/afs/vlocation.c b/fs/afs/vlocation.c
index eced206..cfab969 100644
--- a/fs/afs/vlocation.c
+++ b/fs/afs/vlocation.c
@@ -59,17 +59,21 @@ static LIST_HEAD(afs_vlocation_update_pe
 static struct afs_vlocation *afs_vlocation_update;	/* VL currently being updated */
 static DEFINE_SPINLOCK(afs_vlocation_update_lock); /* lock guarding update queue */
 
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vlocation_cache_match(void *target,
-						     const void *entry);
-static void afs_vlocation_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_vlocation_cache_index_def = {
-	.name		= "vldb",
-	.data_size	= sizeof(struct afs_cache_vlocation),
-	.keys[0]	= { CACHEFS_INDEX_KEYS_ASCIIZ, 64 },
-	.match		= afs_vlocation_cache_match,
-	.update		= afs_vlocation_cache_update,
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_vlocation_cache_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t buflen);
+static uint16_t afs_vlocation_cache_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t buflen);
+static fscache_checkaux_t afs_vlocation_cache_check_aux(void *cookie_netfs_data,
+							const void *buffer,
+							uint16_t buflen);
+
+static struct fscache_cookie_def afs_vlocation_cache_index_def = {
+	.name		= "AFS.vldb",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= afs_vlocation_cache_get_key,
+	.get_aux	= afs_vlocation_cache_get_aux,
+	.check_aux	= afs_vlocation_cache_check_aux,
 };
 #endif
 
@@ -300,13 +304,12 @@ int afs_vlocation_lookup(struct afs_cell
 
 	list_add_tail(&vlocation->link, &cell->vl_list);
 
-#ifdef AFS_CACHING_SUPPORT
+#ifdef CONFIG_AFS_FSCACHE
 	/* we want to store it in the cache, plus it might already be
 	 * encached */
-	cachefs_acquire_cookie(cell->cache,
-			       &afs_volume_cache_index_def,
-			       vlocation,
-			       &vlocation->cache);
+	vlocation->cache = fscache_acquire_cookie(cell->cache,
+						  &afs_vlocation_cache_index_def,
+						  vlocation);
 
 	if (vlocation->valid)
 		goto found_in_cache;
@@ -341,7 +344,7 @@ int afs_vlocation_lookup(struct afs_cell
  active:
 	active = 1;
 
-#ifdef AFS_CACHING_SUPPORT
+#ifdef CONFIG_AFS_FSCACHE
  found_in_cache:
 #endif
 	/* try to look up a cached volume in the cell VL databases by ID */
@@ -423,9 +426,9 @@ int afs_vlocation_lookup(struct afs_cell
 
 	afs_kafstimod_add_timer(&vlocation->upd_timer, 10 * HZ);
 
-#ifdef AFS_CACHING_SUPPORT
+#ifdef CONFIG_AFS_FSCACHE
 	/* update volume entry in local cache */
-	cachefs_update_cookie(vlocation->cache);
+	fscache_update_cookie(vlocation->cache);
 #endif
 
 	*_vlocation = vlocation;
@@ -439,8 +442,8 @@ int afs_vlocation_lookup(struct afs_cell
 		}
 		else {
 			list_del(&vlocation->link);
-#ifdef AFS_CACHING_SUPPORT
-			cachefs_relinquish_cookie(vlocation->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+			fscache_relinquish_cookie(vlocation->cache, 0);
 #endif
 			afs_put_cell(vlocation->cell);
 			kfree(vlocation);
@@ -538,8 +541,8 @@ void afs_vlocation_do_timeout(struct afs
 	}
 
 	/* we can now destroy it properly */
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(vlocation->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_relinquish_cookie(vlocation->cache, 0);
 #endif
 	afs_put_cell(cell);
 
@@ -890,65 +893,103 @@ static void afs_vlocation_update_discard
 
 /*****************************************************************************/
 /*
- * match a VLDB record stored in the cache
- * - may also load target from entry
+ * set the key for the index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vlocation_cache_match(void *target,
-						     const void *entry)
-{
-	const struct afs_cache_vlocation *vldb = entry;
-	struct afs_vlocation *vlocation = target;
-
-	_enter("{%s},{%s}", vlocation->vldb.name, vldb->name);
-
-	if (strncmp(vlocation->vldb.name, vldb->name, sizeof(vldb->name)) == 0
-	    ) {
-		if (!vlocation->valid ||
-		    vlocation->vldb.rtime == vldb->rtime
-		    ) {
-			vlocation->vldb = *vldb;
-			vlocation->valid = 1;
-			_leave(" = SUCCESS [c->m]");
-			return CACHEFS_MATCH_SUCCESS;
-		}
-		/* need to update cache if cached info differs */
-		else if (memcmp(&vlocation->vldb, vldb, sizeof(*vldb)) != 0) {
-			/* delete if VIDs for this name differ */
-			if (memcmp(&vlocation->vldb.vid,
-				   &vldb->vid,
-				   sizeof(vldb->vid)) != 0) {
-				_leave(" = DELETE");
-				return CACHEFS_MATCH_SUCCESS_DELETE;
-			}
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_vlocation_cache_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
+{
+	const struct afs_vlocation *vlocation = cookie_netfs_data;
+	uint16_t klen;
 
-			_leave(" = UPDATE");
-			return CACHEFS_MATCH_SUCCESS_UPDATE;
-		}
-		else {
-			_leave(" = SUCCESS");
-			return CACHEFS_MATCH_SUCCESS;
-		}
-	}
+	_enter("{%s},%p,%u", vlocation->vldb.name, buffer, bufmax);
+
+	klen = strnlen(vlocation->vldb.name, sizeof(vlocation->vldb.name));
+	if (klen > bufmax)
+		return 0;
+
+	memcpy(buffer, vlocation->vldb.name, klen);
+
+	_leave(" = %u", klen);
+	return klen;
 
-	_leave(" = FAILED");
-	return CACHEFS_MATCH_FAILED;
-} /* end afs_vlocation_cache_match() */
+} /* end afs_vlocation_cache_get_key() */
 #endif
 
 /*****************************************************************************/
 /*
- * update a VLDB record stored in the cache
+ * provide new auxilliary cache data
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_vlocation_cache_update(void *source, void *entry)
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_vlocation_cache_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
 {
-	struct afs_cache_vlocation *vldb = entry;
-	struct afs_vlocation *vlocation = source;
+	const struct afs_vlocation *vlocation = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%s},%p,%u", vlocation->vldb.name, buffer, bufmax);
+
+	dlen = sizeof(struct afs_cache_vlocation);
+	dlen -= offsetof(struct afs_cache_vlocation, nservers);
+	if (dlen > bufmax)
+		return 0;
+
+	memcpy(buffer, (uint8_t *)&vlocation->vldb.nservers, dlen);
 
-	_enter("");
+	_leave(" = %u", dlen);
+	return dlen;
+
+} /* end afs_vlocation_cache_get_aux() */
+#endif
+
+/*****************************************************************************/
+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+#ifdef CONFIG_AFS_FSCACHE
+static fscache_checkaux_t afs_vlocation_cache_check_aux(void *cookie_netfs_data,
+							const void *buffer,
+							uint16_t buflen)
+{
+	const struct afs_cache_vlocation *cvldb;
+	struct afs_vlocation *vlocation = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%s},%p,%u", vlocation->vldb.name, buffer, buflen);
+
+	/* check the size of the data is what we're expecting */
+	dlen = sizeof(struct afs_cache_vlocation);
+	dlen -= offsetof(struct afs_cache_vlocation, nservers);
+	if (dlen != buflen)
+		return FSCACHE_CHECKAUX_OBSOLETE;
+
+	cvldb = container_of(buffer, struct afs_cache_vlocation, nservers);
+
+	/* if what's on disk is more valid than what's in memory, then use the
+	 * VL record from the cache */
+	if (!vlocation->valid || vlocation->vldb.rtime == cvldb->rtime) {
+		memcpy((uint8_t *)&vlocation->vldb.nservers, buffer, dlen);
+		vlocation->valid = 1;
+		_leave(" = SUCCESS [c->m]");
+		return FSCACHE_CHECKAUX_OKAY;
+	}
+
+	/* need to update the cache if the cached info differs */
+	if (memcmp(&vlocation->vldb, buffer, dlen) != 0) {
+		/* delete if the volume IDs for this name differ */
+		if (memcmp(&vlocation->vldb.vid, &cvldb->vid,
+			   sizeof(cvldb->vid)) != 0
+		    ) {
+			_leave(" = OBSOLETE");
+			return FSCACHE_CHECKAUX_OBSOLETE;
+		}
+
+		_leave(" = UPDATE");
+		return FSCACHE_CHECKAUX_NEEDS_UPDATE;
+	}
 
-	*vldb = vlocation->vldb;
+	_leave(" = OKAY");
+	return FSCACHE_CHECKAUX_OKAY;
 
-} /* end afs_vlocation_cache_update() */
+} /* end afs_vlocation_cache_check_aux() */
 #endif
diff --git a/fs/afs/vnode.c b/fs/afs/vnode.c
index 9867fef..7116128 100644
--- a/fs/afs/vnode.c
+++ b/fs/afs/vnode.c
@@ -29,17 +29,30 @@ struct afs_timer_ops afs_vnode_cb_timed_
 	.timed_out	= afs_vnode_cb_timed_out,
 };
 
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vnode_cache_match(void *target,
-						 const void *entry);
-static void afs_vnode_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_vnode_cache_index_def = {
-	.name		= "vnode",
-	.data_size	= sizeof(struct afs_cache_vnode),
-	.keys[0]	= { CACHEFS_INDEX_KEYS_BIN, 4 },
-	.match		= afs_vnode_cache_match,
-	.update		= afs_vnode_cache_update,
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_vnode_cache_get_key(const void *cookie_netfs_data,
+					void *buffer, uint16_t buflen);
+static void afs_vnode_cache_get_attr(const void *cookie_netfs_data,
+				     uint64_t *size);
+static uint16_t afs_vnode_cache_get_aux(const void *cookie_netfs_data,
+					void *buffer, uint16_t buflen);
+static fscache_checkaux_t afs_vnode_cache_check_aux(void *cookie_netfs_data,
+						    const void *buffer,
+						    uint16_t buflen);
+static void afs_vnode_cache_mark_pages_cached(void *cookie_netfs_data,
+					      struct address_space *mapping,
+					      struct pagevec *cached_pvec);
+static void afs_vnode_cache_now_uncached(void *cookie_netfs_data);
+
+struct fscache_cookie_def afs_vnode_cache_index_def = {
+	.name			= "AFS.vnode",
+	.type			= FSCACHE_COOKIE_TYPE_DATAFILE,
+	.get_key		= afs_vnode_cache_get_key,
+	.get_attr		= afs_vnode_cache_get_attr,
+	.get_aux		= afs_vnode_cache_get_aux,
+	.check_aux		= afs_vnode_cache_check_aux,
+	.mark_pages_cached	= afs_vnode_cache_mark_pages_cached,
+	.now_uncached		= afs_vnode_cache_now_uncached,
 };
 #endif
 
@@ -189,6 +202,8 @@ int afs_vnode_fetch_status(struct afs_vn
 
 	if (vnode->update_cnt > 0) {
 		/* someone else started a fetch */
+		_debug("conflict");
+
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		add_wait_queue(&vnode->update_waitq, &myself);
 
@@ -220,6 +235,7 @@ int afs_vnode_fetch_status(struct afs_vn
 		spin_unlock(&vnode->lock);
 		set_current_state(TASK_RUNNING);
 
+		_leave(" [conflicted, %d", !!(vnode->flags & AFS_VNODE_DELETED));
 		return vnode->flags & AFS_VNODE_DELETED ? -ENOENT : 0;
 	}
 
@@ -342,54 +358,197 @@ int afs_vnode_give_up_callback(struct af
 
 /*****************************************************************************/
 /*
- * match a vnode record stored in the cache
+ * set the key for the index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vnode_cache_match(void *target,
-						 const void *entry)
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_vnode_cache_get_key(const void *cookie_netfs_data,
+					void *buffer, uint16_t bufmax)
 {
-	const struct afs_cache_vnode *cvnode = entry;
-	struct afs_vnode *vnode = target;
+	const struct afs_vnode *vnode = cookie_netfs_data;
+	uint16_t klen;
 
-	_enter("{%x,%x,%Lx},{%x,%x,%Lx}",
-	       vnode->fid.vnode,
-	       vnode->fid.unique,
-	       vnode->status.version,
-	       cvnode->vnode_id,
-	       cvnode->vnode_unique,
-	       cvnode->data_version);
-
-	if (vnode->fid.vnode != cvnode->vnode_id) {
-		_leave(" = FAILED");
-		return CACHEFS_MATCH_FAILED;
-	}
+	_enter("{%x,%x,%Lx},%p,%u",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.version,
+	       buffer, bufmax);
+
+	klen = sizeof(vnode->fid.vnode);
+	if (klen > bufmax)
+		return 0;
+
+	memcpy(buffer, &vnode->fid.vnode, sizeof(vnode->fid.vnode));
+
+	_leave(" = %u", klen);
+	return klen;
+
+} /* end afs_vnode_cache_get_key() */
+#endif
+
+/*****************************************************************************/
+/*
+ * provide an updated file attributes
+ */
+#ifdef CONFIG_AFS_FSCACHE
+static void afs_vnode_cache_get_attr(const void *cookie_netfs_data,
+				     uint64_t *size)
+{
+	const struct afs_vnode *vnode = cookie_netfs_data;
+
+	_enter("{%x,%x,%Lx},",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.version);
+
+	*size = i_size_read((struct inode *) &vnode->vfs_inode);
+
+} /* end afs_vnode_cache_get_attr() */
+#endif
+
+/*****************************************************************************/
+/*
+ * provide new auxilliary cache data
+ */
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_vnode_cache_get_aux(const void *cookie_netfs_data,
+					void *buffer, uint16_t bufmax)
+{
+	const struct afs_vnode *vnode = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%x,%x,%Lx},%p,%u",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.version,
+	       buffer, bufmax);
 
-	if (vnode->fid.unique != cvnode->vnode_unique ||
-	    vnode->status.version != cvnode->data_version) {
-		_leave(" = DELETE");
-		return CACHEFS_MATCH_SUCCESS_DELETE;
+	dlen = sizeof(vnode->fid.unique) + sizeof(vnode->status.version);
+	if (dlen > bufmax)
+		return 0;
+
+	memcpy(buffer, &vnode->fid.unique, sizeof(vnode->fid.unique));
+	buffer += sizeof(vnode->fid.unique);
+	memcpy(buffer, &vnode->status.version, sizeof(vnode->status.version));
+
+	_leave(" = %u", dlen);
+	return dlen;
+
+} /* end afs_vnode_cache_get_aux() */
+#endif
+
+/*****************************************************************************/
+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+#ifdef CONFIG_AFS_FSCACHE
+static fscache_checkaux_t afs_vnode_cache_check_aux(void *cookie_netfs_data,
+						    const void *buffer,
+						    uint16_t buflen)
+{
+	struct afs_vnode *vnode = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%x,%x,%Lx},%p,%u",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.version,
+	       buffer, buflen);
+
+	/* check the size of the data is what we're expecting */
+	dlen = sizeof(vnode->fid.unique) + sizeof(vnode->status.version);
+	if (dlen != buflen) {
+		_leave(" = OBSOLETE [len %hx != %hx]", dlen, buflen);
+		return FSCACHE_CHECKAUX_OBSOLETE;
+	}
+
+	if (memcmp(buffer,
+		   &vnode->fid.unique,
+		   sizeof(vnode->fid.unique)
+		   ) != 0
+	    ) {
+		unsigned unique;
+
+		memcpy(&unique, buffer, sizeof(unique));
+
+		_leave(" = OBSOLETE [uniq %x != %x]",
+		       unique, vnode->fid.unique);
+		return FSCACHE_CHECKAUX_OBSOLETE;
+	}
+
+	if (memcmp(buffer + sizeof(vnode->fid.unique),
+		   &vnode->status.version,
+		   sizeof(vnode->status.version)
+		   ) != 0
+	    ) {
+		afs_dataversion_t version;
+
+		memcpy(&version, buffer + sizeof(vnode->fid.unique),
+		       sizeof(version));
+
+		_leave(" = OBSOLETE [vers %llx != %llx]",
+		       version, vnode->status.version);
+		return FSCACHE_CHECKAUX_OBSOLETE;
 	}
 
 	_leave(" = SUCCESS");
-	return CACHEFS_MATCH_SUCCESS;
-} /* end afs_vnode_cache_match() */
+	return FSCACHE_CHECKAUX_OKAY;
+
+} /* end afs_vnode_cache_check_aux() */
 #endif
 
 /*****************************************************************************/
 /*
- * update a vnode record stored in the cache
+ * indication of pages that now have cache metadata retained
+ * - this function should mark the specified pages as now being cached
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_vnode_cache_update(void *source, void *entry)
+#ifdef CONFIG_AFS_FSCACHE
+static void afs_vnode_cache_mark_pages_cached(void *cookie_netfs_data,
+					      struct address_space *mapping,
+					      struct pagevec *cached_pvec)
 {
-	struct afs_cache_vnode *cvnode = entry;
-	struct afs_vnode *vnode = source;
+	unsigned long loop;
 
-	_enter("");
+	for (loop = 0; loop < cached_pvec->nr; loop++) {
+		struct page *page = cached_pvec->pages[loop];
 
-	cvnode->vnode_id	= vnode->fid.vnode;
-	cvnode->vnode_unique	= vnode->fid.unique;
-	cvnode->data_version	= vnode->status.version;
+		_debug("- mark %p{%lx}", page, page->index);
 
-} /* end afs_vnode_cache_update() */
+		SetPagePrivate(page);
+	}
+
+} /* end afs_vnode_cache_mark_pages_cached() */
 #endif
+
+/*****************************************************************************/
+/*
+ * indication the cookie is no longer uncached
+ * - this function is called when the backing store currently caching a cookie
+ *   is removed
+ * - the netfs should use this to clean up any markers indicating cached pages
+ * - this is mandatory for any object that may have data
+ */
+static void afs_vnode_cache_now_uncached(void *cookie_netfs_data)
+{
+	struct afs_vnode *vnode = cookie_netfs_data;
+	struct pagevec pvec;
+	pgoff_t first;
+	int loop, nr_pages;
+
+	_enter("{%x,%x,%Lx}",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.version);
+
+	pagevec_init(&pvec, 0);
+	first = 0;
+
+	for (;;) {
+		/* grab a bunch of pages to clean */
+		nr_pages = find_get_pages(vnode->vfs_inode.i_mapping, first,
+					  PAGEVEC_SIZE, pvec.pages);
+		if (!nr_pages)
+			break;
+
+		for (loop = 0; loop < nr_pages; loop++)
+			ClearPagePrivate(pvec.pages[loop]);
+
+		first = pvec.pages[nr_pages - 1]->index + 1;
+
+		pvec.nr = nr_pages;
+		pagevec_release(&pvec);
+		cond_resched();
+	}
+
+	_leave("");
+
+} /* end afs_vnode_cache_now_uncached() */
diff --git a/fs/afs/vnode.h b/fs/afs/vnode.h
index b86a971..3f0602d 100644
--- a/fs/afs/vnode.h
+++ b/fs/afs/vnode.h
@@ -13,9 +13,9 @@
 #define _LINUX_AFS_VNODE_H
 
 #include <linux/fs.h>
+#include <linux/fscache.h>
 #include "server.h"
 #include "kafstimod.h"
-#include "cache.h"
 
 #ifdef __KERNEL__
 
@@ -32,8 +32,8 @@ struct afs_cache_vnode
 	afs_dataversion_t	data_version;	/* data version */
 };
 
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_vnode_cache_index_def;
+#ifdef CONFIG_AFS_FSCACHE
+extern struct fscache_cookie_def afs_vnode_cache_index_def;
 #endif
 
 /*****************************************************************************/
@@ -47,8 +47,8 @@ struct afs_vnode
 	struct afs_volume	*volume;	/* volume on which vnode resides */
 	struct afs_fid		fid;		/* the file identifier for this inode */
 	struct afs_file_status	status;		/* AFS status info for this file */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+	struct fscache_cookie	*cache;		/* caching cookie */
 #endif
 
 	wait_queue_head_t	update_waitq;	/* status fetch waitqueue */
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 0ff4b86..0bd5578 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -15,10 +15,10 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/fscache.h>
 #include "volume.h"
 #include "vnode.h"
 #include "cell.h"
-#include "cache.h"
 #include "cmservice.h"
 #include "fsclient.h"
 #include "vlclient.h"
@@ -28,18 +28,14 @@
 static const char *afs_voltypes[] = { "R/W", "R/O", "BAK" };
 #endif
 
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_volume_cache_match(void *target,
-						  const void *entry);
-static void afs_volume_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_volume_cache_index_def = {
-	.name		= "volume",
-	.data_size	= sizeof(struct afs_cache_vhash),
-	.keys[0]	= { CACHEFS_INDEX_KEYS_BIN, 1 },
-	.keys[1]	= { CACHEFS_INDEX_KEYS_BIN, 1 },
-	.match		= afs_volume_cache_match,
-	.update		= afs_volume_cache_update,
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_volume_cache_get_key(const void *cookie_netfs_data,
+					 void *buffer, uint16_t buflen);
+
+static struct fscache_cookie_def afs_volume_cache_index_def = {
+	.name		= "AFS.volume",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= afs_volume_cache_get_key,
 };
 #endif
 
@@ -214,11 +210,10 @@ int afs_volume_lookup(const char *name, 
 	}
 
 	/* attach the cache and volume location */
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_acquire_cookie(vlocation->cache,
-			       &afs_vnode_cache_index_def,
-			       volume,
-			       &volume->cache);
+#ifdef CONFIG_AFS_FSCACHE
+	volume->cache = fscache_acquire_cookie(vlocation->cache,
+					       &afs_volume_cache_index_def,
+					       volume);
 #endif
 
 	afs_get_vlocation(vlocation);
@@ -286,8 +281,8 @@ void afs_put_volume(struct afs_volume *v
 	up_write(&vlocation->cell->vl_sem);
 
 	/* finish cleaning up the volume */
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(volume->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_relinquish_cookie(volume->cache, 0);
 #endif
 	afs_put_vlocation(vlocation);
 
@@ -481,40 +476,25 @@ int afs_volume_release_fileserver(struct
 
 /*****************************************************************************/
 /*
- * match a volume hash record stored in the cache
+ * set the key for the index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_volume_cache_match(void *target,
-						  const void *entry)
+#ifdef CONFIG_AFS_FSCACHE
+static uint16_t afs_volume_cache_get_key(const void *cookie_netfs_data,
+					void *buffer, uint16_t bufmax)
 {
-	const struct afs_cache_vhash *vhash = entry;
-	struct afs_volume *volume = target;
+	const struct afs_volume *volume = cookie_netfs_data;
+	uint16_t klen;
 
-	_enter("{%u},{%u}", volume->type, vhash->vtype);
+	_enter("{%u},%p,%u", volume->type, buffer, bufmax);
 
-	if (volume->type == vhash->vtype) {
-		_leave(" = SUCCESS");
-		return CACHEFS_MATCH_SUCCESS;
-	}
-
-	_leave(" = FAILED");
-	return CACHEFS_MATCH_FAILED;
-} /* end afs_volume_cache_match() */
-#endif
-
-/*****************************************************************************/
-/*
- * update a volume hash record stored in the cache
- */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_volume_cache_update(void *source, void *entry)
-{
-	struct afs_cache_vhash *vhash = entry;
-	struct afs_volume *volume = source;
+	klen = sizeof(volume->type);
+	if (klen > bufmax)
+		return 0;
 
-	_enter("");
+	memcpy(buffer, &volume->type, sizeof(volume->type));
 
-	vhash->vtype = volume->type;
+	_leave(" = %u", klen);
+	return klen;
 
-} /* end afs_volume_cache_update() */
+} /* end afs_volume_cache_get_key() */
 #endif
diff --git a/fs/afs/volume.h b/fs/afs/volume.h
index bfdcf19..fc9895a 100644
--- a/fs/afs/volume.h
+++ b/fs/afs/volume.h
@@ -12,11 +12,11 @@
 #ifndef _LINUX_AFS_VOLUME_H
 #define _LINUX_AFS_VOLUME_H
 
+#include <linux/fscache.h>
 #include "types.h"
 #include "fsclient.h"
 #include "kafstimod.h"
 #include "kafsasyncd.h"
-#include "cache.h"
 
 typedef enum {
 	AFS_VLUPD_SLEEP,		/* sleeping waiting for update timer to fire */
@@ -45,24 +45,6 @@ struct afs_cache_vlocation
 	time_t			rtime;		/* last retrieval time */
 };
 
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_vlocation_cache_index_def;
-#endif
-
-/*****************************************************************************/
-/*
- * volume -> vnode hash table entry
- */
-struct afs_cache_vhash
-{
-	afs_voltype_t		vtype;		/* which volume variation */
-	uint8_t			hash_bucket;	/* which hash bucket this represents */
-} __attribute__((packed));
-
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_volume_cache_index_def;
-#endif
-
 /*****************************************************************************/
 /*
  * AFS volume location record
@@ -73,8 +55,8 @@ struct afs_vlocation
 	struct list_head	link;		/* link in cell volume location list */
 	struct afs_timer	timeout;	/* decaching timer */
 	struct afs_cell		*cell;		/* cell to which volume belongs */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+	struct fscache_cookie	*cache;		/* caching cookie */
 #endif
 	struct afs_cache_vlocation vldb;	/* volume information DB record */
 	struct afs_volume	*vols[3];	/* volume access record pointer (index by type) */
@@ -109,8 +91,8 @@ struct afs_volume
 	atomic_t		usage;
 	struct afs_cell		*cell;		/* cell to which belongs (unrefd ptr) */
 	struct afs_vlocation	*vlocation;	/* volume location */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+	struct fscache_cookie	*cache;		/* caching cookie */
 #endif
 	afs_volid_t		vid;		/* volume ID */
 	afs_voltype_t		type;		/* type of volume */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
                   ` (4 preceding siblings ...)
  2006-04-20 16:59 ` [PATCH 6/7] FS-Cache: Make kAFS use FS-Cache David Howells
@ 2006-04-20 16:59 ` David Howells
  2006-04-21  0:57   ` Andrew Morton
                     ` (2 more replies)
  2006-04-21  0:12 ` [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit Andrew Morton
  2006-04-21 10:22 ` David Howells
  7 siblings, 3 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 16:59 UTC (permalink / raw)
  To: torvalds, akpm, steved, sct, aviro
  Cc: linux-fsdevel, linux-cachefs, nfsv4, linux-kernel

Add a cache backend that permits a mounted filesystem to be used as a backing
store for the cache.


CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling.  This is called cachefilesd and lives in
/sbin.  The source for the daemon can be downloaded from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.c

And an example configuration from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf

The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services.  Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.

CacheFiles creates a proc-file - "/proc/fs/cachefiles" - that is used for
communication with the daemon.  Only one thing may have this open at once, and
whilst it is open, a cache is at least partially in existence.  The daemon
opens this and sends commands down it to control the cache.

CacheFiles is currently limited to a single cache.

CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section.  This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.


============
REQUIREMENTS
============

The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:

	- dnotify.

	- extended attributes (xattrs).

	- openat() and friends.

	- bmap() support on files in the filesystem (FIBMAP ioctl).

	- The use of bmap() to detect a partial page at the end of the file.

It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.


=============
CONFIGURATION
=============

The cache is configured by a script in /etc/cachefilesd.conf.  These commands
set up cache ready for use.  The following script commands are available:

 (*) brun <N>%
 (*) bcull <N>%
 (*) bstop <N>%

	Configure the culling limits.  Optional.  See the section on culling
	The defaults are 7%, 5% and 1% respectively.

 (*) dir <path>

	Specify the directory containing the root of the cache.  Mandatory.

 (*) tag <name>

	Specify a tag to FS-Cache to use in distinguishing multiple caches.
	Optional.  The default is "CacheFiles".

 (*) debug <mask>

	Specify a numeric bitmask to control debugging in the kernel module.
	Optional.  The default is zero (all off).


==================
STARTING THE CACHE
==================

The cache is started by running the daemon.  The daemon opens the cache proc
file, configures the cache and tells it to begin caching.  At that point the
cache binds to fscache and the cache becomes live.

The daemon is run as follows:

	/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]

The flags are:

 (*) -d

	Increase the debugging level.  This can be specified multiple times and
	is cumulative with itself.

 (*) -s

	Send messages to stderr instead of syslog.

 (*) -n

	Don't daemonise and go into background.

 (*) -f <configfile>

	Use an alternative configuration file rather than the default one.


===============
THINGS TO AVOID
===============

Do not mount other things within the cache as this will cause problems.  The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.

Do not create, rename or unlink files and directories in the cache whilst the
cache is active, as this may cause the state to become uncertain.

Renaming files in the cache might make objects appear to be other objects (the
filename is part of the lookup key).

Do not change or remove the extended attributes attached to cache files by the
cache as this will cause the cache state management to get confused.

Do not create files or directories in the cache, lest the cache get confused or
serve incorrect data.

Do not chmod files in the cache.  The module creates things with minimal
permissions to prevent random users being able to access them directly.


=============
CACHE CULLING
=============

The cache may need culling occasionally to make space.  This involves
discarding objects from the cache that have been used less recently than
anything else.  Culling is based on the access time of data objects.  Empty
directories are culled if not in use.

Cache culling is done on the basis of the percentage of blocks available in the
underlying filesystem.  There are three "limits":

 (*) brun

     If the amount of available space in the cache rises above this limit, then
     culling is turned off.

 (*) bcull

     If the amount of available space in the cache falls below this limit, then
     culling is started.

 (*) bstop

     If the amount of available space in the cache falls below this limit, then
     no further allocation of disk space is permitted until culling has raised
     the amount above this limit again.

These must be configured thusly:

	0 <= bstop < bcull < brun < 100

Note that these are percentages of available space, and do _not_ appear as 100
minus the percentage displayed by the "df" program.

The userspace daemon scans the cache to build up a table of cullable objects.
These are then culled in least recently used order.  A new scan of the cache is
started as soon as space is made in the table.  Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them.


===============
CACHE STRUCTURE
===============

The CacheFiles module will create two directories in the directory it was
given:

 (*) cache/

 (*) graveyard/

The active cache objects all reside in the first directory.  The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
to the graveyard from which the daemon will actually delete them.

The daemon uses dnotify to monitor the graveyard directory, and will delete
anything that appears therein.


The module represents index objects as directories with the filename "I..." or
"J...".  Note that the "cache/" directory is itself a special index.

Data objects are represented as files if they have no children, or directories
if they do.  Their filenames all begin "D..." or "E...".  If represented as a
directory, data objects will have a file in the directory called "data" that
actually holds the data.

Special objects are similar to data objects, except their filenames begin
"S..." or "T...".


If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended.  Into
this directory, if possible, will be placed the representations of the child
objects:

	INDEX     INDEX      INDEX                             DATA FILES
	========= ========== ================================= ================
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry


If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory.  The names of the intermediate directories will have
'+' prepended:

	J1223/@23/+xy...z/+kl...m/Epqr


Note that keys are raw data, and not only may they exceed NAME_MAX in size,
they may also contain things like '/' and NUL characters, and so they may not
be suitable for turning directly into a filename.

To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable.  The two versions of
object filenames indicate the encoding:

	OBJECT TYPE	PRINTABLE	ENCODED
	===============	===============	===============
	Index		"I..."		"J..."
	Data		"D..."		"E..."
	Special		"S..."		"T..."

Intermediate directories are always "@" or "+" as appropriate.


Each object in the cache has an extended attribute label that holds the object
type ID (required to distinguish special objects) and the auxiliary data from
the netfs.  The latter is used to detect stale objects in the cache and update
or retire them.


Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file).


This documentation is added by the patch to:

	Documentation/filesystems/caching/cachefiles.txt

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/caching/cachefiles.txt |  274 +++++
 fs/Kconfig                                       |    8 
 fs/Makefile                                      |    1 
 fs/buffer.c                                      |    2 
 fs/cachefiles/Makefile                           |   16 
 fs/cachefiles/cf-bind.c                          |  283 +++++
 fs/cachefiles/cf-interface.c                     | 1303 ++++++++++++++++++++++
 fs/cachefiles/cf-key.c                           |  160 +++
 fs/cachefiles/cf-main.c                          |  167 +++
 fs/cachefiles/cf-namei.c                         |  837 ++++++++++++++
 fs/cachefiles/cf-proc.c                          |  510 +++++++++
 fs/cachefiles/cf-xattr.c                         |  299 +++++
 fs/cachefiles/internal.h                         |  292 +++++
 fs/fcntl.c                                       |    2 
 include/linux/pagemap.h                          |    6 
 mm/filemap.c                                     |  102 ++
 16 files changed, 4262 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt
new file mode 100644
index 0000000..c9875f2
--- /dev/null
+++ b/Documentation/filesystems/caching/cachefiles.txt
@@ -0,0 +1,274 @@
+	       ===============================================
+	       CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
+	       ===============================================
+
+Contents:
+
+ (*) Overview.
+
+ (*) Requirements.
+
+ (*) Configuration.
+
+ (*) Starting the cache.
+
+ (*) Things to avoid.
+
+
+========
+OVERVIEW
+========
+
+CacheFiles is a caching backend that's meant to use as a cache a directory on
+an already mounted filesystem of a local type (such as Ext3).
+
+CacheFiles uses a userspace daemon to do some of the cache management - such as
+reaping stale nodes and culling.  This is called cachefilesd and lives in
+/sbin.
+
+The filesystem and data integrity of the cache are only as good as those of the
+filesystem providing the backing services.  Note that CacheFiles does not
+attempt to journal anything since the journalling interfaces of the various
+filesystems are very specific in nature.
+
+CacheFiles creates a proc-file - "/proc/fs/cachefiles" - that is used for
+communication with the daemon.  Only one thing may have this open at once, and
+whilst it is open, a cache is at least partially in existence.  The daemon
+opens this and sends commands down it to control the cache.
+
+CacheFiles is currently limited to a single cache.
+
+CacheFiles attempts to maintain at least a certain percentage of free space on
+the filesystem, shrinking the cache by culling the objects it contains to make
+space if necessary - see the "Cache Culling" section.  This means it can be
+placed on the same medium as a live set of data, and will expand to make use of
+spare space and automatically contract when the set of data requires more
+space.
+
+
+============
+REQUIREMENTS
+============
+
+The use of CacheFiles and its daemon requires the following features to be
+available in the system and in the cache filesystem:
+
+	- dnotify.
+
+	- extended attributes (xattrs).
+
+	- openat() and friends.
+
+	- bmap() support on files in the filesystem (FIBMAP ioctl).
+
+	- The use of bmap() to detect a partial page at the end of the file.
+
+It is strongly recommended that the "dir_index" option is enabled on Ext3
+filesystems being used as a cache.
+
+
+=============
+CONFIGURATION
+=============
+
+The cache is configured by a script in /etc/cachefilesd.conf.  These commands
+set up cache ready for use.  The following script commands are available:
+
+ (*) brun <N>%
+ (*) bcull <N>%
+ (*) bstop <N>%
+
+	Configure the culling limits.  Optional.  See the section on culling
+	The defaults are 7%, 5% and 1% respectively.
+
+ (*) dir <path>
+
+	Specify the directory containing the root of the cache.  Mandatory.
+
+ (*) tag <name>
+
+	Specify a tag to FS-Cache to use in distinguishing multiple caches.
+	Optional.  The default is "CacheFiles".
+
+ (*) debug <mask>
+
+	Specify a numeric bitmask to control debugging in the kernel module.
+	Optional.  The default is zero (all off).
+
+
+==================
+STARTING THE CACHE
+==================
+
+The cache is started by running the daemon.  The daemon opens the cache proc
+file, configures the cache and tells it to begin caching.  At that point the
+cache binds to fscache and the cache becomes live.
+
+The daemon is run as follows:
+
+	/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
+
+The flags are:
+
+ (*) -d
+
+	Increase the debugging level.  This can be specified multiple times and
+	is cumulative with itself.
+
+ (*) -s
+
+	Send messages to stderr instead of syslog.
+
+ (*) -n
+
+	Don't daemonise and go into background.
+
+ (*) -f <configfile>
+
+	Use an alternative configuration file rather than the default one.
+
+
+===============
+THINGS TO AVOID
+===============
+
+Do not mount other things within the cache as this will cause problems.  The
+kernel module contains its own very cut-down path walking facility that ignores
+mountpoints, but the daemon can't avoid them.
+
+Do not create, rename or unlink files and directories in the cache whilst the
+cache is active, as this may cause the state to become uncertain.
+
+Renaming files in the cache might make objects appear to be other objects (the
+filename is part of the lookup key).
+
+Do not change or remove the extended attributes attached to cache files by the
+cache as this will cause the cache state management to get confused.
+
+Do not create files or directories in the cache, lest the cache get confused or
+serve incorrect data.
+
+Do not chmod files in the cache.  The module creates things with minimal
+permissions to prevent random users being able to access them directly.
+
+
+=============
+CACHE CULLING
+=============
+
+The cache may need culling occasionally to make space.  This involves
+discarding objects from the cache that have been used less recently than
+anything else.  Culling is based on the access time of data objects.  Empty
+directories are culled if not in use.
+
+Cache culling is done on the basis of the percentage of blocks available in the
+underlying filesystem.  There are three "limits":
+
+ (*) brun
+
+     If the amount of available space in the cache rises above this limit, then
+     culling is turned off.
+
+ (*) bcull
+
+     If the amount of available space in the cache falls below this limit, then
+     culling is started.
+
+ (*) bstop
+
+     If the amount of available space in the cache falls below this limit, then
+     no further allocation of disk space is permitted until culling has raised
+     the amount above this limit again.
+
+These must be configured thusly:
+
+	0 <= bstop < bcull < brun < 100
+
+Note that these are percentages of available space, and do _not_ appear as 100
+minus the percentage displayed by the "df" program.
+
+The userspace daemon scans the cache to build up a table of cullable objects.
+These are then culled in least recently used order.  A new scan of the cache is
+started as soon as space is made in the table.  Objects will be skipped if
+their atimes have changed or if the kernel module says it is still using them.
+
+
+===============
+CACHE STRUCTURE
+===============
+
+The CacheFiles module will create two directories in the directory it was
+given:
+
+ (*) cache/
+
+ (*) graveyard/
+
+The active cache objects all reside in the first directory.  The CacheFiles
+kernel module moves any retired or culled objects that it can't simply unlink
+to the graveyard from which the daemon will actually delete them.
+
+The daemon uses dnotify to monitor the graveyard directory, and will delete
+anything that appears therein.
+
+
+The module represents index objects as directories with the filename "I..." or
+"J...".  Note that the "cache/" directory is itself a special index.
+
+Data objects are represented as files if they have no children, or directories
+if they do.  Their filenames all begin "D..." or "E...".  If represented as a
+directory, data objects will have a file in the directory called "data" that
+actually holds the data.
+
+Special objects are similar to data objects, except their filenames begin
+"S..." or "T...".
+
+
+If an object has children, then it will be represented as a directory.
+Immediately in the representative directory are a collection of directories
+named for hash values of the child object keys with an '@' prepended.  Into
+this directory, if possible, will be placed the representations of the child
+objects:
+
+	INDEX     INDEX      INDEX                             DATA FILES
+	========= ========== ================================= ================
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
+
+
+If the key is so long that it exceeds NAME_MAX with the decorations added on to
+it, then it will be cut into pieces, the first few of which will be used to
+make a nest of directories, and the last one of which will be the objects
+inside the last directory.  The names of the intermediate directories will have
+'+' prepended:
+
+	J1223/@23/+xy...z/+kl...m/Epqr
+
+
+Note that keys are raw data, and not only may they exceed NAME_MAX in size,
+they may also contain things like '/' and NUL characters, and so they may not
+be suitable for turning directly into a filename.
+
+To handle this, CacheFiles will use a suitably printable filename directly and
+"base-64" encode ones that aren't directly suitable.  The two versions of
+object filenames indicate the encoding:
+
+	OBJECT TYPE	PRINTABLE	ENCODED
+	===============	===============	===============
+	Index		"I..."		"J..."
+	Data		"D..."		"E..."
+	Special		"S..."		"T..."
+
+Intermediate directories are always "@" or "+" as appropriate.
+
+
+Each object in the cache has an extended attribute label that holds the object
+type ID (required to distinguish special objects) and the auxiliary data from
+the netfs.  The latter is used to detect stale objects in the cache and update
+or retire them.
+
+
+Note that CacheFiles will erase from the cache any file it doesn't recognise or
+any file of an incorrect type (such as a FIFO file or a device file).
diff --git a/fs/Kconfig b/fs/Kconfig
index 6c95e58..9ef9f14 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -521,6 +521,14 @@ config FSCACHE
 
 	  See Documentation/filesystems/caching/fscache.txt for more information.
 
+config CACHEFILES
+	tristate "Filesystem caching on files"
+	depends on FSCACHE
+	help
+	  This permits use of a mounted filesystem as a cache for other
+	  filesystems - primarily networking filesystems - thus allowing fast
+	  local disk to enhance the speed of slower devices.
+
 endmenu
 
 menu "CD-ROM/DVD Filesystems"
diff --git a/fs/Makefile b/fs/Makefile
index 36ee03b..94ab3f9 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_AFS_FS)		+= afs/
 obj-$(CONFIG_BEFS_FS)		+= befs/
 obj-$(CONFIG_HOSTFS)		+= hostfs/
 obj-$(CONFIG_HPPFS)		+= hppfs/
+obj-$(CONFIG_CACHEFILES)	+= cachefiles/
 obj-$(CONFIG_DEBUG_FS)		+= debugfs/
 obj-$(CONFIG_CONFIGFS_FS)	+= configfs/
 obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
diff --git a/fs/buffer.c b/fs/buffer.c
index 23f1f3a..0602bf8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -185,6 +185,8 @@ int fsync_super(struct super_block *sb)
 	return sync_blockdev(sb->s_bdev);
 }
 
+EXPORT_SYMBOL(fsync_super);
+
 /*
  * Write out and wait upon all dirty data associated with this
  * device.   Filesystem data as well as the underlying block
diff --git a/fs/cachefiles/Makefile b/fs/cachefiles/Makefile
new file mode 100644
index 0000000..fc63c0c
--- /dev/null
+++ b/fs/cachefiles/Makefile
@@ -0,0 +1,16 @@
+#
+# Makefile for caching in a mounted filesystem
+#
+
+#CFLAGS += -finstrument-functions
+
+cachefiles-objs := \
+	cf-bind.o \
+	cf-interface.o \
+	cf-key.o \
+	cf-main.o \
+	cf-namei.o \
+	cf-proc.o \
+	cf-xattr.o
+
+obj-$(CONFIG_CACHEFILES) := cachefiles.o
diff --git a/fs/cachefiles/cf-bind.c b/fs/cachefiles/cf-bind.c
new file mode 100644
index 0000000..c15ec88
--- /dev/null
+++ b/fs/cachefiles/cf-bind.c
@@ -0,0 +1,283 @@
+/* cf-bind.c: bind and unbind a cache from the filesystem backing it
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/namespace.h>
+#include <linux/statfs.h>
+#include <linux/proc_fs.h>
+#include <linux/ctype.h>
+#include "internal.h"
+
+static int cachefiles_proc_add_cache(struct cachefiles_cache *cache,
+				     struct vfsmount *mnt);
+
+/*****************************************************************************/
+/*
+ * bind a directory as a cache
+ */
+int cachefiles_proc_bind(struct cachefiles_cache *cache, char *args)
+{
+	_enter("{%u,%u,%u},%s",
+	       cache->brun_percent,
+	       cache->bcull_percent,
+	       cache->bstop_percent,
+	       args);
+
+	/* start by checking things over */
+	ASSERT(cache->bstop_percent >= 0 &&
+	       cache->bstop_percent < cache->bcull_percent &&
+	       cache->bcull_percent < cache->brun_percent &&
+	       cache->brun_percent  < 100);
+
+	if (*args) {
+		kerror("'bind' command doesn't take an argument");
+		return -EINVAL;
+	}
+
+	if (!cache->rootdirname) {
+		kerror("No cache directory specified");
+		return -EINVAL;
+	}
+
+	/* don't permit already bound caches to be re-bound */
+	if (test_bit(CACHEFILES_READY, &cache->flags)) {
+		kerror("Cache already bound");
+		return -EBUSY;
+	}
+
+	/* make sure we have copies of the tag and dirname strings */
+	if (!cache->tag) {
+		/* the tag string is released by the fops->release()
+		 * function, so we don't release it on error here */
+		cache->tag = kstrdup("CacheFiles", GFP_KERNEL);
+		if (!cache->tag)
+			return -ENOMEM;
+	}
+
+	/* add the cache */
+	return cachefiles_proc_add_cache(cache, NULL);
+
+} /* end cachefiles_proc_bind() */
+
+/*****************************************************************************/
+/*
+ * add a cache
+ */
+static int cachefiles_proc_add_cache(struct cachefiles_cache *cache,
+				     struct vfsmount *mnt)
+{
+	struct cachefiles_object *fsdef;
+	struct nameidata nd;
+	struct kstatfs stats;
+	struct dentry *graveyard, *cachedir, *root;
+	int ret;
+
+	_enter("");
+
+	/* allocate the root index object */
+	ret = -ENOMEM;
+
+	fsdef = kmem_cache_alloc(cachefiles_object_jar, SLAB_KERNEL);
+	if (!fsdef)
+		goto error_root_object;
+
+	atomic_set(&fsdef->usage, 1);
+	atomic_set(&fsdef->fscache_usage, 1);
+	fsdef->type = FSCACHE_COOKIE_TYPE_INDEX;
+
+	_debug("- fsdef %p", fsdef);
+
+	/* look up the directory at the root of the cache */
+	memset(&nd, 0, sizeof(nd));
+
+	ret = path_lookup(cache->rootdirname, LOOKUP_DIRECTORY, &nd);
+	if (ret < 0)
+		goto error_open_root;
+
+	/* bind to the special mountpoint we've prepared */
+	if (mnt) {
+		atomic_inc(&nd.mnt->mnt_sb->s_active);
+		mnt->mnt_sb = nd.mnt->mnt_sb;
+		mnt->mnt_flags = nd.mnt->mnt_flags;
+		mnt->mnt_flags |= MNT_NOSUID | MNT_NOEXEC | MNT_NODEV;
+		mnt->mnt_root = dget(nd.dentry);
+		mnt->mnt_mountpoint = mnt->mnt_root;
+
+		/* copy the name, but ignore kstrdup() failing ENOMEM - we'll
+		 * just end up with an devicenameless mountpoint */
+		mnt->mnt_devname = kstrdup(nd.mnt->mnt_devname, GFP_KERNEL);
+		path_release(&nd);
+
+		cache->mnt = mntget(mnt);
+		root = dget(mnt->mnt_root);
+	}
+	else {
+		cache->mnt = nd.mnt;
+		root = nd.dentry;
+
+		nd.mnt = NULL;
+		nd.dentry = NULL;
+		path_release(&nd);
+	}
+
+	/* check parameters */
+	ret = -EOPNOTSUPP;
+	if (!root->d_inode ||
+	    !root->d_inode->i_op ||
+	    !root->d_inode->i_op->lookup ||
+	    !root->d_inode->i_op->mkdir ||
+	    !root->d_inode->i_op->setxattr ||
+	    !root->d_inode->i_op->getxattr ||
+	    !root->d_sb ||
+	    !root->d_sb->s_op ||
+	    !root->d_sb->s_op->statfs ||
+	    !root->d_sb->s_op->sync_fs)
+		goto error_unsupported;
+
+	ret = -EROFS;
+	if (root->d_sb->s_flags & MS_RDONLY)
+		goto error_unsupported;
+
+	/* get the cache size and blocksize */
+	ret = root->d_sb->s_op->statfs(root->d_sb, &stats);
+	if (ret < 0)
+		goto error_unsupported;
+
+	ret = -ERANGE;
+	if (stats.f_bsize <= 0)
+		goto error_unsupported;
+
+	ret = -EOPNOTSUPP;
+	if (stats.f_bsize > PAGE_SIZE)
+		goto error_unsupported;
+
+	cache->bsize = stats.f_bsize;
+	cache->bshift = 0;
+	if (stats.f_bsize < PAGE_SIZE)
+		cache->bshift = PAGE_SHIFT - long_log2(stats.f_bsize);
+
+	_debug("blksize %u (shift %u)",
+	       cache->bsize, cache->bshift);
+
+	_debug("size %llu, avail %llu", stats.f_blocks, stats.f_bavail);
+
+	/* set up caching limits */
+	stats.f_blocks >>= cache->bshift;
+	do_div(stats.f_blocks, 100);
+	cache->bstop = stats.f_blocks * cache->bstop_percent;
+	cache->bcull = stats.f_blocks * cache->bcull_percent;
+	cache->brun  = stats.f_blocks * cache->brun_percent;
+
+	_debug("limits {%llu,%llu,%llu}",
+	       cache->brun,
+	       cache->bcull,
+	       cache->bstop);
+
+	/* get the cache directory and check its type */
+	cachedir = cachefiles_get_directory(cache, root, "cache");
+	if (IS_ERR(cachedir)) {
+		ret = PTR_ERR(cachedir);
+		goto error_unsupported;
+	}
+
+	fsdef->dentry = cachedir;
+
+	ret = cachefiles_check_object_type(fsdef);
+	if (ret < 0)
+		goto error_unsupported;
+
+	/* get the graveyard directory */
+	graveyard = cachefiles_get_directory(cache, root, "graveyard");
+	if (IS_ERR(graveyard)) {
+		ret = PTR_ERR(graveyard);
+		goto error_unsupported;
+	}
+
+	cache->graveyard = graveyard;
+
+	/* publish the cache */
+	fscache_init_cache(&cache->cache,
+			   &cachefiles_cache_ops,
+			   "%02x:%02x",
+			   MAJOR(fsdef->dentry->d_sb->s_dev),
+			   MINOR(fsdef->dentry->d_sb->s_dev)
+			   );
+
+	ret = fscache_add_cache(&cache->cache, &fsdef->fscache, cache->tag);
+	if (ret < 0)
+		goto error_add_cache;
+
+	/* done */
+	set_bit(CACHEFILES_READY, &cache->flags);
+	dput(root);
+
+	printk(KERN_INFO "CacheFiles:"
+	       " File cache on %s registered\n",
+	       cache->cache.identifier);
+
+	/* check how much space the cache has */
+	cachefiles_has_space(cache, 0);
+
+	return 0;
+
+error_add_cache:
+	dput(cache->graveyard);
+	cache->graveyard = NULL;
+error_unsupported:
+	mntput(cache->mnt);
+	cache->mnt = NULL;
+	dput(fsdef->dentry);
+	fsdef->dentry = NULL;
+	dput(root);
+error_open_root:
+	kmem_cache_free(cachefiles_object_jar, fsdef);
+error_root_object:
+	kerror("Failed to register: %d", ret);
+	return ret;
+
+} /* end cachefiles_proc_add_cache() */
+
+/*****************************************************************************/
+/*
+ * unbind a cache on fd release
+ */
+void cachefiles_proc_unbind(struct cachefiles_cache *cache)
+{
+	_enter("");
+
+	if (test_bit(CACHEFILES_READY, &cache->flags)) {
+		printk(KERN_INFO "CacheFiles:"
+		       " File cache on %s unregistering\n",
+		       cache->cache.identifier);
+
+		fscache_withdraw_cache(&cache->cache);
+	}
+
+	if (cache->cache.fsdef)
+		cache->cache.ops->put_object(cache->cache.fsdef);
+
+	dput(cache->graveyard);
+	mntput(cache->mnt);
+
+	kfree(cache->rootdirname);
+	kfree(cache->tag);
+
+	_leave("");
+
+} /* end cachefiles_proc_unbind() */
diff --git a/fs/cachefiles/cf-interface.c b/fs/cachefiles/cf-interface.c
new file mode 100644
index 0000000..d94ef9a
--- /dev/null
+++ b/fs/cachefiles/cf-interface.c
@@ -0,0 +1,1303 @@
+/* cf-interface.c: CacheFiles to FS-Cache interface
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/buffer_head.h>
+#include "internal.h"
+
+#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+#define log2(n) ffz(~(n))
+
+/*****************************************************************************/
+/*
+ * look up the nominated node in this cache, creating it if necessary
+ */
+static struct fscache_object *cachefiles_lookup_object(
+	struct fscache_cache *_cache,
+	struct fscache_object *_parent,
+	struct fscache_cookie *cookie)
+{
+	struct cachefiles_object *parent, *object;
+	struct cachefiles_cache *cache;
+	struct cachefiles_xattr *auxdata;
+	unsigned keylen, auxlen;
+	void *buffer;
+	char *key;
+	int ret;
+
+	ASSERT(_parent);
+
+	cache = container_of(_cache, struct cachefiles_cache, cache);
+	parent = container_of(_parent, struct cachefiles_object, fscache);
+
+	//printk("\n");
+	_enter("{%s},%p,%p", cache->cache.identifier, parent, cookie);
+
+	/* create a new object record and a temporary leaf image */
+	object = kmem_cache_alloc(cachefiles_object_jar, SLAB_KERNEL);
+	if (!object)
+		goto nomem_object;
+
+	atomic_set(&object->usage, 1);
+	atomic_set(&object->fscache_usage, 1);
+
+	fscache_object_init(&object->fscache);
+	object->fscache.cookie = cookie;
+	object->fscache.cache = parent->fscache.cache;
+
+	object->type = cookie->def->type;
+
+	/* get hold of the raw key
+	 * - stick the length on the front and leave space on the back for the
+	 *   encoder
+	 */
+	buffer = kmalloc((2 + 512) + 3, GFP_KERNEL);
+	if (!buffer)
+		goto nomem_buffer;
+
+	keylen = cookie->def->get_key(cookie->netfs_data, buffer + 2, 512);
+	ASSERTCMP(keylen, <, 512);
+
+	*(uint16_t *)buffer = keylen;
+	((char *)buffer)[keylen + 2] = 0;
+	((char *)buffer)[keylen + 3] = 0;
+	((char *)buffer)[keylen + 4] = 0;
+
+	/* turn the raw key into something that can work with as a filename */
+	key = cachefiles_cook_key(buffer, keylen + 2, object->type);
+	if (!key)
+		goto nomem_key;
+
+	/* get hold of the auxiliary data and prepend the object type */
+	auxdata = buffer;
+	auxlen = 0;
+	if (cookie->def->get_aux) {
+		auxlen = cookie->def->get_aux(cookie->netfs_data,
+					      auxdata->data, 511);
+		ASSERTCMP(auxlen, <, 511);
+	}
+
+	auxdata->len = auxlen + 1;
+	auxdata->type = cookie->def->type;
+
+	/* look up the key, creating any missing bits */
+	ret = cachefiles_walk_to_object(parent, object, key, auxdata);
+	if (ret < 0)
+		goto lookup_failed;
+
+	kfree(buffer);
+	kfree(key);
+	_leave(" = %p", &object->fscache);
+	return &object->fscache;
+
+lookup_failed:
+	kmem_cache_free(cachefiles_object_jar, object);
+	kfree(buffer);
+	kfree(key);
+	kleave(" = %d", ret);
+	return ERR_PTR(ret);
+
+nomem_key:
+	kfree(buffer);
+nomem_buffer:
+	kmem_cache_free(cachefiles_object_jar, object);
+nomem_object:
+	kleave(" = -ENOMEM");
+	return ERR_PTR(-ENOMEM);
+
+} /* end cachefiles_lookup_object() */
+
+/*****************************************************************************/
+/*
+ * increment the usage count on an inode object (may fail if unmounting)
+ */
+static struct fscache_object *cachefiles_grab_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+
+	_enter("%p", _object);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+#ifdef CACHEFILES_DEBUG_SLAB
+	ASSERT((atomic_read(&object->fscache_usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+	atomic_inc(&object->fscache_usage);
+	return &object->fscache;
+
+} /* end cachefiles_grab_object() */
+
+/*****************************************************************************/
+/*
+ * lock the semaphore on an object object
+ */
+static void cachefiles_lock_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+
+	_enter("%p", _object);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+#ifdef CACHEFILES_DEBUG_SLAB
+	ASSERT((atomic_read(&object->fscache_usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+	down_write(&object->sem);
+
+} /* end cachefiles_lock_object() */
+
+/*****************************************************************************/
+/*
+ * unlock the semaphore on an object object
+ */
+static void cachefiles_unlock_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+
+	_enter("%p", _object);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	up_write(&object->sem);
+
+} /* end cachefiles_unlock_object() */
+
+/*****************************************************************************/
+/*
+ * update the auxilliary data for an object object on disk
+ */
+static void cachefiles_update_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+
+	kenter("%p", _object);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache, struct cachefiles_cache, cache);
+
+	//cachefiles_tree_update_object(super, object);
+
+} /* end cachefiles_update_object() */
+
+/*****************************************************************************/
+/*
+ * dispose of a reference to an object object
+ */
+static void cachefiles_put_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	int ret;
+
+	ASSERT(_object);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	_enter("%p{%d}", object, atomic_read(&object->usage));
+
+	ASSERT(object);
+
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+#ifdef CACHEFILES_DEBUG_SLAB
+	ASSERT((atomic_read(&object->fscache_usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+	if (!atomic_dec_and_test(&object->fscache_usage))
+		return;
+
+	_debug("- kill object %p", object);
+
+	/* delete retired objects */
+	if (test_bit(FSCACHE_OBJECT_RECYCLING, &object->fscache.flags) &&
+	    _object != cache->cache.fsdef
+	    ) {
+		_debug("- retire object %p", object);
+		cachefiles_delete_object(cache, object);
+	}
+
+	/* close the filesystem stuff attached to the object */
+	if (object->backer) {
+		if (object->backer->f_op &&
+		    object->backer->f_op->flush
+		    ) {
+			ret = object->backer->f_op->flush(object->backer);
+			if (ret < 0)
+				kerror("Backing file flush returned error %d",
+				       ret);
+		}
+		fput(object->backer);
+		object->backer = NULL;
+	}
+
+	/* note that an object is now inactive */
+	write_lock(&cache->active_lock);
+	rb_erase(&object->active_node, &cache->active_nodes);
+	write_unlock(&cache->active_lock);
+
+	dput(object->dentry);
+	object->dentry = NULL;
+
+	/* then dispose of the object */
+	kmem_cache_free(cachefiles_object_jar, object);
+
+	_leave("");
+
+} /* end cachefiles_put_object() */
+
+/*****************************************************************************/
+/*
+ * sync a cache
+ */
+static void cachefiles_sync_cache(struct fscache_cache *_cache)
+{
+	struct cachefiles_cache *cache;
+	int ret;
+
+	_enter("%p", _cache);
+
+	cache = container_of(_cache, struct cachefiles_cache, cache);
+
+	/* make sure all pages pinned by operations on behalf of the netfs are
+	 * written to disc */
+	ret = fsync_super(cache->mnt->mnt_sb);
+	if (ret == -EIO)
+		cachefiles_io_error(cache,
+				    "Attempt to sync backing fs superblock"
+				    " returned error %d",
+				    ret);
+
+} /* end cachefiles_sync_cache() */
+
+/*****************************************************************************/
+/*
+ * set the data size on an object
+ */
+static int cachefiles_set_i_size(struct fscache_object *_object, loff_t i_size)
+{
+	struct cachefiles_object *object;
+	struct iattr newattrs;
+	int ret;
+
+	_enter("%p,%llu", _object, i_size);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+	if (i_size == object->i_size)
+		return 0;
+
+	if (!object->backer)
+		return -ENOBUFS;
+
+	ASSERT(S_ISREG(object->backer->f_dentry->d_inode->i_mode));
+
+	newattrs.ia_size = i_size;
+	newattrs.ia_file = object->backer;
+	newattrs.ia_valid = ATTR_SIZE | ATTR_FILE;
+
+	mutex_lock(&object->backer->f_dentry->d_inode->i_mutex);
+	ret = notify_change(object->backer->f_dentry, &newattrs);
+	mutex_unlock(&object->backer->f_dentry->d_inode->i_mutex);
+
+	if (ret == -EIO) {
+		cachefiles_io_error_obj(object, "Size set failed");
+		ret = -ENOBUFS;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_set_i_size() */
+
+/*****************************************************************************/
+/*
+ * see if we have space for a number of pages in the cache
+ */
+int cachefiles_has_space(struct cachefiles_cache *cache, unsigned nr)
+{
+	struct kstatfs stats;
+	int ret;
+
+	_enter("{%llu,%llu,%llu},%d",
+	       cache->brun, cache->bcull, cache->bstop,  nr);
+
+	/* find out how many pages of blockdev are available */
+	memset(&stats, 0, sizeof(stats));
+
+	ret = cache->mnt->mnt_sb->s_op->statfs(cache->mnt->mnt_sb, &stats);
+	if (ret < 0) {
+		if (ret == -EIO)
+			cachefiles_io_error(cache, "statfs failed");
+		return ret;
+	}
+
+	stats.f_bavail >>= cache->bshift;
+
+	_debug("avail %llu", stats.f_bavail);
+
+	/* see if there is sufficient space */
+	stats.f_bavail -= nr;
+
+	ret = -ENOBUFS;
+	if (stats.f_bavail < cache->bstop)
+		goto begin_cull;
+
+	ret = 0;
+	if (stats.f_bavail < cache->bcull)
+		goto begin_cull;
+
+	if (test_bit(CACHEFILES_CULLING, &cache->flags) &&
+	    stats.f_bavail >= cache->brun
+	    ) {
+		if (test_and_clear_bit(CACHEFILES_CULLING, &cache->flags)) {
+			kdebug("cease culling");
+			send_sigurg(&cache->cachefilesd->f_owner);
+		}
+	}
+
+	_leave(" = 0");
+	return 0;
+
+begin_cull:
+	if (!test_and_set_bit(CACHEFILES_CULLING, &cache->flags)) {
+		kdebug("### CULL CACHE ###");
+		send_sigurg(&cache->cachefilesd->f_owner);
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_has_space() */
+
+/*****************************************************************************/
+/*
+ * waiting reading backing files
+ */
+static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
+				  int sync, void *_key)
+{
+	struct cachefiles_one_read *monitor =
+		container_of(wait, struct cachefiles_one_read, monitor);
+	struct wait_bit_key *key = _key;
+	struct page *page = wait->private;
+
+	ASSERT(key);
+
+	_enter("{%lu},%u,%d,{%p,%u}",
+	       monitor->netfs_page->index, mode, sync,
+	       key->flags, key->bit_nr);
+
+	if (key->flags != &page->flags ||
+	    key->bit_nr != PG_locked)
+		return 0;
+
+	_debug("--- monitor %p %lx ---", page, page->flags);
+
+	if (!PageUptodate(page) && !PageError(page))
+		dump_stack();
+
+	/* remove from the waitqueue */
+	list_del(&wait->task_list);
+
+	/* move onto the action list and queue for keventd */
+	ASSERT(monitor->object);
+
+	spin_lock(&monitor->object->work_lock);
+	list_move(&monitor->obj_link, &monitor->object->read_list);
+	spin_unlock(&monitor->object->work_lock);
+
+	schedule_work(&monitor->object->read_work);
+
+	return 0;
+
+} /* end cachefiles_read_waiter() */
+
+/*****************************************************************************/
+/*
+ * let keventd drive the copying of pages
+ */
+void cachefiles_read_copier_work(void *_object)
+{
+	struct cachefiles_one_read *monitor;
+	struct cachefiles_object *object = _object;
+	struct fscache_cookie *cookie = object->fscache.cookie;
+	struct pagevec pagevec;
+	int error, max;
+
+	_enter("{ino=%lu}", object->backer->f_dentry->d_inode->i_ino);
+
+	pagevec_init(&pagevec, 0);
+
+	max = 8;
+	spin_lock_irq(&object->work_lock);
+
+	while (!list_empty(&object->read_list)) {
+		monitor = list_entry(object->read_list.next,
+				     struct cachefiles_one_read, obj_link);
+		list_del(&monitor->obj_link);
+
+		spin_unlock_irq(&object->work_lock);
+
+		_debug("- copy {%lu}", monitor->back_page->index);
+
+		error = -EIO;
+		if (PageUptodate(monitor->back_page)) {
+			copy_highpage(monitor->netfs_page, monitor->back_page);
+
+			pagevec_add(&pagevec, monitor->netfs_page);
+			cookie->def->mark_pages_cached(
+				cookie->netfs_data,
+				monitor->netfs_page->mapping,
+				&pagevec);
+			pagevec_reinit(&pagevec);
+
+			error = 0;
+		}
+
+		if (error)
+			cachefiles_io_error_obj(
+				object,
+				"readpage failed on backing file %lx",
+				(unsigned long) monitor->back_page->flags);
+
+		page_cache_release(monitor->back_page);
+
+		monitor->callback_func(monitor->netfs_page,
+				       monitor->callback_data, error);
+
+		page_cache_release(monitor->netfs_page);
+		kfree(monitor);
+
+		/* let keventd have some air occasionally */
+		max--;
+		if (max < 0 || need_resched()) {
+			if (!list_empty(&object->read_list))
+				schedule_work(&object->read_work);
+			_leave(" [maxed out]");
+			return;
+		}
+
+		spin_lock_irq(&object->work_lock);
+	}
+
+	spin_unlock_irq(&object->work_lock);
+
+	_leave("");
+
+} /* end cachefiles_read_copier_work() */
+
+/*****************************************************************************/
+/*
+ * read the corresponding page to the given set from the backing file
+ * - an uncertain page is simply discarded, to be tried again another time
+ */
+static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
+					    fscache_rw_complete_t callback_func,
+					    void *callback_data,
+					    struct page *netpage,
+					    struct pagevec *lru_pvec)
+{
+	struct cachefiles_one_read *monitor;
+	struct address_space *bmapping;
+	struct page *newpage, *backpage;
+	int ret;
+
+	_enter("");
+
+	ASSERTCMP(pagevec_count(lru_pvec), ==, 0);
+	pagevec_reinit(lru_pvec);
+
+	_debug("read back %p{%lu,%d}",
+	       netpage, netpage->index, page_count(netpage));
+
+	monitor = kzalloc(sizeof(*monitor), GFP_KERNEL);
+	if (!monitor)
+		goto nomem;
+
+	init_waitqueue_func_entry(&monitor->monitor, cachefiles_read_waiter);
+	monitor->object = object;
+	monitor->callback_func = callback_func;
+	monitor->callback_data = callback_data;
+	monitor->netfs_page = netpage;
+
+	/* attempt to get hold of the backing page */
+	bmapping = object->backer->f_mapping;
+	newpage = NULL;
+
+	for (;;) {
+		backpage = find_get_page(bmapping, netpage->index);
+		if (backpage)
+			goto backing_page_already_present;
+
+		if (!newpage) {
+			newpage = page_cache_alloc_cold(bmapping);
+			if (!newpage)
+				goto nomem_monitor;
+		}
+
+		ret = add_to_page_cache(newpage, bmapping,
+					netpage->index, GFP_KERNEL);
+		if (ret == 0)
+			goto installed_new_backing_page;
+		if (ret != -EEXIST)
+			goto nomem_page;
+	}
+
+	/* we've installed a new backing page, so now we need to add it
+	 * to the LRU list and start it reading */
+installed_new_backing_page:
+	_debug("- new %p", newpage);
+
+	backpage = newpage;
+	newpage = NULL;
+
+	page_cache_get(backpage);
+	pagevec_add(lru_pvec, backpage);
+	__pagevec_lru_add(lru_pvec);
+
+	ret = bmapping->a_ops->readpage(object->backer, backpage);
+	if (ret < 0)
+		goto read_error;
+
+	/* set the monitor to transfer the data across */
+monitor_backing_page:
+	_debug("- monitor add");
+
+	/* install the monitor */
+	page_cache_get(monitor->netfs_page);
+	page_cache_get(backpage);
+	monitor->back_page = backpage;
+
+	spin_lock_irq(&object->work_lock);
+	list_add_tail(&monitor->obj_link, &object->read_pend_list);
+	spin_unlock_irq(&object->work_lock);
+
+	monitor->monitor.private = backpage;
+	install_page_waitqueue_monitor(backpage, &monitor->monitor);
+	monitor = NULL;
+
+	/* but the page may have been read before the monitor was
+	 * installed, so the monitor may miss the event - so we have to
+	 * ensure that we do get one in such a case */
+	if (!TestSetPageLocked(backpage))
+		unlock_page(backpage);
+	goto success;
+
+	/* if the backing page is already present, it can be in one of
+	 * three states: read in progress, read failed or read okay */
+backing_page_already_present:
+	_debug("- present");
+
+	if (newpage) {
+		page_cache_release(newpage);
+		newpage = NULL;
+	}
+
+	if (PageError(backpage))
+		goto io_error;
+
+	if (PageUptodate(backpage))
+		goto backing_page_already_uptodate;
+
+	goto monitor_backing_page;
+
+	/* the backing page is already up to date, attach the netfs
+	 * page to the pagecache and LRU and copy the data across */
+backing_page_already_uptodate:
+	_debug("- uptodate");
+
+	copy_highpage(netpage, backpage);
+	callback_func(netpage, callback_data, 0);
+
+success:
+	_debug("success");
+	ret = 0;
+
+out:
+	if (backpage)
+		page_cache_release(backpage);
+	kfree(monitor);
+
+	_leave(" = %d", ret);
+	return ret;
+
+read_error:
+	_debug("read error %d", ret);
+	if (ret == -ENOMEM)
+		goto out;
+io_error:
+	cachefiles_io_error_obj(object, "page read error on backing file");
+	ret = -EIO;
+	goto out;
+
+nomem_page:
+	page_cache_release(newpage);
+nomem_monitor:
+	kfree(monitor);
+nomem:
+	_leave(" = -ENOMEM");
+	return -ENOMEM;
+
+} /* end cachefiles_read_backing_file_one() */
+
+/*****************************************************************************/
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if no buffers can be made available
+ * - returns -ENOBUFS if page is beyond EOF
+ * - if the page is backed by a block in the cache:
+ *   - a read will be started which will call the callback on completion
+ *   - 0 will be returned
+ * - else if the page is unbacked:
+ *   - the metadata will be retained
+ *   - -ENODATA will be returned
+ */
+static int cachefiles_read_or_alloc_page(struct fscache_object *_object,
+					 struct page *page,
+					 fscache_rw_complete_t callback_func,
+					 void *callback_data,
+					 unsigned long gfp)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct fscache_cookie *cookie;
+	struct pagevec pagevec;
+	struct inode *inode;
+	sector_t block0, e3block;
+	unsigned shift;
+	int ret;
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache, struct cachefiles_cache, cache);
+
+	_enter("{%p},{%lx},,,", object, page->index);
+
+	if (!object->backer)
+		return -ENOBUFS;
+
+	inode = object->backer->f_dentry->d_inode;
+	ASSERT(S_ISREG(inode->i_mode));
+	ASSERT(inode->i_mapping->a_ops->bmap);
+	ASSERT(inode->i_mapping->a_ops->readpages);
+
+	/* calculate the shift required to use bmap */
+	if (inode->i_sb->s_blocksize > PAGE_SIZE)
+		return -ENOBUFS;
+
+	shift = log2(PAGE_SIZE / inode->i_sb->s_blocksize);
+
+	cookie = object->fscache.cookie;
+
+	pagevec_init(&pagevec, 0);
+
+	/* we assume the absence or presence of the first block is a good
+	 * enough indication for the page as a whole
+	 * - TODO: don't use bmap() for this as it is _not_ actually good
+	 *   enough for this as it doesn't indicate errors, but it's all we've
+	 *   got for the moment
+	 */
+	block0 = page->index;
+	block0 <<= shift;
+
+	e3block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0);
+	_debug("%llx -> %llx", block0, e3block);
+
+	if (e3block) {
+		/* submit the apparently valid page to the backing fs to be
+		 * read from disk */
+		ret = cachefiles_read_backing_file_one(object,
+						       callback_func,
+						       callback_data,
+						       page,
+						       &pagevec);
+		ret = 0;
+	}
+	else if (cachefiles_has_space(cache, 1) == 0) {
+		/* there's space in the cache we can use */
+		pagevec_add(&pagevec, page);
+		cookie->def->mark_pages_cached(cookie->netfs_data,
+					       page->mapping, &pagevec);
+		ret = -ENODATA;
+	}
+	else {
+		ret = -ENOBUFS;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_read_or_alloc_page() */
+
+/*****************************************************************************/
+/*
+ * read the corresponding pages to the given set from the backing file
+ * - any uncertain pages are simply discarded, to be tried again another time
+ */
+static int cachefiles_read_backing_file(struct cachefiles_object *object,
+					fscache_rw_complete_t callback_func,
+					void *callback_data,
+					struct address_space *mapping,
+					struct list_head *list,
+					struct pagevec *lru_pvec)
+{
+	struct cachefiles_one_read *monitor = NULL;
+	struct address_space *bmapping = object->backer->f_mapping;
+	struct page *newpage = NULL, *netpage, *_n, *backpage = NULL;
+	int ret = 0;
+
+	_enter("");
+
+	ASSERTCMP(pagevec_count(lru_pvec), ==, 0);
+	pagevec_reinit(lru_pvec);
+
+	list_for_each_entry_safe(netpage, _n, list, lru) {
+		list_del(&netpage->lru);
+
+		_debug("read back %p{%lu,%d}",
+		       netpage, netpage->index, page_count(netpage));
+
+		if (!monitor) {
+			monitor = kzalloc(sizeof(*monitor), GFP_KERNEL);
+			if (!monitor)
+				goto nomem;
+
+			init_waitqueue_func_entry(&monitor->monitor,
+						  cachefiles_read_waiter);
+			monitor->object = object;
+			monitor->callback_func = callback_func;
+			monitor->callback_data = callback_data;
+		}
+
+		for (;;) {
+			backpage = find_get_page(bmapping, netpage->index);
+			if (backpage)
+				goto backing_page_already_present;
+
+			if (!newpage) {
+				newpage = page_cache_alloc_cold(bmapping);
+				if (!newpage)
+					goto nomem;
+			}
+
+			ret = add_to_page_cache(newpage, bmapping,
+						netpage->index, GFP_KERNEL);
+			if (ret == 0)
+				goto installed_new_backing_page;
+			if (ret != -EEXIST)
+				goto nomem;
+		}
+
+		/* we've installed a new backing page, so now we need to add it
+		 * to the LRU list and start it reading */
+	installed_new_backing_page:
+		_debug("- new %p", newpage);
+
+		backpage = newpage;
+		newpage = NULL;
+
+		page_cache_get(backpage);
+		if (!pagevec_add(lru_pvec, backpage))
+			__pagevec_lru_add(lru_pvec);
+
+	reread_backing_page:
+		ret = bmapping->a_ops->readpage(object->backer, backpage);
+		if (ret < 0)
+			goto read_error;
+
+		/* add the netfs page to the pagecache and LRU, and set the
+		 * monitor to transfer the data across */
+	monitor_backing_page:
+		_debug("- monitor add");
+
+		ret = add_to_page_cache(netpage, mapping, netpage->index,
+					GFP_KERNEL);
+		if (ret < 0) {
+			if (ret == -EEXIST) {
+				page_cache_release(netpage);
+				continue;
+			}
+			goto nomem;
+		}
+
+		page_cache_get(netpage);
+		if (!pagevec_add(lru_pvec, netpage))
+			__pagevec_lru_add(lru_pvec);
+
+		/* install a monitor */
+		page_cache_get(netpage);
+		monitor->netfs_page = netpage;
+
+		page_cache_get(backpage);
+		monitor->back_page = backpage;
+
+		spin_lock_irq(&object->work_lock);
+		list_add_tail(&monitor->obj_link, &object->read_pend_list);
+		spin_unlock_irq(&object->work_lock);
+
+		monitor->monitor.private = backpage;
+		install_page_waitqueue_monitor(backpage, &monitor->monitor);
+		monitor = NULL;
+
+		/* but the page may have been read before the monitor was
+		 * installed, so the monitor may miss the event - so we have to
+		 * ensure that we do get one in such a case */
+		if (!TestSetPageLocked(backpage)) {
+			_debug("2unlock %p", backpage);
+			unlock_page(backpage);
+		}
+
+		page_cache_release(backpage);
+		backpage = NULL;
+
+		page_cache_release(netpage);
+		netpage = NULL;
+		continue;
+
+		/* if the backing page is already present, it can be in one of
+		 * three states: read in progress, read failed or read okay */
+	backing_page_already_present:
+		_debug("- present %p", backpage);
+
+		if (PageError(backpage))
+			goto io_error;
+
+		if (PageUptodate(backpage))
+			goto backing_page_already_uptodate;
+
+		_debug("- not ready %p{%lx}", backpage, backpage->flags);
+
+		if (TestSetPageLocked(backpage))
+			goto monitor_backing_page;
+
+		if (PageError(backpage)) {
+			unlock_page(backpage);
+			goto io_error;
+		}
+
+		if (PageUptodate(backpage))
+			goto backing_page_already_uptodate_unlock;
+
+		/* we've locked a page that's neither up to date nor erroneous,
+		 * so we need to attempt to read it again */
+		//if (!PageLRU(backpage))
+		//	goto add_to_LRU_and_reread_backing_page;
+
+		goto reread_backing_page;
+
+		/* the backing page is already up to date, attach the netfs
+		 * page to the pagecache and LRU and copy the data across */
+	backing_page_already_uptodate_unlock:
+		unlock_page(backpage);
+	backing_page_already_uptodate:
+		_debug("- uptodate");
+
+		ret = add_to_page_cache(netpage, mapping, netpage->index,
+					GFP_KERNEL);
+		if (ret < 0) {
+			if (ret == -EEXIST) {
+				page_cache_release(netpage);
+				continue;
+			}
+			goto nomem;
+		}
+
+		copy_highpage(netpage, backpage);
+
+		page_cache_release(backpage);
+		backpage = NULL;
+
+		page_cache_get(netpage);
+		if (!pagevec_add(lru_pvec, netpage))
+			__pagevec_lru_add(lru_pvec);
+
+		callback_func(netpage, callback_data, 0);
+
+		page_cache_release(netpage);
+		netpage = NULL;
+		continue;
+	}
+
+	netpage = NULL;
+
+	_debug("out");
+
+out:
+	/* tidy up */
+	pagevec_lru_add(lru_pvec);
+
+	if (newpage)
+		page_cache_release(newpage);
+	if (netpage)
+		page_cache_release(netpage);
+	if (backpage)
+		page_cache_release(backpage);
+	kfree(monitor);
+
+	list_for_each_entry_safe(netpage, _n, list, lru) {
+		list_del(&netpage->lru);
+		page_cache_release(netpage);
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+nomem:
+	_debug("nomem");
+	ret = -ENOMEM;
+	goto out;
+
+read_error:
+	_debug("read error %d", ret);
+	if (ret == -ENOMEM)
+		goto out;
+io_error:
+	cachefiles_io_error_obj(object, "page read error on backing file");
+	ret = -EIO;
+	goto out;
+
+} /* end cachefiles_read_backing_file() */
+
+/*****************************************************************************/
+/*
+ * read a list of pages from the cache or allocate blocks in which to store
+ * them
+ */
+static int cachefiles_read_or_alloc_pages(struct fscache_object *_object,
+					  struct address_space *mapping,
+					  struct list_head *pages,
+					  unsigned *nr_pages,
+					  fscache_rw_complete_t callback_func,
+					  void *callback_data,
+					  unsigned long gfp)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct fscache_cookie *cookie;
+	struct list_head e3pages;
+	struct pagevec pagevec;
+	struct inode *inode;
+	struct page *page, *_n;
+	unsigned shift, e3nrpages;
+	int ret, ret2, space;
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache, struct cachefiles_cache, cache);
+
+	_enter("{%p},,%d,,", object, *nr_pages);
+
+	if (!object->backer)
+		return -ENOBUFS;
+
+	space = 1;
+	if (cachefiles_has_space(cache, *nr_pages) < 0)
+		space = 0;
+
+	inode = object->backer->f_dentry->d_inode;
+	ASSERT(S_ISREG(inode->i_mode));
+	ASSERT(inode->i_mapping->a_ops->bmap);
+	ASSERT(inode->i_mapping->a_ops->readpages);
+
+	/* calculate the shift required to use bmap */
+	if (inode->i_sb->s_blocksize > PAGE_SIZE)
+		return -ENOBUFS;
+
+	shift = log2(PAGE_SIZE / inode->i_sb->s_blocksize);
+
+	pagevec_init(&pagevec, 0);
+
+	cookie = object->fscache.cookie;
+
+	INIT_LIST_HEAD(&e3pages);
+	e3nrpages = 0;
+
+	ret = space ? -ENODATA : -ENOBUFS;
+	list_for_each_entry_safe(page, _n, pages, lru) {
+		sector_t block0, e3block;
+
+		/* we assume the absence or presence of the first block is a
+		 * good enough indication for the page as a whole
+		 * - TODO: don't use bmap() for this as it is _not_ actually
+		 *   good enough for this as it doesn't indicate errors, but
+		 *   it's all we've got for the moment
+		 */
+		block0 = page->index;
+		block0 <<= shift;
+
+		e3block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
+							block0);
+		_debug("%llx -> %llx", block0, e3block);
+
+		if (e3block) {
+			/* we have data - add it to the list to give to the
+			 * backing fs */
+			list_move(&page->lru, &e3pages);
+			(*nr_pages)--;
+			e3nrpages++;
+		}
+		else if (space && pagevec_add(&pagevec, page) == 0) {
+			cookie->def->mark_pages_cached(cookie->netfs_data,
+						       mapping, &pagevec);
+			pagevec_reinit(&pagevec);
+			ret = -ENODATA;
+		}
+	}
+
+	if (pagevec_count(&pagevec) > 0) {
+		cookie->def->mark_pages_cached(cookie->netfs_data,
+					       mapping, &pagevec);
+		pagevec_reinit(&pagevec);
+	}
+
+	if (list_empty(pages))
+		ret = 0;
+
+	/* submit the apparently valid pages to the backing fs to be read from disk */
+	if (e3nrpages > 0) {
+		ret2 = cachefiles_read_backing_file(object,
+						    callback_func,
+						    callback_data,
+						    mapping,
+						    &e3pages,
+						    &pagevec);
+
+		ASSERTCMP(pagevec_count(&pagevec), ==, 0);
+
+		if (ret2 == -ENOMEM || ret2 == -EINTR)
+			ret = ret2;
+	}
+
+	_leave(" = %d [nr=%u%s]",
+	       ret, *nr_pages, list_empty(pages) ? " empty" : "");
+	return ret;
+
+} /* end cachefiles_read_or_alloc_pages() */
+
+/*****************************************************************************/
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if no buffers can be made available
+ * - returns -ENOBUFS if page is beyond EOF
+ * - otherwise:
+ *   - the metadata will be retained
+ *   - 0 will be returned
+ */
+static int cachefiles_allocate_page(struct fscache_object *_object,
+				    struct page *page,
+				    unsigned long gfp)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	_enter("%p,{%lx},,,", object, page->index);
+
+	return cachefiles_has_space(cache, 1);
+
+} /* end cachefiles_allocate_page() */
+
+/*****************************************************************************/
+/*
+ * page storer
+ */
+void cachefiles_write_work(void *_object)
+{
+	struct cachefiles_one_write *writer;
+	struct cachefiles_object *object = _object;
+	int ret, max;
+
+	_enter("%p", object);
+
+	ASSERT(!irqs_disabled());
+
+	spin_lock_irq(&object->work_lock);
+	max = 8;
+
+	while (!list_empty(&object->write_list)) {
+		writer = list_entry(object->write_list.next,
+				    struct cachefiles_one_write, obj_link);
+		list_del(&writer->obj_link);
+
+		spin_unlock_irq(&object->work_lock);
+
+		_debug("- store {%lu}", writer->netfs_page->index);
+
+		ret = generic_file_buffered_write_one_kernel_page(
+			object->backer,
+			writer->netfs_page->index,
+			writer->netfs_page);
+
+		if (ret == -ENOSPC) {
+			ret = -ENOBUFS;
+		}
+		else if (ret == -EIO) {
+			cachefiles_io_error_obj(object,
+						"write page to backing file"
+						" failed");
+			ret = -ENOBUFS;
+		}
+
+		_debug("- callback");
+		writer->callback_func(writer->netfs_page,
+				      writer->callback_data, ret);
+
+		_debug("- put net");
+		page_cache_release(writer->netfs_page);
+		kfree(writer);
+
+		/* let keventd have some air occasionally */
+		max--;
+		if (max < 0 || need_resched()) {
+			if (!list_empty(&object->write_list))
+				schedule_work(&object->write_work);
+			_leave(" [maxed out]");
+			return;
+		}
+
+		_debug("- next");
+		spin_lock_irq(&object->work_lock);
+	}
+
+	spin_unlock_irq(&object->work_lock);
+	_leave("");
+
+} /* end cachefiles_write_work() */
+
+/*****************************************************************************/
+/*
+ * request a page be stored in the cache
+ * - cache withdrawal is prevented by the caller
+ * - this request may be ignored if there's no cache block available, in which
+ *   case -ENOBUFS will be returned
+ * - if the op is in progress, 0 will be returned
+ */
+static int cachefiles_write_page(struct fscache_object *_object,
+				 struct page *page,
+				 fscache_rw_complete_t callback_func,
+				 void *callback_data,
+				 unsigned long gfp)
+{
+//	struct cachefiles_one_write *writer;
+	struct cachefiles_object *object;
+	int ret;
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+	_enter("%p,%p{%lx},,,", object, page, page->index);
+
+	if (!object->backer)
+		return -ENOBUFS;
+
+	ASSERT(S_ISREG(object->backer->f_dentry->d_inode->i_mode));
+
+#if 0 // set to 1 for deferred writing
+	/* queue the operation for deferred processing by keventd */
+	writer = kzalloc(sizeof(*writer), GFP_KERNEL);
+	if (!writer)
+		return -ENOMEM;
+
+	page_cache_get(page);
+	writer->netfs_page = page;
+	writer->object = object;
+	writer->callback_func = callback_func;
+	writer->callback_data = callback_data;
+
+	spin_lock_irq(&object->work_lock);
+	list_add_tail(&writer->obj_link, &object->write_list);
+	spin_unlock_irq(&object->work_lock);
+
+	schedule_work(&object->write_work);
+	ret = 0;
+
+#else
+	/* copy the page to ext3 and let it store it in its own time */
+	ret = generic_file_buffered_write_one_kernel_page(object->backer,
+							  page->index,
+							  page);
+
+	if (ret != 0) {
+		if (ret == -EIO)
+			cachefiles_io_error_obj(object,
+						"write page to backing file"
+						" failed");
+		ret = -ENOBUFS;
+	}
+
+	callback_func(page, callback_data, ret);
+#endif
+
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_write_page() */
+
+/*****************************************************************************/
+/*
+ * detach a backing block from a page
+ * - cache withdrawal is prevented by the caller
+ */
+static void cachefiles_uncache_pages(struct fscache_object *_object,
+				     struct pagevec *pagevec)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	_enter("%p,{%lu,%lx},,,",
+	       object, pagevec->nr, pagevec->pages[0]->index);
+
+} /* end cachefiles_uncache_pages() */
+
+/*****************************************************************************/
+/*
+ * dissociate a cache from all the pages it was backing
+ */
+static void cachefiles_dissociate_pages(struct fscache_cache *cache)
+{
+	_enter("");
+
+} /* end cachefiles_dissociate_pages() */
+
+struct fscache_cache_ops cachefiles_cache_ops = {
+	.name			= "cachefiles",
+	.lookup_object		= cachefiles_lookup_object,
+	.grab_object		= cachefiles_grab_object,
+	.lock_object		= cachefiles_lock_object,
+	.unlock_object		= cachefiles_unlock_object,
+	.update_object		= cachefiles_update_object,
+	.put_object		= cachefiles_put_object,
+	.sync_cache		= cachefiles_sync_cache,
+	.set_i_size		= cachefiles_set_i_size,
+	.read_or_alloc_page	= cachefiles_read_or_alloc_page,
+	.read_or_alloc_pages	= cachefiles_read_or_alloc_pages,
+	.allocate_page		= cachefiles_allocate_page,
+	.write_page		= cachefiles_write_page,
+	.uncache_pages		= cachefiles_uncache_pages,
+	.dissociate_pages	= cachefiles_dissociate_pages,
+};
diff --git a/fs/cachefiles/cf-key.c b/fs/cachefiles/cf-key.c
new file mode 100644
index 0000000..86d7fe7
--- /dev/null
+++ b/fs/cachefiles/cf-key.c
@@ -0,0 +1,160 @@
+/* cf-key.c: Key to pathname encoder
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/slab.h>
+#include "internal.h"
+
+static const char cachefiles_charmap[64] =
+	"0123456789"			/* 0 - 9 */
+	"abcdefghijklmnopqrstuvwxyz"	/* 10 - 35 */
+	"ABCDEFGHIJKLMNOPQRSTUVWXYZ"	/* 36 - 61 */
+	"_-"				/* 62 - 63 */
+	;
+
+static const char cachefiles_filecharmap[256] = {
+	/* we skip space and tab and control chars */
+	[ 33 ... 46 ] = 1,		/* '!' -> '.' */
+	/* we skip '/' as it's significant to pathwalk */
+	[ 48 ... 127 ] = 1,		/* '0' -> '~' */
+};
+
+/*****************************************************************************/
+/*
+ * turn the raw key into something cooked
+ * - the raw key should include the length in the two bytes at the front
+ * - the key may be up to 514 bytes in length (including the length word)
+ *   - "base64" encode the strange keys, mapping 3 bytes of raw to four of
+ *     cooked
+ *   - need to cut the cooked key into 252 char lengths (189 raw bytes)
+ */
+char *cachefiles_cook_key(const u8 *raw, int keylen, uint8_t type)
+{
+	unsigned char csum, ch;
+	unsigned int acc;
+	char *key;
+	int loop, len, max, seg, mark, print;
+
+	_enter(",%d", keylen);
+
+	BUG_ON(keylen < 2 || keylen > 514);
+
+	csum = raw[0] + raw[1];
+	print = 1;
+	for (loop = 2; loop < keylen; loop++) {
+		ch = raw[loop];
+		csum += ch;
+		print &= cachefiles_filecharmap[ch];
+	}
+
+	if (print) {
+		/* if the path is usable ASCII, then we render it directly */
+		max = keylen - 2;
+		max += 2;	/* two base64'd length chars on the front */
+		max += 5;	/* @checksum/M */
+		max += 3 * 2;	/* maximum number of segment dividers (".../M")
+				 * is ((514 + 251) / 252) = 3
+				 */
+		max += 1;	/* NUL on end */
+	}
+	else {
+		/* calculate the maximum length of the cooked key */
+		keylen = (keylen + 2) / 3;
+
+		max = keylen * 4;
+		max += 5;	/* @checksum/M */
+		max += 3 * 2;	/* maximum number of segment dividers (".../M")
+				 * is ((514 + 188) / 189) = 3
+				 */
+		max += 1;	/* NUL on end */
+	}
+
+	_debug("max: %d", max);
+
+	key = kmalloc(max, GFP_KERNEL);
+	if (!key)
+		return NULL;
+
+	len = 0;
+
+	/* build the cooked key */
+	sprintf(key, "@%02x/+", (unsigned) csum);
+	len = 5;
+	mark = len - 1;
+
+	if (print) {
+		acc = *(uint16_t *) raw;
+		raw += 2;
+
+		key[len + 1] = cachefiles_charmap[acc & 63];
+		acc >>= 6;
+		key[len] = cachefiles_charmap[acc & 63];
+		len += 2;
+
+		seg = 250;
+		for (loop = keylen; loop > 0; loop--) {
+			if (seg <= 0) {
+				key[len++] = '/';
+				mark = len;
+				key[len++] = '+';
+				seg = 252;
+			}
+
+			key[len++] = *raw++;
+			ASSERT(len < max);
+		}
+
+		switch (type) {
+		case FSCACHE_COOKIE_TYPE_INDEX:		type = 'I';	break;
+		case FSCACHE_COOKIE_TYPE_DATAFILE:	type = 'D';	break;
+		default:				type = 'S';	break;
+		}
+	}
+	else {
+		seg = 252;
+		for (loop = keylen; loop > 0; loop--) {
+			if (seg <= 0) {
+				key[len++] = '/';
+				mark = len;
+				key[len++] = '+';
+				seg = 252;
+			}
+
+			acc = *raw++;
+			acc |= *raw++ << 8;
+			acc |= *raw++ << 16;
+
+			_debug("acc: %06x", acc);
+
+			key[len++] = cachefiles_charmap[acc & 63];
+			acc >>= 6;
+			key[len++] = cachefiles_charmap[acc & 63];
+			acc >>= 6;
+			key[len++] = cachefiles_charmap[acc & 63];
+			acc >>= 6;
+			key[len++] = cachefiles_charmap[acc & 63];
+
+			ASSERT(len < max);
+		}
+
+		switch (type) {
+		case FSCACHE_COOKIE_TYPE_INDEX:		type = 'J';	break;
+		case FSCACHE_COOKIE_TYPE_DATAFILE:	type = 'E';	break;
+		default:				type = 'T';	break;
+		}
+	}
+
+	key[mark] = type;
+	key[len] = 0;
+
+	_leave(" = %p %d:[%s]", key, len, key);
+	return key;
+
+} /* end cachefiles_cook_key() */
diff --git a/fs/cachefiles/cf-main.c b/fs/cachefiles/cf-main.c
new file mode 100644
index 0000000..81e9e9b
--- /dev/null
+++ b/fs/cachefiles/cf-main.c
@@ -0,0 +1,167 @@
+/* cf-main.c: network filesystem caching backend to use cache files on a
+ *            premounted filesystem
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/proc_fs.h>
+#include "internal.h"
+
+unsigned long cachefiles_debug = 0;
+
+static int cachefiles_init(void);
+static void cachefiles_exit(void);
+
+fs_initcall(cachefiles_init);
+module_exit(cachefiles_exit);
+
+MODULE_DESCRIPTION("Mounted-filesystem based cache");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+kmem_cache_t *cachefiles_object_jar;
+
+static void cachefiles_object_init_once(void *_object, kmem_cache_t *cachep,
+					unsigned long flags)
+{
+	struct cachefiles_object *object = _object;
+
+	switch (flags & (SLAB_CTOR_VERIFY | SLAB_CTOR_CONSTRUCTOR)) {
+	case SLAB_CTOR_CONSTRUCTOR:
+		memset(object, 0, sizeof(*object));
+		fscache_object_init(&object->fscache);
+		init_rwsem(&object->sem);
+		spin_lock_init(&object->work_lock);
+		INIT_LIST_HEAD(&object->read_list);
+		INIT_LIST_HEAD(&object->read_pend_list);
+		INIT_WORK(&object->read_work, &cachefiles_read_copier_work,
+			  object);
+		INIT_LIST_HEAD(&object->write_list);
+		INIT_WORK(&object->write_work, &cachefiles_write_work, object);
+		break;
+
+	default:
+		break;
+	}
+}
+
+/*****************************************************************************/
+/*
+ * initialise the fs caching module
+ */
+static int cachefiles_init(void)
+{
+	struct proc_dir_entry *pde;
+	int ret;
+
+	/* create a proc entry to use as a handle for the userspace daemon */
+	ret = -ENOMEM;
+
+	pde = create_proc_entry("cachefiles", 0600, proc_root_fs);
+	if (!pde) {
+		kerror("unable to create /proc/fs/cachefiles");
+		goto error_proc;
+	}
+
+	pde->owner = THIS_MODULE;
+	pde->proc_fops = &cachefiles_proc_fops;
+	cachefiles_proc = pde;
+
+	/* create an object jar */
+	cachefiles_object_jar =
+		kmem_cache_create("cachefiles_object_jar",
+				  sizeof(struct cachefiles_object),
+				  0,
+				  SLAB_HWCACHE_ALIGN,
+				  cachefiles_object_init_once,
+				  NULL);
+	if (!cachefiles_object_jar) {
+		printk(KERN_NOTICE
+		       "CacheFiles: Failed to allocate an object jar\n");
+		goto error_object_jar;
+	}
+
+	printk(KERN_INFO "CacheFiles: Loaded\n");
+	return 0;
+
+error_object_jar:
+	remove_proc_entry("cachefiles", proc_root_fs);
+error_proc:
+	kerror("failed to register: %d", ret);
+	return ret;
+
+} /* end cachefiles_init() */
+
+/*****************************************************************************/
+/*
+ * clean up on module removal
+ */
+static void __exit cachefiles_exit(void)
+{
+	printk(KERN_INFO "CacheFiles: Unloading\n");
+
+	kmem_cache_destroy(cachefiles_object_jar);
+	remove_proc_entry("cachefiles", proc_root_fs);
+
+} /* end cachefiles_exit() */
+
+/*****************************************************************************/
+/*
+ * clear the dead space between task_struct and kernel stack
+ * - called by supplying -finstrument-functions to gcc
+ */
+#if 0
+void __cyg_profile_func_enter (void *this_fn, void *call_site)
+__attribute__((no_instrument_function));
+
+void __cyg_profile_func_enter (void *this_fn, void *call_site)
+{
+       asm volatile("  movl    %%esp,%%edi     \n"
+                    "  andl    %0,%%edi        \n"
+                    "  addl    %1,%%edi        \n"
+                    "  movl    %%esp,%%ecx     \n"
+                    "  subl    %%edi,%%ecx     \n"
+                    "  shrl    $2,%%ecx        \n"
+                    "  movl    $0xedededed,%%eax     \n"
+                    "  rep stosl               \n"
+                    :
+                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
+                    : "eax", "ecx", "edi", "memory", "cc"
+                    );
+}
+
+void __cyg_profile_func_exit(void *this_fn, void *call_site)
+__attribute__((no_instrument_function));
+
+void __cyg_profile_func_exit(void *this_fn, void *call_site)
+{
+       asm volatile("  movl    %%esp,%%edi     \n"
+                    "  andl    %0,%%edi        \n"
+                    "  addl    %1,%%edi        \n"
+                    "  movl    %%esp,%%ecx     \n"
+                    "  subl    %%edi,%%ecx     \n"
+                    "  shrl    $2,%%ecx        \n"
+                    "  movl    $0xdadadada,%%eax     \n"
+                    "  rep stosl               \n"
+                    :
+                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
+                    : "eax", "ecx", "edi", "memory", "cc"
+                    );
+}
+#endif
diff --git a/fs/cachefiles/cf-namei.c b/fs/cachefiles/cf-namei.c
new file mode 100644
index 0000000..45e2359
--- /dev/null
+++ b/fs/cachefiles/cf-namei.c
@@ -0,0 +1,837 @@
+/* cf-namei.c: CacheFiles path walking and related routines
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/quotaops.h>
+#include <linux/xattr.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include "internal.h"
+
+/*****************************************************************************/
+/*
+ * record the fact that an object is now active
+ */
+static void cachefiles_mark_object_active(struct cachefiles_cache *cache,
+					  struct cachefiles_object *object)
+{
+	struct cachefiles_object *xobject;
+	struct rb_node **_p, *_parent = NULL;
+	struct dentry *dentry;
+
+	write_lock(&cache->active_lock);
+
+	dentry = object->dentry;
+	_p = &cache->active_nodes.rb_node;
+	while (*_p) {
+		_parent = *_p;
+		xobject = rb_entry(_parent,
+				   struct cachefiles_object, active_node);
+
+		if (xobject->dentry > dentry)
+			_p = &(*_p)->rb_left;
+		else if (xobject->dentry < dentry)
+			_p = &(*_p)->rb_right;
+		else
+			BUG(); /* uh oh... this dentry shouldn't be here */
+	}
+
+	rb_link_node(&object->active_node, _parent, _p);
+	rb_insert_color(&object->active_node, &cache->active_nodes);
+
+	write_unlock(&cache->active_lock);
+
+} /* end cachefiles_mark_object_active() */
+
+/*****************************************************************************/
+/*
+ * delete an object representation from the cache
+ * - file backed objects are unlinked
+ * - directory backed objects are stuffed into the graveyard for userspace to
+ *   delete
+ * - unlocks the directory mutex
+ */
+static int cachefiles_bury_object(struct cachefiles_cache *cache,
+				  struct dentry *dir,
+				  struct dentry *rep)
+{
+	struct dentry *grave, *alt, *trap;
+	struct qstr name;
+	const char *old_name;
+	char nbuffer[8 + 8 + 1];
+	int ret;
+
+	_enter(",'%*.*s','%*.*s'",
+	       dir->d_name.len, dir->d_name.len, dir->d_name.name,
+	       rep->d_name.len, rep->d_name.len, rep->d_name.name);
+
+	/* non-directories can just be unlinked */
+	if (!S_ISDIR(rep->d_inode->i_mode)) {
+		_debug("unlink stale object");
+		ret = dir->d_inode->i_op->unlink(dir->d_inode, rep);
+
+		mutex_unlock(&dir->d_inode->i_mutex);
+
+		if (ret == 0) {
+			_debug("d_delete");
+			d_delete(rep);
+		}
+		else if (ret == -EIO) {
+			cachefiles_io_error(cache, "Unlink failed");
+		}
+
+		_leave(" = %d", ret);
+		return ret;
+	}
+
+	/* directories have to be moved to the graveyard */
+	_debug("move stale object to graveyard");
+	mutex_unlock(&dir->d_inode->i_mutex);
+
+try_again:
+	/* first step is to make up a grave dentry in the graveyard */
+	sprintf(nbuffer, "%08x%08x",
+		(uint32_t) xtime.tv_sec,
+		(uint32_t) atomic_inc_return(&cache->gravecounter));
+
+	name.name = nbuffer;
+	name.len = strlen(name.name);
+
+	/* hash the name */
+	name.hash = full_name_hash(name.name, name.len);
+
+	if (dir->d_op && dir->d_op->d_hash) {
+		ret = dir->d_op->d_hash(dir, &name);
+		if (ret < 0) {
+			if (ret == -EIO)
+				cachefiles_io_error(cache, "Hash failed");
+
+			_leave(" = %d", ret);
+			return ret;
+		}
+	}
+
+	/* do the multiway lock magic */
+	trap = lock_rename(cache->graveyard, dir);
+
+	/* do some checks before getting the grave dentry */
+	if (rep->d_parent != dir) {
+		/* the entry was probably culled when we dropped the parent dir
+		 * lock */
+		unlock_rename(cache->graveyard, dir);
+		_leave(" = 0 [culled?]");
+		return 0;
+	}
+
+	if (!S_ISDIR(cache->graveyard->d_inode->i_mode)) {
+		unlock_rename(cache->graveyard, dir);
+		cachefiles_io_error(cache, "Graveyard no longer a directory");
+		return -EIO;
+	}
+
+	if (trap == rep) {
+		unlock_rename(cache->graveyard, dir);
+		cachefiles_io_error(cache, "May not make directory loop");
+		return -EIO;
+	}
+
+	if (d_mountpoint(rep)) {
+		unlock_rename(cache->graveyard, dir);
+		cachefiles_io_error(cache, "Mountpoint in cache");
+		return -EIO;
+	}
+
+	/* see if there's a dentry already there for this name */
+	grave = d_lookup(cache->graveyard, &name);
+	if (!grave) {
+		_debug("not found");
+
+		grave = d_alloc(cache->graveyard, &name);
+		if (!grave) {
+			unlock_rename(cache->graveyard, dir);
+			_leave(" = -ENOMEM");
+			return -ENOMEM;
+		}
+
+		alt = cache->graveyard->d_inode->i_op->lookup(
+			cache->graveyard->d_inode, grave, NULL);
+		if (IS_ERR(alt)) {
+			unlock_rename(cache->graveyard, dir);
+			dput(grave);
+
+			if (PTR_ERR(alt) == -ENOMEM) {
+				_leave(" = -ENOMEM");
+				return -ENOMEM;
+			}
+
+			cachefiles_io_error(cache, "Lookup error %ld",
+					    PTR_ERR(alt));
+			return -EIO;
+		}
+
+		if (alt) {
+			dput(grave);
+			grave = alt;
+		}
+	}
+
+	if (grave->d_inode) {
+		unlock_rename(cache->graveyard, dir);
+		dput(grave);
+		grave = NULL;
+		cond_resched();
+		goto try_again;
+	}
+
+	if (d_mountpoint(grave)) {
+		unlock_rename(cache->graveyard, dir);
+		dput(grave);
+		cachefiles_io_error(cache, "Mountpoint in graveyard");
+		return -EIO;
+	}
+
+	/* target should not be an ancestor of source */
+	if (trap == grave) {
+		unlock_rename(cache->graveyard, dir);
+		dput(grave);
+		cachefiles_io_error(cache, "May not make directory loop");
+		return -EIO;
+	}
+
+	/* attempt the rename */
+	DQUOT_INIT(dir->d_inode);
+	DQUOT_INIT(cache->graveyard->d_inode);
+
+	old_name = fsnotify_oldname_init(rep->d_name.name);
+
+	ret = dir->d_inode->i_op->rename(dir->d_inode, rep,
+					 cache->graveyard->d_inode, grave);
+
+	if (ret == 0) {
+		d_move(rep, grave);
+		fsnotify_move(dir->d_inode, cache->graveyard->d_inode,
+			      old_name, rep->d_name.name, 1,
+			      grave->d_inode, rep->d_inode);
+	}
+	else if (ret != -ENOMEM) {
+		cachefiles_io_error(cache, "Rename failed with error %d", ret);
+	}
+
+	fsnotify_oldname_free(old_name);
+
+	unlock_rename(cache->graveyard, dir);
+	dput(grave);
+	_leave(" = 0");
+	return 0;
+
+} /* end cachefiles_bury_object() */
+
+/*****************************************************************************/
+/*
+ * delete an object representation from the cache
+ */
+int cachefiles_delete_object(struct cachefiles_cache *cache,
+			     struct cachefiles_object *object)
+{
+	struct dentry *dir;
+	int ret;
+
+	_enter(",{%p}", object->dentry);
+
+	ASSERT(object->dentry);
+	ASSERT(object->dentry->d_inode);
+	ASSERT(object->dentry->d_parent);
+
+	dir = dget_parent(object->dentry);
+
+	mutex_lock(&dir->d_inode->i_mutex);
+	ret = cachefiles_bury_object(cache, dir, object->dentry);
+
+	dput(dir);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_delete_object() */
+
+/*****************************************************************************/
+/*
+ * walk from the parent object to the child object through the backing
+ * filesystem, creating directories as we go
+ */
+int cachefiles_walk_to_object(struct cachefiles_object *parent,
+			      struct cachefiles_object *object,
+			      char *key,
+			      struct cachefiles_xattr *auxdata)
+{
+	struct cachefiles_cache *cache;
+	struct dentry *dir, *next = NULL, *new;
+	struct file *file;
+	struct qstr name;
+	uid_t fsuid;
+	gid_t fsgid;
+	int ret;
+
+	_enter("{%p}", parent->dentry);
+
+	cache = container_of(parent->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	ASSERT(parent->dentry);
+	ASSERT(parent->dentry->d_inode);
+
+	if (!(S_ISDIR(parent->dentry->d_inode->i_mode))) {
+		// TODO: convert file to dir
+		_leave("looking up in none directory");
+		return -ENOBUFS;
+	}
+
+	fsuid = current->fsuid;
+	fsgid = current->fsgid;
+	current->fsuid = 0;
+	current->fsgid = 0;
+
+	dir = dget(parent->dentry);
+
+advance:
+	/* attempt to transit the first directory component */
+	name.name = key;
+	key = strchr(key, '/');
+	if (key) {
+		name.len = key - (char *) name.name;
+		*key++ = 0;
+	}
+	else {
+		name.len = strlen(name.name);
+	}
+
+	/* hash the name */
+	name.hash = full_name_hash(name.name, name.len);
+
+	if (dir->d_op && dir->d_op->d_hash) {
+		ret = dir->d_op->d_hash(dir, &name);
+		if (ret < 0) {
+			cachefiles_io_error(cache, "Hash failed");
+			goto error_out2;
+		}
+	}
+
+lookup_again:
+	/* search the current directory for the element name */
+	_debug("lookup '%s' %x", name.name, name.hash);
+
+	mutex_lock(&dir->d_inode->i_mutex);
+
+	next = d_lookup(dir, &name);
+	if (!next) {
+		_debug("not found");
+
+		new = d_alloc(dir, &name);
+		if (!new)
+			goto nomem_d_alloc;
+
+		ASSERT(dir->d_inode->i_op);
+		ASSERT(dir->d_inode->i_op->lookup);
+
+		next = dir->d_inode->i_op->lookup(dir->d_inode, new, NULL);
+		if (IS_ERR(next))
+			goto lookup_error;
+
+		if (!next)
+			next = new;
+		else
+			dput(new);
+
+		if (next->d_inode) {
+			ret = -EPERM;
+			if (!next->d_inode->i_op ||
+			    !next->d_inode->i_op->setxattr ||
+			    !next->d_inode->i_op->getxattr ||
+			    !next->d_inode->i_op->removexattr)
+				goto error;
+
+			if (key && (!next->d_inode->i_op->lookup ||
+				    !next->d_inode->i_op->mkdir ||
+				    !next->d_inode->i_op->create ||
+				    !next->d_inode->i_op->rename ||
+				    !next->d_inode->i_op->rmdir ||
+				    !next->d_inode->i_op->unlink))
+				goto error;
+		}
+	}
+
+	_debug("next -> %p %s", next, next->d_inode ? "positive" : "negative");
+
+	if (!key)
+		object->new = !next->d_inode;
+
+	/* we need to create the object if it's negative */
+	if (key || object->type == FSCACHE_COOKIE_TYPE_INDEX) {
+		/* index objects and intervening tree levels must be subdirs */
+		if (!next->d_inode) {
+			DQUOT_INIT(dir->d_inode);
+			ret = dir->d_inode->i_op->mkdir(dir->d_inode, next, 0);
+			if (ret < 0)
+				goto create_error;
+
+			ASSERT(next->d_inode);
+
+			fsnotify_mkdir(dir->d_inode, next);
+
+			_debug("mkdir -> %p{%p{ino=%lu}}",
+			       next, next->d_inode, next->d_inode->i_ino);
+		}
+		/* we need to make sure a positive match found a directory */
+		else if (!S_ISDIR(next->d_inode->i_mode)) {
+			kerror("inode %lu is not a directory",
+			       next->d_inode->i_ino);
+			ret = -ENOBUFS;
+			goto error;
+		}
+	}
+	else {
+		/* non-index objects start out life as files */
+		if (!next->d_inode) {
+			DQUOT_INIT(dir->d_inode);
+			ret = dir->d_inode->i_op->create(dir->d_inode, next,
+							 S_IFREG, NULL);
+			if (ret < 0)
+				goto create_error;
+
+			ASSERT(next->d_inode);
+
+			fsnotify_create(dir->d_inode, next);
+
+			_debug("create -> %p{%p{ino=%lu}}",
+			       next, next->d_inode, next->d_inode->i_ino);
+		}
+		/* we need to make sure a positive match found a directory or a
+		 * file */
+		else if (!S_ISDIR(next->d_inode->i_mode) &&
+			 !S_ISREG(next->d_inode->i_mode)
+			 ) {
+			kerror("inode %lu is not a file or directory",
+			       next->d_inode->i_ino);
+			ret = -ENOBUFS;
+			goto error;
+		}
+	}
+
+	/* process the next component */
+	if (key) {
+		_debug("advance");
+		mutex_unlock(&dir->d_inode->i_mutex);
+		dput(dir);
+		dir = next;
+		next = NULL;
+		goto advance;
+	}
+
+	/* we've found the object we were looking for */
+	object->dentry = next;
+
+	/* if we've found that the terminal object exists, then we need to
+	 * check its attributes and delete it if it's out of date */
+	if (!object->new) {
+		_debug("validate '%*.*s'",
+		       next->d_name.len, next->d_name.len, next->d_name.name);
+
+		ret = cachefiles_check_object_xattr(object, auxdata);
+		if (ret == -ESTALE) {
+			/* delete the object (the deleter drops the directory
+			 * mutex) */
+			object->dentry = NULL;
+
+			ret = cachefiles_bury_object(cache, dir, next);
+			dput(next);
+			next = NULL;
+
+			if (ret < 0)
+				goto delete_error;
+
+			_debug("redo lookup");
+			goto lookup_again;
+		}
+	}
+
+	/* note that we're now using this object */
+	cachefiles_mark_object_active(cache, object);
+
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(dir);
+	dir = NULL;
+
+	if (object->new) {
+		/* attach data to a newly constructed terminal object */
+		ret = cachefiles_set_object_xattr(object, auxdata);
+		if (ret < 0)
+			goto check_error;
+	}
+	else {
+		/* always update the atime on an object we've just looked up
+		 * (this is used to keep track of culling, and atimes are only
+		 * updated by read, write and readdir but not lookup or
+		 * open) */
+		touch_atime(cache->mnt, next);
+	}
+
+	/* open a file interface onto a data file */
+	if (object->type != FSCACHE_COOKIE_TYPE_INDEX) {
+		if (S_ISREG(object->dentry->d_inode->i_mode)) {
+			file = dentry_open_kernel(dget(object->dentry),
+						  mntget(cache->mnt), 0);
+			if (IS_ERR(file))
+				goto open_error;
+
+			object->backer = file;
+		}
+		else {
+			BUG(); // TODO: open file in data-class subdir
+		}
+	}
+
+	current->fsuid = fsuid;
+	current->fsgid = fsgid;
+	object->new = 0;
+
+	_leave(" = 0 [%lu]", object->dentry->d_inode->i_ino);
+	return 0;
+
+create_error:
+	if (ret == -EIO)
+		cachefiles_io_error(cache, "create/mkdir failed");
+	goto error;
+
+open_error:
+	ret = PTR_ERR(file);
+check_error:
+	write_lock(&cache->active_lock);
+	rb_erase(&object->active_node, &cache->active_nodes);
+	write_unlock(&cache->active_lock);
+
+	dput(object->dentry);
+	object->dentry = NULL;
+	goto error_out;
+
+delete_error:
+	_debug("delete error %d", ret);
+	goto error_out2;
+
+lookup_error:
+	_debug("lookup error %ld", PTR_ERR(next));
+	dput(new);
+	ret = PTR_ERR(next);
+	if (ret == -EIO)
+		cachefiles_io_error(cache, "Lookup failed");
+	next = NULL;
+	goto error;
+
+nomem_d_alloc:
+	ret = -ENOMEM;
+error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(next);
+error_out2:
+	dput(dir);
+error_out:
+	current->fsuid = fsuid;
+	current->fsgid = fsgid;
+
+	_leave(" = ret");
+	return ret;
+
+} /* end cachefiles_walk_to_object() */
+
+/*****************************************************************************/
+/*
+ * get a subdirectory
+ */
+struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+					struct dentry *dir,
+					const char *dirname)
+{
+	struct dentry *subdir, *new;
+	struct qstr name;
+	uid_t fsuid;
+	gid_t fsgid;
+	int ret;
+
+	_enter("");
+
+	/* set up the name */
+	name.name = dirname;
+	name.len = strlen(dirname);
+	name.hash = full_name_hash(name.name, name.len);
+
+	if (dir->d_op && dir->d_op->d_hash) {
+		ret = dir->d_op->d_hash(dir, &name);
+		if (ret < 0) {
+			if (ret == -EIO)
+				kerror("Hash failed");
+			_leave(" = %d", ret);
+			return ERR_PTR(ret);
+		}
+	}
+
+	/* search the current directory for the element name */
+	_debug("lookup '%s' %x", name.name, name.hash);
+
+	fsuid = current->fsuid;
+	fsgid = current->fsgid;
+	current->fsuid = 0;
+	current->fsgid = 0;
+
+	mutex_lock(&dir->d_inode->i_mutex);
+
+	subdir = d_lookup(dir, &name);
+	if (!subdir) {
+		_debug("not found");
+
+		new = d_alloc(dir, &name);
+		if (!new)
+			goto nomem_d_alloc;
+
+		subdir = dir->d_inode->i_op->lookup(dir->d_inode, new, NULL);
+		if (IS_ERR(subdir))
+			goto lookup_error;
+
+		if (!subdir)
+			subdir = new;
+		else
+			dput(new);
+	}
+
+	_debug("subdir -> %p %s",
+	       subdir, subdir->d_inode ? "positive" : "negative");
+
+	/* we need to create the subdir if it doesn't exist yet */
+	if (!subdir->d_inode) {
+		DQUOT_INIT(dir->d_inode);
+		ret = dir->d_inode->i_op->mkdir(dir->d_inode, subdir, 0700);
+		if (ret < 0)
+			goto mkdir_error;
+
+		ASSERT(subdir->d_inode);
+
+		fsnotify_mkdir(dir->d_inode, subdir);
+
+		_debug("mkdir -> %p{%p{ino=%lu}}",
+		       subdir,
+		       subdir->d_inode,
+		       subdir->d_inode->i_ino);
+	}
+
+	mutex_unlock(&dir->d_inode->i_mutex);
+
+	current->fsuid = fsuid;
+	current->fsgid = fsgid;
+
+	/* we need to make sure the subdir is a directory */
+	ASSERT(subdir->d_inode);
+
+	if (!S_ISDIR(subdir->d_inode->i_mode)) {
+		kerror("%s is not a directory", dirname);
+		ret = -EIO;
+		goto check_error;
+	}
+
+	ret = -EPERM;
+	if (!subdir->d_inode->i_op ||
+	    !subdir->d_inode->i_op->setxattr ||
+	    !subdir->d_inode->i_op->getxattr ||
+	    !subdir->d_inode->i_op->lookup ||
+	    !subdir->d_inode->i_op->mkdir ||
+	    !subdir->d_inode->i_op->create ||
+	    !subdir->d_inode->i_op->rename ||
+	    !subdir->d_inode->i_op->rmdir ||
+	    !subdir->d_inode->i_op->unlink)
+		goto check_error;
+
+	_leave(" = [%lu]", subdir->d_inode->i_ino);
+	return subdir;
+
+check_error:
+	dput(subdir);
+	_leave(" = %d [check]", ret);
+	return ERR_PTR(ret);
+
+mkdir_error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	kerror("mkdir %s failed with error %d", dirname, ret);
+	goto error_out;
+
+lookup_error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(new);
+	ret = PTR_ERR(subdir);
+	kerror("Lookup %s failed with error %d", dirname, ret);
+	goto error_out;
+
+nomem_d_alloc:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	ret = -ENOMEM;
+	goto error_out;
+
+error_out:
+	current->fsuid = fsuid;
+	current->fsgid = fsgid;
+	_leave(" = %d", ret);
+	return ERR_PTR(ret);
+
+} /* end cachefiles_get_directory() */
+
+/*****************************************************************************/
+/*
+ * cull an object if it's not in use
+ * - called only by cache manager daemon
+ */
+int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
+		    char *filename)
+{
+	struct cachefiles_object *object;
+	struct rb_node *_n;
+	struct dentry *victim, *new;
+	struct qstr name;
+	int ret;
+
+	_enter(",%*.*s/,%s",
+	       dir->d_name.len, dir->d_name.len, dir->d_name.name, filename);
+
+	/* set up the name */
+	name.name = filename;
+	name.len = strlen(filename);
+	name.hash = full_name_hash(name.name, name.len);
+
+	if (dir->d_op && dir->d_op->d_hash) {
+		ret = dir->d_op->d_hash(dir, &name);
+		if (ret < 0) {
+			if (ret == -EIO)
+				cachefiles_io_error(cache, "Hash failed");
+			_leave(" = %d", ret);
+			return ret;
+		}
+	}
+
+	/* look up the victim */
+	mutex_lock(&dir->d_inode->i_mutex);
+
+	victim = d_lookup(dir, &name);
+	if (!victim) {
+		_debug("not found");
+
+		new = d_alloc(dir, &name);
+		if (!new)
+			goto nomem_d_alloc;
+
+		victim = dir->d_inode->i_op->lookup(dir->d_inode, new, NULL);
+		if (IS_ERR(victim))
+			goto lookup_error;
+
+		if (!victim)
+			victim = new;
+		else
+			dput(new);
+	}
+
+	_debug("victim -> %p %s", victim, victim->d_inode ? "positive" : "negative");
+
+	/* if the object is no longer there then we probably retired the object
+	 * at the netfs's request whilst the cull was in progress
+	 */
+	if (!victim->d_inode) {
+		mutex_unlock(&dir->d_inode->i_mutex);
+		dput(victim);
+		_leave(" = -ENOENT [absent]");
+		return -ENOENT;
+	}
+
+	/* check to see if we're using this object */
+	read_lock(&cache->active_lock);
+
+	_n = cache->active_nodes.rb_node;
+
+	while (_n) {
+		object = rb_entry(_n, struct cachefiles_object, active_node);
+
+		if (object->dentry > victim)
+			_n = _n->rb_left;
+		else if (object->dentry < victim)
+			_n = _n->rb_right;
+		else
+			goto object_in_use;
+	}
+
+	read_unlock(&cache->active_lock);
+
+	/* okay... the victim is not being used so we can cull it
+	 * - start by marking it as stale
+	 */
+	_debug("victim is cullable");
+
+	ret = cachefiles_remove_object_xattr(cache, victim);
+	if (ret < 0)
+		goto error_unlock;
+
+	/*  actually remove the victim (drops the dir mutex) */
+	_debug("bury");
+
+	ret = cachefiles_bury_object(cache, dir, victim);
+	if (ret < 0)
+		goto error;
+
+	dput(victim);
+	_leave(" = 0");
+	return 0;
+
+
+object_in_use:
+	read_unlock(&cache->active_lock);
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(victim);
+	_leave(" = -EBUSY [in use]");
+	return -EBUSY;
+
+nomem_d_alloc:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	_leave(" = -ENOMEM");
+	return -ENOMEM;
+
+lookup_error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(new);
+	ret = PTR_ERR(victim);
+	if (ret == -EIO)
+		cachefiles_io_error(cache, "Lookup failed");
+	goto choose_error;
+
+error_unlock:
+	mutex_unlock(&dir->d_inode->i_mutex);
+error:
+	dput(victim);
+choose_error:
+	if (ret == -ENOENT) {
+		/* file or dir now absent - probably retired by netfs */
+		_leave(" = -ESTALE [absent]");
+		return -ESTALE;
+	}
+
+	if (ret != -ENOMEM) {
+		kerror("Internal error: %d", ret);
+		ret = -EIO;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_cull() */
diff --git a/fs/cachefiles/cf-proc.c b/fs/cachefiles/cf-proc.c
new file mode 100644
index 0000000..fce1385
--- /dev/null
+++ b/fs/cachefiles/cf-proc.c
@@ -0,0 +1,510 @@
+/* cf-proc.c: /proc/fs/cachefiles interface
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/namespace.h>
+#include <linux/statfs.h>
+#include <linux/proc_fs.h>
+#include <linux/ctype.h>
+#include "internal.h"
+
+static int cachefiles_proc_open(struct inode *, struct file *);
+static int cachefiles_proc_release(struct inode *, struct file *);
+static ssize_t cachefiles_proc_read(struct file *, char __user *, size_t, loff_t *);
+static ssize_t cachefiles_proc_write(struct file *, const char __user *, size_t, loff_t *);
+static int cachefiles_proc_brun(struct cachefiles_cache *cache, char *args);
+static int cachefiles_proc_bcull(struct cachefiles_cache *cache, char *args);
+static int cachefiles_proc_bstop(struct cachefiles_cache *cache, char *args);
+static int cachefiles_proc_cull(struct cachefiles_cache *cache, char *args);
+static int cachefiles_proc_debug(struct cachefiles_cache *cache, char *args);
+static int cachefiles_proc_dir(struct cachefiles_cache *cache, char *args);
+static int cachefiles_proc_tag(struct cachefiles_cache *cache, char *args);
+
+struct proc_dir_entry *cachefiles_proc;
+
+static unsigned long cachefiles_open;
+
+struct file_operations cachefiles_proc_fops = {
+	.open		= cachefiles_proc_open,
+	.release	= cachefiles_proc_release,
+	.read		= cachefiles_proc_read,
+	.write		= cachefiles_proc_write,
+};
+
+struct cachefiles_proc_cmd {
+	char name[8];
+	int (*handler)(struct cachefiles_cache *cache, char *args);
+};
+
+static const struct cachefiles_proc_cmd cachefiles_proc_cmds[] = {
+	{ "bind",	cachefiles_proc_bind	},
+	{ "brun",	cachefiles_proc_brun	},
+	{ "bcull",	cachefiles_proc_bcull	},
+	{ "bstop",	cachefiles_proc_bstop	},
+	{ "cull",	cachefiles_proc_cull	},
+	{ "debug",	cachefiles_proc_debug	},
+	{ "dir",	cachefiles_proc_dir	},
+	{ "tag",	cachefiles_proc_tag	},
+	{ "",		NULL			}
+};
+
+
+/*****************************************************************************/
+/*
+ * do various checks
+ */
+static int cachefiles_proc_open(struct inode *inode, struct file *file)
+{
+	struct cachefiles_cache *cache;
+
+	_enter("");
+
+	/* only the superuser may do this */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* /proc/fs/cachefiles may only be open once at a time */
+	if (xchg(&cachefiles_open, 1) == 1)
+		return -EBUSY;
+
+	/* allocate a cache record */
+	cache = kzalloc(sizeof(struct cachefiles_cache), GFP_KERNEL);
+	if (!cache) {
+		cachefiles_open = 0;
+		return -ENOMEM;
+	}
+
+	cache->active_nodes = RB_ROOT;
+	rwlock_init(&cache->active_lock);
+
+	/* set default caching limits
+	 * - limit at 1% free space
+	 * - cull below 5% free space
+	 * - cease culling above 7% free space
+	 */
+	cache->brun_percent = 7;
+	cache->bcull_percent = 5;
+	cache->bstop_percent = 1;
+
+	file->private_data = cache;
+	cache->cachefilesd = file;
+	return 0;
+
+} /* end cachefiles_proc_open() */
+
+/*****************************************************************************/
+/*
+ * release a cache
+ */
+static int cachefiles_proc_release(struct inode *inode, struct file *file)
+{
+	struct cachefiles_cache *cache = file->private_data;
+
+	_enter("");
+
+	ASSERT(cache);
+
+	set_bit(CACHEFILES_DEAD, &cache->flags);
+
+	cachefiles_proc_unbind(cache);
+
+	ASSERT(!cache->active_nodes.rb_node);
+
+	/* clean up the control file interface */
+	cache->cachefilesd = NULL;
+	file->private_data = NULL;
+	cachefiles_open = 0;
+
+	kfree(cache);
+
+	_leave("");
+	return 0;
+
+} /* end cachefiles_proc_release() */
+
+/*****************************************************************************/
+/*
+ * read the cache state
+ */
+static ssize_t cachefiles_proc_read(struct file *file, char __user *_buffer,
+				    size_t buflen, loff_t *pos)
+{
+	struct cachefiles_cache *cache = file->private_data;
+	char buffer[256];
+	int n;
+
+	_enter(",,%zu,", buflen);
+
+	if (!test_bit(CACHEFILES_READY, &cache->flags))
+		return 0;
+
+	/* check how much space the cache has */
+	cachefiles_has_space(cache, 0);
+
+	/* summarise */
+	n = snprintf(buffer, sizeof(buffer),
+		     "cull=%c"
+		     " brun=%llx"
+		     " bcull=%llx"
+		     " bstop=%llx",
+		     test_bit(CACHEFILES_CULLING, &cache->flags) ? '1' : '0',
+		     cache->brun,
+		     cache->bcull,
+		     cache->bstop
+		     );
+
+	if (n > buflen)
+		return -EMSGSIZE;
+
+	if (copy_to_user(_buffer, buffer, n) != 0)
+		return -EFAULT;
+
+	return n;
+
+} /* end cachefiles_proc_read() */
+
+/*****************************************************************************/
+/*
+ * command the cache
+ */
+static ssize_t cachefiles_proc_write(struct file *file,
+				     const char __user *_data, size_t datalen,
+				     loff_t *pos)
+{
+	const struct cachefiles_proc_cmd *cmd;
+	struct cachefiles_cache *cache = file->private_data;
+	ssize_t ret;
+	char *data, *args, *cp;
+
+	_enter(",,%zu,", datalen);
+
+	ASSERT(cache);
+
+	if (test_bit(CACHEFILES_DEAD, &cache->flags))
+		return -EIO;
+
+	if (datalen < 0 || datalen > PAGE_SIZE - 1)
+		return -EOPNOTSUPP;
+
+	/* drag the command string into the kernel so we can parse it */
+	data = kmalloc(datalen + 1, GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(data, _data, datalen) != 0)
+		goto error;
+
+	data[datalen] = '\0';
+
+	ret = -EINVAL;
+	if (memchr(data, '\0', datalen))
+		goto error;
+
+	/* strip any newline */
+	cp = memchr(data, '\n', datalen);
+	if (cp) {
+		if (cp == data)
+			goto error;
+
+		*cp = '\0';
+	}
+
+	/* parse the command */
+	ret = -EOPNOTSUPP;
+
+	for (args = data; *args; args++)
+		if (isspace(*args))
+			break;
+	if (*args) {
+		if (args == data)
+			goto error;
+		*args = '\0';
+		for (args++; isspace(*args); args++)
+			continue;
+	}
+
+	/* run the appropriate command handler */
+	for (cmd = cachefiles_proc_cmds; cmd->name[0]; cmd++)
+		if (strcmp(cmd->name, data) == 0)
+			goto found_command;
+
+error:
+	kfree(data);
+	_leave(" = %d", ret);
+	return ret;
+
+found_command:
+	mutex_lock(&file->f_dentry->d_inode->i_mutex);
+
+	ret = -EIO;
+	if (!test_bit(CACHEFILES_DEAD, &cache->flags))
+		ret = cmd->handler(cache, args);
+
+	mutex_unlock(&file->f_dentry->d_inode->i_mutex);
+
+	if (ret == 0)
+		ret = datalen;
+	goto error;
+
+} /* end cachefiles_proc_write() */
+
+/*****************************************************************************/
+/*
+ * give a range error for cache space constraints
+ * - can be tail-called
+ */
+static int cachefiles_proc_range_error(struct cachefiles_cache *cache, char *args)
+{
+	kerror("Free space limits must be in range"
+	       " 0%%<=bstop<bcull<brun<100%%");
+
+	return -EINVAL;
+
+} /* end cachefiles_proc_range_error() */
+
+/*****************************************************************************/
+/*
+ * set the percentage of blocks at which to stop culling
+ * - command: "brun <N>%"
+ */
+static int cachefiles_proc_brun(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long brun;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	brun = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (brun <= cache->bcull_percent || brun >= 100)
+		return cachefiles_proc_range_error(cache, args);
+
+	cache->brun_percent = brun;
+	return 0;
+
+} /* end cachefiles_proc_brun() */
+
+/*****************************************************************************/
+/*
+ * set the percentage of blocks at which to start culling
+ * - command: "bcull <N>%"
+ */
+static int cachefiles_proc_bcull(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long bcull;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	bcull = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (bcull <= cache->bstop_percent || bcull >= cache->brun_percent)
+		return cachefiles_proc_range_error(cache, args);
+
+	cache->bcull_percent = bcull;
+	return 0;
+
+} /* end cachefiles_proc_bcull() */
+
+/*****************************************************************************/
+/*
+ * set the percentage of blocks at which to stop allocating
+ * - command: "bstop <N>%"
+ */
+static int cachefiles_proc_bstop(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long bstop;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	bstop = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (bstop < 0 || bstop >= cache->bcull_percent)
+		return cachefiles_proc_range_error(cache, args);
+
+	cache->bstop_percent = bstop;
+	return 0;
+
+} /* end cachefiles_proc_bstop() */
+
+/*****************************************************************************/
+/*
+ * set the cache directory
+ * - command: "dir <name>"
+ */
+static int cachefiles_proc_dir(struct cachefiles_cache *cache, char *args)
+{
+	char *dir;
+
+	_enter(",%s", args);
+
+	if (!*args) {
+		kerror("Empty directory specified");
+		return -EINVAL;
+	}
+
+	if (cache->rootdirname) {
+		kerror("Second cache directory specified");
+		return -EEXIST;
+	}
+
+	dir = kstrdup(args, GFP_KERNEL);
+	if (!dir)
+		return -ENOMEM;
+
+	cache->rootdirname = dir;
+	return 0;
+
+} /* end cachefiles_proc_dir() */
+
+/*****************************************************************************/
+/*
+ * set the cache tag
+ * - command: "tag <name>"
+ */
+static int cachefiles_proc_tag(struct cachefiles_cache *cache, char *args)
+{
+	char *tag;
+
+	_enter(",%s", args);
+
+	if (!*args) {
+		kerror("Empty tag specified");
+		return -EINVAL;
+	}
+
+	if (cache->tag)
+		return -EEXIST;
+
+	tag = kstrdup(args, GFP_KERNEL);
+	if (!tag)
+		return -ENOMEM;
+
+	cache->tag = tag;
+	return 0;
+
+} /* end cachefiles_proc_tag() */
+
+/*****************************************************************************/
+/*
+ * request a node in the cache be culled
+ * - command: "cull <dirfd> <name>"
+ */
+static int cachefiles_proc_cull(struct cachefiles_cache *cache, char *args)
+{
+	struct dentry *dir;
+	struct file *dirfile;
+	int dirfd, fput_needed, ret;
+
+	_enter(",%s", args);
+
+	dirfd = simple_strtoul(args, &args, 0);
+
+	if (!args || !isspace(*args))
+		goto inval;
+
+	while (isspace(*args))
+		args++;
+
+	if (!*args)
+		goto inval;
+
+	if (strchr(args, '/'))
+		goto inval;
+
+	if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+		kerror("cull applied to unready cache");
+		return -EIO;
+	}
+
+	if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+		kerror("cull applied to dead cache");
+		return -EIO;
+	}
+
+	/* extract the directory dentry from the fd */
+	dirfile = fget_light(dirfd, &fput_needed);
+	if (!dirfile) {
+		kerror("cull dirfd not open");
+		return -EBADF;
+	}
+
+	dir = dget(dirfile->f_dentry);
+	fput_light(dirfile, fput_needed);
+	dirfile = NULL;
+
+	if (!S_ISDIR(dir->d_inode->i_mode))
+		goto notdir;
+
+	ret = cachefiles_cull(cache, dir, args);
+
+	dput(dir);
+	_leave(" = %d", ret);
+	return ret;
+
+notdir:
+	dput(dir);
+	kerror("cull command requires dirfd to be a directory");
+	return -ENOTDIR;
+
+inval:
+	kerror("cull command requires dirfd and filename");
+	return -EINVAL;
+
+} /* end cachefiles_proc_cull() */
+
+/*****************************************************************************/
+/*
+ * set debugging mode
+ * - command: "debug <mask>"
+ */
+static int cachefiles_proc_debug(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long mask;
+
+	_enter(",%s", args);
+
+	mask = simple_strtoul(args, &args, 0);
+
+	if (!args || !isspace(*args))
+		goto inval;
+
+	cachefiles_debug = mask;
+	_leave(" = 0");
+	return 0;
+
+inval:
+	kerror("debug command requires mask");
+	return -EINVAL;
+
+} /* end cachefiles_proc_debug() */
diff --git a/fs/cachefiles/cf-xattr.c b/fs/cachefiles/cf-xattr.c
new file mode 100644
index 0000000..a811667
--- /dev/null
+++ b/fs/cachefiles/cf-xattr.c
@@ -0,0 +1,299 @@
+/* cf-xattr.c: CacheFiles extended attribute management
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/quotaops.h>
+#include <linux/xattr.h>
+#include "internal.h"
+
+static const char cachefiles_xattr_cache[] = XATTR_USER_PREFIX "CacheFiles.cache";
+
+/*****************************************************************************/
+/*
+ * check the type label on an object
+ * - done using xattrs
+ */
+int cachefiles_check_object_type(struct cachefiles_object *object)
+{
+	struct dentry *dentry = object->dentry;
+	char type[3], xtype[3];
+	int ret;
+
+	ASSERT(dentry);
+	ASSERT(dentry->d_inode);
+	ASSERT(dentry->d_inode->i_op);
+	ASSERT(dentry->d_inode->i_op->setxattr);
+	ASSERT(dentry->d_inode->i_op->getxattr);
+
+	if (!object->fscache.cookie)
+		strcpy(type, "C3");
+	else
+		snprintf(type, 3, "%02x", object->fscache.cookie->def->type);
+
+	_enter("%p{%s}", object, type);
+
+	mutex_lock(&dentry->d_inode->i_mutex);
+
+	/* attempt to install a type label directly */
+	ret = dentry->d_inode->i_op->setxattr(dentry, cachefiles_xattr_cache,
+					      type, 2, XATTR_CREATE);
+	if (ret == 0) {
+		_debug("SET");
+		fsnotify_xattr(dentry);
+		mutex_unlock(&dentry->d_inode->i_mutex);
+		goto error;
+	}
+
+	if (ret != -EEXIST) {
+		kerror("Can't set xattr on %*.*s [%lu] (err %d)",
+		       dentry->d_name.len, dentry->d_name.len,
+		       dentry->d_name.name, dentry->d_inode->i_ino,
+		       -ret);
+		goto error;
+	}
+
+	/* read the current type label */
+	ret = dentry->d_inode->i_op->getxattr(dentry, cachefiles_xattr_cache,
+					      xtype, 3);
+	if (ret < 0) {
+		if (ret == -ERANGE)
+			goto bad_type_length;
+
+		kerror("Can't read xattr on %*.*s [%lu] (err %d)",
+		       dentry->d_name.len, dentry->d_name.len,
+		       dentry->d_name.name, dentry->d_inode->i_ino,
+		       -ret);
+		goto error;
+	}
+
+	/* check the type is what we're expecting */
+	if (ret != 2)
+		goto bad_type_length;
+
+	if (xtype[0] != type[0] || xtype[1] != type[1])
+		goto bad_type;
+
+	ret = 0;
+
+error:
+	mutex_unlock(&dentry->d_inode->i_mutex);
+	_leave(" = %d", ret);
+	return ret;
+
+bad_type_length:
+	kerror("Cache object %lu type xattr length incorrect",
+	       dentry->d_inode->i_ino);
+	ret = -EIO;
+	goto error;
+
+bad_type:
+	xtype[2] = 0;
+	kerror("Cache object %*.*s [%lu] type %s not %s",
+	       dentry->d_name.len, dentry->d_name.len,
+	       dentry->d_name.name, dentry->d_inode->i_ino,
+	       xtype, type);
+	ret = -EIO;
+	goto error;
+
+} /* end cachefiles_check_object_type() */
+
+/*****************************************************************************/
+/*
+ * set the state xattr on a cache file
+ */
+int cachefiles_set_object_xattr(struct cachefiles_object *object,
+				struct cachefiles_xattr *auxdata)
+{
+	struct dentry *dentry = object->dentry;
+	int ret;
+
+	ASSERT(object->fscache.cookie);
+	ASSERT(dentry);
+	ASSERT(dentry->d_inode->i_op->setxattr);
+
+	_enter("%p,#%d", object, auxdata->len);
+
+	/* attempt to install the cache metadata directly */
+	mutex_lock(&dentry->d_inode->i_mutex);
+
+	_debug("SET %s #%u",
+	       object->fscache.cookie->def->name, auxdata->len);
+
+	ret = dentry->d_inode->i_op->setxattr(dentry, cachefiles_xattr_cache,
+					      &auxdata->type, auxdata->len,
+					      XATTR_CREATE);
+	if (ret == 0)
+		fsnotify_xattr(dentry);
+	else if (ret != -ENOMEM)
+		cachefiles_io_error_obj(object,
+					"Failed to set xattr with error %d",
+					ret);
+
+	mutex_unlock(&dentry->d_inode->i_mutex);
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_set_object_xattr() */
+
+/*****************************************************************************/
+/*
+ * check the state xattr on a cache file
+ * - return -ESTALE if the object should be deleted
+ */
+int cachefiles_check_object_xattr(struct cachefiles_object *object,
+				  struct cachefiles_xattr *auxdata)
+{
+	struct cachefiles_xattr *auxbuf;
+	struct dentry *dentry = object->dentry;
+	int ret;
+
+	_enter("%p,#%d", object, auxdata->len);
+
+	ASSERT(dentry);
+	ASSERT(dentry->d_inode);
+	ASSERT(dentry->d_inode->i_op->setxattr);
+	ASSERT(dentry->d_inode->i_op->getxattr);
+
+	auxbuf = kmalloc(sizeof(struct cachefiles_xattr) + 512, GFP_KERNEL);
+	if (!auxbuf) {
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	mutex_lock(&dentry->d_inode->i_mutex);
+
+	/* read the current type label */
+	ret = dentry->d_inode->i_op->getxattr(dentry, cachefiles_xattr_cache,
+					      &auxbuf->type, 512 + 1);
+	if (ret < 0) {
+		if (ret == -ENODATA)
+			goto stale; /* no attribute - power went off
+				     * mid-cull? */
+
+		if (ret == -ERANGE)
+			goto bad_type_length;
+
+		cachefiles_io_error_obj(object,
+					"can't read xattr on %lu (err %d)",
+					dentry->d_inode->i_ino, -ret);
+		goto error;
+	}
+
+	/* check the on-disk object */
+	if (ret < 1)
+		goto bad_type_length;
+
+	if (auxbuf->type != auxdata->type)
+		goto stale;
+
+	auxbuf->len = ret;
+
+	/* consult the netfs */
+	if (object->fscache.cookie->def->check_aux) {
+		fscache_checkaux_t result;
+		unsigned int dlen;
+
+		dlen = auxbuf->len - 1;
+
+		_debug("checkaux %s #%u",
+		       object->fscache.cookie->def->name, dlen);
+
+		result = object->fscache.cookie->def->check_aux(
+			object->fscache.cookie->netfs_data,
+			&auxbuf->data, dlen);
+
+		switch (result) {
+			/* entry okay as is */
+		case FSCACHE_CHECKAUX_OKAY:
+			goto okay;
+
+			/* entry requires update */
+		case FSCACHE_CHECKAUX_NEEDS_UPDATE:
+			break;
+
+			/* entry requires deletion */
+		case FSCACHE_CHECKAUX_OBSOLETE:
+			goto stale;
+
+		default:
+			BUG();
+		}
+
+		/* update the current label */
+		ret = dentry->d_inode->i_op->setxattr(dentry,
+						      cachefiles_xattr_cache,
+						      &auxdata->type,
+						      auxdata->len,
+						      XATTR_REPLACE);
+		if (ret < 0) {
+			cachefiles_io_error_obj(object,
+						"Can't update xattr on %lu"
+						" (error %d)",
+						dentry->d_inode->i_ino, -ret);
+			goto error;
+		}
+	}
+
+okay:
+	ret = 0;
+
+error:
+	mutex_unlock(&dentry->d_inode->i_mutex);
+	kfree(auxbuf);
+	_leave(" = %d", ret);
+	return ret;
+
+bad_type_length:
+	kerror("Cache object %lu xattr length incorrect",
+	       dentry->d_inode->i_ino);
+	ret = -EIO;
+	goto error;
+
+stale:
+	ret = -ESTALE;
+	goto error;
+
+} /* end cachefiles_check_object_xattr() */
+
+/*****************************************************************************/
+/*
+ * remove the object's xattr to mark it stale
+ */
+int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+				   struct dentry *dentry)
+{
+	int ret;
+
+	mutex_lock(&dentry->d_inode->i_mutex);
+
+	ret = dentry->d_inode->i_op->removexattr(dentry,
+						 cachefiles_xattr_cache);
+
+	mutex_unlock(&dentry->d_inode->i_mutex);
+
+	if (ret < 0) {
+		if (ret == -ENOENT || ret == -ENODATA)
+			ret = 0;
+		else if (ret != -ENOMEM)
+			cachefiles_io_error(cache,
+					    "Can't remove xattr from %lu"
+					    " (error %d)",
+					    dentry->d_inode->i_ino, -ret);
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+} /* end cachefiles_remove_object_xattr() */
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
new file mode 100644
index 0000000..83af6b3
--- /dev/null
+++ b/fs/cachefiles/internal.h
@@ -0,0 +1,292 @@
+/* internal.h: general netfs cache on cache files internal defs
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ *
+ * CacheFiles layout:
+ *
+ *	/..../CacheDir/
+ *		index
+ *		0/
+ *		1/
+ *		2/
+ *		  index
+ *		  0/
+ *		  1/
+ *		  2/
+ *		    index
+ *		    0
+ *		    1
+ *		    2
+ */
+
+#include <linux/fscache-cache.h>
+#include <linux/timer.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+
+struct cachefiles_cache;
+struct cachefiles_object;
+
+extern unsigned long cachefiles_debug;
+extern struct fscache_cache_ops cachefiles_cache_ops;
+extern struct proc_dir_entry *cachefiles_proc;
+extern struct file_operations cachefiles_proc_fops;
+
+/*****************************************************************************/
+/*
+ * node records
+ */
+struct cachefiles_object {
+	struct fscache_object		fscache;	/* fscache handle */
+	struct dentry			*dentry;	/* the file/dir representing this object */
+	struct file			*backer;	/* backing file */
+	loff_t				i_size;		/* object size */
+	atomic_t			usage;		/* basic object usage count */
+	atomic_t			fscache_usage;	/* FSDEF object usage count */
+	uint8_t				type;		/* object type */
+	uint8_t				new;		/* T if object new */
+	spinlock_t			work_lock;
+	struct rw_semaphore		sem;
+	struct work_struct		read_work;	/* read page copier */
+	struct list_head		read_list;	/* pages to copy */
+	struct list_head		read_pend_list;	/* pages to pending read from backer */
+	struct work_struct		write_work;	/* page writer */
+	struct list_head		write_list;	/* pages to store */
+	struct rb_node			active_node;	/* link in active tree (dentry is key) */
+};
+
+extern kmem_cache_t *cachefiles_object_jar;
+
+/*****************************************************************************/
+/*
+ * Cache files cache definition
+ */
+struct cachefiles_cache {
+	struct fscache_cache		cache;		/* FS-Cache record */
+	struct vfsmount			*mnt;		/* mountpoint holding the cache */
+	struct dentry			*graveyard;	/* directory into which dead objects go */
+	struct file			*cachefilesd;	/* manager daemon handle */
+	struct rb_root			active_nodes;	/* active nodes (can't be culled) */
+	rwlock_t			active_lock;	/* lock for active_nodes */
+	atomic_t			gravecounter;	/* graveyard uniquifier */
+	unsigned			brun_percent;	/* when to stop culling (%) */
+	unsigned			bcull_percent;	/* when to start culling (%) */
+	unsigned			bstop_percent;	/* when to stop allocating (%) */
+	unsigned			bsize;		/* cache's block size */
+	unsigned			bshift;		/* min(log2 (PAGE_SIZE / bsize), 0) */
+	sector_t			brun;		/* when to stop culling */
+	sector_t			bcull;		/* when to start culling */
+	sector_t			bstop;		/* when to stop allocating */
+	unsigned long			flags;
+#define CACHEFILES_READY		0	/* T if cache prepared */
+#define CACHEFILES_DEAD			1	/* T if cache dead */
+#define CACHEFILES_CULLING		2	/* T if cull engaged */
+	char				*rootdirname;	/* name of cache root directory */
+	char				*tag;		/* cache binding tag */
+};
+
+/*****************************************************************************/
+/*
+ * backing file read tracking
+ */
+struct cachefiles_one_read {
+	wait_queue_t			monitor;	/* link into monitored waitqueue */
+	struct page			*back_page;	/* backing file page we're waiting for */
+	struct page			*netfs_page;	/* netfs page we're going to fill */
+	struct cachefiles_object	*object;
+	struct list_head		obj_link;	/* link in object's lists */
+	fscache_rw_complete_t		callback_func;
+	void				*callback_data;
+};
+
+/*****************************************************************************/
+/*
+ * backing file write tracking
+ */
+struct cachefiles_one_write {
+	struct page			*netfs_page;	/* netfs page to copy */
+	struct cachefiles_object	*object;
+	struct list_head		obj_link;	/* link in object's lists */
+	fscache_rw_complete_t		callback_func;
+	void				*callback_data;
+};
+
+/*****************************************************************************/
+/*
+ * auxiliary data xattr buffer
+ */
+struct cachefiles_xattr {
+	uint16_t			len;
+	uint8_t				type;
+	uint8_t				data[];
+};
+
+
+/* cf-bind.c */
+extern int cachefiles_proc_bind(struct cachefiles_cache *cache, char *args);
+extern void cachefiles_proc_unbind(struct cachefiles_cache *cache);
+
+/* cf-interface.c */
+extern void cachefiles_read_copier_work(void *_object);
+extern void cachefiles_write_work(void *_object);
+extern int cachefiles_has_space(struct cachefiles_cache *cache, unsigned nr);
+
+/* cf-key.c */
+extern char *cachefiles_cook_key(const u8 *raw, int keylen, uint8_t type);
+
+/* cf-namei.c */
+extern int cachefiles_delete_object(struct cachefiles_cache *cache,
+				    struct cachefiles_object *object);
+extern int cachefiles_walk_to_object(struct cachefiles_object *parent,
+				     struct cachefiles_object *object,
+				     char *key,
+				     struct cachefiles_xattr *auxdata);
+extern struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+					       struct dentry *dir,
+					       const char *name);
+
+extern int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
+			   char *filename);
+
+/* cf-xattr.c */
+extern int cachefiles_check_object_type(struct cachefiles_object *object);
+extern int cachefiles_set_object_xattr(struct cachefiles_object *object,
+				       struct cachefiles_xattr *auxdata);
+extern int cachefiles_check_object_xattr(struct cachefiles_object *object,
+					 struct cachefiles_xattr *auxdata);
+extern int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+					  struct dentry *dentry);
+
+
+/*****************************************************************************/
+/*
+ * debug tracing
+ */
+#define kerror(FMT,...) printk(KERN_ERR "CacheFiles: "FMT"\n" ,##__VA_ARGS__);
+
+#define cachefiles_io_error(___cache, FMT, ...)		\
+do {							\
+	kerror("I/O Error: " FMT ,##__VA_ARGS__);	\
+	fscache_io_error(&(___cache)->cache);		\
+	set_bit(CACHEFILES_DEAD, &(___cache)->flags);	\
+} while(0)
+
+#define cachefiles_io_error_obj(object, FMT, ...)			\
+do {									\
+	struct cachefiles_cache *___cache;				\
+									\
+	___cache = container_of((object)->fscache.cache,		\
+				struct cachefiles_cache, cache);	\
+	cachefiles_io_error(___cache, FMT ,##__VA_ARGS__);		\
+} while(0)
+
+#define dbgprintk(FMT,...) \
+	printk("[%-6.6s] "FMT"\n",current->comm ,##__VA_ARGS__)
+#define _dbprintk(FMT,...) do { } while(0)
+
+#define kenter(FMT,...)	dbgprintk("==> %s("FMT")",__FUNCTION__ ,##__VA_ARGS__)
+#define kleave(FMT,...)	dbgprintk("<== %s()"FMT"",__FUNCTION__ ,##__VA_ARGS__)
+#define kdebug(FMT,...)	dbgprintk(FMT ,##__VA_ARGS__)
+
+
+#define kjournal(FMT,...) _dbprintk(FMT ,##__VA_ARGS__)
+
+#define dbgfree(ADDR)  _dbprintk("%p:%d: FREEING %p",__FILE__,__LINE__,ADDR)
+
+#define dbgpgalloc(PAGE)						\
+do {									\
+	_dbprintk("PGALLOC %s:%d: %p {%lx,%lu}\n",			\
+		  __FILE__,__LINE__,					\
+		  (PAGE),(PAGE)->mapping->host->i_ino,(PAGE)->index	\
+		  );							\
+} while(0)
+
+#define dbgpgfree(PAGE)						\
+do {								\
+	if ((PAGE))						\
+		_dbprintk("PGFREE %s:%d: %p {%lx,%lu}\n",	\
+			  __FILE__,__LINE__,			\
+			  (PAGE),				\
+			  (PAGE)->mapping->host->i_ino,		\
+			  (PAGE)->index				\
+			  );					\
+} while(0)
+
+#ifdef __KDEBUG
+#define _enter(FMT,...)	kenter(FMT,##__VA_ARGS__)
+#define _leave(FMT,...)	kleave(FMT,##__VA_ARGS__)
+#define _debug(FMT,...)	kdebug(FMT,##__VA_ARGS__)
+#else
+#define _enter(FMT,...)	do { } while(0)
+#define _leave(FMT,...)	do { } while(0)
+#define _debug(FMT,...)	do { } while(0)
+#endif
+
+#if 1 // defined(__KDEBUGALL)
+
+#define ASSERT(X)							\
+do {									\
+	if (unlikely(!(X))) {						\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		BUG();							\
+	}								\
+} while(0)
+
+#define ASSERTCMP(X, OP, Y)						\
+do {									\
+	if (unlikely(!((X) OP (Y)))) {					\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		printk(KERN_ERR "%lx " #OP " %lx is false\n",		\
+		       (unsigned long)(X), (unsigned long)(Y));		\
+		BUG();							\
+	}								\
+} while(0)
+
+#define ASSERTIF(C, X)							\
+do {									\
+	if (unlikely((C) && !(X))) {					\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		BUG();							\
+	}								\
+} while(0)
+
+#define ASSERTIFCMP(C, X, OP, Y)					\
+do {									\
+	if (unlikely((C) && !((X) OP (Y)))) {				\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		printk(KERN_ERR "%lx " #OP " %lx is false\n",		\
+		       (unsigned long)(X), (unsigned long)(Y));		\
+		BUG();							\
+	}								\
+} while(0)
+
+#else
+
+#define ASSERT(X)				\
+do {						\
+} while(0)
+
+#define ASSERTCMP(X, OP, Y)			\
+do {						\
+} while(0)
+
+#define ASSERTIF(C, X)				\
+do {						\
+} while(0)
+
+#define ASSERTIFCMP(C, X, OP, Y)		\
+do {						\
+} while(0)
+
+#endif
diff --git a/fs/fcntl.c b/fs/fcntl.c
index d35cbc6..b43d821 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -529,6 +529,8 @@ int send_sigurg(struct fown_struct *fown
 	return ret;
 }
 
+EXPORT_SYMBOL(send_sigurg);
+
 static DEFINE_RWLOCK(fasync_lock);
 static kmem_cache_t *fasync_cache __read_mostly;
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 02e7d8b..bc5eba8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -217,6 +217,12 @@ static inline void wait_on_page_fs_misc(
 extern void fastcall end_page_fs_misc(struct page *page);
 
 /*
+ * permit installation of a state change monitor in the queue for a page
+ */
+extern void install_page_waitqueue_monitor(struct page *page,
+					   wait_queue_t *monitor);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index d6f7ab4..26ecd9c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -488,6 +488,18 @@ void fastcall wait_on_page_bit(struct pa
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+void install_page_waitqueue_monitor(struct page *page, wait_queue_t *monitor)
+{
+	wait_queue_head_t *q = page_waitqueue(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&q->lock, flags);
+	__add_wait_queue(q, monitor);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+
+EXPORT_SYMBOL_GPL(install_page_waitqueue_monitor);
+
 /**
  * unlock_page() - unlock a locked page
  *
@@ -2106,6 +2118,96 @@ generic_file_buffered_write(struct kiocb
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
+int
+generic_file_buffered_write_one_kernel_page(struct file *file,
+					    pgoff_t index,
+					    struct page *src)
+{
+	struct address_space *mapping = file->f_mapping;
+	struct address_space_operations *a_ops = mapping->a_ops;
+	struct pagevec	lru_pvec;
+	struct page *page, *cached_page = NULL;
+	void *from, *to;
+	long status = 0;
+
+	pagevec_init(&lru_pvec, 0);
+
+	page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+	if (!page) {
+		BUG_ON(cached_page);
+		return -ENOMEM;
+	}
+
+	status = a_ops->prepare_write(file, page, 0, PAGE_CACHE_SIZE);
+	if (unlikely(status)) {
+		loff_t isize = i_size_read(mapping->host);
+
+		if (status != AOP_TRUNCATED_PAGE)
+			unlock_page(page);
+		page_cache_release(page);
+		if (status == AOP_TRUNCATED_PAGE)
+			goto sync;
+
+		/* prepare_write() may have instantiated a few blocks outside
+		 * i_size.  Trim these off again.
+		 */
+		if ((1ULL << (index + 1)) > isize)
+			vmtruncate(mapping->host, isize);
+		goto sync;
+	}
+
+	from = kmap_atomic(src, KM_USER0);
+	to = kmap_atomic(page, KM_USER1);
+	copy_page(to, from);
+	kunmap_atomic(from, KM_USER0);
+	kunmap_atomic(to, KM_USER1);
+	flush_dcache_page(page);
+
+	status = a_ops->commit_write(file, page, 0, PAGE_CACHE_SIZE);
+	if (status == AOP_TRUNCATED_PAGE) {
+		page_cache_release(page);
+		goto sync;
+	}
+
+	if (status > 0)
+		status = 0;
+
+	unlock_page(page);
+	mark_page_accessed(page);
+	page_cache_release(page);
+	if (status < 0)
+		return status;
+
+	balance_dirty_pages_ratelimited(mapping);
+	cond_resched();
+
+sync:
+	if (cached_page)
+		page_cache_release(cached_page);
+
+	/*
+	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
+	 */
+	if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(mapping->host))) {
+		if (!a_ops->writepage)
+			status = generic_osync_inode(
+				mapping->host, mapping,
+				OSYNC_METADATA | OSYNC_DATA);
+  	}
+	
+	/*
+	 * If we get here for O_DIRECT writes then we must have fallen through
+	 * to buffered writes (block instantiation inside i_size).  So we sync
+	 * the file data here, to try to honour O_DIRECT expectations.
+	 */
+	if (unlikely(file->f_flags & O_DIRECT))
+		status = filemap_write_and_wait(mapping);
+
+	pagevec_lru_add(&lru_pvec);
+	return status;
+}
+EXPORT_SYMBOL(generic_file_buffered_write_one_kernel_page);
+
 static ssize_t
 __generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t *ppos)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-20 16:59 ` [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files David Howells
@ 2006-04-20 17:18   ` Christoph Hellwig
  2006-04-20 18:06   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2006-04-20 17:18 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, steved, sct, aviro, linux-fsdevel, linux-cachefs,
	nfsv4, linux-kernel

On Thu, Apr 20, 2006 at 05:59:33PM +0100, David Howells wrote:
> Make it possible to avoid ENFILE checking for kernel specific open files, such
> as are used by the CacheFiles module.
> 
> After, for example, tarring up a kernel source tree over the network, the
> CacheFiles module may easily have 20000+ files open in the backing filesystem,
> thus causing all non-root processes to be given error ENFILE when they try to
> open a file, socket, pipe, etc..

No, just increase the limit.  The whole point of the limit is to avoid resource
exaustion.  A file doesn't use any less ressources just becuase it's opened
from kernelspace.  In doubt increase the limit or even the default limit.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] FS-Cache: Export find_get_pages()
  2006-04-20 16:59 ` [PATCH 4/7] FS-Cache: Export find_get_pages() David Howells
@ 2006-04-20 17:19   ` Christoph Hellwig
  2006-04-20 17:45   ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2006-04-20 17:19 UTC (permalink / raw)
  To: David Howells
  Cc: akpm, aviro, sct, nfsv4, steved, linux-kernel, torvalds,
	linux-cachefs, linux-fsdevel

On Thu, Apr 20, 2006 at 05:59:35PM +0100, David Howells wrote:
> The attached patch exports find_get_pages() for use by the kAFS filesystem in
> conjunction with it caching patch.

Why don't you use pagevec ?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops
  2006-04-20 16:59 ` [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops David Howells
@ 2006-04-20 17:40   ` Zach Brown
  2006-04-20 18:27     ` Anton Altaparmakov
  0 siblings, 1 reply; 31+ messages in thread
From: Zach Brown @ 2006-04-20 17:40 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, akpm, steved, sct, aviro, linux-fsdevel, linux-cachefs,
	nfsv4, linux-kernel

David Howells wrote:
> The attached patch adds a new VMA operation to notify a filesystem or other
> driver about the MMU generating a fault because userspace attempted to write
> to a page mapped through a read-only PTE.

This will almost certainly help OCFS2 get shared writable mmap() right,
too, though it probably won't be the whole story.

- z

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] FS-Cache: Export find_get_pages()
  2006-04-20 16:59 ` [PATCH 4/7] FS-Cache: Export find_get_pages() David Howells
  2006-04-20 17:19   ` Christoph Hellwig
@ 2006-04-20 17:45   ` David Howells
  2006-04-21  0:15     ` Andrew Morton
  2006-04-21 13:02     ` David Howells
  1 sibling, 2 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 17:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: akpm, aviro, sct, nfsv4, steved, linux-kernel, David Howells,
	torvalds, linux-cachefs, linux-fsdevel

Christoph Hellwig <hch@infradead.org> wrote:

> > The attached patch exports find_get_pages() for use by the kAFS filesystem
> > in conjunction with it caching patch.
> 
> Why don't you use pagevec ?

You mean pagevec_lookup() I suppose... That's probably reasonable, though
slower.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-20 16:59 ` [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files David Howells
  2006-04-20 17:18   ` Christoph Hellwig
@ 2006-04-20 18:06   ` David Howells
  2006-04-21  0:11     ` Andrew Morton
  2006-04-21 10:57     ` David Howells
  2006-04-21  0:07   ` Andrew Morton
  2006-04-21 12:33   ` David Howells
  3 siblings, 2 replies; 31+ messages in thread
From: David Howells @ 2006-04-20 18:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: akpm, aviro, nfsv4, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

Christoph Hellwig <hch@infradead.org> wrote:

> > Make it possible to avoid ENFILE checking for kernel specific open files,
> > such as are used by the CacheFiles module.
> > 
> > After, for example, tarring up a kernel source tree over the network, the
> > CacheFiles module may easily have 20000+ files open in the backing
> > filesystem, thus causing all non-root processes to be given error ENFILE
> > when they try to open a file, socket, pipe, etc..
> 
> No, just increase the limit.  The whole point of the limit is to avoid
> resource exaustion.  A file doesn't use any less ressources just becuase
> it's opened from kernelspace.  In doubt increase the limit or even the
> default limit.

As I saw it, the limit is there to prevent userspace from pinning too many
file resources.  Yes, each userspace process is limited by its rlimit, but you
have to multiply that by the number of processes that can be around:

	1024 * 32767 > 32 million files


Besides, you say "increase the limit", but there are two problems with that:

 (1) Each AFS or NFS _dentry_ retained in the system pins a file in the
     backing cache if it's also cached, whether or not it's open.  So on my
     desktop box, I've got about a million dentries cached.  That means I
     might also have anything up to a million files open... Except that the
     netfs would be denied caching rights on any file beyond the ENFILE limit
     - not that that matters, since you wouldn't be able to exec or open
     anything if you weren't root - and that might include running su and the
     like.

 (2) And what should the limit actually _be_?  You haven't said.  If you
     include the cache in the limit, you can at best open ENFILE limit / 2
     files on the netfs before you get moaned at by the system... And then
     closed files aren't immediately released by the cache, so you can quickly
     find yourself backed into a corner...

With this patch, CacheFiles's consumption of files is controlled by the dcache
reclaimer and not by ENFILE, and allocating more files will cause other cache
files to be closed automatically.

The cache can't arbitrarily close backing files for which no userspace file
descriptor remains open: there may be dirty pages, or the file may be mmapped.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops
  2006-04-20 17:40   ` Zach Brown
@ 2006-04-20 18:27     ` Anton Altaparmakov
  0 siblings, 0 replies; 31+ messages in thread
From: Anton Altaparmakov @ 2006-04-20 18:27 UTC (permalink / raw)
  To: Zach Brown
  Cc: David Howells, torvalds, akpm, steved, sct, aviro, linux-fsdevel,
	linux-cachefs, nfsv4, linux-kernel

On Thu, 20 Apr 2006, Zach Brown wrote:
> David Howells wrote:
> > The attached patch adds a new VMA operation to notify a filesystem or other
> > driver about the MMU generating a fault because userspace attempted to write
> > to a page mapped through a read-only PTE.
> 
> This will almost certainly help OCFS2 get shared writable mmap() right,
> too, though it probably won't be the whole story.

This is also required by NTFS for a correct implementation of "mmap write 
into sparse region on ntfs volume with cluster size > PAGE_CACHE_SIZE".

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-20 16:59 ` [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files David Howells
  2006-04-20 17:18   ` Christoph Hellwig
  2006-04-20 18:06   ` David Howells
@ 2006-04-21  0:07   ` Andrew Morton
  2006-04-21 12:33   ` David Howells
  3 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21  0:07 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> Make it possible to avoid ENFILE checking for kernel specific open files, such
> as are used by the CacheFiles module.
> 
> After, for example, tarring up a kernel source tree over the network, the
> CacheFiles module may easily have 20000+ files open in the backing filesystem,
> thus causing all non-root processes to be given error ENFILE when they try to
> open a file, socket, pipe, etc..
> 
>  ...
>
>  
>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> +static atomic_t nr_kernel_files;

So it's not performance-critical.

> -struct file *get_empty_filp(void)
> +struct file *get_empty_filp(int kernel)

I'd suggest a new get_empty_kernel_filp(void) rather than providing a magic
argument.  (we can still have the magic argument in the new
__get_empty_filp(int), but it shouldn't be part of the caller-visible API).

> +	if (!kernel) {
> +		if (get_nr_files() >= files_stat.max_files &&
> +		    !capable(CAP_SYS_ADMIN)
> +		    ) {

ugly.

> +	f->f_kernel_flags = kernel ? FKFLAGS_KERNEL : 0;

It would be more flexible to make the caller pass in the flags directly.

>  	f->f_uid = tsk->fsuid;
>  	f->f_gid = tsk->fsgid;
>  	eventpoll_init_file(f);
> @@ -235,6 +250,7 @@ struct file fastcall *fget_light(unsigne
>  	return file;
>  }
>  
> +EXPORT_SYMBOL(fget_light);

fget_light is not otherwise referenced in this patch.

this change is not changelogged.

why non-GPL?

> +EXPORT_SYMBOL(dentry_open_kernel);

_GPL?

> @@ -640,6 +643,7 @@ struct file {
>  	atomic_t		f_count;
>  	unsigned int 		f_flags;
>  	mode_t			f_mode;
> +	unsigned short		f_kernel_flags;
>  	loff_t			f_pos;
>  	struct fown_struct	f_owner;
>  	unsigned int		f_uid, f_gid;

That's unfortunate.  There's still room in f_flags.  Was it hard to use that?

> @@ -943,7 +943,7 @@ static ctl_table fs_table[] = {
>  		.ctl_name	= FS_NRFILE,
>  		.procname	= "file-nr",
>  		.data		= &files_stat,
> -		.maxlen		= 3*sizeof(int),
> +		.maxlen		= 4*sizeof(int),
>  		.mode		= 0444,
>  		.proc_handler	= &proc_nr_files,

This changes the format of /proc/sys/fs/file-nr.  What will break?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-20 18:06   ` David Howells
@ 2006-04-21  0:11     ` Andrew Morton
  2006-04-21 10:57     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21  0:11 UTC (permalink / raw)
  To: David Howells
  Cc: dhowells, aviro, sct, nfsv4, steved, linux-kernel, torvalds,
	linux-cachefs, linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
>   (1) Each AFS or NFS _dentry_ retained in the system pins a file in the
>       backing cache if it's also cached, whether or not it's open.

That would seem to be a great shortcoming in fscache.

I guess as memory reclaim reaps the top-level dentries those file*'s will
also be freed up, leading to their dentries becoming reclaimable, leading
to their inodes being reclaimable.

But still.  Is it not possible to release those files-pinned-by-dcache when
the top-level files are closed?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
                   ` (5 preceding siblings ...)
  2006-04-20 16:59 ` [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem David Howells
@ 2006-04-21  0:12 ` Andrew Morton
  2006-04-21 10:22 ` David Howells
  7 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21  0:12 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
>   #define PG_checked		 8	/* kill me in 2.5.<early>. */
>  +#define PG_fs_misc		 8

It would be better to rename PG_checked to PG_fs_misc kernel-wide.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] FS-Cache: Export find_get_pages()
  2006-04-20 17:45   ` David Howells
@ 2006-04-21  0:15     ` Andrew Morton
  2006-04-21 13:02     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21  0:15 UTC (permalink / raw)
  To: David Howells
  Cc: dhowells, aviro, sct, nfsv4, steved, linux-kernel, torvalds,
	linux-cachefs, linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > > The attached patch exports find_get_pages() for use by the kAFS filesystem
> > > in conjunction with it caching patch.
> > 
> > Why don't you use pagevec ?
> 
> You mean pagevec_lookup() I suppose... That's probably reasonable, though
> slower.
> 

But the code's using pagevecs now.  In a strange manner.

+		nr_pages = find_get_pages(vnode->vfs_inode.i_mapping, first,
+					  PAGEVEC_SIZE, pvec.pages);

that's an open-coded pagevec_lookup().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7] FS-Cache: Generic filesystem caching facility
  2006-04-20 16:59 ` [PATCH 5/7] FS-Cache: Generic filesystem caching facility David Howells
@ 2006-04-21  0:46   ` Andrew Morton
  2006-04-21 14:15   ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21  0:46 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> ...
>
> --- /dev/null
> +++ b/fs/fscache/cookie.c
> @@ -0,0 +1,1065 @@
> +/* cookie.c: general filesystem cache cookie management
> + *
> + * Copyright (C) 2004-5 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include "fscache-int.h"
> +
> +static LIST_HEAD(fscache_cache_tag_list);
> +static LIST_HEAD(fscache_cache_list);
> +static LIST_HEAD(fscache_netfs_list);
> +static DECLARE_RWSEM(fscache_addremove_sem);
> +static struct fscache_cache_tag fscache_nomem_tag;
> +
> +kmem_cache_t *fscache_cookie_jar;
> +
> +static void fscache_withdraw_object(struct fscache_cache *cache,
> +				    struct fscache_object *object);
> +
> +static void __fscache_cookie_put(struct fscache_cookie *cookie);
> +
> +static inline void fscache_cookie_put(struct fscache_cookie *cookie)
> +{
> +#ifdef CONFIG_DEBUG_SLAB
> +	BUG_ON((atomic_read(&cookie->usage) & 0xffff0000) == 0x6b6b0000);
> +#endif

Interesting.  Perhaps a comment in there?

> +/*****************************************************************************/
> +/*
> + * look up a cache tag
> + */
> +struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *name)
> +{
> +	struct fscache_cache_tag *tag, *xtag;
> +
> +	/* firstly check for the existence of the tag under read lock */
> +	down_read(&fscache_addremove_sem);
> +
> +	list_for_each_entry(tag, &fscache_cache_tag_list, link) {
> +		if (strcmp(tag->name, name) == 0) {
> +			atomic_inc(&tag->usage);
> +			up_read(&fscache_addremove_sem);
> +			return tag;
> +		}
> +	}
> +
> +	up_read(&fscache_addremove_sem);
> +
> +	/* the tag does not exist - create a candidate */
> +	xtag = kmalloc(sizeof(*tag) + strlen(name) + 1, GFP_KERNEL);
> +	if (!xtag) {
> +		/* return a dummy tag if out of memory */
> +		up_read(&fscache_addremove_sem);

We already did an up_read().

> +/*****************************************************************************/
> +/*
> + * register a network filesystem for caching
> + */
> +int __fscache_register_netfs(struct fscache_netfs *netfs)
> +{
> +	struct fscache_netfs *ptr;
> +	int ret;
> +
> +	_enter("{%s}", netfs->name);
> +
> +	INIT_LIST_HEAD(&netfs->link);
> +
> +	/* allocate a cookie for the primary index */
> +	netfs->primary_index =
> +		kmem_cache_alloc(fscache_cookie_jar, SLAB_KERNEL);
> +
> +	if (!netfs->primary_index) {
> +		_leave(" = -ENOMEM");
> +		return -ENOMEM;
> +	}
> +
> +	/* initialise the primary index cookie */
> +	memset(netfs->primary_index, 0, sizeof(*netfs->primary_index));

We have kmem_cache_zalloc() now.

> +EXPORT_SYMBOL(__fscache_register_netfs);

This code exports a huge number of symbols.   Should they be non-GPL?

> +void fscache_withdraw_cache(struct fscache_cache *cache)
> +{
> +	struct fscache_object *object;
> +
> +	_enter("");
> +
> +	printk(KERN_NOTICE
> +	       "FS-Cache: Withdrawing cache \"%s\"\n",
> +	       cache->tag->name);
> +
> +	/* make the cache unavailable for cookie acquisition */
> +	down_write(&cache->withdrawal_sem);
> +
> +	down_write(&fscache_addremove_sem);
> +	list_del_init(&cache->link);
> +	cache->tag->cache = NULL;
> +	up_write(&fscache_addremove_sem);
> +
> +	/* mark all objects as being withdrawn */
> +	spin_lock(&cache->object_list_lock);
> +	list_for_each_entry(object, &cache->object_list, cache_link) {
> +		set_bit(FSCACHE_OBJECT_WITHDRAWN, &object->flags);
> +	}
> +	spin_unlock(&cache->object_list_lock);
> +
> +	/* make sure all pages pinned by operations on behalf of the netfs are
> +	 * written to disc */
> +	cache->ops->sync_cache(cache);
> +
> +	/* dissociate all the netfs pages backed by this cache from the block
> +	 * mappings in the cache */
> +	cache->ops->dissociate_pages(cache);
> +
> +	/* we now have to destroy all the active objects pertaining to this
> +	 * cache */
> +	spin_lock(&cache->object_list_lock);
> +
> +	while (!list_empty(&cache->object_list)) {
> +		object = list_entry(cache->object_list.next,
> +				    struct fscache_object, cache_link);
> +		list_del_init(&object->cache_link);
> +		spin_unlock(&cache->object_list_lock);
> +
> +		_debug("withdraw %p", object->cookie);
> +
> +		/* we've extracted an active object from the tree - now dispose
> +		 * of it */
> +		fscache_withdraw_object(cache, object);
> +
> +		spin_lock(&cache->object_list_lock);
> +	}

This looks livelockable.

> +	spin_unlock(&cache->object_list_lock);
> +
> +	fscache_release_cache_tag(cache->tag);
> +	cache->tag = NULL;
> +
> +	_leave("");
> +
> +} /* end fscache_withdraw_cache() */
> +
>
> ...
>
> +static struct fscache_object *fscache_lookup_object(struct fscache_cookie *cookie,
> +						    struct fscache_cache *cache)
> +{
> +	struct fscache_cookie *parent = cookie->parent;
> +	struct fscache_object *pobject, *object;
> +	struct hlist_node *_p;
> +
> +	_enter("{%s/%s},",
> +	       parent && parent->def ? parent->def->name : "",
> +	       cookie->def ? (char *) cookie->def->name : "<file>");
> +
> +	if (test_bit(FSCACHE_IOERROR, &cache->flags))
> +		return NULL;
> +
> +	/* see if we have the backing object for this cookie + cache immediately
> +	 * to hand
> +	 */
> +	object = NULL;
> +	hlist_for_each_entry(object, _p,
> +			     &cookie->backing_objects, cookie_link
> +			     ) {

That's really weird-looking.  How about

	hlist_for_each_entry(object, _p, &cookie->backing_objects, cookie_link){
		

> +	hlist_for_each_entry(pobject, _p,
> +			     &parent->backing_objects, cookie_link
> +			     ) {
> +		if (pobject->cache == cache)
> +			break;
> +	}

Ditto.

> +	if (!pobject) {
> +		/* we don't know about the parent object */
> +		up_read(&parent->sem);
> +		down_write(&parent->sem);
> +
> +		pobject = fscache_lookup_object(parent, cache);
> +		if (IS_ERR(pobject)) {
> +			up_write(&parent->sem);
> +			_leave(" = %ld [no ipobj]", PTR_ERR(pobject));
> +			return pobject;
> +		}
> +
> +		_debug("pobject=%p", pobject);
> +
> +		BUG_ON(pobject->cookie != parent);
> +
> +		downgrade_write(&parent->sem);
> +	}
> +
> +	/* now we can attempt to look up this object in the parent, possibly
> +	 * creating a representation on disc when we do so
> +	 */
> +	object = cache->ops->lookup_object(cache, pobject, cookie);
> +	up_read(&parent->sem);
> +
> +	if (IS_ERR(object)) {
> +		_leave(" = %ld [no obj]", PTR_ERR(object));
> +		return object;
> +	}
> +
> +	/* keep track of it */
> +	cache->ops->lock_object(object);
> +
> +	BUG_ON(!hlist_unhashed(&object->cookie_link));
> +
> +	/* attach to the cache's object list */
> +	if (list_empty(&object->cache_link)) {
> +		spin_lock(&cache->object_list_lock);
> +		list_add(&object->cache_link, &cache->object_list);
> +		spin_unlock(&cache->object_list_lock);
> +	}
> +
> +	/* attach to the cookie */
> +	object->cookie = cookie;
> +	atomic_inc(&cookie->usage);
> +	hlist_add_head(&object->cookie_link, &cookie->backing_objects);
> +
> +	/* done */
> +	cache->ops->unlock_object(object);

I assume ->lock_object() provides the locking for the hlist_add_head()?

> +	_leave(" = %p [new]", object);
> +	return object;
> +
> +} /* end fscache_lookup_object() */
>
> ...
>
> +/*
> + * update the index entries backing a cookie
> + */
> +void __fscache_update_cookie(struct fscache_cookie *cookie)
> +{
> +	struct fscache_object *object;
> +	struct hlist_node *_p;
> +
> +	if (cookie == FSCACHE_NEGATIVE_COOKIE) {
> +		_leave(" [no cookie]");
> +		return;
> +	}
> +
> +	_enter("{%s}", cookie->def->name);
> +
> +	BUG_ON(!cookie->def->get_aux);
> +
> +	down_write(&cookie->sem);
> +	down_read(&cookie->parent->sem);

That's a surprising locking order.  Normally things are parent-first.

> +	/* update the index entry on disc in each cache backing this cookie */
> +	hlist_for_each_entry(object, _p,
> +			     &cookie->backing_objects, cookie_link
> +			     ) {
> +		if (!test_bit(FSCACHE_IOERROR, &object->cache->flags))
> +			object->cache->ops->update_object(object);
> +	}
> +
> +	up_read(&cookie->parent->sem);
> +	up_write(&cookie->sem);
> +	_leave("");
> +
> +} /* end __fscache_update_cookie() */
> +
>
>...
>
> +int __fscache_pin_cookie(struct fscache_cookie *cookie)
> +{
> +	struct fscache_object *object;
> +	int ret;
> +
> +	_enter("%p", cookie);
> +
> +	if (hlist_empty(&cookie->backing_objects)) {
> +		_leave(" = -ENOBUFS");
> +		return -ENOBUFS;
> +	}
> +
> +	/* not supposed to use this for indexes */
> +	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
> +
> +	/* prevent the file from being uncached whilst we access it and exclude
> +	 * read and write attempts on pages
> +	 */
> +	down_write(&cookie->sem);
> +
> +	ret = -ENOBUFS;
> +	if (!hlist_empty(&cookie->backing_objects)) {
> +		/* get and pin the backing object */
> +		object = hlist_entry(cookie->backing_objects.first,
> +				     struct fscache_object, cookie_link);
> +
> +		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
> +			goto out;
> +
> +		if (!object->cache->ops->pin_object) {
> +			ret = -EOPNOTSUPP;
> +			goto out;
> +		}
> +
> +		/* prevent the cache from being withdrawn */
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {

trylock is often a sign that something is mucked up.  A comment describing
why it is needed is always appropriate.


> +			if (object->cache->ops->grab_object(object)) {
> +				/* ask the cache to honour the operation */
> +				ret = object->cache->ops->pin_object(object);
> +
> +				object->cache->ops->put_object(object);
> +			}
> +
> +			up_read(&object->cache->withdrawal_sem);
> +		}
> +	}
> +
> +out:
> +	up_write(&cookie->sem);
> +	_leave(" = %d", ret);
> +	return ret;
> +
> +} /* end __fscache_pin_cookie() */
> +
> +EXPORT_SYMBOL(__fscache_pin_cookie);
> +
> +/*****************************************************************************/
> +/*
> + * unpin an object into the cache
> + */
> +void __fscache_unpin_cookie(struct fscache_cookie *cookie)
> +{
> +	struct fscache_object *object;
> +	int ret;
> +
> +	_enter("%p", cookie);
> +
> +	if (hlist_empty(&cookie->backing_objects)) {
> +		_leave(" [no obj]");
> +		return;
> +	}
> +
> +	/* not supposed to use this for indexes */
> +	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
> +
> +	/* prevent the file from being uncached whilst we access it and exclude
> +	 * read and write attempts on pages
> +	 */
> +	down_write(&cookie->sem);
> +
> +	ret = -ENOBUFS;
> +	if (!hlist_empty(&cookie->backing_objects)) {
> +		/* get and unpin the backing object */
> +		object = hlist_entry(cookie->backing_objects.first,
> +				     struct fscache_object, cookie_link);
> +
> +		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
> +			goto out;
> +
> +		if (!object->cache->ops->unpin_object)
> +			goto out;
> +
> +		/* prevent the cache from being withdrawn */
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {

Ditto.

> +			if (object->cache->ops->grab_object(object)) {
> +				/* ask the cache to honour the operation */
> +				object->cache->ops->unpin_object(object);
> +
> +				object->cache->ops->put_object(object);
> +			}
> +
> +			up_read(&object->cache->withdrawal_sem);
> +		}
> +	}
> +
> +out:
> +	up_write(&cookie->sem);
> +	_leave("");
> +
> +} /* end __fscache_unpin_cookie() */
> +
>
>...
>
> +/*
> + * debug tracing
> + */
> +#define dbgprintk(FMT,...) \
> +	printk("[%-6.6s] "FMT"\n",current->comm ,##__VA_ARGS__)
> +#define _dbprintk(FMT,...) do { } while(0)
> +
> +#define kenter(FMT,...)	dbgprintk("==> %s("FMT")",__FUNCTION__ ,##__VA_ARGS__)
> +#define kleave(FMT,...)	dbgprintk("<== %s()"FMT"",__FUNCTION__ ,##__VA_ARGS__)
> +#define kdebug(FMT,...)	dbgprintk(FMT ,##__VA_ARGS__)
> +
> +#define kjournal(FMT,...) _dbprintk(FMT ,##__VA_ARGS__)
> +
> +#define dbgfree(ADDR)  _dbprintk("%p:%d: FREEING %p",__FILE__,__LINE__,ADDR)

That's your fourth implementation of kenter().  Maybe we
need <linux/dhowells.h>?

> --- /dev/null
> +++ b/fs/fscache/fsdef.c
> @@ -0,0 +1,113 @@
> +/* fsdef.c: filesystem index definition
> + *
> + * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include "fscache-int.h"
> +
> +static uint16_t fscache_fsdef_netfs_get_key(const void *cookie_netfs_data,
> +					    void *buffer, uint16_t bufmax);
> +
> +static uint16_t fscache_fsdef_netfs_get_aux(const void *cookie_netfs_data,
> +					    void *buffer, uint16_t bufmax);

u16 would be preferred.

> new file mode 100644
> index 0000000..613c3b1
> --- /dev/null
> +++ b/fs/fscache/main.c
> @@ -0,0 +1,150 @@
> +/* main.c: general filesystem caching manager
> + *
> + * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/sched.h>
> +#include <linux/completion.h>
> +#include <linux/slab.h>
> +#include "fscache-int.h"
> +
> +int fscache_debug = 0;

unneeded initialisation

> +/*****************************************************************************/
> +/*
> + * initialise the fs caching module
> + */
> +static int fscache_init(void)

__init?

> +/*****************************************************************************/
> +/*
> + * release the ktype
> + */
> +static void fscache_ktype_release(struct kobject *kobject)
> +{
> +} /* end fscache_ktype_release() */

Don't empty kobject release() functions upset Greg?


> +#if 0
> +void __cyg_profile_func_enter (void *this_fn, void *call_site)
> +__attribute__((no_instrument_function));
> +
> +void __cyg_profile_func_enter (void *this_fn, void *call_site)
> +{
> +       asm volatile("  movl    %%esp,%%edi     \n"
> +                    "  andl    %0,%%edi        \n"
> +                    "  addl    %1,%%edi        \n"
> +                    "  movl    %%esp,%%ecx     \n"
> +                    "  subl    %%edi,%%ecx     \n"
> +                    "  shrl    $2,%%ecx        \n"
> +                    "  movl    $0xedededed,%%eax     \n"
> +                    "  rep stosl               \n"
> +                    :
> +                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
> +                    : "eax", "ecx", "edi", "memory", "cc"
> +                    );
> +}
> +
> +void __cyg_profile_func_exit(void *this_fn, void *call_site)
> +__attribute__((no_instrument_function));
> +
> +void __cyg_profile_func_exit(void *this_fn, void *call_site)
> +{
> +       asm volatile("  movl    %%esp,%%edi     \n"
> +                    "  andl    %0,%%edi        \n"
> +                    "  addl    %1,%%edi        \n"
> +                    "  movl    %%esp,%%ecx     \n"
> +                    "  subl    %%edi,%%ecx     \n"
> +                    "  shrl    $2,%%ecx        \n"
> +                    "  movl    $0xdadadada,%%eax     \n"
> +                    "  rep stosl               \n"
> +                    :
> +                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
> +                    : "eax", "ecx", "edi", "memory", "cc"
> +                    );
> +}
> +#endif

Removeable?

> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> new file mode 100644
> index 0000000..b197be7
> --- /dev/null
> +++ b/fs/fscache/page.c
> @@ -0,0 +1,548 @@
> +/* page.c: general filesystem cache cookie management
> + *
> + * Copyright (C) 2004-5 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/fscache-cache.h>
> +#include <linux/buffer_head.h>
> +#include <linux/pagevec.h>
> +#include "fscache-int.h"
> +
> +/*****************************************************************************/
> +/*
> + * set the data file size on an object in the cache
> + */
> +int __fscache_set_i_size(struct fscache_cookie *cookie, loff_t i_size)
> +{
> +	struct fscache_object *object;
> +	int ret;
> +
> +	_enter("%p,%llu,", cookie, i_size);
> +
> +	if (hlist_empty(&cookie->backing_objects)) {
> +		_leave(" = -ENOBUFS");
> +		return -ENOBUFS;
> +	}
> +
> +	/* not supposed to use this for indexes */
> +	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
> +
> +	/* prevent the file from being uncached whilst we access it and exclude
> +	 * read and write attempts on pages
> +	 */
> +	down_write(&cookie->sem);
> +
> +	ret = -ENOBUFS;
> +	if (!hlist_empty(&cookie->backing_objects)) {
> +		/* get and pin the backing object */
> +		object = hlist_entry(cookie->backing_objects.first,
> +				     struct fscache_object, cookie_link);
> +
> +		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
> +			goto out;
> +
> +		/* prevent the cache from being withdrawn */
> +		if (object->cache->ops->set_i_size &&
> +		    down_read_trylock(&object->cache->withdrawal_sem)

another trylock.

> +		    ) {
> +			if (object->cache->ops->grab_object(object)) {
> +				/* ask the cache to honour the operation */
> +				ret = object->cache->ops->set_i_size(object,
> +								     i_size);
> +
> +				object->cache->ops->put_object(object);
> +			}
> +
> +			up_read(&object->cache->withdrawal_sem);
> +		}
> +	}
> +
> +out:
> +	up_write(&cookie->sem);
> +	_leave(" = %d", ret);
> +	return ret;
> +
> +} /* end __fscache_set_i_size() */
> +
> +EXPORT_SYMBOL(__fscache_set_i_size);
> +
> +/*****************************************************************************/
> +/*
> + * reserve space for an object
> + */
> +int __fscache_reserve_space(struct fscache_cookie *cookie, loff_t size)
> +{
> +	struct fscache_object *object;
> +	int ret;
> +
> +	_enter("%p,%llu,", cookie, size);
> +
> +	if (hlist_empty(&cookie->backing_objects)) {
> +		_leave(" = -ENOBUFS");
> +		return -ENOBUFS;
> +	}
> +
> +	/* not supposed to use this for indexes */
> +	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
> +
> +	/* prevent the file from being uncached whilst we access it and exclude
> +	 * read and write attempts on pages
> +	 */
> +	down_write(&cookie->sem);
> +
> +	ret = -ENOBUFS;
> +	if (!hlist_empty(&cookie->backing_objects)) {
> +		/* get and pin the backing object */
> +		object = hlist_entry(cookie->backing_objects.first,
> +				     struct fscache_object, cookie_link);
> +
> +		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
> +			goto out;
> +
> +		if (!object->cache->ops->reserve_space) {
> +			ret = -EOPNOTSUPP;
> +			goto out;
> +		}
> +
> +		/* prevent the cache from being withdrawn */
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {

and another.

> +			if (object->cache->ops->grab_object(object)) {
> +				/* ask the cache to honour the operation */
> +				ret = object->cache->ops->reserve_space(object,
> +									size);
> +
> +				object->cache->ops->put_object(object);
> +			}
> +
> +			up_read(&object->cache->withdrawal_sem);
> +		}
> +	}
> +
> +out:
> +	up_write(&cookie->sem);
> +	_leave(" = %d", ret);
> +	return ret;
> +
> +} /* end __fscache_reserve_space() */
> +
>
> ...
>
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {
> +		if (down_read_trylock(&object->cache->withdrawal_sem)) {

more.

> +
> +/* find the parent index object for a object */
> +static inline
> +struct fscache_object *fscache_find_parent_object(struct fscache_object *object)
> +{
> +	struct fscache_object *parent;
> +	struct fscache_cookie *cookie = object->cookie;
> +	struct fscache_cache *cache = object->cache;
> +	struct hlist_node *_p;
> +
> +	hlist_for_each_entry(parent, _p,
> +			     &cookie->parent->backing_objects,
> +			     cookie_link
> +			     ) {
> +		if (parent->cache == cache)
> +			return parent;
> +	}
> +
> +	return NULL;
> +}

Large?

> +#endif /* _LINUX_FSCACHE_CACHE_H */
> diff --git a/include/linux/fscache.h b/include/linux/fscache.h
> new file mode 100644
> index 0000000..8aa464b
> --- /dev/null
> +++ b/include/linux/fscache.h
> @@ -0,0 +1,484 @@
> +/* fscache.h: general filesystem caching interface
> + *
> + * Copyright (C) 2004-5 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_FSCACHE_H
> +#define _LINUX_FSCACHE_H
> +
> +#include <linux/config.h>
> +#include <linux/fs.h>
> +#include <linux/list.h>
> +#include <linux/pagemap.h>
> +#include <linux/pagevec.h>
> +
> +#ifdef CONFIG_FSCACHE_MODULE
> +#define CONFIG_FSCACHE
> +#endif

Defining symbols which are owned by the Kconfig system isn't very nice.

> +
> +/*
> + * request a page be stored in the cache
> + * - this request may be ignored if no cache block is currently allocated, in
> + *   which case it:
> + *   - returns -ENOBUFS
> + * - if a cache block was already allocated:
> + *   - a BIO will be dispatched to write the page (end_io_func will be called
> + *     from the completion function)
> + *   - returns 0
> + */
> +#ifdef CONFIG_FSCACHE
> +extern int __fscache_write_page(struct fscache_cookie *cookie,
> +				struct page *page,
> +				fscache_rw_complete_t end_io_func,
> +				void *end_io_data,
> +				gfp_t gfp);

hm.  I tend to find it better for functions to be documented at the
definition site, not at the declaration site.  I don't expect people even
think to go looking in a header file to learn about the function which
they're presently looking at.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem
  2006-04-20 16:59 ` [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem David Howells
@ 2006-04-21  0:57   ` Andrew Morton
  2006-04-21  1:16   ` Andrew Morton
  2006-04-21 14:49   ` David Howells
  2 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21  0:57 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> +		/* let keventd have some air occasionally */
>  +		max--;
>  +		if (max < 0 || need_resched()) {
>  +			if (!list_empty(&object->read_list))
>  +				schedule_work(&object->read_work);
>  +			_leave(" [maxed out]");
>  +			return;
>  +		}

That's perhaps not a terribly effective way of multiplexing keventd cycles.
If someone has done a schedule_work(), that will stick an entry onto
keventd's worklist, but it won't necessarily set need_resched().

We'd need to extend the workqueue API to be able to determine whether
there's other work pending.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem
  2006-04-20 16:59 ` [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem David Howells
  2006-04-21  0:57   ` Andrew Morton
@ 2006-04-21  1:16   ` Andrew Morton
  2006-04-21 14:49   ` David Howells
  2 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21  1:16 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> ...
>
> +		ret = 0;
> +	}
> +	else if (cachefiles_has_space(cache, 1) == 0) {

	} else if (cachefiles_has_space(cache, 1) == 0) {

> +		/* there's space in the cache we can use */
> +		pagevec_add(&pagevec, page);
> +		cookie->def->mark_pages_cached(cookie->netfs_data,
> +					       page->mapping, &pagevec);
> +		ret = -ENODATA;
> +	}
> +	else {

	} else {

(many instances)

> +unsigned long cachefiles_debug = 0;

Unneeded initialisation.

> +static int cachefiles_init(void)

__init?

> +#if 0
> +void __cyg_profile_func_enter (void *this_fn, void *call_site)
> +__attribute__((no_instrument_function));
> +
> +void __cyg_profile_func_enter (void *this_fn, void *call_site)
> +{
> +       asm volatile("  movl    %%esp,%%edi     \n"
> +                    "  andl    %0,%%edi        \n"
> +                    "  addl    %1,%%edi        \n"
> +                    "  movl    %%esp,%%ecx     \n"
> +                    "  subl    %%edi,%%ecx     \n"
> +                    "  shrl    $2,%%ecx        \n"
> +                    "  movl    $0xedededed,%%eax     \n"
> +                    "  rep stosl               \n"
> +                    :
> +                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
> +                    : "eax", "ecx", "edi", "memory", "cc"
> +                    );
> +}
> +
> +void __cyg_profile_func_exit(void *this_fn, void *call_site)
> +__attribute__((no_instrument_function));
> +
> +void __cyg_profile_func_exit(void *this_fn, void *call_site)
> +{
> +       asm volatile("  movl    %%esp,%%edi     \n"
> +                    "  andl    %0,%%edi        \n"
> +                    "  addl    %1,%%edi        \n"
> +                    "  movl    %%esp,%%ecx     \n"
> +                    "  subl    %%edi,%%ecx     \n"
> +                    "  shrl    $2,%%ecx        \n"
> +                    "  movl    $0xdadadada,%%eax     \n"
> +                    "  rep stosl               \n"
> +                    :
> +                    : "i"(~(THREAD_SIZE-1)), "i"(sizeof(struct thread_info))
> +                    : "eax", "ecx", "edi", "memory", "cc"
> +                    );
> +}
> +#endif

removeable?

> +/*
> + * delete an object representation from the cache
> + * - file backed objects are unlinked
> + * - directory backed objects are stuffed into the graveyard for userspace to
> + *   delete
> + * - unlocks the directory mutex
> + */
> +static int cachefiles_bury_object(struct cachefiles_cache *cache,
> +				  struct dentry *dir,
> +				  struct dentry *rep)
> +{
> +	struct dentry *grave, *alt, *trap;
> +	struct qstr name;
> +	const char *old_name;
> +	char nbuffer[8 + 8 + 1];
> +	int ret;
> +
> +	_enter(",'%*.*s','%*.*s'",
> +	       dir->d_name.len, dir->d_name.len, dir->d_name.name,
> +	       rep->d_name.len, rep->d_name.len, rep->d_name.name);
> +
> +	/* non-directories can just be unlinked */
> +	if (!S_ISDIR(rep->d_inode->i_mode)) {
> +		_debug("unlink stale object");
> +		ret = dir->d_inode->i_op->unlink(dir->d_inode, rep);
> +
> +		mutex_unlock(&dir->d_inode->i_mutex);

hm, what's going on here?  It's strange for a callee to undo an i_mutex
which some caller took.

>  EXPORT_SYMBOL(generic_file_buffered_write);
>  
> +int
> +generic_file_buffered_write_one_kernel_page(struct file *file,
> +					    pgoff_t index,
> +					    struct page *src)

Some covering comments would be nice.

> +{
> +	struct address_space *mapping = file->f_mapping;
> +	struct address_space_operations *a_ops = mapping->a_ops;
> +	struct pagevec	lru_pvec;
> +	struct page *page, *cached_page = NULL;
> +	void *from, *to;
> +	long status = 0;
> +
> +	pagevec_init(&lru_pvec, 0);
> +
> +	page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
> +	if (!page) {
> +		BUG_ON(cached_page);
> +		return -ENOMEM;
> +	}
> +
> +	status = a_ops->prepare_write(file, page, 0, PAGE_CACHE_SIZE);
> +	if (unlikely(status)) {
> +		loff_t isize = i_size_read(mapping->host);

If the hosts's i_mutex is held (it should be, but there are no comments)
then we can read inode->i_size directly.  Minor thing.


> +
> +	from = kmap_atomic(src, KM_USER0);
> +	to = kmap_atomic(page, KM_USER1);
> +	copy_page(to, from);
> +	kunmap_atomic(from, KM_USER0);
> +	kunmap_atomic(to, KM_USER1);

that's copy_highpage().




Sigh.  It's all a huge pile of new code.  And it's only used by AFS, the
number of users of which can be counted on the fingers of one foot.  An NFS
implementation would make a testing phase much more useful.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit
  2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
                   ` (6 preceding siblings ...)
  2006-04-21  0:12 ` [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit Andrew Morton
@ 2006-04-21 10:22 ` David Howells
  2006-04-21 10:33   ` Andrew Morton
  7 siblings, 1 reply; 31+ messages in thread
From: David Howells @ 2006-04-21 10:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aviro, sct, nfsv4, steved, linux-kernel, David Howells, torvalds,
	linux-cachefs, linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> It would be better to rename PG_checked to PG_fs_misc kernel-wide.

So would deleting PG_checked and changing the PageChecked() macros to:

	#define PageChecked(page)		PageFsMisc((page))
	#define SetPageChecked(page)		SetPageFsMisc((page))
	#define ClearPageChecked(page)		ClearPageFsMisc((page))

be acceptable?  Or would you rather I replaced those too?

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit
  2006-04-21 10:22 ` David Howells
@ 2006-04-21 10:33   ` Andrew Morton
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21 10:33 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, dhowells, torvalds,
	linux-cachefs, linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > It would be better to rename PG_checked to PG_fs_misc kernel-wide.
> 
> So would deleting PG_checked and changing the PageChecked() macros to:
> 
> 	#define PageChecked(page)		PageFsMisc((page))
> 	#define SetPageChecked(page)		SetPageFsMisc((page))
> 	#define ClearPageChecked(page)		ClearPageFsMisc((page))
> 
> be acceptable?  Or would you rather I replaced those too?
> 

PG_checked is presently a misc bit which only filesystems use.  So yes, I'd
say it's appropriate to remove PageChecked() and friends altogether.

That might break out-of-tree filesystems, but they'll work it out.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-20 18:06   ` David Howells
  2006-04-21  0:11     ` Andrew Morton
@ 2006-04-21 10:57     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: David Howells @ 2006-04-21 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hch, aviro, nfsv4, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> That would seem to be a great shortcoming in fscache.

It's not something that's easy to get around.  The file has to be "open" to be
able to do certain operations on it.

> I guess as memory reclaim reaps the top-level dentries those file*'s will
> also be freed up, leading to their dentries becoming reclaimable, leading
> to their inodes being reclaimable.

Exactly.

> But still.  Is it not possible to release those files-pinned-by-dcache when
> the top-level files are closed?

No, for three (or maybe four) reasons:

 (1) You assume there's a "top-level" file open.  AFS lookup(), for example,
     will read the cache to get the directory contents, but there will _not_
     be an open top-level file.

     We could open and close the cache file in each lookup(), but that could
     be very bad for lookup performance.

 (2) mmap() may still have the struct file open, even though the last close()
     has happened.

 (3) There may be pages not yet written to the cache outstanding.  These
     belong to the netfs *inode* not the netfs *file*.  Whilst the flush or
     release file operation could be made to wait for these, that's not
     necessarily within their spec, and could take a long time.  How far is
     the flush op supposed to go anyway?

 (4) It prevents the data file going away whilst we have a cookie for it
     (someone might go into the cache and delete something they shouldn't).

     Pinning the dentry might work just as well, I suppose, but it makes
     little difference to the resource consumption.

     We keep at least the dentry available so that we can honour the netfs's
     requests to update the auxiliary data as quickly as possible.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-20 16:59 ` [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files David Howells
                     ` (2 preceding siblings ...)
  2006-04-21  0:07   ` Andrew Morton
@ 2006-04-21 12:33   ` David Howells
  2006-04-21 18:22     ` Andrew Morton
  2006-04-21 19:29     ` David Howells
  3 siblings, 2 replies; 31+ messages in thread
From: David Howells @ 2006-04-21 12:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aviro, nfsv4, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> >  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> > +static atomic_t nr_kernel_files;
> 
> So it's not performance-critical.

Hmmm... nowhere near as critical as the ENFILE accounting, plus the only place
we actually read it is for the sysctl file.

It could actually be dispensed with entirely, I suppose.

> > -struct file *get_empty_filp(void)
> > +struct file *get_empty_filp(int kernel)
> 
> I'd suggest a new get_empty_kernel_filp(void) rather than providing a magic
> argument.  (we can still have the magic argument in the new
> __get_empty_filp(int), but it shouldn't be part of the caller-visible API).
> ...
> It would be more flexible to make the caller pass in the flags directly.

So:

	struct file *get_empty_kernel_filp(unsigned short flags);

which devolves to get_empty_filp() if flags == 0?

> > +EXPORT_SYMBOL(fget_light);
> 
> fget_light is not otherwise referenced in this patch.

Good point.  I'll move it into the cachefiles patch.

> > +EXPORT_SYMBOL(dentry_open_kernel);
> 
> _GPL?

If you wish.

> That's unfortunate.  There's still room in f_flags.  Was it hard to use that?

Yeah... but the usage of f_flags is constrained by O_xxxx flags that are part
of the userspace interface.  Using those up for purely kernel things is a bad
idea.

Note that I've not actually increased the size of the struct file - f_mode is
a 16-bit value, hence why I chose an unsigned short.

> This changes the format of /proc/sys/fs/file-nr.  What will break?

As far as I can tell, not a lot.  I've grepped through various etc, lib and
bin directories on my FC5 system, and the only match I've found is:

	/usr/lib64/sa/sadc

I'll present the count through a separate file to make sure.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] FS-Cache: Export find_get_pages()
  2006-04-20 17:45   ` David Howells
  2006-04-21  0:15     ` Andrew Morton
@ 2006-04-21 13:02     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: David Howells @ 2006-04-21 13:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hch, aviro, nfsv4, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> that's an open-coded pagevec_lookup().

Whilst that's true, it is still slower to use pagevec_lookup().  But since you
insist, I'll do that anyway.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7] FS-Cache: Generic filesystem caching facility
  2006-04-20 16:59 ` [PATCH 5/7] FS-Cache: Generic filesystem caching facility David Howells
  2006-04-21  0:46   ` Andrew Morton
@ 2006-04-21 14:15   ` David Howells
  2006-04-21 18:38     ` Andrew Morton
  2006-04-21 19:33     ` David Howells
  1 sibling, 2 replies; 31+ messages in thread
From: David Howells @ 2006-04-21 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aviro, sct, nfsv4, steved, linux-kernel, David Howells, torvalds,
	linux-cachefs, linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> Interesting.  Perhaps a comment in there?

Yep.  It might even be worth making the actual check in linux/slab.h.

> We already did an up_read().

Yep.  Also the initialisation was done on the wrong variable (tag not
xtag)... which only worked because list_for_each_entry() always leaves a valid
pointer in tag.

> We have kmem_cache_zalloc() now.

It's not in the NFS tree I've just pulled.  Of course, git may be being a
complete git again.  I blame the original author:-)

> This looks livelockable.

How so?  The bit at the top of the function makes sure that we can't get any
new objects.

> > +	cache->ops->unlock_object(object);
> 
> I assume ->lock_object() provides the locking for the hlist_add_head()?

Half of it...  Cookie lookup and withdrawal has to deal with the interaction
between two locks:

 (1) The cookie lock.  This is taken first when approached from the netfs side.

 (2) The object lock.  This is taken first when approached from the cache side.

The cookie lock is superior to the netfs lock, and fscache_withdraw_object()
has to do lock reordering when withdrawing a cache from the system.

fscache_lookup_object() - to which you refer - must be called with the cookie
already write-locked.

> > +	down_write(&cookie->sem);
> > +	down_read(&cookie->parent->sem);
> 
> That's a surprising locking order.  Normally things are parent-first.

But the operation starts from the child cookie.

The order I've defined is that if both a cookie and its parent already exist,
then you must get the lock on the child before the lock on the parent if you
want both.

Lookup is okay, since the new cookie doesn't exist to the rest of the system
until we've finished, and should we need to walk up the index tree
instantiating indices, the locking is in the right order to just do that.

> > +		/* prevent the cache from being withdrawn */
> > +		if (down_read_trylock(&object->cache->withdrawal_sem)) {
> 
> trylock is often a sign that something is mucked up.  A comment describing
> why it is needed is always appropriate.

In this case, it's not something mucked up.  The only thing that write-locks
cache->withdrawal_sem is the fscache_withdraw_cache(), and that then prevents
any further operations on that cache succeeding.

Obviously, before we withdraw the cache, down_read_trylock() will always
succeed, and after we withdraw the cache it will always fail; but not only
that, _whilst_ there's an operation in progress, the withdrawal will be made to
wait because that does down_write() *not* down_write_trylock().

I wonder if I should write some docs on the locking procedures used by
FS-Cache.

> That's your fourth implementation of kenter().  Maybe we
> need <linux/dhowells.h>?

:-)

Maybe I should move my debugging macros into include/linux, but then everyone
else would complain if their own versions weren't put in there, or would
complain if they were forced to use mine.

> u16 would be preferred.

As I recall, Linus said something like that uint16_t and co are fine in one's
own interfaces.  I did actually ask him about that.

u16 should die really, IMO.

> > +int fscache_debug = 0;
> 
> unneeded initialisation

Yep.

> > +static int fscache_init(void)
> 
> __init?

Yep.

> Don't empty kobject release() functions upset Greg?

I hope so... then he might improve his docs:-)

The function would seem to be mandatory, but there really isn't anything for it
to do, since the object is static:-/

> Removeable?

Yeah.  It's handy to have around for debugging, but it's very arch specific,
and can be added back later easily enough.

> Large?

Depends how you define "large".

It doesn't actually produce very much code.

> Defining symbols which are owned by the Kconfig system isn't very nice.

Kconfig is still broken:

	warthog>grep -r CONFIG_FSCACHE include/linux/autoconf.h 
	#define CONFIG_FSCACHE_MODULE 1
	warthog>

Modules that might depend on fscache need to know that it's there, and having
to double up every #if to detect both is stupid.

Would you suggest then:

	#if defined(CONFIG_FSCACHE) || defined(CONFIG_FSCACHE_MODULE)
	#define FSCACHE_AVAILABLE 1
	#endif

> hm.  I tend to find it better for functions to be documented at the
> definition site, not at the declaration site.  I don't expect people even
> think to go looking in a header file to learn about the function which
> they're presently looking at.

Two points:

 (1) The function they'll be looking for isn't in a .c file.

 (2) They *should* be looking in Documentation/ where all the kernel's
     interface functions are of course documented.  They will find the
     documentation for the FS-Cache interfaces there.  I'll add obvious banners
     to the tops of the header files telling them where to look.  I can even
     add them to the .c files, I suppose.

     Now I admit that I haven't added docs for get_empty_filp(),
     dentry_open_kernel() and suchlike because there's nowhere to actually add
     them yet.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem
  2006-04-20 16:59 ` [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem David Howells
  2006-04-21  0:57   ` Andrew Morton
  2006-04-21  1:16   ` Andrew Morton
@ 2006-04-21 14:49   ` David Howells
  2 siblings, 0 replies; 31+ messages in thread
From: David Howells @ 2006-04-21 14:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aviro, sct, nfsv4, steved, linux-kernel, David Howells, torvalds,
	linux-cachefs, linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> > +unsigned long cachefiles_debug = 0;
> 
> Unneeded initialisation.

Yep.

> > +static int cachefiles_init(void)
> 
> __init?

Yep.

> removeable?

Yep.

> hm, what's going on here?  It's strange for a callee to undo an i_mutex
> which some caller took.

It happens occasionally.  The problem here is that I want to call this from
three different places, but if I drop the mutex before calling the burial
function, I have to get the mutex again to do the unlink; but as it is, I have
to drop it before I can do the rename:-/

It's not nice, but...

You have to note also that the directory's i_mutex is quite important for
interacting with the daemon also.  The wonders of working through an existing
filesystem, and the the wonders of co-operating with userspace.

> > +int
> > +generic_file_buffered_write_one_kernel_page(struct file *file,
> > +					    pgoff_t index,
> > +					    struct page *src)
> 
> Some covering comments would be nice.

I copied those of generic_file_buffered_write() and rearranged them a bit:-)

I'll add a comment to my function.

> If the hosts's i_mutex is held (it should be, but there are no comments)
> then we can read inode->i_size directly.  Minor thing.

Ah.  Do we though?  I just copied generic_file_buffered_write() and cut it
down.  The same is done there.  The comments at the top of that function
weren't exactly forthcoming on the preconditions for calling that function.

> that's copy_highpage().

Good point.

> Sigh.  It's all a huge pile of new code.  And it's only used by AFS, the
> number of users of which can be counted on the fingers of one foot.  An NFS
> implementation would make a testing phase much more useful.

Yes...  Whilst I have it working with NFS, the NFS anti-aliasing problems are
still there and still need to be sorted.  I thought I'd got them nailed, but
then Trond changed his mind:-(

But that does not preclude putting what I can release up for review.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-21 12:33   ` David Howells
@ 2006-04-21 18:22     ` Andrew Morton
  2006-04-21 19:29     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21 18:22 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, dhowells, torvalds,
	linux-cachefs, linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> > > +struct file *get_empty_filp(int kernel)
> > 
> > I'd suggest a new get_empty_kernel_filp(void) rather than providing a magic
> > argument.  (we can still have the magic argument in the new
> > __get_empty_filp(int), but it shouldn't be part of the caller-visible API).
> > ...
> > It would be more flexible to make the caller pass in the flags directly.
> 
> So:
> 
> 	struct file *get_empty_kernel_filp(unsigned short flags);
> 
> which devolves to get_empty_filp() if flags == 0?
> 

argh, I forgot about the flag.  Oh well.  I'd suggest:

static inline struct file *get_empty_filp(void)
{
	return __get_empty_filp(0);
}

static inline struct file *get_empty_kernel_filp(void)
{
	return __get_empty_filp(0);
}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7] FS-Cache: Generic filesystem caching facility
  2006-04-21 14:15   ` David Howells
@ 2006-04-21 18:38     ` Andrew Morton
  2006-04-21 19:33     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2006-04-21 18:38 UTC (permalink / raw)
  To: David Howells
  Cc: aviro, sct, nfsv4, steved, linux-kernel, dhowells, torvalds,
	linux-cachefs, linux-fsdevel

David Howells <dhowells@redhat.com> wrote:
>
> > That's your fourth implementation of kenter().  Maybe we
> > need <linux/dhowells.h>?
> 
> :-)
> 
> Maybe I should move my debugging macros into include/linux, but then everyone
> else would complain if their own versions weren't put in there, or would
> complain if they were forced to use mine.

The number of home-made debugging macro implementations we have is quite
demented.  Developing (and maintaining) a common set would be a good idea,
IMO.

> It doesn't actually produce very much code.
> 
> > Defining symbols which are owned by the Kconfig system isn't very nice.
> 
> Kconfig is still broken:
> 
> 	warthog>grep -r CONFIG_FSCACHE include/linux/autoconf.h 
> 	#define CONFIG_FSCACHE_MODULE 1
> 	warthog>
> 
> Modules that might depend on fscache need to know that it's there,

In theory, module A isn't supposed to care whether module B was configured,
because module B might be compiled separately, or dowloaded from elsewhere
or whatever.

> and having
> to double up every #if to detect both is stupid.
> 
> Would you suggest then:
> 
> 	#if defined(CONFIG_FSCACHE) || defined(CONFIG_FSCACHE_MODULE)
> 	#define FSCACHE_AVAILABLE 1
> 	#endif

yup.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files
  2006-04-21 12:33   ` David Howells
  2006-04-21 18:22     ` Andrew Morton
@ 2006-04-21 19:29     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: David Howells @ 2006-04-21 19:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aviro, nfsv4, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> argh, I forgot about the flag.  Oh well.  I'd suggest:

I presume you mean for the second to pass an argument of 1 to
__get_empty_filp().

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7] FS-Cache: Generic filesystem caching facility
  2006-04-21 14:15   ` David Howells
  2006-04-21 18:38     ` Andrew Morton
@ 2006-04-21 19:33     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: David Howells @ 2006-04-21 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aviro, nfsv4, linux-kernel, torvalds, linux-cachefs,
	linux-fsdevel

Andrew Morton <akpm@osdl.org> wrote:

> > Modules that might depend on fscache need to know that it's there,
> 
> In theory, module A isn't supposed to care whether module B was configured,
> because module B might be compiled separately, or dowloaded from elsewhere
> or whatever.

In this case it's sort of necessary - unless you're suggesting I make FS-Cache
mandatory...

The problem is that I don't want NFS or whatever to be carrying around the
cookie pointers if FS-Cache isn't compiled as that saves memory.  But that
involves conditionally changing the composition of structures, something
that's most clearly done with cpp-conditionals.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2006-04-21 19:33 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-20 16:59 [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit David Howells
2006-04-20 16:59 ` [PATCH 2/7] FS-Cache: Add notification of page becoming writable to VMA ops David Howells
2006-04-20 17:40   ` Zach Brown
2006-04-20 18:27     ` Anton Altaparmakov
2006-04-20 16:59 ` [PATCH 3/7] FS-Cache: Avoid ENFILE checking for kernel-specific open files David Howells
2006-04-20 17:18   ` Christoph Hellwig
2006-04-20 18:06   ` David Howells
2006-04-21  0:11     ` Andrew Morton
2006-04-21 10:57     ` David Howells
2006-04-21  0:07   ` Andrew Morton
2006-04-21 12:33   ` David Howells
2006-04-21 18:22     ` Andrew Morton
2006-04-21 19:29     ` David Howells
2006-04-20 16:59 ` [PATCH 4/7] FS-Cache: Export find_get_pages() David Howells
2006-04-20 17:19   ` Christoph Hellwig
2006-04-20 17:45   ` David Howells
2006-04-21  0:15     ` Andrew Morton
2006-04-21 13:02     ` David Howells
2006-04-20 16:59 ` [PATCH 5/7] FS-Cache: Generic filesystem caching facility David Howells
2006-04-21  0:46   ` Andrew Morton
2006-04-21 14:15   ` David Howells
2006-04-21 18:38     ` Andrew Morton
2006-04-21 19:33     ` David Howells
2006-04-20 16:59 ` [PATCH 6/7] FS-Cache: Make kAFS use FS-Cache David Howells
2006-04-20 16:59 ` [PATCH 7/7] FS-Cache: CacheFiles: A cache that backs onto a mounted filesystem David Howells
2006-04-21  0:57   ` Andrew Morton
2006-04-21  1:16   ` Andrew Morton
2006-04-21 14:49   ` David Howells
2006-04-21  0:12 ` [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit Andrew Morton
2006-04-21 10:22 ` David Howells
2006-04-21 10:33   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).