linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Howells <dhowells@redhat.com>
To: Andrew Morton <akpm@osdl.org>
Cc: Linus Torvalds <torvalds@osdl.org>,
	dhowells@redhat.com, linux-kernel@vger.kernel.org,
	linux-cachefs@redhat.com, linux-fsdevel@vger.kernel.org,
	nfsv4@linux-nfs.org
Subject: Re: [PATCH 0/12] FS-Cache: Generic filesystem caching facility
Date: Tue, 15 Nov 2005 17:59:02 +0000	[thread overview]
Message-ID: <11717.1132077542@warthog.cambridge.redhat.com> (raw)
In-Reply-To: <20051114150347.1188499e.akpm@osdl.org>

Andrew Morton <akpm@osdl.org> wrote:

> > > This series of patches does four things:
> > 
> > Ok, interesting, and I like most of what I see..
> 
> Less impressed.  It (still) adds a very large amount of tricksy code which
> pokes around in core pagecache functions,

What I'm trying to do is actually fairly simple in concept:

 (1) Have a metadata inode (imeta) that covers the block device.

 (2) Metadata pages are attached to imeta at page indexes corresponding to the
     block indexes on disk.

 (3) Metadata blocks on disk are to be considered invariant[*] once attached
     to the on-disk metadata tree rooted in the journal.

     [*] atimes and netfs metadata can be updated in place. They fit into a
     	 single sector, and so we assume changing them is atomic.

 (4) A new metadata tree is constructed by replacing the disk blocks that need
     to be modified with newly allocated blocks, then attaching the old
     unchanged branches to that new block. Think of RCU or LISP.

     The data then needs to be copied from the old block to the new, and the
     old block discarded when the journal is advanced.

 (5) If a copy of the old block is resident on a page in memory and is up to
     date with respect to the old block on disk, then we really don't want to
     have to copy it to another page in memory. This would require allocation
     of an extra page, as well as requiring a full-page copy, thus thrashing
     the data cache.

     What we do is to detach the page from imeta at the old block index and
     reattach it at the new block index (having allocated a new block first).

     However, this means that we potentially have to allocate new radix tree
     bits, and so we're subject to ENOMEM. But once we've allocated a new
     block, we don't want to have to try and roll the allocation back or
     recycle the block because that would potentially incur ENOMEM also...

     In fact, we don't want to take any errors at all (EIO we can deal with
     because that means the blockdev is screwed).

     So we pre-allocate sufficient radix tree bits and attach them to the task
     so that we can (a) sleep and (b) evade ENOMEM between allocating a new
     block and attaching it to the tree's superstructure.

Using buffer heads doesn't help, and using the blockdev's inode to hold pages
doesn't help. This way, I can keep my metadata in the pagecache and the VM will
schedule it to be written out. I also want to avoid bufferheads because they
use up a big additional chunk of memory I'd prefer to avoid having to pin.

> slows down the radix-tree hotpath,

What I wanted was to be able to supply sufficient radix tree nodes in advance
that I wouldn't incur ENOMEM from that source, but I also needed to be able to
sleep after having loaded the cache, which meant I couldn't just shove the
extra in the per-CPU cache.

Admittedly, this is going to slow things down, and there's not a lot I can do
about that without adding full rollback support, which would be a lot of work,
particularly coping with the case of there being insufficient memory for the
cause.

Another way to deal with this would be to provide alternate
add_to_page_cache*() and radix_tree_preload() or radix_tree_insert() functions
that could be given a cache from which to allocate radix tree nodes.

I could also separate out the two sorts of cache. I changed the form of the
radix tree cache to a linked list with a counter instead of an array. This uses
less memory at the head (which we want for adding to task_struct), but may well
be slower when dequeuing elements as the dcache can't help. I could revert the
per-CPU cache to the original form, whilst keeping the per-task cache in the
less-intrusive form.

I could even remove the metadata from the pagecache entirely. I'd rather not do
that, though. Using the page cache has a lot of advantages, and they mostly
outweigh its disadvantages. Probably the biggest advantage is that I can leave
a metadata block I'm not using at the moment lying around in the pagecache; the
VM can discard it if it likes, but if not, it'll be there when I need it again.

> exports mysterious symbols.  And that's on a 60-second scan.

 (*) clear_page_dirty_for_io()

	Used in mpage.c, mpage_writepages() which I have a very simplified
	version of.

 (*) lru_cache_add()

	Normally called indirectly via add_to_page_cache_lru(), but I wanted
	to call add_to_page_cache(), and sometimes use it for multiple pages
	with pagevec_lru_add() which is exported, and sometimes on a single
	page, which means using lru_cache_add().

> It'll be a sizeable job going through it in detail.  Not as sizeable as
> writing it though ;)

:-)

> All of this for an undisclosed speedup of AFS!

What about NFS?

> I think we need an NFS implementation and some numbers which make it
> interesting.  Or at least, some AFS numbers,

I'll generate some, at least for AFS.

> some explanation as to why they can be extrapolated to NFS and some degree of
> interest from the NFS guys.  Ditto CIFS.
> 
> Because it _is_ a lot of code.

Yes, I noticed that too:-)

David

  parent reply	other threads:[~2005-11-15 17:59 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-14 21:54 [PATCH 0/12] FS-Cache: Generic filesystem caching facility David Howells
2005-11-14 21:54 ` [PATCH 1/12] FS-Cache: Handle -Wsign-compare in i386 bitops David Howells
2005-11-14 21:54 ` [PATCH 2/12] FS-Cache: Permit multiple inclusion of linux/pagevec.h David Howells
2005-11-14 21:54 ` [PATCH 6/12] FS-Cache: Add a function to replace a page in the pagecache David Howells
2005-11-14 21:54 ` [PATCH 5/12] FS-Cache: Release page->private in failed readahead David Howells
2005-11-14 21:54 ` [PATCH 3/12] FS-Cache: Add list_for_each_entry_safe_reverse() David Howells
2005-11-14 21:54 ` [PATCH 7/12] FS-Cache: Export a couple of VM functions David Howells
2005-11-14 21:54 ` [PATCH 8/12] FS-Cache: Add generic filesystem cache core module David Howells
2005-11-14 21:54 ` [PATCH 4/12] FS-Cache: Permit pre-allocation of radix-tree nodes David Howells
2005-11-14 21:54 ` [PATCH 10/12] FS-Cache: Make kAFS use FS-Cache David Howells
2005-11-14 21:54 ` [PATCH 12/12] FS-Cache: CacheFS: Add Documentation David Howells
2005-11-14 21:54 ` [PATCH 9/12] FS-Cache: Add documentation for FS-Cache and its interfaces David Howells
2005-11-14 21:54 ` [PATCH 11/12] FS-Cache: CacheFS: Add cache on blockdevice cache backend David Howells
2005-11-14 22:45 ` [PATCH 0/12] FS-Cache: Generic filesystem caching facility Linus Torvalds
2005-11-14 23:03   ` Andrew Morton
2005-11-14 23:17     ` Trond Myklebust
2005-11-15  8:57       ` [Linux-cachefs] " Jeff Garzik
2005-11-18 22:07       ` Troy Benjegerdes
2005-11-15 16:32   ` Jamie Lokier
2005-11-15 16:54     ` Linus Torvalds
2005-11-15 17:59   ` David Howells [this message]
2005-11-15 19:25     ` Andrew Morton
2005-11-15 23:45       ` Kyle Moffett
2005-11-16 11:26     ` David Howells
2005-11-16 11:56       ` Andrew Morton
2005-11-17 19:28       ` David Howells
2005-11-17 21:29         ` Andrew Morton
2005-11-18  8:43         ` Paul Jackson
2005-11-14 23:47 ` [PATCH] FS-Cache: Make NFS use FS-Cache Steve Dickson
2005-11-15  0:07   ` Andrew Morton
2005-11-15  2:31     ` Steve Dickson
2005-11-15 12:28 ` [PATCH 0/12] FS-Cache: Generic filesystem caching facility Nick Piggin
2005-11-15 13:20 ` [PATCHES 0-12/12] " David Howells
2005-11-15 14:06   ` [Linux-cachefs] " J. Bruce Fields
2005-11-15 16:24   ` David Howells
2005-11-15 13:51 ` [PATCH 0/12] " David Howells
2005-11-15 17:05   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11717.1132077542@warthog.cambridge.redhat.com \
    --to=dhowells@redhat.com \
    --cc=akpm@osdl.org \
    --cc=linux-cachefs@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nfsv4@linux-nfs.org \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).