public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Why cachefs lives directly on a block device
       [not found] ` <20040828011349.31ccd069.akpm@osdl.org>
@ 2004-08-28 11:37   ` David Howells
  0 siblings, 0 replies; only message in thread
From: David Howells @ 2004-08-28 11:37 UTC (permalink / raw)
  To: Andrew Morton, sct; +Cc: linux-kernel, torvalds


Andrew Morton <akpm@osdl.org>:
> Maybe I'm being stoopid, but I don't see why this whole cachefs thing
> cannot operate by creating one regular file per netfs file on top of some
> existing underlying filesystem.
> 
> Why'd you design it this way?

There are several reasons:

 (1) User interface.

     I can use mount to add a cache that all interested filesystems can then
     just take immediate advantage of. Otherwise I need to find some other way
     of doing so. Admittedly, this is a minor point - I could just add a new
     pair of syscalls.

 (2) Performance.

     If I go through another filesystem, then I can't do DMA directly into the
     the netfs's pages. Everything _has_ to be copied. Yes, I know directIO
     now exists, but it's a userspace feature that I'm not sure I can
     use.

     It was also made clear to me that I wouldn't be able to change this - ie:
     to add operations that start BIOs to read or write from a netfs page
     instead of the discfs's page in the pagecache. I can see why it might be
     hard to do for most discfs's: they are designed to use buffer heads and
     would have to keep track of BIOs in progress to blocks without having
     pages around to attach the information to.

     Furthermore, we end up going through two lots of readahead calculations,
     not one - the netfs does one and the discfs does one.

 (3) Memory.

     Having to copy to/from discfs pages also has potential implications for
     memory usage and memory pressure. You end up using a lot more memory at
     certain points - you have to get an extra page to do a read or a write;
     so if the VM is trying to dispose of a page that hasn't yet been written
     to the cache, it has to get a second page to be able to update the cache,
     and it _has_ to update the cache or punch a hole.

     Furthermore, every netfs inode in the cache also has to have an inode
     around in memory all the time at least, and on a discfs you'd have to
     have a struct dentry and probably a struct file too. Cachefs only has to
     keep the inode in memory, not the dentry or file structs. Theoretically,
     I could probably dispense with the inode too, but it's probably more work
     than it's worth.

 (4) Holes.

     The discfs must support holes. The cache must be able to detect the holes
     and report to the netfs that it hasn't yet downloaded the data for that
     page. I suppose I could possibly use inode->i_op->bmap()...

     I can also punch holes in cachefs files, something that can't be done on
     other discfs's at the moment.

 (5) Data Consistency.

     Cachefs uses a pair of journals to keep track of the state of the cache
     and all the pages contained therein. This means that I don't get an
     inconsistent state in the on-disc cache and I don't lose disc space.
     
     One place where I take especial care is between the allocation of a block
     and its splicing into the usual on-disc pointer tree and the data having
     been written to disc. If power is interrupted and then restored, I can
     replay the journal and see that a block was allocated but not written and
     then punch it out. Being backed by a discfs, I'm not certain what will
     happen.

     It may well be possible to mark the discfs's journal, if it has one, but
     how does the discfs deal with those marks?

     Knowing that your cache is in a good state is vitally important if you,
     say, put /usr on AFS. Someone we deal with puts everything barring /etc,
     /sbin, /lib and /var on AFS and have a humungous cache on every
     computer. Imagine if the power goes out and renders every cache
     inconsistent, requiring all the computers to nuke their caches when the
     power comes back on.

 (6) Recycling.

     Recycling is simple on cachefs. I can just scan the metadata index to
     look for inodes that require reclamation/recycling; and I can also build
     up a list of the oldest inodes so that I can nuke them to make space.

     Doing this on a discfs would require a search going down through a nest
     of directories, and would probably have to be done in userspace.

 (7) Disc Space.

     I'd want to set a maximum size to the cache, but I can't guarantee being
     able to reach that maximum size on a discfs.

     If the recycler starts to nuke cache files to make space, the freed
     blocks may just be eaten directly by userspace programs, potentially
     resulting in the entire cache being nuked. Alternatively, netfs
     operations may end up being held up because the cache can't get blocks on
     which to store the data.

     With cachefs, I can guarantee that I have access to every block.

 (8) Users.

     Users can't go into cachefs and run amok. The worst they can do is cause
     bits of the cache to be recycled early. With a discfs backed cache, they
     can do all sorts of bad things to the files belonging to the cache, and
     they can do this quite by accident.


There would be some advantages to using a file-based cache rather than a
blockdev-based cache:

 (1) Writing to the cache.

     Having to copying to or from a discfs's page means that a netfs can just
     make the copy and then assume its own page is ready to go.

 (2) Doesn't require its own blockdev.

     You just nominate a directory and go from there; you don't have to
     reparition or install an extra drive to make use of cachefs in an
     existing system.

 (3) Can use xattrs to store netfs data about a file.

     Cachefs requires the netfs to store a key in any pertinent index entry,
     and it also permits arbitrary data to be stored there.

     A discfs could be requested to store the netfs's data in xattrs, and the
     filename could be used to store the key, though the key would have to be
     rendered as text not binary. Likewise indexes could be rendered as
     directories with xattrs.

 (4) You can easily make your cache bigger if the discfs has plenty of space.


One good point, though, I've tried to develop the cachefs-netfs interface
(cachefs.h) so that it is agnostic with respect to how underlying caches
work. This means that if the underlying mechanism changes radically any netfs
that uses it won't have to change.

It should also be possible to change cachefs's interface such that caches of
different types can be mixed. fs/cachefs/interface.c doesn't really care, and
it could be split from cachefs entirely.

If you're going to insist it becomes file-backed, then will you be willing to
lend your support if I want to make discfs's change to make this easier?

David

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2004-08-28 11:37 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <5600.1093603256@redhat.com>
     [not found] ` <20040828011349.31ccd069.akpm@osdl.org>
2004-08-28 11:37   ` Why cachefs lives directly on a block device David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox