From mboxrd@z Thu Jan  1 00:00:00 1970
From: linux@horizon.com
Subject: Limitations of ZFS
Date: 19 Nov 2005 17:31:51 -0500
Message-ID: <20051119223151.23457.qmail@science.horizon.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from science.horizon.com ([192.35.100.1]:37707 "HELO
	science.horizon.com") by vger.kernel.org with SMTP id S1750987AbVKSWcA
	(ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Sat, 19 Nov 2005 17:32:00 -0500
To: linux-fsdevel@vger.kernel.org
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

ZFS has a bunch of neat features, but I've been looking through the
source trying to understand the on-disk format and what it can't do.

The "always write out of place" design is right out of NetApp's WAFL,
described in some Usenix paper I read a long time ago.  Adding the
intent log to absorb synchronous writes without forcing a global
commit is a nice optimization.

While I think 128 bits is a bit silly, I really like the "multiple file
systems in one storage pool" design and the "block pointer includes a
checksum" business is without any question a Good Thing.  (It also allows
cryptographic possibilities of the "block pointer includes an IV" nature.)

In particular, as disk keeps getting bigger without getting much faster,
the cost of the checksumming goes down while the risk it protects against
goes up.

The challenge any such design faces is garbage collection - how do
you know when a block is not longer in use? - and the limitations
I've found are all based on that.

You can see a description of the algorithm ZFS uses at
http://blogs.sun.com/roller/page/ahrens/20051117
but we should explore its implications.

Every block is tracked by its birth and death - the first snapshot in
which it's referenced and the last one.  When all snapshots between
those two times have been deleted, the block can be reclaimed.

Note that this means that hard links to snapshot files or otherwise making
new references to snapshot data is illegal, because that would violate
the linear nature of snapshots.

Similarly, although ZFS allows you to make a so-called "clone" live file
system out of a snapshot, there is not way to track block lifetime
accurately, so the base snapshot has to remain as long as any clone
rooted at it exists.

In particular, you can't make two clones and then decide which one to
call "mainline" later.  If you choose wrong, you have an undeletable
snapshot that you can't get rid of until you've backed out all clones.


This is a bit like the old Linux VM system - there's no way to find
all pointers to a block, so you have to traverse all the virtual
address spaces rather than the physical.

I haven't found the code yet, but I expect that if a "scrubbing" operation
finds a hard-failed bad block it has to find the problem and copy the
block once per pointer to the block.  (Of course, if the error can be
corrected by rewriting in place, it gets easier.)

This is also what makes it impossible to migrate storage off a
non-mirrored drive.


One possible solution to the above problems would be to attach a block
bitmap to each snapshot (run-length encoded and then Golomb coded for
space efficiency, of course), and then arrange them in trees, where the
root bitmap is all blocks in use anywhere.

Whenever you create or delete a snapshot, it's O(log n) steps to add an
entry to the tree and recompute all the parent bitmaps up to the root.
This would also make it reasonably efficient to find all snapshots that
refer to a block or range of blocks - for every bitmap that refers,
check all its children and recurse as needed.

(Note to patent attornies: the above suggestions are "obvious to a
person having ordinary skill in the art" of file system design.
They're straight out of the book _Managing Gigabytes_.)

To be studied is whether storing difference bitmaps - blocks that are
in the parent but NOT in this snapshot - would be enough of a saving
over storing direct bitmaps to be worth the complexity.

Some kind of self-adjusting tree, where frequently modified file systems
stay near the root and long-lived snapshots get clustered together in
a deep part of the tree, would be ideal.


One thing that should be possible, and I wish ZFS supported, is
mixing redundancy levels within a storage pool.  E.g.  I'd like my git
repositories mirrored heavily, but /var/cache/squid can be straight
RAID-0.  I haven't dug through ZFS's block pointer structure yet to
see how this could be done.


One thing I haven't found in the code yet is how it handles completely
full storage pools.  Does it deadlock, or is there a reserve to allow
processing of a delete as long as it produces a net reduction in
allocated blocks?  The accounting to allow that would be complicated,
but it's not obvious how to handle anything different.


Something I presume they have studied, but I'm not sure of the details
of, is the long-term fragmentation behaviour.  Always writing out of
place certainly fragments the hell out of database files.  (Something
like Reiserfs's opoortunistic packing would help.)


Anyway, while there are good ideas, I'm not sure if a direct copy
is even the best thing to do.  Surely it's possible to do better?