From mboxrd@z Thu Jan 1 00:00:00 1970 From: linux@horizon.com Subject: Limitations of ZFS Date: 19 Nov 2005 17:31:51 -0500 Message-ID: <20051119223151.23457.qmail@science.horizon.com> Return-path: Received: from science.horizon.com ([192.35.100.1]:37707 "HELO science.horizon.com") by vger.kernel.org with SMTP id S1750987AbVKSWcA (ORCPT ); Sat, 19 Nov 2005 17:32:00 -0500 To: linux-fsdevel@vger.kernel.org Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org ZFS has a bunch of neat features, but I've been looking through the source trying to understand the on-disk format and what it can't do. The "always write out of place" design is right out of NetApp's WAFL, described in some Usenix paper I read a long time ago. Adding the intent log to absorb synchronous writes without forcing a global commit is a nice optimization. While I think 128 bits is a bit silly, I really like the "multiple file systems in one storage pool" design and the "block pointer includes a checksum" business is without any question a Good Thing. (It also allows cryptographic possibilities of the "block pointer includes an IV" nature.) In particular, as disk keeps getting bigger without getting much faster, the cost of the checksumming goes down while the risk it protects against goes up. The challenge any such design faces is garbage collection - how do you know when a block is not longer in use? - and the limitations I've found are all based on that. You can see a description of the algorithm ZFS uses at http://blogs.sun.com/roller/page/ahrens/20051117 but we should explore its implications. Every block is tracked by its birth and death - the first snapshot in which it's referenced and the last one. When all snapshots between those two times have been deleted, the block can be reclaimed. Note that this means that hard links to snapshot files or otherwise making new references to snapshot data is illegal, because that would violate the linear nature of snapshots. Similarly, although ZFS allows you to make a so-called "clone" live file system out of a snapshot, there is not way to track block lifetime accurately, so the base snapshot has to remain as long as any clone rooted at it exists. In particular, you can't make two clones and then decide which one to call "mainline" later. If you choose wrong, you have an undeletable snapshot that you can't get rid of until you've backed out all clones. This is a bit like the old Linux VM system - there's no way to find all pointers to a block, so you have to traverse all the virtual address spaces rather than the physical. I haven't found the code yet, but I expect that if a "scrubbing" operation finds a hard-failed bad block it has to find the problem and copy the block once per pointer to the block. (Of course, if the error can be corrected by rewriting in place, it gets easier.) This is also what makes it impossible to migrate storage off a non-mirrored drive. One possible solution to the above problems would be to attach a block bitmap to each snapshot (run-length encoded and then Golomb coded for space efficiency, of course), and then arrange them in trees, where the root bitmap is all blocks in use anywhere. Whenever you create or delete a snapshot, it's O(log n) steps to add an entry to the tree and recompute all the parent bitmaps up to the root. This would also make it reasonably efficient to find all snapshots that refer to a block or range of blocks - for every bitmap that refers, check all its children and recurse as needed. (Note to patent attornies: the above suggestions are "obvious to a person having ordinary skill in the art" of file system design. They're straight out of the book _Managing Gigabytes_.) To be studied is whether storing difference bitmaps - blocks that are in the parent but NOT in this snapshot - would be enough of a saving over storing direct bitmaps to be worth the complexity. Some kind of self-adjusting tree, where frequently modified file systems stay near the root and long-lived snapshots get clustered together in a deep part of the tree, would be ideal. One thing that should be possible, and I wish ZFS supported, is mixing redundancy levels within a storage pool. E.g. I'd like my git repositories mirrored heavily, but /var/cache/squid can be straight RAID-0. I haven't dug through ZFS's block pointer structure yet to see how this could be done. One thing I haven't found in the code yet is how it handles completely full storage pools. Does it deadlock, or is there a reserve to allow processing of a delete as long as it produces a net reduction in allocated blocks? The accounting to allow that would be complicated, but it's not obvious how to handle anything different. Something I presume they have studied, but I'm not sure of the details of, is the long-term fragmentation behaviour. Always writing out of place certainly fragments the hell out of database files. (Something like Reiserfs's opoortunistic packing would help.) Anyway, while there are good ideas, I'm not sure if a direct copy is even the best thing to do. Surely it's possible to do better?