From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bob Copeland Subject: [RFC][PATCH 0/7] Optimized MPEG file system Date: Wed, 15 Mar 2006 22:01:44 -0500 Message-ID: <11424781042886-git-send-email-me@bobcopeland.com> Reply-To: Bob Copeland Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Cc: Bob Copeland Return-path: Received: from mail.deathmatch.net ([216.200.85.210]:34788 "EHLO mail.deathmatch.net") by vger.kernel.org with ESMTP id S1752165AbWCPDBr (ORCPT ); Wed, 15 Mar 2006 22:01:47 -0500 In-Reply-To: To: linux-fsdevel@vger.kernel.org Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org The following set of patches implement a new filesystem: the "Optimized MPEG File System" or OMFS (their name, not mine; I make no claims of actual optimization or utility with digital video). OMFS is used in ReplayTV as well as the Rio Karma, but has only been tested with the latter. This is a bit rough around the edges but I'd like to get some feedback in its current state. The structures were reverse engineered so not everything is known, but here are the basics: - The directories are implemented as a hash table with hashing on the name. The mixture of an unfortunate hash function, file naming policy, and block size means that in practice only a small number of buckets are utilized, and chains tend to be somewhat long (usually 5 or 6 links). - Files are implemented as a table of extents: a list of block numbers and the count of contiguous blocks allocated starting from that block. - All filesystem objects have CRCs and mirrors. Presumably this is so online consistency checks can be made; this module updates them but otherwise ignores them. - ReplayTV, I am told (note I have never seen an RTV image), does not have an on disk free bitmap. The Karma has one however. Presumably the ReplayTV scans the entire disk every time it's mounted. - The fs typically uses a large block size (8k). It also only uses part of a given block for "system" structures (e.g. inodes), so there is a lot of wasted space in the book-keeping. - Some ReplayTV models write everything byte-swapped. I don't do anything about this. More documentation can be found at: http://rtvpatch.sourceforge.net/omfs.html http://bobcopeland.com/karma/ Consequently, the following "interesting" choices were made: - Because the blocksize is larger than a page, the FS sets the blocksize to the system structure blocksize (usually 2k) and multiplies the block numbers accordingly when needed. - In order to traverse the hash table inside readdir without extra state, the bucket index is encoded into the upper byte of fpos and the link index is stored in the lower 24 bits. No existing block size allows more than 256 hash buckets. - To support the ReplayTV, the module allocates space for the whole block bitmap at mount time instead of reading the bitmap blocks as needed. The code to traverse the tree for RTV at startup isn't there yet. - CRCs and mirroring are performed whenever the inode is written. - I realize extent traversal is very non optimal and could probably benefit from Badari's new get_block(s) stuff. OTOH, I don't expect many people to use this FS for anything but interoperability, so I'm trying to keep it simple to start. At present there is no locking and several required filesystem features are missing. Particularly I am interested in the following feedback: - is the use of sb_bread everywhere kosher or should that only be done for the superblock and in get_block(), and the pagecache used everywhere else? - the disk has a structure that doesn't represent a file or a directory (a continuation of the extent table for extremely fragmented files) which otherwise operates the same as an inode. Could/should it be an inode even if it wouldn't go into the dcache, i.e. just for mark_inode_dirty? Right now it has its own write path. - is the readdir thing legal? - what else looks really wrong?