From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Kleikamp Subject: Re: Expected getdents behaviour Date: Fri, 16 Sep 2005 06:57:18 -0500 Message-ID: <1126871838.9350.16.camel@kleikamp.austin.ibm.com> References: <1126793268.1676.9.camel@imp.csi.cam.ac.uk> <1126793558.1676.15.camel@imp.csi.cam.ac.uk> <20050915155108.GE22503@thunk.org> <20050916033945.GA11047@thunk.org> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Cc: Anton Altaparmakov , Akshat Aranya , linux-fsdevel@vger.kernel.org Return-path: Received: from e32.co.us.ibm.com ([32.97.110.130]:56272 "EHLO e32.co.us.ibm.com") by vger.kernel.org with ESMTP id S932663AbVIPL51 (ORCPT ); Fri, 16 Sep 2005 07:57:27 -0400 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e32.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j8GBvKwP272784 for ; Fri, 16 Sep 2005 07:57:22 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8GBvKab400546 for ; Fri, 16 Sep 2005 05:57:20 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8GBvJuL010615 for ; Fri, 16 Sep 2005 05:57:19 -0600 To: "Theodore Ts'o" In-Reply-To: <20050916033945.GA11047@thunk.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, 2005-09-15 at 23:39 -0400, Theodore Ts'o wrote: > JFS also uses a b-tree (or more than one b-tree) to index its > directory, and also gives the same guarantee, but it implements it in > an entirely different way. Here are the detail of jfs's implementation. Each directory entry is assigned sequential "index" when it is created, which is stored in the directory entry in the dtree (the primary directory data structure). In addition, an entry is created in the "index table", which is an array keyed on the index. Initially, this table can be stored directly in the inode, but as it grows, the table is stored in other blocks, and is addressed by a b-tree anchored in the inode. For an existing entry, the table stores the address of the dtree page, and the offset within the page. This table is updated whenever a change to the directory moves the entry. When an entry is deleted, the table contains the index of the next existing entry in dtree-order. Resuming a search means getting the table element for the index, checking a flag whether it exists or has been deleted, following the chain of "next" indices until an existing one is found, and getting the dtree page address and offset of the entry. The disadvantages are the need to update the directory table as entries are moved, and the fact that the index table does not shrink as entries are removed. (The table is completely truncated if all entries are removed from a directory, but that isn't typically going to happen.) > Two different filesystems; two different strategies; both implement > this guarantee without needing to lock the directory against > modifications during the readdir() scan. Take a look at the > implementations closely. > > (Hint: both involve walking the b-tree and returning readdir() entries > in tree order. The details though of which b-tree, how the b-tree is > indexed, how the index is stored in file descriptor offset, and how > telldir/seekdir are implemented are different in the two filesystems, > however.) -- David Kleikamp IBM Linux Technology Center