linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ted Ts'o <tytso@mit.edu>
To: Phillip Susi <psusi@cfl.rr.com>
Cc: Eric Sandeen <sandeen@redhat.com>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: Large directories and poor order correlation
Date: Tue, 15 Mar 2011 21:50:21 -0400	[thread overview]
Message-ID: <20110316015021.GK8120@thunk.org> (raw)
In-Reply-To: <4D7FB945.1070209@cfl.rr.com>

On Tue, Mar 15, 2011 at 03:08:53PM -0400, Phillip Susi wrote:
> 
> When you split the htree node, aren't you just moving around the
> "deleted entries"?  So the normal names remain in the same order so
> readdir() doesn't have a problem when it is ignoring the htree entries
> and just walking the normal names?

No, because the leaf entries of the htree are the actual directory
entries in the directory block.

> Why was the htree hidden inside the normal directory structure anyway?

So that we had full backwards compatibility with ext2.  I had planned
this feature in advance, and ext2 (and pre-htree ext3 implementations)
would clear the htree flag if they tried to modify an htree directory.
Since the leaf nodes (which are the ones that get split) are normal
directory blocks, and interior nodes of the tree are directory blocks
that have what appears to ext2 to be a deleted directory entry
covering the entire block, it's fully backwards compatible with
ext2/older ext3 systems.

> > Unless some files get deleted in between.  Now depending on the
> > "holes" in the directory blocks, where the new directory entries are
> > added, even in the non-htree case, could either be wherever an empty
> > directory entry could be found, or in the worst case, we might need to
> > allocate a new block and that new directory entry gets added to the
> > end of the block.
> 
> Right, but on an otherwise idle system, when you make all the files at
> once via rsync or untaring an archive, this shouldn't happen and they
> should be ( generally ) in ascending order, shouldn't they?

If the directory is freshly created, yes.  But over time, if you are
adding and deleting files, such as a Maildir directory, this will not
be the case forever.

> > I suggest that you try some experiments, using both dir_index and
> > non-dir_index file systems, and then looking at the directory using
> > the debugfs "ls" and "htree_dump" commands.  Either believe me, or
> > learn how things really work.  :-)
> 
> Now THAT sounds interesting.  Is this documented somewhere?

The debugfs man page has some documentation.

> Also, why can't chattr set/clear the 'I' flag?  Is it just a runtime
> combersome thing?  So setting and clearing the flag with debugfs
> followed by a fsck should do the trick?  And when is it automatically
> enabled?

You can't clear the 'I' flag because I didn't want to bother getting
the locking right so that it would be safe.  And turning off the 'I'
flag isn't going to help, since the directory entries are scrambled
anyway --- again, because the leaf nodes of the htree *are* normal
directory blocks.  Turning the 'I' flag on would require reindexing
the whole directory, which would be a major headache.  You'd have to
lock out all directory accesses, and then do a full copy operation ---
and remember, a directory could be potentially many megabytes, and
kernel memory is non-swappable.

> I think you are right in that if sorting is to be done at
> opendir()/readdir() time, then it should be done in libc, not the
> kernel, but it would be even better if the fs made some effort store the
> entries in a good order so no sorting is needed at all.

The problem is that we have to store them in hash tree order so the
lookups happen correctly.  We could have done what JFS does, which is
to keep three separate b-trees for its directories, and where every
single time you add or modify a file, you have to update multiple
btrees.  But, (a) this would have required an non-backwards compatible
file system format change from ext2/older ext3 file systems, and (b)
the extra b-trees updates are their own source of overhead.

    	  	  	      	    	       - Ted

  reply	other threads:[~2011-03-16  1:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-14 20:24 Large directories and poor order correlation Phillip Susi
2011-03-14 20:37 ` Eric Sandeen
2011-03-14 20:52   ` Phillip Susi
2011-03-14 21:12     ` Eric Sandeen
2011-03-14 21:52     ` Ted Ts'o
2011-03-14 23:43       ` Phillip Susi
2011-03-15  0:14         ` Ted Ts'o
2011-03-15 14:01           ` Phillip Susi
2011-03-15 14:33             ` Rogier Wolff
2011-03-15 14:36               ` Ric Wheeler
2011-03-15 17:08             ` Ted Ts'o
2011-03-15 19:08               ` Phillip Susi
2011-03-16  1:50                 ` Ted Ts'o [this message]
2011-03-15  7:59   ` Florian Weimer
2011-03-15 11:06     ` Theodore Tso
2011-03-15 11:23       ` Ric Wheeler
2011-03-15 11:38         ` Theodore Tso
2011-03-15 13:33       ` Rogier Wolff
2011-03-15 17:18         ` Ted Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110316015021.GK8120@thunk.org \
    --to=tytso@mit.edu \
    --cc=linux-ext4@vger.kernel.org \
    --cc=psusi@cfl.rr.com \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).