linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Phillip Susi <psusi@cfl.rr.com>
To: Ted Ts'o <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: Large directories and poor order correlation
Date: Tue, 15 Mar 2011 15:08:53 -0400	[thread overview]
Message-ID: <4D7FB945.1070209@cfl.rr.com> (raw)
In-Reply-To: <20110315170827.GH8120@thunk.org>

On 3/15/2011 1:08 PM, Ted Ts'o wrote:
> No, because the directory blocks are the leaf nodes, and in the case
> of a node split, we need to copy half of the directory entries in one
> block, and move it to a newly allocated block.  If readdir() was
> traversing the linear directory entries, and had already traversed
> that directory block that needs to split, then you'll return those
> directory entries that got copied into a new leaf (i.e., new directory
> block) a second time.

When you split the htree node, aren't you just moving around the
"deleted entries"?  So the normal names remain in the same order so
readdir() doesn't have a problem when it is ignoring the htree entries
and just walking the normal names?

Also, how do you deal with this when you do end up re balancing the
htree during a readdir()?  I would think that keeping that straight
would be much more difficult than handling the problem with linear
directory entries.

Why was the htree hidden inside the normal directory structure anyway?

> Unless some files get deleted in between.  Now depending on the
> "holes" in the directory blocks, where the new directory entries are
> added, even in the non-htree case, could either be wherever an empty
> directory entry could be found, or in the worst case, we might need to
> allocate a new block and that new directory entry gets added to the
> end of the block.

Right, but on an otherwise idle system, when you make all the files at
once via rsync or untaring an archive, this shouldn't happen and they
should be ( generally ) in ascending order, shouldn't they?

> I suggest that you try some experiments, using both dir_index and
> non-dir_index file systems, and then looking at the directory using
> the debugfs "ls" and "htree_dump" commands.  Either believe me, or
> learn how things really work.  :-)

Now THAT sounds interesting.  Is this documented somewhere?

Also, why can't chattr set/clear the 'I' flag?  Is it just a runtime
combersome thing?  So setting and clearing the flag with debugfs
followed by a fsck should do the trick?  And when is it automatically
enabled?

> I suppose we could allocate up to some tunable amount worth of
> directory space, say 64k or 128k, and do the sorting inside the
> kernel.  We then have to live with the fact that each badly behaved
> program which calls opendir(), and then a single readdir(), and then
> stops, will consume 128k of non-swappable kernel memory until the
> process gets killed.  A process which does this thousands of times
> could potentially carry out a resource exhaustion attack on the
> system.  Which we could then try to patch over, by say creating a new
> resource limit of the number of directories a process can keep open at
> a time, but then the attacker could just fork some additional child
> processes....

I think you are right in that if sorting is to be done at
opendir()/readdir() time, then it should be done in libc, not the
kernel, but it would be even better if the fs made some effort store the
entries in a good order so no sorting is needed at all.


  reply	other threads:[~2011-03-15 19:09 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-14 20:24 Large directories and poor order correlation Phillip Susi
2011-03-14 20:37 ` Eric Sandeen
2011-03-14 20:52   ` Phillip Susi
2011-03-14 21:12     ` Eric Sandeen
2011-03-14 21:52     ` Ted Ts'o
2011-03-14 23:43       ` Phillip Susi
2011-03-15  0:14         ` Ted Ts'o
2011-03-15 14:01           ` Phillip Susi
2011-03-15 14:33             ` Rogier Wolff
2011-03-15 14:36               ` Ric Wheeler
2011-03-15 17:08             ` Ted Ts'o
2011-03-15 19:08               ` Phillip Susi [this message]
2011-03-16  1:50                 ` Ted Ts'o
2011-03-15  7:59   ` Florian Weimer
2011-03-15 11:06     ` Theodore Tso
2011-03-15 11:23       ` Ric Wheeler
2011-03-15 11:38         ` Theodore Tso
2011-03-15 13:33       ` Rogier Wolff
2011-03-15 17:18         ` Ted Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D7FB945.1070209@cfl.rr.com \
    --to=psusi@cfl.rr.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=sandeen@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).