public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Daniel Phillips <phillips@arcor.de>
Cc: Alex Tomas <bzzz@tmi.comex.ru>,
	"Martin J. Bligh" <mbligh@aracnet.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	ext2-devel@lists.sourceforge.net, Andrew Morton <akpm@digeo.com>
Subject: Re: [Bug 417] New: htree much slower than regular ext3
Date: Fri, 7 Mar 2003 18:27:50 -0500	[thread overview]
Message-ID: <20030307232749.GA24572@think.thunk.org> (raw)
In-Reply-To: <20030307173425.5C4D3FAAAE@mx12.arcor-online.net>

On Sat, Mar 08, 2003 at 06:38:50PM +0100, Daniel Phillips wrote:
> > The problem is that getdents(2) returns inodes out of order and
> > du causes many head seeks. I tried to solve the problem by patch
> > I included in. The idea of the patch is pretty simple: just try
> > to sort dentries by inode number in readdir(). 

Yup, that's the problem and the fix; the only issue is the the one
which Daniel pointed out:

> The problem I see with your approach is that the traversal is no
> longer in hash order, so a leaf split in the middle of a directory
> traversal could result in a lot of duplicate dirents.  I'm not sure
> there's a way around that.

The fix in userspace would be for du and find (the two programs most
likely to care about this sort of thing) to pull into memory a large
chunks of the directory at a time, sort them by inode, and then stat
them in inode order.

The only other way around it would be to do maintain an entirely
separate B-tree on disk just for the benefit of readdir(), which could
be in inode sort order.  Readdir() could then traverse the tree in
inode sort order, thus solving the performance penalty.  (The b-tree
has to be on disk, since for large directories, we can't afford to fit
it on disk.)

> We've apparently got a simpler problem though: inode numbers aren't
> allocated densely enough in the inode table blocks.  This is the
> only thing I can think of that could cause the amount of thrashing
> we're seeing.  Spreading inode number allocations out all over a
> volume is ok, but sparse allocation within each table block is not
> ok because it multiplies the pressure on the cache.  We want a
> 30,000 entry directory to touch 1,000 - 2,000 inode table blocks,
> not the worst-case 30,000.  As above, fixing this comes down to
> tweaking the ext3_new_inode allocation goal.

Nope.  That's not it.  Inode numbers are allocated sequentially in the
inode table blocks.  You can't get any more dense that that.  Create a
bunch of inodes in a directory like this:

mke2fs -j -F -b 1024 -g 1024 -N 1200000 test.img
mount -t ext3 -o loop test.img /mnt
pushd /mnt
mkdir test test2 test3
cd test
time seq -f "%06.0f" 1 100  | xargs touch
cd ..
cd test2
time seq -f "%06.0f" 1 10000  | xargs touch
cd ..
cd test3
time seq -f "%06.0f" 1 100000  | xargs touch
cd ..
popd

Then look at the result using "ls -li".  You will see that the inode
numbers are all sequentially allocated, when you look at the inodes in
creation order (which in this particular case is in filename sort
order ).  The problem is that when you look at them in hash sort
order, the inode numbers are scattered all over the place.

> Another approach is to have the inode number correspond approximately to the 
> hash value.  That's not as hard as it sounds, it simply means that the goal 
> for ext3_new_inode should be influenced by the hash value.  (But watch out 
> for interaction with the new Orlov allocator.)  

I'm not all convinced that splattering the inodes all over the volume
is in fact a good thing, because it means splattering the block
allocation for files within the same directory all over the volume.
Arguably, read access to files in a directory is much more common than
the "du" or "find" case.  So if we need to make trade-off in one
direction or another, I'd much rather go with striving to maximize
block locality than making a readdir+stat run go fast.  This is
especially true since there *is* a user space workaround that would
speed up things for du and find.

						- Ted



  reply	other threads:[~2003-03-07 23:17 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh
2003-02-28  2:55 ` Daniel Phillips
2003-02-27 21:00   ` Andreas Dilger
2003-02-28  4:12     ` Daniel Phillips
2003-02-27 21:33       ` Martin J. Bligh
2003-03-13 21:04     ` [Ext2-devel] " Stephen C. Tweedie
2003-03-07 15:46 ` Alex Tomas
2003-03-08 17:38   ` Daniel Phillips
2003-03-07 23:27     ` Theodore Ts'o [this message]
2003-03-09 19:26       ` Alex Tomas
2003-03-09  7:08     ` Alex Tomas
2003-03-10 17:58       ` Daniel Phillips
2003-03-10 21:25       ` Theodore Ts'o
2003-03-11 21:57   ` Bill Davidsen
     [not found] ` <20030307214833.00a37e35.akpm@digeo.com>
     [not found]   ` <20030308010424.Z1373@schatzie.adilger.int>
2003-03-09 22:54     ` [Ext2-devel] " Daniel Phillips
2003-03-08 23:19       ` Andrew Morton
2003-03-09 23:10   ` Daniel Phillips
     [not found] ` <20030309184755.ACC80FCA8C@mx12.arcor-online.net>
     [not found]   ` <m3u1ecl5h8.fsf@lexa.home.net>
2003-03-10 20:45     ` [RFC] Improved inode number allocation for HTree Daniel Phillips
     [not found]       ` <3E6D1D25.5000004@namesys.com>
     [not found]         ` <20030311031216.8A31CEFD5F@mx12.arcor-online.net>
2003-03-11 10:45           ` Hans Reiser
2003-03-11 13:00             ` Helge Hafting
2003-03-11 13:41               ` Daniel Phillips
2003-03-11 17:16                 ` Andreas Dilger
2003-03-11 19:39                 ` Helge Hafting
2003-03-11 20:19                   ` Daniel Phillips
2003-03-11 21:25                 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
2003-03-11 23:49                   ` Jamie Lokier
2003-03-10 20:48     ` [RFC] Improved inode number allocation for HTree Daniel Phillips
2003-03-10 21:04       ` John Bradford
2003-03-10 21:28         ` Andreas Schwab
2003-03-10 21:50           ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford
2003-03-14 21:55             ` [Ext2-devel] " Stephen C. Tweedie
2003-03-10 21:33         ` [RFC] Improved inode number allocation for HTree Daniel Phillips
2003-03-10 21:47           ` [Ext2-devel] " Bryan O'Sullivan
2003-03-10 22:02             ` Matthew Wilcox
2003-03-11  8:47               ` Jakob Oestergaard
2003-03-11 11:27                 ` John Bradford
2003-03-14 21:57               ` Stephen C. Tweedie
2003-03-15  8:39                 ` jw schultz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030307232749.GA24572@think.thunk.org \
    --to=tytso@mit.edu \
    --cc=akpm@digeo.com \
    --cc=bzzz@tmi.comex.ru \
    --cc=ext2-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbligh@aracnet.com \
    --cc=phillips@arcor.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox