All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Radek Pazdera <rpazdera@redhat.com>
Cc: linux-ext4@vger.kernel.org, lczerner@redhat.com, kasparek@fit.vutbr.cz
Subject: Re: [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index
Date: Sat, 11 May 2013 21:28:34 +0800	[thread overview]
Message-ID: <20130511132833.GA21325@gmail.com> (raw)
In-Reply-To: <1367702922-3236-1-git-send-email-rpazdera@redhat.com>

Hi Radek,

Could you please rebase your patch series against dev branch of ext4
tree?  I couldn't apply your patches against this branch.  You can get
the tree from here:
  https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/

Regards,
                                                - Zheng

On Sat, May 04, 2013 at 11:28:33PM +0200, Radek Pazdera wrote:
> Hello everyone,
> 
> I am an university student from Brno /CZE/. I decided to try to optimise
> the readdir/stat scenario in ext4 as the final project to school. I
> posted some test results I got few months ago [1].
> 
> I tried to implement an additional tree for ext4's directory index
> that would be sorted by inode numbers. The tree then would be used
> by ext4_readdir() which should lead to substantial increase of
> performance of operations that manipulate a whole directory at once.
> 
> The performance increase should be visible especially with large
> directories or in case of low memory or cache pressure.
> 
> This patch series is what I've got so far. I must say, I originally
> thought it would be *much* simpler :).
> 
> TLDR
> ====
> 
> The series contains the implementation of the new tree and several
> rather small changes to the original directory index. I also added
> a new implementation of ext4_readdir() that uses this new tree
> instead of the original one.
> 
> It applies on 3.9, it is basically working, however, it is highly
> experimental at the moment. It doesn't survive XFS tests yet (I
> still need to work on that).
> 
> DESIGN
> ======
> 
> The tree comes with a new read-only compatible file system feature
> called "itree". It is created in a directory at the same time as the
> original dx tree -- when the directory file exceeds a signle block of
> size. It is indicated by an inode flag.
> 
> The tree resides completely outside of the directory file. I am using
> 64bit block numbers (as suggested by Ted Ts'o), and the pointer to its
> root is stored in the end of the dx_root block.
> 
> I tried to keep the structure of the tree as close as possible to the
> original dx tree. It is a B+ tree with a 64bit long compound key. The
> key consists of a inode number and a hash.
> 
> The hash is there because of hard links. On ext4 there can be as much
> as 65 000 names for a single file which is, from the point of the inode
> tree, a collision. It is a very unlikely scenario to crate over 60 000
> names for a file from a single directory. But it would be a very serious
> problem for readdir() if someone tried to do that, so I decided to add
> the hash to the key as well. This reduces the problems of collisions to
> the same as of the hash tree.
> 
> The depth of the tree is limited by a constant in the code to 3 levels,
> but it can be increased anytime.
> 
> I decided to keep the directory entries within the leaf nodes sorted,
> which might have been a bad idea (and it brought me quite a few
> sleepless nights of pointer debugging). However, it is more convenient
> for readdir and splits. And in the case of the inode tree, there
> shouldn't be that much memmoving, because inodes are allocated linearly,
> therefore we're adding to the end most of the time anyway.
> 
> I also had to implement coalesce-on-delete. Unlike the hash values,
> the inode numbers are not random, so the tree would get fragmented
> really easily. For example when a range of inodes would be freed
> and allocated somewhere else, that part of the tree would stay empty
> forever.
> 
> In this implementation I used 8 bits for "node fullness". Neighbour
> nodes in the tree are merged as soon as their total fullness is smaller
> than maximum fullness. This way, coalescing happens too often. I plan
> to reduce it to only 2 bits (25% full, 50% full, 75% full, 100% full).
> 
> There are also some optimisations to increase the fullness of blocks
> within the tree, because if the file system adds a contiguous sequence
> of inodes to a directory and we split the nodes in half, there will be
> some tree nodes that will never get another entry and still be only 50%
> full. In the current implementation, an append is performed instead of
> a split in case the new entry should be added to the end of the full
> block.
> 
> BENCHMARKS
> ==========
> 
> I did some benchmarks and compared the performance with ext4/htree,
> XFS, and btrfs up to 5 000 000 of files in a single directory. Not
> all of them are done though (they run for days).
> 
> Probably the biggest improvement can be observed with copying files.
> I used two physical SATA disks for the test. For 5M files, the itree
> implementation is 14x faster. With metadata only, ext4 is even a tiny
> bit faster than xfs:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/cp.png
> 
> With each files 4kB big, the inode tree is 11x faster:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/cp.png
> 
> On the other hand, probably the biggest drawback of this implementation
> is that it slows down file creates, as we now have to insert the entry
> into both trees. The difference gets bigger as both trees grow (and their
> blocks get further apart). For 5M directory entries, the creation is
> roughly 40% slower. Hopefully, I'll be able to optimise it a bit in the
> future:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/create.png
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/create.png
> 
> Deleting entries should also very much benefit from this. However, the
> increase of performance in my test is far lower than expected. I think
> that is due to my implementation of coalesce-on-delete, which happens
> too often and during these massive deletes, it coallesces all the time.
> I hope that when I fix that, it will get somewhere close to XFS as well:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/delete.png
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/delete.png
> 
> All of the results have a graph and a table with precise values.
> You can always get the table by replacing the suffix of the graph
> *.png to *.results in the end of the link.
> 
> Full results are available here:
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/
> 
> 
> I also did some tests on an aged file system (I used the simple 0.8
> chance to create, 0.2 to delete a file) where the results of ext4
> with itree are much better even than xfs, which gets fragmented:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/cp.png
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/readdir-stat.png
> 
> 
> There are still some things to be done, the checksums are not yet
> finished and I certainly need to do a bit of cleaning up at some
> places. But as far as features go, it all should be there already.
> 
> I'm working on this along with Lukas Czerner, who acts as my mentor
> and helps me with different things (big thanks!).
> 
> Any feedback/ideas are greatly appeciated :)!
> 
> Cheers,
> Radek
> 
> 
> Radek Pazdera (9):
>   ext4: Adding itree feature and inode flags
>   ext4: Allow sorting dx_map by inode as well
>   ext4: Adding a link to itree to the dx_root struct
>   ext4: Adding itree structures
>   ext4: Adding itree implementation I - Core
>   ext4: Adding itree implementation II - Inserting
>   ext4: Adding itree implementation III - Deleting
>   ext4: Make directory operations use itree
>   ext4: Make ext4_readdir() use itree if available
> 
>  fs/ext4/dir.c   |  177 +++-
>  fs/ext4/ext4.h  |   21 +-
>  fs/ext4/namei.c | 2442 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 2565 insertions(+), 75 deletions(-)
> 
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2013-05-11 13:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-04 21:28 [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index Radek Pazdera
2013-05-04 21:28 ` [RFC 1/9] ext4: Adding itree feature and inode flags Radek Pazdera
2013-05-04 21:28 ` [RFC 2/9] ext4: Allow sorting dx_map by inode as well Radek Pazdera
2013-05-04 21:28 ` [RFC 3/9] ext4: Adding a link to itree to the dx_root struct Radek Pazdera
2013-05-04 21:28 ` [RFC 4/9] ext4: Adding itree structures Radek Pazdera
2013-05-04 21:28 ` [RFC 5/9] ext4: Adding itree implementation I - Core Radek Pazdera
2013-05-04 21:28 ` [RFC 6/9] ext4: Adding itree implementation II - Inserting Radek Pazdera
2013-05-04 21:28 ` [RFC 7/9] ext4: Adding itree implementation III - Deleting Radek Pazdera
2013-05-04 21:28 ` [RFC 8/9] ext4: Make directory operations use itree Radek Pazdera
2013-05-04 21:28 ` [RFC 9/9] ext4: Make ext4_readdir() use itree if available Radek Pazdera
2013-05-11 13:28 ` Zheng Liu [this message]
2013-05-11 21:18   ` [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index Radek Pazdera
2013-06-16  0:55 ` Dave Chinner
2013-06-17  8:58   ` Lukáš Czerner
2013-06-19 12:10     ` Radek Pazdera
2013-06-27  9:24 ` Lukáš Czerner
2013-07-01 11:40   ` Radek Pazdera
2013-07-01 12:17     ` Lukáš Czerner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130511132833.GA21325@gmail.com \
    --to=gnehzuil.liu@gmail.com \
    --cc=kasparek@fit.vutbr.cz \
    --cc=lczerner@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=rpazdera@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.